AlphaFold 2 Paper and Code Release: A Guide for New ML Engineers in Biological Problem Solving
The release of the AlphaFold 2 paper and code has generated a lot of excitement in the scientific community. This breakthrough in protein structure prediction has the potential to revolutionize the field of biology and inspire a new generation of machine learning engineers to focus on foundational biological problems. In this blog post, we aim to provide a self-contained introduction to the core concepts necessary to understand AlphaFold2-like technologies, even for those with no background in biology and a bit of background in machine learning.
We start by exploring the central dogma of biology, which explains the flow of genetic information in biological systems. We then delve into proteins, amino acids, nucleotides, and codons, the building blocks of biological systems. Understanding the 4 levels of protein structures—primary, secondary, tertiary, and quaternary—is essential for comprehending protein folding. We also discuss protein domains, motifs, residues, and turns, which are crucial for understanding the complex 3D structures of proteins.
The concept of distograms, which represent the pairwise distances between amino acids in a protein, is crucial for protein folding predictions. We also touch upon the distinction between genotype and phenotype in biological systems. We highlight the importance of tasks like multiple sequence alignment (MSA), protein 3D structure prediction, and genotype-to-phenotype prediction in the field of bioinformatics.
To build machine learning models for biological tasks, it is essential to represent DNA and amino acid sequences accurately. We discuss different encoding strategies for biological sequences, including character-level encoding and k-mer encoding. We also explore the association of biology with ML model design, focusing on attention mechanisms for processing MSA and the core self-attention module of AlphaFold2, known as Invariant Point Attention (IPA).
In conclusion, while AlphaFold 2 represents a significant advancement in protein structure prediction, the field of protein folding is still not completely solved. We provide resources on AlphaFold2 and biology ML for those interested in exploring further. This blog post serves as a comprehensive guide to understanding the core concepts necessary for diving into the exciting world of AlphaFold2 and computational biology.