AlphaFold Predicts Entire Human Proteome | DeepMind

AlphaFold Database: A Leap Forward in Protein Structure Prediction
DeepMind, alongside its research collaborators, has made available a comprehensive database detailing the 3D structures of almost all proteins found within the human body.
These structures were computationally predicted utilizing AlphaFold, the revolutionary protein folding system unveiled previously.
Impact on Scientific Research
This freely accessible database signifies a substantial advancement and offers considerable convenience to researchers spanning numerous scientific fields.
It is anticipated to serve as a cornerstone for a new era in both biological and medical research.
Database Details and Collaboration
The AlphaFold Protein Structure Database is the result of a partnership between DeepMind, the European Bioinformatics Institute, and other institutions.
Currently, the database contains hundreds of thousands of protein sequences, each with a structure predicted by AlphaFold.
Future plans involve expanding the database to include millions more proteins, effectively creating a global “protein almanac.”
A Significant Contribution from AI
“This undertaking, in our view, constitutes the most impactful contribution artificial intelligence has made to the progression of scientific understanding thus far,” stated Demis Hassabis, founder and CEO of DeepMind.
He further emphasized that this project exemplifies the positive societal benefits that can be derived from AI technologies.
The availability of this data is expected to accelerate discoveries and innovations across a wide range of scientific disciplines.
From Genome to Proteome
For those unfamiliar with the field of proteomics – a perfectly understandable situation – a helpful analogy is the extensive undertaking of human genome sequencing. This massive effort, carried out by numerous scientists and organizations globally throughout the late 1990s and early 2000s, proved to be a landmark achievement.
The completed genome has since been crucial in diagnosing and comprehending a vast array of conditions, as well as in the creation of pharmaceutical interventions and therapeutic strategies.
However, genome sequencing represented only the initial phase of research in this domain – akin to completing the border pieces of an enormous jigsaw puzzle.
Subsequently, attention shifted towards deciphering the human proteome, encompassing all proteins utilized by the human body and encoded within the genome.
The Complexity of the Proteome
The proteome presents a significantly greater challenge than the genome due to its inherent complexity. Like DNA, proteins are composed of sequential molecules.
While DNA utilizes a limited set of bases (adenine, guanine, etc.), proteins are built from 20 distinct amino acids, each coded by multiple genetic bases.
This difference alone introduces a substantial increase in complexity, but it is merely the beginning.
Protein sequences aren't simply informational code; they intricately fold into minuscule molecular structures, functioning as sophisticated machines within the body.
This transformation can be likened to evolving from binary code to a complex language capable of manifesting tangible objects.
Challenges in Proteome Analysis
In practical terms, the proteome isn't just a collection of 20,000 sequences, each comprising hundreds of amino acids.
Each sequence possesses a unique physical conformation and performs a specific biological function.
Determining the three-dimensional structure resulting from a given amino acid sequence is a particularly difficult aspect of proteome analysis.
Traditionally, this has been achieved experimentally through techniques like x-ray crystallography, a lengthy and intricate process that can take months – even with access to state-of-the-art facilities – to resolve the structure of a single protein.
Computational prediction of protein structure has also been attempted, but its accuracy has historically been insufficient for reliable application – until the advent of AlphaFold.
A Paradigm Shift in Structural Biology
The field of computational proteomics has undergone a remarkable transformation. Moving beyond the distributed computing strategies of the past – such as the well-known Folding@home project – the last ten years have witnessed the development of increasingly refined methodologies. The arrival of AI-driven techniques then dramatically altered the landscape in 2019, with DeepMind’s AlphaFold surpassing all existing systems in performance.
Further advancements in 2020 yielded accuracy and reliability levels so substantial that some experts suggested the challenge of predicting a protein’s 3D structure from its amino acid sequence had been effectively resolved. This progress was exceptionally rapid, resolving a decades-old problem that previously demanded slow, costly, and only partially effective approaches.
Image Credits: DeepMindWhile the technical details of DeepMind’s breakthroughs are best left to specialists in computational biology and proteomics for detailed analysis, the practical implications are immediately apparent. Since the release of AlphaFold 2 in 2020, the company has focused not only on refining the model but also on applying it to an extensive range of protein sequences.
Consequently, a predicted structure now exists for 98.5% of the proteins within the human proteome – a “folded” state, as it’s known in the field. This prediction is backed by a level of confidence from the AI model that researchers find trustworthy. Furthermore, the proteomes of 20 additional organisms, including yeast and E. coli, have been modeled, resulting in a total of approximately 350,000 protein structures.
This represents an unprecedented and vastly superior collection of vital information, and it will be accessible through a freely available, searchable database. Researchers can input a sequence or protein name to instantly retrieve the corresponding 3D structure. Details regarding the process and database are outlined in a recent publication in the journal Nature.
“The database, launching tomorrow, functions much like a search engine for protein structures,” explained Hassabis in a TechCrunch interview. “Users can visualize structures in 3D, explore the genetic sequence, and benefit from integration with EMBL-EBI’s other databases, allowing for the identification of related genes and proteins with similar functions.”
“As a scientist working with a complex protein,” stated Edith Heard of EMBL-EBI, “the ability to quickly determine a protein’s functional region is incredibly valuable. Previously, this would have taken years. Now, accessing the structure allows us to focus on understanding its function, accelerating scientific discovery by years – similar to the impact of genome sequencing decades ago.”
The novelty of this capability is such that Hassabis anticipates a fundamental shift within the field, and an evolution of the database itself in response to new demands.
“Structural biologists are still adjusting to the idea of readily accessing structural information that once required years of experimental determination,” he noted. “This should inspire new research questions and experimental designs. As we observe these changes, we will likely develop additional tools to support this newfound serendipity: for example, the ability to analyze 10,000 proteins sharing a specific characteristic. Such inquiries are currently impractical, but may become commonplace.”
The software itself has been released as open source, along with its development history, fostering further innovation. RoseTTAFold, an independently developed system from the University of Washington’s Baker Lab, built upon AlphaFold’s performance last year to create a comparable, yet more efficient, solution. However, DeepMind appears to have regained the lead with its latest iteration, demonstrating that the core principles are now widely available.
Groundbreaking Advances in Structural Biology
While the anticipation of structural bioinformaticians realizing their long-held aspirations is encouraging, it’s crucial to acknowledge the tangible and immediate advantages stemming from the collaborative efforts of DeepMind and EMBL-EBI. A prime example of this impact lies within their partnership with the Drugs for Neglected Diseases Initiative.
The DNDI, as its name suggests, concentrates on illnesses that do not typically attract the substantial attention or financial investment from major pharmaceutical companies and medical research institutions that would be necessary for treatment development. This lack of focus can be attributed to the diseases’ infrequent occurrence, or the economic limitations of those affected, even when the patient population reaches millions.
“This represents a significant practical challenge in clinical genetics, where a potential set of mutations is identified in a child with a condition, and the goal is to determine which alteration is most likely responsible for the genetic disease. Widespread availability of structural data will undoubtedly enhance our ability to accomplish this,” stated Ewan Birney of EMBL-EBI during a press briefing prior to the announcement.
Traditionally, analyzing the proteins suspected of causing a particular health issue has been a costly and lengthy process. For diseases impacting a limited number of individuals, resources—both financial and temporal—are often scarce, being prioritized for more prevalent conditions like cancers or dementia. However, the ability to readily access the structures of ten healthy proteins alongside ten mutated variants of the same protein could yield insights in seconds, insights that might otherwise require years of meticulous laboratory investigation. The drug discovery and testing timeline remains substantial, but initiation may now be possible for currently untreatable diseases, rather than being postponed until 2025.
(Update: Minor inaccuracies regarding the DNDI have been rectified in this article.)
An illustration depicting RNA polymerase II (a protein) functioning within yeast. Image Credits: Getty Images / JUAN GAERTNER/SCIENCE PHOTO LIBRARYIt’s important to note that the reliance on computer predictions, which haven’t been experimentally verified, isn’t absolute, as demonstrated by another distinct case. John McGeehan from the University of Portsmouth, collaborating with DeepMind on a separate potential application, detailed how this influenced his team’s work concerning plastic decomposition.
“Upon submitting seven sequences to the DeepMind team, we already possessed experimental structures for two of them. This allowed us to validate the predictions upon their return, and it was a truly remarkable moment,” McGeehan explained. “The structures generated by DeepMind were identical to our crystal structures, and in some instances, even contained more comprehensive information. We were able to directly utilize this data to engineer more efficient enzymes for plastic breakdown, and those experiments are currently in progress. The acceleration to our project is, I would estimate, several years.”
The intention is to predict structures for every known and sequenced protein—approximately one hundred million—over the coming one to two years. For the vast majority of cases (those few structures resistant to this methodology quickly become apparent), biologists can anticipate a high degree of confidence in the results.
The ability to inspect molecular structures in three dimensions has existed for decades, but determining those structures initially presents a significant challenge. Image Credits: DeepMindThe structural prediction process employed by AlphaFold often surpasses the capabilities of experimental methods. While some degree of uncertainty is inherent in any AI model’s operation, Hassabis emphasized that this system is not simply a “black box.”
“In this specific instance, explainability wasn’t merely a desirable feature—as is often the case in machine learning—but a fundamental requirement, given the critical nature of the intended applications,” he clarified. “We’ve invested more effort than ever before in ensuring the explainability of this system. This includes explainability at both the algorithmic level and in terms of the outputs, predictions, and structures, along with assessments of their reliability and the identification of trustworthy regions.”
Despite describing the system as “miraculous,” Hassabis clarified that the process itself isn’t inherently magical, but rather he is astonished by the power resulting from their collective work.
“This was undoubtedly the most challenging project we’ve undertaken,” he stated. “Even with a complete understanding of the code and system functionality, and the ability to observe all outputs, the results remain somewhat miraculous. The transformation of a one-dimensional amino acid chain into these intricate three-dimensional structures—many of which are aesthetically pleasing as well as scientifically and functionally valuable—is truly remarkable. It was more an expression of awe than a claim of supernatural intervention.”
Unfolding the Proteome: Beyond AlphaFold
While the widespread effects of AlphaFold and the associated proteome database will take time to fully materialize, initial reports from collaborating researchers indicate substantial short- and long-term advancements are highly probable. However, this does not signify a complete resolution of the complexities inherent in understanding the proteome.
The inherent complexity of the proteome significantly surpasses that of the genome. Even with this significant leap forward, our understanding remains superficial. AlphaFold addresses a specific, albeit crucial, challenge: predicting the three-dimensional structure a sequence of amino acids will adopt.
Proteins do not function in isolation. They exist within a dynamic and intricate system, constantly undergoing conformational changes, fragmentation, and reformation. Their behavior is responsive to environmental conditions, the presence of various elements, and interactions with other proteins, leading to continuous reshaping.
Many human proteins for which AlphaFold provided predictions with limited confidence may be intrinsically “disordered.” These proteins exhibit a high degree of variability, making precise structural determination difficult. In such cases, the prediction accuracy would be validated for proteins with more stable structures.
The research team acknowledges the extensive work that remains. As Demis Hassabis stated, “The focus is now shifting towards tackling new challenges.” He further elaborated that areas like protein interaction, protein complexes, and ligand binding are already under investigation with preliminary projects underway.
Hassabis emphasized the historical context, noting that the computational biology community has pursued this problem for decades. He believes a major hurdle has now been overcome, stating, “We have now broken the back of that problem.”
Future Research Directions
- Investigating protein interaction networks.
- Analyzing the structures of protein complexes.
- Modeling ligand binding mechanisms.
- Further exploration of intrinsically disordered proteins.
These areas represent the next frontier in proteomic research, building upon the foundation laid by AlphaFold and promising further breakthroughs in our understanding of life’s fundamental building blocks.
Related Posts

OpenAI, Anthropic & Block Join Linux Foundation AI Agent Effort
Alexa+ Updates: Amazon Adds Delivery Tracking & Gift Ideas

Google AI Glasses: Release Date, Features & Everything We Know

EU Antitrust Probe: Google's AI Search Tools Under Investigation

Microsoft to Invest $17.5B in India by 2029 - AI Expansion
