Predicting protein-protein interactions in the human proteome

3 nov. 2025
3 min de lecture

Science

25 Sep 2025

Vol 390, Issue 6771

Editor’s summary

A grand challenge in molecular biology is understanding which proteins in an organism interact with which other proteins, ideally with high-resolution accuracy. Zhang et al. developed a deep learning model for predicting protein complex structures and applied it to the human interactome. Their method involves processing additional data from available genomes and using it to supplement multiple sequence alignments. They also added information from domain-domain interactions extracted from AlphaFold2 predictions. Their model provides predicted structures for about 5500 previously unknown interactions of human proteins, including proteins involved in immunity, metabolism, and cell signaling. —Michael A. Funk

Structured Abstract

INTRODUCTION

Cellular functions are executed by proteins, which rarely act alone. Instead, they operate through a complex network of protein-protein interactions (PPIs) that orchestrate a plethora of biological processes. Coevolutionary analysis and deep-learning (DL)–based protein structure prediction have enabled high-fidelity, proteome-wide PPI identification in model organisms such as yeast and bacteria, but the comprehensive mapping of the human interactome remains a major challenge.

RATIONALE

Many human proteins are specific to higher eukaryotes, and their relatively short evolutionary histories limit the strength of coevolutionary signals between interacting partners. Simultaneously, the human interactome is complex, with a higher proportion of weak and transient interactions that pose challenges for existing DL tools. We reasoned that both coevolutionary signal strength and DL network performance could be improved by harnessing large-scale sequence and structural data that remain underexploited by the community. First, we assembled protein sequences from unannotated draft eukaryotic genomes and unassembled genomic reads to create omicMSAs that are sevenfold deeper than multiple sequence alignments (MSAs) constructed from publicly available eukaryotic protein sequences. Second, we leveraged domains, the evolutionary and structural units of proteins, to construct a domain-domain interaction dataset derived from 200 million predicted protein structures, expanding the training set for PPI prediction by 16-fold. This large, distilled dataset enabled us to develop RoseTTAFold2-PPI (RF2-PPI), a DL network optimized for PPI prediction accuracy and speed.

RESULTS

We developed a scalable pipeline that integrates coevolutionary analysis and DL to predict interactions among 190 million human protein pairs. Leveraging omicMSAs and RF2-PPI, we achieved considerably higher performance than ColabFold and AlphaFold3 while requiring only a fraction of the computational resources, enabling proteome-scale screening. In total, we predicted more than 29,000 PPIs with an estimated precision of 80% and recall of 10 to 30%. For downstream analyses, we focused on a high-confidence subset of nearly 18,000 PPIs, ~3600 of which are previously unreported. The predicted protein complexes offer rich insights into human biology and disease. Compared with experimental databases, our predictions are enriched in PPIs that involve transmembrane proteins, highlighting this method’s ability to detect interactions in challenging biological contexts. For example, we predicted interactions between G protein–coupled receptors and their potential downstream effectors, upstream ligands, or modulators. By mapping disease-associated single–amino acid variants onto our high-confidence predictions, we identified 4950 PPIs with interface-localized variants, pointing to potential molecular mechanisms underlying genetic disorders. Lastly, our binary predictions enabled the reconstruction of higher-order protein assemblies and the identification of previously unknown components of well-studied complexes. For instance, we predicted a tubulin polyglutamylase complex, several assemblies involved in cilia biogenesis, and new components for the telomere maintenance complex and the GPI-GnT complex, which catalyzes the first step in GPI anchor biosynthesis.

CONCLUSION

By combining large-scale data mining with DL, we address the challenge of predicting the hundreds of thousands of PPIs among the 200 million possible human protein pairs. Our ability to confidently predict 10 to 30% of these interactions represents a substantial step toward the long-term goal of computing the complete human interactome.

Signaling in Oncogenesis, Angiogenesis, and Permeability