Introduction

Deciphering the functional consequence of genetic variation within and across populations is a fundamental question of biology. To address this, a combination of techniques to interrogate changes on both systems-wide and mechanistic scales is required (Fig. 1). Systems-wide approaches provide a high-level view and generate networks that describe how different proteins or genes relate to each other or to environmental perturbations. Such networks have proved highly informative, enabling functional annotations of proteins and conveying information on the architectures of entire biological systems1,2. Protein–protein interaction (PPI) networks describe which proteins interact3,4,5 (Fig. 1a). Experimental methods to determine PPIs include affinity purification–mass spectrometry (AP–MS)6,7, yeast two-hybrid (Y2H) screening8 and protein fractionation9. AP–MS and protein fractionation identify proteins that form complexes together in a cell type of interest, whereas Y2H uses a yeast reporter system to identify binary interactions. PPI networks describe proteins that are in physical contact but lack the resolution to discern mechanism, which often requires knowledge of the structures of the proteins and the complexes they form. Typically, high-resolution protein structures are determined using biophysical approaches, such as X-ray crystallography10, cryogenic electron microscopy (cryo-EM)11 and NMR spectroscopy12 (Fig. 1b). These methods are key for elucidating protein mechanisms and designing drugs that bind to active sites or disrupt PPIs. However, traditional structural biology methods are often time-consuming and rely on purification of the relevant proteins, which is not always feasible. Furthermore, they take place in vitro, which can introduce artefacts and may not always reflect biologically relevant protein conformations.

Fig. 1: Readouts, scale and resolution.
figure 1

A complete understanding of cellular processes requires measurements of physical and functional properties at a low-resolution, systems-wide scale and at high resolution of individual components. a | Protein–protein interaction networks describe which proteins bind to each other and are generated using methods such as affinity purification–mass spectrometry (AP–MS), protein fractionation and yeast two-hybrid screening. b | High-resolution structures of proteins and their complexes are determined using biophysical methods, such as X-ray crystallography, cryogenic electron microscopy (cryo-EM) and NMR spectroscopy, that typically take place in vitro. c | Functional interaction networks (left panel) describe how different genes or proteins or regions thereof affect the function of each other, or how they respond to drugs. Functional connections are determined using methods such as genetic or chemical–genetic interaction mapping. Improvements in these methods and the related field of coevolution have recently enabled the structures of proteins and their complexes to be determined (right panel).

PPI mapping and traditional structural biology are centred on proteins and their physical attributes. Genetic methods provide a functional context by means of measuring the phenotypic consequences of perturbing proteins or PPI networks. The characterization of genetic interactions13, which describes how mutations in different genes affect one another, has proved a particularly useful complement to PPI networks. Systematic mapping of genetic interactions enables the generation of functional interaction networks, shedding light on the biological purpose of the PPIs14,15 (Fig. 1c, left panel). Until recently, systematic genetic analyses were applied only at a whole-gene or protein level, relying on traditional structural biology for deciphering mechanistic actions. Over the past decade, developments in genetic interaction mapping and the related field of coevolution, which studies how protein residues evolve together, have allowed structural biology to be tackled on a genetic basis. By identifying pairs of residues that are related through genetic interactions or coevolution, these methods are providing high-resolution functional information sufficient to model the structures of proteins and their complexes (Fig. 1c, right panel).

In this Review, we describe the fundamentals of coevolution and genetic interaction mapping, and outline how these methods have evolved over the past decades. We discuss how technical advances and the growth of protein sequence databases have enabled the application of these methods to inform structural modelling of proteins and protein complexes. We also describe chemical–genetic interaction mapping, which is closely related to genetic interaction mapping and has similarly been used for structural modelling. We list applications of these methods and discuss emerging approaches that will enable expansion into new systems. For brevity, we do not discuss traditional structural biology methods (reviewed in16,17,18,19).

Coevolution and deep learning approaches

The genetic material of all living organisms evolves over time. This evolution takes place in the form of alterations to the DNA sequence, often as single base substitutions. Coevolution analysis is based on the principle that amino acid residues in a protein, or in two interacting proteins, mutate and evolve together when they reside in the same functional region20. For example, in a single protein, spatially proximal amino acid residues that are essential to a specific function are likely to evolve together over time. Similarly, with two interacting proteins, if one protein evolves in the binding interface, the other protein can develop complementary changes in the interface to avoid disruption of the interaction site. This evolutionary phenomenon was observed more than three decades ago20, and its application to predicting residue–residue contacts was made feasible a few years later with the growth of protein sequence databases and increases in computational power21,22,23,24,25.

Modelling protein structures using coevolution

Accurate identification of residue–residue contacts is crucial for coevolution-based protein structure modelling. Residue–residue contacts are predicted by generating a multiple sequence alignment of a protein family and identifying correlations in amino acid changes for pairs of residue positions across the alignment. Early methods used local statistical models to determine covariation between residue pairs, relying on the assumption that each correlated residue pair is independent of all other pairs21,22,23,26,27. Thus, while computationally efficient, these approaches failed to accurately represent real proteins, in which each residue can interact with many others. As a result, the local approaches were not able to distinguish direct from indirect correlations between residue pairs. Direct correlations reflect true residue–residue contacts, whereas indirect correlations arise for pairs that coevolve without being in contact. Indirect correlations can arise, for example, between residues that are evolutionarily constrained through a network path of direct contacts28. Accurate structure prediction requires that only direct correlations be considered. Hence, the local statistical models were sufficient to predict contacts but lacked the resolving power necessary to model entire protein structures. During the past decade, local models have been replaced by global models, which recognize that correlated pairs are dependent on each other and furthermore incorporate the conservation of individual residues29,30,31,32,33. Global models enable the distinction of directly coupled residue pairs from those that should be excluded from the analysis because they are indirectly coupled. Crucially, these technical advancements have been accompanied by the rapid growth of protein sequence databases such as UniProt34, increasing the coverage of sequence space across the members of protein families and making possible the systematic comparison of evolutionary changes at residue level in prokaryotes. Together, these developments paved the way for using coevolution to model the structures of monomeric proteins. The first successful determination of protein folds using coevolution was achieved by EVfold35,36, followed by other methods, such as DCA-fold37, FILM3 (ref.38) and GREMLIN39 (Fig. 2a).

Fig. 2: Structural modelling of proteins and their complexes using coevolution.
figure 2

Coevolution methods identify pairs of amino acid residues within or between proteins that have evolved together. Such pairs are often close in space and can be used to derive spatial restraints for structural modelling. a | To identify coevolving residue pairs in a protein, a multiple sequence alignment of its protein family is first generated. Pairs of sequence positions whose residue types change in a correlated fashion across the sequence alignment are coevolving and are likely to be close in space. Spatial restraints are generated based on predicted contacts and used for modelling the protein structures. b | Similar to part a, but coevolving residue pairs are here identified across the sequence alignments of an interacting pair of proteins. Here, the predicted residue contacts are thus between two different proteins, and the resulting restraints are used for modelling protein complexes instead of individual proteins. c | Random mutagenesis is carried out on an antibiotic resistance gene, and plasmids harbouring the gene variants are transformed into cells, followed by selection for functional copies of the gene. Surviving variants are again exposed to random mutagenesis and reintroduced into the assay. After a sufficient number of cycles, variants are deep sequenced to identify coevolving residue pairs and structural modelling is carried out as in part a. Filled circles represent sequence positions and the colours represent different residue types (grey denotes any residue type).

Modelling of protein complexes and prediction of PPIs using coevolution

The same coevolution principles used to determine residue–residue contacts within a protein can be used to determine residue–residue contacts between proteins. However, a key challenge lies in the identification of orthologues to generate the paired multiple sequence alignments required for quantifying coevolution among residues between two proteins. Only organisms that contain both interacting proteins can be used for the multiple sequence alignments, and the interacting pairs must be correctly paired in each species, which is particularly difficult if the proteins have paralogues that perform other cellular functions32,40,41,42,43. To enable prediction of PPIs and modelling of their interfaces (Fig. 2b), most studies have limited their scope to protein pairs that are likely to interact based on specific criteria. For example, several efforts have focused on protein pairs encoded close to each other in conserved genomic locations (for example, on the same operon)40,41, or pairs of protein families with members known to interact42,44. Although these studies demonstrated that coevolution could in principle be used for the systematic identification of PPIs, the challenges of scaling to unbiased and proteome-wide predictions made this unfeasible in practice. Furthermore, coevolution methods are computationally costly, and applying them to identify PPIs requires the combinatorial pairing of all possible interaction partners. A recent effort tackled these challenges via a combination of techniques to systematically identify PPIs in Escherichia coli and Mycobacterium tuberculosis using coevolution45. Hundreds of previously uncharacterized PPIs were discovered by quantifying the coevolution of residue pairs across several millions of protein pairs in both organisms. The high computational requirements were managed via a multistep protocol incorporating a faster pre-screen using local models26, followed by global models32,39 and structural modelling to home in on the highest confidence interactors. This study showed that coevolution is highly effective for PPI prediction in binary complexes, but less so in higher-order complexes or those that contain nucleic acids45.

Experimental evolution

Coevolution has proved powerful for determining the structures of proteins and their complexes. However, the requirement of large protein families with sufficient diversity and the obfuscating effects of paralogues impose limitations on the applicability of the approach. An experimental method (3Dseq)46 was recently developed with the aim of using protein sequence variation generated in a laboratory to determine coevolving residues and subsequent application of computational coevolution methods for structure modelling. The approach relies on iterative generation of mutations in a given gene using error-prone PCR and exposure to a medium that selects functional variants of the gene (Fig. 2c). Selected populations are deep sequenced, and coevolving residue pairs are identified by comparison throughout the population, allowing inference of residue couplings and structural modelling using the same principles as for natural coevolution. The method was applied to two antibiotic resistance proteins from Pseudomonas — β-lactamase PSE1 and acetyltransferase AAC6 — expressed in E. coli, with functional selection by ampicillin for PSE1 and kanamycin for AAC6, resulting in accurate high-resolution models of both structures46. As 3Dseq does not rely on natural variation, it is particularly well suited to proteins that lack the large number of family members required for natural coevolution modelling and should provide an avenue for tackling eukaryotic systems.

Deep learning-based approaches

In addition to experimental evolution, numerous computational developments have refined and extended the coevolution field. Improved statistical models30,39,47 have increased accuracy and decreased the required number of aligned protein sequences. Incorporation of metagenome sequencing datasets has provided a means of increasing the sequence space accessed by multiple sequence alignments48. Finally, several new methods, such as RaptorX49, ComplexContact50 and DeepCov51, use deep learning to extract and integrate additional protein sequence features with the coevolution data for contact prediction. Although these advances increased the accuracy of modelling and enabled systematic studies across prokaryotic proteomes, the technology has, in most cases, not been applied to eukaryotic proteins and complexes.

Recent advances in deep learning have led to a revolutionary development in the form of the neural network-based AlphaFold52, which enables regular prediction of protein structures at near experimental accuracy, in prokaryotes as well as eukaryotes. The AlphaFold (version 2) engine makes use of constraints on protein structure derived from evolution, physics and geometry. During training, AlphaFold parses experimental protein structures deposited in the protein databank (PDB)53, as well as clustered protein sequence databases, such as BFD52 and UniRef90 (ref.54), learning rules to govern the modelling of structure from sequence. The neural network takes as input a multiple sequence alignment of a given protein and its family members to extract evolutionary information for individual residues as well as on a pairwise basis. Incorporation with components learnt from the PDB enables the final structure prediction52.

AlphaFold has proved remarkably effective for determining the structures of individual proteins and their complexes. The AlphaFold model, trained on single protein chains, was showcased on nearly the entire human proteome, resulting in confident structure predictions for 58% of all residues55. In comparison, experimental efforts over the past several decades have together resulted in structural coverage of 17% of human protein residues55. Similarly, a study across 11 different proteomes found that AlphaFold added structure determination for on average 25 percentage points of additional residues over existing experimental structures or those that could be derived by homology modelling56. Interestingly, despite being trained on single proteins, AlphaFold proved capable of modelling the structures of protein complexes56,57,58. Most recently, AlphaFold-Multimer has been released, featuring a model trained on multimeric protein structures, which clearly outperforms the standard AlphaFold for modelling protein complex structures59.

Inspired by the performance of AlphaFold, the RoseTTAFold60 software was developed using similar ideas. The accuracy of RoseTTAFold is generally somewhat lower than that of AlphaFold, but the predictions are faster and require less computational power60. RoseTTAFold provided early evidence that this technology can model protein complexes in addition to individual proteins60. Recently, the respective strengths of RoseTTAFold and AlphaFold were combined to not only model but also identify protein complexes61. The high speed of RoseTTAFold was leveraged to examine more than 4 million paired multiple sequence alignments to generate a set of approximately 5,500 potential PPIs in Saccharomyces cerevisiae (budding yeast). AlphaFold was then applied to this smaller set to identify higher-confidence candidate protein complexes and model their structures61. Importantly, like all technologies discussed in this Review, these methods rely on data generated from experimental approaches and should be viewed as powerful complements to these62, rather than as replacements.

Genetic and chemical–genetic interactions

A complementary approach to coevolution and deep learning-based methods leverages the measurement of genetic interactions, providing a means for structural modelling using sets of intentionally designed mutations.

For most organisms, such as Homo sapiens, budding yeast or E. coli, any given gene is typically directly functionally related to only a small number of other genes. Thus, when deleting or otherwise perturbing two different genes, the cellular response will most often reflect the combined effect of the two as independent contributions. Genetic interactions arise between genes for which the response deviates from this expectation, indicating that the genes are functionally related. Genetic interactions can be measured by multiple phenotypic readouts, but often centre around cell replication and survival as this can be informative for most systems, including unicellular organisms and human cancer cells. Positive genetic interactions arise when the cell is either no sicker (epistatic) or healthier (buffering) than the sickest single mutant. This may indicate factors that operate in the same pathway or are subunits of the same non-essential complex63. Conversely, negative genetic interactions (synthetic sick or lethal) occur when mutations in two genes lead to a more severe growth defect than expected. This may reflect factors that function in parallel pathways or are non-essential subunits of the same essential protein complex (Fig. 3a).

Fig. 3: Mapping of genetic and chemical–genetic interactions.
figure 3

Genetic and chemical–genetic interactions describe the functional relationships between pairs of mutations or between a mutation and a drug, respectively. a | A positive genetic interaction between two gene deletions may indicate that the gene products operate in the same pathway (G1–G2 or G3–G4), whereas a negative interaction can arise if the products of the deleted genes belong to parallel pathways (for example, G1–G3). b | Positive interactions between a drug (D) and a gene deletion can indicate an antagonistic relationship (for example, D–G1), whereas a negative interaction may indicate that the gene product belongs to a parallel pathway of the drug target (for example, D–G3). c | The epistatic miniarray profile (E-MAP) and synthetic genetic array (SGA) approaches allow for high-throughput measurements of genetic or chemical–genetic interactions between a set of test mutants (y-axis) and a genome-scale library (x-axis). Each row constitutes the genetic interaction profile for a test mutant (A–E), and clustering these by similarity (tree on right) provides a functional organization of the mutants. d | Deep mutational scanning (DMS) can be used to measure genetic interactions between all pairwise combinations of point mutations in a gene. For each pair of residue positions (left), all possible combinations of amino acids (aa) are measured (right), which can be used to generate a composite genetic interaction score for the position pair. Depictions in parts c,d are illustrative subsets of much larger interaction maps.

Chemical–genetic interactions, similar to genetic interactions, describe how the presence or absence of a drug or environmental perturbation affects the phenotype of a single genetic mutation. Here, a positive interaction reflects that drug treatment has a lesser effect on the mutant phenotype than expected, which could indicate that the drug inhibits pathways in which the mutated gene functions. By contrast, negative chemical–genetic interactions arise when the effect of a mutation in the presence of a drug is more severe than expected, potentially indicating that the drug inhibits a parallel pathway (Fig. 3b). Notably, the relationships that form the basis of genetic and chemical–genetic interactions are often more complex than the illustrative examples provided here.

Systematic analysis of genetic and chemical–genetic interactions

Early work on concepts that underlie genetic interactions focused on small numbers of genes that were already known to affect a given phenotype of interest13. In the early 2000s, the creation of gene deletion libraries in budding yeast and advances in high-throughput technologies paved the way for systematic mapping of genetic and chemical–genetic interactions64. A key development was introduced by synthetic genetic array (SGA), which enabled the rapid crossing of a set of test mutants across a deletion library in a plate-based format, providing an efficient means of identifying synthetic lethal interactions15. A different method, diploid-based synthetic lethal analysis with microarrays (dSLAM), relied on barcoded yeast mutants grown in a pooled competitive format, where microarrays were used to quantify the amounts of the different single and double mutants65. These methods were primarily developed to identify negative genetic interactions. The ability to capture positive genetic interactions was introduced by epistatic miniarray profile (E-MAP), which expanded on SGA to provide quantitative measurements of the entire spectrum of genetic interactions in a high-throughput format66,67. This approach enables the generation of a continuous genetic interaction profile for each test mutant, consisting of its scores across all deletion library mutants; these profiles can be used to group together proteins that are functionally related or belong to the same complex14,67,68,69,70 (Fig. 3c). In parallel with these developments, related methods were designed for determining chemical–genetic interactions, following a similar format but using a library of chemical perturbations in place of the deletion library71,72 (Fig. 3c). Chemical–genetic interaction mapping relies on methods similar to those of genetic interaction mapping but is considerably less complex, as it simply relies on the addition of drugs to the plates or pools of single mutants65,71,72,73,74.

Systematic genetic and chemical–genetic interaction mapping (for example, chemical–genetic miniarray profile (CG-MAP)) have proved highly effective for organizing genes on the basis of function on both local and global levels14,67,68,69,70,71,74,75,76. The technologies have been adapted to different model systems, including Caenorhabditis elegans77, E. coli75,76, Schizosaccharomyces pombe78 and Drosophila melanogaster cell lines79. More recently, advances in RNA interference (RNAi) and CRISPR–Cas9 (ref.80) genome editing have enabled expansion into mammalian cells81,82,83,84,85.

Genetic interactions of point mutants

Most genetic interaction maps have focused on whole-gene deletions or knockdowns. However, early studies in budding yeast investigated the genetic interaction profiles for limited numbers of point mutants. For example, alanine scan mutations of the actin gene ACT1 were screened for genetic interactions with more than 200 genes that had been shown to exhibit complex haploinsufficiencies in a strain hemizygous for ACT1 (ref.86). The screen revealed that alanine mutations in close proximity on the actin surface shared many interactions (that is, exhibited similar genetic interaction profiles), suggesting that they may be disrupting the same PPI binding interfaces86. Similarly, an early budding yeast E-MAP that focused on chromatin biology included three alleles of the POL30 gene14, which encodes the multifunctional protein PCNA that functions in DNA replication and repair and in chromatin assembly. The pol30-79 point mutant allele gave rise to a genetic interaction profile similar to that of pol30-DAmP (a gene knockdown allele), suggesting a destabilizing effect on the protein. The genetic interaction profiles of these mutants were consistent with a defective DNA replication and repair system14,63,87. By contrast, the pol30-8 allele, which perturbs a different region of PCNA, exhibited genetic interactions relating to defects in chromatin assembly. Interestingly, this allele has been shown to diminish the PPI between PCNA and chromatin assembly factor 1 (CAF1)88. These results indicated that genetic interactions provide a high level of resolution and allow the dissection of multifunctional proteins into regions that are functionally and physically connected to other factors. Spurred by these findings, the E-MAP technology was extended to screen entire libraries of point mutations in a set of related proteins to generate point mutant E-MAPs (pE-MAPs)89,90. Quantitative SGA screens have also included large numbers of point mutations; however, these have generally been chosen on the basis of their phenotype as temperature-sensitive alleles of essential genes, rather than systematic mutations of a specific protein or complex68,69.

Concurrently with pE-MAP, a complementary approach termed deep mutational scanning (DMS) was developed91. DMS set out to tackle the problem of identifying the most informative mutations to study in a protein, without the requirement of preselecting residues of interest. To this end, the method allows for a comprehensive screen of point mutations in a protein or protein domain. DMS relies on the rapid synthesis of large numbers of mutations in a gene, in conjunction with a genotype–phenotype coupled selection assay. In its most basic form, DMS quantifies the effects of individual point mutations on a specific function, via the chosen selection assay. However, it can also be applied to pairs of point mutations to quantify genetic interactions91 (Fig. 3d).

The development of pE-MAP and DMS enabled the systematic study of the relationship between genetic interactions and residue distances in a protein structure. The first pE-MAP covered 53 budding yeast point mutants in RNA polymerase II (RNAPII), crossed against a library of 1,200 deletion and knockdown mutants89. This study revealed that pairs of residues that exhibited similar genetic interaction profiles were typically close in space, whether they resided in the same or different RNAPII subunits89,90. Several early DMS studies revealed similar patterns for the pairwise genetic interactions between point mutants92,93,94. For example, a screen of double mutants of 75 residues in the RRM2 domain of the budding yeast PAB1 protein showed that both positive and negative genetic interactions were enriched at shorter distances between the mutated residues92. These findings were supported in a screen of genetic interactions for all pairs of mutations in 55 residues of the IgG binding domain of streptococcal protein G (GB1)93. In some proteins, such as those regulated by allostery, these trends can differ. For example, a recent pE-MAP screen of the molecular switch Gsp1/Ran revealed that the genetic interaction profiles of interface mutations reflected their biophysical effects on the switch cycle kinetics, instead of their interface locations95. These studies highlight how genetic interactions ultimately report on mechanism and showcase the complementarity of this technology to traditional structural biology approaches.

Modelling the structures of proteins and their complexes using genetic and chemical–genetic interactions

Similar to coevolution, genetic interaction data have been used for structural modelling of proteins and their complexes. The key challenge remains how to derive spatial restraints between pairs of residues that can be used for modelling. pE-MAP and DMS provide complementary strengths for this purpose. For example, DMS can provide comprehensive genetic interaction measurements of all possible residue–residue combinations in a protein. Indeed, these fine-grained data can be used to model the secondary structure and tertiary structure of small proteins or domains96,97,98 (Fig. 4a,b). Two groups96,97 examined genetic interaction data from DMS scans of GB1 (ref.93), the RRM2 domain of the budding yeast PAB1 protein92, the human YAP65 WW domain99 and the heterodimer FOS–JUN100. The authors set out to use the genetic interaction data from each of these studies to predict structural contacts between residue pairs in the respective protein domains and to test whether the contacts could be used for structure determination96,97. The GB1 dataset was the most comprehensive and covered nearly all possible mutation pairs across 55 residues, which allowed the determination of residue contacts and accurate modelling of both secondary and tertiary structure of the domain96,97. The RRM2 and WW domain datasets covered only a fraction of the possible double mutants and were sequenced less deeply. Although contact prediction was possible with these datasets, the secondary structure predictions were not accurate. The fold of a 22–24 residue section of the WW domain could be modelled; however, the RRM2 domain fold could not96,97. The data for the FOS–JUN dimer covered a stretch of 32 residues on each monomer and enabled contact predictions across the interface96,97. The predicted contacts were then incorporated into a protein docking of the two monomers as spatial restraints, greatly improving the accuracy of the models compared with docking without DMS-derived restraints96. Finally, one of the studies also predicted contacts in an RNA molecule96,101, the twister ribozyme from Oryza sativa, suggesting that DMS could be used for RNA structure prediction. Interestingly, although the two studies96,97 harnessed different ranges of the genetic interaction data and used different interaction metrics for computing contact predictions, they nonetheless arrived at similar results. This suggests that the approach is robust and highlights the massive information content of DMS data. Accordingly, both groups showed that sparser data subsets still allowed modelling of the GB1 structure at an accuracy similar to that achieved when using the complete dataset. These findings highlight the potential of DMS as a structural biology tool, and other studies have further applied it to successfully reveal structural features of intrinsically disordered proteins102,103.

Fig. 4: Structural modelling of proteins and their complexes using genetic and chemical–genetic interactions.
figure 4

a | Deep mutational scanning (DMS) relies on the rapid synthesis of mutated variants (blue, red or green) of a gene, which are cloned into vectors and introduced into an assay (here, cell-based) that competitively selects for variants with particular traits. The composition of variants is determined via deep sequencing before and after selection, allowing for identification of variants that are enriched or depleted by the selection. b | When using DMS to measure genetic interactions, each gene variant contains two point mutations (stars). The selection assay identifies mutant pairs that are enriched (positive genetic interaction) or depleted (negative genetic interaction) compared with an expectation from the quantities of each single mutant. Likely residue contacts are identified based on the genetic interactions and used for modelling the structure of the protein. c | The point mutant epistatic miniarray profile (pE-MAP) approach relies on in vivo screening of a set of point mutants in two or more interacting proteins against a large library of gene deletions and/or knockdowns (pE-MAP) or chemicals (chemical–genetic miniarray profile (CG-MAP)). The resulting genetic (or chemical–genetic) interaction profiles often consist of more than 1,000 genetic interactions for each point mutant. Pairwise comparison of the profiles provides measures of genetic similarity between all pairs of tested point mutants. High similarity between a pair of point mutants indicates a likely contact between the mutated residues. The structure of the protein complex is modelled using this relationship for pairs of residues that reside in different subunits of the complex.

Whereas DMS is well suited for modelling the structures of small proteins and domains, the pE-MAP approach is more appropriate for determining structures of protein assemblies. pE-MAP has lower coverage than DMS but enables comparison of genetic interactions across residues in any number of interacting proteins in a single screen, which facilitates the modelling of interactions. Additionally, pE-MAP provides systems-wide cellular information for every mutated residue via its genetic interaction profile with thousands of other mutants in different pathways and processes. A recent study harnessed these traits to use pE-MAP and chemical–genetic interaction data to determine the structures of protein complexes104 (Fig. 4c). Using a technique termed integrative structure determination105 (Box 1), the authors modelled the structures of three protein complexes: histones H3 and H4 in budding yeast; subunits Rpb1 and Rpb2 of RNAPII in budding yeast, and subunits RpoB and RpoC of bacterial RNA polymerase (RNAP) in E. coli. The histone pE-MAP included a comprehensive alanine scan as well as context-specific mutations, resulting in a map of 350 histone mutants crossed against 1,370 deletion or knockdown mutants104. Distance restraints between H3–H4 residue pairs were devised using the similarity of genetic interaction profiles between the corresponding mutations. These restraints were then applied to arrange the structures of the H3 and H4 subunits, capturing the interface of their interaction and obtaining an accurate structure of the H3–H4 complex. The RNAPII dataset provided an opportunity to test the performance of the approach on a system that differs vastly from that of the histones. Specifically, Rpb1 and Rpb2 are much larger than the histones (1,200–1,700 residues versus 100–140 residues) and the RNAPII pE-MAP is much sparser, with 53 point mutants crossed against 1,200 deletion or knockdown mutants89. In addition, the authors split Rpb1 into two domains for the structural modelling to test the applicability to a higher-order system. The model of this three-body complex proved accurate, suggesting that the approach is generalizable and can effectively harness the contents of sparse datasets. Extending the use of the approach to chemical–genetic interactions, the authors accurately modelled the RpoB–RpoC complex of bacterial RNAP using a CG-MAP of 44 point mutants subjected to 83 different environmental stresses106. This showed transferability of the approach to chemical–genetic interaction maps in spite of the reduced size of the interaction profiles in this dataset. Finally, in a comparison of integrative structure determination using cross-linking mass spectrometry (XL-MS) data and pE-MAP data, the authors found that the two performed similarly, but crucially led to higher accuracy models when combined104. Thus, a key value of the methods described in this Review is that their data types are typically orthogonal to those traditionally used in structural biology, allowing data integration that results in improved models105 (Box 1).

Emerging approaches

A key promise and challenge for the methods discussed in this Review is the expansion into new systems, scales and organisms. The continued success of this field will rely on the effective integration of complementary data types to best make use of available methods (Fig. 1). In particular, the integration of experimental data with those from computational coevolution and deep learning models should prove valuable. Such efforts will likely benefit from a fine-grained interpretation of the scale and resolution represented by each data type. For example, it has been shown that residue–residue contacts derived from coevolution are more accurate when compared with experimentally determined side chain contacts than with more commonly used backbone contacts107. This finding suggests that the dominant effect observed in coevolution reflects side chain interactions, and could be harnessed to generate more precise models when computationally feasible.

To better complement computational methods, there is a need to increase the speed and coverage of experimental genetic approaches. Advances in CRISPR–Cas9 genome editing (Box 2) are setting the stage for such developments. For example, chemical–genetic interaction mapping is primed for modelling PPIs on a proteome-wide scale in yeast, using a recent method to efficiently generate point mutations while surveying their drug sensitivities in a multiplexed fashion108 (Box 2). Guided by global PPI maps109, and using individual protein structures from traditional structural biology methods or AlphaFold/RoseTTAFold, this system should in principle enable the modelling of interaction interface structures across the yeast proteome. In addition to facilitating increased scale, CRISPR–Cas9 genome editing can be used for the systematic generation of point mutations in mammalian cells110,111,112,113,114. At present, these approaches are not suitable for mammalian pE-MAP screening, owing to incomplete editing, off-target effects or other technical obstacles (Box 2). However, these limitations are steadily diminishing110, setting the stage for genetics-based structural modelling of protein complexes in human cells and providing a means of characterizing the effects of disease-causing mutations. By integration with recent efforts to generate multi-scale models of entire cells115,116,117,118,119, genetic interaction mapping could thus inform on global function as well as the structures of protein complexes.

One of the most crucial, and currently tractable, applications to human systems relates to the rapidly growing field of host–pathogen interaction mapping120,121,122,123,124. This area of research is centred on the systematic identification of PPIs between pathogen and host proteins and the generation of interaction networks between the two organisms (Fig. 5a). These networks have proved highly effective for interrogating the mechanisms of infection, revealing important aspects of pathogen life cycles, host factor functions and host–pathogen interplay, as well as providing potential targets for drug discovery120,121,122,123,124. Host–pathogen PPI networks could be used as a blueprint for genetic interaction mapping between pathogen point mutants and human gene knockouts or knockdowns. To generate these maps, human cells would be infected by virus harbouring the relevant point mutations, and the human proteins from the PPI maps would be knocked down or knocked out (Fig. 5b), allowing for the construction of a host–pathogen genetic interaction map (Fig. 5c). The genetic interaction profiles of the viral point mutants would then be converted into spatial restraints for structural modelling of viral protein complexes (Fig. 5d), which would ultimately be re-integrated into the PPI map. The platforms required for such efforts have recently been developed. For example, a technology for generating viral E-MAPs (vE-MAPs), using infectivity as readout, was recently applied to HIV infection in human cells125. In an analogous fashion, DMS could be used for modelling individual viral proteins, by employing suitable selection assays126. For example, a DMS platform was developed to structurally map mutations in the SARS-CoV-2 Spike receptor-binding domain that alter ACE2 binding or escape antibody recognition127,128. Many pathogens adapt rapidly to circumvent immune and drug responses128,129,130. Genetic interaction-driven modelling of pathogen protein structures will provide an avenue to identify the mechanisms of these changes, laying the groundwork for therapeutic intervention.

Fig. 5: Structural characterization of host–pathogen interaction networks.
figure 5

a | A host–pathogen protein–protein interaction (PPI) network generated using affinity purification–mass spectrometry. The edges denote PPIs between pairs of proteins. b | To generate a host–pathogen point mutant epistatic miniarray profile (pE-MAP), host cells are infected with point mutant virus strains, in combination with CRISPR–Cas9 knockout (KO) or knockdown (KD) of the host genes identified in the host–pathogen PPI network (part a). c | The resulting pE-MAP comprises genetic interaction profiles for the viral point mutants, containing their genetic interactions with the library of host gene KOs and KDs. d | Viral genetic interaction profiles are compared across the subunits of viral protein complexes and the similarities are used for modelling their structures, which can then be integrated into the original network.

Conclusions

Structural modelling of proteins and protein complexes using genetically derived restraints lies at the intersection of network biology and structural biology. Until recently, these major areas of research were disparate and had little overlap. Network biology provided a large-scale systems view of interactions within and between cellular processes, whereas structural biology supplied structures of individual proteins and complexes, typically derived in vitro. Genetics-based structural modelling uses spatial restraints derived from functional data, such as coevolution or genetic interactions, to compute structural models. The methods are efficient and low cost, and enable structural characterization of protein interaction interfaces, with a potential to cover entire protein–protein interactomes, including those of host–pathogen systems. These techniques are not meant to replace traditional structural biology methods, which remain the gold standard in terms of resolution. Instead, the orthogonal datasets produced by genetics-based modelling are primed to complement traditional structural biology methods to provide a more accurate and complete description of the structures of proteins in vivo.