Disease variant prediction with deep generative models of evolutionary data

Frazer, Jonathan; Notin, Pascal; Dias, Mafalda; Gomez, Aidan; Min, Joseph K.; Brock, Kelly; Gal, Yarin; Marks, Debora S.

doi:10.1038/s41586-021-04043-8

Article
Published: 27 October 2021

Disease variant prediction with deep generative models of evolutionary data

Nature volume 599, pages 91–95 (2021)Cite this article

68k Accesses
208 Citations
467 Altmetric
Metrics details

Subjects

A Publisher Correction to this article was published on 17 December 2021

This article has been updated

Abstract

Quantifying the pathogenicity of protein variants in human disease-related genes would have a marked effect on clinical decisions, yet the overwhelming majority (over 98%) of these variants still have unknown consequences^1,2,3. In principle, computational methods could support the large-scale interpretation of genetic variants. However, state-of-the-art methods^{4,5,6,7,8,9,10} have relied on training machine learning models on known disease labels. As these labels are sparse, biased and of variable quality, the resulting models have been considered insufficiently reliable¹¹. Here we propose an approach that leverages deep generative models to predict variant pathogenicity without relying on labels. By modelling the distribution of sequence variation across organisms, we implicitly capture constraints on the protein sequences that maintain fitness. Our model EVE (evolutionary model of variant effect) not only outperforms computational approaches that rely on labelled data but also performs on par with, if not better than, predictions from high-throughput experiments, which are increasingly used as evidence for variant classification^{12,13,14,15,16}. We predict the pathogenicity of more than 36 million variants across 3,219 disease genes and provide evidence for the classification of more than 256,000 variants of unknown significance. Our work suggests that models of evolutionary information can provide valuable independent evidence for variant interpretation that will be widely useful in research and clinical settings.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: EVE accurately predicts disease-causing variants.**

**Fig. 3: EVE is as good as functional experiments at predicting clinical interpretations of variants.**

**Fig. 4: Predictions for variants in 3,219 genes.**

Causal machine learning for predicting treatment outcomes

Article 19 April 2024

Refining the impact of genetic evidence on clinical success

Article Open access 17 April 2024

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

Data availability

The data analysed and generated in this study, including multiple sequence alignments used in training, ClinVar annotations used for validation, population frequencies and predictions from our model, are available in Supplementary Information and at evemodel.org. Predictions from other computational models are available through http://database.liulab.science/dbNSFP. Source data are provided with this paper.

Code availability

The model code is available at https://github.com/OATML-Markslab/EVE, https://doi.org/10.5281/zenodo.5389490.

Change history

17 December 2021
A Correction to this paper has been published: https://doi.org/10.1038/s41586-021-04207-6

References

Van Hout, C. V. et al. Exome sequencing and characterization of 49,960 individuals in the UK Biobank. Nature 586, 749–756 (2020).
Article ADS PubMed PubMed Central CAS Google Scholar
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Landrum, M. J. & Kattman, B. L. ClinVar at five years: delivering on the promise. Hum. Mutat. 39, 1623–1630 (2018).
Article PubMed Google Scholar
Raimondi, D. et al. DEOGEN2: prediction and interactive visualization of single amino acid variant deleteriousness in human proteins. Nucleic Acids Res. 45, W201-W206 (2017).
Article CAS PubMed PubMed Central Google Scholar
Feng, B. J. PERCH: a unified framework for disease gene prioritization. Hum. Mutat. 38, 243–251 (2017).
Article CAS PubMed PubMed Central Google Scholar
Ioannidis, N. M. et al. REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. Am. J. Hum. Genet. 99, 877-885 (2016).
Article CAS PubMed PubMed Central Google Scholar
Ionita-Laza, I., McCallum, K., Xu, B. & Buxbaum, J. D. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat. Genet. 48, 214–220 (2016).
Article CAS PubMed PubMed Central Google Scholar
Jagadeesh, K. A. et al. M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity. Nat. Genet. 48, 1581-1586 (2016).
Article CAS PubMed Google Scholar
Rentzsch, P., Witten, D., Cooper, G. M., Shendure, J. & Kircher, M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 47, D886–D894 (2019).
Article CAS PubMed Google Scholar
Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).
Article CAS PubMed PubMed Central Google Scholar
Richards, S. et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 17, 405–424 (2015).
Article PubMed PubMed Central Google Scholar
Findlay, G. M. et al. Accurate classification of BRCA1 variants with saturation genome editing. Nature 562, 217–222 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Glazer, A. M. et al. High-throughput reclassification of SCN5A variants. Am. J. Hum. Genet. 107, 111–123 (2020).
Article CAS PubMed PubMed Central Google Scholar
Giacomelli, A. O. et al. Mutational processes shape the landscape of TP53 mutations in human cancer. Nat. Genet. 50, 1381–1387 (2018).
Article CAS PubMed PubMed Central Google Scholar
Mighell, T. L., Evans-Dutson, S. & O’Roak, B. J. A saturation mutagenesis approach to understanding PTEN lipid phosphatase activity and genotype–phenotype relationships. Am. J. Hum. Genet. 102, 943–955 (2018).
Article CAS PubMed PubMed Central Google Scholar
Jia, X. et al. Massively parallel functional testing of MSH2 missense variants conferring Lynch syndrome risk. Am. J. Hum. Genet. 108, 163–175 (2021).
Article CAS PubMed Google Scholar
Cao, Y. et al. The ChinaMAP analytics of deep whole genome sequences in 10,588 individuals. Cell Res. 30, 717–731 (2020).
Article PubMed PubMed Central Google Scholar
Gudbjartsson, D. F. et al. Large-scale whole-genome sequencing of the Icelandic population. Nat. Genet. 47, 435–444 (2015).
Article CAS PubMed Google Scholar
Esposito, D. et al. MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol. 20, 223 (2019).
Article PubMed PubMed Central Google Scholar
Trenkmann, M. Putting genetic variants to a fitness test. Nat. Rev. Genet. 19, 667 (2018).
Article CAS PubMed Google Scholar
Rehm, H. L. et al. ClinGen—the Clinical Genome Resource. N. Engl. J. Med. 372, 2235–2242 (2015).
Article CAS PubMed PubMed Central Google Scholar
Grimm, D. G. et al. The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity. Hum. Mutat. 36, 513–523 (2015).
Article PubMed PubMed Central Google Scholar
Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017).
Article CAS PubMed PubMed Central Google Scholar
Marks, D. S. et al. Protein 3D structure computed from evolutionary sequence variation. PLoS ONE 6, e28766 (2011).
Article ADS CAS PubMed PubMed Central Google Scholar
Hopf, T. A. et al. Sequence co-evolution gives 3D contacts and structures of protein complexes. eLife 3, e03430 (2014).
Article PubMed Central Google Scholar
Lapedes, A., Giraud, B. & Jarzynski, C. Using sequence alignments to predict protein structure and stability with high accuracy. Preprint at https://arxiv.org/abs/1207.2484v1 (2012).
Vaser, R., Adusumalli, S., Leng, S. N., Sikic, M. & Ng, P. C. SIFT missense predictions for genomes. Nat. Protoc. 11, 1–9 (2016).
Article CAS PubMed Google Scholar
Reva, B., Antipin, Y. & Sander, C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res. 39, e118 (2011).
Article CAS PubMed PubMed Central Google Scholar
Rezende, D. J., Mohamed, S. & Wierstra, D. in Proceedings of the 31st International Conference on Machine Learning vol. 32 (eds Xing, E. P. & Jebara, T.) 1278–1286 (PMLR, 2014).
Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Preprint at https://arxiv.org/abs/1312.6114 (2013).
Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).
Article CAS PubMed PubMed Central Google Scholar
Suzek, B. E., Wang, Y., Huang, H., McGarvey, P. B. & Wu, C. H. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).
Article CAS PubMed Google Scholar
Kalia, S. S. et al. Recommendations for reporting of secondary findings in clinical exome and genome sequencing, 2016 update (ACMG SF v2.0): a policy statement of the American College of Medical Genetics and Genomics. Genet. Med. 19, 249–255 (2017).
Article PubMed Google Scholar
Frigo, G. et al. Homozygous SCN5A mutation in Brugada syndrome with monomorphic ventricular tachycardia and structural heart abnormalities. Europace 9, 391–397 (2007).
Article PubMed Google Scholar
Itoh, H. et al. Asymmetry of parental origin in long QT syndrome: preferential maternal transmission of KCNQ1 variants linked to channel dysfunction. Eur. J. Hum. Genet. 24, 1160–1166 (2016).
Article CAS PubMed Google Scholar
Glazer, A. M. et al. Deep mutational scan of an SCN5A voltage sensor. Circ. Genom. Precis. Med. 13, e002786 (2020).
Article CAS PubMed PubMed Central Google Scholar
Bouvet, D. et al. Methylation tolerance-based functional assay to assess variants of unknown significance in the MLH1 and MSH2 genes and identify patients with Lynch syndrome. Gastroenterology 157, 421–431 (2019).
Article CAS PubMed Google Scholar
Pan, X. et al. Structure of the human voltage-gated sodium channel Na_v1.4 in complex with β1. Science 362, eaau2486 (2018).
Article PubMed CAS Google Scholar
Fishel, R. et al. The human mutator gene homolog MSH2 and its association with hereditary nonpolyposis colon cancer. Cell 75, 1027–1038 (1993).
Article CAS PubMed Google Scholar
Peltomaki, P. Role of DNA mismatch repair defects in the pathogenesis of human cancer. J. Clin. Oncol. 21, 1174-1179 (2003).
Article CAS PubMed Google Scholar
Warren, J. J. et al. Structure of the human MutSα DNA lesion recognition complex. Mol. Cell 26, 579–592 (2007).
Article CAS PubMed Google Scholar
Brnich, S. E. et al. Recommendations for application of the functional evidence PS3/BS3 criterion using the ACMG/AMP sequence variant interpretation framework. Genome Med. 12, 3 (2019).
Article PubMed PubMed Central Google Scholar
Lewontin, R. C. The Genetic Basis of Evolutionary Change (Columbia Univ. Press, 1974).
Kreitman, M. Nucleotide polymorphism at the alcohol dehydrogenase locus of Drosophila melanogaster. Nature 304, 412-417 (1983).
Article ADS CAS PubMed Google Scholar
Sunyaev, S. et al. Prediction of deleterious human alleles. Hum. Mol. Genet. 10, 591–597 (2001).
Article CAS PubMed Google Scholar
IUCN. The IUCN red list of threatened species. IUCN https://www.iucnredlist.org (2020).

Download references

Acknowledgements

We thank members of the Marks laboratory, OATML and C. Sander for many valuable discussions. J.F., M.D. and K.B. are supported by the Chan Zuckerberg Initiative CZI2018-191853. K.B. is also supported by the US National Institutes of Health (R01 R01GM120574). P.N. is supported by GSK and the UK Engineering and Physical Sciences Research Council (EPSRC ICASE award no. 18000077). A.G. is a Clarendon Scholar and Open Philanthropy AI Fellow. Y.G. holds a Turing AI Fellowship (Phase 1) at the Alan Turing Institute, which is supported by EPSRC grant reference V030302/1. D.S.M. holds a Ben Barres Early Career Award by the Chan Zuckerberg Initiative as part of the Neurodegeneration Challenge Network, CZI2018-191853.

Author information

These authors contributed equally: Jonathan Frazer, Pascal Notin, Mafalda Dias

Authors and Affiliations

Marks Group, Department of Systems Biology, Harvard Medical School, Boston, MA, USA
Jonathan Frazer, Mafalda Dias, Joseph K. Min, Kelly Brock & Debora S. Marks
OATML Group, Department of Computer Science, University of Oxford, Oxford, UK
Pascal Notin, Aidan Gomez & Yarin Gal
Broad Institute of Harvard and MIT, Cambridge, MA, USA
Debora S. Marks

Authors

Jonathan Frazer
View author publications
You can also search for this author in PubMed Google Scholar
Pascal Notin
View author publications
You can also search for this author in PubMed Google Scholar
Mafalda Dias
View author publications
You can also search for this author in PubMed Google Scholar
Aidan Gomez
View author publications
You can also search for this author in PubMed Google Scholar
Joseph K. Min
View author publications
You can also search for this author in PubMed Google Scholar
Kelly Brock
View author publications
You can also search for this author in PubMed Google Scholar
Yarin Gal
View author publications
You can also search for this author in PubMed Google Scholar
Debora S. Marks
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.S.M. and Y.G. led the research. J.F., P.N. and M.D. conceived and implemented the end-to-end approach. A.G. contributed technical advice. K.B. supported with data preparation. J.K.M. developed the website. J.F., P.N., M.D., Y.G. and D.S.M. wrote the manuscript.

Corresponding authors

Correspondence to Yarin Gal or Debora S. Marks.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature thanks Martin Kircher and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Bayesian VAE architecture details.

The Bayesian VAE architecture in EVE is comprised of a symmetric 3-layer encoder & decoder architecture (with 2,000-1,000-300 and 300-1,000-2,000 units respectively) and a latent space of dimension 50. After performing a one-hot encoding of the input sequence across amino acids (zeros in white, ones in green), we flatten the input before performing the forward pass through the network. We use a single set of parameters for the encoder (ϕp) and learn a fully-factorized gaussian distribution over the weights of the decoder (θp): weight samples for the decoder are obtained by sampling a random normal variable (rnv), multiplying that sample by the standard deviation parameters, and subsequently adding the mean parameters. A one-dimensional convolution is applied on the un-flattened output of the decoder to capture potential correlations between amino-acid usage. Finally, a softmax activation turns the final output into probabilities over amino acids at each position of the sequence (low values in white, high values in dark green). The overall network is trained by maximizing the Evidence Lower Bound (ELBO), which forms a tractable lower bound to the log-marginal likelihood (Supplementary Methods and Fig. 1).

Extended Data Fig. 2 Comparison of performance of Bayesian VAE and DeepSequence against 38 deep mutation scans.

Comparison between the performance of the Bayesian VAE architecture in EVE and DeepSequence⁴⁶ which achieves state-of-the-art performance on the protein function prediction task. “Evolutionary indices” were computed by sampling 2k times from the approximate posterior distribution and by ensembling the obtained indices over 5 independently trained VAEs (Supplementary Methods).

Extended Data Fig. 3 Evolutionary index separates pathogenic and benign variants.

a, Average evolutionary index per protein, and corresponding standard deviations, for variants with known Benign and Pathogenic ClinVar labels across 3,219 proteins (sorted by alphabetical order). On the right, marginal distributions of the means over the 3,219 proteins. Evolutionary index separates pathogenic and benign labels consistently across proteins. b, Two-component Gaussian Mixture Models (GMM) over the distributions of the evolutionary indices (histograms) for all the single amino acid variants of 3,219 proteins combined (top, left) and for P53, PTEN and SCN5A separately (top right, bottom left and right, respectively). The dashed black line is the marginal likelihood for the GMM model, i.e. the likelihood of a variant sequence after marginalizing the latent variable that corresponds to the mixture assignment; the dashed blue and red lines represent the relative share of the marginal likelihood from the benign and pathogenic clusters respectively (i.e. the product of the marginal likelihood by each cluster).

Extended Data Fig. 4 EVE prediction for actionable genes and EVE comparison to other computational methods, including meta-predictors.

a, EVE AUCs versus ClinVar labels for set of ACMG “actionable genes”³³ that have 15 or more labels (shown in parentheses). AUCs are computed both for EVE scores of all variants (pale blue), and of the 75% variants with most confident scores (dark blue) (Supplementary Methods). b, Performance comparison of EVE to state-of-the-art computational variant effect predictors: 7 unsupervised, 8 supervised, and 8 supervised meta-prediction methods. Size of marker indicates how many genes for which the method would be relevant (on a per-protein basis validation) (Supplementary Methods, Supplementary Notes 2, Fig. 2, Supplementary Tables 3, 4).

Extended Data Fig. 5 Computational model EVE as good as high-throughput experiments for clinical labels.

(Companion to Fig. 3) a, Comparison of computational model predictions (upper panels EVE score) and experimental assay predictions (lower panels, experimental assay metric) to ClinVar labels (dots) and VUS (crosses) and where pale red and pale blue crosses indicate EVE assignments of VUS. Dashed red and blue lines correspond to EVE predictions after removing the 25% most uncertain variants (computed on all variants across all proteins; see Supplementary Methods). x-axes are position in protein. Experimental measurements data from deep mutational scans of P53¹⁴, from left (WT_Nutlin-3, A549_p53NULL_Nutlin-3, A549_p53NULL_Etoposide), SCN5A¹³, and BRCA1¹². b, Scatter plots of experiment scores (y-axis) against EVE scores (x-axis). Experimental measurements data from deep mutational scans same as a (Supplementary Methods, Supplementary Table 6).

Extended Data Fig. 6 Comparison of label policies, and comparison of EVE and experimental predictions of clinical labels.

a, The y-axis is the subset of the ACMG actionable protein list with at least 5 benign and 5 pathogenic labels with at least a one-star review status in ClinVar, mean for the 3,219 proteins and mean for this subset. x-axis is AUCs computed using these labels (deep blue), labels with at least a two-star review status (light grey) and a more lenient labelling policy (sky blue), as defined in Supplementary Methods. b, AUC of EVE predictions (blue circle) and experimental predictions (blue cross) computed on ClinVar labels. Whilst most of the papers that provide these experimental results refer to the goal of predicting association to human disease, the assays vary in their relevance to disease phenotype. Results use high-quality labels whenever they are sufficient for robust validation (MSH2, P53, BRCA1) and lenient labels for all other cases, and 2017-release ClinVar data whenever experimental results were used in defining labels reported in 2021 (P53 and BRCA1). Reported averages of all displayed AUC values, and of AUCs computed exclusively on 2017 and 2021-reslease high-quality labels (Supplementary Methods, Supplementary Table 5,6).

Extended Data Fig. 7 EVE has many more genes that can be validated on, compared to supervised methods.

Mean number of genes, for EVE (dark blue) and a supervised method (light blue), that have sufficient labels for validation (5 (left), 10 (middle) and 20 labels (right)). We assume a 90% train 10% test random split of all labels in ClinVar for the supervised methods.

Extended Data Fig. 8 Data provided on our server evemodel.org.

Screenshot of evemodel.org for the example of KCNQ2. Our server provides information about each protein: aggregate AUC/Accuracy, performance curves (ROC & Precision-recall), variant-level EVE scores, classification and uncertainties, as well as the multiple sequence alignments used for training. All data is available to download both in bulk and for individual genes.

Extended Data Fig. 9 Fraction of genes per person with more than one variant.

Density function of the fraction of total genes per person with at least two variants, though not necessarily in the same chromosome. Data extracted from 50k genomes of the UK Biobank with self-reported ethnicity backgrounds (Supplementary Methods).

Extended Data Fig. 10 Performance as a function of alignment depth.

Average AUC of EVE scores as a function of N/L_cov for the subset of genes with at least 10 known clinical labels (5 benign and 5 pathogenic). For this subset of genes, the performance of the model can be carefully validated using AUCs. There is no strong correlation between alignment depth and performance: while models with very deep alignments tend to have good performance, models with very low N/L_cov can also have AUC close to 1.

Supplementary information

Supplementary Information

This file contains Supplementary Methods, Supplementary Note 1 on the limitations of supervised modeling methods and Supplementary Note 2 with comments on meta-predictors.

Reporting Summary

Supplementary Table 1

Statistics of multiple sequence alignments used in training.

Supplementary Table 2

Summary of class assignments and combination with other sources of evidence.

Supplementary Table 3

Comparison of performance of EVE and other computational models at predicting ClinVar labels.

Supplementary Table 4

Comparison of performance of EVE and other computational models at predicting results from high-throughput functional assays.

Supplementary Table 5

Sensitivity of predictions to label policy.

Supplementary Table 6

Comparison of performance of high-throughput experiments and EVE at predicting clinical labels.

Supplementary Table 7

Genes with at least 10 labels sorted by EVE performance.

Supplementary Table 8

Variants for which there is disagreement between EVE predictions and ClinVar labels.

Supplementary Table 9

All pairs of variants occurring in the same gene over the UK biobank population for the actionable genes defined by ACMG.

Source data

Source Data Fig. 2

Source Data Fig. 3

Source Data Fig. 4

Rights and permissions

Reprints and permissions

About this article

Cite this article

Frazer, J., Notin, P., Dias, M. et al. Disease variant prediction with deep generative models of evolutionary data. Nature 599, 91–95 (2021). https://doi.org/10.1038/s41586-021-04043-8

Download citation

Received: 18 December 2020
Accepted: 20 September 2021
Published: 27 October 2021
Issue Date: 04 November 2021
DOI: https://doi.org/10.1038/s41586-021-04043-8

This article is cited by

Cellular and molecular mechanisms of aspartoacylase and its role in Canavan disease
- Martin Grønbæk-Thygesen
- Rasmus Hartmann-Petersen
Cell & Bioscience (2024)
Population-based BRCA germline mutation screening in the Han Chinese identifies individuals at risk of BRCA mutation-related cancer: experience from a clinical diagnostic center from greater Shanghai area
- Zhiyuan Wu
- Qingyun Zhang
- Ming Guan
BMC Cancer (2024)
FiTMuSiC: leveraging structural and (co)evolutionary data for protein fitness prediction
- Matsvei Tsishyn
- Gabriel Cia
- Fabrizio Pucci
Human Genomics (2024)
Minimum information and guidelines for reporting a multiplexed assay of variant effect
- Melina Claussnitzer
- Victoria N. Parikh
- Alan F. Rubin
Genome Biology (2024)
MAGPIE: accurate pathogenic prediction for multiple variant types using machine learning approach
- Yicheng Liu
- Tianyun Zhang
- Ning Shen
Genome Medicine (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.