Publication: Building better models for human disease genetics
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Interpreting the functional consequences of genetic variants is a central challenge in human genomics. Despite the rapid acceleration of sequencing technologies, the majority of protein-coding variants remain uncharacterized and classified as variants of unknown significance (VUSs). Variant effect predictors (VEPs) aim to address this gap by scoring the potential impact of variants, but most current models are limited by biases in training data, lack of interpretability, and poor generalizability beyond known disease genes. These limitations are particularly problematic in clinical settings where accurate, genome-wide prediction of variant effect is essential.
To promote the development of clinically useful and generalizable VEPs, we worked to establish best-practice guidelines to address transparency, training data sourcing, and evaluation design. These guidelines identify common pitfalls such as circularity, overfitting to clinical labels, and limited benchmarking scope. Emphasis is placed on rigorous separation between training and evaluation data and open-source availability of models and scores. Building on these principles, we introduced a large-scale benchmarking resource, ProteinGym, to harmonize data from hundreds of deep mutational scanning (DMS) experiments and clinical variant annotations, enabling robust comparisons across predictors.
We developed a novel model, popEVE, to integrate evolutionary sequence data with human population variation in a probabilistic framework. By calibrating missense variant scores against gene-level constraint derived from population data, popEVE enables comparison of variant deleteriousness across the proteome without relying on clinical labels or allele frequency-based heuristics. Applied to rare disease cohorts, popEVE identifies over 100 novel candidate developmental disorder genes and successfully ranks causal variants without parental data. When extended to phenome-wide burden testing in population cohorts, our model uncovers hundreds of novel gene–phenotype associations and enables the construction of disease-specific polygenic risk scores from rare missense variants alone. These results demonstrate the utility of combining deep evolutionary context with human-specific constraint to build generalizable, clinically meaningful models of variant effect.