Publication:

Building better models for human disease genetics

Loading...
Thumbnail Image

Date

2025-12-05

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Orenbuch, Rose Adrienne. 2026. Building better models for human disease genetics. Doctoral Dissertation, Harvard University Graduate School of Arts and Sciences.

Abstract

Interpreting the functional consequences of genetic variants is a central challenge in human genomics. Despite the rapid acceleration of sequencing technologies, the majority of protein-coding variants remain uncharacterized and classified as variants of unknown significance (VUSs). Variant effect predictors (VEPs) aim to address this gap by scoring the potential impact of variants, but most current models are limited by biases in training data, lack of interpretability, and poor generalizability beyond known disease genes. These limitations are particularly problematic in clinical settings where accurate, genome-wide prediction of variant effect is essential.

To promote the development of clinically useful and generalizable VEPs, we worked to establish best-practice guidelines to address transparency, training data sourcing, and evaluation design. These guidelines identify common pitfalls such as circularity, overfitting to clinical labels, and limited benchmarking scope. Emphasis is placed on rigorous separation between training and evaluation data and open-source availability of models and scores. Building on these principles, we introduced a large-scale benchmarking resource, ProteinGym, to harmonize data from hundreds of deep mutational scanning (DMS) experiments and clinical variant annotations, enabling robust comparisons across predictors.

We developed a novel model, popEVE, to integrate evolutionary sequence data with human population variation in a probabilistic framework. By calibrating missense variant scores against gene-level constraint derived from population data, popEVE enables comparison of variant deleteriousness across the proteome without relying on clinical labels or allele frequency-based heuristics. Applied to rare disease cohorts, popEVE identifies over 100 novel candidate developmental disorder genes and successfully ranks causal variants without parental data. When extended to phenome-wide burden testing in population cohorts, our model uncovers hundreds of novel gene–phenotype associations and enables the construction of disease-specific polygenic risk scores from rare missense variants alone. These results demonstrate the utility of combining deep evolutionary context with human-specific constraint to build generalizable, clinically meaningful models of variant effect.

Description

Other Available Sources

Research Data

Keywords

Clinical genetics, Evolutionary models, Machine learning, Missense mutation, Population genetics, Variant effect prediction, Artificial intelligence, Genetics, Bioinformatics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories