Publication:

Latent Structure in Cancer Genomics: Methods for Reproducibility, Computational Efficiency, and Causality

Loading...
Thumbnail Image

Date

2026-01-05

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Landy, Jenna. 2026. Latent Structure in Cancer Genomics: Methods for Reproducibility, Computational Efficiency, and Causality. Doctoral Dissertation, Harvard University Graduate School of Arts and Sciences.

Abstract

Cancer arises from genomic changes and complex interdependencies among latent, or unmeasurable, biological mechanisms. Latent mechanisms include transcriptional programs (gene expression patterns governing cellular activity) and mutational processes (patterns of DNA sequence changes). These mechanisms are central to cancer development and can influence clinical prognosis and response to treatment. Understanding these latent processes is critical for modern cancer research and requires statistical models that can recover them from high-dimensional data.

This dissertation focuses on biologically interpretable latent-variable models. Latent variables allow high-dimensional genomic data to be represented in a low-dimensional space, capturing underlying biological mechanisms. The three chapters of this dissertation introduce new methods utilizing latent variable models for reproducibility, computational efficiency, and causality.

Reproducibility is key for strong, credible findings in cancer genomics. Chapter 1 introduces fdrSAFE, a selective ensembling algorithm for estimating local false discovery rates. We show this approach achieves robust near-optimality, performing well when baseline approaches perform poorly in at least one setting. In addition to improved accuracy, this method eliminates the need for arbitrary model choice, improving reproducibility of multiple test corrections using local false discovery rates.

With rapidly advancing technologies and expanding data sources, computational efficiency is essential for scalable genomic analysis. Chapter 2 develops bayesNMF, a computationally efficient Gibbs sampler for Poisson Bayesian NMF, where learned latent factors model underlying mutational processes. It utilizes Metropolis-Hastings steps to avoid computationally intensive Poisson augmentation and defines high-overlap, geometry-informed proposal distributions. We show that it performs as well as current MCMC methods while being up to 30 times faster.

Translating methods into clinical use requires causal understanding to distinguish correlation from causation. Chapter 3 introduces a framework for causal inference on latent biological outcomes, enabling causal interpretation of treatment effects. It formalizes and quantifies the concept of learning-induced (li-)interference, which arises when learned latent outcomes depend on other samples' treatments, biasing causal estimates. This chapter also proposes an algorithm to mitigate li-interference and demonstrates promising results in both simulated and real cancer data.

Together, these methods contribute to a unified view of latent structure in cancer genomics. Each chapter is accompanied by an open-source R software package (fdrSAFE, bayesNMF, and causalLFO) to promote reproducibility and usability of these methods.

Description

Other Available Sources

Research Data

Keywords

Bayesian modeling, Causal inference, Latent variable models, Multiple hypothesis testing, Mutational signatures, Non-negative matrix factorization, Biostatistics, Bioinformatics, Genetics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories