Publication: Generative Statistical Methods for Biological Sequences
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Measuring and making sequences is central to modern biology and biomedicine. From evolutionary biology to immunology to therapeutics and beyond, scientists collect massive datasets of DNA, RNA and protein sequences, and create new sequences in the laboratory through large-scale DNA synthesis or genome editing. This dissertation is about the problem of learning from measurements of complex sequence data and predicting unobserved or future sequences that can be made in the laboratory. The dissertation describes new generative statistical methods for biological sequences, working within the framework of Bayesian statistics and probabilistic machine learning, and establishes theoretical guarantees on these methods using frequentist analysis. Part I proposes new tools for building biological sequence models, critiquing biological sequence models, and designing experiments to synthesize samples from biological sequence models. Part II deals with the use of misspecified models in biological sequence analysis and beyond, developing a new understanding of how such “wrong” models can be used effectively for estimation and discovery. Overall, the dissertation contributes principles and methods for reliable and accurate prediction, analysis and design of biological sequences across biology and biomedicine.