Publication: Nonparametric Methods for Building and Evaluating Models of Biological Sequences
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Probabilistic models of biological sequences are used to design drugs, make predictions about human health, and learn basic biology. Sequence data is high dimensional so a probabilistic model must make biological assumptions to predict and infer. However, these assumptions can come at the cost of the flexibility of the model, fundamentally limiting its ability to make accurate predic- tions and learn new biology. Modern sequencing efforts and high-throughput experimentation are generating an ever-increasing amount of sequence data, in principle providing increasing informa- tion to learn the complexity of real sequence data. To leverage this wealth of data this thesis builds nonparametric models and tests of sequences that incorporate biological prior knowledge while re- maining flexible. This theis build methods to perform efficient, flexible, and reliable prediction and inference from DNA and protein data, at large and small scale, and in supervised and unsupervised settings.