Interpretable Machine Learning Methods with Applications in Genomics

Ploenzke, Matt

View/Open

Dissertation.pdf (30.21Mb)

Author

Ploenzke, Matt

Metadata

Show full item record

Citation

Ploenzke, Matt. 2020. Interpretable Machine Learning Methods with Applications in Genomics. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.

Abstract

A primary goal in biology is understanding the relationship between genomic sequence and cell state or function. Pharmacogenomic experiments, for instance, measure how different genomic profiles correlate with cell survival under varying drug dosages, thus finding genomic markers and signatures associated with effective therapy. ChIP-seq experiments, on the other hand, isolate proteins and/or transcription factors (TF) bound to the genome and subsequently measure genomic sequence variability with these TFs across different conditions. In both these cases the fundamental problem formulation is set up with some genomic input space, $X$, and interest lies in associations with some outcome $Y$. How one defines either $X$ or $Y$ for any given application has a tremendous downstream effect on the conclusions drawn. The focus of this dissertation is the development of methods for three -omics applications which address the importance of defining $X$ and $Y$ in a data driven manner. Improved model interpretability and an agreement with intuition highlight the benefit of such an approach for each application. A multi-level model is detailed in the pharmacogenomics application to show the effect assuming an outcome variable $Y$ is a continuous univariate random variable when in fact $Y$ follows a two-component mixture distribution. Estimated associations between $X$ and $Y$ are compared under the differing assumptions, as well as bivariate measures of association such as those between the $Y$ collected in one experiment and those collected in another. The second application uses weight constraints and regularization to illustrate how the inherent structure of the genomic sequence $X$, namely being composed of a string of nucleotides, allows one to transform $X$ into a set of learnable sequence motifs using the first layer weights in convolutional neural networks (CNNs). These feature extractors allow one to encode prior information into the sequence-function analysis and extract interpretable sequence motifs after fitting the model. The final results again focus on CNNs and TF binding and show the utility of employing an exponential activation function in the first layer feature extractors. Specifically, measures of model interpretability are improved relative to state-of-the-art methods and there are no effects on test set accuracy. Interestingly, the learned functions with the exponential tend to be less noisy and more robust to hyper-parameter selections. A discussion of deep learning for TF binding applications completes the dissertation.

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA

Citable link to this page

https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37368851

Collections

FAS Theses and Dissertations [6136]

Contact administrator regarding this item (to report mistakes or request changes)