Interpretable Machine Learning Methods with Applications in Genomics
CitationPloenzke, Matt. 2020. Interpretable Machine Learning Methods with Applications in Genomics. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
AbstractA primary goal in biology is understanding the relationship between genomic sequence and cell state or function. Pharmacogenomic experiments, for instance, measure how different genomic profiles correlate with cell survival under varying drug dosages, thus finding genomic markers and signatures associated with effective therapy. ChIP-seq experiments, on the other hand, isolate proteins and/or transcription factors (TF) bound to the genome and subsequently measure genomic sequence variability with these TFs across different conditions. In both these cases the fundamental problem formulation is set up with some genomic input space, $X$, and interest lies in associations with some outcome $Y$. How one defines either $X$ or $Y$ for any given application has a tremendous downstream effect on the conclusions drawn. The focus of this dissertation is the development of methods for three -omics applications which address the importance of defining $X$ and $Y$ in a data driven manner. Improved model interpretability and an agreement with intuition highlight the benefit of such an approach for each application. A multi-level model is detailed in the pharmacogenomics application to show the effect assuming an outcome variable $Y$ is a continuous univariate random variable when in fact $Y$ follows a two-component mixture distribution. Estimated associations between $X$ and $Y$ are compared under the differing assumptions, as well as bivariate measures of association such as those between the $Y$ collected in one experiment and those collected in another. The second application uses weight constraints and regularization to illustrate how the inherent structure of the genomic sequence $X$, namely being composed of a string of nucleotides, allows one to transform $X$ into a set of learnable sequence motifs using the first layer weights in convolutional neural networks (CNNs). These feature extractors allow one to encode prior information into the sequence-function analysis and extract interpretable sequence motifs after fitting the model. The final results again focus on CNNs and TF binding and show the utility of employing an exponential activation function in the first layer feature extractors. Specifically, measures of model interpretability are improved relative to state-of-the-art methods and there are no effects on test set accuracy. Interestingly, the learned functions with the exponential tend to be less noisy and more robust to hyper-parameter selections. A discussion of deep learning for TF binding applications completes the dissertation.
Citable link to this pagehttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37368851
- FAS Theses and Dissertations