dc.contributor.advisor Irizarry, Rafael dc.contributor.author Ploenzke, Matt dc.date.accessioned 2021-08-04T03:59:29Z dc.date.created 2020 dc.date.issued 2020-08-10 dc.date.submitted 2020-11 dc.identifier.citation Ploenzke, Matt. 2020. Interpretable Machine Learning Methods with Applications in Genomics. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences. dc.identifier.other 28086179 dc.identifier.uri https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37368851 * dc.description.abstract A primary goal in biology is understanding the relationship between genomic sequence and cell state or function. Pharmacogenomic experiments, for instance, measure how different genomic profiles correlate with cell survival under varying drug dosages, thus finding genomic markers and signatures associated with effective therapy. ChIP-seq experiments, on the other hand, isolate proteins and/or transcription factors (TF) bound to the genome and subsequently measure genomic sequence variability with these TFs across different conditions. In both these cases the fundamental problem formulation is set up with some genomic input space, $X$, and interest lies in associations with some outcome $Y$. How one defines either $X$ or $Y$ for any given application has a tremendous downstream effect on the conclusions drawn. The focus of this dissertation is the development of methods for three -omics applications which address the importance of defining $X$ and $Y$ in a data driven manner. Improved model interpretability and an agreement with intuition highlight the benefit of such an approach for each application. A multi-level model is detailed in the pharmacogenomics application to show the effect assuming an outcome variable $Y$ is a continuous univariate random variable when in fact $Y$ follows a two-component mixture distribution. Estimated associations between $X$ and $Y$ are compared under the differing assumptions, as well as bivariate measures of association such as those between the $Y$ collected in one experiment and those collected in another. The second application uses weight constraints and regularization to illustrate how the inherent structure of the genomic sequence $X$, namely being composed of a string of nucleotides, allows one to transform $X$ into a set of learnable sequence motifs using the first layer weights in convolutional neural networks (CNNs). These feature extractors allow one to encode prior information into the sequence-function analysis and extract interpretable sequence motifs after fitting the model. The final results again focus on CNNs and TF binding and show the utility of employing an exponential activation function in the first layer feature extractors. Specifically, measures of model interpretability are improved relative to state-of-the-art methods and there are no effects on test set accuracy. Interestingly, the learned functions with the exponential tend to be less noisy and more robust to hyper-parameter selections. A discussion of deep learning for TF binding applications completes the dissertation. dc.format.mimetype application/pdf dc.language.iso en dash.license LAA dc.subject Deep learning dc.subject Interpretability dc.subject Machine learning dc.subject Pharmacogenomics dc.subject TF binding dc.subject Statistics dc.title Interpretable Machine Learning Methods with Applications in Genomics dc.type Thesis or Dissertation dash.depositing.author Ploenzke, Matt dc.date.available 2021-08-04T03:59:29Z thesis.degree.date 2020 thesis.degree.grantor Harvard University Graduate School of Arts and Sciences thesis.degree.level Doctoral thesis.degree.name Ph.D. dc.contributor.committeeMember Parmigiani, Giovanni dc.contributor.committeeMember Braun, Danielle dc.contributor.committeeMember Koo, Peter K dc.type.material text thesis.degree.department Biostatistics dc.identifier.orcid 0000-0002-1354-7318 dash.author.email matthew.ploenzke@gmail.com
﻿