Show simple item record

dc.contributor.advisorIrizarry, Rafael
dc.contributor.authorPloenzke, Matt
dc.date.accessioned2021-08-04T03:59:29Z
dc.date.created2020
dc.date.issued2020-08-10
dc.date.submitted2020-11
dc.identifier.citationPloenzke, Matt. 2020. Interpretable Machine Learning Methods with Applications in Genomics. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
dc.identifier.other28086179
dc.identifier.urihttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37368851*
dc.description.abstractA primary goal in biology is understanding the relationship between genomic sequence and cell state or function. Pharmacogenomic experiments, for instance, measure how different genomic profiles correlate with cell survival under varying drug dosages, thus finding genomic markers and signatures associated with effective therapy. ChIP-seq experiments, on the other hand, isolate proteins and/or transcription factors (TF) bound to the genome and subsequently measure genomic sequence variability with these TFs across different conditions. In both these cases the fundamental problem formulation is set up with some genomic input space, $X$, and interest lies in associations with some outcome $Y$. How one defines either $X$ or $Y$ for any given application has a tremendous downstream effect on the conclusions drawn. The focus of this dissertation is the development of methods for three -omics applications which address the importance of defining $X$ and $Y$ in a data driven manner. Improved model interpretability and an agreement with intuition highlight the benefit of such an approach for each application. A multi-level model is detailed in the pharmacogenomics application to show the effect assuming an outcome variable $Y$ is a continuous univariate random variable when in fact $Y$ follows a two-component mixture distribution. Estimated associations between $X$ and $Y$ are compared under the differing assumptions, as well as bivariate measures of association such as those between the $Y$ collected in one experiment and those collected in another. The second application uses weight constraints and regularization to illustrate how the inherent structure of the genomic sequence $X$, namely being composed of a string of nucleotides, allows one to transform $X$ into a set of learnable sequence motifs using the first layer weights in convolutional neural networks (CNNs). These feature extractors allow one to encode prior information into the sequence-function analysis and extract interpretable sequence motifs after fitting the model. The final results again focus on CNNs and TF binding and show the utility of employing an exponential activation function in the first layer feature extractors. Specifically, measures of model interpretability are improved relative to state-of-the-art methods and there are no effects on test set accuracy. Interestingly, the learned functions with the exponential tend to be less noisy and more robust to hyper-parameter selections. A discussion of deep learning for TF binding applications completes the dissertation.
dc.format.mimetypeapplication/pdf
dc.language.isoen
dash.licenseLAA
dc.subjectDeep learning
dc.subjectInterpretability
dc.subjectMachine learning
dc.subjectPharmacogenomics
dc.subjectTF binding
dc.subjectStatistics
dc.titleInterpretable Machine Learning Methods with Applications in Genomics
dc.typeThesis or Dissertation
dash.depositing.authorPloenzke, Matt
dc.date.available2021-08-04T03:59:29Z
thesis.degree.date2020
thesis.degree.grantorHarvard University Graduate School of Arts and Sciences
thesis.degree.levelDoctoral
thesis.degree.namePh.D.
dc.contributor.committeeMemberParmigiani, Giovanni
dc.contributor.committeeMemberBraun, Danielle
dc.contributor.committeeMemberKoo, Peter K
dc.type.materialtext
thesis.degree.departmentBiostatistics
dc.identifier.orcid0000-0002-1354-7318
dash.author.emailmatthew.ploenzke@gmail.com


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record