Outcome-Driven Clustering of Microarray Data
MetadataShow full item record
CitationHsu, Jessie. 2012. Outcome-Driven Clustering of Microarray Data. Doctoral dissertation, Harvard University.
AbstractThe rapid technological development of high-throughput genomics has given rise to complex high-dimensional microarray datasets. One strategy for reducing the dimensionality of microarray experiments is to carry out a cluster analysis to ﬁnd groups of genes with similar expression patterns. Though cluster analysis has been studied extensively, the clinical context in which the analysis is performed is usually considered separately if at all. However, allowing clinical outcomes to inform the clustering of microarray data has the potential to identify gene clusters that are more useful for describing the clinical course of disease. The aim of this dissertation is to utilize outcome information to drive the clustering of gene expression data. In Chapter 1, we propose a joint clustering model that assumes a relationship between gene clusters and a continuous patient outcome. Gene expression is modeled using cluster speciﬁc random effects such that genes in the same cluster are correlated. A linear combination of these random effects is then used to describe the continuous clinical outcome. We implement a Markov chain Monte Carlo algorithm to iteratively sample the unknown parameters and determine the cluster pattern. Chapter 2 extends this model to binary and failure time outcomes. Our strategy is to augment the data with a latent continuous representation of the outcome and specify that the risk of the event depends on the latent variable. Once the latent variable is sampled, we relate it to gene expression via cluster speciﬁc random effects and apply the methods developed in Chapter 1. The setting of clustering longitudinal microarrays using binary and survival outcomes is considered in Chapter 3. We propose a model that incorporates a random intercept and slope to describe the gene expression time trajectory. As before, a continuous latent variable that is linearly related to the random effects is introduced into the model and a Markov chain Monte Carlo algorithm is used for sampling. These methods are applied to microarray data from trauma patients in the Inﬂammation and Host Response to Injury research project. The resulting partitions are visualized using heat maps that depict the frequency with which genes cluster together.
Citable link to this pagehttp://nrs.harvard.edu/urn-3:HUL.InstRepos:9561188
- FAS Theses and Dissertations