Methods for Analyzing Sparse Genetic and Epigenetic Data: Single Cells to Population Levels
Kangeyan, Divy S.
MetadataShow full item record
CitationKangeyan, Divy S. 2019. Methods for Analyzing Sparse Genetic and Epigenetic Data: Single Cells to Population Levels. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.
AbstractThis dissertation work is motivated by the large influx of sequencing data: that is, both in terms of the amount and the type of data, where current statistical and computational methods are inadequate in addressing the data manipulation and hence the corresponding scientific questions of interest.
In Chapter 1, we address a current issue regarding a data analysis platform to conduct large amount of Next Generation Sequencing based methylation data. Bisulfite sequencing allows base-pair resolution DNA methylation and has recently been adapted for use in single cells. We present a set of preprocessing pipelines that allow users to ensure 1) reproducibility, 2) scalability, 3) integration with publicly available data, and 4) access to best-practice methods. The workflows produce output for visualization and further downstream analysis. Optional use of cloud computing resources facilitates analysis of large datasets, and integration with existing methylation data.
In Chapter 2, we focus our attention on sparsity in single-cell DNA methylation data. Single-cell DNA methylation analysis has the potential to produce high resolution methylation landscape and elucidate the heterogeneity in methylation. But it suffers from low coverage due to the low quantity of input DNA. We find that on average, only about 5 – 10\% of CpGs are observed in typical single-cell libraries. We show how missingness of methylation status can bias metrics such as mean methylation estimates and clustering analyses. We propose a joint analysis approach that leverages bulk sequencing data, to infer bias-corrected single-cell methylation status.
In Chapter 3, we consider sparsity in the rare variant data and how it can be utilized to infer population structure. Population-substructure in genetic studies is often assessed by principal component analysis of genetic relatedness matrices (GRM). With the general availability of whole-genome sequencing (WGS) platforms, rare variant data are now widely available. As such data are genetically “younger” than common variants, they should enable for a fine-scale assessment of the substructure. Here, using the 1,000 genomes project data, we compare the features of Jaccard-based GRMs with standard approaches that utilizes the genetic covariance matrix, with respect to their ability to examine and infer fine-scale population substructure.
Citable link to this pagehttp://nrs.harvard.edu/urn-3:HUL.InstRepos:42029774
- FAS Theses and Dissertations