Publication: Mathematical Methods for Single-cell Analysis
No Thumbnail Available
Open/View Files
Date
2022-05-19
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
DeMeo, Benjamin. 2022. Mathematical Methods for Single-cell Analysis. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
Research Data
Abstract
Single-cell technologies profile samples at cellular resolution, with great potential in the study of biological tissues. However, the discrete, high-volume, and high-dimensional nature of single-cell data poses significant challenges. In this work, we present mathematically-principled and accessible approaches to the recovery of biological signal from these complex datasets.
The sheer size of modern single-cell datasets renders them inaccessible to many researchers. In Chapter 1, we present Hopper, a novel geometric approach for subsampling, or sketching, single-cell data while preserving its transcriptional heterogeneity. Hopper recovers rare cellular populations with low computational burden. For example, in a set of 1.3 million mouse brain cells, analysis of a 5,000-cell Hopper sketch reveals a population of just 64 inflammatory macrophages. Hopper democratizes the analysis of massive-scale single-cell studies, and ensures that analysis remains possible as experimental technologies improve.
While sketching reduces the number of cells, dimensionality reduction reduces the number of features, collapsing tens of thousands of genes to 10-100 dimensions. However, this reduction often loses important information. In Chapter 2, we present Surprisal Component Analysis (SCA), which uses a novel information-theoretic formulation to extract interesting axes of variation from a dataset. We show that SCA enables more accurate clustering, visualization, and imputation of single-cell transcriptomic data, with broad applications to a variety of downstream tasks. For example, in immune cells, SCA recovers clinically-relevant subpopulations like gamma-delta and MAIT cells where other methods cannot.
While important, global dimensionality reduction has significant drawbacks, because the important axes of variation may vary by cellular context. In chapter 3, we introduce a novel method for distilling local information across overlapping cellular locales, and combining it to obtain a far more nuanced view of the cellular states in a dataset. In real and synthetic data, we show significant improvement over even SCA in detection of small cellular populations. For example, in a 20-organ mouse atlas dataset generated by the Tabula Muris consortium, we uncover novel populations of microglia, hepatocytes, and likely many others.
We release these methods as open-source Python packages with comprehensive documentation. Alone or in combination, they offer a toolkit for accessible, principled, and effective signal extraction in single-cell data.
Description
Other Available Sources
Keywords
Clustering, Data, Dimensionality, Information, Rare, Single-cell, Bioinformatics
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service