Publication:
Mathematical Methods for Single-cell Analysis

No Thumbnail Available

Date

2022-05-19

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

DeMeo, Benjamin. 2022. Mathematical Methods for Single-cell Analysis. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.

Research Data

Abstract

Single-cell technologies profile samples at cellular resolution, with great potential in the study of biological tissues. However, the discrete, high-volume, and high-dimensional nature of single-cell data poses significant challenges. In this work, we present mathematically-principled and accessible approaches to the recovery of biological signal from these complex datasets. The sheer size of modern single-cell datasets renders them inaccessible to many researchers. In Chapter 1, we present Hopper, a novel geometric approach for subsampling, or sketching, single-cell data while preserving its transcriptional heterogeneity. Hopper recovers rare cellular populations with low computational burden. For example, in a set of 1.3 million mouse brain cells, analysis of a 5,000-cell Hopper sketch reveals a population of just 64 inflammatory macrophages. Hopper democratizes the analysis of massive-scale single-cell studies, and ensures that analysis remains possible as experimental technologies improve. While sketching reduces the number of cells, dimensionality reduction reduces the number of features, collapsing tens of thousands of genes to 10-100 dimensions. However, this reduction often loses important information. In Chapter 2, we present Surprisal Component Analysis (SCA), which uses a novel information-theoretic formulation to extract interesting axes of variation from a dataset. We show that SCA enables more accurate clustering, visualization, and imputation of single-cell transcriptomic data, with broad applications to a variety of downstream tasks. For example, in immune cells, SCA recovers clinically-relevant subpopulations like gamma-delta and MAIT cells where other methods cannot. While important, global dimensionality reduction has significant drawbacks, because the important axes of variation may vary by cellular context. In chapter 3, we introduce a novel method for distilling local information across overlapping cellular locales, and combining it to obtain a far more nuanced view of the cellular states in a dataset. In real and synthetic data, we show significant improvement over even SCA in detection of small cellular populations. For example, in a 20-organ mouse atlas dataset generated by the Tabula Muris consortium, we uncover novel populations of microglia, hepatocytes, and likely many others. We release these methods as open-source Python packages with comprehensive documentation. Alone or in combination, they offer a toolkit for accessible, principled, and effective signal extraction in single-cell data.

Description

Other Available Sources

Keywords

Clustering, Data, Dimensionality, Information, Rare, Single-cell, Bioinformatics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories