Publication: Clustering of Single-Cell and Text Data
No Thumbnail Available
Open/View Files
Date
2023-07-31
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Chen, Dieyi. 2023. Clustering of Single-Cell and Text Data. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
Research Data
Abstract
In recent years, we have witnessed the rapid growth of the data volume and a significant
fraction of the unlabeled data. We study the problems of subject clustering in bioinformatics
and document clustering in Natural Languague Processing (NLP). In these problems, the data
matrix is usually in very high dimension, but the useful signals for clustering are contained in
a low-dimensional latent structure, masked by complicated noise. This dissertation introduces
new methods and theory for such problems.
In Chapter 1, we study high-dimensional clustering under Rare and Weak signals (i.e., there
are only a small fraction of useful features and each useful feature contains very weak signals for
clustering). First, we theoretically investigate IF-PCA, a popular clustering method. We derive a
phase diagram – in the Possibility region, IF-PCA yields successful clustering; in the Impossibility
region, there exists no polynomial-time algorithm that can yield successful clustering. Next,
inspired by the appealing theoretical properties of IF-PCA, we combine it with recent unsupervised
deep learning methods, the variational auto-encoder (VAE), to simultaneously deal with
sparsity and non-linearity. We call this method IF-VAE. Last, we evaluate the performance of
IF-PCA and IF-VAE on 10 gene microarray data sets and 8 single-cell RNA-seq data sets, with
a comparison with popular methods for single-cell subject clustering (e.g., Seurat and SC3).
In Chapter 2, we study bi-gram topic modeling. The topic model is one of the most popular
uni-gram models (a.k.a., bag-of-words models) for text analysis. However, it ignores word orders
and the context of each word. Bi-gram models improve uni-gram models, but it is yet unclear
how to learn “topics” from the bi-grams. We propose two versions of bi-gram models. For each
model, we propose a tensor-decomposition approach to learn “topics”. These approaches yield
consistent topic estimation in simulations. As a related problem, we also study author clustering
based on the bi-grams of abstracts of statistical papers. We find evidence in real data where using
the bi-grams leads to more meaningful clusters than using the uni-grams.
In Chapter 3, we focus on a sub-problem that arises in many soft-clustering problems, such
as network mixed membership estimation, topic modeling, and hyperspectral remote sensing.
In these problems, one of the key steps is to estimate a simplex structure from a noisy point
cloud, which we call the vertex hunting (VH) problem. The existed VH algorithms, such as
successive projection (SP), are susceptible to outliers. We propose a robust VH algorithm that
properly shrinks estimated vertices towards the interior of data cloud, so as to mitigate the effect
of outliers. The level of shrinkage is determined by maximizing a pseudo likelihood and has no
tuning parameter. Under an idealized model, we show that the proposed method has a faster
rate of convergence than the existed VH algorithms.
Description
Other Available Sources
Keywords
Statistics
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service