Publication:
Clustering of Single-Cell and Text Data

No Thumbnail Available

Date

2023-07-31

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Chen, Dieyi. 2023. Clustering of Single-Cell and Text Data. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.

Research Data

Abstract

In recent years, we have witnessed the rapid growth of the data volume and a significant fraction of the unlabeled data. We study the problems of subject clustering in bioinformatics and document clustering in Natural Languague Processing (NLP). In these problems, the data matrix is usually in very high dimension, but the useful signals for clustering are contained in a low-dimensional latent structure, masked by complicated noise. This dissertation introduces new methods and theory for such problems. In Chapter 1, we study high-dimensional clustering under Rare and Weak signals (i.e., there are only a small fraction of useful features and each useful feature contains very weak signals for clustering). First, we theoretically investigate IF-PCA, a popular clustering method. We derive a phase diagram – in the Possibility region, IF-PCA yields successful clustering; in the Impossibility region, there exists no polynomial-time algorithm that can yield successful clustering. Next, inspired by the appealing theoretical properties of IF-PCA, we combine it with recent unsupervised deep learning methods, the variational auto-encoder (VAE), to simultaneously deal with sparsity and non-linearity. We call this method IF-VAE. Last, we evaluate the performance of IF-PCA and IF-VAE on 10 gene microarray data sets and 8 single-cell RNA-seq data sets, with a comparison with popular methods for single-cell subject clustering (e.g., Seurat and SC3). In Chapter 2, we study bi-gram topic modeling. The topic model is one of the most popular uni-gram models (a.k.a., bag-of-words models) for text analysis. However, it ignores word orders and the context of each word. Bi-gram models improve uni-gram models, but it is yet unclear how to learn “topics” from the bi-grams. We propose two versions of bi-gram models. For each model, we propose a tensor-decomposition approach to learn “topics”. These approaches yield consistent topic estimation in simulations. As a related problem, we also study author clustering based on the bi-grams of abstracts of statistical papers. We find evidence in real data where using the bi-grams leads to more meaningful clusters than using the uni-grams. In Chapter 3, we focus on a sub-problem that arises in many soft-clustering problems, such as network mixed membership estimation, topic modeling, and hyperspectral remote sensing. In these problems, one of the key steps is to estimate a simplex structure from a noisy point cloud, which we call the vertex hunting (VH) problem. The existed VH algorithms, such as successive projection (SP), are susceptible to outliers. We propose a robust VH algorithm that properly shrinks estimated vertices towards the interior of data cloud, so as to mitigate the effect of outliers. The level of shrinkage is determined by maximizing a pseudo likelihood and has no tuning parameter. Under an idealized model, we show that the proposed method has a faster rate of convergence than the existed VH algorithms.

Description

Other Available Sources

Keywords

Statistics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories