Clustering via Deep Dictionary Learning

Tankala, Pranay Bennett

View/Open

pranay_tankala_senior_thesis.pdf (3.767Mb)

Author

Tankala, Pranay Bennett

Metadata

Show full item record

Citation

Tankala, Pranay Bennett. 2021. Clustering via Deep Dictionary Learning. Bachelor's thesis, Harvard College.

Abstract

Clustering, the process of discovering hidden groups in data, is a fundamentally important problem in statistical data analysis with applications ranging from image segmentation to sequence analysis to anomaly detection. Although clustering algorithms have traditionally utilized a variety of techniques from statistics, optimization, and linear algebra, in recent years, clustering algorithms have increasingly relied on deep learning to achieve state-of-the-art performance. Deep learning, the subfield of machine learning based on the use of artificial neural networks, has recently revolutionized a wide range of scientific disciplines. Unfortunately, the astonishing increase in accuracy and efficiency achieved by modern deep clustering algorithms has come at the cost of transparency and interpretability, with neural networks often making predictions and learning weights that are difficult to rigorously analyze.

In this work, we attempt to bridge the gap between state-of-the-art deep clustering methods and older but better-understood optimization algorithms for clustering. To this end, we propose, analyze, and evaluate novel, domain-agnostic clustering algorithms that offer the performance gains of deep learning while retaining the interpretability of existing clustering algorithms based on spectral methods and sparse self-representation. In contrast to black-box deep learning approaches, our algorithms facilitate rigorous analysis and provable performance guarantees.

Our first contribution is a novel theoretical analysis of a clustering algorithm developed in (Theodosis et al., 2021), which uses a highly structured recurrent neural network to perform subspace clustering, the task of partitioning a data set into clusters corresponding to low-dimensional linear subspaces of a high-dimensional ambient space. Using analytical tools from convex optimization and compressed sensing, we relate the convergence of this network's hidden states to properties of the underlying data distribution.

We also propose a novel manifold clustering algorithm (Tankala et al., 2021) that draws on techniques from sparse dictionary learning and nonlinear dimensionality reduction. Our method is based on algorithm unrolling, an emerging technique that uses deep learning to accelerate iterative optimization algorithms. Our algorithm enjoys tremendous computational advantages over related approaches and is both interpretable and flexible. We also prove theoretical performance guarantees and conduct experiments to show that our algorithm is highly efficient and performs competitively on synthetic and real-world data sets.

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA

Citable link to this page

https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37368572

Collections

FAS Theses and Dissertations [6136]

Contact administrator regarding this item (to report mistakes or request changes)