Clustering via Deep Dictionary Learning
Tankala, Pranay Bennett
MetadataShow full item record
CitationTankala, Pranay Bennett. 2021. Clustering via Deep Dictionary Learning. Bachelor's thesis, Harvard College.
AbstractClustering, the process of discovering hidden groups in data, is a fundamentally important problem in statistical data analysis with applications ranging from image segmentation to sequence analysis to anomaly detection. Although clustering algorithms have traditionally utilized a variety of techniques from statistics, optimization, and linear algebra, in recent years, clustering algorithms have increasingly relied on deep learning to achieve state-of-the-art performance. Deep learning, the subfield of machine learning based on the use of artificial neural networks, has recently revolutionized a wide range of scientific disciplines. Unfortunately, the astonishing increase in accuracy and efficiency achieved by modern deep clustering algorithms has come at the cost of transparency and interpretability, with neural networks often making predictions and learning weights that are difficult to rigorously analyze.
In this work, we attempt to bridge the gap between state-of-the-art deep clustering methods and older but better-understood optimization algorithms for clustering. To this end, we propose, analyze, and evaluate novel, domain-agnostic clustering algorithms that offer the performance gains of deep learning while retaining the interpretability of existing clustering algorithms based on spectral methods and sparse self-representation. In contrast to black-box deep learning approaches, our algorithms facilitate rigorous analysis and provable performance guarantees.
Our first contribution is a novel theoretical analysis of a clustering algorithm developed in (Theodosis et al., 2021), which uses a highly structured recurrent neural network to perform subspace clustering, the task of partitioning a data set into clusters corresponding to low-dimensional linear subspaces of a high-dimensional ambient space. Using analytical tools from convex optimization and compressed sensing, we relate the convergence of this network's hidden states to properties of the underlying data distribution.
We also propose a novel manifold clustering algorithm (Tankala et al., 2021) that draws on techniques from sparse dictionary learning and nonlinear dimensionality reduction. Our method is based on algorithm unrolling, an emerging technique that uses deep learning to accelerate iterative optimization algorithms. Our algorithm enjoys tremendous computational advantages over related approaches and is both interpretable and flexible. We also prove theoretical performance guarantees and conduct experiments to show that our algorithm is highly efficient and performs competitively on synthetic and real-world data sets.
Citable link to this pagehttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37368572
- FAS Theses and Dissertations