Publication: Learning from high-dimensional measurements
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
This thesis considers the problems of distilling and quantifying information in high-dimensional measurements, with a focus on applications in biology. First, we explore the idea that underlying low-dimensional structure in high-dimensional data can be exploited to circumvent the curse of dimensionality in mutual information estimation. We develop a method that we call latent MI (LMI) approximation, which applies a nonparametric MI estimator to low-dimensional representations learned by a simple, theoretically-motivated model architecture. Using several benchmarks, we show that unlike existing techniques, LMI can approximate MI well for variables with $>10^3$ dimensions when their dependence structure has low intrinsic dimensionality. Second, we study how measurement noise in data affects the quality of representation learning models. Using an information-theoretic metric of representation quality, we show that model performance scales predictably with molecular undersampling noise in single-cell genomic data. We show that the form of this relationship can be recovered from a simple Gaussian noise model, which provides an intuitive interpretation of the law. Finally, we show that the same scaling relationship emerges in image classification problems, suggesting that noise scaling may be a general phenomenon.