Publication: Generalizing the Polya-Gamma Augmented Dynamic Topic Model to Learn Ideas
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Probabilistic topic models such as latent Dirichlet allocation are algorithms for detecting the underlying semantic structures of large corpora using Bayesian hierarchical modeling. Documents are modeled as distributions over topics, and topics are modeled as distributions over words. Useful insights can be drawn in an automated manner from massive collections of documents by inferring topics and their distributions from the text. The dynamic topic model is an extension of latent Dirichlet allocation that allows topics to evolve along the time dimension. Polya-gamma augmentation is an auxiliary variable scheme that allows the topic evolution to be expressed in the form of a linear dynamical system whose state can be estimated using a Kalman smoother. Previous research involving the Polya-gamma augmented dynamic topic model has always assumed the identity as the Kalman filter's observation model. In this work, we focus on extracting further patterns from the data by generalizing both the Polya-gamma augmented dynamic topic model and the cross-corpora Polya-gamma augmented dynamic topic model. Specifically, we derive methods to learn a corpus specific, dimension collapsing Kalman filter observation model that projects the feature space of the words onto a lower dimensional manifold. This allows systematic differences between document collections to be efficiently modeled in an interpretable manner. Additionally, the clustering of words in the lower dimensional space allows us to uncover relationships between words. Topics go from being distributions over words to distributions over ideas where ideas are latent clusters of related words, leading to a simpler interpretation of the learned topics. We then evaluate the performance of this new type of probabilistic topic model on two data sets, demonstrating its potential use for tracking sentiment evolution in history and politics and for uncovering autism spectrum disorder disease trajectories.