Publication: Generalizing the Polya-Gamma Augmented Dynamic Topic Model to Learn Ideas
No Thumbnail Available
Date
2016-06-22
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Research Data
Abstract
Probabilistic topic models such as latent Dirichlet allocation are algorithms for detecting the underlying semantic structures of large corpora using Bayesian hierarchical modeling. Documents are modeled as distributions over topics, and topics are modeled as distributions over words. Useful insights can be drawn in an automated manner from massive collections of documents by inferring topics and their distributions from the text. The dynamic topic model is an extension of latent Dirichlet allocation that allows topics to evolve along the time dimension. Polya-gamma augmentation is an auxiliary variable scheme that allows the topic evolution to be expressed in the form of a linear dynamical system whose state can be estimated using a Kalman smoother.
Previous research involving the Polya-gamma augmented dynamic topic model has always assumed the identity as the Kalman filter's observation model. In this work, we focus on extracting further patterns from the data by generalizing both the Polya-gamma augmented dynamic topic model and the cross-corpora Polya-gamma augmented dynamic topic model. Specifically, we derive methods to learn a corpus specific, dimension collapsing Kalman filter observation model that projects the feature space of the words onto a lower dimensional manifold. This allows systematic differences between document collections to be efficiently modeled in an interpretable manner. Additionally, the clustering of words in the lower dimensional space allows us to uncover relationships between words. Topics go from being distributions over words to distributions over ideas where ideas are latent clusters of related words, leading to a simpler interpretation of the learned topics. We then evaluate the performance of this new type of probabilistic topic model on two data sets, demonstrating its potential use for tracking sentiment evolution in history and politics and for uncovering autism spectrum disorder disease trajectories.
Description
Other Available Sources
Keywords
Artificial Intelligence, Statistics, Computer Science
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service