Publication:
Generalizing the Polya-Gamma Augmented Dynamic Topic Model to Learn Ideas

No Thumbnail Available

Date

2016-06-22

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Research Data

Abstract

Probabilistic topic models such as latent Dirichlet allocation are algorithms for detecting the underlying semantic structures of large corpora using Bayesian hierarchical modeling. Documents are modeled as distributions over topics, and topics are modeled as distributions over words. Useful insights can be drawn in an automated manner from massive collections of documents by inferring topics and their distributions from the text. The dynamic topic model is an extension of latent Dirichlet allocation that allows topics to evolve along the time dimension. Polya-gamma augmentation is an auxiliary variable scheme that allows the topic evolution to be expressed in the form of a linear dynamical system whose state can be estimated using a Kalman smoother. Previous research involving the Polya-gamma augmented dynamic topic model has always assumed the identity as the Kalman filter's observation model. In this work, we focus on extracting further patterns from the data by generalizing both the Polya-gamma augmented dynamic topic model and the cross-corpora Polya-gamma augmented dynamic topic model. Specifically, we derive methods to learn a corpus specific, dimension collapsing Kalman filter observation model that projects the feature space of the words onto a lower dimensional manifold. This allows systematic differences between document collections to be efficiently modeled in an interpretable manner. Additionally, the clustering of words in the lower dimensional space allows us to uncover relationships between words. Topics go from being distributions over words to distributions over ideas where ideas are latent clusters of related words, leading to a simpler interpretation of the learned topics. We then evaluate the performance of this new type of probabilistic topic model on two data sets, demonstrating its potential use for tracking sentiment evolution in history and politics and for uncovering autism spectrum disorder disease trajectories.

Description

Other Available Sources

Keywords

Artificial Intelligence, Statistics, Computer Science

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories