Publication: Bayesian Text Classification and Summarization via a Class-Specified Topic Model
No Thumbnail Available
Open/View Files
Date
2021
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
JMLR
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Wang, Feifei, Junni L. Zhang, Yichao Li, Ke Deng, and Jun S. Liu. "Bayesian Text Classification and Summarization via a Class-Specified Topic Model." Journal of Machine Learning Research 22 (89):1−48, 2021.
Research Data
Abstract
We propose the Class Specified Topic Model (CSTM) to deal with the tasks of text classifi cation and class-specifi c text summarization. The model assumes that, besides a set of latent topics that are shared across classes, for each class there is a set of class-speci c latent
topics. Each document is a probabilistic mixture of the class-specifi c topics associated with its class and the shared topics. Each class-specifi c or shared topic has its own probability distribution over a given dictionary. We develop Bayesian inference of CSTM in the semi-supervised scenario, with the supervised scenario as a special case. We analyze in detail the 20 Newsgroup dataset, a benchmark dataset for text classifi cation, and demonstrate that CSTM has better performance than a two-stage approach based on latent Dirichlet allocation (LDA), several existing supervised extensions of LDA, and a L1 penalized logistic regression. The nice performance of the CSTM is also demonstrated through Monte Carlo simulations and an analysis of the Reuters dataset