Publication: Bayesian Text Classification and Summarization via a Class-Specified Topic Model
Open/View Files
Date
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Research Data
Abstract
We propose the Class Specified Topic Model (CSTM) to deal with the tasks of text classifi cation and class-specifi c text summarization. The model assumes that, besides a set of latent topics that are shared across classes, for each class there is a set of class-speci c latent topics. Each document is a probabilistic mixture of the class-specifi c topics associated with its class and the shared topics. Each class-specifi c or shared topic has its own probability distribution over a given dictionary. We develop Bayesian inference of CSTM in the semi-supervised scenario, with the supervised scenario as a special case. We analyze in detail the 20 Newsgroup dataset, a benchmark dataset for text classifi cation, and demonstrate that CSTM has better performance than a two-stage approach based on latent Dirichlet allocation (LDA), several existing supervised extensions of LDA, and a L1 penalized logistic regression. The nice performance of the CSTM is also demonstrated through Monte Carlo simulations and an analysis of the Reuters dataset