Publication:
Essays on Statistical Models for Text Analysis

No Thumbnail Available

Date

2023-08-02

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Eshima, Shusei. 2023. Essays on Statistical Models for Text Analysis. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.

Research Data

Abstract

Fully automated content analysis based on statistical models has become popular among social scientists due to its scalability. Despite their use for measurement and exploration, these models have significant shortcomings. For measurement, they often inaccurately capture specific concepts by creating similar topics or merging different topics into one. For exploration, by assuming a flat topic structure, the models ignore the potential interconnectedness of the estimated topics. These limitations prevent the models from fully serving their intended purposes. This dissertation proposes three new statistical models that address these limitations. The first paper demonstrates that keywords can significantly enhance the measurement performance of topic models. An important advantage of the proposed keyword-assisted topic model (keyATM) is that it requires researchers to predefine topics before fitting a model to the data. This contrasts with the widespread practice of post hoc topic interpretation and adjustment, which compromises the objectivity of empirical findings. Furthermore, the keyATM can incorporate covariates and model time trends. Applications show that the keyATM provides more interpretable results, has better document classification performance, and is less sensitive to the number of topics than standard topic models. The second paper addresses a common problem in hierarchical topic models. Although hierarchical topic models have been employed to explore a large number of diverse topics from corpora, existing models yield fragmented topics with overlapping themes whose expected probability becomes exponentially smaller along the depth of the tree. To solve this intrinsic problem, the paper proposes a scale-invariant infinite hierarchical topic model (ihLDA). The ihLDA adaptively adjusts the topic creation to make the expected topic probability decay considerably slower than that in existing models. Thus, it facilitates the estimation of deeper topic structures encompassing diverse topics in a corpus. Furthermore, the ihLDA extends a widely used tree-structured prior in a hierarchical Bayesian way, which enables drawing an infinite topic tree from the base tree while efficiently sampling the topic assignments for the words. Experiments demonstrate that the ihLDA has better topic uniqueness and hierarchical diversity than existing approaches, including state-of-the-art neural models. The third paper proposes a keyword-based hierarchical language model. The Pitman-Yor hierarchical language model (HPYLM) is a widely used Bayesian language model that can closely resemble power-law distributions in natural languages. However, its unsupervised nature hinders the extraction of meaningful information from the corpus for user needs. This paper extends HPYLM to a semi-supervised model by incorporating a Dirichlet Forest prior and proposes the semi-supervised hierarchical Pitman-Yor language model (sHPY). The sHPY outperforms human coders and state-of-the-art methods in two applications: keyword refinement and topic modeling.

Description

Other Available Sources

Keywords

Quantitative Text Analysis, Topic Models, Political science

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories