Publication:
On the unsupervised analysis of domain-specific Chinese texts

Thumbnail Image

Date

2016

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

Proceedings of the National Academy of Sciences
The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Deng, Ke, Peter K. Bol, Kate J. Li, and Jun S. Liu. 2016. β€œOn the Unsupervised Analysis of Domain-Specific Chinese Texts.” Proc Natl Acad Sci USA 113, no. 22: 6154–6159. doi:10.1073/pnas.1516510113.

Research Data

Abstract

With the growing availability of digitized text data both publicly and privately, there is a great need for effective computational tools to automatically extract information from texts. Because the Chinese language differs most significantly from alphabet-based languages in not specifying word boundaries, most existing Chinese text-mining methods require a prespecified vocabulary and/or a large relevant training corpus, which may not be available in some applications. We introduce an unsupervised method, top-down word discovery and segmentation (TopWORDS), for simultaneously discovering and segmenting words and phrases from large volumes of unstructured Chinese texts, and propose ways to order discovered words and conduct higher-level context analyses. TopWORDS is particularly useful for mining online and domain-specific texts where the underlying vocabulary is unknown or the texts of interest differ significantly from available training corpora. When outputs from TopWORDS are fed into context analysis tools such as topic modeling, word embedding, and association pattern finding, the results are as good as or better than that from using outputs of a supervised segmentation method.

Description

Other Available Sources

Keywords

word discovery, text segmentations, EM algorithm, Chinese history, blogs

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories