Publication:
Language Modeling by Clustering with Word Embeddings for Text Readability Assessment

No Thumbnail Available

Date

2017

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

ACM
The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Cha, Miriam, Youngjune Gwon, and H. T. Kung. 2017. Language Modeling by Clustering with Word Embeddings for Text Readability Assessment. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, November 6-7, 2017, Singapore, Singapore, 2003-2006.

Research Data

Abstract

We present a clustering-based language model using word embeddings for text readability prediction. Presumably, an Euclidean semantic space hypothesis holds true for word embeddings whose training is done by observing word co-occurrences. We argue that clustering with word embeddings in the metric space should yield feature representations in a higher semantic space appropriate for text regression. Also, by representing features in terms of histograms, our approach can naturally address documents of varying lengths. An empirical evaluation using the Common Core Standards corpus reveals that the features formed on our clustering-based language model significantly improve the previously known results for the same corpus in readability prediction. We also evaluate the task of sentence matching based on semantic relatedness using the Wiki-SimpleWiki corpus and find that our features lead to superior matching performance.

Description

Other Available Sources

Keywords

Terms of Use

Metadata Only

Endorsement

Review

Supplemented By

Referenced By

Related Stories