Publication: Estimating and Comparing Entropy across Written Natural Languages Using PPM Compression
Open/View Files
Date
2002
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Behr, Jr., Frederic H., Victoria Fossum, Michael Mitzenmacher, and David Xiao. 2002. Harvard Computer Science Group Technical Report TR-12-02.
Research Data
Abstract
Previous work on estimating the entropy of written natural language has focused primarily on English. We expand this work by considering other natural languages, inluding Arabic, Chinese, French, Greek, Japanese, Korean, Russian, and Spanish. We present the results of PPM compression on machine-generated and human-generated translations of texts into various languages. Under the assumption that languages are equally expressive, and that PPM compression does well across languages, one would expect that translated documents would compress to approximately the same size. We verify this empirically on a novel corpus of translated documents. We suggest as an application of this finding using the size of compressed natural language texts as a mean of automatically testing translation quality.
Description
Other Available Sources
Keywords
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service