Publication:
Estimating and Comparing Entropy across Written Natural Languages Using PPM Compression

Thumbnail Image

Date

2002

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Behr, Jr., Frederic H., Victoria Fossum, Michael Mitzenmacher, and David Xiao. 2002. Harvard Computer Science Group Technical Report TR-12-02.

Research Data

Abstract

Previous work on estimating the entropy of written natural language has focused primarily on English. We expand this work by considering other natural languages, inluding Arabic, Chinese, French, Greek, Japanese, Korean, Russian, and Spanish. We present the results of PPM compression on machine-generated and human-generated translations of texts into various languages. Under the assumption that languages are equally expressive, and that PPM compression does well across languages, one would expect that translated documents would compress to approximately the same size. We verify this empirically on a novel corpus of translated documents. We suggest as an application of this finding using the size of compressed natural language texts as a mean of automatically testing translation quality.

Description

Other Available Sources

Keywords

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories