Publication: Learning to Order & Learning to Correct
No Thumbnail Available
Open/View Files
Date
2019-05-03
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Schmaltz, Allen. 2019. Learning to Order & Learning to Correct. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.
Research Data
Abstract
We investigate three core tasks in Natural Language Processing that are informative toward building tools for writing assistance---word ordering, grammatical error identification, and grammar correction---shedding new light on old questions and providing new modeling approaches for real-world applications. The first task, word ordering, aims to correctly order a randomly shuffled sentence. Via this diagnostic task, we find evidence that strong surface-level models are at least as effective as the models utilizing explicit syntactic structures for modeling ordering constraints, and we incorporate this insight when approaching the end-user grammar tasks.
An advantage of surface-level models is that additional training data is relatively straightforward to acquire. We perform an analysis of word ordering output with a surface-level model at scale. We find that remaining errors are associated with greater proportions of n-grams unseen in training, highlighting both a path for future improvements in effectiveness and the clear brittleness of such models, with implications for generation models, more generally.
The second task, grammatical error identification, seeks to classify whether or not a sentence contains a grammatical error. The third task, grammar correction, seeks to transduce a sentence that may or may not have errors into a corrected version. For both of these tasks, we utilize insights from the diagnostic word ordering task and adapt modern sequence models to improve effectiveness over contemporary work.
Modern sequence models for the grammar tasks require significant amounts of data to be effective. In a final section, we propose methods for noising well-formed text from limited amounts of human annotated data. With our proposed data augmentation scheme, we demonstrate that sequence models can be trained with synthetic data to approach the levels of effectiveness of models trained on substantially more human annotated sentences. At the same time, such semi-supervised approaches are still clearly weaker than models trained with very large amounts of annotated data.
Description
Other Available Sources
Keywords
natural language processing, computational linguistics, word ordering, grammar correction
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service