Learning to Order & Learning to Correct
MetadataShow full item record
CitationSchmaltz, Allen. 2019. Learning to Order & Learning to Correct. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.
AbstractWe investigate three core tasks in Natural Language Processing that are informative toward building tools for writing assistance---word ordering, grammatical error identification, and grammar correction---shedding new light on old questions and providing new modeling approaches for real-world applications. The first task, word ordering, aims to correctly order a randomly shuffled sentence. Via this diagnostic task, we find evidence that strong surface-level models are at least as effective as the models utilizing explicit syntactic structures for modeling ordering constraints, and we incorporate this insight when approaching the end-user grammar tasks.
An advantage of surface-level models is that additional training data is relatively straightforward to acquire. We perform an analysis of word ordering output with a surface-level model at scale. We find that remaining errors are associated with greater proportions of n-grams unseen in training, highlighting both a path for future improvements in effectiveness and the clear brittleness of such models, with implications for generation models, more generally.
The second task, grammatical error identification, seeks to classify whether or not a sentence contains a grammatical error. The third task, grammar correction, seeks to transduce a sentence that may or may not have errors into a corrected version. For both of these tasks, we utilize insights from the diagnostic word ordering task and adapt modern sequence models to improve effectiveness over contemporary work.
Modern sequence models for the grammar tasks require significant amounts of data to be effective. In a final section, we propose methods for noising well-formed text from limited amounts of human annotated data. With our proposed data augmentation scheme, we demonstrate that sequence models can be trained with synthetic data to approach the levels of effectiveness of models trained on substantially more human annotated sentences. At the same time, such semi-supervised approaches are still clearly weaker than models trained with very large amounts of annotated data.
Citable link to this pagehttp://nrs.harvard.edu/urn-3:HUL.InstRepos:42029644
- FAS Theses and Dissertations