Show simple item record

dc.contributor.advisorShieber, Stuart
dc.contributor.authorSchmaltz, Allen
dc.date.accessioned2019-12-12T08:46:57Z
dc.date.created2019-05
dc.date.issued2019-05-03
dc.date.submitted2019
dc.identifier.citationSchmaltz, Allen. 2019. Learning to Order & Learning to Correct. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.
dc.identifier.urihttp://nrs.harvard.edu/urn-3:HUL.InstRepos:42029644*
dc.description.abstractWe investigate three core tasks in Natural Language Processing that are informative toward building tools for writing assistance---word ordering, grammatical error identification, and grammar correction---shedding new light on old questions and providing new modeling approaches for real-world applications. The first task, word ordering, aims to correctly order a randomly shuffled sentence. Via this diagnostic task, we find evidence that strong surface-level models are at least as effective as the models utilizing explicit syntactic structures for modeling ordering constraints, and we incorporate this insight when approaching the end-user grammar tasks. An advantage of surface-level models is that additional training data is relatively straightforward to acquire. We perform an analysis of word ordering output with a surface-level model at scale. We find that remaining errors are associated with greater proportions of n-grams unseen in training, highlighting both a path for future improvements in effectiveness and the clear brittleness of such models, with implications for generation models, more generally. The second task, grammatical error identification, seeks to classify whether or not a sentence contains a grammatical error. The third task, grammar correction, seeks to transduce a sentence that may or may not have errors into a corrected version. For both of these tasks, we utilize insights from the diagnostic word ordering task and adapt modern sequence models to improve effectiveness over contemporary work. Modern sequence models for the grammar tasks require significant amounts of data to be effective. In a final section, we propose methods for noising well-formed text from limited amounts of human annotated data. With our proposed data augmentation scheme, we demonstrate that sequence models can be trained with synthetic data to approach the levels of effectiveness of models trained on substantially more human annotated sentences. At the same time, such semi-supervised approaches are still clearly weaker than models trained with very large amounts of annotated data.
dc.description.sponsorshipEngineering and Applied Sciences - Computer Science
dc.format.mimetypeapplication/pdf
dc.language.isoen
dash.licenseLAA
dc.subjectnatural language processing
dc.subjectcomputational linguistics
dc.subjectword ordering
dc.subjectgrammar correction
dc.titleLearning to Order & Learning to Correct
dc.typeThesis or Dissertation
dash.depositing.authorSchmaltz, Allen
dc.date.available2019-12-12T08:46:57Z
thesis.degree.date2019
thesis.degree.grantorGraduate School of Arts & Sciences
thesis.degree.grantorGraduate School of Arts & Sciences
thesis.degree.levelDoctoral
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy
thesis.degree.nameDoctor of Philosophy
dc.contributor.committeeMemberRush, Alexander
dc.contributor.committeeMemberGrosz, Barbara
dc.type.materialtext
thesis.degree.departmentEngineering and Applied Sciences - Computer Science
thesis.degree.departmentEngineering and Applied Sciences - Computer Science
dash.identifier.vireo
dash.author.emailallen.schmaltz@gmail.com


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record