Learning to Order & Learning to Correct

Schmaltz, Allen

dc.contributor.advisor	Shieber, Stuart
dc.contributor.author	Schmaltz, Allen
dc.date.accessioned	2019-12-12T08:46:57Z
dc.date.created	2019-05
dc.date.issued	2019-05-03
dc.date.submitted	2019
dc.identifier.citation	Schmaltz, Allen. 2019. Learning to Order & Learning to Correct. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.
dc.identifier.uri	http://nrs.harvard.edu/urn-3:HUL.InstRepos:42029644	*
dc.description.abstract	We investigate three core tasks in Natural Language Processing that are informative toward building tools for writing assistance---word ordering, grammatical error identification, and grammar correction---shedding new light on old questions and providing new modeling approaches for real-world applications. The first task, word ordering, aims to correctly order a randomly shuffled sentence. Via this diagnostic task, we find evidence that strong surface-level models are at least as effective as the models utilizing explicit syntactic structures for modeling ordering constraints, and we incorporate this insight when approaching the end-user grammar tasks. An advantage of surface-level models is that additional training data is relatively straightforward to acquire. We perform an analysis of word ordering output with a surface-level model at scale. We find that remaining errors are associated with greater proportions of n-grams unseen in training, highlighting both a path for future improvements in effectiveness and the clear brittleness of such models, with implications for generation models, more generally. The second task, grammatical error identification, seeks to classify whether or not a sentence contains a grammatical error. The third task, grammar correction, seeks to transduce a sentence that may or may not have errors into a corrected version. For both of these tasks, we utilize insights from the diagnostic word ordering task and adapt modern sequence models to improve effectiveness over contemporary work. Modern sequence models for the grammar tasks require significant amounts of data to be effective. In a final section, we propose methods for noising well-formed text from limited amounts of human annotated data. With our proposed data augmentation scheme, we demonstrate that sequence models can be trained with synthetic data to approach the levels of effectiveness of models trained on substantially more human annotated sentences. At the same time, such semi-supervised approaches are still clearly weaker than models trained with very large amounts of annotated data.
dc.description.sponsorship	Engineering and Applied Sciences - Computer Science
dc.format.mimetype	application/pdf
dc.language.iso	en
dash.license	LAA
dc.subject	natural language processing
dc.subject	computational linguistics
dc.subject	word ordering
dc.subject	grammar correction
dc.title	Learning to Order & Learning to Correct
dc.type	Thesis or Dissertation
dash.depositing.author	Schmaltz, Allen
dc.date.available	2019-12-12T08:46:57Z
thesis.degree.date	2019
thesis.degree.grantor	Graduate School of Arts & Sciences
thesis.degree.grantor	Graduate School of Arts & Sciences
thesis.degree.level	Doctoral
thesis.degree.level	Doctoral
thesis.degree.name	Doctor of Philosophy
thesis.degree.name	Doctor of Philosophy
dc.contributor.committeeMember	Rush, Alexander
dc.contributor.committeeMember	Grosz, Barbara
dc.type.material	text
thesis.degree.department	Engineering and Applied Sciences - Computer Science
thesis.degree.department	Engineering and Applied Sciences - Computer Science
dash.identifier.vireo
dash.author.email	allen.schmaltz@gmail.com

Files in this item

Name:: SCHMALTZ-DISSERTATION-2019.pdf
Size:: 1.422Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

FAS Theses and Dissertations [6136]

Show simple item record