Sentence-Level Grammatical Error Identification as Sequence-to-Sequence Correction Allen Schmaltz Yoon Kim Alexander M. Rush Stuart M. Shieber Harvard University {schmaltz@fas,yoonkim@seas,srush@seas,shieber@seas}.harvard.edu Abstract We demonstrate that an attention-based encoder-decoder model can be used for sentence-level grammatical error identification for the Automated Evaluation of Scientific Writing (AESW) Shared Task 2016. The attention-based encoder-decoder models can be used for the generation of corrections, in addition to error identification, which is of interest for certain end-user applications. We show that a character-based encoder-decoder model is particularly effective, outperforming other results on the AESW Shared Task on its own, and showing gains over a word-based counterpart. Our final model—a combination of three character-based encoder-decoder models, one word-based encoder-decoder model, and a sentence-level CNN—is the highest performing system on the AESW 2016 binary prediction Shared Task. 1 Introduction The recent confluence of data availability and strong sequence-to-sequence learning algorithms has the potential to lead to practical tools for writing support. Grammatical error identification is one such application of potential utility as a component of a writing support tool. Much of the recent work in grammatical error identification and correction has made use of hand-tuned rules and features that augment data-driven approaches, or individual classifiers for human-designated subsets of errors. Given a large, annotated dataset of scientific journal articles, we propose a fully data-driven approach for this problem, inspired by recent work in neural machine translation and more generally, sequence-tosequence learning (Sutskever et al., 2014; Bahdanau et al., 2014; Cho et al., 2014). The Automated Evaluation of Scientific Writing (AESW) 2016 dataset is a collection of nearly 10,000 scientific journal articles (over 1 million sentences) published between 2006 and 2013 and annotated with corrections by professional, native English-speaking editors. The goal of the associated AESW Shared Task is to identify whether or not a given unedited source sentence was corrected by the editor (that is, whether a given source sentence has one or more grammatical errors, broadly construed). This system report describes our approach and submission to the AESW 2016 Shared Task, which establishes the current highest-performing public baseline for the binary prediction task. Our primary contribution is to demonstrate the utility of an attention-based encoder-decoder model for the binary prediction task. We also provide evidence of tangible performance gains using a character-aware version of the model, building on the characteraware language modeling work of Kim et al. (2016). In addition to sentence-level classification, the models are capable of intra-sentence error identification and the generation of possible corrections. We also obtain additional gains by using an ensemble of a generative encoder-decoder and a discriminative CNN classifier. 2 Background Recent work in natural language processing has shown strong results in sequence-to-sequence trans- formations using recurrent neural network models (Cho et al., 2014; Sutskever et al., 2014). Grammar correction and error identification can be cast as a sequence-to-sequence translation problem, in which an unedited (source) sentence is “translated” into a corrected (target) sentence in the same language. Using this framework, sentence-level error identification then simply reduces to an equality check between the source and target sentences. The goal of the AESW shared task is to identify whether a particular sentence needs to be edited (contains a “grammatical” error, broadly construed1). The dataset consists of sentences taken from academic articles annotated with corrections by professional editors. Annotations are described via insertions and deletions, which are marked with start and end tags. Tokens to be deleted are surrounded with the deletion start tag and the deletion end tag and tokens to be inserted are surrounded with the insertion start tag and the insertion end tag . Replacements (as shown in Figure 1) are represented as deletioninsertion pairs. Unlike the related CoNLL-2014 Shared Task (Ng et al., 2014) data, errors are not labeled with fine-grained types (article or determiner error, verb tense error, etc.). More formally, we assume a vocabulary V of natural language word types (some of which have orthographic errors) and a set Q = {, , , } of annotation tags. Given a sentence s = [s1, . . . , sI ], where si ∈ V is the i-th token of the sentence of length I, we seek to predict whether or not the gold, annotated target sentence t = [t1, . . . , tJ ], where tj ∈ Q ∪ V is the j-th token of the annotated sentence of length J, is identical to s. We are given both s and t for supervised training. At test time, we are only given access to sequence s. We learn to predict sequence t. Evaluation of this binary prediction task is via the F1-score, where the positive class is that indicating an error is present in the sentence (that is, where s = t)2. 1Some insertions and deletions in the shared task data represent stylistic choices, not all of which are necessarily recoverable given the sentence or paragraph context. For the purposes here, we refer to all such edits as “grammatical” errors. 2The 2016 Shared Task also included a probabilistic esti- Evaluation is at the sentence level, but the paragraph-level context for each sentence is also provided. The paragraphs, themselves, are shuffled so that full article context is not available. A coarse academic field category is also provided for each paragraph. Our models described below do not make use of the paragraph context nor the field category, and they treat each sentence independently. Further information about the task is available in the Shared Task report (Daudaravicius et al., 2016). 3 Related Work While this is the first year for a shared task focusing on sentence-level binary error identification, previous work and shared tasks have focused on the related tasks of intra-sentence identification and correction of errors. Until recently, standard handannotated grammatical error datasets were not available, complicating comparisons and limiting the choice of methods used. Given the lack of a large hand-annotated corpus at the time, Park and Levy (2011) demonstrated the use of the EM algorithm for parameter learning of a noise model using error data without corrections, performing evaluation on a much smaller set of sentences hand-corrected by Amazon Mechanical Turk workers. More recent work has emerged as a result of a series of shared tasks, starting with the Helping Our Own (HOO) Pilot Shared Task run in 2011, which focused on a diverse set of errors in a small dataset (Dale and Kilgarriff, 2011), and the subsequent HOO 2012 Shared Task, which focused on the automated detection and correction of preposition and determiner errors (Dale et al., 2012). The CoNLL-2013 Shared Task (Ng et al., 2013)3 focused on the correction of a limited set of five error types in essays by second-language learners of English at the National University of Singapore. The follow-up CoNLL-2014 Shared Task (Ng et al., 2014)4 focused on the full generation task of correcting all errors in essays by second-language learners. As with machine translation (MT), evaluation of mation track. We leave for future work the adaptation of our approach to that task. 3http://www.comp.nus.edu.sg/˜nlp/ conll13st.html 4http://www.comp.nus.edu.sg/˜nlp/ conll14st.html the full generation task is still an open research area, but a subsequent human evaluation ranked the output from the CoNLL-2014 Shared Task systems (Napoles et al., 2015). The system of Felice et al. (2014) ranked highest, utilizing a combination of a rule-based system and phrase-based MT, with re-ranking via a large web-scale language model. Of the non-MT based approaches, the IllinoisColumbia system was a strong performer, combining several classifiers trained for specific types of errors (Rozovskaya et al., 2014). 4 Models We use an end-to-end approach that does not have separate components for candidate generation or reranking that make use of hand-tuned rules or explicit syntax, nor do we employ separate classifiers for human-differentiated subsets of errors, unlike some previous work for the related task of grammatical error correction. We next introduce two approaches for the task of sentence-level grammatical error identification: A binary classifier and a sequence-to-sequence model that is trained for correction but can also be used for identification as a side-effect. 4.1 Baseline Convolutional Neural Net To establish a baseline, we follow past work that has shown strong performance with convolutional neural nets (CNNs) across various domains for sentence-level classification (Kim, 2014; Zhang and Wallace, 2015). We utilize the one-layer CNN architecture of Kim (2014) with the publicly available5 word vectors trained on the Google News dataset, which contains about 100 billion words (Mikolov et al., 2013). We experiment with keeping the word vectors static (CNN-STATIC) and fine-tuning the vectors (CNN-NONSTATIC). The CNN models only have access to sentence-level labels and are not given correction-level annotations. 4.2 Encoder-Decoder While it may seem more natural to utilize models trained for binary prediction, such as the aforementioned CNN, or for example, the recurrent network 5https://code.google.com/archive/p/ word2vec/ approach of Dai and Le (2015), we hypothesize that training at the lowest granularity of annotations may be useful for the task. We also suspect that the generation of corrections is of sufficient utility for endusers to further justify exploring models that produce corrections in addition to identification. We thus use the Shared Task as a means of assessing the utility of a full generation model for the binary prediction task. We propose two encoder-decoder architectures for this task. Our word-based architecture (WORD) is similar to that of Luong et al. (2015). Our character-based models (CHAR) still make predictions at the word-level, but use a CNN and a highway network over characters instead of word embeddings as the input to the encoder and decoder, as depicted in Figure 1. We follow past work (Sutskever et al., 2014; Luong et al., 2015) in stacking multiple recurrent neural networks (RNNs), specifically Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) networks, in both the encoder and decoder. Here, we model the probability of the target given the source, p(t | s), with an encoder neural network that summarizes the source sequence and a decoder neural network that generates a distribution over the target words and tags at each step given the source. We start by describing the basic encoder and decoder architectures in terms of the WORD model, and then we describe the CHAR model departures from WORD. Encoder The encoder reads the source sentence and outputs a sequence of vectors, associated with each word in the sentence, which will be selectively accessed during decoding via a soft attentional mechanism. We use a LSTM network to obtain the hidden states hsi ∈ Rn for each time step i, hsi = LSTM(hsi−1, xsi ). For the WORD models, xsi ∈ Rm is the word embedding for si, the i-th word in the source sentence. (The analogue for the CHAR models is discussed below.) The output of the encoder is the sequence of hidden state vectors [hs1, . . . , hsI ]. The initial hidden state of the encoder is set to zero (i.e. hs0 ← 0). Decoder The decoder is another LSTM that produces a distribution over the next target word/tag Figure 1: An illustration of the CHAR model architecture translating an example source sentence into the corrected target with a single word replacement. A CNN (here, with three filters of width two) is applied over character embeddings to obtain a fixed dimensional representation of a word, which is given to a highway network (in light blue, above). Output from the highway network is used as input to a LSTM encoder/decoder. At each step of the decoder, its hidden state is interacted with the hidden states of the encoder to produce attention weights (for each word in the encoder), which are used to obtain the context vector via a convex combination. The context vector is combined with the decoder hidden state through a one layer MLP (yellow), after which an affine transformation followed by a softmax is applied to obtain a distribution over the next word/tag. The MLP layer (yellow) is used as additional input (via concatenation) for the next time step. Generation continues until the symbol is generated. given the source vectors [hs1, . . . , hsI ] and the previously generated target words/tags t