Sentence-Level Grammatical Error Identiﬁcation as Sequence-to-Sequence Correction

Allen Schmaltz

Yoon Kim

Alexander M. Rush Stuart M. Shieber

Harvard University
{schmaltz@fas,yoonkim@seas,srush@seas,shieber@seas}.harvard.edu

Abstract
We demonstrate that an attention-based encoder-decoder model can be used for sentence-level grammatical error identiﬁcation for the Automated Evaluation of Scientiﬁc Writing (AESW) Shared Task 2016. The attention-based encoder-decoder models can be used for the generation of corrections, in addition to error identiﬁcation, which is of interest for certain end-user applications. We show that a character-based encoder-decoder model is particularly effective, outperforming other results on the AESW Shared Task on its own, and showing gains over a word-based counterpart. Our ﬁnal model—a combination of three character-based encoder-decoder models, one word-based encoder-decoder model, and a sentence-level CNN—is the highest performing system on the AESW 2016 binary prediction Shared Task.
1 Introduction
The recent conﬂuence of data availability and strong sequence-to-sequence learning algorithms has the potential to lead to practical tools for writing support. Grammatical error identiﬁcation is one such application of potential utility as a component of a writing support tool. Much of the recent work in grammatical error identiﬁcation and correction has made use of hand-tuned rules and features that augment data-driven approaches, or individual classiﬁers for human-designated subsets of errors. Given a large, annotated dataset of scientiﬁc journal articles, we propose a fully data-driven approach for

this problem, inspired by recent work in neural machine translation and more generally, sequence-tosequence learning (Sutskever et al., 2014; Bahdanau et al., 2014; Cho et al., 2014).
The Automated Evaluation of Scientiﬁc Writing (AESW) 2016 dataset is a collection of nearly 10,000 scientiﬁc journal articles (over 1 million sentences) published between 2006 and 2013 and annotated with corrections by professional, native English-speaking editors. The goal of the associated AESW Shared Task is to identify whether or not a given unedited source sentence was corrected by the editor (that is, whether a given source sentence has one or more grammatical errors, broadly construed).
This system report describes our approach and submission to the AESW 2016 Shared Task, which establishes the current highest-performing public baseline for the binary prediction task. Our primary contribution is to demonstrate the utility of an attention-based encoder-decoder model for the binary prediction task. We also provide evidence of tangible performance gains using a character-aware version of the model, building on the characteraware language modeling work of Kim et al. (2016). In addition to sentence-level classiﬁcation, the models are capable of intra-sentence error identiﬁcation and the generation of possible corrections. We also obtain additional gains by using an ensemble of a generative encoder-decoder and a discriminative CNN classiﬁer.
2 Background
Recent work in natural language processing has shown strong results in sequence-to-sequence trans-

formations using recurrent neural network models (Cho et al., 2014; Sutskever et al., 2014). Grammar correction and error identiﬁcation can be cast as a sequence-to-sequence translation problem, in which an unedited (source) sentence is “translated” into a corrected (target) sentence in the same language. Using this framework, sentence-level error identiﬁcation then simply reduces to an equality check between the source and target sentences.
The goal of the AESW shared task is to identify whether a particular sentence needs to be edited (contains a “grammatical” error, broadly construed1). The dataset consists of sentences taken from academic articles annotated with corrections by professional editors. Annotations are described via insertions and deletions, which are marked with start and end tags. Tokens to be deleted are surrounded with the deletion start tag <del> and the deletion end tag </del> and tokens to be inserted are surrounded with the insertion start tag <ins> and the insertion end tag </ins>. Replacements (as shown in Figure 1) are represented as deletioninsertion pairs. Unlike the related CoNLL-2014 Shared Task (Ng et al., 2014) data, errors are not labeled with ﬁne-grained types (article or determiner error, verb tense error, etc.).
More formally, we assume a vocabulary V of natural language word types (some of which have orthographic errors) and a set Q = {<ins>, </ins>, <del>, </del>} of annotation tags. Given a sentence s = [s1, . . . , sI ], where si ∈ V is the i-th token of the sentence of length I, we seek to predict whether or not the gold, annotated target sentence t = [t1, . . . , tJ ], where tj ∈ Q ∪ V is the j-th token of the annotated sentence of length J, is identical to s. We are given both s and t for supervised training. At test time, we are only given access to sequence s. We learn to predict sequence t.
Evaluation of this binary prediction task is via the F1-score, where the positive class is that indicating an error is present in the sentence (that is, where s = t)2.
1Some insertions and deletions in the shared task data represent stylistic choices, not all of which are necessarily recoverable given the sentence or paragraph context. For the purposes here, we refer to all such edits as “grammatical” errors.
2The 2016 Shared Task also included a probabilistic esti-

Evaluation is at the sentence level, but the paragraph-level context for each sentence is also provided. The paragraphs, themselves, are shufﬂed so that full article context is not available. A coarse academic ﬁeld category is also provided for each paragraph. Our models described below do not make use of the paragraph context nor the ﬁeld category, and they treat each sentence independently.
Further information about the task is available in the Shared Task report (Daudaravicius et al., 2016).
3 Related Work
While this is the ﬁrst year for a shared task focusing on sentence-level binary error identiﬁcation, previous work and shared tasks have focused on the related tasks of intra-sentence identiﬁcation and correction of errors. Until recently, standard handannotated grammatical error datasets were not available, complicating comparisons and limiting the choice of methods used. Given the lack of a large hand-annotated corpus at the time, Park and Levy (2011) demonstrated the use of the EM algorithm for parameter learning of a noise model using error data without corrections, performing evaluation on a much smaller set of sentences hand-corrected by Amazon Mechanical Turk workers.
More recent work has emerged as a result of a series of shared tasks, starting with the Helping Our Own (HOO) Pilot Shared Task run in 2011, which focused on a diverse set of errors in a small dataset (Dale and Kilgarriff, 2011), and the subsequent HOO 2012 Shared Task, which focused on the automated detection and correction of preposition and determiner errors (Dale et al., 2012). The CoNLL-2013 Shared Task (Ng et al., 2013)3 focused on the correction of a limited set of ﬁve error types in essays by second-language learners of English at the National University of Singapore. The follow-up CoNLL-2014 Shared Task (Ng et al., 2014)4 focused on the full generation task of correcting all errors in essays by second-language learners.
As with machine translation (MT), evaluation of
mation track. We leave for future work the adaptation of our approach to that task.
3http://www.comp.nus.edu.sg/˜nlp/ conll13st.html
4http://www.comp.nus.edu.sg/˜nlp/ conll14st.html

the full generation task is still an open research area, but a subsequent human evaluation ranked the output from the CoNLL-2014 Shared Task systems (Napoles et al., 2015). The system of Felice et al. (2014) ranked highest, utilizing a combination of a rule-based system and phrase-based MT, with re-ranking via a large web-scale language model. Of the non-MT based approaches, the IllinoisColumbia system was a strong performer, combining several classiﬁers trained for speciﬁc types of errors (Rozovskaya et al., 2014).
4 Models
We use an end-to-end approach that does not have separate components for candidate generation or reranking that make use of hand-tuned rules or explicit syntax, nor do we employ separate classiﬁers for human-differentiated subsets of errors, unlike some previous work for the related task of grammatical error correction.
We next introduce two approaches for the task of sentence-level grammatical error identiﬁcation: A binary classiﬁer and a sequence-to-sequence model that is trained for correction but can also be used for identiﬁcation as a side-effect.
4.1 Baseline Convolutional Neural Net
To establish a baseline, we follow past work that has shown strong performance with convolutional neural nets (CNNs) across various domains for sentence-level classiﬁcation (Kim, 2014; Zhang and Wallace, 2015). We utilize the one-layer CNN architecture of Kim (2014) with the publicly available5 word vectors trained on the Google News dataset, which contains about 100 billion words (Mikolov et al., 2013). We experiment with keeping the word vectors static (CNN-STATIC) and ﬁne-tuning the vectors (CNN-NONSTATIC). The CNN models only have access to sentence-level labels and are not given correction-level annotations.
4.2 Encoder-Decoder
While it may seem more natural to utilize models trained for binary prediction, such as the aforementioned CNN, or for example, the recurrent network
5https://code.google.com/archive/p/ word2vec/

approach of Dai and Le (2015), we hypothesize that training at the lowest granularity of annotations may be useful for the task. We also suspect that the generation of corrections is of sufﬁcient utility for endusers to further justify exploring models that produce corrections in addition to identiﬁcation. We thus use the Shared Task as a means of assessing the utility of a full generation model for the binary prediction task.
We propose two encoder-decoder architectures for this task. Our word-based architecture (WORD) is similar to that of Luong et al. (2015). Our character-based models (CHAR) still make predictions at the word-level, but use a CNN and a highway network over characters instead of word embeddings as the input to the encoder and decoder, as depicted in Figure 1. We follow past work (Sutskever et al., 2014; Luong et al., 2015) in stacking multiple recurrent neural networks (RNNs), speciﬁcally Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) networks, in both the encoder and decoder.
Here, we model the probability of the target given the source, p(t | s), with an encoder neural network that summarizes the source sequence and a decoder neural network that generates a distribution over the target words and tags at each step given the source.
We start by describing the basic encoder and decoder architectures in terms of the WORD model, and then we describe the CHAR model departures from WORD.
Encoder The encoder reads the source sentence and outputs a sequence of vectors, associated with each word in the sentence, which will be selectively accessed during decoding via a soft attentional mechanism. We use a LSTM network to obtain the hidden states hsi ∈ Rn for each time step i,
hsi = LSTM(hsi−1, xsi ).
For the WORD models, xsi ∈ Rm is the word embedding for si, the i-th word in the source sentence. (The analogue for the CHAR models is discussed below.) The output of the encoder is the sequence of hidden state vectors [hs1, . . . , hsI ]. The initial hidden state of the encoder is set to zero (i.e. hs0 ← 0).
Decoder The decoder is another LSTM that produces a distribution over the next target word/tag

Figure 1: An illustration of the CHAR model architecture translating an example source sentence into the corrected target with a single word replacement. A CNN (here, with three ﬁlters of width two) is applied over character embeddings to obtain a ﬁxed dimensional representation of a word, which is given to a highway network (in light blue, above). Output from the highway network is used as input to a LSTM encoder/decoder. At each step of the decoder, its hidden state is interacted with the hidden states of the encoder to produce attention weights (for each word in the encoder), which are used to obtain the context vector via a convex combination. The context vector is combined with the decoder hidden state through a one layer MLP (yellow), after which an afﬁne transformation followed by a softmax is applied to obtain a distribution over the next word/tag. The MLP layer (yellow) is used as additional input (via concatenation) for the next time step. Generation continues until the <eos> symbol is generated.

given the source vectors [hs1, . . . , hsI ] and the previously generated target words/tags t<j = [t1, . . . tj]. Let
htj = LSTM(htj−1, xtj)
be the summary of the target sentence up to the j-th word, where xtj is the word embedding for tj in the WORD models. The current target hidden state htj is combined with each of the memory vectors in the
source to produce attention weights as follows,

uj,i = htj · Wαhsi

αj,i =

exp uj,i k∈[1,I] exp uj,k

The source vectors are multiplied with the respec-
tive attention weights, summed, and interacted with the current decoder hidden state htj to produce a context vector cj,

vj =

αj,ihsi

i∈[1,I ]

cj = tanh(W[vj; htj])

The probability distribution over the next

word/tag is given by applying an afﬁne transformation to cj followed by a softmax,
p(tj+1 | s, t<j) = softmax(Ucj + b)
Finally, as in Luong et al. (2015), we feed cj as additional input to the decoder for the next time step by concatenating it with xtj, so the decoder equation is modiﬁed to,
htj = LSTM(htj−1, [xtj; cj−1])
This allows the decoder to have knowledge of previous (soft) alignments at each time step. The decoder hidden state is initialized with the ﬁnal hidden state of the encoder (i.e. ht0 ← hsI ).
Character Convolutional Neural Network For the CHAR models, instead of a word embedding, our input for each word in the source/target sentence is an output from a character-level convolutional neural network (CharCNN) (depicted in Figure 1). Our character model closely follows that of Kim et al. (2016).
Suppose word si is composed of characters [p1, . . . , pl]. We concatenate the character embeddings to form the matrix Pi ∈ Rc×l, where the k-th

column corresponds to the character embedding for pk (of dimension c).
We then apply a convolution between Pi and a ﬁlter H ∈ Rc×w of width w, after which we add a bias and apply a nonlinearity to obtain a feature map fi ∈ Rl−w+1. The k-th element of fi is given by,
fi[k] = tanh( Pi[∗, k : k + w − 1], H + b)
where A, B = Tr(ABT ) is the Frobenius inner product and Pi[∗, k : k + w − 1] is the k-to-(k + w − 1)-th column of Pi. Finally, we take the maxover-time
zi = max fi[k]
k
as the feature corresponding to ﬁlter H. We use multiple ﬁlters H1, . . . Hh to obtain a vector zi ∈ Rh as the representation for a given source/target word or tag. We have separate CharCNNs for the encoder and decoder.
Highway Network Instead of replacing the word embedding xi with zi, we feed zi through a highway network (Srivastava et al., 2015). Whereas a multilayer perceptron produces a new set of features via the following transformation (given input z),
zˆ = f (Wz + b)
a highway network instead computes,
zˆ = r f (Wz + b) + (1 − r) z
where f is a non-linearity (in our models, ReLU), is the element-wise multiplication operator, and r = σ(Wrz + br) and 1 − r are called the transform and carry gates. We feed zi into the highway network to obtain zˆi, which is used to replace the input word embeddings in both the encoder and the decoder.
Inference Exact inference is computationally infeasible for the encoder-decoder models given the combinatorial explosion of possible output sequences, but we follow past work in NMT using beam search. We do not constrain the generation process of words outside insertion tags to words in the source, and each low-frequency holder token generated in the target sentence is replaced with the source token associated with the maximum attention weight. We use a beam size of 10 for all models,

with the exception of one of the models in the ﬁnal system combination, for which we use a beam of size 5, as noted in Section 6.
Note that this model generates corrections, but we are only interested in determining the existence of any error at the sentence-level. As such, after beam decoding, we simply check for whether there were any corrections in the target. However, we found that decoding in this way under-predicts sentencelevel errors. It is therefore important to calibrate the weights associated with corrections, which we discuss in Section 5.3.
5 Experiments
5.1 Data
The AESW task data differs from previous grammatical error datasets in terms of scale and genre. To the best of our knowledge, the AESW dataset is the ﬁrst large-scale, publicly available professionally edited dataset of academic, scientiﬁc writing. The training set consists of 466,672 sentences with edits and 722,742 sentences without edits, and the development set contains 57,340 sentences with edits and 90,106 sentences without. The raw training and development datasets are provided as annotated sentences, t, from which the s sequences may be deterministically derived. There are 143,802 sentences in the Shared Task test set with hidden gold labels, which serve directly as s sequences.
As part of pre-processing, we treat each sentence independently, discarding paragraph context (which sentences, if any, were present in the same paragraph) and domain information, which is a coarse grouping by the ﬁeld of the original journal (Engineering, Computer Science, Mathematics, Chemistry, Physics, etc.). We generate Penn Treebank style tokenizations of the input. Case is maintained and digits are not replaced with holder symbols. The vocabulary is restricted to the 50,000 most common tokens, with remaining low frequency tokens replaced with a special <unk> token. The CHAR model can encode but not decode over open vocabularies and hence we do not have any <unk> tokens on the source side of those models. For all of the encoder-decoder models, we replace the lowfrequency target symbols during inference as discussed above in Section 4.2.

For development against the provided data with labels, we set aside a 10,000 sentence sample from the original development set for tuning, and use the remaining 137,446 sentences for validation6. The encoder-decoder models are given all 466,672 pairs of s and t sequences with edits, augmented with varying numbers of pairs without edits. The CHAR+SAMPLE and WORD+SAMPLE models are given a random sample of 200,000 pairs without edits for a total of 666,672 pairs of s and t sequences. The CHAR+ALL and WORD+ALL models are given all 722,742 sentences without edits for a total of 1,189,414 pairs of s and t sequences. For some of the ﬁnal testing models, we also train with the development sentences. In these latter cases, all sequence pairs are used. In training all of the encoder-decoder models, as indicated in Section 5.2, we drop sentences exceeding 50 tokens in length.
We also experimented with creating corrected versions of sentences for the CNN. The binary CNN classiﬁers are given 1,656,086 single-sentence training examples, of which 722,742 are error-free examples (in which s = t), and the remaining examples are constructed by removing the tags from the annotated sentences, t, to create tag-free examples that contain errors (466,672 instances) and additional error-free examples (466,672 instances).
5.2 Training
Training (along with testing) of all models was conducted on GPUs. Our models were implemented with the Torch7 framework.
CNN Architecture and training approaches were informed by past work in sentence-level classiﬁcation using CNNs (Kim, 2014; Zhang and Wallace, 2015). A limited grid search on the development set determined our use of ﬁlter windows of width 3, 4, and 5 and 1000 feature maps. We trained for 10 epochs. Training otherwise followed the approach of the correspondingly named CNN-STATIC and CNN-NONSTATIC models of Kim (2014).
6Note that the number of sentences in the ﬁnal development set without labels posted on CodaLab (http://codalab. org) differed from that originally posted on the AESW 2016 Shared Task website with labels.
7http://torch.ch

encoder-decoder Initial parameter settings (including architecture decisions such as the number of layers and embedding and hidden state sizes) were informed by concurrent work in neural machine translation and existing work such as that of Sutskever et al. (2014) and Luong et al. (2015). We used 4-layer LSTMs with 1000 hidden units in each layer. We trained for 14 epochs with a batch size of 64 and a maximum sequence length of 50. The parameters for the WORD model were uniformly initialized in [−0.1, 0.1], and those of the CHAR model were uniformly initialized in [−0.05, 0.05]. The L2normalized gradients were constrained to be ≤ 5. Our learning rate schedule started the learning rate at 1 and halved the learning rate after each epoch beyond epoch 10, or once the validation set perplexity no longer improved. The WORD model used 1000-dimensional word embeddings. For CHAR, the character embeddings were 25-dimensional, the ﬁlter width was 6, the number of feature maps was 1000, and 2 highway layers were used. The maximum word length was 35 characters for training CHAR. Note that we do not reverse the source (s) sequences, unlike some previous NMT work. Following the work of Zaremba et al. (2014), we employed dropout with a probability of 0.3 between the LSTM layers.
Training both WORD and CHAR on the training set took on the order of a few days using GPUs, with the former being more efﬁcient than the latter. In practice, we used two GPUs for training CHAR due to memory requirements.
5.3 Tuning
Post-hoc tuning was performed on the 10k held-out portion of the development set. In terms of maximizing the F1-score, this post-hoc tuning was important for these models, without which precision was high and recall was low. We leave to future work alternative approaches to this type of post-hoc tuning.
For the CNN models, after training, we tuned the decision boundary to maximize the F1-score on the held-out tuning set. Analogously, for the encoderdecoder models, after training the models, we tuned the bias weights (given as input to the ﬁnal softmax layer generating the words/tags distribution) associated with the four annotation tags via a simple grid search by iteratively running beam search on the tun-

Precision F1

0.70

0.65

0.60

0.55

Word+all Word+sample

0.50 Char+all

Char+sample

0.45

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Tuning Weights

Figure 2: F1 scores for varying values applied additively to the bias weights of the four annotation tags on the held-out 10k tuning subset.

0.75

0.65

0.55 Word+all Word+sample
0.45 Char+all Char+sample
0.35
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Recall

Figure 3: Precision vs. recall trade-off as the bias weights associated with the four annotation tags are varied on the held-out 10k tuning subset. The points yielding maximum F1 scores are highlighted with black circles.

ing set. Due to the relatively high expense of decoding, we employed a coarse grid search in which the bias weights of the four annotation tags were uniformly varied.
6 Results
Results on the development set, excluding the 10k tuning set, appear in Table 1. Here (and elsewhere) RANDOM is the result of randomly assigning a sentence to one of the binary classes. For the CNN classiﬁers, ﬁne-tuning the word2vec embeddings improves performance. The encoder-decoder models improve over the CNN classiﬁers, even though

the latter are provided with additional data (via word2vec). The character-based models yield tangible improvements over the word-based models.
For consistency here, we kept the beam size at 10 across models, but subsequent analysis revealed that increasing the beam from 5 to 10 had a negligible effect on overall performance.
Tuning results appear in Figures 2 and 3, illustrating the importance of adjusting the bias weights associated with the annotation tags in balancing precision and recall to maximize the F1 score. The models trained on all sequence pairs without edits, CHAR+ALL and WORD+ALL, perform particularly poorly without tuning these bias weights, yielding F1 scores near that of RANDOM before tuning, which corresponds to a weight of 0.0 in Figure 2.
The ofﬁcial development set posted on CodaLab differed slightly from the original development set provided with labels, so we include those results in Table 2 for the encoder-decoder models. Here, evaluation is performed on the CodaLab server, as with the ﬁnal test submission. The overall relative performance pattern is similar to that of the original development set.
A comparison of our results with other shared task submissions appears in Table 3. (Teams were allowed to submit up to two results.) Our submission, COMBINATION was a simple majority vote at the system level (for each test sentence) of 5 models8: (1) a CNN-NONSTATIC model trained with the concatenation of the training and development sets (and using word2vec); (2) a WORD model trained on all sequence pairs in the training and development sets with a beam size of 10 for decoding; (3,4) a CHAR+SAMPLE model trained on the training set, decoding the test set twice, each time with different weight biases (the two highest performing via the grid search over the tuning set) with a beam size of 10; and (5) a CHAR model trained on all sequence pairs in the training and development sets, with training suspended at epoch 9 (out of 14) and a beam size of 5 to meet the Shared Task deadline. For reference, we also include the CodaLab evaluation for just the CHAR+SAMPLE model trained on the training set with a beam size of 10, with the bias
8The choice of models was limited to those that were trained and tuned in time for the Shared Task deadline.

Model

Data Precision Recall F1

RANDOM

N/A 0.3885 0.4992 0.4369

CNN-STATIC

Training+word2vec 0.5349 0.7586 0.6274

CNN-NONSTATIC Training+word2vec 0.5365 0.7758 0.6343

WORD+ALL WORD+SAMPLE CHAR+ALL CHAR+SAMPLE

Training Training Training Training

0.5399 0.5394 0.5400 0.5526

0.7882 0.8024 0.8048 0.8126

0.6408 0.6451 0.6463 0.6579

Table 1: Experimental results on the development set excluding the held-out 10k tuning subset.

Model

Data Precision Recall F1

RANDOM WORD+ALL WORD+SAMPLE CHAR+ALL CHAR+SAMPLE

N/A Training Training Training Training

0.3921 0.5343 0.5335 0.5351 0.5469

0.5981 0.7577 0.7699 0.7749 0.7803

0.4736 0.6267 0.6303 0.6330 0.6431

Table 2: Results on the ofﬁcial development set. Here, RANDOM was provided by the Shared Task organizers.

Model

Data Precision Recall F1

RANDOM

N/A 0.3607 0.6004 0.4507

KNOWLET NTNU-YZU HITS UW-SU NTNU-YZU

– 0.6241 0.3685 0.4634 – 0.6717 0.3805 0.4858 – 0.3765 0.948 0.5389 – 0.4145 0.8201 0.5507 – 0.5025 0.7785 0.6108

CHAR+SAMPLE

Training

0.5112 0.7841 0.6189

COMBINATION Training+Dev+word2vec 0.5444 0.7413 0.6278

Table 3: Final submitted results on the Shared Task test set. COMBINATION was our ﬁnal submitted system. RANDOM was provided by the Shared Task organizers. For comparison, we have included the other team submissions from National Taiwan Normal University and Yuan Ze University (NTNU-YZU), the University of Washington and Stanford University (UW-SU), HITS (HITS), and Knowlet (KNOWLET). Teams were allowed to designate up to two ﬁnal submissions. (The CHAR model trained on the combined training and development set had not ﬁnished training by the Shared Task deadline. As such, it was not submitted, but the partially trained model was included in COMBINATION.)

weights being those that generated the highest F1score on the 10k tuning set.
7 Discussion
Of particular interest, the CHAR+SAMPLE model performs well, both in terms of performance on the test set relative to other submissions, as well as on the development set relative to the WORD models and the CNN classiﬁers. It is possible this is due to the ability of the CHAR models to capture some types of orthographic errors.
The empirical results suggest that simply adding additional already correct source-target pairs when training the encoder-decoder models may not boost

performance, ceteris paribus, as seen in comparing the performance of CHAR+SAMPLE vs WORD+SAMPLE, and CHAR+ALL vs WORD+ALL. We leave to future work alternative approaches for introducing additional correct (target) sentences, as has been examined for neural machine translation models (Sennrich et al., 2015; Gu¨lc¸ehre et al., 2015).
Our results provide initial evidence to support the hypothesis that training at the lowest granularity of annotation is a more efﬁcient use of data than training against the binary label. In future work, we plan to compare against sentence classiﬁcation using LSTMs (Dai and Le, 2015) and convolutional models that use correction-level annotations.

Another beneﬁt of the encoder-decoder models is that they can be used to generate corrections (and identify locations of intra-sentence errors) for endusers. However, the added generation capabilities of the encoder-decoder models comes at the expense of considerably longer training and testing times compared to the CNN classiﬁers.
We found that post-hoc tuning provides a straightforward means of tuning the precision-recall tradeoff for these models, and we speculate (but leave to future work for investigation) that in practice, endusers might prefer greater emphasis placed on precision over recall.
8 Conclusion
We have presented our submission to the AESW 2016 Shared Task, suggesting, in particular, the utility of a neural attention-based model for sentencelevel grammatical error identiﬁcation. Our models do not make use of hand-tuned rules, are not trained with explicit syntactic annotations, and do not make use of individuals classiﬁers designed for humandesignated subsets of errors.
For the encoder-decoder models, modeling at the sub-word level was beneﬁcial, even though predictions were still made at the word level. It would be of interest to push this further to eliminate the need for an initial tokenization step, in order to generalize the approach to other languages, such as Chinese and Japanese.
We plan to examine alternative approaches for training with additional correct (target) sentences. Inducing artiﬁcial errors to generate more incorrect (source) sentences is also a direction we intend to pursue.
We leave for future work an analysis of the generation quality of our encoder-decoder models on the AESW dataset and the CoNLL-2014 Shared Task data, as well as user studies to assess whether performance is sufﬁcient in practice to be useful, including the utility of correction vs. identiﬁcation.
We consider this to be just the beginning of the development of data-driven support tools for writers, and many areas remain to be explored.

Acknowledgments
We would like to thank the organizers of the Shared Task for coordinating the task and making the unique AESW dataset available for research purposes. The Institute for Quantitative Social Science (IQSS) and the Harvard Initiative for Learning and Teaching (HILT) supported earlier, related research that led to our participation in the Shared Task. Jeffrey Ling graciously contributed a torch-based CNN implementation of Kim (2014).
References
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473.
Kyunghyun Cho, Bart Van Merrie¨nboer, C¸ ag˘lar Gu¨lc¸ehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar, October. Association for Computational Linguistics.
Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised sequence learning. CoRR, abs/1511.01432.
Robert Dale and Adam Kilgarriff. 2011. Helping our own: The hoo 2011 pilot shared task. In Proceedings of the 13th European Workshop on Natural Language Generation, ENLG ’11, pages 242–249, Stroudsburg, PA, USA. Association for Computational Linguistics.
Robert Dale, Ilya Anisimoff, and George Narroway. 2012. Hoo 2012: A report on the preposition and determiner error correction shared task. In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, pages 54–62, Stroudsburg, PA, USA. Association for Computational Linguistics.
Vidas Daudaravicius, Rafael E. Banchs, Elena Volodina, and Courtney Napoles. 2016. A report on the automatic evaluation of scientiﬁc writing shared task. In Proceedings of the Eleventh Workshop on Innovative Use of NLP for Building Educational Applications, San Diego, CA, USA, June. Association for Computational Linguistics.
Mariano Felice, Zheng Yuan, Øistein E. Andersen, Helen Yannakoudakis, and Ekaterina Kochmar. 2014. Grammatical error correction using hybrid systems and type ﬁltering. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pages 15–24, Balti-

more, Maryland, June. Association for Computational Linguistics. C¸ aglar Gu¨lc¸ehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Lo¨ıc Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2015. On using monolingual corpora in neural machine translation. CoRR, abs/1503.03535. Sepp Hochreiter and Ju¨rgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735– 1780, November. Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2016. Character-Aware Neural Language Models. In Proceedings of AAAI. Yoon Kim. 2014. Convolutional neural networks for sentence classiﬁcation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar, October. Association for Computational Linguistics. Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attentionbased neural machine translation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1412–1421, Lisbon, Portugal, September. Association for Computational Linguistics. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111– 3119. Curran Associates, Inc. Courtney Napoles, Keisuke Sakaguchi, Matt Post, and Joel Tetreault. 2015. Ground truth for grammatical error correction metrics. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 588–593, Beijing, China, July. Association for Computational Linguistics. Hwee Tou Ng, Siew Mei Wu, Yuanbin Wu, Christian Hadiwinoto, and Joel Tetreault. 2013. The CoNLL2013 shared task on grammatical error correction. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task,

pages 1–12, Soﬁa, Bulgaria, August. Association for Computational Linguistics. Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto, and Christopher Bryant. 2014. The CoNLL-2014 shared task on grammatical error correction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pages 1–14, Baltimore, Maryland, June. Association for Computational Linguistics. Y. Albert Park and Roger Levy. 2011. Automated whole sentence grammar correction using a noisy channel model. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11, pages 934–944, Stroudsburg, PA, USA. Association for Computational Linguistics. Alla Rozovskaya, Kai-Wei Chang, Mark Sammons, Dan Roth, and Nizar Habash. 2014. The Illinois-Columbia system in the CoNLL-2014 shared task. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pages 34– 42, Baltimore, Maryland, June. Association for Computational Linguistics. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Improving neural machine translation models with monolingual data. CoRR, abs/1511.06709. Rupesh Kumar Srivastava, Klaus Greff, and Ju¨rgen Schmidhuber. 2015. Training very deep networks. CoRR, abs/1507.06228. Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112. Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. CoRR, abs/1409.2329. Ye Zhang and Byron Wallace. 2015. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classiﬁcation. CoRR, abs/1510.03820.