Learning Neural Templates for Text Generation Sam Wiseman Stuart M. Shieber Alexander M. Rush School of Engineering and Applied Sciences Harvard University Cambridge, MA, USA {swiseman,shieber,srush}@seas.harvard.edu Abstract Source Entity: Cotto While neural, encoder-decoder models have type[coffee shop], rating[3 out of 5], had significant empirical success in text gener- food[English], area[city centre], ation, there remain several unaddressed prob- price[moderate], near[The Portland Arms] lems with this style of generation. Encoder- decoder models are largely (a) uninterpretable, System Generation: and (b) difficult to control in terms of their Cotto is a coffee shop serving English food phrasing or content. This work proposes a in the moderate price range. It is located neural generation system using a hidden semi- near The Portland Arms. Its customer rating is markov model (HSMM) decoder, which learns 3 out of 5. latent, discrete templates jointly with learning to generate. We show that this model learns Neural Template: useful templates, and that these templates providing make generation both more interpretable and | The is a | is an | | serving | controllable. Furthermore, we show that this ... is an expensive offering... ... approach scales to real data sets and achieves food in the price range It’s strong performance nearing that of encoder- | cuisine with a price bracket It isfoods | and has a | | | . |pricing The place is decoder text generation models. ... ... ... ... located in the Its customer rating is | located near | | . | Their customer rating is1 Introduction near | | .Customers have rated it... ... With the continued success of encoder-decoder Figure 1: An example template-like generation from the E2E models for machine translation and related tasks, Generation dataset (Novikova et al., 2017). Knowledge base there has been great interest in extending these x (top) contains 6 records, and ŷ (middle) is a system gen- eration; records are shown as type[value]. An induced methods to build general-purpose, data-driven nat- neural template (bottom) is learned by the system and em- ural language generation (NLG) systems (Mei ployed in generating ŷ. Each cell represents a segment in et al., 2016; Dušek and Jurcıcek, 2016; Lebret the learned segmentation, and “blanks” show where slots arefilled through copy attention during generation. et al., 2016; Chisholm et al., 2017; Wiseman et al., 2017). These encoder-decoder models (Sutskever et al., 2014; Cho et al., 2014; Bahdanau et al., Encoder-decoder generation systems appear to 2015) use a neural encoder model to represent a have increased the fluency of NLG outputs, while source knowledge base, and a decoder model to reducing the manual effort required. However, emit a textual description word-by-word, condi- due to the black-box nature of generic encoder- tioned on the source encoding. This style of gen- decoder models, these systems have also largely eration contrasts with the more traditional division sacrificed two important desiderata that are often of labor in NLG, which famously emphasizes ad- found in more traditional systems, namely (a) in- dressing the two questions of “what to say” and terpretable outputs that (b) can be easily controlled “how to say it” separately, and which leads to sys- in terms of form and content. tems with explicit content selection, macro- and This work considers building interpretable and micro-planning, and surface realization compo- controllable neural generation systems, and pro- nents (Reiter and Dale, 1997; Jurafsky and Martin, poses a specific first step: a new data-driven gen- 2014). eration model for learning discrete, template-like 3174 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3174–3187 Brussels, Belgium, October 31 - November 4, 2018. ©c 2018 Association for Computational Linguistics structures for conditional text generation. The Recently, a new paradigm has emerged around core system uses a novel, neural hidden semi- neural text generation systems based on machine markov model (HSMM) decoder, which provides translation (Sutskever et al., 2014; Cho et al., a principled approach to template-like text gener- 2014; Bahdanau et al., 2015). Most of this ation. We further describe efficient methods for work has used unconstrained black-box encoder- training this model in an entirely data-driven way decoder approaches. There has been some work by backpropagation through inference. Generat- on discrete variables in this context, including ex- ing with the template-like structures induced by tracting representations (Shen et al., 2018), incor- the neural HSMM allows for the explicit repre- porating discrete latent variables in text model- sentation of what the system intends to say (in the ing (Yang et al., 2018), and using non-HSMM seg- form of a learned template) and how it is attempt- mental models for machine translation or summa- ing to say it (in the form of an instantiated tem- rization (Yu et al., 2016; Wang et al., 2017; Huang plate). et al., 2018). Dai et al. (2017) develop an approx- We show that we can achieve performance com- imate inference scheme for a neural HSMM using petitive with other neural NLG approaches, while RNNs for continuous emissions; in contrast we making progress satisfying the above two desider- maximize the exact log-marginal, and use RNNs ata. Concretely, our experiments indicate that we to parameterize a discrete emission distribution. can induce explicit templates (as shown in Figure Finally, there has also been much recent interest in 1) while achieving competitive automatic scores, segmental RNN models for non-generative tasks and that we can control and interpret our gener- in NLP (Tang et al., 2016; Kong et al., 2016; Lu ations by manipulating these templates. Finally, et al., 2016). while our experiments focus on the data-to-text The neural text generation community has also regime, we believe the proposed methodology rep- recently been interested in “controllable” text gen- resents a compelling approach to learning discrete, eration (Hu et al., 2017), where various aspects latent-variable representations of conditional text. of the text (often sentiment) are manipulated or transferred (Shen et al., 2017; Zhao et al., 2018; Li 2 Related Work et al., 2018). In contrast, here we focus on control- ling either the content of a generation or the way it A core task of NLG is to generate textual descrip- is expressed by manipulating the (latent) template tions of knowledge base records. A common ap- used in realizing the generation. proach is to use hand-engineered templates (Ku- kich, 1983; McKeown, 1992; McRoy et al., 2000), 3 Overview: Data-Driven NLG but there has also been interest in creating tem- plates in an automated manner. For instance, Our focus is on generating a textual description many authors induce templates by clustering sen- of a knowledge base or meaning representation. tences and then abstracting templated fields with Following standard notation (Liang et al., 2009; hand-engineered rules (Angeli et al., 2010; Kon- Wiseman et al., 2017), let x= {r1 . . . rJ} be a dadadi et al., 2013; Howald et al., 2013), or with a collection of records. A record is made up of pipeline of other automatic approaches (Wang and a type (r.t), an entity (r.e), and a value (r.m). Cardie, 2013). For example, a knowledge base of restaurants There has also been work in incorporating prob- might have a record with r.t = Cuisine, r.e = abilistic notions of templates into generation mod- Denny’s, and r.m = American. The aim is els (Liang et al., 2009; Konstas and Lapata, 2013), to generate an adequate and fluent text description which is similar to our approach. However, these ŷ1:T = ŷ1, . . . , ŷT of x. Concretely, we consider approaches have always been conjoined with dis- the E2E Dataset (Novikova et al., 2017) and the criminative classifiers or rerankers in order to ac- WikiBio Dataset (Lebret et al., 2016). We show tually accomplish the generation (Angeli et al., an example E2E knowledge base x in the top of 2010; Konstas and Lapata, 2013). In addition, Figure 1. The top of Figure 2 shows an exam- these models explicitly model knowledge base ple knowledge base x from the WikiBio dataset, field selection, whereas the model we present is where it is paired with a reference text y= y1:T at fundamentally an end-to-end model over genera- the bottom. tion segments. The dominant approach in neural NLG has been 3175 sopoulos, 2007; Turner et al., 2010). Generation is the records in Figure 2: ”frederick parker-rhodes divided into modular, yet highly interdependent, de- (21 november 1914 - 2 march 1987) was an en- glish mycology and plant pathology, mathematics cisions: (1) content planning defines which parts of at the university of uk.” In addition to not being the input fields or meaning representations should fluent, it is unclear what the end of this sentence be selected; (2) sentence planning determines which is even attempting to convey: it may be attempt- selected fields are to be dealt with in each output ing to convey a fact not actually in the knowledge sentence; and (3) surface realization generates those base (e.g., where Parker-Rhodes studied), or per- sentences. haps it is simply failing to fluently realize infor- mation that is in the knowledge base (e.g., Parker- Data-driven approaches have been proposed to Rhodes’s country of residence). automatically learn the individual modules. One ap- Traditional NLG systems (Kukich, 1983; McK- proach first aligns records and sentences and then eown, 1992; Belz, 2008; Gatt and Reiter, 2009), in learns a content selection model (Duboue and McK- contrast, largely avoid these problems. Since they typically employ an explicit planning component, eown, 2002; Barzilay and Lapata, 2005). Hierar- which decides which knowledge base records to chical hidden semi-Markov generative models have FigureFr1e:deWricikkiPpaerkdeiar-Rinhfoodbeso(x21ofMFarrechde1r9i1c4k-P2a1rNkeorv-eRmhboerdes. Thfoecus on, and a surface realization component, also been used to first determine which facts to dis- introdu1c9t8i7o)nwoafs hanisEanrgtliicshlelirnegaudisst:, p“lFanrtepdaetrhioclkogPisatr,kcoemr-pRuhteordes (2w1hich realizes the chosen records, the intent of the cuss and then to generate words from the predi- scientist, mathematician, mystic, and mycologist.March 1914 – 21 November 1987) was an English linguisty,stem is always explicit, and it may be modified cates and arguments of the chosen facts (Liang et al., plant Fpiagtuhroelo2g: isAt,n ceoxammppuletefrrosmciethnetiWst,ikmiBaiothdeamtaasetitc(iLaenb,remt ystitco, meet constraints. 2009). Sentence planning has been formulated as a and meyt caol.,lo2g0i1s6t).”, .with a database x (top) for Frederick Parker- The goal of this work is to propose an approach Rhodes and corresponding reference generation y (bottom). supervised set partitioning problem over facts where to neural NLG that addresses these issues in a prin- each partition corresponds to a sentence (Barzilay cipled way. We target this goal by proposing anew model that generates with template-like ob- and Lapata, 2006). End-to-end approaches have 3 LtoaunsegaunaegnceodMerondetewloirnkgovfeorrxCanodnthsetnraacionneddi- combined sentence planning and surface realiza- Stioenatledneccoedegrennetewroartkiotongenerate , training the jects induced by a neural HSMM (see Figure 1). y whole system in an end-to-end manner. To gener- Templates are useful here because they representtion by using explicitly aligned sentence/meaning Condatietiaondeaslcrliapntigounafgoer amgiovdenelesxaamreplae, paobplauclka-rbocxhoicaefixed plan for the generation’s content, and be- pairs as training data (Ratnaparkhi, 2002; Wong and to gneentweroartke(suscehnatesnanceRsN. N) isWueseditnotprorodduuccee aadis-tablec-ause they make it clear what part of the genera- Mooney, 2007; Belz, 2008; Lu and Ng, 2011). More condtirtiibountieodn olvaenrgthueagnextmwodrde,lfrfoomr wchoinchstaracihnoiincge textiton is associated with which record in the knowl- recently, content selection and surface realization geneirsamtioadnetaondinfceldudbaeceklienmtoetnhtessfyrsotemm.faTchtetaebntlierse. edge base. have been combined (Angeli et al., 2010; Kim and distribution is driven by the internal states of the 3.1 neLuaranlgnuetawgoerkm. odel 4 Background: Semi-Markov ModelsMooney, 2010; Konstas and Lapata, 2013). Given Wa hsielenteefnfeccetivse, =relywing, o.n. .a, wneurwalitdhecTodewr ordWshat does it mean to learn a template? It is nat- At the intersection of rule-based and statisti- frommvaokcesabituldairffiycWult to 1 T , a luanndgerusatagnedmwohdatelasepsetcimts aotfes: ural to think of a template as a sequence of typed cal methods, hybrid systems aim at leveraging hu- x are correlated with a particular system output. text-segments, perhaps with some segments acting man contributed rules and corpus statistics (Langk- This leads to p∏rToblems both in controlling fine- as the template’s “backbone” (Wang and Cardie, ilde and Knight, 1998; Soricut and Marcu, 2006; grainePd(ass)pe=cts of tPhe(wgen|werat,i.on. .p,rwoce−ss)a.nd int 1 t 1 (12)013), and the remaining segments filled in from Mairesse and Walker, 2011). interpreting model mistakes.t=1 the knowledge base. As an example of why controllability is im- A natural probabilistic model conforming with Our approach is inspired by the recent success of Leptorcttan=t, cwont−sid(ner−t1h)e, .re.c.o,rwdst−in1 Fbiegutrhee1.seGquiveennce othfis intuition is the hidden semi-markov model neural language models for image captioning (Kiros n − t1hecsoenintepxuttswaonrednsdp-urseecremdiinghgtwta.ntAton gne-ngeratme lan(H- SMM) (Gales and Young, 1993; Ostendorf guagaenmouotpduetlmmeaetkinegs sapnecoifirdcecronstMraianrtsk,osvucahsass notet al., 2014; Karpathy and Fei-Fei, 2015; Vinyals et umptionet, al., 1996), which models latent segmentationsmentioning any information relating to customer in an output sequence. Informally, an HSMM is al., 2015; Fang et al., 2015; Xu et al., 2015), ma- rating. Under a stand∏Tard encoder-decoder style much like an HMM, except emissions may last chine translation (Devlin et al., 2014; Bahdanau et model, one cPou(lsd)fi≈lter outPth(iws itn|fcotr)m.ation either (2m)ultiple time-steps, and multi-step emissions need al., 2015; Luong et al., 2015), and modeling conver- from the encoder or dte=c1oder, but in practice this not be independent of each other conditioned on sations and dialogues (Shang et al., 2015; Wen et al., 3.2 wLoualndgleuaadgteo munoedxpeelccteodncdhiatniognesedin oonutptautbtlheast the state. 2015; Yao et al., 2015). might propagate through the whole system. We briefly review HSMMs following Murphy A tableAiss ansextaomfpfilelodf/vtahleuedifpfiaciurlst,ywohf eirnetevrparleut-es ar(e2002). Assume we have a sequence of ob- Our model is most similar to Mei et al. (2016) sequienngcmesisotafkweso, rcdosn.siWdeer therfeoflloorweinpgroapcotusael lgaenng- uagserved tokens y1 . . . yT and a discrete, latent state who use an encoder-decoder style neural network modeerlastitohnatfraorme acnonendciotidoenr-eddecoondetrhsetsyelepsayisrtse.m for zt ∈{1, . . . ,K} for each timestep. We addition- model to tackle the WEATHERGOV and ROBOCUP Local conditioning refers to the information tasks. Their architecture relies on LSTM units and from the table that is applied to the description o31f7th6e an attention mechanism which reduces scalability words which have already generated, i.e. the previ- compared to our simpler design. ous words that constitute the context of the language 2 ally use two per-timestep variables to model multi- step segments: a length variable lt ∈{1, . . . , L} z1 T z4 specifying the length of the current segment, and a deterministic binary variable ft indicating whether x a segment finishes at time t. We will consider in RNN RNN particular conditional HSMMs, which condition on a source x, essentially giving us an HSMM de- y y coder. 1 2 y3 y4 An HSMM specifies a joint distribution on the Figure 3: HSMM factor graph (under a known segmenta- observations and latent segmentations. Letting θ tion) to illustrate parameters. Here we assume z1 is in the denote all the parameters of the model, and using “red” state (out of K possibilities), and transitions to the the variables introduced above, we can write the “blue” state after emitting three words. The transition model,shown as T , is a function of the two states and the neural en- corresponding joint-likelihood as follows coded source x. The emission model is a function of a “red” ∏ RNN model (with copy attention over x) that generates wordsT−1 1, 2 and 3. After transitioning, the next word y4 is generated | ft by the “blue” RNN, but independently of the previous words.p(y, z, l, f x; θ) = p(zt+1, lt+1 | zt, lt, x) ∏t=0T Transition Distribution The transition distribu- × p(y − | z , l , x)ftt lt+1:t t t , tion p(zt+1 | zt, x) may be viewed as aK ×K ma- t=1 trix of probabilities, where each row sums to 1. We where we take z0 to be a distinguished start- define this matrix to be state, and the deterministic ft variables are used for excluding non-segment log probabilities. We p(zt+1 | zt, x) ∝ AB +C(xu)D(xu), further assume p(zt+1, lt+1 | zt, lt, x) factors as where A∈RK×m1 , B ∈Rm1×K are state embed- p(zt+1 | zt, x) × p(lt+1 | zt+1). Thus, the likeli- d K×m2 hood is given by the product of the probabilities dings, and where C : R → R and D :d m2×K of each discrete state transition made, the proba- R → R are parameterized non-linear func- bility of the length of each segment given its dis- tions of xu. We apply a row-wise softmax to the crete state, and the probability of the observations resulting matrix to obtain the desired probabilities. in each segment, given its state and length. Length Distribution We simply fix all length probabilities p(lt+1 | zt+1) to be uniform up to a 5 A Neural HSMM Decoder maximum length L.1 We use a novel, neural parameterization of an Emission Distribution The emission model HSMM to specify the probabilities in the likeli- models the generation of a text segment condi- hood above. This full model, sketched out in Fig- tioned on a latent state and source information, ure 3, allows us to incorporate the modeling com- and so requires a richer parameterization. Inspired ponents, such as LSTMs and attention, that make by the models used for neural NLG, we base this neural text generation effective, while maintaining model on an RNN decoder, and write a segment’s the HSMM structure. probability as a product over token-level probabil- ities, 5.1 Parameterization p(yt−l +1:t | zt= k, lt= l, x) = Since our model must condition on x, let r ∈Rd tj lt represent a real embedding of record rj ∈x, and ∏ let x ∈Rd p(y | y , z = k, x) a represent a real embedding of the en- t−lt+i t−lt+1:t−lt+i−1 t i=1 tire knowledge base x, obtained by max-pooling × p( | y , z = k, x)× 1 , coordinate-wise over all the rj . It is also useful t−lt+1:t t {lt = l} to have a representation of just the unique types 1We experimented with parameterizing the length distri- of records that appear in x, and so we also define bution, but found that it led to inferior performance. Forcing xu ∈Rd to be the sum of the embeddings of the the length probabilities to be uniform encourages the modelto cluster together functionally similar emissions of differ- unique types appearing in x, plus a bias vector and ent lengths, while parameterizing them can lead to states that followed by a ReLU nonlinearity. specialize to specific emission lengths. 3177 where is an end of segment token. The 5.2 Learning RNN decoder uses attention and copy-attention The model requires fitting a large set of neu- over the embedded records rj , and is conditioned ral network parameters. Since we assume z, l, on zt= k by concatenating an embedding corre- and f are unobserved, we marginalize over these sponding to the k’th latent state to the RNN’s in- variables to maximize the log marginal-likelihood put; the RNN is also conditioned on the entire x of the observed tokens y given x. The HSMM by initializing its hidden state with xa. marginal-likelihood calculation can be carried out More concretely, let hk ∈Rdi−1 be the state of efficiently with a dynamic program analogous to an RNN conditioned on x and zt= k (as above) either the forward- or backward-algorithm famil- run over the sequence yt−lt+1:t−lt+i−1. We let the iar from HMMs (Rabiner, 1989). model attend over records rj using hki−1 (in the It is actually more convenient to use the style of Luong et al. (2015)), producing a context backward-algorithm formulation when using vector cki−1. We may then obtain scores vi−1 for RNNs to parameterize the emission distributions, each word in the output vocabulary, and we briefly review the backward recurrences here, again following Murphy (2002). We have: vi−1=W tanh(g k 1 ◦ [hki−1, cki−1]), with parameters gk 2d1 ∈R and ∈RV×2d. Note βt(j) = p(yt+1:T | zt= j, ft=1, x)W that there is a gk1 vector for each of K discrete ∑K ∗ states. To additionally implement a kind of slot = βt (k) p(zt+1= k | zt = j) filling, we allow emissions to be directly copied k=1∗ from the value portion of the records r using copy βt (k) = p(yt+1:T | zt+1 = k, ft = 1, x)j attention (Gülçehre et al., 2016; Gu et al., 2016; ∑L [ Yang et al., 2016). Define copy scores, = βt+l(k) p(lt+1= l | zt+1= k) l=1 ] ρ = rTj j tanh(g k ◦ hk2 i−1), p(yt+1:t+l | zt+1= k, lt+1= l) , where gk2 ∈Rd. We then normalize the output- with base∑case βT (j)= 1. We can nowvocabulary and copy scores together, to arrive at obtain the marginal probability of y asK ∗ ṽ =softmax([v , ρ , . . . , ρ ]), p(y |x)= k=1 β0(k) p(z1= k), where wei−1 i−1 1 J have used the fact that f0 must be 1, and we and thus therefore train to maximize the log-marginal likelihood of the observed y: p(yt−lt+i=w | yt−lt+1∑:t−lt+i−1, zt= k, x) = ∑K ṽi−1,w + ṽi−1,V+j . ln p(y |x; θ) = ln β∗0(k) p(z1= k). (1) j:rj .m=w k=1 An Autoregressive Variant The model as spec- Since the quantities in (1) are obtained from a ified assumes segments are independent condi- dynamic program, which is itself differentiable, tioned on the associated latent state and x. While we may simply maximize with respect to the pa- this assumption still allows for reasonable perfor- rameters θ by back-propagating through the dy- mance, we can tractably allow interdependence namic program; this is easily accomplished with between tokens (but not segments) by having each automatic differentiation packages, and we use next-token distribution depend on all the previ- pytorch (Paszke et al., 2017) in all experiments. ously generated tokens, giving us an autoregres- sive HSMM. For this model, we will in fact use 5.3 Extracting Templates and Generating p(yt−lt+i=w | y1:t−lt+i−1, zt= k, x) in defining After training, we could simply condition on a new our emission model, which is easily implemented database and generate with beam search, as is stan- by using an additional RNN run over all the pre- dard with encoder-decoder models. However, the ceding tokens. We will report scores for both structured approach we have developed allows us non-autoregressive and autoregressive HSMM de- to generate in a more template-like way, giving us coders below. more interpretable and controllable generations. 3178 [The Golden Palace]55 [is a]59 [coffee shop]12 extracted in Figure 4. In practice, the argmax in [providing]3 [Indian]50 [food]1 [in the]17 [£20- (2) will be intractable to calculate exactly due to 25]26 [price range]16 [.]2 [It is]8 [located in the use of RNNs in defining the emission distribu- the]25 [riverside]40 [.]53 [Its customer rating is]19 tion, and so we approximate it with a constrained [high]23 [.]2 beam search. This beam search looks very similar Figure 4: A sample Viterbi segmentation of a training text; to that typically used with RNN decoders, except subscripted numbers indicate the corresponding latent state. the search occurs only over a segment, for a par- From this we can extract a template with S=17 segments; compare with the template used at the bottom of Figure 1. ticular latent state k. 5.4 Discussion First, note that given a database x and refer- Returning to the discussion of controllability and ence generation y we can obtain the MAP assign- interpretability, we note that with the proposed ment to the variables z, l, and f with a dynamic model (a) it is possible to explicitly force the gen- program similar to the Viterbi algorithm familiar eration to use a chosen template z(i), which is it- from HMMs. These assignments will give us a self automatically learned from training data, and typed segmentation of y, and we show an example (b) that every segment in the generated ŷ(i) is Viterbi segmentation of some training text in Fig- typed by its corresponding latent variable. We ex- ure 4. Computing MAP segmentations allows us plore these issues empirically in Section 7.1. to associate text-segments (i.e., phrases) with the We also note that these properties may be use- discrete labels zt that frequently generate them. ful for other text applications, and that they offer These MAP segmentations can be used in an ex- an additional perspective on how to approach la- ploratory way, as a sort of dimensionality reduc- tent variable modeling for text. Whereas there has tion of the generations in the corpus. More im- been much recent interest in learning continuous portantly for us, however, they can also be used to latent variable representations for text (see Sec- guide generation. tion 2), it has been somewhat unclear what the la- In particular, since each MAP segmentation im- tent variables to be learned are intended to capture. plies a sequence of hidden states z, we may run On the other hand, the latent, template-like struc- a template extraction step, where we collect the tures we induce here represent a plausible, proba- most common “templates” (i.e., sequences of hid- bilistic latent variable story, and allow for a more den states) seen in the training data. Each “tem- controllable method of generation. plate” z(i) consists of a sequence of latent states, Finally, we highlight one significant possible is- with z(i) (i) (i)= z1 , . . . zS representing the S distinct sue with this model – the assumption that seg- segments in the i’th extracted template (recall that ments are independent of each other given the cor- we will technically have a zt for each time-step, responding latent variable and x. Here we note and so z(i) is obtained by collapsing adjacent zt’s that the fact that we are allowed to condition on x with the same value); see Figure 4 for an example is quite powerful. Indeed, a clever encoder could template (with S=17) that can be extracted from capture much of the necessary interdependence the E2E corpus. The bottom of Figure 1 shows a between the segments to be generated (e.g., the visualization of this extracted template, where dis- correct determiner for an upcoming noun phrase) crete states are replaced by the phrases they fre- in its encoding, allowing the segments themselves quently generate in the training data. to be decoded more or less independently, given x. With our templates z(i) in hand, we can then restrict the model to using (one of) them during 6 Data and Methods generation. In particular, given a new input x, we may generate by computing Our experiments apply the approach outlined above to two recent, data-driven NLG tasks. ŷ(i) = argmax p(y′, z(i) |x), (2) y′ 6.1 Datasets which gives us a generation ŷ(i) for each extracted Experiments use the E2E (Novikova et al., 2017) template z(i). For example, the generation in Fig- and WikiBio (Lebret et al., 2016) datasets, ex- ure 1 is obtained by maximizing (2) with x set to amples of which are shown in Figures 1 and 2, the database in Figure 1 and z(i) set to the template respectively. The former dataset, used for the 3179 2018 E2E-Gen Shared Task, contains approxi- currences in Section 5) to any segmentation that mately 50K total examples, and uses 945 distinct splits up a sequence yt+1:t+l that appears in some word types, and the latter dataset contains approx- rj , or that includes yt+1:t+l as a subsequence of imately 500K examples and uses approximately another sequence. Thus, we maximize (1) subject 400K word types. Because our emission model to these hard constraints. uses a word-level copy mechanism, any record Increasing the Number of Hidden States with a phrase consisting of n words as its value is While a larger K allows for a more expressive la- replaced with n positional records having a single tent model, computing K emission distributions word value, following the preprocessing of Lebret over the vocabulary can be prohibitively expen- et al. (2016). For example, “type[coffee shop]” sive. We therefore tie the emission distribution be- in Figure 1 becomes “type-1[coffee]” and “type- tween multiple states, while allowing them to have 2[shop].” a different transition distributions. For both datasets we compare with published encoder-decoder models, as well as with direct We give additional architectural details of our template-style baselines. The E2E task is eval- model in the Supplemental Material; here we note d uated in terms of BLEU (Papineni et al., 2002), that we use an MLP to embed rj ∈R , and a 1- NIST (Belz and Reiter, 2006), ROUGE (Lin, layer LSTM (Hochreiter and Schmidhuber, 1997) 2004), CIDEr (Vedantam et al., 2015), and ME- in defining our emission distributions. In order to TEOR (Banerjee and Lavie, 2005).2 The bench- reduce the amount of memory used, we restrict our mark system for the task is an encoder-decoder output vocabulary (and thus the height of the ma- style system followed by a reranker, proposed by trix W in Section 5) to only contain words in y Dušek and Jurcıcek (2016). We compare to this that are not present in x; any word in y present in x baseline, as well as to a simple but competitive is assumed to be copied. In the case where a word non-parametric template-like baseline (“SUB” in yt appears in a record rj (and could therefore have tables), which selects a training sentence with been copied), the input to the LSTM at time t+1 is records that maximally overlap (without including computed using information from rj ; if there are extraneous records) the unseen set of records we multiple rj from which yt could have been copied, wish to generate from; ties are broken at random. the computed representations are simply averaged. Then, word-spans in the chosen training sentence For all experiments, we set d=300 and L=4. are aligned with records by string-match, and re- At generation time, we select the 100 most com- placed with the corresponding fields of the new set mon templates z (i), perform beam search with a of records.3 beam of size 5, and select the generation with the The WikiBio dataset is evaluated in terms of highest overall joint probability. BLEU, NIST, and ROUGE, and we compare with For our E2E experiments, our best non- the systems and baselines implemented by Lebret autoregressive model has 55 “base” states, dupli- et al. (2016), which include two neural, encoder- cated 5 times, for a total of K =275 states, and decoder style models, as well as a Kneser-Ney, our best autoregressive model uses K =60 states, templated baseline. without any duplication. For our WikiBio exper- iments, both our best non-autoregressive and au- 6.2 Model and Training Details toregressive models uses 45 base states duplicated We first emphasize two additional methodological 3 times, for a total of K =135 states. In all cases, details important for obtaining good performance. K was chosen based on BLEU performance on held-out validation data. Code implementing our Constraining Learning We were able to learn models is available at https://github.com/ more plausible segmentations of y by constraining harvardnlp/neural-template-gen. the model to respect word spans yt+1:t+l that ap- pear in some record rj ∈x. We accomplish this by 7 Results giving zero probability (within the backward re- Our results on automatic metrics are shown in 2We use the official E2E NLG Challenge scoring scripts at Tables 1 and 2. In general, we find that the https://github.com/tuetschek/e2e-metrics. 3 templated baselines underperform neural models,For categorical records, like “familyFriendly”, which cannot easily be aligned with a phrase, we simply select only whereas our proposed model is fairly competi- candidate training sentences with the same categorical value. tive with neural models, and sometimes even out- 3180 BLEU NIST ROUGE CIDEr METEOR Travellers Rest Beefeater Validation name[Travellers Rest Beefeater], customerRating[3 out of 5], area[riverside], near[Raja Indian Cuisine] D&J 69.25 8.48 72.57 2.40 47.03 SUB 43.71 6.72 55.35 1.41 37.87 1. [Travellers Rest Beefeater]55 [is a]59 [3 star]43 NTemp 64.53 7.66 68.60 1.82 42.46 [restaurant]11 [located near]25 [Raja Indian Cuisine]40 [.]53 NTemp+AR 67.07 7.98 69.50 2.29 43.07 2. [Near]31 [riverside]29 [,]44 [Travellers Rest Beefeater]55 [serves]3 [3 star]50 [food]Test 1 [.]2 3. [Travellers Rest Beefeater]55 [is a]59 [restaurant]12 D&J 65.93 8.59 68.50 2.23 44.83 [providing]3 [riverside]50 [food]1 [and has a]17 SUB 43.78 6.88 54.64 1.39 37.35 [3 out of 5]26 [customer rating]16 [.]2 [It is]8 [near]25 NTemp 55.17 7.14 65.70 1.70 41.91 [Raja Indian Cuisine]40 [.]53 NTemp+AR 59.80 7.56 65.01 1.95 38.75 4. [Travellers Rest Beefeater]55 [is a]59 [place to eat]12 [located near]25 [Raja Indian Cuisine]40 [.]53 Table 1: Comparison of the system of Dušek and Jurcıcek 5. [Travellers Rest Beefeater]55 [is a]59 [3 out of 5]5 (2016), which forms the baseline for the E2E challenge, a [rated]32 [riverside]43 [restaurant]11 [near]25 non-parametric, substitution-based baseline (see text), and [Raja Indian Cuisine]40 [.]53 our HSMM models (denoted “NTemp” and “NTemp+AR” for the non-autoregressive and autoregressive versions, resp.) Table 3: Impact of varying the template z(i) for a single x on the validation and test portions of the E2E dataset. from the E2E validation data; generations are annotated with “ROUGE” is ROUGE-L. Models are evaluated using the of- the segmentations of the chosen z(i). Results were obtained ficial E2E NLG Challenge scoring scripts. using the NTemp+AR model from Table 1. BLEU NIST ROUGE-4 database-to-text generation have since surpassed Template KN † 19.8 5.19 10.7 † the results of Lebret et al. (2016) and our own,NNLM (field) 33.4 7.52 23.9 NNLM (field & word) † 34.7 7.98 25.8 and we show the recent seq2seq style results of Liu NTemp 34.2 7.94 35.9 et al. (2018), who use a somewhat larger model, at NTemp+AR 34.8 7.59 38.6 the bottom of Table 2. Seq2seq (Liu et al., 2018) 43.65 - 40.32 7.1 Qualitative Evaluation Table 2: Top: comparison of the two best neural systems of Lebret et al. (2016), their templated baseline, and our HSMM We now qualitatively demonstrate that our gener- models (denoted “NTemp” and “NTemp+AR” for the non- ations are controllable and interpretable. autoregressive and autoregressive versions, resp.) on the test portion of the WikiBio dataset. Models marked with a † are Controllable Diversity One of the powerful as- from Lebret et al. (2016), and following their methodology we use ROUGE-4. Bottom: state-of-the-art seq2seq-style re- pects of the proposed approach to generation is sults from Liu et al. (2018). that we can manipulate the template z(i) while leaving the database x constant, which allows for easily controlling aspects of the generation. In Ta- performs them. On the E2E data, for example, ble 3 we show the generations produced by our we see in Table 1 that the SUB baseline, despite model for five different neural template sequences having fairly impressive performance for a non- z(i), while fixing x. There, the segments in each parametric model, fares the worst. The neural generation are annotated with the latent states de- HSMM models are largely competitive with the termined by the corresponding z(i). We see that encoder-decoder system on the validation data, de- these templates can be used to affect the word- spite offering the benefits of interpretability and ordering, as well as which fields are mentioned in controllability; however, the gap increases on test. the generated text. Moreover, because the discrete Table 2 evaluates our system’s performance on states align with particular fields (see below), it is the test portion of the WikiBio dataset, compar- generally simple to automatically infer to which ing with the systems and baselines implemented fields particular latent states correspond, allowing by Lebret et al. (2016). Again for this dataset we users to choose which template best meets their re- see that their templated Kneser-Ney model under- quirements. We emphasize that this level of con- performs on the automatic metrics, and that neu- trollability is much harder to obtain for encoder- ral models improve on these results. Here the decoder models, since, at best, a large amount of HSMMs are competitive with the best model of sampling would be required to avoid generating Lebret et al. (2016), and even outperform it on around a particular mode in the conditional distri- ROUGE. We emphasize, however, that recent, so- bution, and even then it would be difficult to con- phisticated approaches to encoder-decoder style trol the sort of generations obtained. 3181 kenny warren name: kenny warren, birth date: 1 april 1946, birth name: kenneth warren deutscher, birth place: brooklyn, new york, occupation: ventriloquist, comedian, author, notable work: book - the revival of ventriloquism in america 1. [kenneth warren deutscher]132 [ ( ]75 [born]89 [april 1, 1946]101 [ ) ]67 [is an american]82 [author]20 [and]1 [ventriloquist and comedian]69 [.]88 2. [kenneth warren deutscher]132 [ ( ]75 [born]89 [april 1, 1946]101 [ ) ]67 [is an american]82 [author]20 [best known for his]95 [the revival of ventriloquism]96 [.]88 3. [kenneth warren]16 [“kenny” warren]117 [ ( ]75 [born]89 [april 1, 1946]101 [ ) ]67 [is an american]127 [ventriloquist, comedian]28 [.]133 4. [kenneth warren]16 [“kenny” warren]117 [ ( ]75 [born]89 [april 1, 1946]101 [ ) ]67 [is a]104 [new york]98 [author]20 [.]133 5. [kenneth warren deutscher]42 [is an american]82 [ventriloquist, comedian]118 [based in]15 [brooklyn, new york]84 [.]88 Table 4: Impact of varying the template z(i) for a single x from the WikiBio validation data; generations are annotated with the segmentations of the chosen z(i). Results were obtained using the NTemp model from Table 2. Interpretable States Discrete states also pro- NTemp NTemp+AR vide a method for interpreting the generations pro- E2E 89.2 (17.4) 85.4 (18.6) duced by the system, since each segment is explic- WikiBio 43.2 (19.7) 39.9 (17.9) itly typed by the current hidden state of the model. Table 4 shows the impact of varying the template Table 5: Empirical analysis of the average purity of dis-crete states learned on the E2E and WikiBio datasets, for the z(i) for a single x from the WikiBio dataset. While NTemp and NTemp+AR models. Average purities are given there is in general surprisingly little stylistic varia- as percents, and standard deviations follow in parentheses. See the text for full description of this calculation. tion in the WikiBio data itself, there is variation in the information discussed, and the templates cap- ture this. Moreover, we see that particular discrete 8 Conclusion and Future Work states correspond in a consistent way to particular We have developed a neural, template-like gen- pieces of information, allowing us to align states eration model based on an HSMM decoder, with particular field types. For instance, birth which can be learned tractably by backpropagat- names have the same hidden state (132), as do ing through a dynamic program. The method al- names (117), nationalities (82), birth dates (101), lows us to extract template-like latent objects in and occupations (20). a principled way in the form of state sequences, To demonstrate empirically that the learned and then generate with them. This approach scales states indeed align with field types, we calculate to large-scale text datasets and is nearly competi- the average purity of the discrete states learned for tive with encoder-decoder models. More impor- both datasets in Table 5. In particular, for each tantly, this approach allows for controlling the discrete state for which the majority of its gen- diversity of generation and for producing inter- erated words appear in some rj , the purity of a pretable states during generation. We view this state’s record type alignment is calculated as the work both as the first step towards learning dis- percentage of the state’s words that come from crete latent variable template models for more dif- the most frequent record type the state represents. ficult generation tasks, as well as a different per- This calculation was carried out over training ex- spective on learning latent variable text models in amples that belonged to one of the top 100 most general. Future work will examine encouraging frequent templates. Table 5 indicates that discrete the model to learn maximally different (or mini- states learned on the E2E data are quite pure. Dis- mal) templates, which our objective does not ex- crete states learned on the WikiBio data are less plicitly encourage, templates of larger textual phe- pure, though still rather impressive given that there nomena, such as paragraphs and documents, and are approximately 1700 record types represented hierarchical templates. in the WikiBio data, and we limit the number of Acknowledgments states to 135. Unsurprisingly, adding autoregres- siveness to the model decreases purity on both SW gratefully acknowledges the support of a datasets, since the model may rely on the autore- Siebel Scholars award. AMR gratefully acknowl- gressive RNN for typing, in addition to the state’s edges the support of NSF CCF-1704834, Intel Re- identity. search, and Amazon AWS Research grants. 3182 References Çaglar Gülçehre, Sungjin Ahn, Ramesh Nallapati, Bowen Zhou, and Yoshua Bengio. 2016. Pointing Gabor Angeli, Percy Liang, and Dan Klein. 2010. A the unknown words. In ACL. simple domain-independent probabilistic approach to generation. In Proceedings of the 2010 Confer- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long ence on Empirical Methods in Natural Language short-term memory. Neural Comput., 9:1735–1780. Processing, pages 502–512. Association for Com- putational Linguistics. Blake Howald, Ravikumar Kondadadi, and Frank Schilder. 2013. Domain adaptable semantic clus- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- tering in statistical nlg. In Proceedings of the 10th gio. 2015. Neural machine translation by jointly International Conference on Computational Seman- learning to align and translate. In ICLR. tics (IWCS 2013)–Long Papers, pages 143–154. Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An Salakhutdinov, and Eric P Xing. 2017. Toward con- automatic metric for mt evaluation with improved trolled generation of text. In International Confer- correlation with human judgments. In Proceedings ence on Machine Learning, pages 1587–1596. of the acl workshop on intrinsic and extrinsic evalu- ation measures for machine translation and/or sum- Po-Sen Huang, Chong Wang, Sitao Huang, Dengyong marization, pages 65–72. Zhou, and Li Deng. 2018. Towards neural phrase- based machine translation. In International Confer- Anja Belz. 2008. Automatic generation of weather ence on Learning Representations. forecast texts using comprehensive probabilistic generation-space models. Natural Language Engi- Dan Jurafsky and James H Martin. 2014. Speech and neering, 14(04):431–455. language processing. Pearson London. Ravi Kondadadi, Blake Howald, and Frank Schilder. Anja Belz and Ehud Reiter. 2006. Comparing auto- 2013. A statistical nlg framework for aggregated matic and human evaluation of nlg systems. In 11th planning and realization. In Proceedings of the 51st Conference of the European Chapter of the Associa- Annual Meeting of the Association for Computa- tion for Computational Linguistics. tional Linguistics (Volume 1: Long Papers), vol- ume 1, pages 1406–1415. Andrew Chisholm, Will Radford, and Ben Hachey. 2017. Learning to generate one-sentence biogra- Lingpeng Kong, Chris Dyer, and Noah A Smith. 2016. phies from wikidata. CoRR, abs/1702.06235. Segmental recurrent neural networks. In Interna- tional Conference on Learning Representations. KyungHyun Cho, Bart van Merrienboer, Dzmitry Bah- danau, and Yoshua Bengio. 2014. On the properties Ioannis Konstas and Mirella Lapata. 2013. A global of neural machine translation: Encoder-decoder ap- model for concept-to-text generation. J. Artif. Intell. proaches. Eighth Workshop on Syntax, Semantics Res.(JAIR), 48:305–346. and Structure in Statistical Translation. Karen Kukich. 1983. Design of a knowledge-based re- Hanjun Dai, Bo Dai, Yan-Ming Zhang, Shuang Li, port generator. In ACL, pages 145–150. and Le Song. 2017. Recurrent hidden semi-markov Rémi Lebret, David Grangier, and Michael Auli. 2016. model. In International Conference on Learning Neural text generation from structured data with Representations. application to the biography domain. In EMNLP, pages 1203–1213. Ondrej Dušek and Filip Jurcıcek. 2016. Sequence-to- sequence generation for spoken dialogue via deep J. Li, R. Jia, H. He, and P. Liang. 2018. Delete, retrieve, syntax trees and strings. In The 54th Annual Meet- generate: A simple approach to sentiment and style ing of the Association for Computational Linguis- transfer. In North American Association for Compu- tics, page 45. tational Linguistics (NAACL). Mark JF Gales and Steve J Young. 1993. The theory Percy Liang, Michael I Jordan, and Dan Klein. 2009. of segmental hidden Markov models. University of Learning semantic correspondences with less super- Cambridge, Department of Engineering. vision. In ACL, pages 91–99. Association for Com- putational Linguistics. Albert Gatt and Ehud Reiter. 2009. Simplenlg: A re- Chin-Yew Lin. 2004. Rouge: A package for auto- alisation engine for practical applications. In Pro- matic evaluation of summaries. Text Summarization ceedings of the 12th European Workshop on Natural Branches Out. Language Generation, pages 90–93. Association for Computational Linguistics. Tianyu Liu, Kexiang Wang, Lei Sha, Baobao Chang, and Zhifang Sui. 2018. Table-to-text generation by Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O. K. structure-aware seq2seq learning. In Proceedings of Li. 2016. Incorporating copying mechanism in the Thirty-Second AAAI Conference on Artificial In- sequence-to-sequence learning. In ACL. telligence. 3183 Liang Lu, Lingpeng Kong, Chris Dyer, Noah A Smith, Ehud Reiter and Robert Dale. 1997. Building applied and Steve Renals. 2016. Segmental recurrent neural natural language generation systems. Natural Lan- networks for end-to-end speech recognition. Inter- guage Engineering, 3(1):57–87. speech 2016, pages 385–389. Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Thang Luong, Hieu Pham, and Christopher D. Man- Jaakkola. 2017. Style transfer from non-parallel text ning. 2015. Effective approaches to attention-based by cross-alignment. In Advances in Neural Informa- neural machine translation. In Proceedings of the tion Processing Systems, pages 6833–6844. 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, pages 1412– Yikang Shen, Zhouhan Lin, Chin wei Huang, and 1421. Aaron Courville. 2018. Neural language modeling by jointly learning syntax and lexicon. In Interna- Kathleen McKeown. 1992. Text generation - using dis- tional Conference on Learning Representations. course strategies and focus constraints to generate natural language text. Studies in natural language Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, processing. Cambridge University Press. Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: A simple way to prevent neural networks Susan W McRoy, Songsak Channarukul, and Syed S from overfitting. The Journal of Machine Learning Ali. 2000. Yag: A template-based generator for real- Research, 15(1):1929–1958. time systems. In Proceedings of the first interna- Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014. tional conference on Natural language generation- Sequence to sequence learning with neural net- Volume 14, pages 264–267. Association for Compu- works. In Advances in Neural Information Process- tational Linguistics. ing Systems (NIPS), pages 3104–3112. Hongyuan Mei, Mohit Bansal, and Matthew R. Walter. Hao Tang, Weiran Wang, Kevin Gimpel, and Karen 2016. What to talk about and how? selective gener- Livescu. 2016. End-to-end training approaches ation using lstms with coarse-to-fine alignment. In for discriminative segmental models. In Spoken NAACL HLT, pages 720–730. Language Technology Workshop (SLT), 2016 IEEE, Kevin P Murphy. 2002. Hidden semi-markov models pages 496–502. IEEE. (hsmms). unpublished notes, 2. Ke M Tran, Yonatan Bisk, Ashish Vaswani, Daniel Vinod Nair and Geoffrey E Hinton. 2010. Rectified Marcu, and Kevin Knight. 2016. Unsupervised neu- linear units improve restricted boltzmann machines. ral hidden markov models. In Proceedings of the In Workshop on Structured Prediction for NLP, pagesProceedings of the 27th international conference 63–71. on machine learning (ICML-10), pages 807–814. Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Jekaterina Novikova, Ondrej Dušek, and Verena Rieser. Parikh. 2015. Cider: Consensus-based image de- 2017. The E2E dataset: New challenges for end-to- scription evaluation. In Proceedings of the IEEE end generation. In Proceedings of the 18th Annual conference on computer vision and pattern recog- Meeting of the Special Interest Group on Discourse nition, pages 4566–4575. and Dialogue, Saarbrücken, Germany. Chong Wang, Yining Wang, Po-Sen Huang, Abdel- Mari Ostendorf, Vassilios V Digalakis, and Owen A rahman Mohamed, Dengyong Zhou, and Li Deng. Kimball. 1996. From hmm’s to segment models: 2017. Sequence modeling via segmentations. In In- A unified view of stochastic modeling for speech ternational Conference on Machine Learning, pages recognition. IEEE Transactions on speech and au- 3674–3683. dio processing, 4(5):360–378. Lu Wang and Claire Cardie. 2013. Domain- Kishore Papineni, Salim Roukos, Todd Ward, and Wei- independent abstract generation for focused meeting Jing Zhu. 2002. Bleu: a method for automatic eval- summarization. In Proceedings of the 51st Annual uation of machine translation. In Proceedings of Meeting of the Association for Computational Lin- the 40th annual meeting on association for compu- guistics (Volume 1: Long Papers), volume 1, pages tational linguistics, pages 311–318. Association for 1395–1405. Computational Linguistics. Sam Wiseman, Stuart Shieber, and Alexander Rush. Adam Paszke, Sam Gross, Soumith Chintala, Gre- 2017. Challenges in data-to-document generation. gory Chanan, Edward Yang, Zachary DeVito, Zem- In Proceedings of the 2017 Conference on Empiri- ing Lin, Alban Desmaison, Luca Antiga, and Adam cal Methods in Natural Language Processing, pages Lerer. 2017. Automatic differentiation in pytorch. 2253–2263. NIPS 2017 Autodiff Workshop. Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and Lawrence R Rabiner. 1989. A tutorial on hidden William W. Cohen. 2018. Breaking the softmax markov models and selected applications in speech bottleneck: A high-rank RNN language model. In recognition. Proceedings of the IEEE, 77(2):257– International Conference on Learning Representa- 286. tions. 3184 Zichao Yang, Phil Blunsom, Chris Dyer, and Wang U ∈Rm3×d1 and U ∈RK×m2×m32 ; D(x) is de- Ling. 2016. Reference-aware language models. fined analogously. For all experiments, m1=64, CoRR, abs/1611.01628. m2=32, and m3=64. Lei Yu, Jan Buys, and Phil Blunsom. 2016. Online seg- ment to segment neural transduction. In Proceed- Optimization We train with SGD, using a learn- ings of the 2016 Conference on Empirical Methods ing rate of 0.5 and decaying by 0.5 each epoch in Natural Language Processing, pages 1307–1316. after the first epoch in which validation log- Junbo Jake Zhao, Yoon Kim, Kelly Zhang, Alexan- likelihood fails to increase. When using an au- der M. Rush, and Yann LeCun. 2018. Adversari- toregressive HSMM, the additional LSTM is op- ally regularized autoencoders. In Proceedings of the timized only after the learning rate has been de- 35th International Conference on Machine Learn- cayed. We regularize with Dropout (Srivastava ing, ICML 2018, pages 5897–5906. et al., 2014). A Supplemental Material A.2 Additional Learned Templates A.1 Additional Model and Training Details In Tables 6 and 7 we show visualizations of addi- Computing r A record r is represented by tional templates learned on the E2E and WikiBioj j embedding a feature for its type, its position, and data, respectively, by both the non-autoregressive its word value in Rd, and applying an MLP with and autoregressive HSMM models presented in ReLU nonlinearity (Nair and Hinton, 2010) to the paper. For each model, we select a set of five form r ∈Rd, similar to Yang et al. (2016) and dissimilar templates in an iterative way by greed-j Wiseman et al. (2017). ily selecting the next template (out of the 200 most frequent) that has the highest percentage of states LSTM Details The initial cell and hidden- that do not appear in the previously selected tem- state values for the decoder LSTM are given plates; ties are broken randomly. Individual states by Q1xa and tanh(Q2xa), respectively, where within a template are visualized using the three Q1,Q2 ∈Rd×d. most common segments they generate. When a word yt appears in a record rj , the input to the LSTM at time t + 1 is computed using an MLP with ReLU nonlinearity over the concatena- tion of the embeddings for rj’s record type, word value, position, and a feature for whether it is the final position for the type. If there are multiple rj from which yt could have been copied, the com- puted representations are averaged. At test time, we use the MAP rj to compute the input, even if there are multiple matches. For yt which could not have been copied, the input to the LSTM at time t+1 is computed using the same MLP over yt and three dummy features. For the autoregressive HSMM, an additional 1- layer LSTM with d hidden units is used. We ex- perimented with having the autoregressive HSMM consume either tokens y1:t in predicting yt+1, or the average embedding of the field types corre- sponding to copied tokens in y1:t. The former worked slightly better for the WikiBio dataset (where field types are more ambiguous), while the latter worked slightly better for the E2E dataset. Transition Distribution The function C(xu), which produces hidden state em- beddings conditional on the source, is de- fined as C(xu)=U2(ReLU(U1xu)), where 3185 The Waterman is a Italian restaurant with a average customer rating 1. | The Golden Palace | is an French pub with high price rangeBrowns Cambridge is a family friendly | fast food | place | with an | low | |.rating ... ... ... ... ... ... ... There is a restaurant The Mill located in the centre of the city that serves 2. | There is a cheap | coffee shop | Bibimbap House | located on the river servingThere is an French restaurant The Twenty Two located north of the | city centre | that provides ... ... ... ... ... ... fast food | sushitake-away deliveries|. ... The Olive Grove servesrestaurant fast food3. | The Punter offers sushiThe Cambridge Blue | pub | has | take-away deliveries|. ... ... ... ... The restaurant The Mill serves English food 4. | Child friendly | coffee shop | Bibimbap House offers Indian cuisineThe average priced French restaurant The Twenty Two | has | Italian | dishes |. ... ... ... ... ... ... The Strada provides Indian food in the customer rating of 1 out of 5 5. | The Dumpling Tree | serves | Chinese | food at a | price range of averageAlimentum offers English food and has a | |.rating of 5 out of 5 ... ... ... ... ... ... The Eagle provides Indian food in the high price range It is near 1. | The Golden Curry | providing | Chinese | cuisine | with a moderateserves English Food and has a | average | customer rating|. | They are | located in the Zizzi rating It’s located near ... ... ... ... ... ... ... ... ... riverside Its customer rating is 1 out of 5 | city centre |. | It has a averageCafe Sicilia The price range is | high |. ... ... ... Located near The Portland Arms is an Italian restaurant called The Waterman 2. | Located in the | riverside | is a family friendly fast food place called CocumNear city centre there is a | French | restaurant named | Loch Fyne |. ... ... ... ... ... ... A Italian restaurant is The Waterman 3. | An | fast food pub called CocumA family friendly French | coffee shop | named | Loch Fyne |. ... ... ... ... ... Located near The Portland Arms The Eagle is a cheap Italian restaurant 4. | Located in the | riverside | , | The Golden Curry is a family friendly family-friendly fast food pubNear city centre |Zizzi is an | |family friendly French | coffee shop|. ... ... ... ... ... ... ... A Italian restaurant near riverside is The Waterman 5. | An | fast food | pub | located in the | city centre called CocumA family friendly French coffee shop located near Cafe Sicilia | named | Loch Fyne |. ... ... ... ... ... ... ... Table 6: Five templates extracted from the E2E data with the NTemp model (top) and the Ntemp+AR model (bottom). 3186 william henry ( born 1968 ) is an american politician 1. | george augustus frederick | was ( | born on 1960; born 1 | 1970 | ]) | is a russian | actor |.marie anne de bourbon ] was an american football player ... ... ... ... ... ... ... sir john herbert was a world war i national team 2. | captain | hartley | was a british | world war organizationlieutenant donald charles cameron was an english first world war | super league |. ... ... ... ... ... john herbert is a indie rock band from australia 3. | hartley | was a | death metal | midfielder | for | los angeles, california|. donald charles cameron is an ska defenceman based in chicago ... ... ... ... ... ... john herbert was a american football midfielder 4. | hartley | is a major league baseball professional baseball defender donald charles cameron is a former | australian | professional ice hockey | goalkeeper|. ... ... ... ... ... james “ billy ” wilson 1900 france is an american footballer 5. | william john | smith c. 1894 budapest is an english professional footballer william “ jack ” henry | ( | 1913 | – | buenos aires | ) | was an american | rules footballer ... ... ... ... ... ... who plays for paganese in the vicotiral football league vfl | who currently plays for | south melbourne | of the | national football league | ( | nfl | ) |. who played with fc dynamo kyiv and the australian football league afl ... ... ... ... ... aftab ahmed born 1951 is an american actor 1. | anderson da silva | (; | born on 1970 ) was an american actressdavid jones born 1 | 1974 | ] | is an english | cricketer|. ... ... ... ... ... ... ... aftab ahmed was a world war i member of the austrian house of representatives 2. | anderson da silva | is a former | liberal | party member of the | pennsylvania | legislaturedavid jones is a baseball recipient of the montana senate |. ... ... ... ... ... ... adjutant aftab ahmed was a world war i member of the knesset 3. | lieutenant anderson da silva is a former liberal party member of the scottish parliamentcaptain | david jones | is a | baseball | recipient of the | fc lokomotiv liski |. ... ... ... ... ... ... william “ billy ” watson 1913 – 1917 was an american football player 4. | john william | smith | ( | c. 1900 | in | surrey, englandjames “ jim ” edward 1913 - british columbia | ) | was an australian rules footballer is an american | defenceman ... ... ... ... ... ... ... who plays for collingwood in the victorial football league vfl | who currently plays for | st kilda | of the | national football league | ( | afl | ) |. who played with carlton and the australian football league nfl ... ... ... ... ... aftab ahmed is a member of the knesset 5. | anderson da silva | is a former | party member of the | scottish parliamentdavid jones is a female recipient of the fc lokomotiv liski |. ... ... ... ... Table 7: Five templates extracted from the WikiBio data with the NTemp model (top) and the Ntemp+AR model (bottom). 3187