Learning Neural Templates for Text Generation
Sam Wiseman Stuart M. Shieber Alexander M. Rush
School of Engineering and Applied Sciences
Harvard University
Cambridge, MA, USA
{swiseman,shieber,srush}@seas.harvard.edu
Abstract Source Entity: Cotto
While neural, encoder-decoder models have type[coffee shop], rating[3 out of 5],
had significant empirical success in text gener- food[English], area[city centre],
ation, there remain several unaddressed prob- price[moderate], near[The Portland Arms]
lems with this style of generation. Encoder-
decoder models are largely (a) uninterpretable, System Generation:
and (b) difficult to control in terms of their Cotto is a coffee shop serving English food
phrasing or content. This work proposes a in the moderate price range. It is located
neural generation system using a hidden semi- near The Portland Arms. Its customer rating is
markov model (HSMM) decoder, which learns 3 out of 5.
latent, discrete templates jointly with learning
to generate. We show that this model learns Neural Template:
useful templates, and that these templates
providing
make generation both more interpretable and | The
is a
| is an | | serving |
controllable. Furthermore, we show that this ... is an expensive offering... ...
approach scales to real data sets and achieves food in the price range It’s
strong performance nearing that of encoder- | cuisine with a price bracket It isfoods | and has a | | | . |pricing The place is
decoder text generation models. ... ... ... ...
located in the Its customer rating is
| located near | | . | Their customer rating is1 Introduction near | | .Customers have rated it... ...
With the continued success of encoder-decoder Figure 1: An example template-like generation from the E2E
models for machine translation and related tasks, Generation dataset (Novikova et al., 2017). Knowledge base
there has been great interest in extending these x (top) contains 6 records, and ŷ (middle) is a system gen-
eration; records are shown as type[value]. An induced
methods to build general-purpose, data-driven nat- neural template (bottom) is learned by the system and em-
ural language generation (NLG) systems (Mei ployed in generating ŷ. Each cell represents a segment in
et al., 2016; Dušek and Jurcıcek, 2016; Lebret the learned segmentation, and “blanks” show where slots arefilled through copy attention during generation.
et al., 2016; Chisholm et al., 2017; Wiseman et al.,
2017). These encoder-decoder models (Sutskever
et al., 2014; Cho et al., 2014; Bahdanau et al., Encoder-decoder generation systems appear to
2015) use a neural encoder model to represent a have increased the fluency of NLG outputs, while
source knowledge base, and a decoder model to reducing the manual effort required. However,
emit a textual description word-by-word, condi- due to the black-box nature of generic encoder-
tioned on the source encoding. This style of gen- decoder models, these systems have also largely
eration contrasts with the more traditional division sacrificed two important desiderata that are often
of labor in NLG, which famously emphasizes ad- found in more traditional systems, namely (a) in-
dressing the two questions of “what to say” and terpretable outputs that (b) can be easily controlled
“how to say it” separately, and which leads to sys- in terms of form and content.
tems with explicit content selection, macro- and This work considers building interpretable and
micro-planning, and surface realization compo- controllable neural generation systems, and pro-
nents (Reiter and Dale, 1997; Jurafsky and Martin, poses a specific first step: a new data-driven gen-
2014). eration model for learning discrete, template-like
3174
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3174–3187
Brussels, Belgium, October 31 - November 4, 2018. ©c 2018 Association for Computational Linguistics
structures for conditional text generation. The Recently, a new paradigm has emerged around
core system uses a novel, neural hidden semi- neural text generation systems based on machine
markov model (HSMM) decoder, which provides translation (Sutskever et al., 2014; Cho et al.,
a principled approach to template-like text gener- 2014; Bahdanau et al., 2015). Most of this
ation. We further describe efficient methods for work has used unconstrained black-box encoder-
training this model in an entirely data-driven way decoder approaches. There has been some work
by backpropagation through inference. Generat- on discrete variables in this context, including ex-
ing with the template-like structures induced by tracting representations (Shen et al., 2018), incor-
the neural HSMM allows for the explicit repre- porating discrete latent variables in text model-
sentation of what the system intends to say (in the ing (Yang et al., 2018), and using non-HSMM seg-
form of a learned template) and how it is attempt- mental models for machine translation or summa-
ing to say it (in the form of an instantiated tem- rization (Yu et al., 2016; Wang et al., 2017; Huang
plate). et al., 2018). Dai et al. (2017) develop an approx-
We show that we can achieve performance com- imate inference scheme for a neural HSMM using
petitive with other neural NLG approaches, while RNNs for continuous emissions; in contrast we
making progress satisfying the above two desider- maximize the exact log-marginal, and use RNNs
ata. Concretely, our experiments indicate that we to parameterize a discrete emission distribution.
can induce explicit templates (as shown in Figure Finally, there has also been much recent interest in
1) while achieving competitive automatic scores, segmental RNN models for non-generative tasks
and that we can control and interpret our gener- in NLP (Tang et al., 2016; Kong et al., 2016; Lu
ations by manipulating these templates. Finally, et al., 2016).
while our experiments focus on the data-to-text The neural text generation community has also
regime, we believe the proposed methodology rep- recently been interested in “controllable” text gen-
resents a compelling approach to learning discrete, eration (Hu et al., 2017), where various aspects
latent-variable representations of conditional text. of the text (often sentiment) are manipulated or
transferred (Shen et al., 2017; Zhao et al., 2018; Li
2 Related Work et al., 2018). In contrast, here we focus on control-
ling either the content of a generation or the way it
A core task of NLG is to generate textual descrip- is expressed by manipulating the (latent) template
tions of knowledge base records. A common ap- used in realizing the generation.
proach is to use hand-engineered templates (Ku-
kich, 1983; McKeown, 1992; McRoy et al., 2000), 3 Overview: Data-Driven NLG
but there has also been interest in creating tem-
plates in an automated manner. For instance, Our focus is on generating a textual description
many authors induce templates by clustering sen- of a knowledge base or meaning representation.
tences and then abstracting templated fields with Following standard notation (Liang et al., 2009;
hand-engineered rules (Angeli et al., 2010; Kon- Wiseman et al., 2017), let x= {r1 . . . rJ} be a
dadadi et al., 2013; Howald et al., 2013), or with a collection of records. A record is made up of
pipeline of other automatic approaches (Wang and a type (r.t), an entity (r.e), and a value (r.m).
Cardie, 2013). For example, a knowledge base of restaurants
There has also been work in incorporating prob- might have a record with r.t = Cuisine, r.e =
abilistic notions of templates into generation mod- Denny’s, and r.m = American. The aim is
els (Liang et al., 2009; Konstas and Lapata, 2013), to generate an adequate and fluent text description
which is similar to our approach. However, these ŷ1:T = ŷ1, . . . , ŷT of x. Concretely, we consider
approaches have always been conjoined with dis- the E2E Dataset (Novikova et al., 2017) and the
criminative classifiers or rerankers in order to ac- WikiBio Dataset (Lebret et al., 2016). We show
tually accomplish the generation (Angeli et al., an example E2E knowledge base x in the top of
2010; Konstas and Lapata, 2013). In addition, Figure 1. The top of Figure 2 shows an exam-
these models explicitly model knowledge base ple knowledge base x from the WikiBio dataset,
field selection, whereas the model we present is where it is paired with a reference text y= y1:T at
fundamentally an end-to-end model over genera- the bottom.
tion segments. The dominant approach in neural NLG has been
3175
sopoulos, 2007; Turner et al., 2010). Generation is the records in Figure 2: ”frederick parker-rhodes
divided into modular, yet highly interdependent, de- (21 november 1914 - 2 march 1987) was an en-
glish mycology and plant pathology, mathematics
cisions: (1) content planning defines which parts of at the university of uk.” In addition to not being
the input fields or meaning representations should fluent, it is unclear what the end of this sentence
be selected; (2) sentence planning determines which is even attempting to convey: it may be attempt-
selected fields are to be dealt with in each output ing to convey a fact not actually in the knowledge
sentence; and (3) surface realization generates those base (e.g., where Parker-Rhodes studied), or per-
sentences. haps it is simply failing to fluently realize infor-
mation that is in the knowledge base (e.g., Parker-
Data-driven approaches have been proposed to Rhodes’s country of residence).
automatically learn the individual modules. One ap- Traditional NLG systems (Kukich, 1983; McK-
proach first aligns records and sentences and then eown, 1992; Belz, 2008; Gatt and Reiter, 2009), in
learns a content selection model (Duboue and McK- contrast, largely avoid these problems. Since they
typically employ an explicit planning component,
eown, 2002; Barzilay and Lapata, 2005). Hierar- which decides which knowledge base records to
chical hidden semi-Markov generative models have FigureFr1e:deWricikkiPpaerkdeiar-Rinhfoodbeso(x21ofMFarrechde1r9i1c4k-P2a1rNkeorv-eRmhboerdes. Thfoecus on, and a surface realization component,
also been used to first determine which facts to dis- introdu1c9t8i7o)nwoafs hanisEanrgtliicshlelirnegaudisst:, p“lFanrtepdaetrhioclkogPisatr,kcoemr-pRuhteordes (2w1hich realizes the chosen records, the intent of the
cuss and then to generate words from the predi- scientist, mathematician, mystic, and mycologist.March 1914 – 21 November 1987) was an English linguisty,stem is always explicit, and it may be modified
cates and arguments of the chosen facts (Liang et al., plant Fpiagtuhroelo2g: isAt,n ceoxammppuletefrrosmciethnetiWst,ikmiBaiothdeamtaasetitc(iLaenb,remt ystitco, meet constraints.
2009). Sentence planning has been formulated as a and meyt caol.,lo2g0i1s6t).”, .with a database x (top) for Frederick Parker- The goal of this work is to propose an approach
Rhodes and corresponding reference generation y (bottom).
supervised set partitioning problem over facts where to neural NLG that addresses these issues in a prin-
each partition corresponds to a sentence (Barzilay cipled way. We target this goal by proposing anew model that generates with template-like ob-
and Lapata, 2006). End-to-end approaches have 3 LtoaunsegaunaegnceodMerondetewloirnkgovfeorrxCanodnthsetnraacionneddi-
combined sentence planning and surface realiza- Stioenatledneccoedegrennetewroartkiotongenerate , training the
jects induced by a neural HSMM (see Figure 1).
y
whole system in an end-to-end manner. To gener- Templates are useful here because they representtion by using explicitly aligned sentence/meaning Condatietiaondeaslcrliapntigounafgoer amgiovdenelesxaamreplae, paobplauclka-rbocxhoicaefixed plan for the generation’s content, and be-
pairs as training data (Ratnaparkhi, 2002; Wong and to gneentweroartke(suscehnatesnanceRsN. N) isWueseditnotprorodduuccee aadis-tablec-ause they make it clear what part of the genera-
Mooney, 2007; Belz, 2008; Lu and Ng, 2011). More condtirtiibountieodn olvaenrgthueagnextmwodrde,lfrfoomr wchoinchstaracihnoiincge textiton is associated with which record in the knowl-
recently, content selection and surface realization geneirsamtioadnetaondinfceldudbaeceklienmtoetnhtessfyrsotemm.faTchtetaebntlierse. edge base.
have been combined (Angeli et al., 2010; Kim and distribution is driven by the internal states of the
3.1 neLuaranlgnuetawgoerkm. odel 4 Background: Semi-Markov ModelsMooney, 2010; Konstas and Lapata, 2013).
Given Wa hsielenteefnfeccetivse, =relywing, o.n. .a, wneurwalitdhecTodewr ordWshat does it mean to learn a template? It is nat-
At the intersection of rule-based and statisti- frommvaokcesabituldairffiycWult to
1 T
, a luanndgerusatagnedmwohdatelasepsetcimts aotfes: ural to think of a template as a sequence of typed
cal methods, hybrid systems aim at leveraging hu- x are correlated with a particular system output. text-segments, perhaps with some segments acting
man contributed rules and corpus statistics (Langk- This leads to p∏rToblems both in controlling fine- as the template’s “backbone” (Wang and Cardie,
ilde and Knight, 1998; Soricut and Marcu, 2006; grainePd(ass)pe=cts of tPhe(wgen|werat,i.on. .p,rwoce−ss)a.nd int 1 t 1 (12)013), and the remaining segments filled in from
Mairesse and Walker, 2011). interpreting model mistakes.t=1 the knowledge base.
As an example of why controllability is im- A natural probabilistic model conforming with
Our approach is inspired by the recent success of Leptorcttan=t, cwont−sid(ner−t1h)e, .re.c.o,rwdst−in1 Fbiegutrhee1.seGquiveennce othfis intuition is the hidden semi-markov model
neural language models for image captioning (Kiros n − t1hecsoenintepxuttswaonrednsdp-urseecremdiinghgtwta.ntAton gne-ngeratme lan(H- SMM) (Gales and Young, 1993; Ostendorf
guagaenmouotpduetlmmeaetkinegs sapnecoifirdcecronstMraianrtsk,osvucahsass notet al., 2014; Karpathy and Fei-Fei, 2015; Vinyals et umptionet, al., 1996), which models latent segmentationsmentioning any information relating to customer in an output sequence. Informally, an HSMM is
al., 2015; Fang et al., 2015; Xu et al., 2015), ma- rating. Under a stand∏Tard encoder-decoder style much like an HMM, except emissions may last
chine translation (Devlin et al., 2014; Bahdanau et model, one cPou(lsd)fi≈lter outPth(iws itn|fcotr)m.ation either (2m)ultiple time-steps, and multi-step emissions need
al., 2015; Luong et al., 2015), and modeling conver- from the encoder or dte=c1oder, but in practice this not be independent of each other conditioned on
sations and dialogues (Shang et al., 2015; Wen et al., 3.2 wLoualndgleuaadgteo munoedxpeelccteodncdhiatniognesedin oonutptautbtlheast the state.
2015; Yao et al., 2015). might propagate through the whole system. We briefly review HSMMs following Murphy
A tableAiss ansextaomfpfilelodf/vtahleuedifpfiaciurlst,ywohf eirnetevrparleut-es ar(e2002). Assume we have a sequence of ob-
Our model is most similar to Mei et al. (2016) sequienngcmesisotafkweso, rcdosn.siWdeer therfeoflloorweinpgroapcotusael lgaenng- uagserved tokens y1 . . . yT and a discrete, latent state
who use an encoder-decoder style neural network modeerlastitohnatfraorme acnonendciotidoenr-eddecoondetrhsetsyelepsayisrtse.m for zt ∈{1, . . . ,K} for each timestep. We addition-
model to tackle the WEATHERGOV and ROBOCUP Local conditioning refers to the information
tasks. Their architecture relies on LSTM units and from the table that is applied to the description o31f7th6e
an attention mechanism which reduces scalability words which have already generated, i.e. the previ-
compared to our simpler design. ous words that constitute the context of the language
2
ally use two per-timestep variables to model multi-
step segments: a length variable lt ∈{1, . . . , L} z1 T z4
specifying the length of the current segment, and a
deterministic binary variable ft indicating whether x
a segment finishes at time t. We will consider in RNN RNN
particular conditional HSMMs, which condition
on a source x, essentially giving us an HSMM de-
y y
coder. 1 2
y3 y4
An HSMM specifies a joint distribution on the Figure 3: HSMM factor graph (under a known segmenta-
observations and latent segmentations. Letting θ tion) to illustrate parameters. Here we assume z1 is in the
denote all the parameters of the model, and using “red” state (out of K possibilities), and transitions to the
the variables introduced above, we can write the “blue” state after emitting three words. The transition model,shown as T , is a function of the two states and the neural en-
corresponding joint-likelihood as follows coded source x. The emission model is a function of a “red”
∏ RNN model (with copy attention over x) that generates wordsT−1 1, 2 and 3. After transitioning, the next word y4 is generated
| ft by the “blue” RNN, but independently of the previous words.p(y, z, l, f x; θ) = p(zt+1, lt+1 | zt, lt, x)
∏t=0T Transition Distribution The transition distribu-
× p(y − | z , l , x)ftt lt+1:t t t , tion p(zt+1 | zt, x) may be viewed as aK ×K ma-
t=1 trix of probabilities, where each row sums to 1. We
where we take z0 to be a distinguished start- define this matrix to be
state, and the deterministic ft variables are used
for excluding non-segment log probabilities. We p(zt+1 | zt, x) ∝ AB +C(xu)D(xu),
further assume p(zt+1, lt+1 | zt, lt, x) factors as where A∈RK×m1 , B ∈Rm1×K are state embed-
p(zt+1 | zt, x) × p(lt+1 | zt+1). Thus, the likeli- d K×m2
hood is given by the product of the probabilities dings, and where C : R → R and D :d m2×K
of each discrete state transition made, the proba- R → R are parameterized non-linear func-
bility of the length of each segment given its dis- tions of xu. We apply a row-wise softmax to the
crete state, and the probability of the observations resulting matrix to obtain the desired probabilities.
in each segment, given its state and length. Length Distribution We simply fix all length
probabilities p(lt+1 | zt+1) to be uniform up to a
5 A Neural HSMM Decoder maximum length L.1
We use a novel, neural parameterization of an Emission Distribution The emission model
HSMM to specify the probabilities in the likeli- models the generation of a text segment condi-
hood above. This full model, sketched out in Fig- tioned on a latent state and source information,
ure 3, allows us to incorporate the modeling com- and so requires a richer parameterization. Inspired
ponents, such as LSTMs and attention, that make by the models used for neural NLG, we base this
neural text generation effective, while maintaining model on an RNN decoder, and write a segment’s
the HSMM structure. probability as a product over token-level probabil-
ities,
5.1 Parameterization
p(yt−l +1:t | zt= k, lt= l, x) =
Since our model must condition on x, let r ∈Rd tj
lt
represent a real embedding of record rj ∈x, and ∏
let x ∈Rd
p(y | y , z = k, x)
a represent a real embedding of the en-
t−lt+i t−lt+1:t−lt+i−1 t
i=1
tire knowledge base x, obtained by max-pooling × p(</seg> | y , z = k, x)× 1 ,
coordinate-wise over all the rj . It is also useful
t−lt+1:t t {lt = l}
to have a representation of just the unique types 1We experimented with parameterizing the length distri-
of records that appear in x, and so we also define bution, but found that it led to inferior performance. Forcing
xu ∈Rd to be the sum of the embeddings of the the length probabilities to be uniform encourages the modelto cluster together functionally similar emissions of differ-
unique types appearing in x, plus a bias vector and ent lengths, while parameterizing them can lead to states that
followed by a ReLU nonlinearity. specialize to specific emission lengths.
3177
where </seg> is an end of segment token. The 5.2 Learning
RNN decoder uses attention and copy-attention The model requires fitting a large set of neu-
over the embedded records rj , and is conditioned ral network parameters. Since we assume z, l,
on zt= k by concatenating an embedding corre- and f are unobserved, we marginalize over these
sponding to the k’th latent state to the RNN’s in- variables to maximize the log marginal-likelihood
put; the RNN is also conditioned on the entire x of the observed tokens y given x. The HSMM
by initializing its hidden state with xa. marginal-likelihood calculation can be carried out
More concretely, let hk ∈Rdi−1 be the state of efficiently with a dynamic program analogous to
an RNN conditioned on x and zt= k (as above) either the forward- or backward-algorithm famil-
run over the sequence yt−lt+1:t−lt+i−1. We let the iar from HMMs (Rabiner, 1989).
model attend over records rj using hki−1 (in the It is actually more convenient to use the
style of Luong et al. (2015)), producing a context backward-algorithm formulation when using
vector cki−1. We may then obtain scores vi−1 for RNNs to parameterize the emission distributions,
each word in the output vocabulary, and we briefly review the backward recurrences
here, again following Murphy (2002). We have:
vi−1=W tanh(g
k
1 ◦ [hki−1, cki−1]),
with parameters gk 2d1 ∈R and ∈RV×2d. Note
βt(j) = p(yt+1:T | zt= j, ft=1, x)W
that there is a gk1 vector for each of K discrete ∑K ∗
states. To additionally implement a kind of slot = βt (k) p(zt+1= k | zt = j)
filling, we allow emissions to be directly copied k=1∗
from the value portion of the records r using copy βt (k) = p(yt+1:T | zt+1 = k, ft = 1, x)j
attention (Gülçehre et al., 2016; Gu et al., 2016; ∑L [
Yang et al., 2016). Define copy scores, = βt+l(k) p(lt+1= l | zt+1= k)
l=1 ]
ρ = rTj j tanh(g
k ◦ hk2 i−1), p(yt+1:t+l | zt+1= k, lt+1= l) ,
where gk2 ∈Rd. We then normalize the output- with base∑case βT (j)= 1. We can nowvocabulary and copy scores together, to arrive at obtain the marginal probability of y asK ∗
ṽ =softmax([v , ρ , . . . , ρ ]), p(y |x)= k=1 β0(k) p(z1= k), where wei−1 i−1 1 J
have used the fact that f0 must be 1, and we
and thus therefore train to maximize the log-marginal
likelihood of the observed y:
p(yt−lt+i=w | yt−lt+1∑:t−lt+i−1, zt= k, x) = ∑K
ṽi−1,w + ṽi−1,V+j . ln p(y |x; θ) = ln β∗0(k) p(z1= k). (1)
j:rj .m=w k=1
An Autoregressive Variant The model as spec- Since the quantities in (1) are obtained from a
ified assumes segments are independent condi- dynamic program, which is itself differentiable,
tioned on the associated latent state and x. While we may simply maximize with respect to the pa-
this assumption still allows for reasonable perfor- rameters θ by back-propagating through the dy-
mance, we can tractably allow interdependence namic program; this is easily accomplished with
between tokens (but not segments) by having each automatic differentiation packages, and we use
next-token distribution depend on all the previ- pytorch (Paszke et al., 2017) in all experiments.
ously generated tokens, giving us an autoregres-
sive HSMM. For this model, we will in fact use 5.3 Extracting Templates and Generating
p(yt−lt+i=w | y1:t−lt+i−1, zt= k, x) in defining After training, we could simply condition on a new
our emission model, which is easily implemented database and generate with beam search, as is stan-
by using an additional RNN run over all the pre- dard with encoder-decoder models. However, the
ceding tokens. We will report scores for both structured approach we have developed allows us
non-autoregressive and autoregressive HSMM de- to generate in a more template-like way, giving us
coders below. more interpretable and controllable generations.
3178
[The Golden Palace]55 [is a]59 [coffee shop]12 extracted in Figure 4. In practice, the argmax in
[providing]3 [Indian]50 [food]1 [in the]17 [£20- (2) will be intractable to calculate exactly due to
25]26 [price range]16 [.]2 [It is]8 [located in the use of RNNs in defining the emission distribu-
the]25 [riverside]40 [.]53 [Its customer rating is]19 tion, and so we approximate it with a constrained
[high]23 [.]2 beam search. This beam search looks very similar
Figure 4: A sample Viterbi segmentation of a training text; to that typically used with RNN decoders, except
subscripted numbers indicate the corresponding latent state. the search occurs only over a segment, for a par-
From this we can extract a template with S=17 segments;
compare with the template used at the bottom of Figure 1. ticular latent state k.
5.4 Discussion
First, note that given a database x and refer- Returning to the discussion of controllability and
ence generation y we can obtain the MAP assign- interpretability, we note that with the proposed
ment to the variables z, l, and f with a dynamic model (a) it is possible to explicitly force the gen-
program similar to the Viterbi algorithm familiar eration to use a chosen template z(i), which is it-
from HMMs. These assignments will give us a self automatically learned from training data, and
typed segmentation of y, and we show an example (b) that every segment in the generated ŷ(i) is
Viterbi segmentation of some training text in Fig- typed by its corresponding latent variable. We ex-
ure 4. Computing MAP segmentations allows us plore these issues empirically in Section 7.1.
to associate text-segments (i.e., phrases) with the We also note that these properties may be use-
discrete labels zt that frequently generate them. ful for other text applications, and that they offer
These MAP segmentations can be used in an ex- an additional perspective on how to approach la-
ploratory way, as a sort of dimensionality reduc- tent variable modeling for text. Whereas there has
tion of the generations in the corpus. More im- been much recent interest in learning continuous
portantly for us, however, they can also be used to latent variable representations for text (see Sec-
guide generation. tion 2), it has been somewhat unclear what the la-
In particular, since each MAP segmentation im- tent variables to be learned are intended to capture.
plies a sequence of hidden states z, we may run On the other hand, the latent, template-like struc-
a template extraction step, where we collect the tures we induce here represent a plausible, proba-
most common “templates” (i.e., sequences of hid- bilistic latent variable story, and allow for a more
den states) seen in the training data. Each “tem- controllable method of generation.
plate” z(i) consists of a sequence of latent states, Finally, we highlight one significant possible is-
with z(i) (i) (i)= z1 , . . . zS representing the S distinct sue with this model – the assumption that seg-
segments in the i’th extracted template (recall that ments are independent of each other given the cor-
we will technically have a zt for each time-step, responding latent variable and x. Here we note
and so z(i) is obtained by collapsing adjacent zt’s that the fact that we are allowed to condition on x
with the same value); see Figure 4 for an example is quite powerful. Indeed, a clever encoder could
template (with S=17) that can be extracted from capture much of the necessary interdependence
the E2E corpus. The bottom of Figure 1 shows a between the segments to be generated (e.g., the
visualization of this extracted template, where dis- correct determiner for an upcoming noun phrase)
crete states are replaced by the phrases they fre- in its encoding, allowing the segments themselves
quently generate in the training data. to be decoded more or less independently, given x.
With our templates z(i) in hand, we can then
restrict the model to using (one of) them during 6 Data and Methods
generation. In particular, given a new input x, we
may generate by computing Our experiments apply the approach outlined
above to two recent, data-driven NLG tasks.
ŷ(i) = argmax p(y′, z(i) |x), (2)
y′ 6.1 Datasets
which gives us a generation ŷ(i) for each extracted Experiments use the E2E (Novikova et al., 2017)
template z(i). For example, the generation in Fig- and WikiBio (Lebret et al., 2016) datasets, ex-
ure 1 is obtained by maximizing (2) with x set to amples of which are shown in Figures 1 and 2,
the database in Figure 1 and z(i) set to the template respectively. The former dataset, used for the
3179
2018 E2E-Gen Shared Task, contains approxi- currences in Section 5) to any segmentation that
mately 50K total examples, and uses 945 distinct splits up a sequence yt+1:t+l that appears in some
word types, and the latter dataset contains approx- rj , or that includes yt+1:t+l as a subsequence of
imately 500K examples and uses approximately another sequence. Thus, we maximize (1) subject
400K word types. Because our emission model to these hard constraints.
uses a word-level copy mechanism, any record Increasing the Number of Hidden States
with a phrase consisting of n words as its value is While a larger K allows for a more expressive la-
replaced with n positional records having a single tent model, computing K emission distributions
word value, following the preprocessing of Lebret over the vocabulary can be prohibitively expen-
et al. (2016). For example, “type[coffee shop]” sive. We therefore tie the emission distribution be-
in Figure 1 becomes “type-1[coffee]” and “type- tween multiple states, while allowing them to have
2[shop].” a different transition distributions.
For both datasets we compare with published
encoder-decoder models, as well as with direct We give additional architectural details of our
template-style baselines. The E2E task is eval- model in the Supplemental Material; here we note
d
uated in terms of BLEU (Papineni et al., 2002), that we use an MLP to embed rj ∈R , and a 1-
NIST (Belz and Reiter, 2006), ROUGE (Lin, layer LSTM (Hochreiter and Schmidhuber, 1997)
2004), CIDEr (Vedantam et al., 2015), and ME- in defining our emission distributions. In order to
TEOR (Banerjee and Lavie, 2005).2 The bench- reduce the amount of memory used, we restrict our
mark system for the task is an encoder-decoder output vocabulary (and thus the height of the ma-
style system followed by a reranker, proposed by trix W in Section 5) to only contain words in y
Dušek and Jurcıcek (2016). We compare to this that are not present in x; any word in y present in x
baseline, as well as to a simple but competitive is assumed to be copied. In the case where a word
non-parametric template-like baseline (“SUB” in yt appears in a record rj (and could therefore have
tables), which selects a training sentence with been copied), the input to the LSTM at time t+1 is
records that maximally overlap (without including computed using information from rj ; if there are
extraneous records) the unseen set of records we multiple rj from which yt could have been copied,
wish to generate from; ties are broken at random. the computed representations are simply averaged.
Then, word-spans in the chosen training sentence For all experiments, we set d=300 and L=4.
are aligned with records by string-match, and re- At generation time, we select the 100 most com-
placed with the corresponding fields of the new set mon templates z
(i), perform beam search with a
of records.3 beam of size 5, and select the generation with the
The WikiBio dataset is evaluated in terms of highest overall joint probability.
BLEU, NIST, and ROUGE, and we compare with For our E2E experiments, our best non-
the systems and baselines implemented by Lebret autoregressive model has 55 “base” states, dupli-
et al. (2016), which include two neural, encoder- cated 5 times, for a total of K =275 states, and
decoder style models, as well as a Kneser-Ney, our best autoregressive model uses K =60 states,
templated baseline. without any duplication. For our WikiBio exper-
iments, both our best non-autoregressive and au-
6.2 Model and Training Details toregressive models uses 45 base states duplicated
We first emphasize two additional methodological 3 times, for a total of K =135 states. In all cases,
details important for obtaining good performance. K was chosen based on BLEU performance on
held-out validation data. Code implementing our
Constraining Learning We were able to learn models is available at https://github.com/
more plausible segmentations of y by constraining harvardnlp/neural-template-gen.
the model to respect word spans yt+1:t+l that ap-
pear in some record rj ∈x. We accomplish this by 7 Results
giving zero probability (within the backward re- Our results on automatic metrics are shown in
2We use the official E2E NLG Challenge scoring scripts at Tables 1 and 2. In general, we find that the
https://github.com/tuetschek/e2e-metrics.
3 templated baselines underperform neural models,For categorical records, like “familyFriendly”, which
cannot easily be aligned with a phrase, we simply select only whereas our proposed model is fairly competi-
candidate training sentences with the same categorical value. tive with neural models, and sometimes even out-
3180
BLEU NIST ROUGE CIDEr METEOR Travellers Rest Beefeater
Validation name[Travellers Rest Beefeater], customerRating[3 out of 5],
area[riverside], near[Raja Indian Cuisine]
D&J 69.25 8.48 72.57 2.40 47.03
SUB 43.71 6.72 55.35 1.41 37.87 1. [Travellers Rest Beefeater]55 [is a]59 [3 star]43
NTemp 64.53 7.66 68.60 1.82 42.46 [restaurant]11 [located near]25 [Raja Indian Cuisine]40 [.]53
NTemp+AR 67.07 7.98 69.50 2.29 43.07 2. [Near]31 [riverside]29 [,]44 [Travellers Rest Beefeater]55
[serves]3 [3 star]50 [food]Test 1
[.]2
3. [Travellers Rest Beefeater]55 [is a]59 [restaurant]12
D&J 65.93 8.59 68.50 2.23 44.83 [providing]3 [riverside]50 [food]1 [and has a]17
SUB 43.78 6.88 54.64 1.39 37.35 [3 out of 5]26 [customer rating]16 [.]2 [It is]8 [near]25
NTemp 55.17 7.14 65.70 1.70 41.91 [Raja Indian Cuisine]40 [.]53
NTemp+AR 59.80 7.56 65.01 1.95 38.75 4. [Travellers Rest Beefeater]55 [is a]59 [place to eat]12
[located near]25 [Raja Indian Cuisine]40 [.]53
Table 1: Comparison of the system of Dušek and Jurcıcek 5. [Travellers Rest Beefeater]55 [is a]59 [3 out of 5]5
(2016), which forms the baseline for the E2E challenge, a [rated]32 [riverside]43 [restaurant]11 [near]25
non-parametric, substitution-based baseline (see text), and [Raja Indian Cuisine]40 [.]53
our HSMM models (denoted “NTemp” and “NTemp+AR”
for the non-autoregressive and autoregressive versions, resp.) Table 3: Impact of varying the template z(i) for a single x
on the validation and test portions of the E2E dataset. from the E2E validation data; generations are annotated with
“ROUGE” is ROUGE-L. Models are evaluated using the of- the segmentations of the chosen z(i). Results were obtained
ficial E2E NLG Challenge scoring scripts. using the NTemp+AR model from Table 1.
BLEU NIST ROUGE-4 database-to-text generation have since surpassed
Template KN † 19.8 5.19 10.7
† the results of Lebret et al. (2016) and our own,NNLM (field) 33.4 7.52 23.9
NNLM (field & word) † 34.7 7.98 25.8 and we show the recent seq2seq style results of Liu
NTemp 34.2 7.94 35.9 et al. (2018), who use a somewhat larger model, at
NTemp+AR 34.8 7.59 38.6 the bottom of Table 2.
Seq2seq (Liu et al., 2018) 43.65 - 40.32
7.1 Qualitative Evaluation
Table 2: Top: comparison of the two best neural systems of
Lebret et al. (2016), their templated baseline, and our HSMM We now qualitatively demonstrate that our gener-
models (denoted “NTemp” and “NTemp+AR” for the non- ations are controllable and interpretable.
autoregressive and autoregressive versions, resp.) on the test
portion of the WikiBio dataset. Models marked with a † are Controllable Diversity One of the powerful as-
from Lebret et al. (2016), and following their methodology
we use ROUGE-4. Bottom: state-of-the-art seq2seq-style re- pects of the proposed approach to generation is
sults from Liu et al. (2018). that we can manipulate the template z(i) while
leaving the database x constant, which allows for
easily controlling aspects of the generation. In Ta-
performs them. On the E2E data, for example, ble 3 we show the generations produced by our
we see in Table 1 that the SUB baseline, despite model for five different neural template sequences
having fairly impressive performance for a non- z(i), while fixing x. There, the segments in each
parametric model, fares the worst. The neural generation are annotated with the latent states de-
HSMM models are largely competitive with the termined by the corresponding z(i). We see that
encoder-decoder system on the validation data, de- these templates can be used to affect the word-
spite offering the benefits of interpretability and ordering, as well as which fields are mentioned in
controllability; however, the gap increases on test. the generated text. Moreover, because the discrete
Table 2 evaluates our system’s performance on states align with particular fields (see below), it is
the test portion of the WikiBio dataset, compar- generally simple to automatically infer to which
ing with the systems and baselines implemented fields particular latent states correspond, allowing
by Lebret et al. (2016). Again for this dataset we users to choose which template best meets their re-
see that their templated Kneser-Ney model under- quirements. We emphasize that this level of con-
performs on the automatic metrics, and that neu- trollability is much harder to obtain for encoder-
ral models improve on these results. Here the decoder models, since, at best, a large amount of
HSMMs are competitive with the best model of sampling would be required to avoid generating
Lebret et al. (2016), and even outperform it on around a particular mode in the conditional distri-
ROUGE. We emphasize, however, that recent, so- bution, and even then it would be difficult to con-
phisticated approaches to encoder-decoder style trol the sort of generations obtained.
3181
kenny warren
name: kenny warren, birth date: 1 april 1946, birth name: kenneth warren deutscher, birth place: brooklyn, new york,
occupation: ventriloquist, comedian, author, notable work: book - the revival of ventriloquism in america
1. [kenneth warren deutscher]132 [ ( ]75 [born]89 [april 1, 1946]101 [ ) ]67 [is an american]82 [author]20 [and]1
[ventriloquist and comedian]69 [.]88
2. [kenneth warren deutscher]132 [ ( ]75 [born]89 [april 1, 1946]101 [ ) ]67 [is an american]82 [author]20
[best known for his]95 [the revival of ventriloquism]96 [.]88
3. [kenneth warren]16 [“kenny” warren]117 [ ( ]75 [born]89 [april 1, 1946]101 [ ) ]67 [is an american]127
[ventriloquist, comedian]28 [.]133
4. [kenneth warren]16 [“kenny” warren]117 [ ( ]75 [born]89 [april 1, 1946]101 [ ) ]67 [is a]104 [new york]98 [author]20 [.]133
5. [kenneth warren deutscher]42 [is an american]82 [ventriloquist, comedian]118 [based in]15 [brooklyn, new york]84 [.]88
Table 4: Impact of varying the template z(i) for a single x from the WikiBio validation data; generations are annotated with
the segmentations of the chosen z(i). Results were obtained using the NTemp model from Table 2.
Interpretable States Discrete states also pro- NTemp NTemp+AR
vide a method for interpreting the generations pro- E2E 89.2 (17.4) 85.4 (18.6)
duced by the system, since each segment is explic- WikiBio 43.2 (19.7) 39.9 (17.9)
itly typed by the current hidden state of the model.
Table 4 shows the impact of varying the template Table 5: Empirical analysis of the average purity of dis-crete states learned on the E2E and WikiBio datasets, for the
z(i) for a single x from the WikiBio dataset. While NTemp and NTemp+AR models. Average purities are given
there is in general surprisingly little stylistic varia- as percents, and standard deviations follow in parentheses.
See the text for full description of this calculation.
tion in the WikiBio data itself, there is variation in
the information discussed, and the templates cap-
ture this. Moreover, we see that particular discrete 8 Conclusion and Future Work
states correspond in a consistent way to particular We have developed a neural, template-like gen-
pieces of information, allowing us to align states eration model based on an HSMM decoder,
with particular field types. For instance, birth which can be learned tractably by backpropagat-
names have the same hidden state (132), as do ing through a dynamic program. The method al-
names (117), nationalities (82), birth dates (101), lows us to extract template-like latent objects in
and occupations (20). a principled way in the form of state sequences,
To demonstrate empirically that the learned and then generate with them. This approach scales
states indeed align with field types, we calculate to large-scale text datasets and is nearly competi-
the average purity of the discrete states learned for tive with encoder-decoder models. More impor-
both datasets in Table 5. In particular, for each tantly, this approach allows for controlling the
discrete state for which the majority of its gen- diversity of generation and for producing inter-
erated words appear in some rj , the purity of a pretable states during generation. We view this
state’s record type alignment is calculated as the work both as the first step towards learning dis-
percentage of the state’s words that come from crete latent variable template models for more dif-
the most frequent record type the state represents. ficult generation tasks, as well as a different per-
This calculation was carried out over training ex- spective on learning latent variable text models in
amples that belonged to one of the top 100 most general. Future work will examine encouraging
frequent templates. Table 5 indicates that discrete the model to learn maximally different (or mini-
states learned on the E2E data are quite pure. Dis- mal) templates, which our objective does not ex-
crete states learned on the WikiBio data are less plicitly encourage, templates of larger textual phe-
pure, though still rather impressive given that there nomena, such as paragraphs and documents, and
are approximately 1700 record types represented hierarchical templates.
in the WikiBio data, and we limit the number of Acknowledgments
states to 135. Unsurprisingly, adding autoregres-
siveness to the model decreases purity on both SW gratefully acknowledges the support of a
datasets, since the model may rely on the autore- Siebel Scholars award. AMR gratefully acknowl-
gressive RNN for typing, in addition to the state’s edges the support of NSF CCF-1704834, Intel Re-
identity. search, and Amazon AWS Research grants.
3182
References Çaglar Gülçehre, Sungjin Ahn, Ramesh Nallapati,
Bowen Zhou, and Yoshua Bengio. 2016. Pointing
Gabor Angeli, Percy Liang, and Dan Klein. 2010. A the unknown words. In ACL.
simple domain-independent probabilistic approach
to generation. In Proceedings of the 2010 Confer- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
ence on Empirical Methods in Natural Language short-term memory. Neural Comput., 9:1735–1780.
Processing, pages 502–512. Association for Com-
putational Linguistics. Blake Howald, Ravikumar Kondadadi, and Frank
Schilder. 2013. Domain adaptable semantic clus-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- tering in statistical nlg. In Proceedings of the 10th
gio. 2015. Neural machine translation by jointly International Conference on Computational Seman-
learning to align and translate. In ICLR. tics (IWCS 2013)–Long Papers, pages 143–154.
Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan
Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An Salakhutdinov, and Eric P Xing. 2017. Toward con-
automatic metric for mt evaluation with improved trolled generation of text. In International Confer-
correlation with human judgments. In Proceedings ence on Machine Learning, pages 1587–1596.
of the acl workshop on intrinsic and extrinsic evalu-
ation measures for machine translation and/or sum- Po-Sen Huang, Chong Wang, Sitao Huang, Dengyong
marization, pages 65–72. Zhou, and Li Deng. 2018. Towards neural phrase-
based machine translation. In International Confer-
Anja Belz. 2008. Automatic generation of weather ence on Learning Representations.
forecast texts using comprehensive probabilistic
generation-space models. Natural Language Engi- Dan Jurafsky and James H Martin. 2014. Speech and
neering, 14(04):431–455. language processing. Pearson London.
Ravi Kondadadi, Blake Howald, and Frank Schilder.
Anja Belz and Ehud Reiter. 2006. Comparing auto- 2013. A statistical nlg framework for aggregated
matic and human evaluation of nlg systems. In 11th planning and realization. In Proceedings of the 51st
Conference of the European Chapter of the Associa- Annual Meeting of the Association for Computa-
tion for Computational Linguistics. tional Linguistics (Volume 1: Long Papers), vol-
ume 1, pages 1406–1415.
Andrew Chisholm, Will Radford, and Ben Hachey.
2017. Learning to generate one-sentence biogra- Lingpeng Kong, Chris Dyer, and Noah A Smith. 2016.
phies from wikidata. CoRR, abs/1702.06235. Segmental recurrent neural networks. In Interna-
tional Conference on Learning Representations.
KyungHyun Cho, Bart van Merrienboer, Dzmitry Bah-
danau, and Yoshua Bengio. 2014. On the properties Ioannis Konstas and Mirella Lapata. 2013. A global
of neural machine translation: Encoder-decoder ap- model for concept-to-text generation. J. Artif. Intell.
proaches. Eighth Workshop on Syntax, Semantics Res.(JAIR), 48:305–346.
and Structure in Statistical Translation. Karen Kukich. 1983. Design of a knowledge-based re-
Hanjun Dai, Bo Dai, Yan-Ming Zhang, Shuang Li, port generator. In ACL, pages 145–150.
and Le Song. 2017. Recurrent hidden semi-markov Rémi Lebret, David Grangier, and Michael Auli. 2016.
model. In International Conference on Learning Neural text generation from structured data with
Representations. application to the biography domain. In EMNLP,
pages 1203–1213.
Ondrej Dušek and Filip Jurcıcek. 2016. Sequence-to-
sequence generation for spoken dialogue via deep J. Li, R. Jia, H. He, and P. Liang. 2018. Delete, retrieve,
syntax trees and strings. In The 54th Annual Meet- generate: A simple approach to sentiment and style
ing of the Association for Computational Linguis- transfer. In North American Association for Compu-
tics, page 45. tational Linguistics (NAACL).
Mark JF Gales and Steve J Young. 1993. The theory Percy Liang, Michael I Jordan, and Dan Klein. 2009.
of segmental hidden Markov models. University of Learning semantic correspondences with less super-
Cambridge, Department of Engineering. vision. In ACL, pages 91–99. Association for Com-
putational Linguistics.
Albert Gatt and Ehud Reiter. 2009. Simplenlg: A re- Chin-Yew Lin. 2004. Rouge: A package for auto-
alisation engine for practical applications. In Pro- matic evaluation of summaries. Text Summarization
ceedings of the 12th European Workshop on Natural Branches Out.
Language Generation, pages 90–93. Association for
Computational Linguistics. Tianyu Liu, Kexiang Wang, Lei Sha, Baobao Chang,
and Zhifang Sui. 2018. Table-to-text generation by
Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O. K. structure-aware seq2seq learning. In Proceedings of
Li. 2016. Incorporating copying mechanism in the Thirty-Second AAAI Conference on Artificial In-
sequence-to-sequence learning. In ACL. telligence.
3183
Liang Lu, Lingpeng Kong, Chris Dyer, Noah A Smith, Ehud Reiter and Robert Dale. 1997. Building applied
and Steve Renals. 2016. Segmental recurrent neural natural language generation systems. Natural Lan-
networks for end-to-end speech recognition. Inter- guage Engineering, 3(1):57–87.
speech 2016, pages 385–389.
Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi
Thang Luong, Hieu Pham, and Christopher D. Man- Jaakkola. 2017. Style transfer from non-parallel text
ning. 2015. Effective approaches to attention-based by cross-alignment. In Advances in Neural Informa-
neural machine translation. In Proceedings of the tion Processing Systems, pages 6833–6844.
2015 Conference on Empirical Methods in Natural
Language Processing, EMNLP 2015, pages 1412– Yikang Shen, Zhouhan Lin, Chin wei Huang, and
1421. Aaron Courville. 2018. Neural language modeling
by jointly learning syntax and lexicon. In Interna-
Kathleen McKeown. 1992. Text generation - using dis- tional Conference on Learning Representations.
course strategies and focus constraints to generate
natural language text. Studies in natural language Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
processing. Cambridge University Press. Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: A simple way to prevent neural networks
Susan W McRoy, Songsak Channarukul, and Syed S from overfitting. The Journal of Machine Learning
Ali. 2000. Yag: A template-based generator for real- Research, 15(1):1929–1958.
time systems. In Proceedings of the first interna- Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014.
tional conference on Natural language generation- Sequence to sequence learning with neural net-
Volume 14, pages 264–267. Association for Compu- works. In Advances in Neural Information Process-
tational Linguistics. ing Systems (NIPS), pages 3104–3112.
Hongyuan Mei, Mohit Bansal, and Matthew R. Walter. Hao Tang, Weiran Wang, Kevin Gimpel, and Karen
2016. What to talk about and how? selective gener- Livescu. 2016. End-to-end training approaches
ation using lstms with coarse-to-fine alignment. In for discriminative segmental models. In Spoken
NAACL HLT, pages 720–730. Language Technology Workshop (SLT), 2016 IEEE,
Kevin P Murphy. 2002. Hidden semi-markov models pages 496–502. IEEE.
(hsmms). unpublished notes, 2. Ke M Tran, Yonatan Bisk, Ashish Vaswani, Daniel
Vinod Nair and Geoffrey E Hinton. 2010. Rectified Marcu, and Kevin Knight. 2016. Unsupervised neu-
linear units improve restricted boltzmann machines. ral hidden markov models. In Proceedings of the
In Workshop on Structured Prediction for NLP, pagesProceedings of the 27th international conference 63–71.
on machine learning (ICML-10), pages 807–814.
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi
Jekaterina Novikova, Ondrej Dušek, and Verena Rieser. Parikh. 2015. Cider: Consensus-based image de-
2017. The E2E dataset: New challenges for end-to- scription evaluation. In Proceedings of the IEEE
end generation. In Proceedings of the 18th Annual conference on computer vision and pattern recog-
Meeting of the Special Interest Group on Discourse nition, pages 4566–4575.
and Dialogue, Saarbrücken, Germany.
Chong Wang, Yining Wang, Po-Sen Huang, Abdel-
Mari Ostendorf, Vassilios V Digalakis, and Owen A rahman Mohamed, Dengyong Zhou, and Li Deng.
Kimball. 1996. From hmm’s to segment models: 2017. Sequence modeling via segmentations. In In-
A unified view of stochastic modeling for speech ternational Conference on Machine Learning, pages
recognition. IEEE Transactions on speech and au- 3674–3683.
dio processing, 4(5):360–378.
Lu Wang and Claire Cardie. 2013. Domain-
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- independent abstract generation for focused meeting
Jing Zhu. 2002. Bleu: a method for automatic eval- summarization. In Proceedings of the 51st Annual
uation of machine translation. In Proceedings of Meeting of the Association for Computational Lin-
the 40th annual meeting on association for compu- guistics (Volume 1: Long Papers), volume 1, pages
tational linguistics, pages 311–318. Association for 1395–1405.
Computational Linguistics.
Sam Wiseman, Stuart Shieber, and Alexander Rush.
Adam Paszke, Sam Gross, Soumith Chintala, Gre- 2017. Challenges in data-to-document generation.
gory Chanan, Edward Yang, Zachary DeVito, Zem- In Proceedings of the 2017 Conference on Empiri-
ing Lin, Alban Desmaison, Luca Antiga, and Adam cal Methods in Natural Language Processing, pages
Lerer. 2017. Automatic differentiation in pytorch. 2253–2263.
NIPS 2017 Autodiff Workshop.
Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and
Lawrence R Rabiner. 1989. A tutorial on hidden William W. Cohen. 2018. Breaking the softmax
markov models and selected applications in speech bottleneck: A high-rank RNN language model. In
recognition. Proceedings of the IEEE, 77(2):257– International Conference on Learning Representa-
286. tions.
3184
Zichao Yang, Phil Blunsom, Chris Dyer, and Wang U ∈Rm3×d1 and U ∈RK×m2×m32 ; D(x) is de-
Ling. 2016. Reference-aware language models. fined analogously. For all experiments, m1=64,
CoRR, abs/1611.01628. m2=32, and m3=64.
Lei Yu, Jan Buys, and Phil Blunsom. 2016. Online seg-
ment to segment neural transduction. In Proceed- Optimization We train with SGD, using a learn-
ings of the 2016 Conference on Empirical Methods ing rate of 0.5 and decaying by 0.5 each epoch
in Natural Language Processing, pages 1307–1316. after the first epoch in which validation log-
Junbo Jake Zhao, Yoon Kim, Kelly Zhang, Alexan- likelihood fails to increase. When using an au-
der M. Rush, and Yann LeCun. 2018. Adversari- toregressive HSMM, the additional LSTM is op-
ally regularized autoencoders. In Proceedings of the timized only after the learning rate has been de-
35th International Conference on Machine Learn- cayed. We regularize with Dropout (Srivastava
ing, ICML 2018, pages 5897–5906. et al., 2014).
A Supplemental Material A.2 Additional Learned Templates
A.1 Additional Model and Training Details In Tables 6 and 7 we show visualizations of addi-
Computing r A record r is represented by tional templates learned on the E2E and WikiBioj j
embedding a feature for its type, its position, and data, respectively, by both the non-autoregressive
its word value in Rd, and applying an MLP with and autoregressive HSMM models presented in
ReLU nonlinearity (Nair and Hinton, 2010) to the paper. For each model, we select a set of five
form r ∈Rd, similar to Yang et al. (2016) and dissimilar templates in an iterative way by greed-j
Wiseman et al. (2017). ily selecting the next template (out of the 200 most
frequent) that has the highest percentage of states
LSTM Details The initial cell and hidden- that do not appear in the previously selected tem-
state values for the decoder LSTM are given plates; ties are broken randomly. Individual states
by Q1xa and tanh(Q2xa), respectively, where within a template are visualized using the three
Q1,Q2 ∈Rd×d. most common segments they generate.
When a word yt appears in a record rj , the input
to the LSTM at time t + 1 is computed using an
MLP with ReLU nonlinearity over the concatena-
tion of the embeddings for rj’s record type, word
value, position, and a feature for whether it is the
final position for the type. If there are multiple rj
from which yt could have been copied, the com-
puted representations are averaged. At test time,
we use the MAP rj to compute the input, even if
there are multiple matches. For yt which could not
have been copied, the input to the LSTM at time
t+1 is computed using the same MLP over yt and
three dummy features.
For the autoregressive HSMM, an additional 1-
layer LSTM with d hidden units is used. We ex-
perimented with having the autoregressive HSMM
consume either tokens y1:t in predicting yt+1, or
the average embedding of the field types corre-
sponding to copied tokens in y1:t. The former
worked slightly better for the WikiBio dataset
(where field types are more ambiguous), while the
latter worked slightly better for the E2E dataset.
Transition Distribution The function
C(xu), which produces hidden state em-
beddings conditional on the source, is de-
fined as C(xu)=U2(ReLU(U1xu)), where
3185
The Waterman is a Italian restaurant with a average customer rating
1. | The Golden Palace | is an French pub with high price rangeBrowns Cambridge is a family friendly | fast food | place | with an | low | |.rating
... ... ... ... ... ... ...
There is a restaurant The Mill located in the centre of the city that serves
2. | There is a cheap | coffee shop | Bibimbap House | located on the river servingThere is an French restaurant The Twenty Two located north of the | city centre | that provides
... ... ... ... ... ...
fast food
| sushitake-away deliveries|.
...
The Olive Grove servesrestaurant fast food3. | The Punter offers sushiThe Cambridge Blue | pub | has | take-away deliveries|.
... ... ... ...
The restaurant The Mill serves English food
4. | Child friendly | coffee shop | Bibimbap House offers Indian cuisineThe average priced French restaurant The Twenty Two | has | Italian | dishes |.
... ... ... ... ... ...
The Strada provides Indian food in the customer rating of 1 out of 5
5. | The Dumpling Tree | serves | Chinese | food at a | price range of averageAlimentum offers English food and has a | |.rating of 5 out of 5
... ... ... ... ... ...
The Eagle provides Indian food in the high price range It is near
1. | The Golden Curry | providing | Chinese | cuisine | with a moderateserves English Food and has a | average |
customer rating|. | They are | located in the
Zizzi rating It’s located near
... ... ... ... ... ... ... ... ...
riverside Its customer rating is 1 out of 5
| city centre |. | It has a averageCafe Sicilia The price range is | high |.
... ... ...
Located near The Portland Arms is an Italian restaurant called The Waterman
2. | Located in the | riverside | is a family friendly fast food place called CocumNear city centre there is a | French | restaurant named | Loch Fyne |.
... ... ... ... ... ...
A Italian restaurant is The Waterman
3. | An | fast food pub called CocumA family friendly French | coffee shop | named | Loch Fyne |.
... ... ... ... ...
Located near The Portland Arms The Eagle is a cheap Italian restaurant
4. | Located in the | riverside | , | The Golden Curry is a family friendly family-friendly fast food pubNear city centre |Zizzi is an | |family friendly French | coffee shop|.
... ... ... ... ... ... ...
A Italian restaurant near riverside is The Waterman
5. | An | fast food | pub | located in the | city centre called CocumA family friendly French coffee shop located near Cafe Sicilia | named | Loch Fyne |.
... ... ... ... ... ... ...
Table 6: Five templates extracted from the E2E data with the NTemp model (top) and the Ntemp+AR model (bottom).
3186
william henry ( born 1968 ) is an american politician
1. | george augustus frederick | was ( | born on 1960; born 1 | 1970 | ]) | is a russian | actor |.marie anne de bourbon ] was an american football player
... ... ... ... ... ... ...
sir john herbert was a world war i national team
2. | captain | hartley | was a british | world war organizationlieutenant donald charles cameron was an english first world war | super league |.
... ... ... ... ...
john herbert is a indie rock band from australia
3. | hartley | was a | death metal | midfielder | for | los angeles, california|.
donald charles cameron is an ska defenceman based in chicago
... ... ... ... ... ...
john herbert was a american football midfielder
4. | hartley | is a major league baseball professional baseball defender
donald charles cameron is a former
| australian | professional ice hockey | goalkeeper|.
... ... ... ... ...
james “ billy ” wilson 1900 france is an american footballer
5. | william john | smith c. 1894 budapest is an english professional footballer
william “ jack ” henry
| ( | 1913 | – | buenos aires | ) | was an american | rules footballer
... ... ... ... ... ...
who plays for paganese in the vicotiral football league vfl
| who currently plays for | south melbourne | of the | national football league | ( | nfl | ) |.
who played with fc dynamo kyiv and the australian football league afl
... ... ... ... ...
aftab ahmed born 1951 is an american actor
1. | anderson da silva | (; | born on 1970 ) was an american actressdavid jones born 1 | 1974 | ] | is an english | cricketer|.
... ... ... ... ... ... ...
aftab ahmed was a world war i member of the austrian house of representatives
2. | anderson da silva | is a former | liberal | party member of the | pennsylvania | legislaturedavid jones is a baseball recipient of the montana senate |.
... ... ... ... ... ...
adjutant aftab ahmed was a world war i member of the knesset
3. | lieutenant anderson da silva is a former liberal party member of the scottish parliamentcaptain | david jones | is a | baseball | recipient of the | fc lokomotiv liski |.
... ... ... ... ... ...
william “ billy ” watson 1913 – 1917 was an american football player
4. | john william | smith | ( | c. 1900 | in | surrey, englandjames “ jim ” edward 1913 - british columbia | ) |
was an australian rules footballer
is an american | defenceman
... ... ... ... ... ... ...
who plays for collingwood in the victorial football league vfl
| who currently plays for | st kilda | of the | national football league | ( | afl | ) |.
who played with carlton and the australian football league nfl
... ... ... ... ...
aftab ahmed is a member of the knesset
5. | anderson da silva | is a former | party member of the | scottish parliamentdavid jones is a female recipient of the fc lokomotiv liski |.
... ... ... ...
Table 7: Five templates extracted from the WikiBio data with the NTemp model (top) and the Ntemp+AR model (bottom).
3187