The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19)

Data-to-Text Generation with Content Selection and Planning

Ratish Puduppully, Li Dong, Mirella Lapata Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB [email protected], [email protected], [email protected]

Abstract Content Content The ( 3 - 3 ) stunned the Recent advances in data-to-text generation have led to the use Selection Planning Los Angeles Clippers ( 3 - 3 ) Thursday in of large-scale datasets and neural network models which are Gate Network Game 6 at the trained end-to-end, without explicitly modeling what to say Staples … and in what order. In this work, we present a neural network architecture which incorporates content selection and plan- Col1 Col2 Record ♣ ning without sacrificing end-to-end training. We decompose Text the generation task into two stages. Given a corpus of data Record ♥ Row1 ♦ ■ Decoder records (paired with descriptive documents), we first generate Row2 ♣ ♥ Record ■ a content plan highlighting which information should be men- tioned and in which order and then generate the document while taking the content plan into account. Automatic and Figure 1: diagram of our approach. human-based evaluation experiments show that our model1 outperforms strong baselines improving the state-of-the-art on the recently released ROTOWIRE dataset. al. (2017) show that neural text generation techniques per- form poorly at content selection, they struggle to maintain 1 Introduction inter-sentential coherence, and more generally a reasonable ordering of the selected facts in the output text. Additional Data-to-text generation broadly refers to the task of automat- challenges include avoiding redundancy and being faithful ically producing text from non-linguistic input (Reiter and to the input. Interestingly, comparisons against template- Dale 2000; Gatt and Krahmer 2018). The input may be in based methods show that neural techniques do not fare well various forms including databases of records, spreadsheets, on metrics of content selection recall and factual output gen- expert system knowledge bases, simulations of physical sys- eration (i.e., they often hallucinate statements which are not tems, and so on. Table 1 shows an example in the form of a supported by facts in the database). database containing statistics on NBA basketball games, and In this paper, we address these shortcomings by explic- a corresponding game summary. itly modeling content selection and planning within a neu- Traditional methods for data-to-text generation (Kukich ral data-to-text architecture. Our model learns a content plan 1983; McKeown 1992) implement a pipeline of modules from the input and conditions on the content plan in order to including content planning (selecting specific content from generate the output document (see Figure 1 for an illustra- some input and determining the structure of the output text), tion). An explicit content planning mechanism has at least sentence planning (determining the structure and lexical three advantages for multi-sentence document generation: it content of each sentence) and surface realization (converting represents a high-level organization of the document struc- the sentence plan to a surface string). Recent neural genera- ture allowing the decoder to concentrate on the easier tasks tion systems (Lebret et al. 2016; Mei et al. 2016; Wiseman et of sentence planning and surface realization; it makes the al. 2017) do not explicitly model any of these stages, rather process of data-to-document generation more interpretable they are trained in an end-to-end fashion using the very suc- by generating an intermediate representation; and reduces cessful encoder-decoder architecture (Bahdanau et al. 2015) redundancy in the output, since it is less likely for the con- as their backbone. tent plan to contain the same information in multiple places. Despite producing overall fluent text, neural systems We train our model end-to-end using neural networks have difficulty capturing long-term structure and generat- and evaluate its performance on ROTOWIRE (Wiseman et ing documents more than a few sentences long. Wiseman et al. 2017), a recently released dataset which contains statis- Copyright c 2019, Association for the Advancement of Artificial tics of NBA basketball games paired with human-written Intelligence (www.aaai.org). All rights reserved. summaries (see Table 1). Automatic and human evaluation 1Our code is publicly available at https://github.com/ratishsp/ shows that modeling content selection and planning im- data2text-plan-py. proves generation considerably over competitive baselines.

6908 TEAM WIN LOSS PTS FG PCT RB AST ... The defeated the host Indiana Pacers 105-99 at Bankers Life Field- Pacers 4 6 99 42 40 17 ... house on Saturday. In a battle between two injury-riddled teams, the Celtics were Celtics 5 4 105 44 47 22 ... able to prevail with a much needed road victory. The key was shooting and de- PLAYER H/V AST RB PTS FG CITY ... fense, as the Celtics outshot the Pacers from the field, from three-point range and Jeff Teague H 4 3 20 4 Indiana ... from the free-throw line. Boston also held Indiana to 42 percent from the field and Miles Turner H 1 8 17 6 Indiana ... 22 percent from long distance. The Celtics also won the rebounding and assisting Isaiah Thomas V 5 0 23 4 Boston ... differentials, while tying the Pacers in turnovers. There were 10 ties and 10 lead Kelly Olynyk V 4 6 16 6 Boston ... changes, as this game went down to the final seconds. Boston (5–4) has had to deal Amir Johnson V 3 9 14 4 Boston ... with a gluttony of injuries, but they had the fortunate task of playing a team just ...... as injured here. Isaiah Thomas led the team in scoring, totaling 23 points and five PTS: points, FT PCT: percentage, RB: re- assists on 4–of–13 shooting. He got most of those points by going 14–of–15 from bounds, AST: assists, H/V: home or visiting, FG: field the free-throw line. Kelly Olynyk got a rare start and finished second on the team goals, CITY: player team city. with his 16 points, six rebounds and four assists.

Table 1: Example of data-records and document summary. Entities and values corresponding to the plan in Table 2 are boldfaced.

2 Related Work neural encoder-decoder model to generate weather forecasts and soccer commentaries, while Wiseman et al. (2017) gen- The generation literature provides multiple examples of con- erate NBA game summaries (see Table 1). They introduce a tent selection components developed for various domains new dataset for data-to-document generation which is suf- which are either hand-built (Kukich 1983; McKeown 1992; ficiently large for neural network training and adequately Reiter and Dale 1997; Duboue and McKeown 2003) or challenging for testing the capabilities of document-scale learned from data (Barzilay and Lapata 2005; Duboue and text generation (e.g., the average summary length is 330 McKeown 2001; 2003; Liang et al. 2009; Angeli et al. 2010; words and the average number of input records is 628). Kim and Mooney 2010; Konstas and Lapata 2013). Like- Moreover, they propose various automatic evaluation mea- wise, creating summaries of sports games has been a topic sures for assessing the quality of system output. Our model of interest since the early beginnings of generation systems follows on from Wiseman et al. (2017) addressing the chal- (Robin 1994; Tanaka-Ishii et al. 1998). lenges for data-to-text generation identified in their work. Earlier work on content planning has relied on generic We are not aware of any previous neural network-based planners (Dale 1988), based on Rhetorical Structure Theory approaches which incorporate content selection and plan- (Hovy 1993) and schemas (McKeown et al. 1997). Content ning mechanisms and generate multi-sentence documents. planners are defined by analysing target texts and devising Perez-Beltrachini and Lapata (2018) introduce a content se- hand-crafted rules. Duboue and McKeown (2001) study or- lection component (based on multi-instance learning) with- dering constraints for content plans and in follow-on work out content planning, while Liu et al. (2017) propose a sen- (Duboue and McKeown 2002) learn a content planner from tence planning mechanism which orders the contents of a an aligned corpus of inputs and human outputs. A few re- Wikipedia infobox so as to generate a single sentence. searchers (Mellish et al. 1998; Karamanis 2004) select con- tent plans according to a ranking function. 3 Problem Formulation More recent work focuses on end-to-end systems instead of individual components. However, most models make sim- The input to our model is a table of records (see Table 1 plifying assumptions such as generation without any con- left hand-side). Each record rj has four features including tent selection or planning (Belz 2008; Wong and Mooney its type (rj,1; e.g., LOSS, CITY), entity (rj,2; e.g., Pacers, 2007) or content selection without planning (Konstas and Miles Turner), value (rj,3; e.g., 11, Indiana), and whether a player is on the home- or away-team (rj,4; see column H/V Lapata 2012; Angeli et al. 2010; Kim and Mooney 2010). 4 An exception are Konstas and Lapata (2013) who incorpo- in Table 1), represented as {rj,k}k=1. The output y is a doc- rate content plans represented as grammar rules operating on ument containing words y = y1 ··· y|y| where |y| is the doc- the document level. Their approach works reasonably well ument length. Figure 2 shows the overall architecture of our with weather forecasts, but does not scale easily to larger model which consists of two stages: (a) content selection databases, with richer vocabularies, and longer text descrip- and planning operates on the input records of a database and tions. The model relies on the EM algorithm (Dempster, produces a content plan specifying which records are to be Laird, and Rubin 1977) to learn the weights of the grammar verbalized in the document and in which order (see Table 2) rules which can be very many even when tokens are aligned and (b) text generation produces the output text given the to database records as a preprocessing step. content plan as input; at each decoding step, the generation Our work is closest to recent neural network models model attends over vector representations of the records in the content plan. which learn generators from data and accompanying text re- |r| sources. Most previous approaches generate from Wikipedia Let r = {rj}j=1 denote a table of input records and y infoboxes focusing either on single sentences (Lebret et the output text. We model p(y|r) as the joint probability of al. 2016; 2017; Sha et al. 2017; Liu et al. 2017) or short texts text y and content plan z, given input r. We further decom- (Perez-Beltrachini and Lapata 2018). Mei et al. (2016) use a pose p(y, z|r) into p(z|r), a content selection and planning

6909 풄풔 푝푔푒푛 (푦4|푟, 푧,푦<4) 풓ퟐ 푦3 푦2 푦1 푝푐표푝푦 (푦4|푟, 푧, 푦<4) 풂풕풕 Home/ 풅ퟒ Text Name Type Value Content 풅4 풅3 풅2 풅1 Away Selection 풅4 Generation Plan … … … … 풂풕풕 Attention 푦3 푦2 푦1 푆푂푆 풓ퟐ 풓ퟐ 푟2,1 푟2,2 푟2,3 푟2,4 풆1 풆2 풆3 Content Plan Attention Encoding … … … … 풓ퟐ,ퟏ 풓ퟐ,ퟐ 풓ퟐ,ퟑ 풓ퟐ,ퟒ 풓풄풔 풓풄풔 풓풄풔 ퟑ |풓| ퟏ … … … … 푟2,1 푟2,2 푟2,3 푟2,4 풄풔 풄풔 풄풔 풄풔 풄풔 풆 Content 풓ퟏ 풓ퟐ 풓ퟑ 풓ퟒ … 풓|풓| 풓 풉1 풉2 풉3 풉풅4 Selection & 풔 풄풔 풄풔 풄풔 Figure 3: Content selection mechanism. 풓 풓ퟑ 풓|풓| 풓ퟏ Planning 푟1 푟2 푟3 푟4 … 푟|푟| 퐸푂푆 Encoder Decoder Vector Content Selection Gate We next apply the content selection gating mechanism cs to rj, and obtain the new record representation r via: Figure 2: Generation model with content selection and plan- j att ning; the content selection gate is illustrated in Figure 3. gj = sigmoid rj cs rj = gj rj

where denotes element-wise multiplication, and gate gj ∈ phase, and p(y|r, z), a text generation phase: n [0, 1] controls the amount of information flowing from rj. X X p(y|r) = p(y, z|r) = p(z|r)p(y|r, z) In other words, each element in rj is weighed by the corre- z z sponding element of the content selection gate gj. In the following we explain how the components p(z|r) and Content Planning p(y|r, z) are estimated. In our generation task, the output text is long but follows a canonical structure. Game summaries typically begin by dis- Record Encoder cussing which team won/lost, following with various statis- tics involving individual players and their teams (e.g., who The input to our model is a table of unordered records, each performed exceptionally well or under-performed), and fin- represented as features {r }4 . Following previous work j,k k=1 ishing with any upcoming games. We hypothesize that gen- (Yang et al. 2017; Wiseman et al. 2017), we embed features eration would benefit from an explicit plan specifying both into vectors, and then use a multilayer perceptron to obtain what to say and in which order. Our model learns such a vector representation r for each record: j content plans from training data. However, notice that RO- TOWIRE (see Table 1) and most similar data-to-text datasets rj = ReLU(Wr[rj,1; rj,2; rj,3; rj,4] + br) do not naturally contain content plans. Fortunately, we can where [; ] indicates vector concatenation, Wr ∈ obtain these relatively straightforwardly following an infor- n×4n n R , br ∈ R are parameters, and ReLU is the mation extraction approach (which we explain in Section 4). rectifier activation function. Suffice it to say that plans are extracted by mapping the text in the summaries onto entities in the input table, their Content Selection Gate values, and types (i.e., relations). A plan is a sequence of pointers with each entry pointing to an input record {r }|r| . The context of a record can be useful in determining its im- j j=1 An excerpt of a plan is shown in Table 2. The order in the portance vis-a-vis other records in the table. For example, plan corresponds to the sequence in which entities appear if a player scores many points, it is likely that other mean- in the game summary. Let z = z . . . z denote the con- ingfully related records such as field goals, three-pointers, or 1 |z| tent planning sequence. Each zk points to an input record, rebounds will be mentioned in the output summary. To better |r| capture such dependencies among records, we make use of i.e., zk ∈ {rj}j=1. Given the input records, the probability the content selection gate mechanism as shown in Figure 3. p(z|r) is decomposed as: We first compute attention scores αj,k over the input table |z| and use them to obtain an attentional vector ratt for each Y j p(z|r) = p(zk|z

6910 Value Entity Type H/V the output vocabulary is computed via: Boston Celtics TEAM-CITY V | Celtics Celtics TEAM-NAME V βt,k ∝ exp(dt Wbek) (1) 105 Celtics TEAM-PTS V X Indiana Pacers TEAM-CITY H qt = βt,kek Pacers Pacers TEAM-NAME H k att 99 Pacers TEAM-PTS H dt = tanh(Wd[dt; qt]) 42 Pacers TEAM-FG PCT H att pgen(yt|y

ut∈{0,1} Table 2: Content plan for the example in Table 1. where ut is marginalized out. input. The first hidden state of the decoder is initialized by |r| Joint Copy The probability of copying from record values avg({rcs} ), i.e., the average of record vectors. At decod- j j=1 and generating from the vocabulary is globally normalized: ing step k, let hk be the hidden state of the LSTM. We model p(z = r |z , r) as the attention over input records: k j

6911 Training and Inference ing portion of the ROTOWIRE corpus.4 On held-out data it achieved 94% accuracy, and recalled approximately 80% of Our model is trained to maximize the log-likelihood of the the relations licensed by the records. Given the output of the gold3 content plan given table records r and the gold output IE system, a content plan simply consists of (entity, value, text given the content plan and table records: record type, H/V) tuples in their order of appearance in a X game summary (the content plan for the summary in Table 1 max log p (z|r) + log p (y|r, z) is shown in Table 2). Player names are pre-processed to in- (r,z,y)∈D dicate the individual’s first name and surname (see Isaiah where D represents training examples (input records, plans, and Thomas in Table 2); team records are also pre-processed and game summaries). During inference, the output for input to indicate the name of team’s city and the team itself (see r is predicted by: Boston and Celtics in Table 2).

0 zˆ = arg max p(z |r) Training Configuration We validated model hyperpa- z0 0 rameters on the development set. We did not tune the di- yˆ = arg max p(y |r, zˆ) mensions of word embeddings and LSTM hidden layers; y0 we used the same value of 600 reported in Wiseman et where z0 and y0 represent content plan and output text can- al. (2017). We used one-layer pointer networks during con- didates, respectively. For each stage, we utilize beam search tent planning, and two-layer LSTMs during text generation. to approximately obtain the best results. Input feeding (Luong et al. 2015) was employed for the text decoder. We applied dropout (Zaremba et al. 2014) at a rate of 0.3. Models were trained for 25 epochs with the Ada- 4 Experimental Setup grad optimizer (Duchi et al. 2011); the initial learning rate Data We trained and evaluated our model on ROTOWIRE was 0.15, learning rate decay was selected from {0.5, 0.97}, (Wiseman et al. 2017), a dataset of basketball game sum- and batch size was 5. For text decoding, we made use of maries, paired with corresponding box- and line-score ta- BPTT (Mikolov et al. 2010) and set the truncation size bles. The summaries are professionally written, relatively to 100. We set the beam size to 5 during inference. All mod- well structured and long (337 words on average). The num- els are implemented in OpenNMT-py (Klein et al. 2017). ber of record types is 39, the average number of records is 628, the vocabulary size is 11.3K words and token count is 5 Results 1.6M. The dataset is ideally suited for document-scale gen- Automatic Evaluation We evaluated model output using eration. We followed the data partitions introduced in Wise- the metrics defined in Wiseman et al. (2017). The idea is to man et al. (2017): we trained on 3,398 summaries, tested on employ a fairly accurate IE system (see the description in 728, and used 727 for validation. Section 4) on the gold and automatic summaries and com- pare whether the identified relations align or diverge. Content Plan Extraction We extracted content plans Let yˆ be the gold output, and y the system output. Content from the ROTOWIRE game summaries following an infor- selection (CS) measures how well (in terms of precision and mation extraction (IE) approach. Specifically, we used the recall) the records extracted from y match those found in yˆ. IE system introduced in Wiseman et al. (2017) which iden- Relation generation (RG) measures the factuality of the gen- tifies candidate entity (i.e., player, team, and city) and value eration system as the proportion of records extracted from y (i.e., number or string) pairs that appear in the text, and then which are also found in r (in terms of precision and num- predicts the type (aka relation) of each candidate pair. For ber of unique relations). Content ordering (CO) measures instance, in the document in Table 1, the IE system might how well the system orders the records it has chosen and is identify the pair “Jeff Teague, 20” and then predict that that computed as the normalized Damerau-Levenshtein Distance their relation is “PTS”, extracting the record (Jeff Teague, between the sequence of records extracted from y and yˆ. In 20, PTS). Wiseman et al. (2017) train an IE system on RO- addition to these metrics, we report BLEU (Papineni et al. TOWIRE by determining word spans which could represent 2002), with human-written game summaries as reference. entities (i.e., by matching them against players, teams or Our results on the development set are summarized in cities in the database) and numbers. They then consider each Table 3. We compare our Neural Content Planning model entity-number pair in the same sentence, and if there is a (NCP for short) against the two encoder-decoder (ED) mod- record in the database with matching entities and values, the els presented in Wiseman et al. (2017) with joint copy (JC) pair is assigned the corresponding record type or otherwise and conditional copy (CC), respectively. In addition to our given the label “none” to indicate unrelated pairs. own re-implementation of these models, we include the best We adopted their IE system architecture which predicts scores reported in Wiseman et al. (2017) which were ob- relations by ensembling 3 convolutional models and 3 bidi- tained with an encoder-decoder model enhanced with con- rectional LSTM models. We trained this system on the train- 4A bug in the code of Wiseman et al. (2017) excluded num- ber words from the output summary. We corrected the bug and this 3Strictly speaking, the content plan is not gold since it was not resulted in greater recall for the relations extracted from the sum- created by an expert but is the output of a fairly accurate IE system. maries. See the supplementary material for more details.

6912 RG CS CO RG CS CO Model BLEU Model BLEU # P% P% R% DLD% # P% P% R% DLD% TEMPL 54.29 99.92 26.61 59.16 14.42 8.51 TEMPL 54.23 99.94 26.99 58.16 14.92 8.46 WS-2017 23.95 75.10 28.11 35.86 15.33 14.57 WS-2017 23.72 74.80 29.49 36.18 15.42 14.19 ED+JC 22.98 76.07 27.70 33.29 14.36 13.22 NCP+JC 34.09 87.19 32.02 47.29 17.15 14.89 ED+CC 21.94 75.08 27.96 32.71 15.03 13.31 NCP+CC 34.28 87.47 34.18 51.22 18.58 16.50 NCP+JC 33.37 87.40 32.20 48.56 17.98 14.92 NCP+CC 33.88 87.51 33.52 51.21 18.57 16.19 Table 5: Automatic evaluation on ROTOWIRE test set using NCP+OR 21.59 89.21 88.52 85.84 78.51 24.11 relation generation (RG) count (#) and precision (P%), con- tent selection (CS) precision (R%) and recall (R%), content Table 3: Automatic evaluation on ROTOWIRE development ordering (CO) in normalized Damerau-Levenshtein distance set using relation generation (RG) count (#) and preci- (DLD%), and BLEU. sion (P%), content selection (CS) precision (P%) and re- call (R%), content ordering (CO) in normalized Damerau- Levenshtein distance (DLD%), and BLEU. model does not have, and thus represents an upper-bound on content selection and relation generation. We also measured RG CS CO the degree to which the game summaries generated by our Model BLEU # P% P% R% DLD% model contain redundant information as the proportion of ED+CC 21.94 75.08 27.96 32.71 15.03 13.31 non-duplicate records extracted from the summary by the IE CS+CC 24.93 80.55 28.63 35.23 15.12 13.52 system. 84.5% of the records in NCP+CC are non-duplicates CP+CC 33.73 84.85 29.57 44.72 15.84 14.45 compared to Wiseman et al. (2017) who obtain 72.9% show- NCP+CC 33.88 87.51 33.52 51.21 18.57 16.19 ing that our model is less repetitive. NCP 34.46 — 38.00 53.72 20.27 — We further conducted an ablation study with the condi- tional copy variant of our model (NCP+CC) to establish Table 4: Ablation results on ROTOWIRE development set us- whether improvements are due to better content selection ing relation generation (RG) count (#) and precision (P%), (CS) and/or content planning (CP). We see in Table 4 that content selection (CS) precision (P%) and recall (R%), con- content selection and planning individually contribute to tent ordering (CO) in normalized Damerau-Levenshtein dis- performance improvements over the baseline (ED+CC), and tance (DLD%), and BLEU. accuracy further increases when both components are taken into account. In addition we evaluated these components on their own (independently of text generation) by com- ditional copy (WS-2017). Table 3 also shows results when paring the output of the planner (see p(z|r) block in Fig- NCP uses oracle content plans (OR) as input. In addition, ure 2) against gold content plans obtained using the IE sys- we report the performance of a template-based generator tem (see row NCP in Table 4. Compared to the full system (Wiseman et al. 2017) which creates a document consist- (NCP+CC), content selection precision and recall are higher ing of eight template sentences: an introductory sentence (by 4.5% and 2%, respectively) as well as content ordering (who won/lost), six player-specific sentences (based on the (by 1.8%). In another study, we used the CS and CO metrics six highest-scoring players in the game), and a conclusion to measure how well the generated text follows the content sentence. plan produced by the planner (instead of arbitrarily adding or As can be seen, NCP improves upon vanilla encoder- removing information). We found out that NCP+CC gener- decoder models (ED+JC, ED+CC), irrespective of the copy ates game summaries which follow the content plan closely: mechanism being employed. In fact, NCP achieves compa- CS precision is higher than 85%, CS recall is higher than rable scores with either joint or conditional copy mechanism 93%, and CO higher than 84%. This reinforces our claim which indicates that it is the content planner which brings that higher accuracies in the content selection and planning performance improvements. Overall, NCP+CC achieves phase will result in further improvements in text generation. best content selection and content ordering scores in terms The test set results in Table 5 follow a pattern simi- of BLEU. Compared to the best reported system in Wise- lar to the development set. NCP achieves higher accuracy man et al. (2017), we achieve an absolute improvement of in all metrics including relation generation, content selec- approximately 12% in terms of relation generation; content tion, content ordering, and BLEU compared to Wiseman et selection precision also improves by 5% and recall by 15%, al. (2017). We provide examples of system output in Fig- content ordering increases by 3%, and BLEU by 1.5 points. ure 4 and the supplementary material. The results of the oracle system (NCP+OR) show that con- tent selection and ordering do indeed correlate with the qual- Human-Based Evaluation We conducted two human ity of the content plan and that any improvements in our evaluation experiments using the Amazon Mechanical Turk planning component would result in better output. As far (AMT) crowdsourcing platform. The first study assessed re- as the template-based system is concerned, we observe that lation generation by examining whether improvements in re- it obtains low BLEU and CS precision but scores high on lation generation attested by automatic evaluation metrics CS recall and RG metrics. This is not surprising as the tem- are indeed corroborated by human judgments. We compared plate system is provided with domain knowledge which our our best performing model (NCP+CC), with gold reference

6913 #Support #Contra Gram Cohere Concise #Support #Contra Gram Cohere Concise The (10–2) defeated the Boston Celtics (6–6) 104–88. Klay Gold 2.98 0.28 11.78 16.00 13.78 Gold 2.98 0.28 11.78 16.00 13.78 Thompson scored 28 points (12–21 FG, 3–6 3PT, 1–1 FT) to go with 4 rebounds. TEMPL 6.98 0.21 -0.89 -4.89 1.33 TEMPL 6.98 0.21 -0.89 -4.89 1.33 Kevin Durant scored 23 points (10–13 FG, 1–2 3PT, 2–4 FT) to go with 10 rebounds. WS-2017 3.19 1.09 -4.22 -4.89 -6.44 WS-2017 3.19 1.09 -4.22 -4.89 -6.44 Isaiah Thomas scored 18 points (4–12 FG, 1–6 3PT, 9–9 FT) to go with 2 rebounds. NCP+CC 4.90 0.90 -2.44 -2.44 -3.55 scored 17 points (7–15 FG, 2–4 3PT, 1–2 FT) to go with 10 rebounds. NCP+CC 4.90 0.90 -2.44 -2.44 -3.55 scored 16 points (7–20 FG, 2–10 3PT, 0–0 FT) to go with 3 rebounds. Table 6: Average number of supporting (#Support) and con- Table 7: Average number of supporting (#Support) and con- Terry Rozier scored 11 points (3–5 FG, 2–3 3PT, 3–4 FT) to go with 7 rebounds. best- tradictingtradicting (#Contra) (#Contra) facts facts in in game game summaries summaries and andbest- The Golden State Warriors’ next game will be at home against the , worst scaling evaluation (higher is better) for grammaticality while the Boston Celtics will travel to play the Bulls. worst scaling evaluation (higher is better) for grammaticality (Gram), Coherence (Cohere), and Conciseness (Concise). The Golden State Warriors defeated the Boston Celtics 104–88 at TD Garden on Fri- (Gram), Coherence (Cohere), and Conciseness (Concise). day. The Warriors (10–2) came into this game winners of five of their last six games, but the Warriors (6–6) were able to pull away in the second half. Klay Thompson led reliable results as compared to rating scales (Kiritchenko the way for the Warriors with 28 points on 12–of–21 shooting, while Kevin Durant andand Mohammad Mohammad 2017). 2017). We We arranged arranged every every 4-tuple 4-tuple of of com- com- added 23 points, 10 rebounds, seven assists and two steals. Stephen Curry added 16 petingpeting summaries summaries into into 6 6 pairs. pairs. Every Every pair pair was was shown shown to to three three points and eight assists, while Draymond Green rounded out the box score with 11 crowdworkers,crowdworkers, who who were were asked asked to to choose choose which which summary summary points, eight rebounds and eight assists. For the Celtics, it was Isaiah Thomas who waswasbestbestandand which which was wasworstworstaccordingaccording to to three three criteria: criteria: shot just 4–of–12 from the field and finished with 18 points. Avery Bradley added GrammaticalityGrammaticality(is(is the the summary summary fluent fluent and and grammatical?), grammatical?), 17 points and 10 rebounds, while the rest of the Celtics combined to score just seven Coherence (is the summary easy to read? does it follow a points. Boston will look to get back on track as they play host to the 76ers on Friday. Coherence (is the summary easy to read? does it follow a naturalnatural ordering ordering of of facts?), facts?), and andConcisenessConciseness(does(does the the sum- sum- mary avoid redundant information and repetitions?). The Table 6: Example output from TEMPL (top) and NPC+CC mary avoid redundant information and repetitions?). The Figure 4: Example output from TEMPL (top) and NPC+CC scorescore of of a a system system for for each each criterion criterion is is computed computed as as the the dif- dif- (bottom). Text that accurately reflects a record in the associ- ference between the percentage of times the system was se- (bottom).ated box or Text line that score accurately is in blue reflects, erroneous a record text in is thein red. associ- ference between the percentage of times the system was se- ated box or line score is in blue, erroneous text is in red. lectedlected as as the the best best and and the the percentage percentage of of times times it it was was selected selected asas the the worst worst (Orme (Orme 2009). 2009). The The scores scores range range from from−−100100(ab-(ab- solutely worst) to +100 (absolutely best). man et al. (2017). AMT workers were presented with a spe- solutely worst) to +100 (absolutely best). The results of the second study are summarized in Ta- summaries,cific NBA game’s a template box system score and and line the score, best model and four of Wise- (ran- The results of the second study are summarized in Ta- ble 7. Gold summaries were perceived as significantly better mandomly et al. selected) (2017). sentences AMT workers from were the summary. presented They with a were spe- ble 6. Gold summaries were perceived as significantly better compared to the automatic systems across all criteria (again cificasked NBA to identify game’s supporting box score and and line contradicting score, and facts four men- (ran- compared to the automatic systems across all criteria (again using a one-way ANOVA with post-hoc Tukey HSD tests). domlytioned selected) in each sentence. sentences We from randomly the summary. selected They 30 games were using a one-way ANOVA with post-hoc Tukey HSD tests). NCP+CC was perceived as significantly more grammatical askedfrom the to testidentify set. Each supporting sentence and was contradicting rated by three facts workers. men- NCP+CC was perceived as significantly more grammatical than WS-2017 but not compared to TEMPL which does not tioned in each sentence. We randomly selected 30 games than WS-2017 but not compared to TEMPL which does not The left two columns in Table 7 contain the average suffer from fluency errors since it does not perform any gen- fromnumber theof test supporting set. Each sentence and contradicting was rated facts by three per sentenceworkers. suffer from fluency errors since it does not perform any gen- The left two columns in Table 6 contain the average eration. NCP+CC was perceived as significantly more co- as determined by the crowdworkers, for each model. The herenteration. than NCP+CC TEMPL wasand WS-2017. perceived The as significantly template fairs more poorly co- numbertemplate-based of supporting system and has the contradicting highest number facts of per supporting sentence herent than TEMPL and WS-2017. The template fairs poorly as determined by the crowdworkers, for each model. The on coherence, its output is stilted and exhibits no variability facts, even compared to the human gold standard. TEMPL (seeon coherence, top block in its Table output 6). is Withstilted regard and exhibits to conciseness, no variability the template-baseddoes not perform system any contenthas the highest selection, number it includes of supporting a large (see top block in Table 4). With regard to conciseness, the facts, even compared to the human gold standard. TEMPL neural systems are significantly worse than TEMPL, while number of facts from the database and since it does not per- NCP+CCneural systems is significantly are significantly better than worse WS-2017. than TEMPL, By design while doesform not any perform generation any either, content it exhibits selection, a few it includes contradictions. a large NCP+CC is significantly better than WS-2017. By design number of facts from the database and since it does not per- the template cannot repeat information since there is no re- Compared to WS-2017 and the Gold summaries, NCP+CC dundancythe template in the cannot sentences repeat chosen information to verbalize since the there summary. is no re- formdisplays any ageneration larger number either, of it exhibitssupporting a few facts. contradictions. All models dundancy in the sentences chosen to verbalize the summary. Compared to WS-20175 and the Gold summaries, NCP+CC Taken together, our results show that content planning im- are significantly different in the number of supporting facts provesTaken data-to-text together, our generation results show across that metrics content and planning systems. im- displays(#Supp) froma larger TEMPL number (using of a supporting one-way ANOVA facts. All with models post- 5 Weproves find data-to-textthat NCP+CC generation overall performs across metrics best, however and systems. there arehoc significantly Tukey HSDdifferent tests). NCP+CC in the number is significantly of supporting different facts We find that NCP+CC overall performs best, however there (#Supp) from TEMPL (using a one-way ANOVA with post- is a significant gap between automatically generated sum- from WS-2017 and Gold. With respect to contradicting facts mariesis a significant and human-authored gap between ones. automatically generated sum- hoc(#Cont), Tukey Gold HSD and tests). TEMPL NCP+CC are not is significantly different different maries and human-authored ones. fromfrom WS-2017 each other and but Gold. are significantly With respect different to contradicting from the neu- facts 6 Conclusions (#Cont),ral systems Gold (WS-2017, and TEMPL NCP+CC). are not significantly different 6 Conclusions fromIn each the second other but experiment, are significantly we assessed different the from generation the neu- We presented a data-to-text generation model which is en- We presented a data-to-text generation model which is en- ralquality systems of our (WS-2017, model. We NCP+CC). elicited judgments for the same hanced with content selection and planning modules. Exper- hanced with content selection and planning modules. Exper- 30In games the second used in theexperiment, first study. we For assessed each game, the participants generation imental results (based on automatic metrics and judgment qualitywere asked of our to compare model. Wea human-written elicited judgments summary, for NCP the same with elicitationimental results studies) (based demonstrate on automatic that generation metrics and quality judgment im- 30conditional games used copy in the (NCP+CC), first study. Wiseman For each et game, al.’s (2017)participants best proveselicitation both studies) in terms demonstrate of the number that of generation relevant facts quality con- im- weremodel, asked and to the compare template a human-written system. Our study summary, used Best-Worst NCP with tainedproves in both the output in terms text, of and the thenumber order of according relevant to facts which con- conditionalScaling (BWS; copy Louviere, (NCP+CC), Flynn, Wiseman and Marley et al.’s 2015), (2017) a tech- best thesetained are in presented. the output Positive text, and side-effects the order of according content planning to which model,nique shown and the to template be less labor-intensive system. Our study and used providingBest-Worst more arethese additional are presented. improvements Positive in side-effects the grammaticality, of content and planning con- Scalingreliable(BWS; results Louviere, as compared Flynn, to rating and Marley scales 2015), (Kiritchenko a tech- cisenessare additional of the improvements generated text. in In the the grammaticality, future, we would and like con- nique shown to be less labor-intensive and providing more tociseness learn more of the detail-oriented generated text. plans In involvingthe future, inference we would over like 5All significance differences reported throughout this paper are multipleto learn factsmore and detail-oriented entities. We plans would involving also like inferenceto verify our over 5 withAll a level significance less than differences 0.05. reported throughout this paper are approachmultiple acrossfacts and domains entities. and We languages. would also like to verify our with a level less than 0.05. approach across domains and languages.

6914 References Lebret, R.; Grangier, D.; and Auli, M. 2016. Neural text generation Angeli, G.; Liang, P.; and Klein, D. 2010. A simple domain- from structured data with application to the biography domain. In independent probabilistic approach to generation. In EMNLP, 502– EMNLP, 1203–1213. 512. Liang, P.; Jordan, M.; and Klein, D. 2009. Learning semantic correspondences with less supervision. In ACL, 91–99. Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. In ICLR. Liu, T.; Wang, K.; Sha, L.; Chang, B.; and Sui, Z. 2017. Table-to- text generation by structure-aware seq2seq learning. arXiv preprint Barzilay, R., and Lapata, M. 2005. Collective content selection for arXiv:1711.09724. concept-to-text generation. In EMNLP. Louviere, J. J.; Flynn, T. N.; and Marley, A. A. J. 2015. Best-worst Belz, A. 2008. Automatic generation of weather forecast texts scaling: Theory, methods and applications. Cambridge University using comprehensive probabilistic generation-space models. Nat. Press. Lang. Eng. 14(4):431–455. Luong, T.; Pham, H.; and Manning, C. D. 2015. Effective ap- Chisholm, A.; Radford, W.; and Hachey, B. 2017. Learning to proaches to attention-based neural machine translation. In EMNLP, generate one-sentence biographies from wikidata. In EACL, 633– 1412–1421. 642. McKeown, K. R.; Pan, S.; Shaw, J.; Jordan, D. A.; and Allen, B. A. Dale, R. 1988. Generating referring expressions in a domain of ob- 1997. Language generation for multimedia healthcare briefings. In jects and processes. Ph.D. Dissertation, University of Edinburgh. ANLP, 277–282. Dempster, A. P.; Laird, N. M.; and Rubin, D. B. 1977. Maximum McKeown, K. 1992. Text generation. Cambridge University Press. Journal of likelihood from incomplete data via the em algorithm. Mei, H.; Bansal, M.; and Walter, M. R. 2016. What to talk the royal statistical society. Series B (methodological) 1–38. about and how? selective generation using lstms with coarse-to- Duboue, P. A., and McKeown, K. R. 2001. Empirically estimating fine alignment. In NAACL, 720–730. order constraints for content planning in generation. In ACL, 172– Mellish, C.; Knott, A.; Oberlander, J.; and O’Donnell, M. 1998. 179. Experiments using stochastic search for text planning. Natural Duboue, P., and McKeown, K. 2002. Content planner construction Language Generation. via evolutionary algorithms and a corpus-based fitness function. Mikolov, T.; Karafiat,´ M.; Burget, L.; Cernockˇ y,` J.; and Khudan- In Proceedings of the International Natural Language Generation pur, S. 2010. Recurrent neural network based language model. In Conference, 89–96. Eleventh Annual Conference of the International Speech Commu- Duboue, P. A., and McKeown, K. R. 2003. Statistical acquisition nication Association. of content selection rules for natural language generation. Orme, B. 2009. Maxdiff analysis: Simple counting, individual- Duchi, J.; Hazan, E.; and Singer, Y. 2011. Adaptive subgradient level logit, and hb. Sawtooth Software. methods for online learning and stochastic optimization. Journal Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a of Machine Learning Research 12(Jul):2121–2159. method for automatic evaluation of machine translation. In ACL, Gatt, A., and Krahmer, E. 2018. Survey of the state of the art in nat- 311–318. ural language generation: Core tasks, applications and evaluation. Perez-Beltrachini, L., and Lapata, M. 2018. Bootstrapping gener- JAIR 61:65–170. ators from noisy data. arXiv preprint arXiv:1804.06385. Gu, J.; Lu, Z.; Li, H.; and Li, V. O. 2016. Incorporating copying Reiter, E., and Dale, R. 1997. Building applied natural language mechanism in sequence-to-sequence learning. In ACL, 1631–1640. generation systems. Nat. Lang. Eng. 3(1):57–87. Gulcehre, C.; Ahn, S.; Nallapati, R.; Zhou, B.; and Bengio, Y. Reiter, E., and Dale, R. 2000. Building natural language genera- 2016. Pointing the unknown words. In ACL, 140–149. tion systems. New York, NY: Cambridge University Press. Hovy, E. H. 1993. Automated discourse generation using discourse Robin, J. 1994. Revision-based generation of Natural Language structure relations. Artificial intelligence 63(1-2):341–385. Summaries providing historical Background. Ph.D. Dissertation, Karamanis, N. 2004. Entity coherence for descriptive text struc- Ph. D. thesis, Columbia University. turing. Ph.D. Dissertation, School of Informatics, University of Sha, L.; Mou, L.; Liu, T.; Poupart, P.; Li, S.; Chang, B.; and Sui, Z. Edinburgh. 2017. Order-planning neural text generation from structured data. arXiv preprint arXiv:1709.00155. Kim, J., and Mooney, R. 2010. Generative alignment and semantic parsing for learning from ambiguous supervision. In Coling 2010: Tanaka-Ishii, K.; Hasida, K.; and Noda, I. 1998. Reactive content Posters, 543–551. selection in the generation of real-time soccer commentary. In ACL. Kiritchenko, S., and Mohammad, S. 2017. Best-worst scaling more Vinyals, O.; Fortunato, M.; and Jaitly, N. 2015. Pointer networks. reliable than rating scales: A case study on sentiment intensity an- In Cortes, C.; Lawrence, N. D.; Lee, D. D.; Sugiyama, M.; and notation. In ACL, volume 2, 465–470. Garnett, R., eds., NIPS. Curran Associates, Inc. 2692–2700. Klein, G.; Kim, Y.; Deng, Y.; Senellart, J.; and Rush, A. M. Wiseman, S.; Shieber, S.; and Rush, A. 2017. Challenges in data- 2017. Opennmt: Open-source toolkit for neural machine transla- to-document generation. In EMNLP, 2253–2263. tion. arXiv preprint arXiv:1701.02810. Wong, Y. W., and Mooney, R. 2007. Generation by inverting a Konstas, I., and Lapata, M. 2012. Unsupervised concept-to-text semantic parser that uses statistical machine translation. In NAACL, generation with hypergraphs. In NAACL, 752–761. 172–179. Yang, Z.; Blunsom, P.; Dyer, C.; and Ling, W. 2017. Reference- Konstas, I., and Lapata, M. 2013. A global model for concept-to- aware language models. In EMNLP, 1851–1860. text generation. J. Artif. Int. Res. 48(1):305–346. Zaremba, W.; Sutskever, I.; and Vinyals, O. 2014. Recurrent neural Kukich, K. 1983. Design of a knowledge-based report generator. network regularization. arXiv preprint arXiv:1409.2329. In ACL.

6915