Towards Comprehensive Description Generation from Factual Attribute-value Tables

Tianyu Liu1, Fuli Luo1, Pengcheng Yang1, Wu1, Baobao Chang1,2 and Zhifang Sui1,2 1MOE Key Lab of Computational Linguistics, School of EECS, Peking University 2Peng Cheng Laboratory, Shenzhen, China {tianyu0421, luofuli, yang pc, wu.wei, chbb, szf}@pku.edu.cn

Abstract Attribute Value Birthplace Utah, America Position forward (soccer player) The comprehensive descriptions for factual attribute-value tables, which should be accu- Comprehensive:A Utah soccer player who plays as forward rate, informative and loyal, can be very helpful Missing Key Attri.:A soccer player who plays as forward Groundless info:A Utah forward in the national team for end users to understand the structured data Less Informative: An American forward in this form. However previous neural gen- erators might suffer from key attributes miss- Table 1: An example for comprehensive generation. ing, less informative and groundless informa- Suppose we only have two attribute-value tuples, the tion problems, which impede the generation of underlined content is groundless information not men- high-quality comprehensive descriptions for tioned in source tables. tables. To relieve these problems, we first pro- pose force attention (FA) method to encour- age the generator to pay more attention to the tive and groundless content in its generated de- uncovered attributes to avoid potential key at- tributes missing. Furthermore, we propose re- scriptions towards source tables. For example, in inforcement learning for information richness Table1, the ‘missing key attribute’ case doesn’t to generate more informative as well as more mention where the player comes from (birthplace) loyal descriptions for tables. In our experi- while the ‘less informative’ one chooses American ments, we utilize the widely used WIKIBIO rather than Utah. The case with groundless infor- dataset as a benchmark. Additionally we cre- mation contains ‘in the national team’ which is WB-filter WIKIBIO ate based on to test not mentioned in the source attributes. Although our model in the simulated user-oriented sce- narios, in which the generated descriptions the ‘key points missing’ problem exists in many should accord with particular user interests. text-to-text and data-to-text datasets, for large- Experimental results show that our model out- scale structured tables with vast heterogeneous at- performs the state-of-the-art baselines on both tributes such as Wikipedia infoboxes, ‘Key at- automatic and human evaluation. tribute missing’ and ‘less informative’ problems might be even more challenging. As the key at- 1 Introduction tributes, like the ‘position’ of a basketball player Generating descriptions for the factual attribute- or the ‘political party’ of a senator, are very likely value tables has attracted widely interests among to be unique features to particular tables, which NLP researchers especially in a neural end-to-end usually appear much less frequently and are sel- fashion (e.g. Lebret et al.(2016); Liu et al.(2018); domly mentioned than the common attributes like Sha et al.(2018); Bao et al.(2018); Puduppully ‘Name’ and ‘Birthdate’. The ‘groundless infor- et al.(2018); Li and Wan(2018); Nema et al. mation’, which is also known as the ‘hallucina- (2018)) as shown in Fig1a. For broader potential tion’ problem, remains a long-standing problem in applications in this field, we also simulate user- NLG. oriented generation, whose goal is to provide com- In this paper, we show that our model can gen- prehensive generation for the selected attributes erate more accurate and informative descriptions according to particular user interests like Fig1b. with less groundless content for tables. Firstly we However, we find that previous models might design a force-attention (FA) method to encour- miss key information and generate less informa- age the decoder to pay more attention to the un-

5985 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5985–5996 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics Wikipedia Infobox Attribute Va l ue sive table descriptions (Table1). Then we demon- (a) End-to-end (neural) Table-to-text Generation Name Dillon Sheppard Table Encoder Description Decoder strate how and why we create WB-filter (Sec Birthdate 27 Feb 1979 … … 4.1) as well as evaluations (Sec 4.2), experimental Birthplace Durban, South Africa configurations (Sec 4.3 and 4.4), case studies and Current Club Bidvest Wits ( ( (b) User-oriented Description Generation for the Tables ( Number 29 visualizations )( (Sec ) 4.5 ) and error analysis (Sec User interests Description Generation Height 1.80 m (5 ft 11 in) Attributes selected by users : Name played as a 4.6). Name ; Current Club ; Position Position in Current Club Position Left-winger 2 Background: Table-to-Description Figure 1: The end-to-end (a) and user-oriented table-to- 2.1 Table Encoder text generation (b) for an infobox (left) in WIKIBIO. Given a structured table like Fig1 (left), we model the attribute-value tuples in the table as a sequence covered attributes to avoid potential key attributes of words with related attribute names. After seri- missing by both stepwise and global constraints. alizing all the words in the ‘Value’ columns, for In addition, we define the ‘information richness’ ak the i-th word in the table xi whose attribute is measurement of the generated descriptions to the ak (the k-th attribute), we use the attribute name source tables. Based on that, we use the rein- ak and the word’s position in that tuple to lo- forcement learning to encourage the generator to cate the word (Lebret et al., 2016). Specifically cover infrequent and rarely mentioned attributes ak ak ak we utilize a triple zi = {ak, pi+, pi−} to rep- as well as generate more informative descriptions ak resent the structure information for word xi , in with less groundless content. ak ak ak which pi+ and pi− are the positions of xi counted We test our models on two settings: from the beginning and end of ak, respectively. 1) For neural table-to-text generation like Fig For example, for the ‘Birthplace’ attribute in Fig 1a, we test our model on WIKIBIO (Lebret et al., 1 (left), we can use triples {birthplace,1,4} and 2016), a crawled dataset from Wikipedia with {birthplace,4,1} to represent the structure infor- paired infoboxes and associated descriptions. It is mation for the words ‘Durban’ 1 and ‘Africa’. We a widely used benchmark dataset for description concatenate the word xt and its structure represen- generation for factual attribute-value tables and tation zt at the t-th time step and feed them into also a quite meaningful testbed in the real-world LSTM (Hochreiter and Schmidhuber, 1997) unit scenarios with vast and heterogenous attributes. to encode the table. ht = LSTM([xt; zt], ht−1) 2) To test our model in the user-oriented set- is the t-th hidden state among the encoder states T ting, we filter WIKIBIO to form WB-filter. In H = {ht}t=1. In the following sections, we might ak this setting, we suppose all attributes in the source omit the superscript of xi if it is not necessary. tables of WB-filter are selected by users that should be covered in the corresponding descrip- 2.2 Description Decoder tions. We try to make sure the gold descriptions in For the generated description y∗, the generated to- WB-filter cover all the attributes of the source ∗ ken yt at the t-th time step is predicted based on tables in this condition. Details in Sec4. ∗ ∗ all the previously generated tokens y

5986 Average Word-level Compensation Values Coverage ! % = '()(! + where g(st, hi) is a relevance score between st and "# "# "# "# ',- %"# − %"# Dillon hi. We use Bahdanau-style attention mechanism Dillon Sheppard Sheppard (Bahdanau et al., 2014) to calculate g(st, hi). Compensation Low 27 27 $ February February "# Average Attribute-level g(s , h ) = tanh(W h + W s + b) (2) t i p i q t 1979 Coverage +"# 1979 Durban Name Durban South Birthdate South Ws,Wt,Wp,Wq are learnable parameters. High High Compensation Africa Birthplace Africa Bidvest Currentclub Bidvest 3 Comprehensive Table Description Wits Position Wits leftwinger leftwinger The problems listed in Table1 not only prevent th Decoder at 14 timestep: Dillon Sheppardborn27 february 1979, the generators to produce comprehensive descrip- Durban SouthAfricais a tions for selected entries in the tables (Fig1b), but also prevent the generator to produce informative, Figure 2: Stepwise forcing attention at the 14-th step accurate and loyal table descriptions (Fig1a). So for the filtered version of the original infobox in Fig we propose two methods: force-attention (FA) and 1 in the WB-filter dataset (The next word is ‘left- richness-oriented reinforcement learning to pro- winger’). The uncovered attributes like ‘currentclub’ duce accurate, informative and loyal descriptions. and ‘position’ (marked in orange and green) get high attention compensation (rightmost). Note that word ‘Sheppard’ does not get any compensation (rightmost) 3.1 Force-Attention Module because it has got high attention in the previous steps. For ‘missing key attributes’ problem (Table1), we find that the generator usually focuses on partic- v is a compensation vector for the low-coverage ular attributes while the other attributes have rel- t attributes: atively low attention values in the entire decod- ing procedure. So force attention method is pro- T X i i i ak posed to guide the decoder to pay more attention vt = (max(ζt) − ζt )hi; ζt = min(θt, γt ) i=1 to the previous uncovered attributes with low at- (3) tention values to avoid potential key attribute miss- ζt is the modified average word-level coverage re- ing. Note that FA method focuses on attribute- garding the average attribute-level coverage as the level coverage rather than word-level coverage (Tu upper bound to avoid excessive compensation. et al., 2016) as our goal is to reduce the ‘missing Fig2 shows a running example. The motivation key attributes’ phenomenons instead of building behind is that we want the decoder to pay enough rigid word-by-word alignment between tables and attention to all the attributes in the whole decoding descriptions. process, which prevents missing key attributes be- Stepwise Forcing Attention: We define attribute- cause of the low attention value on them. Thus we level attention βak = avg(P αi) at the t-th t xi∈ak t make compensation for the previous uncovered at- step for attribute ak as the average value of the tributes (like ‘currentclub’ and ‘position’ in Fig2 word-level attention values for the words in that ) by vt at the t-th time step. attribute. The word-level coverage is defined as Global Forcing Attention: Inspired by the soft- the sum of attention vector before the t-th step attention constraint of (Xu et al., 2015) which en- i i i θt = θt−1 + αt (Tu et al., 2016). In the sim- courages the generator to pay equal attention to ilar way, we define the attribute-level coverage every part of the image while generating image ak ak ak γt = γt−k + βt as the overall attention for captions, we propose global forcing attention to attribute ak before the t-th time step. The av- avoid insufficient or excessive attention on certain erage word-level and attribute-level coverage are attributes by adding the following loss to the prime i i ak ak θt = θt/t and γt = γt /t, respectively. seq2seq loss. Then we propose stepwise attention forcing, which explicitly guides the decoder to pay more K X ak 2 attention on the uncovered attributes by calculat- LFA = λ [γ−1 − 1/K] (4) k=1 ing a new context vector cet = πct + (1 − π)vt to make compensation for the ignored attributes in where K is the number of attributes in the table, the previous time steps. π is a learnable vector. λ is a hyper-parameter which is set to 0.3 based

5987 ak on evaluations on the validation data. γ−1 is the 3.2.2 Reinforcement Learning average attribute-level coverage for attribute ak at Reward Function: Different from previous mod- the last time step. els which only measures how well the generated sentences match the target sentences, we design 3.2 Reinforced Richness-oriented Learning a mixed reward Rmix which contains both the We also propose a reinforcement learning frame- BLEU-4 scores and the information richness of the work which encourages the generator to cover rare generated descriptions towards the source tables. and seldom mentioned words and attributes in the table. The experiments and case studies show its Rmix = λRinfo + (1 − λ)RBLEU (6) effectiveness to deal with the ‘groundless informa- tion’ and ‘less informative’ problems in Table1. λ is set to 0.4 and 0.6 for WIKIBIO and WB-filter based on evaluations on the valida- 3.2.1 Information Richness tion data. Fig6 shows how we choose λ. The information richness (Eq5) is the multiplica- Training Algorithm: We use the REINFORCE tion of the attribute-level and word-level richness algorithm (Williams, 1992) to learn an agent to of the descriptions towards the source tables. maximize the reward function Rmix. The train- Attribute-level Information Richness: Different ing loss of sequence generation is defined as the tables which describe different objects are always negative expected reward. featured by the unique attributes in the table. For example, a sportsman often has the attributes like s s LRL = −Eys∼p [r(y ) · log(Pφ(y ))] (7) ‘position’, ‘debutyear’. The information in the φ unique attributes is harder to capture than that in s where Pφ(y ) is the agent’s policy, i.e. the word the common attributes like ‘name’, ‘birthdate’ as distribution of description decoder (Eq1), and r(·) the latters are very frequent in the training set. We is the reward function defined in Eq6. In the im- define the information richness for an attribute ai plementation, ys is a sequence that can be sam- −1 as f(ak) = [freq(ak)] by calculating its fre- s pled from Pφ by Monte-Carlo sampling y = quency in the training set. s s s {y1, y2, ··· , y|Y |}. The policy gradients for Eq7 Word-level Information Richness: The unique can be calculated as: words in the tables are more likely to be informa- tive, such as a specific location, name or book. To ∇φLRL = λ∇φRinfo + (1 − λ)∇φRBLEU (8) calculate the word-level information richness, we firstly lemmatize all the words in the tables and We use self-critical sequence training method filter the words with a stop-words list which in- (Rennie et al., 2017; Paulus et al., 2017) to reduce cluding prepositions, symbols and numbers, etc. the variance of gradients by subtracting a baseline Then we randomly sample 5 synonyms of the cer- reward for the mix reward in Eq6. tain word from WordNet (Miller, 1995). Finally, ak s g s we calculate the word-level richness w(xi ) for ∇φRBLEU ≈ −[B(y , y)−B(y , y)]∇φlog(Pφ(y )) (9) the i-th word in attribute ak by averaging the tf-idf ak values of xi and its synonyms in the training set. where B(a, b) is the BLEU score of sequence a For a generated description y∗, we lemmatize compared with sequence b, yg is a generated se- all the words in y∗ to get y∗. Then we calculate the quence using greedy search. To calculate the in- information richness based on the related source formation richness reward Rinfo for the lemma- table with T words and the gold description y, re- tized sampled sequence ys, we use the information spectively. richness (Eq5) of the related lemmatized gold de- scription y towards the source table as the baseline PT ak ak ∗ [f(ak) · w(x ) · 1{x˜ ∈ y }] Rich(y∗) = i=1 i i (5) reward. PT ak i=1[f(ak) · w(xi )] s s ∇φRinfo ≈ −[Rich(y )−Rich(y)]∇φlog(Pφ(y )) (10) ak ak in which x˜i represents any word among xi and its synonyms in the table. The information rich- For more technical details, we refer the interested ness measures the ratio of covered information in readers to (Williams, 1992; Ranzato et al., 2015; the table by the description. Rennie et al., 2017).

5988 Dataset WIKIBIO WB-filter WIKIBIO. # instances 728321 88287 The ‘frequency-coverage’ figure in Fig3 shows # Tokens 26.1 30.2 1) The filtering ensures that the WB-filter per Bio # Tokens 53.1 20.8 dataset achieves 100% Hit-1 coverage. 2) The per Table WIKIBIO dataset suffers from both ‘low fre- # Attri. 19.7 6.3 per Table quency’ and ‘low coverage’ problems, which # Word 9.5 12.1 overlap means some key attributes in the tables are sel- dom mentioned by the descriptions. The cause of ‘low coverage’ problem is the loosely alignment Figure 3: The ‘coverage-frequency’ figure (left) (each between structured data and related descriptions. point represents an attribute) shows that many at- The two datasets are divided in to training (80%), tributes have very low coverage and low frequency in testing (10%) and validation (10%) sets. the WIKIBIO dataset. Due to our filtering, the at- WB-filter tributes in have 100% Hit-1 coverage 4.2 Evaluation Metrics (Sec 4.2) and more overlapping words with the origi- nal tables as shown in the data statistics (right). Automatic Metrics: Following the previous work (Lebret et al., 2016; Sha et al., 2018; Liu et al., 4 Experiments 2018), we use BLEU-4 (Papineni et al., 2002) and ROUGE-4 (F measure) (Lin, 2004) for automatic 4.1 Datasets evaluation. Furthermore, to evaluate how the gen- We use two datasets to test our model in the con- erated biographies cover the key points in the in- text of end-to-end table description generation and foboxes, we also use information richness (Eq5) comprehensive generation for selected attributes as one of our automatic evaluation. ‘Hit at least 1 in user-oriented scenario. word’ for an attribute means that a biography has For end-to-end description generation, we use at least one overlapping word with the words (or WIKIBIO dataset (Lebret et al., 2016) as the their synonyms) in that attribute, which are lem- benchmark dataset, which contains 728,321 arti- matized and filtered by a stop-words list like the cles from English Wikipedia (Sep 2015) and uses way we get WB-filter in Sec 4.1.‘HIT-1 cov- the first sentence of each article as the description. erage’ for an attribute is the ratio of the instances To test our model in the user-oriented scenario, involving that attribute whose biographies ‘Hit at we filtered the WIKIBIO dataset to form a new least 1 word’ in that attribute. dataset WB-filter. To simulate the user inter- Human Evaluation: Since automatic evaluations ests, we first select the top 100 frequent 2 attributes like BLEU may not be reliable for NLG sys- in WIKIBIO. After that we manually filter irrel- tems (Callison-Burch et al., 2006; Reiter and Belz, evant attributes (like ’caption’, ’website’ or ’sig- 2009; Reiter, 2018). We use human evaluation nature’) and merge identical attributes (like ’ar- which involves the generation fluency, coverage ticle title’ and ’name’) to avoid repetition. Then (how much given information in the infobox is we leave out all the remaining attributes in the ta- mentioned in the related biography) and correct- bles and filter the instances in WIKIBIO whose ness (how much false or irrelevant information is descriptions can not cover the selected attributes mentioned in the biography). We firstly sampled to form WB-filter. To achieve this, we firstly 300 generated biographies from the generators for lemmatize all the tokens in the infoboxes as well human evaluation. After that, we hired 3 third- as those in the related gold biographies and filter party crowd-workers who are equipped with suffi- them by a stop-words list, then we randomly re- cient background knowledge to rank the given bi- trieve 5 synonyms for every word in the infoboxes ographies. We present the generated descriptions from WordNet. Finally we make sure the gold bi- to the annotators in a randomized order and ask ographies cover at least one word (or its synonym) them to be objective and not to guess which sys- for every attribute-value tuple among the chosen tem a particular generated case is from. Two bi- attributes and filter the unqualified instances in ographies may have the same ranking if it is hard to decide which one is better. The Pearson corre- 2In this setup, the reason of choosing high fre- quent attributes is to ensure enough training instances in lations of inter-annotator agreement are 0.76 and WB-filter for data-driven methods. 0.71 (Table3) on WIKIBIO and WB-filter, re-

5989 spectively. Models BLEU ROUGE KN 2.21 0.38 4.3 Experimental Details Template KN 19.80 10.70 Following previous work (Liu et al., 2018). For NLM 4.17 1.48 WIKIBIO We select the most frequent 20,000 Table NLM 34.70 25.80 words and 1480 attributes in the training set as Order-planning 43.91 37.15 the word and attribute vocabulary. We tune Struct-aware 44.89 41.21 the hyper-parameters based on the model perfor- Word-level Coverage* 43.44 39.84 mance on the validation set. The dimensions Attri-level Coverage* 42.87 38.95 of word embedding, attribute embedding, posi- Seq2seq 43.51 39.61 tion embedding and hidden unit are 500, 50, 600, + Force-Attention 44.46 40.58 † 10 respectively. The batch size, learning rate + Richness RL 45.47 41.54 and optimizer for both two datasets are 32, 5e- (a) Automatic evaluation on WIKIBIO 4 and Adam (Kingma and Ba, 2014), respec- Models BLEU ROUGE tively. We use Xavier initialization (Glorot and Struct-aware* 40.81 36.52 Bengio, 2010) for all the parameters in our model. Word-level Coverage* 38.85 35.11 The global constraint of force-attention (Eq4) Attri-level Coverage* 38.34 34.92 is adapted after 4 and 1.5 epochs of training to Seq2seq 39.17 35.39 avoid hurting the primary loss for the WIKIBIO + Force Attention 41.21 36.71 † and WB-filter datasets, respectively. Before + Richness RL 42.03 37.55 WB-filter the richness-oriented reinforced training, the neu- (b) Automatic evaluation on ral generator is pre-trained 8 and 4 epochs for Table 2: BLEU and ROUGE scores on the WIKIBIO the WIKIBIO and WB-filter datasets (with or and WB-filter datasets. The baselines with * are without force-attention module), respectively. We based on our implementation while the others are re- † replace UNK tokens with the most relevant token ported by their authors. Models with are trained us- in the source table according to the attention ma- ing the RL criterion specified in Sec 3.2.2 while the remaining models are trained using the maximum like- trix (Jean et al., 2015). lihood estimate (MLE). 4.4 Baselines

KN & Template KN: A template-based Kneser- Wqst + Wmγt + b). θt and γt are the word-level Ney (KN) language model (Heafield et al., 2013) and attribute-level coverage defined in Sec 3.1. The extracted template for Table1 is “ name 1 name 2 (born birthdate 1 ··· ”. During inference, 4.5 Analysis of Experimental Results the decoder is constrained to emit words from the Automatic evaluations are shown in Table2 for vocabulary or the special tokens in the tables. WIKIBIO and WB-filter. The proposed force- Table NLM: Lebret et al.(2016) proposed a neu- attention module achieves 1.11/0.98 and 2.04/1.32 ral language model Table NLM taking the attribute BLEU/ROUGE increases on the WIKIBIO and information into consideration. WB-filter datasets, respectively. Although the Order-planning: Sha et al.(2018) proposed a proposed force attention method does not outper- link matrix to model the order for the attribute- form the ‘struct-aware’ method in terms of BLEU value tuples while generating biographies. and ROUGE in the WIKIBIO dataset. We show Struct-aware: Liu et al.(2018) proposed a its advantages in the user-oriented scenario as well structure-aware model using a modified LSTM as its ability to cover the key attributes as shown unit and a specific attention mechanism to incor- in Table4 and5. The richness-oriented reinforced porate the attribute information. module further enhances the model performance, Word & Attribute level Coverage: we also im- helping our model outperform the state-of-the-art plement the implicit coverage method (Tu et al., system (Liu et al., 2018) by about 0.79 BLEU and 2016) for comparison. For word-level coverage, 0.58 ROUGE. Note that the BLEU and ROUGE we replace Eq2 with g(st, hi) = tanh(Wphi + scores are lower in the WB-filter datasets be- Wqst + Wmθt + b). For attribute-level coverage, cause firstly, the WIKIBIO has much larger train- we replace Eq2 with g(st, hi) = tanh(Wphi + ing set. Secondly, the gold biographies might con-

5990

Name Dillon Sheppard Name Dillon Sheppard

Birthdate 27 February 1979 Birthdate 27 February 1979

Birthplace Durban , South Africa Birthplace Durban , South Africa

Currentclub Bidvest Wits Currentclub Bidvest Wits

Position left-winger Position left-winger

Models Fluency Coverage Correctness Name Dillon Sheppard Name Dillon Sheppard Seq2seq 1.87 1.99 1.95 Struct-aware 1.61 1.80 1.71 Birthdate 27 February 1979 Birthdate 27 February 1979 Our best 1.54 1.46 1.61 Birthplace Durban , South Africa Birthplace Durban , South Africa WIKIBIO (a) Human evaluation on Currentclub Bidvest Wits Currentclub Bidvest Wits Models Fluency Coverage Correctness Seq2seq 2.02 1.88 1.93 Position left-winger Position left-winger Struct-aware 1.58 1.52 1.65 seq2seq seq2seq+Force-attention Our best 1.54 1.39 1.54 (b) Human evaluation on WB-filter seq2seq: Dillon Sheppard (born 27 february 1979) is a soc- cer who plays for Bidvest Wits. Table 3: Average ranking (lower is better) of 3 systems. seq2seq+FA: Dillon Sheppard (born 27 february 1979, We calculate the Pearson correlation to show the inter- Durban South Africa) is a left-winger in Bidvest Wits. annotator agreement. Figure 4: The average attribute-level (green) and

word-level struct-aware (red) [Liu et coverageal. 2017] of the seq2seq coverage-oriented models (ours) with Models BLEU Rich orS2S+cover without: Dillon force-attention Sheppard ( born 27 february module 1979 for ) is a an soccer infobox who plays in for Bidvest Wits. 1 seq2seq 43.51 28.21 WB-filterSha et al. 2017: Dillon(higher Sheppard values ( born 27 are february darker) 1979 ) inis a thesoccer last who de- 2 + Stepwise (only) 43.69 30.01 codingplays for Bidvest step. Wits The. vanilla seq2seq model ignores the 3 + Global loss (only) 44.21 31.65 Liu et al. 2017: Dillon Sheppard ( born 27 february 1979 ) is a South African ‘soccerbirthplace who plays’ andfor Bidvest ‘position Wits. ’ attributes as the low cover- 4 + Stepwise + Global loss 44.46 32.90 Ours: Dillon Sheppard ( born 27 february 1979, Durban South Africa ) is a 5 + Richness RL (only) 45.23 35.84 agefootballer on who them plays while as left-winger the FA for Bidvest module Wits. attracts enough at- 6 + All 45.47 37.64 tention on them while decoding. (a) Ablation studies on WIKIBIO Models BLEU Rich 1 seq2seq 39.17 56.30 2 + Stepwise (only) 39.59 59.29 3 + Global loss (only) 40.83 61.12 4 + Stepwise + Global loss 41.21 62.81 5 + Richness RL (only) 41.66 63.89 6 + All 42.03 64.41 (b) Ablation studies on WB-filter

Table 4: The ablation studies for our model. Models 2-4 are from the force-attention method. ‘Rich’ is the Figure 5: Hit-1 coverage (Sec 4.2) for attributes on the ‘information richness’ defined in Eq5. test sets of WIKIBIO and WB-filter. For better vi- sualization, we first select the attributes whose frequen- cies are larger than 0.1%, then rank the Hit-1 coverage of these attributes (214 attributes in WIKIBIO; 26 at- tain information beyond the tables. Although this tributes in WB-filter) in the descending ordering. phenomenon also occurs in WIKIBIO, the filter- ing of WB-filter magnifies this issue. Human evaluations in Table3 show our model achieves posed model. better generation coverage and correctness than all Fig4 shows the effectiveness of the force- the baselines. Table4 shows that the ablation stud- attention module. The decoder is guided to pay ies of our model. more attention to the uncovered attributes (‘birth- As demonstrated in Table5, we select an in- place’ and ‘position’) while decoding. Fig5 fobox from WIKIBIO and WB-filter respec- shows that both two proposed modules can boost tively for case studies. By observing the gen- the attribute-level coverage on the two datasets. erated description in WIKIBIO, we find that 1) Fig6(left) explains why our model can also im- compared with the vanilla seq2seq model, our prove end-to-end table description generation. At- force-attention module can cover the information tributes like ‘position’, ‘battles’ and ‘political in the ‘Notableworks’ attribute. 2) The richness- party’ are key information to describe the in- oriented module further helps our model to cover foboxes for sportsmen, soldiers and politicians. the ‘Alma mater’ and ‘ Notableworks’ attributes as Fig6(right) shows the effects of λ in Eq6. they are infrequent attributes (more informative) in the dataset. Additionally, due to the rareness of the 4.6 Error Analysis word ‘kiev’, our model is able to cover the related Although the proposed models achieve compet- information. Similarly, the generated description itive performance, we also observe some failure for WB-filter covers the information from ‘Or- cases. To sum up, the irrelevant information in the ganization’ and ‘ Birthplace’ with the help of pro- generated descriptions to the source tables. For ex-

5991 Name:Ivan Ohienko Metropolitan Ilarion ; Birthdate:2 January 1882 ;Birthplace:Brusilov, Kiev governorate, Russian empire ; Deathdate:29 March 1972;Deathplace:Winnipeg, Canada; Occupation:cleric, historian, ethnographer, and scholar, writer, and translator; Language:Ukrainian; Nationality:Ukrainian; Alma mater:Kiev university Notableworks:translation of the bible into ukrainian Article title:Ilarion Ohienko Seq2seq: Ivan Ohienko Metropolitan ( January 2 , 1882 – March 29 , 1972 ) was a Ukrainian cleric , historian , ethnogra- pher, writer , linguist , writer and scolar. +Force-Attention: Ivan Ohienko Metropolitan Ilarion ( 2 January 1882 in Brusilov – 29 march 1972 in Winnipeg ) was a Ukrainian linguist , ethnographer , and scholar , best known for his translation of the bible into ukrainian . +Richness-oriented RL: Ivan Ohienko Metropolitan Ilarion ( 2 January 1882 , Krusilov , Kiev governorate– 29 march 1972 , Winnipeg ) was a Ukrainian cleric, historian , ethnographer , and scholar of Kiev university , best known for his translation of the bible into ukrainian .

Name:Rajendra Singh ; Birthdate:06 August 1959 ;Birthplace:Daula, Bagpat District, Uttar Pradesh ; Nationality: Indian; Organization:Tarun Bharat Sangh; Occupation:water conservationist Alma mater:Allahabad University Seq2seq: Rajendra Singh is an Indian water conservationist. +Force-Attention: Rajendra Singh (born 6 August 1959) is an Indian conservationist and a senior fellow of the Tarun Bharat Sangh. +Richness-oriented RL: Rajendra Singh (born 6 august 1959, Uttar Pradesh) is an Indian water conservationist and a member of the Tarun Bharat Sangh.

Table 5: The generated cases in WIKIBIO (above) and WB-filter (below) datasets. The underlined texts, which are the key information of the source tables, are ignored by seq2seq model.

Hit-1 coverage of some key attributes while summarizing WIKIBIO How we choose the ! in Eq 6 for WIKIBIO several attributes (like a time span) may not be well represented by our model.

5 Related Work

Data-to-text a language generation task to gener- !: $%&' = !$&)*+ + (1 − !)$1234 ate text for structured data. Table-to-text belongs to the data-to-text generation (Reiter and Dale, Figure 6: Hit-1 Coverage (Sec 4.2) for some key at- 2000). Many previous work (Barzilay and Lapata, tributes (left) on the test set of WIKIBIO shows that our model can help to cover some key attributes while 2005, 2006; Liang et al., 2009) treated the task as describing the tables. The right figure is the analysis of a pipelined systems, which viewed content selec- λ (Eq6) for ‘Seq2seq + RL’ model on the validation tion and surface realization as two separate tasks. set of WIKIBIO. Duboue and McKeown(2002) proposed a clus- tering approach in the biography domain by scor- ing the semantic relevance of the text and paired ample, a biography about a football player might knowledge base. In a similar vein, Barzilay and contain ‘in the national football league’ although Lapata(2005) modeled the dependencies between the related infobox does not mention this piece the American football records and identified the of information as the similar expression exists in bits of information to be verbalized. Liang et al. many instances of the training set. Although our (2009); Angeli et al.(2010) extended the work of model could largely relieve this problem as shown Barzilay and Lapata(2005) to soccer and weather in human evaluation (Table3), it is still a gen- domains by learning the alignment between data eral problem in NLG. As for the ability to cover and text using hidden variable models. Androut- important information in the tables, although our sopoulos et al.(2013) and Duma and Klein(2013) model is able to cover much more comprehensive focused on generating descriptive language for information than the previous models (Table2 and Ontologies and RDF triples. Most recent work 3). Some implicitly expressed (like if a person is utilize neural networks on data-to-text generation retired or not) or rarely covered (like ‘spouse’ or (Mahapatra et al., 2016; Wiseman et al., 2017; ‘high school’) attributes in the source tables might Laha et al., 2018; Kaffee et al., 2018; Freitag and still be ignored in the descriptions generated by Roy, 2018; Qader et al., 2018; Dou et al., 2018; our model. Furthermore, those pieces of informa- Yeh et al., 2018; Jhamtani et al., 2018; Jain et al., tion which need some form of inference across 2018; Liu et al., 2017b, 2019; Peng et al., 2019;

5992 Dusekˇ et al., 2019). tial key attribute missing. Richness-oriented re- Some closely relevant work also focused on inforcement learning is proposed to cover more the table-to-text generation. Mei et al.(2016) informative contents in source tables, which help proposed an encoder-aligner-decoder framework the generator to generate informative and accurate for generating weather broadcast. Hachey et al. descriptions. The experiments on the WIKIBIO (2017) used a table-text and text-table auto- and WB-filter datasets show the merits of our encoder framework for table-to-text generation. model. In the future, we will explore the repre- Nema et al.(2018) proposed gated orthogonaliza- sentation for the implicit information like whether tion to avoid repetitions. Wiseman et al.(2018) a man is retired or not or how long a sportsman’s used neural semi-HMM to generate template-like career is given starting and ending years, in the ta- descriptions for structured data. Our work some- ble by including some inference strategies. what shares similar goals as Kiddon et al.(2016); Tu et al.(2016); Liu et al.(2017a); Gong et al. Acknowledgments (2018) in the sense that they emphasis easily ig- We would like to thank the anonymous review- nored (usually less frequent) features or bits of in- ers for their valuable suggestions. This work formation in the training procedure by smoothing is supported by the National Science Founda- or regularization. The greatest difference between tion of China under Grant No. 61876004, No. our work and theirs is that our method is tailored 61772040. The corresponding authors of this pa- for covering the key information embedded in the per are Baobao Chang and Zhifang Sui. attributes (entries) of the key-value tables rather than single words or labels. Although the deficient score of Tu et al.(2016) in Table2 has demon- References strated that word-level coverage oriented methods Ion Androutsopoulos, Gerasimos Lampouras, and may not still be suitable to the structured tables, Dimitrios Galanis. 2013. Generating natural lan- we assume other word-level constraints may easily guage descriptions from owl ontologies: the natu- transfer to the structured tables without losing ef- ralowl system. Journal of Artificial Intelligence Re- search, 48:671–715. ficiency. We leave the recognition of potential ap- plicable word-level constraints to the future work. Gabor Angeli, Percy Liang, and Dan Klein. 2010. A This paper focused on generating one-sentence simple domain-independent probabilistic approach to generation. In EMNLP 2010, pages 502–512. biographies for infoboxes like many previous works (Lebret et al., 2016; Hachey et al., 2017; Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Liu et al., 2018; Bao et al., 2018; Nema et al., Bengio. 2014. Neural machine translation by 2018; Puduppully et al., 2018; Cao et al., 2018). jointly learning to align and translate. CoRR, abs/1409.0473. Perez-Beltrachini and Lapata(2018) used the first paragraph of the wikipedia pages as the gold Jun-Wei Bao, Duyu Tang, Nan Duan, , Yuan- biographies aiming at generating longer biogra- hua Lv, Ming Zhou, and Tiejun Zhao. 2018. Table- phies. We tried the same setting and unfortu- to-text: Describing table region with natural lan- guage. In Proceedings of the Thirty-Second AAAI nately found most generated biographies contain Conference on Artificial Intelligence, (AAAI-18), too much groundless information compared with the 30th innovative Applications of Artificial Intel- the source infoboxes. This is because the related ligence (IAAI-18), and the 8th AAAI Symposium gold biographies from first paragraph contain too on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February much groundless information beyond the source 2-7, 2018, pages 5020–5027. infoboxes. Regina Barzilay and Mirella Lapata. 2005. Collective content selection for concept-to-text generation. In 6 Conclusion and Future Work Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Lan- We set up 3 goals for comprehensive description guage Processing, pages 331–338. Association for generation for attribute-value factual tables: ac- Computational Linguistics. curate, informative and loyal. To achieve these Regina Barzilay and Mirella Lapata. 2006. Aggrega- goals, we propose force-attention method, which tion via set partitioning for natural language gener- encourages the generator to pay more attention ation. In NAACL, pages 359–366. Association for to previous uncovered attributes to avoid poten- Computational Linguistics.

5993 Chris Callison-Burch, Miles Osborne, and Philipp Parag Jain, Anirban Laha, Karthik Sankaranarayanan, Koehn. 2006. Re-evaluation the role of bleu in ma- Preksha Nema, Mitesh M Khapra, and Shreyas chine translation research. In EACL 2006. Shetty. 2018. A mixed hierarchical attention based encoder-decoder approach for standard table sum- Juan Cao, Junpeng Gong, and Pengzhou Zhang. marization. arXiv preprint arXiv:1804.07790. 2018. Open-domain table-to-text generation based on seq2seq. In Proceedings of the 2018 Interna- Sebastien´ Jean, KyungHyun Cho, Roland Memisevic, tional Conference on Algorithms, Computing and and Yoshua Bengio. 2015. On using very large Artificial Intelligence, page 72. ACM. target vocabulary for neural machine translation. In Proceedings of the 53rd Annual Meeting of the Longxu Dou, Guanghui , Jinpeng Wang, Jin-Ge Association for Computational Linguistics and the Yao, and Chin-Yew Lin. 2018. Data2text studio: 7th International Joint Conference on Natural Lan- Automated text generation from structured data. In guage Processing of the Asian Federation of Natural Proceedings of the 2018 Conference on Empirical Language Processing, ACL 2015, July 26-31, 2015, Methods in Natural Language Processing: System Beijing, China, Volume 1: Long Papers, pages 1–10. Demonstrations, pages 13–18. Harsh Jhamtani, Varun Gangal, Eduard Hovy, Gra- Pablo A Duboue and Kathleen R McKeown. 2002. ham Neubig, and Taylor Berg-Kirkpatrick. 2018. Content planner construction via evolutionary algo- Learning to generate move-by-move commentary rithms and a corpus-based fitness function. In Pro- for chess games from large-scale social forum data. Proceedings of the 56th Annual Meeting of the ceedings of INLG 2002, pages 89–96. In Association for Computational Linguistics (Volume Daniel Duma and Ewan Klein. 2013. Generating nat- 1: Long Papers), volume 1, pages 1661–1671. ural language from linked data: Unsupervised tem- Lucie-Aimee´ Kaffee, Hady ElSahar, Pavlos Vou- plate extraction. In Proceedings of the 10th Inter- giouklis, Christophe Gravier, Fred´ erique´ Laforest, national Conference on Computational Semantics Jonathon S. Hare, and Elena Simperl. 2018. Learn- (IWCS 2013)–Long Papers, pages 83–94. ing to generate wikipedia summaries for under- served languages from wikidata. In Proceedings of Ondrejˇ Dusek,ˇ Jekaterina Novikova, and Verena Rieser. the 2018 Conference of the North American Chap- 2019. Evaluating the state-of-the-art of end-to-end ter of the Association for Computational Linguistics: natural language generation: The e2e nlg challenge. Human Language Technologies, NAACL-HLT, New arXiv preprint arXiv:1901.07931. Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), pages 640–645. Markus Freitag and Scott Roy. 2018. Unsupervised natural language generation with denoising autoen- Chloe´ Kiddon, Luke Zettlemoyer, and Yejin Choi. coders. arXiv preprint arXiv:1804.07899. 2016. Globally coherent text generation with neural checklist models. In EMNLP 2016, pages 329–339. Xavier Glorot and Yoshua Bengio. 2010. Understand- ing the difficulty of training deep feedforward neu- Diederik P Kingma and Jimmy Ba. 2014. Adam: A ral networks. In Proceedings of the Thirteenth In- method for stochastic optimization. arXiv preprint ternational Conference on Artificial Intelligence and arXiv:1412.6980. Statistics, pages 249–256. Anirban Laha, Parag Jain, Abhijit Mishra, and Karthik Chengyue Gong, Xu Tan, Di He, and Tao Qin. Sankaranarayanan. 2018. Scalable micro-planned 2018. Sentence-wise smooth regularization for generation of discourse from structured data. CoRR, sequence to sequence learning. arXiv preprint abs/1810.02889. arXiv:1812.04784 . Remi´ Lebret, David Grangier, and Michael Auli. 2016. Neural text generation from structured data with ap- Ben Hachey, Will Radford, and Andrew Chisholm. plication to the biography domain. In EMNLP 2016, 2017. Learning to generate one-sentence biogra- pages 1203–1213. phies from wikidata. In Proceedings of the 15th Conference of the European Chapter of the Asso- Liunian Li and Xiaojun Wan. 2018. Point precisely: ciation for Computational Linguistics, EACL 2017, Towards ensuring the precision of data in generated Valencia, Spain, April 3-7, 2017, Volume 1: Long texts using delayed copy mechanism. In Proceed- Papers, pages 633–642. ings of the 27th International Conference on Com- putational Linguistics, pages 1044–1055. Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H Clark, and Philipp Koehn. 2013. Scalable modified Percy Liang, Michael I Jordan, and Dan Klein. 2009. kneser-ney language model estimation. In ACL (2), Learning semantic correspondences with less super- pages 690–696. vision. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th Interna- Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. tional Joint Conference on Natural Language Pro- Long short-term memory. Neural Computation, cessing of the AFNLP: Volume 1-Volume 1, pages 9(8):1735–1780. 91–99. Association for Computational Linguistics.

5994 Chin-Yew Lin. 2004. Rouge: A package for auto- Romain Paulus, Caiming Xiong, and Richard Socher. matic evaluation of summaries. Text Summarization 2017. A deep reinforced model for abstractive sum- Branches Out. marization. arXiv preprint arXiv:1705.04304.

Tianyu Liu, Fuli Luo, Qiaolin Xia, Shuming Ma, Hao Peng, Ankur P. Parikh, Manaal Faruqui, Bhuwan Baobao Chang, and Zhifang Sui. 2019. Hierarchical Dhingra, and Das Dipanjan. 2019. Text generation encoder with auxiliary supervision for neural table- with exemplar-based adaptive decoding. In Pro- to-text generation: Learning better representation ceedings of the Conference of the North American for tables. In Proceedings of AAAI. Chapter of the Association for Computational Lin- guistics: Human Language Technologies. Tianyu Liu, Kexiang Wang, Baobao Chang, and Zhi- fang Sui. 2017a. A soft-label method for noise- Laura Perez-Beltrachini and Mirella Lapata. 2018. tolerant distantly supervised relation extraction. In Bootstrapping generators from noisy data. In Pro- Proceedings of the 2017 Conference on Empirical ceedings of the 2018 Conference of the North Methods in Natural Language Processing, pages American Chapter of the Association for Compu- 1790–1795. tational Linguistics: Human Language Technolo- gies, NAACL-HLT 2018, New Orleans, Louisiana, Tianyu Liu, Kexiang Wang, Lei Sha, Baobao Chang, USA, June 1-6, 2018, Volume 1 (Long Papers) and Zhifang Sui. 2018. Table-to-text generation , pages by structure-aware seq2seq learning. In Proceed- 1516–1527. ings of the Thirty-Second AAAI Conference on Ar- tificial Intelligence, (AAAI-18), the 30th innovative Ratish Puduppully, Li Dong, and Mirella Lapata. 2018. Applications of Artificial Intelligence (IAAI-18), and Data-to-text generation with content selection and the 8th AAAI Symposium on Educational Advances planning. arXiv preprint arXiv:1809.00582. in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 4881– Raheel Qader, Khoder Jneid, Franc¸ois Portet, and Cyril 4888. Labbe.´ 2018. Generation of company descriptions using concept-to-text and text-to-text deep models: Tianyu Liu, Bingzhen Wei, Baobao Chang, and Zhi- dataset collection and systems evaluation. In Pro- fang Sui. 2017b. Large-scale simple question gen- ceedings of the 11th International Conference on eration by template-based seq2seq learning. In Nat- Natural Language Generation, pages 254–263. ural Language Processing and Chinese Computing - 6th CCF International Conference, NLPCC 2017, Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, Dalian, China, November 8-12, 2017, Proceedings, and Wojciech Zaremba. 2015. Sequence level pages 75–87. training with recurrent neural networks. CoRR, abs/1511.06732. Joy Mahapatra, Sudip Kumar Naskar, and Sivaji Bandyopadhyay. 2016. Statistical natural language Ehud Reiter. 2018. A structured review of the validity generation from tabular non-textual data. In Pro- of bleu. Computational Linguistics, 44(3):393–401. ceedings of the 9th International Natural Language Generation conference, pages 143–152. Ehud Reiter and Anja Belz. 2009. An investigation into the validity of some metrics for automatically evalu- Hongyuan Mei, Mohit Bansal, and Matthew R. Walter. ating natural language generation systems. Compu- 2016. What to talk about and how? selective gener- tational Linguistics, 35(4):529–558. ation using lstms with coarse-to-fine alignment. In NAACL HLT 2016 , pages 720–730. Ehud Reiter and Robert Dale. 2000. Building natural George A Miller. 1995. Wordnet: a lexical database for language generation systems. Cambridge university english. Communications of the ACM, 38(11):39– press. 41. Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Preksha Nema, Shreyas Shetty, Parag Jain, Anirban Jarret Ross, and Vaibhava Goel. 2017. Self-critical Laha, Karthik Sankaranarayanan, and Mitesh M sequence training for image captioning. In CVPR, Khapra. 2018. Generating descriptions from struc- volume 1, page 3. tured data using a bifocal attention mechanism and gated orthogonalization. In Proceedings of the 2018 Lei Sha, Lili Mou, Tianyu Liu, Pascal Poupart, Su- Conference of the North American Chapter of the jian Li, Baobao Chang, and Zhifang Sui. 2018. Association for Computational Linguistics: Human Order-planning neural text generation from struc- Language Technologies, Volume 1 (Long Papers), tured data. In Proceedings of the Thirty-Second volume 1, pages 1539–1550. AAAI Conference on Artificial Intelligence, (AAAI- 18), the 30th innovative Applications of Artificial In- Kishore Papineni, Salim Roukos, Todd Ward, and Wei- telligence (IAAI-18), and the 8th AAAI Symposium Jing Zhu. 2002. Bleu: a method for automatic eval- on Educational Advances in Artificial Intelligence uation of machine translation. In ACL2002, pages (EAAI-18), New Orleans, Louisiana, USA, February 311–318. 2-7, 2018, pages 5414–5421.

5995 Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. 2016. Modeling coverage for neural machine translation. In ACL 2016, Volume 1: Long Papers. Ronald J Williams. 1992. Simple statistical gradient- following algorithms for connectionist reinforce- ment learning. In Reinforcement Learning, pages 5–32. Springer. Sam Wiseman, Stuart M. Shieber, and Alexander M. Rush. 2017. Challenges in data-to-document gener- ation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Process- ing, EMNLP 2017, Copenhagen, Denmark, Septem- ber 9-11, 2017, pages 2253–2263. Sam Wiseman, Stuart M. Shieber, and Alexander M. Rush. 2018. Learning neural templates for text gen- eration. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Process- ing, Brussels, Belgium, October 31 - November 4, 2018, pages 3174–3187. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual at- tention. In International Conference on Machine Learning, pages 2048–2057.

Shyh-Horng Yeh, Hen-Hsen Huang, and Hsin-Hsi Chen. 2018. Precise description generation for knowledge base entities with local pointer network. In 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI), pages 214–221. IEEE.

5996