Naver Labs Europe's Systems for the Document-Level Generation And

Naver Labs Europe’s Systems for the Document-Level Generation and Translation Task at WNGT 2019 Fahimeh Saleh∗ Alexandre Bérard Ioan Calapodescu Monash University Naver Labs Europe [email protected] [email protected] https://europe.naverlabs.com Laurent Besacier Université Grenoble-Alpes [email protected] Abstract Existing models lack accuracy, coherence, or ade- quacy to source material (Wiseman et al., 2017). Recently, neural models led to significant im- The two aspects which are mostly addressed provements in both machine translation (MT) and natural language generation tasks (NLG). in data-to-text generation techniques are identi- However, generation of long descriptive sum- fying the most important information from input maries conditioned on structured data remains data, and verbalizing data as a coherent docu- an open challenge. Likewise, MT that goes ment: “What to talk about and how?" (Mei et al., beyond sentence-level context is still an open 2016). These two challenges have been addressed issue (e.g., document-level MT or MT with separately as different modules in pipeline sys- metadata). To address these challenges, we tems (McKeown, 1985; Reiter and Dale, 2000) propose to leverage data from both tasks and or in an end-to-end manner with PCFGs or SMT- do transfer learning between MT, NLG, and MT with source-side metadata (MT+NLG). like approaches (Mooney and Wong, 2007; Angeli First, we train document-based MT systems et al., 2010; Konstas and Lapata, 2013), or more with large amounts of parallel data. Then, recently, with neural generation models (Wiseman we adapt these models to pure NLG and et al., 2017; Lebret et al., 2016; Mei et al., 2016). MT+NLG tasks by fine-tuning with smaller In spite of generating fluent text, end-to-end neu- amounts of domain-specific data. This end-to- ral generation models perform weakly in terms of end NLG approach, without data selection and best content selection (Wiseman et al., 2017). Re- planning, outperforms the previous state of the art on the Rotowire NLG task. We participated cently, Puduppully et al.(2019) trained an end- to the “Document Generation and Translation” to-end data-to-document generation model on the task at WNGT 2019, and ranked first in all Rotowire dataset (English summaries of basket- tracks. ball games with structured data).1 They aimed to overcome the shortcomings of end-to-end neural 1 Introduction NLG models by explicitly modeling content selec- Neural Machine Translation (NMT) and Neural tion and planning in their architecture. Language Generation (NLG) are the top lines of We suggest in this paper to leverage the data the recent advances in Natural Language Process- from both MT and NLG tasks with transfer learning. Although state-of-the-art NMT systems have ing. As both tasks have the same target (e.g., reported impressive performance on several lan- English-language stories), they can share the same guages, there are still many challenges in this field decoder. The same encoder can also be used for especially when context is considered. Currently, NLG and MT if the NLG metadata is encoded the majority of NMT models translate sentences as a text sequence. We first train domain-adapted independently, without access to a larger context document-level NMT models on large amounts of (e.g., other sentences from the same document parallel data. Then we fine-tune these models on or structured information). Additionally, despite small amounts of NLG data, transitioning from improvements in text generation, generating long MT to NLG. We show that separate data selection descriptive summaries conditioned on structured and ordering steps are not necessary if NLG model data is still an open challenge (e.g., table records). is trained at document level and is given enough ∗This work was done while the author was visiting at 1https://github.com/harvardnlp/ Naver Labs Europe. boxscore-data Corpus Lang(s) Split Docs Sents 3. Re-train sentence-level MT models on a con- train 242 3247 DGT EN-DE valid 240 3321 catenation of the WMT19 parallel data, DGT- test 241 3248 train and BT. The later was split into 20 parts, train 3398 45.5k one part for each training epoch. This is almost Rotowire EN valid 727 9.9k test 728 10.0k equivalent to oversampling the non-BT data by WMT19-sent – 28.5M EN-DE train 20 and doing a single epoch of training. WMT19-doc 68.4k 3.63M EN 14.6M 420M 4. Fine-tune the best sentence-level checkpoint News-crawl train DE 25.1M 534M (according to valid perplexity) on document- Table 1: Statistics of the allowed resources. The En- level data. Like Junczys-Dowmunt(2019), we glish sides of DGT-train, valid and test are respectively truncated the WMT documents into sequences subsets of Rotowire-train, valid and test. More mono- of maximum 1100 BPE tokens. We also aggre- lingual data is available, but we only used Rotowire and gated random sentences from WMT-sent into News-crawl. documents, and upsampled the DGT-train data. Contrary to Junczys-Dowmunt(2019), we do not use any sentence separator or document information. We propose a compact way to en- boundary tags. code the data available in the original database, and enrich it with some extra facts that can be eas- 5. Fine-tune the best doc-level checkpoint on ily inferred with a minimal knowledge of the task. DGT-train plus back-translated Rotowire-train We also show that NLG models trained with this and Rotowire-valid. data capture document-level structure and can se- lect and order information by themselves. We describe the pre-processing and hyperpa- rameters in Section4. In steps (1) and (3), we 2 Document-Level Generation and train for at most 20 epochs, with early stopping Translation Task based on newstest2014 perplexity. In step (4), we train for at most 5 additional epochs, with early The goal of the Document-Level Generation and stopping according to DGT-valid perplexity (doc- Translation (DGT) task is to generate summaries level). In the last step, we train for 100 epochs, of basketball games, in two languages (English with BLEU evaluation on DGT-valid every 10 and German), by using either structured data about epochs. We also compute the BLEU score of the the game, a game summary in the other language, best checkpoint according to DGT-valid perplex- or a combination of both. The task features 3 ity, and keep the checkpoint with highest BLEU. tracks, times 2 target languages (English or Ger- The models in step (5) overfit very quickly, man): NLG (Data to Text), MT (Text to Text) and reaching their best valid perplexity after only 1 MT+NLG (Text + Data to Text). The data and or 2 epochs. For DE-EN, we found that the evaluation are document-level, encouraging par- best DGT-valid BLEU was achieved anywhere ticipants to generate full documents, rather than between 10 and 100 epochs (sometimes with a sentence-based outputs. Table1 describes the al- high valid perplexity). For EN-DE, perplexity and lowed parallel and monolingual corpora. BLEU correlated better, and the best checkpoint according to both scores was generally the same. 3 Our MT and NLG Approaches The same observations apply when fine-tuning on All our models (MT, NLG, MT+NLG) are based NLG or MT+NLG data in the next sections. on Transformer Big (Vaswani et al., 2017). Details Like Berard et al.(2019), all our MT models for each track are given in the following sections. use corpus tags: each source sentence starts with a special token which identifies the corpus it comes 3.1 Machine Translation Track from (e.g., Paracrawl, Rotowire, News-crawl). At For the MT track, we followed these steps: test time, we use the DGT tag. One thing to note, is that document-level decod- 1. Train sent-level MT models on all the WMT19 ing is much slower than its sentence-level counter- parallel data (doc and sent) plus DGT-train. part.2 The goal of this document-level fine-tuning 2. Back-translate (BT) the German and English 2On a single V100, sent-level DGT-valid takes 1 minute News-crawl by sampling (Edunov et al., 2018). to translate, while doc-level DGT-valid takes 6 minutes. was not to increase translation quality, but to allow 4 Experiments us to use the same model for MT and NLG, which 4.1 Data Pre-processing is easier to do at the document level. We filter the WMT19-sent parallel corpus with 3.2 Natural Language Generation Track langid.py (Lui and Baldwin, 2012) and re- Original metadata consists of one JSON document move sentences of more than 175 tokens or with per game, containing information about teams and a length ratio greater than 1.5. Then, we ap- their players. We first generate compact represen- ply the official DGT tokenizer (based on NLTK’s tations of this metadata as text sequences. Then, word_tokenize) to the non-tokenized text (ev- we fine-tune our doc-level MT models (from step erything but DGT and Rotowire). 4) on the NLG task by using this representation We apply BPE segmentation (Sennrich et al., on the source side and full stories on the target 2016) with a joined SentencePiece-like model side. We train on a concatenation of DGT-train, (Kudo and Richardson, 2018), with 32k merge Rotowire-train and Rotowire-valid. We filter the operations, obtained on WMT + DGT-train (En- later to remove games that are also in DGT-valid. glish + German). The vocabulary threshold is set Our metadata has the following structure: to 100 and inline casing is applied (Berard et al., 1.

Load more