Open Domain Event Text Generation∗

The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Open Domain Event Text Generation∗ Zihao Fu,1 Lidong Bing,2 Wai Lam1 1Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong 2DAMO Academy, Alibaba Group {zhfu, wlam}@se.cuhk.edu.hk; [email protected] Abstract The WikiBio (Lebret, Grangier, and Auli 2016) task generates the first sentence of a person’s biography based on its Text generation tasks aim at generating human-readable text Wikipedia Infobox. from different kinds of data. Normally, the generated text only contains the information included in the data and its ap- plication is thus restricted to some limited scenarios. In this Punic War paper, we extend the task to an open domain event text gen- Roman Republic eration scenario with an entity chain as its skeleton. Specif- ically, given an entity chain containing several related event Ancient Carthage entities, the model should retrieve from a trustworthy repository (e.g. Wikipedia) the detailed information of these entities and generate a description text based on the retrieved sentences. We build a new dataset called WikiEvent1 that provides 34K pairs of entity chain and its corresponding description sentences. To solve the problem, we propose a wiki augmented generator framework that contains an encoder, a retriever, and a decoder. The encoder encodes the entity The second Punic war between Rome and Carthage chain into a hidden space while the decoder decodes from broke out in 219 BC and ended in 201 BC. the hidden space and generates description text. The retriever retrieves relevant text from a trustworthy repository which Figure 1: Illustration of the Open Domain Event Text Gen- provides more information for generation. To alleviate the overfitting problem, we propose a novel random drop com- eration task. The event description is generated based on a ponent that randomly deletes words from the retrieved sen- given entity chain. tences making our model more robust for handling long input sentences. We apply the proposed model on the WikiEvent The current NLG tasks always assume that the generated dataset and compare it with a few baselines. The experimen- text only contains the information included in the given data. tal results show that our carefully-designed architecture does However, this assumption is very rigid and sort of less prac- help generate better event text, and extensive analysis further tical in the real world. For example, in the novel writing sce- uncovers the characteristics of the proposed task. nario as shown in Fig. 1, when a writer wants to describe the “Punic War”, what comes into his mind first is only some Introduction entity words such as “Punic War”, “Roman Republic” and “Ancient Carthage”. Then, he digs his own or turns to cer- In recent years, many natural language generation (NLG) tain resources for help to acquire knowledge of these entities tasks have been proposed to generate human-readable text to compose a complete sentence. It would be very useful if based on different kinds of data. (Gardent et al. 2017a; such event text description can be automatically generated. 2017b) propose the WebNLG task to generate text descrip- In this paper, we propose a new task named Open Do- tions based on a group of knowledge base triples. The E2E main Event Text Generation (ODETG). In this task, we are (Novikova, Dusek,ˇ and Rieser 2017) task aims to gener- given a sequence of entities called an entity chain. The goal ate restaurant reviews with respect to the given attributes. is to generate a sentence with this entity chain as its skele- ∗ ton. We assume that we only know the entity chain and re- The work described in this paper is substantially supported by quire the model to retrieve from a trustworthy repository a grant from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project Code: 14204418). (e.g. Wikipedia) to find the detailed information of these en- Copyright c 2020, Association for the Advancement of Artificial tities and generate a description text based on the retrieved Intelligence (www.aaai.org). All rights reserved. sentences. The ODETG task can be applied to more scenar- 1Available at https://github.com/fuzihaofzh/WikiEvent ios where the traditional generation task cannot handle eas- 7748 ily. Traditional tasks are restricted since they assume that the understand the data. given data contain all the needed information and the gener- On the other hand, many works have been proposed to ated sentences can only describe the given information. Dif- generate text from different kinds of prompts. Fan, Lewis, ferent from traditional generation tasks, the ODETG task re- and Dauphin (2018) use reddit’s WritingPrompts data to quires to retrieve information from a trustworthy repository. expand a short story to a long one. (Park and Ahn 2018; Therefore, such nature allows our new task to be used in Feng et al. 2018; Peng et al. 2018) propose to generate much broader scenarios, meanwhile, it also makes the task sentences from keywords. Yan (2016) proposes to gen- more challenging. erate a poem from given keywords. Drissi, Watkins, and We propose a novel Wiki Augmented Generator (WAG) Kalita (2018) propose to generate text by learning from framework to solve this problem of generating sentences the output of a summarization model. Wang, Chen, and based on a given entity chain. Our framework firstly re- Lee (2019) propose to generate text from hidden explain- trieves sentences from a trustworthy sentence repository able topics. (Li et al. 2013; Li, Luong, and Jurafsky 2015; with queries formed with the entities, and then it generates Martin et al. 2018) propose to generate text based on an ab- an event description with the retrieved sentences. Accord- stract state translation. However, instead of using a trustwor- ingly, the core part of our framework contains an encoder, thy repository to provide information, these tasks generate a retriever and a decoder. The encoder encodes the given text solely based on the prompts and knowledge from the entity chain into a hidden space. The retriever retrieves the training set. Different from them, our framework expands related information for the given entity chain. Precisely, it the semantic of the input entity chain by retrieving rich con- retrieves the sentences describing the entities from the sen- tent from a trustworthy repository, i.e. Wikipedia. tence repository. The decoder generates text based on the encoder output and the retrieved text. We observe that since WikiEvent Dataset Construction the retrieved sentences can be extremely long, the training We build our WikiEvent dataset based on the English procedure of the generation model can easily get overfitted. 2 We employ a random drop component to alleviate the over- Wikipedia dump with three steps: (1) selecting articles de- fitting problem. scribing some events; (2) extracting hyperlinked entities to We build a real-world dataset called WikiEvent to evalu- form (entity chain, text) pairs; (3) filtering the candidates to ate the performance of our model. The dataset is built based build the final dataset. In addition, we also build a trustwor- on Wikipedia and provides 34K pairs of entity chains and thy sentence repository from the Wikipedia dump. their corresponding description text. We also build a sentence repository based on Wikipedia for the retriever to fetch Article Selection useful text information for the generation sub-task. Presum- In order to find articles describing some events, we deter- ably, it can be regarded as a trustworthy one and, of course, mine some keywords (namely Battle, Revolution, Revolt, it could be replaced by other resources such as daily news. In Campaign, Rebellion, Siege, War, Conflict, Invasion, Inci- summary, our contributions are as follows. (1) We propose dent, Conference, Treaty, Affair, Uprising, Expedition) that a new task, namely, Open Domain Event Text Generation may refer to some events and choose the articles containing (ODETG), which is much more challenging and practical. any keyword in their titles. We ignore the functional articles (2) We contribute a new dataset called WikiEvent which can that contain words such as “Category”, “List of ”, “disam- be used to evaluate the models for this task. (3) We propose a biguation”, “Image:”, “File:” in the title. novel model tackling the characteristics of this task to solve the ODETG problem. (4) We propose an effective random Candidate Extraction drop component to tackle the overfitting problem encoun- We split the article text into sentences with NLTK (Bird and tered in the training process. Loper 2004). After we get the tokenized sentences, we ex- tract the entities and corresponding text by regular expres- Related Works sions. For example, given the text in Wikipedia markup lan- Nowadays, many data-to-text generation tasks have been guage “[[Washington, D.C.|Washington]] is the capital of proposed to generate text aiming at explaining different the [[United States|U.S.]]”, an entity is marked in the square kinds of data in a more human-readable form. For exam- brackets and it is separated into two parts by “|”. The first ple, the WebNLG task (Gardent et al. 2017a; 2017b) gen- part in the brackets is the formal entity name while the sec- erates sentences corresponding to a group of related triples ond part is the surface text shown in the rendered Wikipedia sampled from DBpedia (Auer et al. 2007; Lehmann et al. page. Therefore, the extracted entity chain is “Washington, 2015). A similar task is to generate a description text from D.C. — United States”, while the extracted text description the attributes of an entity, such as generating a short biog- is “Washington is the capital of the U.S.”.

Load more