Supporting Journalism by Combining Neural Language Generation And

Supporting Journalism by Combining Neural Language Generation and Knowledge Graphs Marco Cremaschi, Federico Bianchi, Andrea Maurino, Andrea Primo Pierotti Department of Computer Sciences, Systems and Communications University of Milan-Bicocca Viale Sarca, 336 - 20126, Milan, Italy fmarco.cremaschi,federico.bianchi,[email protected] [email protected] Abstract effort needed for repetitive tasks, such as data col- lection and drafting writing. The name given to Natural Language Generation is a field this phenomenon is Automated Journalism; this that is becoming relevant in several do- new type of journalism uses algorithms to generate mains, including journalism. Natural Lan- news under human supervision. During the past guage Generation techniques can be of years, several newsrooms have begun to experi- great help to journalists, allowing a sub- ment this technology: Associated Press, Forbes, stantial reduction in the time required to Los Angeles Times, and ProPublica are among the complete repetitive tasks. In this position first, but adoption could spread out soon (Graefe, paper, we enforce the idea that automated 2016). Automated Journalism can bring a massive tools can reduce the effort required to jour- change to the sector: writing news is a business nalist when writing articles; at the same that endeavours to minimise costs while maintain- time we introduce GazelLex (Gazette Lex- ing maximum efficiency and full speed, and thanks icalization), a prototype that covers several to this software the above-mentioned objectives steps of Natural Language Generation, in can be achieved, generating good-quality articles order to create soccer articles automati- (van Dalen, 2012). This new technology provides cally, using data from Knowledge Graphs, many advantages: the most evident are speed and leaving journalists the possibility of refin- the scale of news coverage. Of course, there are ing and editing articles with additional in- also problems and limitations. One of the most formation. We shall present our first re- relevant is the dependence from structured data sults and current limits of the approach, (Graefe, 2016), that is the reason why sports re- and we shall also describe some lessons ports, financial articles, and forecasts are the most learned that might be useful to readers that covered topics by software: they are all domains want to explore this field. where the complexity of the topic can be managed from software using structured data. Similar structured data are not always available in other fields. 1 Introduction In order to generate valuable text, approaches con- sidering data contained in the Knowledge Graphs Although automation is a phenomenon that is be- (KGs) have recently been introduced in literature coming more and more visible today, there are (Gardent et al., 2017; Trisedya et al., 2018). specialised jobs that require human effort to be completed. The job of a journalist is among these A Knowledge Graph (KG) describes real-world (Ornebring,¨ 2010). However, recent technological entities and the relations between them. KGs are progress in the field of Natural Language Gener- an essential source of information, and their fea- ation (NLG) and the use of increasingly sophisti- tures allow the use of this information in different cated techniques of artificial intelligence allow the contexts, such as link prediction (Trouillon et al., use of software capable of writing newspaper ar- 2016) and recommendation (Zhang et al., 2016). ticles almost indistinguishable from human ones. Popular KGs are the Google Knowledge Graph, These techniques can help journalists reduce the Wikidata and DBpedia (Auer et al., 2007). En- tities are defined in an ontology and thus can be Copyright 2019 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- classified using a series of types. The primary el- ternational (CC BY 4.0). ement of a KG to store entities information is a Resource Description Framework (RDF) triple in Neural Machine Translation (NMT) approach to the format hsubject; predicate; objecti. As RDF generate articles (sentences) starting from data triples open many possibilities in Web data repre- composed by RDF triples. GazelLex is also able sentation, utilising this data also in the NLG con- to generate videos containing the images and the text is valuable (Perera et al., 2016). Interlinked prominent information of the article, and to gener- KGs can be used to automatically extend the in- ate audio using a speech synthesis module (Figure formation relating to a given entity in an article. 1). To the best of our knowledge, our prototype In our solution, we use DBpedia, one of the is the first to provide an all-in-one integrated ap- fastest growing Linked Data resource that is avail- proach to NLG with RDF triples in the context of able free of charge; it is characterised by a high helping journalist in writing articles. number of links from the Linked Data Cloud2. This paper is structured as follows: in Sec- DBpedia is thus a central interlinking hub, an ac- tion 2, we analyse the state-of-the-art on Natural cess point for retrieving information to be inserted Language Generation, showing that these meth- in an article, as specified below. ods to generate natural language are becoming Up to 2010, commercial providers in the NLG popular. In Section 3 we describe our proto- field were not popular, but in the last years few type, GazelLex, that combines neural methods and companies have started to provide this kind of ser- knowledge graphs to create soccer articles and de- vices. In 2016 there were 13 companies covering scribe how this kind of tools can be of help to jour- this field (Drr, 2016) (e.g., AutomatedInsights3, nalism. In Section 4 we show a preliminary exper- NarrativeScience4). Approaches that try to inte- imental analysis, while in Section 5 we provide grate deep networks and text generation are now conclusions. common in literature (Gardent et al., 2017). These CONTENT DETERMINATION automated tools are going to become a standard ATTRIBUTES SCHEMA TEXT method to help journalist during the news writing DEFINITION DEFINITION STRUCTURING process. TRAINING MODEL FOR TRAINING DATASET We shall concentrate on examples of related OLD ARTICLE TRAINING DATASET RDF GENERATION work in the context of lexicalization from RDF TRIPLES data, we shall refer to surveys from the state of TEXT GENERATION the art for a more detailed overview of the field INFO DEEP RDF UX (Reiter and Dale, 1997; Gatt and Krahmer, 2018; EXTRACTION TRIPLES LEARNING Moussallem et al., 2018). Semantic web technolo- VIDEO AUDIO NEW ARTICLE gies like RDF can be used to enhance the power of current algorithms (Bouayad-Agha et al., 2012). Figure 1: The workflow of our model. The WebNLG challenge (Gardent et al., 2017) has been introduced to study the possibilities given by the combination of deep learning techniques and 2 Natural Language Generation semantic web technologies. In a similar context, an approach based on Long Short-Term Memory NLG is a “sub-field of artificial intelligence and (LSTM) networks has been proposed to generate computational linguistics that is concerned with text lexicalizations from RDF triples (Trisedya et the construction of computer systems that can al., 2018). produce understandable texts in English or other In this work, we aim to describe what is the human languages from some underlying non- possible automation process that can be used to linguistic representation of information” (Reiter help journalist in the news writing process. At the and Dale, 1997; Reiter and Dale, 2000). In NLG same time we describe a new prototype we have six “problems” must be addressed: Content de- created to support journalistic activities, GazelLex termination: input data that is always more de- (Gazette Lexicalization). GazelLex, through the tailed and richer than what we want to cover in the use of deep learning techniques implements a text (Gatt and Krahmer, 2018) and so the aim is to filter and choose what to say. Text structuring: 2https://wiki.dbpedia.org/ dbpedia-2016-04-statistics a clear text structure and the order of presentation 3https://automatedinsights.com/ of information are critical for readers, for this rea- 4https://narrativescience.com/ son, pre-defining the templates is necessary. Sen- tence aggregation: sentences must not be discon- each phase, implementation details will be pro- nected. Text needs therefore to be grouped in such vided. a way that a “more fluid and readable” text (Gatt and Krahmer, 2018) is generated. Lexicalization: 3.1 Content Determination one of the most critical phases of NLG process is how to express message blocks through words and To select the most relevant information, a hand- phrases. This task is called lexicalization and con- crafted approach was chosen. To select the infor- cerns the actual conversion from messages to nat- mation to bring in the final output, we traced the ural language. Reference expression generation: most used data in soccer articles. One of the pri- to avoid repetitions, selecting ways to refer to en- mary references was PASS, a personalised auto- tities using different methods (such as pronouns, mated text system developed to write soccer arti- proper nouns, or descriptions) is essential. Lin- cles (van der Lee et al., 2017). We took the kind guistic realisation: it concerns the combination of information PASS used to fill its templates and of relevant words and phrases to form a sentence. enriched them with our data fields. So we have As we stated above, lexicalization is one of the some entities of type “TEAM”, “FORMATION”, most critical and complex tasks in the NLG pro- “COACH” and some predicates like “injuryAt”, 5 cess. Natural language vagueness and choosing “yellowCardAt”, and “violentFoulAt” . The soft- the right words to express a concept are intri- ware used this data to create triples, that algo- cate issues to manage.

Load more