Automatic Generation of Formula 1 Reports Tijmen Krijnen University of Twente P.O
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Generation of Formula 1 Reports Tijmen Krijnen University of Twente P.O. Box 217, 7500AE Enschede The Netherlands [email protected] ABSTRACT The other possibility is to use deep learning to generate In this work, a new language generation system is pro- natural language. A deep learning NLG system is created posed. Language generation in the area of sports reports by training it on an extensive data set of input-and-output is an active interest of the NLG community. However, data. The input does not need to be structured data. For there has been no research into the automatic generation example, the input could be images, and the NLG system of Formula 1 race reports in this area. The goal of the could be trained to produce a caption for them, or the system that will be created is to produce short Formula 1 input could be texts, and the system must learn how to race reports. The system will create the reports with the translate or summarize them. help of a deep learning system called GPT-2, which will The proposed research will focus on automatically gen- be finetuned to produce short Formula 1 reports. This erating Formula 1 race reports; the race reports will be research shows that it is possible to generate reports with generated using an existing pre-trained NLG system. The similar quality of fluency, grammaticality, and cohesive- research aims to finetune the existing NLG system such ness as human-written reports. However, it also shows that the reports generated are factual, non-repetitive, flu- the two biggest pain points for the system, which are the ent, grammatically correct, and cohesive. factuality and the repetitiveness. In this paper, firstly, the problem statement and the re- search question that it prompts will be described in sec- Keywords tion 2. Secondly, the related work in deep learning NLG natural language generation, data-to-text, deep learning, systems will be discussed in section 3. Then, how the re- transformer models, GPT-2 search has been performed will be described in section 4. After which the results of the language model and the re- sults of the survey will be discussed in section 5. Lastly, 1. INTRODUCTION section 6 will conclude the research and will talk about Natural language generation (NLG) and natural language possible future work in this area of research. understanding (NLU) are both parts of natural language processing (NLP). NLP is a research field that lies at the boundary of linguistics, computer science, and artificial intelligence, dealing with natural language interaction be- 2. PROBLEM STATEMENT tween computers and humans. NLU is the process of Automatic report generation for sports is an active topic understanding natural language and producing a struc- of research inside the NLG community. In the literature tured representation, whereas NLG is the process of turn- there are, among others, systems for generating football ing structured data into human-readable text. reports [6], National Basketball Association (NBA) news [7], and minor baseball league news [8]. However, as of NLG has existed since the mid-1960s when Weizenbaum now, there is no system for the generation of Formula 1 developed ELIZA, the first program to make natural lan- reports. Consequently, this research will focus on creating guage conversation between humans and computers pos- an NLG system that can automatically generate Formula 1 sible [1]. Further examples of NLG are the generation of reports. weather forecasts [2], patient reports [3], and persuasive fashion product descriptions [4]. Automatic report generation is an active topic inside the NLG community because NLG has multiple benefits. NLG There are two types of NLG systems: template-based NLG can help media outlets publish more content and make systems and deep learning NLG systems. A template- their content more diverse, e.g., historical coverage. This based NLG system utilizes non-linguistic inputs and maps is possible because NLG is faster and cheaper than writing these to linguistic structures with gaps, where the data reports. While it would be most efficient if the generated from the non-linguistic input will be used to fill the gaps article did not need any attention, this is not always the in the linguistic structure based on rules [5]. case. NLG systems do not always generate quality reports, and sometimes they need to be edited to be published. Permission to make digital or hard copies of all or part of this work for However, this would still help improve efficiency and save personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that time and money. copies bear this notice and the full citation on the first page. To copy oth- Another task which NLG could be used for potentially is erwise, or republish, to post on servers or to redistribute to lists, requires live reporting and updating. As soon as data comes in, the prior specific permission and/or a fee. language model could generate a minor update for people 35th Twente Student Conference on IT July 2nd, 2021, Enschede, The Netherlands. who cannot watch the event live. Another benefit of au- Copyright 2021, University of Twente, Faculty of Electrical Engineer- tomatic reporting is the ability to translate these updates ing, Mathematics and Computer Science. and reports to multiple languages automatically. 1 2.1 Research Question with race information was required. This dataset or API The problem statement leads to the following research needed to supply information about the race. At first the question. requirements for the API were quite extensive, the API needed to provide information such as race winner, race How well is a deep learning NLG system able to generate duration, number of laps, lap times, location, race name, race reports for Formula 1? and pit stops. Furthermore, the drivers and constructors The research question will be answered through the fol- standings were also required. However, during the second lowing sub-questions. phase it became clear that not all of this data could be generated. There were a couple of potential API's such as 1 2 3 4 RQ1: How factual are the generated reports? Sportradar , Sportmonks , api-sports , and Ergast API . In the end the Ergast API was used, since it had all of the RQ2: How repetitive are the generated reports? required data, and it did not require a subscription or any kind of payment. Furthermore, there also exists a wrapper RQ3: How fluent are the generated reports? package for the Ergast API called fastf1 in Python which makes it very easy to work with the data. Using this pack- RQ4: How grammatically correct are the generated reports? age the data required for the second phase was acquired, RQ5: How cohesive are the generated reports? the data that was acquired was: 3. RELATED WORK • winning driver's name (e.g. "Michael Schumacher") This section of the report will discuss the previous research • second place driver's name (e.g. "Mika Hakkinen") performed and the available literature on this matter. • third place driver's name (e.g. "Ayrton Senna) In 2019 Radford et al. created the generative pre-trained • winning constructor's name (e.g. "Ferrari") transformer 2 (GPT-2), a language model. A language • race name (e.g "Italian Grand Prix") model is a system which has learned to predict the prob- ability of a word or a sequence of words, given the pre- • race location (e.g. "Monza") vious word(s). This means that when provided the start • year (e.g. "2000") of a sentence, GPT-2 can complete it with a highly likely continuation. GPT-2 has not been trained with explicit For the output data, Formula 1 race reports were required. supervision for a single task [9]. On the contrary, this sys- The reports could be from any source, however, they do tem was designed as a general-purpose language model, all need to be in English. There are of course lots of news which should be able to perform all kinds of tasks, such sites providing Formula 1 race reports, however, to train as question answering, translation, text generation, and the language model, as many reports as possible were re- summarization. quired. So, eventhough sites like BBC Sport5 and the In 2019 Budzianowski and Vuli´c researched the possibility official Formula 1 website6 provide quality reports, they of using GPT-2 for task-oriented dialogue systems [10]. either only provided a couple reports, or the website was They proposed a task-oriented dialogue model that only a bit more difficult to scrape. In the end a website called uses text input. The model has not been finetuned further Crash.net7 was used, it was chosen because it contained for the purpose of a task-oriented dialogue system. They Formula 1 reports dating all the way back to the year 2000 found that the model performed just as well as a model and because it was structured such that scraping it would trained explicitly for this purpose. not take too much time. On average each year starting In 2020 Lee and Hsiang researched the possibility of using from the year 2000 until 2020 had about 20 races, in the GPT-2 to generate patent claims [11]. They found that the end this resulted in a total of 397 reports scraped. An ex- generated patent claims could be coherent on the surface ample of a report from crash.net can be seen in Example: form. However, they also found that a more considerable Crash.net in subsection 4.3. challenge was how to measure the semantic quality of the 4.2 Phase Two: Finetuning generated text.