Generation of Sport News Articles from Match Text Commentary
Total Page:16
File Type:pdf, Size:1020Kb
UKRAINIAN CATHOLIC UNIVERSITY MASTER THESIS Generation of sport news articles from match text commentary Author: Supervisor: Denys PORPLENKO PhD. Valentin MALYKH A thesis submitted in fulfillment of the requirements for the degree of Master of Science in the Department of Computer Sciences Faculty of Applied Sciences Lviv 2020 ii Declaration of Authorship I, Denys PORPLENKO, declare that this thesis titled, “Generation of sport news ar- ticles from match text commentary” and the work presented in it are my own. I confirm that: • This work was done wholly or mainly while in candidature for a research de- gree at this University. • Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated. • Where I have consulted the published work of others, this is always clearly attributed. • Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work. • I have acknowledged all main sources of help. • Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed my- self. Signed: Date: iii UKRAINIAN CATHOLIC UNIVERSITY Faculty of Applied Sciences Master of Science Generation of sport news articles from match text commentary by Denys PORPLENKO Abstract Nowadays, thousands of sporting events take place every day. Most of the sports news (results of sports competitions) is written by hand, despite their pattern struc- ture. In this work, we want to check possible or not to generate news based on the broadcast - a set of comments that describe the game in real-time. This problem solves for the Russian language and considered as a summarization problem, using extractive and abstract approaches. Among extractive models, we do not get sig- nificant results. However, we build an Oracle model that showed the best possible result equal to 0.21 F1 for ROUGE-1. For the abstraction approach, we get 0.26 F1 for the ROUGE-1 score using the NMT framework, the Bidirectional Encoder Rep- resentations from Transformers (BERT), as an encoder and text augmentation based on a thesaurus. Other types of encoders do not show significant improvements. iv Acknowledgements First of all, I want to thank my supervisor Valentin Malykh from the Moscow Insti- tute of Physics and Technology, who gave me much useful advice regarding the ex- periments, results, structure, and content of this work. I am also grateful to Vladislav Denisov from Moscow Technical University of Communications and Informatics for providing data corpus and Open Data Science community 1 who was granting com- putational resources for research. Also, I am grateful to the supervisor of seminars Hanna Pylieva (Ukrainian Catholic University, Data Science Master program alumnus) for discussing progress, past and future stages of this project. Finally, I want to thank Ukrainian Catholic University and Oleksii Molchanovskyi personally for the most progressive master program in data science in Ukraine and for the opportunity to take part in it. 1https://ods.ai/ v Contents Declaration of Authorship ii Abstract iii Acknowledgements iv 1 Introduction1 1.1 Motivation...................................1 1.2 Goals of the master thesis..........................2 1.3 Thesis structure................................2 2 Related work3 2.1 Extractive and abstarctive approaches for summarization........3 2.2 Generating news and sport summaries...................4 3 Dataset description6 3.1 Broadcast....................................6 3.2 News......................................7 3.3 Additional information about dataset................... 10 4 Background and theory information 12 4.1 Metrics..................................... 12 4.1.1 ROUGE................................ 12 4.1.2 Cross-entropy............................. 13 4.2 Extractive approach.............................. 14 4.2.1 PageRank............................... 14 4.2.2 TextRank................................ 14 4.2.3 LexRank................................ 15 4.3 Word embedding............................... 16 4.3.1 Word2vec............................... 16 4.3.2 FastText................................ 16 4.4 Abstractive approach............................. 17 4.4.1 RNN.................................. 17 4.4.2 LSTM.................................. 19 4.4.3 Sequence-to-sequence model.................... 19 4.4.4 Neural Machine Translation..................... 20 4.4.5 Attention mechanism........................ 20 4.4.6 Transformers............................. 21 4.4.7 BERT.................................. 22 4.5 Differences in broadcasts styles and news................. 23 5 Model 25 5.1 OpenNMT................................... 25 5.2 PreSumm.................................... 25 vi 6 Experiments 28 6.1 TextRank approaches............................. 28 Implementation details....................... 28 Results and examples........................ 29 6.2 LexRank.................................... 29 Implementation details....................... 29 Results and examples........................ 30 6.3 Oracle...................................... 31 Implementation details....................... 31 Results and examples........................ 31 6.4 OpenNMT................................... 32 Implementation details....................... 32 6.5 PreSumm.................................... 33 Implementation details....................... 33 6.5.1 BertSumAbs.............................. 35 6.5.2 RuBertSumAbs............................ 36 6.5.3 BertSumExtAbs............................ 36 6.5.4 BertSumAbs1024........................... 37 6.5.5 OracleA................................ 37 6.5.6 BertSumAbsClean.......................... 38 6.5.7 AugAbs................................ 38 6.6 Human evaluation.............................. 40 6.7 Compare human judgment and score ROUGE.............. 41 7 Conclusion 43 7.1 Contribution.................................. 43 7.2 Future work.................................. 44 A Examples of generated news 45 Bibliography 58 vii List of Figures 3.1 Number of comments per each sport game.................8 3.2 Number of tokens (splited by space) per one broadcast..........8 3.3 Number of news per one sport games. News splited for two group: before and after news............................. 10 3.4 Number of tokens (splited by space) per one sport game. We cut news with more than 1000 tokens.......................... 11 4.1 Stages of TextRank algorithm......................... 15 4.2 Two-dimensional PCA projection of countries and their capital cities. The figure illustrates the ability of the model to learn implicitly the re- lationships between them without any supervised information about what a capital city means. Source: (Mikolov et al., 2013a)........ 17 4.3 The CBOW architecture predicts the current word based on the con- text, and the Skip-gram predicts surrounding words given the current word. Source: (Mikolov et al., 2013b).................... 18 4.4 A Recurrent Neural Network(RNN). Three time-steps are shown.... 18 4.5 The detailed internals of a LSTM. Source: (CS224n: NLP with Deep Learning)..................................... 20 4.6 The graphical illustration of the proposed model trying to generate the t-th target word yt given a source sentence (x1, x2, ..., xT). Source: (Bahdanau, Cho, and Bengio, 2014)..................... 21 4.7 The Transformer - model architecture. Source: (Vaswani et al., 2017).. 22 4.8 The figure shows pre-training and fine-tuning procedures for the BERT model. The same architectures and model parameters (to initialize models for different down-stream tasks) are used in both pre-training and fine-tuning stages. During fine-tuning, all parameters are fine- tuned. Source: (Devlin et al., 2018)...................... 23 5.1 Architecture of the original BERT model (left) and BERTSUM (right). Source: Liu and Lapata, 2019........................ 26 6.1 This figure shows the distributation on number of sentences using Oracle model.................................. 31 6.2 This figure shows the distributation number of tokens in news (target for Presumm experiments).......................... 34 6.3 This figure shows the distributation number of tokens in broadcasts (source for Presumm experiments)...................... 35 viii List of Tables 3.1 Examples of comments for a same sport game...............7 3.2 Examples of news for sport games.....................9 3.4 Examples of service/advertising information inside broadcasts and news....................................... 10 3.3 Examples of after sport game news, with before sport game or general contexts..................................... 11 6.1 ROUGE scores of TextRank approaches.PageRank W2V and PageR- ank FT models based on PageRank and used word2vec and FastText models. TextRank Gns - an algorithm from the Gensim toolkit...... 29 6.2 ROUGE-1/ROUGE-2/ROUGE-L score of LexRank results........ 30 6.3 Examples of summaries generated using LexRank approach....... 30 6.4 ROUGE scores for all exptractive approaches................ 31 6.5 ROUGE scores for summaries generated by using model BertSumAbs. 35 6.6 ROUGE scores for summaries generated by using RuBertSumAbs... 36 6.7 ROUGE scores for summaries generated by using BertSumExtAbs... 36 6.8 ROUGE scores for summaries generated by using BertSumAbs1024.. 37 6.9 ROUGE scores for summaries generated using