Employing a Transformer Language Model for Information Retrieval and Document Classification
Total Page:16
File Type:pdf, Size:1020Kb
DEGREE PROJECT IN THE FIELD OF TECHNOLOGY ENGINEERING PHYSICS AND THE MAIN FIELD OF STUDY COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020 Employing a Transformer Language Model for Information Retrieval and Document Classification Using OpenAI's generative pre-trained transformer, GPT-2 ANTON BJÖÖRN KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Employing a Transformer Language Model for Information Retrieval and Document Classification ANTON BJÖÖRN Master in Machine Learning Date: July 11, 2020 Supervisor: Håkan Lane Examiner: Viggo Kann School of Electrical Engineering and Computer Science Host company: SSC - Swedish Space Corporation Company Supervisor: Jacob Ask Swedish title: Transformermodellers användbarhet inom informationssökning och dokumentklassificering iii Abstract As the information flow on the Internet keeps growing it becomes increasingly easy to miss important news which does not have a mass appeal. Combating this problem calls for increasingly sophisticated information retrieval meth- ods. Pre-trained transformer based language models have shown great gener- alization performance on many natural language processing tasks. This work investigates how well such a language model, Open AI’s General Pre-trained Transformer 2 model (GPT-2), generalizes to information retrieval and classi- fication of online news articles, written in English, with the purpose of compar- ing this approach with the more traditional method of Term Frequency-Inverse Document Frequency (TF-IDF) vectorization. The aim is to shed light on how useful state-of-the-art transformer based language models are for the construc- tion of personalized information retrieval systems. Using transfer learning the smallest version of GPT-2 is trained to rank and classify news articles achiev- ing similar results to the purely TF-IDF based approach. While the average Normalized Discounted Cumulative Gain (NDCG) achieved by the GPT-2 based model was about 0.74 percentage points higher the sample size was too small to give these results high statistical certainty. Keywords: Deep Learning, Transformer Models, Information Retrieval, Ranking, Generative Pre-training, Document Classification iv Sammanfattning Informationsflödet på Internet fortsätter att öka vilket gör det allt lättare att missa viktiga nyheter som inte intresserar en stor mängd människor. För att bekämpa detta problem behövs allt mer sofistikerade informationssöknings- metoder. Förtränade transformermodeller har sedan ett par år tillbaka tagit över som de mest framstående neurala nätverken för att hantera text. Det här arbetet undersöker hur väl en sådan språkmodell, Open AIs General Pre-trained Trans- former 2 (GPT-2), kan generalisera från att generera text till att användas för informationssökning och klassificering av texter. För att utvärdera detta jäm- förs en transformerbaserad modell med en mer traditionell Term Frequency- Inverse Document Frequency (TF-IDF) vektoriseringsmodell. Målet är att klar- göra hur användbara förtränade transformermodeller faktiskt är i skapandet av specialiserade informationssökningssystem. Den minsta versionen av språk- modellen GPT-2 anpassas och tränas om till att ranka och klassificera nyhets- artiklar, skrivna på engelska, och uppnår liknande prestanda som den TF-IDF baserade modellen. Den GPT-2 baserade modellen hade i genomsnitt 0.74 procentenheter högre Normalized Discounted Cumulative Gain (NDCG) men provstorleken var ej stor nog för att ge dessa resultat hög statistisk säkerhet. Nyckelord: djupinlärning, transformermodeller, informationssökning, ranking, generativ förträning, dokumentklassificering Contents 1 Introduction 1 1.1 Overview . .1 1.1.1 The Space Industry . .1 1.1.2 Swedish Space Corporation . .2 1.1.3 Problem Formulation . .2 1.1.4 Approach . .2 1.2 Research Question . .3 1.2.1 Hypothesis . .3 1.2.2 Conditions & Limitations . .3 2 Background 5 2.1 Artificial Neural Networks . .5 2.1.1 The Perceptron . .5 2.1.2 Deep Neural Networks & Stochastic Gradient Descent . .6 2.2 Transformer Networks . .8 2.2.1 Network Architecture . 10 2.2.2 Generative Pre-Training and GPT-2 . 12 2.2.3 Generalization Capabilities of GPT-2 . 13 2.3 Information Retrieval . 16 2.3.1 Document Ranking & Relevance Score . 16 2.3.2 TF-IDF . 17 2.3.3 Neural Relevance Scoring . 18 2.3.4 Ensemble Relevance Scoring (TF-IDF + GPT-2) . 19 2.4 Evaluation . 19 3 Method 21 3.1 Data Collection . 21 3.1.1 Web Scraping . 21 v vi CONTENTS 3.1.2 SSC News Summaries & Fundamental Data Set . 21 3.1.3 Daily Web Scraping & Main Data Set . 22 3.1.4 Labeling of Main Data Set . 24 3.2 TF-IDF Based Model . 25 3.2.1 Training . 25 3.2.2 Scoring and Classification . 26 3.3 GPT-2 Based Model . 27 3.3.1 Adaptation . 27 3.3.2 Relevance Classifier . 28 3.3.3 Zone Classifier . 29 3.3.4 Data Set . 30 3.3.5 Training . 30 3.3.6 Scoring & Classification . 31 3.4 Ensemble Models . 32 3.4.1 Multiplicative Model . 32 3.4.2 Re-ranking Model . 32 3.5 Evaluation . 33 3.5.1 N-Fold Cross Validation . 33 3.5.2 Normalized Discounted Cumulative Gain . 33 3.5.3 Classification Accuracy . 34 3.5.4 Resource Demands . 34 3.6 Statistical Significance & Null Hypothesis . 34 4 Results 36 4.1 Relevance Scoring . 36 4.2 Classification Accuracy . 37 4.3 NDCG vs Accuracy . 38 4.4 NDCG@10 . 40 4.5 T-Test . 41 4.6 Resource Demands . 41 5 Discussion 42 5.1 Model Performance . 42 5.2 Model Robustness . 43 5.3 Ensemble Models . 43 5.4 Resource Demands . 44 5.5 Ethical Considerations . 45 5.6 Sustainability & Societal Considerations . 46 CONTENTS vii 6 Conclusions 47 6.1 Employment of Transformer Language Models . 47 6.2 Future Work . 48 Bibliography 50 Chapter 1 Introduction This chapter introduces the problem and provides some basic information about the host company and the approach that was applied. 1.1 Overview The amount of information published on the Internet every day is vast. For some companies it is possible that the amount of news published just within their industry alone takes too much time for a single person to read, making it hard to stay up to date. Furthermore there are many forces at work behind what is published and how it is published (click-baiting, biases, paid articles) which can make finding the most relevant and trustworthy news challenging. This flood of information calls for sophisticated information retrieval techniques to aid in navigating the growing information landscape. 1.1.1 The Space Industry The Space Industry has seen big changes since the beginning of the century with launch prices being reduced by about a factor of 20 [1] accompanied by a big shift towards commercialisation [2, 1]. With a boom in space related startups and commercial interest [1] the information flow within the industry is higher than it’s ever been. With future plans ranging from huge satellite constellations providing global Internet services [3] to the colonization of Mars [4] and new moon missions [5] there seems to be no end to the coming growth and change. 1 2 CHAPTER 1. INTRODUCTION 1.1.2 Swedish Space Corporation Swedish Space Corporation (SSC), previously known as Rymdbolaget, is a mainly commercial space company owned by the Swedish government. SSC operates the Esrange spaceport in northern Sweden where they carry out mis- sions for their customers including launching experiments aboard sounding rockets and weather balloons. The company also provides science and launch services as well as satellite ground network services from their many ground stations around the world and engineering and spacecraft operation services. 1.1.3 Problem Formulation Swedish Space Corporation (SSC) wants to help their employees stay up-to- date with industry events in a time efficient manner by constructing a program that can automatically retrieve and present the most relevant news items for SSC. The system should automatically collect news articles from selected sites and upon a request produce a summary of news collected within a given range of dates. The scientific question being investigated is whether a modern transformer model, more specifically Generative Pre-trained Transformer 2 (GPT-2) [6], can be used to improve the results of such a system by being employed for document ranking and classification. The primary objective of the system will however be document ranking, classification is a secondary objective. 1.1.4 Approach Collection of news articles will be done by crawling and scraping pages from a list of online news pages created with the help of SSC. This web scraper will be used both to collect data to be labeled for training and to collect articles for creating the news summaries in the finished software. Two different methods for document relevance scoring (ranking) and classifi- cation will be implemented. One will be a vector space model based on TF- IDF (term frequency-inverse document frequency) and will serve as a baseline for performance. The other method will perform relevance scoring and classi- fication via transfer learning on the pre-trained transformer model GPT-2 [6]. All web pages scraped will be written in English since GPT-2 is mainly trained on English text. CHAPTER 1. INTRODUCTION 3 1.2 Research Question "Is employing a transformer language model using transfer learning better than a proven traditional approach for the purpose of information retrieval and document classification?" This research question is evaluated by comparing the performance of the two different approaches. In this work "better" for the purpose of information re- trieval is defined as retrieving results of higher relevance and "better" for doc- ument classification is defined as achieving a higher classification accuracy. In addition to these metrics of performance, the resource demands of both ap- proaches will also be taken into consideration in the form of execution time. To answer this question software must be constructed and two different rel- evance scoring and classification methods implemented into it.