Introducing a Hierarchical Attention Transformer for Document Embeddings
Total Page:16
File Type:pdf, Size:1020Kb
DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2019 Introducing a Hierarchical Attention Transformer for document embeddings Utilizing state-of-the-art word embeddings to generate numerical representations of text documents for classification VIKTOR KARLSSON KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Introducing a Hierarchical Attention Transformer for document embeddings VIKTOR KARLSSON Master in Computer Science Date: December 8, 2019 Supervisor: Hamid Reza Faragardi Examiner: Olov Engwall School of Electrical Engineering and Computer Science Swedish title: Introduktion av Hierarchical Attention Transformer för dokumentrepresentationer iii Abstract The field of Natural Language Processing has produced a plethora of algo- rithms for creating numerical representations of words or subsets thereof. These representations encode the semantics of each unit which for word level tasks enable immediate utilization. Document level tasks on the other hand require special treatment in order for fixed length representations to be generated from varying length documents. We develop the Hierarchical Attention Transformer (HAT), a neural net- work model which utilizes the hierarchical nature of written text for creating document representations. The network rely entirely on attention which en- ables interpretability of its inferences and context to be attended from any- where within the sequence. We compare our proposed model to current state-of-the-art algorithms in three scenarios: Datasets of documents with an average length (1) less than three paragraphs, (2) greater than an entire page and (3) greater than an entire page with a limited amount of training documents. HAT outperforms its com- petition in case 1 and 2, reducing the relative error up to 33% and 32.5% for case 1 and 2 respectively. HAT becomes increasingly difficult to optimize in case 3 where it did not perform better than its competitors. iv Sammanfattning Inom fältet Natural Language Processing existerar det en uppsjö av algorit- mer för att skapa numeriska representationer av ord eller mindre delar. Dessa representationer fångar de semantiska egenskaperna av orden som för pro- blem på ordnivå direkt går att använda. Ett exempel på ett sådant problem är entitetsigenkänning. Problem på dokumentnivå kräver däremot speciella till- vägagångssätt för att möjliggöra skapandet av representationer med bestämd längd även när dokumentlängden varierar. Detta examensarbete utvecklar algoritmen Hierarchical Attention Trans- former (HAT), ett neuralt nätverk som tar vara på den hierarkiska strukturen hos dokument för att kombinera informationen på ordnivå till en representa- tion på dokumentnivå. Nätverket är helt och hållet baserat på uppmärksamhet vilket möjliggör utnyttjandet av information från hela sekvensen samt förstå- else av modellens slutsatser. HAT jämförs mot de för tillfället bäst presterande dokumentklassifice- ringsalgoritmerna i tre scenarier: Datasamlingar av dokument med medelläng- den (1) kortare än tre paragrafer, (2) längre än en hel sida och (3) längre än en hel sida där antalet dokument för träning är begränsat. HAT presterar bättre än konkurrenterna i fall 1 och 2, där felet minskades med upp till 33% och 32.5% för fall 1 respektive 2. Optimeringen av HAT ökade i komplexitet för fall 3, för vilket resultatet inte slog konkurrenterna. Contents 1 Introduction 1 1.1 Background . .1 1.2 Research question . .2 1.2.1 Delimitation . .3 1.2.2 Relevancy and business value . .3 1.3 Research methodology . .3 1.4 Ethics, sustainability and societal impact . .4 1.5 Outline . .5 2 Background 6 2.1 Neural networks . .6 2.1.1 Architecture . .6 2.2 Training techniques . .8 2.2.1 Dataset management . .8 2.2.2 Back-propagation . .9 2.2.3 Techniques for improving model performance . 10 2.3 Recurrent neural networks . 12 2.4 Transformer . 13 2.4.1 Overview . 13 2.4.2 Encoder . 14 2.5 Natural Language Processing . 16 2.5.1 Bag of words . 17 2.5.2 Tf-idf . 17 2.5.3 Language modelling . 18 2.6 Contextualized word embeddings . 22 2.6.1 Context vectors . 22 2.6.2 Bidirectional Encoder Representations from Transform- ers............................ 22 2.7 Our contribution . 25 v vi CONTENTS 2.7.1 Extracting features from BERT for documents . 25 2.7.2 The Hierarchical Attention Transformer . 26 3 Related works 29 3.1 Smooth Inverse Frequency . 29 3.2 Paragraph Vector . 31 3.3 Document Vector Through Corruption . 33 3.4 Hierarchical Attention Network . 34 3.5 Word Mover’s Embeddings . 35 3.5.1 Word Mover’s Distance . 35 3.5.2 Word Mover’s Embeddings . 36 4 Methods 38 4.1 Research questions . 38 4.2 Method . 38 4.2.1 Baseline comparison . 39 4.2.2 Classification of page-long documents . 41 4.2.3 Limiting the number of training instances . 41 4.3 Training procedure . 41 4.3.1 Hierarchical Attention Transformer . 42 4.3.2 Smooth Inverse Frequency . 42 4.3.3 Word Mover’s Embeddings . 43 4.3.4 Dataset management and model evaluation . 43 5 Results 44 5.1 Baseline comparison . 44 5.2 IMDB-long-n .......................... 44 6 Discussion 46 6.1 Baseline comparison . 46 6.1.1 BBC Sport . 46 6.1.2 OHSUMED . 47 6.2 IMDB-long-n .......................... 47 6.3 Validity discussion . 49 6.3.1 Internal validity . 49 6.3.2 External validity . 49 7 Conclusions 51 7.1 Future work . 51 CONTENTS vii Bibliography 53 A Attention examples 57 Chapter 1 Introduction Written text has for centuries enabled humanity to accumulate knowledge across generations. It also made communication over both distance and time possi- ble. It is reasonable to believe that keeping track of and categorizing these documents never posed a problem during the infancy of this technology. How- ever, it is with the inception of the Internet together with democratization of speech no surprise that this medium has exploded in volume. Categorizing even a fraction of this data would today be an unreasonable task for humans. This has led to many new and interesting problems within the field of Machine Learning (ML) where automatic document categorization is one of them. Enabling computers to make sense of the infinitely varying and complex structure of written language has been an ongoing research area for more than four decades [1]. Major breakthroughs have been achieved during the last few years [2, 3], enabling new use cases. This is partly due to the ever growing amount of available text data together with cross-pollination of ideas from different research fields within ML. Transfer learning techniques have been brought from the field of Computer Vision to Natural Language Processing (NLP). This has enabled word representation models of millions of parameters to be trained on datasets of billions of words [3] for use in new domains where data sparsity might otherwise prevent their success. 1.1 Background An early approach for enabling computers to understand text dates back to the mid 1950s. Representing a document during this time was achieved with a list of its word-counts [1]. This technique, which is called a Bag of words- model, has some glaring issues but can still produce useful document repre- 1 2 CHAPTER 1. INTRODUCTION sentations for classification and clustering. The two main shortcomings are its total neglect of word ordering and similarity between words, both of which are invaluable for complete language understanding. These two issues can be solved by training a model to find numerical representation of words which en- code their semantics. This approach result in word representations for similar words to be close in the high dimensional embedding space, thus preserving their semantic relatedness [2]. The growing interest in deep neural networks has enabled researchers to construct multi-million parameter models able to create even richer encodings for words. The performance of these models further close the gap to human performance in many different tasks, from question and answering to docu- ment classification [3]. A research area born from these achievements has focused on how to use this low-level information provided for each word to create meaningful repre- sentation of sentences, paragraphs or even documents. A lot of work has gone in to clever weighting schemes [4, 5, 6] and neural models, both linear [7] and recurrent structures [8], with the common goal of generating document em- beddings. There are, to the best of our knowledge, still particularly interesting techniques yet to be examined which is what this thesis will explore. 1.2 Research question This thesis will investigate a novel technique for creating document represen- tations from pre-trained embeddings of the words within. We will to examine this will answer the following research questions: • Is there merit in using state-of-the-art word embeddings together with a neural network model only relying on attention to create document embeddings for use in a classification problem? • How does the amount of training data in the case of longer documents affect the performance of our proposed model compared to current doc- ument embedding algorithms? The merit of our attention model can only be evaluated when compared with state-of-the-art document embedding algorithms. These will be described in later chapters. Further, we consider longer documents ones spanning more than 500 words. This usually is the amount that fits on a single A4-page, which closely coincides with the 512 token limit of Delvin et al.’s algorithm [3]. CHAPTER 1. INTRODUCTION 3 1.2.1 Delimitation This thesis will only evaluate the proposed model together with the embedding algorithm presented in [3] even though it is possible to use other embedding algorithms. The performance is expected to vary with other models, but this will not be studied. Further, this work is limited to study the effect of varying the amount of training data for one particular dataset.