Processing Methods for Multi- Label Classification of Textual Data

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2018 Comparing Feature Extraction Methods and Effects of Pre- Processing Methods for Multi- Label Classification of Textual Data MARTIN EKLUND KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Comparing Feature Extraction Methods and Effects of Pre-Processing Methods for Multi-Label Classification of Textual Data MARTIN EKLUND Master in Computer Science Date: June 27, 2018 Supervisor: Håkan Lane Examiner: Olof Bälter Swedish title: Utvärdering av Metoder för Extraktion av Särdrag och Förbehandling av Data för Multi-Taggning av Textdata School of Electrical Engineering and Computer Science iii Abstract This thesis aims to investigate how different feature extraction methods applied to textual data affect the results of multi-label classification. Two different Bag of Words extraction methods are used, specif- ically the Count Vector and the TF-IDF approaches. A word embedding method is also investigated, called the GloVe extraction method. Multi-label classification can be useful for categorizing items, such as pieces of music or news articles, that may belong to multiple classes or topics. The effect of using different pre-processing methods is also investigated, such as the use of N-grams, stop-word elimination, and stemming. Two different classifiers, an SVM and an ANN, are used for multi-label classification using a Binary Relevance approach. The results indicate that the choice of extraction method has a meaningful impact on the resulting classifications, but that no one method consis- tently outperforms the others. Instead the results show that the GloVe extraction method performs the best for the recall metrics, while the Bag of Words methods perform the best for the precision metrics. iv Sammanfattning Detta arbete ämnar att undersöka vilken effekt olika metoder för att extrahera särdrag ur textdata har när dessa används för att multi-tagga textdatan. Två metoder baserat på Bag of Words undersöks, närmare bestämt Count Vector-metoden samt TF-IDF-metoden. Även en metod som använder sig av word embessings undersöks, som kallas för GloVe-metoden. Multi-taggning av data kan vara användbart när datan, exempelvis musikaliska stycken eller nyhetsartiklar, kan tillhöra flera klasser eller områden. Även användandet av flera olika metoder för att förbehandla datan undersöks, såsom användandet utav N- gram, eliminering av icke-intressanta ord, samt transformering av ord med olika böjningsformer till gemensam stamform. Två olika klassifi- cerare, en SVM samt en ANN, används för multi-taggningen genom använding utav en metod kallad Binary Relevance. Resultaten visar att valet av metod för extraktion av särdrag har en betydelsefull roll för den resulterande multi-taggningen, men att det inte finns en metod som ger bäst resultat genom alla tester. Istället indikerar resultaten att extraktionsmetoden baserad på GloVe presterar bäst när det gäl- ler ’recall’-mätvärden, medan Bag of Words-metoderna presterar bäst gällade ’precision’-mätvärden. Contents 1 Introduction 1 1.1 Problem Statement . .2 1.2 Scope . .2 1.3 Objective . .3 2 Background 4 2.1 Multi-Label Classification . .4 2.1.1 Methods . .5 2.2 Content-Based Recommendation . .5 2.2.1 Exploitation and Exploration . .6 2.3 Pre-Processing . .7 2.3.1 Tokenization and N-grams . .7 2.3.2 Stemming . .8 2.3.3 Stop-Word Elimination . .8 2.4 Feature Extraction . .8 2.4.1 Bag of Words . .9 2.4.2 Term Frequency-Inverse Document Frequency . .9 2.4.3 Word Embeddings and GloVe . 11 2.5 Classifiers . 12 2.5.1 Support Vector Machine . 12 2.5.2 Artificial Neural Networks . 13 2.6 Related Work . 14 2.6.1 Word Embeddings for Single-Label Classification 14 2.6.2 Bag of Words for Multi-Label Classification . 15 2.6.3 Research Gap . 15 3 Method 16 3.1 Dataset . 16 3.2 Pre-Processing . 17 v vi CONTENTS 3.3 Extraction Methods . 17 3.3.1 TF-IDF . 17 3.3.2 Bag of Words/Count Vector . 18 3.3.3 GloVe . 18 3.4 Classifiers . 18 3.5 Evaluation . 19 3.5.1 Precision . 19 3.5.2 Recall . 19 3.5.3 F-Score . 20 4 Results 21 4.1 TF-IDF . 21 4.2 Count Vector . 21 4.3 GloVe . 25 4.4 Effect of N-grams . 25 4.5 Stop-Words . 25 4.6 Stemming . 28 4.7 Best Scores . 30 5 Discussion 32 5.1 Effects of Preprocessing . 32 5.1.1 N-grams . 32 5.1.2 Stop-Word Elimination . 33 5.1.3 Stemming . 33 5.2 Extraction Methods . 33 5.2.1 TF-IDF and Count Vector . 33 5.2.2 Bag of Words vs. Word Embeddings . 34 5.3 Concerns . 35 5.4 Future Work . 35 6 Conclusion 36 Bibliography 38 A Appended Material 41 Chapter 1 Introduction The amount of digital content available on the Internet is steadily grow- ing, and it can prove challenging to make the best use of the vast amount of data that is available. One way to handle this problem is to classify data into different categories, thus giving a better overview of what kinds of data are available. This can in turn for instance enable users of a news site to better filter out the articles that they are interested in, or enable users on social media to locate photos of themselves. In order not to have to do this procedure of categorization manually, one can instead employ machine learning techniques to automate the process. Such techniques usually require distinguishing features of an object in order to be able to classify it. There are several different ways to extract features from different types of data, and different methods may prove suitable in different situations. For instance when classifying images of apples and or- anges, a crude extraction method could be to take the average pixel values of the images. If the average pixel value is close to orange, then the image would be classified as an orange, and if it is closer to green then it would be classified as an apple. This method would however also most likely classify a tiger as an orange, since it does not take into account any other feature than color into consideration. By choosing a more appropriate extraction method this could hopefully be avoided. Instead of only taking the average pixel values into account one could also consider features such as the shape, size and texture of the object. Taking these features into account, an image of a tiger would most likely not be classified as an orange. Categorized data items can, among other things, be used to pro- 1 2 CHAPTER 1. INTRODUCTION vide recommendations of similar items to a user depending on which categories the user has previously shown an interest in. By using a multi-label approach, hopefully, the problem of Exploitation and Ex- ploration described in section 2.2.1 could be somewhat mitigated. This could be done by recommending items that might not be the ones most similar to the users viewing history, but that still lie relatively close to other categories that the user has shown a previous interest in. 1.1 Problem Statement The aim of this report is to evaluate how different methods for feature extraction affect a multi-label classification problem with textual data. The effect of different pre-processing methods is also investigated. If the choice of extraction method has a significant impact on the final classification results, then it might be better to choose the right extraction method rather than spend too much time optimizing the classifier itself. The questions that this report will attempt to answer are: • Which one of the three feature extraction methods (Count Vector, TF-IDF and GloVe) performs the best? • Does one of these feature extraction methods perform the best even when used with different classifier models? • Are commonly used pre-processing methods always useful when applied to different feature extraction methods? 1.2 Scope The main goal of this thesis is to examine the effect of feature extraction methods for multi-label classification for textual data. Three feature extraction methods are evaluated in conjunction with two classifier models. It would be possible to include more classifier models, but hopefully two are enough to examine whether or not extraction methods yield similar results when used with different classifier. The resulting classifications may in turn aid in building a simple recommendation system, but it is outside the scope of this thesis to CHAPTER 1. INTRODUCTION 3 evaluate such a system. Instead this work is to be viewed as examin- ing the classification foundation for such a system. This work is being done in association with the Swedish Pensions Agency (Pensionsmyn- digheten), who are interested in developing a prototype for a recom- mender system. 1.3 Objective The objective of this report is to investigate how different approaches for feature extraction affects the results of multi-label classification. Naturally you want as good results as possible when doing classification. If the choice of feature extraction method has a great impact, it might be worth it to spend more time focusing on selecting an appropriate feature extraction method rather than spending a lot of time optimizing a certain classifier. This report also investigates if it is always useful to apply commonly used pre-processing methods to text data when different extraction methods are used. Hopefully the results of this work could prove useful for trying to better classify multi-label data. Chapter 2 Background 2.1 Multi-Label Classification In single-label classification a data sample is assigned only one label (or category) from a set of disjoint labels.

Processing Methods for Multi- Label Classification of Textual Data

10 Oriented Principal Component Analysis for Feature Extraction

Feature Selection/Extraction

Feature Extraction (PCA & LDA)

Feature Extraction for Image Selection Using Machine Learning

Time Series Feature Extraction for Industrial Big Data (Iiot) Applications

Machine Learning Feature Extraction Based on Binary Pixel Quantiﬁcation Using Low-Resolution Images for Application of Unmanned Ground Vehicles in Apple Orchards

Towards Reproducible Meta-Feature Extraction

Generalized Feature Extraction for Structural Pattern Recognition in Time-Series Data Robert T

Feature Extraction Using Dimensionality Reduction Techniques: Capturing the Human Perspective

Unsupervised Feature Extraction for Reinforcement Learning

Alignment-Based Topic Extraction Using Word Embedding

Deep Learning Feature Extraction Approach for Hematopoietic Cancer Subtype Classiﬁcation