Evaluation and Comparison of Word Embedding Models, for Efficient Text Classification
Total Page:16
File Type:pdf, Size:1020Kb
Evaluation and comparison of word embedding models, for efficient text classification Ilias Koutsakis George Tsatsaronis Evangelos Kanoulas University of Amsterdam Elsevier University of Amsterdam Amsterdam, The Netherlands Amsterdam, The Netherlands Amsterdam, The Netherlands [email protected] [email protected] [email protected] Abstract soon. On the contrary, even non-traditional businesses, like 1 Recent research on word embeddings has shown that they banks , start using Natural Language Processing for R&D tend to outperform distributional models, on word similarity and HR purposes. It is no wonder then, that industries try to and analogy detection tasks. However, there is not enough take advantage of new methodologies, to create state of the information on whether or not such embeddings can im- art products, in order to cover their needs and stay ahead of prove text classification of longer pieces of text, e.g. articles. the curve. More specifically, it is not clear yet whether or not theusage This is how the concept of word embeddings became pop- of word embeddings has significant effect on various text ular again in the last few years, especially after the work of classifiers, and what is the performance of word embeddings Mikolov et al [14], that showed that shallow neural networks after being trained in different amounts of dimensions (not can provide word vectors with some amazing geometrical only the standard size = 300). properties. The central idea is that words can be mapped to In this research, we determine that the use of word em- fixed-size vectors of real numbers. Those vectors, should be beddings can create feature vectors that not only provide able to hold semantic context, and this is why they were a formidable baseline, but also outperform traditional, very successful in sentiment analysis, word disambiguation count-based methods (bag of words, tf-idf) for the same and syntactic parsing tasks. However, those vectors can still amount of dimensions. We also show that word embed- represent single words only, something that, although use- dings appear to improve accuracy, even further than the ful, really limits the usage of word embeddings in a wider distributional models baseline, when the amount of data is context. relatively small. In addition to that, the averaging of word 1.1 Motivation embeddings appears to be a simple but effective way to keep a significant portion of the semantic information. The motivation behind this thesis, is to examine whether or Besides the overall performance and the standard classifi- not word embeddings can be used in document classification cation metrics (accuracy, recall, precision, F1 score), the time tasks. It is important to understand the main problem, which complexity of the embeddings will also be compared, is how to form a document feature vector, from the separate word vectors. Also, we should examine the possible perfor- Keywords text classification, word embeddings, fasttext, mance issues that different dataset sizes can have, but most word2vec, vectorization importantly, datasets with different semantics. Following, we will present our findings, comparing the us- Acknowledgments age of word embeddings in text classification tasks, through the averaging of the word vectors, and their performance I would like to thank my supervisors, George Tsatsaronis in comparison to baseline, distributional models (bag and Evangelos Kanoulas, for guiding me and providing me of words and tf-idf). with their years of experience, patience, and support, during the completion of this thesis. 2 Classification and Evaluation Metrics I would also like to thank Elsevier, for giving me the chance to work and contribute to the work of amazing individuals, Classification is the process where, given a set of classes, we in such an establishment, and take advantage of their know- try to determine one or more predifined classes/labels, that a how and data, during my research. given object belongs to [13]. More specifically, Using a learn- Last, I would like to thank my mother, for constantly being ing method or learning algorithm , we then wish to train a on my side, supporting and pushing me. classifier or classification function γ that maps documents to classes, like this: 1 Introduction γ = X ! C The accumulating amount of text data, is increasing expo- nentially, day by day. This trend of constant production and 1https://www.ibm.com/blogs/watson/2016/06/ analysis of textual information is not going to stop anytime natural-language-processing-transforming-financial-industry-2/ 1 In our experiments, we used the Gaussian Naive Bayes implementation, from Scikit-Learn, which uses the following formula: 2 1 ¹xi − µy º P¹x jyº = exp(− º i q 2 2 2σy 2πσy 2.1.2 Logistic Regression Logistic regression (LR) is a regression model, where the dependent variable is categorical. This allows it to be used Figure 1. A representation of the classification procedure, in classification, where, as an optimization problem20 [ ], it showing the training and the prediction part. (source: NLTK minimizes the following cost function: Documentation) n 1 Õ minimize loд¹1 + exp(−b aT xºº This group of machine learning algorithms is called su- n i i i=1 pervised learning because a supervisor (the human who defines the classes and labels training documents) serves It was invented in 1958 [22], and it is similar to the naive as a teacher directing the learning process. We denote the Bayes classifier. But instead of using probabilities to setthe supervised learning method by Γ and write Γ¹Dº = γ , where model’s parameters, it searches for the parameters that will maximize the classifier performance D is the training set, containing the documents. The learning [12]. method Γ takes the training set D as input and returns the 2.1.3 Random Forest learned classification function γ . It is important to note, that binary datasets (with true/false Ensemble methods, try to improve accuracy and generaliza- labels only) andmulticlass datasets (with a variety of classes tion in predictions, by using several estimators instead of one. to choose from), are not necessarily different problems. More Random Forest (RF) is an ensemble method based on De- specifically, the multiclass classification problem is usually cision Trees. It uses the averaging method of ensembling, delegated into a binary problem using the "one-vs-rest" method, i.e. averages the predictions of the classifiers used. where each class iteratively becomes the "true" class, and Decision Trees differ from the previously presented algo- is evaluated with the rest of the classes (which collectively rithms, as they are a non-parametric supervised learning become the "false" class). method, that tries to create a model that predicts the value You can take a see an example of the process on Figure 1, of a target variable based on several input variables [24]. You taken from the NLTK documentation [12]. What follows is can take a look on an example of created rules, on Figure 2. a short descriptions of the classifiers used in this thesis, as 2.1.4 Support Vector Machines well as the most common classifier evaluation metrics, both statistical and visual. Support Vector Machines/Classifiers (SVC) make use of hyperplanes, in a high or infinite dimensional space, which 2.1 Common classifiers can be used for classification. They are very effective, com- For this thesis, we needed to test different classification al- pared to other algorithms, in cases where [17]: gorithms and compare the results. We decided to settle on • the dataset is very high-dimensional; and/or a few algorithms, representative of the different classifica- • the number of dimensions is higher than the number tion methods that exist. The implementations used are the of samples. ones found in Scikit-Learn, a machine learning library for Both of those are immediately applicable in text classifi- Python [17]. cation, and although it is slower than other algorithms (es- The selected algorithms, are the following: pecially Bayesian ones), it is very performant and memory efficient. 2.1.1 Naive Bayes Naive bayes (NB) is a classifier based on applying Bayes’ 2.1.5 k-Nearest Neighbors theorem with the "naive" assumption of independence be- One of the simplest and most important algorithms in data tween all the features. It is considered a particularly powerful mining, is the k-Nearest Neighbors. It is a non-parametric machine learning algorithm, with multiple applications, es- method used for classification, where the input consists of pecially in document classification and spam filtering[25]. the k closest training examples in the feature space, and the 2 tp + tn Accuracy = P + N • Precision: the ability of the classifier not to label as positive a sample that is negative, defined as: tp Precision = tp + f p • Recall: ability of the classifier to find all the positive samples, defined as: tp Recall = tp + f n • F1 score: the weighted harmonic mean of the preci- sion and recall, defined as: precision × recall F1 = 2 × precision + recall Figure 2. A tree showing survival of passengers on the Ti- 2.3 Visual Evaluation Metrics - Confusion Matrix tanic. The figures under the leaves show the probability In addition to the above, we can also make use of a visual of survival and the percentage of observations in the leaf. metric, the confusion matrix. Also known as error matrix, (source: Wikipedia) it is a special kind of contigency table, with dimensions equal to the number of classes. It summarizes the algorithm perfor- mance, by exposing the false positives and false negatives. output is the predicted class [2]. Every input item is classified 2.4 Differentiating between binary and multiclass by a majority vote of its neighbors. datasets It is considered very sensitive to the structure of the data, thus it is commonly used for datasets with a small amount In order to differentiate the classification metrics used for of samples.