<<

Evaluation and comparison of models, for efficient text classification

Ilias Koutsakis George Tsatsaronis Evangelos Kanoulas University of Amsterdam Elsevier University of Amsterdam Amsterdam, The Netherlands Amsterdam, The Netherlands Amsterdam, The Netherlands [email protected] [email protected] [email protected]

Abstract soon. On the contrary, even non-traditional businesses, like 1 Recent research on word embeddings has shown that they banks , start using Natural Language Processing for &D tend to outperform distributional models, on word similarity and HR purposes. It is no wonder then, that industries try to and analogy detection tasks. However, there is not enough take advantage of new methodologies, to create state of the information on whether or not such embeddings can im- art products, in order to cover their needs and stay ahead of prove text classification of longer pieces of text, e.g. articles. the curve. More specifically, it is not clear yet whether or not theusage This is how the concept of word embeddings became pop- of word embeddings has significant effect on various text ular again in the last few years, especially after the work of classifiers, and what is the performance of word embeddings Mikolov et al [14], that showed that shallow neural networks after being trained in different amounts of dimensions (not can provide word vectors with some amazing geometrical only the standard size = 300). properties. The central idea is that words can be mapped to In this research, we determine that the use of word em- fixed-size vectors of real numbers. Those vectors, should be beddings can create feature vectors that not only provide able to hold semantic context, and this is why they were a formidable baseline, but also outperform traditional, very successful in sentiment analysis, word disambiguation count-based methods (bag of words, tf-idf) for the same and syntactic parsing tasks. However, those vectors can still amount of dimensions. We also show that word embed- represent single words only, something that, although use- dings appear to improve accuracy, even further than the ful, really limits the usage of word embeddings in a wider distributional models baseline, when the amount of data is context. relatively small. In addition to that, the averaging of word 1.1 Motivation embeddings appears to be a simple but effective way to keep a significant portion of the semantic information. The motivation behind this thesis, is to examine whether or Besides the overall performance and the standard classifi- not word embeddings can be used in cation metrics (accuracy, recall, precision, F1 score), the time tasks. It is important to understand the main problem, which complexity of the embeddings will also be compared, is how to form a document feature vector, from the separate word vectors. Also, we should examine the possible perfor- Keywords text classification, word embeddings, fasttext, mance issues that different dataset sizes can have, but most , vectorization importantly, datasets with different semantics. Following, we will present our findings, comparing the us- Acknowledgments age of word embeddings in text classification tasks, through the averaging of the word vectors, and their performance I would like to thank my supervisors, George Tsatsaronis in comparison to baseline, distributional models (bag and Evangelos Kanoulas, for guiding me and providing me of words and tf-idf). with their years of experience, patience, and support, during the completion of this thesis. 2 Classification and Evaluation Metrics I would also like to thank Elsevier, for giving me the chance to work and contribute to the work of amazing individuals, Classification is the process where, given a set of classes, we in such an establishment, and take advantage of their know- try to determine one or more predifined classes/labels, that a how and data, during my research. given object belongs to [13]. More specifically, Using a learn- Last, I would like to thank my mother, for constantly being ing method or learning algorithm , we then wish to train a on my side, supporting and pushing me. classifier or classification function γ that maps documents to classes, like this: 1 Introduction γ = X → C The accumulating amount of text data, is increasing expo- nentially, day by day. This trend of constant production and 1https://www.ibm.com/blogs/watson/2016/06/ analysis of textual information is not going to stop anytime natural-language-processing-transforming-financial-industry-2/ 1 In our experiments, we used the Gaussian Naive Bayes implementation, from Scikit-Learn, which uses the following formula:

2 1 (xi − µy ) P(x |y) = exp(− ) i q 2 2 2σy 2πσy

2.1.2 Logistic regression (LR) is a regression model, where the dependent variable is categorical. This allows it to be used Figure 1. A representation of the classification procedure, in classification, where, as an optimization problem20 [ ], it showing the training and the prediction part. (source: NLTK minimizes the following cost function: Documentation)

n 1 Õ minimize loд(1 + exp(−b aT x)) This group of algorithms is called su- n i i i=1 pervised learning because a supervisor (the human who defines the classes and labels training documents) serves It was invented in 1958 [22], and it is similar to the naive as a teacher directing the learning process. We denote the Bayes classifier. But instead of using probabilities to setthe method by Γ and write Γ(D) = γ , where model’s parameters, it searches for the parameters that will maximize the classifier performance D is the training set, containing the documents. The learning [12]. method Γ takes the training set D as input and returns the 2.1.3 learned classification function γ . It is important to note, that binary datasets (with true/false Ensemble methods, try to improve accuracy and generaliza- labels only) andmulticlass datasets (with a variety of classes tion in predictions, by using several estimators instead of one. to choose from), are not necessarily different problems. More Random Forest (RF) is an ensemble method based on De- specifically, the multiclass classification problem is usually cision Trees. It uses the averaging method of ensembling, delegated into a binary problem using the "one-vs-rest" method, i.e. averages the predictions of the classifiers used. where each class iteratively becomes the "true" class, and Decision Trees differ from the previously presented algo- is evaluated with the rest of the classes (which collectively rithms, as they are a non-parametric supervised learning become the "false" class). method, that tries to create a model that predicts the value You can take a see an example of the process on Figure 1, of a target variable based on several input variables [24]. You taken from the NLTK documentation [12]. What follows is can take a look on an example of created rules, on Figure 2. a short descriptions of the classifiers used in this thesis, as 2.1.4 Support Vector Machines well as the most common classifier evaluation metrics, both statistical and visual. Support Vector Machines/Classifiers (SVC) make use of hyperplanes, in a high or infinite dimensional space, which 2.1 Common classifiers can be used for classification. They are very effective, com- For this thesis, we needed to test different classification al- pared to other algorithms, in cases where [17]: gorithms and compare the results. We decided to settle on • the dataset is very high-dimensional; and/or a few algorithms, representative of the different classifica- • the number of dimensions is higher than the number tion methods that exist. The implementations used are the of samples. ones found in Scikit-Learn, a machine learning library for Both of those are immediately applicable in text classifi- Python [17]. cation, and although it is slower than other algorithms (es- The selected algorithms, are the following: pecially Bayesian ones), it is very performant and memory efficient. 2.1.1 Naive Bayes Naive bayes (NB) is a classifier based on applying Bayes’ 2.1.5 k-Nearest Neighbors theorem with the "naive" assumption of independence be- One of the simplest and most important algorithms in data tween all the features. It is considered a particularly powerful mining, is the k-Nearest Neighbors. It is a non-parametric machine learning algorithm, with multiple applications, es- method used for classification, where the input consists of pecially in document classification and spam filtering[25]. the k closest training examples in the feature space, and the 2 tp + tn Accuracy = P + N • Precision: the ability of the classifier not to label as positive a sample that is negative, defined as: tp Precision = tp + f p • Recall: ability of the classifier to find all the positive samples, defined as: tp Recall = tp + f n • F1 score: the weighted harmonic mean of the preci- sion and recall, defined as:

precision × recall F1 = 2 × precision + recall Figure 2. A showing survival of passengers on the Ti- 2.3 Visual Evaluation Metrics - Confusion Matrix tanic. The figures under the leaves show the probability In addition to the above, we can also make use of a visual of survival and the percentage of observations in the leaf. metric, the confusion matrix. Also known as error matrix, (source: Wikipedia) it is a special kind of contigency table, with dimensions equal to the number of classes. It summarizes the algorithm perfor- mance, by exposing the false positives and false negatives. output is the predicted class [2]. Every input item is classified 2.4 Differentiating between binary and multiclass by a majority vote of its neighbors. datasets It is considered very sensitive to the structure of the data, thus it is commonly used for datasets with a small amount In order to differentiate the classification metrics used for of samples. Nearest neighbor rules in effect implicitly com- binary and multiclass datasets, there are a few ways to av- pute the decision boundary. For high-dimensional data (e.g. erage binary metric calculations across the set of classes, textual data) dimension reduction is usually performed prior each of which are useful in different scenarios. Those av- to applying the k-NN algorithm in order to avoid the effects eraging methods (explained in length in the scikit-learn 2 of the curse of dimensionality [3]. documentation ), that are applicable here, are: • micro: gives each sample-class pair an equal contribu- 2.2 Statistical Evaluation metrics tion to the overall metric (except as a result of sample- The metrics used bellow apply to . As weight). Rather than summing the metric per class, mentioned before, in case of multiclass datasets, each class this sums the dividends and divisors that make up the separately is considered the positive one, and an "one-vs- per-class metrics to calculate an overall quotient. rest" approach is used to get metric results separately. This • macro: simply calculates the mean of the binary met- becomes especially important when considering the different rics, without adding any weights. It can be especially metrics, as some of those, e.g. the F1 score, are defined for useful in cases were classes with a small number of binary classification only, and we need to be careful with the samples, are important, so their performance needs to results. be taken into consideration. In this context we consider tp, tn, f p, f as true positive, • weighted: tries to rectify class imbalance, by com- true negative, false positive and false negative, respectively. puting the average of binary metrics in which each We also consider P as the total positive samples and N as class’ score is weighted by its presence in the true data the total negative samples. It is important to note that sine sample. the methods bellow use mainly fractions, the best value is 1 For example, let’s assume that we classified a dataset, and and the worst value is 0. we have some results. If we consider ytrue as the correct labels for each item, and y the predicted labels, and L is • pred Accuracy: the fraction of correct predictions, on a the set of labels, then for the Precision, we would get: dataset. It can be computed either using the count • = ( , ) of the correct predictions, or their fraction on the Precmicro P ytrue ytrue total. We are using the fraction method, defined as: 2 http://scikit-learn.org/stable/modules/model_evaluation.html 3 1 Í • Pmacro = L P(ytrue ,ytrue ) In this thesis, we will be using and macro averaging, as all our datasets are completely balanced, so the weighted or micro average is not useful at all. In addition, in the case of visual metrics, the representa- tions change according to the requirements. For example, in cases of multiclass tasks, the confusion matrix becomes an N × N table, where N is the number of classes.

3 The Vector Space Model (VSM) In order to use the text documents in a classifier, we need to create an appropriate and usable representation. This is achieved, using the Vector Space Model (VSM), which was developed by Gerard Salton and his colleagues in 1975 [19] for the SMART information retrieval system. In the VSM Figure 3. Plot of word embeddings, showing their dimen- each document in a collection is shown as a point in a space sional qualities. (a vector in a vector space), and its usage revolutionized natural language processing and information retrieval. In the VSM, all the documents and the queries, are repre- So, one of the main methods of text used for text pro- sented as vectors, where each di where each dimension is a cessing is the vector-space based Tf-Idf (Term Frequency × distinct word. E.g: Inverse Document Frequency) representation (Salton, 1983). As the name says, the formula has 2 distinct parts: • document = (w ,w ,...,w ) i i,1 i,2 i,n • term frequency (tf), which is provided by the for- • document = (w ,w ,...,w ) i+1 i+1,1 i+1,2 i+1,n mula: • query = (w1,w2,...,wn)

Of course, the VSM provides the outline of the procedure, t f = ft,d but there are different ways of weighting the words/features • and the inverse document frequency, which is: of the vector, the Bag of Words and Tf-Idf. An explanation of both follows. N id f = loд nt 3.1 Bag of Words (BoW) In this thesis, we will be using the default Tf-Idf vector- The bag of words is a simplified representation of a document, izer class of Scikit-Learn, which uses the "smooth idf" based on the one-hot-encoding [23], where each word is variation 3. represented in a vector, by a categorical feature. It is one of the simplest and most effective tools in text mining and 4 Word Embeddings information retrieval. It’s implementation in text mining The idea of using vector representations for words, is not a is quite simple, and it requires to have the whole corpus new one. Earliest attempts to express word semantics as vec- beforehand, in order to know the dimensions for the feature tors stretch back to 1950s (Osgood [16], however they have vectors. After transforming the text into a "bag of words", we become a popular topic only in recent years. In the 1990s, the can calculate various measures to characterize the text, but first methods to automatically generate contextual features usually we use it in accordance to the "term-frequency", were developed, with one of the most important ones being which is essentially the count of the words in each document. the LSA [6]. The latest, massive interest in semantic word representations came from the resent research of Mikolov et 3.2 Tf-Idf al, who introduced the word2vec method [14]. Tf-Idf is an evolution of the Bag of words, as it works under In this chapter, we will describe the 2 different models the assumption that not all of the words are equally impor- that are part of the word2vec algorithm: Continuous Bag tant, no matter how often they appear or not. According to of Words (CBoW) and Skip-Gram. It is important to note Aggarwal et al [1] the most significant word are not the ones that we will be using 2 different libraries, Gensim18 [ ] and that appear most often, as these tend to be linking words Facebook’s FastText, by Bojanowski et al [4]. such as "the", "or", "and" which are crucial to the structure of A representation of the two models can be seen in Figure the document, but do not carry importance. So, a way needs 4. to be used, that re-weights the count features (bag of words) 3http://scikit-learn.org/stable/modules/feature_extraction.html# into floating point values suitable for usage by a classifier. tfidf-term-weighting 4 Figure 4. The CBoW and Skip-Gram models.

4.1 CBOW and Skip-Gram Both CBoW and Skip-Gram are used extensively, and we Word2Vec differs from the previous, distributional models will experiment with both. by its iterative nature, as it does not require to compute any 4.2 Document vectorization using word embeddings global information e.g. co-occurrence matrix, but iterates over the whole dataset while adjusting the vectors, which In order to retrieve a single feature vector for each document, makes it much more memory-efficient and easier to reuse. we needed to find a way to combine the word vectors for each CBoW and Skip-Gram are similar, with a big difference: text collection. According to Mikolov et al, the averaging Although both of them rely on the same concept of moving of the word vectors seems to provide a sufficient baseline, a filter window of fixed size along the whole dataset and so we decided to test that on our own datasets, in different trying to predict certain words, CBOW predicts the cen- dimensions. The algorithm used to create the feature vector tral word from the context (surrounding) words, while from the word embeddings, is the following: the skip-gram does the opposite and predicts context words from the central word. Algorithm 1: Vectorization of a document through the Both word2vec models are considered shallow learning, averaging of the word embeddings. because, as seen in the picture, they only include one hidden Data: a list of words, representing a document and one output layer. The input of the neural network, is Result: the document feature vector every context word, which is encoded using one-hot encoding. 1 initialize number of words = 0; This means, that for C words, each word is represented by an 2 create a feature vector of 0s, with length = word C-dimensional vector. The hidden layer has simply to get the embeddings dimensions (e.g. 300); average of all those word vectors, using the representations 3 for word = 1, 2, ..., n do that form a weight matrix W1. Here are the important difference between the 2 models: 4 if word in word2vec trained model then 5 increase the number of words by 1; • For CBOW, following from the hidden layer to the 6 get the word vector from the model; output layer, the second weight matrix W2 can be used 7 add the word vector to the feature vector to compute a score for each word in the vocabulary, (elementwise); and softmax can be used to obtain the distribution of 8 end words. 9 end • Skip-Gram, on the other hand, at the output layer, we 10 divide the feature vector with the number of words to now output C multinomial distributions instead of just get the average; one. The training objective is to mimimize the summed 11 return the feature vector of the document; prediction error across all context words in the output layer..

5 5 Implementation and Datasets Table 1. General dataset information In this chapter we will present the technical aspects of the implementation used for our experiments, as well as a brief Corpus Size Labels Avg. tokens descriptions of the datasets. news 7.600 4 61 reviews 25.000 2 130 5.1 Implementation dbpedia 70.000 14 76 The implementation of the required functionality was based on a variety of tools available in Python. Python was the obvious choice for this project, due to the wide variety of data from the input space), so it cannot be used with new useful libraries for , all of them being state of data, making it unusable in business cases [21]. the art in their respective categories. The implementation In the end we opted to using PCA (Principal Component of the core functionality is based on the scikit-learn [17] Analysis), one of the most popular algorithms for dimension- Python library, as it includes implementations of the most ality reduction [9]. In terms of it commonly-used procedures and algorithms (e.g. decompo- can be formulated as the problem of finding the m orthonor- sition, vectorization, classification, evaluation). In ad- mal directions minimizing the representation error. dition, we took advantage of the sklearn API [5], which al- The implementation used was the TruncatedSVD, found lowed us to create a completely automated analysis pipeline. in Scikit-Learn [5]. It is much more performant than PCA, as The text vectors are being created using the Scikit-Learn it can easily work with sparce matrices, which are the main implementations of Tf-Idf and Count (Bag of Words) vec- representation of text documents. torizers. In the case of pre-trained word embeddings, we opted for trying the two most popular tools right now: the 5.3 Datasets Word2Vec implementation of the Gensim Python pack- It was very important to investigate the different outcomes age [18] (which also provided many text pre-processing func- in a variety of datasets. The datasets should be distinct, but tions), as well as the implementation provided in Facebook’s also share some characteristics, to evaluate the differences FastText library [10]. and make meaningful comparisons. For this reason, we chose Regarding the tuning options, since we wanted to get a the following: clear view of the baseline, we chose to use the classifiers with • a news dataset with 7.600 items, with 4 classes; the default tunings (e.g. no regularization for Logistic Regres- • a binary dataset of IMDB movie reviews of 25.000; and sion, no extra trees or depth for Random Forests, etc). In the • a dataset with descriptions and ontologies from dbpe- case of the word embeddings, we followed a similar proce- dia, with 70.000 items. dure, by using a context window of size w = 5. Words that have occurred less than 5 times in the corpus were ignored The texts should be considered short, as we avoided using and high-frequency words were randomly downsampled, big datasets (like scientific publications), due to computa- with the standard sample = 0.001 threshold. tional limitations. However, the accumulated datasets pro- The preprocessing was also minimal, as we decided, after vide a clear look on standard corpora, that are extensively the initial experiments to just use stemming and stopwords used in both industry and academia (e.g. sentiment, news). removal, using the Gensim preprocessing functions, and the You can take a look at the information summary for the NLTK stopword list [12]. This significantly increased per- datasets on Table 1. All the datasets are curated from Zhang et al, for the pur- formance. 5 All the visualizations were created using the scikit-plot poses of their own research [26]. library [15] which is a wrapper over Matplotlib [8]. 6 Experiments and results 5.2 Dimension Reduction In our experiments, we have conducted a variety of tests, An interesting issue that arised was the selection of the in order to evaluate the performance of word embeddings in decomposition algorithm. Some early choices includes mani- comparison to distributional models (bag of words, tf-idf). Our fold algorithms, e.g t-SNE and MDS (multidimensional scal- research questions are mainly summarized as such: ing). However, according to Maaten et al 4, algorithms like the • can the usage of word embedding outperform distri- abovementioned try to solve specific problems; mainly, the butional models; and crowding problem that appears in two-dimensional and • how do word embeddings, trained in different dimen- three-dimensional visualizations, of very-high dimensional sions, perform against distributional models, of the datasets. The additional fact that t-SNE is a non-parametric same dimensions. algorithm (it does not learn an explicit function that maps To get answers, we created the following procedure:

4https://lvdmaaten.github.io/tsne/ 5https://goo.gl/PAK8mX 6 Table 2. Accuracy dominance matrix (News dataset) Table 3. F1 dominance matrix (News dataset)

Model LR NB SVC RF Model LR NB SVC RF w2v-cbow 0 0 0 0 w2v-cbow 0 0 0 0 w2v-sg 0 0 0 1 w2v-sg 0 0 0 1 ft-cbow 0 0 0 0 ft-cbow 0 0 0 0 ft-sg 3 6 3 5 ft-sg 3 6 3 5 bow 1 0 0 0 bow 0 0 0 0 tf-idf 2 0 3 0 tf-idf 3 0 3 0

• select a dataset; • train word embeddings on the dataset, for a specific amount of dimensions, on all the available options (fasttext and word2vec, both in their cbow and skip- gram iterations); • compare the results with the results of distributional models, after using PCA for dimension reduction; • compare the results to the baseline (distributional mod- els without any dime The training of the word embeddings happened in a va- riety of dimensions, from 100 up to 500, in order to inves- tigate if the results would be acceptable. In addition, we randomly selected a few combinations and applied 10-fold Figure 5. Accuracy in Naive Bayes, for the News dataset. cross-validation, by hand, to determine that the results do not overfit, and the results were positive. We will now present the results for each of our datasets. We will be using a combination of table results and visual- izations for multiclass datasets. More specifically, we will be using the dominance matrix, a contigency matrix-style ta- ble, that shows the overall, accumulated performance of each algorithm, for each vectorization model. A thorough representation of the retrieved results can be found in the Appendix.

6.1 Dataset: News Articles 6.1.1 Metrics This was the smallest dataset of all, with 7.600 articles and 4 categories. Seeing the results of the dominance matrix Table Figure 6. Accuracy in Random Forest, for the News dataset. 2, it is clear that the Skip −Gram model, specifically the one from FastText, outperforms all the other models, almost every time, with the exception of TF − IDF on SVC. dimensions, word embeddings averaging is constantly higher Similar results can be found when investigating Preci- than the best of the 2 distributional methods, T f − Id f . sion, Recall, and F1 score (see Table 3 for the F1 dominance matrix). 6.1.2 Visual Metrics The most crucial finding here is of course, that word em- Due to the fact that we have a multiclass dataset, it is quite beddings seem to outperform the baseline (BoW/Tf- interesting to investigate the confusion matrices, and search Idf without decomposition), even in a very low dimen- for patterns. In this case, we chose as an example to investi- sional space (for text documents at least), e.g. 10 or even 30 gate the results of LoдisticReдression. It is very interesting dimensions. Specifically, Random Forest and Naive Bayes, that not only does the use of word embeddings (in this case have very important gains, for a fraction of the original di- Skip − Gram) improve the T f − Id f results, but that the mensions, making this model cheaper and far more effective. semantic meaning is kept after using the averaging vector- The plots (Figure 5 and Figure 6) show that from 10 to 500 ization method, as the error rate in classes that are easy 7 Table 4. Accuracy dominance matrix (IMDB Reviews)

Model LR NB SVC RF w2v-cbow 0 0 0 0 w2v-sg 0 0 0 1 ft-cbow 0 0 0 0 ft-sg 3 6 3 5 bow 1 0 0 0 tf-idf 2 0 3 0

Figure 7. Tf-Idf results. Business and Tech is mixed (News dataset).

Figure 9. Accuracy in Naive Bayes (IMDB Reviews dataset).

Figure 8. Skip-Gram results: Overall improvement, but not on the classes that are semantically close (News dataset). Figure 10. Accuracy in Random Forest (IMDB Reviews dataset). to collide (in this case Business and Tech.), has not signif- icantly (if at all) declined. Figure 7 and Figure 8 present those findings. Once again here, word embeddings outperform the baseline,starting from as low as 30 dimensions, for Ran- 6.2 Dataset: IMDB Movie Reviews dom Forest and Naive Bayes. Curiously, W ord2Vec seems This was a medium-sized, binary dataset, with 25000 movie to be much more performant than before, even outperform- reviews. Seeing the results of the dominance matrix Table ing FastText in certain cases, as seen on the figures below: 2, it is not as clear as in the previous dataset, exactly which It is important to distinguish the results. Although word model is the best. T f − Id f provides a formidable baseline embeddings immediately return better results than the base- that cannot be surpased in some cases. However, all in all, line (no decomposition), the decomposition itself boosts the word embeddings seem to be quite successful here as well, results of BoW and T f − Id f . So, compared to the baseline, especially W ord2Vec embeddings, which did not perform as as well as the dataset after the dimension reduction, word well before, in comparison to FastText. embeddings perform similarly well (with a slight loss), 8 Table 5. Accuracy dominance matrix (DBPedia Ontologies) using larger number of dimensions (e.g. around 100) in com- parison to W ord2Vec, something that needs to be taken into Model LR NB SVC RF consideration. Another important note here, is that for larger number w2v-cbow 0 0 0 0 of dimensions, TF − Id f is not significantly faster, so one w2v-sg 0 0 0 0 could argue that training a word embedding model is the ft-cbow 0 0 0 2 better choice, as it has the advantage that can be retrained ft-sg 6 6 6 4 regularly, without having to use the whole dataset. bow 0 0 0 0 You can take a look at the time measurements, found in tf-idf 0 0 0 0 Table 8.

7 Conclusion and Future Work and outperform distributional methods after around The purpose of this thesis was to investigate and evaluate, 300 dimensions. whether or not word embeddings can successfully be used in text classification. For the purpose of our experiments we 6.3 Dataset: DBPedia Ontologies used most of the important tools that are available to data scientists right now, including: The last dataset that we will use is the description of cer- tain DBPedia ontologies. These are short texts, of about a • the Gensim implementation of Word2Vec; paragraph in length, that could describe anything, e.g. ani- • the newest word embedding software from Facebook, mals, plants, companies and villages. It is quite interesting FastText; to see what happens in such a diverse but also extended • some curated and balanced datasets. dataset. FastText dominated in this dataset, especially the Our results varied, depending on the dataset, the prepro- Skip − Gram algorithm, although CBoW was also successful cessing and the classifier. However, we were able to define in some cases. some basic patterns, that seemed to appear constantly.

6.3.1 Confusion Matrix Comparison 7.1 Evaluation of Word Embeddings Since we have a significant amount of classes here, it makes There is no question that Skip − Gram is much better than sense to investigate if our previous hypothesis, that word CBoW in their respective libraries, which seems to be in embeddings improve the overall performance, but not agreement with resent research from Levy et al Levy and in cases of semantic similarities, could appear here. For Goldberg [11] It is also, somewhat slower, and can be sig- the purposes of this, we chose Naive Bayes at 30 dimensions, nificantly slower depending on the architecture, especially in where Skip − Gram achieves an 0.876 accuracy rate, and FastText. However, it seems to be worth it, with gains from T f −Id f is somewhat lower, at 0.815. The confusion matrices, 2% − 5% and up to 20% in certain classifiers. since they are pretty big, can be found in the appendix. In addition, although FastText is better (significantly so, The results were actually quite interesting. Word embed- in some cases), it is not that much better most of the times. dings increased the overall performance, but in certain cases, The increased training time that is required for FastText use, where the semantic similarity was big, the performance wors- even with its multithreaded approach, make Word2Vec the ened significantly. The classes affected were: better choice, for all intents and purposes. • Animal got a higher confusion percentage with Plant 7.2 Evaluation of Classifiers (both are life-forms); and • Artist got confused with Album and WrittenWork Generally speaking, word Embeddings seem to have huge (art). gains in Bayesian algorithms, especially in lower dimen- sions. You can find the results in the Appendix. It also gives It is quite clear that the averaging of the vectors creates a significant boost to Random Forests, even above the some geometrical properties, that are being used during the distributional baseline. training of the classifier. Logistic Regression and SVM are not that impressive, as word embeddings mostly underperform or perform mini- 6.4 Time Measurements mally above the baseline. However, they provide steadily The time measurements were quite expected, according to good results in all kinds of datasets, with a small amount of our knowledge. As you can see in the following tables, they dimensions (even 10 to 30 dimensions seem enough to get are quite expected. For both implementations, Skip − Gram an acceptable accuracy). is slower to train than CBoW , which is already known. It A very interesting discovery, after looking into the mul- seems though, that FastText is significantly slower when ticlass confusion matrices, was that even if the accuracy 9 Table 6. Time measurements for the training/vectorization of 2 datasets.

Training time (sec.) for 25.000 Training time (sec.) for 70.000 FT-CBOW FT-SG W2V-CBOW W2V-SG FT-CBOW FT-SG W2V-CBOW W2V-SG 10 19.51 28.14 11.04 34.79 14.1 27.32 7.22 23.79 30 25.45 39.81 15.28 52.5 21.52 29.59 10.68 27.57 50 37.85 46.71 14.52 44.12 24.7 34.1 9.84 29.42 100 51.35 73.17 16.04 40.4 35.7 43.81 9.73 28.14 300 119.02 170.62 23.19 72.69 92.2 115.93 13.03 52.7 500 225.76 274.02 30.99 104.39 153.27 194 16.41 62.95 BoW TF-IDF BoW TF-IDF 10 3.84 4.89 2.43 2.26 30 4.53 4.59 2.49 2.6 50 5.6 5.68 2.28 2.38 100 7.9 8.2 2.6 2.4 300 15.49 15.88 2.3 3.3 500 22.98 22.64 2.3 3.3 is improved in general, there is a chance that for cer- Last, right now the topic of "something2vec" is quite hot, tain, semantically similar classes, it will worsen. This and new research appears all the time, some of it quite spe- is most probably due to the fact that since we have a prede- cific, e.g tweet2vec7 [ ]. It would be interesting to see if termined number of dimensions, the averaging method that domain-specific problems could be solved with domain- we use "clusters" the documents, as similar words provide a specific corpora, without having to implement and train similar average. a domain and problem-specific neural network.

7.3 Dataset sizes and dimensions References [1] Charu C. Aggarwal and Cheng Xiang Zhai. 2012. Mining Text Data. It seems that as the usage of word embeddings creates an Springer Publishing Company, Incorporated. early maximum, that does not adhere to much improvement [2] N. S. Altman. 1992. An Introduction to Kernel and Nearest-Neighbor as the dimensions increase. This means, that we can get . The American Statistician 46, 3 (1992), acceptable results with a few dimensions, e.g. 10, but the re- 175–185. https://doi.org/10.2307/2685209 turns are diminishing as we train the models with increasing [3] Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1999. When Is ”Nearest Neighbor” Meaningful?. In Proceedings dimensionality. Sometimes it even worsens. of the 7th International Conference on Database Theory (ICDT ’99). On the other hand, it seems that the size of the dataset Springer-Verlag, London, UK, UK, 217–235. http://dl.acm.org/citation. plays a role, but not that important. Although the highest cfm?id=645503.656271 results that we got, using word embeddings, were clearly in [4] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. the large dataset (95.9%5), the small dataset achieved a very 2016. Enriching Word Vectors with Subword Information. CoRR abs/1607.04606 (2016). http://arxiv.org/abs/1607.04606 satisfactory 85.1%, with just 100 dimensions, which is clearly [5] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, An- important. dreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexan- dre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. 2013. API design for machine 7.4 Future work learning software: experiences from the scikit-learn project. In ECML The beforementioned results open the road for some exciting PKDD Workshop: Languages for and Machine Learning. 108–122. new research opportunities. Apart from the logical next step, [6] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. which is to use larger, more diverse datasets, and different Landauer, and Richard Harshman. 1990. Indexing by latent semantic word embedding tools (e.g. Glove), we would like to see analysis. Journal of the American Society for Information Science 41, what other ways of using the word vectors exist, and how 6 (1990), 391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41: successful would they be for the same issues. 6<391::AID-ASI1>3.0.CO;2-9 [7] Bhuwan Dhingra, Zhong Zhou, Dylan Fitzpatrick, Michael Muehl, and One idea is to use aT f −Id f weighting scheme, instead of William W. Cohen. 2016. Tweet2Vec: Character-Based Distributed the simple averaging of word embeddings. Next, we should Representations for Social Media. CoRR abs/1605.03481 (2016). http: investigate on the usage or not of certain Neural network //.org/abs/1605.03481 options, like CNNs and RNNs, which have been shown to [8] J. D. Hunter. 2007. Matplotlib: A 2D graphics environment. Computing provide excellent results, without having to deal with the In Science & Engineering 9, 3 (2007), 90–95. https://doi.org/10.1109/ MCSE.2007.55 vectorization issue that we had. [9] I.T. Jolliffe. 1986. Principal Component Analysis. Springer Verlag. 10 [10] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of Tricks for Efficient Text Classification. CoRR abs/1607.01759 (2016). http://arxiv.org/abs/1607.01759 [11] Omer Levy and Yoav Goldberg. 2014. Neural Word Embedding As Implicit Matrix Factorization. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14). MIT Press, Cambridge, MA, USA, 2177–2185. http://dl.acm.org/citation. cfm?id=2969033.2969070 [12] Edward Loper and Steven Bird. 2002. NLTK: The Natural Lan- guage Toolkit. In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - Volume 1 (ETMTNLP ’02). Associ- ation for Computational Linguistics, Stroudsburg, PA, USA, 63–70. https://doi.org/10.3115/1118108.1118117 [13] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA. [14] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS. Curran Associates, Inc., 3111–3119. [15] Reiichiro Nakano. 2017. reiinakano/scikit-plot: 0.2.6. Zenodo. (2017). https://github.com/reiinakano/scikit-plot [16] Charles E. Osgood. 1952. The nature and measurement of meaning. Psychological Bulletin 49, 3 (1952), 197–237. [17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830. [18] Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Work- shop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/884893/en. [19] G. Salton, A. Wong, and C. S. Yang. 1975. A Vector Space Model for Automatic Indexing. Commun. ACM 18, 11 (Nov. 1975), 613–620. https://doi.org/10.1145/361219.361220 [20] Mark W. Schmidt, Nicolas Le Roux, and Francis R. Bach. 2013. Minimizing Finite Sums with the Stochastic Average Gradient. CoRR abs/1309.2388 (2013). http://dblp.uni-trier.de/db/journals/corr/ corr1309.html#SchmidtRB13 [21] L.J.P. van der Maaten and G.E. Hinton. 2008. Visualizing High- Dimensional Data Using t-SNE. (2008). [22] S. H. Walker and D. B. Duncan. 1967. Estimation of the probability of an event as a function of several independent variables. Biometrika 54, 1 (June 1967), 167–179. http://view.ncbi.nlm.nih.gov/pubmed/6049533 [23] Wikipedia. 2017. Bag of Words model - Wikipedia. (2017). https: //en.wikipedia.org/wiki/Bag-of-words_model [Online; accessed 14- June-2017]. [24] Wikipedia. 2017. learning - Wikipedia. (2017). https: //en.wikipedia.org/wiki/Decision_tree_learning [Online; accessed 11- June-2017]. [25] Harry Zhang. 2004. The Optimality of Naive Bayes. In Proceedings of the Seventeenth International Florida Research Society Conference (FLAIRS 2004). [26] Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level Convolutional Networks for Text Classification. CoRR abs/1509.01626 (2015). http://arxiv.org/abs/1509.01626

11 Appendix: Tables and Figures

Table 7. IMDB Results for Logistic Regression and Naive bayes

Logistic Regression Naive Bayes Accuracy Prec (ma) Rec (ma) F1 (ma) Accuracy Prec (ma) Rec (ma) F1 (ma) 10 BoW 0.72247 0.71921 0.72247 0.71237 0.64617 0.69402 0.64617 0.65195 TF-IDF 0.69953 0.70551 0.69953 0.69201 0.65973 0.68739 0.65973 0.66107 w2v-cbow 0.76921 0.76638 0.76921 0.76387 0.73669 0.74867 0.73669 0.73614 ft-cbow 0.85764 0.85655 0.85764 0.85523 0.83226 0.83288 0.83226 0.83058 w2v-sg 0.75299 0.74731 0.75299 0.74659 0.73461 0.73752 0.73461 0.73086 ft-sg 0.85219 0.85058 0.85219 0.84928 0.83104 0.83267 0.83104 0.82915 30 BoW 0.87716 0.87625 0.87716 0.87645 0.78426 0.7936 0.78426 0.78453 TF-IDF 0.8876 0.88826 0.8876 0.88748 0.8159 0.83375 0.8159 0.81969 w2v-cbow 0.88067 0.87998 0.88067 0.88009 0.8014 0.81374 0.8014 0.80222 ft-cbow 0.92 0.91961 0.92 0.91954 0.87303 0.87717 0.87303 0.87289 w2v-sg 0.8929 0.89218 0.8929 0.89175 0.83607 0.84094 0.83607 0.83466 ft-sg 0.9299 0.92952 0.9299 0.92943 0.87677 0.87855 0.87677 0.8756 50 BoW 0.9025 0.90204 0.9025 0.90204 0.78346 0.79723 0.78346 0.78526 TF-IDF 0.9117 0.91166 0.9117 0.91152 0.83966 0.84439 0.83966 0.83968 w2v-cbow 0.89941 0.8988 0.89941 0.89891 0.81869 0.82892 0.81869 0.81916 ft-cbow 0.93111 0.93073 0.93111 0.93079 0.88037 0.88373 0.88037 0.88017 w2v-sg 0.91789 0.91727 0.91789 0.91726 0.84797 0.8525 0.84797 0.84722 ft-sg 0.94051 0.94017 0.94051 0.94018 0.88491 0.88717 0.88491 0.88433 100 BoW 0.91904 0.91871 0.91904 0.91873 0.75256 0.77877 0.75256 0.75203 TF-IDF 0.92704 0.92706 0.92704 0.92692 0.85651 0.85867 0.85651 0.8565 w2v-cbow 0.91206 0.91157 0.91206 0.91171 0.82267 0.83178 0.82267 0.823 ft-cbow 0.93971 0.93941 0.93971 0.93946 0.88206 0.88615 0.88206 0.88187 w2v-sg 0.93619 0.93581 0.93619 0.93585 0.85811 0.86491 0.85811 0.85811 ft-sg 0.95023 0.94998 0.95023 0.95 0.8903 0.89255 0.8903 0.88986 300 BoW 0.94297 0.94291 0.94297 0.94288 0.65029 0.72622 0.65029 0.65413 TF-IDF 0.9434 0.94316 0.9434 0.94313 0.85133 0.85289 0.85133 0.85115 w2v-cbow 0.91463 0.9142 0.91463 0.91432 0.82376 0.83341 0.82376 0.82452 ft-cbow 0.94111 0.94081 0.94111 0.94086 0.88256 0.88647 0.88256 0.88244 w2v-sg 0.94746 0.94742 0.94746 0.94736 0.85966 0.86561 0.85966 0.85951 ft-sg 0.95623 0.95604 0.95623 0.95605 0.88873 0.89201 0.88873 0.88849 textbf500 BoW 0.94957 0.94966 0.94957 0.94951 0.57629 0.68072 0.57629 0.58711 TF-IDF 0.95311 0.95307 0.95311 0.953 0.83304 0.83601 0.83304 0.83332 w2v-cbow 0.91207 0.91161 0.91207 0.91172 0.81413 0.82522 0.81413 0.81483 ft-cbow 0.94037 0.94008 0.94037 0.94012 0.88261 0.8865 0.88261 0.8824 w2v-sg 0.94213 0.94187 0.94213 0.94187 0.8607 0.8678 0.8607 0.86051 ft-sg 0.95554 0.95535 0.95554 0.95537 0.88696 0.89086 0.88696 0.88667

12 Table 8. Naive Bayes and Random Forest for News Dataset

Naive Bayes Random Forest Dimensions/Vec. Accuracy Prec (ma) Rec (ma) F1 (ma) Accuracy Prec (ma) Rec (ma) F1 (ma) MAX BoW 0.78289 0.79344 0.78289 0.78458 0.81184 0.81052 0.81184 0.81042 TF-IDF 0.76711 0.76786 0.76711 0.76735 0.77237 0.77078 0.77237 0.77068 10 BoW 0.52592 0.56142 0.52592 0.51306 0.70592 0.70324 0.70592 0.70372 TF-IDF 0.65421 0.67704 0.65421 0.64795 0.78829 0.78778 0.78829 0.78717 w2v-cbow 0.65461 0.66344 0.65461 0.64392 0.72132 0.71794 0.72132 0.7184 ft-cbow 0.62118 0.63319 0.62118 0.60097 0.72184 0.71953 0.72184 0.71964 w2v-sg 0.79474 0.79415 0.79474 0.79285 0.82224 0.82145 0.82224 0.8216 ft-sg 0.83105 0.82955 0.83105 0.8297 0.83474 0.83411 0.83474 0.83428 30 BoW 0.58592 0.63953 0.58592 0.58087 0.74184 0.74067 0.74184 0.7398 TF-IDF 0.67053 0.71837 0.67053 0.67194 0.80513 0.80449 0.80513 0.80412 w2v-cbow 0.64013 0.64387 0.64013 0.62944 0.72763 0.72489 0.72763 0.72537 ft-cbow 0.56066 0.58333 0.56066 0.53501 0.72474 0.72249 0.72474 0.72238 w2v-sg 0.80921 0.80758 0.80921 0.80752 0.83079 0.83022 0.83079 0.8299 ft-sg 0.82947 0.82791 0.82947 0.82818 0.83329 0.83265 0.83329 0.83264 50 BoW 0.59592 0.64861 0.59592 0.59413 0.73513 0.7345 0.73513 0.73349 TF-IDF 0.66461 0.70563 0.66461 0.66555 0.80605 0.80578 0.80605 0.80539 w2v-cbow 0.63579 0.6439 0.63579 0.6274 0.73197 0.72933 0.73197 0.72949 ft-cbow 0.54829 0.56786 0.54829 0.52126 0.73289 0.72995 0.73289 0.73017 w2v-sg 0.80697 0.80546 0.80697 0.80511 0.83237 0.83155 0.83237 0.83175 ft-sg 0.83171 0.83032 0.83171 0.83063 0.83789 0.83732 0.83789 0.83745 100 BoW 0.59974 0.65004 0.59974 0.59964 0.72013 0.71969 0.72013 0.71822 TF-IDF 0.65342 0.69429 0.65342 0.65385 0.79908 0.79898 0.79908 0.7981 w2v-cbow 0.60592 0.61194 0.60592 0.59694 0.72566 0.72273 0.72566 0.72311 ft-cbow 0.52355 0.55938 0.52355 0.50034 0.73382 0.73119 0.73382 0.73151 w2v-sg 0.80974 0.80825 0.80974 0.80779 0.83105 0.83031 0.83105 0.8304 ft-sg 0.82145 0.81997 0.82145 0.82003 0.83803 0.83731 0.83803 0.83751 300 BoW 0.59211 0.6394 0.59211 0.59415 0.67658 0.67531 0.67658 0.6735 TF-IDF 0.65408 0.68661 0.65408 0.65355 0.77447 0.77336 0.77447 0.77293 w2v-cbow 0.56039 0.57191 0.56039 0.54825 0.71789 0.71508 0.71789 0.71582 ft-cbow 0.465 0.48974 0.465 0.4411 0.72329 0.721 0.72329 0.7 w2v-sg 0.80671 0.80528 0.80671 0.80474 0.83737 0.83692 0.83737 0.83679 ft-sg 0.81921 0.81744 0.81921 0.81772 0.83211 0.83169 0.83211 0.83172 500 BoW 0.58132 0.62546 0.58132 0.58298 0.65184 0.65211 0.65184 0.64961 TF-IDF 0.65697 0.68577 0.65697 0.65601 0.75276 0.75242 0.75276 0.75127 w2v-cbow 0.53421 0.54084 0.53421 0.51946 0.70974 0.70661 0.70974 0.70696 ft-cbow 0.44487 0.47991 0.44487 0.4292 0.70618 0.70378 0.70618 0.70375 w2v-sg 0.80566 0.80431 0.80566 0.80373 0.82803 0.82726 0.82803 0.8275 ft-sg 0.81289 0.81128 0.81289 0.81116 0.83197 0.83156 0.83197 0.83163

13 Figure 11. Tf-Idf in Naive Bayes (DBPedia dataset).

Figure 12. SG in Naive Bayes (DBPedia dataset).

14 Table 9. DBPedia Dataset

Naive Bayes Linear SCV Dimensions/Vec. Accuracy Prec (ma) Rec (ma) F1 (ma) Accuracy Prec (ma) Rec (ma) F1 (ma) 10 BoW 0.64617 0.69402 0.64617 0.65195 0.71024 0.70619 0.71024 0.68798 TF-IDF 0.65973 0.68739 0.65973 0.66107 0.69941 0.69368 0.69941 0.67798 w2v-cbow 0.73669 0.74867 0.73669 0.73614 0.7614 0.76026 0.7614 0.75214 ft-cbow 0.83226 0.83288 0.83226 0.83058 0.85403 0.85332 0.85403 0.85002 w2v-sg 0.73461 0.73752 0.73461 0.73086 0.74541 0.73856 0.74541 0.73158 ft-sg 0.83104 0.83267 0.83104 0.82915 0.84886 0.84765 0.84886 0.84389 30 BoW 0.78426 0.7936 0.78426 0.78453 0.87717 0.87578 0.87717 0.87598 TF-IDF 0.8159 0.83375 0.8159 0.81969 0.89477 0.89415 0.89477 0.89411 w2v-cbow 0.8014 0.81374 0.8014 0.80222 0.87943 0.8786 0.87943 0.87861 ft-cbow 0.87303 0.87717 0.87303 0.87289 0.91984 0.91941 0.91984 0.91926 w2v-sg 0.83607 0.84094 0.83607 0.83466 0.89226 0.89174 0.89226 0.89078 ft-sg 0.87677 0.87855 0.87677 0.8756 0.92979 0.9294 0.92979 0.92925 50 BoW 0.78346 0.79723 0.78346 0.78526 0.9021 0.9014 0.9021 0.90146 TF-IDF 0.83966 0.84439 0.83966 0.83968 0.91859 0.91809 0.91859 0.91818 w2v-cbow 0.81869 0.82892 0.81869 0.81916 0.89914 0.89841 0.89914 0.89845 ft-cbow 0.88037 0.88373 0.88037 0.88017 0.93119 0.93077 0.93119 0.9308 w2v-sg 0.84797 0.8525 0.84797 0.84722 0.91869 0.91804 0.91869 0.91799 ft-sg 0.88491 0.88717 0.88491 0.88433 0.94073 0.94036 0.94073 0.94036 100 BoW 0.75256 0.77877 0.75256 0.75203 0.91883 0.91842 0.91883 0.91843 TF-IDF 0.85651 0.85867 0.85651 0.8565 0.93333 0.93308 0.93333 0.9331 w2v-cbow 0.82267 0.83178 0.82267 0.823 0.91961 0.91918 0.91961 0.91929 ft-cbow 0.88206 0.88615 0.88206 0.88187 0.94219 0.9419 0.94219 0.94191 w2v-sg 0.85811 0.86491 0.85811 0.85811 0.93916 0.93881 0.93916 0.93882 ft-sg 0.8903 0.89255 0.8903 0.88986 0.95119 0.95092 0.95119 0.95095 300 BoW 0.65029 0.72622 0.65029 0.65413 0.9429 0.94288 0.9429 0.94282 TF-IDF 0.85133 0.85289 0.85133 0.85115 0.95066 0.95049 0.95066 0.95047 w2v-cbow 0.82376 0.83341 0.82376 0.82452 0.92766 0.92732 0.92766 0.92738 ft-cbow 0.88256 0.88647 0.88256 0.88244 0.94704 0.94679 0.94704 0.9468 w2v-sg 0.85966 0.86561 0.85966 0.85951 0.95364 0.9535 0.95364 0.95351 ft-sg 0.88873 0.89201 0.88873 0.88849 0.95844 0.95827 0.95844 0.95828 500 BoW 0.57629 0.68072 0.57629 0.58711 0.94881 0.94882 0.94881 0.94876 TF-IDF 0.83304 0.83601 0.83304 0.83332 0.95066 0.95047 0.95066 0.95044 w2v-cbow 0.81413 0.82522 0.81413 0.81483 0.92597 0.92559 0.92597 0.92567 ft-cbow 0.88261 0.8865 0.88261 0.8824 0.94699 0.94675 0.94699 0.94675 w2v-sg 0.8607 0.8678 0.8607 0.86051 0.95866 0.95852 0.95866 0.95853 ft-sg 0.88696 0.89086 0.88696 0.88667 0.9592 0.95902 0.9592 0.95905

15