Evaluation of vector embedding models in clustering of text documents

Tomasz Walkowiak Mateusz Gniewkowski University of Science and Technology, University of Science and Technology, Faculty of Electronics, Faculty of Computer Science and Management, Wybrzeze Wyspianskiego 27, Wybrzeze Wyspianskiego 27, Wroclaw 50-370, Poland Wroclaw 50-370, Poland [email protected] [email protected]

Abstract tering algorithm (Hastie et al., 2009) to assign documents to some groups. The classical feature The paper presents an evaluation of word vectors are based on the bag-of-words technique embedding models in clustering of texts (Harris, 1954). Components of these vectors rep- in the Polish language. Authors verified resent frequencies (weighted) of occurrences of six different embedding models, starting words/terms in individual documents. The con- from widely used , across fast- temporary state-of-the-art technique is word2vec Text with character n-grams embedding, (Le and Mikolov, 2014), where individual words to deep learning-based ELMo and BERT. are represented by high-dimensional feature vec- Moreover, four standardisation methods, tors trained on large text corpora. This technique is three distance measures and four cluster- constantly being improved. This is demonstrated ing methods were evaluated. The analy- by the most recent propositions of algorithms like sis was performed on two corpora of texts ELMo (Peters et al., 2018) or BERT (Devlin et al., in Polish classified into subjects. The Ad- 2018). Choosing the most useful clustering algo- justed Mutual Information (AMI) metric rithm is not a trivial task since there is a large num- was used to verify the quality of cluster- ber of them. Just to mention the most popular ones ing results. The performed experiments like K-means (K.Jain, 2009), Agglomerative Hier- show that Skipgram models with n-grams archical Clustering (Day and Edelsbrunner, 1984) character embedding, built on KGR10 cor- and Spectral Clustering (Ng et al., 2002). More- pus and provided by Clarin-PL, outper- over, the results of clustering are strongly depen- forms other publicly available models for dent on the chosen distance measure and the used Polish. Moreover, presented results sug- method of an input data standardisation. The en- gest that Yeo–Johnson transformation for tire workflow, described above, expresses many document vectors standardisation and Ag- factors which can influence results of the text ex- glomerative Clustering with a cosine dis- ploration. It is difficult to control them and thus, tance should be used for grouping of text might lead to unpredictable outcomes of the ex- documents. periment. It becomes challenging for texts in an inflected language such as Polish. 1 Introduction The main aim of this research is the evaluation A number of digital repositories of texts enlarge of clustering accuracy on documents in Polish, each year. The variety of tools for natural lan- using publicly available mod- guage processing and the quality of their perfor- els. We conducted our experiments to answer the mance are growing steadily. That opens possibili- following research questions: What is the best ties for automatic categorisation of text documents method (i.e. the word2vec model, standardisation in terms of the subject areas in any digital collec- method, distance metric and clustering algorithm) tion of documents. It is also an important problem for subject grouping texts in Polish? The experi- for researchers from different areas of the human- ments were performed on two corpora with texts ities and social science (Eder et al., 2017). assigned to subject groups. We analysed the qual- Commonly used methods rely on representing ity of results (defined by AMI metric (Romano documents with feature vectors and using clus- et al., 2016)) in a function of method options.

1304 Proceedings of Recent Advances in Natural Language Processing, pages 1304–1311, Varna, Bulgaria, Sep 2–4, 2019. https://doi.org/10.26615/978-954-452-056-4_149 Some works studied the quality of word2vec mod- grams). els for Polish (Piasecki et al., 2018; Mykowiecka The newest approaches are inspired by deep- et al., 2017; Kocon and Gawor, 2019), but they fo- learning algorithms. In a recently introduced cus on single words, not on an application of the ELMo (Peters et al., 2018), word embeddings are word embeddings to represent whole documents defined by the internal states of a deep bidirec- for a clustering purpose. tional LTSM language model (biLSTM), which The paper is structured as follows. In Section is trained on a large . What is impor- 2, we describe in details word embedding tech- tant, ELMo looks at the whole sentence before as- niques and list the models examined in the work. signing an embedding to each word in it. There- Next, we provide technicalities about the methods fore, the embeddings are sentence aware and could we compared and finally in Section4 we present solve a problem of polysemous words (words with test corpora used in the study as well as results of multiple meanings). Another approach similar to the comparative study. ELMo is BERT (Devlin et al., 2018). It is also a bidirectional representation, but it is jointly built 2 Word Embeddings on both the left and the right context. The available BERT models1 are multilingual and pre-trained on 2.1 Techniques of Word Embedding two unsupervised tasks: masked language mod- Word embedding is an approach of text analysis elling and next sentence prediction. based on the assumption that individual words can Word embeddings can be simply used to gener- be represented by high-dimensional feature vec- ate feature vectors for document clustering by av- tors. It is based on the hypothesis that relation- eraging vector representations of individual words ships (distances) between vector representations occurring in a document. This approach is known of words can be related to semantic similarities as doc2vec and used for example in fastText of words. The models are built on large text (Joulin et al., 2017) algorithm. In the case of corpora by observing co-occurrence of words in BERT, we get sentence embeddings, but the ap- similar contexts. One of the most popular tech- proach used for word models can be repeated here nique, word2vec, is based on neural networks (Le as well (as an average of sentence embeddings). and Mikolov, 2014). The authors proposed two approaches: CBOW and Skipgram. In the first 2.2 Available Models for Polish one, the aim is to predict a word based on con- There are two groups working on publicly avail- text words. The Skipgram model does the oppo- able word embedding models for Polish: IPI PAN2 site task, it predicts context words from a given and Clarin-PL3. The IPI PAN provides4 a set of word. In the classical word2vec (Le and Mikolov, more than 100 CBOW and Skipgram models gen- 2014) technique each word (form from the text) is erated from data consisting of National Corpus represented by a distinct vector, which might be of Polish (NKJP) (Przepiorkowski et al., 2012) a problem for a language with large vocabularies and Wikipedia (Wiki). Some of them are gener- and rich inflexion like Polish is. The first solution ated only for lemmas, others of words from texts to this problem was to build models based on word (forms). For tests we have selected the Skip- lemmas (Mykowiecka et al., 2017), however, such gram model (i.e. nkjp+wiki-forms-all-300-skipg- a technique requires a morphological analysis of hs-50). The model was generated by gensim tool5. all texts in a training corpus. Next, in (Bojanowski It assigns a distinct vector to each word. et al., 2017) authors extend the Skipgram model The Clarin-PL provides6 16 models generated by building a vector representation of character n- by fastText software7 on larger than in a previ- grams and constructing the word representation as ous case corpus (Kocon and Gawor, 2019). They the sum of the character n-grams embeddings (for are joint models of words and character n-grams n-grams appearing in the word). It could decree 1https://github.com/google-research/ the model size and allows to generate word em- bert bedding for words not seen in a training corpus. 2https://ipipan.waw.pl/en/ The next step forward was introducing (Grave 3https://clarin-pl.eu/en/home-page/ 4http://dsmodels.nlp.ipipan.waw.pl et al., 2018) the extension of the original CBOW 5https://radimrehurek.com/gensim/ model (Le and Mikolov, 2014) with position 6http://hdl.handle.net/11321/606 weights and subword information (character n- 7https://fasttext.cc/

1305 able to produce vectors for unknown words (Bo- significant eigenvectors create new, lower di- janowski et al., 2017). For tests, we have selected mensional space that is used with a K-means two Skipgram models based on forms (KGR10) algorithm. and lemmas (KGR10 lemma). The second group of sources of word2vec models for Polish are 4. Expectation–maximisation (EM) (Bilmes web pages of word embedding tools like fastText, et al., 1998) is used to estimate parameters in ELMo and BERT. They were trained on Polish statistical models, Gaussian mixture model Common Crawl and Wikipedia. However, the in our example. Gaussian mixture model BERT model was trained on many languages in assumes that data is generated from a finite parallel. The details on used models are sum- number of Gaussian distributions. marised in Table1. 3 Methods 3.2 Standardisation Standardisation of input data is usually necessary, 3.1 Metrics and Clustering because many machine learning algorithms re- A distance measure needs to distinguish contrast- quires features to have a normal distribution and, ing samples. Sometimes that contrast is well de- probably more important in clustering algorithms, fined, and we know what kind of behaviour we a similar scale. In our work, we decided to use require from the chosen function. In natural lan- following methods: guage processing commonly used function is a co- sine distance as i.e. it does not distinguish doc- Table 2: Symbols description uments, described as a vector of most frequently feature vector (column of the input ma- X occurred words in the corpus, that have a linear trix) dependence between features. It also works well X mean value of the feature with sparse high-dimensional space and is less Xmin,Xmax minimal/maximal value of the feature new/old value of the feature of i-th sam- noisy than euclidean (Kriegel H-P., 2012). Mod- x , xˆ i i ple els we want to compare have different properties, σ standard deviation of the feature so relevant distance function is even less obvious. power parameter that is estimated λ In our work we decided to focus on cosine, Bray– through maximum likelihood Curtis and Euclidean distance. We also tried few different variations of them 1. Min–Max scaling - the most popular way of but the results were not significantly different. scaling data. Returned values are in range In order to group data, we decided to use fol- from 0 to 1: lowing methods (mind that two of them use only Euclidean distance or Lk norm in general, because xi − Xmin xˆi = otherwise algorithms may stop converging). Xmax − Xmin

1. K-means (K.Jain, 2009) algorithm is a classic 2. Z-score normalisation - one of the classic method that assigns labels to the data, basing method of standardisation. It results in a dis- on a distance to the nearest centroid. Cen- tribution with a standard deviation equal to 1: troids are moved iteratively until all clusters x − X stabilise. xˆ = i i σ 2. Agglomerative hierarchical clustering (AC) (Day and Edelsbrunner, 1984) is a method 3. Yeo–Johnson transformation - a member of that iteratively joins subgroups basing on a power transform functions that allows nega- linkage criterion. In this paper, we present tive values of input (Yeo and Johnson, 2000): result for the average linkage clustering.  (x +1)λ−1  i , if λ 6= 0, x ≥ 0 Spectral clustering (SC)  λ 3. (Ng et al., 2002) is  based on the Laplacian matrix of the simi- log(xi + 1), if λ = 0, x ≥ 0 xˆi = ( larity graph and its eigenvectors. The least (−xi+1) 2−λ)−1 − 2−λ , if λ 6= 2, x < 0  8  CBOW with position weights − log(−xi + 1), if λ = 2, x < 0

1306 Table 1: Used embedding models name method feature tool size address IPIPAN CBOW forms gensim 300 dsmodels.nlp.ipipan.waw.pl KGR10 Skipgram forms, character n-grams fastText 300 hdl.handle.net/11321/606 KGR10 lemma Skipgram lemmas,character n-grams fastText 300 hhdl.handle.net/11321/606 fastText CBOW+8 orths, character 5-grams fastText 300 .cc elmo ELMo forms ELMo 1024 vectors.nlpl.eu/repository/11/167.zip bert BERT multilingual forms, sentences BERT 768 github.com/google-research/bert

3.3 Quality Metrics X 1 H(X) = P (x ) log i P (x ) Evaluation of clustering quality may be performed i i in two different ways: with external knowledge of X X P (xi ∩ yj) sample membership or without it. The first way is MI(X,Y ) = P (x ∩ y ) log i j P (x )P (y ) usually better if we have already labelled data and i j i i using is none of our options. For example, we know that the clustering problem The problem with mutual information is that the we want to solve concerns similar data we have la- maximum is reached not only when labels from belled, which is a case in this work. We compare one set (clusters) matches perfectly those from the how different vector representation of documents, other (classes), but also when they are further sub- which have an assigned label to it, can be clus- divided. The simple solution for that is to nor- tered. malise MI by mean of entropy of X and Y : There are plenty of clustering quality measures that have a different interpretations, like purity, V- MI(X,Y ) NMI(X,Y ) = measure or Rand Index (Amigo et al., 2009). In (H(X) + H(Y ))/2 our work, we decided to use corrected for a chance Normalised mutual information can be further im- measures where the result does not increase with proved by subtracting expected value of MI from several clusters for randomly chosen labels. Two nominator and denominator: most common metrics are Adjusted Rand Index (ARI) and Adjusted Mutual Information (AMI) MI(X,Y ) − E(MI) (Vinh et al., 2010). According to (Romano et al., AMI(X,Y ) = 2016) AMI is better suited to our problem as doc- (H(X) + H(Y )/2 − E(MI) uments types are often unbalanced. It gives more This is what is called ”corrected for a chance”. weight to a clustering solutions with purer small The general form of AMI was proposed in (Hubert groups than to minor mistakes in bigger ones. and Arabie, 1985). Adjusted mutual information score is one of the information theoretically based measures. It is 4 Experiments based on mutual information (MI) which comes 4.1 Data Sets naturally from entropy. We performed tests on two collections of text doc- uments in Polish: P ress and Rewievs. The first Table 3: Symbols description corpus (P ress) comprises Polish press news. It X,Y set of classes/clusters is a complete, high quality and well defined data H Entropy set. The texts were assigned by press agency to MI mutual information five subject categories. All the subject groups are NMI normalized mutual information well separable from each other and each group AMI adjusted mutual information contains a reasonably large number of members. xi, yi i-th element of X/Y (class or cluster) probability of the document being in i- In the study (Walkowiak and Malak, 2018), the P (x ),P (y ) i i th class or cluster authors reported 95.5% accuracy achieved on this P (xiyˆj ) intersection of P (xi) and P (yj ) data set in supervised classification. There are ca. E(MI) expected value of MI 6, 500 documents in total in this corpus.

1307 The second corpus (Reviews) consists of re- 6. It can be noticed that standardisation has mi- views of scientific works from 21 different science nor influence on Spectral Clustering (SC). It either areas. The achieved accuracy on this data set by slightly improves or does not deteriorate results. fastText (Joulin et al., 2017) in supervised classi- The only exception is Euclidean distance where fication was 90.7% (after division 2:1 for training standardisation can blur and therefore worsen the and testing). There are ca. 10, 500 documents in score. It is clearly visible for Agglomerative Clus- this corpus. tering (AC) which, after standardisation and with a cosine or a Bray–Courtis distance, works best. 4.2 Idea Using standardised version of K-means algorithm The goal of our experiments was to find the best with any of proposed models is rather not suitable performing word embedding model in a cluster- since standardisation does not necessarily increase ing problem. In order to do that, first, we checked the score, however this classic approach is usable how standardisation affects results and picked one with the given problem, especially that it strongly of the methods to use it in further tests. Then depends on centroids which usually are quite use- we compared several models using different clus- ful. We expected that standardisation will have tering approaches with and without standardisa- a positive influence on Gaussian mixture model tion. In order to generate feature vectors for docu- (EM) which assumes data is generated from some ments (doc2vec) we averaged word/sentence em- number of Gaussian distributions and standardisa- beddings for every text in the dataset. tion should make that data more Gaussian-like. It is probably the case with those models that have 4.3 Choosing the Standardisation Method higher score in figures4 and6 than in figures3 We performed our first experiment as follows: and5, but it is not a rule. having the documents di ∈ D represented as doc2vec vectors from KGR10 lemma model, we performed several tests to evaluate the quality measure (AMI) of multiple clustering algorithms with different distance functions. The results are given in the Figure1 and Figure2. It can be ob- served that for EM, K-means and SC standardi- sation does not significantly improve the results. What is more, for Euclidean distance, data scaling may blur the distances between points and worsen the quality of the solution. On the other hand, us- age of standardisation methods with Agglomera- tive Clustering (AC) algorithm improves obtained results. It is not surprising as the linkage method strongly depends on variance especially when us- ing a cosine distance. On average (the average height of the AMI score) the best method of stan- dardisation turned out to be Yeo–Johnson transfor- mation, so we used it in subsequent experiments.

4.4 Model Comparison In order to compare how the chosen model af- fects quality, we performed multiple tests, simi- lar to the previously conducted. They were eval- uating the quality measure due to used doc2vec representations, generated from models described in Section 2. The results of clusterisation of the original data can be observed in figures3 and5 alongside with the results of standardised vectors with Yeo–Johnson transformation in figures4 and

1308 AC EM KMEANS SC 0.6

0.4 AMI 0.2

0 braycurtis cosine euclidean euclidean euclidean braycurtis cosine euclidean Original Z-score Min–Max Yeo–Johnson

Figure 1: Standarization - PRESS

AC EM KMEANS SC 0.6

0.4 AMI 0.2

0 braycurtis cosine euclidean euclidean euclidean braycurtis cosine euclidean Original Z-score Min–Max Yeo–Johnson

Figure 2: Standarization - REVIEWS

5 Conclusion the best quality for the Polish language, we cur- rently training our own model to verify this hy- As a conclusion, we would like to recommend pothesis. The second reason may be that both using Yeo–Johnson transformation to standardise BERT and ELMo do not work well with the dis- doc2vec embedding and if a task is to group doc- cussed problem. It is hard to find any article deal- uments, Agglomerative Clustering (AC) with a ing with document clustering problem using those cosine distance. For researchers who deal with methods. Polish datasets, we strongly recommend using We plan to test other methods of composing KGR10 lemma or KGR10 word2vec models (Ko- document vectors, i.e. representing documents 9 con and Gawor, 2019) . First of them gives better by several concatenated vectors. We want to test results but the second one (only slightly worse) is two approaches. One is based on a division of much faster in a usage, since it does not require documents into parts and generating doc2vec for a time consuming lemmatization of texts. We are each part. And the second, based on clustering now working on implementing the best selected of word embeddings into a predefined number of workflow as a part of WebSty (Eder et al., 2017), groups and using centroids as elements of final 10 an online tool aimed for researchers in humani- document vectors. ties and social science working with texts in Pol- ish. The work was funded by the Polish Ministry of Although, that ELMo and BERT perform great Science and Higher Education within CLARIN- in many tasks in NLP our results show otherwise. PL Research Infrastructure. Two factors can be responsible for this. First, we used already trained models downloaded from the addresses shown in Table.1. As they might not be

9http://hdl.handle.net/11321/606 10http://ws.clarin-pl.eu/websty.shtml

1309 AC EM KMEANS SC 0.6

0.4 AMI 0.2

0 braycurtis cosine euclidean euclidean euclidean braycurtis cosine euclidean IPIPAN KGR10 KGR10 lemma fastText elmo bert

Figure 3: Model Comparison - PRESS

AC EM KMEANS SC 0.6

0.4 AMI 0.2

0 braycurtis cosine euclidean euclidean euclidean braycurtis cosine euclidean IPIPAN KGR10 KGR10 lemma fastText elmo bert

Figure 4: Model Comparison - PRESS (standardised with Yeo–Johnson transformation)

AC EM KMEANS SC 0.6

0.4 AMI 0.2

0 braycurtis cosine euclidean euclidean euclidean braycurtis cosine euclidean IPIPAN KGR10 KGR10 lemma fastText elmo bert

Figure 5: Model Comparison - REVIEWS

AC EM KMEANS SC 0.6

0.4 AMI 0.2

0 braycurtis cosine euclidean euclidean euclidean braycurtis cosine euclidean IPIPAN KGR10 KGR10 lemma fastText elmo bert

Figure 6: Model Comparison - REVIEWS (standardised with Yeo–Johnson transformation)

1310 References Schubert E. Zimek A. Kriegel H-P. 2012. A survey on unsupervised outlier detection. Statistical Analysis Enrique Amigo, Julio Gonzalo, Javier Artiles, and and Data Mining pages 363–387. Felisa Verdejo. 2009. A comparison of extrinsic clustering evaluation metrics based on formal con- Quoc Le and Tomas Mikolov. 2014. Distributed rep- straints. 12(4):461–486. resentations of sentences and documents. In Inter- national Conference on Machine Learning Jeff A Bilmes et al. 1998. A gentle tutorial of the em al- . pages gorithm and its application to parameter estimation 1188–1196. for gaussian mixture and hidden markov models. In- A. Mykowiecka, M. Marciniak, and P. Rych- ternational Computer Science Institute 4(510):126. lik. 2017. Testing word embeddings for pol- Piotr Bojanowski, Edouard Grave, Armand Joulin, and ish. Cognitive Studies — Etudes cognitives 17. Tomas Mikolov. 2017. Enriching word vectors with https://doi.org/10.11649/cs.1468. subword information. Transactions of the Associa- Andrew Y Ng, Michael I Jordan, and Yair Weiss. 2002. tion for Computational Linguistics 5:135–146. On spectral clustering: Analysis and an algorithm. William H. E. Day and Herbert Edelsbrunner. 1984. In Advances in neural information processing sys- Efficient algorithms for agglomerative hierarchi- tems. pages 849–856. cal clustering methods. Journal of Classification 1(1):7–24. https://doi.org/10.1007/BF01890115. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Zettlemoyer. 2018. Deep contextualized word rep- Kristina Toutanova. 2018. Bert: Pre-training of deep resentations. In Proc. of NAACL. bidirectional transformers for language understand- ing. arXiv preprint arXiv:1810.04805 . Maciej Piasecki, Gabriela Czachor, Arkadiusz Janz, Dominik Kaszewski, and Pawel Kedzia. 2018. Maciej Eder, Maciej Piasecki, and Tomasz Walkowiak. Wordnet-based evaluation of large distributional 2017. An open stylometric system based on multi- models for polish. In Proceedings of the 9th Global level text analysis. Cognitive Studies — Etudes cog- Wordnet Conference (GWC 2018). Global WordNet nitives 17. https://doi.org/10.11649/cs.1430. Association. Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar- A. Przepiorkowski, M. Banko, R. L. Gorski, and mand Joulin, and Tomas Mikolov. 2018. Learning B. Lewandowska-Tomaszczyk, editors. 2012. Nar- word vectors for 157 languages. In Proceedings odowy Korpus Jzyka Polskiego. Wydawnictwo of the International Conference on Language Re- Naukowe PWN, Warszawa. sources and Evaluation (LREC 2018). Simone Romano, Nguyen Xuan Vinh, James Bailey, Zellig S Harris. 1954. Distributional structure. Word and Karin Verspoor. 2016. Adjusting for chance 10(2-3):146–162. clustering comparison measures. The Journal of Trevor J. Hastie, Robert John Tibshirani, and Jerome H. Machine Learning Research 17(1):4635–4666. Friedman. 2009. The elements of statistical Nguyen Xuan Vinh, Julien Epps, and James Bailey. learning: data mining, inference, and prediction. 2010. Information theoretic measures for cluster- Springer series in statistics. Springer, New York. ings comparison: Variants, properties, normaliza- Autres impressions : 2011 (corr.), 2013 (7e corr.). tion and correction for chance. Journal of Machine Lawrence Hubert and Phipps Arabie. 1985. Compar- Learning Research 11(Oct):2837–2854. ing partitions. Journal of Classification 2(1):193– 218. https://doi.org/10.1007/BF01908075. Tomasz Walkowiak and Piotr Malak. 2018. Pol- ish texts topic classification evaluation. In Pro- Armand Joulin, Edouard Grave, Piotr Bojanowski, and ceedings of the 10th International Conference on Tomas Mikolov. 2017. Bag of tricks for efficient text Agents and Artificial Intelligence - Volume 2: classification. In Proceedings of the 15th Confer- ICAART. INSTICC, SciTePress, pages 515–522. ence of the European Chapter of the Association for https://doi.org/10.5220/0006601605150522. Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, pages In-Kwon Yeo and Richard A Johnson. 2000. A new 427–431. http://aclweb.org/anthology/E17-2068. family of power transformations to improve normal- ity or symmetry. Biometrika 87(4):954–959. Anil K.Jain. 2009. Data clustering: 50 years beyond k-means. Pattern Recognition Let- ters, Volume 31, Issue 8, pages 651–666. https://doi.org/10.1016/j.patrec.2009.09.011. Jan Kocon and Michal Gawor. 2019. Evaluating KGR10 polish word embeddings in the recognition of temporal expressions using bilstm-crf. CoRR abs/1904.04055. http://arxiv.org/abs/1904.04055.

1311