Human Language Technologies – The Baltic Perspective 9 K. Muischnek and K. Müürisep (Eds.) © 2018 The authors and IOS Press. This article is published online with Open Access by IOS Press and distributed under the terms of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0). doi:10.3233/978-1-61499-912-6-9 Topic Interpretation Using Wordnet

Eduard BARBU 1, Heili ORAV and Kadri VARE Institute Of , Tartu, Estonia

Abstract. This is a preliminary study in topic interpretation for the Estonian lan- guage. We estimate empirically the best number of topics to compute for a 185 million newspaper corpus. To assess the difficulty of topics, they are independently labeled by two annotators and translated into English. The Estonian Wordnet and Princeton Wordnet are then used to compute the word pairs in the topics that have high taxonomic similarity.

Keywords. topic models, semantic networks, Wordnet, multilingual topic interpretation

1. Introduction

Topic models are a class of unsupervised algorithms that discover themes in a collec- tion of documents. In Natural Language Processing and Information Retrieval the topic models are used for browsing large document collections but their use is not restricted to text documents. They have been employed in the automatic labeling of images with text concepts [1] and [2] among others. The researchers devoted a significant effort to building algorithms for discovering topics. Among the most used algorithm are Probabilistic (PLSA) [3], Latent Dirichlet Allocation (LDA) [4] and many LDA extensions like Pachinko allocation [5]). Yet, less effort was put in the interpretation of the topics when they are computed using large collections of text documents. The most relevant study addressing this problem is [6]. The authors made use of crowd-sourcing through Amazon Mechanical Turk to assess the topic coherence. The conclusion of the study is that the traditional metrics to measure topic coherence correlate negatively with the coherence as perceived by humans. In another study [7] the authors stated that LDA generated topics interpretation is hard and that the charac- terization of topics by the top-k most relevant words should be supplemented by richer descriptions and visualization to make the topic meaning intelligible. In [8] the authors developed a new method for ranking the words within a topic. They use the Point Mu- tual Information scores computed on Wikipedia, Google and MEDLINE between the top word-pairs in the topics to enhance topic coherence as judged by humans. There is a big amount of studies discussing topic models in the English language, however there are much less studies available for other languages. The work presented in this paper is a preliminary study of topic interpretation for the Estonian language. Our contribution is threefold. To our best knowledge this is the the first study of interpretation in the Estonian language. We evaluate the agreement in labeling the

1Corresponding Author: Eduard Barbu, Institute Of Computer Science, Tartu Estonia, E- mail:[email protected]. 10 E. Barbu et al. / Topic Interpretation Using Wordnet automatically generated topics thus trying to answer the question: ”How hard is topic interpretation?”. Second, and more important, we would like to know what proportion of pairs of words in the topics can be ”explained” by network similarity measures. If we take an automatically generated topic we would like to know what word pairs are simi- lar in a network structure like Wordnet, for example. The nature of the word similarity in the topics is statistical, whereas the Wordnet path similarity is better understood be- ing defined based on semantic relations. Thirdly, we translate the topics in English and look to the topic interpretation from a bilingual perspective using the Estonian Wordnet (EstWN) and the Princeton WordNet (PWN). The rest of the paper is organized as follows. The next section describes the topic computation, topic labeling and topic translation. Section 3 presents the similarity mea- sures used to compute the word-pair similarities. In section 4 the main results are shown and discussed. We end with the conclusions outlining, at the same time, the improve- ments and the new research directions. We would like to make a short note on terminology. The topics are best understood as word distributions. Yet the fundamental unit of is the word sense (e.g. the word worker is represented by four word senses in PWN). In this paper we have tried to strike a balance between precision and readability. That is why we sometimes use the term word where we should have used word sense. We trust that the reader will make the distinction.

2. Topic Computation

We used the generative probabilistic LDA algorithm for topic computation. The idea of this algorithm is that a document is constructed drawing words from a topic collection. The task of LDA is to reconstruct the topics when we observe the documents. The algo- rithm makes two assumptions. The first one is that there exists a topic distribution, that is some topics are more probable than others. The second assumption is that each topic is equivalent to a word distribution, that is in a topic some words are more prominent (have a higher probability) than others. We have trained an LDA topic model on the Estonian newspaper corpus, which is the largest part of the Estonian Reference Corpus [9]. It contains of different daily and weekly Estonian newspapers and magazines available in the web from the period of 1990 up to 2007. The corpus contains 705.259 documents and circa 185 million words and is lemmatized and part of speech tagged. The newspaper corpus has been filtered, preserving the content words only (e.g., we have eliminated the conjunctions, prepositions, postpositions, etc.). The LDA algorithm implemented in the software package MALLET [10] has been used to compute the top- ics. Several runs of the algorithm (with 100, 150, and 200 topics respectively) have been performed, and for each topic, the most representative 50 words have been selected. Two native Estonian speakers have inspected the results and chose the setting with 200 topics as the best one. In the selected setting, the topics are more easily understood, and the perceived topic coherence is better than for other settings. The selection of the number of 200 topics is a bit surprising if we consider that most topic computation settings have 150 topics or less. The same annotators have independently labeled each topic. When the labels at- tached to the topics are conceptually close, the annotators fully agree. When the seman- E. Barbu et al. / Topic Interpretation Using Wordnet 11

Table 1. Topic fragments examples, topic labeling and topic translation Computed Topics Topics translation TL TL TL Agr Estonian English FA SA ehlvest round ehlvest partii grandmaster suur meister male chess kasparov chess male male yes kasparov maletaja chess player valeri keres valeri keres riigi kogu Parliament question kusimus¨ seadus law tana¨ kolleeg riigikogu poliitika istung part today colleague harra¨ eel nou˜ sir draft ask palu regulation article emu¨ regulation article emu¨ current union current union EU euroliit DN no eut¨ comission eut¨ comission take take europe appendix europe appendix putin aprill nou˜ kogu pank president NL juuni mai vensel opmann palk tics of the labels overlaps partially the annotators agree partially. Finally, when the as- signed label semantics is different the annotators disagree. After discussion the annota- tors resolved some cases of disagreement and assigned the final topic label. The topics and the final topic label have been translated into English. Table 1 shows a fragment of the computed topics and illustrates the main agreement cases. The first column shows a (partial) topic, computed by running the LDA on the newspaper corpus. The second column represents the computed topic translated into English. The column called TL is the final topic label assigned after discussion, the next two columns shows the topic labels assigned by each of the two annotators and the last column displays the level of agreement among the annotators. Please, notice that when the topic cannot be labeled the topic words are not translated and the final topic label is NL (Not Labeled). When an annotators does not know what label to assign s/he assigns DN (Don’t know). The partial disagreement was usually due to the fact that annotators used near syn- onyms as topic labels and hyperonymy, i.e. one annotator was more specific than the other. While discussing the label annotators tried to choose the more specific one. Some topics were difficult to label with a specific label since they were composed of words with broader meanings, for example, ”politics” and ”foreign affairs” or ”finance” and ”trading”. Topics with the words that have more concrete meaning, for example, ”wine”, ”chess” and ”tennis” were clearer. In every topic, there were 50 words and most cases most of the 50 were considered relevant and were translated also to English. Since we discarded part-of-speech tags there were problems with translating words as verbs or as nouns premiere - esietendus (n), esietendu(ma) (v) or study - oping˜ (n), oppi(ma)˜ (v) etc. Also, there were quite many proper names which in Estonian can also be common names, but here the knowledge about Estonian specific background was useful and most of the proper names were a clear indication what the topic was about. 12 E. Barbu et al. / Topic Interpretation Using Wordnet

Table 2. The results of topic labeling and translation Topics 200 Unlabeled Topics 45 Unique Topic 124 Labels Topics annotators 98 agree Topics annotators 16 disagree Topics partial 43 agreement Cohen Kappa 0.86

2.1. Topic Labeling Agreement

The results of the topic word translation into English and topic labeling are given in table 2. The annotators din not label approximately 22 percent of all topics, meaning that the topics were not coherent enough to be interpretable.The annotators assigned the same label to the topics that could be labeled in 68 percent of the cases. It is customary to compute the level of agreement using a Kappa statistic that takes into consideration the possibility that the annotators agree by chance. The annotators agreement given by the Cohen’s kappa coefficient [11] is 0.86. According to a common interpretation [12] this level of agreement is almost perfect. Yet there is no scientific consensus of how to interpret Kappa statistics on a scale. Nevertheless, in all interpretations a coefficient higher than 0.85 reflects a very good agreement. 124 topics have distinct labels. Among the labels that repeat multiple times the most prominent ones are ”politics” and ”sports” with 5 mentions each. This is not surprising given that the corpus contains newspaper articles. One assumes that these two categories are well represented in any newspaper.

3. Similarity Measures

The is computed using statistical measures like LDA, or it is defined on the taxonomic part of a . When the semantic similarity is computed using statistical measures we do not know the type of semantic relationship holding between the words. This is because all statistical measures rely on word co-occurrence or syntactic information to infer semantic similarity. The advantage of the semantic similarity computed using the semantic networks is that it is based on the well defined IS-A relation. Take for example the word pair (rect- angle, rhombus). In a semantic network that models the domain of geometry both con- cepts are subclasses of the more general concept parallelogram. This means that rectan- gle and rhombus inherit all the properties that the concept parallelogram has: they are both geometrical figures and have two pairs of parallel sides. The semantic similarity measure literature is rich, with applications spanning different fields: from natural lan- guage processing to bioinformatics. In particular, there are many semantic similarities and relatedness metrics that use the taxonomic Wordnet structure [13]. E. Barbu et al. / Topic Interpretation Using Wordnet 13

3.1. Estonian Wordnet

The compilation of EstWN started as part of the EuroWordNet (EWN) [14] project. Both resources follow PWN’s basic principles [15] – words are in synonymous sets (synsets) and the relations between synsets determine a Wordnet structure. While Wordnets for other languages than English follow the blueprint of PWN, there are always changes. For example, in EstWN there are more semantic relations: in addition to IS-A relation for nouns there are also: meronymy, role, causation, fuzzynymy, antonymy and near synonymy relations. EstWN has some inconsistencies in the taxonomies, since it has been built over 20 years by different people. Today, EstWN has more than 86.000 synsets. For this study the older 2016 EstWN version is used.

3.2. Semantic Similarity Measures

For this study, we have chosen three measures that exploit the semantic hierarchy of nouns and that of verbs. 1. Path similarity. This is the most basic similarity measure and returns the inverse of the shortest path connecting the concepts in the semantic network graph. Thus, the path similarity returns a number between 0, meaning no similarity, and 1, meaning that the two concepts are identical. The formula for computing the path similarity between the concepts c1 and c2 is given in equation 1.

1 (1) 1 + shortest path(c1,c2)

2. Leacock-Chodorow Similarity [16] This measure is a refinement of the path similarity and takes into consideration the depth of the concepts in the taxon- omy.The shortest path between the concepts in the taxonomy is computed and it is scaled by the maximum path length (max depth) in that taxonomy. The intu- ition behind this measure is that the concepts deeper in the hierarchy should be more similar. The formula for computing this measure for the same concepts c1 and c2 is given in equation 2.

−log(1 + shortest path(c ,c )) 1 2 (2) 2 ∗ max depth

3. Wu-Palmer similarity [17]. This measure computes the similarity using the shortest path that connects the concepts (c1 and c2) and the lowest common sub- sumer. The Wu-Palmer formula is given equation 3.

2 ∗ depth lcs(c ,c ) 1 2 (3) depth(c1)+depth(c2) 14 E. Barbu et al. / Topic Interpretation Using Wordnet

4. Results and Discussion

The procedure for computing the similarity between the words in the topics makes use of EstWN and PWN through the software packages estnltk [18] and nltk, respectively. For the time being we have considered only the noun hierarchy in EstWN. During corpus filtering process part-of-speech tags were removed and also infinitive forms for verbs weren’t explicit, which didn’t allow to use EstWN for evaluation. The procedure for computing the Wordnet similarity between the word senses in the computed topics is the following: 1. Topic filtering. The words in each topic are projected onto the Wordnet noun and verb hierarchies. The output of this phase is a new list of topics that contain only the words that have noun or verb synsets. 2. Word-Sense Pair Computation For each filtered topic the word sense pairs are generated. To better understand the word sense pair generation we take two words ((Israel,Palestine)) from the topic labeled ”secret service operations”. In PWN both words have two senses. The first sense of the word ”Israel” designates the actual state of Israel and the second sense refers to the ancient Israel. The same considerations apply for the two senses of the word ”Palestine”. Four word sense pairs are generated: (Israel[n.01],Palestine[n.01]), (Israel[n.01],Palestine[n.02]), (Israel[n.02],Palestine[n.01]), (Israel[n.02], Palestine[n.02]) . 3. Pair Similarity Score Generation. For each word sense pair the three similarity measures are computed. The word sense pairs are then ordered according to the similarity score. The word sense pair with the highest score is selected as the most probable pair. In the case presented above, the pair (Israel[n.01],Palestine[n.01]) is selected. The results are presented in table3 . The ”Unique words” row counts the number of unique words in the Estonian topics and their translation into English. As explained before there are more Estonian words than English ones because some words cannot be properly translated into English. The ”Words in Wordnet as nouns” and the ”Words in Wordnet as verbs” rows give the number of words after Topic Filtering that are nouns, respectively verbs, in the two Wordnets. Because the EstWN is smaller than PWN we thought that fewer nouns in the computed topics are in the EstWN than in PWN. In contrast, because the topics were computed on an Estonian corpus one would expect that terms specific to Estonian culture will not be found in PWN. Many nouns in the topics are named entities but the EstWN does not include them. Yet PWN includes named entities ranging from companies (e.g. Apple and Microsoft) to people (e.g. Immanuel Kant and Vladimir Putin). 61 percent of the nouns in the English translation of the topics are in PWN and only 36 percent of the nouns in the topics are in the EstWN. Thus, the size and the breath of entities included in PWN prevails over the cultural factors. The ”Generated Pairs” line lists the number of pairs containing words in PWN and EstWN. The number of word senses per filtered noun in PWN is 3.29 whereas the same measure for EstWN is approximately one point lower 2.33. This is due to the fact that the EstWN is still work in progress and it lacks some words and senses. In PWN the verbs are semantically richer that the nouns: the number of senses per verb for filtered verbs is 5.22. E. Barbu et al. / Topic Interpretation Using Wordnet 15

Statistic English Estonian Unique words 3659 4479 Words in Wordnet 2242 1623 as nouns Words in Wordnet 902 - as verbs Generated Pairs 67634 35524 Noun Word-Senses 7383 3795 Verb Word-Senses 4715 - Noun Word-senses 53299 338 scored pairs Verb Word-senses 10670 - scored pairs Top Noun Pairs 5827 337 Top Verb Pairs 4492 - Top Topics 125 100 Table 3. The Wordnet statistics of topics

The ”Noun Word Senses scored pairs” row and the corresponding row for verbs ”Verb Word-Senses scored pairs” give the number of scored word senses for nouns (and for verbs, respectively) for all topics. A top pair is a word sense pair that has a path similarity score at least equal to 0.25. This means that a distance equal to 3 separates the word senses in the Wordnet graph. Cohyponyms have a path similarity equal to 0.33. For example, the word senses pair (bush.n.04, bush.n.06) corresponding to George Bush son and father, both presidents of the USA, have a path similarity of 0.33. The path similarity between a concept and its hyponyms is 0.5. Unfortunately, not all relevant word sense pairs are captured in this way because the Wordnet taxonomies have mistakes. Take for example the pair (Arab, Jew). In figure 1 there is a mistake in PWN. The Arabs and Jews are semitic people because they speak semitic languages, but PWN classifies as semitic only the former. The real path similarity between these word senses should have been 0.33 but because the actual score is less than the threshold we fixed, the pair is missed. In EstWN there is no path between the synsets corresponding to Arab (araablane) and Jew (iisraellane). A manual estimation of the top pairs mapping precision has been made. For the nouns the precision is very high: in EstWN it is over 90 percent, and in PWN it is over 80 percent. The precision of verb mapping though is pretty low, less than 35 percent of the top verb pairs are correctly mapped onto PWN. In the ideal case, when all word sense pairs are correct, reflecting the topic coher- ence, 15 percent of the word sense pairs in the topics can be explained by PWN similar- ity. The percent of word sense pairs for nouns in EstWN is very low: 0.1 percents. This is a consequence of the fact that the density of EstWN is much lower than the density of PWN. In PWN there is a balanced number of top pairs that are nouns and top pairs that are verbs. A top topic is a topic that has at least a top pair. The topic English translation shows that 83 percent of the topics is a top topic, whereas the proportion is lower for the original Estonian topics: 66 percent. 16 E. Barbu et al. / Topic Interpretation Using Wordnet

Figure 1. The word-sense Arab[”n.01”] and Jew [”n.01”] are not on same level in PWN

The other two similarity measures give the same top pairs as the Path Similarity except that the pair order is sometimes different.

5. Conclusions and Future Work

In this preliminary study we have explored some questions related to the interpretation of topics computed on a Estonian newspaper corpus. The first question regards the pos- sibility to consistently label and translate the topics. The second question asks what is the proportion of word pairs in the topics that have a high similarity score in a semantic network. Some topics can be easily and accurately labeled, however other topics cannot be labeled because they are not coherent or are opened to different interpretations. From 200 topics to be labeled independently by two annotators 22 percent were not labeled. When the topics received a label two annotators assigned the same label 68 percent of the time. The translation of the topics is difficult because there are cases when the context provided is insufficient to understand the words meaning. In general, the word mapping onto the noun taxonomies of PWN and EstWN is very accurate: over 80 percent for PWN and over 90 percent for EstWN. The mapping pre- cision of verb pairs in English is low, less than 35 percent of the top verb pairs are cor- rectly mapped onto PWN. We have found that approximately 15 percent of the English noun and verb pairs in the topics and just 0.1 percent of the Estonian noun pairs can be interpreted using the respective Wordnets . In the future we would like to compute better topics taking into consideration the parts of speech of the words thus being able to compute statistics for verbs in Estonian. Moreover, we would like to compute topic correlations (e.g. using Pachinko Allocation) and employ automatic methods to re-score the top-k topic words. E. Barbu et al. / Topic Interpretation Using Wordnet 17

Acknowledgments

This study was supported by the Estonian Ministry of Education and Research (IUT20- 56).

References

[1] David M. Blei and Michael I. Jordan. Modeling annotated data. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR ’03, pages 127–134, New York, NY, USA, 2003. ACM. [2] Lin Liu, Lin Tang, Wen Dong, Shaowen Yao, and Wei Zhou. An overview of topic modeling and its current applications in bioinformatics. SpringerPlus, 5(1):1608, Sep 2016. [3] Thomas Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn., 42(1-2):177–196, January 2001. [4] David M. Blei, Andrew Y. Ng, Michael I. Jordan, and John Lafferty. Latent dirichlet allocation. Journal of Research, 3:2003, 2003. [5] Wei Li and Andrew McCallum. Pachinko allocation: Dag-structured mixture models of topic correla- tions. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, pages 577–584, New York, NY, USA, 2006. ACM. [6] Jonathan Chang, Sean Gerrish, Chong Wang, Jordan L. Boyd-graber, and David M. Blei. Reading tea leaves: How humans interpret topic models. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 288– 296. Curran Associates, Inc., 2009. [7] Daniel Ramage, Evan Rosen, Jason Chuang, Christopher D. Manning, and Daniel A. McFarland. Topic modeling for the social sciences. In Workshop on Applications for Topic Models, NIPS, 2009. [8] David Newman, Youn Noh, Edmund Talley, Sarvnaz Karimi, and Timothy Baldwin. Evaluating topic models for digital libraries. In Proceedings of the 10th Annual Joint Conference on Digital Libraries, JCDL ’10, pages 215–224, New York, NY, USA, 2010. ACM. [9] Heiki-Jaan Kaalep, Kadri Muischnek, Kristel Uiboaed, and Kaarel Veskis. The estonian reference cor- pus: Its composition and morphology-aware user interface. In Proceedings of the 2010 Conference on Human Language Technologies – The Baltic Perspective: Proceedings of the Fourth International Con- ference Baltic HLT 2010, pages 143–146, Amsterdam, The Netherlands, The Netherlands, 2010. IOS Press. [10] Andrew Kachites McCallum. Mallet: A machine learning for language toolkit. Technical report, 2002. http://mallet.cs.umass.edu. [11] Jacob Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measure- ment, 20(1):37–46, 1960. [12] J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33(1):159–174, 1977. [13] L Meng, R Huang, and J Gu. A review of semantic similarity measures in . 6, 01 2013. [14] Piek Vossen, editor. EuroWordNet: A Multilingual Database with Lexical Semantic Networks. Kluwer Academic Publishers, Norwell, MA, USA, 1998. [15] George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J. Miller. Intro- duction to wordnet: An on-line lexical database. Journal of Lexicography, 3(4):235–244, 1990. [16] Claudia Leacock and Martin Chodorow. Combining Local Context and WordNet Similarity for Word Sense Identification, volume 49, pages 265–283. 01 1998. [17] Zhibiao Wu and Martha Palmer. Verbs semantics and lexical selection. In Proceedings of the 32Nd Annual Meeting on Association for Computational Linguistics, ACL ’94, pages 133–138, Stroudsburg, PA, USA, 1994. Association for Computational Linguistics. [18] Siim Orasmaa, Timo Petmanson, Alexander Tkachenko, Sven Laur, and Heiki-Jaan Kaalep. Estnltk - nlp toolkit for estonian. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France, may 2016. European Language Resources Association (ELRA).