Topic Interpretation Using Wordnet

Human Language Technologies – The Baltic Perspective 9 K. Muischnek and K. Müürisep (Eds.) © 2018 The authors and IOS Press. This article is published online with Open Access by IOS Press and distributed under the terms of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0). doi:10.3233/978-1-61499-912-6-9 Topic Interpretation Using Wordnet Eduard BARBU 1, Heili ORAV and Kadri VARE Institute Of Computer Science, Tartu, Estonia Abstract. This is a preliminary study in topic interpretation for the Estonian language. We estimate empirically the best number of topics to compute for a 185 million newspaper corpus. To assess the difficulty of topics, they are independently labeled by two annotators and translated into English. The Estonian Wordnet and Princeton Wordnet are then used to compute the word pairs in the topics that have high taxonomic similarity. Keywords. topic models, semantic networks, Wordnet, multilingual topic interpretation 1. Introduction Topic models are a class of unsupervised algorithms that discover themes in a collection of documents. In Natural Language Processing and Information Retrieval the topic models are used for browsing large document collections but their use is not restricted to text documents. They have been employed in the automatic labeling of images with text concepts [1] and bioinformatics [2] among others. The researchers devoted a significant effort to building algorithms for discovering topics. Among the most used algorithm are Probabilistic Latent Semantic Analysis (PLSA) [3], Latent Dirichlet Allocation (LDA) [4] and many LDA extensions like Pachinko allocation [5]). Yet, less effort was put in the interpretation of the topics when they are computed using large collections of text documents. The most relevant study addressing this problem is [6]. The authors made use of crowd-sourcing through Amazon Mechanical Turk to assess the topic coherence. The conclusion of the study is that the traditional metrics to measure topic coherence correlate negatively with the coherence as perceived by humans. In another study [7] the authors stated that LDA generated topics interpretation is hard and that the charac- terization of topics by the top-k most relevant words should be supplemented by richer descriptions and visualization to make the topic meaning intelligible. In [8] the authors developed a new method for ranking the words within a topic. They use the Point Mu- tual Information scores computed on Wikipedia, Google and MEDLINE between the top word-pairs in the topics to enhance topic coherence as judged by humans. There is a big amount of studies discussing topic models in the English language, however there are much less studies available for other languages. The work presented in this paper is a preliminary study of topic interpretation for the Estonian language. Our contribution is threefold. To our best knowledge this is the the first study of topic model interpretation in the Estonian language. We evaluate the agreement in labeling the 1Corresponding Author: Eduard Barbu, Institute Of Computer Science, Tartu Estonia, E- mail:[email protected]. 10 E. Barbu et al. / Topic Interpretation Using Wordnet automatically generated topics thus trying to answer the question: ”How hard is topic interpretation?”. Second, and more important, we would like to know what proportion of pairs of words in the topics can be ”explained” by network similarity measures. If we take an automatically generated topic we would like to know what word pairs are simi- lar in a network structure like Wordnet, for example. The nature of the word similarity in the topics is statistical, whereas the Wordnet path similarity is better understood be- ing defined based on semantic relations. Thirdly, we translate the topics in English and look to the topic interpretation from a bilingual perspective using the Estonian Wordnet (EstWN) and the Princeton WordNet (PWN). The rest of the paper is organized as follows. The next section describes the topic computation, topic labeling and topic translation. Section 3 presents the similarity measures used to compute the word-pair similarities. In section 4 the main results are shown and discussed. We end with the conclusions outlining, at the same time, the improve- ments and the new research directions. We would like to make a short note on terminology. The topics are best understood as word distributions. Yet the fundamental unit of Wordnets is the word sense (e.g. the word worker is represented by four word senses in PWN). In this paper we have tried to strike a balance between precision and readability. That is why we sometimes use the term word where we should have used word sense. We trust that the reader will make the distinction. 2. Topic Computation We used the generative probabilistic LDA algorithm for topic computation. The idea of this algorithm is that a document is constructed drawing words from a topic collection. The task of LDA is to reconstruct the topics when we observe the documents. The algorithm makes two assumptions. The first one is that there exists a topic distribution, that is some topics are more probable than others. The second assumption is that each topic is equivalent to a word distribution, that is in a topic some words are more prominent (have a higher probability) than others. We have trained an LDA topic model on the Estonian newspaper corpus, which is the largest part of the Estonian Reference Corpus [9]. It contains of different daily and weekly Estonian newspapers and magazines available in the web from the period of 1990 up to 2007. The corpus contains 705.259 documents and circa 185 million words and is lemmatized and part of speech tagged. The newspaper corpus has been filtered, preserving the content words only (e.g., we have eliminated the conjunctions, prepositions, postpositions, etc.). The LDA algorithm implemented in the software package MALLET [10] has been used to compute the topics. Several runs of the algorithm (with 100, 150, and 200 topics respectively) have been performed, and for each topic, the most representative 50 words have been selected. Two native Estonian speakers have inspected the results and chose the setting with 200 topics as the best one. In the selected setting, the topics are more easily understood, and the perceived topic coherence is better than for other settings. The selection of the number of 200 topics is a bit surprising if we consider that most topic computation settings have 150 topics or less. The same annotators have independently labeled each topic. When the labels at- tached to the topics are conceptually close, the annotators fully agree. When the seman- E. Barbu et al. / Topic Interpretation Using Wordnet 11 Table 1. Topic fragments examples, topic labeling and topic translation Computed Topics Topics translation TL TL TL Agr Estonian English FA SA ehlvest round ehlvest partii grandmaster suur meister male chess kasparov chess male male yes kasparov maletaja chess player valeri keres valeri keres riigi kogu Parliament question kusimus¨ seadus law tana¨ kolleeg riigikogu poliitika istung part today colleague harra¨ eel nou˜ sir draft ask palu regulation article emu¨ regulation article emu¨ current union current union EU euroliit DN no eut¨ comission eut¨ comission take take europe appendix europe appendix putin aprill nou˜ kogu pank president NL juuni mai vensel opmann palk tics of the labels overlaps partially the annotators agree partially. Finally, when the assigned label semantics is different the annotators disagree. After discussion the annotators resolved some cases of disagreement and assigned the final topic label. The topics and the final topic label have been translated into English. Table 1 shows a fragment of the computed topics and illustrates the main agreement cases. The first column shows a (partial) topic, computed by running the LDA on the newspaper corpus. The second column represents the computed topic translated into English. The column called TL is the final topic label assigned after discussion, the next two columns shows the topic labels assigned by each of the two annotators and the last column displays the level of agreement among the annotators. Please, notice that when the topic cannot be labeled the topic words are not translated and the final topic label is NL (Not Labeled). When an annotators does not know what label to assign s/he assigns DN (Don’t know). The partial disagreement was usually due to the fact that annotators used near syn- onyms as topic labels and hyperonymy, i.e. one annotator was more specific than the other. While discussing the label annotators tried to choose the more specific one. Some topics were difficult to label with a specific label since they were composed of words with broader meanings, for example, ”politics” and ”foreign affairs” or ”finance” and ”trading”. Topics with the words that have more concrete meaning, for example, ”wine”, ”chess” and ”tennis” were clearer. In every topic, there were 50 words and most cases most of the 50 were considered relevant and were translated also to English. Since we discarded part-of-speech tags there were problems with translating words as verbs or as nouns premiere - esietendus (n), esietendu(ma) (v) or study - oping˜ (n), oppi(ma)˜ (v) etc. Also, there were quite many proper names which in Estonian can also be common names, but here the knowledge about Estonian specific background was useful and most of the proper names were a clear indication what the topic was about. 12 E. Barbu et al. / Topic Interpretation Using Wordnet Table 2. The results of topic labeling and translation Topics 200 Unlabeled Topics 45 Unique Topic 124 Labels Topics annotators 98 agree Topics annotators 16 disagree Topics partial 43 agreement Cohen Kappa 0.86 2.1. Topic Labeling Agreement The results of the topic word translation into English and topic labeling are given in table 2.

Load more