Topic Models: Accounting Component Structure of

Michael Nokel Natalia Loukachevitch Lomonosov Moscow State University, Lomonosov Moscow State University, Russian Federation Russian Federation [email protected] louk [email protected]

Abstract well-known models are Latent Dirichlet Alloca- tion (LDA) (Blei et al., 2003), which is based on The paper describes the results of an em- Dirichlet prior distribution, and Probabilistic La- pirical study of integrating col- tent Semantic Analysis (PLSA) (Hofmann, 1999), locations and similarities between them which is not connected with any parametric prior and unigrams into topic models. First of distribution. all, we propose a novel algorithm PLSA- One of the main drawback of the topic models SIM that is a modification of the original is that they utilize “bag-of-words” model that dis- algorithm PLSA. It incorporates bigrams cards word order and is based on the word inde- and maintains relationships between uni- pendence assumption. There are numerous stud- grams and bigrams based on their com- ies, where the integration of collocations, n-grams, ponent structure. Then we analyze a va- idioms and multi-word terms into topic models riety of word association measures in or- is investigated. However, it often leads to a de- der to integrate top-ranked bigrams into crease in the model quality due to increasing size topic models. All experiments were con- of a vocabulary or to a serious complication of the ducted on four text collections of different model (Wallach, 2006; Griffiths et al., 2007; Wang domains and languages. The experiments et al., 2007). distinguish a subgroup of tested measures The paper proposes a novel approach taking that produce top-ranked bigrams, which into account bigram collocations and relationship demonstrate significant improvement of between them and unigrams in topic models (such topic models quality for all collections, as citizen – citizen of country – citizen of union when integrated into PLSA-SIM algo- – European citizen – state citizen; categorization rithm. – document categorization – term categorization – text categorization). This allows us to create a 1 Introduction novel method of integrating bigram collocations Topic modeling is one of the latest applications into topic models that does not consider bigrams of techniques to the natural lan- being as “black boxes”, but maintains the rela- guage processing. Topic models identify which tionship between unigrams and bigrams based on topics relate to each document and which words their component structure. The proposed algo- form each topic. Each topic is defined as a multi- rithm leads to significant improvement of topic nomial distribution over terms and each document models quality measured in perplexity and topic is defined as multinomial distribution over top- coherence (Newman et al., 2010) without compli- ics (Blei et al., 2003). Topic models have achieved cations of the model. noticeable success in various areas such as infor- All experiments were carried out using PLSA mation retrieval (Wei and Croft, 2006), including algorithm and its modifications on four corpora such applications as multi-document summariza- of different domains and languages: the English tion (Wang et al., 2009), text clustering and cat- part of Europarl parallel corpus, the English part of egorization (Zhou et al., 2009), and other natural JRC-Acquis parallel corpus, ACL Anthology Ref- language processing tasks such as word sense dis- erence corpus, and Russian banking magazines. ambiguation (Boyd-Graber et al., 2007), machine The rest of the paper is organized as follows. In translation (Eidelman et al., 2012). Among most the section 2 we focus on related work. Section 3

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 145 proposes a novel algorithm PLSA-SIM that incor- locations in topic models. The authors extract bi- porates bigrams and similarities between them and gram collocations via t-test and replace separate unigrams into topic models. Section 4 describes units by top-ranked bigrams at the preprocessing the datasets used in experiments, all preprocessing step. They use two metrics of topic quality: per- steps and metrics used to evaluate the quality. In plexity and topic coherence (Newman et al., 2010) the section 5 we perform an extensive analysis of a and conclude that incorporating bigram colloca- variety of measures for integrating top-ranked bi- tions into topics results in worsening perplexity grams into topic models. And in the last section and improving topic coherence. we draw conclusions. Our current work also belongs to the second type of methods and distinguishes from previ- 2 Related Work ous papers such as Lau et al. (2013) in that our approach does not consider bigrams as “black The idea of using collocations in topic models is boxes”, but maintains information about the inner not a novel one. Nowadays there are two kinds of structure of bigrams and relationships between bi- methods proposed to deal with this problem: cre- grams and component unigrams, which leads to ation of a unified probabilistic model and prelim- improvement in both metrics: perplexity and topic inary extraction of collocations and n-grams with coherence. further integration into topic models. The idea to utilize prior natural language knowl- Most studies belong to the first kind of meth- edge in topic models is not a novel one. So, An- ods. So, the first movement beyond “bag-of- drzejewski et al. (2009) incorporated domain- words” assumption has been made by Wallach specific knowledge by Must-Link and Cannot- (2006), where the Bigram Topic Model was pre- Link primitives represented by a novel Dirichlet sented. In this model word probabilities are con- Forest prior. These primitives control that two ditioned on the immediately preceding word. The words tend to be generated by the same or sep- LDA Collocation Model (Griffiths et al., 2007) ex- arate topics. However, this method can result in tends the Bigram Topic Model by introducing a an exponential growth in the encoding of Cannot- new set of variables and thereby giving a flexibil- Link primitives and thus has difficulty in process- ity to generate both unigrams and bigrams. Wang ing a large number of constraints (Liu, 2012). An- et al. (2007) proposed the Topical N-Gram Model other method of incorporating such knowledge is that adds a layer of complexity to allow the for- presented in Zhai (2010) where a semi-supervised mation of bigrams to be determined by the con- EM-algorithm was proposed to group expressions text. Hu et al. (2008) proposed the Topical Word- into some user-specified categories. To provide a Character Model challenging the assumption that better initialization for EM-algorithm the method the topic of an n-gram is determined by the top- employs prior knowledge that expressions shar- ics of composite words within the collocation. ing words and synonyms are likely to belong to This model is mainly suitable for Chinese lan- the same group. Our current work distinguishes guage. Johnson (2010) established connection be- from these ones in that we incorporate similar- tween LDA and Probabilistic Context-Free Gram- ity links between unigrams and bigrams into the mars and proposed two probabilistic models com- topic model in a very natural way counting their bining insights from LDA and Adaptor Grammars co-occurrences in documents. The proposed ap- to integrate collocations and proper names into the proach does not increase the complexity of the topic model. original PLSA algorithm. While all these models have a theoretically ele- gant background, they are very complex and hard 3 PLSA-SIM algorithm to compute on real datasets. For example, Bigram Topic Model has W2T parameters, compared to As mentioned above, original topic models utilize WT for LDA and WT + DT for PLSA, where W the “bag-of-words” assumption that assumes word is the size of vocabulary, D is the number of doc- independence. And bigrams are usually added to uments, and T is the number of topics. Therefore topic models as “black boxes” without any ties such models are mostly of theoretical interest. with other words. So, bigrams are added to the The algorithm proposed in Lau et al. (2013) be- vocabulary as single tokens and in each document longs to the second type of methods that use col- containing any of added bigrams the frequencies

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 146 of unigram components are decreased by the fre- of the original algorithm concern lines 6 and 9, quencies of bigrams (Lau et al., 2013). Thus “bag- where we introduce auxiliary variable fdw, which of-words” assumption holds. takes into account pre-computed sets of similar However, there are many similar unigrams and terms. Thus, the weight of such terms is increased bigrams that share the same lemmas (i.e, correc- within each document. tion – correction of word – error correction – spelling correction; rail – rail infrastructure – rail Algorithm 1: PLSA-SIM algorithm: PLSA transport – use of rail) and others in documents. with similar terms We should note such bigrams do not only have Input: collection of documents D, number of identical words, but many of them maintain se- topics |T|, initial distributions Θ and mantic and thematic similarity. At the same time Φ, sets of similar terms S other bigrams with the same words (i.e., idioms) Output: distributions Θ and Φ can have significant semantic differences. To take 1 while not meet the stop criterion do into account these different situations, we hypoth- 2 for d ∈ D, w ∈ W, t ∈ T do esized that similar bigrams sharing the same uni- 3 nˆwt = 0,n ˆtd = 0,n ˆt = 0, nd = |d| gram components should often belong to the same 4 for d ∈ D, w ∈ W do P topics, if they often co-occur within the same texts. 5 Z = φwtθtd, t To verify this hypothesis we precompute sets P of similar unigrams and bigrams sharing the same fdw = ndw + nds s∈S w lemmas and propose novel PLSA-SIM algorithm 6 that is the modification of the original PLSA al- 7 for t ∈ T do gorithm. We will rely on the description found 8 if φwtθtd > 0 then in Vorontsov and Potapenko (2014) and use the 9 δ = fdwφwtθtd/Z following notations (further in the paper we will 10 nˆwt = nˆwt + δ,n ˆtd = nˆd + δ, use notation “term” when speaking about both un- nˆt = nˆt + δ igrams and bigrams): 11 for w ∈ W, t ∈ T do • D – the collection of documents; 12 φwt = nˆwt/nˆt • T – the set of inferred topics; 13 for d ∈ D, t ∈ T do • W – the vocabulary (the set of unique terms 14 θtd = nˆtd/nd found in the collection D); • Φ = {φwt = p(w|t)} – the distribution of terms w over topics t; So, if similar unigrams and bigrams co-occur • Θ = {θtd = p(t|d)} – the distribution of topics within the same document, we try to carry them t over documents d; to the same topics. We consider such terms hav- • S = {S w} – the sets of similar terms (S w is ing semantic and thematic similarities. However, the set of terms similar to w, that is S = w if unigrams and bigrams from the same set S w do {wSwvSvw}, where w is the lemmatized v v not co-occur within the same document, we do no unigram, while wv and vw are lemmatized bi- modifications to the original algorithm PLSA. We grams); consider such terms having semantic differences. • ndw, nds – the number of occurrences of the terms w, s in the document d; 4 Datasets and Evaluation • nˆwt – the estimate of frequency of the term w 4.1 Datasets and Preprocessing in the topic t; In our experiments we used English and Russian • nˆ – the estimate of frequency of the topic t td text collections obtained from different sources: in the document d; • nˆt – the estimate of frequency of the topic t in • For the English part of our study we took the text collection D; three different collections: • nd – the number of words in the document d. – Europarl multilingual parallel corpus. The pseudocode of PLSA-SIM algorithm is pre- It was extracted from the proceed- sented in the Algorithm 1. The only modifications ings of the European Parliament (http:

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 147 //www.statmt.org/europarl). The where n is the number of all considered words in English part includes almost 54 mln. the collection, D is the set of documents in the col- words and 9672 documents. lection, ndw is the number of occurrences of the – JRC-Acquis multilingual parallel cor- word w in the document d, p(w|d) is the probabil- pus. It represents selected texts of the ity of appearing the word w in the document d. EU legislation written between the The less the value of perplexity is the better 1950s and 2005 (http://ipsc.jrc. the model predicts words w in documents D. Al- ec.europa.eu/index.php?id=198). though there were numerous studies arguing that The English part contains almost 45 perplexity is not suited to topic model evalua- mln. words and 23545 documents. tion (Chang et al., 2009; Newman et al., 2010), it is still commonly used for comparing different – ACL Anthology Reference Corpus. It topic models. Since it is well-known that perplex- contains scholarly publications about ity computed on the same training collection is Computational Linguistics (http:// susceptible to over-fitting and can give optimisti- acl-arc.comp.nus.edu.sg/). The cally low values (Blei et al., 2003) we use the stan- corpus includes almost 42 mln. words dard method of computing hold-out perplexity de- and 10921 documents. scribed in Asuncion et al. (2009). In our exper- • For the Russian part of our study we iments we split the collections randomly into the took 10422 Russian articles from several training sets D, on which models are trained, and 0 economics-oriented magazines such as Au- the validation sets D , on which hold-out perplex- ditor, RBC, Banking Magazine, etc. These ity is computed. documents contain almost 18.5 mln. words. Another method of evaluating topic model qual- ity is using expert opinions. We provided anno- At the preprocessing step documents were tators with inferred topics from the same text col- processed by morphological analyzers. For lections and instructed them to decide whether the the English corpus we used Stanford CoreNLP topic was to some extent coherent, meaningful and tools (http://nlp.stanford.edu/software/ interpretable. The indicator of topic usefulness is corenlp.shtml), while for the Russian corpus the ease by which one could think of a short label we used our own morphological analyzer. All to describe a topic (Newman et al., 2010). In the words were lemmatized. We consider only Ad- Table 1 we present incoherent topic that can not be jectives, Nouns, Verbs and Adverbs since function given any label and coherent one with label given words do not play significant role in forming top- by experts. ics. Besides, we excluded words occurring less Top words from topic Label than five times per the whole text collection. have, also, commission, state, more, however – In addition, we extracted all bigrams in forms of vessel, fishing, fishery, community, catch, board fishing Noun + Noun, Adjective + Noun and Noun + of + Table 1: Examples of incoherent and coherent top- Noun for all English collections, and Noun + Noun ics in Genitive and Adjective + Noun for the Russian collection. We consider only such bigrams since Since involving experts is time-consuming and topics are mainly identified by nouns and noun expensive, there were several attempts to propose groups (Wang et al., 2007). a method for automatic evaluation of topic mod- 4.2 Evaluation Framework els quality that would go beyond perplexity and would be correlated with expert opinions. The for- As for the inferred topics quality, we consider four mulation of such a problem is very complicated different intrinsic measures. The first measure is since experts can quite strongly disagree with each Perplexity since it is the standard criterion of topic other. However, it was recently shown that it is models quality (Daud et al., 2010): possible to evaluate topic coherence automatically   using word semantics with precision, almost coin-  1 XX  ciding with experts (Newman et al., 2010; Mimno Perplexity(D) = exp− ndw ln p(w|d),  n  d∈D w∈d et al., 2011). The proposed metric measures in- (1) terpretability of topics based on human judge-

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 148 ment (Newman et al., 2010). As topics are usu- 5 Integrating bigrams into topic models ally presented to users via their top-N topic terms, To compare proposed algorithm with the original the topic coherence evaluates whether these top one we extracted all bigrams found in each docu- terms correspond to the topic or not. Newman et ment of collections. For ranking bigrams we uti- al. (2010) proposed an automated variation of the lized Term Frequency (TF) or one of the following coherence score based on pointwise mutual infor- 19 word association measures: mation (TC-PMI):

10 j−1 1. Mutual Information (MI) (Church and Hanks, XX P(w j,wi) TC-PMI(t) = log , (2) 1990); P(w )P(w ) j=2 i=1 j i 2. Augmented MI (Zhang, 2008); 3. Normalized MI (Bouma, 2009); where (w1,w2,...,w10) are the top-10 terms in a 4. True MI (Deane, 2005); topic, P(wi) and P(w j) are probabilities of uni- 5. Cubic MI (Daille, 1995); grams wi and w j respectively, while P(w j,wi) is 6. Symmetric Conditional Probability (Lopes the probability of bigram (w j,wi). The final mea- and Silva, 1999); sure of topic coherence is calculated by averaging 7. Dice Coefficient (DC) (Smadja et al., 1996); TC-PMI(t) measure by all topics t. 8. Modified DC (Kitamura and Matsumoto, This score is proven to demonstrate high cor- 1996); relation with human judgement (Newman et al., 9. Lexical Cohesion (Park et al., 2002); 2010). The proposed metric considers only top- 10. Gravity Count (Daudarviciusˇ and 10 words in each topic since they usually pro- Marcinkevicienˇ e,´ 2003); vide enough information to form the subject of the 11. Simple Matching Coefficient (Daille, 1995); topic and distinguishing features from other top- 12. Kulczinsky Coefficient (Daille, 1995); ics. Topic coherence is becoming more widely 13. Ochiai Coefficient (Daille, 1995); used to evaluate topic model quality along with 14. Yule Coefficient (Daille, 1995); perplexity. For example, Stevens et al. (2012) 15. Jaccard Coefficient (Jaccard, 1901); showed that this metric is strongly correlated with 16. T-Score; expert estimates. Also Andrzejewski et al. (2011) 17. Z-Score; simply used it for evaluating topic model quality. 18. Chi Square; Following the approach proposed by Mimno et 19. Loglikelihood Ratio (Dunning, 1993). al. (2011) we compute probabilities by dividing the number of documents where the unigram or According to the results of Lau et al. (2013) bigram occurred by the number of documents in we decided to integrate top-1000 bigrams into all the collection. To avoid optimistically high values topic models under consideration. We should note we use external corpus for this purpose – namely, that in all experiments described in the paper we Russian and English Wikipedia. We should note fixed the number of topics and the number of iter- that we do not consider another variation of topic ations of algorithms to 100. coherence based on log conditional probability We conducted experiments with all 20 afore- (TC-LCP) proposed by Mimno et al. (2011) since mentioned measures on all four text collections in it was shown in Lau et al. (2013) that it works sig- order to compare the quality of the original algo- nificantly worse than TC-PMI. rithm PLSA, PLSA with top-1000 bigrams added We should note that while incorporating the as “black boxes”, and PLSA-SIM algorithm with knowledge of similar unigrams and bigrams into the same top-1000 bigrams. topic models in the proposed algorithm, we en- According to the results of experiments we have courage such terms to be in the top-10 terms in revealed two groups of measures. inferred topics. Therefore, we increase TC-PMI The first group contains MI, Augmented MI, metric unintentionally since such terms are likely Normalized MI, DC, Chi-Square, Symmetrical to co-occur within the same documents. So we Conditional Probability, Simple Matching Coef- decided to use also modification of this metric to ficient, Kulczinsky Coefficient, Yule Coefficient, consider not top-10 terms in topics but top-10 non- Ochiai Coefficient, Jaccard Coefficient, Z-Score, similar terms in topics (this metric will be further and Loglikelihood Ratio. We got nearly the same called as TC-PMI-nSIM). levels of perplexity and topic coherence when top

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 149 TC- bigrams ranked by these measures were integrated TC- Corpus Model Perplexity PMI- into all tested topic models. This is explained by PMI nSIM the fact that these measures rank up very special, PLSA 1724.2 86.1 86.1 PLSA non-typical and low frequency bigrams. In the Ta- 2251.8 98.8 98.8 Banking + bigrams ble 2 we present results of integrating top-1000 bi- PLSA-SIM 1450.6 156.5 102.6 grams ranked by MI for all text collections. + bigrams PLSA 1594.3 53.2 53.2 TC- PLSA TC- 1993.5 57.3 57.3 Corpus Model Perplexity PMI- Europarl + bigrams PLSA-SIM PMI nSIM 1431.6 127.7 84.7 PLSA 1724.2 86.1 86.1 + bigrams PLSA PLSA 812.1 67 67 1714.1 84.2 84.2 PLSA Banking + bigrams 1038.9 72 72 PLSA-SIM JRC + bigrams 1715.4 84.1 84.1 PLSA-SIM + bigrams 743.7 108.4 76.9 PLSA 1594.3 53.2 53.2 + bigrams PLSA PLSA 2134.7 74.8 74.8 1584.6 55 55 PLSA Europarl + bigrams 2619.3 73.7 73.7 PLSA-SIM ACL + bigrams 1591.3 55.2 55.2 PLSA-SIM + bigrams 1806.4 152.7 87.8 PLSA 812.1 67 67 + bigrams PLSA 815.4 66.3 66.3 JRC + bigrams Table 3: Results of integrating top-1000 bigrams PLSA-SIM 815.6 66.4 66.4 + bigrams ranked by TF into topic models PLSA 2134.7 74.8 74.8 PLSA 2138.1 75.5 75.5 ACL + bigrams present results for all text collections except ACL PLSA-SIM 2144.8 75.8 75.8 Anthology Reference Corpus because for the cor- + bigrams rect markup advance knowledge in computational Table 2: Results of integrating top-1000 bigrams linguistics is required. ranked by MI into topic models Expert 1 Expert 2 Corpus Model + – + – The second group includes TF, Cubic MI, True PLSA 93 7 92 8 PLSA 92 8 95 5 MI, Modified DC, T-Score, Lexical Cohesion and Banking + bigrams Gravity Count. We got worsened perplexity PLSA-SIM 95 5 97 3 and improved topic coherence, when top bigrams + bigrams PLSA 92 8 90 10 ranked by these measures were integrated into PLSA 94 6 97 3 PLSA algorithm as “black boxes”. But when they JRC + bigrams PLSA-SIM were used in PLSA-SIM topic models, it led to 97 3 100 0 + bigrams significant improvement of all metrics under con- PLSA 97 3 99 1 PLSA sideration. This is explained by the fact that these 95 5 99 1 measures rank up high frequent, typical bigrams. Europarl + bigrams PLSA-SIM 98 2 100 0 In the Table 3 we present results of integrating top- + bigrams 1000 bigrams ranked by TF for all text collections. So, we succeed to achieve better quality for both Table 4: Results of expert markup of topics languages using the proposed algorithm and the second group of measures. As we can see, in the case of PLSA-SIM al- For the expert evaluation of topic model qual- gorithm with top-1000 bigrams ranked by TF the ity we invited two linguistic experts and gave them amount of inferred topics, for which labels can be topics inferred by the original PLSA algorithm and given, is increased for all text collections. It is also by the proposed PLSA-SIM algorithm with top- worth noting that adding bigrams as “black boxes” 1000 bigrams ranked by TF (term frequency). The does not increase the amount of such inferred top- task was to classify given topics into 2 classes: ics. This result also confirms that the proposed whether they can be given a subject name (we will algorithm improves the quality of topic models. further mark such topics as ’+’) or not (we will In the Table 5 we present top-5 words from further mark such topics as ’–’). In the Table 4 we one random topic for each corpus for original

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 150 PLSA and PLSA-SIM algorithms with top-1000 Acknowledgements bigrams ranked by TF. Within each text collection This work is partially supported by RFBR grant we present topics discussing the same subject. N14-07-00383.

Banking Europarl PLSA PLSA-SIM PLSA PLSA-SIM Financial Economic References Banking Financial system crisis David Andrzejewski, Xiaojin Zhu, and Mark Craven. Financial Financial Bank Crisis 2009. Incorporating Domain Knowledge into Topic market crisis Financial European Modeling via Dirichlet Forest Priors. Proceedings Sector Have th sector economy of the 26 Annual International Conference on Ma- Financial Time of chine Learning: 25–32. Financial European crisis Financial David Andrzejewski and David Buttler. 2011. Latent System Market Crisis institute Topic Feedback for Information Retrieval. Proceed- th JRC-Acquis ACL ings of the 17 ACM SIGKDD International Con- PLSA PLSA-SIM PLSA PLSA-SIM ference on Knowledge Discovery and Data Mining: Transport Transport Tag Tag 600–608. Transport Tag Road Word service set Arthur Asuncion, Max Welling, Padhraic Smyth, Yee Road Tag Whye Teh. 2009. On Smoothing and Inference for Nuclear Corpus transport sequence Topic Models. Proceedings of the 25th International Transport Unknown Vehicle Tagger Conference on Uncertainty in Artificial Intelligence: sector word 27–34. Air Speech Material Tagging transport tag David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Ma- Table 5: Top-5 words from topics inferred by chine Learning Research, volume 3: 993–1022. PLSA and PLSA-SIM algorithms Gerlof Bouma. 2009. Normalized (Pointwise) Mu- tual Information. Proceedings of the Biennial GSCL Conference: 31–40. We should note that we used only intrinsic mea- sures of topic model quality in the paper. In the Jordan Boyd-Graber, David M. Blei, and Xiaojin Zhu. future we would like to test improved topic mod- 2007. A Topic Model for Word Sense Disambigua- els in such applications of information retrieval as tion. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Process- text clustering and categorization. ing and Computational Natural Language Learning: 1024–1033. 6 Conclusion Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang, David M. Blei. 2009. Reading Tea The paper presents experiments on integrating bi- Leaves: How Human Interpret Topic Models. Pro- th grams and similarities between them and unigrams ceedings of the 24 Annual Conference on Neural Information Processing Systems: 288–296. into topic models. At first, we propose the novel algorithm PLSA-SIM that incorporates similar un- Kenneth Ward Church, and Patrick Hanks. 1990. Word igrams and bigrams into topic models and main- Association Norms, Mutual Information, and Lexi- cography tains relationships between bigrams and unigram . Computational Linguistics, volume 16: 22–29. components. The experiments were conducted on the English parts of Europarl and JRC-Acquis par- Beatrice Daille. 1995. Combined Approach for Ter- allel corpora, ACL Anthology Reference corpus minology Extraction: Lexical Statistics and Linguis- tic Filtering PhD Dissertation. University of Paris, and Russian banking articles distinguished two Paris. groups of measures ranking bigrams. The first group produces top bigrams, which, if added to Ali Daud, Juanzi Li, Lizhu Zhou, Faqir Muhammad. 2010. Knowledge discovery through directed prob- topic models either as “black boxes” or not, re- abilistic topic models: a survey. Frontiers of Com- sults in nearly the same quality of inferred topics. puter Science in China, 4(2): 280–301. However, the second group produces top bigrams, Vidas Daudarviciusˇ and Ruta¯ Marcinkevicienˇ e.´ 2003. which, if added to the proposed PLSA-SIM al- Gravity Counts for the Boundaries of Collocations. gorithm, results in significant improvement in all International Journal of , 9(2): metrics under consideration. 321–348.

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 151 Paul Deane. 2005. A Nonparametric Method for Ex- David Newman, Jey Han Lau, Karl Grieser, and Timo- traction of Candidate Phrasal Terms. Proceedings thy Baldwin. 2010. Automatic Evaluation of Topic of the 43rd Annual Meeting of the ACL: 605–613. Coherence. Proceedings of Human Language Tech- nologies: The 11th Annual Conference of the North Ted Dunning. 1993. Accurate Methods for the Statis- American Chapter of the Association for Computa- tics of Surprise and Coincidence. International Jour- tional Linguistics: 100–108. nal of Computational Linguistics, 19(1): 61–74. Youngja Park, Roy J. Byrd, and Branimir K. Boguraev. 2002. Automatic Glossary Extraction: Beyond Ter- Vladimir Eidelman, Jordan Boyd-Graber, and Philip minology Identification. Proceedings of the 19th In- Resnik. 2012. Topic Models for Dynamic Transla- ternational Conference on Computational Linguis- tion Model Adaptation. Proceedings of the 50th An- tics: 1–7. nual Meeting of the Association of Computational Linguistics, volume 2: 115–119. Frank Smadja, Kathleen R. McKeown, and Vasileios Hatzivassiloglou. 1996. Translating Collocations Thomas L. Griffiths, Mark Steyvers, and Joshua B. for Bilingual Lexicons: A Statistical Approach. Tenenbaum. 2007. Topics in Semantic Represen- Computational Linguistics, 22(1): 1–38. tation. Psychological Review, 114(2): 211–244. Keith Stevens, Philip Kegelmeyer, David Adnrzejew- Thomas Hofmann. 1999. Probabilistic Latent Seman- ski, and David Buttler. 2012. Exploring Topic Co- tic Indexing. Proceedings of the 22nd Annual Inter- herence over Many Models and Many Topics. Pro- national SIGIR Conference on Research and Devel- ceedings of EMNLP-CoNLL’12: 952–961. opment in Information Retrieval: 50–57. Konstantin V. Vorontsov, and Anna A. Potapenko. 2014. Tutorial on Probabilistic Topic Modeling: Wei Hu, Nobuyuki Shimizu, Hiroshi Nakagawa, and Additive Regularization for Stochastic Matrix Fac- Huanye Shenq. 2008. Modeling Chinese Docu- torization. Proceedings of AIST’2014. LNCS, ments with Topical Word-Character Models. Pro- nd Springer Verlag-Germany, volume CCIS 439: 28– ceedings of the 22 International Conference on 45. Computational Linguistics: 345–352. Hanna M. Wallach. 2006. Topic Modeling: Beyond Paul Jaccard. 1901. Distribution de la flore alpine Bag-of-Words. Proceedings of the 23rd International dans le Bassin des Dranses et dans quelques regions Conference on Machine Learning: 977–984. voisines. Bull. Soc. Vaudoise sci. Natur. V. 37. Bd. Xuerui Wang, Andrew McCallum, and Xing Wei. 140: 241–272. 2007. Topical N-grams: Phrase and Topic Discov- ery, with an Application to Information Retrieval. PCFGs, Topic Models, Adap- Mark Johnson M. 2010. Proceedings of the 2007 Seventh IEEE International tor Grammars and Learning Topical Collocations Conference on Data Mining: 697–702. and the Structure of Proper Names. Proceedings of the 48th Annual Meeting of the ACL: 1148–1157. Dingding Wang, Shenghuo Zhu, Tao Li, and Yihong Gong. 2009. Multi-Document Summarization us- Mihoko Kitamura, and Yuji Matsumoto. 1996. Au- ing Sentence-based Topic Models. Proceedings of tomatic Extraction of Word Sequence Correspon- the ACL-IJCNLP 2009 Conference Short Papers: dences in Parallel Corpora. Proceedings of the 4th 297–300. Annual Workshop on Very Large Corpora: 79–87. Xing Wei and W. Bruce Croft. 2006. LDA-Based Doc- ument Models for Ad-hoc Retrieval. Proceedings of Jey Han Lau, Timothy Baldwin, and David Newman. th 2013. On Collocations and Topic Models. ACM the 29 International Conference on Research and Transactions on Speech and Language Processing, Development in Information Retrieval: 178–185. 10(3): 1–14. Zhongwu Zhai, Bing Liu, Hua Xu, and Peifa Jia. 2010. Grouping Product Features Using Semi-Supervised Bing Liu. 2012. and Opinion Min- Learning with Soft-Constraints. Proceedings of ing. Synthesis Lectures on Human Language Tech- the 23rd International Conference on Computational nologies, Morgan & Claypool Publishers. Linguistics: 1272–1280.

Jose Gabriel Pereira Lopes, and Joaquim Ferreira da Wen Zhang, Taketoshi Yoshida, Tu Bao Ho, and Xi- Silva. 1999. A Local Maxima Method and a Fair jin Tang. 2008. Augmented Mutual Information for Dispersion Normalization for Extracting Multiword Multi-Word Term Extraction. International Journal Units. Proceedings of the 6th Meeting on the Math- of Innovative Computing, Information and Control, ematics of Language: 369–381. 8(2): 543–554. Shibin Zhou, Kan Li, and Yushu Liu. 2009. Text David Mimno, Hanna M. Wallach, Edmund Talley, Categorization Based on Topic Model. International Miriam Leenders, Andrew McCallum. 2011. Op- Journal of Computational Intelligence Systems, vol- timizing Semantic Coherence in Topic Models. Pro- ume 2, No. 4: 398–409. ceedings of EMNLP’11: 262–272.

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 152