Topic Models: Accounting Component Structure of Bigrams

Topic Models: Accounting Component Structure of Bigrams Michael Nokel Natalia Loukachevitch Lomonosov Moscow State University, Lomonosov Moscow State University, Russian Federation Russian Federation [email protected] louk [email protected] Abstract well-known models are Latent Dirichlet Alloca- tion (LDA) (Blei et al., 2003), which is based on The paper describes the results of an em- Dirichlet prior distribution, and Probabilistic La- pirical study of integrating bigram col- tent Semantic Analysis (PLSA) (Hofmann, 1999), locations and similarities between them which is not connected with any parametric prior and unigrams into topic models. First of distribution. all, we propose a novel algorithm PLSA- One of the main drawback of the topic models SIM that is a modification of the original is that they utilize “bag-of-words” model that dis- algorithm PLSA. It incorporates bigrams cards word order and is based on the word inde- and maintains relationships between uni- pendence assumption. There are numerous stud- grams and bigrams based on their com- ies, where the integration of collocations, n-grams, ponent structure. Then we analyze a va- idioms and multi-word terms into topic models riety of word association measures in or- is investigated. However, it often leads to a de- der to integrate top-ranked bigrams into crease in the model quality due to increasing size topic models. All experiments were con- of a vocabulary or to a serious complication of the ducted on four text collections of different model (Wallach, 2006; Griffiths et al., 2007; Wang domains and languages. The experiments et al., 2007). distinguish a subgroup of tested measures The paper proposes a novel approach taking that produce top-ranked bigrams, which into account bigram collocations and relationship demonstrate significant improvement of between them and unigrams in topic models (such topic models quality for all collections, as citizen – citizen of country – citizen of union when integrated into PLSA-SIM algo- – European citizen – state citizen; categorization rithm. – document categorization – term categorization – text categorization). This allows us to create a 1 Introduction novel method of integrating bigram collocations Topic modeling is one of the latest applications into topic models that does not consider bigrams of machine learning techniques to the natural lan- being as “black boxes”, but maintains the rela- guage processing. Topic models identify which tionship between unigrams and bigrams based on topics relate to each document and which words their component structure. The proposed algo- form each topic. Each topic is defined as a multi- rithm leads to significant improvement of topic nomial distribution over terms and each document models quality measured in perplexity and topic is defined as multinomial distribution over top- coherence (Newman et al., 2010) without compli- ics (Blei et al., 2003). Topic models have achieved cations of the model. noticeable success in various areas such as infor- All experiments were carried out using PLSA mation retrieval (Wei and Croft, 2006), including algorithm and its modifications on four corpora such applications as multi-document summariza- of different domains and languages: the English tion (Wang et al., 2009), text clustering and cat- part of Europarl parallel corpus, the English part of egorization (Zhou et al., 2009), and other natural JRC-Acquis parallel corpus, ACL Anthology Ref- language processing tasks such as word sense dis- erence corpus, and Russian banking magazines. ambiguation (Boyd-Graber et al., 2007), machine The rest of the paper is organized as follows. In translation (Eidelman et al., 2012). Among most the section 2 we focus on related work. Section 3 Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 145 proposes a novel algorithm PLSA-SIM that incor- locations in topic models. The authors extract bi- porates bigrams and similarities between them and gram collocations via t-test and replace separate unigrams into topic models. Section 4 describes units by top-ranked bigrams at the preprocessing the datasets used in experiments, all preprocessing step. They use two metrics of topic quality: per- steps and metrics used to evaluate the quality. In plexity and topic coherence (Newman et al., 2010) the section 5 we perform an extensive analysis of a and conclude that incorporating bigram colloca- variety of measures for integrating top-ranked bi- tions into topics results in worsening perplexity grams into topic models. And in the last section and improving topic coherence. we draw conclusions. Our current work also belongs to the second type of methods and distinguishes from previ- 2 Related Work ous papers such as Lau et al. (2013) in that our approach does not consider bigrams as “black The idea of using collocations in topic models is boxes”, but maintains information about the inner not a novel one. Nowadays there are two kinds of structure of bigrams and relationships between bi- methods proposed to deal with this problem: cre- grams and component unigrams, which leads to ation of a unified probabilistic model and prelim- improvement in both metrics: perplexity and topic inary extraction of collocations and n-grams with coherence. further integration into topic models. The idea to utilize prior natural language knowl- Most studies belong to the first kind of meth- edge in topic models is not a novel one. So, An- ods. So, the first movement beyond “bag-of- drzejewski et al. (2009) incorporated domain- words” assumption has been made by Wallach specific knowledge by Must-Link and Cannot- (2006), where the Bigram Topic Model was pre- Link primitives represented by a novel Dirichlet sented. In this model word probabilities are con- Forest prior. These primitives control that two ditioned on the immediately preceding word. The words tend to be generated by the same or sep- LDA Collocation Model (Griffiths et al., 2007) ex- arate topics. However, this method can result in tends the Bigram Topic Model by introducing a an exponential growth in the encoding of Cannot- new set of variables and thereby giving a flexibil- Link primitives and thus has difficulty in process- ity to generate both unigrams and bigrams. Wang ing a large number of constraints (Liu, 2012). An- et al. (2007) proposed the Topical N-Gram Model other method of incorporating such knowledge is that adds a layer of complexity to allow the for- presented in Zhai (2010) where a semi-supervised mation of bigrams to be determined by the con- EM-algorithm was proposed to group expressions text. Hu et al. (2008) proposed the Topical Word- into some user-specified categories. To provide a Character Model challenging the assumption that better initialization for EM-algorithm the method the topic of an n-gram is determined by the top- employs prior knowledge that expressions shar- ics of composite words within the collocation. ing words and synonyms are likely to belong to This model is mainly suitable for Chinese lan- the same group. Our current work distinguishes guage. Johnson (2010) established connection be- from these ones in that we incorporate similar- tween LDA and Probabilistic Context-Free Gram- ity links between unigrams and bigrams into the mars and proposed two probabilistic models com- topic model in a very natural way counting their bining insights from LDA and Adaptor Grammars co-occurrences in documents. The proposed ap- to integrate collocations and proper names into the proach does not increase the complexity of the topic model. original PLSA algorithm. While all these models have a theoretically ele- gant background, they are very complex and hard 3 PLSA-SIM algorithm to compute on real datasets. For example, Bigram Topic Model has W2T parameters, compared to As mentioned above, original topic models utilize WT for LDA and WT + DT for PLSA, where W the “bag-of-words” assumption that assumes word is the size of vocabulary, D is the number of doc- independence. And bigrams are usually added to uments, and T is the number of topics. Therefore topic models as “black boxes” without any ties such models are mostly of theoretical interest. with other words. So, bigrams are added to the The algorithm proposed in Lau et al. (2013) be- vocabulary as single tokens and in each document longs to the second type of methods that use col- containing any of added bigrams the frequencies Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 146 of unigram components are decreased by the fre- of the original algorithm concern lines 6 and 9, quencies of bigrams (Lau et al., 2013). Thus “bag- where we introduce auxiliary variable fdw, which of-words” assumption holds. takes into account pre-computed sets of similar However, there are many similar unigrams and terms. Thus, the weight of such terms is increased bigrams that share the same lemmas (i.e, correc- within each document. tion – correction of word – error correction – spelling correction; rail – rail infrastructure – rail Algorithm 1: PLSA-SIM algorithm: PLSA transport – use of rail) and others in documents. with similar terms We should note such bigrams do not only have Input: collection of documents D, number of identical words, but many of them maintain se- topics jTj, initial distributions Θ and mantic and thematic similarity. At the same time Φ, sets of similar terms S other bigrams with the same words (i.e., idioms) Output: distributions Θ and Φ can have significant semantic differences. To take 1 while not meet the stop criterion do into account these different situations, we hypoth- 2 for d 2 D, w 2 W, t 2 T do esized that similar bigrams sharing the same uni- 3 nˆwt = 0,n ˆtd = 0,n ˆt = 0, nd = jdj gram components should often belong to the same 4 for d 2 D, w 2 W do P topics, if they often co-occur within the same texts.

Topic Models: Accounting Component Structure of Bigrams

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support