Topic Modeling Based on Keywords and Context
Total Page:16
File Type:pdf, Size:1020Kb
Topic Modeling based on Keywords and Context Johannes Schneider, University of Liechtenstein Michail Vlachos, IBM Research, Zurich February 6, 2018 Abstract PLSA (and LDA) the frequency of a word within a topic heavily influences its probability (within the Current topic models often suffer from discovering top- topic) whereas the frequencies of the word in other ics not matching human intuition, unnatural switching topics have a lesser impact. Thus, a common word of topics within documents and high computational that occurs equally frequently across all topics might demands. We address these shortcomings by propos- have a large probability in each topic. This makes ing a topic model and an inference algorithm based on sense for models based on document “generation”, be- automatically identifying characteristic keywords for cause frequent words should be generated more often. topics. Keywords influence the topic assignments of But using our rationale that a likely word should also nearby words. Our algorithm learns (key)word-topic be typical for a topic, high scores (or probabilities) scores and self-regulates the number of topics. The should only be assigned to words that occur in a few inference is simple and easily parallelizable. A quali- topics. Although there are techniques that provide tative analysis yields comparable results to those of a weighing of words, they either do not fully cover state-of-the-art models, but with different strengths the above rationale and perform static weighing, e.g., and weaknesses. Quantitative analysis using eight TF-IDF, or they have high computational demands datasets shows gains regarding classification accuracy, but yield limited improvements [14]. PMI score, computational performance, and consis- Third, a topic modeling technique should discover an tency of topic assignments within documents, while appropriate number of topics for a corpus and a spe- most often using fewer topics. cific task. Hierarchical models based on LDA [24, 19] limit the number of topics automatically, but increase computational costs. PLSA and LDA do not tend to 1 Introduction self-regulate the number of topics. They return as many topics as specified by a parameter. We proclaim Topic modeling deals with the extraction of latent that this parameter should be seen as an upper bound. information, i.e., topics, from a collection of text doc- Choosing too large a number of topics leads to over- uments. Classical approaches, such as probabilistic fitting. Furthermore, it is unclear how to choose the latent semantic indexing (PLSA) [12] and more so its number of topics. Manual investigation of topics also generalization, Latent Dirichlet Allocation (LDA) [5] benefits from self-regulation, i.e., a reduction of the enjoy widespread popularity despite a few shortcom- number of topics to consider. arXiv:1710.02650v2 [cs.CL] 3 Feb 2018 ings. They neglect word order within documents, i.e., Other concerns of topic models are the computational documents are not treated as a sequence of words but performance and the implementation complexity. We rather as a set or ‘bag’ of words. This might be one provide a novel, holistic model and inference algo- reason for unnatural word-topic assignments within rithm that addresses all of the above mentioned short- documents, where topics change after almost every comings of existing models – at least partially – by word as shown in our evaluation. Many attempts have introducing a novel topic model based on the following been made to remedy the ‘bag-of-words’ assumption rationale: (eg. [9, 10, 25, 28]), but improvement comes usually i) For each word in each topic, we compute a keyword at a price, e.g., a strong increase in computational score stating how likely a word belongs to the topic. demands and often model complexity. Roughly speaking, the score depends on how com- The second shortcoming is an unnatural assignment mon the word is within the topic and how common of word-topic probabilities. Better solutions to the it is within other topics. Classical generative models LDA model (in terms of its loss function) might re- compute a word-topic probability distribution that sult in worse human interpretability of topics [6]. For states how likely a word is generated given a topic. 1 In those models, the probability of a word given a important) than any other. In principle, one could, topic depends less strongly (only implicitly) on the for example, give higher priority to initially occurring frequency of a word in other topics. words that might correspond to a summarizing ii) We assume that topics of a word in a sequence of abstract of a text. words are heavily influenced by words in the same sequence. Therefore, words near a word with high Model Equations keyword score might be assigned to the topic of that word, even if they on their own are unlikely for the p(d, w, i) := p(d) · p(i|d) · p(wi = w|d, i) (2) topic. Thus, the order of words has an impact in p(w|d, i) := max {(f(w, t) + f(wi+j , t)) · p(t|d)} (3) contrast to bag-of-words models. t,j∈Ri iii) The topic-document distribution depends on a with Ri := [max(0, i − L), min(|d| − 1, i + L)] P α simple and highly efficient summation of the keyword ( f(wi, t)) p(t|d) := i∈[0,|d|−1] (4) scores of the words within a document. Like LDA P P α ( f(wi|t)) does, it has a prior for topic-document distributions. t i∈[0,|d|−1] iv) An intrinsic property of our model is to limit the To compute the probability of a word at a certain number of distinct topics. Redundant topics are re- position in a document (Equation 3) we use latent moved explicitly. variables, i.e., topics, as done in the aspect model After modeling the above ideas and describing an algo- (Equations 1). Instead of the generative probability rithm for inference, we evaluate it on several datasets, p(w|t) that word w occurs in topic t, we use a keyword showing improvements compared with three other score f(w, t). A keyword for a topic should be some- methods on the PMI score, classification accuracy, per- what frequent and also characteristic for that topic formance and a new metric denoted by topic-change only. The keyword scoring function f computes a key- likelihood. This metric gives the probability that word score for a particular word and topic that might two consecutive words have different topics. From depend on multiple parameters such as p(w|t), p(t|w) a human perspective, it seems natural that multiple and p(t), whereas the generative probability p(w|t) is consecutive words in a document should very often only based on the relative number of occurrences of belong to the same topic. Models such as LDA tend a word in a topic. We shall discuss such functions in to assign consecutive words to different topics. Section 3. We use the idea that a word with a high keyword score for a topic might “enforce” the topic 2 Topic Keyword Model onto a nearby word, even if that word is just weakly associated with the topic. For a word wi at position i Our topic keyword model (TKM) uses ideas from the in the document, all words within L words to the left aspect model [13] which defines a joint probability and right, i.e., words wi+j with j ∈ [−L, L], could con- D × W (for notation see Table 1). The standard tain a word (with high keyword score) that determines aspect model assumes conditional independence of the topic assignment of wi. To account for boundary words and documents given a topic: conditions as the beginning and end of a document d, we use j ∈ [max(0, i − L), min(|d| − 1, i + L)]. There p(d, w) := p(d) · p(w|d) are multiple options how a nearby keyword might im- X p(w|d) := p(w|t) · p(t|d) (1) pact the topic of a word. The addition of the scores t f(wi, t) and f(wi+j, t) exhibits a linear behavior that Our model (Equations 2 – 4) maintains the core is suitable for modeling the assumption that one word ideas of the aspect model, but we account for context might determine the topic of a nearby word, even if by taking the position i of a word into account the other word is only weakly associated with the and we use a keyword score f(w, t). Taking the topic. Generative models following the bag-of-words context of a word into account implies that if a word model imply the use of multiplication of probabilities, occurs multiple times in the same document but i.e., keyword scores, which does not capture our mod- with different nearby words, each occurrence of the eling assumption: A word that is weakly associated word might have a different probability. To express with a topic, i.e., has a score close to zero, would have context, we include the index i, denoting position a low score for the topic even in the presence of strong of the ith word in the document (see Equation 2). keywords for the topic. Furthermore, each occurrence The distribution p(d) is proportional to |d|. We of a word in a document is assumed to stem from simply use a uniform distribution. We also assume a exactly one topic, which is expressed by taking the uniform distribution for p(i|d), as we do not consider maximum in Equation 3. any position in the document as more likely (or We compute p(t|d) dependent on keyword scores 2 f(w, t) of the words in the document d (Equation by the probability of the word times the total number P 4).