Topic Modeling based on Keywords and Context

Johannes Schneider, University of Liechtenstein Michail Vlachos, IBM Research, Zurich February 6, 2018

Abstract PLSA (and LDA) the frequency of a word within a topic heavily influences its probability (within the Current topic models often suffer from discovering top- topic) whereas the frequencies of the word in other ics not matching human intuition, unnatural switching topics have a lesser impact. Thus, a common word of topics within documents and high computational that occurs equally frequently across all topics might demands. We address these shortcomings by propos- have a large probability in each topic. This makes ing a and an inference algorithm based on sense for models based on document “generation”, be- automatically identifying characteristic keywords for cause frequent words should be generated more often. topics. Keywords influence the topic assignments of But using our rationale that a likely word should also nearby words. Our algorithm learns (key)word-topic be typical for a topic, high scores (or probabilities) scores and self-regulates the number of topics. The should only be assigned to words that occur in a few inference is simple and easily parallelizable. A quali- topics. Although there are techniques that provide tative analysis yields comparable results to those of a weighing of words, they either do not fully cover state-of-the-art models, but with different strengths the above rationale and perform static weighing, e.g., and weaknesses. Quantitative analysis using eight TF-IDF, or they have high computational demands datasets shows gains regarding classification accuracy, but yield limited improvements [14]. PMI score, computational performance, and consis- Third, a topic modeling technique should discover an tency of topic assignments within documents, while appropriate number of topics for a corpus and a spe- most often using fewer topics. cific task. Hierarchical models based on LDA [24, 19] limit the number of topics automatically, but increase computational costs. PLSA and LDA do not tend to 1 Introduction self-regulate the number of topics. They return as many topics as specified by a parameter. We proclaim Topic modeling deals with the extraction of latent that this parameter should be seen as an upper bound. information, i.e., topics, from a collection of text doc- Choosing too large a number of topics leads to over- uments. Classical approaches, such as probabilistic fitting. Furthermore, it is unclear how to choose the latent semantic indexing (PLSA) [12] and more so its number of topics. Manual investigation of topics also generalization, Latent Dirichlet Allocation (LDA) [5] benefits from self-regulation, i.e., a reduction of the enjoy widespread popularity despite a few shortcom- number of topics to consider. arXiv:1710.02650v2 [cs.CL] 3 Feb 2018 ings. They neglect word order within documents, i.e., Other concerns of topic models are the computational documents are not treated as a sequence of words but performance and the implementation complexity. We rather as a set or ‘bag’ of words. This might be one provide a novel, holistic model and inference algo- reason for unnatural word-topic assignments within rithm that addresses all of the above mentioned short- documents, where topics change after almost every comings of existing models – at least partially – by word as shown in our evaluation. Many attempts have introducing a novel topic model based on the following been made to remedy the ‘bag-of-words’ assumption rationale: (eg. [9, 10, 25, 28]), but improvement comes usually i) For each word in each topic, we compute a keyword at a price, e.g., a strong increase in computational score stating how likely a word belongs to the topic. demands and often model complexity. Roughly speaking, the score depends on how com- The second shortcoming is an unnatural assignment mon the word is within the topic and how common of word-topic probabilities. Better solutions to the it is within other topics. Classical generative models LDA model (in terms of its loss function) might re- compute a word-topic probability distribution that sult in worse human interpretability of topics [6]. For states how likely a word is generated given a topic.

1 In those models, the probability of a word given a important) than any other. In principle, one could, topic depends less strongly (only implicitly) on the for example, give higher priority to initially occurring frequency of a word in other topics. words that might correspond to a summarizing ii) We assume that topics of a word in a sequence of abstract of a text. words are heavily influenced by words in the same sequence. Therefore, words near a word with high Model Equations keyword score might be assigned to the topic of that word, even if they on their own are unlikely for the p(d, w, i) := p(d) · p(i|d) · p(wi = w|d, i) (2)

topic. Thus, the order of words has an impact in p(w|d, i) := max {(f(w, t) + f(wi+j , t)) · p(t|d)} (3) contrast to bag-of-words models. t,j∈Ri iii) The topic-document distribution depends on a with Ri := [max(0, i − L), min(|d| − 1, i + L)] P α simple and highly efficient summation of the keyword ( f(wi, t)) p(t|d) := i∈[0,|d|−1] (4) scores of the words within a document. Like LDA P P α ( f(wi|t)) does, it has a prior for topic-document distributions. t i∈[0,|d|−1] iv) An intrinsic property of our model is to limit the To compute the probability of a word at a certain number of distinct topics. Redundant topics are re- position in a document (Equation 3) we use latent moved explicitly. variables, i.e., topics, as done in the aspect model After modeling the above ideas and describing an algo- (Equations 1). Instead of the generative probability rithm for inference, we evaluate it on several datasets, p(w|t) that word w occurs in topic t, we use a keyword showing improvements compared with three other score f(w, t). A keyword for a topic should be some- methods on the PMI score, classification accuracy, per- what frequent and also characteristic for that topic formance and a new metric denoted by topic-change only. The keyword scoring function f computes a key- likelihood. This metric gives the probability that word score for a particular word and topic that might two consecutive words have different topics. From depend on multiple parameters such as p(w|t), p(t|w) a human perspective, it seems natural that multiple and p(t), whereas the generative probability p(w|t) is consecutive words in a document should very often only based on the relative number of occurrences of belong to the same topic. Models such as LDA tend a word in a topic. We shall discuss such functions in to assign consecutive words to different topics. Section 3. We use the idea that a word with a high keyword score for a topic might “enforce” the topic 2 Topic Keyword Model onto a nearby word, even if that word is just weakly associated with the topic. For a word wi at position i Our topic keyword model (TKM) uses ideas from the in the document, all words within L words to the left aspect model [13] which defines a joint probability and right, i.e., words wi+j with j ∈ [−L, L], could con- D × W (for notation see Table 1). The standard tain a word (with high keyword score) that determines aspect model assumes conditional independence of the topic assignment of wi. To account for boundary words and documents given a topic: conditions as the beginning and end of a document d, we use j ∈ [max(0, i − L), min(|d| − 1, i + L)]. There p(d, w) := p(d) · p(w|d) are multiple options how a nearby keyword might im- X p(w|d) := p(w|t) · p(t|d) (1) pact the topic of a word. The addition of the scores t f(wi, t) and f(wi+j, t) exhibits a linear behavior that Our model (Equations 2 – 4) maintains the core is suitable for modeling the assumption that one word ideas of the aspect model, but we account for context might determine the topic of a nearby word, even if by taking the position i of a word into account the other word is only weakly associated with the and we use a keyword score f(w, t). Taking the topic. Generative models following the bag-of-words context of a word into account implies that if a word model imply the use of multiplication of probabilities, occurs multiple times in the same document but i.e., keyword scores, which does not capture our mod- with different nearby words, each occurrence of the eling assumption: A word that is weakly associated word might have a different probability. To express with a topic, i.e., has a score close to zero, would have context, we include the index i, denoting position a low score for the topic even in the presence of strong of the ith word in the document (see Equation 2). keywords for the topic. Furthermore, each occurrence The distribution p(d) is proportional to |d|. We of a word in a document is assumed to stem from simply use a uniform distribution. We also assume a exactly one topic, which is expressed by taking the uniform distribution for p(i|d), as we do not consider maximum in Equation 3. any position in the document as more likely (or We compute p(t|d) dependent on keyword scores

2 f(w, t) of the words in the document d (Equation by the probability of the word times the total number P 4). We model the idea of looking for keywords in a of words, i.e. p(w, t) · d∈D |d| or, as we shall discuss document and aggregating their score, i.e., f(w, t). in the inference simply using the assignment counts n(w, t). Damped frequencies, e.g., log(1 + p(w, t) · The parameter α impacts the number of topics per P document. The larger α the more concentrated the d∈D |d|), work better than using raw frequencies topic-document distribution, i.e., the fewer topics per for inference and classification, because classification relies more on concentration, i.e., being certain that document. a word belongs to a topic. For humans, the words Essentially, these equations allow us to derive an with the highest keyword score are often too specific – algorithm for inference that is efficient, while avoiding they might only be familiar to experts on the topic. overfitting and allowing to model a prior on topic Therefore, we propose a second keyword distribution concentration. targeted for human understanding that puts more emphasis on frequency using raw counts. Combining the word frequency and the concentration score we get 3 Modeling Keywords a score f(w, t) that prefers rather specific keywords and a second one that emphasizes more widely used Here we state how to compute the keyword score (known) words fhu(w, t). Both can be normalized. We f(w, t) given a word and a topic. Alternative options add a prior β for f(w, t), similar to LDA and other are discussed in the Related Work Section. A word models, stating that a word is assumed to occur for obtains a high keyword score for a topic if it is as- each topic at least β times. signed often to the topic and the relative number of X assignments to the topic is high compared with other f(w, t) ∝ log(1 + p(w, t) · |d| + β) · con(w) topics. The first aspect relates to the frequency of the d∈D word n(w, t) within the topic. The second captures X (7) f (w, t) ∝ (p(w, t) · |d|) · con(w) how characteristic a keyword is for the topic relative hu to others, i.e., p(t|w). If the topic-word distribution d∈D p(t|w) is uniform, the word is not characteristic of any topic. If it is highly concentrated then it is. This can 4 Inference be captured using the inverse of the entropy H(w): X H(w) := − p(t|w) · log(p(t|w)) We want to find parameters that maximize the like- (5) Q Q t lihood of the data, i.e., d i∈[0,|d|−1] p(d, w, i). A key challenge for inference is the fairly complex model The entropy H(w) is maximized for a uniform dis- formalized in Equations (2–4) and (7). Although meth- tribution, i.e., p(w|t) = 1/|T | giving H(w) = log |T |, ods such as Gibbs sampling might be used, they would where |T | is the number of (current) topics, i.e., ini- be rather inefficient. In particular, optimizations for tially k. Thus, 1/H(w) is a measure that increases the faster inference of a Gibbs sampler, such as integrating more concentrated the assignment of words to topics out (collapsing) variables, are harder to apply for a is. However, if all occurrences of a word are assigned complex model. To derive an efficient inference mech- to a single topic, the entropy is zero and the inverse anism, we follow the expectation-maximization(EM) infinite. Therefore, we add one in the denominator, approach combined with standard probabilistic rea- i.e., 1/(1 + H(w)). Furthermore, if the occurrences of soning based on word-topic assignment frequencies. a word n(w) in the entire corpus are fewer than the The general idea of EM is to perform two steps. In number of topics |T |, then the word’s entropy can be the E-step latent variables are estimated, i.e., the at most log n(w) < log |T |. Ignoring this results in a probability p(t|w, i, d) of a topic given word w and high keyword score for a rare word even if each occur- position i in document d. In the M-step the loss rence is assigned to a different topic. Thus, we ensure function is maximized with respect to the parameters that rare words are not preferred too much by using using the topic distribution p(t|w, i, d). In our model the factor log min(|T |, n(w) + 1), where the addition we assume that a word at some fixed position in a of one is to ensure non-zero weights for words that document can only have one topic as expressed in occur once. An optional weight parameter δ allows Equation (3). Therefore, the topic t(w, i, d) is simply more or less emphasis to be put on the concentration the most likely topic of that word in that context, i.e., (relative to the frequency within a topic). Overall, our adjusting Equation (3) accordingly we get: concentration score is: t(w, i, d) := argmax{(f(w, t) + f(wi+j , t)) · p(t|d)|j ∈ R }  log(min(|T |, n(w) + 1) δ i con(w) := (6) t 1 + H(w) 1 t = t p(t|w, i, d) = w,i,d The second aspect of keyword scores relates to the 0 tw,i,d 6= t frequency of the word within a topic. Mathematically (8) speaking, the frequency of a word might be estimated

3 This differs from PLSA and LDA, where each word in each document is assigned to one topic. within a document is assigned a distribution typically X with non-zero probabilities for all topics. In the M- f(w, t) ∝ log(1 + p(w, t) · |d| + β) · con(w) Step, we want to optimize parameters. Analogously to d∈D X Equations (9.30) and (9.31) in [3] we define the func- ∝ log(1 + p(w|t) · p(t) · n(w, t) + β) · con(w) tion Q(Θ, Θold) for the complete data log likelihood w,t depending on parameters Θ: P n(w,t) n(w,t) X ∝ log(1 + P · P w · n(w, t) + β) · con(w) new old n(w,t) n(w,t) Θ = argmax Q(Θ, Θ ) w w,t w,t Θ X (9) ∝ log(1 + n(w, t) + β) · con(w) with Q(Θ, Θold) := p(t|D, Θold) log p(D, t|Θ) d,i,t 5 Self-Regulation of Topics The optimization problem in Equation (9) might be tackled using various methods, e.g. using Lagrange We only keep word-topic distributions that are signif- multipliers. Unfortunately, simple analytical solutions icantly different from each other. Redundant topics based on these approaches are intractable given the can be removed either during inference (as done in complexity of the model equations (2-4). However, Algorithm 1) or after inference. To measure the dif- one might also look at the inference of parameters ference between two word-topic distributions we use p(w|t), p(t) and p(w) differently. Assume that we are the symmetrized Kullback–Leibler divergence. given the assignments of words w to topics t for a X p(w|ti) collection of documents D, i.e., n(w, t). Our frequen- KL(ti, tj ) := p(w|ti) · log( ) p(w|tj ) tist inference approach uses the empirical distribution: w (12) The probability of a word given a topic equals the SKL(ti, tj ) := KL(ti, tj ) + KL(tj , ti) fraction of words that have been assigned to the topic of all words assigned to the topic. Under mild assump- The set of indexes of (significantly) distinct word- tions the maximum likelihood distribution equals the topic distributions DT is such that for any two topics empirical distribution (see eg. 9.2.2 in [2]): i, j ∈ DT ⊆ T , it holds that SKL(ti, tj) ≥ γ.

n(w, t) p(w|t) := P (10) w n(w, t) 6 Evaluation

Note, that n(w, t) is computable by summing across We assessed the tendency to self-regulate the num- the assignments from the E-step, i.e., p(t|w, i, d), be- ber of topics (Experiment 1) comparing with LDA cause each word within a context is assigned to one and HDP [26]. We evaluated several metrics, such as topic only (Equation 8): the classification accuracy, PMI score and computa- tion time (Experiment 2) using LDA, BTM [28] and X WNTM [30]. Finally, topics were assessed qualita- n(w, t) := p(t|wi, i, d) (11) i,d tively for one dataset (Experiment 3).

To compute p(t|w) we use Bayes’ Law to obtain Dataset Docs Unique Avg. Cla- p(t|w) = p(w|t) · p(t)/p(w). Therefore, the only free Words Words sses parameters, we need to estimate are p(t) and p(w), BookReviews 179541 22211 33 8 P n(w,t) WikiBig 52024 137155 346 11 i.e., k + |W |. We have p(t) = w and p(w) = P n(w,t) Ohsumed 23166 23068 99 23 w,t P n(w,t) 20Newsgroups 18625 37150 122 20 t , thus: Reuters21578 9091 11098 69 65 P n(w,t) w,t CategoryReviews 5809 6596 60 6 WebKB4Uni 4022 7670 136 4 p(t) p(t|w) = p(w|t) · p(w) BrownCorpus 500 16514 1006 15  P n(w0,t) / P n(w0,t0) Table 2: Datasets = n(w,t) · w0 w0,t0 P n(w0,t) P 0  P 0 0 w0 0 n(w,t ) / 0 0 n(w ,t ) t w ,t 6.1 Algorithms, Datasets and Setup: = n(w,t) P n(w,t0) t0 We compared an implementation of Algorithm 1 (avail- able on Github1) and LDA using a collapsed Gibbs We also need to compute the keyword score f(w, t). P P 1 We use that d∈D |d| = w,t n(w, t), since each word https://github.com/JohnTailor/tkm

4 Symbol Meaning Algorithm 1 TKM(docs D, nTopics k, Priors α, β) D corpus, all documents 1: δ := 1.5; L := 7; p(t|d) := 1/k; T := [1, k] d document from D 2: p(w|t) := 1/|W |+noise |d| number of words in d 3: while p(w, t) “not converged” do W set of unique words in D 4: n(w, t) := 0 w word 5: for d ∈ D do wi i-th word in a document d 6: for i = 0 to |d| − 1 do k (maximal) number of topics 7: R := [max(0, i − L), min(|d| − 1, i + L)] i  T set (of indexes) of topics 8: t(wi, i, d) := argmaxt{ f(wi, t) + f(wj , t) · p(t|d)|j ∈ Ri} T ⊆ [0, k − 1] 9: n(wi, t(wi, i, d)) = n(wi, t(wi, i, d)) + 1 DT set (of indexes) of distinct 10: end for topics in T 11: end for t topic t from T 12: p(t|w) := n(w,t) P n(w,t0) t(w, i, d) topic of word w assuming t0 13: H(w) := − P p(t|w) · log(p(t|w)) i-th word wi = w in d t log(min(|T |,n(w)+1)) δ L length of sliding wind. to 1 side 14: con(w) := 1+H(w) α, β topic, word prior 15: f(w, t) := log(1+n(w,t)+β)·con(w) P log(1+n(w,t)+β)·con(w) δ weight for word concentration w n(w, t) number of assignments of 16: T := DT {Keep only distinct topics (Section 5)} P α ( f(wi,t)) word w to topic t 17: i∈[0,|d|−1] p(t|d) := P P α ( f(wi|t)) n(w) number of occurrences of t i∈[0,|d|−1] 18: end while word w in D n(w,t)·con(w) 19: fhu(w, t) := P n(w0,t)·con(w0) Table 1: Notation w0

sampler [8] in Python, BTM made available by the tion 5) depending on the upper bound k of the number authors as C++ library and the author’s WNTM of topics for LDA, HDP and TKM. Java implementation. For HDP we used the Python Experiment 2: We compared various metrics for LDA, Gensim library. For all algorithms, we used the same TKM, WNTM and BTM using k = 100. The classifi- convergence criterion, i.e., computation stopped once cation accuracy was measured using a random forest word-topic distributions no longer changed signifi- with 100 trees, with 60% of all data for training and cantly. For all algorithms, we ran experiments with 40% for testing. The time to compute the word-topic and topic-document distributions on the training data, different parameters for α and β. We chose the best the time to compute the topic-document distribution configuration focusing on classification accuracy, i.e., on the test data, the number of distinct topics |DT | α = 5/k and β = 0.04 for LDA, α = 50/k and (Equation 12) and the PMI score as proposed in [21] β = 0.02 for BTM, α = 50/k and β = 0.05 for were also compared. PMI measures the co-occurrence WNTM and, finally, α = 2.5 and β = 0.05 for TKM. of words within topics relative to their co-occurrence For HDP we just used parameters as in [26]. To re- within documents (or sequences of words) of a large move redundant topics (Section 5) we used γ := 0.25. external corpus, i.e., we used an English Wikipedia The datasets in Table 2 are public and most have dump with about 4 million documents as in [21]. For already been used for text classification. For the each pair of words wi, wj we calculated the fraction distinct review datasets (from Amazon) we predicted of documents in which both occurred. either the product, i.e., book, based on a review or the |{d|wi, wj ∈ d, d ∈ D}| p(wi, wj ) = product category. The Wiki benchmarks are based |D| on Wikipedia categories. We performed standard |{d|wi ∈ d, d ∈ D}| p(wi) = (13) preprocessing, e.g., , stopwords removal and |D|

removal of words that occurred only once in the entire p(wi, wj ) PMI(wi, wj ) = log corpus. All experiments ran on a server with a 64bit p(wi)p(wj ) Ubuntu system, 100 GB RAM and 24 AMD cores The PMI score for a topic is the median of the PMI operating at 2.6 GHz. score for all pairs of the ten highest-ranked words ac- cording to the word-topic distribution, i.e., for TKM 6.2 Experiments: the distribution is phu as in Equation 7. The PMI score is the mean of the PMI score of all individual Experiment 1: We empirically analyzed the conver- topics. PMI has been reported to correlate better gence of the number of distinct topics |DT | (See Sec- with human judgment than other measures such as

5 perplexity [21]. to depend more on the number of documents; that of We introduce a novel measure that captures the con- TKM’s more on the number of unique words. The sistency of assignments of words to topics within doc- number of topics |DT | weakly depends on α – more uments as motivated in the end of Section 1. It is so for TKM. LDA showed no significant changes, even given by the probability that two consecutive words when lowering α to 1/500 of the recommended value stem from different topics. As LDA computes a dis- of 50/k [8]. Limiting the number of discovered topics, tribution across all topics, we choose the most likely i.e., returning very similar topics, is an intrinsic topic (see Equation 8). The topic change probability property of TKM. A word influences the topic of T oC for a corpus D is defined as nearby words directly and those of more distant words indirectly. Words with high keyword scores P P I d∈D i t(i,d)6=t(i+1,d) for a topic tend to pull other words to the same topic. T oC := P (14) d∈D |d| The indicator variable I is one if two neighboring Experiment 2: Overall TKM outperforms other mod- words have different topics and zero otherwise. els on all metrics as shown in Table 3. We found that Experiment 3: The topics using k = 20 discovered by parameter tuning, ie. in particular of the word prior LDA and TKM for the 20 Newsgroups dataset were β, is essential for all methods to achieve good classifi- assessed qualitatively. cation results. For classification TKM dominates or matches LDA and WNTM except for the Ohsumed dataset. The accuracy for Ohsumed is low for all tech- niques. It seems that there are too many terms that are shared across different categories, which seems to impact LDA the least. WNTM and LDA gave similar results, which is inline with the authors own report [30]. For WNTM one might expect less topic changes since it artificially creates documents based on word contexts. However, the grouping of contexts does not seem sufficient to overcome the tendency of the topic inference mechanism, ie. LDA, to group fre- quent words together. Overall BTM seems somewhat better than LDA and WNTM but worse than TKM for classification except for the BookReviews dataset, Figure 1: Distinct topics given an upper bound. which is characterized by relatively short documents that seem to be better handled by BTM, which was 6.3 Results: designed for short texts. All methods can be made to do better on specific datasets, but at the price of Experiment 1: As shown in Figure 1 the number of classifying worse on others. (significantly) distinct topics |DT | (see Section 5) of PMI assesses co-occurrence and TKM shows the best TKM converges when the maximal number of topics results. TKM’s word-topic distribution phu (Equa- k increases, whereas for LDA and HDP the number tion 7) is based on weighing words by frequency and of extracted topics depends (much less) on the the concentration. Thus, words that are frequently as- given upper bound.2 An apparent disadvantage of signed to one topic but also to others might still HDP is that the variance in reduction seems more get a low score. In turn, the other models focus on extreme than for TKM. HDP with κ = 0.6 yields only frequency only. It is intuitive that words that are about 30 topics for some datasets and 500 for others frequent within a few topics only co-occur with higher even though the datasets vary much less in terms likelihood and, thus, have a larger PMI score than of distinct words, documents and human-defined words that are frequent in many topics. number of topics. The fact that topics are typically The fact that TKM tends to learn less complex mod- investigated manually and humans often assign els, i.e., discovers fewer topics, has been investigated fewer than 100 categories for each dataset (Table 2) in depth in Experiment 1. Less complex models are suggests that discovering more than a few hundred preferable when the models perform well. Indeed, topics seems less suitable. HDP’s convergence seems TKM achieves higher classification accuracy with fewer topics. The topic change probability (Equa- 2Figure 1 was created using small offsets on the x-axis for the sake of clarity. For visibility we only showed four datasets tion 14) is significantly lower for TKM, indicating in the Figure. more consistent assignments of words to topics. TKM

6 Classification PMI Distinct Topic Change Training Inference Data- Accuracy Topics Probability Time[Min] Time[Min] set TKM BTM LDA WN. TK. BT. LD. WN. TK. BT. LD. WN. T. B. L. W. TK. BT. LD. WN. TK. BT. LD. WN. 20Ne. .79 .78 .76 .76 1.6 1.1 1.3 1.5 99 99 100 100 .1 .7 .6 .7 3.7 66 10 22 .6 2.9 2.0 3 Reut. .9 .87 .83 .86 1.4 1.3 1.3 1.3 95 100 100 100 .1 .6 .6 .7 1.1 17 2.3 5 .3 .9 .6 .9 Web. .81 .81 .81 .79 1.4 1.3 1.3 1.4 93 100 100 100 .3 .8 .7 .8 .8 16 1.9 16 .1 .6 .3 .6 Wiki. .95 .93 .93 .93 2.2 1.3 1.5 1.3 100 100 100 100 .1 .6 .5 .6 20 516 71 120 2.0 19 14 19 Brow. .45 .42 .3 .42 1.4 1.2 1.4 1.3 92 100 100 100 .4 .9 .8 .9 1.0 15 2.1 28 .0 .6 .2 .7 Ohsu. .24 .23 .36 .33 2.3 2.1 2.4 2.3 99 100 100 100 .1 .7 .7 .7 2.9 66 9.1 51 .7 3.0 2.0 2.8 Cate. .9 .88 .87 .85 1.5 1.4 1.4 1.5 90 100 100 100 .2 .8 .8 .8 .6 9.6 1.3 11 .1 .4 .3 .4 Book. .78 .79 .75 .78 1.4 1.3 1.3 1.4 73 100 100 100 .2 .7 .7 .8 8.1 146 25 81 5.0 11 14 10 Table 3: Comparison of TKM, BTM, LDA and WNTM. ‘Bold’ is better at a 99.5% significance level.

tends to assign the same topic to a sequence of words, ‘problem’, ‘go’, ‘com’, whereas TKM prefers rather if one word in the sequence has a high keyword score specific words, e.g., ‘gif’, ‘duke’, ‘Laurentian’, ‘BMW’, for that topic. This is often the case, i.e., TKM ‘HIV’. TKM tends to mix topics using indicative words switches topics only every 5 to 8 words. LDA and of different topics. LDA tends to find topics that are BTM switch topics after almost every other word. uninformative owing to the generality of the words. LDA, BTM and WNTM have a parameter that influ- The first 14 topics in Table 5 are rather similar for ences the number of topics per document. However, both methods except Topic 12 for TKM, which makes as expected for a bag-of-words model, even if a docu- limited sense. Topic 3 of LDA mixes autos and mo- ment covers few topics, the change probability tends torcycles. For LDA, topics 15 to 20 seem to make to be high. Tuning parameters has little impact, i.e., limited sense, as most words are common and non- we used α = 5/k to reduce the number of topics (com- characteristic of any topic. For TKM topics 15 to 20 pared to the recommended value of 50/k [8]). We seem to be slightly better, i.e., they tend to be more found that making fewer topics per document (lower of a mix of topics with one topic often dominating. α) results in worse classification results, but still does For example, topic 16 has elements of religion and not reduce the number of topic changes that much. sexuality, topic 18 of hockey, topic 19 of atheism and The computational performance is clearly best for windows.x and topic 20 of autos. TKM. Speed-up ranges from a minimum factor of 2-3 for all methods for both training and inference up to a factor of 15-30 for BTM and WNTM for training. 7 Related Work During training in our baseline implementation, the complexity per iteration is O(L · k · (P |d|)). For Hofmann [12] introduced probabilistic latent semantic P d∈D analysis (PLSA) as an improvement over latent seman- LDA [8] it is only O(k · ( d∈D |d|)). The stopping criterion is the same for both algorithms. TKM re- tic analysis. Its generalization LDA [5] adds priors quires fewer iterations and computations are simple, with hyperparameters to sample from distributions of e.g., TKM does not sample from distributions as LDA topics and words. LDA has been extended and varied does. Inference time of document-topic distributions in many ways, e.g., [16, 20, 7, 4, 28, 29, 24]. Whereas is faster for TKM, as we only process a document once, our model has little in common with PLSA and LDA whereas LDA potentially requires multiple iterations. except its rooting in the aspect model, extensions and For BTM a matrix containing all biterms is needed. modifications of LDA and PLSA mostly did not touch This alone makes computation fairly expensive. upon key generative aspects such as how document- topic distributions are determined. Most extensions Experiment 3: Table 5 shows the highest ranked words with the exception of [9, 1, 22, 10, 25, 27, 28] rely per topic of LDA and TKM, i.e., using phu(w|t). Using on the bag-of-words assumption. In contrast, we as- cosine similarity, topics of both methods were matched sume that a word influences topics of nearby words. in Table 5. The categories for the 20 Newsgroups In [9] each sentence is assigned to one topic using dataset are shown in Table 4. We noted considerable a Markov chain to model topic-to-topic transitions variation of topics for LDA and TKM for each execu- after sentences. Multiple works have used for tion of the algorithms but overall found qualitatively latent topic models, e.g., [1, 22, 10, 25]. [1, 22, 10] use comparable results. Both methods tend to find some a model to improve PLSA for speech recog- expected topics as suggested in Table 4 but missing nition. For a bigram (wi, wj)[1, 22] both multiply or mixing others. Overall, LDA tends to give high probabilities containing conditionals wi|wj.[10] mod- probabilities to general words more often, e.g., ‘need’, els p(t|wi, wj) ∝ p(wi|t) · p(wj|t). They use distanced

7 comp.graphics, comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, comp.sys.mac.hardware, comp.windows.x, rec.autos, rec.motorcycles, rec.sport.baseball, rec.sport.hockey, sci.crypt, sci.electronics, sci.med, sci.space, misc.forsale,talk.politics.misc,talk.politics.guns, talk.politics.mideast, talk.religion.misc, alt.atheism, soc.religion.christian Table 4: The 20 Newsgroups

TKM Topics LDA Topics Sim 1 ax giz tax myer presid think chz go pl ms ax giz chz gk pl fij uy fyn ah ei 1.0 2 armenian turkish armenia turk turkei azerbaijan azeri armenian turkish muslim turkei turk armenia peopl war 0.78 3 com bike motorcycl dod bmw ride ca write biker articl car com bike write articl edu engin ride drive new 0.65 4 chip clipper encryption wire privaci nsa kei escrow kei us com chip clipper secur encryption would system 0.64 5 game stephanopoulo playoff pitch espn score pitcher pt game team plai player year edu ca win hockei season 0.64 6 monitor simm mhz mac card centri duo us edu modem card drive us edu mac driver system work disk problem 0.64 7 gun firearm atf weapon stratu fire handgun bd amend edu gun com edu would fire write articl peopl koresh fbi 0.64 8 church christ scriptur bibl koresh faith sin god cathol god christian jesu bibl church us christ sai sin love 0.64 9 medic cancer drug patient doctor hiv health newslett us medic studi diseas patient effect drug doctor food 0.63 10 edu israel isra arab palestinian articl write adl uci right israel state isra peopl arab edu war write jew 0.59 11 space drive scsi orbit shuttl disk nasa id mission hst space nasa gov launch orbit earth would mission moon edu 0.55 12 peopl us would batteri like jew right know make year presid would state us mr think go year monei tax 0.49 13 imag jpeg gif format file graphic xv color pixel viewer imag us window graphic avail system server softwar data 0.47 14 window mous font server microsoft driver client xterm us file us window program imag entri format jpeg need 0.44 15 oil kuwait ac write rushdi edu islam uk entri contest edu write articl com know uiuc cc would anyone cs 0.33 16 homosexu rutger cramer christian sexual gai msg food edu edu peopl sai write would think com articl moral us 0.33 17 insur kei detector radar duke de phone system edu ripem book post list mail new edu inform send address email 0.24 18 team player jesu plai water roger edu laurentian hockei us would like write edu articl good com look get 0.23 19 moral atheist god widget atheism openwindow belief exist go like get would time know us sai peopl think 0.14 20 car brake candida yeast engin militia vitamin steer us com edu need power work help ca mous anyone 0.12 Table 5: Topics by TKM and LDA for 20Newsgroups dataset for k = 20

n-grams, i.e., for a fixed distance d between two words original corpus. Then it uses LDA for topic inference they estimate a probability distribution p(wi|t) using of the created documents and a further step is needed all word pairs at distance d from the corpus. N-gram to get topic distributions of the original corpus. statistics and latent topic variables have been com- Keyword extraction often relies on using co-occurrence bined in [25] and later work, e.g., [27, 28]. A key data, POS tagging [23] and external sources or TF- underlying modeling assumption of [25] is inferring IDF [11]. Typically, these methods first extract key- the probability of one word given its predecessor using phrases and then rank them. For example, [23] first smoothed bigram estimators. The sparsity of short splits words into phrases using sentence delimiters texts was the motivation in [28] to use biterms yielding and stop words. They compute keywords using the the BTM model. In the BTM model the probability of ratio of the frequency of a word as well as its de- P a biterm equals p(wi, wj) = t p(t) · p(wi|t) · p(wj|t). gree, i.e., all words that are within a specific distance In [28] an increases of the time complexity by about for any occurrence within a phrase. We do not use a factor of 3 is reported together with improvements any of the typical preprocessing, e.g., POS tagging otherwise. We also consider all biterms within a win- or phrase extraction, though this might be beneficial. dow and therefore also compare against [28]. Aside We also tested the metric of [23] and found that it from that, there are few similarities. gave overall slightly worse result. Topic modeling and Relatively little work has been conducted on limit- keyword extraction are related. For example [17] ing the number of topics. Hierarchical topic mod- extracts keywords using topic-word distributions ob- els [24, 19, 26] do not require the specification of the tained from LDA. Key word extraction and clustering number of topics but come with increased complexity. are also related. [18] uses clustering as a preprocessing The hierarchial extension of LDA (HDP) [26], for ex- step to obtain keyword candidates stemming from ample, has an upper bound on the number of topics medoids of these clusters. [18, 17] stick to the concept and several parameters that control the number of dis- that keywords are extracted based on a preprocessing covered topics. For TKM, no additional parameters phase. In contrast, we perform a dynamic iterative control the number of topics though the parameter approach, in which words are assigned to topics and settings influence the number of topics, in particular the distribution of word assignments across topics de- the upper bound on the number of topics. termines the keyword score. [14] uses term-weighing The word network topic model(WNTM) [30] is also to enhance topic modeling, i.e., LDA. Their keyword made for short texts. For each word it creates a doc- score corresponds to the variance of the word-topic ument by combining the contexts of the words in the distribution. They do not state a thorough derivation

8 of their Gibbs sampler, i.e., an explicit integration of [13] T. Hofmann. Unsupervised learning by probabilistic latent Equation (4) in [14] to get (6) and (7). Their classifi- semantic analysis. , 42(1):177–196, 2001. cation performance reported on the 20 Newsgroups [14] S. Lee, J. Kim, and S.-H. Myaeng. An extension of topic models for text classification: A term weighting approach. dataset for their best algorithm is significantly lower In Conf. on Big Data and Smart Computing (BigComp), than the performance we observed for LDA (as well pages 217–224, 2015. as that of our algorithms). [15] S. Li, T.-S. Chua, J. Zhu, and C. Miao. Generative topic Word embeddings and topic modeling have also been embedding: a continuous representation of documents. In combined [7, 15]. Our model might also benefit from Proc. of the Asso. for Computational Linguistics(ACL), 2016. word embeddings. [16] W. Li and A. McCallum. Pachinko allocation: Dag- structured mixture models of topic correlations. In Proc. of the int. conference on Machine learning, pages 577–584, 8 Conclusions 2006. [17] Z. Liu, W. Huang, Y. Zheng, and M. Sun. Automatic We presented a novel topic model based on keywords. keyphrase extraction via topic decomposition. In Confer- Both the model and its inference are simple and per- ence on empirical methods in natural language processing, formant. Our experiments report improvements to pages 366–376, 2010. [18] Z. Liu, P. Li, Y. Zheng, and M. Sun. Clustering to find existing techniques across a number of metrics, thus exemplar terms for keyphrase extraction. In Conference suggesting that the proposed technique aligns closer on Empirical Methods in Natural Language Processing, with human intuition about topics and keywords. pages 257–266, 2009. [19] D. Mimno, W. Li, and A. McCallum. Mixtures of hierar- chical topics with pachinko allocation. In Proc. of the int. References conference on Machine learning, pages 633–640, 2007. [1] M. Bahrani and H. Sameti. A new bigram-plsa language [20] D. Newman, E. V. Bonilla, and W. Buntine. Improving model for . EURASIP Journal on Ap- topic coherence with regularized topic models. In Adv. plied Signal Processing, 2010, 2010. in neural information processing systems, pages 496–504, 2011. [2] D. Barber. Bayesian reasoning and machine learning. [21] D. Newman, J. H. Lau, K. Grieser, and T. Baldwin. Au- Cambridge University Press, 2012. tomatic evaluation of topic coherence. In Conf. of the [3] C. M. Bishop. Pattern recognition. Machine Learning, North American Chapter of the Asso. for Computational 128:1–58, 2006. Linguistics, pages 100–108, 2010. [4] D. M. Blei, T. L. Griffiths, and M. I. Jordan. The nested [22] J. Nie, R. Li, D. Luo, and X. Wu. Refine bigram plsa chinese restaurant process and bayesian nonparametric model by assigning latent topics unevenly. In Workshop on inference of topic hierarchies. Journal of the ACM (JACM), Automatic Speech Recognition and Understanding, pages 57(2):7, 2010. 141–146, 2007. [5] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirich- [23] S. Rose, D. Engel, N. Cramer, and W. Cowley. Automatic let allocation. Journal of machine Learning research, keyword extraction from individual documents. Text Min- 3(Jan):993–1022, 2003. ing: Applications and Theory, pages 1–20, 2010. [24] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Shar- [6] J. Chang, J. L. Boyd-Graber, S. Gerrish, C. Wang, and ing clusters among related groups: Hierarchical dirichlet D. M. Blei. Reading tea leaves: How humans interpret processes. In Advances in neural information processing topic models. In Conf. on Neural Information Processing systems, pages 1385–1392, 2005. Systems(NIPS), pages 1–9, 2009. [25] H. M. Wallach. Topic modeling: beyond bag-of-words. In [7] R. Das, M. Zaheer, and C. Dyer. Gaussian lda for topic Proc. of the int. conf. on Machine learning, pages 977–984, models with word embeddings. In Proc. of the Asso. for 2006. Computational Linguistics(ACL), 2015. [26] C. Wang, J. Paisley, and D. Blei. Online variational [8] T. L. Griffiths and M. Steyvers. Finding scientific top- inference for the hierarchical dirichlet process. In Proc. of ics. Proceedings of the National academy of Sciences, Int. Conference on Artificial Intelligence and Statistics, 101(suppl 1):5228–5235, 2004. pages 752–760, 2011. [9] A. Gruber, Y. Weiss, and M. Rosen-Zvi. Hidden topic [27] X. Wang, A. McCallum, and X. Wei. Topical n-grams: markov models. In AISTATS, pages 163–170, 2007. Phrase and topic discovery, with an application to infor- [10] M. A. Haidar and D. O’Shaughnessy. PLSA enhanced mation retrieval. In Proc. of the Int. Conference on Data with a long-distance bigram language model for speech Mining (ICDM), pages 697–702, 2007. recognition. In European Signal Processing Conference [28] X. Yan, J. Guo, Y. Lan, and X. Cheng. A biterm topic (EUSIPCO), pages 1–5, 2013. model for short texts. In Proc. of the international conf. on World Wide Web, pages 1445–1456, 2013. [11] K. S. Hasan and V. Ng. Automatic keyphrase extraction: A survey of the state of the art. In Proc. of the Annual [29] G. Yang, D. Wen, N.-S. Chen, E. Sutinen, et al. A novel Meeting of the Association for Computational Linguistics, contextual topic model for multi-document summarization. pages 1262–1273, 2014. Expert Systems with Applications, 42(3):1340–1352, 2015. [30] Y. Zuo, J. Zhao, and K. Xu. Word network topic model: a [12] T. Hofmann. Probabilistic latent semantic indexing. In simple but general solution for short and imbalanced texts. Proceedings of Research and development in information Knowledge and Information Systems, 48(2):379–398, 2016. retrieval, pages 50–57, 1999.

9