An N-gram Topic Model for Time-Stamped Documents

Shoaib Jameel and Wai Lam

The Chinese University of Hong Kong

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Outline

Introduction and Motivation

I The Bag-of-Words (BoW) assumption I Temporal nature of data Related Work

I Temporal Topic Models I N-gram Topic Models Overview of our model

I Background

F Topics Over Time (TOT) Model - proposed earlier F Our proposed n-gram model Empirical Evaluation Conclusions and Future Directions

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia The ‘popular’ Bag-of-Words Assumption Many works in the topic modeling literature assume exchangeability among the words. As a result generate ambiguous words in topics. For example, consider few topics obtained from the NIPS collection using the Latent Dirichlet Allocation (LDA) model: Example

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 architecture order connectionist potential prior recurrent first role membrane bayesian network second binding current data module analysis structures synaptic evidence modules small distributed dendritic experts

The problem with the LDA model Words in topics are not insightful. Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia The ‘popular’ Bag-of-Words Assumption Many works in the topic modeling literature assume exchangeability among the words. As a result generate ambiguous words in topics. For example, consider few topics obtained from the NIPS collection using the Latent Dirichlet Allocation (LDA) model: Example

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 architecture order connectionist potential prior recurrent first role membrane bayesian network second binding current data module analysis structures synaptic evidence modules small distributed dendritic experts

The problem with the LDA model Words in topics are not insightful. Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia The problem with the bag-of-words assumption

1 Logical structure of the document is lost. For example, we do not know whether “the cat saw a dog or a dog saw a cat”. 2 The computational models cannot tap an extra word order information inherent in the text. Therefore, affects the performance. 3 The usefulness of maintaining the word order has also been illustrated in Information Retrieval, Computational Linguistics and many other fields.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia The problem with the bag-of-words assumption

1 Logical structure of the document is lost. For example, we do not know whether “the cat saw a dog or a dog saw a cat”. 2 The computational models cannot tap an extra word order information inherent in the text. Therefore, affects the performance. 3 The usefulness of maintaining the word order has also been illustrated in Information Retrieval, Computational Linguistics and many other fields.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia The problem with the bag-of-words assumption

1 Logical structure of the document is lost. For example, we do not know whether “the cat saw a dog or a dog saw a cat”. 2 The computational models cannot tap an extra word order information inherent in the text. Therefore, affects the performance. 3 The usefulness of maintaining the word order has also been illustrated in Information Retrieval, Computational Linguistics and many other fields.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Why capture topics over time?

1 We know that data evolves over time. 2 What people are talking today may not be talking tomorrow or an year after.

Burj Khalifa Wikipedia Gaza Strip Volcano N.Z Earthquake Sachin Tendulkar Manila Hostage Osama bin Laden China Iraq War Higgs Boson Apple Inc. Year-2010 Year-2011 Year-2012

3 Models such as LDA do not capture such temporal characteristics in data.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Related Work Temporal Topic Models

Discrete time assumption models

I Blei et al., (David M. Blei and John D. Lafferty. 2006.) - Dynamic Topic Models - assume that topics in one year are dependent on the topics of the previous year. I Knights et al., (Knights, D., Mozer, M., and Nicolov, N. 2009.)- Topic Model - Train a topic model on the most recent K months of data.

The problem here One needs to select an appropriate time slice value manually. The question is which time slice be chosen: day, month, year, etc.?

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Related Work Temporal Topic Models

Discrete time assumption models

I Blei et al., (David M. Blei and John D. Lafferty. 2006.) - Dynamic Topic Models - assume that topics in one year are dependent on the topics of the previous year. I Knights et al., (Knights, D., Mozer, M., and Nicolov, N. 2009.)- Compound Topic Model - Train a topic model on the most recent K months of data.

The problem here One needs to select an appropriate time slice value manually. The question is which time slice be chosen: day, month, year, etc.?

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Related Work Temporal Topic Models

Continuous Time Topic Models

I Noriaki (Noriaki Kawamae. 2011.) - Trend Analysis Model - The model has a probability distribution over temporal words, topics, and a continuous distribution over time. I Uri et al., (Uri Nodelman, Christian R. Shelton, and Daphne Koller. 2002.) - Continuous Time Bayesian Networks - Builds a graph where each variable lies in the node whose values change over time.

The problem with the above models All assume the notion of exchangeability and thus lose important collocation information inherent in the document.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Related Work Temporal Topic Models

Continuous Time Topic Models

I Noriaki (Noriaki Kawamae. 2011.) - Trend Analysis Model - The model has a probability distribution over temporal words, topics, and a continuous distribution over time. I Uri et al., (Uri Nodelman, Christian R. Shelton, and Daphne Koller. 2002.) - Continuous Time Bayesian Networks - Builds a graph where each variable lies in the node whose values change over time.

The problem with the above models All assume the notion of exchangeability and thus lose important collocation information inherent in the document.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Related Work N-gram Topic Models

1 Wallach’s (Hanna M. Wallach. 2006.) topic model. Maintains word order during topic generation process. Generates only bigram words in topics. 2 Griffiths et al. (Griffiths, T. L., Steyvers, M., and Tenenbaum, J. B. 2007.) - LDA Collocation Model. Introduced binary random variables which decides when to generate a unigram or a bigram. 3 Wang et al. (Wang, X., McCallum, A., and Wei, X. 2007.) - Topical N-gram Model - Extends the LDA Collocation Model. Gives topic assignment to every word in the phrase.

The problem with the above models Cannot capture the temporal dynamics in data.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Related Work N-gram Topic Models

1 Wallach’s (Hanna M. Wallach. 2006.) bigram topic model. Maintains word order during topic generation process. Generates only bigram words in topics. 2 Griffiths et al. (Griffiths, T. L., Steyvers, M., and Tenenbaum, J. B. 2007.) - LDA Collocation Model. Introduced binary random variables which decides when to generate a unigram or a bigram. 3 Wang et al. (Wang, X., McCallum, A., and Wei, X. 2007.) - Topical N-gram Model - Extends the LDA Collocation Model. Gives topic assignment to every word in the phrase.

The problem with the above models Cannot capture the temporal dynamics in data.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia I

Topics Over Time (TOT) (Wang et al., 2006) 1 Our model extends from this model. 2 Assumes the notion of word and topic exchangeability.

Generative Process Topics Over Time Model (TOT)

1 Draw T multinomials φz from a Dirichlet Prior β, one for each α topic z

2 For each document d, draw a multinomial θ(d) from a Dirichlet θ (d) prior α; then for each word wi in the document d d z 1 Draw a topic zi from β Multinomial θ(d) (d) 2 Draw a word wi from Multinomial φ (d) z i w t (d) φ Ω 3 Draw a timestamp t from i Nd Beta Ω (d) T D zi

Shoaib Jameel and Wai Lam ECIR-2013,Fig. 1. Moscow,TOT model Russia Topics Over Time Model (TOT)

1 The model assumes a continuous distribution over time associated with each topic. 2 Topics are responsible for generating both observed time-stamps and also words. 3 The model does not capture the sequence of state changes with a Markov assumption.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Topics Over Time Model (TOT)

1 The model assumes a continuous distribution over time associated with each topic. 2 Topics are responsible for generating both observed time-stamps and also words. 3 The model does not capture the sequence of state changes with a Markov assumption.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Topics Over Time Model (TOT)

1 The model assumes a continuous distribution over time associated with each topic. 2 Topics are responsible for generating both observed time-stamps and also words. 3 The model does not capture the sequence of state changes with a Markov assumption.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Topics Over Time Model (TOT) Posterior Inference

1 In the Gibbs sampling, compute the conditional:

P(z(d) w, t, z(d), α, β, Ω) (1) i | i 2 We can thus write the updating equations as:

(d) (d)   P(z w, t, z , α, β, Ω) m (d) + α (d) 1 i i z z | ∝ i i − (d)Ω (d) −1 Ω (d) −1 z 2 n (d) (d) + β (d) 1 (d) z 1 i z w w (1 t ) i t i i i − − i i (2) PW × (n (d) + β ) 1 × B(Ω (d) , Ω (d) ) v=1 z v v z 1 z 2 i − i i

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Our Model N-gram Topics Over Time Model

1 The model assumes a continuous distribution over time associated with each topic. 2 Topics are responsible for generating both observed time-stamps and also words. 3 The model does not capture the sequence of state changes with a Markov assumption. 4 Maintains the order of words during topic generation process. 5 Generates words as unigrams, , etc. in topics. 6 Results in more interpretable topics.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Our Model N-gram Topics Over Time Model

1 The model assumes a continuous distribution over time associated with each topic. 2 Topics are responsible for generating both observed time-stamps and also words. 3 The model does not capture the sequence of state changes with a Markov assumption. 4 Maintains the order of words during topic generation process. 5 Generates words as unigrams, bigrams, etc. in topics. 6 Results in more interpretable topics.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Our Model N-gram Topics Over Time Model

1 The model assumes a continuous distribution over time associated with each topic. 2 Topics are responsible for generating both observed time-stamps and also words. 3 The model does not capture the sequence of state changes with a Markov assumption. 4 Maintains the order of words during topic generation process. 5 Generates words as unigrams, bigrams, etc. in topics. 6 Results in more interpretable topics.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Our Model N-gram Topics Over Time Model

1 The model assumes a continuous distribution over time associated with each topic. 2 Topics are responsible for generating both observed time-stamps and also words. 3 The model does not capture the sequence of state changes with a Markov assumption. 4 Maintains the order of words during topic generation process. 5 Generates words as unigrams, bigrams, etc. in topics. 6 Results in more interpretable topics.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Our Model N-gram Topics Over Time Model

1 The model assumes a continuous distribution over time associated with each topic. 2 Topics are responsible for generating both observed time-stamps and also words. 3 The model does not capture the sequence of state changes with a Markov assumption. 4 Maintains the order of words during topic generation process. 5 Generates words as unigrams, bigrams, etc. in topics. 6 Results in more interpretable topics.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Our Model N-gram Topics Over Time Model

1 The model assumes a continuous distribution over time associated with each topic. 2 Topics are responsible for generating both observed time-stamps and also words. 3 The model does not capture the sequence of state changes with a Markov assumption. 4 Maintains the order of words during topic generation process. 5 Generates words as unigrams, bigrams, etc. in topics. 6 Results in more interpretable topics.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia I

Graphical Model N-gram Topics Over Time Model

TW γ ψ α Ω

θ

zi 1 zi zi+1 −

x xi+2 ti 1 xi ti i+1 ti+1 −

wi 1 wi wi+1 − D

β φ δ σ T TW

Fig. 1. Our model

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Generative Process N-gram Topics Over Time Model

Draw Discrete(φz ) from Dirichlet(β) for each topic z; Draw Bernoulli(ψzw ) from Beta(γ) for each topic z and each word w; Draw Discrete(σzw ) from Dirichlet(δ) for each topic z and each word w; For every document d, draw Discrete(θ(d)) from Dirichlet(α); (d) foreach word wi in document d do (d) Draw xi from Bernoulli(ψ (d) (d) ); zi−1wi−1 (d) (d) Draw zi from Discrete(θ ); (d) (d) Draw wi from Discrete(σ (d) (d) ) if xi = 1; zi wi−1 (d) Otherwise, Draw wi from Discrete(φ (d) ); zi (d) Draw a time-stamp ti from Beta(Ω (d) ); zi end

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Posterior Inference Collapsed Gibbs Sampling

(d) (d) (d) (d)   P(z , x |w, t, x , z , α, β, γ, δ, Ω) ∝ γ (d) + p (d) (d) − 1 × α (d) + q (d) − 1 × i i ¬i ¬i xi zi−1wi−1xi zi dzi

 β (d) +n (d) (d)  w z w −1 (d) Ω −1 (d)Ω (d) −1  i i i (d) z 2  PW if xi = 0 (d) z 1 i  (βv +n (d) )−1 (1 − t ) i t  v=1 z v i i × i δ (d) +m (d) (d) (d) −1 (3) B(Ω (d) , Ω (d) ) w z w w z 1 z 2  i i i−1 i (d) i i  PW if xi = 1  (δv +m (d) (d) )−1  v=1 z w v i i−1

Posterior Estimates

α + q β + n γ + p ˆ(d) z dz ˆ w zw ˆ k zwk θz = (4) φzw = (5) ψzwk = (6) PT (α + q ) PW (β + ) P1 (γ + p ) t=1 t dt v=1 v nzv k=0 k zwk

δv + mzwv  tz (1 − tz )   tz (1 − tz )  σˆzwv = (7) Ωˆ = t − 1 (8) Ωˆ = (1 − t ) − 1 (9) PW (δ + ) z1 z 2 z2 z 2 v=1 v mzwv sz sz

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Inference Algorithm Input : γ, δ, α, T , β, Corpus, MaxIteration Output: Topic assignments for all the n-gram words with temporal information 1 Initialization: Randomly initialize the n-gram topic assignment for all words; 2 Zero all count variables; 3 for iteration ← 1 to MaxIteration do 4 for d ← 1 to D do 5 for w ← 1 to Nd according to word order do (d) (d) 6 Draw zw , xw defined in Equation 3; (d) 7 if xw ← 0 then 8 Update nzw ; 9 end 10 else 11 Update mzw ; 12 end 13 Update qdz , pzw ; 14 end 15 end 16 for z ← 1 to T do 17 Update Ωz by the method of moments as in Equations 8 and 9; 18 end 19 end 20 Compute the posterior estimates of α, β, γ, δ defined in Equations 4, 5, 6, 7;

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Empirical Evaluation Data Sets We have conducted experiments on two datasets 1 U.S. Presidential State-of-the-Union1 speeches from 1790 to 2002. 2 NIPS conference papers - The original raw NIPS dataset2 consists of 17 years of conference papers. But we supplemented this dataset by including some new raw NIPS documents3 and it has 19 years of papers in total.

Preprocessing

1 Removed stopwords. 2 Did not perform word .

1http://infomotions.com/etexts/gutenberg/dirs/etext04/suall11.txt 2 http://www.cs.nyu.edu/eroweis/data.html 3 http://ai.stanford.edu/egal/Data/NIPS/ Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Qualitative Results

Our Model Mexican War TOT Mexican War 3000 6000

2000 4000

1000 2000

0 0 1800 1850 1900 1950 2000 1800 1850 1900 1950 2000 Year Year

1. east bank 8. military 1. mexico 8. territory 2. american coins 9. general herrera 2. texas 9. army 3. mexican flag 10. foreign coin 3. war 10. peace 4. separate independent 11. military usurper 4. mexican 11. act 5. american commonwealth 12. mexican treasury 5. united 12. policy 6. mexican population 13. invaded texas 6. country 13. foreign 7. texan troops 14. veteran troops 7. government 14. citizens

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Qualitative Results Topics changes over time

Our Model Panama Canal 6000 TOT Panama Canal 6000

4000 4000

2000 2000

0 0 1800 1850 1900 1950 2000 1800 1850 1900 1950 2000 Year Year

1. panama canal 8. united states senate 1. government 8. spanish 2. isthmian canal 9. french canal company 2. cuba 9. island 3. isthmus panama 10. caribbean sea 3. islands 10. act 4. republic panama 11. panama canal bonds 4. international 11. commission 5. united states government 12. panama 5. powers 12. officers 6. united states 13. american control 6. gold 13. spain 7. state panama 14. canal 7. action 14. rico

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Qualitative Results Topics changes over time - TOT

cells network data function algorithm learning cell learning model data state data model input algorithm set learning set response units method distribution time training firing training algorithm output probability model algorithms activity models models step test layer number input problem neural action neurons hidden distribution probability kernel weights parameters node stimulus information policy classification figure networks set networks sequence class NIPS-1987 NIPS-1988 NIPS-1995 NIPS-1996 NIPS-2004 NIPS-2005 Figure : Top ten probable phrases from the posterior inference in NIPS year-wise.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia I

Qualitative Results Topics changes over time - Our Model

orientation map neural networks linear algebra probability vector optimal policy firing threshold hidden units input signals relevant documents build stack time delay hidden layer gaussian filters continuous embedding reinforcement learning neural state neural network optical flow doubly stochastic matrix nash equilibrium low conduction safety training set model matching probability vectors suit stack correlogram peak mit press resistive line binding energy synthetic items centric models hidden unit input signal energy costs compressed map long channel learning algorithm analog vlsi variability index reward function synaptic chip output units depth map learning bayesian td networks frog sciatic nerve output layer temporal precision polynomial time intrinsic reward NIPS-1987 NIPS-1988 NIPS-1995 NIPS-1996 NIPS-2004

kernel cca empirical risk training sample data clustering random selection gaussian regression online hypothesis linear separators covariance operator line algorithm NIPS-2005

Figure : Top ten probable phrases from the posterior inference in NIPS year-wise. Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Qualitative Results Topics changes over time

5000

4000

3000

2000

1000

0 1990 1995 2000 2005

1. hidden unit 6. learning algorithms 1. state 6. sequences 2. neural net 7. error signals 2. time 7. recurrent 3. input layer 8. recurrent connections 3. sequence 8. models 4. recurrent network 9. training pattern 4. states 9. markov 5. hidden layers 10. recurrent cascade 5. model 10. transition Figure : A topic related to “recurrent NNs” comprising of n-gram words obtained from both the models. Histograms depict the way topics are distributed over time and they are fitted with Beta probability density functions.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Quantitative Results Predicting decade on State-of-the-Union dataset

1 Computed the time-stamp prediction performance. 2 Learn a model on some subset of the data randomly sampled from the collection. 3 Given a new document, compute the likelihood of the decade prediction.

L1 Error E(L1) Accuracy Our Model 1.60 1.65 0.25 TOT 1.95 1.99 0.20 Table : Results of decade prediction in the State-of-the-Union speeches dataset.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia Conclusions and Future Work

1 We have presented an n-gram topic model which can capture both temporal structure and n-gram words in the time-stamped documents. 2 Topics found by our model are more interpretable with better qualitative and quantitative performance on two publicly available datasets. 3 We have derived a collapsed Gibbs sampler for faster posterior inference. 4 An advantage of our model is that it does away with ambiguities that might appear among the words in topics.

Future Work Explore non-parametric methods for n-gram topics over time.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia References David M. Blei and John D. Lafferty. 2006. Dynamic topic models. In Proc. of ICML. 113-120. Knights, D., Mozer, M., and Nicolov, N. (2009). Detecting topic drift with compound topic models. Proc. of the ICWSM’09. Noriaki Kawamae. 2011. Trend analysis model: Trend consists of temporal words, topics, and timestamps. In Proc. of WSDM. 317-326. Hanna M. Wallach. 2006. Topic modeling: beyond bag-of-words. In Proc. of ICML, 977-984. Griffiths, T. L., Steyvers, M., and Tenenbaum, J. B. (2007). Topics in semantic representation. Psychological review, 114(2), 211. Wang, X., McCallum, A., and Wei, X. 2007. Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In Proc. of ICDM, (pp. 697-702). Xuerui Wang and Andrew McCallum. 2006. Topics over time: a non-Markov continuous-time model of topical trends. In Proc. of KDD . 424-433.

Shoaib Jameel and Wai Lam ECIR-2013, Moscow, Russia