A Nonparametric N-Gram Topic Model with Interpretable Latent Topics Shoaib Jameel and Wai Lam the Chinese University of Hong Kong

A Nonparametric N-Gram Topic Model with Interpretable Latent Topics Shoaib Jameel and Wai Lam The Chinese University of Hong Kong

. One line summary of the work We will see how maintaining the order of words in a document helps improve the qualitative and quantitatve results in a nonparametric topic model..

Shoaib Jameel and Wai Lam AIRS-2013, Singapore 1 / 9 Introduction and Motivation

Popular nonparametric topic models such as Hierarchical Dirichlet Processes (HDP) assume a bag-of-words paradigm.

They thus lose important collocation information in the document. .

.Example .They cannot capture a compound word “neural network” in a topic.

Parametric n-gram topic models also exist, but they require the user to supply the number of topics.

The bag-of-words assumption makes the latent topics discovered less interpretable.

Shoaib Jameel and Wai Lam AIRS-2013, Singapore 2 / 9 Related Work

Our work extends the Hierarchical Dirichlet Processes (HDP) (Teh et al. JASA-2006) (Goldwater et al. ACL-2006) presented nonparametric word segmentation models where order of words in the document is maintained. Deane (Deane. ACL-2005) presented a nonparametric approach to extract phrasal terms. Parametric n-gram topic models such as Bigram Topic Model (Wallach. ICML-2006), LDA-Collocation Model (Grifﬁths et al. Psy. Rev-2005), and Topical N-gram Model (Wang et al. IDCM-2007) all need the number of topics to be speciﬁed by the user.

Shoaib Jameel and Wai Lam AIRS-2013, Singapore 3 / 9 Background - HDP Model The Hierarchical Dirichlet Processes (HDP), when used as a topic model, discovers the latent topics in a collection. The model does not require the number of topics to be speciﬁed by the user.

The model can be regarded as a nonparametric version of the Latent Dirichlet Allocation (LDA) (Blei et al., JMLR-2003)

. . .Generative Process .Graphical Model in Plate Diagram

| ∼ G0 γ G0 γ, H DP(γ, H) H | ∼ Gd α, G0 DP(α, G0)

| ∼ G α zdi Gd Gd d | ∼ wdi zdi Multinomial(zdi) γ and α are the concentration parameters. zdi The base probability measure H provides the prior distribution. for the factors or topics zdi. wdi Nd . D

Shoaib Jameel and Wai Lam AIRS-2013, Singapore 4 / 9 N-gram Hierarchical Dirichlet Processes Model

.This is our proposed model, called as NHDP. . .Our Model - NHDP We introduce a set of binary indicator variables between words in sequence in the HDP model.

The binary variables are set to 1 if the words form a bigram, else they are 0.

If the words in sequence form a bigram, then the words are generated solely based on the previous word.

Unigram words are generated from the topic.

. This framework helps us capture the word order in the document.

Shoaib Jameel and Wai Lam AIRS-2013, Singapore 5 / 9 N-gram Hierarchical Dirichlet Processes Model

. . .Generative Process .Graphical Model in Plate Diagram G |γ, H ∼ DP(γ, H); 0 γ H G0 Gd|α, G0 ∼ DP(α, G0); α zdi|Gd ∼ Gd; ǫ | ∼ G xdi wd,i−1 Bernoulli(ψwd,i−1); d if xdi = 1 then z z w |w − ∼ Multinomial(σ ) d,i −1 zdi d,i +1 ψ di d,i 1 wd,i−1 V end x +1 xdi d,i σ else V | ∼ wdi zdi F(zdi) w −1 w wd,i +1 d,i di δ end D . .

. As one can see, our model can capture word dependencies in the text data..

Shoaib Jameel and Wai Lam AIRS-2013, Singapore 6 / 9 Posterior. Inference using Gibbs Sampling .First Condition: xdi = 0 .Under this condition, sampling is same as in the HDP model. . → .Second Condition: xdi = 1 Prob. of a topic in a document. { ¬ ¬dt wdt ¬dt m.k fk (wdt) if k is already used P(k = k|t, k ) ∝ ¬w dt γf dt (w ) if k = kˆ kˆ dt ¬ ∏ ¬ wdt wdt,ϑ w ,ϑ ¬w Γ(n + V η) Γ(n + n dt + η) f dt (w ) = ..k × ϑ ..k k dt ¬w ∏ ¬w ,ϑ Γ(n dt + nwdt + V η) Γ( dt + η) . ..k ϑ n..k

kdt is the topic index variable for each table t in d w as (wdi : ∀d, i) and wdt as (wdi : ∀i with tdi = t), t as (tdi : ∀d, i) and k as (kdt : ∀d, t). x as (xdi : ∀d, i). (k¬dt, t¬di), it means that the variables corresponding to the superscripted index are removed from the set or from the calculation of the count. V is the vocabulary size. ¬wdi n..k is the number of words belonging to the topic k in the corpus whose xdi = 0 excluding wdi. w n dt is the total number of words at the table t whose xdi = 0. Shoaib Jameel and Wai Lam AIRS-2013, Singapore 7 / 9 Experimental Results .

.Qualitative Results

HDP NHDP HDP NHDP HDP NHDP patterns neurons sports week color windows cortex neural network television usenet sun microsystems activity british neurons cbs summer olympics ins anonymous ftp activation neural networks miss broadcast ftp usenet pattern simulations television world news tonight bit comp NIPS Dataset AP Dataset Comp Dataset

. .

.Quantitative Results - Perplexity Analysis AP Dataset Training Data=30% Training Data=50% Training Data=70% Training Data=90% . . . . 6,720 . . 3,410 . . . . 3,370

6,700 ......

3,470 . .

. ,

3 370 . .

6,680 . . 3,350 3,430 .

, 3,330 .

6 660 .

3,330 ...... 6,640 . 3,290 . 3,390 . 3,310 . Perplexity Perplexity Perplexity Perplexity

HDP HDP HDP HDP NHDP NHDP NHDP NHDP . LDACOL Bi-NHDP LDACOL Bi-NHDP LDACOL Bi-NHDP LDACOL Bi-NHDP

Shoaib Jameel and Wai Lam AIRS-2013, Singapore 8 / 9 Conclusion and References We proposed an n-gram nonparametric topic model which discovers more interpretable latent topics. Our model introduces a new set of binary random variables in the HDP model. Our model extends the posterior inference scheme of the HDP model. Results demonstrate that our model has outperformed state-of-the-art results. . .References Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. JMLR, 3, 993-1022. Wallach, H. M. (2006). Topic modeling: Beyond bag-of-words. Proc of ICML (pp. 977-984). Grifﬁths, T. L., Steyvers, M., and Tenenbaum, J. B. (2007). Topics in semantic representation. Psychological review, 114(2), 211. Wang, X., McCallum, A., and Wei, X. (2007). Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In Proc. of ICDM, (pp. 697-702). Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet Processes. Journal of the American Statistical Association 101: pp. 15661581. Deane, P., 2005. A nonparametric method for extraction of candidate phrasal terms. In Proc. of ACL. 605-613. Goldwater, S., Grifﬁths, T. L., and Johnson, M,. (2006). Contextual dependencies in unsupervised . word segmentation. In Proc. of ACL. 673-680. Shoaib Jameel and Wai Lam AIRS-2013, Singapore 9 / 9