Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval

Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval Xuerui Wang, Andrew McCallum, Xing Wei University of Massachusetts 140 Governors Dr, Amherst, MA 01003 xuerui, mccallum, xwei @cs.umass.edu f g Abstract cluding parsing, machine translation and information retrieval. In general, phrases as the whole carry more in- Most topic models, such as latent Dirichlet allocation, formation than the sum of its individual components, thus rely on the bag-of-words assumption. However, word order they are much more crucial in determining the topics of col- and phrases are often critical to capturing the meaning of lections than individual words. Most topic models such as text in many text mining tasks. This paper presents topical latent Dirichlet allocation (LDA) [2], however, assume that n-grams, a topic model that discovers topics as well as top- words are generated independently from each other, i.e., un- ical phrases. The probabilistic model generates words in der the bag-of-words assumption. Adding phrases increases their textual order by, for each word, first sampling a topic, the model’s complexity, but it could be useful in certain con- then sampling its status as a unigram or bigram, and then texts. The possible over complicacy caused by introducing sampling the word from a topic-specific unigram or bigram phrases makes these topic models completely ignore them. distribution. Thus our model can model “white house” as It is true that these models with the bag-of-words assump- a special meaning phrase in the ‘politics’ topic, but not tion have enjoyed a big success, and attracted a lot of inter- in the ‘real estate’ topic. Successive bigrams form longer ests from researchers with different backgrounds. We be- phrases. We present experimental results showing mean- lieve that a topic model considering phrases would be defi- ingful phrases and more interpretable topics from the NIPS nitely more useful in certain applications. data and improved information retrieval performance on a Assume that we conduct topic analysis on a large collec- TREC collection. tion of research papers. The acknowledgment sections of research papers have a distinctive vocabulary. Not surpris- ingly, we would end up with a particular topic on acknowledgment (or funding agencies) since many papers have an 1 Introduction acknowledgment section that is not tightly coupled with the content of papers. One might therefore expect to find words Although the bag-of-words assumption is prevalent in such as “thank”, “support” and “grant” in a single topic. document classification and topic models, the great major- One might be very confused, however, to find words like ity of natural language processing methods represent word “health” and “science” in the same topic, unless they are order, including n-gram language models for speech recog- presented in context: “National Institutes of Health” and nition, finite-state models for information extraction and “National Science Foundation”. context-free grammars for parsing. Word order is not only Phrases often have specialized meaning, but not always. important for syntax, but also important for lexical mean- For instance, “neural networks” is considered a phrase be- ing. A collocation is a phrase with meaning beyond the cause of its frequent use as a fixed expression. However, it individual words. For example, the phrase “white house” specifies two distinct concepts: biological neural networks carries a special meaning beyond the appearance of its in- in neuroscience and artificial neural networks in modern us- dividual words, whereas “yellow house” does not. Note, age. Without consulting the context in which the term is lo- however, that whether or not a phrase is a collocation may cated, it is hard to determine its actual meaning. In many sit- depend on the topic context. In the context of a document uations, topic is very useful to accurately capture the mean- about real estate, “white house” may not be a collocation. ing. Furthermore, topic can play a role in phrase discovery. N-gram phrases are fundamentally important in many Considering learning English, a beginner usually has diffi- areas of natural language processing and text mining, in- culty in telling “strong tea” from “powerful tea” [15], which are both grammatically correct. The topic associated with SYMBOL DESCRIPTION “tea” might help to discover the misuse of “powerful”. T number of topics In this paper, we propose a new topical n-gram (TNG) D number of documents model that automatically determines unigram words and W number of unique words phrases based on context and assign mixture of topics to Nd number of word tokens in document d (d) th both individual words and n-gram phrases. The ability to zi the topic associated with the i token in the form phrases only where appropriate is unique to our model, document d distinguishing it from the traditional collocation discovery (d) th xi the bigram status between the (i 1) token methods discussed in Section 3, where a discovered phrase and ith token in the document d − collocation (d) is always treated as a regardless of the context w the ith token in document d (which would possibly make us incorrectly conclude that i θ(d) the multinomial (Discrete) distribution of topics “white house” remains a phrase in a document about real w.r.t. the document d estate). Thus, TNG is not only a topic model that uses φ the multinomial (Discrete) unigram distribution phrases, but also help linguists discover meaningful phrases z of words w.r.t. topic z in right context, in a completely probabilistic manner. We in Figure 1(b), the binomial (Bernoulli) distribution show examples of extracted phrases and more interpretable v of status variables w.r.t. previous word v topics on the NIPS data, and in a text mining application, in Figure 1(c), the binomial (Bernoulli) distribution we present better information retrieval performance on an zv of status variables w.r.t. previous topic z/word v ad-hoc retrieval task over a TREC collection. σzv in Figure 1(a) and (c), the multinomial (Discrete) bigram distribution of words w.r.t. topic z/word v 2 N-gram based Topic Models σv in Figure 1(b), the multinomial (Discrete) bigram distribution of words w.r.t. previous word v α Dirichlet prior of θ Before presenting our topical n-gram model, we first de- β Dirichlet prior of φ scribe two related n-gram models. Notation used in this pa- γ Dirichlet prior of per is listed in Table 1, and the graphical models are showed δ Dirichlet prior of σ in Figure 1. For simplicity, all the models discussed in this section make the 1st order Markov assumption, that is, they are actually bigram models. However, all the models have Table 1. Notation used in this paper the ability to “model” higher order n-grams (n > 2) by concatenating consecutive bigrams. 2.2 LDA Collocation Model (LDACOL) 2.1 Bigram Topic Model (BTM) Starting from the LDA topic model, the LDA colloca- Recently, Wallach develops a bigram topic model [22] on tion model [20] (not yet published) introduces a new set the basis of the hierarchical Dirichlet language model [14], of random variables (for bigram status) x (xi = 1: wi−1 by incorporating the concept of topic into bigram models. and wi form a bigram; xi = 0: they do not) that denote if This model is one solution for the “neural network” exam- a bigram can be formed with the previous token, in addi- ple in Section 1. We assume a dummy word w0 existing at tion to the two sets of random variables z and w in LDA. the beginning of each document. The graphical model pre- Thus, it has the power to decide if to generate a bigram or sentation of this model is shown in Figure 1(a). The gener- a unigram. At this aspect, it is more realistic than the bi- ative process of this model can be described as follows: gram topic model which always generates bigrams. After all, unigrams are the major components in a document. We 1. draw Discrete distributions σzw from a Dirichlet prior assume the status variable x1 is observed, and only a uni- δ for each topic z and each word w; gram is allowed at the beginning of a document. If we want to put more constraints into the model (e.g., no bigram is 2. for each document d, draw a Discrete distribution θ(d) allowed for sentence/paragraph boundary; only a unigram (d) from a Dirichlet prior α; then for each word wi in can be considered for the next word after a stop word is document d: removed; etc.), we can assume that the corresponding status variables are observed as well. This model’s graphical (d) (d) (a) draw zi from Discrete θ ; and model presentation is shown in Figure 1(b). (d) The generative process of the LDA collocation model is (b) draw wi from Discrete σz(d)w(d) . i i−1 described as follows: α 1. draw Discrete distributions φz from a Dirichlet prior β for each topic z; θ 2. draw Bernoulli distributions w from a Beta prior γ for each word w; ... zi 1 zi zi+1 zi+2 ... − 3. draw Discrete distributions σw from a Dirichlet prior δ for each word w; (d) ... wi 1 wi wi+1 wi+2 ... 4. for each document d, draw a Discrete distribution θ − (d) D from a Dirichlet prior α; then for each word wi in document d: (d) (a) draw xi from Bernoulli w(d) ; σ δ i−1 TW (d) (d) (b) draw zi from Discrete θ ; and (a) Bigram topic model (d) (d) (c) draw wi from Discrete σw(d) if xi = 1; else α i−1 (d) draw wi from Discrete φ (d) .

Load more