Categorical Distributions in Natural Language Processing Version 0.1
MURAWAKI Yugo
12 May 2016
MURAWAKI Yugo Categorical Distributions in NLP 1 / 34 Categorical distribution
Suppose random variable x takes one of K values. x is generated according to categorical distribution Cat(θ), where
θ = (0.1, 0.6, 0.3). RYG
In many task settings, we do not know the true θ and need to infer it from observed data x = (x1, ··· , xN). Once we infer θ, we often want to predict new variable x′.
NOTE: In Bayesian settings, θ is usually integrated out and x′ is predicted directly from x.
MURAWAKI Yugo Categorical Distributions in NLP 2 / 34 Categorical distributions are a building block of natural language models
N-gram language model (predicting the next word) POS tagging based on a Hidden Markov Model (HMM) Probabilistic context-free grammar (PCFG) Topic model (Latent Dirichlet Allocation (LDA))
MURAWAKI Yugo Categorical Distributions in NLP 3 / 34 Example: HMM-based POS tagging
BOS DT NN VBZ VBN EOS
the sun has risen
Let K be the number of POS tags and V be the vocabulary size (ignore BOS and EOS for simplicity).
The transition probabilities can be computed using K categorical θTRANS, θTRANS, ··· distributions ( DT NN ), with the dimension K. θTRANS = . , . , . , ··· DT (0 21 0 27 0 09 ) NN NNS ADJ Similarly, the emission probabilities can be computed using K θEMIT, θEMIT, ··· categorical distributions ( DT NN ), with the dimension V. θEMIT = . , . , . , ··· NN (0 012 0 002 0 005 ) sun rose cat MURAWAKI Yugo Categorical Distributions in NLP 4 / 34 Outline
Categorical and multinomial distributions Conjugacy and posterior predictive distribution LDA (Latent Dirichlet Applocation) as an application Gibbs sampling for inference
MURAWAKI Yugo Categorical Distributions in NLP 5 / 34 Categorical distribution: 1 observation
Suppose θ is known.
Th probability of generating random variable x ∈ {1, ··· , K} is
p(x = k|θ) = θk. ∑ ≤ θ ≤ K θ = 0 k 1 and k=1 k 1.
For example, if θ = (0.1, 0.6, 0.3), then p(x = 2|θ) = θ2 = 0.6.
MURAWAKI Yugo Categorical Distributions in NLP 6 / 34 Categorical distribution: N observations
The probability of generating a sequence of random variables with length N, x = (x1, ··· , xN), is
∏N ∏K |θ = θ = θnk , p(x ) xi k i=1 k=1 where∑ nk is the number of times value k is observed in x K = ( k=1 nk N).
NOTE: p(x|θ) does not depend on the ordering of x but only on the total number of observed values in the sequence (sufficient statistics).
MURAWAKI Yugo Categorical Distributions in NLP 7 / 34 Multinomial distribution: the probability of counts
Replacing x = (x1, ··· , xN) with the counts n = (n1, ··· , nK), we obtain the multinomial distribution:
( ) ∏K N Multi(n|θ) = θnk n ··· n k 1 k k=1 ∏K N! = θnk n ! ··· n ! k 1 k k=1 ∏K Γ(N + 1) = ∏ θnk . K Γ + k k=1 (nk 1) k=1
Note Γ(n) = (n − 1)!. The second term is the same as p(x|θ). The first term is the combinatorial number of x mapped to n.
NOTE: In NLP, the categorical distribution is often called a multinomial distribution.
MURAWAKI Yugo Categorical Distributions in NLP 8 / 34 Estimating θ from N observations
If θ is unknown, we may want to infer it from x.
First, the likelihood function is defined as follows:
L(θ; x) = Cat(x|θ).
In maximum likelihood estimation (MLE), we estimate θML that satisfies: θML = argmax L(θ; x) = argmax Cat(x|θ). θ θ It has the following analytical solution (use the method of Lagrange multipliers): θML = nk . k N
MURAWAKI Yugo Categorical Distributions in NLP 9 / 34 A problem with MLE
= θML = If nk 0, then k 0 (zero frequency problem).
Would larger observed data fix the problem? Not necessarily. Natural language symbols usually follow a power low (cf. Zipf’s law). There are always low-frequency symbols.
MURAWAKI Yugo Categorical Distributions in NLP 10 / 34 Bayes’ theorem
Now we introduce p(θ), a prior distribution over θ. A prior distribution is an (arbitrary) distribution that expresses our beliefs about θ.
Bayes’ theorem: p(x|θ)p(θ) p(θ|x) = p(x) ∝ p(x|θ)p(θ). p(θ|x) is a posterior distribution, p(x|θ) is a likelihood, and p(θ) is a prior distribution. p(θ|x) can be interpreted as the distribution of θ after observing data.
MURAWAKI Yugo Categorical Distributions in NLP 11 / 34 A prior for categorical distributions
For a categorical distribution, we usually set the Dirichlet distribution Dir(θ|α) as its prior becaues it has nice analytical properties. p(θ|x, α) ∝ Cat(x|θ)Dir(θ|α), where ∏K Γ(A) α − Dir(θ|α) = θ k 1, Γ(α ) ··· Γ(α ) k 1 K k=1 ∑ α = α , ··· , α α > = K α α where ( 1 K), k 0, and A k=1 k. is a parameter given a priori (hyperparameter). We usually set αi = α j. αk can be a real number (e.g. 1.5).
Note that the Dirichlet distribution resembles a categorical distribution: ∏K |θ = θnk . Cat(x ) k k=1
MURAWAKI Yugo Categorical Distributions in NLP 12 / 34 Posterior distribution
Posterior distribution p(θ|x, α) ∝ Cat(x|θ)Dir(θ|α) is (proof omitted):
∏K Γ(N + A) +α − p(θ|x, α) = θnk k 1 Γ(n + α ) ··· Γ(n + α ) k 1 1 K K k=1 = Dir(θ|n + α).
The posterior has the same form as the prior. The parameter α is replaced with n + α. This property is called conjugacy.
MURAWAKI Yugo Categorical Distributions in NLP 13 / 34 Maximum a posteriori (MAP) estimatoin
Maximum a posteriori (MAP) estimation employs θMAP that maximizes the posterior probability:
θMAP = argmax p(θ|x) θ = argmax p(x|θ)p(θ). θ
MAP For Dirichlet-categorical parameter θ = argmaxθ Dir(θ|n + α), the solution is n + α − 1 n + α − 1 θMAP = ∑ k k = k k . k K + α − N + A − K k=1(nk k 1)
θML = nk α − Compare it with ML estimate k N . The additional term k 1 is used to smooth the esimate (additive smoothing).
MURAWAKI Yugo Categorical Distributions in NLP 14 / 34 Marginal likelihood / posterior predictive distribution
Instead of using one estimate of θ, we now consider all possible values of θ. To do so, we integrate out θ to obtain marginal likelihood: ∫ p(x) = p(x|θ)p(θ) dθ. θ The probability of generating new variable x′ given x (predictive probability) is: ∫ p(x′|x) = p(x′|θ)p(θ|x) dθ ∫θ p(x|θ)p(θ) = p(x′|θ) dθ θ p(x)
Thanks to conjugacy, we have analytical solutions for Dirichlet-categorical (next slide).
MURAWAKI Yugo Categorical Distributions in NLP 15 / 34 Marginal likelihood for Dirichlet-categorical
∫ p(x|α) = Cat(x|θ)Dir(θ|α) dθ θ ∫ ∏K ∏K Γ(A) α − = θnk θ k 1 θ k k d θ Γ(α ) ··· Γ(α ) k=1 1 K k=1 ∫ ∏K Γ(A) +α − = θnk k 1 dθ Γ α ··· Γ α k ( 1) ( K) θ = ∏ k 1 K Γ(A) Γ(nk + αk) = ∏ k=1 K Γ α Γ(N + A) k=1 ( k) Γ ∏K Γ + α = (A) (nk k) Γ(N + A) Γ(α ) k=1 k
MURAWAKI Yugo Categorical Distributions in NLP 16 / 34 Posterior predictive for Dirichlet-categorical
The probability of generating x′ after observing x is
p(x′ = k′, x|α) p(x′ = k′|x, α) = p(x|α) ∏ ′ Γ(A) K Γ(nk+I(k=k )+αk) Γ(N+1+A) k=1 Γ(α ) = ∏ k Γ(A) K Γ(nk+αk) Γ(N+A) k=1 Γ(αk) Γ + Γ ′ + + α ′ = (N A) (nk 1 k ) Γ(N + 1 + A) Γ(nk′ + αk′ ) ′ + α ′ = nk k , N + A where I(statement) is an indicator function. It gives 1 if statement is true; 0 otherwise.
In analogy with nk (the count of observations), αk is called a pseudo-count.
MURAWAKI Yugo Categorical Distributions in NLP 17 / 34 Sequential updates of x
Let x1:N = (x1, ··· , xN). p(x1:N|α) is the product of the probabilities of generating x one by one.
p(x1:N|α) = p(xN|x1:N−1, α)p(x1:N−1|α) + α nxN (x1:N−1) xN = p(x − |α) N − 1 + A 1:N 1 + α nxN (x1:N−1) xN = p(x − |x − , α)p(x − |α) N − 1 + A N 1 1:N 2 1:N 2 n (x − ) + α n (x − ) + α = xN 1:N 1 xN xN−1 1:N 2 xN−1 |α − + − + p(x1:N−2 ) ∏ N∏1 A N 2 A K nk(x1:N ) = i − 1 + αk = k 1∏ i=1 N − + i=1 i 1 A Γ ∏K Γ + α = (A) (nk(x1:N) k), Γ(N + A) Γ(α ) k=1 k where nk(x) is the number of times k appears in x.
MURAWAKI Yugo Categorical Distributions in NLP 18 / 34 Exchangeability
As stated earlier, p(x|θ) does not depend on the ordering of x but on the total number of times each value is observed in the sequence (sufficient statistics). Let us create x′ by swapping an arbitrary pair of variables xi, x j in x. Since the ordering does not matter, the following is hold:
p(x′|α) = p(x|α)
This property is called exchangeability.
As we will see later, exchangeability makes inference easy.
MURAWAKI Yugo Categorical Distributions in NLP 19 / 34 Example: Sequential updates of x
Suppose random variable x takes one of three values: R, Y, G. The probability of generating the sequence x = (R, R, Y) is 0 + α 1 + α 0 + α p(x|α) = R R Y 0 + A 1 + A 2 + A Verify that reordering does not affect the probability.
MURAWAKI Yugo Categorical Distributions in NLP 20 / 34 Generative story
For Dirichlet-categorical, θ is first generated from a Dirichlet distribution with parameter α and then each random variable xn is generated from a categorical distribution with parameter θ. This process can be summarized as follows:
θ|α ∼ Dir(α)
xn|θ ∼ Cat(θ)
MURAWAKI Yugo Categorical Distributions in NLP 21 / 34 Application: LDA (Latent Dirichlet allocation)
NOTE: We use the modified model by Griffiths et al. (2004), not the original one by Blei et al. (2002).
We are given D documents. Document i contains Ni words, and wi, j is the j-th word in document i. LDA assumes that each document is a mixture of K topics. Document i is associated with a mixing proportion θi. Each topic k has a corresponding vocabulary φ distribution Cat( k). Each word wi, j is tied with a latent variable ′ zi, j = k . zi, j is generated from Cat(θi), and then wi, j is generated φ from Cat( k′ ).
MURAWAKI Yugo Categorical Distributions in NLP 22 / 34 Application: LDA (Latent Dirichlet allocation)
The generative story can be represented by a directed graph called a graphical model.
Each arrow indicates a dependency between variables. Each plate represents repetition with the number of repetition is shown in the corner. The shaded node represents an observed variable.
MURAWAKI Yugo Categorical Distributions in NLP 23 / 34 Application: LDA (Latent Dirichlet allocation)
The generative story is as follows: 1 For each topic k ∈ {1, ··· , K}, draw vocabulary distribution φ ∼ β k Dir( ) 2 For each document i ∈ {1, ··· , D}, draw topic distribution θi ∼ Dir(α) 3 For each word j ∈ {1, ··· , Ni} in document i, 1 Draw topic assignment zi, j ∼ Cat(θi), and 2 , ∼ φ Draw word wi j Cat( zi, j ) Let W be the set of all observed words in the collection of documents and Z be the corresponding latent topic assignments. Given W, we want to infer Z.
MURAWAKI Yugo Categorical Distributions in NLP 24 / 34 Application: LDA (Latent Dirichlet allocation)
We begin with the joint probability: ∏K ∏D ∏Ni , , θ, φ|α, β = φ |β × θ |α |θ |φ, . p(W Z ) p( k ) p( i ) p(zi, j i)p(wi, j zi, j) k=1 i=1 j=1 θ and φ can be integrated out to obtain marginal likelihood. i k ∫ ∫ p(W, Z|α, β) = p(W, Z, θ, φ|α, β) dθdφ θ φ ∏D ∏K Γ k + α Γ(α · ) (n , · k) = ( ) i ( ) (·) Γ α = Γ(n + α(·)) = ( k) i 1 i,(·) k 1 ∏K ∏ Γ k + β Γ(β(·)) (n(·),v v) × , Γ k + β · Γ(β ) k=1 (n(·),(·) ( )) v v k where ni,v is the number of times word type v in document i is tied with topic k. We can see that LDA is simply a composite of D + K Dirichlet-categorical distributions.
MURAWAKI Yugo Categorical Distributions in NLP 25 / 34 Application: LDA (Latent Dirichlet allocation)
Now consider sequential updates of topic assignments and words. It is easy to verify that exchangeability is hold.
Doc ID Topic Word Prob. +α +β Politics Obama 0 Politics × 0 Obama 0+α(·) 0+β(·) 0+α +β Economy FRB Economy × 0 FRB 1 1+α(·) 0+β(·) +α +β Politics election 1 Politics × 0 election 2+α(·) 1+β(·) +α +β Politics Obama 2 Politics × 1 Obama 3+α(·) 2+β(·) +α 0 Economy 0+βYellen Economy Yellen +α × +β 2 0 (·) 1 (·) 1+α +β Economy FRB Economy × 1 FRB 1+α(·) 2+β(·)
MURAWAKI Yugo Categorical Distributions in NLP 26 / 34 Application: LDA (Latent Dirichlet allocation)
We want to infer Z that satisfies
argmax p(Z|W, α, β) = argmax p(W, Z|α, β) Z Z
The number of possible assignments is K|Z|. The search space is too large for exhaustive search.
We resort to approximate inference using random walk. Specificall,y we often use Gibbs sampling as a theoretically-grounded algorithm.
MURAWAKI Yugo Categorical Distributions in NLP 27 / 34 Sampling
Sampling is the procedure of obtaining a sample from a probability distribution. It is typically implemented by modifying the values of the rand function r (0 ≤ r < 1).
Suppose x is a sample from θ = (0.1, 0.6, 0.3). Using r, we can create a sample x as follows: 1 0 ≤ r < 0.1, x = 2 0.1 ≤ r < 0.7, and 3 0.7 ≤ r < 1.
MURAWAKI Yugo Categorical Distributions in NLP 28 / 34 Sampling from Dirichlet-categorical posterior predictive
Recall the predictive predictive distribution of Dirichlet-categorical is n ′ (x) + α ′ p(x′ = k′|x, α) = k k . n(x) + α(·) = , , , , α = α , α , α = , , Let x (R R Y R R), and ( R Y G) (1 1 1). Then
′ 4 + 1 p(x = R|x, α) = = 0.625 5 + 3 ′ 1 + 1 p(x = Y|x, α) = = 0.25 5 + 3 ′ 0 + 1 p(x = G|x, α) = = 0.125. 5 + 3 Draw a sample from θ = (0.625, 0.25, 0.125), and that is the sample we want to obtain.
MURAWAKI Yugo Categorical Distributions in NLP 29 / 34 Gibbs sampling
LDA is a composite of Dirichlet-categorical distributions p(Z|W, α, β). Directly sampling from this complex distribution is difficult. −i However, it is easy to draw a sample from p(zi|W, Z , α, β), where −i Z = Z \{zi}. i is now a corpus-wide index. Gibbs sampling is a method of obtaining samples of Z by −i repeatedly sampling from p(zi|W, Z , α, β). The algorithmic overview of Gibbs sampling is as follows: 1 initialize Z 2 For each iteration τ = 1,..., T, randomly select i ′ | , −i, α, β sample zi from p(zi W Z ) ′ update Z by replacing zi with zi After a sufficient number of iterations, Z can be seen as samples from p(Z|W, α, β).
MURAWAKI Yugo Categorical Distributions in NLP 30 / 34 Exchangeability again
Gibbs sampling is applicable if we can easily draw a sample from −i p(zi|Z ).
Due to exchangeability, the probability of observing the sequence (z1,..., zi−1, zi, zi+1,..., zN) is the same as the probability of observing the sequence (z1,..., zi−1, zi+1,..., zN, zi).
−i This means that p(zi|Z ) is is the posterior predictive distribution of zi after observing (z1,..., zi−1, zi+1,..., zN), the sequence excluding zi.
MURAWAKI Yugo Categorical Distributions in NLP 31 / 34 Gibbs sampling for LDA 1/2
′ −i p(zi = k |Z , W, α, β) ′ −i ∝ p(zi = k , Z , W|α, β) ∏D ( Γ(α · ) = ( ) (·) −i Γ + ∈ + α · j=1 (n j,(·)(Z ) I(wi W j) ( )) ) ∏K Γ k −i + ′ = ∈ + α (n j,(·)(Z ) I(k k & wi W j) k) Γ(α ) k=1 k ∏K ( Γ(β · ) × ( ) −i ′ Γ k + = + β · k=1 (n(·),(·)(Z ) I(k k) ( )) ∏ k −i ′ ) Γ(n (Z ) + I(k = k & wi = v) + βv) (·),v . Γ β v ( v)
W j is the sequence of words in the j-th document. Most terms are irrelevant with zi and can be dropped. MURAWAKI Yugo Categorical Distributions in NLP 32 / 34 Gibbs sampling for LDA 2/2
′ −i p(zi = k |Z , W, α, β) (·) −i k′ −i Γ(n ′ (Z ) + α(·)) Γ(n ′ (Z ) + 1 + αk′ ) ∝ j ,(·) × j ,(·) · ′ − ( ) −i k i ′ Γ ′ + + α · Γ(n ′ (Z ) + α ) (n j ,(·)(Z ) 1 ( )) j ,(·) k k′ −i k′ −i Γ(n (Z ) + β · ) Γ(n ′ (Z ) + 1 + βv′ ) × (·),(·) ( ) × (·),v k′ −i k′ −i Γ + + β · Γ ′ + β ′ (n(·),(·)(Z ) 1 ( )) (n(·),v (Z ) v ) k′ −i k′ −i n ′ (Z ) + αk′ n ′ (Z ) + βv′ = j ,(·) × (·),v , (·) −i k′ −i ′ + α · n (Z ) + β · n j ,(·)(Z ) ( ) (·),(·) ( ) ′ ′ where v = wi, and j is the index of the document wi belongs to. The first term is the posterior predictive probability of adding topic assignment with value k′ to document j′ (the denominator can be dropped because it is constant wrt zi). The second term is the posterior predictive probability of drawing word type v′ from the vocabulary distribution of topic k′. MURAWAKI Yugo Categorical Distributions in NLP 33 / 34 Intuitive explanation of Gibbs sampling for LDA
′ ′ k −i k −i n ′ (Z )+α ′ n ′ (Z )+β ′ ′ −i j ,(·) k (·),v v = | , , α, β ∝ × ′ p(zi k Z W ) (·) −i k −i n ′ (Z )+α · n (Z )+β · j ,(·) ( ) (·),(·) ( )
To make the first term larger, we need to select k′ that appears often in document j′. To make the second term larger, we need to select k′ from ′ which wi = v is often drawn. zi reflects a balance between the first and second terms. After repeating the procedure, Z will generally satisfy these conditions.
MURAWAKI Yugo Categorical Distributions in NLP 34 / 34