<<

Categorical Distributions in Natural Language Processing Version 0.1

MURAWAKI Yugo

12 May 2016

MURAWAKI Yugo Categorical Distributions in NLP 1 / 34

Suppose x takes one of K values. x is generated according to categorical distribution Cat(θ), where

θ = (0.1, 0.6, 0.3). RYG

In many task settings, we do not know the true θ and need to infer it from observed data x = (x1, ··· , xN). Once we infer θ, we often want to predict new variable x′.

NOTE: In Bayesian settings, θ is usually integrated out and x′ is predicted directly from x.

MURAWAKI Yugo Categorical Distributions in NLP 2 / 34 Categorical distributions are a building block of natural language models

N-gram language model (predicting the next word) POS tagging based on a (HMM) Probabilistic context-free grammar (PCFG) Topic model (Latent Dirichlet Allocation (LDA))

MURAWAKI Yugo Categorical Distributions in NLP 3 / 34 Example: HMM-based POS tagging

BOS DT NN VBZ VBN EOS

the sun has risen

Let K be the number of POS tags and V be the vocabulary size (ignore BOS and EOS for simplicity).

The transition probabilities can be computed using K categorical θTRANS, θTRANS, ··· distributions ( DT NN ), with the dimension K. θTRANS = . , . , . , ··· DT (0 21 0 27 0 09 ) NN NNS ADJ Similarly, the emission probabilities can be computed using K θEMIT, θEMIT, ··· categorical distributions ( DT NN ), with the dimension V. θEMIT = . , . , . , ··· NN (0 012 0 002 0 005 ) sun rose cat MURAWAKI Yugo Categorical Distributions in NLP 4 / 34 Outline

Categorical and multinomial distributions Conjugacy and posterior predictive distribution LDA (Latent Dirichlet Applocation) as an application for inference

MURAWAKI Yugo Categorical Distributions in NLP 5 / 34 Categorical distribution: 1 observation

Suppose θ is known.

Th probability of generating random variable x ∈ {1, ··· , K} is

p(x = k|θ) = θk. ∑ ≤ θ ≤ K θ = 0 k 1 and k=1 k 1.

For example, if θ = (0.1, 0.6, 0.3), then p(x = 2|θ) = θ2 = 0.6.

MURAWAKI Yugo Categorical Distributions in NLP 6 / 34 Categorical distribution: N observations

The probability of generating a sequence of random variables with length N, x = (x1, ··· , xN), is

∏N ∏K |θ = θ = θnk , p(x ) xi k i=1 k=1 where∑ nk is the number of times value k is observed in x K = ( k=1 nk N).

NOTE: p(x|θ) does not depend on the ordering of x but only on the total number of observed values in the sequence (sufficient ).

MURAWAKI Yugo Categorical Distributions in NLP 7 / 34 : the probability of counts

Replacing x = (x1, ··· , xN) with the counts n = (n1, ··· , nK), we obtain the multinomial distribution:

( ) ∏K N Multi(n|θ) = θnk n ··· n k 1 k k=1 ∏K N! = θnk n ! ··· n ! k 1 k k=1 ∏K Γ(N + 1) = ∏ θnk . K Γ + k k=1 (nk 1) k=1

Note Γ(n) = (n − 1)!. The second term is the same as p(x|θ). The first term is the combinatorial number of x mapped to n.

NOTE: In NLP, the categorical distribution is often called a multinomial distribution.

MURAWAKI Yugo Categorical Distributions in NLP 8 / 34 Estimating θ from N observations

If θ is unknown, we may want to infer it from x.

First, the is defined as follows:

L(θ; x) = Cat(x|θ).

In maximum likelihood estimation (MLE), we estimate θML that satisfies: θML = argmax L(θ; x) = argmax Cat(x|θ). θ θ It has the following analytical solution (use the method of Lagrange multipliers): θML = nk . k N

MURAWAKI Yugo Categorical Distributions in NLP 9 / 34 A problem with MLE

= θML = If nk 0, then k 0 (zero frequency problem).

Would larger observed data fix the problem? Not necessarily. Natural language symbols usually follow a power low (cf. Zipf’s law). There are always low-frequency symbols.

MURAWAKI Yugo Categorical Distributions in NLP 10 / 34 Bayes’ theorem

Now we introduce p(θ), a prior distribution over θ. A prior distribution is an (arbitrary) distribution that expresses our beliefs about θ.

Bayes’ theorem: p(x|θ)p(θ) p(θ|x) = p(x) ∝ p(x|θ)p(θ). p(θ|x) is a posterior distribution, p(x|θ) is a likelihood, and p(θ) is a prior distribution. p(θ|x) can be interpreted as the distribution of θ after observing data.

MURAWAKI Yugo Categorical Distributions in NLP 11 / 34 A prior for categorical distributions

For a categorical distribution, we usually set the Dir(θ|α) as its prior becaues it has nice analytical properties. p(θ|x, α) ∝ Cat(x|θ)Dir(θ|α), where ∏K Γ(A) α − Dir(θ|α) = θ k 1, Γ(α ) ··· Γ(α ) k 1 K k=1 ∑ α = α , ··· , α α > = K α α where ( 1 K), k 0, and A k=1 k. is a parameter given a priori (hyperparameter). We usually set αi = α j. αk can be a real number (e.g. 1.5).

Note that the Dirichlet distribution resembles a categorical distribution: ∏K |θ = θnk . Cat(x ) k k=1

MURAWAKI Yugo Categorical Distributions in NLP 12 / 34 Posterior distribution

Posterior distribution p(θ|x, α) ∝ Cat(x|θ)Dir(θ|α) is (proof omitted):

∏K Γ(N + A) +α − p(θ|x, α) = θnk k 1 Γ(n + α ) ··· Γ(n + α ) k 1 1 K K k=1 = Dir(θ|n + α).

The posterior has the same form as the prior. The parameter α is replaced with n + α. This property is called conjugacy.

MURAWAKI Yugo Categorical Distributions in NLP 13 / 34 Maximum a posteriori (MAP) estimatoin

Maximum a posteriori (MAP) estimation employs θMAP that maximizes the :

θMAP = argmax p(θ|x) θ = argmax p(x|θ)p(θ). θ

MAP For Dirichlet-categorical parameter θ = argmaxθ Dir(θ|n + α), the solution is n + α − 1 n + α − 1 θMAP = ∑ k k = k k . k K + α − N + A − K k=1(nk k 1)

θML = nk α − Compare it with ML estimate k N . The additional term k 1 is used to smooth the esimate ().

MURAWAKI Yugo Categorical Distributions in NLP 14 / 34 / posterior predictive distribution

Instead of using one estimate of θ, we now consider all possible values of θ. To do so, we integrate out θ to obtain marginal likelihood: ∫ p(x) = p(x|θ)p(θ) dθ. θ The probability of generating new variable x′ given x (predictive probability) is: ∫ p(x′|x) = p(x′|θ)p(θ|x) dθ ∫θ p(x|θ)p(θ) = p(x′|θ) dθ θ p(x)

Thanks to conjugacy, we have analytical solutions for Dirichlet-categorical (next slide).

MURAWAKI Yugo Categorical Distributions in NLP 15 / 34 Marginal likelihood for Dirichlet-categorical

∫ p(x|α) = Cat(x|θ)Dir(θ|α) dθ θ     ∫ ∏K   ∏K     Γ(A) α −  =  θnk   θ k 1 θ  k   k  d θ Γ(α ) ··· Γ(α ) k=1 1 K k=1 ∫ ∏K Γ(A) +α − = θnk k 1 dθ Γ α ··· Γ α k ( 1) ( K) θ = ∏ k 1 K Γ(A) Γ(nk + αk) = ∏ k=1 K Γ α Γ(N + A) k=1 ( k) Γ ∏K Γ + α = (A) (nk k) Γ(N + A) Γ(α ) k=1 k

MURAWAKI Yugo Categorical Distributions in NLP 16 / 34 Posterior predictive for Dirichlet-categorical

The probability of generating x′ after observing x is

p(x′ = k′, x|α) p(x′ = k′|x, α) = p(x|α) ∏ ′ Γ(A) K Γ(nk+I(k=k )+αk) Γ(N+1+A) k=1 Γ(α ) = ∏ k Γ(A) K Γ(nk+αk) Γ(N+A) k=1 Γ(αk) Γ + Γ ′ + + α ′ = (N A) (nk 1 k ) Γ(N + 1 + A) Γ(nk′ + αk′ ) ′ + α ′ = nk k , N + A where I(statement) is an . It gives 1 if statement is true; 0 otherwise.

In analogy with nk (the count of observations), αk is called a pseudo-count.

MURAWAKI Yugo Categorical Distributions in NLP 17 / 34 Sequential updates of x

Let x1:N = (x1, ··· , xN). p(x1:N|α) is the product of the probabilities of generating x one by one.

p(x1:N|α) = p(xN|x1:N−1, α)p(x1:N−1|α) + α nxN (x1:N−1) xN = p(x − |α) N − 1 + A 1:N 1 + α nxN (x1:N−1) xN = p(x − |x − , α)p(x − |α) N − 1 + A N 1 1:N 2 1:N 2 n (x − ) + α n (x − ) + α = xN 1:N 1 xN xN−1 1:N 2 xN−1 |α − + − + p(x1:N−2 ) ∏ N∏1 A N 2 A K nk(x1:N ) = i − 1 + αk = k 1∏ i=1 N − + i=1 i 1 A Γ ∏K Γ + α = (A) (nk(x1:N) k), Γ(N + A) Γ(α ) k=1 k where nk(x) is the number of times k appears in x.

MURAWAKI Yugo Categorical Distributions in NLP 18 / 34 Exchangeability

As stated earlier, p(x|θ) does not depend on the ordering of x but on the total number of times each value is observed in the sequence (sufficient statistics). Let us create x′ by swapping an arbitrary pair of variables xi, x j in x. Since the ordering does not matter, the following is hold:

p(x′|α) = p(x|α)

This property is called exchangeability.

As we will see later, exchangeability makes inference easy.

MURAWAKI Yugo Categorical Distributions in NLP 19 / 34 Example: Sequential updates of x

Suppose random variable x takes one of three values: R, Y, G. The probability of generating the sequence x = (R, R, Y) is 0 + α 1 + α 0 + α p(x|α) = R R Y 0 + A 1 + A 2 + A Verify that reordering does not affect the probability.

MURAWAKI Yugo Categorical Distributions in NLP 20 / 34 Generative story

For Dirichlet-categorical, θ is first generated from a Dirichlet distribution with parameter α and then each random variable xn is generated from a categorical distribution with parameter θ. This process can be summarized as follows:

θ|α ∼ Dir(α)

xn|θ ∼ Cat(θ)

MURAWAKI Yugo Categorical Distributions in NLP 21 / 34 Application: LDA (Latent Dirichlet allocation)

NOTE: We use the modified model by Griffiths et al. (2004), not the original one by Blei et al. (2002).

We are given D documents. Document i contains Ni words, and wi, j is the j-th word in document i. LDA assumes that each document is a mixture of K topics. Document i is associated with a mixing proportion θi. Each topic k has a corresponding vocabulary φ distribution Cat( k). Each word wi, j is tied with a latent variable ′ zi, j = k . zi, j is generated from Cat(θi), and then wi, j is generated φ from Cat( k′ ).

MURAWAKI Yugo Categorical Distributions in NLP 22 / 34 Application: LDA (Latent Dirichlet allocation)

The generative story can be represented by a directed graph called a graphical model.

Each arrow indicates a dependency between variables. Each plate represents repetition with the number of repetition is shown in the corner. The shaded node represents an observed variable.

MURAWAKI Yugo Categorical Distributions in NLP 23 / 34 Application: LDA (Latent Dirichlet allocation)

The generative story is as follows: 1 For each topic k ∈ {1, ··· , K}, draw vocabulary distribution φ ∼ β k Dir( ) 2 For each document i ∈ {1, ··· , D}, draw topic distribution θi ∼ Dir(α) 3 For each word j ∈ {1, ··· , Ni} in document i, 1 Draw topic assignment zi, j ∼ Cat(θi), and 2 , ∼ φ Draw word wi j Cat( zi, j ) Let W be the set of all observed words in the collection of documents and Z be the corresponding latent topic assignments. Given W, we want to infer Z.

MURAWAKI Yugo Categorical Distributions in NLP 24 / 34 Application: LDA (Latent Dirichlet allocation)

We begin with the joint probability:   ∏K ∏D  ∏Ni  , , θ, φ|α, β = φ |β ×  θ |α |θ |φ,  . p(W Z ) p( k ) p( i ) p(zi, j i)p(wi, j zi, j) k=1 i=1 j=1 θ and φ can be integrated out to obtain marginal likelihood. i k ∫ ∫ p(W, Z|α, β) = p(W, Z, θ, φ|α, β) dθdφ θ φ  ∏D  ∏K Γ k + α   Γ(α · ) (n , · k) =  ( ) i ( )   (·) Γ α  = Γ(n + α(·)) = ( k) i 1  i,(·) k 1  ∏K  ∏ Γ k + β   Γ(β(·)) (n(·),v v) ×   , Γ k + β · Γ(β ) k=1 (n(·),(·) ( )) v v k where ni,v is the number of times word type v in document i is tied with topic k. We can see that LDA is simply a composite of D + K Dirichlet-categorical distributions.

MURAWAKI Yugo Categorical Distributions in NLP 25 / 34 Application: LDA (Latent Dirichlet allocation)

Now consider sequential updates of topic assignments and words. It is easy to verify that exchangeability is hold.

Doc ID Topic Word Prob. +α +β Politics Obama 0 Politics × 0 Obama 0+α(·) 0+β(·) 0+α +β Economy FRB Economy × 0 FRB 1 1+α(·) 0+β(·) +α +β Politics election 1 Politics × 0 election 2+α(·) 1+β(·) +α +β Politics Obama 2 Politics × 1 Obama 3+α(·) 2+β(·) +α 0 Economy 0+βYellen Economy Yellen +α × +β 2 0 (·) 1 (·) 1+α +β Economy FRB Economy × 1 FRB 1+α(·) 2+β(·)

MURAWAKI Yugo Categorical Distributions in NLP 26 / 34 Application: LDA (Latent Dirichlet allocation)

We want to infer Z that satisfies

argmax p(Z|W, α, β) = argmax p(W, Z|α, β) Z Z

The number of possible assignments is K|Z|. The search space is too large for exhaustive search.

We resort to approximate inference using random walk. Specificall,y we often use Gibbs sampling as a theoretically-grounded algorithm.

MURAWAKI Yugo Categorical Distributions in NLP 27 / 34 Sampling

Sampling is the procedure of obtaining a sample from a . It is typically implemented by modifying the values of the rand function r (0 ≤ r < 1).

Suppose x is a sample from θ = (0.1, 0.6, 0.3). Using r, we can create a sample x as follows:   1 0 ≤ r < 0.1,  x = 2 0.1 ≤ r < 0.7, and  3 0.7 ≤ r < 1.

MURAWAKI Yugo Categorical Distributions in NLP 28 / 34 Sampling from Dirichlet-categorical posterior predictive

Recall the predictive predictive distribution of Dirichlet-categorical is n ′ (x) + α ′ p(x′ = k′|x, α) = k k . n(x) + α(·) = , , , , α = α , α , α = , , Let x (R R Y R R), and ( R Y G) (1 1 1). Then

′ 4 + 1 p(x = R|x, α) = = 0.625 5 + 3 ′ 1 + 1 p(x = Y|x, α) = = 0.25 5 + 3 ′ 0 + 1 p(x = G|x, α) = = 0.125. 5 + 3 Draw a sample from θ = (0.625, 0.25, 0.125), and that is the sample we want to obtain.

MURAWAKI Yugo Categorical Distributions in NLP 29 / 34 Gibbs sampling

LDA is a composite of Dirichlet-categorical distributions p(Z|W, α, β). Directly sampling from this complex distribution is difficult. −i However, it is easy to draw a sample from p(zi|W, Z , α, β), where −i Z = Z \{zi}. i is now a corpus-wide index. Gibbs sampling is a method of obtaining samples of Z by −i repeatedly sampling from p(zi|W, Z , α, β). The algorithmic overview of Gibbs sampling is as follows: 1 initialize Z 2 For each iteration τ = 1,..., T, randomly select i ′ | , −i, α, β sample zi from p(zi W Z ) ′ update Z by replacing zi with zi After a sufficient number of iterations, Z can be seen as samples from p(Z|W, α, β).

MURAWAKI Yugo Categorical Distributions in NLP 30 / 34 Exchangeability again

Gibbs sampling is applicable if we can easily draw a sample from −i p(zi|Z ).

Due to exchangeability, the probability of observing the sequence (z1,..., zi−1, zi, zi+1,..., zN) is the same as the probability of observing the sequence (z1,..., zi−1, zi+1,..., zN, zi).

−i This means that p(zi|Z ) is is the posterior predictive distribution of zi after observing (z1,..., zi−1, zi+1,..., zN), the sequence excluding zi.

MURAWAKI Yugo Categorical Distributions in NLP 31 / 34 Gibbs sampling for LDA 1/2

′ −i p(zi = k |Z , W, α, β) ′ −i ∝ p(zi = k , Z , W|α, β) ∏D ( Γ(α · ) = ( ) (·) −i Γ + ∈ + α · j=1 (n j,(·)(Z ) I(wi W j) ( )) ) ∏K Γ k −i + ′ = ∈ + α (n j,(·)(Z ) I(k k & wi W j) k) Γ(α ) k=1 k ∏K ( Γ(β · ) × ( ) −i ′ Γ k + = + β · k=1 (n(·),(·)(Z ) I(k k) ( )) ∏ k −i ′ ) Γ(n (Z ) + I(k = k & wi = v) + βv) (·),v . Γ β v ( v)

W j is the sequence of words in the j-th document. Most terms are irrelevant with zi and can be dropped. MURAWAKI Yugo Categorical Distributions in NLP 32 / 34 Gibbs sampling for LDA 2/2

′ −i p(zi = k |Z , W, α, β) (·) −i k′ −i Γ(n ′ (Z ) + α(·)) Γ(n ′ (Z ) + 1 + αk′ ) ∝ j ,(·) × j ,(·) · ′ − ( ) −i k i ′ Γ ′ + + α · Γ(n ′ (Z ) + α ) (n j ,(·)(Z ) 1 ( )) j ,(·) k k′ −i k′ −i Γ(n (Z ) + β · ) Γ(n ′ (Z ) + 1 + βv′ ) × (·),(·) ( ) × (·),v k′ −i k′ −i Γ + + β · Γ ′ + β ′ (n(·),(·)(Z ) 1 ( )) (n(·),v (Z ) v ) k′ −i k′ −i n ′ (Z ) + αk′ n ′ (Z ) + βv′ = j ,(·) × (·),v , (·) −i k′ −i ′ + α · n (Z ) + β · n j ,(·)(Z ) ( ) (·),(·) ( ) ′ ′ where v = wi, and j is the index of the document wi belongs to. The first term is the posterior predictive probability of adding topic assignment with value k′ to document j′ (the denominator can be dropped because it is constant wrt zi). The second term is the posterior predictive probability of drawing word type v′ from the vocabulary distribution of topic k′. MURAWAKI Yugo Categorical Distributions in NLP 33 / 34 Intuitive explanation of Gibbs sampling for LDA

′ ′ k −i k −i n ′ (Z )+α ′ n ′ (Z )+β ′ ′ −i j ,(·) k (·),v v = | , , α, β ∝ × ′ p(zi k Z W ) (·) −i k −i n ′ (Z )+α · n (Z )+β · j ,(·) ( ) (·),(·) ( )

To make the first term larger, we need to select k′ that appears often in document j′. To make the second term larger, we need to select k′ from ′ which wi = v is often drawn. zi reflects a balance between the first and second terms. After repeating the procedure, Z will generally satisfy these conditions.

MURAWAKI Yugo Categorical Distributions in NLP 34 / 34