<<

10-708: Probabilistic Graphical Models 10-708, Spring 2016

Lecture 19: Indian Buffet Process

Lecturer: Matthew Gormley Scribes: Kai-Wen Liang, Han Lu

1 Review

1.1 Chinese Restaurant Process

In theory, the Chinese restaurant process is a discrete-time , analogous to seating customers at infinite number of tables in a Chinese restaurant. Assume that each customer enters and sits down at a table. The way they sit at the tables follows the process below:

• The first customer sits at the first unoccupied table • Each subsequent customer chooses a table according to the following :

p(kth occupied table) ∝ nk p(next unoccupied table) ∝ α

In the end, we have the number of people sitting at each table. This corresponds to a distribution over clusterings, where custermer = index, and table = cluster. Although CRP gives potentially infinite number of clusters, the expected number of clusters given n customers is O(α log(n)). The number of clusters also indicates the rich-get-richer effect on clusters. Also as α goes to 0, the number of clusters goes to 1, while as α goes to +∞, the number of clusters goes to n.

1.2 CRP

Here we denotes z1, z2, . . . zn as a sequence of indices drawn from a Chinese Restaurant Process, where n ∗ is the number of customers. For each table/cluster we also draw a distribution θk from a base distribution H. Despite there are infinite number of tables/clusters we can have in CRP, the maximum number of clusters/tables is the number of the custumers (i.e. n). Finally, for each customer zi (cluster indice), draw

1 2 Lecture 19: Indian Buffet Process

a observation x from p(x |θ∗ ). Here, in chinese restaurant story, we can view z as the table assignment of i i zi i ∗ ith customer, θk is the table specific distribution over dishes, and finally xi is the dishes that ith customer ordered follow the table specific dishes distribution. The next thing we want to know is inference problem (i.e. computing the distribution of z and θ given observation x). Because the exchangeability of CRP, the Gibbs sampler is easy to do inference because for each observation, we can remove the customer/dish from the restaurant and resample as if the were the last to enter. Here we discribe three Gibbs Samplers for CRP Mixture Model.

• Algorithm. 1 (uncollapsed)

state: per-customer parameters θ1, θ2, . . . , θn

– For i = 1, . . . , n: draw θi ∝ p(θi|θ−i, x) • Algorithm. 2 (uncollapsed)

∗ ∗ – Markov chain state: per-customer cluster indices z1, . . . , zn and per-cluster parameters θ1, . . . , θk ∗ – For i = 1, . . . , n: draw zi ∝ p(zi|z−i, x, θ ) – Set K =numer of clusters in z ∗ ∗ – For k = 1,...,K:draw θk ∝ p(θk|xi : zi = k) • Algorithm. 3 (collapsed)

– Markov chain state: per-customer cluster indices z1, . . . , zn

– For i = 1, . . . , n:draw zi ∝ p(zi|z−i, x)

For algorithm 1, if θi = θj, then i, j ∈ samecluster. For algorithm 2, since it is uncollapsed, it is hard to draw a new zi under the conditional distribution.

1.3 Dirichlet Process

• Parameters of a DP: – Base distribution, H, is a probability distribution over Θ – Strength parameter, α ∈ R

• We say G ∝ DP (α, H), if for any partition A1 ∪ A2 ∪ · · · ∪ AK = Θ we have: (G(A1),...,G(AK )) ∝ Dirichlet(αH(A1), . . . , αH(AK ))

The above definition is to say that the DP is a distribution over probability measures such taht marginlas on finite partitions are Dirichlet distributed. Given Dirichlet Process definition above, we have properties as follows,

• Base distribution is the mean of the DP: E[G(A)] = H(A) for any Ai ⊂ Θ

H(A)(1−H(A)) • Strength parameter is like inverse : V [G(A)] = α+1 • Samples from a DP are discrete distributions (stick-breaking construction of G ∝ DP (α, H) makes this clear)

• Posterior distribution of G ∝ DP (α, H) given samples θ1, . . . , θn from G is a DP, G|θ1, . . . , θn ∝ Pn α n i=1 δθi DP (α + n, α+n H + α+n n ) Lecture 19: Indian Buffet Process 3

1.4 Stick Breaking Construction

Stick breaking construction provides a constructive definition of the Dirichlet process as follows,

• Start with a stick of length 1, and break it at β1. Then the length of the broken part of the stick is π1

• Recursively break the rest of the stick to obtain β2, β3 ... and π2, . . . , π3

Qk−1 ∗ where βk ∝ Beta(1, α), πk ∝ βk l=1 1 − βl. Also we draw θk from a base distribution H. Then G = P∞ ∗ k=1 πkθk ∝ DP (α, H)

2 Indian Buffet Process

2.1 Motivation

There are some latent feature models that are familiar to us. For example, they are factor analysis, prob- abilistic PCA, cooperative vector quantization, and sparse PCA. The applications are various, one of the application is as follows: we have images, and there are some set of objects in it, we want to get a vector, in which a one corresponds to the existence of the object in the image and zero if not. What latent feature models do is to help us assign our data instances to multiple classes, while a mixture model only assign one data instance to one class. Another example is Netflix challenge, where we have a sparse data of the preference of users, and we want to find movie to recommend to users. They also allows infinite features so that we do not need to specify beforehand.

th The formal description of latent feature models is as follows: let xi be the i data instance, and f i be its T T T T T T features. Define X = [x1 , x2 , . . . , xN ] be the list of data instances and F = [f1 , f2 ,..., fN ] be the list of features. The model is then specified by the joint distribution of p(X,F ), and by specifying some priors over the features, we further factorize it as p(X,F ) = P (X|F )p(F ). We can further decompose the feature matrix F into a sparse binary matrix Z and a value matrix V . That is, for a real matrix F , we have F = Z ⊗ V, 4 Lecture 19: Indian Buffet Process

where ⊗ is elementwise product and zij ∈ {0, 1} and vij ∈ R. One example is shown as follows:

The reason that this is a powerful idea is that as the number of feature K, which is the number of column here, goes to infinity, we do not need to represent the entire matrix V (even if it might be dense), as long as the matrix Z is appropriately sparse. Therefore, the model becomes

p(X,F ) = p(X|F )p(Z)p(V ).

The main topic of this lecture, Indian Buffet Process, is going to provide a way to specify p(Z) under the condition of infinite number of features k. Before going to the infinite latent feature models, we first review the basics of finite feature models.

2.2 Finite Feature Model

The first example is Beta-Bernoulli Model. We have encountered this model before when we were talking about LDA. Here we restate the coin-flipping story: From a hyperparameter α, we sample a weighted coin for each column k, and for each row n we sample a head or tail.

Here we make things more formal. For each column we sample a feature πk and for each row we sample an ON/OFF value based on πk. That is,

• for each feature k ∈ {1,...,K}:

α – πk ∼ Beta( K , 1) where α > 0 – for each object i ∈ {1, dots, N}:

∗ zik ∼ Bernoulli(πk)

The graphical representation can be drawn as the plate diagram as follows. This gives us the probability of zik given πk and α.

Because is the of (this is the special case for being the conjugate prior of Multinomial distribution, where the dimension decrease to 2), we Lecture 19: Indian Buffet Process 5

can analytically marginalize out the feature parameters πk. The probability of just the matrix Z can be written as the product of the marginal probability of each column, that is,

K N ! Y Z Y P (Z) = P (zik|πk) p(πk)dπk k=1 i=1 K α α Y Γ(mk + )Γ(N − mk + 1) = K K , Γ(N + 1 + α ) k=1 K

PN where mk = i=1 zik is the number of features ON in column k, and Γ is the Gamma function. The question that we are interested in is the expected number of non-zero elements in the matrix Z. To answer this question, we first recall that

r if X ∼ Beta(r, s), then E[X] = r + s if Y ∼ Bernoulli(p), then E[Y ] = p

α Since zik ∼ Bernoulli(πk) and πk ∼ Beta( K , 1), we have

α K E[zik] = α 1 + K and we can calculate the expected number of ON element as

" N K # X X Nα E[1T Z1] = E z = ik 1 + α i=1 k=1 K

This value is upper-bounded by Nα. If we take K → ∞, the value simply goes to Nα, which means that this particular model guarantees sparsity even if we have infinite set of features. However, a bad thing is α that when K → ∞, p(Z) will also go to 0 because the first term K goes to 0. This is not a property we favor since we do not want to see the entire matrix Z become 0. To tackle this problem, we first have to recognize the fact that the features are not identifiable, which means that the order of features does not matter to the model. To understand the concept of ”not identifiable”, recall that in topic model, we usually use MAP inference after a few run for the kth topic from topic model since the order of which topic corresponds to which k does not matter. In a latent feature model, it is obvious that there is no difference between feature k = 13 and k = 27. Having this in mind, we can further convert the matrix to Left-Ordered Form (lof). Define the history of feature k to be the magnitude of the binary value given by the column

N X (N−i) hk = 2 zik. i=1

The figure below help us understand the concept of history: 6 Lecture 19: Indian Buffet Process

With history at hand, we further define the lof(Z) to be Z sorted left-to-right by the history of each feature. The figure below depicts the concept.

We define equivalence class [Z] = {Z0 : lof(Z0) = lof(Z)}, which is the collection of all the matrices Z’s that have the same lof. By doing some counting, we can find out the cardinality of [Z] to be K! , Q2N −1 h=0 Kh! which is the number of matrices that have the same lof. Now, instead of calculating the probability of a particular matrix p(Z), we calculate the probability of the collection of matrices p([Z]). That is

K! lim p([Z]) = lim N p(Z) K→∞ K→∞ Q2 −1 h=0 Kh!

K+ K+ α Y (N − mk)!(mk − 1)! = · exp{−αH } , Q2N −1 N N! h=1 Kh! k=1

PN 1 th where K+ is the number of features with non-zero history, and HN = j=1 j is the N harmonic number. By doing the algebra, we can see that the probability no longer goes to infinity. Now we have the enough background to go to the Indian Buffet Process.

2.3 The Indian Buffet Process

Imagine that there is an Indian restaurant with wonderful buffet containing an infinite number of dishes. Each customer walks in, takes as many dishes as possible and then sit down after they have enough food in their plate. The rule for them to select the dish is as follows: Lecture 19: Indian Buffet Process 7

• 1st customer: Starts at the left and selects a P oisson(α) number of dishes. • ith customer:

mk – Samples previously sampled dishes according to their popularity: (i.e. with probability i where mk is the number of previous customers who tried dish k) α – Selects a P oisson( i ) number of new dishes

The example of this process is shown below: The problem is that the process is not exchangeable, which means that dishes sampled as ”new” depend on the customer order. The way to fix that is to modify the way the ith customer selects dishes:

• Makes a single decision for dishes with same history, h: (i.e. if there are Kh dishes with history h sampled by mh customers, then samples a Binomial(mh/i) number starting at the left)

α • Selects a P oisson( i ) number of new dishes

This is equivalent to seeing them as a lof matrix. Therefore, we fix the problem and can thus calculate the probability p([Z]). Next is to construct a Gibbs sampler for Indian Buffet Process. Specifically, we consider a ”prior only” sampler of p(Z|α). For finite number K, we have

Z 1 P (zik = 1|z−i,k) = P (zik|πk)p(πk|z−i,k)dπk 0 α m−i,k + K = α N + K

th where z−i,k is the k column except row i, and m−i,k is the number of rows with feature k except i. For infinite K, since Indian Buffet Process is exchangeable, we can do the sampling just like CRP, in which we th choose an order such that the i customer was the last to enter. For any k such that m−i,k > 0, resample m P (z = 1|z ) = −i,k , ik −i,k N α then draw a P oisson( i ) number of new dishes. There are some properties of Indian Buffet Process that should be noted:

• It is infinitely exchangeable. • The number of ones in each row is P oisson(α). • The expected total number of ones is αN. • The number of nonzero columns grows as O(α log N). • It has a stick-breaking representation. • It can be interpreted using Beta-.

Finally, the posterior inference can be done using several different methods such as , Conjugate sampler, etc. And previous literatures have reported using this model for graph structures, protein complexes, etc.