Lecture 19: Indian Buffet Process 1 Dirichlet Process Review

10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 19: Indian Buffet Process Lecturer: Matthew Gormley Scribes: Kai-Wen Liang, Han Lu 1 Dirichlet Process Review 1.1 Chinese Restaurant Process In probability theory, the Chinese restaurant process is a discrete-time stochastic process, analogous to seating customers at infinite number of tables in a Chinese restaurant. Assume that each customer enters and sits down at a table. The way they sit at the tables follows the process below: • The first customer sits at the first unoccupied table • Each subsequent customer chooses a table according to the following probability distribution: p(kth occupied table) / nk p(next unoccupied table) / α In the end, we have the number of people sitting at each table. This corresponds to a distribution over clusterings, where custermer = index, and table = cluster. Although CRP gives potentially infinite number of clusters, the expected number of clusters given n customers is O(α log(n)). The number of clusters also indicates the rich-get-richer effect on clusters. Also as α goes to 0, the number of clusters goes to 1, while as α goes to +1, the number of clusters goes to n. 1.2 CRP Mixture Model Here we denotes z1; z2; : : : zn as a sequence of indices drawn from a Chinese Restaurant Process, where n ∗ is the number of customers. For each table/cluster we also draw a distribution θk from a base distribution H. Despite there are infinite number of tables/clusters we can have in CRP, the maximum number of clusters/tables is the number of the custumers (i.e. n). Finally, for each customer zi (cluster indice), draw 1 2 Lecture 19: Indian Buffet Process a observation x from p(x jθ∗ ). Here, in chinese restaurant story, we can view z as the table assignment of i i zi i ∗ ith customer, θk is the table specific distribution over dishes, and finally xi is the dishes that ith customer ordered follow the table specific dishes distribution. The next thing we want to know is inference problem (i.e. computing the distribution of z and θ given observation x). Because the exchangeability of CRP, the Gibbs sampler is easy to do inference because for each observation, we can remove the customer/dish from the restaurant and resample as if the were the last to enter. Here we discribe three Gibbs Samplers for CRP Mixture Model. • Algorithm. 1 (uncollapsed) { Markov chain state: per-customer parameters θ1; θ2; : : : ; θn { For i = 1; : : : ; n: draw θi / p(θijθ−i; x) • Algorithm. 2 (uncollapsed) ∗ ∗ { Markov chain state: per-customer cluster indices z1; : : : ; zn and per-cluster parameters θ1; : : : ; θk ∗ { For i = 1; : : : ; n: draw zi / p(zijz−i; x; θ ) { Set K =numer of clusters in z ∗ ∗ { For k = 1;:::;K:draw θk / p(θkjxi : zi = k) • Algorithm. 3 (collapsed) { Markov chain state: per-customer cluster indices z1; : : : ; zn { For i = 1; : : : ; n:draw zi / p(zijz−i; x) For algorithm 1, if θi = θj, then i; j 2 samecluster. For algorithm 2, since it is uncollapsed, it is hard to draw a new zi under the conditional distribution. 1.3 Dirichlet Process • Parameters of a DP: { Base distribution, H, is a probability distribution over Θ { Strength parameter, α 2 R • We say G / DP (α; H), if for any partition A1 [ A2 [···[ AK = Θ we have: (G(A1);:::;G(AK )) / Dirichlet(αH(A1); : : : ; αH(AK )) The above definition is to say that the DP is a distribution over probability measures such taht marginlas on finite partitions are Dirichlet distributed. Given Dirichlet Process definition above, we have properties as follows, • Base distribution is the mean of the DP: E[G(A)] = H(A) for any Ai ⊂ Θ H(A)(1−H(A)) • Strength parameter is like inverse variance: V [G(A)] = α+1 • Samples from a DP are discrete distributions (stick-breaking construction of G / DP (α; H) makes this clear) • Posterior distribution of G / DP (α; H) given samples θ1; : : : ; θn from G is a DP, Gjθ1; : : : ; θn / Pn α n i=1 δθi DP (α + n; α+n H + α+n n ) Lecture 19: Indian Buffet Process 3 1.4 Stick Breaking Construction Stick breaking construction provides a constructive definition of the Dirichlet process as follows, • Start with a stick of length 1, and break it at β1. Then the length of the broken part of the stick is π1 • Recursively break the rest of the stick to obtain β2; β3 ::: and π2; : : : ; π3 Qk−1 ∗ where βk / Beta(1; α), πk / βk l=1 1 − βl. Also we draw θk from a base distribution H. Then G = P1 ∗ k=1 πkθk / DP (α; H) 2 Indian Buffet Process 2.1 Motivation There are some latent feature models that are familiar to us. For example, they are factor analysis, probabilistic PCA, cooperative vector quantization, and sparse PCA. The applications are various, one of the application is as follows: we have images, and there are some set of objects in it, we want to get a vector, in which a one corresponds to the existence of the object in the image and zero if not. What latent feature models do is to help us assign our data instances to multiple classes, while a mixture model only assign one data instance to one class. Another example is Netflix challenge, where we have a sparse data of the preference of users, and we want to find movie to recommend to users. They also allows infinite features so that we do not need to specify beforehand. th The formal description of latent feature models is as follows: let xi be the i data instance, and f i be its T T T T T T features. Define X = [x1 ; x2 ; : : : ; xN ] be the list of data instances and F = [f1 ; f2 ;:::; fN ] be the list of features. The model is then specified by the joint distribution of p(X; F ), and by specifying some priors over the features, we further factorize it as p(X; F ) = P (XjF )p(F ). We can further decompose the feature matrix F into a sparse binary matrix Z and a value matrix V . That is, for a real matrix F , we have F = Z ⊗ V; 4 Lecture 19: Indian Buffet Process where ⊗ is elementwise product and zij 2 f0; 1g and vij 2 R. One example is shown as follows: The reason that this is a powerful idea is that as the number of feature K, which is the number of column here, goes to infinity, we do not need to represent the entire matrix V (even if it might be dense), as long as the matrix Z is appropriately sparse. Therefore, the model becomes p(X; F ) = p(XjF )p(Z)p(V ): The main topic of this lecture, Indian Buffet Process, is going to provide a way to specify p(Z) under the condition of infinite number of features k. Before going to the infinite latent feature models, we first review the basics of finite feature models. 2.2 Finite Feature Model The first example is Beta-Bernoulli Model. We have encountered this model before when we were talking about LDA. Here we restate the coin-flipping story: From a hyperparameter α, we sample a weighted coin for each column k, and for each row n we sample a head or tail. Here we make things more formal. For each column we sample a feature πk and for each row we sample an ON/OFF value based on πk. That is, • for each feature k 2 f1;:::;Kg: α { πk ∼ Beta( K ; 1) where α > 0 { for each object i 2 f1; dots; Ng: ∗ zik ∼ Bernoulli(πk) The graphical representation can be drawn as the plate diagram as follows. This gives us the probability of zik given πk and α. Because Beta distribution is the conjugate prior of Bernoulli distribution (this is the special case for Dirichlet distribution being the conjugate prior of Multinomial distribution, where the dimension decrease to 2), we Lecture 19: Indian Buffet Process 5 can analytically marginalize out the feature parameters πk. The probability of just the matrix Z can be written as the product of the marginal probability of each column, that is, K N ! Y Z Y P (Z) = P (zikjπk) p(πk)dπk k=1 i=1 K α α Y Γ(mk + )Γ(N − mk + 1) = K K ; Γ(N + 1 + α ) k=1 K PN where mk = i=1 zik is the number of features ON in column k, and Γ is the Gamma function. The question that we are interested in is the expected number of non-zero elements in the matrix Z. To answer this question, we first recall that r if X ∼ Beta(r; s); then E[X] = r + s if Y ∼ Bernoulli(p); then E[Y ] = p α Since zik ∼ Bernoulli(πk) and πk ∼ Beta( K ; 1), we have α K E[zik] = α 1 + K and we can calculate the expected number of ON element as " N K # X X Nα E[1T Z1] = E z = ik 1 + α i=1 k=1 K This value is upper-bounded by Nα. If we take K ! 1, the value simply goes to Nα, which means that this particular model guarantees sparsity even if we have infinite set of features. However, a bad thing is α that when K ! 1, p(Z) will also go to 0 because the first term K goes to 0. This is not a property we favor since we do not want to see the entire matrix Z become 0.

Load more