<<

Stick-Breaking Beta Processes and the Poisson Process

John Paisley1 David M. Blei3 Michael I. Jordan1,2 1Department of EECS, 2Department of , UC Berkeley 3Computer Science Department, Princeton University

Abstract beta process also has an underlying Poisson process (Jor- dan, 2010; Thibaux & Jordan, 2007), with a mean We show that the stick-breaking construction of ν (as discussed in detail in Section 2.1). Therefore, the pro- the beta process due to Paisley et al. (2010) can cess presented in Paisley et al. (2010) must also be a Pois- be obtained from the characterization of the beta son process with this same mean measure. Showing this process as a Poisson process. Specifically, we equivalence would provide a direct proof of Paisley et al. show that the mean measure of the underlying (2010) using the well-studied Poisson process machinery Poisson process is equal to that of the beta pro- (Kingman, 1993). cess. We use this underlying representation to In this paper we present such a derivation (Section 3). In derive error bounds on truncated beta processes addition, we derive error truncation bounds that are tighter that are tighter than those in the literature. We than those in the literature (Section 4.1) (Doshi-Velez et al., also develop a new MCMC inference algorithm 2009; Paisley et al., 2011). The Poisson process framework for beta processes, based in part on our new Pois- also provides an immediate proof of the extension of the son process construction. construction to beta processes with a varying and infinite base measure (Section 4.2), which does not follow immediately from the derivation in Pais- 1 Introduction ley et al. (2010). In Section 5, we present a new MCMC algorithm for stick-breaking beta processes that uses the The beta process is a Bayesian nonparametric prior for Poisson process to yield a more efficient sampler than that sparse collections of binary features (Thibaux & Jordan, presented in Paisley et al. (2010). 2007). When the beta process is marginalized out, one ob- tains the Indian buffet process (IBP) (Griffiths & Ghahra- mani, 2006). Many applications of this circle of ideas— 2 The Beta Process including focused topic distributions (Williamson et al., 2010), featural representations of multiple (Fox In this section, we review the beta process and its marginal- et al., 2010) and dictionary learning for image process- ized representation. We discuss the link between the beta ing (Zhou et al., 2011)—are motivated from the IBP rep- process and the Poisson process, defining the underlying resentation. However, as in the case of the Dirichlet pro- Levy´ measure of the beta process. We then review the cess, where the Chinese restaurant process provides the stick-breaking construction of the beta process, and give marginalized representation, it can be useful to develop in- an equivalent representation of the generative process that ference methods that use the underlying beta process. A will help us derive its Levy´ measure. step in this direction was provided by Teh et al. (2007), who A draw from a beta process is (with one) a derived a stick-breaking construction for the special case of countably infinite collection of weighted atoms in a space the beta process that marginalizes to the one-parameter IBP. Ω, with weights that lie in the interval [0, 1] (Hjort, 1990). Recently, a stick-breaking construction of the full beta pro- Two parameters govern the distribution on these weights, a cess was derived by Paisley et al. (2010). The derivation re- concentration parameter α > 0 and a finite base measure lied on a limiting process involving finite matrices, similar µ, with µ(Ω) = γ.1 Since such a draw is an atomic mea- P to the limiting process used to derive the IBP. However, the sure, we can write it as H = ij πijδθij , where the two index values follow from Paisley et al. (2010), and we write Appearing in Proceedings of the 15th International Conference on H ∼ BP(α, µ). Artificial Intelligence and Statistics (AISTATS) 2012, La Palma, Canary Islands. Volume XX of JMLR: W&CP XX. Copyright 1In Section 4.2 we discuss a generalization of this definition 2012 by the authors. that is more in line with the definition given by Hjort (1990). Stick-Breaking Beta Processes and the Poisson Process

1 1

π H(dθ)

A

0 0 a θ b a θ b

Figure 1: (left) A Poisson process Π on [a, b] × [0, 1] with mean measure ν = µ × λ, where λ(dπ) = απ−1(1 − π)α−1dπ R and µ([a, b]) < ∞. The set A contains a Poisson distributed number of atoms with parameter A µ(dθ)λ(dπ). (right) The beta process constructed from Π. The first dimension corresponds to location, and the second dimension to weight.

Contrary to the , which provides a prob- In this example, a Poisson process generates points in the ability measure, the total measure H(Ω) 6= 1 with proba- space [a, b] × [0, 1]. It is completely characterized by its bility one. Instead, beta processes are useful as parameters mean measure, ν(dθ, dπ) (Kingman, 1993; Cinlar, 2011). for a . We write the Bernoulli process X For any subset A ⊂ [a, b] × [0, 1], the random counting P as X = ij zijδθij , where zij ∼ Bernoulli(πij), and de- measure N(A) equals the number of points from Π con- note this as X ∼ BeP(H). Thibaux & Jordan (2007) show tained in A. The distribution of N(A) is Poisson with that marginalizing over H yields the Indian buffet process parameter ν(A). Moreover, for all pairwise disjoint sets (IBP) of Griffiths & Ghahramani (2006). A1,...,An, the random variables N(A1),...,N(An) are independent, and therefore N is completely random. The IBP clearly shows the featural clustering property of the beta process, and is specified as follows: To generate a In the case of the beta process, the mean measure of the sample Xn+1 from an IBP conditioned on the previous n underlying Poisson process is samples, draw ν(dθ, dπ) = απ−1(1 − π)α−1dπµ(dθ). (1) n ! 1 X α −1 α−1 X |X ∼ BeP X + µ . We refer to λ(dπ) = απ (1 − π) dπ as the Levy´ mea- n+1 1:n α + n m α + n m=1 sure of the process, and µ as its base measure. Our goal in Section 3 will be to show that the following construction is This says that, for each θij with at least one value of also a Poisson process with mean measure equal to (1), and Xm(θij) equal to one, the value of Xn+1(θij) is equal to 1 P is thus a beta process. one with probability α+n m Xm(θij). After sampling these locations, a Poisson(αµ(Ω)/(α + n)) distributed 2.2 Stick-breaking for the beta process number of new locations θi0j0 are introduced with corre- sponding X (θ 0 0 ) set equal to one. From this repre- n+1 i j Paisley et al. (2010) presented a method for explicitly sentation one can show that X (Ω) has a Poisson(µ(Ω)) m constructing beta processes based on the notion of stick- distribution, and the number of unique observed atoms breaking, a general method for obtaining discrete proba- in the process X1:n is Poisson distributed with parameter n bility measures (Ishwaran & James, 2001). Stick-breaking P αµ(Ω)/(α + m − 1) (Thibaux & Jordan, 2007). m=1 plays an important role in Bayesian nonparametrics, thanks largely to a seminal derivation of a stick-breaking represen- 2.1 The beta process as a Poisson process tation for the Dirichlet process by Sethuraman (1994). In An informative perspective of the beta process is as a the case of the beta process, Paisley et al. (2010) presented completely random measure, a construction based on the the following representation: Poisson process (Jordan, 2010). We illustrate this in Fig- ∞ Ci i−1 X X (i) Y (l) ure 1 using an example where Ω = [a, b] and µ(A) = H = Vij (1 − Vij )δθij , (2) γ i=1 j=1 l=1 b−a Leb(A), with Leb(·) the . The right figure shows a draw from the beta process. The left figure 1 C iid∼ Poisson(γ),V (l) iid∼ Beta(1, α), θ iid∼ µ, shows the underlying Poisson process, Π = {(θ, π)}. i ij ij γ John Paisley David M. Blei Michael I. Jordan

C1 = 3 C2 = 2 C3 = 1 C4 = 0 C5 = 2

π11 π21 π31 π51

θ11 θ21 θ31 θ51 π12 π22 π52

θ12 θ22 θ52 π13

θ13

Figure 2: An illustration of the stick-breaking construction of the beta process by round index i for i ≤ 5. Given a space Ω with measure µ, for each index i a Poisson(µ(Ω)) distributed number of atoms θ are drawn i.i.d. from the µ/µ(Ω). To atom θij, a corresponding weight πij is attached that is the ith break drawn independently from a P Beta(1, α) stick-breaking process. A beta process is H = ij πijδθij . where, as previously mentioned, α > 0 and µ is a non- derlying Poisson process with mean measure (1).2 We first atomic finite base measure with µ(Ω) = γ. state two basic lemmas regarding Poisson processes (King- man, 1993). We then use these lemmas to show that the This construction sequentially incorporates into H a construction of H in (3) has an underlying Poisson process Poisson-distributed number of atoms drawn i.i.d. from µ/γ, representation, followed by the proof. with each round in this sequence indexed by i. The atoms receive weights in [0, 1], drawn independently according to a stick-breaking construction—an atom in round i throws 3.1 Representing H as a Poisson process away the first i − 1 breaks from its stick, and keeps the ith The first lemma concerns the marking of points in a Poisson break as its weight. We illustrate this in Figure 2. process with i.i.d. random variables. The second lemma We use an equivalent definition of H that reduces the to- concerns the superposition of independent Poisson pro- tal number of random variables by reducing the product cesses. Theorem 1 uses these two lemmas to show that Q (1−Vj) to a function of a single . Let the construction in (3) has an underlying Poisson process. j

Lemma 2 (superposition property) Let Π1, Π2,... be a iid iid Ci ∼ Poisson(γ),Vij ∼ Beta(1, α), countable collection of independent Poisson processes on Ω × [0, 1]. Let Πi have mean measure νi. Then the su- ∞ ind iid 1 S perposition Π = i=1 Πi is a Poisson process with mean Tij ∼ Gamma(i − 1, α), θij ∼ µ. (3) P∞ γ measure ν = i=1 νi. Starting from a finite approximation of the beta process, Theorem 1 The construction of H given in (3) has an Paisley et al. (2010) showed that (2) must be a beta process underlying Poisson process. by making use of the stick-breaking construction of a (Sethuraman, 1994), and then finding the lim- Proof. This is an application of Lemmas 1 and 2; in iting case; a similar limiting-case derivation was given for this proof we fix some notation for what follows. Let the Indian buffet process (Griffiths & Ghahramani, 2006). π1j := V1j and πij := Vij exp{−Tij} for i > 1. Let Hi := We next show that (2) can be derived directly from the char- PCi P∞ j=1 πijδθij and therefore H = i=1 Hi. Noting that acterization of the beta process as a Poisson process. This Ci ∼ Poisson(µ(Ω)), for each Hi the set of atoms {θij} verifies the construction, and also leads to new properties of the beta process. 2A similar result has recently been presented by Broderick et al. (2012); however, their approach differs from ours in its mathematical underpinnings. Specifically we use a decomposi- 3 Stick-breaking from the Poisson Process tion of the beta process into a countably infinite collection of Poisson processes, which leads directly to the applications that we pursue in subsequent sections. By contrast, the proof in Brod- We now prove that (2) is a beta process with parameter erick et al. (2012) does not take this route, and their focus is on α > 0 and base measure µ by showing that it has an un- power-law generalizations of the beta process. Stick-Breaking Beta Processes and the Poisson Process

∗ forms a Poisson process Π on Ω with mean measure µ. variables (Rohatgi, 1976), the density of πij = VijWij is Each θij is marked with a πij ∈ [0, 1] that has some prob- Z 1 ability measure λi (to be defined later). By Lemma 1, each −1 fi(π|α) = w pV (π/w|α)pW (w|i, α)dw (5) Hi has an underlying Poisson process Πi = {(θij, πij)}, π on Ω × [0, 1] with mean measure µ × λi. It follows that H αi Z 1 1 π has an underlying Π = S∞ Π , which is a superposition = wα−2(ln )i−2(1 − )α−1dw. i=1 i (i − 2)! w w of a countable collection of independent Poisson processes, π and is therefore a Poisson process by Lemma 2.  Though this integral does not have a closed-form solution for a single Levy´ measure λi, we show next that the sum 3.2 Calculating the mean measure of H over these measures does have a closed-form solution. The Levy´ measure of H. Using the values of f derived We’ve shown that H has an underlying Poisson process; it i above, we can calculate the mean measure of the Poisson remains to calculate its mean measure. We define the mean process underlying (2). As discussed, the measure ν can be measure of Πi to be νi = µ×λi, and by Lemma 2 the mean P∞ P∞ decomposed as follows, measure of Π is ν = i=1 νi = µ × i=1 λi. We next −1 α−1 show that ν(dθ, dπ) = απ (1 − π) dπµ(dθ), which ∞ ∞ X X will establish the result stated in the following theorem. ν(dθ, dπ) = (µ × λi)(dθ, dπ) = µ(dθ)dπ fi(π|α). i=1 i=1 Theorem 2 The construction defined in (2) is of a beta By showing that P∞ f (π|α) = απ−1(1 − π)α−1, we process with parameter α > 0 and finite base measure µ. i=1 i complete the proof; we refer to the appendix for the details Proof. To show that the mean measure of Π is equal to (1), of this calculation. we first calculate each νi and then take their summation. We split this calculation into two groups, Π1 and Πi for 4 Some Properties of the Beta Process i > 1, since the distribution of πij (as defined in the proof of Theorem 1) requires different calculations for these two We have shown that the stick-breaking construction defined groups. We use the definition of H in (3) to calculate these in (2) has an underlying Poisson process with mean mea- distributions of π for i > 1. ij sure ν(dθ, dπ) = απ−1(1 − π)α−1dπµ(dθ), and is there- Case i = 1. The first round of atoms and their correspond- fore a beta process. Representing the stick-breaking con- PC1 ing weights, H1 = j=1 π1jδθ1j with π1j := V1j, has an struction as a superposition of a countably infinite collec- underlying Poisson process Π1 = {(θ1j, π1j)} with mean tion of independent Poisson processes is also useful for fur- measure ν1 = µ × λ1 (Lemma 1). It follows that ther characterizing the beta process. For example, we can use this representation to analyze truncation properties. We α−1 λ1(dπ) = α(1 − π) dπ. (4) can also easily extend the construction in (2) to cases such as that considered in Hjort (1990), where α is a function of We write λi(dπ) = fi(π|α)dπ. For example, the density θ and µ is an infinite measure. α−1 above is f1 = α(1 − π) . We next focus on calculating the density f for i > 1. i 4.1 Truncated beta processes Case i > 1. Each Hi has an underlying Poisson process Truncated beta processes arise in the variational inference Πi = {(θij, πij)} with mean measure µ × λi, where λi de- setting (Doshi-Velez et al., 2009; Paisley et al., 2011; Jor- termines the of πij (Lemma 1). As dan et al., 1999). Poisson process representations are use- with i = 1, we write this measure as λi(dπ) = fi(π|α)dπ, ful for characterizing the part of the beta process that is be- where fi(π|α) is the density of πij, i.e., of the ith break from a Beta(1, α) stick-breaking process. This density ing thrown away in the truncation. Consider a beta process (R) PR plays a significant role in the truncation bounds and MCMC truncated after round R, defined as H = i=1 Hi. The (R) sampler derived in the following sections; we next focus on part being discarded, H −H , has an underlying Poisson its derivation. process with mean measure π := V exp{−T } V ∼ Beta(1, α) + P∞ Recall that ij ij ij , where ij νR (dθ, dπ) := i=R+1 νi(dθ, dπ) and Tij ∼ Gamma(i−1, α). First, let Wij := exp{−Tij}. P∞ = µ(dθ) × λi(dπ), (6) Then by a change of variables, i=R+1 and a corresponding counting measure N +(dθ, dπ). This αi−1 R p (w|i, α) = wα−1(− ln w)i−2 . measure contains information about the missing atoms.3 W (i − 2)! 3For example, the number of missing atoms having weight + Using the product distribution formula for two random π ≥  is Poisson distributed with parameter νR (Ω, [, 1]). John Paisley David M. Blei Michael I. Jordan

1 5 5 Theorem 3 1.6 0.3 Corollary 1.2 0.8 Previous Bound 0.4 4 4 0.6 0.3 1.2 0.4 0.6 3 3 0.8 1.6 � � 0.8 0.4 0.6 2 2 Error bound 0.4 0.6 0.2 0.8 1 1 0.6 1.6 1.2 0 0.8 15 20 25 30 35 40 45 50 1 2 3 4 5 1 2 3 4 5 Truncation level, R M = 100 � M = 500 �

Figure 3: Examples of the error bound. (left) The bounds for α = 3, γ = 4 and M = 500. The previous bound appears in Paisley et al. (2011). (center and right) Contour plots of the L1 distance between the Theorem 3 bound and the Corollary bound, presented as functions of α and γ for (center) M = 100, (right) M = 500. The L1 distance for the left plot is 0.46. The Corollary bound becomes tighter as α and γ increase, and as M decreases.

iid For truncated beta processes, a measure of closeness to the Theorem 3 Let X1:M ∼ BeP(H) with H ∼ BP(α, µ) true beta process is helpful when selecting truncation lev- constructed as in (2). For a truncation value R, let E be els. To this end, let data Yn ∼ f(Xn, φn), where Xn is a the event that there exists an index (i, j) with i > R such (R) Bernoulli process taking either H or H as parameters, that Xn(θij) = 1. Then the bound in (7) equals and φn is a set of additional parameters (which could be ( ) globally shared). Let Y = (Y1,...,YM ). One measure of Z + M  closeness is the L1 distance between the marginal density P(E) = 1 − exp − νR (Ω, dπ) 1 − (1 − π) . (0,1] of Y under the beta process, m∞(Y), and the process trun- cated at round R, m (Y). This measure originated with R Proof. Let U ∈ {0, 1}M . By Lemma 3, the set {(θ, π, U)} work on truncated Dirichlet processes in Ishwaran & James constructed from rounds R + 1 and higher is a Pois- (2000, 2001); in Doshi-Velez et al. (2009), it was extended son process on Ω × [0, 1] × {0, 1}M with mean measure to the beta process. + νR (dθ, dπ)Q(π, U) and a corresponding counting measure + After slight modification to account for truncating rounds NR (dθ, dπ, U), where Q(π, ·) is a transition probability rather than atoms, the result in Doshi-Velez et al. (2009) measure on the space {0, 1}M . Let A = {0, 1}M \0, implies that where 0 is the zero vector. Then Q(π, A) is the proba- bility of this set with respect to a Bernoulli process with 1 Z parameter π, and therefore Q(π, A) = 1 − (1 − π)M . |mR(Y) − m∞(Y)|dY (7) 4 The probability P(E) = 1 − P(Ec), which is equal to + 1 − P(NR (Ω, [0, 1],A) = 0). The theorem follows since + NR (Ω, [0, 1],A) is a Poisson-distributed random variable ≤ P (∃(i, j), i > R, 1 ≤ n ≤ M : Xn(θij) = 1) , R + 4 with parameter (0,1] νR (Ω, dπ)Q(π, A).  with a similar proof as in Ishwaran & James (2000). This Using the Poisson process, we can give an analytical bound says that 1/4 times the L1 distance between mR and m∞ that is tighter than that in Paisley et al. (2011). is less than one minus the probability that, in M Bernoulli processes with parameter H ∼ BP(α, µ), there is no Corollary 1 Given the set-up in Theorem 3, an upper X (θ) = 1 for a θ ∈ H with i > R. In Doshi-Velez et al. n i bound on P(E) is (2009) and Paisley et al. (2011), this bound was loosened. H ( R) Using the Poisson process representation of , we can give  α  an exact form of this bound. To do so, we use the follow- (E) ≤ 1 − exp −γM . P 1 + α ing lemma, which is similar to Lemma 1, but accounts for markings that are not independent of the atom. Proof. Lemma 3 Let (θ, π) form a Poisson process on Ω × [0, 1] We give the proof in the appendix. + with mean measure νR . Mark each (θ, π) with a random 4We give a second proof using simple functions in the ap- variable U in a finite space S with transition probability pendix. One can use approximating simple functions to give an kernel Q(π, ·). Then (θ, π, U) forms a Poisson process on arbitrarily close approximation of Theorem 3. Furthermore, since + + + + ν = ν − νR and ν = ν, performing a sweep of trun- Ω × [0, 1] × S with mean measure νR (dθ, dπ)Q(π, U). R R−1 0 cation values requires approximating only one additional integral This leads to Theorem 3. for each increment of R. Stick-Breaking Beta Processes and the Poisson Process

The bound in Paisley et al. (2011) has 2M rather than M. 5.1 A distribution on observed atoms We observe that the term in the exponential equals the neg- R 1 + Before presenting the MCMC sampler, we derive a quantity ative of M 0 πνR (Ω, dπ), which is the expected number of missing ones in M truncated Bernoulli process observa- that we use in the algorithm. Specifically, for the collection tions. Figure 3 shows an example of these bounds. of Poisson processes Hi, we calculate the distribution on the number of atoms θ ∈ Hi for which the Bernoulli pro- cess X (θ) is equal to one for some 1 ≤ n ≤ M. In this 4.2 Beta processes with infinite µ and varying α n case, we denote the atom as being “observed.” This dis- tribution is relevant to inference, since in practice we care The Poisson process allows for the construction to be ex- most about samples at these locations. tended to the more general definition of the beta process given by Hjort (1990). In this definition, the value of α(θ) The distribution of this quantity is related to Theorem 3. is a function of θ, rather than a constant, and the base mea- There, the exponential term gives the probability that this sure µ may be infinite, but σ-finite.5 Using Poisson pro- number is zero for all i > R. More generally, under the cesses, the extension of (2) to this setting is straightforward. prior on a single Hi, the number of observed atoms is Pois- We note that this is not immediate from the limiting case son distributed with parameter derivation presented in Paisley et al. (2010). Z 1 M For a partition (E ) of Ω with µ(E ) < ∞, we treat each ξi = νi(Ω, dπ)(1 − (1 − π) )dπ. (8) k k 0 E set k as a separate Poisson process with mean measure P∞ The sum i=1 ξi < ∞ for finite M, meaning a finite num- ber of atoms will be observed with probability one. νEk (dθ, dπ) = µ(dθ)λ(θ, dπ), θ ∈ Ek = α(θ)π−1(1 − π)α(θ)−1dπµ(dθ). Conditioning on there being T observed atoms overall, ∗ θ1:T , we can calculate a distribution on the Poisson pro- ∗ The transition probability kernel λ follows from the con- cess to which atom θk belongs. This is an instance of Pois- tinuous version of Lemma 3. By superposition, we have sonization of the multinomial; since for each Hi the distri- the overall beta process. Modifying (2) gives the following bution on the number of observed atoms is independent and construction: For each set Ek construct a separate HEk . In Poisson(ξi) distributed, conditioning on T the Poisson pro- ∗ each round of (2), incorporate Poisson(µ(Ek)) new atoms cess to which atom θk belongs is independent of all other (k) (k) ∗ θij ∈ Ek drawn i.i.d. from µ/µ(Ek). For atom θij , draw atoms, and identically distributed with P(θk ∈ Hi) ∝ ξi. (k) (k) a weight πij using the ith break from a Beta(1, α(θij )) stick-breaking process. The complete beta process is the 5.2 The sampling algorithm union of these local beta processes. We next present the MCMC sampling algorithm. We index samples by an s, and define all densities to be zero outside 5 MCMC Inference of their support.

We derive a new MCMC inference algorithm for beta pro- Sample πk. We take several Metropolis- s cesses that incorporates ideas from the stick-breaking con- Hastings steps for πk. Let πk be the value at step s. Let struction and Poisson process. In the algorithm, we re- ? s s s iid 2 the proposal be πk = πk + ξk, where ξk ∼ N(0, σπ). Set s+1 index atoms to take one index value k, and let dk indicate ? πk = πk with probability the Poisson process of the kth atom under consideration ? ? s  p(m , m |π )f s (π |w , α ) 1,k 0,k k dk k k s (i.e., θk ∈ Hdk ). For calculation of the likelihood, given min 1, s s s , p(m , m |π )f s (π |w , α ) M Bernoulli process draws, we denote the sufficient statis- 1,k 0,k k dk k k s PM tics m1,k = n=1 Xn(θk) and m0,k = M − m1,k. s+1 s otherwise set πk = πk. The likelihood and priors are We use the densities f and f , i > 1, derived in (4) and 1 i p(m , m |π) = πm1,k (1 − π)m0,k , (5) above. Since the numerical integration in (5) is com- 1,k 0,k  s αs−1 w s (wk − π) if dk > 1 putationally expensive, we sample as an auxiliary vari- f s (π|w , α ) ∝ . dk k s αs−1 able. The joint density of πij and wij, 0 ≤ πij ≤ wij, for (1 − π) if dk = 1 θij ∈ Hi and i > 1 is Sample wk. We take several random walk Metropolis- −1 i−2 α−1 Hastings steps for w when d > 1. Let ws be the value fi(πij, wij|α) ∝ wij (− ln wij) (wij − πij) . k k k ? s s s iid at step s. Set the proposal wk = wk + ζk, where ζk ∼ The density for i = 1 does not depend on w. 2 N(0, σw), and set

5  ? s s  That is, the total measure µ(Ω) = ∞, but there is a measur- s+1 ? f(wk|πk, dk, αs) w = wk w.p. min 1, , able partition (Ek) of Ω with each µ(Ek) < ∞. k s s s f(wk|πk, dk, αs) John Paisley David M. Blei Michael I. Jordan

s+1 s otherwise set wk = wk. The value of f is Sample new atoms. We sample new atoms in addition to

s the observed atoms. For each i = 1,..., max(d1:Ts ), we s s −1 dk−2 s αs−1 f(w|πk, dk, αs) = w (− ln w) (w − πk) . “complete” the round by sampling the unobserved atoms. s s For Poisson process Hi, this number has a Poisson(γs −ξ ) When dk = 1, the auxiliary variable wk does not exist, so i s−1 s distribution. We can sample additional Poisson processes we don’t sample it. If dk = 1, but dk > 1, we sample s s as well according to this distribution. In all cases, the new wk ∼ Uniform(πk, 1) and take many random walk M-H steps as detailed above. atoms are i.i.d. µ/γs.

Sample dk. We follow the discussion in Section 5.1 to 5.3 Experimental results s+1 sample dk . Conditioned on there being Ts observed s+1 atoms at step s, the prior on dk is independent of all other We evaluate the MCMC sampler on synthetic data. We use s+1 s s indicators d, and P(dk = i) ∝ ξi , where ξi is given in the beta-Bernoulli process as a matrix factorization prior s (8). The likelihood depends on the current value of dk. for a linear-Gaussian model. We generate a data matrix s s s s+1 Y = Θ(W ◦ Z) +  with each Wkn ∼ N(0, 1), the bi- Case dk > 1. The likelihood f(πk, wk|d = i, αs) is k nary matrix Z has Pr(Zkn = 1|H) = πk and the columns proportional to of Θ are vectorized 4 × 4 patches of various patterns (see  αi Figure 4). To generate H for generating Z, we let π be  s (ws)−1(− ln ws)i−2(ws − πs)αs−1 if i > 1 k (i−2)! k k k k the of the kth atom under the stick-breaking s αs−1  α(1 − πk) if i = 1 construction with parameters α = 1, γ = 2. We place Gamma(1, 1) priors on α and γ for inference. We sampled Case ds = 1. k In this case we must account for the possi- M = 500 observations, which used a total of 20 factors. πs bility that k may be greater than the most recent value of Therefore Y ∈ R16×500 and Z ∈ {0, 1}20×500. wk, we marginalize the auxiliary variable w numerically, and compute the likelihood as follows: We ran our MCMC sampler for 10,000 iterations, collect- ing samples every 25th iteration after a burn-in of 2000 iter-  i 1 αs R −1 i−2 s αs−1  s w (− ln w) (w − πk) dw if i > 1 ations. For sampling π and w, we took 1,000 random walk (i−2)! πk steps using a Gaussian with 10−3. Inference was  α(1 − πs)αs−1 if i = 1 k relatively fast; sampling all beta process related variables A slice sampler (Neal, 2003) can be used to sample from required roughly two seconds per iteration, which is signif- this infinite-dimensional discrete distribution. icantly faster than the per-iteration average of 14 seconds for the algorithm presented in Paisley et al. (2010), where Sample α. We have the option of α. For Monte Carlo integration was heavily used. (τ , τ ) α α a Gamma 1 2 prior on , the full conditional of is a We show results in Figure 4. While we expected to learn gamma distribution with parameters a γ around two, and α around one, we note that our algo- 0 P s 0 P s s rithm is inaccurate for these values. We believe that this is τ1,s = τ1 + k dk, τ2,s = τ2 − k ln(wk − πk). largely due to our prior on dk (Section 5.1). The value of In this case we set ws = 1 if ds = 1. k k dk significantly impacts the value of α, and conditioning on PM n=1 Xn(θ) > 0 gives a prior for dk that is spread widely Sample γ. We also have the option of Gibbs sampling γ across the rounds and allows for much variation. A possi- using a Gamma(κ1, κ2) prior on γ. As discussed in Section ble fix for this would be conditioning on the exact value of 5.1, let Ts be the number of observed atoms in the model the number of atoms in a round. This will effectively give a at step s. The full conditional of γ is a gamma distribution unique prior for each atom, and would require significantly with parameters more numerical integrations leading to a slower algorithm. 0 0 PM−1 αs κ = κ1 + Ts, κ = κ2 + . 1,s 2,s n=0 αs+n Despite the inaccuracy in learning γ and α, the algorithm This distribution results from the Poisson process, and the still found to the correct number of factors (initialized at fact that the observed and unobserved atoms form a dis- 100), and found the correct underlying sparse structure of joint set, and therefore can be treated as independent Pois- the data. This indicates that our MCMC sampler is able to son processes. In deriving this update, we use the equality perform the main task of finding a good sparse representa- 6 P∞ s PM−1 αs tion. It appeared that the likelihood of π dominates infer- ξ /γs = , found by inserting the mean i=1 i n=0 αs+n measure (1) into (8). ence for this value, since we observed that these samples tended to “shadow” the empirical distribution of Z. Sample X. For sampling the Bernoulli process X, we 6The variable γ only enters the algorithm when sampling new have that p(X|D,H) ∝ p(D|X)p(X|H). The likelihood atoms. Since we learn the correct number of factors, this indicates of data D is independent of H given X and is model- that our algorithm is not sensitive to γ. Fixing the concentration specific, while the prior on X only depends on π. parameter α is an option, and is often done for Dirichlet processes. Stick-Breaking Beta Processes and the Poisson Process

Figure 4: Results on synthetic data. (left) The top 16 underlying factor loadings for MCMC iteration 10,000. The ground truth patterns are uncovered. (middle) A histogram of the number of factors. The empirical distribution centers on the truth. (right) The kernel smoothed density using the samples of α and γ (see the text for discussion).

6 Conclusion Alternate proof of Theorem 3 Let the set Bnk =  k−1 k  k−1 n , n and bnk = n , where n and k ≤ n are posi- We have used the Poisson processes to prove that the stick- tive integers. Approximate the variable π ∈ [0, 1] with the Pn breaking construction presented by Paisley et al. (2010) is simple function gn(π) = k=1 bnk1Bnk (π). We calculate c Q M a beta process. We then presented several consequences the truncation error term, P(E ) = E[ i>R,j(1 − πij) ], of this representation, including truncation bounds, a more by approximating with gn, re-framing the problem as a + general definition of the construction, and a new MCMC Poisson process with mean and counting measures νR and + sampler for stick-breaking beta processes. Poisson pro- NR (Ω,B), and then taking a limit: cesses offer flexible representations of Bayesian nonpara- hQ M i metric priors; for example, Lin et al. (2010) show how E i>R,j(1 − πij) they can be used as a general representation of dependent n h + i Dirichlet processes. Representing a beta process as a su- Y M·N (Ω,Bnk) = lim (1 − bnk) R (10) perposition of a countable collection of Poisson processes n→∞ E may lead to similar generalizations. k=2 ( n ) X + M  = exp lim − ν (Ω,Bnk) 1 − (1 − bnk) . n→∞ R Appendix k=2 For a fixed n, this approach divides the interval [0, 1] into Proof of Theorem 2 (conclusion) From the text, we ∞ disjoint regions that can be analyzed separately as indepen- have that λ(dπ) = f (π|α)dπ + P f (π|α)dπ with 1 i=2 i dent Poisson processes. Each region uses the approxima- f (π|α) = α(1 − π)α−1 and f given in Equation 5 for 1 i tion π ≈ g (π), with lim g (π) = π, and N +(Ω,B) i > 1. The sum of densities is n n→∞ n R counts the number of atoms with weights that fall in the in- P∞ terval B. Since N + is Poisson distributed with mean ν+, fi(π|α) R R i=2 the expectation follows. ∞ X αi Z 1 1 π = wα−2(ln )i−2(1 − )α−1dw Proof of Corollary 1 From the alternate proof of Theo- (i − 2)! π w w Q M i=2 rem 3 above, we have P(E) = 1−E[ i>R,j(1−πij) ] ≤ Z 1 ∞ i−2 Q M π X α 1 1 − E[ i>R,j(1 − πij)] . This second expectation can = α2 wα−2(1 − )α−1dw (ln )i−2 w (i − 2)! w be calculated as in Theorem 3 with M replaced by a one. π i=2 The resulting integral is analytic. Let qr be the distribution Z 1 = α2 w−2(1 − π/w)α−1dw . (9) of the rth break from a Beta(1, α) stick-breaking process. π The negative of the term in the exponential of Theorem 3 is

The second equality is by monotone convergence and Fu- Z 1 ∞ + X bini’s theorem. This leads to an exponential power se- πνR (Ω, dπ) = γ Eqr [π]. (11) ries, which simplifies to the third line. The last line equals 0 r=R+1 α α(1−π) α(1 − π)α−1 π . Adding the result of (9) to gives  r  R P∞ −1 α−1 −1 α α i=1 fi(π|α) = απ (1−π) . Therefore, ν(dθ, dπ) = Since Eqr [π] = α 1+α , (11) equals γ 1+α . −1 α−1 απ (1 − π) dπµ(dθ), and the proof is complete.  John Paisley David M. Blei Michael I. Jordan

Acknowledgements John Paisley and Michael I. Jordan Paisley, J., Carin, L. & Blei, D. (2011). Variational infer- are supported by ONR grant number N00014-11-1-0688 ence for stick-breaking beta process priors. In Interna- under the MURI program. David M. Blei is supported tional Conference on . Seattle, WA. by ONR 175-6343, NSF CAREER 0745520, AFOSR Paisley, J., Zaas, A., Ginsburg, G., Woods, C. & Carin, L. 09NL202, the Alfred P. Sloan foundation, and a grant from (2010). A stick-breaking construction of the beta pro- Google. cess. In International Conference on Machine Learning. Haifa, Israel. References Rohatgi, V. (1976). An Introduction to and . John Wiley & Sons. Broderick, T., Jordan, M. & Pitman, J. (2012). Beta pro- cesses, stick-breaking, and power laws. Bayesian Anal- Sethuraman, J. (1994). A constructive definition of Dirich- ysis 7, 1–38. let priors. Statistica Sinica 4, 639–650. Cinlar, E. (2011). Probability and Stochastics. Springer. Teh, Y., Gorur, D. & Ghahramani, Z. (2007). Stick- breaking construction for the Indian buffet process. In Doshi-Velez, F., Miller, K., Van Gael, J. & Teh, Y. (2009). International Conference on Artificial Intelligence and Variational inference for the Indian buffet process. In Statistics. San Juan, Puerto Rico. International Conference on Artificial Intelligence and Thibaux, R. & Jordan, M. (2007). Hierarchical beta pro- Statistics. Clearwater Beach, FL. cesses and the Indian buffet process. In International Fox, E., Sudderth, E., Jordan, M. I. & Willsky, A. S. (2010). Conference on Artificial Intelligence and Statistics. San Sharing features among dynamical systems with beta Juan, Puerto Rico. processes. In Advances in Neural Information Process- Williamson, S., Wang, C., Heller, K. & Blei, D. (2010). ing. Vancouver, B.C. The IBP compound Dirichlet process and its application Griffiths, T. & Ghahramani, Z. (2006). Infinite latent fea- to focused topic modeling. In International Conference ture models and the Indian buffet process. In Advances on Machine Learning. Haifa, Israel. in Neural Information Processing. Vancouver, B.C. Zhou, M., Yang, H., Sapiro, G., Dunson, D. & Carin, L. Hjort, N. (1990). Nonparametric Bayes estimators based (2011). Dependent hierarchical beta process for image on beta processes in models for life history data. Annals interpolation and denoising. In International Conference of Statistics 18, 1259–1294. on Artificial Intelligence and Statistics. Fort Lauderdale, FL. Ishwaran, H. & James, L. (2000). Approximate Dirichlet process computing in finite normal mixtures: smoothing and prior information. Journal of Computational and Graphical Statistics 11, 1–26. Ishwaran, H. & James, L. (2001). Gibbs sampling meth- ods for stick-breaking priors. Journal of the American Statistical Association 96, 161–173. Jordan, M., Ghahramani, Z., Jaakkola, T. & Saul, L. (1999). An introduction to variational methods for graphical models. Machine Learning 37, 183–233. Jordan, M. I. (2010). Hierarchical models, nested mod- els and completely random measures. In M.-H. Chen, D. Dey, P. Muller,¨ D. Sun & K. Ye, eds., Frontiers of sta- tistical decision making and Bayesian analysis: In honor of James O. Berger. Springer, New York. Kingman, J. (1993). Poisson Processes. Oxford University Press. Lin, D., Grimson, E. & Fisher, J. (2010). Construction of dependent Dirichlet processes based on Poisson pro- cesses. In Advances in Neural Information Processing. Vancouver, B.C. Neal, R. (2003). Slice sampling. Annals of Statistics 31, 705–767.