A Stick-Breaking Construction of the Beta Process

Home , Bernoulli process, Dirichlet process

John Paisley1 [email protected] Aimee Zaas2 [email protected] Christopher W. Woods2 [email protected] Geoﬀrey S. Ginsburg2 [email protected] Lawrence Carin1 [email protected] 1Department of ECE, 2Duke University Medical Center, Duke University, Durham, NC

Abstract To review, a Dirichlet process, G, can be constructed according to the following stick-breaking pro- We present and derive a new stick-breaking cess (Sethuraman, 1994; Ishwaran & James, 2001), construction of the beta process. The construction is closely related to a special case of ∞ i−1 the stick-breaking construction of the Dirich- X Y G = Vi (1 − Vj)δθi let process (Sethuraman, 1994) applied to the i=1 j=1 beta distribution. We derive an inference V iid∼ Beta(1, α) procedure that relies on Monte Carlo integra- i iid tion to reduce the number of parameters to θi ∼ G0 (1) be inferred, and present results on synthetic data, the MNIST handwritten digits data set This stick-breaking process is so-called because pro- and a time-evolving gene expression data set. portions, Vi, are sequentially broken from the remain- Qi−1 ing length, j=1(1 − Vj), of a unit-length stick. This Qi−1 produces a probability (or weight), Vi j=1(1 − Vj), 1. Introduction that can be visually represented as one of an infinite number of contiguous sections cut out of a unit-length The Dirichlet process (Ferguson, 1973) is a power- stick. As i increases, these weights stochastically de- ful Bayesian nonparametric prior for mixture models. crease, since smaller and smaller fractions of the stick There are two principle methods for drawing from this remain, and so only a small number of the infinite infinite-dimensional prior: (i) the Chinese restaurant number of weights have appreciable value. By con- process (Blackwell & MacQueen, 1973), in which sam- struction, these weights occur first, which allows for ples are drawn from a marginalized Dirichlet process practical implementation of this prior. and implicitly construct the prior; and (ii) the stick- breaking process (Sethuraman, 1994), which is a fully The contribution of this paper is the derivation of a Bayesian construction of the Dirichlet process. stick-breaking construction of the beta process. We use a little-known property of the constructive defini- Similarly, the beta process (Hjort, 1990) is receiving tion in (Sethuraman, 1994), which is equally applicable significant use recently as a nonparametric prior for la- to the beta distribution – a two-dimensional Dirichlet tent factor models (Ghahramani et al., 2007; Thibaux distribution. The construction presented here will be & Jordan, 2007). This infinite-dimensional prior can seen to result from an infinite collection of these stick- be drawn via marginalization using the Indian buf- breaking constructions of the beta distribution. fet process (Griffiths & Ghahramani, 2005), where samples again construct the prior. However, unlike The paper is organized as follows. In Section 2, we re- the Dirichlet process, the fully Bayesian stick-breaking view the beta process, the stick-breaking construction construction of the beta process has yet to be derived of the beta distribution, as well as related work in this (though related methods exist, reviewed in Section 2). area. In Section 3, we present the stick-breaking construction of the beta process and its derivation. We de- Appearing in Proceedings of the 27 th International Confer- rive an inference procedure for the construction in Sec- ence on Machine Learning, Haifa, Israel, 2010. Copyright tion 4 and present experimental results on synthetic, 2010 by the author(s)/owner(s). MNIST digits and gene expression data in Section 5. A Stick-Breaking Construction of the Beta Process

2. The Beta Process In this construction, weights are drawn according to the standard stick-breaking construction of the DP Let H0 be a continuous measure on the space (Θ, B) (Ishwaran & James, 2001), as well as their respective and let H0(Θ) = γ. Also, let α be a positive scalar locations, which are independent of the weights and and define the process HK as follows, iid among themselves. The major difference is that K the set of locations is finite, 0 or 1, which results in X HK = πkδθk more than one term being active in the summation. k=1 Space restrictions prohibit an explicit proof of this con- iid αγ γ πk ∼ Beta , α(1 − ) struction here, but we note that Sethuraman implicitly K K proves this in the following way: Using notation from iid 1 θk ∼ H0 (2) (Sethuraman, 1994), let the space, X = {0, 1}, and the γ prior measure, α, be α(1) = a, α(0) = b, and therefore then as K → ∞, HK → H and H is a beta process, α(X ) = a + b. Carrying out the proof in (Sethuraman, which we denote H ∼ BP(αH0). 1994) for this particular space and measure yields (4). We avoid a complete measure-theoretic definition, We note that this α is different from that in (2). since the stick-breaking construction to be presented is derived in reference to the limit of (2). That H is 2.2. Related Work a beta process can be shown in the following way: In- There are three related constructions in the machine (K) T K tegrating out π = (π1, . . . , πK ) ∈ (0, 1) , letting learning literature, each of which differs significantly K → ∞ and sampling from this marginal distribution from that presented here. The first construction, pro- produces the two-parameter extension of the Indian posed by (Teh et al., 2007), is presented specifically buffet process discussed in (Thibaux & Jordan, 2007), for the Indian buffet process (IBP) prior. The gener- which is shown to have the beta process as its under- ative process from which the IBP and this construc- lying de Finetti mixing distribution. tion are derived replaces the beta distribution in (2) α Before deriving the stick-breaking construction of the with Beta( K , 1). This small change greatly facilitates α beta process, we review a property of the beta distri- this construction, since the parameter 1 in Beta( K , 1) bution that will be central to the construction. We allows for a necessary simplification of the beta dis- also review related work to distinguish the presented tribution. This construction does not extend to the construction from other constructions in the literature. two-parameter generalization of the IBP (Ghahramani et al., 2007), which is equivalent in the infinite limit 2.1. A Construction of the Beta Distribution to the marginalized representation in (2). The constructive definition of a Dirichlet prior de- A second method for drawing directly from the beta rived in (Sethuraman, 1994) applies to more than the process prior has been presented in (Thibaux & Jor- infinite-dimensional Dirichlet process. In fact, it is dan, 2007), and more recently in (Teh & Görür,2009) applicable to Dirichlet priors of any dimension, of as a special case of a more general power-law repre- which the beta distribution can be viewed as a spe- sentation of the IBP. In this representation, no stick- cial, two-dimensional case.1 Focusing on this special breaking takes place of the form in (1), but rather case, Sethuraman showed that one can sample the weight for each location is simply beta-distributed, as opposed to the usual function of multiple beta- π ∼ Beta(a, b) (3) distributed random variables. The derivation relies heavily upon connecting the marginalized process to according to the following stick-breaking construction, the fully Bayesian representation, which does not fac- ∞ i−1 tor into the similar derivation for the DP (Sethuraman, X Y π = Vi (1 − Vj)I(Yi = 1) 1994). This of course does not detract from the result, i=1 j=1 which appears to have a simpler inference procedure iid than that presented here. Vi ∼ Beta(1, a + b) a A third representation presented in (Teh & Görür, Y iid∼ Bernoulli (4) i a + b 2009) based on the inverse Lévymethod (Wolpert & Ickstadt, 1998) exists in theory only and does not sim- where I(·) denotes the indicator function. plify to an analytic stick-breaking form. See (Damien 1We thank Jayaram Sethuraman for his valuable corre- et al., 1996; Lee & Kim, 2004) for two approximate spondence regarding his constructive definition. methods for sampling from the beta process. A Stick-Breaking Construction of the Beta Process

3. A Stick-Breaking Construction of 3.1. Derivation of the Construction the Beta Process Starting with (2), we now show how Sethuraman’s con- We now define and briefly discuss the stick-breaking structive definition of the beta distribution can be used construction of the beta process, followed by its deriva- to derive that the infinite limit of (2) has (5) as an alternate representation that is equal in distribution. tion. Let α and H0 be defined as in (2). If H is constructed according to the following process, We begin by observing that, according to (4), each πk value can be drawn as follows, ∞ Ci i−1 X X (i) Y (`) H = Vij (1 − Vij )δθij ∞ l−1 X ˆ Y ˆ ˆ i=1 j=1 `=1 πk = Vkl (1 − Vkm)I(Ykl = 1) iid l=1 m=1 Ci ∼ Poisson(γ) iid (`) iid Vˆkl ∼ Beta(1, α) Vij ∼ Beta(1, α) ˆ iid γ 1 Ykl ∼ Bernoulli (7) θ iid∼ H (5) K ij γ 0 where the marker ˆ is introduced because V will later then H ∼ BP(αH0). be re-indexed values of Vˆ . We also make the observa- 0 Since the first row of (5) may be unclear at first sight, tion that, if the sum is instead taken to K , and we 0 we expand it for the first few values of i below, then let K → ∞, then this truncated representation converges to (7). C1 X (1) H = V1,j δθ1,j + This suggests the following procedure for constructing j=1 the limit of the vector π(K) in (2). We define the K×K K×K C2 matrices Vˆ ∈ (0, 1) and Yˆ ∈ {0, 1} , where X (2) (1) V2,j (1 − V2,j )δθ2,j + iid j=1 Vˆkl ∼ Beta(1, α) C3 iid γ X (3) (2) (1) Yˆkl ∼ Bernoulli (8) V3,j (1 − V3,j )(1 − V3,j )δθ3,j + ··· (6) K j=1 for k = 1,...,K and l = 1,...,K. The K-truncated For each value of i, which we refer to as a “round,” weight, πk, is then constructed “horizontally” by look- there are Ci atoms, where Ci is itself random and ing at the kth row of Vˆ and Yˆ , and where we define drawn from Poisson(γ). Therefore, every atom is de- that the error of the truncation is assigned to 1 − πk fined by two subscripts, (i, j). The mass associated 0 (i.e., Yk,l0 := 0 for the extension l > K.) with each atom in round i is equal to the ith break from an atom-specific stick, where the stick-breaking It can be seen from the matrix definitions in (8) and weights follow a Beta(1, α) stick-breaking process (as the underlying function of these two matrices, defined in (1)). Superscripts are used to index the i random for each row as a K-truncated version of (7), that in the limit as K → ∞, this representation converges to variables that construct the weight on atom θij. Since the number of breaks from the unit-length stick prior the infinite beta process when viewed vertically, and to to obtaining a weight increases with each level in (6), a construction of the individual beta-distributed ran- the weights stochastically decrease as i increases, in a dom variables when viewed horizontally, each of which similar manner as in the stick-breaking construction of occur simultaneously. the Dirichlet process (1). Before using these two matrices to derive (5), we de- Since the expectation of the mass on the kth atom rive a probability that will be used in the infinite limit. drawn overall does not simplify to a compact and For a given column, i, of (8), we calculate the proba- transparent form, we omit its presentation here. How- bility that, for a particular row, k, there is at least one ˆ ˆ ˆ ever, we note the following relationship between α and Y = 1 in the set {Yk,1,..., Yk,i−1}, in other words, the i−1 P ˆ 0 γ in the construction. As α decreases, weights de- probability that i0 =1 Yki > 0. This value is cay more rapidly as i increases, since smaller fractions i−1 ! of each unit-length stick remains prior to obtaining a X γ i−1 Yˆ 0 > 0|γ, K = 1 − (1 − ) (9) weight. As α increases, the weights decay more grad- P ki K i0=1 ually over several rounds. The expected weight on an atom in round i is equal to α(i−1)/(1 + α)i. The num- In the limit as K → ∞, this can be shown to converge ber of atoms in each round is controlled by γ. to zero for all fixed values of i. A Stick-Breaking Construction of the Beta Process

P∞ ˆ As with the Dirichlet process, the problem with draw- these numbers, Ci := k=1 Yki, as ing each πk explicitly in the limit of (2) is that there iid are an infinite number of them, and any given πk is Ci ∼ Poisson(γ) (11) equal to zero with probability one. With the represen- We next need to sample index values uniformly from tation in (8), this problem appears to have doubled, the positive integers, N. However, we recall from (9) since there are now an infinite number of random vari- that for all fixed values of i, the probability that the ables to sample in two dimensions, rather than one. drawn index will have previously seen a one is equal However, this is only true when viewed horizontally. to zero. Therefore, using the conceptual process de- When viewed vertically, drawing the values of interest fined above, we can bypass sampling the index value becomes manageable. and directly sample the atom which it indexes. Also, First, we observe that, in (8), we only care about the we note that the “without replacement” constraint no set of indices {(k, l): Yˆkl = 1}, since these are the longer factors. locations which indicate that mass is to be added to The final step is simply a matter of re-indexing. Let their respective πk values. Therefore, we seek to by- the function σ (j) map the input j ∈ {1,...,C } to the ˆ i i pass the drawing of all indices for which Y = 0, and index of the jth nonzero element drawn in column i, ˆ directly draw those indices for which Y = 1. as discussed above. Then the re-indexed random vari- (i) ˆ (`) ˆ To do this, we use a property of the binomial distribu- ables Vij := Vσi(j),i and Vij := Vσi(j),`, where ` < i. tion. For any column, i, of Yˆ , the number of nonzero We similarly re-index θσi(j) as θij := θσi(j), letting the PK ˆ γ double and single subscripts remove ambiguity, and locations, k=1 Yki, has the Binomial(K, K ) distribution. Also, it is well-known that hence no ˆ marker is used. The addition of a sub- script/superscript in the two cases above arises from γ Poisson(γ) = lim Binomial K, (10) ordering the nonzero locations for each column of (8), K→∞ K i.e., the original index values for the selected rows of each column are being mapped to 1, 2,... separately Therefore, in the limit as K → ∞, the sum of each for each column in a many-to-one manner. The result ˆ column (as well as row) of Y produces a random vari- of this re-indexing is the process given in (5). able with a Poisson(γ) distribution. This suggests the procedure of first drawing the number of nonzero locations for each column, followed by their corresponding 4. Inference for the Stick-Breaking indices. Construction Returning to (8), given the number of nonzero loca- For inference, we integrate out all stick-breaking ran- PK ˆ γ dom variables, V , using Monte Carlo integration tions in column i, k=1 Yki ∼ Binomial(K, K ), find- ing the indices of these locations then becomes a pro- (Gamerman & Lopes, 2006), which significantly re- cess of sampling uniformly from {1,...,K} without duces the number of random variables to be learned. replacement. Moreover, since there is a one-to-one As a second aid for inference, we introduce the latent correspondence between these indices and the atoms, round-indicator variable, iid 1 θ1, . . . , θK ∼ H0, which they index, this is equiva- ∞  i  γ X X lent to selecting from the set of atoms, {θ1, . . . , θK }, dk := 1 + I  Cj < k (12) uniformly without replacement. i=1 j=1

th A third more conceptual process, which will aid the The equality dk = i indicates that the k atom drawn derivation, is as follows: Sample the PK Yˆ nonzero ∞ k=1 ki overall occurred in round i. Note that, given {dk}k=1, 0 ∞ indices for column i one at a time. After an index, k , we can reconstruct {Ci} . Given these latent indi- ˆ ˆ i=1 is obtained, check {Yk0,1,..., Yk0,i−1} to see whether cators, the generative process is rewritten as, this index has already been drawn. If it has, add the Qi−1 ∞ dk−1 corresponding mass, Vk0i l=1(1 − Vk0l), to the tally ∞ X Y 1 H | {dk}k=1 = Vk,dk (1 − Vkj)δθk for πk0 . If it has not, draw a new atom, θk0 ∼ H0, γ k=1 j=1 and associate the mass with this atom. V iid∼ Beta(1, α) The derivation concludes by observing the behavior of kj iid 1 this last process as K → ∞. We ﬁrst reiterate that, in θk ∼ H0 (13) the limit as K → ∞, the count of nonzero locations for γ each column is independent and identically distributed where, for clarity in what follows, we’ve avoided intro- as Poisson(γ). Therefore, for i = 1, 2,... , we can draw ducing a third marker (e.g., V˜ ) after this re-indexing. A Stick-Breaking Construction of the Beta Process

Data is generated iid from H via a Bernoulli process 4.1.2. Prior Term and take the form of inﬁnite-dimensional binary vec- The prior for the sequence of indicators d , d ,... tors, z ∈ {0, 1}∞, where 1 2 n is the equivalent sequential process for sampling ∞ dk−1 P Y C1,C2,... , where Ci = k=1 I(dk = i) ∼ Poisson(γ). znk ∼ Bernoulli Vk,d (1 − Vkj) (14) k j=1 Pk−1 Let #dk−1 = j=1 I(dj = dk−1) and let Pγ (·) denote N the Poisson distribution with parameter γ. Then it The suﬃcient statistics calculated from {zn}n=1 are the counts along each dimension, k, can be shown that

Pγ (C > #dk−1 ) N N p(d = d |γ, # ) = (19) X X k k−1 dk−1 (C ≥ # ) m1k = I(znk = 1), m0k = I(znk = 0) (15) Pγ dk−1 n=1 n=1 Also, for h = 1, 2,... , the probability

4.1. Inference for dk p(dk = dk−1 + h|γ, #dk−1 ) = (20) K With each iteration, we sample the sequence {dk}k=1 Pγ (C > #dk−1 ) h−1 without using future values from the previous itera- 1 − Pγ (C > 0)Pγ (C = 0) Pγ (C ≥ #dk−1 ) tion; the value of K is random and equals the number th Since dk 6< dk−1, these two terms complete the prior. of nonzero m1k. The probability that the k atom was observed in round i is proportional to 4.1.3. Posterior of dk p d = i|{d }k−1, {z }N , α, γ ∝ (16) k l l=1 nk n=1 For the posterior, the normalizing constant requires N k−1 integration over h = 0, 1, 2,... , which is not possi- p({znk}n=1|dk = i, α)p(dk = i|{dl}l=1 , γ) ble given the proposed sampling method. We there- Below, we discuss the likelihood and prior terms, fol- fore propose incrementing h until the resulting trun- lowed by an approximation to the posterior. cated probability of the largest value of h falls below a threshold (e.g., 10−6). We have found that the proba- 4.1.1. Likelihood Term bilities tend to decrease rapidly for h > 1. The integral to be solved for integrating out the ran- i dom variables {Vkj}j=1 is 4.2. Inference for γ N p({znk}n=1|dk = i, α) = (17) Given d1, d2,... , the values C1,C2,... can be recon- Z structed and a posterior for γ can be obtained using i m1k i m0k i ~ f({Vkj}1) {1−f({Vkj}1)} p({Vkj}1|α) dV a conjugate gamma prior. Since the value of dK may (0,1)i not be the last in the sequence composing CdK , this where f(·) is the stick-breaking function used in (14). value can be “completed” by sampling from the prior, Though this integral can be analytically solved for in- which can additionally serve as proposal factors. teger values of m0k via the binomial expansion, we have found that the resulting sum of terms leads to 4.3. Inference for α computational precision issues for even small sample sizes. Therefore, we use Monte Carlo methods to ap- Using (18), we again integrate out all stick-breaking proximate this integral. random variables to calculate the posterior of α, (s) K For s = 1,...,S samples, {V }i , drawn iid from Y kj j=1 p(α|{z }N , {d }K ) ∝ p({z }N |α, {d }K )p(α) Beta(1, α), we calculate n 1 k 1 nk 1 k 1 k=1 N p({znk}n=1|dk = i, α) ≈ (18) Since this is not possible for the positive, real-valued S α, we approximate this posterior by discretizing the 1 X (s) (s) f({V }i )m1k {1 − f({V }i )}m0k space. Speciﬁcally, using the value of α from the previ- S kj j=1 kj j=1 s=1 ous iteration, αprev, we perform Monte Carlo integra- T This approximation allows for the use of natural loga- tion at the points {αprev + t∆α}t=−T , ensuring that rithms in calculating the posterior, which was not pos- αprev − T ∆α > 0. We use an improper, uniform prior sible with the analytic solution. Also, to reduce com- for α, with the resulting probability therefore being putations, we note that at most two random variables the normalized likelihood over the discrete set of se- need to be drawn to perform the above stick-breaking, lected points. As with sampling dk, we again extend one random variable for the proportion and one for the the limits beyond αprev ±T ∆α, checking that the tails error; this is detailed in the appendix. of the resulting probability fall below a threshold. A Stick-Breaking Construction of the Beta Process

4.4. Inference for p(znk = 1|α, dk,Zprev) In latent factor models, (Griﬃths & Ghahramani, N 2005), the vectors {zn}n=1 are to be learned with the rest of the model parameters. To calculate the posterior of a given binary indicator therefore requires a prior, which we calculate as follows

p(znk = 1|α, dk,Zprev) (21) Z = p(znk = 1|V~ )p(V~ |α, dk,Zprev) dV~ (0,1)dk R ~ ~ ~ ~ Figure 1. Synthetic results for learning α and γ. For each (0,1)dk p(znk = 1|V )p(Zprev|V )p(V |α, dk) dV = trial of 150 iterations, 10 samples were collected and aver- R p(Z |V~ )p(V~ |α, d ) dV~ (0,1)dk prev k aged over the last 50 iterations. The step size ∆α = 0.1. We again perform Monte Carlo integration (18), where (a) Inferred γ vs true γ (b) Inferred α vs true α (c) A plane, shown as an image, fit using least squares that shows the `1 the numerator increments the count m1k of the denom- inator by one. For computational speed, we treat the distance of the inferred (αout, γout) to the true (αtrue, γtrue). previous latent indicators, Zprev, as a block (Ishwaran & James, 2001), allowing this probability to remain fixed when sampling the new matrix, Z. 5.2. MNIST Handwritten Digits We consider the digits 3, 5 and 8 using 1000 obser- 5. Experiments vations for each digit and projecting into 50 dimensions using PCA. We model the resulting digits matrix, We present experimental results on three data sets: X ∈ R50×3000, with a latent factor model (Griffiths & (i) A synthetic data set; (ii) the MNIST handwritten Ghahramani, 2005; Paisley & Carin, 2009), digits data set (digits 3, 5 and 8); and (iii) a time- evolving gene expression data set. X = Φ(W ◦ Z) + E (22) where the columns of Z are samples from a Bernoulli 5.1. Synthetic Data process, and the elements of Φ and W are iid Gaussian. For the synthetic problem, we investigate the ability of The symbol ◦ indicates element-wise multiplication. the inference procedure in Section 4 to learn the under- We infer all variance parameters using inverse-gamma lying α and γ used in generating H. We use the rep- priors, and integrate out the weights, wn, when sam- (K) resentation in (2) to generate π for K = 100,000. pling zn. Gibbs sampling is performed for all param- This provides a sample of π(K) that approximates the eters, except for the variance parameters, where we infinite beta process well for smaller values of α and γ. perform variational inference (Bishop, 2006). We have 1000 We then sample {zn}n=1 from a Bernoulli process and found that the “inflation” of the variance parameters remove all dimensions, k, for which m1k = 0. Since that results from the variational expectation leads to the weights in (13) are stochastically decreasing as k faster mixing for the latent factor model. increases, while the representation in (2) is exchange- Figure 2 displays the inference results for an initial- able in k, we reorder the dimensions of {z }1000 so that n n=1 ization of K = 200. The top-left figure shows the m ≥ m ≥ ... . The binary vectors are treated as 1,1 1,2 number of factors as a function of 10,000 Gibbs iter- observed for this problem. ations, and the top-right figure shows the histogram We present results in Figure 1 for 5,500 trials, where of these values after 1000 burn-in iterations. For αtrue ∼ Uniform(1, 10) and γtrue ∼ Uniform(1, 10). Monte Carlo integration, we use S = 100,000 sam- We see that the inferred αout and γout values center ples from the stick-breaking prior for sampling dk and on the true αtrue and γtrue, but increase in variance p(znk = 1|α, dk,Zprev), and S = 10,000 samples for as these values increase. We believe that this is due sampling α, since learning the parameter α requires in part to the reordering of the dimensions, which are significantly more overall samples. The average time not strictly decreasing in (5), though some reordering per iteration was approximately 18 seconds, though is necessary because of the nature of the two priors. this value increases when K increases and vice-versa. We choose to generate data from (2) rather than (5) In the bottom two rows of Figure 2, we show four ex- because it provides some added empirical evidence as ample factor loadings (columns of Φ), as well as the to the correctness of the stick-breaking construction. probability of its being used by a 3, 5 and 8. A Stick-Breaking Construction of the Beta Process

Figure 2. Results for MNIST digits 3, 5 and 8. Top left: The number of factors as a function of iteration number. Top right: A histogram of the number of factors after 1000 burn-in iterations. Middle row: Several example learned factors. Bottom row: The probability of a digit possessing the factor directly above.

5.3. Time-Evolving Gene Expression Data We next apply the model discussed in Section 5.2 on data from a viral challenge study (Zaas et al., 2009). In this study, a cohort of 17 healthy volunteers were experimentally infected with the influenza A virus at varying dosages. Blood was taken at intervals between -4 and 120 hours from infection and gene expression values were extracted. Of the 17 patients, 9 ultimately became symptomatic (i.e., became ill), and the goal of the study was to detect this in the gene expression Figure 3. Results for time-evolving gene expression data. values prior to the initial showing of symptoms. There Top row: (left) Number of factors per iteration (middle) were a total of 16 time points and 267 gene expression Histogram of the total number of factors after 1000 burn-in extractions, each including expression values for 12,023 iterations (right) Histogram of the number of factors used genes. Therefore, the data matrix X ∈ R267×12023. per observation. Rows 2-5: Discriminative factors and the names of the most important genes associated with each In Figure 3, we show results for 4000 iterations; each factor (as determined by weight). iteration took an average of 2.18 minutes. The top row shows the number of factors as a function of iteration, with 100 initial factors, and histograms of the overall number factors, and the number of factors per obser- As motivated in (Griffiths & Ghahramani, 2005), the vation. In the remaining rows, we show four discrim- values in Z are an alternative to hard clustering, and inative factor loading vectors, with the statistics from in this case are useful for group selection. For exam- the 267 values displayed as a function of time. We ple, sparse linear classifiers for the model y = Xβ + , note that the expression values begin to increase for such as the RVM (Bishop, 2006), are prone to select the symptomatic patients prior to the onset of symp- single correlated genes from X for prediction, setting toms around the 45th hour. We list the top genes for the others to zero. In (West, 2003), latent factor mod- each factor, as determined by the magnitude of values els were motivated as a dimensionality reduction step ˆ in W for that factor. In addition, the top three genes prior to learning the classifier y = Φβ + 2, where the in terms of the magnitude of the four-dimensional vec- loading matrix replaces X and unlabeled data are in- tor comprising these factors are RSAD2, IFI27 and ferred transductively. In this case, discriminative fac- IFI44L; the genes listed here have a significant overlap tors selected by the model represent groups of genes with those in the literature (Zaas et al., 2009). associated with that factor, as indicated by Z. A Stick-Breaking Construction of the Beta Process

6. Conclusion Ishwaran, H. and James, L.F. Gibbs sampling methods for stick-breaking priors. Journal of the American We have presented a new stick-breaking construction Statistical Association, 96:161–173, 2001. of the beta process. The derivation relies heavily upon the constructive definition of the beta distribution, a Lee, Jaeyong and Kim, Yongdai. A new algorithm to special case of (Sethuraman, 1994), which has been ex- generate beta processes. Computational Statistics & clusively used in its infinite form in the machine learn- Data Analysis, 47(3):441–453, 2004. ing community. We presented an inference algorithm that uses Monte Carlo integration to eliminate several Paisley, J. and Carin, L. Nonparametric factor analysis random variables. Results were presented on synthetic with beta process priors. In Proc. of the ICML, pp. data, the MNIST handwritten digits 3, 5 and 8, and 777–784, 2009. time-evolving gene expression data. Sethuraman, J. A constructive definition of dirichlet As a final comment, we note that the limit of the priors. Statistica Sinica, pp. 4:639–650, 1994. representation in (2) reduces to the original IBP when Teh, Y.W. and Görür,D. Indian buffet processes with α = 1. Therefore, the stick-breaking process in (5) power-law behavior. In NIPS, 2009. should be equal in distribution to the process in (Teh et al., 2007) for this parametrization. The proof of Teh, Y.W., Görür,D., and Ghahramani, Z. Stick- this equality is an interesting question for future work. breaking construction for the indian buffet process. In AISTATS, 2007. Acknowledgements Thibaux, R. and Jordan, M.I. Hierarchical beta pro- This research was supported by DARPA under the cesses and the indian buffet process. In AISTATS, PHD Program, and by the Army Research Office un- 2007. der grant W911NF-08-1-0182. West, M. Bayesian factor regression models in the References “large p, small n” paradigm. Bayesian Statistics, 2003. Bishop, C.M. Pattern Recognition and Machine Learn- ing. Springer, New York, 2006. Wolpert, R.L. and Ickstadt, K. Simulations of lévyrandom fields. Practical and Semiparametric Blackwell, D. and MacQueen, J.B. Ferguson distribu- Bayesian Statistics, pp. 227–242, 1998. tions via pólya urn schemes. The Annals of Statis- tics, 1:353–355, 1973. Zaas, A., Chen, M., Varkey, J., Veldman, T., Hero, A.O., Lucas, J., Huang, Y., Turner, R., Gilbert, Damien, Paul, Laud, Purushottam W., and Smith, A., Lambkin-Williams, R., Oien, N., Nicholson, B., Adrian F. M. Implementation of bayesian non- Kingsmore, S., Carin, L., Woods, C., and Ginsburg, parametric inference based on beta processes. Scan- G.S. Gene expression signatures diagnose influenza dinavian Journal of Statistics, 23(1):27–36, 1996. and other symptomatic respiratory viral infections Ferguson, T. A bayesian analysis of some nonparamet- in humans. Cell Host & Microbe, 6:207–217, 2009. ric problems. The Annals of Statistics, pp. 1:209– 230, 1973. 7. Appendix Gamerman, D. and Lopes, H.F. Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Infer- Following i−1 breaks from a Beta(1, α) stick-breaking ence, Second Edition. Chapman & Hall, 2006. process, the remaining length of the unit-length stick is = Qi−1 (1 − V ). Let S := − ln(1 − V ). Then, Ghahramani, Z., Griffiths, T.L., and Sollich, P. i j=1 j j j since it can be shown that Sj ∼ Exponential(α), and Bayesian nonparametric latent feature models. Pi−1 Bayesian Statistics, 2007. therefore j=1 Sj ∼ Gamma(i − 1, α), the value of i can be calculated using only one random variable, Griffiths, T.L. and Ghahramani, Z. Infinite latent fea- −Ti ture models and the indian buffet process. In NIPS, i = e

pp. 475–482, 2005. Ti ∼ Gamma(i − 1, α) Hjort, N.L. Nonparametric bayes estimators based on Therefore, to draw V Qi−1 (1 − V ) = V , one can beta processes in models for life history data. Annals i j=1 j i i sample V ∼ Beta(1, α) and as above. of Statistics, 18:3:1259–1294, 1990. i i