Arxiv:2008.04475V2 [Math.ST] 18 Jul 2021

Stick-breaking processes with exchangeable length variables

Mar´ıaF. Gil–Leyva IIMAS, Universidad Nacional Autónomade México CDMX, México [email protected]

RamsésH. Mena IIMAS, Universidad Nacional Autónomade México CDMX, México [email protected]

Abstract Our object of study is the general class of stick-breaking processes with exchangeable length variables. These generalize well-known Bayesian non-parametric priors in an unexplored direction. We give conditions to assure the respective species sampling process is proper and the corresponding prior has full support. For a rich sub-class we explain how, by tuning a single [0, 1]-valued parameter, the stochastic ordering of the weights can be modulated, and Dirichlet and Geometric priors can be recovered. A general formula for the distribution of the latent allocation variables is derived and an MCMC algorithm is proposed for density estimation purposes. Keywords: Dirichlet process, Exchangeable sequence, Geometric process, Species sampling process, Stick-breaking. arXiv:2008.04475v2 [math.ST] 18 Jul 2021

1 1 Introduction

Bayesian non-parametric priors have gained interest mainly due to their great flexibility to adjust to complex data sets, while remaining mathematically tractable. The Dirichlet process (Ferguson; 1973) stands as the canonical and most popular non-parametric prior in the literature. This random probability measure enjoys the property of having exchangeable increments with respect to a suitable measure, and when defined over a Polish space, it has full support with respect to the weak topology. These two characteristics influence greatly the fact that the Dirichlet process is such a pliable model. Searching for competitive alternatives to the canonical model, different constructions of random probability measures over Polish spaces have been developed. Some of the most notable are through the normalization of homogeneous completely random measures (Kingman; 1975; Regazzini et al.; 2003; James et al.; 2009; Hjort et al.; 2010), through the prediction rule of exchangeable partitions (Blackwell and MacQueen; 1973; Pitman; 2006), by means of the stick- breaking decomposition of a sequence of weights (Sethuraman; 1994; Ishwaran and James; 2001; Favaro et al.; 2012), and most recently by virtue of latent random subsets of the natural numbers (Walker; 2007; Fuentes-Garc´ıaet al.; 2010a; De Blasi et al.; 2020). All of these, as well as other popular models, fall into the class of species sampling processes (Pitman; 2006; Jang et al.; 2010), which are random probability measures whose atomic decomposition specializes to the form,   X X µ = wjδξj + 1 − wj µ0, (1) j≥1 j≥1 P for some collection of non-negative random variables W = (wj)j≥1 satisfying j≥1 wj ≤ 1 almost surely, independently of the atoms, (ξj)j≥1, that are independent and identically distributed (i.i.d.) from the diffuse probability measure µ0, called base measure of µ. Indeed, µ as in (1) has µ0-exchangeable increments (Kallenberg; 2017), and under some conditions on the weights, (wj)j≥1, as characterized by Bissiri and Ongaro(2014), the corresponding prior has full support with respect to the weak topology. In contrast to other constructions, the stick-breaking decomposition is exhaustive in the class of species sampling processes, in the sense that the weights of every random probability measure in this class have a stick-breaking decomposition. For completeness, a proof of this last statement can be found in Theorem A.1 in the appendix. Namely, we can decompose

j−1 Y w1 = v1, wj = vj (1 − vi), j ≥ 2, (2) i=1 for some sequence V = (vi)i≥1, with elements taking values in [0, 1]. Hereinafter we write (wj)j≥1 = SB[(vi)i≥1] (or W = SB[V]) whenever (2) holds, and refer to the elements of V as length variables. Most eﬀorts have concentrated in the case where these are mutually independent (Sethuraman; 1994; Pitman; 1996a; Ishwaran and James; 2001; Rodr´ıguez and Dunson; 2011; Rodr´ıguezand Quintana; 2015), while there are only a handful of examples of stick-breaking process with explicitly dependent length variables (Fuentes-Garc´ıaet al.; 2010a; Favaro et al.; 2012, 2016; Gil-Leyva et al.; 2020). So far, the dependent case has remained

2 somehow elusive due to the mathematical hurdles to overcome. Our proposal here represents, to some extent, a ﬁrst general treatment of stick-breaking processes with dependent length variables. Explicitly, we will be focusing on the class with exchangeable length variables. Such stick-breaking processes not only constitute a rich class of priors that maintain mathematical tractability, but also delve into unexplored territory that uniﬁes current theory of stick-breaking processes with independent and dependent length variables. The outline of the paper is as follows. Section2 presents an overview of exchangeable sequences driven by a species sampling processes. In Section3 we analyze the general case of stick-breaking processes with exchangeable length variables. Section4 addresses the case where the length variables themselves are driven by another species sampling process. Even this subclass is substantially wide as, in particular, it generalizes Dirichlet and Geometric processes (Fuentes-Garc´ıaet al.; 2010a). An important result characterizing the ordering of the respective weights is given. For illustration purposes, in Section5 we consider the case where the length variables are driven by a Dirichlet process and other interesting species sampling processes such as the Pitman-Yor process. Finally, in Section6, we implement and evaluate our models to estimate the density of univariate and bivariate simulated data. The proofs of main results as well as the MCMC algorithm are deferred to the appendix.

2 Preliminaries

One of the most inﬂuential theoretical results for Bayesian statistics is the representation theorem for exchangeable sequences, ﬁrst proved by de Finetti(1931) and later generalized by

Hewitt and Savage(1955). This result states that a sequence, X = (xi)i≥1, whose elements take values in a Polish space, S, with Borel σ-algebra, BS, is exchangeable if and only if there exist a random probability measure, µ, over (S, BS), such that for every n ≥ 1 and B1,...,Bn ∈ BS, Qn P[x1 ∈ B1,..., xn ∈ Bn | µ] = i=1 µ(Bi). Hence, elements in X are conditionally i.i.d. given iid µ, denoted by {xi | µ ∼ µ; i ≥ 1}. The random measure µ is called the directing random measure of X, it is unique almost surely and given by the almost sure limit of the empirical −1 Pn distributions µ = limn→∞ n i=1 δxi . A random variable that encloses important information about an exchangeable sequence is the random partition of {1, . . . , n}, here denoted by Π(x1:n), generated by the random equivalence relation i ∼ j if and only if xi = xj. Using the terminology of Pitman(1995), Π(x1:n) is exchangeable, for every n ≥ 1, in the sense that for any partition A = {A1,...,Ak} of {1, . . . , n}, S k [Π(x1:n) = A] = π(|A1|,..., |Ak|), for some symmetric function, π : → [0, 1], where P k∈N N |Ai| stands for the cardinality of Ai. The function π is called exchangeable partition probability function (EPPF). Another key aspect is that the collection (Π(x1:n))n≥1 is consistent (Pitman; 1995, 2006) meaning that the restriction of Π(x1:n) to {1, . . . , m} equals Π(x1:m) almost surely, for every n > m. This translates to the well known addition rule of the EPPF

k X π(n1, . . . , nk) = π(n1, . . . , nk, 1) + π(n1, . . . , nj−1, nj + 1, nj+1, . . . , nk). (3) j=1

In fact, every symmetric function π : S k → [0, 1], with π(1) = 1 and that satisﬁes (3), k∈N N

3 deﬁnes the EPPF of a consistent family of exchangeable partitions (Pitman; 1995). In particular, if the directing random measure, µ, is a species sampling process, the base measure, µ0, together with the EPPF, π, characterize completely the laws of the µ and (xi)i≥1. To be precise x1 ∼ µ0 = E[µ], and for every n ≥ 1, the prediction rule is

Kn (j) (K +1) X π n π n n P[xn+1 ∈ · | x1,..., xn] = δx∗ + µ0, (4) π(n) j π(n) j=1 where x∗,..., x∗ are the K distinct values in {x ,..., x }, n = |{i ≤ n : x = x∗}|, 1 Kn n 1 n j i j (j) (Kn+1) for j ≤ Kn, n = (n1,... nKn ), n = (n1,... nj−1, nj + 1, nj+1,..., nKn ) and n = n (n1,..., nKn , 1). Equivalently, for every n ≥ 1 and every measurable function f : S → R,

E [f(x1,..., xn)] (5)  k  X Z Y Y  = f(xl1 , . . . , xln ) 1{lr =j} µ0(dx1) . . . µ0(dxk) π(|A1|,..., |Ak|),

{A1,...,Ak}  j=1 r∈Aj  whenever the integrals in the right side exist, and where the sum ranges over all partitions of {1, . . . , n}. Equality (5) is a generalization of a result by Yamato(1984), and its proof can be found in Theorem C.2 in the AppendixC. A very important quantity of species sampling process is the tie probability of µ,

X h 2i ρ = π(2) = P[x1 = x2] = E [P[x1 = x2 | µ]] = E (wj) . j≥1

For the case n = 2, and using the exchangeability of the sequence, (5) simpliﬁes to Z Z E[f(xi, xj)] = E [f(x1, x2)] = ρ f(x, x)µ0(dx) + (1 − ρ) f(x1, x2)µ0(dx1)µ0(dx2),

R R for every i 6= j, whenever f(x, x)µ0(dx) and f(x1, x2)µ0(dx1)µ0(dx2) exist. Another quantity that can be written in terms of the tie probability, is the conditional probability,

π(2) π(1, 1) [x ∈ · | x ] = δ + µ = ρ δ + (1 − ρ)µ , P j i π(1) xi π(1) 0 xi 0 for every i 6= j. The last equation follows easily from (4), using the fact that (xi, xj) is equal in distribution to (x1, x2). Note that when ρ ≈ 0, P[xj ∈ · | xi] ≈ µ0, in the opposite case when ρ ≈ 1, P[xj ∈ · | xi] ≈ δxi . Indeed, as the tie probability approaches zero, (xi)i≥1 converges in distribution to an i.i.d. sequence, and as ρ → 1, (xi)i≥1 converges in distribution to a sequence of identical random variables. As we will see this assertion is essential to prove important convergence properties of stick-breaking process with exchangeable length variables. An account of exchangeable sequences driven by species sampling process is provided in the AppendixC along with the proof of the aforementioned properties.

4 3 Stick-breaking processes with exchangeable length variables

Hereinafter we focus on species sampling processes, µ, as in (1), defined over a measurable Polish space, (S, BS), with collection of weights W = SB[V], where V = (vi)i≥1 is an exchangeable sequence. In other words, we analyze species sampling process whose weights’ distribution remains invariant under permutations of the length variables. Any random probability measure of this kind will be referred as an exchangeable stick-breaking process (ESB). The first examples of species sampling processes in this class are well-known models in the literature. In fact, if the length variables are identical, vi = v, for i ≥ 1, and some v ∼ ν0, where ν0 is a diffuse probability measure over [0, 1], B[0,1] , then (v, v,...) is exchangeable and driven by ν = δv. In this case, where the length variables are fully dependent, the decreasingly ordered Geometric j−1 weights, wj = v(1 − v) , for j ≥ 1, are recovered, so that the corresponding ESB becomes a Geometric process (Fuentes-Garc´ıaet al.; 2010a). In terms of the dependence between length iid variables, at the other end of the spectrum, we find i.i.d length variables, {vi ∼ ν0; i ≥ 1}. This sequence, (vi)i≥1, is trivially exchangeable and driven by the deterministic probability measure ν = ν0, clearly the respective ESB recovers a stick-breaking process featuring i.i.d. length variables (see for example Sethuraman; 1994; Ishwaran and James; 2001; Pitman; 1996a). In particular, if ν0 stands for a Beta distribution with parameters (1, θ), denoted by Be(1, θ), so that iid {vi ∼ Be(1, θ); i ≥ 1}, the ESB corresponds to a Dirichlet process with total mass parameter θ. From a Bayesian perspective, especially if the distribution of the species sampling process is to be regarded as a mixing prior, one of the first properties one should analyze is whether the weights sum up to one. In this case the almost surely discrete species sampling process is termed proper. Another property of interest is whether a species sampling process has full support. This property assures that if the support of the base measure, µ0, is S, then the weak topological support of the distribution of the species sampling process is the set of all probability measures over (S, BS) (see Bissiri and Ongaro; 2014). Dirichlet and Geometric processes are both proper and have full support under minor conditions of ν0, the following theorem generalizes this result to the complete class of ESBs.

Theorem 3.1. Let µ be an ESB, with exchangeable length variables, (vi)i≥1, driven by the random probability measure ν. Let us denote ν0 = E[ν].

i) If there exists ε > 0 such that (0, ε) is contained in the support of ν0, µ has full support. ii) µ is proper if and only if ν({0}) < 1 almost surely.

In the context of the above theorem, if 0 = ν0({0}) = E[ν({0})], we have that ν({0}) = 0 P almost surely. Hence, a suﬃcient condition to ensure j≥1 wj = 1, is that 0 is not an atom of ν0, that is to say, P[vi = 0] = 0. For example, if ν0 = Be(a, b) for some a, b > 0, so that marginally vi ∼ Be(a, b), then 0 is not an atom of ν0. Furthermore, the support of a Be(a, b) distribution is [0, 1], which means that the conditions given in (i) and (ii) of Theorem 3.1 are satisﬁed and we have the following corollary.

Corollary 3.2. Let µ be an ESB, with exchangeable length variables, (vi)i≥1, such that marginally vi ∼ Be(a, b). Then, µ is proper and it has full support.

5 Notice that Geometric processes with a Beta distributed length variable and stick-breaking processes featuring i.i.d. Beta length variables, including Dirichlet processes, are all particular cases of the species sampling processes in Corollary 3.2. Our next result explains how Geometric and Dirichlet processes can be recover as limits of other ESBs.

(1) (2) Theorem 3.3. Consider some diﬀuse probability measures µ0, µ0 , µ0 ,..., over (S, BS), such (n) (n) that as n → ∞, µ0 converges weakly in distribution to µ0. For each n ≥ 1, let ν be a (n) (n) random probability measure over [0, 1], B[0,1] , with ν ({0}) < 1 almost surely, and let µ (n) (n) (n) be an ESB with base measure µ0 and exchangeable length variables V = vi with i≥1 directing random measure ν(n). Denote by W(n) = SB V(n) the weights of µ(n).

(n) i) If ν converges weakly in distribution to a deterministic probability measure ν0 with (n) ν0 6= δ0, then µ converges weakly in distribution to a stick-breaking process, µ, with iid base measure µ0 and featuring independent length variables {vi ∼ ν0; i ≥ 1}, as n → ∞. In particular if ν0 = Be(1, θ), the limit, µ, is a Dirichlet process with total mass parameter θ, and W(n) converges in distribution to the size-biased permuted weights of µ.

(n) ii) If ν converges weakly in distribution to δv for some [0, 1]-valued random variable v ∼ ν0, (n) where ν0({0}) = 0. Then, as n → ∞, µ converges weakly in distribution to a Geometric (n) process, µ, with base measure µ0 and length variable v ∼ ν0, and W converges in distribution to the decreasingly ordered weights of µ.

Some important quantities that contain relevant information of the model, are the expec- h i Qk aj bj k tations E j=1 vj (1 − vj) , for non-negative integers (aj, bj)j=1. For instance, as shown by Pitman(1996a,b), for a proper species sampling process µ, with weights sequence (wj)j≥1, taking the form (2), its EPPF is given by

" k # " m # X Y X Y π(n , . . . , n ) = wni = vai (1 − v )bi , (6) 1 k E ji E i i (j1,...,jk) i=1 (j1,...,jk) i=1 where the sum ranges over all k-tuples of distinct positive integers (j1, . . . , jk), and m and m Pm (ai, bi)i=1 are functions of (j1, . . . , jk), given by m = max{j1, . . . , jk}, ai = l=1 nl1{ji=l}, and P bi = r>i ar. Pitman(1996a) also proved that

 k k−1 j !  k  n −1 n −1 P 0 Y j Y X Y j i>j ni π (n1, . . . , nk) = E  wj 1 − wi  = E  vj (1 − vj)  (7) j=1 j=1 i=1 j=1 is a symmetric function of (n1, . . . , nk) if and only if (wj)j≥1 is invariant under size-biased permutations. In which case π = π0 is precisely the EPPF corresponding to the model. Evi- dently, if the distribution of the size-biased permuted weights is not available, computing the EPPF in closed form can be difficult due to the infinite unordered sum in (6). For a handful of stick-breaking processes, despite the lack of invariance under size-biased permutations, (6) can be simplified (e.g. Mena and Walker; 2012; Rodr´ıguezand Quintana; 2015), however, for most general cases the equation in question cannot be reduced to a more tractable expression. A

6 way to mitigate this hurdle, is to undertake clustering analysis by means of the so-called latent allocation variables (see Fuentes-Garc´ıaet al.; 2010b, 2019). These random variables contain complete information about the partition structure, and despite the ordering of the weights, their ﬁnite dimensional distributions can be expressed in terms of expectations of power prod- iid P ucts of the length variables. Namely, for {xi | µ ∼ µ; i ≥ 1} where µ = j≥1 wjδξj is a proper species sampling process, we deﬁne the latent allocation variables (di)i≥1 through di = j if and only if xi = ξj, so that    iid X  di | W ∼ wjδj; i ≥ 1 . (8)  j≥1 

The diﬀuseness of the base measure implies Π(x1:n) is equal almost surely to Π(d1:n). Further- more, the conditional distribution of (x1,..., xn) given Π(x1:n) is that of (x1,..., xn) given (d1,..., dn). From (8) we can easily compute for every n ≥ 1 and any positive integers d1, . . . , dn,  k   k  Y rj Y rj tj P[d1 = d1,..., dn = dn] = E  wj  = E  vj (1 − vj)  , (9) j=1 j=1 Pn Pn P where k = max{d1, . . . , dn}, rj = i=1 1{di=j} and tj = i=1 1{di>j} = i>j ri. In particular if the length variables are exchangeable, de Finetti’s theorem yields     k Z  k Z  Y rj tj Y rj tj E  vj (1 − vj)  = v (1 − v) ν(dv) Q(dν) j=1 j=1  where Q is the law of the directing random measure of (vi)i≥1.

4 Exchangeable length variables driven by species sampling process

In this section we will analyze ESBs with length variables driven by another species sampling process, ν. To be precise, we will study general ESBs denoted by µ, with length variables iid {vi | ν ∼ ν; i ≥ 1}, for some species sampling process, ν, over [0, 1], B[0,1] . We will denote by πν the EPPF corresponding to ν, by ρν its tie probability, and by ν0 the base measure of ν. Noting that the base measure satisfies ν0 = E[ν], from the first part of Theorem 3.1, we know that if (0, ε) belongs to the support of ν0, for some 0 < ε < 1, then µ has full support. Furthermore, as ν is a species sampling process, by definition its base measure, ν0, is required to be diffuse, so that it can not have an atom in 0, thus µ must be proper. This gives the following corollary of Theorem 3.1.

Corollary 4.1. Let µ be an ESB with length variables, (vi)i≥1, that are driven by a species sampling process, ν, with base measure ν0. Then µ is proper, and if (0, ε) is contained in the support ν0, for some ε > 0, µ also has full support.

7 As mentioned at the beginning of Section3, Dirichlet and Geometric processes are examples iid of ESBs, as they have exchangeable length variables, {vi | ν ∼ ν; i ≥ 1}, with ν = ν0 = Be(1, θ) and ν = δv, respectively. Clearly, Dirichlet and Geometric processes also belong to the sub-class of ESBs driven by species sampling processes. An appealing advantage of restricting the study to this class, becomes evident when we specialize Theorem 3.3. In effect, whilst Theorem 3.3 modulates the convergence of ESBs to Geometric and Dirichlet processes by means of random probability measures, the following corollary controls the convergence in terms of the underlying tie probability, ρν = P[v1, = v2] ∈ [0, 1]. (1) (2) Corollary 4.2. Consider some diffuse probability measures µ0, µ0 , µ0 ,..., over (S, BS), and (1) (2) some diffuse probability measures, ν0, ν0 , ν0 ,..., over [0, 1], B[0,1] . Say that as n → ∞, (n) (n) (n) µ converges weakly to µ0 and ν converges weakly to ν0. For n ≥ 1, let ρν ∈ (0, 1), and 0 0 n o (n) (n) (n) (n) iid (n) consider the ESB µ with base measure µ0 , and length variables vi ν ∼ ν ; i ≥ 1 , (n) (n) (n) where ν is a species sampling process with base measure ν0 and tie probability ρν . Also (n) (n) (n) (n) (n) consider the stick-breaking weights of µ , W = SB V , where V = vi . i≥1

(n) (n) i) If ρν → 0, as n → ∞, then µ converges weakly in distribution to a stick-breaking iid process, µ, with base measure, µ0, and independent length variables {vi ∼ ν0; i ≥ 1}. Particularly, if ν0 = Be(1, θ), µ is a Dirichlet process with total mass parameter θ, and W(n) converges in distribution to the size-biased permuted weights of µ.

(n) (n) ii) If ρν → 1, as n → ∞, then µ converges weakly in distribution to a Geometric process, (n) µ, with base measure µ0 and length variable v ∼ ν0, and W converges in distribution to the decreasingly ordered weights of µ.

Corollary 4.2 states that by means of ESBs featuring species sampling driven length variables, we can approximate with arbitrary precision the distribution of a Geometric process, by making

ρν → 1. Alternatively, whenever the length variables have Be(1, θ) marginals, by letting ρν → 0, we can arbitrarily approximate Dirichlet priors. Since [0, 1] is a set with no gaps, our result sets up a continuous bridge between Geometric and Dirichlet processes, and provides a new interpretation of these random probability measures as opposite extreme points of the class of ESBs. Corollary 4.2 also has important consequences when it comes to understanding the ordering of the weights in question. Indeed, as ρν → 1, the weights become more likely to be decreasingly ordered, and in the special case where the length variables have Be(1, θ) marginals, the ordering of the weights ranges from a size-biased order to a decreasing order. To understand better how the stochastic ordering of the weights is aﬀected by ρν we provide the following result.

iid Theorem 4.3. Consider some exchangeable length variables {vi | ν ∼ ν; i ≥ 1}, where ν is a species sampling process with base measure ν0, EPPF πν , and tie probability ρν . Set (wj)j≥1 = h i SB (vi)i≥1 . Then, for every j ≥ 1,

−→ −1 a) P [wj ≥ wj+1] = ρν + (1 − ρν )E [ν0(c(v))], where c(v) = 1 ∧ v(1 − v) for every v ∈ [0, 1], −→ −→ v ∼ ν0, and ν0 is the distribution function of ν0, that is ν0(x) = ν0([0, x]).

8 −→ b) Letting c and ν0 be as in (a),

P[wj ≥ wj+1 | w1,..., wj] = P[vj+1 ≤ c(vj) | v1,..., vj]

Kj (i) (Kj +1) X πν n πν n −→ = 1{v∗≤c(v )} + ν0(c(vj)), π (n) i j π (n) i=1 ν ν where v∗,..., v∗ are the distinct values that {v ,..., v } exhibits, n = (n ,... n ) 1 Kj 1 j 1 Kj ∗ (i) is given by ni = |{l ≤ j : vl = vi }|, n = (n1,... ni−1, ni + 1, ni+1,..., nKj ) and (Kj +1) n = (n1,..., nKj , 1).

A curious highlight about Theorem 4.3 is that P[wj ≥ wj+1] does not depends on j, in effect, for every j ≥ 1, P[wj ≥ wj+1] is simply a linear combination, modulated by ρν , between a quantity determined by ν0, and one. This means that for a fixed base measure, ν0, by tuning −→ the tie probability, ρν , we can control how close is P[wj ≥ wj+1] to E [ν0(c(v))] or one. Theorem 4.3 (b) computes the conditional probability P[wj ≥ wj+1 | w1,..., wj], which is completely determined by the EPPF, πν , the base measure ν0, and the first j length variables. To spell out the corresponding equation, note that given the first j weights, the probability that wj ≥ wj+1 ∗ ∗ is precisely the probability that vj+1 = vi for some vi ≤ c(vj) or that vj+1 takes a new value, say v∗ , which is smaller than c(v ). Kj +1 j The last property we analyze for this type of species sampling processes are finite dimensional distributions of the latent allocation random variables d1, d2,..., as in (9). It is clear that these probabilities can be written in terms of the expectation of power products of the length variables. Equation (5) explains how to compute these expectations when the length variables are driven by a species sampling process. Hence, the following result is a straightforward consequence of (5) and (9).

Theorem 4.4. Consider the weights sequence, W = (wj)j≥1, as in (2), with length variables iid {vi | ν ∼ ν; i ≥ 1}, for some species sampling process, ν, with base measure ν0 and EPPF πν . n iid P o Then, for di | W ∼ j≥1 wjδj; i ≥ 1 , and every d1, . . . , dn ∈ N, X P[d1 = d1,..., dn = dn] = πν (|A1|,..., |Am|)×

{A1,...,Am} m ( ) (10) Z P P Y i∈A ri i∈A ti × (v) j (1 − v) j ν0(dv) , j=1 [0,1]

Pn Pn where ri = l=1 1{dl=i}, ti = l=1 1{dl>i}, and where the sum ranges over the set of all partitions of {1, . . . , k}, for k = max{d1, . . . , dn}.

Remark 4.5. If ν0 denotes a Be(a, b) distribution, the integrals in Theorem 4.4 are given by

Z P P Γ(a + b)Γ(a + P a )Γ(b + P b ) ai bi i∈Aj i i∈Aj i (v) i∈Aj (1 − v) i∈Aj ν (dv) = . 0 Γ(a)Γ(b)Γ(a + b + P (a + b )) [0,1] i∈Aj i i for any non-negative integers (ai, bi)i≥1.

9 5 Examples

5.1 Dirichlet driven stick-breaking processes 5.1.1 Theoretical results To illustrate the results of Section4, we ﬁrst concentrate in exchangeable length variables,

V = (vi)i≥1, driven by a Dirichlet process ν with total mass parameters β and base measure ν0. In this case we have that the EPPF, πν , corresponding to ν is given by m βm Y [Π(v ) = A] = π (|A |,..., |A |) = (|A | − 1)! (11) P 1:k ν 1 m (β) i k i=1 for every partition A = {A1,...,Am} of {1, . . . , k} (Ewens; 1972; Antoniak; 1974). In particular we can compute the underlying tie probability ρν = π(2) = 1/(1 + β). Motivated by Corollaries 3.2 and 4.2, we will further study the case ν0 = Be(1, θ). Indeed, this assumption guaranties that the corresponding ESB, µ, is proper and has full support, and as a consequence of Corollary 4.2, it allows us to recover Geometric and Dirichlet processes in the weak limits as β → 0 and β → ∞, respectively. Any such ESB, µ, will be called a Dirichlet driven stick-breaking process

(DSB) with parameters (β, θ, µ0), and its weights, W = SB[V], will be referred to as Dirichlet driven stick-breaking weights (DSBw) with parameters (β, θ). Next, we specialize the results of Section4 to DSBs. (1) (2) (n) Corollary 5.1. Let µ0, µ0 , µ0 be diﬀuse probability measures over (S, BS) such that µ0 (n) (n) (n) converges weakly to µ0, as n → ∞. For each n ≥ 1 consider β , θ ∈ (0, ∞), with θ → θ (n) (n) (n) (n) (n) in (0, ∞). Let µ be a DSB with parameters β , θ , µ0 , and let W the corresponding DSBw with parameters β(n), θ(n). i) If β(n) → ∞, as n → ∞, then µ(n) converges weakly in distribution to a Dirichlet pro- (∞) (n) cess, µ , with total mass parameter θ and base measure µ0, and W converges in distribution to the size-biased permutation of the weights of µ(∞). ii) If β(n) → 0, as n → ∞, then µ(n) converges weakly in distribution to µ(0), where µ(0) ∗ stands for a Geometric process with base measure µ0 and length variable v ∼ Be(1, θ), and W(n) converges in distribution to the decreasingly ordered weights of µ(0). (n) (n) Corollary 5.1 follows immediately from Corollary 4.2 by substituting ρν = 1/ 1 + β . As to the ordering of DSBw’s, we have the following corollary of Theorem 4.3.

Corollary 5.2. Fix θ > 0, and for each β > 0, consider a DSBw, (wj)j≥1, with parameters (β, θ). Let us denote by 2F1 to the Gauss hypergeometric function. Then, for every j ≥ 1, F (1, 1; θ + 2, 1/2)βθ a) [w ≥ w ] = 1 − 2 1 , for every β > 0. P j j+1 2(β + 1)(θ + 1)   1  X h θi b) [w ≥ w | w ,..., w ] = n + β 1 − (1 − c(v )) , where c(v) = 1 ∧ P j j+1 1 j β + j i j i∈Aj  v(1 − v)−1, v∗,..., v∗ are the distinct values that {v ,..., v } exhibits, A = {i < K : 1 Kj 1 j j j ∗ ∗ vi ≤ c(vj)} and ni = |{l ≤ j : vl = vi }|, for every i ≤ Kj.

10 Remark 5.3. In the context of Corollary 5.2, if θ = 1, we even obtain

1 + β log(2) [w ≥ w ] = . P j j+1 1 + β Other quantities that simplify nicely for DSBs are the ﬁnite dimensional distributions of latent allocation variables. For instance, from (11), using Remark 4.5 and recalling that Γ(x + Qn−1 n)/Γ(x) = i=0 (x + i) = (x)n, Theorem 4.4 specializes the as follows.

Corollary 5.4. Let W = (wj)j≥1 be a DSBw with parameters (β, θ), and consider the allocation n iid P o variables di | W ∼ j≥1 wjδj; i ≥ 1 . Then for any positive integers d1, . . . , dn,

m m (|A | − 1)! (P r )! X (βθ) Y j i∈Aj i P[d1 = d1,..., dn = dn] = , (12) (β) (θ + P t ) P k i∈Aj i 1+ i∈A ri {A1,...,Am} j=1 j Pn Pn where ri = l=1 1{dl=i}, ti = l=1 1{dl>i} and sum ranges over the set of all partitions of {1, . . . , k}, for k = max{d1, . . . , dn}.

5.1.2 Distribution of the number of groups A random variable that encloses important information about a species sampling process µ is iid the number of distinct values Kn = |Π(x1:n)|, that a sample, {xi | µ ∼ µ; i ≥ 1} exhibits. For certain species sampling processes, the distribution of Kn is analytically available (Lijoi et al.; 2007; De Blasi et al.; 2015). However, characterizing this distribution is generally not an easy task. Nonetheless, whenever one can sample from the finite dimensional distributions of the length variables, (v1,..., vm), and therefore the weights, (w1,..., wm), drawing samples from iid Kn is relatively simple. First sample {ui ∼ Unif(0, 1); i ≥ 1}, and w1,... wm up the first index, Pm Pi−1 Pi m, that satisfies j=1 wj > maxk uk. Define dk = i, if and only if j=1 wj ≤ uk < j=1 wj, with the convention that the empty sum equals zero. Finally, note that the number of distinct values in {d1,..., dn}, is precisely a sample from Kn. (β,θ) Denote by Kn the random variable, Kn, corresponding to a DSB with parameters (β, θ, µ0), for simplicity when β = 0 and β = ∞ we refer to a Geometric and Dirichlet pro- (β,θ) cess, respectively. In Figure1, we exhibit the distribution of Kn for different choices of β and θ and for n = 20. Here we illustrate the asymptotic results stated in Corollary 4.2 and Corollary (β,θ) 5.1, by means of Kn . In fact we have a graphical representation of how as β → 0 and β → ∞, (β,θ) (0,θ) (β,θ) (∞,θ) Kn → Kn and Kn → Kn , in distribution. We also observe that an increment on β (β,θ) (decrement of ρν ) contributes to the distribution of Kn with a smaller mean and variance. (β,θ) Conversely, decreasing the value of β, impacts the prior distribution of Kn with a larger mean and variance, and a heavier right tail, say less informative. We also see that, just as it occurs for the Dirichlet process, for DSBs with the value of β fixed, increasing the parameter (β,θ) θ makes the distribution of Kn favour larger values. This behaviour can be explained by recalling that marginally vi ∼ Be(1, θ), meaning that larger values of θ lead to smaller values of the length variables which translates to the smaller values of the first weights, this then leads to more diversity in samples of exchangeable sequences driven by the corresponding DSB.

11 Figure 1: Frequency polygons of samples of size 10000 from the distribution of K20, corresponding to DSBs with parameters β in {0, 0.5, 1, 10, 100, ∞} (A − F, respectively), and varying θ in {0.5, 1, 3, 6, 10}.

h (β,θ)i Figure2 provides numerical approximations of the mapping n 7→ E Kn for n ∈ {1,..., 200}, and for distinct values of β and θ. We can appreciate that for a ﬁxed value h (β,θ)i of β = (1 − ρν )/ρν , larger values of θ lead to a faster growth of E Kn as n increases. This is consistent with the analysis in Figure1, in the sense that larger values of θ translate (β,θ) to Kn being more prone to take bigger values. Alternatively, if we ﬁx the parameter θ, h (β,θ)i we see that an increment on ρν = 1/(β + 1), leads to a faster growth of E Kn with re-

12 h (∞,θ)i spect to n. Indeed, for the Dirichlet model (ρν = 0) we see that E Kn is the one with h (0,θ)i the slowest growth, and at the other extreme, for the Geometric process (ρν = 1), E Kn exhibits the fastest growth. This analysis suggests that for 0 < ρν = 1/(β + 1) < 1, the h (β,θ)i rate at which E Kn increases is bounded by those corresponding to a Geometric and a Dirichlet model with the same value of θ. Namely, it seems sensible to propose the conjecture h (∞,θ)i h (β2,θ)i h (β1,θ)i h (0,θ)i E Kn ≤ E Kn ≤ E Kn ≤ E Kn for real numbers 0 < β1 < β2.

Figure 2: Numerical approximation of the mapping n 7→ E[Kn], for n ∈ {1,..., 200}, corresponding to DSBs with parameters θ in {0.5, 1, 2.5, 4} (A − D, respectively), and for each ﬁxed value of θ, we vary β in {0, 1/3, 1, 3, ∞} (ρν ∈ {1, 0.75, 0.5, 0.25, 0}, respectively).

5.2 Models beyond Dirichlet driven stick-breaking processes The generality of Section4 allows us to construct a great variety of new Bayesian non-parametric priors. In fact, for every known species sampling process, ν, with available tie probability, ρν , the analysis in Section4 can be carried out, thus leading to a new model. In addition, if the expression for the EPPF, πν , is manageable, relevant clustering probabilities can be recovered by inserting πν into the ﬁnite sum in (10). Such is the case of Gibbs-type priors (e.g. De Blasi et al.; 2015), and normalized random measures with independent increments (Regazzini et al.; 2003; James et al.; 2009).

13 To provide another concrete example, let us consider the case where ν is the two-parameter Pitman-Yor process (Pitman and Yor; 1992; Perman et al.; 1992) with parameters α ∈ [0, 1) and β > −α. For this species sampling process the EPPF is Qk (β + α)k−1↑α j=1(1 − α)nj −1 πν (n1, . . . , nk) = , (13) (β + 1)n−1 Pk Qm−1 where n = j=1 nj,(x)m↑a = i=0 (x + ia) and (x)m = (x)m↑1. This implies that the iid prediction rule for {vi | ν ∼ ν; i ≥ 1}, becomes

k X nj − α β + kα P[vn+1 ∈ · | v1,... vn] = δv∗ + ν0, (14) n + β j n + β j=1

∗ ∗ ∗ where v1,... vk are the distinct values in {v1,..., vn}, and nj = |{i ≤ n : vi = vj }|, for every n ≥ 1. As to the tie probability of ν we get 1 − α ρ = π (2) = . (15) ν ν β + 1

Inserting (13) into (10), for the case where the base measure ν0 = Be(1, θ), we obtain the ﬁnite dimensional distributions of the latent allocation variables m m (1 − α) (P r )! X θ (β + α)m−1↑α Y |Aj |−1 i∈Aj i P[d1 = d1,..., dn = dn] = . (β + 1) (θ + P t ) P k−1 i∈Aj i 1+ i∈A ri {A1,...,Am} j=1 j Pn Pn where ri = l=1 1{dl=i}, ti = l=1 1{dl>i}, and where the sum ranges over the set of all partitions of {1, . . . , k}, for k = max{d1, . . . , dn}. For the corresponding weights, (wj)j≥1, substituting (15) into Theorem 4.3 yields 1 − α + (β + α) [−→ν (c(v))] [w ≥ w ] = E 0 , P j j+1 β + 1 −→ for every j ≥ 1, and where ν0 and c are as stated in the theorem. For the special case ν0 = Be(1, θ), the probability in question simpliﬁes to F (1, 1; θ + 2, 1/2)(β + α)θ [w ≥ w ] = 1 − 2 1 , P j j+1 2(β + 1)(θ + 1) and if θ = 1 we even get P[wj ≥ wj+1] = [1 − α + (β + α) log(2)]/[β + 1]. In general, by (15), 0 0 as α → 1 and β → β ∈ [1, ∞) or β → ∞ and α → α ∈ [0, 1) we get ρν → 0. Alternatively, as 0 0 α → α ∈ [0, 1) and β → −α , the tie probability ρν → 1. Thus, Corollary 4.2 assures that by means of Pitman-Yor driven ESBs we can approximate (weakly in distribution) Dirichlet and Geometric process. This is not true for all choices of ν, in fact if ν is a normalized inverse- Gaussian random measure, with total mass parameter β > 0, as proved by Lijoi et al.(2005), its −1 2 β R ∞ −1 −x tie probability is ρν = 2 [1 + β e E1(β) − β], where E1(β) = β x e dx is the exponential integral. Using the inequality e−β 2 1 log 1 + < E (β) < e−β log 1 + , 2 β 1 β

14 it can be shown as β → ∞, ρν ≤ 0, and as β → 0, ρν → c ≤ 1/2. Thus Geometric processes can not be recovered as weak limits of ESBs with normalized inverse-Gaussian processes as the directing random measure of the length variables. There are also interesting choices of ν, outside Bayesian non-parametric priors. For example, κ one might consider the species sampling process with ﬁnitely many atoms, ν = α P p δ ∗ + j=1 j vj Pκ (1 − α) ν0, for some κ ∈ N and where pj ≥ 0, j=1 pj = 1 and α is an independent random κ variable taking values in [0, 1]. Depending on the distribution of (pj)j=1 and α, the EPPF, πν , could be relatively simple to derive, as well as the tie probability which can be computed 2 Pκ 2 through ρν = E α j=1 E pj .

6 Illustrations

In Bayesian non-parametric statistics it is common to model data, Y = (y1,..., ym), that features no repetitions, as if sampled from {yk | xk} ∼ G(· | xk) independently for k ≥ 1. Here G denotes a diﬀuse probability kernel from the Polish space, S, where xk takes values, into the Polish space, T , where yk takes values. We further assume that G(· | s) has a density for every s ∈ S and that (x1, x2,...) is exchangeable and driven by a proper species sampling process P µ = j≥1 wjδξj . In terms of the law of Y, this is equivalent to model the sequence as if it was conditionally i.i.d. sampled from random mixture

Z X Φ = G(· | s)µ(ds) = wj G(· | ξj). (16) j≥1

Hereinafter we call Φ an ESB mixture or a DSB mixture whenever µ is an ESB or a DSB. In general, if G(· | sn) converges weakly to G(· | s), as sn → s in S, the mapping

X Z X wjδsj 7→ G(· | s)dµ(ds) = wjG(· | sj) j≥1 j≥1 is continuous with respect to the weak topology (Corollary B.3 in the appendix). This means that analogous convergence results to those in Theorem 3.3 and Corollaries 4.2 and 5.1 hold for ESB and DSB mixtures. In this context, Kn = |Π(x1:n)| represents the number of mixture components that are signiﬁcant in the sample, that is the number of elements in {G(· | ξj)}j≥1 for which there exist k ∈ {1, . . . , m} such that yk was ultimately sampled from G(· | ξj). Details of an MCMC algorithm for density estimation by means of ESB mixtures are provided in Section E of the appendix. We designed two experiments, one consists in estimating the density of univariate data by means of the usual expected a posterior (EAP) estimator. Here we will adjust six DSB mixtures, where the parameter β = (1 − ρν )/ρν is ﬁxed to distinct values. The main objective of this test is to analyze the posterior impact of Corollary 5.1, meaning that we expect to observe that when ρν = 1/(β + 1) is small, the posterior estimators behave similar to those of a Dirichlet prior, and when ρν is close to one, the results resemble those of a Geometric prior. For the second experiment we work with bivariate data, and focus on estimating its density by means of the EAP and the maximum a posterior (MAP) estimators, we will also estimate the clusters

15 of the data points using the MAP estimator (see SectionE of the appendix, for details). In this second study, we will assign a prior distribution to the underlying tie probability ρν , allowing the model to determine which values of ρν suit better the dataset, given that the rest of the parameters and hyper-parameters are ﬁxed.

6.1 Results for DSB mixtures with ﬁxed tie probability

200 For this experiment we simulated observations (yk)k=1 from a mixture of seven Normal distributions, and estimate the density of the data through six distinct DSB mixtures with parameters

(β, θ, µ0). For each of the six mixtures we consider a Gaussian kernel with random location and −1 −1 scale parameters, i.e G(y|ξj) = N(y|mj, τj ) and µ0(ξj) = N(mj|µ, (λτj) )Ga(τj|a, b), where −1 Pn a = b = 0.5, λ = 1/100 and µ = n k=1 yk. Each of the six DSB mixtures features a distinct of β = (1 − ρν )/ρν and all share θ = 1.

Figure 3: Estimated densities, taking into account 4000 iterations of the Gibbs sampler, skipping 4 iterations, after a burn-in period of 7000, for six DSB mixtures with parameter

β ∈ {0, 1/3, 1, 4, 9, ∞} (ρν ∈ {0.75, 0.5, 0.2, 0.1}, respectively).

In Figure3 we can observe that all models do a good job estimating the density. The

Dirichlet process (ρν = 0) and the DSB with ρν = 0.1 struggle more than the other models to diﬀerentiate the second and third modes from left to right, this can be due to the initial election of the parameter θ and the fact that a priori the DSB with ρν = 0.1 behaves similarly to a

16 Dirichlet process. In Figure3 we can also observe that it is at the high density areas that the estimated density from model to model varies slightly.

Figure 4: Frequency polygons corresponding to the posterior distribution of K200, for the six DSB mixtures with parameter β = 0, 1/3, 1, 4, 9, ∞ (ρν = 0.75, 0.5, 0.2, 0.1, respectively)

Figure4 illustrates the posterior distribution of Kn, with n = 200, for each of the six DSB mixtures implemented. Here we see that the Dirichlet process and the DSBs with a smaller value of ρν , give high probability to numbers close to seven, which is the true number of components of the mixture from which the data was sampled. In contrast, as the parameter ρν approaches one, we observe that the models tend to give higher probability to larger values through the posterior distribution of Kn, this means that these models use more components to provide the estimations illustrated in Figure3. Indeed, since Geometric weights decrease at a constant rate, in order to estimate the size and shape of some components, the model is forced to overlap many small components. If we were interested in clustering the data points, this can be a disadvantage of DSB mixtures with a large values of ρν , as it is likely that the number of clusters will be overestimated. However, if we are only interested in density estimation this feature actually makes the models that behave similar to Geometric processes more likely to capture subtle changes in the histogram of the data set. Overall we see that the results are consistent with

Corollaries 4.2 and 5.1 in the sense that as ρν → 0, the results provided by the DSB mixtures are similar to those given by a Dirichlet prior, and when ρν → 1, they are closer to those provided by a Geometric prior.

17 6.2 Results for DSB mixtures with random tie probability For this experiment, we simulated 510 data points from a paw-shaped mixture of seven Normal distributions. Here we adjust a Dirichlet mixture with total mass parameter θ, a Geometric mixture with length variable v ∼ Be(1, θ) and a DSB with parameters (β, θ, µ0), where β = (1 − ρν )/ρν and ρν ∼ Unif(0, 1). For all mixtures we assume a bivariate Gaussian kernel, i.e. G(y|ξj) = N2(y|mj, Σj), and a Normal-inverse-Wishart prior for ξj = (mj, Σj), so that −1 −1 µ0(ξj) = N2(mj | µ, λ Σj)W (Σj | P, ν). In all cases the hyper-parameters were ﬁxed to −1 Pn θ = 1, µ = n k=1 yk, λ = 1/100, ν = 2 and P equal to the identity matrix.

Figure 5: Estimated clusters using the MAP estimator, taking into account 8000 iterations of the Gibbs sampler after a burn-in period of 2000 iterations, according to the Dirichlet prior (A) a DSB prior (B), a Geometric prior (C) and the true model (D).

In Figure5 we see that while the DSB recovers exactly seven clusters as there are according to the true model, the Dirichlet process merges two clusters of the true model, and the Geo- metric model overestimates the number of clusters. This ﬁgures illustrates that the generality

18 inherent DSBs allows to balance features of Dirichlet and Geometric processes. This is also reﬂected through the MAP estimators of the density, presented in Figure6 (and Figure A1 in the appendix).

Figure 6: Estimated densities of the data points using the MAP, taking into account 8000 iterations of the Gibbs sampler after a burn-in period of 2000 iterations, according to the Dirichlet prior (A) a DSB prior (B) and a Geometric prior (C). D shows the true density.

19 Figure 7: Estimated densities of the data points using the EAP, taking into account each fourth iteration among 8000 iterations of the Gibbs sampler after a burn-in period of 2000, according to the Dirichlet prior (A) a DSB prior (B) and a Geometric prior (C). D shows the true density from which the data points were i.i.d. sampled.

Figure7 (and Figure A2 in the appendix) exhibits the EAP estimators of the density provided by each model, in comparison to the MAP estimators we see that these ones are much smoother. Here we appreciate all models estimate the density quite nicely, this is explained by the fact that Gaussian mixtures are in general very ﬂexible models and that all of the species sampling priors considered here have full support.

20 Figure 8: Frequency polygons corresponding to the posterior distribution of K510, for the Dirich- let, DSB, and Geometric mixtures. The dotted line indicates the true number of mixture components.

In Figure8 we observe the posterior distribution of Kn, (for n = 510) for each of the models. Here we see that through the posterior mode of Kn the DSB recovers the true number of mixtures components. The Dirichlet model also assigns a probability larger than zero to the true number of components. Despite this, the posterior mode of Kn for the Dirichlet process is one unit smaller than the true number of components. As to the Geometric process, the posterior distribution of Kn concentrates in signiﬁcantly larger values than the real number of mixture components.

The last ﬁgure we will analyze is Figure9, which presents the posterior distribution of ρν for the DSB mixture random random tuning parameter. Here we see that the posterior mode is close to 0.25. In particular, this suggests that, for the ﬁxed values of the hyper-parameters we considered, a model more similar to the Dirchlet mixture is preferred over one that approximates a Geometric mixture.

21 Figure 9: Posterior distributions of the tie probability ρν .

7 Final comments

While we did not addressed in detail other examples of ESBs outside the case where the length variables are species sampling driven, it is important to highlight that various well known Bayesian non-parametric priors fall into the general framework outlined in Section3. Indeed, if the length variables are conditionally i.i.d. given ν, where ν denotes a parametric distribution with random parameters, for instance ν = Be(α, θ) and (α, θ) ∼ p(α, θ), then the length variables are exchangeable and the species sampling process, µ, they deﬁne is an ESB. Such is the case of a Dirichlet process with a prior on the total mass parameter, model that has been widely exploited (e.g. Escobar and West; 1995). Although the present work was mainly motivated by Bayesian non-parametric theory, stick- breaking processes continue to be widely used in other probabilistic frameworks, and the results provided here can easily emigrate to such contexts.

Acknowledgements

The ﬁrst author was supported by a CONACyT PhD scholarship. Both authors gratefully acknowledge the support of PAPIIT Grant IG100221.

22 A Stick-breaking decomposition

Theorem A.1. Let W = (wj)j≥1 be a sequence such that 0 ≤ wj ≤ 1, for every j ≥ 1, and P j≥1 wj ≤ 1 almost surely. Then, there exist a sequence V = (vi)i≥1 taking values in [0, 1] such that (2) holds.

Proof. Let (Ω, F, P) be the probability space over which (wj)j≥1 is deﬁned. Given that we are interested in an almost surely decomposition, we may assume without loss of generality that P j≥1 wj ≤ 1 and 0 ≤ wj ≤ 1, for every j ≥ 1, hold over Ω. Fix v1 = w1, and for k ≥ 2, deﬁne n Pk−1 o the event Ek = ω ∈ Ω: j=1 wj(ω) < 1 and set w v = k 1 . k Pk−1 Ek 1 − j=1 wj P Evidently vk is measurable as w1, ..., wk are. Also, since j≥1 wj ≤ 1, we have that wk(ω) ≤ Pk−1 1 − j=1 wj(ω), for every ω ∈ Ek, which yields 0 ≤ vk ≤ 1. This shows (vi)i≥1 is a sequence of 0 [0, 1]-valued random variables. Now, for k < k we have that Ek0 ⊆ Ek. Hence, for every k ≥ 2,

k−1 Pj ! k−1 wk(ω) Y 1 − wi(ω) Y w (ω) = i=1 = v (ω) (1 − v (ω)), k Pk−1 Pj−1 k j 1 − j=1 wj(ω) j=1 1 − i=1 wi(ω) j=1

P c for all ω ∈ Ek, and since j≥1 wj ≤ 1, wk(ω) = 0 = vk(ω), for ω ∈ (Ek) . This show that for Qk−1 every k ≥ 1, wk = vk j=1 (1 − vj), as desired.

B Weak convergence of probability measures

Inhere we mention some topological details of measure spaces. For a Polish space S, with Borel σ-algebra BS, we denote by P(S) to the space of all probability measures over (S, BS). A well-known metric on P(S) is the L´evy-Prokhorov metric given by

0 0 ε 0 ε dL(µ, µ ) = inf{ε > 0 : µ(A) ≤ µ (A ) + ε, µ (A) ≤ µ (A ) + ε, ∀A ∈ B(S)}, (A1) for any µ, µ0 ∈ P(S), and where Aε = {s ∈ S : d(s, A) < ε}, d(s, A) = inf{d(a, s): a ∈ A} and d is some complete metric on S. For probability measures µ, µ(1), µ(2),... it is said that µ(n) (n) w (n) R (n) R converges weakly to µ, denoted by µ → µ, whenever µ (f) = S fdµ → S fdµ = µ(f) for every continuous bounded function f : S → R. The Portmanteau theorem states that this (n) (n) condition is equivalent to dL µ , µ → 0, and to µ (A) → µ(A), for every Borel set such that µ(∂A) = 0, where ∂A denotes the boundary of A. P(S), equipped with the topology of weak convergence, is Polish again. Its Borel σ-ﬁeld, BP(S), can equivalently be deﬁned as the σ-algebra generated by all the projection maps {µ 7→ µ(B): B ∈ BS}. In this sense (n) the random probability measures, µ n≥1, are said to converge weakly, a.s. to µ, whenever (n) w (n) µ (ω) → µ(ω) outside a P-null set. Analogously, we say that µ n≥1 converges weakly, in (n) Lpw (n) Pw (n) dw Lp, in probability or in distribution, to µ, denoted by µ → µ, µ → µ and µ → µ, respectively, whenever µ(n)(f) → µ(f), in the corresponding mode of convergence, for every

23 w Lpw continuous bounded function f : S → R. Evidently, µ(n) → µ a.s. and µ(n) → µ are both suﬃcient conditions for µ(n) P→w µ, which in turn implies µ(n) dw→ µ. The latter even is equivalent to µ(n) →d µ (for further details see Parthasarathy; 1967; Billingsley; 1968; Kallenberg; 2017). The following Lemmas will be needed for the proofs of some of the main results.

Lemma B.1. Let ∆∞ denote the inﬁnite dimensional simplex and consider some Polish space S. The mapping X [(µ1, µ2, ...), (w1, w2,...)] 7→ wjµj, j≥1 ∞ from P(S) × ∆∞ into P(S) is continuous with respect to the weak and product topologies.

(n) (n) (n) Proof: Let w = (w1, w2,...), w = w1 , w2 ,... be elements of ∆∞, and µ = n≥1 (n) (n) (n) ∞ (n) (µ1, µ2,...), µ = µ1 , µ2 ,... , be elements of P(S) , such that wj → wj and n≥1 (n) w (n) P (n) (n) P µj → µj, for every j ≥ 1. Deﬁne ν = j≥1 wj µj and ν = j≥1 wjµj. Fix a continuous (n) (n) and bounded function f : S → R. Then, for j ≥ 1, wj µj (f) → wjµj(f). Since f is bounded, (n) (n) (n) (n) (n) there exist M such that |f| ≤ M, hence |wj µj (f)| ≤ wj µj (|f|) ≤ wj M, for every n ≥ 1, (n) P (n) P and j ≥ 1. Evidently, Mwj → Mwj, and j≥1 Mwj = M = j≥1 Mwj. Hence, by general Lebesgue dominated convergence theorem, we obtain

(n) X (n) (n) X ν (f) = wj µj (f) → wjµj(f) = ν(f) j≥1 j≥1

That is ν(n) →w ν.

Lemma B.2. Let ∆∞ denote the inﬁnite dimensional simplex and consider some Polish space S. The mapping X [(s1, s2, ...), (w1, w2,...)] 7→ wjδsj , j≥1 ∞ from S × ∆∞ into P(S) is continuous with respect to the product and weak topologies.

Proof: By Lemma B.1 if suffices to check that the mapping s → δs from S into P(S) is continuous. So fix sn → s in S and fix a continuous and bounded function f : S → R. Then w δsn (f) = f(sn) → f(s) = δs(f), that is δsn → δs. Lemma B.3. Consider two Polish spaces, S and T , and let G be a probability kernel from S w into T , such that for every sn → s in S, G(·|sn) → G(·|s). The mapping

X Z X µ = wjδsj 7→ G(· | s)dµ(ds) = wjG(· | sj), j≥1 j≥1 from P(S) into P(T ) is continuous with respect to the product and weak topologies.

(n) P (n) Proof: Consider some discrete probability measures µ = j≥1 wj δs(n) over j n≥1 (S, ), such that µ(n) →w P w δ = µ, and set Φ(n) = P w(n)G · s(n) and BS j≥1 j sj j≥1 j j

24 P Φ = j≥1 wjG (· | sj). Let f : T → R be a continuous and bounded function and deﬁne the function h : S → , by R Z h(s) = f(t)G(dt | s).

Evidently h is bounded because f is bounded and G(·|s) is a probability measure. Furthermore, w as G(·|sn) → G(·|s), for every sn → s in S, h is also continuous. Thus, Z Z Z Z (n) X (n) (n) (n) X f dΦ = wj h sj = h dµ → h dµ = wjh (sj) = f dΦ. j≥1 j≥1

That is Φ(n) →w Φ.

Corollary B.4. Let ∆∞ denote the inﬁnite dimensional simplex. Consider a couple of Polish spaces, S and T and let G be a probability kernel from S into T , such that for every sn → s in w S, G(·|sn) → G(·|s). The mapping X [(s1, s2, ...), (w1, w2,...)] 7→ wjG(·|sj), j≥1 ∞ from S × ∆∞ into P(T ) is continuous with respect to the weak topology. Proof: This is straightforward from Lemmas B.2 and B.3.

C Exchangeable sequences driven by species sampling processes

This section provides an overview of exchangeable sequences driven by species sampling processes. The results presented here are fundamental for the proof of our main results.

Theorem C.1. Let (xi)i≥1 be an random sequence, taking values in a Polish space (S, B(S)), and for n ≥ 1, deﬁne Π(x1:n) as the random partition of [n] generated by the random equivalence relation i ∼ j if and only if xi = xj. Let µ0 be a diﬀuse probability measure over (S, B(S)) and let π be an EPPF. The following statements are equivalent.

I.( xi)i≥1 is exchangeable and directed by a species sampling process µ as in (1), with base measure µ0, and whose size-biased pseudo-permuted weights ( ˜wj)j≥1 satisfy

 k k−1 j ! Y nj −1 Y X π(n1, . . . , nk) = E  ˜wj 1 − ˜wj  . j=1 j=1 i=1

II. x1 ∼ µ0, and for every n ≥ 1,

Kn (j) (K +1) X π n π n n P[xn+1 ∈ · | x1,..., xn] = δx∗ + µ0, π(n) j π(n) j=1 where x∗,..., x∗ are the K distinct values in {x ,..., x }, n = (n ,... n ), n(j) = 1 Kn n 1 n 1 Kn (Kn+1) (n1,... nj−1, nj + 1, nj+1,..., nKn ) and n = (n1,..., nKn , 1), with nj = |{i ≤ n : ∗ xi = xj }|.

25 III. The law of (Π(x1:n))n≥1 is described by the EPPF π and for every n ≥ 1 and B1,...,Bn ∈ B(S) K   Yn \ P [x1 ∈ B1,..., xn ∈ Bn | Π(x1:n)] = µ0  Bj , i=1 j∈Πi

where Π1,..., ΠKn are the random blocks of Π(x1:n).

IV. For every n ≥ 1, and any x1, . . . , xn ∈ S, k Y ∗ P [x1 ∈ dx1,... xn ∈ dxn] = π(n1, . . . , nk) µ0(dxj ), i=1 ∗ ∗ ∗ where x1, . . . , xk are the distinct values in {x1, . . . , xn}, and nj = |{i : xi = xj }|. First we clarify what we mean by a size-biased pseudo-permutation. For a sequence of P ˜ weights, W = (wj)j≥1, with wj ≥ 0 and j≥1 wj ≤ 1 almost surely, we call W = ( ˜w1, ˜w2,...), a size-biased pseudo-permutation of W if   X X P [ ˜w1 ∈ · | W] = wjδwj + 1 − wj δ0, j≥1 j≥1 and for every i ≥ 1, P Pi P j≥1 wjδwj − j=1 ˜wjδ ˜wj + 1 − j≥1 wj δ0 P [ ˜wi+1 ∈ · | W, ˜w1,..., ˜wi] = Pi 1 − j=1 ˜wj

Pi if 1 − j=1 ˜wj > 0, and P [ ˜wi+1 ∈ · | W, ˜w1,... ˜wi] = δ0, otherwise. If the weights sum up to one almost surely, this deﬁnition coincides with the notion of size-biased permutation of the weights (Pitman; 1996a). The proof of Theorem C.1 is a consequence of the work by Pitman(1995, 1996a,b). In iid particular, point III reveals that for {xi | µ ∼ µ; i ≥ 1} driven by a species sampling process, the random vector (x ,..., x ) is distributed as x∗ ,..., x∗ , with l = j if and only if r ∈ Π . 1 n l1 ln r j For example, say that for some realization Π(x1:6) = {{1, 4, 5}, {2, 3}, {6}}, then under such ∗ ∗ ∗ ∗ ∗ ∗ ∗ iid event, (x1,..., x6) distributes as (x1, x2, x2, x1, x1, x3), where {xi ∼ µ0; i ≥ 1} independently of Π(x1:6). With this in mind, the proof of the following result is straightforward.

Theorem C.2. Let (xi)i≥1 be an exchangeable sequence driven by a species sampling process, µ, n with base measure µ0 and corresponding EPPF π. Fix n ≥ 1 and let f : S → R be measurable function, then

E [f(x1,..., xn)] (A2)  k  X Z Y Y  = f(xl1 , . . . , xln ) 1{lr =j} µ0(dx1) . . . µ0(dxk) π(|A1|,..., |Ak|),

{A1,...,Ak}  j=1 r∈Aj  whenever the integrals in the right side exist, and where the sum ranges over all partitions of {1, . . . , n}.

26 ∗ iid Proof. Let Π1,..., ΠKn denote the blocks of Π(x1:n), and consider {xi ∼ µ0; i ≥ 1} independently. By Theorem C.1 III, and the tower property of conditional

E [f(x1,..., xn)] = E [E [f(x1,..., xn) | Π(x1:n)]]  K  Yn Y = f(x∗ ,..., x∗ ) E  l1 ln 1{lr =j} j=1 r∈Πj

 k  X Z Y Y  = f(xl1 , . . . , xln ) 1{lr =j} µ0(dx1) . . . µ0(dxk) π(|A1|,..., |Ak|),

{A1,...,Ak}  j=1 r∈Aj  whenever the integral in the right side exist, and where the sum ranges over all partitions of {1, . . . , n}.

Theorem C.2 generalizes a result by Yamato(1984), in which (A2) is derived only for symmetric functions and for the special case where µ is a Dirichlet process. Related formulae also appear and are cleverly exploited in Lijoi and Pr¨unster(2009). As mention in Section2, the tie probability, ρ = P[x1 = x2] = π(2), determines important characteristics of a species sampling process. For instance the following conditional moments are completely determined by the tie probability and the base measure.

Corollary C.3. Let µ be a species sampling process with base measure µ0 and tie probability ρ. Consider {x1, x2,... | µ} ∼ µ. Then, for every i 6= j

a) E[xj | xi] = ρ xi + (1 − ρ)E[xi]

n 2 o b) Var(xj | xi) = (1 − ρ) ρ (xi − E[xi]) + Var(xi)

c) Cov(xi, xj) = ρ Var(xi)

d) Corr(xi, xj) = ρ.

Proof. For any measurable function f : [0, 1] → R+, and for every i 6= j, Z E [f(xj) | xi] = ρ f(xi) + (1 − ρ) f(x)µ0(dx) = ρ f(xi) + (1 − ρ)E [f(xi)] ,

2 noting that xi ∼ µ0. The choice f(x) = x proves (a). To prove (b) note that for f(x) = x , we 2 2 2 obtain E xj xi = ρ xi + (1 − ρ)E xi , this together with (a) show that

2 2 2 Var(xj | xi) = ρ xi + (1 − ρ)E[xi ] − (ρ xi + (1 − ρ)E[xi]) n 2 o = (1 − ρ) ρ (xi − E[xi]) + Var(xi) .

2 2 To prove (c) we ﬁrst compute, using (a), E [xixj] = E[xiE [xj | xi]] = ρE xi + (1 − ρ)E[xi] . Thus 2 2 2 Cov(xi, xj) = ρE xi + (1 − ρ)E[xi] − E[xi] = ρ Var(xi). p Finally, (d) follows by diving the last equation by Var(xi) = Var(xi)Var(xj).

27 Notice that for small values of ρ, E[xj | xi] ≈ E[xj], Var(xj | xi) ≈ Var(xj) and Cov(xi, xj) ≈ 0, alternatively for values of ρ close to 1, E[xj | xi] ≈ xi, Var(xj | xi) ≈ 0 and Cov(xi, xj) ≈ Var(xi). So (xi)i≥1 behaves very similar to an i.i.d. sequence, whenever ρ is close to 0, and if ρ ≈ 1 the behaviour of (xi)i≥1 is similar to (x, x,...) with x ∼ µ0. Theorem C.5 below formalizes this intuition, to prove it we use the following lemma, which also allows to characterize some moments of the species sampling process in terms of its base measure and its tie probability.

Lemma C.4. Let S be a Polish space and consider a species sampling process µ as in (1), with base measure µ0 and with tie probability ρ. Let f, g : S → R be measurable and bounded R functions. Let us denote µ(f) = f(s)µ(ds) and analogously for µ0 and g. Then,

a) E [µ(f)] = µ0(f).

2 2 2 b) E µ(f) = ρ µ0(f ) + (1 − ρ)µ0(f) .

c) E [µ(f)µ(g)] = ρ µ0(fg) + (1 − ρ)µ0(f)µ0(g). P P Proof. Set w0 = 1 − j≥1 wj, so that j≥0 wj = 1 almost surely, and by a monotone conver- P gence argument we also obtain j≥0 E [wj] = 1. Note that X µ(f) = wjf(ξj) + w0µ0(f), j≥1

Pn and for any bound of f, M, we have that | j=1 wjf(ξj)| < M almost surely for every n ≥ 1. Hence, by linearity of the expectation, Lebesgue dominated convergence theorem, and since iid {ξj ∼ µ0; j ≥ 1} independently of the weights, we get   X X E[µ(f)] = E[wj]E[f(ξj)] + E [w0] µ0(f) =  E [wj] µ0(f) = µ0(f). j≥1 j≥0

This proves the ﬁrst part. To prove the second and thirds parts, ﬁrst note that by the tower prop- P h 2i erty of conditional expectation and monotone convergence theorem we get ρ = j≥1 E (wj) and  2  X  X 2 X 2 1 = E  wj  = E wj + E [wiwj] + E w0 . j≥0 j≥1 i6=j

P 2 P P P Thus, 1 − ρ = i6=j E [wiwj] + E w0 , where i6=j aiaj denotes i≥0 j≥0 aiaj1{i6=j}. Sec- ondly, since f and g are bounded and µ is a random probability measure we have that

28 µ(|f|), µ(|g|) < ∞. Then,     X X µ(f)µ(g) =  wjf(ξj) + w0µ0(f)  wjg(ξj) + w0µ0(g) j≥1 j≥1   X 2 X X = wj f(ξj)g(ξj) + wiwjf(ξi)g(ξj) +  w0wjg(ξj) µ0(f) j≥1 i,j≥1 j≥1 i6=j   X 2 +  w0wjf(ξj) µ0(g) + w0µ0(f)µ0(g). j≥1 Now, if M is a bound for f, and N is a bound of g we have that for every n ≥ 1, Pn Pn Pn 2 | j=1 wjw0f(ξj))| ≤ M, | j=1 wjw0g(ξj))| ≤ N, | j=1 wj f(ξj)g(ξj)| ≤ MN, and Pn Pn | i=1 j=1 wiwjf(ξj)g(ξi)1{i6=j}| ≤ MN. Thus, by linearity of the expectation, Lebesgue iid dominated convergence theorem, and as {ξj ∼ µ0; j ≥ 1} independently of the weights, we obtain

E [µ(f)µ(g)] X 2 X 2 = E wj E [f(ξj)g(ξj)] + E [wiwj] E [f(ξi)] E [g(ξj)] + E w0 µ0(f)µ0(g) j≥1 i,j≥1 i6=j     X X +  E [w0wj] E [g(ξj)] µ0(f) +  E [w0wj] E [f(ξj)] µ0(g). j≥1 j≥1 X 2 X 2 = E wj µ0(fg) + E [wiwj] µ0(f)µ0(g) + E w0 µ0(f)µ0(g) j≥1 i6=j

= ρµ0(fg) + (1 − ρ)µ0(f)µ0(g). This proves the third part of the lemma, and the choice g = f gives the second part.

In the context of Lemma C.4, the particular choices f = 1A and g = 1B for some A, B ∈ BS, imply E[µ(A)] = µ0(A), Var (µ(A)) = ρ µ0(A)(1 − µ0(A)) and Cov (µ(A), µ(B)) = ρ(µ0(A ∩ B) − µ0(A)µ0(B)). (1) (2) Theorem C.5. Consider a Polish space S with Borel σ-algebra BS. Let µ0, µ0 , µ0 ,... be (n) diﬀuse probability measures over (S, BS), such that µ converges weakly to µ0 as n → ∞. For n 0 o (n) (n) (n) iid (n) (n) n ≥ 1 let ρ ∈ (0, 1), and consider xi µ ∼ µ ; i ≥ 1 where µ is a species sampling (n) (n) process with base measure µ0 and tie probability ρ .

(n) (n) (n) i) If ρ → 0, as n → ∞, then µ converges weakly in distribution to µ0, and xi i≥1 iid converge in distribution to a sequence of i.i.d. random variables {xi ∼ µ0; i ≥ 1}.

(n) (n) ii) If ρ → 1, as n → ∞, then µ converges weakly in distribution to δx, where x ∼ µ0, and (n) xi converge in distribution to the sequence of identical random variables (x, x,...). i≥1

29 Proof. We may assume without loss of generality that all the species sampling processes are deﬁned on the same probability space. First we prove (i), let f : S → R be a continuous and bounded function. Since f is continuous it is also measurable, and by Lemma C.4 we have that

2 n (n) o E µ (f) − µ0(f) n o2 h i (n) (n) 2 (A3) = E µ (f) − 2E µ (f) µ0(f) + {µ0(f)}

2 (n) (n) 2 (n) n (n) o (n) 2 = ρ µ0 f + 1 − ρ µ0 (f) − 2µ0 (f)µ0(f) + {µ0(f)} .

(n) w (n) By hypothesis we know that µ0 → µ0 and ρ → 0, as n → ∞, by taking limits in (A3), we found that 2 n (n) o E µ (f) − µ0(f) → 0,

(n) (n) d as n → ∞. That is, µ (f) converges to µ0(f) in L2, which implies µ (f) → µ0(f). Since f was chosen arbitrarily, this proves (i) for the species sampling processes. Given that S and P(S) are Polish, we might construct on some probability space Ωˆ, Fˆ, Pˆ some exchangeable ˆ (n) (n) ˆ (n) sequences X = xˆi , such that X is directed by a species sampling process, i≥1 n≥1 (n) (n) (n) (n) µˆ , with base measure µ0 and tie probability ρ , and where µˆ converges weakly almost surely to µ0, as n → ∞. Fix m ≥ 1 and B1,...,Bm ∈ BS. Since µ0 is diﬀuse, µ0(∂Bi) = 0, (n) and by the Portmanteau theorem we know µˆ (Bi) → µ0(Bi) almost surely as n → ∞. This together with the representation theorem for exchangeable sequences imply

" m # m m \ (n) Y Y ˆ ˆ ˆ(n) ˆ(n) P xi ∈ Bi µ = µ (Bi) → µ0(Bi), i=1 i=1 i=1 almost surely, as n → ∞, and by taking expectations we obtain

" m # " m # m " m # ˆ \ (n) ˆ Y Y ˆ \ P xî ∈ Bi → E µ0(Bi) = µ0(Bi) = P (xî ∈ Bi) , i=1 i=1 i=1 i=1 ˆ iid (n) d ˆ (n) d ˆ where X = (xî)i≥1, with {xî ∼ µ0; i ≥ 1}. Thus xi = X → X as n → ∞, and we i≥1 have proven (i). (n) (n) (n) (n) To prove (ii) let w1 ≥ w2 ≥ · · · be the decreasingly ordered weights of µ and let ξj (n) (n) P (n) be the atom corresponding to wj . Let us denote w0 = 1 − j≥1 wj . Note that, using monotone convergence theorem, we can write h h ii 2 (n) (n) (n) (n) X (n) ρ = E P x1 = x2 µ = E wj (A4) j≥1 and by the proof of Lemma C.4 we also know

2 (n) (n) X h (n) (n)i 1 − ρ = E w0 + E wj wi (A5) i6=j

30 for n ≥ 1. Since the weights are decreasing, we must have that for every i ≥ j ≥ 2, h (n) (n)i h (n) (n) i E wi wj ≤ E wi wj−1 , hence

2 X h (n) (n)i X X h (n) (n)i X (n) E wi wj ≥ E wi wj ≥ E wj ≥ 0, (A6) i6=j j≥2 i≥j j≥2

2 P (n) for n ≥ 1. By taking limits, as n → ∞, by (A5) and (A6), we found j≥2 E wj → 0, 2 (n) (n) which together with (A4) proves that E w1 → 1. Since 0 ≤ w1 ≤ 1, and P h (n)i j≥0 E wj = 1, we obtain

h (n)i X h (n)i E w1 → 1 and E wj → 0, (A7) j6=1

(n) w as n → ∞. Seeing that all the corresponding spaces are Polish, and µ0 → µ0, we might ˆ ˆ ˆ n ˆ(n) iid (n) o construct on a probability space Ω, F, P , some independent sequences, ξj ∼ µ0 ; j ≥ 1 , (n) d (n) ˆ(n) ˆ and wˆ j = wj , such that ξj → ξj ∼ µ0, almost surely, as n → ∞, independently j≥1 j≥1 (n) P (n) (n) (n) (n) P (n) for j ≥ 1. Deﬁne µˆ = j≥1 wˆ j δ ˆ(n) + wˆ 0 µ0 , where wˆ 0 = 1 − j≥1 wˆ j . Then for ξj any continuous and bounded function, f, by Lemma C.4

n o2 µˆ(n)(f) − δ (f) E ξˆ1 n o2 h i n o2 (n) (n) ˆ ˆ (A8) = E µˆ (f) − 2E µˆ (f) f ξ1 + E f ξ1

2 (n) (n) 2 (n) n (n) o h (n) ˆ i 2 = ρ µ0 f + 1 − ρ µ0 (f) − 2E µˆ (f) f ξ1 + µ0(f ).

As f is bounded, we can write

h (n)i h ˆ(n) ˆ i 2 X h (n)i h (n) ˆ i E wˆ 1 E f ξ1 f ξ1 −M E wˆ j ≤ E µˆ (f) f(ξ1) j6=1 h (n)i h ˆ(n) ˆ i 2 X h (n)i ≤ E wˆ 1 E f ξ1 f ξ1 + M E wˆ j j6=1 where M is a bound of f. By taking limits as n → ∞ in the last equation and by (A7), we get

2 h (n) ˆ i n ˆ o 2 E µˆ (f) f ξ1 → E f ξ1 = µ0(f ).

Hence, by making n → ∞ in (A8), we obtain

n o2 µˆ(n)(f) − δ (f) → 0. E ξˆ1

31 That is µˆ(n)(f) → δ (f) in L , which implies µ(n)(f) =d µˆ(n)(f) →d δ (f). As this holds ξˆ1 2 ξˆ1 for every continuous and bounded function, f, we obtain µ(n) dw→ δ , as n → ∞. Set ξˆ1 ˆ xˆ = ξ1, under analogous arguments as in (i), we may assume without loss of generality that (n) µˆ converges weakly almost surely to δxˆ as n → ∞, and consider exchangeable sequences ˆ (n) (n) ˆ (n) (n) X = xˆi , such that X is directed by µˆ . Fix m ≥ 1 and B1,...,Bm ∈ BS. i≥1 n≥1 ˆ The diﬀuseness of µ0 implies that xˆ 6∈ ∂Bi almost surely, so that outside a P-null set, (n) δxˆ(∂Bi) = 0, and using the Portmanteau theorem we obtain µ (Bi) → δxˆ(Bi), almost surely, as n → ∞. The representation theorem for exchangeable sequences assures

" m # m m \ (n) Y Y ˆ ˆ ˆ(n) ˆ(n) P xi ∈ Bi µ = µ (Bi) → δxˆ(Bi), i=1 i=1 i=1 almost surely, as n → ∞, and by taking expectations we get

" m # " m # ˆ \ (n) ˆ Y ˆ P xˆi ∈ Bi → E δxˆ(Bi) = P [xˆ ∈ B1,..., xˆ ∈ Bm] . i=1 i=1

Hence X(n) =d Xˆ (n) →d (xˆ, xˆ,...) as n → ∞.

The proof of Theorem C.5 appears in Ghosal and van der Vaart(2017) for the particular case of the Dirichlet process. Note that the elements of any exchangeable sequence are marginally identically distributed, so in terms of their mutual dependence the two extrema are the case where the random variables are i.i.d. and the case where they are identical. This are precisely the two limits of exchangeable sequences driven by a species sampling process, when the tie probability approaches zero or one, respectively.

D Proof of main results

This section is dedicated to prove the main results of the article.

D.1 Proof of Theorem 3.1 Proof. (i) Following the proof of Proposition 7 in Bissiri and Ongaro(2014), it suﬃces to show 0 0 Tn 0 that for every 0 < ε < 1, there exist 0 < δ < ε such that P [ i=1(δ < vi < ε )] > 0, for every n ≥ 1. Fix 0 < ε0 < 1 and consider ε00 = min{ε, ε0}, where ε > 0 is such (0, ε) is contained 00 in the support of ν0. Set δ = ε /2, by the representation theorem for exchangeable sequences, 00 Jensen’s inequality and the fact that (δ, ε ) ⊆ (0, ε) is contained in the support of ν0,

" n # " n # \ 00 Y 00 00 n 00 n P (δ < vi < ε ) = E ν(δ, ε ) = E {ν(δ, ε )} ≥ {ν0(δ, ε )} > 0, i=1 i=1

00 0 Tn 0 Tn 00 for every n ≥ 1. As ε ≤ ε , we conclude P [ i=1(δ < vi < ε )] ≥ P [ i=1(δ < vi < ε )] > 0, for n ≥ 1.

32 Pj Qj (ii) As explained in Ghosal and van der Vaart(2017), 1 − i=1 wi = i=1(1 − vi), for P Qj every j ≥ 1. From which is evident that, j≥1 wj = 1 if and only if i=1(1 − vi) → 0, as Qj hQj i j → ∞, almost surely. Since 0 ≤ i=1(1 − vi) ≤ 1, this is equivalent to E i=1(1 − vi) → 0. hQj i j As (vi)i≥1 is exchangeable and directed by ν, we get E i=1(1 − vi) = E (1 − E[v1 | ν]) . Now, if ν({0}) < 1 almost surely, then P [v1 > 0 | ν] > 0, almost surely. Since v1 is non-negative, hQj i j this shows E[v1 | ν] > 0 almost surely, hence E i=1(1 − vi) = E (1 − E[v1 | ν]) → 0 as j → ∞. Alternatively, if P [ν({0}) = 1] > 0, then for every j ≥ 1,

j j E (1 − E[v1 | ν]) = E (1 − E[v1 | ν]) 1{ν({0})<1} + P [ν({0}) = 1] .

Which implies

" j # Y j lim (1 − vi) = lim (1 − [v1 | ν]) ≥ [ν({0}) = 1] > 0. j→∞ E j→∞ E E P i=1

D.2 Proof of Theorem 3.3

(n) (n) Proof. (i) First we see that the sequences of length variables V = vi converge in i≥1 distribution to V = (vi)i≥1. Since [0, 1] and P([0, 1]) are Polish, we might construct on some ˆ ˆ ˆ ˆ (n) (n) probability space Ω, F, P some exchangeable sequences V = vˆi , such that i≥1 n≥1 Vˆ (n) is directed by a random probability measure νˆ(n) =d ν(n), and where νˆ(n) converges weakly almost surely to ν0 6= δ0, as n → ∞. Fix m ≥ 1 and B1,...,Bm ∈ B[0,1] such that ν0(∂Bi) = 0, for every i ≤ m. By the Portmanteau theorem we know νˆ(Bi) → ν0(Bi) almost surely as n → ∞. This together with the representation theorem for exchangeable sequences imply

" m # m m \ (n) Y Y ˆ ˆ ˆ(n) ˆ(n) P vi ∈ Bi ν = ν (Bi) → ν0(Bi), i=1 i=1 i=1 almost surely, as n → ∞, and by taking expectations we obtain

" m # m " m # ˆ \ (n) Y \ P vˆi ∈ Bi → ν0(Bi) = P (vi ∈ Bi) . i=1 i=1 i=1 Since each sequence of length variables is countable it is enough to prove the convergence of the ﬁnite dimensional distributions, and we get V(n) =d Vˆ (n) →d V as desired. Note that mapping

j−1 ! Y (v1, v2, . . . , vj) 7→ v1, v2(1 − v1), . . . , vj (1 − vi) i=1 is continuous with respect to the product topology, thus the weights of µ(n), W(n) = SB V(n), converge in distribution to the weights of µ, W = SB [V]. Further, the requirements on ν(n) and

33 (n) ν0 assure W and W take values in the inﬁnite dimensional simplex, ∆∞. In addition, as the (n) (n) (n) (n) base measures µ0 converge weakly to µ0, we also have that the atoms of µ , Ξ = ξj j≥1 (n) (which are independent of W ) converge in distribution to the atoms of µ, Ξ = (ξj)j≥1 (which (n) (n) d ∞ are independent of W). Thus Ξ , W → (Ξ, W) in S × ∆∞, and Lemma B.2 yields (n) µ converges weakly in distribution to µ. In particular, if ν0 denotes a Be(1, θ) distribution we get µ is a Dirichlet process, and as shown by Pitman(1996a), W is in size-biased order. ˆ (n) (n) (ii) Analogously as in (i) we may construct sequences V = vˆi , such that i≥1 n≥1 Vˆ (n) is directed by a random probability measure νˆ(n) =d ν(n), and where νˆ(n) converges weakly almost surely to δvˆ , with vˆ ∼ ν0, as n → ∞. Fix m ≥ 1 and B1,...,Bm ∈ B[0,1] with ˆ ν0(∂Bi) = 0, for every i ≤ m. Then we get that vˆ 6∈ ∂Bi almost surely, so that outside a P-null (n) set, δvˆ (Bi) = 0, and using the Portmanteau theorem we obtain ν (Bi) → δvˆ (Bi), almost surely, as n → ∞. The representation theorem for exchangeable sequences assures

" m # m m \ (n) Y Y ˆ ˆ ˆ(n) ˆ(n) P vi ∈ Bi ν = ν (Bi) → δvˆ (Bi), i=1 i=1 i=1 almost surely, as n → ∞, and by taking expectations we get

" m # " m # ˆ \ (n) ˆ Y ˆ P vˆi ∈ Bi → E δvˆ (Bi) = P [vˆ ∈ B1,..., vˆ ∈ Bm] . i=1 i=1

Hence V(n) =d Vˆ (n) →d (vˆ, vˆ,...) =d (v, v,...) as n → ∞. Note that in this case, W = SB[(v, v,...)] is in decreasing order. The rest of the proof of (ii) follows identically is in (i).

D.3 Proof of Corollary 4.2

(n) (n) Proof. By Theorem C.5 we know that if ρν → 0 then, ν converges weakly in distribution to (n) (n) ν0. Alternatively, if ρν → 1 we get ν converges weakly in distribution to δv, where v ∼ ν0. The result then follows from Theorem 3.3.

D.4 Proof of Theorem 4.3

Proof. Fix j ≥ 1. As ν0 diﬀuse, (1 − vj) > 0 almost surely, for every ρν ∈ (0, 1). Hence, wj ≥ −1 wj+1 if and only if vj ≥ vj+1(1 − vj), or equivalently vj+1 ≤ c (vj) where c(v) = 1 ∧ v(1 − v) .

Using the exchangeability of (vi)i≥1, we know that under the event {vj 6= vj+1}, which occurs ∗ ∗ with probability 1 − ρν , the conditional distribution of (vj, vj+1) is that of (v , v), where v and v are i.i.d. from ν0 (see for instance Theorem C.1). Hence we can easily compute

P [wj ≥ wj+1] = P [vj+1 ≤ c (vj) | vj = vj+1] ρν + (A9) P [vj+1 ≤ c (vj) | vj 6= vj+1] (1 − ρν ) ∗ = ρν + (1 − ρν )P [v ≤ c(v)] .

34 ∗ −→ −→ Noting that P [v ≤ c(v)] = E [ν0(c(v))], where ν0 is the distribution function of v ∼ ν0, we get (a). To prove (b) ﬁrst we show that

P[wj ≥ wj+1 | w1,..., wj] = P[vj+1 ≤ c(vj) | v1,..., vj], (A10) almost surely. It is straight forward from the stick-breaking construction that (w1,..., wj) is (v1,..., vj)-measurable. Conversely, the proof of Theorem A.1 yields (v1,..., vj) is (w1,..., wj)-measurable as well, whenever 0 < vi < 1 almost surely for every i ≥ 1, this is of course the case of ESBs, because the underlying base measure, ν0, is diﬀuse. Finally as explained above wj ≥ wj+1 if and only if c(vj) ≥ vj+1, which proves (A10). Now, since (vi)i≥1 is sampled from a species sampling process, we already know, from Theorem C.1, how to compute

Kj (i) (Kj +1) X πν n πν n ∗ −→ P[vj+1 ≤ c(vj) | v1,..., vj] = 1{v ≤c(vj )} + ν0(c(vj)), π (n) i π (n) i=1 ν ν which ﬁnishes the proof.

D.5 Proof of Corollary 5.2

−→ θ Proof. For v ∼ Be(1, θ), its distribution function is given by, ν0(x) = 1 − (1 − x) , hence by substituting the tie probability ρν = 1/(β + 1), in Theorem 4.3, we obtain β [w ≥ w ] = 1 − (1 − c(v))θ . (A11) P j j+1 β + 1E where c(v) = 1 ∧ v(1 − v)−1. Since v ∼ Be(1, θ), we get

Z 1/2 θ Z 1/2 θ θ x θ−1 (1 − 2x) E (1 − c(v)) = θ 1 − (1 − x) dx = θ dx, 0 1 − x 0 (1 − x) and by the change of variables y = 2x, Z 1 θ θ θ (1 − y) 2F1(1, 1; θ + 2, 1/2)θ E (1 − c(v)) = dy = . 2 0 (1 − y/2) 2(θ + 1) Substituting this quantity into (A11) yields (a). For the proof of (b) we simply have to recall that if (vi)i≥1 are exchangeable and driven by a Dirichlet process, ν, with total mass parameter β and base measure ν0 = Be(1, θ), then   Kj 1 X  P[vj+1 ∈ · | v1,..., vj] = niδv∗ + βν0 , β + j i i=1  where v∗,..., v∗ are the distinct values that {v ,..., v } exhibits and n = |{l ≤ j : v = v∗}|, 1 Kj 1 j i l i for every i ≤ Kj. This implies   Kj 1 X  ∗ −→ P[vj+1 ≤ c(vj) | v1,..., vj] = ni1{v ≤c(vj )} + βν0(c(vj)) , β + j i i=1 

−→ θ and recalling that ν0(x) = 1 − (1 − x) the proof of (b) follows.

35 E MCMC implementation for density estimation

Say we model elements in {y1,..., yn} as i.i.d. sampled from the ESB mixture, Φ = P j≥1 wjG(· | ξj) where G(· | s) has a density, denoted by the same letter, with respect to P suitable measure, and µ = j≥1 wjδξj deﬁnes an ESB with exchangeable length variables V = (vi)i≥1 driven by a species sampling process ν. Let us denote W = (wj)j≥1 = SB[V] and Ξ = (ξj)j≥1, we will also use the notation p(z) and p(z | γ) to refer to the marginal density (or mass probability function) of z, and the conditional density of z given γ, respectively. Consider the random density X Φ(y) = p(y | W, Ξ) = wjG(y|ξj), j≥1 for MCMC implementation purposes, and following Walker(2007), this random density can be augmented as p X (y, u|W, Ξ) = 1{u

Au(W) = {j : u < wj}, that is

1 X p(y|u, W, Ξ) = G(y|ξj). (A12) |Au(W)| j∈Au(W)

This translates the problem of dealing with a mixture that has a inﬁnite number of components, into working with one that features a random number of components. From a numerical perspective, the latter is much more manageable. Moreover, considering the latent allocation variable d, i.e. d = j iﬀ y is sampled from G(·|ξj), one can further consider the augmented joint density, p (y, u, d|W, Ξ) = 1{u

n n Y LΞ,W((yk, uk, dk) ) = 1 G(yk|ξd ), k=1 {uk

n 1. Updating the slice variables, U = (uk)k=1: p (uk| ...) ∝ 1{uk

Hence to updated uk, we simply sample it from a Unif (0, wk) distribution.

36 2. Updating the kernel parameters, Ξ = (ξj)j≥1:

Under the assumption that the base measure, µ0 has a density, denoted by the same symbol, with respect to a suitable measure, we get Y p(ξj| ...) ∝ µ0(ξj) G(yk|ξj),

k∈Dj where Dj = {k : dk = j}. If µ0 and G form a conjugate pair, the above is easy to sample from.

n 3. Updating latent allocation variables, D = (dk)k=1: p (dk = j| ...) ∝ G(yk|ξj)1{uk

4. Updating the length variables, V = (vi)i≥1:

Consistently with the notation of previous sections, let πν denote the EPPF corresponding to ν. Let V−j, denote the vector of (previously updated) vi’s excluding vj, by Remark E.1 below, V−j is ﬁnite at each iteration. Hence exploiting the exchangeability of V we get

k (i) (k+1) X πν (n ) πν (n ) p(vj | V−j) = δv∗ (vj) + ν0(vj), (A14) π (n) i π (n) i=1 ν ν

∗ ∗ (i) where, v1,..., vk, are the distinct values V−j exhibits, n = (n1,... nk), n = (n1,... ni−1, ni+ (k+1) ∗ 1, ni+1,..., nk) and n = (n1,..., nk, 1), with nl = |{vi ∈ V−j : vi = vl }|. This yields the full conditional of vj is

" m #" k # Y X (i) (k+1) p(vj| ...) ∝ 1 πν (n )δv∗ (vj) + πν (n )ν0(vj) . {uk

Qdk−1 After some algebra and recalling wdk = vdk i=1 (1 − vi), it can be seen that

m Y 1 ∝ 1 {uk j}, and using the convention that aj = 0 if Aj = ∅ and bj = 1 when Bj = ∅. Thus,

k X (i) (k+1) p(v | ...) ∝ π (n )δ ∗ (v ) ∗ + π (n )ν (v ) . (A15) j ν vi j 1{aj

37 (i) Sampling v from (A15) simply means that with probability q ∝ π n 1 ∗ we set j i ν {aj

Z bj (k+1) q0 ∝ πν n ν0(x)dx aj we sample vj from the density ν0(v)1 {aj

k X ni β p(vj | V−j) = δv∗ (vj) + Be(vj | 1, θ). |V | + β i |V | + β i=1 −j −j Hence, we obtain ni1{a

β (1 − a )θ − (a − b )θ q = j j , 0 P n + β [(1 − a )θ − (a − b )θ] l∈Ci l j j and θ−1 ν0(v)1 θ(1 − v) 1 {aj

Remark E.1 (For the updating of Ξ, V and X). It is not necessary to sample vj and ξj for Pϕ inﬁnitely many indexes, it suﬃces to sample them for j ≤ ϕ, where j=1 wj ≥ maxk(1 − uk), then it is not possible that wj > uk for any k ≤ m and j > ϕ, and the updating of the latent allocation variables dk’s can take place.

5. Updating the tie probability for DSBs: When implementing certain ESB priors such as DSBs, it is of interest to estimate the underlying tie probability ρν = P[vj = vl] = 1/(β + 1), to do this we can assign it a prior distribution. Roughly speaking, this allows the model to choose between DSB mixtures that behave arbitrarily similar to a Dirichlet mixture, to a Geometric mixture or some other mixture in between. It is straightforward to check that the full conditionals of ξj, di, ui and vj will remain as described above conditionally given β = (1 − ρν )/ρν . In this case we will require to update ρν at each iteration of the Gibbs sampler. Since ρν only aﬀects directly the length variables, it is easy to see that the full conditional distribution of this random variable is

p(ρν | ...) ∝ p(v1, v2,... | ρν )p(ρν ).

38 Given that at each iteration of the Gibbs sampler we only sample ﬁnitely many length variables, say v1,..., vm, we obtain p p (ρν | ...) ∝ πν (n1,... nKm ) (ρν )

(see Theorem C.1) where Km is the number of distinct values {v1,..., vm} exhibits, n1,..., nKm are the frequencies of the distinct values and πν is the EPPF of the Dirichlet process. That is, (1 − ρ )Km−1 ρm−Km p(ρ | ...) ∝ ν ν p(ρ ). ν Qm−2 ν l=0 (1 + lρν )

Drawing samples from the full conditional of ρν is possible with the aid of a rejection sampling method such as Adaptive Rejection Metropolis Sampling (ARMS) (e.g. Gilks et al.; 1995).

Posterior estimation M (m) (m) (m) (m) Given the samples, ξj , wj , uk , dk , obtained after M iterations j j k k m=1 of the Gibbs sampler after the burn-in period has elapsed, we can estimate the density of the data at y, by means of the expected a posteriori (EAP),

M n 1 X 1 X 1 X (m) Φ (y) = [Φ(y) | y ,... y ] ≈ G y ξ , (A16) EAP E 1 n M n (m) j m=1 k=1 Ak (m) j∈Ak

(m) n (m) (m)o where Ak = j : uk < wj . We can estimate as well the posterior distribution of Kn through M 1 X P Kn = j y1,..., yn ≈ 1 K(m)=j , (A17) M { n } m=1 (m) (m) where Kn is the number of distinct values in dk . Alternatively, we can ﬁnd k p (m) (m) (m) (m) mˆ = arg max ξj , wj , uk , dk y1,..., yn 1

By means of the MAP we can also estimate the clusters of the data points, {y1,..., yn}, via the mixture components

( n −1 ) 1 X (m ˆ ) (m ˆ ) n o Ak 1 j∈A(m ˆ ) G · ξj , n k k=1 j

39 by deﬁning ( n −1 ) 1 X (m ˆ ) (m ˆ ) n o ck = arg max Ak 1 j∈A(m ˆ ) G yk ξj , j n k k=1 for every k ∈ {1, . . . , n}, and putting yi and yk in the same cluster if and only if ci = ck.

F Supplemental graphs of Section 6.2

Figure A1: 3D view of the estimated densities using the MAP, taking into account 8000 iterations of the Gibbs sampler after a burn-in period of 2000 iterations, according to the Dirichlet prior (A) a DSB prior (B) and a Geometric prior (C). D shows the histogram of the data.

40 Figure A2: 3D view of the estimated densities using the EAP, taking into account each fourth iteration among 8000 iterations of the Gibbs sampler after a burn-in period of 2000, according to the Dirichlet prior (A) a DSB prior (B) and a Geometric prior (C). D shows the true density from which the data points were i.i.d. sampled. D shows the histogram of the data

References

Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to bayesian nonparametric problems, Ann. Stat. 2: 1152–1174.

Billingsley, P. (1968). Convergence of Probability Measures, Wiley series in probability and statistics, John Wiley and Sons Inc.

Bissiri, P. and Ongaro, A. (2014). On the topological support of species sampling priors, Electron. J. Stat. 8(1): 861–882.

Blackwell, D. and MacQueen, J. (1973). Ferguson distributions via P´olya urn schemes, Ann. Stat. 1: 353–355.

41 De Blasi, P., Favaro, S., Lijoi, A., Mena, R., Pr¨unster,I. and Ruggiero, M. (2015). Are gibbs- type priors the most natural generalization of the dirichlet process?, IEEE Trans. Pattern Anal. Mach. Intell. 37(2): 212–229.

De Blasi, P., Mart´ınez,A. F., Mena, R. and Pr¨unster,I. (2020). On the inferencial implications of decreasing weights structures in mixture models, Comput. Stat. Data Anal. 147. de Finetti, B. (1931). Funzione caratteristica di un fenomeno aleatorio., Atti della R. Academia Nazionale dei Lincei, Serie 6. Memorie, Classe di Scienze Fisiche, Mathematice e Naturale 4: 251–299.

Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures, J. Am. Stat. Assoc. 90(430): 577–588.

Ewens, W. (1972). The sampling theory of selectively neutral alleles, Theor. Popul. Biol. 3: 87– 112.

Favaro, S., Lijoi, A., Nava, C., Nipoti, B., Pr¨unster,I. and Teh, Y. W. (2016). On the stick- breaking representation for homogeneous NRMIs, Bayesian Anal. 11(3): 697–724.

Favaro, S., Lijoi, A. and Pr¨unster(2012). On the stick-breaking representation of normalized inverse Gaussian priors, Biometrika 99: 663–674.

Ferguson, T. (1973). A Bayesian analysis of some nonparametric problems, Ann. Stat. 1(2): 209– 230.

Fuentes-Garc´ıa,R., Mena, R. H. and Walker, S. G. (2010a). A new Bayesian nonparametric mixture model, Commun. Stat. Simul. Comput. 39(4): 669–682.

Fuentes-Garc´ıa,R., Mena, R. H. and Walker, S. G. (2010b). A probability for classiﬁcation based on the Dirichlet process mixture model, Journal of Classiﬁcation 27: 389–403.

Fuentes-Garc´ıa,R., Mena, R. H. and Walker, S. G. (2019). Modal posterior clustering motivated by hopﬁeld’s networkl, Comput. Stat. Data Anal. 137: 92–100.

Ghosal, S. and van der Vaart, A. (2017). Fundamentals of Nonparametric Bayesian Inference, Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press.

Gil-Leyva, M. F., Mena, R. H. and Nicoleris, T. (2020). Beta-Binomial stick-breaking non- parametric prior, Electron. J. Stat. 14: 1479–1507.

Gilks, W., Best, N. and Tan, K. (1995). Adaptive Rejection Metropolis Sampling, Applied Statistics 44: 455–472.

Hewitt, E. and Savage, L. (1955). Symmetric measures on Cartesian products, Trans. Am. Math. Soc. 80: 470–501.

Hjort, N., Holmes, C., M¨uller,P. and Walker, S. G. (2010). Bayesian Nonparametrics, Cam- bridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press.

42 Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors, J. Am. Stat. Assoc. 96(453): 161–173.

James, L. F., Lijoi, A. and Pr¨unster,I. (2009). Posterior analysis for normalized random measures with independent increments, J. R. Stat. Soc. Series B Stat. Methodol. 36(1): 76– 97.

Jang, G. H., Lee, J. and Lee, S. (2010). Posterior consistency of species sampling priors, Stat. Sin. 20: 581–593.

Kallenberg, O. (2017). Random Measures, Theory and Applications, Vol. 77, ﬁrst edn, Springer.

Kingman, J. F. C. (1975). Random discrete distributions (with discussion), J. R. Stat. Soc. Series B Stat. Methodol. 37(1): 1–22.

Lijoi, A., Mena, R. and Pr¨unster,I. (2005). Hierarchical mixture modelling with normalized inverse gaussian priors, J. Am. Stat. Assoc. 100(472): 1278–1291.

Lijoi, A., Mena, R. and Pr¨unster,I. (2007). Bayesian non-parametric estimation on the probability of discovering a new species, Biometrika 94(4): 769–786.

Lijoi, A. and Pr¨unster,I. (2009). Distributional properties of means of random probability measures, Stat. Surv. 3: 47–95.

Mena, R. and Walker, S. G. (2012). An eppf from independent sequences of geometric random variables, Statistics and Probability Letters 82: 1059–1066.

Parthasarathy, K. R. (1967). Probability measures on metric spaces., Academic press, New York.

Perman, M., Pitman, J. and Yor, M. (1992). Size-biased sampling of Poisson point processes and excursions, Probab. Theory Relat. Fields 92(1): 21–39.

Pitman, J. (1995). Exchangeable and partially exchangeable random partitions, Probab. Theory Relat. Fields 102: 145–158.

Pitman, J. (1996a). Random discrete distributions invariant under size-biased permutation, Adv. Appl. Probab. 28(2): 525–539.

Pitman, J. (1996b). Some developments of the Blackwell-MacQueen urn scheme, in T. F. et al. (ed.), Statistics, Probability and Game Theory; Papers in honor of David Blackwell, Vol. 30 of Lecture Notes-Monograph Series, Institute of Mathematical Statistics, Hayward, California, pp. 245–267.

Pitman, J. (2006). Combinatorial stochastic processes., Vol. 1875 of Ecole´ d’étéde probabilités de Saint-Flour, first edn, Springer-Verlag Berlin Heidelberg, New York.

Pitman, J. and Yor, M. (1992). Arcsine laws and interval partitions derived from a stable subordinator, Proceedings of the London Mathematical Society s3-65(2): 326–356.

43 Regazzini, E., Lijoi, A. and Pr¨unster,I. (2003). Distributional results for means of normalized random measures with independent increments, Ann. Stat. 31(2): 560–585.

Rodr´ıguez,A. and Dunson, D. B. (2011). Nonparametric Bayesian models through probit stick- breaking processes, Bayesian Anal. 6(1): 145–178.

Rodr´ıguez,A. and Quintana, F. A. (2015). On species sampling sequences induced by residual allocation models, J. Stat. Plan. Inference 157–158: 108–120.

Sethuraman, J. (1994). A constructive deﬁnition of Dirichlet priors, Stat. Sin. 4: 639–650.

Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices, Commun. Stat. Simul. Comput. 36(1): 45–54.

Yamato, H. (1984). Expectations of functions of samples from distributions chosen from dirichlet processes., Fac. Sci. Kagoshima Univ. Math. Phys. Chem. 17: 1–8.