Scalable and Robust Bayesian Inference Via the Median Posterior

Scalable and Robust Bayesian Inference via the Median Posterior

Stanislav Minsker1 [email protected] Sanvesh Srivastava2,3 [email protected] Lizhen Lin2 [email protected] David B. Dunson2 [email protected] Departments of Mathematics1 and Statistical Science2, Duke University, Durham, NC 27708 Statistical and Applied Mathematical Sciences Institute3, 19 T.W. Alexander Dr, Research Triangle Park, NC 27709

Abstract is limited. One of the reasons for this limitation is related Many Bayesian learning methods for massive to Markov chain Monte Carlo (MCMC) - one of the key data benefit from working with small subsets of techniques for approximating the posterior distribution of observations. In particular, significant progress parameters in Bayesian models. While there are many ef- has been made in scalable Bayesian learning via ficient MCMC techniques for sampling from posterior dis- stochastic approximation. However, Bayesian tributions based on small subsets of the data (called sub- learning methods in distributed computing en- set posteriors in the sequel), there is no widely accepted vironments are often problem- or distribution- and theoretically justified approach for combining the sub- specific and use ad hoc techniques. We pro- set posteriors into a single distribution for improved perfor- pose a novel general approach to Bayesian in- mance. To this end, we propose a new general solution for ference that is scalable and robust to corruption this problem based on evaluation of the geometric median in the data. Our technique is based on the idea of a collection of subset posterior distributions. The result- of splitting the data into several non-overlapping ing measure is called the M-posterior (“median posterior”). subgroups, evaluating the posterior distribution Modern approaches to scalable Bayesian learning in a dis- given each independent subgroup, and then com- tributed setting fall into three major categories. Methods bining the results. Our main contribution is the in the first category independently evaluate the likelihood proposed aggregation step which is based on for each data subset across multiple machines and return finding the geometric median of subset poste- the likelihoods to a “master” machine, where they are ap- rior distributions. Presented theoretical and nu- propriately combined with the prior using conditional in- merical results confirm the advantages of our ap- dependence assumptions of the probabilistic model. These proach. two steps are repeated at every MCMC iteration (Smola & Narayanamurthy, 2010; Agarwal & Duchi, 2012). This approach is problem-specific and involves extensive com- 1. Introduction munication among machines. Methods from the second category use stochastic approximation (SA) and succes- Massive data often require computer clusters for storage sively learn “noisy” approximations to the full posterior and processing. In such cases, each machine in the clus- distribution using data in small mini-batches. The accuracy ter can only access a subset of data at a given point. of SA increases as it uses more observations. One sub- Most learning algorithms designed for distributed comput- group of this category uses sampling-based methods to ex- ing share a common feature: they efficiently use the data plore the posterior distribution through a modified Hamil- subset available to a single machine and combine the “lo- tonian or Langevin Dynamics (Welling & Teh, 2011; Ahn cal” results for “global” learning, while minimizing com- et al., 2012; Korattikara et al., 2013). Unfortunately, these munication among cluster machines (Smola & Narayana- methods fail to accommodate discrete-valued parameters murthy, 2010). A wide variety of optimization-based ap- and multimodality. The other subgroup uses determinis- proaches are available for distributed learning (Boyd et al., tic variational approximations and learns the parameters of 2011); however, the number of similar Bayesian methods the approximated posterior through an optimization-based Proceedings of the 31 st International Conference on Machine method (Hoffman et al., 2013; Broderick et al., 2013). Al- Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- though these approaches often have excellent predictive right 2014 by the author(s). performance, it is well known that they tend to substan- Scalable and Robust Bayesian Inference via the Median Posterior tially underestimate posterior uncertainty and lack theoret- words, for any Borel-measurable B, δy(B) = I{y ∈ B}, ical guarantees. where I{·} is the indicator function. Other objects and def- initions are introduced in the course of exposition when Our approach instead falls in a third class of methods, such necessity arises. which avoid extensive communication among machines by running independent MCMC chains for each data subset and obtaining draws from subset posteriors. These subset 2.2. Geometric median posteriors can be combined in a variety of ways. Some The goal of this section is to introduce the geometric me- of the methods simply average draws from each subset dian, a generalization of the univariate median to higher (Scott et al., 2013), other use an approximation to the dimensions. Let µ be a Borel probability measure on a full posterior distribution based on kernel density estima- normed space (Y, k · k). The geometric median x∗ of µ R tors (Neiswanger et al., 2013), or the Weierstrass transform is defined as x∗ := argmin (ky − xk − kxk) µ(dx). y∈Y Y (Wang & Dunson, 2013). Unlike the method proposed in We use a special case of this definition and assume that this work, none of the aforementioned algorithms are prov- µ is a uniform distribution on a collection of m atoms ably robust to the presence of outliers, moreover, they have x1, . . . , xm ∈ Y (which will later correspond to m sub- limitations related to the dimension of the parameter. set posteriors identified with points in a certain space), so We propose a different approximation to the full data pos- that terior for each subset, with MCMC used to generate sam- m X ples from these “noisy” subset posteriors in parallel. As x∗ = medg(x1, . . . , xm) := argmin ky − xjk. (1) a “de-noising step” that also induces robustness to out- y∈Y j=1 liers, we then calculate the geometric median of the subset posteriors, referred to as the M-posterior. By embed- Geometric median exists under rather general conditions, ding the subset posteriors in a Reproducing Kernel Hilbert in particular, when Y is a Hilbert space. Moreover, Space (RKHS), we facilitate computation of distances, al- in this case x∗ ∈ co(x1, . . . , xm) – the convex hull lowing Weiszfeld’s algorithm to be used to approximate of x1, . . . , xm (in other words, there exist nonnegative P Pm the geometric median (Beck & Sabach, 2013). The M- αj, j = 1 . . . m, j αj = 1 such that x∗ = j=1 αjxj). posterior admits strong theoretical guarantees, is provably An important property of the geometric median states that resistant to the presence outliers, efficiently uses all of the it transforms a collection of independent and “weakly con- available observations, and is well-suited for distributed centrated” estimators into a single estimator with signifi- Bayesian learning. Our work was inspired by multivariate cantly stronger concentration properties. Given q, α such median-based techniques for robust point estimation devel- that 0 < q < α < 1/2, define oped by Minsker(2013) and Hsu & Sabato(2013) (see also Alon et al., 1996; Lerasle & Oliveira, 2011; Nemirovski & 1 − α α ψ(α, q) := (1 − α) log + α log . (2) Yudin, 1983, where similar ideas were applied in different 1 − q q frameworks). The following result follows from Theorem 3.1 of Minsker 2. Preliminaries (2013). Theorem 2.1. Assume that ( , h·, ·i) is a Hilbert space We first describe the notion of geometric median and the H and θ ∈ . Let θˆ ,..., θˆ ∈ be a collection of method for calculating distance between probability distri- 0 H 1 m H independent random - valued elements. Let the con- butions via embedding them in a Hilbert space. This will H stants α, q, ν be such that 0 < q < α < 1/2, and be followed by the formal description of our method and 0 ≤ ν < α−q . Suppose ε > 0 is such that for all corresponding theoretical guarantees. 1−q j, 1 ≤ j ≤ b(1 − ν)mc + 1, 2.1. Notation ˆ Pr kθj − θ0k > ε ≤ q. (3) In what follows, k · k2 denotes the standard Euclidean dis- p tance in R and h·, ·i p - the associated dot product. Given ˆ ˆ ˆ R Let θ∗ = medg(θ1,..., θm) be the geometric median of a totally bounded metric space (Y, d), the packing number {θˆ ,..., θˆ }. Then M(ε, Y, d) is the maximal number N such that there exist 1 m N disjoint d-balls B1,...,BN of radius ε contained in , Y h α−ν i−m N ˆ (1−ν)ψ( 1−ν ,q) S Pr kθ∗ − θ0k > Cαε ≤ e , i.e., Bj ⊆ Y. Given a metric space (Y, d) and y ∈ Y, j=1 δy denotes the Dirac measure concentrated at y. In other q 1 where Cα = (1 − α) 1−2α . Scalable and Robust Bayesian Inference via the Median Posterior

Theorem 2.1 implies that the concentration of the geomet- space structure allows to use fast numerical methods to ap- ric median around the “true” parameter value improves ge- proximate the geometric median. ometrically fast with respect to the number m of indepen- In this work, we will only consider characteristic kernels, dent estimators, while the estimation rate is preserved. Pa- which means that kP − Qk = 0 if and only if P = Q. rameter ν allows to take the corrupted observations into Fk It follows from Theorem 7 in (Sriperumbudur et al., 2010) account: if the data contain not more than bνmc outliers that a sufficient condition for k to be characteristic is strict of arbitrary nature, then at most bνmc estimators amongst positive definiteness: we say that k is strictly positive def- {θ , . . . , θ } can be affected. Parameter α should be 1 m inite if it is measurable, bounded, and for all non-zero viewed as a fixed quantity and can be set to α = 1/3 for signed Borel measures µ, RR k(x, y)dµ(x)dµ(y) > 0. the rest of the paper. X×X When X = Rp, a simple sufficient criterion for the kernel k 2.3. RKHS and distances between probability measures to be characteristic follows from Theorem 9 in (Sriperum- budur et al., 2010): The goal of this section is to introduce a special family p of distances between probability measures which provide Proposition 2.2. Let X = R , p ≥ 1. Assume that k(x, y) = φ(x − y) for some bounded, continuous, inte- a structure necessary to evaluate the geometric median in p the space of measures. Since our goal is to develop compu- grable, positive-definite function φ : R 7→ R. tationally efficient techniques, we consider distances that 1. Let φb be the Fourier transform of φ. If |φb(x)| > 0 for admit accurate numerical approximation. all x ∈ Rp, then k is characteristic; 2. If φ is compactly supported, then k is characteristic. Assume that (X, ρ) is a separable metric space, and let F = {f : X 7→ R} be a collection of real-valued functions. Remark 2.3. Given two Borel probability measures P,Q on , define X (a) It is important to mention that in practical applications, Z we (almost) always deal with empirical measures based kP − Qk := sup f(x)d(P − Q)(x) . (4) F on a collection of independent samples from the poste- f∈F X rior. A natural question is the following: if P and Q An important special case arises when F is a unit ball in a p are probability distributions on R and Pn, Qm are their RKHS (H, h·, ·i) with a reproducing kernel k : X × X 7→ P 1 empirical versions, what is the size of the error em,n := R so that kP − Qk − kP − Q k ? A useful fact is that e p Fk m n Fk m,n F = Fk := {f : X 7→ R, f ∈ H, kfk := hf, fi ≤ 1}. H often does not depend on p: under weak assumptions on (5) −1/2 −1/2 k, en,m has an upper bound of order m + n (see R p Let Pk := {P is a prob. measure, k(x, x)dP (x) < corollary 12 in Sriperumbudur et al., 2009). X ∞}, and assume that P,Q ∈ Pk. If follows from Theorem 1 in (Sriperumbudur et al., 2010) that the corresponding (b) Choice of the kernel determines the “richness” of the distance between measures P and Q takes the form space H and, hence, the relative strength of induced norm k · k . For example, the well-known family of Matern´ ker- Z Fk nels leads to Sobolev spaces (Rieger & Zwicknagl, 2009). kP − QkF = k(x, ·)d(P − Q)(x) . (6) k X H Gaussian kernels often yields good results in applications. Note that when P and Q are discrete measures (say, P = Finally, we recall the definition of the well-known PN1 β δ and Q = PN2 γ δ ), then j=1 j zj j=1 j yj Hellinger distance. Assume that P and Q are probability D N1 measures on which are absolutely continuous with re- X R kP − Qk2 = β β k(z , z ) Fk i j i j (7) spect to Lebesgue measure with densities p and q respec- i,j=1 tively. Then N2 N1 N2 s X X X 1 Z p p 2 + γiγjk(yi, yj) − 2 βiγjk(zi, yj). h(P,Q) := p(x) − q(x) dx 2 D i,j=1 i=1 j=1 R is the Hellinger distance between P and Q. The mapping P 7→ R k(x, ·)dP (x) is thus an embedding X of Pk into the Hilbert space H which can be seen as an ap- plication of the “kernel trick” in our setting. The Hilbert 3. Contributions and main results

1We will say that k is a kernel if it is a symmetric, positive 3.1. Construction of “robust posterior distribution” deﬁnite function; it is a reproducing kernel for H and such that for any f ∈ H and x ∈ X, hf, k(·, x)i = f(x) (see Aronszajn, Let {Pθ, θ ∈ Θ} be a family of probability distributions H D 1950, for details). over R indexed by Θ. Suppose that for all θ ∈ Θ, Pθ has Scalable and Robust Bayesian Inference via the Median Posterior

dPθ a Radon-Nikodym derivative pθ(·) = dx (·) with respect where the median medg(·) is evaluated with respect to to the Lebesgue measure on D. In what follows, we equip ˆ R k · kFk introduced in (1) and (5). Note that Πn,g is always Θ with a “Hellinger metric” a probability measure: due to the aforementioned properties of a geometric median, there exists α1 ≥ 0, . . . , αm ≥ ρ(θ1, θ2) := h(Pθ ,Pθ ), (8) m m 1 2 P P (j) 0, αj = 1 such that Πˆ n,g = αjΠn . In practice, and assume that the metric space (Θ, ρ) is separable. j=1 j=1 small weights α are set to 0 for improved performance; D j Let X1,...,Xn be i.i.d. R -valued random vectors de- see Algorithm2 for details of implementation. fined on some probability space (Ω, B, Pr) with unknown While Πˆ n,g possesses several nice properties (such as ro- distribution P0 := Pθ0 for θ0 ∈ Θ. A usual way to “estimate” P0 in Bayesian statistics consists in defining a prior bustness to outliers), in practice it often overestimates the distribution Π over Θ (equipped with the Borel σ-algebra uncertainty about θ0, especially when the number of groups induced by ρ), so that Π(Θ) = 1. The posterior distri- m is large. To overcome this difficulty, we suggest a mod- (j) bution given the observations Xn := {X1,...,Xn} is a ification of our approach where the random measures Πn random probability measure on Θ defined by (subset posteriors) are replaced by the stochastic approximations Π (·|G ), j = 1 . . . m Z Qn n,m j of the full posterior dis- i=1 pθ(Xi) tribution. To this end, define the stochastic approximation Πn(B|Xn) := R Qn dΠ(θ) i=1 pθ(Xi)dΠ(θ) to the full posterior based on the subsample Gj as B Θ bn/|Gj |c for all Borel measurable sets B ⊆ Θ. It is known (Ghosal Q p (X ) dΠ(θ) Z i∈Gj θ i et al., 2000) that under rather general assumptions the pos- Πn,m(B|Gj) := . bn/|Gj |c terior distribution Πn “contracts” towards θ0, meaning that R Q p (X ) dΠ(θ) B i∈Gj θ i Θ Πn(θ ∈ Θ: ρ(θ, θ0) ≥ εn|Xn) → 0 (10) in probability as n → ∞ for a suitable sequence εn → In other words, Πn,k(·|Gj) is obtained as a posterior dis- 0. One of the question that we address can be formulated tribution given that each data point from Gj is observed as follows: what happens if some observations in Xn are bn/|Gj|c times. Similarly to Πˆ n,g, we set corrupted, e.g., if Xn contains outliers of arbitrary nature ˆst and magnitude? In this case, the usual posterior distribution Πn,g := medg(Πn,m(·|G1),..., Πn,m(·|Gm)). (11) might concentrate “far” from the true value θ0, depending on the amount of corruption in the sample. We show that While each of Πn,k(·|Gj) might be “unstable”, the geo- ˆst it is possible to modify existing inference procedures via a metric median Πn,g of these random measures improves simple and computationally efficient scheme that improves stability and yields smaller credible sets with good cov- ˆst robustness of the underlying method. erage properties. Practical performance of Πn,g is often Πˆ We proceed with the general description of the proposed superior as compared to n,g in our experiments. In all nu- ˆst algorithm. Let 1 ≤ m ≤ n/2 be an integer, and divide the merical simulations below, we evaluate Πn,g unless noted otherwise. sample Xn into m disjoint groups Gj, j = 1 . . . m of size m S |Gj| ≥ bn/mc each: Xn = Gj,Gi ∩Gl = ∅ for i 6= j. 3.2. Convergence of posterior distribution and j=1 applications to robust Bayesian inference A typically choice of m is m ' log n, so that the groups Gj are sufficiently large (however, other choices are possible Let k be a characteristic kernel defined on Θ×Θ; k defines as well depending on concrete practical scenario). a metric on Θ n (j) Let Π be a prior distribution over Θ, and let Πn (·) := ρk(θ1, θ2) := kk(·, θ1) − k(·, θ2)k (12) o H 1/2 Πn(·|Gj), j = 1 . . . m be the family of posterior distri- = k(θ1, θ1) + k(θ2, θ2) − 2k(θ1, θ2) , butions depending on disjoint subgroups Gj, j = 1 . . . m: Z Q p (X ) where H is the RKHS associated to kernel k. We will as- i∈Gj θ i Π (B|G ) := dΠ(θ). sume that (Θ, ρk) is separable. n j R Q p (X )dΠ(θ) i∈Gj θ i B Θ Assumption 3.1. Let h(Pθ1 ,Pθ2 ) be the Hellinger distance between Pθ and Pθ . Assume there exist positive constants ˆ 1 2 Define the “median posterior” (or M-posterior) Πn,g as γ and C˜ such that for all θ1, θ2 ∈ Θ, Πˆ := (Π(1),..., Π(m)), ˜ γ n,g medg n n (9) h(Pθ1 ,Pθ2 ) ≥ Cρk(θ1, θ2). Scalable and Robust Bayesian Inference via the Median Posterior

p Example 3.2. Let {Pθ, θ ∈ Θ ⊆ R } be the exponential Moreover, let assumption 3.1 be satisﬁed. Then there exists family a sufﬁciently large M = M(C,K, C˜) > 0

dPθ 1/γ 1 2 −Klεl /2 (x) := pθ(x) = exp hT (x), θi p − G(θ) + q(x) , Pr kδ0 − Πl(·|Xl)kF ≥ Mε ≤ + e . R k l 2 dx Clεl (14) where h·, ·i p is the standard Euclidean dot product. Then R 2 the Hellinger distance can be expressed as h (Pθ1 ,Pθ2 ) = Note that the right-hand side in (14) may decay very slowly 1−exp − 1 G(θ )+G(θ )−2G θ1+θ2 (Nielsen & 2 1 2 2 with l. This is where the properties of the geometric me- 2 Garcia, 2011). If G(θ) is convex and its Hessian D G(θ) dian become useful. Combination of Theorems 3.4 and 2.1 satisfies D2G(θ) A uniformly for all θ ∈ Θ and some yields the following inequality for Πˆ n,g which is our main p p symmetric positive definite operator A : R 7→ R , then theoretical result. 2 1 T Corollary 3.5. Let X1,...,Xn be an i.i.d. sample from h (Pθ1 ,Pθ2 ) ≥ 1 − exp − (θ1 − θ2) A(θ1 − θ2) , ˆ 8 P0, and assume that Πn,g is defined with respect to the k · k as in (9) above. Let l := bn/mc. Assume that 1 Fk hence assumption 3.1 holds with Ck = √ and γ = 1 for 2 conditions of Theorem 3.4 hold, and, moreover, εl is such 2 1 −Klεl /2 1 that q := 2 + 4e < . Let α be such that 1 Clεl 2 k(θ , θ ) := exp − (θ − θ )T A(θ − θ ) . 1 2 8 1 2 1 2 q < α < 1/2. Then −m 1/γ h ψ(α,q)i In particular, it implies that for the family {Pθ = ˆ Pr δ0 − Πn,g F ≥ CαMεl ≤ e , N(θ, Σ), θ ∈ D} with Σ 0 and the kernel k R (15) 1 T −1 q k(θ1, θ2) := exp − (θ1 − θ2) Σ (θ1 − θ2) , 1 8 where Cα = (1 − α) 1−2α and M is as in Theorem 3.4. assumption 3.1 holds with C = √1 and γ = 1 (moreover, k 2 The case when the sample Xn contains bνmc outliers of ar- it holds with equality rather than inequality). bitrary nature can be handled similarly. This more general bound is readily implied by Theorem 2.1. Assume that Θ ⊂ Rp is compact, and let k(·, ·) be a kernel defined on p × p. Suppose that k satisfies conditions For many parametric models (see Section 5 in Ghosal R R q of proposition 2.2 (in particular, k is characteristic). Recall m log(n/m) et al.(2000)), (15) holds with εl ' τ n for that by Bochner’s theorem, there exists a finite nonnegative τ small enough. If m ' log n (which is a typical sce- R ihx,θi Borel measure v such that k(θ) = e dv(x). q log2 n p R nario), then εl ' n . At the same time, if we R 2 Proposition 3.3. Assume that kxk2dv(x) < ∞ and for use the “full posterior” distribution Πn(·|Xn) (which cor- p R responds to m = 1), conclusion of Theorem 3.4 is that all θ1, θ2 ∈ Θ and some γ > 0, 1/(2γ) log n −1 γ Pr kδ0 − Πn(·|Xn)kFk ≥ M n . log n, h(Pθ1 ,Pθ2 ) ≥ c(Θ)kθ1 − θ2k2 . (13) while Corollary 3.5 yields a much stronger bound for Πˆ n,g: ˜ Then assumption 3.1 holds with γ as above and C = 2 1/(2γ) ˆ log n − log n ˜ Pr δ0 − Πn,g ≥ CαM ≤ r C(k, c(Θ), γ). Fk n n for some 1 > rn → 0. Let δ := δ be the Dirac measure concentrated at θ ∈ Θ 0 θ0 0 ˆst Theoretical guarantees for Πn,g, the median of “stochastic corresponding to the “true” distribution P0. approximations” defined in (11), are very similar to results Theorem 3.4. Let Xl = {X1,...,Xl} be an i.i.d. sample of Corollary 3.5, but we omit exact formulations here due from P0. Assume that εl > 0 and Θl ⊂ Θ are such that for to the space constraints. a universal constant K > 0 and some constant C > 0

2 1) log M(εl, Θl, ρ) ≤ lεl , 4. Numerical Experiments 2) Π(Θ \ Θ ) ≤ exp(−lε2(C + 4)), l l In this section, we describe a method based on Weiszfeld’s 2 ! pθ 2 pθ 2 algorithm for implementing the M-posterior, and compare 3) Π θ : −P0 log ≤ εl ,P0 log ≤ εl p0 p0 the performance of M-posterior with the usual posterior in the tasks involving simulated and real data sets. 2 ≥ exp(−lεl C), −Klε2/2 We start with a short remark discussing the improvement 4) e l ≤ εl. in computational time complexity achieved by M-posterior. Scalable and Robust Bayesian Inference via the Median Posterior

Algorithm 1 Evaluating the geometric median of probabil- 4.1. Simulated data ity distributions via Weiszfeld’s algorithm The first example uses data from a Gaussian distribution Input: with known variance and unknown mean, and demonstrates 1. Discrete measures Q1,...,Qm; p p 2. The kernel k(·, ·): R × R 7→ R; the effect of the magnitude of an outlier on the posterior 3. Threshold ε > 0; distribution of the mean parameter. The second example Initialize: (0) 1 demostrates the robustness and scalability of nonparamet- 1. Set wj := m , j = 1 . . . m; m ric regression using M-posterior in presence of outliers. (0) 1 P 2. Set Q∗ := m Qj ; j=1 4.1.1. UNIVARIATE GAUSSIAN DATA repeat Starting from t = 0, for each j = 1, . . . , m: The goal of this example is to demonstrate the effect of (t) −1 kQ∗ −Qj k (t+1) Fk a large outlier on the posterior distribution of the mean 1. Update w = m ; (apply (7) to eval- j P (t) −1 kQ∗ −Qik parameter µ. We generated 25 sets containing 100 ob- Fk i=1 25 (t) servations each. Every sample {xi}i=1, where xi = uate kQ∗ − QikFk ); m (x , . . . , x ), contains 99 independent observations (t+1) P (t+1) i,1 i,100 2. Update Q∗ = wj Qj ; x ∼ N (0, 1) j=1 from the standard Gaussian distribution ( i,j (t+1) (t) for i = 1,..., 25 and j = 1,..., 99), while the last until kQ∗ − Q∗ kFk ≤ ε (t+1) (t+1) entry in each sample, xi,100, is an outlier, and its value Return: w∗ := (w1 , . . . , wm ). linearly increases for i = 1,..., 25, namely, xi,100 = i max(|xi,1|,..., |xi,99|). Index of an outlier is unknown to the algorithm, while the variance of observations is Algorithm 2 Approximating the M-posterior distribution known. We use a flat (Jeffreys) prior on the mean µ Input: and obtain its posterior distribution, which is also Gaus- Nj P100 1. Samples {Zj,i}i=1 ∼ i.i.d. from Πn,m(·|Gj ), j = j=1 xij 1 1 . . . m (see equation (10)); sian with mean 100 and variance 100 . We generate Do: 1000 samples from each posterior distribution Π100(·|xi) Nj 1 P for i = 1,..., 25. Algorithm2 generates 1000 samples 1. Qj := δZ , j = 1 . . . m - empirical approxima- Nj j,i ˆst i=1 from the M-posterior Π100,g(·|xi) for each i = 1,..., 25: tions of Πn,m(·|Gj ). to this end, we set m = 10 and generate 100 samples from 2. Apply Algorithm1 to Q1,...,Qm; return w∗ = every Π100,10(·|Gj,i), j = 1,..., 10 to form the empirical (w∗,1 . . . w∗,m); 10 1 measures Qj,i; here, ∪j=1Gj,i = xi. Consensus MCMC 3. For j = 1, . . . , m, set w¯j := w∗,j I{w∗,j ≥ 2m }; define ∗ Pm (Scott et al., 2013) as a representative for scalable MCMC wˆj :=w ¯j / i=1 w¯i. ˆst Pm ∗ methods, and compared its performance with M-posterior Return: Πn,g := wî Qi. i=1 when the number of data subsets is fixed. Figure1 compares the performance of the “consensus posterior”, the overall posterior and the M-posterior using the Given the data set Xn of size n, let t(n) be the running empirical coverage of (1-α)100% credible intervals (CIs) time of the subroutine (e.g., MCMC) that outputs a sin- calculated across 50 replications for α = 0.2, 0.15, 0.10, gle observation from the posterior distribution Πn(·|Xn). and 0.05. The empirical coverages of M-posterior’s CIs Assuming that our goal is to obtain a sample of size N show robustness to the size of an outlier. On the contrary, from the (usual) posterior, the total computational com- performance of the consensus and overall posteriors deteri- plexity is O (N · t(n)). We compare this with the running orate fairly quickly across all α’s leading to 0% empirical time needed to obtain a sample of same size N from the coverage as the outlier strength increases from i = 1 to M-posterior given that the algorithm is running on m ma- i = 25. chines (m n) in parallel. In this case, we need to generate O (N/m) samples from each of m subset posteriors, 4.1.2. GAUSSIANPROCESSREGRESSION N n which is done in time O m · t m . According to Theo- rem 7.1 in (Beck & Sabach, 2013), Weiszfeld’s algorithm We use function f0(x) = 1 + 3 sin(2πx − π) and simu- approximates the M-posterior to degree of accuracy ε in at late 90 (case 1) and 980 (case 2) values of f0 at equidis- most O(1/ε) steps, and each of these steps has complexity tant x’s in [0, 1] (hereafter x1:90 and x1:980) corrupted by O(N 2) (which follows from (7)), so that the total running Gaussian noise with mean 0 and variance 1. To demon- 2 strate the robustness of M-posterior in nonparametric re- time is O N · t N + N . If, for example, t(n) ' nr m m ε gression, we added 10 (case 1) and 20 (case 2) outliers N n 1 r for some r ≥ 1, then m · t m ' m1+r Nn which should (sampled on the uniform grids of corresponding sizes) to be compared to N · nr required by the standard approach. Scalable and Robust Bayesian Inference via the Median Posterior

Overall Posterior ● M−Posterior ● Consensus Posterior ● M−Posterior M−Posterior α = 0.2 α = 0.15 α = 0.1 α = 0.05 (out, total) = (10, 100) (out, total) = (20, 1000) ) 1.0 ● α ● ●● ● ● − 0.9 ● ● ●●● ● ●● ●● ● ● 15 1 ● ● ● ● ● ● ●● ● ● ● ● ● ●● ( ● ● ● ●●● ●● ● ● ● ● ● 0.8 ● ● ●●●● ●●● ● ● ● ●●● ● ●● ● ● ●●● ● ● ● 10 ●● ● ● ● ● ●● ● ● ● ● ● ● 0.7 ●● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ●●●● ● ● ● ●● ● 5 ● ● ● ● ● ● 0.6 ● ● ● ●● 0 0.5 ● ● ●● ● ● ●● ● ● ● ● −5 0.4 ● ● ● ● ● ● ● ● 0.3 ● ●● ●● ● ●● ●● GP Posterior GP Posterior ●● ●● ●● f(x) (out, total) = (10, 100) (out, total) = (20, 1000) 0.2 ● ● ● ●●● ●● ● ●● ●● ● ● 0.1 ● ● ●●● ● ●● ● ●● ●● ●●● ●●● ●●●● ● 15 0.0 ●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●● ● ●●●●●● Posterior credible interval level interval level credible Posterior 10 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 Relative magnitude of the outlier 5 Figure 1. Effect of outlier on the empirical coverage of (1- 0 α)100% credible intervals (CIs). The x-axis represents the out- −5 y 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 lier magnitude. The -axis represents the fraction of times the x CIs include the true mean over 50 replications. The panels show Figure 2. Performance of M-posterior in Gaussian process (GP) the coverage results when α = 0.2, 0.15, 0.10, and 0.05. The regression. The top and bottom row of panels show simulation horizontal lines (in violet) show the theoretical coverage. results for M-posterior (in blue) and GP regression (in red). The size of data increases from column 1 to 2. The true noiseless curve f0(x) is in green. The shaded regions around the curves represent the data sets such that f0(x91:100) = 10 max(f0(x1:90)) 95% confidence bands obtained over 30 replicated data sets. and f0(x981:1000) = 10 max(f0(x1:980)). The gausspr function in kernlab R package (Karat- f zoglou et al., 2004) is used for GP regression. Based on produces b2 which is close to the true curve in both cases; the standard convention in GP regression, the noise vari- however, in case 1, when the number of data points is small, ance (or “nugget effect”) is fixed at 0.01. Using these the 95% bands are unstable. settings for GP regression without the “standard” poste- An attractive property of M-posterior based GP regression rior, gausspr obtains an estimator fb1 and a 95% con- is that numerical instability due to matrix inversion can be fidence band for the values of the regression function at avoided by working with multiple subsets. We investigated 100 equally spaced grid points y1:100 in [0, 1] (note that such cases when the number of data points n was greater these locations are different from the observed data). Algo- than 104. Chalupka et al.(2012) compare several low rank rithm2 performs GP regression with M-posterior and ob- matrix approximations techniques used to avoid matrix in- tains an estimator fb2 described below. The posterior draws version in massive data GP computation. M-posterior– across y1:100 are obtained in cases 1 and 2 as follows. First, based GP computation does not use approximations to ob- {(xi, fi)} are split into m = 10 and 20 subsets (each living tain subset posteriors. By increasing the number of sub- on its own uniform grid) respectively, and gausspr esti- sets (m), M-posterior based GP regression is both compu- mates the posterior mean µj and covariance Σj for each tationally feasible and numerically stable for cases when data subset, j = 1, . . . , m. These estimates correspond n = O(106) and m = O(103). On the contrary, stan- to the Gaussian distributions Πj(·|µj, Σj) that are used to dard GP regression using the whole data set was intractable 4 generate 100 posterior draws at y1:100 each. These draws for data size greater than 10 due of numerical instabili- are further employed to form the empirical versions of sub- ties in matrix inversion. In general, for n data points and set posteriors. Finally, Weiszfeld’s algorithm is used to m subsets, the computational complexity for GP with M- n 2 combine them. Next, we obtained 1000 samples from the posterior is O(n( m ) ); therefore, m > 1 is computation- M-posterior Πg(·|{(xi, fi)}). The median of these 1000 ally better than working with the whole data set. By care- n samples at each location on the grid y1:100 represents the fully choosing the m ratio depending on the available com- ˆ estimator f2. Its 95% confidence band corresponds to 2.5% putational resources and n, GP regression with M-posterior and 97.5% quantiles of the 1000 posterior draws across is a promising approach for GP regression for massive data y1:100. without low rank approximations. Figure2 summarizes the results of GP regression with and 4.2. Real data: PdG hormone levels vs day of ovulation without M-posterior across 30 replications. In case 1, GP regression without M-posterior is extremely sensitive to the North Carolina Early Pregnancy Study (NCEPS) measured outliers, resulting in fb1 that is shifted above the truth and urinary pregnanediol-3-glucuronide (PdG) levels, a proges- distorted near the x’s that are adjacent to the outliers; in terone metabolite, in 166 women from the day of ovula- turn, this affects the coverage of 95% confidence bands tion across 41 time points (Baird et al., 1999). These data and results in the “bumps” that correspond to the location have two main features that need to be modeled. First, the of outliers. In contrast, GP regression using M-posterior data contains information about women in different stages Scalable and Robust Bayesian Inference via the Median Posterior of conception and non-conception ovulation cycles, so the GP Posterior M−Posterior Test Data : 6 Test Data : 7 Test Data : 8 Test Data : 9 Test Data : 10 probabilistic model should be flexible and free of any re- 4 strictive distributional assumptions; therefore, we choose 3 2 non-parametric regression of log PdG on day of ovulation 1 0 using GP regression. Second, missing data and extreme ob- −1 −2 −3 servations are very common in these data due the nature of −4 observations and diversity of subjects in the study; there- Test Data : 1 Test Data : 2 Test Data : 3 Test Data : 4 Test Data : 5 4 fore, we use M-posterior based GP regression as a robust 3 log PdG levels 2 approach to automatically account for outliers and possible 1 0 model misspecification. −1 −2 −3 NCEPS data have missing PdG levels for multiple women −4 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 across multiple time points. We discarded subjects that did Time (days) not have data for at least half the time points, which left us with 3810 PdG levels across 41 time points. The size of Figure 3. Comparison of 95% posterior predictive intervals for the data enables the use of gausspr function for GP re- GP regression with and without M-posterior. The x and y axes gression, and its results are compared against M-posterior represent day of ovulation and log PdG levels. Panels show the based GP regression (similar to Section 4.1.2). The data result of GP regression of log PdG on day of ovulation across 10- was divided into 10 subsets. On each stage, 9 of them folds of NCEPS data. The mean dependence log PdG on day of were used to evaluate the M-posterior while the remain- ovulation is shown using solid lines. The dotted lines around the ing was a test set; the process was repeated 10 times for solid curves represent the 95% posterior predictive intervals. The different test subsets. Our goal is to obtain posterior pre- intervals for GP regression without M-posterior severely under- dictive intervals for log PdG levels given the day of ovu- estimate the uncertainty in log PdG levels. On the other hand, lation. Figure3 shows the posterior predictive distribution posterior predictive intervals of M-posterior appear to be reason- obtained via GP regression with and without M-posterior. able and the trend of dependence of log PdG on day of ovulation is stable. Across all folds, the uncertainty quantification based on M- posterior is much better than its counterpart without the M-posterior. The main reason for this poor performance gorithm, it naturally scales to massive data. Second, since of “vanilla” GP regression is that NCEPS data have many Weiszfeld’s algorithm only uses the samples from subset data points for each day relative to ovulation, but with posteriors, it can be implemented in a distributed setting many outliers and missing data. The vanilla GP regression via MapReduce/Hadoop. does not account for the latter feature of NCEPS data, thus leading to over-optimistic uncertainty estimates across all Several important questions are not included in the present folds. M-posterior automatically accounts for outliers and paper and will be addressed in subsequent work. These model misspecification; therefore, it leads to reliable pos- topic include applications to other types of models; alter- terior predictive uncertainty quantification across all folds. native data partition methods and connections to the inference based on the usual posterior distribution; different 5. Discussion choices of distances, notions of the median and related subset posterior aggregation methods. More efficient computa- We presented a general approach to scalable and robust tional alternatives and extensions of Weiszfeld’s algorithm Bayesian inference based on the evaluation of the geomet- (which is currently used due to its simplicity, stability, and ric median of subset posterior distributions (M-posterior). ease of implementation) can be developed for estimating To the best of our knowledge, this is the first technique that the M-posterior; see (Bose et al., 2003; Cardot et al., 2013; is provably robust and computationally efficient. The key Vardi & Zhang, 2000), among other works. Applications of to making inference tractable is to embed the subset pos- distributed optimization methods, such as ADMM (Boyd terior distributions in a suitable RKHS, and pose the ag- et al., 2011), is another potentially fruitful approach. gregation problem as convex optimization in this space, which in turn can be solved using Weiszfeld’s algorithm, Acknowledgments a simple gradient descent-based method. Unlike popular point estimators, the M-posterior distribution can be used Authors were supported by grant R01-ES-017436 from the Na- for summarizing uncertainty in the parameters of interest. tional Institute of Environmental Health Sciences (NIEHS) of the Another advantage of our approach is scalability, so that it National Institutes of Health (NIH). S. Srivastava was supported can be used for distributed Bayesian learning: first, since by NSF under Grant DMS-1127914 to SAMSI. S. Minsker ac- it combines the subset posteriors using a gradient-based al- knowledges support from NSF grants FODAVA CCF-0808847, DMS-0847388, ATD-1222567. Scalable and Robust Bayesian Inference via the Median Posterior

References Hsu, D. and Sabato, S. Loss minimization and parameter estimation with heavy tails. arXiv:1307.1827, 2013. Agarwal, A. and Duchi, J. C. Distributed delayed stochastic optimization. In Decision and Control (CDC), 2012 Karatzoglou, A., Smola, A., Hornik, K., and Zeileis, A. IEEE 51st Annual Conference on, pp. 5451–5452. IEEE, kernlab – an S4 package for kernel methods in R. Jour- 2012. nal of Statistical Software, 11(9):1–20, 2004. Ahn, S., Korattikara, A., and Welling, M. Bayesian poste- Korattikara, A., Chen, Y., and Welling, M. Austerity in rior sampling via stochastic gradient Fisher scoring. In MCMC land: cutting the Metropolis-Hastings budget. Proceedings of ICML, 2012. arXiv:1304.5299, 2013. Alon, N., Matias, Y., and Szegedy, M. The space complex- Lerasle, M. and Oliveira, R. I. Robust empirical mean esti- ity of approximating the frequency moments. In Pro- mators. arXiv:1112.3914, 2011. ceedings of the 28th ACM symposium on Theory of computing, pp. 20–29. ACM, 1996. Minsker, S. Geometric median and robust estimation in Banach spaces. arXiv:1308.1334, 2013. Aronszajn, N. Theory of reproducing kernels. Transactions of the American mathematical society, 68(3):337–404, Neiswanger, W., Wang, C., and Xing, E. Asymp- 1950. totically exact, embarrassingly parallel MCMC. arXiv:1311.4780, 2013. Baird, D. D., Weinberg, C. R., Zhou, H., Kamel, F., Mc- Nemirovski, A. and Yudin, D. Problem complexity and Connaughey, D. R., Kesner, J. S., and Wilcox, A. J. method efficiency in optimization. 1983. Preimplantation urinary hormone profiles and the probability of conception in healthy women. Fertility and Nielsen, F. and Garcia, V. Statistical exponential families: sterility, 71(1):40–49, 1999. a digest with flash cards. arXiv:0911.4863, 2011. Beck, A. and Sabach, S. Weiszfeld’s method: old and new Rieger, C. and Zwicknagl, B. Deterministic error analy- results. Preprint, 2013. sis of support vector regression and related regularized kernel methods. JMLR, 10:2115–2132, 2009. Bose, P., Maheshwari, A., and Morin, P. Fast approximations for sums of distances, clustering and the Fermat– Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H. A., Weber problem. Computational Geometry, 24(3):135– George, E. I., and McCulloch, R. E. Bayes and Big Data: 146, 2003. the consensus Monte Carlo algorithm. In EFaB Bayes 250 workshop, volume 16, 2013. Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributed optimization and statistical learning via the Smola, A. J. and Narayanamurthy, S. An architecture for alternating direction method of multipliers. Foundations parallel topic models. In Very Large Databases (VLDB), and Trends in Machine Learning, 3(1):1–122, 2011. 2010. Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., and Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Jordan, M. Streaming variational Bayes. In NIPS, pp. Scholkopf,¨ B., and Lanckriet, G. On integral proba- 1727–1735, 2013. bility metrics, φ-divergences and binary classification. arXiv:0901.2698, 2009. Cardot, H., Cenac, P., and Zitt, P.-A. Efficient and fast estimation of the geometric median in Hilbert spaces with Sriperumbudur, B. K., Gretton, A., Fukumizu, K., an averaged stochastic gradient algorithm. Bernoulli, 19 Scholkopf,¨ B., and Lanckriet, G. Hilbert space embed- (1):18–43, 2013. dings and metrics on probability measures. JMLR, 99: 1517–1561, 2010. Chalupka, K., Williams, C. K., and Murray, I. A frame- work for evaluating approximation methods for gaussian Vardi, Yehuda and Zhang, Cun-Hui. The multivariate L1- process regression. arXiv:1205.6326, 2012. median and associated data depth. Proceedings of the National Academy of Sciences, 97(4):1423–1426, 2000. Ghosal, S., Ghosh, J. K., and van der Vaart, A. W. Conver- gence rates of posterior distributions. Annals of Statis- Wang, X. and Dunson, D. B. Parallel MCMC via Weier- tics, 28(2):500–531, 2000. strass sampler. arXiv:1312.4605, 2013. Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Welling, M. and Teh, Y. W. Bayesian learning via stochas- Stochastic variational inference. JMLR, 14:1303–1347, tic gradient Langevin dynamics. In Proceedings of 2013. ICML, pp. 681–688, 2011.