Bayesian Coreset Construction Via Greedy Iterative Geodesic Ascent

Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent Trevor Campbell 1 Tamara Broderick 1 Abstract streaming variants (Jordan et al., 1999; Wainwright & Jor- Coherent uncertainty quantification is a key dan, 2008; Hoffman et al., 2013; Ranganath et al., 2014; strength of Bayesian methods. But modern algo- Broderick et al., 2013; Campbell & How, 2014; Campbell rithms for approximate Bayesian posterior infer- et al., 2015; Dieng et al., 2017) are both susceptible to find- ence often sacrifice accurate posterior uncertainty ing bad local optima in the variational objective and tend to estimation in the pursuit of scalability. This work either over- or underestimate posterior variance depending shows that previous Bayesian coreset construction on the chosen discrepancy and variational family. algorithms—which build a small, weighted subset Bayesian coresets (Huggins et al., 2016; Campbell & Brod- of the data that approximates the full dataset—are erick, 2017) provide an alternative approach—based on the no exception. We demonstrate that these algo- observation that large datasets often contain redundant data— rithms scale the coreset log-likelihood subopti- in which a small subset of the data of size M minfN; T g mally, resulting in underestimated posterior un- is selected and reweighted such that it preserves the statisti- certainty. To address this shortcoming, we de- cal properties of the full dataset. The coreset can be passed velop greedy iterative geodesic ascent (GIGA), a to a standard MCMC algorithm, providing posterior infer- novel algorithm for Bayesian coreset construction ence with theoretical guarantees at a significantly reduced that scales the coreset log-likelihood optimally. O(M(N + T )) computational cost. But despite their advan- GIGA provides geometric decay in posterior ap- tages, existing Bayesian coreset constructions—like many proximation error as a function of coreset size, other scalable inference methods—tend to underestimate and maintains the fast running time of its prede- posterior variance (Fig.1). This effect is particularly evi- cessors. The paper concludes with validation of dent when the coreset is small, which is the regime we are GIGA on both synthetic and real datasets, demon- interested in for scalable inference. strating that it reduces posterior approximation error by orders of magnitude compared with pre- In this work, we show that existing Bayesian coreset con- vious coreset constructions. structions underestimate posterior uncertainty because they scale the coreset log-likelihood suboptimally in order to remain unbiased (Huggins et al., 2016) or to keep their 1. Introduction weights in a particular constraint polytope (Campbell & Broderick, 2017). The result is an overweighted coreset Bayesian methods provide a wealth of options for principled with too much “artificial data,” and therefore an overly cer- parameter estimation and uncertainty quantification. But tain posterior. Taking this intuition to its limit, we demon- Markov chain Monte Carlo (MCMC) methods (Robert & strate that there exist models for which previous algorithms Casella, 2004; Neal, 2011; Hoffman & Gelman, 2014), the output coresets with arbitrarily large relative posterior ap- gold standard for Bayesian inference, typically have com- proximation error at any coreset size (Proposition 2.1). We arXiv:1802.01737v2 [stat.ML] 28 May 2018 plexity Θ(NT ) for dataset size N and number of samples T address this issue by developing a novel coreset construction and are intractable for modern large-scale datasets. Scalable algorithm, greedy iterative geodesic ascent (GIGA), that op- methods (see (Angelino et al., 2016) for a recent survey), timally scales the coreset log-likelihood to best fit the full on the other hand, often sacrifice the strong guarantees of dataset log-likelihood. GIGA has the same computational MCMC and provide unreliable posterior approximations. complexity as the current state of the art, but its optimal For example, variational methods and their scalable and log-likelihood scaling leads to uniformly bounded relative 1Computer Science and Artificial Intelligence Laboratory, Mas- error for all models, as well as asymptotic exponential error sachusetts Institute of Technology, Cambridge, MA, United States. decay (Theorem 3.1 and Corollary 3.2). The paper con- Correspondence to: Trevor Campbell <[email protected]>. cludes with experimental validation of GIGA on a synthetic th vector approximation problem as well as regression models Proceedings of the 35 International Conference on Machine applied to multiple real and synthetic datasets. Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent Figure 1. (Left) Gaussian inference for an unknown mean, showing data (black points and likelihood densities), exact posterior (blue), and optimal coreset posterior approximations of size 1 from solving the original coreset construction problem Eq. (3) (red) and the modified problem Eq. (5) (orange). The orange coreset posterior has artificially low uncertainty. The exact and approximate log-posteriors are scaled down (by the same amount) for visualization. (Right) The vector formulation of log-likelihoods using the same color scheme. 2. Bayesian Coresets Solving Eq. (3) exactly is not tractable for large N due to the cardinality constraint; approximation is required. Given 2.1. Background a norm induced by an inner product, and defining In Bayesian statistical modeling, we are given a dataset N N (yn) of N observations, a likelihood p(ynjθ) for each X n=1 σn := kLnk and σ := σn; (4) θ 2 Θ observation given a parameter , and a prior density n=1 π0(θ) on Θ. We assume that the data are conditionally independent given θ. The Bayesian posterior is given by Campbell & Broderick(2017) replace the cardinality constraint in Eq. (3) with a simplex constraint, 1 π(θ) := exp(L(θ))π (θ); (1) 2 Z 0 min kL(w) − Lk N w2R N (5) where the log-likelihood L(θ) is defined by X s.t. w ≥ 0; σnwn = σ: N X n=1 Ln(θ) := log p(yn j θ); L(θ) := Ln(θ); (2) n=1 Eq. (5) can be solved while ensuring kwk0 ≤ M using either importance sampling (IS) or Frank–Wolfe (FW) (Frank and Z is the marginal likelihood. MCMC returns approx- & Wolfe, 1956). Both procedures add one data point to imate samples from the posterior, which can be used to the linear combination L(w) at each iteration; IS chooses construct an empirical approximation to the posterior dis- the new data point i.i.d. with probability σn=σ, while FW tribution. Since each sample requires at least one full like- chooses the point most aligned with the residual error. This lihood evaluation—typically an Θ(N) operation—MCMC difference in how the coreset is built results in different has Θ(NT ) complexity for T posterior samples. convergence behavior: FW exhibits geometric convergence kL(w) − Lk2 = O(νM ) for some 0 < ν < 1, while IS is To reduce the complexity, we can instead run MCMC on a 2 −1 Bayesian coreset (Huggins et al., 2016), a small, weighted limited by the Monte Carlo rate kL(w) − Lk = O(M ) N with high probability (Campbell & Broderick, 2017, Theo- subset of the data. Let w 2 R be the vector of non- rems 4.1, 4.4). For this reason, FW is the preferred method negative weights, with weight wn for data point yn, and PN 1 for coreset construction. kwk0 := n=1 [wn > 0] N. Then we approximate PN the full log-likelihood with L(w; θ) := n=1 wnLn(θ) Eq. (3) is a special case of the sparse vector approxima- and run MCMC with the approximated likelihood. By view- tion problem, which has been studied extensively in past ing the log-likelihood functions Ln(θ); L(θ); L(w; θ) as literature. Convex optimization formulations—e.g. basis vectors Ln; L; L(w) in a normed vector space, Campbell & pursuit (Chen et al., 1999), LASSO (Tibshirani, 1996), the Broderick(2017) pose the problem of constructing a coreset Dantzig selector (Candes` & Tao, 2007), and compressed of size M as cardinality-constrained vector approximation, sensing (Candes` & Tao, 2005; Donoho, 2006; Boche et al., 2015)—are expensive to solve compared to our greedy ap- min kL(w) − Lk2 proach, and often require tuning regularization coefficients N w2R (3) and thresholding to ensure cardinality constraint feasibil- s.t. w ≥ 0; kwk0 ≤ M: ity. Previous greedy iterative algorithms—e.g. (orthogonal) Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent matching pursuit (Mallat & Zhang, 1993; Chen et al., 1989; In contrast, the red coreset approximation obtained by solv- Tropp, 2004), Frank–Wolfe and its variants (Frank & Wolfe, ing Eq. (3) scales its weight to minimize kL(w) − Lk, re- 1956; Guelat´ & Marcotte, 1986; Jaggi, 2013; Lacoste-Julien sulting in a significantly better approximation of posterior & Jaggi, 2015; Locatello et al., 2017), Hilbert space vector uncertainty. Note that we can scale the weight vector w by approximation methods (Barron et al., 2008), kernel herd- any α ≥ 0 without affecting feasibility in the cardinality ing (Chen et al., 2010), and AdaBoost (Freund & Schapire, constraint, i.e., kαwk0 ≤ kwk0. In the following section, 1997)—have sublinear error convergence unless computa- we use this property to develop a greedy coreset construc- tionally expensive correction steps are included. In contrast, tion algorithm that, unlike FW and IS, maintains optimal the algorithm developed in this paper has no correction steps, log-likelihood scaling in each iteration. no tuning parameters, and geometric error convergence. 3. Greedy Iterative Geodesic Ascent (GIGA) 2.2. Posterior Uncertainty Underestimation P In this section, we provide a new algorithm for Bayesian The new n σnwn = σ constraint in Eq. (5) has an unfortu- coreset construction and demonstrate that it yields improved nate practical consequence: both IS and FW must scale the approximation error guarantees proportional to kLk (rather coreset log-likelihood suboptimally—roughly, by σ rather than σ ≥ kLk). We begin in Section 3.1 by solving for the than kLk as they should—in order to maintain feasibility.

Bayesian Coreset Construction Via Greedy Iterative Geodesic Ascent

Facilitating Bayesian Continual Learning by Natural Gradients and Stein Gradients

Covariances, Robustness, and Variational Bayes

2018 Annual Report Alfred P

Martin Hairer: Fields Medalist

Parallel Streaming Wasserstein Barycenters

Quantifying Uncertainty and Robustness at Scale Tamara Broderick ITT Career Development Assistant Professor [email protected]

Ryan Giordano Runjing Liu Nelle Varoquaux Michael I. Jordan

Arxiv:0712.2437V2 [Q-Bio.QM] 15 Dec 2007 Fully) Simpler Set of Interactions

Tamara Broderick Associate Professor Electrical Engineering & Computer Science MIT

Lester Mackey

Tamara Broderick Michael I. Jordan

Tamara Broderick ITT Career Development Assistant Professor, MIT