Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent

Trevor Campbell 1 Tamara Broderick 1

Abstract streaming variants (Jordan et al., 1999; Wainwright & Jor- Coherent uncertainty quantification is a key dan, 2008; Hoffman et al., 2013; Ranganath et al., 2014; strength of Bayesian methods. But modern algo- Broderick et al., 2013; Campbell & How, 2014; Campbell rithms for approximate Bayesian posterior infer- et al., 2015; Dieng et al., 2017) are both susceptible to find- ence often sacrifice accurate posterior uncertainty ing bad local optima in the variational objective and tend to estimation in the pursuit of scalability. This work either over- or underestimate posterior variance depending shows that previous Bayesian coreset construction on the chosen discrepancy and variational family. algorithms—which build a small, weighted subset Bayesian coresets (Huggins et al., 2016; Campbell & Brod- of the data that approximates the full dataset—are erick, 2017) provide an alternative approach—based on the no exception. We demonstrate that these algo- observation that large datasets often contain redundant data— rithms scale the coreset log-likelihood subopti- in which a small subset of the data of size M  min{N,T } mally, resulting in underestimated posterior un- is selected and reweighted such that it preserves the statisti- certainty. To address this shortcoming, we de- cal properties of the full dataset. The coreset can be passed velop greedy iterative geodesic ascent (GIGA), a to a standard MCMC algorithm, providing posterior infer- novel algorithm for Bayesian coreset construction ence with theoretical guarantees at a significantly reduced that scales the coreset log-likelihood optimally. O(M(N + T )) computational cost. But despite their advan- GIGA provides geometric decay in posterior ap- tages, existing Bayesian coreset constructions—like many proximation error as a function of coreset size, other scalable inference methods—tend to underestimate and maintains the fast running time of its prede- posterior variance (Fig.1). This effect is particularly evi- cessors. The paper concludes with validation of dent when the coreset is small, which is the regime we are GIGA on both synthetic and real datasets, demon- interested in for scalable inference. strating that it reduces posterior approximation error by orders of magnitude compared with pre- In this work, we show that existing Bayesian coreset con- vious coreset constructions. structions underestimate posterior uncertainty because they scale the coreset log-likelihood suboptimally in order to remain unbiased (Huggins et al., 2016) or to keep their 1. Introduction weights in a particular constraint polytope (Campbell & Broderick, 2017). The result is an overweighted coreset Bayesian methods provide a wealth of options for principled with too much “artificial data,” and therefore an overly cer- parameter estimation and uncertainty quantification. But tain posterior. Taking this intuition to its limit, we demon- Markov chain Monte Carlo (MCMC) methods (Robert & strate that there exist models for which previous algorithms Casella, 2004; Neal, 2011; Hoffman & Gelman, 2014), the output coresets with arbitrarily large relative posterior ap- gold standard for Bayesian inference, typically have com- proximation error at any coreset size (Proposition 2.1). We arXiv:1802.01737v2 [stat.ML] 28 May 2018 plexity Θ(NT ) for dataset size N and number of samples T address this issue by developing a novel coreset construction and are intractable for modern large-scale datasets. Scalable algorithm, greedy iterative geodesic ascent (GIGA), that op- methods (see (Angelino et al., 2016) for a recent survey), timally scales the coreset log-likelihood to best fit the full on the other hand, often sacrifice the strong guarantees of dataset log-likelihood. GIGA has the same computational MCMC and provide unreliable posterior approximations. complexity as the current state of the art, but its optimal For example, variational methods and their scalable and log-likelihood scaling leads to uniformly bounded relative 1Computer Science and Artificial Intelligence Laboratory, Mas- error for all models, as well as asymptotic exponential error sachusetts Institute of Technology, Cambridge, MA, United States. decay (Theorem 3.1 and Corollary 3.2). The paper con- Correspondence to: Trevor Campbell . cludes with experimental validation of GIGA on a synthetic

th vector approximation problem as well as regression models Proceedings of the 35 International Conference on Machine applied to multiple real and synthetic datasets. Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent

Figure 1. (Left) Gaussian inference for an unknown mean, showing data (black points and likelihood densities), exact posterior (blue), and optimal coreset posterior approximations of size 1 from solving the original coreset construction problem Eq. (3) (red) and the modified problem Eq. (5) (orange). The orange coreset posterior has artificially low uncertainty. The exact and approximate log-posteriors are scaled down (by the same amount) for visualization. (Right) The vector formulation of log-likelihoods using the same color scheme.

2. Bayesian Coresets Solving Eq. (3) exactly is not tractable for large N due to the cardinality constraint; approximation is required. Given 2.1. Background a norm induced by an inner product, and defining In Bayesian statistical modeling, we are given a dataset N N (yn) of N observations, a likelihood p(yn|θ) for each X n=1 σn := kLnk and σ := σn, (4) θ ∈ Θ observation given a parameter , and a prior density n=1 π0(θ) on Θ. We assume that the data are conditionally independent given θ. The Bayesian posterior is given by Campbell & Broderick(2017) replace the cardinality con- straint in Eq. (3) with a simplex constraint, 1 π(θ) := exp(L(θ))π (θ), (1) 2 Z 0 min kL(w) − Lk N w∈R N (5) where the log-likelihood L(θ) is defined by X s.t. w ≥ 0, σnwn = σ. N X n=1 Ln(θ) := log p(yn | θ), L(θ) := Ln(θ), (2) n=1 Eq. (5) can be solved while ensuring kwk0 ≤ M using ei- ther importance sampling (IS) or Frank–Wolfe (FW) (Frank and Z is the marginal likelihood. MCMC returns approx- & Wolfe, 1956). Both procedures add one data point to imate samples from the posterior, which can be used to the linear combination L(w) at each iteration; IS chooses construct an empirical approximation to the posterior dis- the new data point i.i.d. with probability σn/σ, while FW tribution. Since each sample requires at least one full like- chooses the point most aligned with the residual error. This lihood evaluation—typically an Θ(N) operation—MCMC difference in how the coreset is built results in different has Θ(NT ) complexity for T posterior samples. convergence behavior: FW exhibits geometric convergence kL(w) − Lk2 = O(νM ) for some 0 < ν < 1, while IS is To reduce the complexity, we can instead run MCMC on a 2 −1 Bayesian coreset (Huggins et al., 2016), a small, weighted limited by the Monte Carlo rate kL(w) − Lk = O(M ) N with high probability (Campbell & Broderick, 2017, Theo- subset of the data. Let w ∈ R be the vector of non- rems 4.1, 4.4). For this reason, FW is the preferred method negative weights, with weight wn for data point yn, and PN 1 for coreset construction. kwk0 := n=1 [wn > 0]  N. Then we approximate PN the full log-likelihood with L(w, θ) := n=1 wnLn(θ) Eq. (3) is a special case of the sparse vector approxima- and run MCMC with the approximated likelihood. By view- tion problem, which has been studied extensively in past ing the log-likelihood functions Ln(θ), L(θ), L(w, θ) as literature. Convex optimization formulations—e.g. basis vectors Ln, L, L(w) in a normed vector space, Campbell & pursuit (Chen et al., 1999), LASSO (Tibshirani, 1996), the Broderick(2017) pose the problem of constructing a coreset Dantzig selector (Candes` & Tao, 2007), and compressed of size M as cardinality-constrained vector approximation, sensing (Candes` & Tao, 2005; Donoho, 2006; Boche et al., 2015)—are expensive to solve compared to our greedy ap- min kL(w) − Lk2 proach, and often require tuning regularization coefficients N w∈R (3) and thresholding to ensure cardinality constraint feasibil- s.t. w ≥ 0, kwk0 ≤ M. ity. Previous greedy iterative algorithms—e.g. (orthogonal) Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent matching pursuit (Mallat & Zhang, 1993; Chen et al., 1989; In contrast, the red coreset approximation obtained by solv- Tropp, 2004), Frank–Wolfe and its variants (Frank & Wolfe, ing Eq. (3) scales its weight to minimize kL(w) − Lk, re- 1956; Guelat´ & Marcotte, 1986; Jaggi, 2013; Lacoste-Julien sulting in a significantly better approximation of posterior & Jaggi, 2015; Locatello et al., 2017), Hilbert space vector uncertainty. Note that we can scale the weight vector w by approximation methods (Barron et al., 2008), kernel herd- any α ≥ 0 without affecting feasibility in the cardinality ing (Chen et al., 2010), and AdaBoost (Freund & Schapire, constraint, i.e., kαwk0 ≤ kwk0. In the following section, 1997)—have sublinear error convergence unless computa- we use this property to develop a greedy coreset construc- tionally expensive correction steps are included. In contrast, tion algorithm that, unlike FW and IS, maintains optimal the algorithm developed in this paper has no correction steps, log-likelihood scaling in each iteration. no tuning parameters, and geometric error convergence. 3. Greedy Iterative Geodesic Ascent (GIGA) 2.2. Posterior Uncertainty Underestimation P In this section, we provide a new algorithm for Bayesian The new n σnwn = σ constraint in Eq. (5) has an unfortu- coreset construction and demonstrate that it yields improved nate practical consequence: both IS and FW must scale the approximation error guarantees proportional to kLk (rather coreset log-likelihood suboptimally—roughly, by σ rather than σ ≥ kLk). We begin in Section 3.1 by solving for the than kLk as they should—in order to maintain feasibility. optimal log-likelihood scaling analytically. After solving Since σ ≥ kLk, intuitively the coreset construction algo- this “radial optimization problem,” we are left with a new rithms are adding too much “artificial data” via the coreset optimization problem on the unit hyperspherical manifold. weights, resulting in an overly certain posterior approxima- In Sections 3.2 and 3.3, we demonstrate how to solve this tion. It is worth noting that this effect is apparent in the error new problem by iteratively building the coreset one point at bounds developed by Campbell & Broderick(2017), which a time. Since this procedure selects the point greedily based are all proportional to σ rather than kLk as one might hope on a geodesic alignment criterion, we call it greedy iterative for when approximating L. geodesic ascent (GIGA), detailed in Algorithm1. In Sec- Fig.1 provides intuition in the setting of Gaussian inference tion 3.4 we scale the resulting coreset optimally using the for an unknown mean. In this example, we construct the procedure developed in Section 3.1. Finally, in Section 3.5, optimal coreset of size 1 for the modified problem in Eq. (5) we prove that Algorithm1 provides approximation error (orange) and for the original coreset construction problem that is proportional to kLk and geometrically decaying in that we would ideally like to solve in Eq. (3) (red). Com- M, as shown by Theorem 3.1. pared with the exact posterior (blue), the orange coreset Theorem 3.1. The output w of Algorithm1 satisfies approximation from Eq. (5) has artificially low uncertainty, σ since it must place weight /σn on its chosen point. Build- kL(w) − Lk ≤ ηkLkνM , (7) ing on this intuition, Proposition 2.1 shows that there are 1 problems for which both FW and IS perform arbitrarily where νM is decreasing and ≤ 1 for all M ∈ N, νM = poorly for any number of iterations M. O(νM ) for some 0 < ν < 1, and η is defined by N Proposition 2.1. For any M ∈ N, there exists (Ln)n=1 for s which both the FW and IS coresets after M iterations have   L L 2 0 ≤ η := 1 − max n , ≤ 1. (8) arbitrarily large error relative to kLk. n∈[N] kLnk kLk

1 Proof. Let Ln = /N1n where 1n is the indicator for the ∞ th N PN The particulars of the sequence (νM )M=1 are somewhat n component of , and let L = Ln. Then in the R n=1√ involved; see Section 3.5 for the detailed development. A σ = 1/N σ = 1 kLk = 1/ N 2-norm, n , , and . By symmetry, straightforward consequence of Theorem 3.1 is that the w kwk ≤ M the optimal for Eq. (5) satisfying 0 has uniform issue described in Proposition 2.1 has been resolved; the so- N/M nonzero weights . Substituting yields lution L(w) is always scaled optimally, leading to decaying r kL(w) − Lk N relative error. Corollary 3.2 makes this notion precise. = − 1 , (6) N kLk M Corollary 3.2. For any set of vectors (Ln)n=1, Algorithm1 provides a solution to Eq. (3) with error ≤ 1 relative to kLk. which can be made as large as desired by increasing N. The result follows since both FW and IS generate a feasible solution w for Eq. (5) satisfying kwk0 ≤ M. 3.1. Solving the Radial Optimization

1While the proof uses orthogonal vectors in N for simplicity, We begin again with the problem of coreset construction for R N a similar construction arises from the model θ ∼ N (0,I), xn ∼ a collection of vectors (Ln)n=1 given by Eq. (3). Without p 2 N (θn, 1), n ∈ [N] given the norm kLk = Eπ [k∇Lk ] . loss of generality, we assume kLk > 0 and ∀n ∈ [N], Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent

Algorithm 1 GIGA: Greedy Iterative Geodesic Ascent irrespective of their norms. Finally, defining the normalized Ln L PN N vectors `n := kL k , ` := kLk , and `(w) := n=1 wn`n, Require: (Ln)n=1, M, h·, ·i n . Normalize vectors and initialize weights to 0 solving Eq. (11) is equivalent to maximizing the alignment PN of ` and `(w) over `(w) lying on the unit hypersphere: 1: L ← n=1 Ln Ln L 2: ∀n ∈ [N] `n ← , ` ← kLnk kLk max h`(w), `i N 3: w0 ← 0, `(w0) ← 0 w∈R (12) 4: t ∈ {0,...,M − 1} for do s.t. w ≥ 0, kwk0 ≤ M, k`(w)k = 1. . Compute the geodesic direction for each data point `−h`,`(wt)i`(wt) 5: dt ← Compare this optimization problem to Eq. (5); Eq. (12) k`−h`,`(wt)i`(wt)k `n−h`n,`(wt)i`(wt) shows that the natural choice of manifold for optimization 6: ∀n ∈ [N], dtn ← k`n−h`n,`(wt)i`(wt)k is the unit hypersphere rather than the simplex. Therefore, . Choose the best geodesic in Sections 3.2 and 3.3, we develop a greedy optimization 7: n ← arg max hd , d i t n∈[N] t tn algorithm that operates on the hyperspherical manifold. At 8: ζ ← h`, ` i, ζ ← h`, `(w )i, ζ ← h` , `(w )i 0 nt 1 t 2 nt t each iteration, the algorithm picks the point indexed by n . Compute the step size ζ0−ζ1ζ2 for which the geodesic between `(w) and `n is most aligned 9: γt ← (ζ0−ζ1ζ2)+(ζ1−ζ0ζ2) with the geodesic between `(w) and `. Once the point has . Update the coreset been added, it reweights the coreset and iterates. (1−γt)`(wt)+γt`nt 10: `(wt+1) ← k(1−γt)`(wt)+γt`nt k (1−γt)wt+γt1nt 11: wt+1 ← 3.2. Coreset Initialization k(1−γt)`(wt)+γt`nt k 12: end for For the remainder of this section, denote the value of the . Scale the weights optimally N th kLk weights at iteration t by wt ∈ R+ , and write the n compo- 13: ∀n ∈ [N], wMn ← wMn h`(wM ), `i kLnk nent of this vector as wtn. To initialize the optimization, we 14: return w select the vector `n most aligned with `, i.e.

n1 = arg max h`n, `i , w1 = 1n1 . (13) kLnk > 0; if kLk = 0 then w = 0 is a trivial optimal n∈[N] solution, and any Ln for which kLnk = 0 can be removed from the coreset without affecting the objective in Eq. (3). This initialization provides two benefits: for all iterations Recalling that the weights of a coreset can be scaled by an t, there is no point `n in the hyperspherical cap {y : arbitrary constant α ≥ 0 without affecting feasibility, we kyk = 1, hy, `i ≥ r} for any r > h`(w1), `i, as well as rewrite Eq. (3) as h`(w1), `i > 0 by Lemma 3.3. We use these properties to develop the error bound in Theorem 3.1 and guarantee that min kαL(w) − Lk2 N the iteration step developed below in Section 3.3 is always α∈R,w∈R (9) feasible. Note that Algorithm1 does not explicitly run this s.t. α ≥ 0, w ≥ 0, kwk0 ≤ M. initialization and simply sets w0 = 0, since the initialization Taking advantage of the fact that k · k is induced by an in- for w1 is equivalent to the usual iteration step described in ner product, we can expand the objective in Eq. (9) as a Section 3.3 after setting w0 = 0. quadratic function of α, and analytically solve the optimiza- Lemma 3.3. The initialization satisfies tion in α to yield    kLk kLk L(w) L h`(w1), `i ≥ > 0. (14) α? = max 0, , . (10) σ kL(w)k kL(w)k kLk In other words, L(w) should be rescaled to have norm kLk, Proof. For any ξ ∈ ∆N−1, we have that and then scaled further depending on its directional align- X ment with L. We substitute this result back into Eq. (9) to h`(w1), `i = max h`n, `i ≥ ξn h`n, `i . (15) n∈[N] find that coreset construction is equivalent to solving n

2!   L(w) L  So choosing ξn = σn/σ, min kLk2 1 − max 0, , N w∈R kL(w)k kLk (11) * + 1 X kLk kLk s.t. w ≥ 0, kwk ≤ M. h`(w ), `i ≥ L , ` = h`, `i = . (16) 0 1 σ n σ σ n Note that by selecting the optimal scaling via Eq. (10), the cost now depends only on the alignment of L(w) and L By assumption (see Section 3.1), we have kLk > 0. Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent

based on the fact that we initialized the procedure with the

vector most aligned with ` (and so `nt must lie on the “other side” of ` from `(wt), i.e. γt ≤ 1), along with the fact that the optimal objective in Eq. (18) is guaranteed to be posi-

tive (so moving towards `nt is guaranteed to improve the objective, i.e. γt ≥ 0). Lemma 3.4. For all t ∈ N, γt ∈ [0, 1].

(a) (b) Proof. It is sufficient to show that both ζ0 − ζ1ζ2 and ζ1 − ζ0ζ2 are nonnegative and that at least one is strictly positive. N−1 Figure 2. Fig. 2a shows the greedy geodesic selection procedure, First, examining Eq. (18), note that for any ξ ∈ ∆ , with normalized total log-likelihood ` in blue, normalized data N log-likelihoods `n in black, and the current iterate `(w) in orange. X hdt, dtnt i = max hdt, dtni ≥ ξn hdt, dtni , (22) Fig. 2b shows the subsequent line search procedure. n∈[N] n=1

so choosing ξn = CkLnkk`n − h`n, `(wt)i `(wt)k where 3.3. Greedy Point Selection and Reweighting C > 0 is chosen appropriately to normalize ξ, we have that N At each iteration t, we have a set of weights wt ∈ R+ for hdt, dtnt i ≥ CkLkk` − h`, `(wt)i `(wt)k > 0, (23) which k`(wt)k = 1 and kwtk0 ≤ t. The next point to add to the coreset is the one for which the geodesic direction is since kLk > 0 by assumption (see Section 3.1), and ` 6= most aligned with the direction of the geodesic from `(wt) `(wt) since otherwise we could terminate the optimization to `; Fig. 2a illustrates this selection. Precisely, we define after iteration t − 1. By expanding the formulae for dt and 0 dtnt , we have that for some C > 0, ` − h`, `(wt)i `(wt) dt := 0 k` − h`, `(w )i `(w )k ζ0 − ζ1ζ2 = C hdt, dtnt i > 0. (24) t t (17) `n − h`n, `(wt)i `(wt) For the other term, first note that h`, `(w )i is a monoton- ∀n ∈ [N], dtn := , t k`n − h`n, `(wt)i `(wt)k ically increasing function in t since γ = 0 is feasible in Eq. (19). Combined with the choice of initialization for w1, and we select the data point at index nt where we have that ζ1 ≥ ζ0. Further, since ζ2 ∈ [−1, 1] and ζ1 ≥ 0

nt = arg max hdt, dtni . (18) by its initialization, Eq. (24) implies ζ1 ≥ ζ1(−ζ2) ≥ −ζ0. n∈[N] Therefore ζ1 ≥ |ζ0| ≥ |ζ0ζ2| ≥ ζ0ζ2, and the proof is complete. This selection leads to the largest decrease in approximation error, as shown later in Eq. (28). For any vector u with u Given the line search step size γt, we can now update the kuk = 0, we define kuk := 0 to avoid any issues with weights to compute `(wt+1) and wt+1: Eq. (18). Once the vector `nt has been selected, the next task is to redistribute the weights between `(w ) and ` `(w ) + γ (` − `(w )) t nt `(w ) = t t nt t (25) to compute w . We perform a search along the geodesic t+1 t+1 k`(wt) + γt(`nt − `(wt))k from `(wt) to `nt to maximize h`(wt+1), `i, as shown in wt + γt(1nt − wt) Fig. 2b: wt+1 = . (26) k`(wt) + γt(`nt − `(wt))k   `(wt) + γ(`nt − `(wt)) There are two options for practical implementation: either γt = arg max `, . (19) γ∈[0,1] k`(wt) + γ(`nt − `(wt))k (A) we keep track of only wt, recomputing and normaliz- ing `(wt) at each iteration to obtain wt+1; or (B) we keep Constraining γ ∈ [0, 1] ensures that the resulting w is t+1 track of and store both wt and `(wt). In theory both are feasible for Eq. (12). Taking the derivative and setting it to equivalent, but in practice (B) will cause numerical errors to 0 yields the unconstrained optimum of Eq. (19), accumulate, making `(wt) not equal to the sum of normal- ized ` weighted by w . However, in terms of vector sum ζ := h`, ` i ζ := h`, `(w )i ζ := h` , `(w )i (20) n tn 0 nt 1 t 2 nt t and inner-product operations, (A) incurs an additional cost (ζ0 − ζ1ζ2) of O(t) at each iteration t—resulting in a total of O(M 2) γt = . (21) (ζ0 − ζ1ζ2) + (ζ1 − ζ0ζ2) for M iterations—while (B) has constant cost for each it- eration with a total cost of O(M). We found empirically Lemma 3.4 shows that γt is also the solution to the con- that option (B) is preferable in practice, since the reduced strained line search problem. The proof of Lemma 3.4 is numerical error of (A) does not justify the O(M 2) cost. Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent

3.4. Output where f : [0, 1] → R is defined by √ √ After running M iterations, the algorithm must output a 1 − x p1 − β2  + x β set of weights that correspond to unnormalized vectors Ln f(x) := r (33) √ p √ 2 rather than the normalized counterparts `n used within the 1 − x 1 − β2  − 1 − x β algorithm. In addition, weights need to be rescaled optimally   by α? from Eq. (10). The following formula adjusts the β := 0 ∧ min h`n, `i s.t. h`n, `i > −1 . (34) weights wM accordingly to generate the output: n∈[N] kLk ∀n ∈ [N], w ← w h`(w ), `i . (27) √ Mn Mn M The proof of Theorem 3.1 below follows by using the τ Jt kLnk bound from Lemma 3.6 to obtain an O(1/t) bound on Jt, 3.5. Convergence Guarantee and then combining that result with the f(Jt) bound from Lemma 3.6 to obtain the desired geometric decay. We now prove Theorem 3.1 by bounding the objective of Eq. (11) as a function of the number of iterations M of √ 2 Proof of Theorem 3.1. Substituting the τ Jt bound from greedy geodesic ascent. Writing Jt := 1 − h`(wt), `i for Lemma 3.6 into Eq. (28) and applying a standard inductive 2 ? 2 ? brevity, we have that kLk Jt = kα L(wt) − Lk for α argument (e.g. the proof of Campbell & Broderick(2017, from Eq. (10), so providing an upper bound on Jt provides a Lemma A.6)) yields relative bound on the coreset error. Substituting the optimal line search value γ into the objective of Eq. (19) yields J1 t J ≤ B(t) := . (35) t 1 + τ 2J (t − 1)  2 1 Jt+1 = Jt 1 − hdt, dtnt i , (28) Since B(t) → 0 and f(B(t)) →  > 0 as t → ∞, there ex- p where dt, dtn, and nt are given in Eqs. (17) and (18). In ists a t? ∈ N for which t > t? implies f(B(t)) ≥ τ B(t) . other words, the cost Jt decreases corresponding to how Furthermore, since f(x) is monotonically decreasing, we well aligned the `(w )-to-` and `(w )-to-` geodesics are. t t nt have that f(Jt) ≥ f(B(t)). Using the bound hdt, dtnt i ≥ We define for each n ∈ [N] the geodesic direction from ` to f(B(t)) in Eq. (28) and combining with Eq. (35) yields `n−h`n,`i` `n to be d∞n := ; this slight abuse of notation k`n−h`n,`i`k t from Eq. (17) is justified since we expect `(w ) → ` as Y t J ≤ B(t ∧ t?) 1 − f 2(B(s)) . (36) t → ∞. Then the worst-case alignment is governed by two t s=t?+1 constants, τ and : 2 N Multiplying by kLk , taking√ the square root, and noting that −1 X η from Eq. (8) is equal to J gives the final result. The τ := min ksk1 s.t. ` = sm`m s ≥ 0 (29) 1 s∈ N R m=1 convergence of f(B(t)) →  as t → ∞ shows√ that the rate * + of decay in the theorem statement is ν = 1 − 2 . X  := min max − smd∞m, d∞n (30) s∈ N n∈[N] R m X 4. Experiments s.t. k smd∞mk = 1, s ≥ 0. m In this section we evaluate the performance of GIGA coreset construction compared with both uniformly random sub- Both are guaranteed to be positive by Lemma 3.5 below. sampling and the Frank–Wolfe-based method of Campbell The constant τ captures how fundamentally hard it is to & Broderick(2017). We first test the algorithms on sim- approximate ` using (` )N , and governs the worst-case n n=1 ple synthetic examples, and then test them on logistic and behavior of the large steps taken in early iterations. In con- Poisson regression models applied to numerous real and trast,  captures the worst-case behavior of the smaller steps synthetic datasets. Code for these experiments is available in the t → ∞ asymptotic regime when `(wt) ≈ `. These at https://github.com/trevorcampbell/bayesian-coresets. notions are quantified precisely in Lemma 3.6. The proofs of both Lemmas 3.5 and 3.6 may be found in AppendixB. 4.1. Synthetic Gaussian Inference Lemma 3.5. The constants τ and  satisfy To generate Fig.1, we generated µ ∼ N (0, 1) followed by kLk 10 τ ≥ > 0 and  > 0. (31) (yn)n=1 ∼ N (µ, 1). We then constructed Bayesian coresets σ via FW and GIGA using the norm specified in (Campbell

Lemma 3.6. The geodesic alignment hdt, dtnt i satisfies & Broderick, 2017, Section 6). Across 1000 replications p of this experiment, the median relative error in posterior hd , d i ≥ τ J ∨ f(J ), t tnt t t (32) variance approximation is 3% for GIGA and 48% for FW. Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent

4.2. Synthetic Vector Sum Approximation on a number of datasets; see AppendixD for references. For logistic regression, the Synthetic dataset consisted of In this experiment, we generated 20 independent datasets N = 10,000 data points with covariate x ∈ 2 sampled consisting of 106 50-dimensional vectors from the multivari- n R i.i.d. from N (0,I), and label y ∈ {−1, 1} generated from ate normal distribution with mean 0 and identity covariance. n the logistic likelihood with parameter θ = [3, 3, 0]T . The We then constructed coresets for each of the datasets via Phishing dataset consisted of N = 11,055 data points uniformly random subsampling (RND), Frank–Wolfe (FW), each with D = 68 features. The DS1 dataset consisted of and GIGA. We compared the algorithms on two metrics: N = 26,733 data points each with D = 10 features. reconstruction error, as measured by the 2-norm between L(w) and L; and representation efficiency, as measured by For Poisson regression, the Synthetic dataset consisted the size of the coreset. of N = 10,000 data points with covariate xn ∈ R sampled i.i.d. from N (0, 1), and count yn ∈ N generated from the Fig.3 shows the results of the experiment, with reconstruc- T tion error in Fig. 3a and coreset size in Fig. 3b. Across Poisson likelihood with θ = [1, 0] . The BikeTrips all construction iterations, GIGA provides a 2–4 order-of- dataset consisted of N = 17,386 data points each with D = magnitude reduction in error as compared with FW, and sig- 8 features. The AirportDelays dataset consisted of nificantly outperforms RND. The exponential convergence N = 7,580 data points each with D = 15 features. of GIGA is evident. The flat section of FW/GIGA in Fig. 3a We constructed coresets for each of the datasets via uni- for iterations beyond 102 is due to the algorithm reaching formly random subsampling (RND), Frank–Wolfe (FW), the limits of numerical precision. In addition, Fig. 3b shows and GIGA using the weighted Fisher information distance, that GIGA can improve representational efficiency over FW, 2  2 ceasing to grow the coreset once it reaches a size of 120, kLnk := Eπˆ k∇Ln(θ)k2 , (39) while FW continues to add points until it is over twice as large. Note that this experiment is designed to highlight the where Ln(θ) is the log-likelihood for data point n, and πˆ is strengths of GIGA: by the law of large numbers, as N → ∞ obtained via the Laplace approximation. In order to do so, the sum L of the N i.i.d. standard multivariate normal data we approximated all vectors Ln using a 500-dimensional vectors satisfies kLk → 0, while the sum of their norms random feature projection (following Campbell & Broder- σ → ∞ a.s. But Appendix A.1 shows that even in patho- ick, 2017). For posterior inference, we used Hamiltonian logical cases, GIGA outperforms FW due to its optimal Monte Carlo (Neal, 2011) with 15 leapfrog steps per sample. log-likelihood scaling. We simulated a total of 6,000 steps, with 1,000 warmup steps for step size adaptation with a target acceptance rate of 0.8, 4.3. Bayesian Posterior Approximation and 5,000 posterior sampling steps. AppendixA shows re- sults for similar experiments using random-walk Metropolis– In this experiment, we used GIGA to generate Bayesian Hastings and the No-U-Turn Sampler (Hoffman & Gelman, coresets for logistic and Poisson regression. For both 2014). We ran 20 trials of projection / coreset construction / models, we used the standard multivariate normal prior MCMC for each combination of dataset, model, and algo- θ ∼ N (0,I). In the logistic regression setting, we have rithm. We evaluated the coresets at logarithmically-spaced N a set of data points (xn, yn)n=1 each consisting of a fea- construction iterations between M = 1 and 1000 by their D ture xn ∈ R and a label yn ∈ {−1, 1}, and the goal is median posterior Fisher information distance, estimated us- to predict the label of a new point given its feature. We ing samples obtained by running posterior inference on the thus seek to infer the posterior distribution of the parameter full dataset in each trial. We compared these results versus D+1 θ ∈ R governing the generation of yn given xn via computation time as well as coreset size. The latter serves as     an implementation-independent measure of total cost, since indep 1 x y | x , θ ∼ Bern z := n . (37) n n −zT θ n the cost of running MCMC is much greater than running 1 + e n 1 coreset construction and depends linearly on coreset size. In the Poisson regression setting, we have a set of data N The results of this experiment are shown in Fig.4. In Fig. 4a points (x , y ) each consisting of a feature x ∈ D n n n=1 n R and Fig. 4b, the Fisher information distance is normalized and a count y ∈ , and the goal is to learn a relationship n N by the median distance of RND for comparison. In Fig. 4b, between features x and the associated mean count. We n the computation time is normalized by the median time to thus seek to infer the posterior distribution of the parameter run MCMC on the full dataset. The suboptimal coreset θ ∈ D+1 governing the generation of y given x via R n n log-likelihood scaling of FW can be seen in Fig. 4a for   indep   zT θ xn small coresets, resulting in similar performance to RND. y | x , θ ∼ Poiss log 1 + e n z := . (38) n n n 1 In contrast, GIGA correctly scales posterior uncertainty across all coreset sizes, resulting in a major (3–4 orders We tested the coreset construction methods for each model of magnitude) reduction in error. Fig. 4b shows the same Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent

(a) (b)

50 Figure 3. Comparison of different coreset constructions on the synthetic R vector dataset. Fig. 3a shows a comparison of 2-norm error between the coreset L(w) and the true sum L as a function of construction iterations, demonstrating the significant improvement in quality when using GIGA. Fig. 3b shows a similar comparison of coreset size, showing that GIGA produces smaller coresets for the same computational effort. All plots show the median curve over 20 independent trials.

(a) (b)

Figure 4. Comparison of the median Fisher information distance to the true posterior for GIGA, FW, and RND on the logistic and Poisson regression models over 20 random trials. Distances are normalized by the median value of RND for comparison. In Fig. 4b, computation time is normalized by the median value required to run MCMC on the full dataset. GIGA consistently outperforms FW and RND. results plotted versus total computation time. This confirms Like previous algorithms, GIGA is simple to implement, that across a variety of models and datasets, GIGA provides has low computational cost, and has no tuning parameters. significant improvements in posterior error over the state of But in contrast to past work, GIGA scales the coreset log- the art. likelihood optimally, providing significant improvements in the quality of posterior approximation. The theoretical 5. Conclusion guarantees and experimental results presented in this work reflect this improvement. This paper presented greedy iterative geodesic ascent (GIGA), a novel Bayesian coreset construction algorithm. Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent

Acknowledgments Donoho, David. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006. This research is supported by an MIT Lincoln Laboratory Advanced Concepts Committee Award, ONR grant N00014- Frank, Marguerite and Wolfe, Philip. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3:95–110, 17-1-2072, a Google Faculty Research Award, and an ARO 1956. YIP Award. Freund, Yoav and Schapire, Robert. A decision-theoretic gen- eralization of on-line learning and an application to boosting. References Journal of Computer and System Sciences, 55:119–139, 1997. Angelino, Elaine, Johnson, Matthew, and Adams, Ryan. Patterns Guelat,´ Jacques and Marcotte, Patrice. Some comments on Wolfe’s of scalable Bayesian inference. Foundations and Trends in ‘away step’. Mathematical Programming, 35:110–119, 1986. Machine Learning, 9(1–2):1–129, 2016. Hoffman, Matthew and Gelman, Andrew. The No-U-Turn Sampler: Barron, Andrew, Cohen, Albert, Dahmen, Wolfgang, and DeVore, adaptively setting path lengths in Hamiltonian Monte Carlo. Ronald. Approximation and learning by greedy algorithms. The Journal of Machine Learning Research, 15:1351–1381, 2014. Annals of , 36(1):64–94, 2008. Hoffman, Matthew, Blei, David, Wang, Chong, and Paisley, John. Boche, Holger, Calderbank, Robert, Kutyniok, Gitta, and Vyb´ıral, Stochastic variational inference. Journal of Machine Learning Jan. A survey of compressed sensing. In Boche, Holger, Calder- Research, 14:1303–1347, 2013. bank, Robert, Kutyniok, Gitta, and Vyb´ıral, Jan (eds.), Com- Huggins, Jonathan, Campbell, Trevor, and Broderick, Tamara. pressed Sensing and its Applications: MATHEON Workshop Coresets for Bayesian logistic regression. In Advances in Neural 2013. Birkhauser,¨ 2015. Information Processing Systems, 2016. Broderick, Tamara, Boyd, Nicholas, Wibisono, Andre, Wilson, Jaggi, Martin. Revisiting Frank-Wolfe: projection-free sparse Ashia, and Jordan, Michael. Streaming variational Bayes. In convex optimization. In International Conference on Machine Advances in Neural Information Processing Systems, 2013. Learning, 2013.

Campbell, Trevor and Broderick, Tamara. Automated scalable Jordan, Michael, Ghahramani, Zoubin, Jaakkola, Tommi, and Saul, Bayesian inference via Hilbert coresets. arXiv:1710.05053, Lawrence. An introduction to variational methods for graphical 2017. models. Machine Learning, 37:183–233, 1999.

Campbell, Trevor and How, Jonathan. Approximate decentralized Lacoste-Julien, Simon and Jaggi, Martin. On the global linear Bayesian inference. In Uncertainty in Artificial Intelligence, convergence of Frank-Wolfe optimization variants. In Advances 2014. in Neural Information Processing Systems, 2015. Locatello, Francesco, Tschannen, Michael, Ratsch,¨ Gunnar, and Campbell, Trevor, Straub, Julian, Fisher III, John W., and How, Jaggi, Martin. Greedy algorithms for cone constrained opti- Jonathan. Streaming, distributed variational inference for mization with convergence guarantees. In Advances in Neural Advances in Neural Information Bayesian nonparametrics. In Information Processing Systems, 2017. Processing Systems, 2015. Mallat, Stephane´ and Zhang, Zhifeng. Matching pursuits with Candes,` Emmanual and Tao, Terence. The Dantzig selector: statis- time-frequency dictionaries. IEEE Transactions on Signal Pro- tical estimation when p is much larger than n. The Annals of cessing, 41(12):3397–3415, 1993. Statistics, 35(6):2313–2351, 2007. Neal, Radford. MCMC using Hamiltonian dynamics. In Brooks, Candes,` Emmanuel and Tao, Terence. Decoding by linear pro- Steve, Gelman, Andrew, Jones, Galin, and Meng, Xiao-Li (eds.), gramming. IEEE Transactions on Information Theory, 51(12): Handbook of Markov chain Monte Carlo, chapter 5. CRC Press, 4203–4215, 2005. 2011.

Chen, Scott, Donoho, David, and Saunders, Michael. Atomic Ranganath, Rajesh, Gerrish, Sean, and Blei, David. Black box decomposition by basis pursuit. SIAM Review, 43(1):129–159, variational inference. In International Conference on Artificial 1999. Intelligence and Statistics, 2014. Monte Carlo Statistical Chen, Sheng, Billings, Stephen, and Luo, Wan. Orthogonal least Robert, Christian and Casella, George. Methods squares methods and their application to non-linear system . Springer, 2nd edition, 2004. identification. International Journal of Control, 50(5):1873– Tibshirani, Robert. Regression shrinkage and selection via the 1896, 1989. lasso. Journal of the Royal Statistical Society Series B, 58(1): 267–288, 1996. Chen, Yutian, Welling, Max, and Smola, Alex. Super-samples from kernel herding. In Uncertainty in Artificial Intelligence, Tropp, Joel. Greed is good: algorithmic results for sparse approx- 2010. imation. IEEE Transactions on Information Theory, 50(10): 2231–2242, 2004. Dieng, Adji, Tran, Dustin, Rangananth, Rajesh, Paisley, John, and Blei, David. Variational inference via χ upper bound minimiza- Wainwright, Martin and Jordan, Michael. Graphical models, ex- tion. In Advances in Neural Information Processing Systems, ponential families, and variational inference. Foundations and 2017. Trends in Machine Learning, 1(1–2):1–305, 2008. Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent

A. Additional results with Fig. 7b. A.1. Orthonormal vectors B. Technical Results and Proofs In this experiment, we generated a dataset of 5000 unit vec- 5000 Proof of Lemma 3.5. s = kLmk m ∈ tors in R , each aligned with one of the coordinate axes. By setting m kLk for each This dataset is exactly that used in the proof of Proposi- kLk [N] in Eq. (29), we have that τ ≥ σ > 0. Now sup- tion 2.1, except that the number of datapoints N is fixed to pose  ≤ 0; then there exists some conic combination 5000. We constructed coresets for each of the datasets via N d of (d∞m)m=1 for which kdk = 1, hd, `i = 0, and uniformly random subsampling (RND), Frank–Wolfe (FW), ∀m ∈ [N], h−d, d∞mi ≤ 0. There must exist at least and GIGA. We compared the algorithms on two metrics: one index n ∈ [N] for which h−d, d∞ni < 0, since oth- reconstruction error, as measured by the 2-norm between N erwise d is not in the linear span of (d∞m)m=1. This L(w) and L; and representation efficiency, as measured by also implies kd∞nk > 0 and hence k`n − h`n, `i `k > 0. the size of the coreset. Fig.5 shows the results of the ex- D N E Then −d, P ξ d < 0 for any ξ ∈ ∆N−1 with periment, with reconstruction error in Fig. 5a and coreset m=1 m ∞m size in Fig. 5b. As expected, for early iterations FW per- ξn > 0. But setting ξn ∝ σnk`n − h`n, `i `k results in PN forms about as well as uniformly random subsampling, as n=1 ξnd∞n = 0, and we have a contradiction. these algorithms generate equivalent coresets (up to some re- ordering of the unit vectors) with high probability. FW only √ Proof of Lemma 3.6. We begin with the τ Jt bound. For finds a good coreset after all 5000 points in the dataset have any ξ ∈ ∆N−1, been added. These algorithms both do not correctly scale the coreset; in contrast, GIGA scales its coreset correctly, N providing significant reduction in error. X hdt, dtnt i = max hdt, dtni ≥ ξn hdt, dtni . (40) n∈[N] n=1 A.2. Alternate inference algorithms Suppose that ` = PN s ` for some s ∈ N . Setting We reran the same experiment as described in Section 4.3, n=1 n n R+ ξ ∝ s k` − h` , `(w )i `(w )k yields except we swapped the inference algorithm for random-walk n n n n t t Metropolis–Hastings (RWMH) and the No-U-Turn Sampler hd , d i ≥ C−1 k` − h`, `(w )i `(w )k (41) (NUTS) (Hoffman & Gelman, 2014). When using RWMH, t tnt t t N ! we simulated a total of 50,000 steps: 25,000 warmup steps X : including covariance adaptation with a target acceptance C = sn k`n − h`n, `(wt)i `(wt)k . (42) rate of 0.234, and 25,000 sampling steps thinned by a factor n=1 of 5, yielding 5,000 posterior samples. If the acceptance Noting that the norms satisfy k` − h`, `(wt)i `(wt)k = rate for the latter 25,000 steps was not between 0.15 and 0.7, √ Jt and k`n − h`n, `(wt)i `(wt)k ≤ 1, we have we reran the procedure. When using NUTS, we simulated a total of 6,000 steps: 1,000 warmup steps including leapfrog p hd , d i ≥ ksk−1 J . (43) step size adaptation with a target acceptance rate of 0.8, and t tnt 1 t 5,000 sampling steps. Maximizing over all valid choices of s yields The results for these experiments are shown in Figs.6 and7, p and generally corroborate the results from the experiments hdt, dtnt i ≥ τ Jt . (44) using Hamiltonian Monte Carlo in the main text. One dif- ference when using NUTS is that the performance versus Next, we develop the f(Jt) bound. Note that computation time appears to follow an “S”-shaped curve, N N which is caused by the dynamic path-length adaptation pro- X X vided by NUTS. Consider the log-likelihood of logistic wtn k`n − h`n, `i `k d∞n = wtn(`n − h`n, `i `) regression, which has a “nearly linear” region and a “nearly n=1 n=1 flat” region. When the coreset is small, there are directions (45) in latent space that point along “nearly flat” regions; along = `(wt) − h`(wt), `i `, these directions, u-turns happen only after long periods of (46) travel. When the coreset reaches a certain size, these “nearly √ √ flat” directions are all removed, and u-turns happen more √so we can√ express `(wt) = Jt d + 1 − Jt ` and dt = frequently. Thus we expect the computation time as a func- Jt ` − 1 − Jt d for some vector d that is a conic combi- tion of coreset size to initially increase smoothly, then drop N nation of (d∞n)n=1 with kdk = 1 and hd, `i = 0. Then by quickly, followed by a final smooth increase, in agreement the definition of  in Eq. (30) and Lemma 3.5, there exists Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent

(a) (b)

Figure 5. Comparison of different coreset constructions on the synthetic axis-aligned vector dataset. Fig. 5a shows a comparison of 2-norm error between the coreset L(w) and the true sum L as a function of construction iterations. Fig. 5b shows a similar comparison of coreset size.

(a) (b)

Figure 6. Results for the experiment described in Section 4.3 with posterior inference via random-walk Metropolis–Hastings.

an n ∈ [N] such that h−d, d∞ni ≥  > 0. Therefore minimization over these variables. We further lower-bound by removing the coupling between them. Fixing h−d, d∞ni, hdt, dtn i (47) t the derivative in h`, `ni is always nonnegative, and note that ≥ hd , d i (48) t tn h`n, `i > −1 since otherwise h−d, d∞ni = 0 by the remark   √ √ `n − h`n, `(wt)i `(wt) after Eq. (18), so setting = Jt ` − 1 − Jt d, (49) k`n − h`n, `(wt)i `(wt)k √ √   1 − Jt h−d, `ni + Jt h`, `ni β = 0 ∧ min h`, `ni s.t. h`, `ni > −1 , (52) = (50) n∈[N] q √ √ 2 1 − 1 − Jt h`n, `i + Jt h`n, di we have √ q 2 √ 1 − Jt 1 − h`n, `i h−d, d∞ni + Jt h`, `ni = . hdt, dtnt i ≥ (53) s 2 √ √ √ √ q  p 2 2 1 − Jt 1 − β h−d, d∞ni + Jt β 1 − 1 − Jt h`n, `i + Jt 1 − h`n, `i hd, d∞ni . (54) r √ √ 2 p 2 (51) 1 − 1 − Jt β + Jt 1 − β hd, d∞ni

We view this bound as a function of two variables h`, `ni We add {0} into the minimization since β ≤ 0 guarantees and h−d, d∞ni, and we view the worst-case bound as the that the derivative of the above with respect to Jt is nonpos- Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent

(a) (b)

Figure 7. Results for the experiment described in Section 4.3 with posterior inference via NUTS. itive (which we will require in proving√ the main theorem). the computational gains diminish significantly with high- p 2 For√ all Jt small enough such that 1 − Jt 1 − β  + dimensional vectors `n. We include the details of our pro- Jt β ≥ 0, the derivative of the above with respect to posed cap-tree below, and leave more efficient construction h−d, d∞ni is nonnegative. Therefore, minimizing yields and search as an open problem for future work. √ √ p 2 Each node in the tree is a spherical “cap” on the surface of 1 − Jt 1 − β  + Jt β hdt, dtn i ≥ . the unit sphere, defined by a central direction ξ, kξk = 1 t r √ √ 2 p 2 and a dot-product bound r ∈ [−1, 1], with the property that 1 − 1 − Jt β − Jt 1 − β  all data in child leaves of that node satisfy h`n, ξi ≥ r. Then (55) we can upper/lower bound the search objective for such data given ξ and r. If we progress down the tree, keeping track which holds for any such small enough√Jt. But note that of the best lower bound, we may be able to prune large hd , d i ≥ τ J we’ve already proven the t tnt t bound, which quantities of data if the upper bound of any node is less than is always nonnegative; so the only time the current bound is the current best lower bound. “active” is when it is itself nonnegative, i.e. when Jt is small enough. Therefore the bound For the lower bound, we evaluate the objective at the vec- tor `n closest to ξ. For the upper bound, define u := √ √ √ √ 2 `−h`,`(wt)i`(wt) 1−Jt 1−β + Jt β , and v := `(wt). Then kuk = kvk = 1 hd , d i ≥ τ J ∨ k`−h`,`(wt)i`(wt)k t tnt t r √ √ √ 2  2  1− 1−Jt β− Jt 1−β  and hu, vi = 0. The upper bound is (56) hζ, ui max s.t. hζ, ξi ≥ r kζk = 1. (58) holds for all Jt ∈ [0, 1]. q ζ 1 − hζ, vi2 C. Cap-tree Search P If we write ζ = αuu + αvv + i αizi where zi completes When choosing the next point to add to the coreset, we need P the basis of u, v etc, and ξ = βuu + βvv + i βizi, to solve the following maximization with O(N) complexity:

αu h`n, ` − h`, `(wt)i `(wt)i max (59) n = arg max . α∈ d p 2 t q (57) R 1 − αv n∈[N] 2 1 − h`n, `(wt)i X s.t. αuβu + αvβv + αiβi ≥ r One option to potentially reduce this complexity is to first i 2 2 X 2 partition the data in a tree structure, and use the tree struc- αu + αv + αi = 1. ture for faster search. However, we found that in practice i (1) the cost of constructing the tree structure outlined be- low outweighs the benefit of faster search later on, and (2) Noting that αi doesn’t appear in the objective, we maximize Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent P i αiβi to find the equivalent optimization Therefore, the upper bound is as follows: αu  1 |β | > r max (60)  v αu,αv p 2  p 2 2 1 − αv 1 βu ≥ r − β U = √ √ v 2 2 2 p 2 2  βu r −βv +kβk 1−r s.t. αuβu + αv|βv| + kβk 1 − αu − αv ≥ r (61)  2 2 else. kβk +βu 2 2 αu + αv ≤ 1, (62) (74) where the norm on |β | comes from the fact that we can v D. Datasets choose the sign of αv arbitrarily, ensuring the optimum has αv ≥ 0. Now define The Phishing dataset is available online at https://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets/ αu 1 γ := η := , (63) binary.html. The DS1 dataset is available online at p1 − α2 p1 − α2 v v http://komarix.org/ac/ds/. The BikeTrips dataset so that the optimization becomes is available online at http://archive.ics.uci.edu/ml/ datasets/Bike+Sharing+Dataset. The AirportDelays max γ (64) dataset was constructed using flight delay data from γ,η http://stat-computing.org/dataexpo/2009/the-data.html p 2 p 2 s.t. γβu + |βv| η − 1 + kβk 1 − γ ≥ rη (65) and historical weather information from https: γ2 ≤ 1, η ≥ 1. (66) //www.wunderground.com/history/.

Since η is now decoupled from the optimization, we can solve

p 2 max |βv| η − 1 − rη (67) η≥1 to make the feasible region in γ as large as possible. If |βv| > r, we maximize Eq. (67) by sending η → ∞ yield- ing a maximum of 1 in the original optimization. Otherwise, note that at η = 1 the derivative of the objective is +∞, so we know the constraint η = 1 is not active. Therefore, taking the derivative and setting it to 0 yields

|β |η 0 = v − r (68) pη2 − 1 s r2 η = 2 2 . (69) r − |βv|

Substituting back into the original optimization,

max γ (70) γ p 2 p 2 2 s.t. γβu + kβk 1 − γ ≥ r − |βv| (71) γ2 ≤ 1. (72)

p 2 2 If βu ≥ r − |βv| , then γ = 1 is feasible and the opti- mum is 1. Otherwise, note that at γ = −1, the derivative of the constraint is +∞ and the derivative of the objective is 1, so the constraint γ = −1 is not active. Therefore, we can solve the unconstrained optimization by taking the derivative and setting to 0, yielding √ p 2 2 2 βu r − βv + kβk 1 − r γ = 2 2 . (73) kβk + βu