Distributed Estimation of the Inverse Hessian by Determinantal Averaging

Home , Adjugate matrix

arXiv:1905.11546v1 [cs.LG] 28 May 2019 neso bias inversion u oiaincmsfo itiue etnsmto (exp method Newton’s distributed from comes motivation Our a eo) h ai esnfrti sthat is li this fundamental for certain reason has basic The approach averaging below). naïve a Here, incvrac)matrix sian/covariance) E taeyfrbotn h siainacrc,essentiall accuracy, estimation the boosting for strategy n cest aycpe fsm nisdetmtrof estimator unbiased some of copies many to access ing fasur matrix square require optimization a and of learning machine in problems Many Introduction 1 m 1 [ weighted H P b nti ae,w rps ehdt drs hsivrinb inversion this address to method a propose we paper, this In = ] t m =1 okwt)teivreHsinmlile ytegain.I gradient. the by multiplied Hessian inverse the with) work nipoe lbletmt.Ti ehdpoie h rtk first the provides av method then This and is estimate, that estimate. Hessian global local improved the an of estimat determinant local the the to reweighting no involves approach will This average uniform bias. propose a we taking this, so address and biased, are e estimates not wher method, does Newton’s inverses distributed the in occurs of this sum of example the then matrices, distributed a eapidntol oNwo’ ehd u ocmuiga th computing of quantification. trace to uncertainty a but data taking method, in e.g., inverse, used Newton’s matrix to a only of ma not tranformation random a applied of be adjugate and de can determinant we the this, for show bounds To moment infinity. to grows partitions distributed an H H b neso bias inversion ndsrbtdotmzto n itiue ueia lin numerical distributed and optimization distributed In nvriyo aiona Berkeley California, of University nteecss aigauiomaeaeo hs independe those of average uniform a taking cases, these In . t vrg,weetewihsaecrflycoe ocompensa to chosen carefully are weights the where average, → smttclyconsistent asymptotically itiue siaino h nes esa by Hessian inverse the of estimation Distributed . [email protected] eateto Statistics of Department H o ayohrpolm,hwvr eaemr neetdin interested more are we however, problems, other many For . ihłDereziński Michał H fw att opt uniyta eed nteivreo inverse the on depends that quantity a compute to want we if : sc steHsino osfnto rasml covariance sample a or function loss a of Hessian the as (such H − 1 eemnna averaging determinantal n ti eesr rdsrbet okwith work to desirable or necessary is it and , eemnna averaging determinantal .. trcvr h xc tpi h ii stenme of number the as limit the in step exact the recovers it i.e., , E [ H b Abstract − 1 ] 1 6= e prahfrcretn h inversion the correcting for approach new a , CIadDprmn fStatistics of Department and ICSI H nvriyo aiona Berkeley California, of University [email protected] − ymkn s ftelwo ag numbers: large of law the of use making by y 1 .. htteei htmyb aldan called be may what is there that i.e., , so h etnsse proportionally step Newton’s the of es ihe .Mahoney W. Michael H nes oainemti,wihis which matrix, covariance inverse e eo e xetto dniisand identities expectation new velop ews ocmue(rimplicitly (or compute to wish we e .. admmatrix random a i.e., , htw rdc nacrt estimate accurate an produce we that eoe h orc ouin To solution. correct the recover t iain dsrbdi oedetail more in (described mitations ulteivreo h u.An sum. the of inverse the qual a ler,w fe encounter often we algebra, ear rgn hmtgte oobtain to together them eraging ondsrbtdNwo step Newton distributed nown hscs,lclycomputed locally case, this n andsoty,weecombining where shortly), lained rx eemnna averaging Determinantal trix. a hleg.Temto uses method The challenge. ias yqatt hti linear a is that quantity ny tcpe rvdsanatural a provides copies nt efradcrettebias. the correct and for te H b − 1 h nes (Hes- inverse the steestimator. the as u of sum a f H ,wiehav- while ), b uhthat such independent estimates of the inverse Hessian is desired, but our method is more generally applicable, and so we first state our key ideas in a more general context.

Theorem 1 Let si be independent random variables and Zi be ﬁxed square rank-1 matrices. If E H = i siZi is invertible almost surely, then the inverse of the matrix H = [H] can be expressed as:

P 1 b E det(H)H− b H 1 = . − E det(H) b b 1 To demonstrate the implications of Theorem 1, supposeb that our goal is to estimate F (H− ) for 1 1 some linear function F . For example, in the case of Newton’s method F (H− ) = H− g, where g 1 1 is the gradient and H is the Hessian. Another example would be F (H− ) = tr(H− ), where H is the covariance matrix of a dataset and tr( ) is the matrix trace, which is useful for uncertainty · 1 quantiﬁcation. For these and other cases, consider the following estimation of F (H− ), which takes 1 an average of the individual estimates F (Ht− ), each weighted by the determinant of Ht, i.e.,

m H 1 b ˆ t=1 atF ( t− ) b Determinantal Averaging: Fm = m , at = det(Ht). t=1 at P b By applying the law of large numbers (separately to the numeraP tor and the denominator),b Theorem 1 easily implies that if H1,..., Hm are i.i.d. copies of H then this determinantal averaging estimator 1 is asymptotically consistent, i.e., Fˆm F (H− ), almost surely. This determinantal averaging esti- → 1 1 mator is particularly usefulb whenb problem constraintsb do not allow us to compute F ( m t Ht)− , e.g., when the matrices are distributed and not easily combined. P To establish finite sample convergence guarantees for estimators obtained via determinantalb averaging, we establish the following matrix concentration result. We state it separately since it is technically interesting and since its proof requires novel bounds for the higher moments of the determinant of a random matrix, which is likely to be of independent interest. Below and throughout the paper, C denotes an absolute constant and “ ” is the Löwner order on positive semi-definite (psd) matrices. 1 n E Theorem 2 Let H = k i=1 biZi +B and H = [H], where B is a positive definite d d matrix and × 2 b are i.i.d. Bernoulli( k ). Moreover, assume that all Z are psd, d d and rank-1. If k C µd log3 d i n P i η2 δ b 1 b × ≥ for η (0, 1) and µ = maxi ZiH /d, then ∈ k − k m 1 η 1 t=1 atHt− η 1 1 H− m 1+ H− with probability 1 δ, − √m · t=1 at √m · ≥ − P b i.i.d. P where H ,..., Hm H and at = det(Ht). 1 ∼ 1.1 Distributedb b Newton’sb methodb To illustrate how determinantal averaging can be useful in the context of distributed optimization, consider the task of batch minimization of a convex loss over vectors w Rd, defined as follows: n ∈ def 1 ⊤ λ 2 (w) = ℓi(w xi) + w , (1) L n 2 k k i X=1

2 where λ> 0, and ℓi are convex, twice diﬀerentiable and smooth. Given a vector w, Newton’s method dictates that the correct way to move towards the optimum is to perform an update w = w p, − 1 with p = 2 (w) (w), where 2 (w) = ( 2 (w)) 1 denotes the inverse Hessian of at w. ∇− L ∇L ∇− L ∇ L − L Here, the Hessian and gradient are: e

2 1 ⊤ ⊤ 1 ⊤ (w)= ℓ′′(w xi) xix + λI, and (w)= ℓ′ (w xi) xi + λw. ∇ L n i i ∇L n i i i X X For our distributed Newton application, we study a distributed computational model, where a single machine has access to a subsampled version of with sample size parameter k n: L ≪ n def 1 ⊤ λ 2 (w) = biℓi(w xi) + w , where bi Bernoulli k/n . (2) L k 2 k k ∼ i=1 X b Note that accesses on average k loss components ℓi (k is the expected local sample size), and more- L over, E (w) = (w) for any w. The goal is to compute local estimates of the Newton’s step p in L L a communication-eﬃcientb manner (i.e., by only sending O(d) parameters from/to a single machine), then combineb them into a better global estimate. The gradient has size O(d) so it can be computed exactly within this communication budget (e.g., via map-reduce), however the Hessian has to be approximated locally by each machine. Note that other computational models can be considered, such as those where the global gradient is not computed (and local gradients are used instead).

Under the constraints described above, the most natural strategy is to use directly the Hessian of the locally subsampled loss (see, e.g., GI- L ANT [WRKXM18]), resulting in the approximate Newton step p = 2 (w) (wb). Suppose that ∇− L ∇L we independently construct m i.i.d. copies of this estimate: p1,...,b pm (here,b m is the number of machines). Then, for suﬃciently large m, taking a simple averageb of theb estimates will stop converging to p 1 m because of the inversion bias: pt E p = m t=1 → 6 p. Figure 1 shows this by plotting the estimation Figure 1: Newton step estimation error ver- P sus number of machines, averaged over 100 error (in Euclidean distance) of the averagedb b New- ton step estimators, when the weights are uniform runs (shading is standard error) for a libsvm and determinantal (for more details and plots, see dataset [CL11]. More plots in Appendix C. Appendix C). The only way to reduce the estimation error beyond a certain point is to increase the local sample size k (thereby reducing the inversion bias), which raises the computational cost per machine. Determinantal averaging corrects the inversion bias so that estimation error can always be decreased by adding more machines without increasing the local sample size. From the preceding discussion we can easily show that determinantal averaging leads to an asymptotically consistent estimator. This is a corollary of Theorem 1, as proven in Section 2.

1Clearly, one would not actually compute the inverse of the Hessian explicitly [XRKM17, YXRKM18]. We describe it this way for simplicity. Our results hold whether or not the inverse operator is computed explicitly.

3 2 Corollary 3 Let t be i.i.d. samples of (2) and define at = det t(w) . Then: {L }t∞=1 ∇ L m t=1 at pt b a.s. 2 b 2 m p, where pt = − t(w) (w) and p = − (w) (w). a m−→ ∇ L ∇L ∇ L ∇L P t=1 t →∞ b The (unnormalized)P determinantal weightsb can beb computed locally in the same time as it takes to compute the Newton estimates so they do not add to the overall cost of the procedure. While this result is only an asymptotic statement, it holds with virtually no assumptions on the loss function (other than twice-differentiability) or the expected local sample size k. However, with some additional assumptions we will now establish a convergence guarantee with a finite number of machines m by bounding the estimation error for the determinantal averaging estimator of the Newton step. ⊤ In the next result, we use Mahalanobis distance, denoted v M = √v Mv, to measure the error k k of the Newton step estimate (i.e., the deviation from optimum p), with M chosen as the Hessian of . This choice is motivated by standard convergence analysis of Newton’s method, discussed next. L This is a corollary of Theorem 2, as explained in Section 3. Corollary 4 For any δ, η (0, 1) if expected local sample size satisfies k Cη 2µd2 log3 d then ∈ ≥ − δ m a p η t=1 t t p p m 2 (w) with probability 1 δ, a − 2 ≤ √m · ≥ − P t=1 t (w) ∇ L b ∇ L 1 P ⊤ 2 where µ = d max i ℓi′′(w xi) xi −2 (w), and at, pt and p are defined as in Corollary 3. k k∇ L We next establish how this error bound impacts the convergence guarantees offered by Newton’s b method. Note that under our assumptions is strongly convex so there is a unique minimizer L w = argmin (w). We ask how the distance from optimum, w w , changes after we make ∗ w L k − ∗k an update w = w p. For this, we have to assume that the Hessian matrix is L-Lipschitz as a − function of w. After this standard assumption, a classical analysis of the Newton’s method reveals that Corollarye 4 impliesb the following Corollary 6 (proof in Appendix B). Assumption 5 The Hessian is L-Lipschitz: 2 (w) 2 (w) L w w for any w, w Rd. k∇ L −∇ L k≤ k − k ∈ Corollary 6 For any δ, η (0, 1) if expected local sample size satisfies k Cη 2µd2 log3 d then ∈ e ≥e − eδ under Assumption 5 it holds with probability at least 1 δ that − m η 2L 2 t=1 at pt w w∗ max √κ w w∗ , w w∗ for w = w , − ≤ √m − λ − − m a min P t=1 t n o b where Ce, µ, at and pt are defined as in Corollaries 3 and 4, while κ ande λmin areP the condition number and smallest eigenvalue of 2 (w), respectively. ∇ L b The bound is a maximum of a linear and a quadratic convergence term. As m goes to infinity and/or η η goes to 0 the approximation coefficient α = √m in the linear term disappears and we obtain exact Newton’s method, which exhibits quadratic convergence (at least locally around w∗). However, decreasing η means increasing k and with it the average computational cost per machine. Thus, to preserve the quadratic convergence while maintaining a computational budget per machine, as the optimization progresses we have to increase the number of machines m while keeping k fixed. This is only possible when we correct for the inversion bias, which is done by determinantal averaging.

4 1.2 Distributed data uncertainty quantification Here, we consider another example of when computing a compressed linear representation of the ⊤ inverse matrix is important. Let X be an n d matrix where the rows xi represent samples drawn × 1 ⊤ from a population for statistical analysis. The sample covariance matrix Σ = n X X holds the information about the relations between the features. Assuming that Σ is invertible, the matrix 1 Σ− , also known as the precision matrix, is often used to establish a degree of confidence we have in 1 the data collection [KBCG13]. The diagonal elements of Σ− are particularly useful since they hold the variance information of each individual feature. Thus, efficiently estimating either the entire diagonal, its trace, or some subset of its entries is of practical interest [Ste97, WLK+16, BCF09]. We consider the distributed setting where data is separately stored in batches and each local covariance is modeled as: n 1 ⊤ Σ = bixix , where bi Bernoulli(k/n). k i ∼ i X=1 b For each of the local covariances Σ1,..., Σm, we compute its compressed uncertainty information: η 1 2 F (Σt + I) , where we added a small amount of ridge to ensure invertibility . Here, F ( ) √m − · may for example denote the traceb or theb vector of diagonal entries. We arrive at the following 1 asymptoticallyb consistent estimator for F (Σ− ):

m η 1 t=1 at,mF (Σt + I)− ˆ √m Σ η I Fm = m , where at,m = det t + √m . P t=1 at,m b η P b 1 Note that the ridge term I decays to zero as m goes to infinity, which is why Fˆm F (Σ ). √m → − Even though this limit holds for any local sample size k, in practice we should choose k sufficiently 2 2 3 d large so that Σ is well-conditioned. In particular, Theorem 2 implies that if k 2Cη− µd log δ , 1 2 1 η ≥ 1 x − ˆ Σ Σ where µ = d maxi i Σ 1 , then for F ( ) = tr( ) we have Fm tr( − ) √m tr( − ) w.p. 1 δ. b k k · · | − |≤ · − 1.3 Related work Many works have considered averaging strategies for combining distributed estimates, particularly in the context of statistical learning and optimization. This research is particularly important in federated learning [KBRR16, KBY+16], where data are spread out accross a large network of de- vices with small local storage and severely constrained communication bandwidth. Using averaging to combine local estimates has been studied in a number of learning settings [MMS+09, MHM10] as well as for first-order stochastic optimization [ZWLS10, AD11]. For example, [ZDW13] exam- ine the effectiveness of simple uniform averaging of emprical risk minimizers and also propose a bootstrapping technique to reduce the bias. More recently, distributed averaging methods gained considerable attention in the context of second-order optimization, where the Hessian inversion bias is of direct concern. [SSZ14] propose a distributed approximate Newton-type method (DANE) which under certain assumptions exhibits low bias. This was later extended and improved upon by [ZL15, RKR+16]. The GIANT method of [WRKXM18] most closely follows our setup from Section 1.1, providing non-trivial guarantees for uniform averaging of the Newton step estimates pt (except they use with-replacement uniform

− 2Since the ridge term vanishes as m goes to inﬁnity, we are still estimating the ridge-free quantity F (Σ 1). b

5 sampling, whereas we use without-replacement, but that is typically a negligible difference). A related analysis of this approach is provided in the context of ridge regression by [WGM17]. Finally, [ABH17] propose a different estimate of the Hessian inverse by using the Neuman series expansion, and use it to construct Newton step estimates which exhibit low bias when the Hessian is sufficiently well-conditioned. Our approach is related to recent developments in determinantal subsampling techniques (e.g. volume sampling), which have been shown to correct the inversion bias in the context of least squares regression [DW17, DWH19]. However, despite recent progress [DW18, DWH18], volume sampling is still far too computationally expensive to be feasible for distributed optimization. Indeed, often uniform sampling is the only practical choice in this context. With the exception of the expensive volume sampling-based methods, all of the approaches discussed above, even under favorable conditions, use biased estimates of the desired solution (e.g., the exact Newton step). Thus, when the number of machines grows sufficiently large, with fixed local sample size, the averaging no longer provides any improvement. This is in contrast to our determinantal averaging, which converges exactly to the desired solution and requires no expensive subsampling. Therefore, it can scale with an arbitrarily large number of machines. 2 Expectation identities for determinants and adjugates

In this section, we prove Theorem 1 and Corollary 3, establishing that determinantal averaging is asymptotically consistent. To achieve this, we establish a lemma involving two expectation identities. For a square n n matrix A, we use adj(A) to denote its adjugate, deﬁned as an n n × i+j × matrix whose (i, j)th entry is ( 1) det(A j, i), where A j, i denotes A without jth row and ith − − − − − column. The adjugate matrix provides a key connection between the inverse and the determinant 1 because for any invertible matrix A, we have adj(A) = det(A)A− . In the following lemma, we will also use a formula called Sylvester’s theorem, relating the adjugate and the determinant: det(A + uv⊤) = det(A)+ v⊤adj(A)u.

Lemma 7 For A = i siZi, where si are independently random and Zi are square and rank-1, (a) EPdet(A) = det E[A] and (b) E adj(A) = adj E[A] . Proof We use induction over the number of components in the sum. If there is only one component, i.e., A = sZ, then det(A) = 0 a.s. unless Z is 1 1, in which case (a) is trivial, and (b) follows × similarly. Now, suppose we showed the hypothesis when the number of components is n and let n+1 ⊤ A = i=1 siZi. Setting Zn+1 = uv , we have: n ⊤ P E det(A) = E det siZi + sn+1uv i=1 Xn n ⊤ (Sylvester’s Theorem) = E det siZi + sn+1v adj siZi u i=1 i=1 Xn X n ⊤ (inductive hypothesis) = det E siZi + E[sn+1] v adj E siZi u h i=1 i h i=1 i Xn X ⊤ (Sylvester’s Theorem) = det E siZi + E[sn+1] uv = det E[A] , i=1 h X i

6 i+j showing (a). Finally, (b) follows by applying (a) to each entry adj(A)ij = ( 1) det(A j, i). − − − Similar expectation identities for the determinant have been given before [vdV65, DWH19, Der18]. None of them, however, apply to the random matrix A as deﬁned in Lemma 7, or even to the special case we use for analyzing distributed Newton’s method. Also, our proof method is quite diﬀerent, and somewhat simpler, than those used in prior work. To our knowledge, the extension of determinantal expectation to the adjugate matrix has not previously been pointed out. We next prove Theorem 1 and Corollary 3 as consequences of Lemma 7. 1 Proof of Theorem 1 When A is invertible, its adjugate is given by adj(A) = det(A)A− , so the lemma implies that

1 1 1 E det(A) E[A] − = det E[A] E[A] − = adj(E[A]) = E adj(A) = E det(A)A− , from which Theorem 1 follows immediately. Proof of Corollary 3 The subsampled Hessian matrix used in Corollary 3 can be written as:

d 2 1 ⊤ ⊤ ⊤ def (w)= biℓ′′(w xi) xix + λ eie = H, ∇ L k i i i i i X X=1 b b 2 so, letting Ht = t(w), Corollary 3 follows from Theorem 1 and the law of large numbers: ∇ L m 1 m 1 E 1 t=1 atbpt mb t=1 det Ht Ht− (w) det(H)H− 2 m = m ∇L (w)= − (w) (w), at 1 H m−→ E H ∇L ∇ L ∇L P t=1 P m t=1 det t →∞ det( ) b b b b b P which concludes the proof.P b b

3 Finite-sample convergence analysis

In this section, we prove Theorem 2 and Corollary 4, establishing that determinantal averaging exhibits a 1/√m convergence rate, where m is the number of sampled matrices (or the number of machines in distributed Newton’s method). For this, we need a tool from random matrix theory.

Lemma 8 (Matrix Bernstein [Tro12]) Consider a finite sequence Xi of independent, random, { } self-adjoint matrices with dimension d such that E[Xi]= 0 and λ (Xi) R almost surely. If the max ≤ sequence satisfies E[X2] σ2, then the following inequality holds for all x 0: i i ≤ ≥ 2 P x 2 4σ2 σ d e− for x R ; Pr λmax Xi x i x ≤ σ2 ≥ ≤ (d e− 4R for x . X ≥ R The key component of our analysis is bounding the moments of the determinant and adjugate of a certain class of random matrices. This has to be done carefully, because higher moments of the determinant grow more rapidly than, e.g., for a sub-gaussian random variable. For this result, we do not require that the individual components Zi of matrix A be rank-1, but we impose several additional boundedness assumptions. In the proof below we apply the concentration inequality of Lemma 8 twice: first to the random matrix A itself, and then also to its trace, which allows finer control over the determinant.

7 1 Lemma 9 Let A = γ i biZi + B, where bi Bernoulli(γ) are independent, whereas Zi and B ∼ 2 are d d psd matrices such that Zi ǫ for all i and E[A] = I. If γ 8ǫdη (p + ln d) for × P k k ≤ ≥ − 0 < η 0.25 and p 2, then ≤ ≥ 1 1 p p p p (a) E det(A) 1 5η and (b) E adj(A) I 9η. − ≤ − ≤ h i h i Proof We start by proving (a). Let X = det(A) 1 and denote 1 as the indicator variable of − [a,b] the event that X [a, b]. Since det(A) 0, we have: ∈ ≥ E p E p E p X = ( X) 1[ 1,0] + X 1[0, ] | | − · − · ∞ 1 p p 1 ∞ p 1 η + px − Pr( X x)dx + px − Pr(X x)dx. (3) ≤ − ≥ ≥ Zη Z0

bi Thus it suﬃces to bound the two integrals. We will start with the ﬁrst one. Let Xi = (1 )Zi. We − γ use the matrix Bernstein inequality to control the extreme eigenvalues of the matrix I A = Xi − i (note that matrix B cancels out because I = E[A]= Zi+B). To do this, observe that Xi ǫ/γ i k k≤P and, moreover, E (1 bi )2 = 1 1 1 , so: − γ γ − ≤ γ P 2 bi 2 2 1 2 ǫ ǫ E[X ] = E (1 ) Z Z Zi . i − γ i ≤ γ · i ≤ γ · ≤ γ i i i i X X X X Thus, applying Lemma 8 we conclude that for any z η , 1 : ∈ √2d

z2γ 2 2d 2 2dp ln(d) z 2 (p+ln d) z 2 Pr I A z 2d e− 4ǫ 2e − η 2e− η . (4) k − k≥ ≤ ≤ ≤ Conditioning on the high-probability event given by (4) leads to the lower bound det(A) (1 z)d ≥ − which is very loose. To improve on it, we use the following inequality, where δ1,...,δd denote the eigenvalues of I A: −

tr(I A) δi 2 det(A)e − = (1 δi)e (1 δi)(1 + δi)= (1 δ ). − ≥ − − i i i i Y Y Y tr(I A) Thus we obtain a tighter bound when det(A) is multiplied by e − , and now it suﬃces to upper bound the latter. This is a simple application of the scalar Bernstein’s inequality (Lemma 8 with η2 E 2 ǫ d = 1) for the random variables Xi = tr(Xi) ǫ/γ 8dp , which satisfy i [Xi ] γ tr i Zi 2 ≤ ≤ ≤ ≤ ǫd η . Thus the scalar Bernstein’s inequality states that γ ≤ 8p P P

y2 2p e− η2 for y d; max Pr tr(A I) y , Pr tr(A I) y dp (5) y 2 ≤ − ≥ − ≤− ≤ e− η2 for y d. n o  ≥ x x  Setting y = 2 and z = 2d and taking a union bound over the appropriate high-probability events given by (4) and (5), we conclude that for any x [η, 1]: p ∈ x 2 p 2 d x x 2 det(A) (1 z ) exp tr(A I) 1 e− 2 1 x, with prob. 1 3e− 2η . ≥ − − ≥ − 2 ≥ − − 8 x2 p Thus, for X = det(A) 1 and x [η, 1] we obtain that Pr( X x) 3e− 2η2 , and consequently, − ∈ − ≥ ≤ x2 p 1 1 p 2η2 x2 2η2 ∞ e− p 1 p 1 − 2η2 p 1 px − Pr X x dx 3p x − e dx 3p π p x − dx η − ≥ ≤ η ≤ · | | 2πη2/p Z Z p−1 q Z−∞ η2 p 3 2πη2p p 2 = 3 2πp η . p ≤ · p · We now move on to bounding the remainingp integral from (3).p Since determinant is the product of eigenvalues, we have det(A) = det(I + A I) etr(A I), so we can use the Bernstein bound of (5) − ≤ − w.r.t. A I. It follows that: − ∞ p 1 ∞ p 1 tr(A I) px − Pr(X x)dx px − Pr e − 1+ x dx 0 ≥ ≤ 0 ≥ Z Z d e 1 2 2p 2dp − p 1 ln (1+x) 2 ∞ p 1 ln(1+x) 2 px − e− η dx + px − e− η dx ≤ 0 ed 1 Z Z − e 1 2 2p 2p − p 1 ln (1+x) 2 ∞ p 1 ln(1+x) 2 px − e− η dx + px − e− η dx, ≤ 0 e 1 Z Z − because ln2(1 + x) ln(1 + x) for x e 1. Note that ln2(1 + x) x2/4 for x [0, e 1], so ≥ ≥ − ≥ ∈ −

e 1 2 2p e 1 2 p − p 1 ln (1+x) 2 − p 1 x 2 p px − e− η dx px − e− 2η dx 2πp η . 0 ≤ 0 ≤ · Z Z p In the interval x [e 1, ], we have: ∈ − ∞ 2p 2p p ∞ p 1 ln(1+x) 2 ∞ (p 1) ln(x) ln(1+x) 2 ∞ ln(1+x) 2 px − e− η dx = p e − − η dx p e− η dx e 1 e 1 ≤ e 1 Z − Z − Z − ∞ p p p 1 1 p p p 1 η2 dx = 1 η2 − p 1 η2 p η2 , ≤ 1+x p 1 2 ≤ · 2 ≤ · Z1 η2 − 1 1 where the last inequality follows because ( 1 ) η2 η2. Noting that (1 + 4√2πp + p) p 5 for any 2 ≤ ≤ p 2 concludes the proof of (a). The proof of (b), given in Appendix A, follows similarly as above ≥ because for any positive deﬁnite A we have det(A) I adj(A) det(A) I. λmax(A) · λmin · Having obtained bounds on the higher moments, we can now convert them to convergence with high probability for the average of determinants and the adjugates. Since determinant is a scalar variable, this follows by using standard arguments. On the other hand, for the adjugate matrix we require a somewhat less standard matrix extension of the Khintchine/Rosenthal inequalities (see Appendix A).

2 3 d Corollary 10 There is C > 0 s.t. for A as in Lemma 9 with all Zi rank-1 and γ Cǫdη log , ≥ − δ m m 1 η 1 η (a) Pr det(At) 1 δ and (b) Pr adj(At) I δ, m − ≥ √m ≤ m − ≥ √m ≤ t=1 t=1 X X where A1,..., Am are independent copies of A.

9 We are ready to show the convergence rate of determinantal averaging, which follows essentially by upper/lower bounding the enumerator and denominator separately, using Corollary 10.

1 1 Proof of Theorem 2 We will apply Corollary 10 to the matrices At = H− 2 HtH− 2 . Note n 1 1 1 1 that At = biZi + λH , where each Zi = H− 2 ZiH− 2 satisﬁes Zi µ d/n. Therefore, k i − n k k ≤ · Corollary 10 guarantees that for k C µd dη 2 log3 d , with probability 1 δ the followingb average P n ≥ n − δ − of determinants ise concentrated around 1: e e

def 1 det(Ht) 1 1 1 η Z = = det H− 2 H H− 2 [1 α, 1+ α] for α = , m det(H) m t √m t t ∈ − X b X along with a corresponding bound for the adjugateb matrices. We obtain that with probability 1 2δ, − m t=1 adj(At) 1 I adj At Z I /Z m det(A ) − ≤ m − Pt=1 t t X P 1 1 α (Corollary 10a) adj At I + ≤ 1 α m − 1 α − t − α Xα (Corollary 10b) + . ≤ 1 α 1 α − − 1 It remains to multiply the above expressions by H− 2 from both sides to recover the desired estimator:

m 1 m t=1 det(Ht) Ht− 1 t=1 adj(At) 1 1 2α 1 2α 1 = H− 2 H− 2 H− 2 1+ IH− 2 = 1+ H− , m m det(A ) 1 α 1 α P t=1 det(Ht) Pt=1 t − − b b and the lower bound follows identically.P Appropriately adjusting the constants concludes the proof. P b As an application of the above result, we show how this allows us to bound the estimation error in distributed Newton’s method, when using determinantal averaging.

⊤ ⊤ Proof of Corollary 4 Follows from Theorem 2 by setting Zi = ℓi′′(w xi)xixi and B = λI. Note that the assumptions imply that Zi µ, so invoking the theorem and denoting g as (w), with k k≤ ∇L probability 1 δ we have − m m 1 at pt 1 det(Ht) H− 1 1 t=1 2 t=1 t 1 2 2 m p = H m H− H H− g at − H − P t=1 H P t=1 det( t) b m b b 1 P 1 Pt=1 det(Ht) Ht− 1 1 1 H 2 b H− H 2 H− 2 g ≤ m H − · P t=1 det( t) 1 η 1 1 b b η (Theorem 2) H 2 H− H 2 p H = p H, ≤ √m P · kb k √m · k k which completes the proof of the corollary.

Acknowledgements MWM would like to acknowledge ARO, DARPA, NSF and ONR for providing partial support of this work. Also, MWM and MD thank the NSF for funding via the NSF TRIPODS program. Part of this work was done while MD and MWM were visiting the Simons Institute for the Theory of Computing.

10 References

[ABH17] Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for machine learning in linear time. Journal of Machine Learning Research, 18(116):1–40, 2017.

[AD11] Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 873–881. Curran Associates, Inc., 2011.

[BCF09] C. Bekas, A. Curioni, and I. Fedulova. Low cost high performance uncertainty quan- tiﬁcation. In Proceedings of the 2Nd Workshop on High Performance Computational Finance, WHPCF ’09, pages 8:1–8:8, New York, NY, USA, 2009. ACM.

[CL11] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011.

[Der18] Michał Dereziński. Fast determinantal point processes via distortion-free intermediate sampling. CoRR, abs/1811.03717, 2018.

[DW17] Michał Dereziński and Manfred K. Warmuth. Unbiased estimates for linear regression via volume sampling. In Advances in Neural Information Processing Systems 30, pages 3087–3096, Long Beach, CA, USA, December 2017.

[DW18] Michał Dereziński and Manfred K. Warmuth. Reverse iterative volume sampling for linear regression. Journal of Machine Learning Research, 19(23):1–39, 2018.

[DWH18] Michał Dereziński, Manfred K. Warmuth, and Daniel Hsu. Leveraged volume sampling for linear regression. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Process- ing Systems 31, pages 2510–2519. Curran Associates, Inc., 2018.

[DWH19] Michał Dereziński, Manfred K. Warmuth, and Daniel Hsu. Correcting the bias in least squares regression with volume-rescaled sampling. In Proceedings of the 22nd International Conference on Artiﬁcial Intelligence and Statistics, 2019.

[GCT12] Alex Gittens, Richard Y. Chen, and Joel A. Tropp. The masked sample covariance estimator: an analysis using matrix concentration inequalities. Information and In- ference: A Journal of the IMA, 1(1):2–20, 05 2012.

[KBCG13] V. Kalantzis, C. Bekas, A. Curioni, and E. Gallopoulos. Accelerating data uncertainty quantiﬁcation by solving linear systems with multiple right-hand sides. Numer. Al- gorithms, 62(4):637–653, April 2013.

[KBRR16] Jakub Konecný, H. Brendan McMahan, Daniel Ramage, and Peter Richtárik. Feder- ated Optimization: Distributed Machine Learning for On-Device Intelligence. arXiv e-prints, page arXiv:1610.02527, Oct 2016.

11 [KBY+16] Jakub Konecný, H. Brendan McMahan, Felix X. Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated Learning: Strategies for Improving Communication Eﬃciency. arXiv e-prints, page arXiv:1610.05492, Oct 2016.

[MHM10] Ryan McDonald, Keith Hall, and Gideon Mann. Distributed training strategies for the structured perceptron. In Human Language Technologies: The 2010 Annual Confer- ence of the North American Chapter of the Association for Computational Linguistics, HLT ’10, pages 456–464, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.

[MMS+09] Ryan Mcdonald, Mehryar Mohri, Nathan Silberman, Dan Walker, and Gideon S. Mann. Eﬃcient large-scale distributed training of conditional maximum entropy models. In Y. Bengio, D. Schuurmans, J. D. Laﬀerty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1231–1239. Curran Associates, Inc., 2009.

[NW06] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, NY, USA, second edition, 2006.

[RKR+16] Sashank J. Reddi, Jakub Konecný, Peter Richtárik, Barnabás Póczós, and Alex Smola. AIDE: Fast and Communication Eﬃcient Distributed Optimization. arXiv e-prints, page arXiv:1608.06879, Aug 2016.

[SSZ14] Ohad Shamir, Nati Srebro, and Tong Zhang. Communication-eﬃcient distributed optimization using an approximate Newton-type method. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learn- ing, volume 32 of Proceedings of Machine Learning Research, pages 1000–1008, Bejing, China, 22–24 Jun 2014. PMLR.

[Ste97] Guy V. G. Stevens. On the inverse of the covariance matrix in portfolio analysis. The Journal of Finance, 53:1821–1827, 1997.

[Tro12] Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389–434, August 2012.

[vdV65] H. Robert van der Vaart. A note on Wilks’ internal scatter. Ann. Math. Statist., 36(4):1308–1312, 08 1965.

[WGM17] Shusen Wang, Alex Gittens, and Michael W. Mahoney. Sketched ridge regression: Optimization perspective, statistical perspective, and model averaging. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Confer- ence on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3608–3616, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.

[WLK+16] Lingfei Wu, Jesse Laeuchli, Vassilis Kalantzis, Andreas Stathopoulos, and Efstratios Gallopoulos. Estimating the trace of the matrix inverse by interpolating from the diagonal of an approximate inverse. Journal of Computational Physics, 326:828 – 844, 2016.

12 [WRKXM18] Shusen Wang, Farbod Roosta-Khorasani, Peng Xu, and Michael W Mahoney. Gi- ant: Globally improved approximate newton method for distributed optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 2332–2342. Curran Associates, Inc., 2018.

[XRKM17] Peng Xu, Farbod Roosta-Khorasani, and Michael Mahoney. Newton-type methods for non-convex optimization under inexact hessian information. Mathematical Pro- gramming, August 2017.

[YXRKM18] Z. Yao, P. Xu, F. Roosta-Khorasani, and M. W. Mahoney. Inexact non-convex Newton-type methods. Technical report, 2018. Preprint: arXiv:1802.06925.

[ZDW13] Yuchen Zhang, John C. Duchi, and Martin J. Wainwright. Communication-eﬃcient algorithms for statistical optimization. J. Mach. Learn. Res., 14(1):3321–3363, Jan- uary 2013.

[ZL15] Yuchen Zhang and Xiao Lin. Disco: Distributed optimization for self-concordant empirical loss. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 362–370, Lille, France, 07–09 Jul 2015. PMLR.

[ZWLS10] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J. Smola. Parallelized stochastic gradient descent. In J. D. Laﬀerty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 2595–2603. Curran Associates, Inc., 2010.

13 A Omitted proofs from Section 3

In Section 3, we stated Lemma 9 and proved the ﬁrst part of it (a moment bound for the determinant). Here, we provide the proof of the second part (a moment bound for the adjugate).

1 Lemma 11 (Lemma 9b restated) Let A = biZi + B, where bi Bernoulli(γ) are inde- γ i ∼ pendent, whereas Zi and B are d d psd matrices such that Zi ǫ for all i and E[A] = I. If × P k k ≤ γ 8ǫdη 2(p + ln d) for 0 < η 0.25 and p 2, then ≥ − ≤ ≥ 1 p p E adj(A) I 9η. − ≤ h i Proof Let λmax and λmin denote the largest and smallest eigenvalue of adj(A). We have

p ∞ p 1 E adj(A) I = px − Pr adj(A) I x dx k − k k − k≥ Z0 p ∞ p 1 η + px − Pr(λmax 1+ x)+Pr(λmin 1 x) dx. ≤ η ≥ ≤ − Z We will now bound the two probabilities. Let δmax and δmin denote the largest and smallest eigenvalue of matrix A I. Recall the following concentration bounds implied by Lemma 8 (see the ﬁrst − part of the proof of Lemma 9):

y2 2p e− η2 for y [0, d]; A I A I max Pr tr( ) y , Pr tr( ) y y 2dp ∈ (6) − ≥ − ≤− ≤ e− η2 for y d, n o  ≥ 2p z2 η e− η2 for z [0, ];  √2d 2 2dp ∈  z 2 η max Pr δmax z , Pr δmin z e− η for z [ , 1]; (7) ≥ ≤− ≤  √2d  2dp ∈ n o  z 2 e− η for z 1. ≥  A−I 1 det(A) etr( ) From the formula adj(A) = det(A)A− it follows that λmax  so we have ≤ 1+δmin ≤ 1+δmin etr(A I) Pr λ 1+ x Pr − 1+ x max ≥ ≤ 1+ δ ≥ min 1 = Pr tr(A I)+ln ln(1 + x) − 1+ δmin ≥ 2 1 1 Pr tr(A I) ln(1 + x) + Pr ln ln(1 + x) ≤ − ≥ 3 · 1+ δmin ≥ 3 · 2 1 = Pr tr(A I) ln(1 + x) + Pr δmin 1 1 . − ≥ 3 · ≤ (1 + x) 3 − 1 ln2(1+x) 8p (1 ( 1 ) 3 )2 2p x2 p e− 9η2 + e− − 1+x η2 2e− 20η2 for x [0, e 1], 4p 1 2dp ≤ p ∈ − ≤  ln(1+x) 2 2 ln(1+x) 2 d e− 3η + e− 16 η 2e− 8η for x [e 1, e 1].  ≤ ∈ − − For x ed 1, since λ  (1 + δ )d edδmax and ln(1 + x) d, we have: ≥ − max ≤ max ≤ ≥ ln(1+x) 2p Pr λ 1+ x Pr edδmax 1+ x = Pr δ ln(1 + x)/d e− η2 . max ≥ ≤ ≥ max ≥ ≤ 14 Next, we use the fact that for δ = max δ , δ we have: | max| | min| det(A) (1 δ2)d λ (1 δ)(1 dδ2) 1 tr(I A) , min − tr(I A) ≥ 1+ δmax ≥ (1 + δmax)e − ≥ − − − − so for x [η, 1] we have: ∈ Pr λ 1 x Pr(δ x/3) + Pr(δ2 x/3d)+Pr tr(I A) x/3 min ≤ − ≤ ≥ ≥ − ≥ x2 2dp x 2dp x2 2p x2 2p 2e− 9η2 + 2e− 3d η2 + e− η2 5e − 9η2 . ≤ ≤ Putting everything together we obtain that:

e 1 2 p p p p − p 1 x 2 ∞ p 1 ln(1+x) 2 E adj(A) I η + px − 7e− 20η dx + px − 3e− 8η dx k − k ≤ η e 1 Z Z − 3p p 1 ηp + 7 20πp ηp + 1 16η2 − ≤ p 1 2 16η2 − p ηp + 7 20πp ηp + 6(3η)p (9η)p, ≤ ≤ which completes the proof. p

As a consequence of the moment bounds shown in Lemma 9, we establish convergence with high probability for the average of determinants and the adjugates. For the adjugate matrix, we require a matrix variant of the Khintchine/Rosenthal inequalities.

Lemma 12 ([GCT12]) Suppose that p 2 and r = max p, 2 log d .3 Consider a ﬁnite sequence ≥ { } Xi of independent, symmetrically random, self-adjoint matrices with dimension d d. Then, { } × 1 1 1 p 2 p 2 p p E Xi √er E[X ] + 2er E maxi Xi . i ≤ i i k k h i X X Corollary 13 (Corollary 10 restated) There is C > 0 s.t. for A as in Lemma 9 with all Zi rank-1 and γ Cǫdη 2 log3 d , ≥ − δ m m 1 η 1 η (a) Pr det(At) 1 δ and (b) Pr adj(At) I δ, m − ≥ √m ≤ m − ≥ √m ≤ t=1 t=1 X X where A1,..., Am are independent copies of A.

2 Proof Applying Lemma 9 to the matrix A, for appropriate C and any ﬁxed p 2, if γ Cǫdσ− (p+ s s ≥ ≥ ln d), then for any s [2,p] we have E adj(At) I σ . With the additional assumption that ∈ k − k ≤ Z ’s are rank-1, Theorem 7 implies that E adj(A ) = I, so by a standard symmetrization argument, i t where r denote independent Rademacher random variables, t m 1 1 1 p p rt p p E adj(At) I 2 E adj(At) I . m − ≤ · m − t=1 t X X 3In [GCT12] it is assumed that d ≥ 3, however this assumption is not used anywhere in the proof.

15 1 Applying Lemma 12 to the matrices Xt = Yt, where Yt = rt adj(At) I , we obtain that: m − 1 m p p 1 1 1 1 2 2 2er p p E Yt √er m E[Y ] + E Yi m ≤ · m2 t m k k t i X h X=1 i 1 er 1 2er p E Y 2 2 + m E Y p ≤ m · k k m · k k r er 2er pσ + 1 σ C′ , ≤ m 1 p · ≤ · √m r m − for p 2 log d and C chosen appropriately. Now Markov’s inequality yields: ≥ ′ p p 1 p 1 2C′pσ Pr adj(A ) I α α− E adj(A ) I . m t m t α√m t − ≥ ≤ · t − ≤ X X η η 1 1 p Setting α = , σ = ′ and p = 2 max log d, log , the above bound becomes ( ) δ for √m 4C p ⌈ { δ }⌉ 2 ≤ k C µd2η 2(log3 1 + log3d). Showing the analogous result for the average of determinants of ≥ ′′ − δ matrices At instead of the adjugates follows identically, except that Lemma 12 can be replaced with the standard scalar Rosenthal’s inequality.

B Proof of Newton convergence

Here, we provide a proof of Corollary 6, which describes the convergence guarantees for the approximate Newton step obtained via determinantal averaging. It suﬃces to show the following lemma.

Lemma 14 Let loss be deﬁned as in (1) and assume its Hessian is L-Lipschitz (Assumption 5). L If

2 p p p 2 p w w ∗ 2 (w) α ∗ (w), where ∗ = − ( ) ( ), − ∇ L ≤ k k∇ L ∇ L ∇L then the approximate Newton step w = w p satisfies: b − 2L 2 w w∗ max α√κ we w∗ , b w w∗ , where w∗ = argmin (w), k − k≤ k − k σmin k − k w L n o 2 where κ ande σmin are the condition number and smallest eiganvalue of (w), respectively. ∇ L Proof The lemma essentially follows via the standard analysis of the Newton’s method. For the sake of completeness we will outline the proof following [WRKXM18]. Denoting H = 2 (w) and ∇ L g = (w), we define the auxiliary function ∇L φ(p) def= p⊤Hp 2p⊤g. − By definition of φ(p) we have φ(p )= φ(H 1g)= p 2 . If follows that ∗ − −k ∗kH 1 2 ⊤ 1 1 2 φ(p) φ(p∗)= H 2 p 2g H− Hp + H 2 p∗ − k k − k k 1 2 2 2 2 2 = H 2 (p p∗) = p p∗ α p∗ H = α φ(p∗). b k b − k b− H ≤ k k −

b 16b We invoke the classical result in local convergence analysis of Newton’s method [NW06], using the statement of Lemma 9 in [WRKXM18]. Lemma 15 ([WRKXM18]) Assume Hessian is L-Lipschitz and that p satisﬁes φ(p) (1 2 ≤ − α ) minp φ(p). Then w = w p satisﬁes − 2 b b 2 2 α 2 we w∗ H b L w w∗ w w∗ + w w∗ H. k − k ≤ k − k k − k 1 α2 k − k − Lemma 15 immediately implies that one of the following two inequalities hold: e e 2L 2 w w∗ w w∗ , k − k≤ σmin(H) · k − k

e α 2λmax(H) w w∗ w w∗ , k − k≤ √1 α2 s λmin(H) · k − k − which proves Lemma 14. e Note that Corollary 6 follows immediately by combining Corollary 4 with Lemma 14.

Figure 2: Comparison of the estimation error between determinantal and uniform averaging on four libsvm datasets.

C Experiments

In this section, we experimentally evaluate the estimation error of determinantal averaging for the Newton’s method (following the setup of Section 1.1), and we compare it against uniform averaging

17 [WRKXM18]. Although there are obvious follow-up directions for empirical and implementational work, here we focus our experiments on demonstrating the existence of the inversion bias with previous methods and how our determinantal averaging solves this problem. We use square loss ⊤ ⊤ 2 ℓi(w xi) = (w xi yi) , where yi are the real-valued labels for a regression problem, and we run − the experiments on several benchmark regression datasets from the libsvm repository [CL11]. In this setting, the local Newton estimate computed from the starting vector w = 0 is given by:

n 1 n 1 ⊤ − 1 p = bixix + λI yixi, where bi Bernoulli(k/n). k i n ∼ i i X=1 X=1 b 1 i.i.d. In all of our experiments we set the regularization parameter to λ = . Let p ,..., pm p be n 1 ∼ m distributed local estimates and denote Ht as the tth local Hessian estimate. The two averaging strategies we compare are: b b b b m m t=1 det(Ht) pt 1 determinantal: pdet = , uniform: puni = pt. m H m P t=1 det( t) t=1 b b X b P b b Figure 2 plots the estimation errors pdet p∗ andb puni p∗ , where p∗ is the exact Newton step k − k k − 4 k starting from w = 0, for datasets abalone, cpusmall, mg and cadata [CL11] (for convenience, the plot from Figure 1 in Section 1.1b is repeated here).b The reported results are averaged over 100 trials, with shading representing standard error. We consistently observe that for a small number of machines m both methods eﬀectively reduce the estimation error, however after a certain point uniform averaging converges to a biased estimate and the estimation error ﬂattens out. On the other hand, determinantal averaging continues to converge to the optimum as the number of machines keeps growing. We remark that for some datasets determinantal averaging exhibits larger variance than uniform averaging, especially when local sample size is small. Reducing that variance, for example through some form of additional regularization, is a new direction for future work.

4We expanded features to all degree 2 monomials, and removed redundant ones.