arXiv:1905.04447v1 [cs.DS] 11 May 2019 iosIsiuefrteTer fCmuiga Berkeley. at Computing of Theory the for Institute Simons ahntnadhse yYnTtLe ato h okdn whi done work the Nikhi Bartlett, of L. Part Peter by hosted Lee. and Tat Berkeley Yin at Computing by hosted and Washington ∗ † ‡ [email protected] [email protected] [email protected] where hspolmgnrlzslna rgamn n nldsm includes and programming linear minimization. generalizes problem This nw ehiuscnntgv better [Al recently, a Very give consider. not we can problem techniques the known of case special a rpsstocnet hc r ieetfo Chn e,S Lee, cl [Cohen, broad from more different are a • which to concepts Song’19] two Lee, proposes [Cohen, time multiplication eta ahwihudtswihsb admsas vector sparse random a by weights • updates which path central vnwe h egt paevco sdense. is vector update weights the when even where and O ubr rohrdt eedn aaeesadteeaecap are these and parameters ω dependent data other or numbers ∗ epooea ffiin aasrcuet anantecentra the maintain to data-structure efficient an propose We egv outdtriitccnrlpt ehd whereas method, path central deterministic robust a give We ∼ ( aycne rbesi ahn erigadcmue scien computer and learning machine in problems convex Many nti ae,w iea loih htrn ntime in runs that algorithm an give we paper, this In u eutgnrlzstevr eetrsl fsliglin solving of result recent very the generalizes result Our n δ 2 ω . sterltv cuay oeta h utm a nyalog a only has runtime the that Note accuracy. relative the is ω 38 f log( i steepnn fmti multiplication, matrix of exponent the is r ovxfntoson functions convex are VsiesaWlim’2 eGl’4 and Gall’14] Le Williams’12, [Vassilevska n/δ nvriyo ahntn&McootRsac emn.Par Redmond. Research Microsoft & Washington of University )) nvriyo ahntn&U-utn ato h okdn w done work the of Part UT-Austin. & Washington of University h urn arxMlilcto Time Multiplication Matrix Current the ovn miia ikMnmzto in Minimization Risk Empirical Solving i a Lee Tat Yin ace h urn etfrsligadnelatsursre squares least dense a solving for best current the matches nvriyo aiona Berkeley. California, of University O ∗ (( ∗ n R ω n + i min n x ihconstant with 2 ω hoSong Zhao . 5 X − below Abstract i α/ f 2 i ( + A α 2 i n . x steda xoeto arxmultiplication, matrix of exponent dual the is 168 rvsaa ats epl,adDvdP Woodruff. P. David and Vempala, Santosh Srivastava, l 2+1 α + n ∼ † / b i hc slre hnour than larger is which 6 , i log( ) 0 ) A , . 31 evstn iosIsiuefrteTer of Theory the for Institute Simons visiting le i a’8 rvdta l h current the all that proved man’18] ∈ a rgasi h urn matrix current the in programs ear L al rui’8,orruntime our Urrutia’18], Gall, [Le n/δ ue in tured n’9 : ong’19] R s fpolm.Oralgorithm Our problems. of ass . iy Zhang Qiuyi aho neirpitmethods point interior of path l n rbesi miia risk empirical in problems any h rvosoei stochastic a is one previous the n )) i × eedneo h condition the on dependence esaetesm form: same the share ce d , b δ i ftewr oewievisiting while done work the of t o h urn bound current the For . ∈ R n ievstn nvriyof University visiting hile rsinproblem, gression i ‡ and 1 + 2 P / 6 i . n i = n . 1 Introduction

Empirical Risk Minimization (ERM) problem is a fundamental question in statistical machine learn- ing. There are a huge number of papers that have considered this topic [Nes83, Vap92, PJ92, Nes04, BBM05, BB08, NJLS09, MB11, FGRW12, LRSB12, JZ13, Vap13, SSZ13, DB14, DBLJ14, FGKS15, DB16, SLC+17, ZYJ17, ZX17, ZWX+17, GSS17, MS17, NS17, AKK+17, Csi18, JLGJ18] as almost all convex optimization machine learning can be phrased in the ERM framework [SSBD14, Vap92]. While the statistical convergence properties and generalization bounds for ERM are well-understood, a general runtime bound for general ERM is not known although fast runtime bounds do exist for specific instances [AKPS19]. Examples of applications of ERM include linear regression, LASSO [Tib96], elastic net [ZH05], + logistic regression [Cox58, HJLS13], support vector machines [CV95], ℓp regression [Cla05, DDH 09, BCLL18, AKPS19], quantile regression [Koe00, KH01, Koe05], AdaBoost [FS97], kernel regression [Nad64, Wat64], and mean-field variational inference [XJR02]. The classical Empirical Risk Minimization problem is defined as m min fi(a⊤x + bi) x i Xi=1 where f : R R is a convex function, a Rd, and b R, i [m]. Note that this formulation i → i ∈ i ∈ ∀ ∈ also captures most standard forms of regularization as well.

Letting yi = ai⊤x + bi, and zi = fi(ai⊤x + bi) allows us to rewrite the original problem in the following sense, m min zi (1) x,y,z Xi=1 s.t. Ax + b = y (y , z ) K = (y , z ) : f (y ) z , i [m] i i ∈ i { i i i i ≤ i} ∀ ∈ We can consider a more general version where dimension of Ki can be arbitrary, e.g. ni. Therefore, we come to study the general n-variable form

min c⊤x x Qm K ,Ax=b ∈ i=1 i m where i=1 ni = n. We state our main result for solving the general model. d n TheoremP 1.1 (Main result, informal version of Theorem C.3). Given a matrix A R × , two vec- ∈ tors b Rd, c Rn, and m compact convex sets K , K , , K . Assume that there is no redundant ∈ ∈ 1 2 · · · m constraints and n = O(1), i [m]. There is an algorithm (procedure Main in Algorithm 6) that i ∀ ∈ solves

min c⊤x x Qm K ,Ax=b ∈ i=1 i up to δ precision and runs in expected time

ω+o(1) 2.5 α/2+o(1) 2+1/6+o(1) n O (n + n − + n ) log( ) · δ   where ω is the exponente of , α is the dual exponent of matrix multiplication. For the current value of ω 2.38 [Wil12, LG14] and α 0.31 [LGU18], the expected time is ω+o(1) n ∼ ∼ simply n O(log( δ )).

e 1 Remark 1.2. More precisely, when ni is super constant, our running time depends polynomially on maxi [m] ni (but not exponential dependence). ∈ Also note that our runtime depends on diameter, but logarithmically to the diameter. So, it can be applied to linear program by imposing an artificial bound on the solution.

1.1 Related Work First-order algorithms for ERM are well-studied and a long series of accelerated stochastic gradi- ent descent algorithms have been developed and optimized [Nes98, JZ13, XZ14, SSZ14, FGKS15, LMH15, MLF15, AY16, RHS+16, SS16, AH16, SLRB17, MS17, LMH17, LJCJ17, All17b, All17a, All18b, All18a]. However, these rates depend polynomially on the Lipschitz constant of f and in ∇ i order to achieve a log(1/ǫ) dependence, the runtime will also have to depend on the strong convexity of the i fi. In this paper, we want to focus on algorithms that depend logarithmically on diam- eter/smoothness/strong convexity constants, as well as the error parameter ǫ. Note that gradient P descent and a direct application of Newton’s method do not belong to these class of algorithms, but for example, interior point method and ellipsoid method does. Therefore, in order to achieve high-accuracy solutions for non-smooth and non strongly convex case, most convex optimization problems will rely on second-order methods, often under the general interior point method (IPM) or some sort of iterative refinement framework. So, we note that our algorithm is thus optimal in this general setting since second-order methods require at least nω runtime for general matrix inversion. Our algorithm applies the interior point method framework to solve ERM. The most general interior point methods require O(√n)-iterations of linear system solves [Nes98], requiring a naive runtime bound of O(nω+1/2). Using the inverse maintenance technique [Vai89, CLS19], one can improve the running time for LP to O(nω). This essentially implies that almost all convex opti- mization problems can be solved, up to subpolynomial factors, as fast as linear regression or matrix inversion! ω The specific case of ℓ2 regression can be solved in O(n ) time since the solution is explicitly given by solving a linear system. In the more general case of ℓp regression, [BCLL18] proposed 1/2 1/p ω a Op(n| − |)-iteration iterative solver with a naive O(n ) system solve at each step. Recently, max (ω,7/3) [AKPS19] improved the runtime to Op(n ), which is current matrix multiplication time as ω >e 7/3. However, both these results depend exponentially on p and fail to be impressive for large p. Otherwise, we are unaware of othere ERM formulations that have have general runtime bounds for obtaining high-accuracy solutions. Recently several works [AW18a, AW18b, Alm18] try to show the limitation of current known techniques for improving matrix multiplication time. Alman and Vassilevska Williams [AW18b] proved limitations of using the Galactic method applied to many tensors of interest (including Coppersmith-Winograd tensors [CW87]). More recently, Alman [Alm18] proved that by applying the Universal method on those tensors, we cannot hope to achieve any running time better than n2.168 which is already above our n2+1/6.

2 Overview of Techniques

In this section, we discuss the key ideas in this paper. Generalizing the stochastic sparse update approach of [CLS19] to our setting is a natural first step to speeding up the matrix-vector multipli- cation that is needed in each iteration of the interior point method. In linear programs, maintaining approximate complementary slackness means that we maintain x,s to be close multiplicatively to

2 the central path under some notion of distance. However, the generalized notion of complementary slackness requires a barrier-dependent notion of distance. Specifically, if φ(x) is a barrier func- tion, then our distance is now defined as our function gradient being small in a norm depending on 2φ(x). One key fact of the stochastic sparse update is that the variance introduced does not ∇ perturb the approximation too much, which requires understanding the second derivative of the dis- tance function. For our setting, this would require bounding the 4th derivative of φ(x), which may not exist for self-concordant functions. So, the stochastic approach may not work algorithmically (not just in the analysis) if φ(x) is assumed to be simply self-concordant. Even when assumptions on the 4th derivative of φ(x) are made, the analysis will become significantly more complicated due to the 4th derivative terms. To avoid these problems, the main contributions of this paper is to 1) introduce a robust version of the central path and 2) exploit the robustness via sketching to apply the desired matrix-vector multiplication fast. More generally, our main observation is that one can generally speed up an iterative method using sketching if the method is robust in a certain sense. To speed up interior point methods, in Section 4 and A, we give a robust version of the interior point method; and in Section B, we give a data structure to maintain the sketch; and in Section C, we show how to combine them together. We provide several basic notations and definitions for numerical linear algebra in Section 3. In Section D, we provide some classical lemmas from the literature of interior point methods. In Section E, we prove some basic properties of the sketching matrix. Now, we first begin with an overview of our robust central path and then proceed with an overview of sketching iterative methods.

2.1 Central Path Method We consider the following optimization problem

min c⊤x (2) x Qm K ,Ax=b ∈ i=1 i m where i=1 Ki is the direct product of m low-dimensional convex sets Ki. We let xi be the i-th block of x corresponding to K . Interior point methods consider the path of solutions to the following Q i optimization problem: m x(t) = arg min c⊤x + t φi(xi) (3) Ax=b Xi=1 where φ : K R are self-concordant barrier functions. This parameterized path is commonly i i → known as the central path. Many algorithms solve the original problem (2) by following the central path as the path parameter is decreased t 0. The rate at which we decrease t and subsequently the → runtimes of these path-following algorithms are usually governed by the self-concordance properties of the barrier functions we use.

Definition 2.1. We call a function φ a ν self-concordant barrier for K if domφ = K and for any x domφ and for any u Rn ∈ ∈ 3 3/2 D φ(x)[u, u, u] 2 u and φ(x) ∗ √ν | |≤ k kx k∇ kx ≤ where v x := v 2φ(x) and v x∗ := v 2φ(x)−1 , for any vector v. k k k k∇ k k k k∇ Remark 2.2. It is known that ν 1 for any self-concordant barrier function. ≥ Nesterov and Nemirovsky showed that for any open convex set K Rn, there is a O(n) self- ⊂ concordant barrier function [Nes98]. In this paper, the convex set Ki we considered has O(1)

3 dimension. While Nesterov and Nemirovsky gave formulas for the universal barrier; in practice, most ERM problems lend themselves to explicit O(1) self-concordant barriers for majority of the convex functions people use. For example, for the set x : x < 1 , we use log(1 x 2); for the { k k } − − k k set x : x > 0 , we use log(x), and so on. That is the reason why we assume the gradient and { } − hessian can be computed in O(1) time. Therefore, in this paper, we assume a νi self-concordant barrier φ is provided and that we can compute φ and 2φ in O(1) time. The main result we i ∇ i ∇ i will use about self-concordance is that the norm is stable when we change x. k · kx Theorem 2.3 (Theorem 4.1.6 in [Nes98]). If φ is a self-concordant barrier and if y x < 1, k − kx then we have : 1 (1 y x )2 2φ(x) 2φ(y) 2φ(x). − k − kx ∇ ∇  (1 y x )2 ∇ − k − kx In general, we can simply think of φ as a function penalizing any point x / K . It is known i i ∈ i how to transform the original problem (2) by adding O(n) many variables and constraints so that

The minimizer x(t) at t = 1 is explicitly given. • One can obtain an approximate solution of the original problem using the minimizer at small • t in linear time.

For completeness, we show how to do it in Lemma D.2. Therefore, it suffices to study how we can move efficiently from x(1) to x(ǫ) for some tiny ǫ where x(t) is again the minimizer of the problem (3).

2.2 Robust Central Path

In the standard interior point method, we use a tight ℓ2-bound to control how far we can deviate t from x(t) during the entirety of the algorithm. Specifically, if we denote γi (xi) as the appropriate measure of error (this will be specified later and is often called the Newton Decrement) in each block coordinate x at path parameter t, then as we let t 0, the old invariant that we are maintaining i → is,

m Φt (x)= γt(x )2 O(1). old i i ≤ Xi=1 t It can be shown that a Newton step in the standard direction will allow for us to maintain Φold to be 1/2 small even as we decrease t by a multiplicative factor of O(m− ) in each iteration, thereby giving a standard O(√m) iteration analysis. Therefore, the standard approach can be seen as trying to remain within a small ℓ2 neighborhood of the central path by centering with Newton steps after making small decreases in the path parameter t. Note however that if each γi can be perturbed by 1/2 t an error that is Ω(m− ), Φold(x) can easily become too large for the potential argument to work. To make our analysis more robust, we introduce a robust version that maintains the soft-max potential:

m Φt (x)= exp(λγt(x )) O(m) new i i ≤ Xi=1 for some λ = Θ(log m). The robust central path is simply the region of all x that satisfies our poten- tial inequality. We will specify the right constants later but we always make λ large enough to ensure

4 that γi 1 for all x in the robust central path. Now note that a ℓ perturbation of γ translates ≤ t ∞ into a small multiplicative change in Φ , tolerating errors on each γi of up to O(1/poly log(n)). However, maintaining Φt (x) O(m) is not obvious because the robust central path is a much new ≤ wider region of x than the typical ℓ2-neighborhood around the central path. We will show later how t to modify the standard Newton direction to maintain Φnew(x) O(m) as we decrease t. Specifically, t ≤ we will show that a variant of gradient descent of Φnew in the Hessian norm suffices to provide the correct guarantees.

2.3 Speeding up via Sketching To motivate our sketching algorithm, we consider an imaginary iterative method z(k+1) z(k) + P F (z(k)) ← · where P is some dense matrix and F (z) is some simple formula that can be computed efficiently in linear time. Note that the cost per iteration is dominated by multiplying P with a vector, which takes O(n2) time. To avoid the cost of multiplication, instead of storing the solution explicitly, we store it implicitly by z(k) = P u(k). Now, the algorithm becomes · u(k+1) u(k) + F (P u(k)). ← · This algorithm is as expensive as the previous one except that we switch the location of P . However, if we know the algorithm is robust under perturbation of the z(k) term in F (z(k)), we can instead do (k+1) (k) (k) u u + F (R⊤RP u ) ← · b n for some random Gaussian matrix R : R × . Note that the matrix RP is fixed throughout the whole algorithm and can be precomputed. Therefore, the cost of per iteration decreases from O(n2) to O(nb). For our problem, we need to make two adjustments. First, we need to sketch the change of z, that is F (P u(k)), instead of z(k) directly because the change of z is smaller and this creates a · smaller error. Second, we need to use a fresh random R every iteration to avoid the randomness dependence issue in the proof. For the imaginary iterative process, it becomes

(k+1) (k) (k) (k) (k) z z + R ⊤R P F (z ), ← · u(k+1) u(k) + F (z(k)). ← After some iterations, z(k) becomes too far from z(k) and hence we need to correct the error by setting z(k) = P u(k), which zeros the error. · Note that the algorithm explicitly maintains the approximate vector z while implicitly main- taining the exact vector z by P u(k). This is different from the classical way to sketch Newton method [PW16, PW17], which is to simply run z(k+1) z(k) + R RP F (z(k)) or use another way ← ⊤ · to subsample and approximate P . Such a scheme relies on the iteration method to fix the error accumulated in the sketch, while we are actively fixing the error by having both the approximate explicit vector z and the exact implicit vector z. Without precomputation, the cost of computing R(k)P is in fact higher than that of P F (z(k)). · The first one involves multiplying multiple vectors with P and the second one involves multiplying (1) (2) (T ) 1 vector with P . However, we can precompute [R ⊤; R ⊤; ; R ⊤]⊤ P by fast matrix multi- · · · ω 1 · plication. This decreases the cost of multiplying 1 vector with P to n − per vector. This is a huge saving from n2. In our algorithm, we end up using only O(n) random vectors in total and hence the total cost is still roughly nω. e 5 2.4 Maintaining the Sketch The matrix P we use in interior point methods is of the form

1 P = √W A⊤(AW A⊤)− A√W where W is some block diagonal matrix. [CLS19] showed one can approximately maintain the matrix P with total cost O(nω) across all iterations of interior point method. However, the cost of applying the dense matrix P with a vector z is roughly O(n z ) which is O(n2) for dense vectors. k k0 Since interior point methodse takes at least √n iterations in general, this gives a total runtime of O(n2.5). The key idea in [CLS19] is that one can design a stochastic interior point method such that each step only need to multiply P with a vector of density O(√n). This bypasses the n2.5 bottleneck. In this paper, we do not have this issue because we only need toe compute RP z which is much cheaper than P z. We summarize why it suffices to maintain RP throughout the algorithm. In general, for interior point method, the vector z is roughly an unit vector and since P is an orthogonal projection, we have P z = O(1). One simple insight we have is that if we multiply a random k k2 √n n matrix R with values 1 by P z, we have RP z = O( 1 ) (Lemma E.5). Since × ± √n k k∞ √n there are O(√n) iterations in interior point method, the total error is roughly O(1) in a correctly reweighed ℓ norm. In Section A, we showed that this is exactly whate interior point method needs ∞ for convergence.e Furthermore, we note that though each step needs to use a freshe random matrix R of size √n n, the random matrices [R ; R ; ; R ] we need can all fit into O(n) n budget. l × 1⊤ 2⊤ · · · T⊤ ⊤ × Therefore, throughout the algorithm, we simply need to maintain the matrix [R ; R ; ; R ] P 1⊤ 2⊤ · · · T⊤ ⊤ which can be done with total cost O(nω) across all iterations using idea similar toe [CLS19]. The only reason the data structure looks complicated is that when the block matrix W changes in different location in √W A (AWe A ) 1A√W , we need to update the matrix [R ; R ; ; R ]P ⊤ ⊤ − 1 2 · · · T appropriately. This gives us few simple cases to handle in the algorithm and in the proof. For the intuition on how to maintain P under W change, see [CLS19, Section 2.2 and 5.1].

2.5 Fast rectangular matrix multiplication Given two size n n matrices, the time of multiplying them is n2.81 < n3 by applying Strassen’s × original algorithm [Str69]. The current best running time takes nω time where ω < 2.373 [Wil12, LG14]. One natural extension of multiplying two square matrices is multiplying two rectangular matrices. What is the running time of multiplying one n na matrix with another na n matrix? × × Let α denote the largest upper bound of a such that multiplying two rectangular matrices takes n2+o(1) time. The α is called the dual exponent of matrix multiplication, and the state-of-the-art result is α = 0.31 [LGU18]. We use the similar idea as [CLS19] to delay the low-rank update when the rank is small.

3 Preliminaries

Given a vector x Rn and m compact convex sets K Rn1 , K Rn2 , , K Rnm with ∈ 1 ⊂ 2 ⊂ · · · m ⊂ m n = n. We use x to denote the i-th block of x, then x m K if x K , i [m]. i=1 i i ∈ i=1 i i ∈ i ∀ ∈ P Q

6 We say a block diagonal matrix A m Rni ni if A can be written as ∈⊕i=1 × A1 A2 A =  .  ..    A   m   where A Rn1 n1 , A Rn2 n2 , and A Rnm nm . For a matrix A, we use A to denote its 1 ∈ × 2 ∈ × m ∈ × k kF Frobenius norm and use A to denote its operator norm. There are some trivial facts AB k k k k2 ≤ A B and AB A B . k k2 · k k2 k kF ≤ k kF · k k2 For notation convenience, we assume the number of variables n 10 and there are no redundant ≥ constraints. In particular, this implies that the constraint matrix A is full rank. For a positive integer n, let [n] denote the set 1, 2, ,n . { · · · } For any function f, we define O(f) to be f logO(1)(f). In addition to O( ) notation, for two · · functions f, g, we use the shorthand f . g (resp. &) to indicate that f Cg (resp. ) for some ≤ ≥ absolute constant C. For any functione f, we use domf to denote the domain of function f. For a vector v, We denote v as the standard Euclidean norm of v and for a symmetric PSD k k matrix A, we let v = (v Av)1/2. For a convex function f(x) that is clear from context, we k kA ⊤ denote v x = v 2f(x) and v x∗ = v 2f(x)−1 . k k k k∇ k k k k∇ 4 Robust Central Path

In this section we show how to move move efficiently from x(1) to x(ǫ) for some tiny ǫ by staying on a robust version of the central path. Because we are maintaining values that are slightly off-center, we show that our analysis still goes through despite ℓ perturbations on the order of O(1/poly log(n)). ∞ 4.1 Newton Step To follow the path x(t), we consider the optimality condition of (3): s/t + φ(x) = 0, ∇ Ax = b,

A⊤y + s = c where φ(x) = ( φ (x ), φ (x ), , φ (x )). To handle the error incurred in the progress, ∇ ∇ 1 1 ∇ 2 2 · · · ∇ m m we consider the perturbed central path s/t + φ(x)= µ, ∇ Ax = b,

A⊤y + s = c where µ represent the error between the original central path and our central path. Each iteration, we decrease t by a certain factor. It may increase the error term µ. Therefore, we need a step to decrease the norm of µ. The Newton method to move µ to µ + h is given by 1 δideal + 2φ(x) δideal = h, t · s ∇ · x ideal Aδx = 0, ideal ideal A⊤δy + δs = 0

7 where 2φ(x) is a block diagonal matrix with the i-th block is given by 2φ (x ). Letting W = ∇ ∇ i i ( 2φ(x)) 1, we can solve this: ∇ − 1 ideal − δ = t AW A⊤ AW h, y − · 1 ideal   − δ = t A⊤ AW A⊤ AW h, s · 1 ideal   − δ = W h W A⊤ AW A⊤ AW h. x −   We define projection matrix P Rn n as follows ∈ × 1 1/2 − 1/2 P = W A⊤ AW A⊤ AW   and then we rewrite them

δideal = W 1/2(I P )W 1/2δ , (4) x − µ ideal 1/2 1/2 δs = tW − PW δµ. (5)

One standard way to analyze the central path is to measure the error by µ 2φ(x)−1 and uses 1 k k∇ the step induced by h = µ. One can easily prove that if µ 2φ(x)−1 < 10 , one step of Newton − k k∇ step decreases the norm by a constant factor. Therefore, one can alternatively decrease t and do a Newton step to follow the path.

4.2 Robust Central Path Method In this section, we develop a central path method that is robust under certain ℓ perturbations. ∞ Due to the ℓ perturbation, we measure the error µ by a soft max instead of the ℓ2 type potential: ∞ Definition 4.1. For each i [m], let µt(x,s) Rni and γt(x,s) R be defined as follows: ∈ i ∈ i ∈ t µi(x,s)= si/t + φi(xi), (6) t t ∇ γi (x,s)= µi(x,s) 2φ (x )−1 , (7) k k∇ i i and we define potential function Φ as follows:

m t t Φ (x,s)= exp(λγi (x,s)) Xi=1 where λ = O(log m). The robust central path is the region (x,s) that satisfies Φt(x,s) O(m). To run our convergence ≤ argument, we will be setting λ appropriately so that staying on the robust central path will guarantee a ℓ bound on γ. Then, we will show how to maintain Φt(x,s) to be small throughout the algorithm ∞ while decreasing t, always staying on the robust central path. This is broken into a two step analysis: the progress step (decreasing t) and the centering step (moving x,s to decrease γ). It is important to note that to follow the robust central path, we no longer pick the standard Newton direction by setting h = µ. To explain how we pick our centering step, suppose we can − move µ µ + h arbitrarily with the only restriction on the distance h 2φ(x)−1 = α. Then, the → k k∇ natural step would be h =arg min f(µ(x,s)), h h 2 −1 =α h∇ i k k∇ φ(x)

8 m 2 −1 where f(µ)= i=1 exp(λ µ φi(xi) ). Note that k k∇ P t t t 2 1 t f(µ (x,s)) = λ exp(λγ (x,s))/γ (x,s) φ (x )− µ (x,s). ∇ i i i ·∇ i i i Therefore, the solution for the minimization problem is

hideal = α ct(x,s)idealµt(x,s) Rni , i − · i i ∈ where µt(x,s) Rni is defined as Eq. (6) and ct(x,s) R is defined as i ∈ i ∈ exp(λγt(x,s))/γt(x,s) ct(x,s)ideal = i i . i m t 1/2 ( i=1 exp(2λγi (x,s))) Eq. (4) and Eq. (5) gives the correspondingP ideal step on x and s. Now, we discuss the perturbed version of this algorithm. Instead of using the exact x and s in the formula of h, we use a x which is approximately close to x and a s which is close to s. Precisely, we have h = α ct(x, s)µt(x, s) (8) i − · i i where exp(λγt(x,s))/γt(x,s) i i if γt(x,s) 96√α t (Pm exp(2λγt(x,s)))1/2 i ci(x,s)= i=1 i ≥ . (9) (0 otherwise

t t 1 t Note that our definition of ci ensures that ci(x,s) 96√α regardless of the value of γi (x,s). This ≤ t makes sure we do not move too much in any coordinates and indeed when γi is small, it is fine t to set ci = 0. Furthermore, for the formula on δx and δs, we use some matrix V that is close to ( 2φ(x)) 1. Precisely, we have ∇ − e δ = V 1/2(I P )V 1/2h, (10) x − 1/2 1/2 δs = t V − P V h. (11) e· e e where e e e 1/2 1 1/2 P = V A⊤(AV A⊤)− AV . Here we give a quick summary of our algorithm. (The more detailed of our algorithm can be e e e e found in Algorithm 5 and 6 in Section C.)

RobustIPM(A, b, c, φ, δ) • 16 20 2 10 – λ = 2 log(m), α = 2− λ− , κ = 2− α. 1 – δ = min( λ , δ). m – ν = i=1 νi where νi are the self-concordant parameters of φi. – ModifyP the convex problem and obtain an initial x and s according to Lemma D.2. – t = 1. δ2 – While t> 4ν Find x and s such that xi xi x < α and si si ∗ < tα for all i. ∗ k − k i k − kxi Find V such that (1 α)( 2φ (x )) 1 V (1 + α)( 2φ (x )) 1 for all i. ∗ i − ∇ i i −  i  ∇ i i − e e

9 Compute h = α ct(x, s)µt(x, s) where ∗ − · i i t t exp(λγi (x,s))/γi (x,s) t m if γ (x, s) 96√α t (P exp(2λγt(x,s)))1/2 i ci(x, s)= i=1 i ≥ . (0 otherwise

t t t 2 −1 and µi(x, s)= si/t + φi(xi) and γi (x, s)= µi(x, s) φi(xi) ∇ k k∇ Let P = V 1/2A (AV A ) 1AV 1/2. ∗ ⊤ ⊤ − Compute δ = V 1/2(I P )V 1/2h and δ = t V 1/2P V 1/2h. ∗ x − s · − Moveex e x + δ , se s + δ .e ∗ ← x ← s tnew = (1 κ )te. e e e e e ∗ − √ν – Return an approximation solution of the convex problem according to Lemma D.2.

Theorem 4.2 (Robust Interior Point Method). Consider a convex problem minAx=b,x Qm K c⊤x ∈ i=1 i where Ki are compact convex sets. For each i [m], we are given a νi-self concordant barrier m ∈ (0) m function φi for Ki. Let ν = i=1 νi. Also, we are given x = argminx i=1 φi(xi). Assume that 1. Diameter of the set: For any x m K , we have that x R. P ∈ i=1 i k k2 ≤ P 2. Lipschitz constant of the program:Qc L. k k2 ≤ Then, the algorithm RobustIPM finds a vector x such that

c⊤x min c⊤x + LR δ, ≤ Ax=b,x Qm K · ∈ i=1 i

Ax b 3δ R A + b , k − k1 ≤ ·  | i,j| k k1 Xi,j m   x K . ∈ i Yi=1 2 ν in O(√ν log m log( δ )) iterations. Proof. Lemma D.2 shows that the initial x and s satisfies 1 s + φ(x) ∗ δ k ∇ kx ≤ ≤ λ where the last inequality is due to our step δ min( 1 , δ). This implies that γ1(x,s) = s + ← λ i k i φ (x ) 1 and hence Φ1(x,s) e m 80 m for the initial x and s. Apply Lemma A.8 ∇ i i kx∗i ≤ λ ≤ · ≤ α repetitively, we have that Φt(x,s) 80 m during the whole algorithm. In particular, we have this ≤ α at the end of the algorithm. This implies that m log(80 α ) s + φ (x ) ∗ 1 k i ∇ i i kxi ≤ λ ≤ at the end. Therefore, we can apply Lemma D.3 to show that 2 c, x c, x∗ + 4tν c, x∗ + δ h i ≤ h i ≤ h i where we used the stop condition for t at the end. Note that this guarantee holds for the modified convex program. Since the error is δ2, Lemma D.2 shows how to get an approximate solution for the original convex program with error LR δ. · The number of steps follows from the fact we decrease t by 1 1 factor every iteration. − √ν log2 m

10 References

[AH16] Zeyuan Allen-Zhu and Elad Hazan. Variance Reduction for Faster Non-Convex Opti- mization. In Proceedings of the 33rd International Conference on Machine Learning, ICML ’16, 2016. Full version available at http://arxiv.org/abs/1603.05643.

[AKK+17] Naman Agarwal, Sham Kakade, Rahul Kidambi, Yin Tat Lee, Praneeth Netrapalli, and Aaron Sidford. Leverage score sampling for faster accelerated regression and erm. arXiv preprint arXiv:1711.08426, 2017.

[AKPS19] Deeksha Adil, Rasmus Kyng, Richard Peng, and Sushant Sachdeva. Iterative refinement for ℓp-norm regression. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1405–1424. SIAM, 2019.

[All17a] Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient meth- ods. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Com- puting, pages 1200–1205. ACM, 2017.

[All17b] Zeyuan Allen-Zhu. Natasha: Faster Non-Convex Stochastic Optimization via Strongly Non-Convex Parameter. In Proceedings of the 34th International Con- ference on Machine Learning, ICML ’17, 2017. Full version available at http://arxiv.org/abs/1702.00763.

[All18a] Zeyuan Allen-Zhu. Katyusha X: Practical Momentum Method for Stochas- tic Sum-of-Nonconvex Optimization. In Proceedings of the 35th International Conference on Machine Learning, ICML ’18, 2018. Full version available at http://arxiv.org/abs/1802.03866.

[All18b] Zeyuan Allen-Zhu. Natasha 2: Faster Non-Convex Optimization Than SGD. In Pro- ceedings of the 32nd Conference on Neural Information Processing Systems, NIPS ’18, 2018. Full version available at http://arxiv.org/abs/1708.08694.

[Alm18] Josh Alman. Limits on the universal method for matrix multiplication. arXiv preprint arXiv:1812.08731, 2018.

[AW18a] Josh Alman and Virginia Vassilevska Williams. Further limitations of the known ap- proaches for matrix multiplication. In ITCS. arXiv preprint arXiv:1712.07246, 2018.

[AW18b] Josh Alman and Virginia Vassilevska Williams. Limits on all known (and some un- known) approaches to matrix multiplication. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 2018.

[AY16] Zeyuan Allen-Zhu and Yang Yuan. Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives. In Proceedings of the 33rd International Conference on Machine Learning, ICML ’16, 2016. Full version available at http://arxiv.org/abs/1506.01972.

[BB08] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in neural information processing systems, pages 161–168, 2008.

[BBM05] Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complex- ities. The Annals of Statistics, 33(4):1497–1537, 2005.

11 [BCLL18] Sébastien Bubeck, Michael B Cohen, Yin Tat Lee, and Yuanzhi Li. An homotopy method for ℓp regression provably beyond self-concordance and in input-sparsity time. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 1130–1137. ACM, 2018. [CKPS16] Xue Chen, Daniel M Kane, Eric Price, and Zhao Song. Fourier-sparse interpolation without a frequency gap. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 741–750. IEEE, 2016.

[Cla05] Kenneth L Clarkson. Subgradient and sampling algorithms for ℓ1 regression. In Pro- ceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms (SODA), pages 257–266, 2005. [CLS19] Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. In STOC. https://arxiv.org/pdf/1810.07896.pdf, 2019. [Cox58] David R Cox. The regression analysis of binary sequences. Journal of the Royal Statis- tical Society. Series B (Methodological), pages 215–242, 1958. [Csi18] Dominik Csiba. Data sampling strategies in stochastic algorithms for empirical risk minimization. arXiv preprint arXiv:1804.00437, 2018. [CT65] James W Cooley and John W Tukey. An algorithm for the machine calculation of complex Fourier series. Mathematics of computation, 19(90):297–301, 1965. [CV95] and . Support-vector networks. Machine learning, 20(3):273–297, 1995. [CW87] Don Coppersmith and Shmuel Winograd. Matrix multiplication via arithmetic progres- sions. In Proceedings of the nineteenth annual ACM symposium on Theory of comput- ing(STOC), pages 1–6. ACM, 1987. [DB14] Alexandre Défossez and Francis Bach. Constant step size least-mean-square: Bias- variance trade-offs and optimal sampling distributions. arXiv preprint arXiv:1412.0156, 2014. [DB16] Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes. The Annals of Statistics, 44(4):1363–1399, 2016. [DBLJ14] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654, 2014. [DDH+09] Anirban Dasgupta, Petros Drineas, Boulos Harb, Ravi Kumar, and Michael W Ma- honey. Sampling algorithms and coresets for ℓp regression. SIAM Journal on Computing, 38(5):2060–2078, 2009. [FGKS15] Roy Frostig, Rong Ge, Sham M Kakade, and Aaron Sidford. Competing with the empirical risk minimizer in a single pass. In Conference on learning theory (COLT), pages 728–763, 2015. [FGRW12] Vitaly Feldman, Venkatesan Guruswami, Prasad Raghavendra, and Yi Wu. Agnostic learning of monomials by halfspaces is hard. SIAM Journal on Computing, 41(6):1558– 1590, 2012.

12 [FS97] and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.

[GSS17] Alon Gonen and Shai Shalev-Shwartz. Fast rates for empirical risk minimization of strict saddle problems. arXiv preprint arXiv:1701.04271, 2017.

[HIKP12a] Haitham Hassanieh, , Dina Katabi, and Eric Price. Nearly optimal sparse fourier transform. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 563–578. ACM, 2012.

[HIKP12b] Haitham Hassanieh, Piotr Indyk, Dina Katabi, and Eric Price. Simple and practical algorithm for sparse Fourier transform. In Proceedings of the twenty-third annual ACM- SIAM symposium on Discrete Algorithms, pages 1183–1194. SIAM, 2012.

[HJLS13] David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. Applied logistic regression, volume 398. John Wiley & Sons, 2013.

[IK14] Piotr Indyk and Michael Kapralov. Sample-optimal fourier sampling in any constant dimension. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Sym- posium on, pages 514–523. IEEE, 2014.

[IKP14] Piotr Indyk, Michael Kapralov, and Eric Price. (Nearly) Sample-optimal sparse Fourier transform. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Dis- crete Algorithms, pages 480–499. SIAM, 2014.

[JLGJ18] Chi Jin, Lydia T Liu, Rong Ge, and Michael I Jordan. On the local minima of the empirical risk. In Advances in Neural Information Processing Systems (NeurIPS), pages 4901–4910, 2018.

[JZ13] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315– 323, 2013.

[Kap16] Michael Kapralov. Sparse Fourier transform in any constant dimension with nearly- optimal sample complexity in sublinear time. In Symposium on Theory of Computing Conference, STOC’16, Cambridge, MA, USA, June 19-21, 2016, 2016.

[Kap17] Michael Kapralov. Sample efficient estimation and recovery in sparse fft via isolation on average. In Foundations of Computer Science, 2017. FOCS’17. IEEE 58th Annual IEEE Symposium on. https://arxiv.org/pdf/1708.04544, 2017.

[KH01] Roger Koenker and Kevin F Hallock. Quantile regression. Journal of economic perspec- tives, 15(4):143–156, 2001.

[Koe00] Roger Koenker. Galton, edgeworth, frisch, and prospects for quantile regression in econometrics. Journal of Econometrics, 95(2):347–374, 2000.

[Koe05] Roger Koenker. Quantile Regression. Cambridge University Press, 2005.

[LDFU13] Yichao Lu, Paramveer Dhillon, Dean P Foster, and Lyle Ungar. Faster ridge regression via the subsampled randomized hadamard transform. In Advances in neural information processing systems, pages 369–377, 2013.

13 [Lee17] Yin Tat Lee. Uniform sampling and inverse maintenance. In Talk at Michael Cohen Memorial Symposium. Available at: https://simons.berkeley.edu/talks/welcome-andbirds-eye-view-michaels-work., 2017.

[LG14] François Le Gall. Powers of tensors and fast matrix multiplication. In Proceedings of the 39th international symposium on symbolic and algebraic computation(ISSAC), pages 296–303. ACM, 2014.

[LGU18] Francois Le Gall and Florent Urrutia. Improved rectangular matrix multiplication using powers of the coppersmith-winograd tensor. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms(SODA), pages 1029–1046. SIAM, 2018.

[LJCJ17] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex finite-sum op- timization via scsg methods. In Advances in Neural Information Processing Systems, pages 2348–2358, 2017.

[LM00] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302–1338, 2000.

[LMH15] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, pages 3384–3392, 2015.

[LMH17] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. Catalyst acceleration for first-order convex optimization: from theory to practice. arXiv preprint arXiv:1712.05654, 2017.

[LRSB12] Nicolas Le Roux, Mark W Schmidt, and Francis R Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In NIPS, pages 2672–2680, 2012.

[MB11] Eric Moulines and Francis R Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pages 451–459, 2011.

[MLF15] Zhuang Ma, Yichao Lu, and Dean Foster. Finding linear structure in large datasets with scalable canonical correlation analysis. In International Conference on Machine Learning, pages 169–178, 2015.

[MS17] Tomoya Murata and Taiji Suzuki. Doubly accelerated stochastic variance reduced dual averaging method for regularized empirical risk minimization. In Advances in Neural Information Processing Systems, pages 608–617, 2017.

[Nad64] Elizbar A Nadaraya. On estimating regression. Theory of Probability & Its Applications, 9(1):141–142, 1964.

[Nes83] Yurii E Nesterov. A method for solving the convex programming problem with conver- gence rate O(1/k2). In Dokl. Akad. Nauk SSSR, volume 269, pages 543–547, 1983.

[Nes98] Yurii Nesterov. Introductory lectures on convex programming volume i: Basic course. Lecture notes, 1998.

14 [Nes04] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2004.

[NJLS09] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on opti- mization, 19(4):1574–1609, 2009.

[NS17] Yurii Nesterov and Sebastian U Stich. Efficiency of the accelerated coordinate de- scent method on structured optimization problems. SIAM Journal on Optimization, 27(1):110–123, 2017.

[NSW19] Vasileios Nakos, Zhao Song, and Zhengyu Wang. (Nearly) sample-optimal sparse Fourier transform in any dimension; RIPless and Filterless. In manuscript, 2019.

[PJ92] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.

[Pri13] Eric C. Price. Sparse recovery and Fourier sampling. PhD thesis, Massachusetts Institute of Technology, 2013.

[PS15] Eric Price and Zhao Song. A robust sparse Fourier transform in the continuous setting. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 583–600. IEEE, 2015.

[PSW17] Eric Price, Zhao Song, and David P. Woodruff. Fast regression with an ℓ guarantee. In ∞ International Colloquium on Automata, Languages, and Programming (ICALP), 2017.

[PW16] Mert Pilanci and Martin J Wainwright. Iterative hessian sketch: Fast and accurate solution approximation for constrained least-squares. The Journal of Machine Learning Research, 17(1):1842–1879, 2016.

[PW17] Mert Pilanci and Martin J Wainwright. Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205– 245, 2017.

[RHS+16] Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochas- tic variance reduction for nonconvex optimization. In International conference on ma- chine learning, pages 314–323, 2016.

[SLC+17] Fanhua Shang, Yuanyuan Liu, James Cheng, KW Ng, and Yuichi Yoshida. Vari- ance reduced stochastic gradient descent with sufficient decrease. arXiv preprint arXiv:1703.06807, 2017.

[SLRB17] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017.

[SS16] Shai Shalev-Shwartz. SDCA without duality, regularization, and individual convexity. In International Conference on Machine Learning, pages 747–754, 2016.

[SSBD14] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

15 [SSZ13] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb):567–599, 2013. [SSZ14] Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordi- nate ascent for regularized loss minimization. In International Conference on Machine Learning, pages 64–72, 2014. [Str69] Volker Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 13(4):354–356, 1969. [Tib96] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996. [Vai89] Pravin M Vaidya. Speeding-up linear programming using fast matrix multiplication. In FOCS. IEEE, 1989. [Vap92] Vladimir Vapnik. Principles of risk minimization for learning theory. In Advances in neural information processing systems, pages 831–838, 1992. [Vap13] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013. [Wat64] Geoffrey S Watson. Smooth regression analysis. Sankhy¯a: The Indian Journal of Statis- tics, Series A, pages 359–372, 1964. [Wil12] Virginia Vassilevska Williams. Multiplying matrices faster than coppersmith-winograd. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing (STOC), pages 887–898. ACM, 2012. [XJR02] Eric P Xing, Michael I Jordan, and Stuart Russell. A generalized mean field algorithm for variational inference in exponential families. In Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence, pages 583–591. Morgan Kaufmann Publishers Inc., 2002. [XZ14] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014. [ZH05] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301– 320, 2005. [ZWX+17] Shun Zheng, Jialei Wang, Fen Xia, Wei Xu, and Tong Zhang. A general distributed dual coordinate optimization framework for regularized loss minimization. The Journal of Machine Learning Research, 18(1):4096–4117, 2017. [ZX17] Yuchen Zhang and Lin Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimization. The Journal of Machine Learning Research, 18(1):2939– 2980, 2017. [ZYJ17] Lijun Zhang, Tianbao Yang, and Rong Jin. Empirical risk minimization for stochas- tic convex optimization: O(1/n)-and O(1/n2)-type of risk bounds. arXiv preprint arXiv:1702.02030, 2017.

16 Statement Section Parameters Lemma A.2 Section A.2 µt(x,s) µt(xnew,snew) i → i Lemma A.4 Section A.2 γt(x,x,s) γt(xnew,x,snew) i → i Lemma A.5 Section A.3 Φ(x,x,s) Φ(xnew,x,snew) → Lemma A.6 Section A.4 Φ(xnew,x,snew) Φ(xnew,xnew,snew) new → Lemma A.7 Section A.5 Φt Φt → new Lemma A.8 Section A.6 Φt(x,s) Φt (xnew,snew) → Table 1: Bounding the changes of different variables

Appendix

A Robust Central Path

The goal of this section is to analyze robust central path. We provide an outline in Section A.1. In Section A.2, we bound the changes in µ and γ. In Section A.3, we analyze the changes from (x,x,s) to (xnew,x,snew). In Section A.4, we analyze the changes from (xnew,x,snew) to (xnew,xnew,snew). We bound the changes in t in Section A.5. Finally, we analyze entire changes of potential function in Section A.6.

A.1 Outline of Analysis Basically, the main proof is just a simple calculation on how Φt(x,s) changes during 1 iteration. It could be compared to the proof of ℓ potential reduction arguments for the convergence of ∞ long-step interior point methods, although the main difficulty arises from the perturbations from stepping using x, s instead of x,s. t t To organize the calculations, we note that the term γi (x,s)= µi(x,s) 2φ (x )−1 has two terms k k∇ i i involving x, one in the µ term and one in the Hessian. Hence, we separate how different x affect the potential by defining

t t γi (x,z,s)= µi(x,s) 2φ (z )−1 , k k∇ i i m t t Φ (x,z,s)= exp(λγi (x,z,s)). Xi=1 One difference between our proof and standard ℓ2 proofs of interior point is that we assume the barrier function is decomposable. We define α = δ is the “step” size of the coordinate i. One i k x,ikxi crucial fact we are using is that sum of squares of the step sizes is small.

Lemma A.1. Let α denote the parameter in RobustIPM. For all i [m], let α = δ . Then, ∈ i k x,ikxi m α2 4α2. i ≤ Xi=1 Proof. Note that

m 2 2 1/2 1/2 2 1/2 1/2 α = δ = h⊤V (I P )V φ(x)V (I P )V h. i k xkx − ∇ − Xi=1 e e e e e e

17 Since (1 α)( 2φ (x )) 1 V (1 + α)( 2φ (x )) 1, we have that − ∇ i i −  i  ∇ i i − 2 1 2 1 (1 α)( φ(x))− V (1 + α)( φ(x))− . −e ∇   ∇ Using α 1 , we have that ≤ 10000 e m 2 1/2 1/2 α 2h⊤V (I P )(I P )V h 2h⊤V h i ≤ − − ≤ Xi=1 e e e e e where we used that I P is an orthogonal projection at the end. Finally, we note that − m e 2 h⊤V h 2 h ∗ ≤ k ikxi Xi=1 e m 2 t 2 t 2 = 2α c (x, s) µ (x, s) ∗ i k i kxi i=1 Xm exp(2λγt(x, s))/γt(x, s)2 2α2 i i µt(x, s) 2 m t 1/2 i x∗i ≤ i=1 exp(2λγi (x, s)) k k Xi=1 m exp(2λγt(x, s)) = 2α2 i=1P i m exp(2λγt(x, s)) Pi=1 i 2 = 2α P t where the second step follows from definition of hi (8), the third step follows from definition ci (9), t the fourth step follows from definition of γi (7). Therefore, putting it all together, we can show

m α2 4α2. i ≤ Xi=1

A.2 Changes in µ and γ We provide basic lemmas that bound changes in µ, γ due to the centering steps. Lemma A.2 (Changes in µ). For all i [m], let ∈ t new new t (µ) µi(x ,s )= µi(x,s)+ hi + ǫi .

Then, ǫ(µ) 10α α . k i kx∗i ≤ · i Proof. Let x(u) = uxnew + (1 u)x and µnew = µt(xnew,snew). The definition of µ (6) shows that − i i 1 µnew = µ + δ + φ (xnew) φ (x ) i i t s,i ∇ i i −∇ i i 1 1 2 (u) = µi + δs,i + φi(xi )δx,i du t 0 ∇ Z 1 1 2 2 (u) 2 = µi + δs,i + φi(xi)δx,i + φi(xi ) φi(xi) δx,i du. t ∇ 0 ∇ −∇ Z  

18 1 1 By the definition of δx and δs (10) and (11), we have that t δs,i + Vi− δx,i = hi. Hence, we have

new (µ) µi = µi + hi + ǫi e where 1 (µ) 2 (u) 2 2 1 ǫi = φi(xi ) φi(xi) δx,i du + ( φi(xi) Vi− )δx,i. (12) 0 ∇ −∇ ∇ − Z   (µ) e To bound ǫi , we note that

x(t) x x(t) x + x x δ + α = α + α 3α k i − ikxi ≤ k i − ikxi k i − ikxi ≤ k x,ikxi i ≤ where the first step follows from triangle inequality, the third step follows from definition of αi (Lemma A.1), and the last step follows from α 2α (Lemma A.1). i ≤ Using α 1 , Theorem 2.3 shows that ≤ 100 7α 2φ (x ) 2φ (x(u)) 2φ (x ) 7α 2φ (x ). − ·∇ i i ∇ i i −∇ i i  ·∇ i i Equivalently, we have

2 (u) 2 2 1 2 (u) 2 2 2 ( φ (x ) φ (x )) ( φ (x ))− ( φ (x ) φ (x )) (7α) φ (x ). ∇ i i −∇ i i · ∇ i i · ∇ i i −∇ i i  ·∇ i i Using this, we have

1 ∗ 1 2 (u) 2 2 (u) 2 ∗ φi(xi ) φi(xi) δx,i du φi(xi ) φi(xi) δx,i du ∇ −∇ ≤ ∇ −∇ xi Z0 xi Z0     7α δx,i x = 7α αi, (13) ≤ k k i · where the last step follows from definition of αi (Lemma A.1). (µ) For the other term in ǫi , we note that

2 1 2 (1 2α) ( φ (x )) V − (1 + 2α) ( φ (x )). − · ∇ i i  i  · ∇ i i Hence, we have e 2 1 ∗ ( φi(xi) Vi− )δx,i 2α δx,i xi = 2α αi. (14) ∇ − xi ≤ k k ·

Combining (12), (13) and (14), we have e (µ) ǫ ∗ 9α α . k i kxi ≤ · i Finally, we use the fact that x and x are α close and hence again by self-concordance, ǫ(µ) i i k i kx∗i ≤ 10α α . · i

Before bounding the change of γ, we first prove a helper lemma:

Lemma A.3. For all i [m], we have ∈ t t µ (x,s) µ (x, s) ∗ 4α. k i − i kxi ≤

19 Proof. Note that

t t 1 µ (x,s) µ (x, s) ∗ = s s ∗ + φ (x ) φ (x ) ∗ . k i − i kxi t k i − ikxi k∇ i i −∇ i i kxi

For the first term, we have si si ∗ tα. k − kxi ≤ For the second term, let x(u) = ux + (1 u)x . Since x is close enough to x , Theorem 2.3 i i − i i i shows that 2φ (x(u)) 2 2φ (x ). Hence, we have ∇ i i  ·∇ i i 1 2 (u) ∗ φ (x ) φ (x ) ∗ = φ (x ) (x x )du 2 x x = 2α. k∇ i i −∇ i i kxi ∇ i i · i − i ≤ k i − ikxi Z0 xi

t t Hence, we have µ (x,s) µ (x, s) ∗ 3α and using again xi is close enough to xi to get the final k i − i kxi ≤ result.

Lemma A.4 (Changes in γ). For all i [m], let ∈ γt(xnew,x,snew) (1 α ct(x, s))γt(x,x,s)+ ǫ(γ). i ≤ − · i i i then ǫ(γ) 10α (αct(x, s)+ α ). Furthermore, we have γt(xnew,x,snew) γt(x,x,s) 3α. i ≤ · i i | i − i |≤ Proof. For the first claim, Lemma A.2, the definition of γ (7), h (8) and c (9) shows that

t new new t (µ) γi (x ,x,s )= µi(x,s)+ hi + ǫi x∗i k t t k = (1 α c (x, s))µ (x,s)+ ǫ ∗ k − · i i ikxi where ǫ = α ct(x, s)(µt(x,s) µt(x, s)) + ǫ(µ). i · i i − i i From the definition of ct, we have that ct 1 1 and hence 0 1 α ct(x, s) 1. i i ≤ 96√α ≤ α ≤ − · i ≤ Therefore, we have

t new new t t γ (x ,x,s ) (1 α c (x, s))γ (x,x,s)+ ǫ ∗ . (15) i ≤ − · i i k ikxi Now, we bound ǫ : k ikx∗i t t t (µ) ǫ ∗ αc (x, s) µ (x,s) µ (x, s) ∗ + ǫ ∗ k ikxi ≤ i · k i − i kxi k i kxi 4α2ct(x, s) + 10α α (16) ≤ i · i where we used Lemma A.3 and Lemma A.2 at the end. For the second claim, we have

t new new t (µ) γ (x ,x,s ) γ (x,x,s) h + ǫ ∗ 2α + 10α α i − i ≤ k i i kxi ≤ · i where we used (16) and that h 2 h 2 α. From Lemma A.1 and that α 1 , we have k ikx∗i ≤ k kx∗ ≤ ≤ 10000 10α α 20α2 α. · i ≤ ≤

20 A.3 Movement from (x, x, s) to (xnew,x,snew) In the previous section, we see that γ will be expected to decrease by a factor of α ct up to some i · i small perturbations. We show that our potential Φt will therefore decrease significantly.

Lemma A.5 (Movement along the first and third parameters). Assume that γt(x,x,s) 1 for all i ≤ i. We have

m 1/2 t new new t αλ t Φ (x ,x,s ) Φ (x,x,s) exp(2λγi (x, s)) + √mλ exp(192λ√α). ≤ − 5 ! · Xi=1 Note that γ is a function that has three inputs. We use γ(x,s) to denote γ(x,x,s) for simplicity.

Proof. Let Φnew =Φt(xnew,x,snew), Φ=Φt(x,x,s),

γ(u) = uγt(xnew,x,snew) + (1 u)γt(x,x,s). i − i Then, we have that

m m new λγ(1) λγ(0) λγ(ζ) (1) (0) Φ Φ= (e i e i )= λ e i (γ γ ) − − i − i Xi=1 Xi=1 for some 0 ζ 1. Let v = γ(1) γ(0). Lemma A.4 shows that ≤ ≤ i i − i v α ct(x, s) γt(x,x,s)+ ǫ(γ) = α ct(x, s) γ(0) + ǫ(γ) i ≤− · i · i i − · i · i i and hence m m Φnew Φ − α ct(x, s) γ(0) exp(λγ(ζ))+ ǫ(γ) exp(λγ(ζ)). (17) λ ≤− i · i i i i Xi=1 Xi=1 (0) (ζ) t To bound the first term in (17), we first relate γi , γi and γi (x, s) . Lemma A.4 shows that

γ(0) γ(ζ) γ(0) γ(1) 3α. (18) | i − i | ≤ | i − i |≤ Finally, we have

γt(x, s) γ(0) = γt(x, x, s) γt(x,x,s) i − i i − i t t t t γi (x, x, s) γi (x, x,s) + γi (x, x,s) γi (x,x,s) ≤ − − µt(x,s) µt(x, s) + µt(x,s) µt(x,s) i i ∗xi i x∗i i x∗ i ≤ k − k k k − k k 2 µt(x,s) µt(x, s) + µt(x,s) µt(x,s) i i x∗i i x∗i i x∗ i ≤ k − k k k − k k 2 µt(x,s) µt(x, s) + 2α µt(x,s) i i x∗i i x∗i ≤ k − k k k 8α + 2α = 10α (19) ≤ where the first step follows from definition, the second and third step follows from triangle inequality, t t t t the fourth step follows from µ (x,s) µ (x, s) ∗ 2 µ (x,s) µ (x, s) ∗ , the fifth step follows k i − i kxi ≤ k i − i kxi from self-concordance, the sixth step follows from Lemma A.3 and that µt(x,s) = γt(x,x,s) 1 k i kx∗i i ≤ for all i

21 Using (18) and (19), we have m ct(x, s) γ(0) exp(λγ(ζ)) i · i i i=1 Xm = ct(x, s) γ(0) exp(λγt(x, s) λγt(x, s)+ λγ(0) λγ(0) + λγ(ζ)) i · i i − i i − i i i=1 Xm ct(x, s) γ(0) exp(λγt(x, s) 13λα) ≥ i · i i − i=1 Xm 1 ct(x, s) γ(0) exp(λγt(x, s)) ≥ 2 i · i i i=1 Xm m 1 ct(x, s) γt(x, s) exp(λγt(x, s)) 3α ct(x, s) exp(λγt(x, s)). (20) ≥ 2 i · i i − i i Xi=1 Xi=1 where the third step follows from exp( 13λα) 1/2, and the last step follows from (18). − ≥ For the first term in (20), we have m ct(x, s) γt(x, s) exp(λγt(x, s)) i · i i Xi=1 exp(2λ γt(x, s)) = · i ( m exp(2λγt(x, s)))1/2 γt(x,s) 96√α i=1 i i X≥ m exp(2Pλ γt(x, s)) exp(2λ γt(x, s)) = · i · i ( m exp(2λγt(x, s)))1/2 − ( m exp(2λγt(x, s)))1/2 i=1 i=1 i γt(x,s)<96√α i=1 i X i X m P 1/2 P m exp(192λ √α) exp(2λγt(x, s)) . i m · t · 1/2 ≥ ! − ( i=1 exp(2λγi (x, s))) Xi=1 So, if m exp(2λγt(x, s)) m exp(192λ √Pα), we have i=1 i ≥ · · Pm m 1/2 t t t t ci(x, s) γi (x, s) exp(λγi (x, s)) exp(2λγi (x, s)) √m exp(192λ√α). · ≥ ! − · Xi=1 Xi=1 Note that if m exp(2λγt(x, s)) m exp(192λ √α), this is still true because left hand side is i=1 i ≤ · · lower bounded by 0. For the second term in (20), we have P m exp(λ γt(x, s))/γt(x, s) ct(x, s) exp(λγt(x, s)) = · i i exp(λγt(x, s)) i i ( m exp(2λγt(x, s)))1/2 i i=1 γt(x,s) 96√α i=1 i X i X≥ 1 P exp(2λ γt(x, s)) · i ≤ 96√α ( m exp(2λγt(x, s)))1/2 γt(x,s) 96√α i=1 i i X≥ m P t 1 exp(2λ γi (x, s)) m · t 1/2 ≤ 96√α ( i=1 exp(2λγi (x, s))) Xi=1 m P 1/2 1 t = exp(2λγi (x, s)) . 96√α ! Xi=1

22 1 1 where the second step follows γt(x,s) 96√α , and the third step follows from each term in the i ≤ summation is non-negative. Combining the bounds for both first and second term in (20), we have

m m 1/2 t (0) (ζ) 1 t ci(x, s) γi exp(λγi ) exp(2λγi (x, s)) √m exp(192λ√α) · ≥ 2  ! − ·  Xi=1 Xi=1  m 1/2  3α t exp(2λγi (x, s)) − 96√α ! Xi=1 m 1/2 2 t exp(2λγi (x, s)) √m exp(192λ√α). (21) ≥ 5 ! − · Xi=1 where the last step follows from 1 3α 1 3 = 45 2 . 2 − 96√α ≥ 2 − 96 96 ≥ 5 For the second term in (17), we note that γ(ζ) γt(x, s) 13α 1 by (18) and (19). Hence, | i − i |≤ ≤ 2λ m m ǫ(γ) exp(λγ(ζ)) 2 ǫ(γ) exp(λγt(x, s)). i i ≤ i i Xi=1 Xi=1 Now, we use ǫ(γ) 10α (αct(x, s)+ α ) (Lemma A.4) to get i ≤ · i i m m ǫ(γ) exp(λγ(ζ)) 20α (αct(x, s)+ α ) exp(λγt(x, s)) i i ≤ i i · i Xi=1 Xi=1 m 1/2 m 1/2 t 2 t 20α (αci(x, s)+ αi) exp(2λγi (x, s)) . ≤ ! ! Xi=1 Xi=1 where the last step follows from Cauchy-Schwarz inequality. Note that by using Cauchy-Schwarz,

m 1/2 m 1/2 m 1/2 t 2 t 2 2 (αci(x, s)+ αi) α ci(x, s) + αi ! ≤ ! ! Xi=1 Xi=1 Xi=1 1 √α α + 2α . ≤ · 96√α ≤ 90 where we used the definition of ct, Lemma A.1 and α 1 . Together, we conclude i ≤ 224 m m 1/2 (γ) (ζ) 1 t ǫi exp(λγi ) α exp(2λγi (x, s)) . (22) ≤ 5 ! Xi=1 Xi=1 Combining (21) and (22) to (17) gives

1/2 1/2 new m m Φ Φ 2 t 1 t − α exp(2λγi (x, s)) + √m exp(192λ√α)+ α exp(2λγi (x, s)) λ ≤ − 5 ! · 5 ! Xi=1 Xi=1 m 1/2 1 t = α exp(2λγi (x, s)) + √m exp(192λ√α). − 5 ! · Xi=1 where the last step follows from merging the first term with the third term.

23 A.4 Movement from (xnew,x,snew) to (xnew, xnew,snew) Next, we must analyze the potential change when we change the second term.

Lemma A.6 (Movement along the second parameter). Assume that γt(x,x,s) 1. Then we k k∞ ≤ have

m 1/2 t new new new t new new t t Φ (x ,x ,s ) Φ (x ,x,s ) + 12α( γ (x,x,s) + 3α)λ exp(2λγi (x,x,s)) . ≤ k k∞ ! Xi=1 Proof. We can upper bound Φt(xnew,xnew,snew) as follows

m t new new new t new new new Φ (x ,x ,s )= exp(λγi (x ,x ,s )) i=1 Xm exp(λγt(xnew,x,snew)(1 + 2α )). ≤ i i Xi=1 where the second step follows from γt(xnew,xnew,snew) γt(xnew,x,snew) (1 + 2α ) by self- i ≤ i · i concordance (Theorem 2.3) and xnew x 2 xnew x 2α . k i − ikxi ≤ k i − ikxi ≤ i Now, by Lemma A.4, we note that γt(xnew,x,snew) γt(x,x,s)+3α 1+3α and that α 1 . i ≤ i ≤ ≤ 100λ Hence, by a simple taylor expansion, we have

Φt(xnew,xnew,snew) m m exp(λγt(xnew,x,snew)) + 3 α exp(λγt(xnew,x,snew))γt(xnew,x,snew). ≤ i i i i Xi=1 Xi=1 Finally, we bound the last term by

m t new new t new new exp(λγi (x ,x,s ))γi (x ,x,s )αi i=1 Xm exp(λγt(x,x,s) + 3λα)(γt(x,x,s) + 3α)α ≤ i i i i=1 X m t t 2( γ (x,x,s) + 3α) exp(λγi (x,x,s))αi ≤ k k∞ Xi=1 m 1/2 m 1/2 t t 2 2( γ (x,x,s) + 3α) exp(2λγi (x,x,s)) αi ≤ k k∞ ! ! Xi=1 Xi=1 m 1/2 t t 4α( γ (x,x,s) + 3α) exp(2λγi (x,x,s)) , ≤ k k∞ ! Xi=1 where the first step follows from λγt(xnew,x,snew) exp(λγt(x,x,s)+3λα), the second step follows i ≤ i exp(3λα) 2, the third step follows from Cauchy-Schwarz inequality, the last step follows from ≤ m α2 4α2. i=1 i ≤ P

24 A.5 Movement of t Lastly, we analyze the effect of setting t tnew. → Lemma A.7 . t new κ (Movement in t) For any x,s such that γi (x,s) 1 for all i, let t = 1 √ν t m ≤ − where ν = i=1 νi, we have  

P m 1/2 tnew t t Φ (x,s) Φ (x,s) + 10κλ exp(2λγi (x,s)) . ≤ ! Xi=1 Proof. Note that

tnew s ∗ γi (x,s)= new + φi(xi) t ∇ xi

s ∗ = + φ (x ) t(1 κ/√ν) ∇ i i − xi t (1 + 2κ/√ν)γ (x,s) + 2 ( κ/√ν) φi(xi) ∗ ≤ i k ∇ kxi (1 + 2κ/√ν)γt(x,s) + 3κ√ν /√ν ≤ i i γt(x,s) + 5κ√ν /√ν ≤ i i where the first step follows from definition, the second step follows from tnew = t(1 κ/√ν), the − second last step follows from the fact that our barriers are νi-self-concordant and the last step used γt(x,s) 1 and ν 1. Using that 5κ 1 and γt(x,s) 1, we have by simple taylor expansion, i ≤ i ≥ ≤ 10λ i ≤ m m new Φt (x,s) exp(λγt(x,s)) + 2λ exp(λγt(x,s)) 5κ ν /ν ≤ i i i i=1 i=1   Xm Xm p t t = exp(λγi (x,s)) + 10κλ exp(λγi (x,s)) νi/ν i=1 i=1 X X p  m m 1/2 m 1/2 t t νi exp(λγi (x,s)) + 10κλ exp(2λγi (x,s)) ≤ ! ν ! Xi=1 Xi=1 Xi=1 m m 1/2 t t = exp(λγi (x,s)) + 10κλ exp(2λγi (x,s)) , ! Xi=1 Xi=1 m where the third step follows from Cauchy-Schwarz, and the last step follows from i=1 νi = ν. P A.6 Potential Maintenance Putting it all together, we can show that our potential Φt can be maintained to be small throughout our algorithm.

Lemma A.8 (Potential Maintenance). If Φt(x,s) 80 m , then ≤ α

new αλ Φt (xnew,snew) 1 Φt(x,s)+ √mλ exp(192λ√α). ≤ − 40√m ·   new In particularly, we have Φt (xnew,snew) 80 m . ≤ α

25 Proof. Let m 1/2 t ζ(x,s)= exp(2λγi (x,s)) . ! Xi=1 By combining our previous lemmas,

new Φt (xnew,snew) Φt(xnew,snew) + 10κλ ζ(xnew,snew) ≤ · Φt(xnew,x,snew) + 12αλ( γt(x,s) + 3α) ζ(x,s) + 10κλ ζ(xnew,snew) ≤ k k∞ · · αλ Φt(x,x,s) ζ(x, s)+ √mλ exp(192λ√α) ≤ − 5 · + 12αλ( γt(x,s) + 3α) ζ(x,s) + 10κλ ζ(xnew,snew) (23) k k∞ · · where the first step follows from Lemma A.7, the second step follows from Lemma A.6, and the last step follows from Lemma A.5. We note that in all lemma above, we used that fact that γt 1 k k∞ ≤ (for different combination of x, x, xnew, s, s, snew) which we will show later. t new new We can upper bound γi (x ,s ) in the following sense, γt(xnew,snew) γt(xnew,x,snew) + 2α γt(x,x,s) + 5α. (24) i ≤ i ≤ i where the first step follows from self-concordance and γ 1, the second step follows from Lemma A.4. i ≤ Hence, since ζ changes multiplicatively when γ changes additively, ζ(xnew,snew) 2ζ(x,s). ≤ Lemma A.3 shows that µt(x,s) µt(x, s) 4α and hence k i − i kx∗i ≤ m 1/2 2 t ζ(x, s) exp(2λγi (x,x, s)) ≥ 3 ! Xi=1 m 1/2 2 t exp(2λγi (x,x,s) 8αλ) ≥ 3 − ! Xi=1 1 ζ(x,s). (25) ≥ 2 Combining (24) and (25) into (23) gives

new Φt (xnew,snew) αλ Φt(x,s)+ 12αλ( γt(x,s) + 3α) + 20κλ ζ(x,s)+ √mλ exp(192λ√α) ≥ k k∞ − 10 · ·   αλ Φt(x,s)+ 12αλ γt(x,s) ζ(x,s)+ √mλ exp(192λ√α) ≥ k k∞ − 20 · ·   where the last step follows from κ α and α 1 . ≤ 1000 ≤ 10000 Finally, we need to bound γt(x,s) . The bound for other γt , i.e. for different combination k k∞ k k∞ of x, x, xnew, s, s, snew, are similar. We note that m Φt(x,s) 80 ≤ α log(80 m ) implies that γt(x,s) α . Hence, by our choice of λ and α, we have that λ 480 log(80 m ) k k∞ ≤ λ ≥ α and hence αλ 12αλ γt(x,s) . k k∞ ≤ 40

26 Name Type Statement Algorithm Input Output Initialize public Lemma B.4 Alg. 1 A, x, s, W,ǫ , a, b mp ∅ Update public Lemma B.5 Alg. 2 W ∅ FullUpdate private Lemma B.7 Alg. 3 W ∅ PartialUpdate private Lemma B.6 Alg. 2 W ∅ Query public Lemma B.8 Alg. 1 x, s ∅ MultiplyMove public Lemma B.11 Alg. 4 h, t ∅ Multiply private Lemma B.10 Alg. 4 h, t ∅ Move private Lemma B.9 Alg. 4 ∅ ∅

Table 2: Summary of data structure CentralPathMaintenance

Finally, using Φt(x,s) √m ζ(x,s), we have ≤ · new αλ Φt (xnew,snew) Φt(x,s) ζ(x,s)+ √mλ exp(192λ√α) ≥ − 40 · αλ 1 Φt(x,s)+ √mλ exp(192λ√α). ≥ − 40√m ·   new Since λ 1 , we have Φt(x,s) 80 m implies Φt (xnew,snew) 80 m . ≤ 400√α ≤ α ≤ α

B Central Path Maintenance

ω 1/2 The goal of this section is to present a data-structure to perform our centering steps in O(n − ) amortized time and prove a theoretical guarantee of it. The original idea of inverse maintenance is from Michael B. Cohen [Lee17], then [CLS19] used it to get faster running time for solvinge Linear Programs. Because a simple matrix vector product would require O(n2) time, our speedup comes via a low-rank embedding that provides ℓ guarantees, which is unlike the sparse vector approach of [CLS19]. In fact, we are unsure if moving∞ in a sparse direction h can have sufficiently controlled noise to show convergence. Here, we give a stochastic version that is faster for dense direction h.

d n Theorem B.1 (Central path maintenance). Given a full rank matrix A R × with n d, a m ∈ ≥ tolerance parameter 0 <ǫmp < 1/4 and a block diagonal structure n = i=1 ni. Given any positive number a such a α where α is the dual exponent of matrix multiplication. Given any linear sketch ≤ P of size b, there is a randomized data structure CentralPathMaintenance (in Algorithm 1, 2, 4) that approximately maintains the projection matrices

1 √W A⊤(AW A⊤)− A√W for positive block diagonal psd matrix W Rni ni ; exactly implicitly maintains central path pa- ⊕i × rameters (x,s) and approximately explicitly maintains path parameters through the following five operations: (0) (0) 1. Initialize(W , ) : Assume W Rni ni . Initialize all the parameters in O(nω) · · · ∈ ⊗i × time. 2. Update(W ) : Assume W Rni ni . Output a block diagonal matrix V Rni ni such that ∈⊕i × ⊕i × (1 ǫ )v w (1 + ǫ )v . − mp i  i  mp i e

e 27 e 3. Query() : Output (x, s) such that x x V −1 ǫmp and s s V tǫmp where t is the last t k − k e 1/4≤ k − k e ≤ used in MultiplyMove, where ǫ = α log2(nT ) n and the success probability is 1 1/ poly(nT ). mp √b − This step takes O(n) time. 4. MultiplyMove(h, t) : It outputs nothing. It implicitly maintains:

1/2 1/2 1/2 1/2 x = x + V (I P )V h, s = s + tV − P V h. − 1/2 1 1/2 where P = V A⊤(AV A⊤)− AVe . It alsoe e explicitly maintainse ex,es. Assuming t is decreasing, aω+o(1) a 1.5 each call takes O(nb + n + n h 0 + n ) amortized time. (0) k k (1) (T ) Let eW ebe the initiale matrixe and W , , W be the (random) update sequence. Under the · · · assumption that there is a sequence of matrix W (0), ,W (T ) m Rni ni satisfies for all k · · · ∈⊕i=1 × 1/2 1/2 wi− (wi wi)wi− ǫmp, − F ≤ m 2 (k) 1/2 E (k +1) (k) (k) 1/2 2 (wi )− ( [wi ] wi )(wi )− C1 , − F ≤ i=1 m X 2 2 E (k) 1/2 (k+1) (k) (k) 1/2 2 (wi )− (wi wi )(wi )− C2 , − F ≤ i=1    X (k) 1/2 (k+1) (k) (k) 1/2 1 (wi )− (wi wi )(wi )− . − F ≤ 4

(k) (k) where w is the i-th block of W , i [m]. i ∀ ∈ Then, the amortized expected time per call of Update(w) is

2 ω 1/2+o(1) 2 a/2+o(1) (C /ǫ + C /ǫ ) (n − + n − ). 1 mp 2 mp · 2 4 Remark B.2. For our algorithm, we have C1 = O(1/ log n), C2 = O(1/ log n) and ǫmp = O(1/ log2 n). Note that the input of Update W can move a lot. It is working as long as W is close to some W that is slowly moving. In our application, our W satisfies C1,C2 deterministically. We keep it for possible future applications.

B.1 Proof of Theorem B.1 We follow the proof-sketch as [CLS19]. The proof contains four parts : 1) Definition of X and Y , 2) We need to assume sorting, 3) We provide the definition of potential function, 4) We write the potential function.

Definition of matrices X and Y . Let us consider the k-th round of the algorithm. For all (k) i [m], matrix y Rni ni is constructed based on procedure Update (Algorithm 2) : ∈ i ∈ × w(k+1) y(k) = i I. i (k) − vi and π is a permutation such that y(k) y(k) . k π(i)kF ≥ k π(i+1)kF (k) (k) (k) For the purpose of analysis : for all i [m], we define x , x and y Rni ni as follows: ∈ i i i ∈ × w(k) w(k+1) w(k+1) x(k) = i I, y(k) = i I, x(k+1) = i I, i (k) − i (k) − i (k+1) − vi vi vi

28 Algorithm 1 Central Path Maintenance Data Structure - Initial, Query, Move 1: datastructure CentralPathMaintenance ⊲ Theorem B.1 2: 3: private : members Rni ni 4: W i [m] × ⊲ Target vector, W is ǫw-close to W ∈⊗ ∈ Rni ni 5: V, V i [m] × ⊲ Approximate vector ∈⊗ ∈ 6: A Rd n ⊲ Constraints matrix ∈ × 7: M e Rn n ⊲ Approximate Projection Matrix ∈ × 8: ǫ (0, 1/4) ⊲ Tolerance mp ∈ 9: a (0, α] ⊲ Batch Size for Update (na) ∈ 10: b Z+ ⊲ Sketch size of one sketching matrix ∈ 1+o(1) 11: R Rn n ⊲ A list of sketching matrices ∈ × 12: Q Rb n ⊲ Sketched matrices ∈ × 13: u Rn, F Rn n, u Rn ⊲ Implicit representation of x, x = u + F u 1 ∈ ∈ × 2 ∈ 1 · 2 14: u Rn, G Rn n, u Rn ⊲ Implicit representation of s, s = u + G u 3 ∈ ∈ × 4 ∈ 3 · 4 15: x, s Rn ⊲ Central path parameters, maintain explicitly ∈ 16: l Z ⊲ Randomness counter, R Rb n ∈ + l ∈ × 17: tpre R ⊲ Tracking the changes of t ∈ + 18: end members 19: 20: public : procedure Initialize(A,x,s,W,ǫmp, a, b) ⊲ Lemma B.4 21: ⊲ parameters will never change after initialization 22: A A, a a, b b, ǫ ǫ ← ← ← mp ← mp 23: ⊲ parameters will still change after initialization 24: W W , V W , V V ← ← ← 25: Choose R Rb n to be sketching matrix, l [√n] ⊲ Lemma E.5 l ∈ × ∀ ∈ 26: R [R ,R , ] e ⊲ Batch them into one matrix R ← 1⊤ 2⊤ · · · ⊤ 27: M A (AV A ) 1A, Q R V M ⊲ Initialize projection matrices ← ⊤ ⊤ − ← 28: u x, u 0, u s, u 0 ⊲ Initialize x and s 1 ← 2 ← 3 ← 4 ←p 29: x x, s s e ← ← 30: l 1 ← 31: end procedure 32: 33: public : procedure Query() ⊲ Lemma B.8 34: return (x, s) 35: end procedure 36: 37: end datastructure

(k) wi (k) 1/2 (k) (k) 1/2 where (k) denotes (vi )− wi (vi )− . vi (k) (k) It is not hard to observe the difference between xi and yi is that w is changing. We call it “w (k) (k+1) move”. Similarly, the difference between yi and xi is that v is changing. We call it “v move”. For each i, we define βi as follows

(k) 1/2 (k+1) (k) (k) 1/2 β = (w )− (E[w ] w )(w )− , i k i i − i i kF

29 Algorithm 2 Central Path Maintenance Data Structure - Update and PartialUpdate 1: datastructure CentralPathMaintenance ⊲ Theorem B.1 2: new new 3: public : procedure Update(W ) ⊲ Lemma B.5, W is close to W new 1/2 1/2 4: y v− wnewv− 1, i [m] i ← i i i − ∀ ∈ 5: r the number of indices i such that y ǫ ← k ikF ≥ mp 6: if r

m β2 C2. i ≤ 1 Xi=1 Assume sorting for diagonal blocks. Without loss of generality, we can assume the diagonal (k) (k) (k) (k) m Rni ni blocks of matrix x i=1 × are sorted such that xi F xi+1 F . In[CLS19], xi is a ∈ ⊕ k k ≥ k k (k) scalar. They sorted the sequence based on absolute value. In our situation, xi is a matrix. We sort the sequence based on Frobenius norm. Let τ permutation such that x(k+1) x(k+1) . k τ(i) kF ≥ k τ(i+1)kF Let π denote the permutation such that y(k) y(k) . k π(i)kF ≥ k π(i+1)kF

Definition of Potential function. We define three functions g, ψ and Φk here. The definition of ψ is different from [CLS19], since we need to handle matrix. The definitions of g and Φk are the same as [CLS19].

30 Algorithm 3 Central Path Maintenance Data Structure - Full Update 1: datastructure CentralPathMaintenance ⊲ Theorem B.1 2: new 3: private : procedure FullUpdate(W ) ⊲ Lemma B.7 1/2 1/2 4: y v− wnewv− 1, i [m] i ← i i i − ∀ ∈ 5: r the number of indices i such that y ǫ ← k ikF ≥ mp 6: Let π : [m] [m] be a sorting permutation such that y y → k π(i)kF ≥ k π(i+1)kF 7: while 1.5 r

For the completeness, we still provide a definition of g. Let g be defined as

a a n− , if i √n or t tpre/2 ⊲ Variance is large enough ≥ 28: x u + F u , s u + F u ← 1 2 ← 3 4 29: Initialize(A, x, s, W,ǫmp, a, b) ⊲ Algorithm 1 30: else 31: x x + δ , s s + δ ⊲ Update x, s ← x ← s 32: end if 33: return (x, s) e e 34: end procedure 35: 36: end datastructure where x denotes the Frobenius norm of square matrix x, and let L = max D ψ[h]/ H , k kF 1 x x k kF L = max D2ψ[h, h]/ H 2 where h is the vectorization of matrix H. 2 x x k kF For the completeness, we define the potential at the k-th round by

m (k) Φk = gi ψ(x ) · τk(i) Xi=1

32 (k) (k) where τk(i) is the permutation such that x F x F . (Note that in [CLS19] F k τk(i)k ≥ k τk(i+1)k k · k should be .) | · | Rewriting the potential, and bounding it. Following the ideas in [CLS19], we can rewrite Φ Φ into two terms: the first term is w move, and the second term is v move. For the k+1 − k completeness, we still provide a proof.

m Φ Φ = g ψ(x(k+1)) ψ(x(k)) k+1 − k i · τ(i) − i i=1   Xm m = g ψ(y(k) ) ψ(x(k)) g ψ(y(k) ) ψ(x(k+1)) i · π(i) − i − i · π(i) − τ(i) i=1 i=1 X  W move  X  V move  Using Lemma B.12, we can bound| the first{z term. Using} Lemma| B.15, we{z can bound} the second term.

B.2 Initialization time, update time, query time, move time, multiply time Remark B.3. In terms of implementing this data-structure, we only need three operations Ini- tialize, Update, and Query. However, in order to make the proof more understoodable, we split Update into many operations : FullUpdate, PartialUpdate, Multiply and Move. We give a list of operations in Table 2.

Lemma B.4 (Initialization). The initialization time of data-structure CentralPathMainte- nance (Algorithm 1) is O(nω+o(1)).

1 Proof. The running time is mainly dominated by two parts, the first part is computing A⊤(AV A⊤)− A, 2 ω 2 this takes O(n d − ) time. The second part is computing R V M. This takes O(nω+o(1)) time. p e Lemma B.5 (Update time). The update time of data-structure CentralPathMaintenance 2+o(1) (Algorithm 2) is O(rgrn ) where r is the number of indices we updated in V . Proof. It is trivially follows from combining Lemma B.6 and Lemma B.7.

Lemma B.6 (Partial Update time). The partial update time of data-structure CentralPath- Maintenance (Algorithm 2) is O(n1+a).

Proof. We first analyze the running time of F update, the update equation of F in algorithm is

F new F + ((V new)1/2 (V )1/2)M ← − F F new ← e e which can be implemented as

F F + ((V new)1/2 (V )1/2)M ← − a 1+a where we only need to change n row-blockse of F . It takese O(n ) time. Similarly, for the update time of G.

33 Next we analyze the update time of u1, the update equation of u1 is

u u + (F F new)u 1 ← 1 − 2 Note that the difference between F and F new is only na row-blocks, thus it takes n1+a time to update. Finally we analyze the update time of x. Let S denote the blocks where V and V new are different.

xS (u1)bS + (F u2)S e e b ← b b This also can be done in n1+a time, since S indicates only na blocks. Therefore, the overall running time is O(n1+a). b

Lemma B.7 (Full Update time). The full update time of data-structure CentralPathMainte- 2+o(1) nance (Algorithm 3) is O(rgrn ) where r is the number of indices we updated in V . Proof. The update equation we use for Q is

Qnew Q + R (Γ M new)+ R √V (M new M). ← · · · · − It can be re-written as

new new 1 1 Q Q + R (Γ M )+ R √V ( M ,S (∆− + MS,S)− (M ,S)⊤) ← · · · · − ∗ · S,S · ∗ The running time of computing second term is multiplying a n r matrix with another r n matrix. × × The running time of computing third term is also dominated by multiplying a n r matrix with × another r n matrix. × Thus running time of processing Q update is the same as the processing M update. For the running time of other parts, it is dominated by the time of updating M and Q. Therefore, the rest of the proof is almost the same as Lemma 5.4 in [CLS19], we omitted here.

Lemma B.8 (Query time). The query time of data-structure CentralPathMaintenance (Al- gorithm 1) is O(n) time.

Proof. This takes only O(n) time, since we stored x and s.

Lemma B.9 (Move time). The move time of data-structure CentralPathMaintenance (Algo- ω+o(1) ω 1/2+o(1) rithm 4) is O(n ) time in the worst case, and is O(n − ) amortized cost per iteration. Proof. In one case, it takes only O(n) time. For the other case, the running time is dominated by Initialize, which takes nω+o(1) by Lemma B.7.

Lemma B.10 (Multiply time). The multiply time of data-structure CentralPathMaintenance (Algorithm 4) is O(nb + n1+a+o(1)) for dense vector h = n, and is O(nb + naω+o(1) + na h ) for k k0 k k0 sparse vector h.

Proof. We first analyze the running time of computing vector δm, the equation is

1 1 δm ((∆S,S)− + MS,S)− (MS, )⊤ V h ← e e e e · e ∗  p  e e 34 where ∆= V V . Let r = ni = O(r) where r is the number of blocks are different in V and − i Se V . ∈ P It containse e several parts:e e 1. Computing M ( V h) Rr takes O(r) h . S⊤ · ∈ e k k0 e 1 1 O(r) O(r) 2. Computing (∆− +pM )− R × that is the inverse of a O(r) O(r) matrix takes S,S S,S e e f e e e e e ∈ e × O(rω+o(1)) time. e e e 1 3. Computing matrix-vector multiplication between O(r) O(r) matrix ((∆S,S + MS,S)− ) and × e e e e e 2 O(r) 1 vector ((MS, )⊤ V h) takes O(r ) time. × e ∗ e e e Thus, the running timep of computing δm is e f e e O(r h + rω+o(1) + r2)= O(r h + rω+o(1)). k k0 k k0

Next, we want to analyzee the updatee equatione of δx e e

δx V h (Rl⊤ ((Ql + Rl ∆M) V he)) (Rl⊤ ((Q + RlΓM ) δm)) ← − · · · − · l,Se Se ·  p p  where Γ=e V e √V has O(r) non-zeroe blocks. e e − It is clearp that the running time is dominated by the second term in the equation. We only focus on thate term.e 1. Computing R Q V h takes O(bn) time, because Q ,R Rb n. l⊤ l l l ∈ × 2. Computing Rl⊤Rpl ∆M V h takes O(bn + br + r h 0) time. The reason is, computing e k k ∆M V h takes r h 0 ptime, computingp Rl ( ∆M V h) takes br, then finally computing Rl⊤ k k e e · e e · p(Rl ∆pM V h) takes nb. p p e e e e pLast, thep updatee equation of u1, u2, u3, u4 only takes the O(n) time.e Finally,e wee note that r O(na) due to the guarantee of FullUpdate and PartialUpdate. ≤ Thus, overall the running time of the Multiply is

O(r h + rω+o(1) + r2 + br + nb)= O(r h + rω+o(1) + nb) k k0 k k0 ω+o(1) = O(r h 0 + r + nb) e e e e ek k e = O(na h + naω+o(1) + nb) k k0 where the first step follows from br nb and r2 rω+o(1), and the second step follows from ≤ ≤ r = O(r), and the last step follows from r = O(na). If h is the dense vector, then thee overall timee is e e O(nb + n1+a + naω+o(1)).

Based on Lemma 5.5 in [CLS19], we know that aω 1+ a. Thus, it becomes O(nb + n1+a+o(1)) ≤ time. If h is a sparse vector, then the overall time is

O(nb + na h + naω+o(1)). k k0

Lemma B.11 (MultiplyMove). The running time of MultiplyMove (Algorithm B.11) is the Multiply time plus Move time.

35 B.3 Bounding W move The goal of this section is to analyze the movement of W . [CLS19] provided a scalar version of W move, here we provide a matrix version. Lemma B.12 (W move, matrix version of Lemma 5.7 in [CLS19]). m (k) (k) a/2 ω 5/2 g E ψ(y ) ψ(x ) = O(C + C /ǫ ) log n (n− + n − ). i · π(i) − π(i) 1 2 mp · · i=1 X h i p Proof. In scalar version, [CLS19] used absolute ( ) to measure each x(k). In matrix version, we use |·| i Frobenius norm ( ) to measure each x(k). Let I [m] be the set of indices such that x(k) 1. kkF i ⊆ k i kF ≤ We separate the term into two : m (k) (k) (k) (k) (k) (k) g E[ψ(y ) ψ(x )] = g −1 E[ψ(y ) ψ(x )] + g −1 E[ψ(y ) ψ(x )]. i · π(i) − π(i) π (i) · i − i π (i) · i − i i=1 i I i Ic X X∈ X∈ Case 1. Let us consider the terms from I. (k) (k) (k) Let vec(yi ) denote the vectorization of matrix yi . Similarly, vec(xi ) denotes the vectoriza- (k) tion of xi . Mean value theorem shows that

(k) (k) (k) (k) (k) 1 (k) (k) (k) (k) ψ(y ) ψ(x )= ψ′(x ),y x + vec(y x )⊤ψ′′(ζ)vec(y x ) i − i h i i − i i 2 i − i i − i (k) (k) (k) L2 (k) (k) 2 ψ′(x ),y x + y x ≤ h i i − i i 2 k i − i kF (k) (k) 1/2 (k+1) (k) (k) 1/2 = ψ′(x ), (v )− (w w )(v )− h i i i − i i i L2 (k) 1/2 (k+1) (k) (k) 1/2 2 + (v )− (w w )(v )− 2 k i i − i i kF where the second step follows from definition of L2 (see Part 4 of Lemma B.17). Taking conditional expectation given w(k) on both sides

(k) (k) (k) (k) 1/2 (k+1) (k) (k) 1/2 E[ψ(y ) ψ(x )] ψ′(x ), (v )− (E[w ] w )(v )− i − i ≤ h i i i − i i i L2 (k) 1/2 (k+1) (k) (k) 1/2 2 + E[ (v )− (w w )(v )− ] 2 k i i − i i kF (k) 1/2 (k+1) (k) (k) 1/2 L (v )− (E[w ] w )(v )− ≤ 1k i i − i i kF L2 (k) 1/2 (k+1) (k) (k) 1/2 2 + E[ (v )− (w w )(v )− ] 2 k i i − i i kF (k) 1/2 (k) 1/2 2 (k) 1/2 (k+1) (k) (k) 1/2 L (v )− (w ) (w )− (E[w ] w )(w )− ≤ 1k i i k · k i i − i i kF L2 (k) 1/2 (k) 1/2 4 (k) 1/2 (k+1) (k) (k) 1/2 2 + (v )− (w ) E[ (w )− (w w )(w )− ] 2 k i i k · k i i − i i kF (k) 1/2 (k) 1/2 2 L2 (k) 1/2 (k) 1/2 4 = L (v )− (w ) β + (v )− (w ) γ (27) 1k i i k · i 2 k i i k · i where the second step follows from definition of L2 (see Part 4 of Lemma B.17), the third step follows from AB A B , and the last step follows from defining β and γ as follows: k kF ≤ k kF · k k i i (k) 1/2 E (k+1) (k) (k) 1/2 βi = (wi )− ( [wi ] wi )(wi )− − F 2 E (k) 1/2 (k+1) (k) (k) 1/2 γi = (wi )− (wi wi )(wi )− . − F h i

36 E (k) (k) To upper bound i I gπ−1(i) [ψ(yi ) ψ(xi )], we need to bound the following two terms, ∈ − P 2 4 (k) 1/2 (k) 1/2 L2 (k) 1/2 (k) 1/2 g −1 L (v )− (w ) β , and g −1 (v )− (w ) γ . (28) π (i) 1 i i i π (i) 2 i i i i I i I X∈ X∈ For the first term (which is related to β) in Eq. (28), we have

1/2 2 (k) 1/2 (k) 1/2 2 (k) 1/2 (k) 1/2 2 2 g −1 L (v )− (w ) β g −1 L (v )− (w ) β π (i) 1k i i k i ≤ π (i) 1k i i k i i I i I i I ! X∈ X∈   X∈ n 1/2 2 2 O(L1) gi C1 ≤ · ! Xi=1 = O(C L g ). (29) 1 1k k2 where the first step follows from Cauchy-Schwarz inequality, the second step follows from ni = O(1) and (v(k)) 1/2(w(k))1/2 2 = O(1). k i − i k For the second term (which is related to γ) in Eq. (28), we have

m L2 (k) 1/2 (k) 1/2 4 g −1 (v )− (w ) n γ O(L ) g γ = O(C L g ). (30) π (i) 2 k i i k i i ≤ 2 · i · i 2 2k k2 i I i=1 X∈ X

Putting Eq. (27), Eq. (29) and Eq. (30) together, and using several facts L1 = O(1), L2 = O(1/ǫ ) (from part 4 of Lemma B.17) and g √log n O(n a/2 + nω 5/2) (from Lemma B.13) mp k k2 ≤ · − − gives us

k) (k) a/2 ω 5/2 g −1 E[ψ(y ) ψ(x )] O(C + C /ǫ ) log n (n− + n − ). π (i) · i − i ≤ 1 2 mp · · i I X∈ p (Note that, the above Equation is the same as [CLS19].) Case 2. Let us consider the terms from Ic. For each i Ic, we know x(k) 1. We observe that ψ(x) is constant for x 2 (2ǫ )2, ∈ k i kF ≥ k kF ≥ mp where ǫ 1/4. If y(k) 1/2, then ψ(y(k)) ψ(x(k)) = 0. Therefore, we only need to focus mp ≤ k i kF ≥ i − i on the i Ic such that y(k) < 1/2. ∈ k i kF For each i Ic with y(k) < 1/2, we have ∈ k i kF 1 < y(k) x(k) 2 k i − i kF (k) 1/2 (k+1) (k) (k) 1/2 = (v )− (w w )(v )− k i i − i i kF (k) 1/2 (k) 1/2 (k) 1/2 (k+1) (k) (k) 1/2 (k) 1/2 (k) 1/2 = (v )− (w ) (w )− (w w )(w )− (w ) (v )− k i i · i i − i i · i i kF (k) 1/2 (k) 1/2 (k) 1/2 (k+1) (k) (k) 1/2 (k) 1/2 (k) 1/2 (v )− (w ) (w )− (w w )(w )− (w ) (v )− ≤ k i i k · k i i − i i kF · k i i k (k) 1/2 (k) (k) 1/2 (k) 1/2 (k+1) (k) (k) 1/2 = (v )− w (v )− (w )− (w w )(w )− k i i i k · k i i − i i kF 3 (k) 1/2 (k+1) (k) (k) 1/2 (w )− (w w )(w )− (31) ≤ 2k i i − i i kF

(k+1) (k) wi where the last step follows from yi F = (k) I F 1/2. k k k vi − k ≤

37 It is obvious that Eq. (31) implies

(k) 1/2 (k+1) (k) (k) 1/2 (w )− (w w )(w )− > 1/3 > 1/4. k i i − i i kF But this is impossible, since we assume it is 1/4. ≤ Thus, we have

(k) (k) g −1 E[ψ(y ) ψ(x )] = 0. π (i) · i − i i Ic X∈

We state a Lemma that was proved in previous work [CLS19].

Lemma B.13 (Lemma 5.8 in [CLS19]).

n 1/2 2 a/2 ω 5/2 g log n O(n− + n − ) i ≤ · i=1 ! X p B.4 Bounding V move In previous work, [CLS19] only handled the movement of V in scalar version. Here, the goal of is to understand the movement of V in matrix version. We start to give some definitions about block diagonal matrices.

(k) (k) (k) (k+1) Rni ni Definition B.14. We define block diagonal matrices X , Y , X and Y i [m] × as ⊗ ∈ follows

w(k) w(k+1) w(k+1) w(k+1) x(k) = i I, y(k) = i I, x(k+1) = i I, y(k) = i I. i (k) − i (k) − i (k+1) − i (k) − vi vi vi vi

Let ǫw denote the error between W and W

1/2 1/2 W − (W W )W − ǫ . k i i − i i kF ≤ w Lemma B.15 (V move, matrix version of Lemma 5.9 in [CLS19]). We have,

n g ψ(y(k) ) ψ(x(k+1)) Ω(ǫ r g / log n). i · π(i) − τ(i) ≥ mp k rk Xi=1   Proof. To prove the Lemma, similarly as [CLS19], we will split the proof into two cases. Before getting into the details of each case, let us first understand several simple facts which are useful in the later proof. Note that from the definition of the algorithm, we only change the block if y(k) is larger than the error between w and w . Hence, all the changes only decreases the norm, k i kF i i namely ψ(y(k)) ψ(x(k+1)) for all i. So is their sorted version ψ(y(k) ) ψ(x )(k+1) for all i. i ≥ i π(i) ≥ τ(i)

38 Case 1. The procedure exits the while loop when 1.5r n. k ≥ Let u denote the largest u such that y(k) ǫ . ∗ k π(u)kF ≥ mp If u∗ = rk, we have that (k) y F ǫmp ǫmp/100. k π(rk)k ≥ ≥ If u = r , using the condition of the loop, we have that ∗ 6 k ∗ (k) log1.5 rk log1.5 u (k) y F (1 1/ log n) − y ∗ F k π(rk)k ≥ − · k π(u )k (1 1/ log n)log1.5 n ǫ ≥ − · mp ǫ /100. ≥ mp where the last step follows from n 4. (k+1) ≥ Recall the definition of xτ(i) . We can lower bound the LHS in the Lemma statement in the following sense,

n 2n/3 g ψ(y(k) ) ψ(x(k+1)) g ψ(y(k) ) ψ(x(k+1)) i · π(i) − τ(i) ≥ i · π(i) − τ(i) Xi=1   i=Xn/3+1   2n/3 g (Ω(ǫ ) O(ǫ )) ≥ i · mp − w i=Xn/3+1 2n/3 g Ω(ǫ ) ≥ i · mp i=Xn/3+1

= Ω(rkgrk ǫmp).

(k) (k) (k) where the second step follows from y F y F (1 O(ǫw)) y F ǫmp/200 for all k π(i)k ≥ k π(rk)k ≥ − k π(rk)k ≥ i< 2n/3.

(k) (k) Case 2. The procedure exits the while loop when 1.5rk

(k) y F ǫmp/100. k π(rk)k ≥ Using Part 3 of Lemma B.17 and the following fact

y(k) < min ǫ , y(k) (1 1/ log n) , k π(1.5r)kF mp k π(r)kF · −   we can show that

ψ(y(k) ) ψ(y(k) )=Ω(ǫ / log n). (32) π(1.5r) − π(r) mp (k) (k) (k) (k) Now the question is, how to relax ψ(yπ(1.5r)) to ψ(yπ(1.5r)) and how to relax ψ(yπ(r)) to ψ(yπ(r)) Note that y(k) x(k+1) for all i. Hence, we have ψ(y(k) ) ψ(x(k+1)) for all i. k i kF ≥ k i kF π(i) ≥ τ(i) Recall the definition of y, y, π and π,

w(k+1) w(k+1) y(k) = i I, y(k) = i I. i (k) − i (k) − vi vi

39 and π and π denote the permutations such that y(k) y(k) and y(k) y(k) . k π(i)kF ≥ k π(i+1)kF k π(i)kF ≥ k π(i+1)kF Using Fact B.16 and = Θ(1) when the matrix has constant dimension k · k2 k · kF y(k) y(k) O(ǫ ). k π(i) − π(i)kF ≤ w where ǫw is the error between W and W . Next, i, we have ∀ ψ(y(k) )= ψ(y(k) ) O(ǫ ǫ ) (33) π(i) π(i) ± w mp Next, we note that all the blocks the algorithm updated must lies in the range i = 1, , 3rk 1. · · · 2 − After the update, the error of rk of these block becomes so small that its rank will much higher 3rk than rk. Hence, rk/2 of the unchanged blocks in the range i = 1, , 2 will move earlier in the (k+1) · · · 3 (k) rank. Therefore, the rk/2-th element in x must be larger than the 2 rk-th element in y . In (k+1) (k) short, we have ψ(x ) ψ(y ) for all i rk/2. τ(i) ≤ π(1.5rk) ≥ Putting it all together, we have

n g ψ(y(k) ) ψ(x(k+1)) i · π(i) − τ(i) Xi=1   rk g ψ(y(k) ) ψ(x(k+1)) ≥ i · π(i) − τ(i) i=Xrk/2   rk (k) (k+1) (k+1) (k) gi ψ(y ) ψ(y ) by ψ(x ) ψ(y ), i rk/2 ≥ · π(i) − π(1.5rk) τ(i) ≤ π(1.5rk) ∀ ≥ i=Xrk/2   rk (k) (k+1) gi ψ(y ) ψ(y ) O(ǫwǫmp) by (33) ≥ · π(i) − π(1.5rk) − i=Xrk/2   rk (k) (k+1) (k) (k) gi ψ(y ) ψ(y ) O(ǫwǫmp) by ψ(y ) ψ(y ), i [rk/2, rk] ≥ · π(rk) − π(1.5rk) − π(i) ≥ π(rk) ∀ ∈ i=Xrk/2   r k ǫ g Ω( mp ) O(ǫ ǫ ) by (32) ≥ rk · log n − w mp i=Xrk/2   r k ǫ g Ω( mp ) by ǫ < O(1/ log n) ≥ rk · log n w i=Xrk/2

= Ω (ǫmprkgrk / log n) . Therefore, we complete the proof.

Fact B.16. Given two length n positive vectors a, b. Let a be sorted such that a a . Let π i ≥ i+1 denote the permutation such that b b . If for all i [n], a b ǫa . Then for all π(i) ≥ π(i+1) ∈ | i − i| ≤ i i [n], a b ǫa . ∈ | i − π(i)|≤ i Proof. Case 1. π(i)= i. This is trivially true. Case 2. π(i) < i. We have

b b (1 ǫ)a π(i) ≥ i ≥ − i

40 Since π(i) < i, we know that there exists a j > i such that π(j) <π(i). Then we have

b b (1 + ǫ)a (1 + ǫ)a π(i) ≤ π(j) ≤ j ≤ i Combining the above two inequalities, we have (1 ǫ)a b (1 + ǫ)a . − i ≤ π(i) ≤ i Case 3. π(i) > i. We have

b b (1 + ǫ)a π(i) ≤ i ≤ i Since π > i, we know that there exists j < i such that π(j) >π(i). Then we have

b b (1 ǫ)a (1 ǫ)a . π(i) ≥ π(j) ≥ − j ≥ − i Combining the above two inequalities gives us (1 ǫ)a b (1 + ǫ)a . − i ≤ π(i) ≤ i Therefore, putting all the three cases together completes the proof.

B.5 Potential function ψ [CLS19] used a scalar version potential function. Here, we generalize it to the matrix version. Lemma B.17 (Matrix version of Lemma 5.10 in [CLS19]). Let function ψ : square matrix R → (defined as Eq. (26)) satisfies the following properties : 1. Symmetric (ψ(x)= ψ( x)) and ψ(0) = 0 − 2. If x F y F , then ψ(x) ψ(y) k k ≥ k k ≥ 2 2 3. f ′(x) = Ω(1/ǫmp), x [(0.01ǫmp) ,ǫmp] | | ∀ ∈ 2 def Dxψ[H] def Dxψ[H,H] 4. L = max = 2 and L = max 2 = 10/ǫ 1 x H F 2 x H mp k k k kF 5. ψ(x) is a constant for x 2ǫ k kF ≥ mp Proof. Let f : R R be defined as + → x2 2 3 , x [0,ǫmp]; 2ǫmp ∈ (4ǫ2 x)2 f(x)=  mp− 2 2 ǫmp 18ǫ3 , x (ǫmp, 4ǫmp];  − mp ∈ ǫ , x (4ǫ2 , + ). mp ∈ mp ∞  We can see that 

x 2 1 2 ǫ3 , x [0,ǫmp]; 3 , x [0,ǫmp]; mp ∈ ǫmp ∈ 4ǫ2 x 1 mp− 2 2 2 2 f(x)′ =  3 , x (ǫmp, 4ǫmp]; and f(x)′′ =  9ǫ3 , x (ǫmp, 4ǫmp]; 9ǫmp − mp ∈  ∈  0, x (4ǫ2 , + ). 0, x (4ǫ2 , + ). ∈ mp ∞ | |∈ mp ∞    1 1  2 It implies that maxx f(x)′ and maxx f(x)′′ 3 . Let ψ(x)= f( X F ). | |≤ ǫmp | |≤ ǫmp k k Proof of Part 1,2 and 5. These proofs are pretty standard from definition of ψ. Proof of Part 3. This is trivially following from definition of scalar function f. Proof of Part 4. By chain rule, we have

2 D ψ[h] = 2f ′( X ) tr[XH] x k kF · 2 2 2 2 2 D ψ[h, h] = 2f ′′( X ) (tr[XH]) + 2f ′( x ) tr[H ] x k kF · k kF ·

41 where x is the vectorization of matrix X and h is the vectorization of matrix H. We can upper bound

2 2 D ψ[h] 2 f ′( X ) tr[XH] 2 f ′( X ) X H | x |≤ | k kF | · | |≤ | k kF | · k kF · k kF Then, we have

3 3 X F /ǫmp 1, X F [0,ǫmp] 2 k 2k ≤2 k k ∈ f ′( X ) X F = (4ǫ X ) X /9ǫ 2/3, X (ǫ , 2ǫ ] | k kF | · k k  mp − k kF k kF mp ≤ k kF ∈ mp mp 0, X (2ǫ , + ) k kF ∈ mp ∞ It implies that D ψ[h] 2 H , x. | x |≤ k kF ∀ By case analysis, we have

1 2 2 2 2 2 ǫ3 X F 4/ǫmp, X F [0, 4ǫmp] f ′′( X ) X mp k k ≤ k k ∈ F F 2 | k k | · k k ≤ (0, X (4ǫmp, + ) k kF ∈ ∞ We can also upper bound

2 2 2 2 2 D ψ[h, h] 2 f ′′( X ) (tr[XH]) + 2 f ′( X ) tr[H ] | x |≤ | k kF | · | k kF | · 2 2 2 2 2 f ′′( X ) ( X H ) + 2 f ′( X ) H ≤ | k kF | · k kF k kF | k kF | · k kF 4 2 1 2 2 H F + 2 H F ≤ · ǫmp k k · ǫmp k k 10 2 = H F . ǫmp k k

B.6 x and x are close Lemma B.18 (x and x are close in term of V 1). With probability 1 δ over the randomness of − − sketching matrix R Rb n, we have ∈ × e xi xi V −1 ǫx k − k ei ≤

1/4 ǫ = O(α log2(n/δ) n ), b is the size of sketching matrix. x · √b

Proof. Recall the definition of δx and δx, we have

1/2 1/2 1/2 1/2 1/2 1/2 δ δ = V (I R⊤RP )V h V (I P )V h = V (P R⊤RP )V h x,i − x,i i − e − i − i − For iteratione t, the definitione shoulde bee e e e e e e e (t) (t) (t) 1/2 (t) (t) (t) (t) (t) 1/2 δ δ = (V ) (P (R )⊤R P )(V ) h. x,i − x,i i −

For any i, let k bee the current iteration,e eki be the last whene wee changed the Vi. Then, we have that

k e x(k) x(k) = δ(t) δ(t) i − i x,i − x,i tX=ki e 42 (ki) (ki) (t) because we have xi = xi (guaranteed by our algorithm). Since Vi did not change during iteration ki to k for the block i. (However, the whole other parts of matrix V could change). We consider e e k ⊤ k (k) (k) (k) 1 (k) (k) (t) (t) (k) 1 (t) (t) (x x )⊤ (V )− (x x )= δ δ (V )− δ δ i − i · i · i − i  x,i − x,i · i ·  x,i − x,i tX=ki tX=ki e e e e 2  k    (t) (t) (t) (t) 1/2 (t) = (I (R )⊤R )P (V ) h . − i t=ki X   2 e e We consider block i and a coordinate j block i. We define random vector X Rni as follows: ∈ t ∈ (t) (t) (t) (t) 1/2 (t) Xt = (I R ⊤R )P (V ) h . − i   Let (Xt)j denote the j-th coordinate of Xt, for eachej e[ni]. ∈ By Lemma E.5 in Section E, we have for each t, 1 E[X ] = 0, and E[(X )2]= (P (t)(V (t))1/2h(t)) 2 t t j b k ik2 and with probability 1 δ, e e − (t) (t) 1/2 (t) log(n/δ) (Xt)j (P (V ) h )i 2 := M. | | ≤ k k √b Now, we apply Bernstein inequalitye (Lemmae E.3),

τ 2/2 Pr (X ) >τ exp t j ≤ − E[(X )2]+ Mτ/3 " t # t t j ! X √ P Choosing τ = 103 T log2(n/δ) (P (t)(V (t))1/2h(t)) √b · k ik2 √T e e Pr (X ) > 103 log2(n/δ) (P (t)(V (t))1/2h(t)) t j √ · k ik2 " t b # X 6 T 4 (t) (t) 1/2 (t) 2 10 log (n/δ) e(P e(V ) h )i /2 exp b · k k2 T (t) (t) 1/2 (t) 2 3 √T 3 (t) (t) 1/2 (t) 2 ≤ − (P (V ) h )i + 10 log (n/δ) (P (V ) h )i /3! b k k2 b e e k k2 exp( 100 log(n/δ)) ≤ − e e e e Now, taking a union, we have

k (t) (t) (t) (t) 1/2 (t) √T 2 (t) (t) 1/2 (t) (I (R )⊤R )P (V ) h = O log (n/δ) (P (V ) h )i − i √b 2 t=ki ! X   2 e e e e √T O log2(n/δ)α ≤ √b ! where we use that (P (t)(V (t))1/2h(t)) ((V (t))1/2h(t)) = O(α), n = O(1). k ik2 ≤ k ik2 i Finally, we use the fact that the algorithm reset x = x, s = s in less than √n iterations. e e e

43 B.7 s and s are close Lemma B.19 (s and s are close). With probability 1 δ over the randomness of sketching matrix − R Rb n, we have ∈ × 1 t− si si ǫs, k − kVei ≤ 1/4 ǫ = O(α log2(n/δ) n , and b is the size of sketching matrix. s · √b

Proof. Recall the definition of δs, δs, we have

1/2 1/2 δ δ = tV − (R⊤R I)P V h es,i − s,i i − The rest of the proof is identicale to Lemma B.18e except we usee alsoe the fact we make s = s whenever our t changed by a constant factor. We omitted the details here.

B.8 Data structure is maintaining (x, s) implicitly over all the iterations

Lemma B.20. Over all the iterations, u1 + F u2 is always maintaining x implicitly, u3 + Gu4 is always maintaining s implicitly. Proof. We only focus on the PartialUpdate. The FullUpdate is trivial, we ignore the proof. For x. Note that M is not changing. Let’s assume that u1 + F u2 = x, we want to show that new new new new u1 + F u2 = x . which is equivalent to prove unew + F newunew (u + F u )= δ 1 2 − 1 2 x Let ∆u = unew u be the change of u over iteration t, then 1 1 − 1 1 ∆u = V newh + (F F new)u 1 − 2 Let ∆u = unew u be the change of u over iteration t, then 2 2 − 2 2 e ∆u = (V new)1/2h + 1 (∆ 1 + M ) 1M (V new)1/2h. 2 S − S,S − S⊤ − e S,e Se e e e By definition of δx at iteratione t, we have e e

new new new new 1 1 new δx = V h V M V h V MS(∆− + MS,S)− (MS)⊤ V h . − − e S,e Se e e e p p p p  We can computee e e e e e unew + F newunew (u + F u ) 1 2 − 1 2 = ∆u + (F newunew F u ) 1 2 − 2 = V newh + (F F new)u + (F newunew F u ) − 2 2 − 2 new new new = V h + F (u u2) e 2 − new new = V h + F ∆u2 e new new new new1 1 1 new = V h F V h + F S (∆− + MS,S)− (MS)⊤ V h e − e S,e Se e e e = δ p p ex e e e

44 where we used F new = V newM in the last step. For s. p We have e 1 1 Gnew = M, G = M V new V Let ∆u = unew u be the change ofpu over iterationpt, then 3 3 − 3 3e e ∆u = (G Gnew)u 3 − 4 Let ∆u = unew u be the change of u over iteration t, then 4 4 − 4 4 ∆u = t ∆u 4 · 2

By definition of δs in iteration t,

1 new 1 1 1 new δs = M V (th) MS(∆− + MS,S)− (MS)⊤ V (th) V new − V new e S,e Se e e e ! p p p e p e e We can computee e new new new new new (u3 + G u4 ) (u3 + Gu4)=∆u3 + (G u4 Gu4) − new −new new = (G G )u4 + (G u4 Gu4) new− new − = G (u4 u4) new − = G t∆u2

= δs where the last step follows by definition of ∆u2.

45 C Combining Robust Central Path with Data Structure

The goal of this section is to combine Section A and Section B.

Notation Choice of Parameter Statement Comment 2 C1 Θ(1/ log n) Lem. C.1, Thm. B.1 ℓ2 accuracy of W sequence 4 C2 Θ(1/ log n) Lem. C.1, Thm. B.1 ℓ4 accuracy of W sequence 2 ǫmp Θ(1/ log n) RobustIPM Alg in Sec. A accuracy for data structure T Θ(√n log2 n log(n/δ)) Thm. 4.2 #iterations α Θ(1/ log2 n) RobustIPM Alg in Sec. A step size in Hessian norm b Θ(√n log6(nT ) Lem. B.18, Lem. B.19, Lem. C.2 sketch size 3 ǫx Θ(1/ log n) Lem. B.18 accuracy of x (respect to x) 3 ǫs Θ(1/ log n) Lem. B.19 accuracy of s (respect to s) 3 ǫw Θ(1/ log n) Lem. C.2 accuracy of W (respect to W ) a min(2/3, αm) αm is the dual exponent of MM batch size

Table 3: Summary of parameters

C.1 Guarantee for W matrices Lemma C.1 (Guarantee of a sequence of W ). Let xnew = x + δ . Let W new = ( 2φ(xnew)) 1 and x ∇ − W = ( 2φ(x)) 1. Then we have ∇ − m 2 1/2 new 1/2 2 wi− (wi wi)wi− C1 , − F ≤ i=1 Xm 4 1/2 new 1/2 2 wi− (wi wi)wi− C2 , − F ≤ i=1 X 1/2 new 1/2 1 wi− (wi wi)wi− . − F ≤ 4

2 where C2 = Θ(α ) and C1 = Θ(α) . Proof. For each i [m], we have ∈ 2 1/2 new 1/2 Wi− (Wi Wi)Wi− − F 2 1/2 new 1 /2 = n W − (W W )W − i i i − i i 2 2 1/2 2 new 1 2 1 2 1/2 = n ( φ(x )) ( φ(x )− φ(x )− )( φ(x )) i ∇ i ∇ i −∇ i ∇ i 2 2 1 2 1/2 2 1 2 1/2 new 1 ( φ(xi)) φ(xi)− ( φ(xi)) 2 2 ≤ (1 xi xi φ(xi)) − ! · ∇ ∇ ∇ − k − k∇ 2 1 = ni new 1 2 2 (1 xi xi φ(xi)) − ! − k − k∇ new 2 100ni xi xi 2φ(x ), ≤ k − k∇ i

46 where the second step follows by Theorem 2.3. In our problem, we assume that ni = O(1). It remains to bound

new 2 2 2 2 xi xi 2φ(x ) = δx,i 2φ(x ) . δx,i xi = αi k − k∇ i k k∇ i k k where the last step follows from definition α = δ . i k x,ikxi Then, we have

m m new 2 2 2 xi xi 2φ(x ) O(αi ) O(α ). k − k∇ i ≤ ≤ Xi=1 Xi=1 where the last step follows by Lemma A.1.

Lemma C.2 (Accuracy of W ). Let x and x be the vectors maintained by data-structure Stochas- ticProjectionMaintenance. Let W = ( 2φ(x)) 1 and W = ( 2φ(x)) 1. Then we have ∇ − ∇ − 1/2 1/2 w− (w w )w− ǫ , k i i − i i kF ≤ w

1/4 where ǫ = O α log2(nT ) n , b is the size of sketching matrix. w · √b   Proof. By similar calculation, we have

1/2 1/2 − − 2 wi (wi wi)wi F = O(1) xi xi φ(xi). k − k · k − k∇ Then, using Lemma B.18 with δ = 1/T

√ 1/4 2 n 2 xi xi φ(xi) O α log (nT ) . k − k∇ ≤ · √b !

C.2 Main result The goal of this section is to prove our main result.

Theorem C.3 (Main result, formal version of Theorem 1.1). Consider a convex problem

min c⊤x Ax=b,x Qm K ∈ i=1 i where Ki are compact convex set. For each i [m], we are given a νi-self concordant barrier function (0) ∈ φi for Ki. Also, we are given x = argminx i φi(xi). Assume that 1. Diameter of the set: For any x m KP, we have that x R. ∈ i=1 i k k2 ≤ 2. Lipschitz constant of the program:Qc L. k k2 ≤

47 Algorithm 5 Robust Central Path 1: procedure CentralPathStep(x, s, t, λ, α) 2: for i = 1 m do ⊲ Figure out direction h t → 3: µi si/t + φi(xi) ⊲ According to Eq. (6) t ← t ∇ 4: 2 −1 γi µi φi(xi) ⊲ According to Eq. (7) ← k kexp(∇ λγt)/γt 5: t i i t √ t ci Pm t 1/2 if γi 96 α and ci 0 otherwise ⊲ According to Eq. (9) ← ( i=1 exp(2λγi )) ≥ ← 6: h α ct µt ⊲ According to Eq. (8) i ←− · i · i 7: end for 8: W ( 2φ(x)) 1 ⊲ Computing block-diagonal matrix W ← ∇ − 9: return h, W 10: end procedure 11: 12: procedure RobustCentralPath(mp, t, λ, α) ⊲ Lemma A.8 13: ⊲ Standing at (x,s) implicitly via data-structure 14: ⊲ Standing at (x, s) explicitly via data-structure 15: (x, s) mp.Query() ⊲ Algorithm 1, Lemma B.8 ← 16: 17: h, W CentralPathStep(x, s, t, λ, α) ← 18: 19: mp.Update(W ) ⊲ Algorithm 2, Lemma B.5 20: mp.MultiplyMove(h, t) ⊲ Algorithm 4, Lemma B.10, Lemma B.9 21: ⊲ x x + δ , s s + δ , achieved by data-structure implicitly ← x ← s 22: ⊲ x x + δ , s s + δ , achieved by data-structure explicitly ← x ← s 23: ⊲ If x is far from x, then x x ← 24: end procedure e e

Algorithm 6 Our main algorithm (More detailed version of RobustIPM in Section 4) 1: procedure Main(A, b, c, φ, δ) ⊲ Theorem 1.1, Theorem C.3 2: λ 216 log(m), α 2 20λ 2 , κ 2 10α ← ← − − ← − 3: δ min( 1 , δ) ⊲ Choose the target accuracy ← λ 4: a min(2/3, αm) ⊲ Choose the batch size ← 10 6 5: bsketch 2 √ν log (n/δ) log log(1/δ) ⊲ Choose the size of sketching matrix ← · 6: Modify the ERM(A, b, c, φ) and obtain an initial x and s 7: CentralPathMaintenance mp ⊲ Algorithm 1, Theorem B.1 8: mp.Initialize(A, x, s, α, a, bsketch) ⊲ Algorithm 1, Lemma B.4 9: ν m ν ⊲ ν are the self-concordant parameters of φ ← i=1 i i i 10: t 1 ← P 11: while t > δ2/(4ν) do 12: tnew (1 κ )t ← − √ν 13: RobustCentralPath(mp, t, λ, α) ⊲ Algorithm 5 14: t tnew ← 15: end while 16: Return an approximate solution of the original ERM according to Section D 17: end procedure

48 Then, the algorithm Main finds a vector x such that

c⊤x min c⊤x + LR δ, ≤ Ax=b,x Qm K · ∈ i=1 i

Ax b 3δ R A + b , k − k1 ≤ ·  | i,j| k k1 Xi,j m   x K . ∈ i Yi=1 in time

ω+o(1) 2.5 α/2+o(1) 2+1/6+o(1) O(n + n − + n ) O(log(n/δ)). · where ω is the exponent of matrix multiplication [Wil12, LG14], ande α is the dual exponent of matrix multiplication [LGU18]. Proof. The number of iterations is

O(√ν log2(m) log(ν/δ)) = O(√n log2(n) log(n/δ)).

For each iteration, the amortized cost per iteration is

1+a 1.5 2 ω 1/2+o(1) 2 a/2+o(1) ω 1/2+o(1) O(nb + n + n )+ O(C /ǫ + C /ǫ ) (n − + n − )+ O(n − ) 1 mp 2 mp · 1+a 1.5 2 ω 1/2+o(1) 2 a/2+o(1) ω 1/2+o(1) = O(nb + n + n )+ O(α + α ) (n − + n − )+ O(n − ) · 1+a 1.5 4 ω 1/2+o(1) 2 a/2+o(1) ω 1/2+o(1) = O(nb + n + n )+ O(1/ log n) (n − + n − )+ O(n − ) · 1.5+o(1) 6 1+a+o(1) ω 1/2+o(1) 2 a/2+o(1) = O(n log log(1/δ)+ n )+ O(n − + n − ). where the last step follows from choice of b (see Table 3). Finally, we have

total time = #iterations cost per iteration · 2 1.5+o(1) 6 1+a+o(1) ω 1/2+o(1) 2 a/2+o(1) = O √n log n log(n/δ) O n log log(1/δ)+ n + n − + n − · #iterations   cost per iteration  1.5+a+o(1) ω+o(1) 2.5 a/2+o(1) 6 = O| n {z + n } |+ n − log(n/δ) log {zlog(1/δ) } · ·  2+1/6+o(1) ω+o(1) 2.5 αm/2+o(1) 6 = O n + n + n − log(n/δ) log log(1/δ) · ·   where we pick a = min(2/3, αm) and αm is the dual exponent of matrix multiplication[LGU18]. Thus, we complete the proof.

Corollary C.4 (Empirical risk minimization). Given convex function fi(y) : R R. Suppose the d → solution x∗ R lies in ℓ -Ball(0,R). Suppose fi is L-Lipschitz in region y : y 4√n M R . ∈ ∞ { | |≤ · · } Given a matrix A Rd n with A M and A has no redundant constraints, and a vector b Rd ∈ × k k≤ ∈ with b M R. We can find x Rd s.t. k k2 ≤ · ∈ n n

fi(ai⊤x + bi) min fi(ai⊤x + bi)+ δMR ≤ x Rd Xi=1 ∈ Xi=1

49 in time ω+o(1) 2.5 α/2+o(1) 2+1/6+o(1) O(n + n − + n ) O(log(n/δ)). · where ω is the exponent of matrix multiplication [Wil12, LG14], and α is the dual exponent of matrix e multiplication [LGU18].

Proof. It follows from applying Theorem C.3 on convex program (1) with an extra constraint x∗ lies in ℓ -Ball(0,R). Note that in program (1), ni = 2. Thus m = O(n). ∞ D Initial Point and Termination Condition

We first need some result about self concordance. Lemma D.1 (Theorem 4.1.7, Lemma 4.2.4 in [Nes98]). Let φ be any ν-self-concordant barrier. Then, for any x,y domφ, we have ∈ φ(x),y x ν, h∇ − i≤ y x 2 φ(y) φ(x),y x k − kx . h∇ −∇ − i≥ 1+ y x k − kx n Let x = argmin φ(x). For any x R such that x x ∗ 1, we have that x domφ. ∗ x ∈ k − ∗kx ≤ ∈ x∗ y ∗ ν + 2√ν. k − kx ≤ Lemma D.2. Consider a convex problem minAx=b,x Qm K c⊤x where Ki are compact convex set. ∈ i=1 i For each i [m], we are given a νi-self concordant barrier function φi for Ki. Also, we are given (0) ∈ x = argminx i φi(xi). Assume that m 1. DiameterP of the set: For any x Ki, we have that x 2 R. ∈ i=1 k k ≤ 2. Lipschitz constant of the program:Qc L. k k2 ≤ For any δ > 0, the modified program min m c x with Ax=b,x Q K R+ ⊤ ∈ i=1 i× δ c A = [A b Ax(0)], b = b, and c = LR · | − 1   satisfies the following: (0) δ x LR c 1. x = , y = 0d and s = · are feasible primal dual vectors with s+ φ(x) ∗ δ 1 1 k ∇ kx ≤     where φ(x)= m φ (x ) log(x ). i=1 i i − m+1 m R 2 2. For any x suchP that Ax = b, x i=1 Ki + and c⊤x minAx=b,x Qm K R c⊤x + δ , ∈ × ≤ i=1 i + the vector x (x is the first n coordinates of x) is an approximate solution∈ to× the original 1:n 1:n Q convex program in the following sense

c⊤x1:n min c⊤x + LR δ, ≤ Ax=b,x Qm K · ∈ i=1 i

Ax b 3δ R A + b , k 1:n − k1 ≤ ·  | i,j| k k1 Xi,j m   x K . 1:n ∈ i Yi=1 50 Proof. For the first result, straightforward calculations show that (x, y, s) are feasible. To compute s + φ(x) , note that k ∇ kx∗ δ s + φ(x) ∗x = c 2φ(x(0))−1 . k ∇ k kLR · k∇ Rn (0) m Lemma D.1 shows that x such that x x x(0) 1, we have that x i=1 Ki because (0) ∈ k − k ≤2 (0) ∈ (0) x = argminx i φi(xi). Hence, for any v such that v⊤ φ(x )v 1, we have that x v m ∇ ≤ Q ± ∈ K and hence x(0) v R. This implies v R for any v 2φ(x(0))v 1. Hence, i=1 i P k ± k2 ≤ k k2 ≤ ⊤∇ ≤ ( 2φ(x(0))) 1 R2 I. Hence, we have Q∇ −  · δ δ s + φ(x) ∗x = c 2φ(x(0))−1 c 2 δ. k ∇ k kLR · k∇ ≤ kL · k ≤

For the second result, we let OPT = min m c x and OPT = min m c x. Ax=b,x Q Ki ⊤ Ax=b,x Q K R+ ⊤ ∈ i=1 ∈ i=1 i× x For any feasible x in the original problem, x = is a feasible in the modified problem. There- 0 fore, we have that   δ δ OPT c⊤x = OPT. ≤ LR · LR · x Given a feasible x with additive error δ2. Write x = 1:n for some τ 0. We can compute τ ≥   c x which is δ c x + τ. Then, we have ⊤ LR · ⊤ 1:n

δ 2 δ 2 c⊤x + τ OPT + δ OPT + δ . (34) LR · 1:n ≤ ≤ LR · Hence, we can upper bound the OPT of the transformed program as follows:

LR δ LR δ 2 c⊤x = c⊤x OPT + δ = OPT + LR δ, 1:n δ · LR 1:n ≤ δ LR · ·   where the second step follows by (34). For the feasibility, we have that τ δ c x + δ OPT + δ2 δ + δ + δ because ≤ − LR · ⊤ 1:n LR · ≤ OPT = minAx=b,x 0 c⊤x LR and that c⊤x1:n LR. The constraint in the new polytope shows ≥ ≤ ≤ that Ax + (b Ax(0))τ = b. 1:n − Rewriting it, we have Ax b = (Ax(0) b)τ and hence 1:n − − Ax b Ax(0) b τ. k 1:n − k1 ≤ k − k1 ·

Lemma D.3. Let φ (x ) be a ν -self-concordant barrier. Suppose we have si + φ (x )= µ for all i i i t ∇ i i i i [m], A y + s = c and Ax = b. Suppose that µ 1 for all i, we have that ∈ ⊤ k ikx,i∗ ≤

c, x c, x∗ + 4tν h i ≤ h i m m where x∗ = argminAx=b,x Q Ki c⊤x and ν = i=1 νi. ∈ i=1 P

51 Proof. Let xα = (1 α)x+αx∗ for some α to be chosen. By Lemma D.1, we have that φ(xα),x∗ xα −ν h∇ − i≤ ν. Hence, we have 1 α φ(xα),x∗ x . Hence, we have − ≥ h∇ − i να φ(x ),x x 1 α ≥ h∇ α α − i − s = φ(x ) φ(x),x x + µ ,x x h∇ α −∇ α − i − t α − m 2 D E xα,i xi xi 1 k − k + µ,xα x c A⊤y,xα x ≥ 1+ xα,i xi xi h − i− t − − i=1 k − k D E Xm m α2 x x 2 α k i∗ − ikxi α µi x∗i xi∗ xi xi c, x∗ x . ≥ 1+ α xi∗ xi xi − k k k − k − t h − i Xi=1 k − k Xi=1 where we used Lemma D.1 on the second first, Axα = Ax on the second inequality. Hence, we have m m c, x c, x ν α x x 2 ∗ k i∗ − ikxi h i h i + + µi x∗i xi∗ xi xi . t ≤ t 1 α k k k − k − 1+ α xi∗ xi xi − Xi=1 Xi=1 k − k Using µ 1 for all i, we have k ikx∗i ≤ m c, x c, x ν x x c, x ν m h i h ∗i + + k i∗ − ikxi h ∗i + + . t ≤ t 1 α 1+ α xi∗ xi xi ≤ t 1 α α − Xi=1 k − k − Setting α = 1 , we have c, x c, x + 2t(ν + m) c, x + 4tν because the self-concordance ν 2 h i ≤ h ∗i ≤ h ∗i i is always larger than 1.

E Basic Properties of Subsampled Hadamard Transform Matrix

This section provides some standard calculations about sketching matrices, it can be found in previ- ous literatures [PSW17]. Usually, the reason for using subsampled randomized Hadamard/Fourier transform [LDFU13] is multiplying the matrix with k vectors only takes kn log n time. Unfor- tunately, in our application, the best way to optimize the running is using matrix multiplication directly (without doing any [CT65], or more fancy sparse Fourier transform [HIKP12b, HIKP12a, Pri13, IKP14, IK14, PS15, CKPS16, Kap16, Kap17, NSW19]). In order to have an easy analysis, we still use subsampled randomized Hadamard/Fourier matrix.

E.1 Concentration inequalities We first state a useful for concentration, Lemma E.1 (Lemma 1 on page 1325 of [LM00]). Let X 2 be a chi-squared distributed random ∼Xk variable with k degrees of freedom. Each one has zero mean and σ2 variance. Then Pr[X kσ2 (2√kt + 2t)σ2] exp( t) − ≥ ≤ − Pr[kσ2 X 2√ktσ2] exp( t) − ≥ ≤ − Lemma E.2 (Khintchine’s Inequality). Let σ , ,σ be i.i.d. sign random variables, and let 1 · · · n z , , z be real numbers. Then there are constants C,C > 0 so that 1 · · · n ′ n 2 Pr ziσi Ct z 2 exp( C′t ) " ≥ k k # ≤ − i=1 X

52 Lemma E.3 (Bernstein Inequality). Let X , , X be independent zero-mean random variables. 1 · · · n Suppose that X M almost surely, for all i. Then, for all positive t, | i|≤ n t2/2 Pr X >t exp i n E 2 " # ≤ − j=1 [Xj ]+ Mt/3! Xi=1 P E.2 Properties obtained by random projection Remark E.4. The Subsampled Randomized Hadamard Transform [LDFU13] can be defined as R = SH Σ Rb , where Σ is an n n diagonal matrix with i.i.d. diagonal entries Σ in which n ∈ × × i,i Σ = 1 with probability 1/2, and Σ = 1 with probability 1/2. H refers to the Hadamard i,i i,i − n matrix of size n, which we assume is a power of 2. The b n matrix S samples b coordinates × of n dimensional vector uniformly at random. If we replace the definition of sketching matrix in Lemma E.5 by Subsampled Randomized Hadamard Transform and let R = SHn, then the same proof will go through.

Lemma E.5 (Expectation, variance, absolute guarantees for sketching a fixed vector). Let h Rn ∈ be a fixed vector. Let R Rb n denote a random matrix where each entry is i.i.d. sampled from ∈ × +1/√b with probability 1/2 and 1/√b with probability 1/2. Let Σ Rn n denote a diagonal matrix − ∈ × where each entry is 1 with probability 1/2 and 1 with probability 1/2. Let R = RΣ, then we have −

2 2 1 2 log(n/δ) E[R⊤Rh]= h, E[(R⊤Rh) ] h + h , Pr (R⊤Rh) h > h δ. i ≤ i b k k2 | i − i| k k2 √ ≤  b  b n b Proof. Let Ri,j denote the entry at i-th row and j-th column in matrix R R × . Let R ,i R ∈ ∗ ∈ denote the vector in i-th column of R. We first show expectation,

E[(R⊤Rh)i]= E[ R,R ,ih⊤ ] h ∗ i b n = E R R h  j,l j,i l Xj=1 Xl=1  b  b = E R2 h + E R R h  j,i i  j,l j,i l j=1 j=1 l [n] i X X ∈X\     = hi + 0

= hi

53 Secondly, we prove the variance is small

2 2 E[(R⊤Rhi) ]= E[ R,R ,ih⊤ ] h ∗ i 2 b n = E R R h  j,l j,i l  Xj=1 Xl=1 2  b b   = E R2 h + R R h  j,i i j,l j,i l  j=1 j=1 l [n] i X X ∈X\ 2  b b  b 2 2 = E R h + 2 E R ′ h R R h  j,i i   j ,i i j,l j,i l j=1 j′=1 j=1 l [n] i X X X ∈X\ 2  b     + E R R h  j,l j,i l  j=1 l [n] i X ∈X\    = C1 + C2 + C3, where the last step follows from defining those terms to be C1,C2 and C3. For the term C1, we have

2 b b 2 2 2 4 2 2 2 1 1 2 C = h E R = h E R + R R ′ = h b + b(b 1) = h 1 i  j,i  i  j,i j,i j ,i i · b2 − · b2 i j=1 j=1 j′=j   X X X6      For the second term C2,

C2 = 0.

For the third term C3,

2 b C = E R R h 3  j,l j,i l  j=1 l [n] i X ∈X\ b   b 2 2 2 = E R R h + E R R h R ′ ′ R ′ h ′  j,l j,i l   j,l j,i l j ,l j ,i l  j=1 l [n] i j=1 l [n] i j′ [b] j l′ [n] i l X ∈X\ X ∈X\ X∈ \ ∈X\ \ b    1 1 1 = h2 + 0 h 2 d d l ≤ dk k2 j=1 l [n] i X ∈X\ Therefore, we have

2 2 1 2 E[(R⊤Rh) ] C + C + C h + h . i ≤ 1 2 3 ≤ i b k k2 Third, we prove the worst case bound with high probability. We can write (R Rh) h as ⊤ i − i

54 follows

(R⊤Rh)i hi = R,R ,ih⊤ hi − h ∗ i− b n = R R h h j,l · j,i · l − i Xj=1 Xl=1 b b = R2 h h + R R h j,i i − i j,l j,i · l j=1 j=1 l [n] i X X ∈X\ b = R R h by R2 = 1/b j,l j,i · l j,i j=1 l [n] i X ∈X\ = hl R ,l,R ,i h ∗ ∗ i l [n] i ∈X\ = hl σlR ,l,σiR ,i by R ,l = σlR ,l · h ∗ ∗ i ∗ ∗ l [n] i ∈X\ First, we apply Khintchine’s inequality, we have

1/2 2 2 2 Pr hl σl R ,l,σiR ,i Ct hl ( R ,l,σiR ,i ) exp( C′t )  · · h ∗ ∗ i ≥  h ∗ ∗ i   ≤ − l [n] i l [n] i ∈X\ ∈X\       and choose t = log(n/δ). For each l = i, using [LDFU13] we have 6 p log(n/δ) Pr R ,l, R ,i δ/n. "|h ∗ ∗ i| ≥ p √b # ≤ Taking a union bound over all l [n] i, we have ∈ \ log(n/δ) (R⊤Rh)i hi h 2 | − | ≤ k k √b with probability 1 δ. − Acknowledgments

The authors would like to thank Haotian Jiang, Swati Padmanabhan, Ruoqi Shen, Zhengyu Wang, Xin Yang and Peilin Zhong for useful discussions.

55