Arxiv:1912.06252V2 [Math.OC] 13 May 2021 Rceig Fte5n Nulamsmoimo Hoyo Com of Prelimina Theory on a Symposium ACM 757481-Scaleopt

arXiv:1912.06252v2 [math.OC] 13 May 2021 rceig fte5n nulAMSmoimo hoyo Com of prelimina Theory on A Symposium ACM 757481-ScaleOpt. Annual fr 52nd SH the no. and of agreements DD proceedings grant programme: from innovation LAV and research 2020 Horizon 2 cln-nain loih o ierpormigwhos programming linear for algorithm scaling-invariant A eateto ahmtc,Lno colo cnmc n Polit and Economics of School London Mathematics, of Department ∗ hspoethsrcie udn rmteErpa Researc European the from funding received has project This ailDadush Daniel invariant aai n e(ah rg 9)gv h rteatalgorith exact first the gave ’96) Prog. (Math. Ye and Vavasis hsagrtmas losu oapoiaetevleo ¯ of value the approximate to us allows also algorithm This aai-ealgorithm. Vavasis-Ye o ovn ierporm(P max (LP) on depending program time linear running a with solving computation For of model real the eeoe rmlda neirpitmto sn a using method that point showed interior and primal-dual a developed odto esr otoln h ieo ouin oline to solutions of size the controlling measure condition eedn nta ntemaue¯ measure the on instead depending ne eclnso h oun of columns the of rescalings under ounrescaling column a eov hsoe usinaffirmatively. question open this resolve ari of matroid euti n(upiig otatt hto Tun¸cel (Math. of that to ¯ contrast approximating for (surprising) hardness in is result owr ihratios with work to lo st prxmt h au f¯ of value the approximate to us allow h aeLSaayi.I hsvi,a u eodmi contri main second our as vein, al this invariant In scaling analysis. truly LLS a base either the providing of short falls loyed factor a yields also n nwa ecl the call we what in loihsi eea.Wt hsaayi,w eiea impr an derive we analysis, this With potential refined general. a with in together algorithms digraph, ratio circuit the unn iedpnsol ntecntan matrix constraint the on only depends time running )ieainbudfrotmlyslig(P sn u algo our using (LP) solving optimally for bound iteration )) olwn h ratruhwr fTro Oe.Rs 8)i ’86) Res. (Oper. Tardos of work breakthrough the Following otioadTuhy SA .Otm 0) oigta h c the that noting ’03), Optim. J. (SIAM Tsuchiya and Monteiro u rtmi otiuini an is contribution main first Our hl hsrsle otioadTuhy’ usinb app by question Tsuchiya’s and Monteiro resolves this While L loih,wihue n yaial anan improv maintains dynamically and uses which algorithm, LLS A 1 etu ikne&Ifraia h Netherlands The Informatica, & Wiskunde Centrum ocmueanal pia ignlrescaling diagonal optimal nearly a compute to O 1 ( n/ oheHuiberts Sophie , g AD crutrtodigraph’ ratio ‘circuit n i 3 /g . log 5 j log(¯ of fcrut of circuits of n A mrvmn nteieaincmlxt on fteorigin the of bound complexity iteration the on improvement χ χ n aesrn vdneta hssol etecs.We case. the be should this that evidence strong gave and , A A + owti 2 within to n ntdKingdom United )ieain uc osle(P xcl,wee¯ where exactly, (LP) solve to suffice iterations )) χ A O χ A ∗ ( A ∗ A and enda h iiu ¯ minimum the as defined , m ie,mnmllna dependencies linear minimal —i.e., Abstract yamxmmgoercma yl computation cycle mean geometric maximum a by c 2 ⊤ 1 n of poly(rank( et Natura Bento , ,Ax x, 2 c se hte hr xssa Palgorithm LP an exists there whether asked , + A 1 . n 3 ieagrtmwihwrso h linear the on works which algorithm time ) = A )) ,x b, h e nih o u loih is algorithm our for insight key The . oni EC ne h uoenUnion’s European the under (ERC) Council h mgatareetn.854-I,B and BN 805241-QIP, no. agreement grant om yvrino hspprhsapae nthe in appeared has paper this of version ry lyrdlatsquares’ least ‘layered rssesrltdto related systems ar oih ra mrvmn upon improvement an or gorithm uig(TC [ (STOC) puting ≥ χ ucinbsdaayi o LLS for analysis based function 0 A rg 9) h hwdNP- showed who ’99), Prog. A , 2 o ierpormigin programming linear for m D pt factor a to up ih.Tesm argument same The rithm. uinw eeo a develop we bution n László Végh and A. , yo h osritmatrix. constraint the on ly h i-opeiymodel, bit-complexity the n oved aifig¯ satisfying ∈ oraepercsig it preprocessing, ropriate χ AD nrlpt sinvariant is path entral R m O × au civbeby achievable value ( n n DHNV20 aai n Ye and Vavasis , 2 n siae of estimates ing . 5 χ Ag log AD n A (¯ 0—which = LS step, (LLS) χ n . ≤ A ∗ ]. clScience, ical log(¯ ) 2 n χ scaling This . (¯ A χ χ A ∗ A ∗ sa is ) al + 3 ∗ 2 . e 1 Introduction

The linear programming (LP) problem in primal-dual form is to solve

min c⊤x max y⊤b

Ax = b A⊤y + s = c (LP) x 0, s 0, ≥ ≥ m n m n n where A R × , rank(A) = m n, b R , c R are given in the input, and x, s R , ∈ ≤ ∈ ∈ ∈ y Rm are the variables. We consider the program in x to be the primal problem and the program∈ in (y,s) to be the dual problem. Khachiyan [Kha79] used the ellipsoid method to give the first polynomial time LP algorithm in the bit-complexity model, that is, polynomial in the bit description length of (A,b,c). An out- standing open question is the existence of a strongly polynomial algorithm for LP, listed by Smale as one of the most prominent mathematical challenges for the 21st century [Sma98]. Such an algorithm amounts to solving LP using poly(n,m) basic arithmetic operations in the real model of computation.1 Known strongly polynomially solvable LP problems classes include: feasibility for two variable per inequality systems [Meg83], the minimum-cost circulation problem [Tar85], the maximum generalized flow problem [Vég17, OV17], and discounted Markov decision problems [Ye05, Ye11]. Towards this goal, the principal line of attack has been to develop LP algorithms whose running time is bounded in terms of natural condition measures. Such condition measures attempt to finely measure the “intrinsic complexity” of the LP. An important line of work in this area has been to parametrize LPs by the “niceness” of their solutions (e.g. the depth of the most interior point), where relevant examples include the Goffin measure [Gof80] for conic systems and Renegar’s distance to ill-posedness for general LPs [Ren94, Ren95], and bounded ratios between the nonzero entries in basic feasible solutions [Chu14, KM13].

Parametrizing by the constraint matrix A second line of research, and the main focus of this work, focuses on the complexity of the constraint matrix A. The first breakthrough in this area was given by Tardos [Tar86], who showed that if A has integer entries and all square submatrices of A have determinant at most ∆ in absolute value, then (LP) can be solved in poly(n, m, log ∆) arithmetic operations, independent of the encoding length of the vectors b and c. This is achieved by finding the exact solutions to O(nm) rounded LPs derived from the original LP, with the right hand side vector and cost function being integers of absolute value bounded in terms of n and ∆. From m such rounded problem instances, one can infer, via proximity results, that a constraint xi = 0 must be valid for every optimal solution. The process continues by induction until the optimal primal face is identified.

Path-following methods and the Vavasis–Ye algorithm In a seminal work, Vavasis and Ye [VY96] introduced a new type of interior-point method that optimally solves (LP) within 3.5 O(n log(¯χA + n)) iterations, where the condition numberχ ¯A controls the size of solutions to certain linear systems related to the kernel of A (see Section 2 for the formal deﬁnition). Before detailing the Vavasis–Ye (henceforth VY) algorithm, we recall the basics of path following interior-point methods. If both the primal and dual problems in (LP) are strictly

1In the bit-complexity model, a further requirement is that the algorithm must be in PSPACE.

2 feasible, the central path for (LP) is the curve ((x(µ),y(µ),s(µ)) : µ> 0) deﬁned by

x(µ) s(µ) = µ, i [n] i i ∀ ∈ Ax(µ)= b, x(µ) > 0, (CP)

A⊤y(µ)+ s(µ)= c, s(µ) > 0, which converges to complementary optimal primal and dual solutions (x∗,y∗,s∗) as µ 0, → recalling that the optimality gap at time µ is exactly x(µ)⊤s(µ) = nµ. We thus refer to µ as the normalized duality gap. Methods that “follow the path” generate iterates that stay in a certain neighborhood around it while trying to achieve rapid multiplicative progress w.r.t. to µ, n where given (x,y,s) ‘close’ to the path, we deﬁne the eﬀective µ as µ(x,y,s) = i=1 xisi/n. Given a target parameter µ′ and starting point close to the path at parameter µ, standard path P following methods [Gon92] can compute a point at parameter below µ′ in at most O(√n log(µ/µ′)) iterations, and hence the quantity log(µ/µ′) can be usefully interpreted as the length of the corresponding segment of the central path.

Crossover events and layered least squares steps At a very high level, Vavasis and n Ye show that the central path can be decomposed into at most 2 short but curved segments, possibly joined by long (apriori unbounded) but very straight segments. At the end of each curved segment, they show that a new ordering relation xi(µ) > xj (µ)—called a ‘crossover event’—is implicitly learned, where this relation did not hold at the start of the segment, but will hold n at every point from the end of the segment onwards. These 2 relations give a combinatorial way to measure progress along the central path. In contrast to Tardos’s algorithm, where the main progress is setting variables to zero explicitly, the variables participating in crossover events cannot be identified, only their existence is shown. At a technical level, the VY-algorithm is a variant of the Mizuno–Todd–Ye [MTY93] predictor- corrector method (MTY P-C). In predictor-corrector methods, corrector steps bring an iterate closer to the path, i.e., improve centrality, and predictor steps “shoot down” the path, i.e., reduce µ without losing too much centrality. Vavasis and Ye’s main algorithmic innovation was the introduction of a new predictor step, called the ‘layered least squares’ (LLS) step, which crucially allowed them to cross each aforementioned “straight” segment of the central path in a single step, recalling that these straight segments may be arbitrarily long. To traverse the short and curved segments of the path, the standard predictor step, known as affine scaling (AS), in fact suffices. To compute the LLS direction, the variables are decomposed into ‘layers’ J1 J2 ... Jp = [n]. The goal of such a decomposition is to eventually learn a refinement of the optimal∪ ∪ partition∪ of the variables B∗ N ∗ = [n], where B∗ := i [n]: x∗ > 0 and N ∗ := i [n]: s∗ > 0 for the ∪ { ∈ i } { ∈ i } limit optimal solution (x∗,y∗,s∗). The primal affine scaling direction can be equivalently described by solving a weighted least squares problem in Ker(A), with respect to a weighting defined according to the current iterate. The primal LLS direction is obtained by solving a series of weighted least squares problems, starting with focusing only on the final layer Jp. This solution is gradually extended to the higher layers (i.e., layers with lower indices). The dual directions have analogous interpretations, with the solutions on the layers obtained in the opposite direction, starting with J1. If we use the two-level layering J1 = B∗, J2 = N ∗, and are sufficiently close to the limit (x∗,y∗,s∗) of the central path, then the LLS step reaches an exact optimal solution in a single step. We note that standard AS steps generically never find an exact optimal solution, and thus some form of “LLS rounding” in the final iteration is always necessary to achieve finite termination with an exact optimal solution.

3 Of course, guessing B∗ and N ∗ correctly is just as hard as solving (LP). Still, if we work with a “good” layerings, these will reveal new information about the “optimal order” of the variables, where B∗ is placed on higher layers than N ∗. The crossover events correspond to swapping two wrongly ordered variables into the correct ordering. Namely, a variable i B∗ and j N ∗ are currently ordered on the same layer, or j is in a higher layer than i. After∈ the crossover∈ event, i will always be placed on a higher layer than j.

Computing good layerings and the χ¯A condition measure Given the above discussion, the obvious question is how to come up with “good” layerings? The philosophy behind LLS can be stated as saying that if modifying a set of variables xI barely affects the variables in x[n] I (recalling that movement is constrained to ∆x Ker(A)), then one should optimize over x\ ∈ I without regard to the effect on x[n] I ; hence xI should be placed on lower layers. VY’s strategy for computing such\ layerings was to directly use the size of the coordinates of the current iterate x (where (x,y,s) is a point near the central path). In particular, assuming x x ... x , the layering J J ... J = [n] corresponds to consecutive intervals 1 ≥ 2 ≥ ≥ n 1 ∪ 2 ∪ ∪ p constructed in decreasing order of xi values. The break between Ji and Ji+1 occurs if the gap xr/xr+1 >g, where r is the rightmost element of Ji and g > 0 is a threshold parameter. Thus, the expectation is that if xi > gxj , then a small multiplicative change to xj , subject to moving in Ker(A), should induce a small multiplicative change to xi. By proximity to the central path, the dual ordering is reversed as mentioned above. The threshold g for which this was justified in VY was a function of theχ ¯A condition measure. We now provide a convenient definition, which immediately yields this justification (see Proposition 2.4). Letting W = Ker(A) and π (W )= x : x W , we defineχ ¯ :=χ ¯ as the I { I ∈ } A W minimum number M 1 such that for any = I [n] and z πI (W ), there exists y W with y = z and y ≥M z . Thus, a change∅ of 6 ε in⊆ variables in∈I can be lifted to a change∈ of I k k≤ k k at mostχ ¯ ε in variables in [n] I. Crucially,χ ¯ is a “self-dual” quantity. That is,χ ¯ =χ ¯ ⊥ , A \ W W where W ⊥ = range(A⊤) is the movement subspace for the dual problem, justifying the reversed layering for the dual (see Sections 2 for more details).

The question of scale invariance and χ¯A∗ While the VY layering procedure is powerful, its properties are somewhat mismatched with those of the central path. In particular, variable ordering information has no intrinsic meaning on the central path, as the path itself is scaling invariant. Namely, the central path point (x(µ),y(µ),s(µ)) w.r.t. the problem instance (A,b,c) 1 is in bijective correspondence with the central path point (D− x(µ),Dy(µ),Ds(µ))) w.r.t. the problem instance (AD,Dc,b) for any positive diagonal matrix D. The standard path following algorithms are also scaling invariant in this sense. This lead Monteiro and Tsuchiya [MT03] to ask whether a scaling invariant LLS algorithm exists. They noted that any such algorithm would then depend on the potentially much smaller parameter χ¯A∗ := inf χ¯AD , (1) D where the inﬁmum is taken over the set of n n positive diagonal matrices. Thus, Monteiro and Tsuchiya’s question can be rephrased as to× whether there exists an exact LP algorithm with running time poly(n, m, logχ ¯A∗ ). Substantial progress on this question was made in the followup works [MT05,LMT09]. The paper [MT05] showed that the number of iterations of the MTY predictor-corrector algorithm [MTY93] can get from µ0 > 0 to η> 0 on the central path in

3.5 2 0 0 O n logχ ¯∗ + min n log log(µ /η), log(µ /η) A { } 4 iterations. This is attained by showing that the standard AS steps are reasonably close to the LLS steps. This proximity can be used to show that the AS steps can traverse the “curved” parts of the central path in the same iteration complexity bound as the VY algorithm. Moreover, on the “straight” parts of the path, the rate of progress ampliﬁes geometrically, thus attaining a log log convergence on these parts. Subsequently, [LMT09] developed an aﬃne invariant trust region 3.5 step, which traverses the full path in O(n log(¯χA∗ + n)) iterations. However, the running time of each iteration is weakly polynomial in b and c. The question of developing an LP algorithm with complexity bound poly(n, m, logχ ¯A∗ ) thus remained open. A related open problem to the above is whether it is possible to compute a near-optimal rescaling D for program (1)? This would give an alternate pathway to the desired LP algorithm by simply preprocessing the matrix A. The related question of approximatingχ ¯A was already studied by Tun¸cel [Tun99], who showed NP-hardness for approximatingχ ¯A to within a 2poly(rank(A)) factor. Taken at face value, this may seem to suggest that approximating the rescaling D should be hard. A further open question is whether Vavasis and Ye’s cross-over analysis can be improved. 2.5 Ye in [Ye06] showed that the iteration complexity can be reduced to O(n log(¯χA + n)) for 1.5 feasibility problems and further to O(n log(¯χA + n)) for homogeneous systems, though the 3.5 O(n log(¯χA + n)) bound for optimization has not been improved since [VY96].

1.1 Our contributions In this work, we resolve all of the above questions in the affirmative. We detail our contributions below. 1. Finding an approximately optimal rescaling. As our first contribution, we give an O(m2n2 +n3) time algorithm that works on the linear matroid of A to compute a diagonal rescaling matrix D 3 which achievesχ ¯AD n(¯χA∗ ) , given any m n matrix A. Furthermore, this same algorithm ≤ × 2 allows us to approximateχ ¯A to within a factor n(¯χA∗ ) . The algorithm bypasses Tun¸cel’s hardness result by allowing the approximation factor to depend on A itself, namely onχ ¯A∗ . This gives a simple first answer to Monteiro and Tsuchiya’s question: by applying the Vavasis-Ye algorithm directly on the preprocessed A matrix, we may solve any LP with constraint matrix A using 3.5 2 O(n log(¯χA∗ + n)) iterations. Note that the approximation factor n(¯χA∗ ) increases the runtime only by a constant factor. To achieve this result, we work with the circuits of A, where a circuit C [n] corresponds to an inclusion-wise minimal set of linearly dependent columns. With each circuit,⊆ we can associate a vector gC Ker(A) with supp(gC ) = C that is unique up to scaling. By the ‘circuit ratio’ ∈ C C κij associated with the pair of nodes (i, j), we mean the largest ratio gj /gi taken over every circuit C of A such that i, j C. As our first observation, we show that| the| maximum of all ∈ circuit ratios, which we call the ‘circuit imbalance measure’, in fact characterizesχ ¯A up to a factor n. This measure was first studied by Vavasis [Vav94], who showed that it lower bounds χ¯A, though, as far as we are aware, our upper bound is new. The circuit ratios of each pair (i, j) induce a weighted directed graph we call the ‘circuit ratio digraph’ of A. From here, our main result is thatχ ¯A∗ is up to a factor n equal to the maximum geometric mean cycle in the circuit ratio digraph. Our algorithm populates the circuit ratio digraph with approximations of the κij ratios for each i, j [n] using standard techniques from matroid theory, and then computes a rescaling by solving∈ the dual of the maximum geometric mean ratio cycle on the ‘approximate circuit ratio digraph’.

2. Scaling invariant LLS algorithm. While the above yields an LP algorithm with poly(n, m, logχ ¯A∗ ) running time, it does not satisfactorily address Monteiro and Tsuchiya’s question for a scaling

5 invariant algorithm. As our second contribution, we use the circuit ratio digraph directly to give a natural scaling invariant LLS layering algorithm together with a scaling invariant crossover analysis. At a conceptual level, we show that the circuit ratios give a scale invariant way to measure whether ‘xi > xj ’ and enable a natural layering algorithm. Assume for now that the circuit imbalance value κij is known for every pair (i, j). Given the circuit ratio graph induced by the κij ’s and given a primal point x near the path, our layering algorithm can be described as follows. We first rescale the variables so that x becomes the all ones vector, which rescales κij to κij xi/xj . We then restrict the graph to its edges of length κ x /x 1/poly(n)—the long edges of the ij i j ≥ (rescaled) circuit ratio graph—and let the layering J1 J2 ... Jp be a topological ordering of its strongly connected components (SCC) with edges going∪ ∪ from∪ left to right. Intuitively, variables that “affect each other” should be in the same layer, which motivates the SCC definition. We note that our layering algorithm does not in fact have access to the true circuit ratios κij , as these are NP-hard to compute. Getting a good enough initial estimate for our purposes however is easy: we letκ îj be the ratio corresponding to an arbitrary circuit containing i and 2 j. This already turns out to be within a factor (¯χA∗ ) from the true value κij , which we recall is the maximum over all such circuits. Our layering algorithm in fact learns better circuit ratio estimates if the “lifting costs” of our SCC layering, i.e., how much it costs to lift changes from lower layer variables to higher layers (as in the definition ofχ ¯A), are larger than we expected them to be based on the previous estimates. We develop a scaling-invariant analogoue of cross-over events as follows. Before the crossover n n event, poly(n)(¯χA∗ ) > κij xi/xj , and after the crossover event, poly(n)(¯χA∗ ) < κij xi/xj for all further central path points. Our analysis relies onχ ¯A∗ in only a minimalistic way, and does not require an estimate on the value ofχ ¯A∗ . Namely, it is only used to show that if i, j Jq, for a layer O( Jq )∈ q [p], then the rescaled circuit ratio κij xi/xj is in the range (poly(n)¯χA∗ ) ±| | . The argument to∈ show this crucially utilizes the maximum geometric mean cycle characterization. Furthermore, unlike prior analyses [VY96,MT03], our definition of a “good” layering (i.e., ‘balanced’ layerings, see Section 3.5), is completely independent ofχ ¯A∗ . 3. Improved potential analysis. As our third contribution, we improve the Vavasis–Ye crossover analysis using a new and simple potential function based approach. When applied to our new LLS 2.5 algorithm, we derive an O(n log n log(¯χA∗ + n)) iteration bound for path following, improving the polynomial term by an Ω(n/ log n) factor compared to the VY analysis. Our potential function can be seen as a fine-grained version of the crossover events as described above. In case of such a crossover event, it is guaranteed that in every subsequent iteration, i is in a layer before j. We analyze less radical changes instead: an “event” parametrized by τ means that i and j are currently together on a layer of size τ, and after the event, i is on a layer before j, or if they are together on the same layer, then this≤ layer must have size 2τ. For every LLS step, we can find a parameter τ such that an event of this type happens concurrently≥ for at least τ 1 pairs within the next O(√nτ log(¯χ∗ + n)) iterations, − A Our improved analysis is also applicable to the original VY-algorithm. Let us now comment on the relation between the VY-algorithm and our new algorithm. The VY-algorithm starts a new layer once xπ(i) > gxπ(i+1) between two consecutive variables where the permutation π is a non-increasing order of the xi variables, and g = poly(n)¯χ. Setting the initial ‘estimates’κ îj =χ ¯ for a suitable polynomial, our algorithm runs the same way as the VY algorithm. Using these estimates, the layering procedure becomes much simpler: there is no need to verify ‘balancedness’ as in our algorithm. However, using estimatesκ îj =χ ¯ has drawbacks. Most importantly, it does not give a lower bound on the true circuit ratio κij —to the contrary, g will be an upper bound. In effect, this

6 causes VY’s layers to be “much larger” than ours, and for this reason, the connection toχ ¯A∗ is lost. Nevertheless, our potential function analysis can still be adapted to the VY-algorithm to obtain the same Ω(n/ log n) improvement on the iteration complexity bound; see Section 4.1 for more details.

1.2 Related work Since the seminal works of Karmarkar [Kar84] and Renegar [Ren88], there has been a tremendous amount of work on speeding up and improving interior-point methods. In contrast to the present work, the focus of these works has mostly been to improve complexity of approximately solving LPs. Progress has taken many forms, such as the development of novel barrier methods, such Vaidya’s volumetric barrier [Vai89] and the recent entropic barrier of Bubeck and Eldan [BE14] and the weighted log-barrier of Lee and Sidford [LS14, LS19], together with new path following techniques, such as the predictor-corrector framework [Meh92, MTY93], as well as advances in fast linear system solving [ST04, LS15]. For this last line, there has been substantial progress in improving IPM by amortizing the cost of the iterative updates, and working with approximate computations, see e.g. [Ren88,Vai89] for classical results. Recently, Cohen, Lee and Song [CLS19] developed a new inverse maintenance scheme to get a randomized O˜(nω log(1/ε))-time algorithm for ε-approximate LP, which was derandomized by van den Brand [vdB20]; here ω 2.37 is the matrix multiplication exponent. A very recent result by van den Brand et al. [vdBTLSS20≈ ] obtained a randomized O˜(nm + m3) algorithm. For special classes of LP such as network flow and matching problems, even faster algorithms have been obtained using, among other techniques, fast Laplacian solvers, see e.g. [DS08,Mad13,vdBLN+20,vdBLL+21]. Given the progress above, we believe it to be an interesting problem to understand to what extent these new numerical techniques can be applied to speed up LLS computations, though we expect that such computations will require very high precision. We note that no attempt has been made in the present work to optimize the complexity of the linear algebra. Subsequent to this work, [DNV20] extended Tardos’s framework to the real model of computation, showing that poly(n, m, logχ ¯A) running time can be achieved using approximate solvers in a ω+1 black box manner. Combined with [vdB20], one obtains a deterministic O(mn log(n) log(¯χA + n)) LP algorithm; using the initial rescaling subroutine from this paper, the dependence can be 2 2 ω+1 improved toχ ¯A∗ ; together with the preprocessing time, this amounts to O(m n + mn log(n) log(¯χA∗ + n)). A weaker extension of Tardos’s framework to the real model of computation was previously given by Ho and Tun¸cel [HT02]. With regard to LLS algorithms, the original VY-algorithm required explicit knowledge ofχ ¯A to implement their layering algorithm. The paper [MMT98] showed that this could be avoided by computing all LLS steps associated with n candidate partitions and picking the best one. In particular, they showed that all such LLS steps can be computed in O(m2n) time. In [MT03], an alternate approach was presented to compute an LLS partition directly from the coefficients of the AS step. We note that these methods crucially rely on the variable ordering, and hence are not scaling invariant. Kitahara and Tsuchiya [KT13], gave a 2-layer LLS step which achieves a running time depending only onχ ¯A∗ and right-hand side b, but with no dependence on the objective, assuming the primal feasible region is bounded. A series of papers have studied the central path from a differential geometry perspective. Monteiro and Tsuchiya [MT08] showed that a curvature integral of the central path, first intro- 3.5 duced by Sonnevend, Stoer, and Zhao [SSZ91], is in fact upper bounded by O(n log(¯χA∗ + n)). This has been extended to SDP and symmetric cone programming [KOT14], and also studied in the context of information geometry [KOT13]. Circuits have appeared in several papers on linear and integer optimization (see [DLKS19]

7 and references within). The idea of using circuits within the context of LP algorithms also appears in [DLHL15]. They develop an augmentation framework for LP (as well ILP) and show that a simplex-like algorithm which takes steps according to the “best circuit” direction achieves linear convergence, though these steps are hard to compute. Our algorithm makes progress towards strongly polynomial solvability of LP, by improving the dependence poly(n, m, logχ ¯A) to poly(n, m, logχ ¯A∗ ). However, in a remarkable recent paper, Allamigeon et al. [ABGJ18] have shown, using tools from tropical geometry, that path-following methods for the standard logarithmic barrier cannot be strongly polynomial. In particular, they give a parametrized family of instances, where, for suﬃciently large parameter values, any sequence of iterations following the central path must be of exponential length—thus,χ ¯A∗ will be doubly exponential.

1.3 Organization The rest of the paper is organized as follows. We conclude this section by introducing some notation. Section 2 discusses our results on the circuit imbalance measure. It starts with Section 2.1 on the necessary background on the condition measuresχ ¯A andχ ¯A∗ . Section 2.2 introduces the circuit imbalance measure, and formulates and explains all main results of Section 2. The proofs are given in the rest of the sections: basic properties in Section 2.3, the min-max characterization in Section 2.4, the circuit ﬁnding algorithm in Section 2.5, the algorithms for approximatingχ ¯A∗ andχ ¯A in Section 2.6. In Section 3, we develop our scaling invariant interior-point method. Interior-point preliminaries are given in Section 3.1. Section 3.2 introduces the aﬃne scaling and layered-least-squares directions, and proves some basic properties. Section 3.3 provides a detailed overview of the high level ideas and a roadmap to the analysis. Section 3.4 further develops the theory of LLS directions and introduces partition lifting scores. Section 3.5 gives our scaling invariant layering procedure, and our overall algorithm is given in Section 3.6. In Section 4, we give the potential function proof for the improved iteration bound, relying on technical lemmas. The full proof of these lemmas is deferred to Section 6; however, Section 4 provides the high-level ideas to each proof. Section 4.1 shows that our argument also leads to a factor Ω(n/ log n) improvement in the iteration complexity bound of the VY-algorithm. In Section 5, we prove the technical properties of our LLS step, including its proximity to AS and step length estimates. Finally, in Section 7, we discuss the initialization of our interior-point method. Besides reading the paper linearly, we suggest two other possible strategies for navigating the paper. Readers mainly interested in the circuit imbalance measure and its approximation may focus only on Section 2; this part can be understood without any familiarity with interior point methods. Other readers, who wish to mainly focus on our interior point algorithm may read Section 2 only up to Section 2.2; this includes all concepts and statements necessary for the algorithm.

1.4 Notation

Our notation will largely follow [MT03,MT05]. We let R++ denote the set of positive reals, and i n R+ the set of nonnegative reals. For n N, we let [n]= 1, 2,...,n . Let e R denote the ith n ∈ { n } ∈ n n unit vector, and e R the all 1s vector. For a vector x R , we let Diag(x) R × denote the diagonal matrix with∈ x on the diagonal. We let D denote∈ the set of all positive∈ n n diagonal n n × matrices. For x, y R , we use the notation xy R to denote xy = Diag(x)y = (xiyi)i [n]. ∈ ∈ ∈ p The inner product of the two vectors is denoted as x⊤y. For p Q, we also use the notation x ∈

8 p n to denote the vector (xi )i [n]. Similarly, for x, y R , we let x/y denote the vector (xi/yi)i [n]. ∈ n ∈ ∈ We denote the support of a vector x R by supp(x)= i [n]: xi =0 . ∈ n I { ∈ 6 } For an index subset I [n], we use πI : R R for the coordinate projection. That is, ⊆ n → n n πI (x)= xI , and for a subset S R , πI (S)= xI : x S . We let RI = x R : x[n] I =0 . n k ⊆ { ∈ } { ∈ \ } For a matrix B R × , I [n] and J [k] we let BI,J denote the submatrix of B restricted to the set of rows in∈I and columns⊂ in J. We⊂ also use B = B and B = B = B . We I,• I,[k] J •,J [n],J k n let B† R × denote the pseudo-inverse of B. ∈ m n We let Ker(A) denote the kernel of the matrix A R × . Throughout, we assume that the matrix A in (LP) has full row rank, and that n 3. ⊆ We use the real model of computation, allowing≥ basic arithmetic operations +, , , /, comparisons, and square root computations. Exact square root computations could be− omitted× by using approximate square roots; we assume exact computations for simplicity.

Subspace formulation Throughout the paper, we let W = Ker(A) Rn denote the kernel of the matrix A. Using this notation, (LP) can be written in the form ⊆

min c x max d⊤(c s) ⊤ − x W + d s W ⊥ + c (2) ∈ ∈ x 0, s 0, ≥ ≥ where d Rn satisﬁes Ad = b. ∈ 2 Finding an approximately optimal rescaling 2.1 The condition number χ¯

The condition numberχ ¯A is deﬁned as

1 χ¯ = sup A⊤ ADA⊤ − AD : D D A k k ∈ n o (3) A⊤y 1/2 n = sup : y minimizes D (A⊤y p) for some 0 = p R and D D . ( p − 6 ∈ ∈ ) k k

This condition number was ﬁrst studied by Dikin [Dik67 ], Stewart [Ste89], and Todd [Tod90], among others, and plays a key role in the analysis of the Vavasis–Ye interior point method [VY96]. There is an extensive literature on the properties and applications ofχ ¯A, as well as its relations to other condition numbers. We refer the reader to the papers [HT02, MT03, VY96] for further results and references. It is important to note thatχ ¯A only depends on the subspace W = Ker(A). Hence, we can n k n also writeχ ¯ for a subspace W R , deﬁned to be equal toχ ¯ for some matrix A R × W ⊆ A ∈ with W = Ker(A). We will use the notationsχ ¯A andχ ¯W interchangeably. The next lemma summarizes some important known properties ofχ ¯A.

m n Proposition 2.1. Let A R × with full row rank and W = Ker(A). ∈ O(LA) (i) If the entries of A are all integers, then χ¯A is bounded by 2 , where LA is the input bit length of A.

1 (ii) χ¯ = max B− A : B non-singular m m-submatrix of A . A {k k × }

9 n (n m) (iii) Let the columns of B R × − form an orthonormal basis of W . Then ∈

χ¯W = max BBI,† : = I [n] . k •k ∅ 6 ⊂ n o

(iv) χ¯W =χ ¯W ⊥ . Proof. Part (i) was proved in [VY96, Lemma 24]. For part (ii), see [TTY01, Theorem 1] and [VY96, Lemma 3]. In part (iii), the direction was proved in [Ste89], and the direction in [O’L90]. The duality statement (iv) was shown≥ in [GL97]. ≤ In Proposition 3.8, we will also give another proof of (iv). We now define the lifting map, a key operation in this paper, and explain its connection toχ ¯A. Definition 2.2. Let us define the lifting map LW : π (W ) W by I I → LW (p) = arg min z : z = p,z W . I {k k I ∈ } W W W Note that LI is the unique linear map from πI (W ) to W such that LI (p) I = p and LI (p) n is orthogonal to W R[n] I . ∩ \ Lemma 2.3. Let W Rn be an (n m)-dimensional linear subspace. Let the columns of n (n m) ⊆ − W n I B R × − denote an orthonormal basis of W . Then, viewing L as a matrix in R ×| |, ∈ I W LI = BB† . I,•

n m Proof. If p πI (W ), then p = BI, y for some y R − . By the well-known property of the ∈ • ∈ † † pseudo-inverse we get BI, p = arg minp=BI, y y . This solution satisfies πI (BBI, p) = p and • • k k • BBI,† p W . Since the columns of B form an orthonormal basis of W , we have BBI,† p = • ∈ k • k BI,† p . Consequently, BBI,† p is the minimum-norm point with the above properties. k • k • The above lemma and Proposition 2.1(iii) yield the following characterization. This will be the most suitable characterization ofχ ¯W for our purposes. Proposition 2.4. For a linear subspace W Rn, ⊆ χ¯ = max LW : I [n], I = . W k I k ⊆ 6 ∅ The following notation will be convenient for our algorithm. For a subspace W Rn and an index set I [n], if π (W ) = 0 then we define the lifting score ⊆ ⊆ I 6 { } ℓW (I) := LW 2 1 . (4) k I k − q W W Otherwise, we define ℓ (I) = 0. This means that for any z πI (W ) and x = LI (z), x[n] I ℓW (I) z . ∈ k \ k≤ k k

The condition number χ¯∗ For every D D, we can consider the condition numberχ ¯ = A ∈ W D χ¯AD−1 . We let χ¯∗ =χ ¯∗ = inf χ¯ : D D W A { W D ∈ } denote the best possible value ofχ ¯ that can be attained by rescaling the coordinates of W . The main result of this section is the following theorem.

10 Theorem 2.5 (Proof on p. 20). There is an O(n2m2 + n3) time algorithm that for any matrix m n A R × computes an estimate ξ of χ¯ such that ∈ W 2 ξ χ¯ n(¯χ∗ ) ξ ≤ W ≤ W and a D D such that ∈ 3 χ¯∗ χ¯ n(¯χ∗ ) . W ≤ W D ≤ W 2.2 The circuit imbalance measure The key tool in proving Theorem 2.5 is to study a more combinatorial condition number, the circuit imbalance measure which turns out to give a good proxy toχ ¯A. Definition 2.6. For a linear subspace W Rn and a matrix A such that W = Ker(A), a circuit is an inclusion-wise minimal dependent⊆ set of columns of A. Equivalently, a circuit is a n set C [n] such that W RC is one-dimensional and that no strict subset of C has this property. The set⊆ of circuits of W ∩is denoted . CW Note that circuits defined above are the same as the circuits in the linear matroid associated C C with A. Every circuit C W can be associated with a vector g W such that supp(g )= C; this vector is unique up to∈C scalar multiplication. ∈ Definition 2.7. For a circuit C and i, j C, we let ∈CW ∈ C W gj κij (C)= C . (5) gi

C C Note that since g is unique up to scalar multiplication, this is independent of the choice of g . W For any i, j [n], we deﬁne the circuit ratio as the maximum of κij (C) over all choices of the circuit C: ∈ κW = max κW (C): C ,i,j C . (6) ij ij ∈CW ∈ W By convention we set κij = 0 if there is no circuit supporting i and j. Further, we deﬁne the circuit imbalance measure as

κ = max κW : i, j [n] . W ij ∈ Minimizing over all coordinate rescalings, we deﬁne

κ∗ = min κ : D D . W { W D ∈ } n We omit the index W whenever it is clear from context. Further, for a vector d R++, we write d Diag(d)W d d ∈ κij = κij and κ = κW = κDiag(d)W .

We want to remark that a priori it is not clear that κW∗ is well-deﬁned. Theorem 2.12 will show that the minimum of κ : D D is indeed attained. { W D ∈ } We next formulate the main statements on the circuit imbalance measure; proofs will be given in the subsequent subsections. Crucially, we show that the circuit imbalance κW is a good proxy to the condition numberχ ¯W . The lower bound was already proven in [Vav94], and the upper 2 bound is from [DNV20]. A slightly weaker upper bound 1 + (nκW ) was previously given in the conference version of this paper [DHNV20]. p

11 Theorem 2.8 (Proof on p. 14). For a linear subspace W Rn, ⊆ 1 + (κ )2 χ¯ nκ . W ≤ W ≤ W

We now overview some basic propertiesp of κW . Proposition 2.4 asserts thatχ ¯W is the maximum ℓ ℓ operator norm of the mappings LW over I [n]. In [DNV20], it was shown 2 → 2 I ⊆ that κW is in contrast the maximum ℓ1 ℓ operator norm of the same mappings; this easily implies the upper boundχ ¯ nκ . → ∞ W ≤ W Proposition 2.9 ([DNV20]). For a linear subspace W Rn, ⊆ W LI (p) κ = max k k∞ : I [n], I = ,p π (W ) 0 . W p ⊆ 6 ∅ ∈ I \{ } k k1 W Similarly toχ ¯W , κW is self-dual; moreover, this holds for all individual κij values.

⊥ Lemma 2.10 (Proof on p. 15). For any subspace W Rn and i, j [n], κW = κW . ⊆ ∈ ij ji The next lemma provides a subroutine that eﬃcienctly yields upper bounds on ℓW (I) or lower bounds on some circuit imbalance values. Recall the deﬁnition of the lifting score ℓW (I) from (4). Lemma 2.11 (Proof on p. 14). There exists a subroutine Verify-Lift(W,I,θ) that, given a n linear subspace W R , an index set I [n], and a threshold θ R++, either returns the answer ‘pass’, verifying⊆ ℓW (I) θ, or returns⊆ the answer ‘fail’, and a∈ pair i I, j [n] I such that θ/n κW . The running time≤ can be bounded as O(n(n m)2). ∈ ∈ \ ≤ ij − The proof of the above statements are given in Section 2.3.

A min-max theorem We next provide a combinatorial min-max characterization of κW∗ . Con- sider the circuit ratio digraph G = ([n], E) on the node set [n] where (i, j) E if κij > 0, that W ∈ is, there exists a circuit C with i, j C. We will refer to κij = κij as the weight of the edge (i, j). (Note that (i, j) E∈Cif and only∈ if (j, i) E, but the weight of these two edges can be diﬀerent.) ∈ ∈ Let H be a cycle in G, that is, a sequence of indices i1,i2,...,ik,ik+1 = i1. We use H = k to denote the length of the cycle. (In our terminology, ‘cycles’ always refer to objects in G| , whereas| ‘circuits’ refer to the minimum supports in Ker(A).) We use the notation κ(H) = κ (H) = k κW . For a vector d Rn , we denote W j=1 ij ij+1 ∈ ++ κd (H) = κ (H). A simple but important observation is that such a rescaling does not W Diag(d)W Q change the value associated with the cycle, that is,

κd (H)= κ (H) d Rn for any cycle H in G. (7) W W ∀ ∈ ++ Theorem 2.12 (Proof on p. 15). For a subspace W Rn, we have ⊂ d 1/ H κW∗ = min κW = max κW (H) | | : H is a cycle in G . d>0 n o The proof relies on the following formulation:

κW∗ = min t κ d /d t (i, j) E ij j i ≤ ∀ ∈ d> 0.

12 Taking logarithms, we can rewrite this problem as

min s log κ + z z s (i, j) E ij j − i ≤ ∀ ∈ z Rn. ∈

This is the dual of the minimum-mean cycle problem with weights log κij , and can be solved in polynomial time (see e.g. [AMO93, Theorem 5.8]). Whereas this formulation verifies Theorem 2.12, it does not give a polynomial-time algorithm W to compute κW∗ . The caveat is that the values κij are typically not available; in fact, approximating them up to a factor 2O(m) is NP-hard, as follows from the work of Tun¸cel [Tun99]. Nevertheless, the following corollary of Theorem 2.12 shows that any arbitrary circuit con- 2 taining i and j yields a (κ∗) approximation to κij . Corollary 2.13 (Proof on p. 16). Let us be given a linear subspace W Rn and i, j [n], i = j, ⊆ ∈ 6 and a circuit C W with i, j C. Let g W be the corresponding vector with supp(g) = C. Then, ∈ C ∈ ∈ W κij gj W 2 | | κij . (κ ) ≤ gi ≤ W∗ | | The above statements are shown in Section 2.4. In Section 2.5, we use techniques from matroid theory and linear algebra to efficiently identify a circuit for any pair of variables that are contained in the same circuit. A matroid is non-separable if the circuit hypergraph is connected; precise definitions and background will be described in Section 2.5.

m n 2 2 Theorem 2.14 (Proof on p. 18). Given A R × , there exists an O(n m ) time algorithm Find-Circuits(A) that obtains a decomposition∈ of (A) to a direct sum of non-separable linear matroids, and returns a family ˆ of circuits such thatM if i and j are in the same non-separable component, then there exists a circuitC in ˆ containing both i and j. Further, for each i = j in C 6 the same component, the algorithm returns a value κˆij as the the maximum of gj /gi such that | | 2 g W , supp(g)= C for some C ˆ containing i and j. For these values, κˆ κ (κ∗) κˆ . ∈ ∈ C ij ≤ ij ≤ ij Finally, in Section 2.6, we combine the above results to prove Theorem 2.5 on approximating χ¯W∗ and κW∗ . Section 2.5 contains an interesting additional statement, namely that the logarithms of the circuit ratios satisfy the triangle inequality. This will also be useful in the analysis of the LLS algorithm. The proof uses similar arguments as the proof of Theorem 2.14.

Lemma 2.15 (Proof on p. 19).

2.3 Basic properties of κW Theorem 2.8 (Restatement). For a linear subspace W Rn, ⊆ 1 + (κ )2 χ¯ nκ . W ≤ W ≤ W p 13 Proof. For the first inequality, let C W be the circuit and i = j C such that gj/gi = κW C∈C 6 ∈ | | for the corresponding solution g = g . Let us use the characterization ofχ ¯W in Proposition 2.4. i Let I = ([n] C) i , and p = gie , that is, the vector with pi = gi and pk = 0 for k = i. Then, the unique vector\ ∪{z }W such that z = p is z = g. Therefore, 6 ∈ I 2 2 z g gi + gj 2 χ¯W min k k = k k | | | | = 1+ κW . ≥ z W,zI =p p gi ≥ gi ∈ k k | | p | | q The second inequality is immediate from Proposition 2.4 and Proposition 2.9, and the inequalities 2 between ℓ1, ℓ2, and ℓ norms. The proof of the slightly weakerχ ¯W 1 + (nκW ) follows from Lemma 2.11. ∞ ≤ p The next lemma will be needed to prove Lemma 2.11 and also to analyze the LLS algorithm. n n Let us say that the vector y R is sign-consistent with x R if xiyi 0 for all i [n] and x = 0 implies y = 0 for all i∈ [n]. ∈ ≥ ∈ i i ∈ i W i Lemma 2.16. For i I [n] with eI πI (W ), let z = LI (eI ). Then for any j supp(z) we have κW z . ∈ ⊂ ∈ ∈ ij ≥ | j | Proof. We consider the cone F W of vectors sign-consistent with z. The faces of F are bounded by inequalities of the form⊂ z y 0 or y = 0. The edges (rays) of F are of the form k k ≥ k αg : α 0 with supp(g) W . It is easy to see from the Minkowski-Weyl theorem that z can {be written≥ as} ∈C h z = gk, kX=1 1 2 h where h n, C1, C2,...,Ch W are circuits, and the vectors g ,g ,...,g W are sign- ≤ k ∈ C ∈ consistent with z and supp(g )= Ck for all k [h]. Note that i Ck for all k [h], as otherwise, k i ∈ k∈ ∈ z′ = z g would also satisfy zI′ = eI , but z′ < z due to g being sign-consistent with z, a contradiction− to the definition of z. k k k k h k Pk=1 gj At least one k [h] contributes at least as much to zj = h | k | as the average. Hence we ∈ | | Pk=1 gi find κW gk/gk z . ij ≥ | j i | ≥ | j | Lemma 2.11 (Restatement). There exists a subroutine Verify-Lift(W,I,θ) that, given a n linear subspace W R , an index set I [n], and a threshold θ R++, either returns the answer ‘pass’, verifying⊆ ℓW (I) θ, or returns⊆ the answer ‘fail’, and a∈ pair i I, j [n] I such that θ/n κW . The running time≤ can be bounded as O(n(n m)2). ∈ ∈ \ ≤ ij − Proof. Take any minimal I′ I such that dim(πI′ (W )) = dim(πI (W )). Then we know that I′ ⊂ W W ([n] I) I′ πI′ (W )= R and for p πI (W ) we can compute LI (p)= LI′ (pI′ ). Let B R \ × be the ∈ W ∈ matrix sending any q πI′ (W ) to the corresponding vector (LI′ (q))[n] I . The column Bi can ′ \ W ∈i i I W 2 2 W 2 be computed as (LI′ (eI′ ))[n] I for eI′ R . We have LI (p) = p + (LI′ (pI′ ))[n] I \ ∈ k k k k k \ k ≤ 2 2 2 W W 2 p + B p ′ for any p π (W ), and so ℓ (I) = L 1 B . We upper bound k k k k k I k ∈ I k I k − ≤ k k the operator norm by the Frobenius norm as B Bq = B2 n max B . By k k≤k kF ji ji ≤ ji | ji| W i W Lemma 2.16 it follows that Bji = (LI′ (e ))j κij . The algorithmqP returns the answer ‘pass’ if n max B θ and ‘fail’| otherwise.| | | ≤ ji | ji|≤ To implement the algorithm, we first need to select a minimal I′ I such that dim(πI′ (W )) = (n m)⊂ n dim(π (W )). This can be found by computing a matrix M R − × such that range(M)= W , I ∈ and selecting a maximal number of linearly independent columns of MI . Then, we compute the

14 ([n] I) I′ W matrix B R \ × that implements the transformation [LI′ ][n] I : πI′ (W ) π[n] I (W ). ∈ \ → \ The algorithm returns the pair (i, j) corresponding to the entry maximizing Bji . The running time analysis will be given in the proof of Lemma 3.15, together with an amortized| | analysis of a sequence of calls to the subroutine. Remark 2.17. We note that the algorithm Verify-Lift does not need to compute the circuit as in Lemma 2.16. The following observation will be important in the analysis: the algorithm returns the answer ‘fail’ even if ℓW (I) θ

⊥ Lemma 2.10 (Restatement). For any subspace W Rn and i, j [n], κW = κW . ⊆ ∈ ij ji C n Proof. Choose a circuit C W and corresponding circuit solution g := g W RC such that ∈C ∈ ∩W ⊥ W κij = κij (C)= gj /gi . We will construct a circuit solution in W ⊥ that certifies κji κij . | C | ≥ Define h R by hi = gj ,hj = gi and hk = 0 for all k C i, j . Then, h is orthogonal ∈ − n ∈ \{ } to gC by construction, and hence h (πC (W RC ))⊥ = πC (W ⊥). Furthermore, we have C ∈ ∩ C supp(h) ⊥ since h R is a support minimal vector orthogonal to g . ∈CπC (W ) ∈ Take any vector h¯ W ⊥ satisfying h¯C = h that is support minimal subject to these con- ∈ ¯ straints. We claim that supp(h) W ⊥ . Assume not, then there exists a non-zero v W ⊥ with supp(v) supp(h¯). Since supp(π∈C(v)) supp(π (h¯)) = supp(h), we must have either∈ v = 0 ⊂ C ⊆ C C or v = sh for s = 0. If v = 0, then h¯ αv is also in W ⊥ satisfying π (h¯ αv) = h for C 6 C − C C − all α R, and since v = 0 we can choose α such that h¯ αv has smaller support than h¯, a ∈ 6 − contradiction. If s = 0 then v/s W ⊥ satisfies πC (v/s) = h and has smaller support than h¯, again a contradiction.6 ∈ By the above construction, we have

⊥ h¯ h g κW i = i = j = κW . ji ≥ h¯ h g ij j j i

W W ⊥ By swapping the role of W and W ⊥ and i and j, we obtain κij κji . The statement follows. ≥

∗ 2.4 A min-max theorem on κW

The proof of the characterization of κW∗ follows. Theorem 2.12 (Restatement). For a subspace W Rn, we have ⊂ d 1/ H κW∗ = min κW = max κW (H) | | : H is a cycle in G . d>0 n o 1/ H Proof. For the direction κW (H) | | κW∗ we use (7). Let d> 0 be a scaling and H a cycle. We d d ≤ d d H have κij κW for every i, j [n], and hence κW (H)= κW (H) (κW )| |. Since this inequality ≤ ∈ H ≤ holds for every d> 0, it follows that κW (H) (κW∗ )| |. For the reverse direction, consider the following≤ optimization problem.

min t κ d /d t (i, j) E (8) ij j i ≤ ∀ ∈ d> 0.

15 For any feasible solution (d, t) and λ> 0, we get another feasible solution (λd, t) with the same objective value. As such, we can strengthen the condition d> 0 to d 1 without changing the objective value. This makes it clear that the optimum value is achieved≥ by a feasible solution. Any rescaling d> 0 provides a feasible solution with objective value κd, which means that the optimal value t∗ of (8) is t∗ = κ∗. Moreover, with the variable substitution zi = log di, s = log t, (8) can be written as a linear program: min s log κ + z z s (i, j) E (9) ij j − i ≤ ∀ ∈ z Rn. ∈ This is the dual of a minimum-mean cycle problem with respect to the cost function log(κij ). Therefore, an optimal solution corresponds to the cycle maximizing ij H log κij / H , or in ∈ | | other words, maximizing κ(H)1/ H . | | P

The following example shows that κ∗ χ¯∗ can be arbitrarily big. ≤ Example 2.18. Take W = span((0, 1, 1,M), (1, 0,M, 1)), where M > 0. Then 2, 3, 4 and 1, 3, 4 are circuits with κW ( 2, 3, 4 )= M and κW ( 1, 3, 4 ) = M. Hence, by Theorem{ }2.12, { } 34 { } 43 { } we see that κ∗ M. ≥ Corollary 2.13 (Restatement). Let us be given a linear subspace W Rn and i, j [n], i = j, ⊆ ∈ 6 and a circuit C W with i, j C. Let g W be the corresponding vector with supp(g) = C. Then, ∈ C ∈ ∈ W κij gj W 2 | | κij . (κ ) ≤ gi ≤ W∗ | | Proof. The second inequality follows by deﬁnition. For the ﬁrst inequality, note that the same W W W circuit C yields gi/gj κji (C) κji . Therefore, gj/gi 1/κji . | |≤ ≤ W W 2| |≥ W W 2 From Theorem 2.12 we see that κij κji (κW∗ ) , giving 1/κji κij /(κW∗ ) , completing the proof. ≤ ≥

2.5 Finding circuits: a detour in matroid theory

We next prove Theorem 2.14, showing how to eﬃciently obtain a family ˆ W such that for any i, j [n], ˆ includes a circuit containing both i and j, provided thereC⊆C exists such a circuit. We need∈ someC simple concepts and results from matroid theory. We refer the reader to [Sch03, Chapter 39] or [Fra11, Chapter 5] for deﬁnitions and background. Let = ([n], ) bea matroid on ground set [n] with independent sets 2[n]. The rank rk(S) of aM set S [nI] is the maximum size of an independent set contained inI ⊆S. The maximal independent sets⊆ are called bases. All bases have the same cardinality rk([n]). m n For the matrix A R × , we will work with the linear matroid (A) = ([n], (A)), where ∈ M I a subset I [n] is independent if the columns Ai : i I are linearly independent. Note that rk([n]) = m⊆under the assumption that A has full{ row∈ rank.} The circuits of the matroid are the inclusion-wise minimal non-independent sets. Let I be an independent set, and i [n] I such that I i / . Then, there exists a unique∈ I circuit C(I,i) I i that is∈ called\ the fundamental∪{ circuit} ∈of I i with respect to I. Note that i C(I,i). ⊆ ∪{ } ∈ The matroid is separable, if the ground set [n] can be partitioned to two nonempty subsets [n]= S T suchM that I if and only if I S, I T . In this case, the matroid is the direct sum of its∪ restrictions to∈S I and T . In particular,∩ ∩ every∈ circuitI is fully contained in S or in T .

16 For the linear matroid (A), separability means that Ker(A) = Ker(A ) Ker(A ). In this M S ⊕ T case, solving (LP) can be decomposed into two subproblems, restricted to the columns in AS and in AT , and κA = max κAS ,κAT . Hence, we can focus on{non-separable} matroids. The following characterization is well-known, see e.g. [Fra11, Theorems 5.2.5, 5.2.7–5.2.9]. For a hypergraph H = ([n], ), we define the E underlying graph HG = ([n], E) such that (i, j) E if there is a hyperedge S with i, j S. That is, we add a clique corresponding to each hyperedge.∈ The hypergraph is called∈E connected∈ if the underlying graph G = ([n], E) is connected. Proposition 2.19. For a matroid = ([n], ), the following are equivalent: M I (i) is non-separable. M (ii) The hypergraph of the circuits is connected. (iii) For any base B of , the hypergraph formed by the fundamental circuits B = C(B,i): i [n] B is connected.M C { ∈ \ } (iv) For any i, j [n], there exists a circuit containing i and j. ∈ Proof. The implications (i) (ii), (iii) (ii), and (iv) (ii) are immediate from the definitions. For the implication (ii) ⇔ (iii), assume⇒ for a contradiction⇒ that the hypergraph of the fundamental circuits with respect⇒ to B is not connected. This means that we can partition [n]= S T such that for each i S, C(B,i) S, and for each i T , C(B,i) T . Consequently,∪ rk(S) = B S , rk(T )=∈ B T , and⊆ therefore rk([n]) = rk(∈ S) + rk(T ).⊆ It is easy to see that this property| ∩ is| equivalent| to separability∩ | to S and T ; see e.g. [Fra11, Theorem 5.2.7] for a proof. Finally, for the implication (ii) (iv), consider the undirected graph ([n], E) where (i, j) E if there is a circuit containing both⇒i and j. This graph is transitive according to [Fra11, Theorem∈ 5.2.5]: if (i, j), (j, k) E, then also (i, k) E. Consequently, whenever ([n], E) is connected, it must be a complete directed∈ graph. ∈ We now give a different proof of (iii) (iv) that will be convenient for our algorithmic purposes. First, we need a simple lemma that⇒ is commonly used in matroid optimization, see e.g. [Fra11, Lemma 13.1.11] or [Sch03, Theorem 39.13]. Lemma 2.20. Let I be an independent set of a matroid = ([n], ), and U = u ,u ,...,u M I { 1 2 ℓ}⊆ I, V = v1, v2,...,vℓ [n] I such that I vi is dependent for each i [ℓ]. Further, assume that for{ each t [ℓ], u}⊆ C(I,\ v ) and u / ∪{C(I,} v ) for all h

17 Proof of Lemma 2.21. Note that S = (B U) V/ since S > B and B is a basis. For any i [ℓ+1], we can use Lemma 2.20 to show that\ S ∪ v ∈= I (B U|) | (V | |v ) (and thus, is a basis).∈ \{ i} \ ∪ \{ i} ∈ I To see this, we apply Lemma 2.20 for the ordered sets V ′ = v1,...,vi 1, vℓ+1, vℓ,...,vi+1 and { − } U ′ = u1,...,ui 1,uℓ,uℓ 1,...,ui . Consequently,{ − every circuit− in S} must contain the entire set V . The uniqueness of the circuit in S follows by the well-known circuit axiom asserting that if C, C′ , C = C′ and v C C′, ∈C 6 ∈ ∩ then there exists a circuit C′′ such that C′′ (C C′) v , contradicting the claim that every circuit in S contains the∈ entire C set V . ⊆ ∪ \{ }

We are ready to describe the algorithm that will be used to obtain lower bounds on all κij values.

m n 2 2 Theorem 2.14 (Restatement). Given A R × , there exists an O(n m ) time algorithm Find-Circuits(A) that obtains a decomposition∈ of (A) to a direct sum of non-separable linear matroids, and returns a family ˆ of circuits such thatM if i and j are in the same non-separable component, then there exists a circuitC in ˆ containing both i and j. Further, for each i = j in C 6 the same component, the algorithm returns a value κˆij as the the maximum of gj /gi such that | | 2 g W , supp(g)= C for some C ˆ containing i and j. For these values, κˆ κ (κ∗) κˆ . ∈ ∈ C ij ≤ ij ≤ ij

Proof. Once we have found the set of circuits ˆ, and computedκ îj as in the statement, the 2 C inequalitiesκ îj κij (κ∗) κîj follow easily. The first inequality is by the definition of κij , and the second inequality≤ ≤ is from Corollary 2.13. We now turn to the computation of ˆ. We first obtain a basis B [n] of Ker(A) via Gauss- Jordan elimination in time O(nm2). RecallC the assumption that A ⊆has full row-rank. Let us assume that B = [m] is the set of first m indices. The elimination transforms it to the form m (n m) A = (Im H), where H R × − corresponds to the non-basis elements. In this form, the fundamental| circuit C(B,i∈ ) is the support of the ith column of A together with i for every m +1 i n. We let B denote the set of all these fundamental circuits. We≤ construct≤ an undirectedC graph G = (B, E) as follows. For each i [n] B, we add a clique between the nodes in C(B,i) i . This graph can be constructed in∈O(nm\ 2) time. The connected components of G \{correspond} to the connected components of B restricted to B. Thus, due to the equivalence shown in Proposition 2.19 we can obtain the decompositionC by identifying the connected components of G. For the rest of the proof, we assume that the entire hypergraph is connected; connectivity can be checked in O(m2) time. We initialize ˆ as B. We will then check all pairs i, j [n], i = j. If no circuit C ˆ exists with i, j C, thenC weC will add such a circuit to ˆ as follows.∈ 6 ∈ C Assume∈ first i, j [n] B. We can find a shortestC path in G between the sets C(B,i) i and C(B, j) j in∈ time\ O(m2). This can be represented by the sequences of points \{V =} v , v ,...,v \{ } [n] B, v = i, v = j, and U = u ,u ,...,u B as in Lemma 2.21. { 1 2 ℓ+1} ⊆ \ 1 ℓ+1 { 1 2 ℓ} ⊆ According to the lemma, S = (B U) V contains a unique circuit C that contains all vt’s, including i and j. \ ∪ We now show how this circuit can be identified in O(m) time, along with the vector gC. Let C AS be the submatrix corresponding to the columns in S. Since g = g is unique up to scaling, we can set g = 1. Note that for each t [ℓ], the row of A corresponding to u contains only v1 ∈ S t two nonzero entries: Autvt and Autvt+1 . Thus, the value gv1 = 1 can be propagated to assigning unique values to gv2 ,gv3 ,...,gvℓ+1 . Once these values are set, there is a unique extension of g to the indices t B S in the basis. Thus, we have identified g as the unique element of ∈ ∩ Ker(AS ) up to scaling. The circuit C is obtained as supp(g). Clearly, the above procedure can be implemented in O(m) time.

18 The argument easily extends to ﬁnding circuits for the case i, j B = . If i B, then { } ∩ 6 ∅ ∈ for any choice of V = v1, v2,...,vℓ+1 and U = u1,u2,...,uℓ as in Lemma 2.21 such that i C(B, v ) and i / C({B, v ) for t> 1,} the unique{ circuit in (B } U) V also contains i. This ∈ 1 ∈ t \ ∪ follows from Lemma 2.20 by taking V ′ = v , v ,...,v and U ′ = u ,...,u ,i , which proves { ℓ+1 ℓ 1} { ℓ 1 } that S i = (B U ′) V ′ . Similarly, if j B with j C(B, v ) and j / C(B, v ) for \{ } \ ∪ ∈ I ∈ ∈ ℓ+1 ∈ t t<ℓ + 1, taking V ′′ = V and U ′′ = u1,...,uℓ, j gives S j . The bottleneck for the running time{ is ﬁnding} the shortest\{ } paths ∈ I for the n(n 1) pairs, in time O(m2) each. −

The triangle inequality An interesting additional fact about the circuit ratio graph is that the logarithm of the weights satisfy the triangle inequality. The proof uses similar arguments as the proof of Theorem 2.14 above. Lemma 2.15 (Restatement).

C (i) For any distinct i, j, k in the same connected component of W , and any g with i, j C, C , there exist circuits C , C , i, k C , j, kC C such that gC/gC∈ = ∈ CW 1 2 ∈ CW ∈ 1 ∈ 2 | j i | gC2 /gC2 gC1 /gC1 . | j k | · | k i | (ii) For any distinct i, j, k in the same connected component of , κ κ κ . CW ij ≤ ik · kj Proof. Note that part (ii) immediately follows from part (i) when taking C such that ∈ CW κij (C)= κij . We now prove part (i). m n Let A R × be a full-rank matrix with W = Ker(A). If C = i, j , then the columns Ai, Aj are linearly∈ dependent. Writing A = λA , we have λ = gC/gC.{ Let}h be any circuit solution i j − j i with i, k supp(h), and hence j / supp(h). By assumption, the vector h′ = h h e + λh e ∈ ∈ − i i i j will satisfy Ah′ = 0 and have i / supp(h′), j, k supp(h′). We know that h′ is a circuit solution, ∈ ∈ because any circuit C′ supp(h′) could, by the above process in reverse, be used to produce a kernel solution with strictly⊂ smaller support than h, contradicting the assumption that h is a circuit solution. Now we have hj′ /hk′ hk/hi = hj′ /hi = λ by construction. Thus, h and h′ are the circuit solutions we are| looking| · for.| | | | | | Now assume C = i, j . If k C, the statement is trivially true with C = C1 = C2, so assume k / C. Pick6 l{ C}, l / i,∈ j and set B = C l . Assume without loss of generality ∈ ∈ ∈ { } \{ } that B [m] and apply row operations to A such that AB,B = IB B is an identity submatrix ⊆ × C and A[n] B,B = 0. Then the column Al has support given by B, for otherwise g could not be \ C C in the kernel. The given circuit solution satisfies gt = At,lgl for all t B, and in particular C C − ∈ gj /gi = Aj,l/Ai,l. Take any minimal set J [n] such that k J and the linear matroid induced by C J is ⊂ ∈ ∪ non-separable. We show that we can uniquely lift any vector x Ker(AB,C k ) to a vector ∈ ∪{ } x′ Ker(AC J ), xC′ k = x. Since this lift will send circuit solutions to circuit solutions by uniqueness,∈ it∪ suffices∪ to find our desired circuits as solutions to the smaller linear system. Consider a circuit solution h W such that l, k supp(h) C J, which exists by Proposition 2.19(iv). By Proposition∈ 2.19(ii), the set{C }supp( ⊂ h) induces⊂ a non-seperable∪ matroid, which means that by minimality of J we have J supp(∪ h). This h gives rise to the non-zero ⊂ vector hJ Ker(A[n] B,J ) since Ah = 0 and A[n] B,C = 0. Suppose that dim(Ker(A[n] B,J )) 2, ∈ \ \ \ ≥ then there would exist some vector y Ker(A[n] B,J ) with k supp(y) ( J. This vector could ∈ \ ∈ be uniquely lifted to a vectory ¯ Ker(A) with supp(¯y) ( B J, just knowing that AB,B = IB B ∈ ∪ ¯ × and A[n] B,B = 0. From this we deduce that there would exist a circuit solution h Ker(A) with \ ∈ k supp(h¯) and C supp(h¯) = and J * supp(h¯), contradicting the minimality of J. As such, ∈ ∩ 6 ∅ we know that dim(Ker(A[n] B,J )) = 1. \

19 This provides us with clear linear relations between any two entries in J of any vector in Ker(A[n] B,J ), implying that we can apply row operations to A such that AB,J has non-zero \ entries only in the column AB, k . Note that these row operations leave AC unchanged because { } A[n] B,C = 0. From this, we can see that any element in Ker(AB,C k ) can be uniquely lifted \ ∪{ } to an element in Ker(AC J ). Hence we can focus on Ker(AB,C k ). ∪ ∪{ } If Ai,k = Aj,k = 0, then any x Ker(AB,C k ) satisfies xi + Ai,lxl = xj + Aj,lxl = 0 and, in particular, any circuit l, k C¯ C∈ k contains∪{ } i, j C¯ and fulfills gC /gC = A /A = ∈ ⊂ ∪{ } { }⊂ | j i | | j,l i,l| gC¯ /gC¯ = gC¯ /gC¯ gC¯ /gC¯ . Choosing C = C = C¯ concludes the case. | j i | | j k || k i | 1 2 Otherwise, we know that Ai,k = 0 or Aj,k = 0. Any circuit in Ker(A i,j , i,j,l,k ) can be 6 6 { } { } lifted uniquely to an element in Ker(AB,C k ) since AB,B is an identity matrix and we can set the entries of B i, j individually to∪{ satisfy} the equalities. Note that this lifted vector is a circuit as well, again\{ by} uniqueness of the lift. Hence we may restrict our attention to the matrix A i,j , i,j,l,k . If the columns A i,j ,k, A i,j ,l are linearly dependent, then any circuit { } { } { } C { } solution to A i,j , i,j,l x =0, xl = 0, such as g i,j,l , is easily transformed into a circuit solution { } { } 6 { } to A i,j , i,j,k x =0, xk = 0 and we are done. { } { } 6 1 0 a c C C If A i,j ,k, A i,j ,l are independent, we can write A i,j , i,j,l,k = ( 0 1 b d ), where gj /gi = b/a. { } { } a{ c } { } For α = ad bc, which is non-zero since α = det(( b d )) = 0 by the independence assumption, we can check− that (α, 0, d, b) and (0, α, c, a) are the circuits6 we are looking for. − − 2.6 Approximating χ¯ and χ¯∗ Equipped with Theorem 2.12 and Theorem 2.14, we are ready to prove Theorem 2.5. Recall that d Diag(d)W d we defined κij := κij = κij dj /di when d> 0. We can similarly defineκ îj :=κ îj dj /di, and d d κîj approximates κij just as in Theorem 2.14. Theorem 2.5 (Restatement). There is an O(n2m2 + n3) time algorithm that for any matrix m n A R × computes an estimate ξ of χ¯ such that ∈ W 2 ξ χ¯ n(¯χ∗ ) ξ ≤ W ≤ W and a D D such that ∈ 3 χ¯∗ χ¯ n(¯χ∗ ) . W ≤ W D ≤ W Proof. Let us run the algorithm Finding-Circuits(A) described in Theorem 2.14 to obtain the 2 valuesκ îj such thatκ îj κij (κW∗ ) κîj . We let G = ([n], E) be the circuit ratio digraph, that is, (i, j) E if κ > 0. ≤ ≤ ∈ ij To show the first statement on approximatingχ ¯, we simply set ξ = max(i,j) E κîj . Then, ∈ 2 2 ξ κ χ¯ nκ n(κ∗ ) ξ n(¯χ∗ ) ξ ≤ W ≤ W ≤ W ≤ W ≤ W follows by Theorem 2.8.

For the second statement on ﬁnding a nearly optimal rescaling forχ ¯W∗ , we consider the following optimization problem, which is an approximate version of (8) from Theorem 2.12.

min t κˆ d /d t (i, j) E (10) ij j i ≤ ∀ ∈ d> 0.

dˆ 3 Let dˆ be an optimal solution to (10) with value tˆ. We will prove that κ (κ∗ ) . ≤ W

20 dˆ ˆ ˆ 2 ˆ ˆ 2ˆ First, observe that κij = κij dj /di (κW∗ ) κîj dj /di (κW∗ ) t for any (i, j) E. Now, ∗ ≤ ≤ ∈ d ˆ let d∗ > 0 be such that κ = κW∗ . The vector d∗ is a feasible solution to (10), and so t d∗ ˆ ≤ maxi=j κîj dj∗/di∗ maxi=j κij dj∗/di∗ = κ . Hence we find that d gives a rescaling with 6 ≤ 6 dˆ 3 3 χ¯ nκ n(κW∗ ) n(¯χW ) , W Db ≤ ≤ ≤ where we again used Theorem 2.8. We can obtain the optimal value tˆ of (10) by solving the corresponding maximum-mean cycle problem (see Theorem 2.12). It is easy to develop a multiplicative version of the standard dynamic programming algorithm of the classical minimum-mean cycle problem (see e.g. [AMO93, Theorem 5.8]) that allows finding the optimum to (10) directly, in the same O(n3) time.

It is left to find the labels di > 0, i [n] such thatκ îj dj /di tˆ for all (i, j) E. We define the following weighted directed graph.∈ We associate the weight≤w = log tˆ log∈κ ˆ with every ij − ij (i, j) E, and add an extra source vertex r with edges (r, i) of weight wri = 0 for all i [n]. By∈ the choice of tˆ, this graph does not contain any negative weight directed cycles.∈ We can 3 compute the shortest paths from r to all nodes in O(n ) using the Bellman-Ford algorithm; let σi be the shortest path label for i. We then set di = exp(σi). One can avoid computing logarithms by using a multiplicative variant of the Bellman-Ford algorithm instead. The running time of the whole algorithm will be bounded by O(n2m2 +n3). The running time is dominated by the O(n2m2) complexity of Finding-Circuits(A) and the O(n3) complexity of solving the minimum-mean cycle problem and shortest path computation.

3 A scaling-invariant layered least squares interior-point algorithm 3.1 Preliminaries on interior-point methods In this section, we introduce the standard deﬁnitions, concepts and results from the interior-point literature that will be required for our algorithm. We consider an LP problem in the form (LP), or equivalently, in the subspace form (2) for W = Ker(A). We let

++ n ++ m+n = x R : Ax = b,x > 0 , = (y,s) R : A⊤y + s = c,s > 0 . P { ∈ } D { ∈ } Recall the central path deﬁned in (CP), with w(µ) = (x(µ),y(µ),s(µ)) denoting the central path point corresponding to µ > 0. We let w∗ = (x∗,y∗,s∗) denote the primal and dual optimal solutions to (LP) that correspond to the limit of the central path for µ 0. ++ ++ → For a point w = (x,y,s) , the normalized duality gap is µ(w)= x⊤s/n. ∈P × D The ℓ2-neighborhood of the central path with opening β > 0 is the set

xs (β)= w ++ ++ : e β N ∈P × D µ(w) − ≤

Throughout the paper, we will assume β is chosen from (0, 1/ 4]; in Algorithm 2 we use the value β =1/8. The following proposition gives a bound on the distance between w and w(µ) if w (β). See e.g. [Gon92, Lemma 5.4], [MT03, Proposition 2.1]. ∈ N

21 Proposition 3.1. Let w = (x,y,s) (β) for β (0, 1/4] and µ = µ(w), and consider the central path point w(µ) = (x(µ),y(µ),s∈(µ N)). For each∈i [n], ∈ x 1 2β x i − x x (µ) i , and 1+2β ≤ 1 β · i ≤ i ≤ 1 β − − s 1 2β s i − s s (µ) i . 1+2β ≤ 1 β · i ≤ i ≤ 1 β − − We will often use the following proposition which is immediate from deﬁniton of β. Proposition 3.2. Let w = (x,y,s) (β) for β (0, 1/4], and µ = µ(w). Then for each i [n] ∈ N ∈ ∈ (1 β)√µ √s x (1 + β)√µ . − ≤ i i ≤ Proof. By deﬁnition of (β) we have for all i [n] that xisi 1 xs e β and so N ∈ | µ − |≤k µ − k ≤ (1 β)µ x s (1 + β)µ. Taking roots gives the results. − ≤ i i ≤ A key property of the central path is “near monotonicity”, formulated in the following lemma, see [VY96, Lemma 16].

Lemma 3.3. Let w = (x,y,s) be a central path point for µ and w′ = (x′,y′,s′) be a central path point for µ′ µ. Then x′/x + s′/s n. Further, for the optimal solution w∗ = (x∗,y∗,s∗) ≤ k k∞ ≤ corresponding to the central path limit µ 0, we have x∗/x + s∗/s = n. → k k1 k k1 Proof. We show that x′/x + s′/s 2n for any feasible primal x′ and dual (y′,s′) such k k1 k k1 ≤ that (x′)⊤s′ x⊤s = nµ; this implies the ﬁrst statement with the weaker bound 2n. For the ≤ stronger bound x′/x + s′/s n, see the proof of [VY96, Lemma 16]. Since x x′ W and k k∞ ≤ − ∈ s s′ W ⊥, we have (x x′)⊤(s s′) = 0. This can be rewritten as x⊤s′ +(x′)⊤s = x⊤s+(x′)⊤s′. − ∈ − − By our assumption on x′ and s′, the right hand side is bounded by 2nµ. Dividing by µ, and noting that x s = µ for all i [n], we obtain i i ∈ n x s x s ′ + ′ = i′ + i′ 2n . x s x s ≤ 1 1 i=1 i i X

The second statement follows by using this to central path points (x′,y′,s′) with parameter µ′, and taking the limit µ′ 0. → 3.2 The aﬃne scaling and layered-least-squares steps Given w = (x,y,s) ++ ++, the search directions commonly used in interior-point methods are obtained as the∈P solution×D (∆x, ∆y, ∆s) to the following linear system for some σ [0, 1]. ∈ A∆x =0 (11)

A⊤∆y + ∆s =0 (12) s∆x + x∆s = σµe xs (13) − Predictor-corrector methods, such as the Mizuno-Todd-Ye Predictor-Corrector (MTY P-C) algorithm [MTY93], alternate between two types of steps. In predictor steps, we use σ = 0. This direction is also called the aﬃne scaling direction, and will be denoted as ∆wa = (∆xa, ∆ya, ∆sa) throughout. In corrector steps, we use σ = 1. This gives the centrality direction, denoted as ∆wc = (∆xc, ∆yc, ∆sc).

22 In the predictor steps, we make progress along the central path. Given the search direction on the current iterate w = (x,y,s) (β), the step-length is chosen maximal such that we remain in (2β), i.e. ∈ N N a a α := sup α [0, 1] : α′ [0, α]: w + α′∆w (2β) . { ∈ ∀ ∈ ∈ N } Thus, we obtain a point w+ = w + αa∆wa (2β). The corrector step finds a next iterate wc = wa + ∆wc, where ∆wc is the centrality∈ direction N computed at wa. The next proposition summarizes well-known properties, see e.g. [Ye97, Section 4.5.1]. Proposition 3.4. Let w = (x,y,s) (β) for β (0, 1/4]. ∈ N ∈ (i) For the affine scaling step, we have µ(w+)=(1 α)µ(w). − (ii) The affine scaling step-length is

β ∆xa∆sa αa max , 1 k k . ≥ √n − βµ(w)

(iii) For w+ (2β), and wc = w+ + ∆wc, we have µ(wc)= µ(w+) and wc (β). ∈ N ∈ N (iv) After a sequence of O(√nt) predictor and corrector steps, we obtain an iterate w′ = t (x′,y′,s′) (β) such that µ(w′) µ(w)/2 . ∈ N ≤ Minimum norm viewpoint and residuals For any point w = (x,y,s) ++ ++ we deﬁne ∈ P × D

1/2 1/2 n δ = δ(w)= s x− R . (14) ∈ With this notation, we can write (13) for σ = 0 in the form

1 1/2 1/2 δ∆x + δ− ∆s = s x . (15) −

Note that for a point w(µ) = (x(µ),y(µ),s(µ)) on the central path, we have δi(w(µ)) = si(µ)/√µ = √µ/xi(µ) for all i [n]. From Proposition 3.1, we see that if w (β), and µ = µ(w), then for each i [n], ∈ ∈ N ∈ 1 1 2β δi(w(µ)) δi(w) δi(w(µ)) . (16) − · ≤ ≤ √1 2β · p − The matrix Diag(δ(w)) will be often used for rescaling in the algorithm. That is, for the current iterate w = (x,y,s) in the interior-point method, we will perform projections in the space δ δ Diag(δ)W Diag(δ(w))W . To simplify notation, for δ = δ(w), we use LI and κij as shorthands for LI Diag(δ)W and κij . The subspace W = Ker(A) will be ﬁxed throughout. It is easy to see from the optimality conditions that the components of the aﬃne scaling direction ∆wa = (∆xa, ∆ya, ∆sa) are the optimal solutions of the following minimum-norm problems. ∆xa = arg min δ(x + ∆x) 2 : A∆x =0 ∆x Rn {k k } ∈ a a 1 2 (17) (∆y , ∆s ) = arg min δ− (s + ∆s) : A⊤∆y + ∆s =0 (∆y,∆s) Rm Rn{k k } ∈ ×

23 Following [MT05], for a search direction ∆w = (∆x, ∆y, ∆s), we deﬁne the residuals as

δ(x + ∆x) δ 1(s + ∆s) Rx := , Rs := − . (18) √µ √µ

We let Rx a and Rsa denote the residuals for the affine scaling direction ∆wa. Hence, the primal a a affine scaling direction ∆x is the one that minimizes the ℓ2-norm of the primal residual Rx , a a a and the dual affine scaling direction (∆y , ∆s ) minimizes the ℓ2-norm of the dual residual Rs . The next lemma summarizes simple properties of the residuals, see [MT05]. Lemma 3.5. For β (0, 1/4] such that w = (x,y,s) (β) and the affine scaling direction ∆w = (∆xa, ∆ya, ∆sa∈), we have ∈ N (i)

∆xa∆sa x1/2s1/2 Rx aRsa = , Rx a + Rsa = , (19) µ √µ

(ii) Rx a 2 + Rsa 2 = n , k k k k (iii) We have Rx a , Rsa √n, and for each i [n], max Rx a, Rsa 1 (1 β). k k k k≤ ∈ { i i }≥ 2 − (iv) a 1 1 a a 1 a Rx = δ− ∆s , Rs = δ∆x . −√µ −√µ

Proof. Parts (i) and (iv) are immediate from the definitions and from (11)-(13) and (15). In part a a (ii), we use part (i) and (Rx )⊤Rs = 0. In part, (iii), the first statement follows by part (ii), and the second statement follows from (i) and Proposition 3.2. For a subset I [n], we define ⊂ a a a a a εI (w) := max min Rx i , Rsi , and ε (w) := ε[n](w) . (20) i I {| | | |} ∈ The next claim shows that for the affine scaling direction, a small ε(w) yields a long step; see [MT05, Lemma 2.5].

Lemma 3.6. Let w = (x,y,s) (β) for β (0, 1/4]. Then for the aﬃne scaling step, we have ∈ N ∈ µ(w + αa∆wa) β √nεa(w) min 1 , . µ(w) ≤ − √n β Proof. Let ε := εa(w). From Lemma 3.5(i), we get ∆xa∆sa /µ = Rx aRsa . We can bound Rx aRsa ε( Rx a + Rsa ) ε√n, where the latterk inequalityk k follows byk Lemma 3.5(iii). kFrom Propositionk ≤ k 3.4(ii)k ,k we getk α≤a max β/√n, 1 √nε/β . The claim follows by part (i) of the same proposition. ≥ { − }

24 3.2.1 The layered-least-squares direction Let = (J ,J ,...,J ) be an ordered partition of [n].2 For k [p], we use the notations J 1 2 p ∈ Jk := Jk+1 ... Jp, and similarly J k and J k. We will also refer to ∪ ∪ − ∪ ∪ ≤ ≥ the sets Jk as layers, and as a layering. Layers with lower indices will be referred to as ‘higher’ layers. J Given w = (x,y,s) ++ ++, and the layering , the layered-least-squares (LLS) direction is defined as follows.∈ P For× the D primal direction, we proceedJ backwards, with k = p,p ll − 1,..., 1. Assume the components on the lower layers ∆xJ>k have already been determined. We ll define the components in Jk as the coordinate projection ∆xJk = πJk (Xk), where the affine subspace Xk is defined as the set of minimizers

2 ll Xk := arg min δJk (xJk + ∆xJk ) : A∆x =0, ∆xJ>k = ∆xJ>k . (21) ∆x Rn k k ∈ n o The dual direction ∆sll is determined in the forward order of the layers k =1, 2,...,p. Assume ll ll we already ﬁxed the components ∆sJ

1 2 m ll − Sk = arg min δJk (sJk + ∆sJk ) : y R , A⊤∆y + ∆s =0, ∆sJ

3.3 Overview of ideas and techniques A key technique in the analysis of layered least-squares algorithms [VY96, MT03, LMT09] is to argue about variables that have ‘converged’. According to Proposition 3.1 and Lemma 3.3, for any iterate w = (x,y,s) (β) and the limit optimal solution w∗ = (x∗,y∗,s∗), the ∈ N bounds x∗ O(n)x and s∗ O(n)s hold. We informally say that x (or s ) has converged, if i ≤ i i ≤ i i i xi O(n)xi∗ (si O(n)si∗) hold for the current iterate or for any earlier iterate. Thus, the value ≤ ≤ 2 of xi (or si) remains within a multiplicative factor O(n ) for the rest of the algorithm. Note that ′ si(µ )/si(µ) 1 2 if µ>µ′ and x has converged at µ, then ′ 2 , O(n ) ; thus, s keeps “shooting i µ /µ ∈ O(n ) i down” with the central path parameter. h i

Converged variables in the aﬃne scaling algorithm Let us start by showing that at any point of the algorithm, at least one primal or dual variable has converged. Suppose for simplicity that our current iterate is exactly on the central path, i.e., that xs = µe. This assumption will be maintained throughout this overview. In this case, the residuals can be simply written as Rx a = (x + ∆xa)/x, Rsa = (s + ∆sa)/s. Recall from (17) that the aﬃne scaling direction corresponds to minimizing the residuals Rx a and Rsa. From this choice, we see that x x + ∆xa s s + ∆sa ∗ , ∗ . (23) x ≥ x s ≥ s

2In contrast to how ordered partitions were deﬁned in [MT05 ], we use the term ordered only to the p-tuple

(J1,...,Jp), which is to be viewed independently of δ.

25 We have Rx a 2 + Rsa 2 = n by Lemma 3.5(ii). Let us assume Rx a 2 n/2; thus, there k k k k k k ≥ exists a i [n] such that xi∗ xi/√2. In other words, just by looking at the residuals, we get the guarantee∈ that a primal≥ or a dual variable has already converged. Based on the value of the residuals, we can guarantee this to be a primal or a dual variable, but cannot identify which particular xi or si this might be. For Rx a 2 n/2, a primal variable has already converged before performing the predictor and correctork k steps.≥ We now show that even if Rx a is small, a primal variable will have k k converged after a single iteration. From (23), we see that there is an index i with xi∗/xi Rx a /√n. ≥ k Furthermore,k Proposition 3.4(ii) and Lemma 3.5 imply that 1 α Rx a Rsa /β √n Rx a /β, since Rsa √n. The predictor step moves to x+ :=− x ≤+ α k∆xak·k= (1 kα)x≤+ a k ka k + k ≤√n Rx a − α(x + ∆x ). Hence, x k k + Rx x. Putting the two inequalities together, we learn ≤ β k k + + + + + that x O(n)x∗ for some i [n]. Since w = (x ,y ,s ) (2β), Proposition 3.1 implies i ≤ i ∈ ∈ N that xi will have converged after this iteration. An analogous argument proves that some sj will also have converged after the iteration. We again emphasize that the argument only shows the existence of converged variables but we cannot identify them in general.

Measuring combinatorial progress Tying the above together, we find that after a single affine scaling step, at least one primal variable xi and at least one dual variable sj has converged. ′ ′ 4 xi(µ )/xj (µ ) µ O(n )µ This means that for any µ′ <µ, 4 ′ , ′ ; thus, the ratio of these variables xi(µ)/xj (µ) ∈ O(n )µ µ keeps asymptotically increasing. The xi/xj ratiosh serve as thei main progress measure in the Vavasis–Ye algorithm. If xi/xj is between 1/(poly(n)¯χ) and poly(n)¯χ before the affine scaling step for the pair of converged variables xi and sj , then after poly(n) logχ ¯ iterations, the xi/xj ratio must leave this interval and never return. Thus, we obtain a ‘crossover-event’ that cannot again occur for the same pair of variables. In the affine scaling algorithm, there is no guarantee that xi/xj falls in such a bounded interval for the converging variables xi and sj ; in particular, we may obtain the same pairs of converged variables after each step. The main purpose of layered-least-squares methods is to proactively force that in every certain number of iterations, some ‘bounded’ xi/xj ratios become ‘large’ and remain so for the rest of the algorithm. W In our approach, the first main insight is to focus on the scaling invariant quantities κij xi/xj instead. For simplicity’s sake, we first present the algorithm with the assumption that all values W κij are known. We will then explain how this assumption can be removed by using gradually improving estimates on the values. The combinatorial progress will be observed in the ‘long edge graph’. For a primal-dual 6 feasible point w = (x, y, x) and σ = 1/O(n ), this is defined as Gw,σ = ([n], Ew,σ) with edges W (i, j) such that κij xi/xj σ. Observe that for any i, j [n], at least one of (i, j) and (j, i) are ≥ ∈ C C W long edges: this follows since for any circuit C with i, j C, we get lower bounds gj /gi κij C C W ∈ | |≤ and gi /gj κji . Intuitively,| |≤ our algorithm will enforce the following two types of events. The analysis in Section 4 is based on a potential function analysis capturing roughly the same progress. • For an iterate w and a value µ> 0, we have i, j [n] in a strongly connected component in ∈ G of size τ, and for any iterate w′ with µ(w′) >µ, if i, j are in a strongly connected w,σ ≤ component of G ′ then this component has size 2τ. w ,σ ≥ • For an iterate w and a value µ > 0, we have (i, j) / E , and for any iterate w′ with ∈ w,σ µ(w′) >µ we have (i, j) E ′ . ∈ w ,σ

26 At most O(n2 log n) such events can happen overall, so if we can prove that on average an event will happen every O(√n log(¯χA∗ + n)) iterations or the algorithm terminates, then we have the 2.5 desired convergence bound of O(n log(n) log(¯χA∗ + n)) iterations.

J1 J2 J3 x x∗ i i j Primal

s i j s∗ j Dual

J1 J2 J3

J1 J2 J3 x x∗ i i j Primal

j s s∗ j i Dual

J1 J2 J3

J1 J2 J3 i j

x x∗ i j Primal

j s s∗

Dual i

J1 J2 J3

Figure 1: Top-down we have a chart of primal/dual variables and the estimated subgraph of the circuit ratio digraph (Deﬁnition 3.11) for three diﬀerent iterations: 1) All variables are far away from their optimal values. 2) On J1 there is a primal variable (i) and dual variable (j) that have converged, i.e. xi is close to xi∗ and si is close to si∗. 3) j moves to layer J2 due to a change in the underlying subgraph of the circuit ratio digraph.

Converged variables cause combinatorial progress We now show that combinatorial progress as above must happen in the aﬃne scaling step in the case when the graph Gw,σ is strongly connected. As noted above, for the pair of converged variables xi and sj after the aﬃne W scaling step, xi/xj , and thus κij xi/xj , will asymptotically increase by a factor 2 in every O(√n) iterations.

27 By the strong connectivity assumption, there is a directed path in the long edge graph from i to j of length at most n 1. Each edge has length at least σ, and by the cycle charac- − W n 1 n W terization (Theorem 2.12) we know that (κji xj /xi) σ − (κW∗ ) . As such, κji xj /xi n n 1 W W 2 · ≤ W ≤ (κW∗ ) /σ − . Since κij κji (κW∗ ) by the same theorem, we obtain the lower bound κij xi/xj n 1 n+2 ≤ ≥ σ − (κW∗ )− . n 1.5 This means that after O(√n log((κW∗ /σ) )) = O(n log(κW∗ + n)) affine scaling steps, the 4n weight of the edge (i, j) will be more than (κW∗ /σ) . There can never again be a length n or shorter path from j to i in the long edge graph, for otherwise the resulting cycle would violate Theorem 2.12. Moreover, by the triangle inequality (Lemma 2.15), any other k = i, j will have 2n 6 either (i, k) or (k, j) of length at least (κW∗ /σ) , similarly causing a pair of variables to never 1.5 again be in the same connected component. As such, we took O(n log(κW∗ + n)) affine scaling steps and in that time at least n 1 combinatorial progress events have occured. − The layered least squares step Similarly to the Vavasis–Ye algorithm [VY96] and subsequent literature, our algorithm is a predictor-corrector method using layered least squares (LLS) steps as in Section 3.2.1 for certain predictor iterations. Our algorithm (Algorithm 2) uses LLS steps only sometimes, and most steps are the simpler affine scaling steps; but for simplicity of this overview, we can assume every predictor iteration uses an LLS step. We define the ordered partition = (J1,J2,...,Jp) corresponding to the strongly connected components in topological ordering.J Recalling that either (i, j) or (j, i) is a long edge for every pair i, j [n], this order is unique and such that there is a complete directed graph of long edges ∈ from every Jk to Jk′ for 1 k < k′ p. The first important property≤ of the≤ LLS step is that it is very close to the affine scaling step. W W In Section 3.4.1, we introduce the partition lifting cost ℓ ( ) = max2 k p ℓ (J k) as the cost of lifting from lower to higher layers; we let ℓ1/x( ) be aJ shorthand≤ for≤ℓDiag(1/x≥)W ( ). Note J J that this same rescaling is used for the affine scaling step in (17), since δ = √µ/x if w is on the central path. In Lemma 3.10(ii), we show that for a small partition lifting cost, the LLS residuals will remain near the affine scaling residuals. Namely,

Rx ll Rx a , Rsll Rsa 6n3/2ℓ1/x( ) . k − k k − k≤ J Recall that the LLS residuals can be written as Rx ll = (x + ∆xll)/x, Rsll = (s+∆sll)/s for a point 1/x W on the central path. For defined as above, Lemma 2.11 yields ℓ ( ) n maxi J>k ,j J≤k,k [p] κij xi/xj . J J ≤ ∈ ∈ ∈ This will be sufficiently small as this maximum is taken over ‘short’ edges (not in Ew,σ). A second, crucial property of the LLS step is that it “splits” our LP into p separate LPs that ll ll have “negligible” interaction. Namely, the direction (∆xJk , ∆sJk ) will be very close to the affine scaling step obtained in the problem restricted to the subspace W ,k = xJk : x W, xJ>k =0 (Lemma 3.10(i)) J { ∈ } Since each component Jk is strongly connected in the long edge graph Gw,σ, if there is at least one primal xi and dual sj in Jk that have converged after the LLS step, we can use the W above argument to show combinatorial progress regarding the κij xi/xj value (Lemma 4.3). Exploting the proximity between the LLS and affine scaling steps, Lemma 3.10(iv) gives a 3√n ll ll lower bound on the step size α 1 maxi [n] min Rx i , Rsi . Let Jk be the component ≥ − β ∈ {| | | |} where min Rx ll , Rsll is the largest. Hence, the step size α can be lower bounded in terms {k Jk k k Jk k} of min Rx ll , Rsll . {k Jk k k Jk k} The analysis now distinguishes two cases. Let w+ = w + α∆sll be the point obtained by + the predictor LLS step. If the corresponding partition lifting cost ℓ1/x ( ) is still small, then a similar argument that has shown the convergence of primal and dual variablesJ in the affine

28 scaling step will imply that after the LLS step, at least one xi and one sj will have converged for i, j Jk. Thus, in this case we obtain the combinatorial progress (Lemma 4.4). ∈ + The remaining case is when ℓ1/x ( ) becomes large. In Lemma 4.5, we show that in this case a new edge will enter the long edgeJ graph, corresponding to the second combinatorial event listed previously. Intuitively, in this case one layer “crashes” into another.

Refined estimates on circuit imbalances In the above overview, we assumed the circuit W imbalance values κij are given, and thus the graph Gw,σ is available. Whereas these quantities are difficult to compute, we can naturally work with lower estimates. For each i, j [n] that are contained in a circuit together, we start with the lower boundκ ˆW = gC /gC obtained∈ for ij | j i | an arbitrary circuit C with i, j C. We use the graph Gˆ = ([n], Eˆ ) corresponding to ∈ w,σ w,σ these estimates. Clearly, Eˆ E , but some long edges may be missing. We determine the w,σ ⊆ w,σ partition of the strongly connected components of Gˆw,σ and estimate the partition lifting cost ℓ1/x( ).J If this is below the desired bound, the argument works correctly. Otherwise, we can identifyJ a pair i, j responsible for this failure. Namely, we find a circuit C with i, j C such W C C ∈ thatκ îj < gj /gi . In this case, we update our estimate, and recompute the partition; this is described in| Algorithm| 1. At each LLS step, the number of updates is bounded by n, since every update leads to a decrease in the number of partition classes. This finishes the overview of the algorithm.

3.4 A linear system viewpoint of layered least squares We now continue with the detailed exposition of our algorithm. We present an equivalent definition of the LLS step introduced in Section 3.2.1, generalizing the linear system (12)–(13). We use the subspace notation. With this notation, (12)–(13) for the affine scaling direction can be written as a a a a s∆x + x∆s = xs , ∆x W, and ∆s W ⊥ , (24) − ∈ ∈ a 1 a 1/2 1/2 which is further equivalent to δ∆x + δ− ∆s = x s . Given the layering and w = (x,y,s), for each− k [p] we define the subspaces J ∈

W ,k := xJk : x W, xJ>k =0 and W ⊥,k := xJk : x W ⊥, xJ

ll 1 1/2 1/2 ll δ∆x + δ− ∆s = x s , ∆x W, and ∆s W ⊥,1 W ⊥,p , (25) − ∈ ∈ J ⊕···⊕ J and the dual LLS step ∆sll is the unique solution to

1 ll 1/2 1/2 ll δ∆x + δ− ∆s = x s , ∆x W ,1 W ,p , and ∆s W ⊥ . (26) − ∈ J ⊕···⊕ J ∈ It is important to note that ∆s in (25) may be different from ∆sll, and ∆x in (26) may be different from ∆xll. In fact, ∆sll = ∆s and ∆xll = ∆x can only be the case for the affine scaling step. The following lemma proves that the above linear systems are indeed uniquely solved by the LLS step.

29 n n n W,δ Lemma 3.7. For t R , W R , δ R++, and = (J1,J2,...,Jp), let w = LLS (t) be deﬁned by ∈ ⊆ ∈ J J 1 δw + δ− v = δt, w W, v W ⊥,1 W ⊥,p. ∈ ∈ J ⊕···⊕ J Then LLSW,δ(t) is well-deﬁned and J δ (t w ) = min δ (t z ) : z W, z = w k Jk Jk − Jk k k Jk Jk − Jk k ∈ J>k J>k for every k [p]. ∈ In the notation of the above lemma we have, for ordered partitions = (J1,J2,...,Jp), ++ ++ 1/2 1/2 J ll W,δ ¯ = (Jp,Jp 1,...,J1), and (x,y,s) with δ = s x− , that ∆x = LLS ( x) J − ⊥ −1 ∈P × D J − ll W ,δ and ∆s = LLS ¯ ( s). J −

Proof of Lemma 3.7. We first prove the equality W (W ⊥,1 W ⊥,p)= 0 , and by a similar ∩ J ⊕···⊕ J { } argument we have W ⊥ (W ,1 W ,p)= 0 . By duality, this last equality tells us that ∩ J ⊕···⊕ J { } n (W ⊥ (W ,1 W ,p))⊥ = W + (W ⊥,1 W ⊥,p)= R . ∩ J ⊕···⊕ J J ⊕···⊕ J Thus, the linear decomposition defining LLSW,δ(t) has a solution and its solution is unique. J Suppose y W (W ⊥,1 W ⊥,p). We prove yJk = 0 by induction on k, starting at k = p. ∈ ∩ J ⊕···⊕ J The induction hypothesis is that yJ>k = 0, which is an empty requirement when k = p. The n hypothesis yJ = 0 together with the assumption y W is equivalent to y W R , and >k ∈ ∈ ∩ J≤k n ) := W . Since we also have y W by assumption, which is implies yJk πJk (W RJ k ,k Jk ⊥,k ∈ ∩ ≤ J ∈ J the orthogonal complement of W ,k, we must have yJk = 0. Hence, by induction y = 0. This J finishes the proof that LLSW,δ(t) is well-defined. Next we prove that w isJ a minimizer of min δ (t z ) : z W, z = w . The k Jk Jk − Jk k ∈ J>k J>k optimality condition is for δJk (tJk zJk ) to be orthogonal to δJk u for any u W ,k. By − 1 ∈ J the LLS equation, we have δ (t w ) = δ− v , where v W . Noting then that Jk Jk Jk Jk Jk Jk ⊥,k 1 − ∈ J δJk u,δ− v = uJk , vJk = 0 for u W ,k, the optimality condition follows immediately. h Jk i h i ∈ J With these tools, we can prove that the lifting costs are self-dual. This explains the reverse order in the dual vs primal LLS step and justifies our attention on the lifting cost in a self-dual algorithm. The next proposition generalizes the result of [GL97]. Proposition 3.8 (Proof on p. 41). For a linear subspace W Rn and index set I [n] with J = [n] I, ⊆ ⊆ \ ⊥ LW max 1, LW . k I k≤ { k J k} ⊥ In particular, ℓW (I)= ℓW (J). We defer the proof to Section 5. Note that this proposition also implies Proposition 2.1(iv).

3.4.1 Partition lifting scores A key insight is that if the layering is “well-separated”, then we indeed have x∆sll + s∆xll xs, that is, the LLS direction isJ close to the affine scaling direction. This will be shown≈ in− Lemma 3.10. The notion of “well-separatedness” can be formalized as follows. Recall the definition of the lifting score (4). The lifting score of the layering = (J1,J2,...,Jp) of [n] with respect to W is defined as J W W ℓ ( ) := max ℓ (J k) . J 2 k p ≥ ≤ ≤

30 n W,δ Diag(δ)W W,δ Diag(δ)W For δ R++, we use ℓ (I) := ℓ (I) and ℓ ( ) := ℓ ( ). When the context is clear,∈ we omit W and write ℓδ(I) := ℓW,δ(I) and ℓδ( )J := ℓW,δ( ). J The following important duality claim asserts thatJ the liftingJ score of a layering equals the lifting score of the reverse layering in the orthogonal complement subspace. It is an immediate consequence of Proposition 3.8. Lemma 3.9. Let W Rn be a linear subspace, δ Rn . For an ordered partition = ⊆ ∈ ++ J (J1,J2,...,Jp), let ¯ = (Jp,Jp 1,...,J1) denote the reverse ordered partition. Then, we have J − ⊥ −1 ℓW,δ( )= ℓW ,δ ( ¯). J J 1 Proof. Let U = Diag(δ)W . Note that U ⊥ = Diag(δ− )W ⊥. Then by Proposition 3.8, for 2 k p, we have that ≤ ≤ W,δ U U ⊥ U ⊥ W ⊥,δ−1 ℓ (J k)= ℓ (J k)= ℓ (J k 1)= ℓ (J¯ p k+2)= ℓ (J¯ p k+2). ≥ ≥ ≤ − ≥ − ≥ − ⊥ −1 In particular, ℓW,δ( )= ℓW ,δ ( ¯), as needed. J J The next lemma summarizes key properties of the LLS steps, assuming the partition has a small lifting score. We show that if ℓδ( ) is sufficiently small, then on the one hand, the LLS step will be very close to the affine scalingJ step, and on the other hand, on each layer k [p], it ∈ will be very close to the affine scaling step restricted to this layer for the subspace W ,k. The proof is deferred to Section 5. J Lemma 3.10 (Proof on p. 45). Let w = (x,y,s) (β) for β (0, 1/4], let µ = µ(w) and δ = δ ∈ N 2 ∈ ll ll ll ll δ(w). Let = (J1,...,Jp) be a layering with ℓ ( ) β/(32n ), and let ∆w = (∆x , ∆y , ∆s ) denote theJ LLS direction for the layering . ThenJ the≤ following properties hold. J (i) We have

ll 1 ll 1/2 1/2 δ δJ ∆x + δ− ∆s + x s 6nℓ ( )√µ , k [p], and (27) k k Jk Jk Jk Jk Jk k≤ J ∀ ∈ ll 1 ll 1/2 1/2 3/2 δ δ∆x + δ− ∆s + x s 6n ℓ ( )√µ . (28) k k≤ J (ii) For the aﬃne scaling direction ∆wa = (∆xa, ∆ya, ∆sa),

Rx ll Rx a , Rsll Rsa 6n3/2ℓδ( ) . k − k k − k≤ J (iii) For the residuals of the LLS steps we have Rx ll , Rsll √2n. For each i [n], max Rx ll , Rsll 1 3 β. k k k k ≤ ∈ {| i | | i |} ≥ 2 − 4 ll ll ll (iv) Let ε (w) = maxi [n] min Rx i , Rsi , and deﬁne the step length as ∈ {| | | |} ll α := sup α′ [0, 1] : α¯ [0, α′]: w +α ¯∆w (2β) . { ∈ ∀ ∈ ∈ N } We obtain the following bounds on the progress in the LLS step:

µ(w + α∆wll)=(1 α)µ , and − 3√nεll(w) α 1 . ≥ − β

(v) We have εll(w)=0 if and only if α = 1. These are further equivalent to w + ∆wll = (x + ∆xll,y + ∆yll,s + ∆sll) being an optimal solution to (LP).

31 3.5 The layering procedure Our algorithm performs LLS steps on a layering with a low lifting score. A further requirement δ is that within each layer, the circuit imbalances κij defined in (6) are suitably bounded. The rescaling here is with respect to δ = δ(w) for the current iterate w = (x,y,s). To define the precise requirement on the layering, we first introduce an auxiliary graph. Throughout we use the parameter β γ := . (29) 210n5

n The auxiliary graph For a vector δ R++ and σ > 0, we define the directed graph Gδ,σ = δ ∈ ([n], Eδ,σ) such that (i, j) Eδ,σ if κij σ. This is a subgraph of the circuit ratio digraph studied in Section 2, including only∈ the edges≥ where the circuit ratio is at least the threshold σ. Note δ that we do not have direct access to this graph, as we cannot efficiently compute the values κij . At the beginning of the entire algorithm, we run the subroutine Find-Circuits(A) as in Theorem 2.14, where W = Ker(A). We assume the matroid (A) is non-separable. For a separable matroid, we can solve the subproblems of our LP on theM components separately. Thus, for each i = j, i, j [n], we obtain an estimateκ îj κij . These estimates will be gradually improved throughout6 ∈ the algorithm. ≤ Note that κδ = κ δ /δ andκ ˆδ =κ ˆ δ /δ . Ifκ ˆδ σ, then we are guaranteed (i, j) E . ij ij j i ij ij j i ij ≥ ∈ δ,σ

Definition 3.11. Define Gˆδ,σ = ([n], Eˆδ,σ) to be the directed graph with edges (i, j) such that κˆδ σ; clearly, Gˆ is a subgraph of G . ij ≥ δ,σ δ,σ Lemma 3.12. Let δ Rn . For every i = j, i, j [n], κˆδ κˆδ 1. Consequently, for any ∈ ++ 6 ∈ ij · ji ≥ 0 < σ 1, at least one of (i, j) Eˆ or (j, i) Eˆ . ≤ ∈ δ,σ ∈ δ,σ Proof. We show that this property holds at the initialization. Since the estimates can only increase, it remains true throughout the algorithm. Recall the definition ofκ îj from Theorem 2.14. This is defined as the maximum of g /g such that g W , supp(g) = C for some C ˆ | j i| ∈ ∈ C containing i and j. For the same vector g, we getκ ˆji gi/gj . Consequently,κ îj κˆji 1, and alsoκ ˆδ κˆδ 1. The second claim follows by the assumption≥ | |σ 1. · ≥ ij · ji ≥ ≤ Balanced layerings We are ready to define the requirements on the layering in the algorithm. In the algorithm, δ = δ(w) will correspond to the scaling of the current iterate w = (x,y,s).

Deﬁnition 3.13. Let δ Rn . The layering = (J ,J ,...,J ) of [n] is δ-balanced if ∈ ++ J 1 2 p (i) ℓδ( ) γ, and J ≤ (ii) J is strongly connected in G for all k [p]. k δ,γ/n ∈ δ The following lemma shows that within each layer, the κij values are within a bounded range. This will play an important role in our potential analysis. Lemma 3.14. Let 0 <σ< 1 and t> 0, and i, j [n], i = j. ∈ 6 (i) If the graph G contains a directed path of at most t 1 edges from j to i, then δ,σ − κ t κδ < ∗ . ij σ

32 (ii) If G contains a directed path of at most t 1 edges from i to j, then δ,σ − t δ σ κij > . κ∗ Proof. For part (i), let j = i1,i2,...,ih = i be a path in Gδ,σ in J from j to i with h t. That is, κδ σ for each ℓ [h]. Theorem 2.12 yields ≤ iℓiℓ+1 ≥ ∈ t δ h 1 δ t (¯κ∗) κ σ − >κ σ , ≥ ij · ij · δ δ since h t and σ< 1. Part (ii) follows using part (i) for j and i, and that κij κji 1 according to Lemma≤ 3.12. · ≥

Description of the layering subroutine Consider an iterate w = (x,y,s) (β) of the algorithm with δ = δ(w), The subroutine Layering(δ, κˆ), described in Algorithm∈1 N, constructs ˆ a δ-balanced layering. We recall that the approximated auxilliary graph Gδ,γ/n with respect to κˆ is as in Deﬁnition 3.11

Algorithm 1: Layering(δ, κˆ) n E Input : δ R++ andκ ˆ R++. ∈ ∈ E Output: δ-balanced layering = (J1,...,Jp) and updated valuesκ ˆ R++. J ˆ ∈ 1 Compute the strongly connected components C1, C2,...,Cℓ of Gδ,γ/n, listed in the ˆ ordering imposed by Gδ,γ/n; 2 E¯ Eˆ ; ← δ,γ/n 3 for k =2,...,ℓ do 4 Call Verify-Lift(Diag(δ)W, C k,γ) that answers ‘pass’ or ‘fail’; ≥ 5 if the answer is ‘fail’ then 6 Let i C k, j C

We now give an overview of the subroutine Layering(δ, κˆ). We start by computing the ˆ strongly connected components (SCCs) of the directed graph Gδ,γ/n. The edges of this graph δ ˆ are obtained using the current estimatesκ ˆij . According to Lemma 3.12, we have (i, j) Eδ,γ/n ˆ ∈ or (j, i) Eδ,γ/n for every i, j [n], i = j. Hence, there is a linear ordering of the components ∈ ∈ ˆ 6 C1, C2,...,Cℓ such that (u, v) Eδ,γ/n whenever u Ci, v Cj , and i < j. We call this the ˆ ∈ ∈ ∈ ordering imposed by Gδ,γ/n. Next, for each k =2,...,ℓ, we use the subroutine Verify-Lift(Diag(δ)W, C k,γ) described δ ≥ in Lemma 2.11. If the subroutine returns ‘pass’, then we conclude ℓ (C k) γ, and proceed to ≥ ≤ the next layer. If the answer is ‘fail’, then the subroutine returns as certiﬁcates i C k, j C

33 Cℓ between those containing i and j will be merged into a single strongly connected component. ˆ To see this, recall that if i′ Cℓ and j′ Cℓ′ for ℓ<ℓ′, then (i′, j′) Eδ,γ/n according to Lemma 3.12. ∈ ∈ ∈ Finally, we compute the strongly connected components of ([n], E¯). We let J1,J2,...,Jp denote their unique acyclic order, and return these layers. Lemma 3.15. The subroutine Layering(δ, κˆ) returns a δ-balanced layering in O(nm2 + n2) time. The difficult part of the proof is showing the running time bound. We note that the weaker bound O(n2m2) can be obtained by a simpler argument. Proof. We first verify that the output layering is indeed δ-balanced. For property (i) of Defi- nition 3.13, note that each Jq component is the union of some of the Ck’s. In particular, for δ every q [p], the set J q = C k for some k [ℓ]. Assume now ℓ (C k) > γ. At step k of the main cycle,∈ the subroutine≥ Verify-Lift≥ returned∈ the answer ‘fail’, and≥ a new edge (i, j) E ˆ ∈ was added with i C k, j C

34 SCC layering. Our task is to compute the matrices Bk, 2 k ℓ, needed in the calls to 2 ≤ ≤ Verify-Lift on W, C k, 2 k ℓ, in total O(n(n m) ) time. We achieve this in three steps working with the≥ basis≤ matrix≤ M as above. Firstly,− by applying column operations to M, we compute sets Ik Ck and Dk = [ I k ] [ Ik by using the columns of M . We then let B = M and decrement k. To justify correctness, •,Dk k C

3.6 The overall algorithm Algorithm 2 presents the overall algorithm LP-Solve(A,b,c,w0). We assume that an initial feasible solution w0 = (x0,y0,s0) (β) is given. We address this in Section 7, by adapting the ∈ N extended system used in [VY96]. We note that this subroutine requires an upper bound onχ ¯∗. Since computingχ ¯∗ is hard, we can implement it by a doubling search on logχ ¯∗, as explained in Section 7. Other than for initialization, the algorithm does not require an estimate onχ ¯∗. The algorithm starts with the subroutine Find-Circuits(A) as in Theorem 2.14. The iterations are similar to the MTY Predictor-Corrector algorithm [MTY93]. The main difference is that certain affine scaling steps are replaced by LLS steps. In every predictor step, we compute a a a the affine scaling direction, and consider the quantity ε (w) = maxi [n] min Rx i , Rsi . If ∈ {| | | |} this is above the threshold 10n3/2γ, then we perform the affine scaling step. However, in case εa(w) < 10n3/2γ, we use the LLS direction instead. In each such iteration, we call the subroutine Layering(δ, κˆ) (Algorithm 1) to compute the layers, and we compute the LLS step for this layering. Another important difference is that the algorithm does not require a final rounding step. It terminates with the exact optimal solution w∗ once a predictor step is able to perform a full step with α = 1.

m n m n 0 Theorem 3.16. For given A R × , b R , c R , and an initial feasible solution w = 0 0 0 ∈ ∈ ∈ 2.5 (x ,y ,s ) (1/8), Algorithm 2 finds an optimal solution to (LP) in O(n log n log(¯χA∗ + n)) iterations. ∈ N Remark 3.17. Whereas using LLS steps enables us to give a strong bound on the total number of iterations, finding LLS directions has a significant computational overhead as compared to finding affine scaling directions. The layering can be computed in time O(nm2) (Lemma 3.15), and the J

35 Algorithm 2: LP-Solve(A,b,c,w0) m n m n Input : A R × , b R , c R , and an initial feasible solution w0∈= (x0,y0,s∈0) (1∈/8) to (LP). ∈ N Output: Optimal solution w∗ = (x∗,y∗,s∗) to (LP). 1 Call Find-Circuits(A) to obtain the lower boundsκ ˆ for each i, j [n], i = j; ij ∈ 6 2 k 0, α 0; ← ← 3 repeat 4 /* Predictor step */ 5 Compute aﬃne scaling direction ∆wa = (∆xa, ∆ya, ∆sa) for w; 6 if εa(w) < 10n3/2γ then // Recall εa(w) defined in (20) k 1/2 k 1/2 7 δ (s ) (x )− ; ← 8 ( , κˆ) Layering(δ,κ ˆ); J ← 9 Compute Layered Least Squares direction ∆wll = (∆xll, ∆yll, ∆sll) for the layering and w; J 10 ∆w ∆wll; ← 11 else 12 ∆w ∆wa; ← 13 α sup α′ [0, 1] : α¯ [0, α′]: w +α ¯∆w (1/4) ; ← k{ ∈ ∀ ∈ ∈ N } 14 w′ w + α∆w; ← 15 /* Corrector step */ c c c c 16 Compute centrality direction ∆w = (∆x , ∆y , ∆s ) for w′; k+1 c 17 w w′ + ∆w ; ← 18 k k + 1; ← 19 until µ(wk) = 0; 20 return wk = (xk,yk,sk).

LLS steps also require O(nm2) time, see [VY96,MMT98]. This is in contrast to the computational cost O(nω) of an affine scaling direction. Here ω < 2.373 is the matrix multiplication constant [VW12]. We now sketch a possible approach to amortize the computational cost of the LLS steps over the sequence of affine scaling steps. It was shown in [MT05] that for the MTY P-C algorithm, the “bad” scenario between two crossover events amounts to a series of affine scaling steps where the progress in µ increases exponentially from every iteration to the next. This corresponds to 2 the term O(min n log log(µ0/η), log(µ0/η) ) in their running time analysis. Roughly speaking, such a sequence{ of affine scaling steps indicates} that an LLS step is necessary. Hence, we could observe these accelerating sequences of affine scaling steps, and perform an LLS step after we see a sequence of length O(log n). The progress made by these affine scaling steps offsets the cost of computing the LLS direction.

4 The potential function and the overall analysis

1/2 1/2 Let µ > 0 and δ(µ) = s(µ) x(µ)− = √µ/x(µ) = s(µ)/√µ correspond to the point on the central path and recall the deﬁnition of γ in (29). For i, j [n], i = j, we deﬁne ∈ 6 log κδ(µ) ̺µ(i, j) := ij , log (4nκA∗ /γ)

36 and the main potentials in the algorithm as

′ Ψµ(i, j) := max 1, min 2n, inf ̺µ (i, j) and Ψ(µ) := log Ψµ(i, j) . 0<µ′<µ i,j [n],i=j ∈X 6 The quantity Ψµ(i, j) is motivated by the bounds in Lemma 3.14. This statement together with (16) imply Lemma 4.1. Lemma 4.1. Let w = (x,y,s) (β) for β (0, 1/4], let µ = µ(w), and δ = δ(w). Let i, j [n], i = j. ∈ N ∈ ∈ 6 1. If G contains a path from j to i of at most t 1 edges, then ̺µ(i, j) t. δ,γ/(4n) − − 3. If Ψµ(i, j) t, then i and j cannot be together on a layer of size at most t, and j cannot ≥ be on a layer preceding the layer containing i in any δ(w′)-balanced layering, where w′ = (x′,y′,s′) (β) with µ(w′) <µ. ∈ N Our potentials Ψµ(i, j) can be seen as ﬁne-grained analogues of the crossover events analyzed in [VY96,MT03,MT05]. Roughly speaking, a crossover event corresponds to Ψµ(i, j) increasing above n, meaning that i and j cannot be contained in the same layer after the normalized duality gap decreases below µ. In what follows, we formulate four important lemmas crucial for the proof of Theorem 3.16. For the lemmas, we only highlight some key ideas here, and defer the full proofs to Section 6. For a triple w (β), ∆wll refers to the LLS direction found in the algorithm, and Rx ll and Rsll denote the residuals∈ N as in (18). For a subset I [n] recall the deﬁnition ⊂ ll ll ll εI (w) := max min Rx i , Rsi . i I ∈ {| | | |} We introduce another important quantity ξ for the analysis:

ξll(w) := min Rx ll , Rsll I {k I k k I k} for a subset I [n]. For a layering = (J ,J ,...,J ), we let ⊂ J 1 2 p ll ll ξ (w) = max ξJk (w) . J k [p] ∈

The key idea of the analysis is to extract information about the optimal solution w∗ = (x∗,y∗,s∗) from the LLS direction. The ﬁrst main lemma shows that if Rx ll is large on some layer J , k Jq k q then for at least one index i J , x∗/x 1/poly(n), i.e., the variable x has “converged”. The ∈ q i i ≥ i analogous statement holds on the dual side for Rsll and an index j J . k Jq k ∈ q Lemma 4.2 (Proof on p. 47). Let w = (x,y,s) (β) for β (0, 1/8] and let w∗ = (x∗,y∗,s∗) ∈ N ∈ be the optimal solution corresponding to µ∗ =0 on the central path. Let further = (J1,...,Jp) be a δ(w)-balanced layering (Deﬁnition 3.13), and let ∆wll = (∆xll, ∆yll, ∆sll) beJ the corresponding LLS direction. Then the following statement holds for every q [p]: ∈ (i) There exists i J such that ∈ q

2xi ll x∗ ( Rx 2γn) . (30) i ≥ 3√n · k Jq k−

37 (ii) There exists j J such that ∈ q

2sj ll s∗ ( Rs 2γn) . (31) j ≥ 3√n · k Jq k−

We outline the main idea of the proof of part (i); part (ii) follows analogously using the duality of the lifting scores (Lemma 3.9). On layer q, the LLS step minimizes δJq (xJq + ∆xJq ) , ll k k subject to ∆xJ>q = ∆xJ>q and subject to existence of ∆xJq) γ due to δ(w)-balancedness, we can show the existence of a point z W + x∗ ≤ ll ll ∈ such that δJq (zJq xJ∗q ) is small, and zJ>q = xJ>q + ∆xJ>q . By the choice of ∆xJq , we have k − kll ll δJq zJq δJq (xJq + ∆xJq ) = √µ Rx Jq . Therefore, δJq xJ∗q /√µ cannot be much smaller k k≥kll k k k k k than Rx . Noting that δJ x∗ /√µ x∗ /xJ , we obtain a lower bound on x∗/xi for some k Jq k q Jq ≈ Jq q i i Jq. ∈ We emphasize that the lemma only shows the existence of such indices i and j, but does not provide an eﬃcient algorithm to identify them. It is also useful to note that for any i [n], ll ll 1 3 ∈ max Rx i , Rsi 2 4 β according to Lemma 3.10(iii). Thus, for each q [p], we obtain a positive{| lower| | bound|} ≥ in− either case (i) or case (ii). ∈

The next lemma allows us to argue that the potential function Ψ·( , ) increases for multiple · · pairs of variables, if we have strong lower bounds on both xi∗ and sj∗ for some i, j [n], along with a lower and upper bound on ̺µ(i, j). ∈ Lemma 4.3 (Proof on p. 48). Let w = (x,y,s) (2β) for β (0, 1/8], let µ = µ(w) and δ = ∈ N ∈ δ(w). Let i, j [n] and 2 τ n such that for the optimal solution w∗ = (x∗,y∗,s∗), we have 10∈5.5 ≤ ≤ 10 5.5 µ 1 xi∗ βxi/(2 n ) and sj∗ βsj /(2 n ), and assume ̺ (i, j) τ. After O(β− √nτ log(¯χ∗ + ≥ ≥ µ′ ≥− n)) further iterations the duality gap µ′ fulﬁlls Ψ (i, j) 2τ, and for every ℓ [n] i, j , either ′ ′ Ψµ (i,ℓ) 2τ, or Ψµ (ℓ, j) 2τ. ≥ ∈ \{ } ≥ ≥ We note that i and j as in the lemma are necessarily diﬀerent, since i = j would imply 2 20 11 0= xi∗si∗ β µ/(2 n ) > 0. ≥ ′ Let us illustrate the idea of the proof of Ψµ (i, j) 2τ. For i and j as in the lemma, and ≥ 10 6.5 for a central path element w′ = w(µ′) for µ′ < µ, we have xi′ xi∗/n βxi/(2 n ) and 10 6.5 ≥ ≥ s′ s∗/n βs /(2 n ) by the near-monotonicity of the central path (Lemma 3.3). Note that j ≥ j ≥ j 2 2 2 δ′ δj′ xi′ sj′ β xisj β (1 β) δ µ κij = κij = κij κij 20 13 20−13 κij , · δi′ · µ′ ≥ · 2 n µ′ ≥ 2 n · · µ′

δ′ where the last inequality uses Proposition 3.2. Consequently, as µ′ suﬃciently decreases, κij will δ become much larger than κij . The claim on ℓ [n] i, j can be shown by using the triangle inequality κ κ κ shown in Lemma 2.15∈. \{ } ik · kj ≥ ij ll Assume now ξJq (w) 4γn for some q [p] in the LLS step. Then, Lemma 4.2 guarantees ≥ ∈ 4 10 5.5 the existence of i, j J such that x∗/x ,s∗/s γn > β/(2 n ). Further, Lemma 4.1 ∈ q i i j j ≥ 3√n gives ̺µ(i, j) J . Hence, Lemma 4.3 is applicable for i and j with τ = J . ≥ −| q| | q| The overall potential argument in the proof of Theorem 3.16 uses Lemma 4.3 in three cases: + ξll (w) 4γn (Lemma 4.2 is applicable as above); ξll (w) < 4γn and ℓδ ( ) 4γn (Lemma 4.4); J ≥ + J J ≤ and ξll (w) < 4γn and ℓδ ( ) > 4γn (Lemma 4.5). Here, δ+ refers to the value of δ after the LLS step.J Note that δ+ >J0 is well-deﬁned, unless the algorithm terminated with an optimal solution.

38 To prove these lemmas, we need to study how the layers “move” during the LLS step. We B ll N ll ll let = t [n]: Rst < 4γn and = t [n]: Rx t < 4γn . The assumption ξ (w) < 4γn { ∈ | | } { ∈ | | } J means that for each layer Jk, either Jk B or Jk N; we accordingly refer to B-layers and N-layers. ⊆ ⊆

Lemma 4.4 (Proof on p. 50). Let w = (x,y,s) (β) for β (0, 1/8], and let = (J1,...,Jp) be a δ(w)-balanced partition. Assume that ξll (w∈) < N 4γn, and∈ let w+ = (x+,y+,sJ+) (2β) be the next iterate obtained by the LLS step withJ µ+ = µ(w+) and assume µ+ > 0. Let ∈q N[p] such ll ll δ+ +∈ 3/2 that ξ (w) = ξJq (w). If ℓ ( ) 4γn, then there exist i, j Jq such that xi∗ βxi /(16n ) J J ≤ ∈ ≥ + 3/2 µ+ and s∗ βs /(16n ). Further, for any ℓ,ℓ′ J , we have ̺ (ℓ,ℓ′) J . j ≥ j ∈ q ≥ −| q| ll ll ll N For the proof sketch, without loss of generality, let ξ = ξJq = Rx Jq , that is, Jq is an - J k k layer. The case ξll = Rsll can be treated analogously. Since the residuals Rx ll and Rsll Jq k Jq k k Jq k k Jq k cannot be both small, Lemma 4.2 readily provides a j Jq such that sj∗/sj 1/(6√n). Using + + ∈ ≥3/2 3/2 Lemma 3.3 and Proposition 3.1, sj∗/sj = sj∗/sj sj /sj > (1 β)/(6(1 + 4β)n ) > β/(16n ). · − + 3/2 The key ideas of showing the existence of an i J such that x∗ x /(16n ) are the ∈ q i ≥ i following. With , / and ', we write equalities and inequalities that hold up to small polynomial factors. First, we≈ show that (i) δ x+ µ+/ µ, and then, that (ii) δ x µ+/ µ . Jq Jq / √ Jq J∗q ' √ k k k+ k + If we can show (i) and (ii) as above, we obtain that δJ x∗ ' δJ x , and thus, x∗ ' x k q Jq k k q Jq k i i for some i Jq. ∈ N + + Let us now sketch the ﬁrst step. By the assumption Jq , one can show xJq /xJq µ /µ, and therefore ⊂ ≈ + + + + µ µ µ δJ x δJ xJ √µ = . k q Jq k≈ µ k q q k≈ µ √µ

The second part of the proof, namely, lower bounding δJ x∗ , is more difficult. Here, we only k q Jq k sketch it for the special case when Jq = [n]. That is, we have a single layer only; in particular, the LLS step is the same as the affine scaling step ∆xll = ∆xa. The general case of multiple layers follows by making use of Lemma 3.10, i.e. exploting that for a sufficiently small ℓδ( ), the LLS step is close to the affine scaling step. J Hence, assume that ∆xll = ∆xa. Using the equivalent definition of the affine scaling step ll ll ll (17) as a minimum-norm point, we have δx∗ δ(x + ∆x ) = √µ Rx = √µξ . From + a ll k k≥k k k + k J Lemma 3.6, µ /µ √nε (w)/β √nξ /β. Thus, we see that δx∗ βµ /(√nµ). ≤ ≤ J µ+ k k≥ The final statement on lower bounding ̺ (ℓ,ℓ′) Jq for any ℓ,ℓ′ Jq follows by showing + + ≥ −| | µ+ ∈ µ that δℓ /δℓ′ remains close to δℓ/δℓ′ , and hence the values of κ (ℓ,ℓ′) and κ (ℓ,ℓ′) are sufficiently close for indices on the same layer (Lemma 6.1).

Lemma 4.5 (Proof on p. 52). Let w = (x,y,s) (β) for β (0, 1/8], and let = (J1,...,Jp) be a δ(w)-balanced partition. Assume that ξll (w∈) < N 4γn, and∈ let w+ = (x+,y+,sJ+) (2β) be J +∈ N the next iterate obtained by the LLS step with µ+ = µ(w+) and assume µ+ > 0. If ℓδ ( ) > 4γn, + J3/2 then there exist two layers Jq and Jr and i Jq and j Jr such that xi∗ xi /(8n ), and + 3/2 µ+ ∈ ∈ ≥ sj∗ sj /(8n ). Further, ̺ (i, j) Jq Jr , and for all ℓ,ℓ′ Jq Jr, ℓ = ℓ′ we have µ ≥ ≥ −| ∪ | ∈ ∪ 6 Ψ (ℓ,ℓ′) J J . ≤ | q ∪ r| B ll + Consider now any ℓ Jk . Then, since Rx ℓ is very close to 1, xℓ xℓ; on the other + ∈ ⊆ ll ≈ N + hand sℓ will “shoot down” close to the small value Rsℓ sℓ. Conversely, for ℓ Jk , sℓ sℓ, + · ∈ ⊆ ≈ and xℓ will “shoot down” to a small value. + The key step of the analysis is showing that the increase in ℓδ ( ) can be attributed to an J N-layer J “crashing into” a B-layer J . That is, we show the existence of an edge (i′, j′) r q ∈

39 B N Eδ+,γ/(4n) for i′ Jq and j′ Jr, where r < q and Jq , Jr . This can be achieved by analyzing the matrix∈ B used∈ in the subroutine Verify-Lift⊆ . ⊆ For the layers J and J , we can use Lemma 4.2 to show that there exists an i J where q r ∈ q xi∗/xi is lower bounded, and there exists a j Jr where sj∗/sj is lower bounded. The lower µ+ ∈ µ bound on ̺ (i, j) and the upper bounds on the Ψ (ℓ,ℓ′) values can be shown by tracking the δ δ+ changes between the κ (ℓ,ℓ′) and κ (ℓ,ℓ′) values, and applying Lemma 4.1 both at w and at w+. Proof of Theorem 3.16. We analyze the overall potential function Ψ(µ). By definition, 0 ≤ Ψ(µ) n(n 1)(log2 n + 1), and if µ′ < µ then Ψ(µ′) Ψ(µ). By the iteration at µ we mean the iteration≤ − where the normalized duality gap of the current≥ iterate is µ. If µ+ = 0 at the end of an iteration, the algorithm terminates with an optimal solution. Recall from Lemma 3.10(v) that this happens if and only if εll(w) = 0 at a certain iteration. From now on, assume that µ+ > 0. We distinguish three cases at each iteration. These cases are well-defined even at iterations where affine scaling steps are used. At such iterations, ξll (w) still refers to the LLS residuals, even if these have not been computed by the algorithm. (CaseJ + I) ξll (w) 4γn; (Case II) ξll (w) < 4γn and ℓδ ( ) 4γn; and (Case III) ξll (w) < 4γn and + J ≥ J J ≤ J ℓδ ( ) > 4γn. RecallJ that the algorithm uses an LLS direction instead of the affine scaling direction whenever εa(w) < 10n3/2γ. Consider now the case when an affine scaling direction is used, that is, εa(w) 10n3/2γ. According to Lemma 3.10(ii), Rx ll Rx a , Rsll Rsa 6n3/2γ. This implies that≥ ξll (w) 4n3/2γ 4nγ. Therefore, in casesk II− and III,k k an LLS− stepk≤ will be performed. J Starting≥ with≥ any given iteration, in each case we will identify a set J [n] of indices with ⊆ J > 1, and start a phase of O(√n J log(¯χ∗ + n)) iterations (that can be either affine scaling |or| LLS steps). In each phase, we will| guarantee| that Ψ increases by at least J 1. As we can | |−2.5 partition the union of all iterations into disjoint phases, this yields the bound O(n log n log(¯χ∗ + n)) on the total number of iterations. We now consider each of the cases. We always let µ denote the normalized duality gap at the ll ll current iteration, and we let q [p] be the layer such that ξ (w)= ξJq (w). ∈ J

ll Case I: ξ (w) 4γn. Lemma 4.2 guarantees the existence of xi,sj Jq such that xi∗/xi,sj∗/sj J ≥ ∈ ≥ 4γn/(3√n) > 1/(210n5.5). Further, according to Lemma 4.1, ̺µ(i, j) J . Thus, Lemma 4.3 ≥ −| q| is applicable for J = Jq. The phase starting at µ comprises O(√n Jq log(¯χ∗ +n)) iterations, after µ′ | | which we get a normalized duality gap µ′ such that Ψ (i, j) 2 Jq , and for each ℓ [n] i, j , ′ ′ ≥ | | ∈ \{ } either Ψµ (i,ℓ) 2 J , orΨµ (ℓ, j) 2 J . ≥ | q| ≥ | q| We can take advantage of these bounds for indices ℓ Jq. Again by Lemma 4.1, for any µ µ ∈ ℓ,ℓ′ Jq, we have Ψ (ℓ,ℓ′) ̺ (ℓ,ℓ′) Jq . Thus, there are at least Jq 1 pairs of indices ∈ µ ≤ ≤ | | | |− (ℓ,ℓ′) for which Ψ (ℓ,ℓ′) increases by at least Jq between iterations at µ and µ′. We note that this analysis works regardless| whether| an LLS step or an aﬃne scaling step was performed in the iteration at µ.

+ Case II: ξll (w) < 4γn and ℓδ ( ) 4γn. As explained above, in this case we perform an LLS step inJ the iteration at µ, andJ we≤ let w+ denote the iterate obtained by the LLS step. For + + 3/2 J = Jq, Lemma 4.4 guarantees the existence of i, j Jq such that xi∗/xi ,sj∗/sj > β/(16n ), + ∈ µ and further, ̺ (i, j) > Jq . We can therefore apply Lemma 4.3. The phase starting at µ includes the LLS step leading−| | to µ+ (and the subsequent centering step), and the additional O(√n J log(¯χ∗ + n)) iterations (β is a ﬁxed constant in Algorithm 2) as in Lemma 4.3. As in | q| Case I, we get the desired potential increase compared to the potentials at µ in layer Jq.

40 + Case III: ξll (w) < 4γn and ℓδ ( ) > 4γn. Again, the iteration at µ will use an LLS step. J J We apply Lemma 4.5, and set J = Jq Jr as in the lemma. The argument is the same as in ∪ µ Case II, using that Lemma 4.5 explicitly states that Ψ (ℓ,ℓ′) J for any ℓ,ℓ′ J, ℓ = ℓ′. ≤ | | ∈ 6 4.1 The iteration complexity bound for the Vavasis-Ye algorithm We now show that the potential analysis described above also gives an improved bound O(n2.5 log n log(¯χA + n)) for the original VY algorithm [VY96]. We recall the VY layering step. Order the variables via π such that δπ(1) δπ(2) ... δπ(n). The layers will be consecutive sets in the ordering; a new layer starts with≤ π(i +≤ 1) each≤ time δπ(i+1) > gδπ(i), for a parameter g = poly(n)¯χ. As outlined in the Introduction, the VY algorithm can be seen as a special implementation of δ our algorithm by settingκ îj = gγ/n. With these edge weights, we have thatκ îj γ/n precisely 3 ≥ if gδj δi. With≥ these edge weights, it is easy to see that our Layering(δ, κˆ) subroutine finds the exact same components as VY. Moreover, the layers will be the initial strongly connected components Ci of Gδ,γ/n: due to the choice of g, this partition is automatically δ-balanced. There is no need to call Verify-Lift. The essential difference compared to our algorithm is that the valuesκ îj = gγ/n are not lower bounds on κij as we require, but upper bounds instead. This is convenient to simplify the construction of the layering. On the negative side, the strongly connected components of ˆ Gδ,γ/n may not anymore be strongly connected in Gδ,γ/n. Hence, we cannot use Lemma 4.1, and consequently, Lemma 4.3 does not hold. Still, theκ îj bounds are overestimating κij by at most a factor poly(n)¯χ. Therefore, the ˆ strongly connected components of Gδ,n/γ are strongly connected in Gδ,σ for some σ =1/(poly(n)¯χ). Hence, the entire argument described in this section is applicable to the VY algorithm, with a different potential function defined withχ ¯ instead ofχ ¯∗. This is the reason why the iteration bound in Lemma 4.3, and therefore in Theorem 3.16, also changes toχ ¯ dependency. It is worth noting that due to the overestimation of the κij values, the VY algorithm uses a coarser layering than our algorithm. Our algorithm splits up the VY layers into smaller parts so that ℓδ( ) remains small, but within each part, the gaps between the variables are bounded as J a function ofχ ¯A∗ instead ofχ ¯A.

5 Properties of the layered least square step

This section is dedicated to the proofs of Proposition 3.8 on the duality of lifting scores and Lemma 3.10 on properties of LLS steps.

Proposition 3.8 (Restatement). For a linear subspace W Rn and index set I [n] with J = [n] I, ⊆ ⊆ \ ⊥ LW max 1, LW . k I k≤ { k J k} ⊥ In particular, ℓW (I)= ℓW (J).

Proof. We ﬁrst treat the case where πI (W ) = 0 or πJ (W ⊥) = 0 . If πI (W ) = 0 then W W { } I { } n { } L = ℓ (I) = 0. Furthermore, in this case R = π (W )⊥ = π (W ⊥ R ), and thus k I k I I ∩ I 3 For simplicity, in the Introduction we used gxi ≥ xj instead, which is almost the same in the proximity in the central path.

41 W W ⊥ Rn π J (W ⊥) W ⊥. In particular, LJ 1 and ℓ (J) = 0. Symmetrically, if πJ (W ⊥) = 0 ⊥⊆ ⊥ k k ≤ { } then LW = ℓW (J) = 0, LW 1 and ℓW (I) = 0. k J k k I k≤ We now restrict our attention to the case where both πI (W ), πJ (W ⊥) = 0 . Under this W W ⊥ W W ⊥ 6 { } assumption, we show that LI = LJ and thus that ℓ (I) = ℓ (J). Note that by k k k k ⊥ non-emptyness, we clearly have that LW , LW 1. k I k k J k≥ We formulate a more general claim. Let 0 = U, V Rn be linear subspaces such that { } 6 ⊂ U + V = Rn and U V = 0 . Note that for the orthogonal complements in Rn, we also have ∩ { }n 0 = U ⊥, V ⊥, U ⊥ + V ⊥ = R and U ⊥ V ⊥ = 0 . { } 6 ∩ { } Claim 5.1. Let 0 = U, V Rn be linear subspaces such that U + V = Rn and U V = 0 . n { } 6 ⊂ ∩ { } Thus, for z R , there are unique decompositions z = u + v with u U, v V and z = u′ + v′ ∈ n ∈ ∈ n with u′ U ⊥ and v′ V ⊥. Let T : R V be the map sending Tz = v. Let T ′ : R V ⊥ be ∈ ∈ → → the map sending T ′z = v′. Then, T = T ′ . k k k k Proof. To prove the statement, we claim that it suﬃces to show that if T > 1 then T ′ T . k k k k≥k k To prove suﬃciency, note that by symmetry, we also get that if T ′ > 1 then T T ′ .Note k k k k≥k k that V, V ⊥ = 0 by assumption, and Tz = z for z V , T ′z = z for z V ⊥. Thus, we always 6 { } ∈ ∈ have T , T ′ 1, and therefore the equality T = T ′ must hold in all cases. We now k k k k ≥ k k k k assume T > 1 and show T ′ T . k k k k≥k k k Representing T as an n n matrix, we write T = σ v u⊤ using a singular value decom- × i=1 i i i position with σ1 σk > 0. As such, v1,...,vk is an orthonormal basis of V , since the ≥ ···≥ P range(T )= V , and u1,...,uk is an orthonormal basis of U ⊥, since Ker(T )= U, noting that we have restricted to the singular vectors associated with positive singular values. By assumption, we have that T = Tu1 = σ1 > 1. The proofk isk completek k by showing that

T ′(v u /σ ) σ v u /σ , (32) k 1 − 1 1 k≥ 1k 1 − 1 1k and that v1 u1/σ1 > 0, since then the vector v1 u1/σ1 will certify that T ′ σ1. k − k 2 − 1 k k≥ The map T is a linear projection with T = T . Hence ui, vi = σi− and ui, vj = 0 for all i = j. h i h i 6 1 1 We show that v1 σ1− u1 can be decomposed as v1 σ1u1 +(σ1 σ1− )u1 such that v1 σ1u1 1 − 1 − − − ∈ V ⊥ and (σ1 σ1− )u1 U ⊥. Therefore, T ′(v1 σ1− u1)= v1 σ1u1. − ∈ 1 − − The containment (σ σ− )u U ⊥ is immediate. To show v σ u V ⊥, we need that 1 − 1 1 ∈ 1 − 1 1 ∈ v1 σ1u1, vi = 0 for all i [k]. For i 2, this is true since ui, vj = 0 and vi, vj = 0. h − i ∈ ≥ h i 1 h i For i = 1, we have v1 σ1u1, v1 = 0 since v1 = 1 and u1, v1 = σ1− . Consequently, 1 h − i k k h i T ′(v σ− u )= v σ u . 1 − 1 1 1 − 1 1 1 2 2 We compute v σ− u = 1 σ− > 0, since σ > 1, and v σ u = σ 1. This 1 − 1 1 − 1 1 k 1 − 1 1k 1 − verifies (32), and thus T ′ σ1 q= T . p k k≥ k k n To prove the lemma, we define = (J, I), U = W ⊥,1 W ⊥,2 and V = W and let T : R V n J J ⊕ J → and T ′ : R V ⊥ be as in Claim 5.1. By assumption, 0 = π (W ) 0 = V and → { } 6 I ⇒ { } 6 0 = πJ (W ⊥) = W ⊥,1 0 = U. Applying Lemma 3.7, U, V satisfy the conditions of { } 6 J W,1⇒ { } 6 Claim 5.1 and T = LLS . In particular, T ′ = T . Using the fact that U ⊥ = W ,1 W ,2 J J J k k Wk⊥,1k ⊕ and V ⊥ = W ⊥, we similarly get that T ′ = , where ¯ = (I,J). By (21) we have, for LLS ¯ J W,1 W J W Rn any t π I (W ), that Tt = LLS (t)= LI (tI ). Thus T LI 1. ∈ J k k≥k k≥ W To finish the proof of the lemma from the claim, we show that T LI . By a symmetric W ⊥ k k≤k k argument we get T ′ = L . k k k J k

42 n n If x RJ , then T x W RJ because any s W ⊥,2,t πI (W ) with s + t = 0 must ∈ ∈ ∩ ∈ J ∈n have s = t = 0 since W ⊥,2 is orthogonal to πI (W ). But W RJ and W ⊥,1 are orthogonal, so T x x because x =J T x + (x T x) is an orthogonal decomposition.∩ J k k≤k kn − n If y RI , then yJ = 0 and hence (Ty)J = (Ty y)J . Since (Ty y)J W ⊥,1 = πJ (W RJ )⊥, ∈ n − n n − ∈ J ∩ we see that Ty (W RJ )⊥. As such, for any x RJ ,y RI , we see that x y and T x Ty. For x, y = 0, we∈ thus∩ have that ∈ ∈ ⊥ ⊥ 6 T (x + y) 2 T (x) 2 + T (y) 2 T (x) 2 T (y) 2 T (y) 2 k k = k k k k max k k , k k max 1, k k . x + y 2 x 2 + y2 ≤ x 2 y 2 ≤ y 2 k k k k k k k k k k k k W n Since LI 1, we must have that Tt / t is maximized by some t RI . From Ker(T )= U k k≥ k k k k ∈ n it is clear that Tt / t is maximized by some t U ⊥. Now, U ⊥ R = πRn (W ), so any t k k k k ∈ ∩ I I maximizing Tt / t satisfies Tt = LW (t ). Therefore, LW T . k k k k I I k I k≥k k Our next goal is to show Lemma 3.10: for a layering with small enough ℓδ( ), the LLS step ll 1 ll 1/2 1/2 J approximately satisfies (13), that is, δ∆x + δ− ∆s x s . This also enables us to derive bounds on the norm of the residuals and on the step-length.≈− We start by proving a few auxiliary technical claims. The next simple lemma allows us to take advantage of low lifting scores in the layering. Lemma 5.2. Let u, v Rn be two vectors such that u v W . Let I [n], and δ Rn . ∈ − ∈ ⊆ ∈ ++ Then there exists a vector u′ W + u satisfying u′ = v and ∈ I I δ δ[n] I (u[′n] I u[n] I ) ℓ (I) δI (uI vI ) . k \ \ − \ k≤ k − k Proof. We let 1 δ u′ := u + δ− L (δ (v u )) . I I I − I The claim follows by the definition of the lifting score ℓδ(I). The next lemma will be the key tool to prove Lemma 3.10. It is helpful to recall the characterization of the LLS step in Section 3.4. Lemma 5.3. Let w = (x,y,s) (β) for β (0, 1/4], let µ = µ(w) and δ = δ(w). Let ∈ N ∈ ll ll ll ll = (J1,...,Jp) be a δ(w)-balanced layering, and let ∆w = (∆x , ∆y , ∆s ) denote the corre- J p p sponding LLS direction. Let ∆x k=1 W ,k and ∆s k=1 W ⊥,k as in (25) and (26), that is ∈ J ∈ J L L ll 1 1/2 1/2 δ∆x + δ− ∆s + x s =0 , (33) 1 ll 1/2 1/2 δ∆x + δ− ∆s + x s =0. (34)

p p Then, there exist vectors ∆¯x k=1 W ,k and ∆¯s k=1 W ⊥,k such that ∈ J ∈ J δ (∆¯x L ∆xll ) 2nℓδ( )L√µ k [p] and (35) k Jk Jk − Jk k≤ J ∀ ∈ 1 ll δ δ− (∆¯sJ ∆s ) 2nℓ ( )√µ k [p] . (36) k Jk k − Jk k≤ J ∀ ∈ Proof. Throughout, we use the shorthand notation λ = ℓδ( ). We construct ∆¯x; one can obtain J 1 ∆¯s, using that the reverse layering has lifting score λ in W ⊥ Diag(δ− ) according to Lemma 3.9.

We proceed by induction, constructing ∆¯xJk W ,k for k = p,p 1,..., 1. This will be J (k) (k) ∈ (k) − given as ∆¯xJk = ∆xJk for a vector ∆x W such that ∆xJ>k = 0. We prove the inductive hypothesis ∈ p (k) ll δJ ∆x ∆x 2λ√µ Jq . (37) ≤k J≤k − J≤k ≤ | | q=k+1 q X

43 Note that (35) follows by restricting the norm on the LHS to Jk and since the sum on the RHS is n. ≤ (p) ll ll For k = p, the RHS is 0. We simply set ∆x = ∆x , that is, ∆¯xJp = ∆xJp , trivially satisfying the hypothesis. Consider now k

1 1/2 1/2 ll δJ ∆¯xJ + δ− ∆sJ x s + δJ (∆¯xJ ∆x ) k k+1 k+1 Jk+1 k+1 k≤k Jk+1 Jk+1 k k k+1 k+1 − Jk+1 k p 1/2 1/2 x s +2λ√µ Jq 1+ β µ Jk+1 +2nλ√µ< 2 µ Jk+1 , ≤k Jk+1 Jk+1 k | |≤ | | | | q=k+2 X q p p p using also that w (β), Proposition 3.2, and the assumptions β 1/4, λ β/(32n2). ∈ N ≤ ≤ Note that ∆¯xJk+1 W ,k and ∆sJk+1 W ⊥,k are orthogonal vectors. The above inequality therefore implies ∈ J ∈ J δ ∆¯x 2 µ J . k Jk+1 Jk+1 k≤ | k+1| (k) (k+1) Let us now use Lemma 5.2 to obtain ∆x for u =p ∆x , v = 0, and I = J>k. That is, we get ∆x(k) =0, ∆x(k) W , and J>k ∈ (k) (k+1) (k+1) δJ (∆x ∆x ) λ δJ ∆x k ≤k J≤k − J≤k k≤ k >k J>k k = λ δ ∆¯x 2λ µ J . k Jk+1 Jk+1 k≤ | k+1| By the triangle inequality and the induction hypothesis (37) for k +p 1,

(k) ll (k) (k+1) (k+1) ll δJ (∆x ∆x ) δJ (∆x ∆x ) + δJ (∆x ∆x ) k ≤k J≤k − J≤k k≤k ≤k J≤k − J≤k k k ≤k J≤k − J≤k k p 2λ µ J +2λ µ J , ≤ | k+1| | q| q=k+2 p X q yielding the induction hypothesis for k. Lemma 3.10 (Restatement). Let w = (x,y,s) (β) for β (0, 1/4], let µ = µ(w) and δ = δ ∈ N 2 ∈ ll ll ll ll δ(w). Let = (J1,...,Jp) be a layering with ℓ ( ) β/(32n ), and let ∆w = (∆x , ∆y , ∆s ) denote theJ LLS direction for the layering . ThenJ the≤ following properties hold. J (i) We have

44 We obtain the following bounds on the progress in the LLS step:

µ(w + α∆wll)=(1 α)µ , and − 3√nεll(w) α 1 . ≥ − β

(v) We have εll(w)=0 if and only if α = 1. These are further equivalent to w + ∆wll = (x + ∆xll,y + ∆yll,s + ∆sll) being an optimal solution to (LP). Proof. Again, we use λ = ℓδ( ). Part (i). Clearly, (27) impliesJ (28). To show (27), we use Lemma 5.3 to obtain ∆¯x and ∆¯s as p p in (35) and (36). We will also use ∆x k=1 W ,k and ∆s k=1 W ⊥,k as in (33) and (34). Select any layer k [p]. From (33),∈ we get thatJ ∈ J ∈ L L 1 1/2 1/2 ll δJ ∆¯xJ + δ− ∆sJ + x s = δJ (∆¯xJ ∆x ) 2nλ√µ . (38) k k k Jk k Jk Jk k k k k − Jk k≤ Similarly, from (34), we see that

1 1/2 1/2 1 ll δ− ∆¯sJ + δJ ∆xJ + x s = δ− (∆¯sJ ∆s ) 2nλ√µ . k Jk k k k Jk Jk k k Jk k − Jk k≤ From the above inequalities, we see that

1 δJ (∆¯xJ ∆xJ )+ δ− (∆sJ ∆¯sJ ) 4nλ√µ . k k k − k Jk k − k k≤ 1 Since δJ (∆¯xJ ∆xJ ) and δ− (∆sJ ∆¯sJ ) are orthogonal vectors, we have k k − k Jk k − k 1 δJ (∆¯xJ ∆xJ ) , δ− (∆sJ ∆¯sJ ) 4nλ√µ . k k k − k k k Jk k − k k≤ ll Together with (35), this yields δJ (∆x ∆xJ ) 6nλ√µ. Combined with (26), we get k k Jk − k k≤ ll 1 ll 1/2 1/2 ll δJ ∆x + δ− ∆s + x s = δJ (∆x ∆xJ ) 6nλ√µ , k k Jk Jk Jk Jk Jk k k k Jk − k k≤ thus, (27) follows.

Part (ii). Recall from Lemma 3.5(i) that √µRx a + √µRsa = x1/2s1/2. From part (i), we can similarly see that √µRx ll + √µRsll x1/2s1/2 6n3/2λ√µ . k − k≤ From these, we get (Rx ll Rx a) + (Rsll Rsa) 6n3/2λ . k − − k≤ ll a ll a 1 The claim follows since Rx Rx Diag(δ)W and Rs Rs Diag(δ− )W ⊥ are orthogonal vectors. − ∈ − ∈

Part (iii). Both bounds follow from the previous part and Lemma 3.5(iii), using the assumption ℓδ( ) β/(32n2). J ≤

45 Part (iv). Let w+ = w+α∆wll. We need to find the largest value α> 0 such that w+ (2β). ∈ N To begin, we first show that normalized duality gap µ(w+)= (1 α)µ for any α R. For this purpose, we use the decomposition: − ∈ (x + α∆xll)(s + α∆sll)=(1 α)xs + α(x + ∆xll)(s + ∆sll) α(1 α)∆xll∆sll. (39) − − − p p Recall from Part (i) that there exists ∆x k=1 W ,k and ∆s k=1 W ⊥,k as in (33) and (34) ∈ J ∈ J such that δ∆xll +δ 1∆s = δx and δ∆x+δ 1∆sll = δ 1s. In particular, x+∆xll = δ 2∆s and − L− − L − s + ∆sll = δ2∆x. Noting that− ∆xll ∆sll and ∆x −∆s, taking the average of the coordinates on both sides of (39), we get that ⊥ ⊥ µ(w + α∆wll)=(1 α)µ(w)+ α x + ∆xll,s + ∆sll /n α(1 α) ∆xll, ∆sll /n − h i − − h i 2 2 = (1 α)µ(w)+ α δ− ∆s,δ ∆x /n − h i = (1 α)µ(w), (40) − as needed. Let ε := εll(w). To obtain the desired lower bound on the step-length, given (40) it suffices to show that for all 0 α< 1 3√nε that ≤ − β (x + α∆xll)(s + α∆sll) e 2β . (41) (1 α)µ − ≤

− We will need a bound on the product of the LLS residuals:

1 x1/2s1/2 δ∆xll + δ 1∆sll + x1/2s1/2 Rx llRsll ∆xll∆sll = − − µ µ · µ √ √ (42) β 6(1+2β)n3/2λ , ≤ ≤ 4 using Proposition 3.1, part (i), and the assumptions λ β/(32n2), β 1/4. Another useful bound will be ≤ ≤ 2 2 2 2 Rx llRsll 2 = Rx ll Rsll ε2 max Rx ll , Rsll k k i i ≤ i i i [n] i [n] n o (43) X∈ X∈ ε2( Rx ll 2 + Rs ll 2) 2nε2 . ≤ k k k k ≤ The last inequality uses part (ii). We are ready to get a bound as in (41). (x + α∆xll)(s + α∆sll) α α e β + (x + ∆xll)(s + ∆sll)+ (xs + x∆sll + s∆xll) (1 α)µ − ≤ (1 α)µ µ − − α 1 β + Rx llRsll + α Rx llRsll ∆xll∆sll ≤ 1 α k k − µ

− √2nε β 5 √ 2nε β + + β + . ≤ 1 α 4 ≤ 4 1 α − − This value is 2β whenever 2√nε/(1 α) (3/4)β α< 1 3√nε , as needed. ≤ − ≤ ⇐ − β

Part (v). From part (iv), it is immediate that εll(w) = 0 implies α =1. If α = 1, we have that w + ∆wll is the limit of (strictly) feasible solutions to (LP) and thus is also a feasible solution. Optimality of w + ∆wll now follows from Part (iv), since α = 1 implies µ(w + ∆wll) = 0. ll ll ll The remaining implication is that if w + ∆w is optimal, then ε (w) = 0. Recall that Rx i = ll ll 1 ll ll δi(xi + ∆xi )/√µ and Rsi = δi− (si + ∆si )/√µ. The optimality of w + ∆w means that for each i [n], either x + ∆xll =0 or s + ∆sll = 0. Therefore, εll(w) = 0. ∈ i i i i

46 6 Proofs of the main lemmas for the potential analysis

Lemma 4.2 (Restatement). Let w = (x,y,s) (β) for β (0, 1/8] and let w∗ = (x∗,y∗,s∗) be ∈ N ∈ the optimal solution corresponding to µ∗ =0 on the central path. Let further = (J1,...,Jp) be a δ(w)-balanced layering (Deﬁnition 3.13), and let ∆wll = (∆xll, ∆yll, ∆sll) beJ the corresponding LLS direction. Then the following statement holds for every q [p]: ∈ (i) There exists i J such that ∈ q 2xi ll x∗ ( Rx 2γn) . (30) i ≥ 3√n · k Jq k−

(ii) There exists j J such that ∈ q 2sj ll s∗ ( Rs 2γn) . (31) j ≥ 3√n · k Jq k−

Proof of Lemma 4.2. We prove part (i); part (ii) follows analogously using Lemma 3.9. Let z ll be a vector fulﬁlling the statement of Lemma 5.2 for u = x∗, v = x + ∆x , and I = J>q. Then ll δ z W + d, zJ = xJ + ∆x and by ℓ ( ) γ ∈ >q >q J>q J ≤ ll δJ (x∗ zJ ) γ δJ x∗ (xJ + ∆x ) . ≤q J≤q − ≤q ≤ >q J>q − >q J>q

Restricting to the components in Jq, and dividing by √µ, we get ll δJq (xJ∗ zJq ) δJ>q xJ∗ (xJ>q + ∆xJ ) δJ>q xJ∗ q − γ >q − >q γ >q + γ Rx ll . (44) µ ≤ µ ≤ µ k J>q k √ √ √

Since w (β), from Proposition 3.1 and (16) we see that for i [n] ∈ N ∈ δ 1 δ (w(µ)) 1 1 i i = , √µ ≤ √1 2β · √µ √1 2β · xi(µ) − − and therefore δ x J>q J∗>q 1 1 1 1 n x(µ)J− xJ∗>q x(µ)J− xJ∗>q , µ ≤ √1 2β >q ≤ √1 2β · >q 1 ≤ √1 2β √ − − − where the last inequality follows by Lemma 3.3 . ll ll √ Using the above bounds with (44), along with Rx J≥q Rx 2n from Lemma 3.10(iii), we get k k≤k k≤ δJ zJ δJq xJ∗ γn δJq xJ∗ q q q + + γ√2n q +2γn , µ ≤ µ √1 2β ≤ µ √ √ √ − using that β 1 /8 and n 3. Note that z is a feasible solution to the least-squared problem ≤ ≥ ll which is optimally solved by xJq for layer Jq and so

δJ zJ Rxll q q . k Jq k≤ µ √

It follows that δJq xJ∗ q Rxll 2γn . µ ≥k Jq k− √

47 Let us pick i = argmaxt Jq δtxt∗ . Using Proposition 3.2, ∈ | | x 1 δ x Rxll 2γn 2 i∗ i i∗ k Jq k− ll ( Rx Jq 2γn) , xi ≥ 1+ β · √µ ≥ (1 + β)√n ≥ 3√n · k k− completing the proof. Lemma 4.3 (Restatement). Let w = (x,y,s) (2β) for β (0, 1/8], let µ = µ(w) and δ = ∈ N ∈ δ(w). Let i, j [n] and 2 τ n such that for the optimal solution w∗ = (x∗,y∗,s∗), we have 10∈5.5 ≤ ≤ 10 5.5 µ 1 xi∗ βxi/(2 n ) and sj∗ βsj /(2 n ), and assume ̺ (i, j) τ. After O(β− √nτ log(¯χ∗ + ≥ ≥ µ′ ≥− n)) further iterations the duality gap µ′ fulﬁlls Ψ (i, j) 2τ, and for every ℓ [n] i, j , either ′ ′ Ψµ (i,ℓ) 2τ, or Ψµ (ℓ, j) 2τ. ≥ ∈ \{ } ≥ ≥ Proof of Lemma 4.3. Let us select a value µ′ such that

4nκ∗ log µ log µ′ 5τ log + 31 log n + 44 4 log β . − ≥ γ − 1 The normalized duality gap decreases to such value within O(β− √nτ log(¯χ∗ + n)) iterations, · recalling that log(¯χ∗ +n) = Θ(log(κ∗ +n)). The step-lengths for the affine scaling and LLS steps are stated in Proposition 3.4 and Lemma 3.10(iv).Whenever the algorithm chooses an LLS step, εa(w) < 10n3/2γ. Thus, the progress in µ will be at least as much (in fact, much better) than the (1 β/√n) guarantee we use for the affine scaling step in Proposition 3.4. − Let w′ = (x′,y′,s′) be the central path element corresponding to µ′, and let δ′ = δ(w′). From now on we use the shorthand notation 4nκ Θ := log ∗ . γ We first show that ′ Θ̺µ (i, j) 4Θτ + 18 log n + 22 log 2 2 log β (45) ≥ − µ′ for µ′, and therefore, ΘΨ (i, j) min(2Θn, 4Θτ + 18 log n + 22 log 2 2 log β) 2Θτ as τ n. δ ≥ − ≥ ≤ Recalling the definition κij = κij δj /δi, we see that according to Proposition 3.2,

κ x s ′ x′ s′ κδ ij i j , and κδ = κ i j . ij ≤ (1 β)2 · µ ij ij · µ − ′ Thus,

µ′ µ Θ̺ (i, j) Θ̺ (i, j) + log µ log µ′ + 2 log(1 β) + log x′ log x + log s′ log s ≥ − − i − i j − j µ Θ̺ (i, j)+5Θτ + 31 log n + 44 4 log β + 2 log(1 β) + log x′ log x + log s′ log s . ≥ − − i − i j − j

Using the near-monotonicity of the central path (Lemma 3.3), we have xi′ xi∗/n and sj′ 10 5.5 10 5.≥5 ≥ s∗/n. Together with our assumptions x∗ βx /(2 n ) and s∗ βs /(2 n ), we see that j i ≥ i i ≥ i log x′ log x + log s′ log s 13 log n 20 log 2 + 2 log β . i − i j − j ≥− − Using the assumption ̺µ(i, j) > τ of the lemma, we can establish (45) as β < 1/8. − Next, consider any ℓ [n] i, j . From the triangle inequality Lemma 2.15(ii) it follows that ′ ′ ′ ′ ′ ′ κδ κδ κδ , which gives∈ ̺µ\{(i,ℓ)+} ̺µ (ℓ, j) ̺µ (i, j). We therefore get ij ≤ iℓ · ℓj ≥

′ ′ 1 ′ (45) max Θ̺µ (i,ℓ), Θ̺µ (ℓ, j) Θ̺µ (i, j) 2Θτ + 9 log n + 11 log 2 log β. { }≥ 2 ≥ −

48 ′ ′ We next show that if Θ̺µ (i,ℓ) 2Θτ + 9 log n + 11 log 2 log β, then Ψµ (i,ℓ) 2τ. The ′ case Θ̺µ (ℓ, j) 2Θτ + 9 log n + 11≥ log 2 log β follows analogously.− ≥ ≥ − Consider any 0 < µ<µ¯ ′ with the corresponding central path pointw ¯ = (¯x, y,¯ s¯). The proof ′ is complete by showing Θ̺µ¯(i,ℓ) Θ̺µ (i,ℓ) 9 log n 11 log 2 + log β. Recall that for central δ′ ≥ δ¯− − path elements, we have κij = κij xi′ /xj′ , and κij = κij x¯i/x¯j . Therefore

µ¯ µ′ Θ̺ (i, j)=Θ̺ (i, j) + logx ¯ log x′ logx ¯ + log x′ . i − i − j j 10 5.5 Using Proposition 3.1, Lemma 3.3 and the assumption xi∗ βxi/(2 n ), we havex ¯j nxj′ and ≥ ≤

x∗ βx β(1 β)x′ βx′ x¯ i i − i i . i ≥ n ≥ 210n6.5 ≥ 210n7.5 ≥ 211n7.5 Using these bounds, we get

′ Θ̺µ¯(i, j) Θ̺µ (i, j) 9 log n 11 log 2 + log β, ≥ − − completing the proof. It remains to prove Lemma 4.4 and Lemma 4.5, addressing the more diﬃcult case ξll < 4γn. It is useful to decompose the variables into two sets. We let J B := t [n]: Rsll < 4γn , and N := t [n]: Rx ll < 4γn . (46) { ∈ | t | } { ∈ | t | } ll The assumption ξ < 4γn implies that for every layer Jk, either Jk B or Jk N. The next two lemmas describeJ the relations between δ and δ+. ⊆ ⊆ Lemma 6.1. Let w (β) for β (0, 1/8], and assume ℓδ( ) γ and εll(w) < 4γn. For the next iterate w+ = (x+∈,y N+,s+) ∈(2β), we have J ≤ ∈ N (i) For i B, ∈ + + + + 1 µ δi µ 1 + 3µ 2 and δi− si . 2 · s µ ≤ δi ≤ · s µ ≤ √µ (ii) For i N, ∈ + + 1 µ δi µ + 3µ + 2 + and δixi . 2 · µ ≤ δi ≤ · µ ≤ √µ r r (iii) If i, j B or i, j N, then ∈ ∈ δ + 1 κij δi δj δ+ = + 4 . 4 ≤ κij δiδj ≤

(iv) If i N and j B, then ∈ ∈ δ κij 3.5 δ+ 4n . κij ≥ Proof. Part (i). By Lemma 3.10(i), we see that

ll ll 1 ll 1/2 1/2 1 ll δB∆xB δB∆xB + δ− ∆sB + x s + δ− (∆sB + sB) k k∞ ≤k B B B k∞ k B k∞ ll 1 ll 1/2 1/2 ll = δB∆xB + δ− ∆sB + x s + √µ RsB k B B B k∞ k k∞ √µ 6nℓδ( )+4nγ 10nγ√µ √µ/64 , ≤ J ≤ ≤ 49 by the assumption on ℓδ( ) and the deﬁnition of B. J + + ll ll + By construction of the LLS step, xi xi = α ∆x ∆x , recalling that 0 α 1. Using the bound derived above, for i | B−we get| | | ≤ | | ≤ ≤ ∈ x+ ∆xll δ ∆xll √µ 1 i 1 i = | i i | , x − ≤ x δ x ≤ 64δ x ≤ 32 i i i i i i

where the last inequality follows from Proposition 3.2. As

+ + + + + + + δi xi si xi 1 2β µ xi si 1+2β µ = + and − δi s xisi · x 1+ β √µ ≤ s xisi ≤ 1 β √µ i p − p by Proposition 3.2 the claimed bounds follow with β 1/8. 1 + ≤ To get the upper bound on δi− si , again with Proposition 3.2

+ + + + 1 + δi + δi + + µ + 3µ δi− si = + si = xi si 2 (1+2β) µ . δiδi δi · ≤ s µ · ≤ √µ q p Part (ii). Analogously to (i).

Part (iii). Immediate from parts (i) and (ii).

Part (iv). Follows by parts (i) and (ii), and by the lower bound on µ/µ+ obtained from Lemma 3.10(iv) as follows p κδ δ+δ µ 1 β ij = i j = 4n3.5. κδ+ δ δ+ ≥ 4µ+ 4(1 α+) ≥ 12√nεll(w) ≥ ij i j −

Lemma 4.4 (Restatement). Let w = (x,y,s) (β) for β (0, 1/8], and let = (J1,...,Jp) be a δ(w)-balanced partition. Assume that ξll (w∈) N< 4γn, and∈ let w+ = (x+,y+,sJ+) (2β) be the next iterate obtained by the LLS step withJ µ+ = µ(w+) and assume µ+ > 0. Let ∈q N[p] such ll ll δ+ +∈ 3/2 that ξ (w) = ξJq (w). If ℓ ( ) 4γn, then there exist i, j Jq such that xi∗ βxi /(16n ) J J ≤ ∈ ≥ + 3/2 µ+ and s∗ βs /(16n ). Further, for any ℓ,ℓ′ J , we have ̺ (ℓ,ℓ′) J . j ≥ j ∈ q ≥ −| q| ll ll ll N Proof of Lemma 4.4. Without loss of generality, let ξ = ξJq = Rx Jq for a layer q with Jq . J k k ⊆ ll ll B The case ξJq = RsJq and Jq can be treated analogously. k k ll ⊆ 1 3 1 By Lemma 3.10(iii), RsJq 2 4 β > 4 +2nγ, and therefore Lemma 4.2 provides a j Jq k k≥ − + ∈ such that sj∗/sj 1/(6√n). Using Lemma 3.3 and Proposition 3.1 we ﬁnd that sj /sj 2n and + ≥ + 3/2 3/2 ≤ so sj∗/sj = sj∗/sj sj /sj 1/(12n ) > 1/(16n ). · ≥ + µ The ﬁnal statement ̺ (ℓ,ℓ′) J for any ℓ,ℓ′ J is also straightforward. From ≥ −| q| ∈ q Lemma 6.1(iii) and the strong connectivity of Jq in Gδ,γ/n, we obtain that Jq is strongly con- µ+ nected in G + . Hence, ̺ (ℓ,ℓ′) J follows by Lemma 4.1. δ ,γ/(4n) ≥ −| q| The rest of the proof is dedicated to showing the existence of an i Jq such that xi∗ + 3/2 ∈ ≥ βxi /(16n ). For this purpose, we will prove following claim. βµ+ Claim 6.2. δ x∗ . k Jq Jq k≥ 8√nµ

50 In order to prove Claim 6.2, we deﬁne

+ 1 δ+ + + + z := (δ )− L δ (x∗ x ) and w := x∗ x z , J>q J>q J>q − J>q − − as in Lemma 5.2. By construction, w W and wJ>q = 0. Thus, wJq W ,q as deﬁned in Section 3.4. ∈ ∈ J Using the triangle inequality, we get

+ δJ x∗ δJ (x + wJ ) δJ zJ . (47) k q Jq k≥k q Jq q k−k q q k δ+ We bound the two terms separately, starting with an upper bound on δJq zJq . Since ℓ ( ) 4γn, we have with Lemma 5.2 that k k J ≤

+ δ+ + + δ zJ ℓ ( ) δ x∗ x Jq q ≤ J J>q J>q − J>q + + 4nγ δ x∗ x ≤ J >q J>q − J>q x + + J∗>q

=4nγ δ x ½ J>q J>q x+ − J>q ! (48)

+ + x∗ + 4nγ δ x + nµ ≤ k k∞ · x+ 1 p 3 4 4nγ µ+ n + nµ+ ≤ 2 · 3 p p 16n2 µ+γ, ≤ where the penultimate inequality follows byp Proposition 3.1, Proposition 3.2 and Lemma 3.3. We can use this and Lemma 6.1(ii) to obtain

2 + + + + 32n γµ βµ δJq zJq δJq /δ δ zJq , (49) k k≤k Jq k∞ ·k Jq k≤ √µ ≤ 32n3√µ using the deﬁnition of γ. The ﬁrst RHS term in (47) will be bounded as follows. Claim 6.3. δ (x+ + w ) 1 µξll . Jq Jq Jq 2 √ k k≥ J Proof of Claim 6.3. We recall the characterization (25) of the LLS step ∆xll W . Namely, there 1 ∈ll exists ∆s W ⊥,1 W ⊥,q that is the unique solution to δ− ∆s + δ∆x = δx. From the above, note∈ thatJ ⊕···⊕ J −

1 ll ll ll δJ− ∆sJq = δJq (xJq + ∆xJq ) = √µ Rx Jq = √µξ . k q k k k k k J From the Cauchy-Schwarz inequality,

1 + 1 + − − δJq ∆sJq δJq (xJq + wJq ) δJq ∆sJq ,δJq (xJq + wJq ) k k·k k≥ (50) D 1 + E = δ− ∆sJ ,δJ x . Jq q q Jq D E

Here, we used that ∆sJq W ⊥,q and wJq W ,q. Note that ∈ J ∈ J + ll ll ll 2 ll x = x + α∆x = x + ∆x (1 α)∆x = δ− ∆s (1 α)∆x . − − − − −

51 Therefore,

1 + 1 1 ll δ− ∆sJ ,δJ x = δ− ∆sJ , δ− ∆sJ (1 α)δJ ∆x Jq q q Jq Jq q − Jq q − − q Jq D E D 1 2 1 llE δ− ∆sJ (1 α) δ− ∆sJ ,δJ ∆x . ≥k Jq q k − − Jq q q Jq D E ll δ By Lemma 5.3, there exists ∆¯x W ,1 W ,p such that δJq (∆xJ ∆¯xJq ) 2nℓ ( )√µ. ∈ J ⊕···⊕ J k q − k≤ J Therefore, using the orthogonality of ∆sJq and ∆¯xJq , we get that

1 ll 1 ll ll δ 1 δ− ∆sJ ,δJ ∆x = δ− ∆sJ ,δJ (∆x ∆¯x ) 2nℓ ( )√µ δ− ∆sJ . Jq q q Jq Jq q q Jq − Jq ≤ J ·k Jq q k D E D E

From the above inequalities, we see that + 1 δ ll δ δJq (xJ + wJq ) δJ− ∆sJq 2(1 α)nℓ ( )√µ = √µξ 2(1 α)nℓ ( )√µ . k q k≥k q k− − J J − − J It remains to show (1 α)nℓδ( ) ξll /4. From Lemma 3.10(iv), we obtain − J ≤ J δ 3/2 δ ll 1 (1 α)nℓ ( ) 3n ℓ ( )ξ β− , − J ≤ J J using ξll εll. The claim now follows by the assumption ℓδ( ) γ, and the choice of γ. J ≥ J ≤ Proof of Claim 6.2. Using Lemma 3.10(iv),

ll + 3√nξ µ µ J , ≤ β

+ + implying δJq (x + wJq ) βµ /(6√nµ) by Claim 6.3. Now the claim follows using (47) and k Jq k≥ (49). By Lemma 6.1(ii), we see that

+ + + 3√nµ δJq x √n δJq x . k Jq k≤ k Jq k∞ ≤ √µ

Thus, the lemma follows immediately from Claim 6.2: for at least one i J , we must have ∈ q

x δJq x∗ β β i∗ k Jq k + 3/2 . xi ≥ δJ x ≥ 24n ≥ 16n k q Jq k

Lemma 4.5 (Restatement). Let w = (x,y,s) (β) for β (0, 1/8], and let = (J1,...,Jp) be a δ(w)-balanced partition. Assume that ξll (w∈) N< 4γn, and∈ let w+ = (x+,y+,sJ+) (2β) be J +∈ N the next iterate obtained by the LLS step with µ+ = µ(w+) and assume µ+ > 0. If ℓδ ( ) > 4γn, + J3/2 then there exist two layers Jq and Jr and i Jq and j Jr such that xi∗ xi /(8n ), and + 3/2 µ+ ∈ ∈ ≥ sj∗ sj /(8n ). Further, ̺ (i, j) Jq Jr , and for all ℓ,ℓ′ Jq Jr, ℓ = ℓ′ we have µ ≥ ≥ −| ∪ | ∈ ∪ 6 Ψ (ℓ,ℓ′) J J . ≤ | q ∪ r| Proof of Lemma 4.5. Recall the sets B and N deﬁned in (46). The key is to show the existence of an edge

(i′, j′) E + such that i′ J B, j′ J N, r < q . (51) ∈ δ ,γ/(4n) ∈ q ⊆ ∈ r ⊆

52 Before proving the existence of such i′ and j′, we show how the rest of the statements follow. + 3/2 + 3/2 The existence of i Jq and j Jr such that xi∗ xi /(8n ) and sj∗ sj /(8n ) follow ∈ ∈ ≥ + ≥ µ immediately from Lemma 4.2. The other statements are that ̺ (i, j) Jq Jr , and for each µ ≥ −| ∪ | ℓ,ℓ′ J J , ℓ = ℓ′,Ψ (ℓ,ℓ′) J J . According to Lemma 4.1, the latter is true (even with ∈ q ∪ r 6 ≤ | q ∪ r| the stronger bound max Jq , Jr ) whenever ℓ,ℓ′ Jq, or ℓ,ℓ′ Jr, or if ℓ Jq and ℓ′ Jr. It {| | | |} µ+ µ∈ ∈ ∈ ∈ is left to show the lower bound on ̺ (i, j) and Ψ (ℓ,ℓ′) Jq Jr for ℓ′ Jq and ℓ Jr. ≤ | ∪ | ∈ δ ∈ δ+ From Lemma 6.1(iii), we have that if ℓ,ℓ′ J B or ℓ,ℓ′ J N, then κ ′ /4 κ ′ . ∈ q ⊆ ∈ r ⊆ ℓℓ ≤ ℓℓ Hence, the strong connectivity of Jr and Jq in Gδ,γ implies the strong connectivity of these sets in G + . Together with the edge (i′, j′), we see that every ℓ′ J can reach every ℓ J δ ,γ/(4n) ∈ q ∈ r on a directed path of length Jq Jr 1 in Gδ+,γ/(4n). Applying Lemma 4.1 for this setting, µ µ+ ≤ | ∪ |− µ+ we obtain Ψ (ℓ,ℓ′) ̺ (ℓ,ℓ′) J J for all such pairs, and also ̺ (i, j) J J . ≤ ≤ | q ∪ r| ≥ −| q ∪ r| The rest of the proof is dedicated to showing the existence of i′ and j′ as in (51). We let δ+ δ+ k [p] such that ℓ (J k)= ℓ ( ) > 4nγ. To simplify the notation, we let I = J k. ∈When constructing≥ In LayeringJ (δ, κˆ), the subroutine Verify-Lift(Diag(δ≥)W,I,γ) was J δ called for the set I = J k, with the answer ‘pass’. Besides ℓ (I) γ, this guaranteed the stronger ≥ ≤ property that maxji Bji γ for the matrix B implementing the lift (see Remark 2.17). Let us recall how| this|≤ matrix B was obtained. The subroutine starts by ﬁnding a minimal I′ δ δ I′ I such that dim(π ′ (W )) = dim(π (W )). Recall that π ′ (W ) = R and L (p) = L ′ (p ′ ) ⊂ I I I ∼ I I I for every p πI (Diag(δ)W ). ∈ δ ([n] I) I′ Consider the optimal lifting LI : πI (Diag(δ)W ) Diag(δ)W . We deﬁned B R \ × as → δ ∈ the matrix sending any q πI′ (Diag(δ)W ) to the corresponding vector [LI′ (q)][n] I . The column ′ \ ∈δ i i I Bi can be computed as [LI′ (e )][n] I for e R . We consider the transformation\ ∈ ¯ + 1 + 1 B := Diag(δ δ− )B Diag (δI′ )− δI′ . + + This maps πI′ (Diag(δ )W ) π[n] I (Diag(δ )W ). + → \ Let z πI (Diag(δ )W ) be the singular vector corresponding to the maximum singular value δ+ ∈ δ+ of LI , namely, [LI (z)][n] I > 4nγ z . Let us normalize z such that zI′ = 1. Thus, k \ k k k k k δ+ [LI′ (zI′ )][n] I > 4nγ . \

¯ + δ+ Let us now apply B to zI′ πI′ (Diag( δ )W ). Since LI is the minimum-norm lift operator, we see that ∈ ¯ δ+ BzI′ [LI′ (zI′ )]n I > 4nγ . ≥ \ 2 We can upper bound the operator norm by the Frobenius norm B¯ B¯ = B¯ k k≤k kF ji ji ≤ n max B¯ , and therefore q ji | ji| P max B¯ji > 4γ . ji | |

Let us ﬁx i′ I′ and j′ [n] I as the indices giving the maximum value of B¯. Note that ¯ ∈+ + ∈ \ Bj′ i′ = Bj′ i′ δj′ δi′ /(δi′ δj′ ). Let us now use Lemma 2.16 for the pair i′, j′, the matrix B and the subspace Diag(δ)W . δ i′ δ Noting that B ′ ′ = [L ′ (e )] ′ , we obtain κ ′ ′ B ′ ′ . Now, j i I j i j ≥ | j i | + + + δ ′ δ ′ δ ′ δ ′ δ δ j i j i ¯ κi′j′ = κi′j′ + Bj′i′ + = Bj′i′ > 4γ . (52) · δi′ δj′ ≥ | | · δi′ δj′ | | The next claim ﬁnishes the proof.

53 Claim 6.4. For i′ and j′ selected as above, (51) holds.

Proof. (i′, j′) E + holds by (52). From the above, we have ∈ δ ,γ/(4n) + δi′ δj′ Bj′i′ > 4γ + . | | · δi′ δj′

According to Remark 2.17, Bj′ i′ γ follows since Verify-Lift(Diag(δ)W,I,γ) returned with ‘pass’. We thus have | |≤ + δi′ δj′ 1 + < . δi′ δj′ 4

Lemma 6.1 excludes the scenarios i′, j′ N, i′, j′ B, and i′ N, j′ B, leaving i′ B and ∈ ∈ ∈ ∈ ∈ j′ N as the only possibility. Therefore, i′ J B and j′ J N. We have r < q since ∈ ∈ q ⊆ ∈ r ⊆ i I = J k and j [n] I = J

7 Initialization

Our main algorithm (Algorithm 2 in Section 3.6), requires an initial solution w0 = (x0,y0,s0) (β). In this section, we remove this assumption by adapting the initialization method of [VY96∈] Nto our setting. We use the “big-M method”, a standard initialization approach for path-following interior point methods that introduces an auxiliary system whose optimal solutions map back to the optimal solutions of the original system. The primal-dual system we consider is

min c⊤x+Me⊤x max y⊤b +2Me⊤z ¯ Ax Ax = b A⊤y + z + s = c − ¯ x +¯x =2Me z +¯s =0 (Init-LP)

x, x,¯ x 0 A⊤y + s = Me ¯ ≥ − ¯ s, s,¯ s 0. ¯ ≥ The constraint matrix used in this system is

A A 0 Aˆ = − I 0 I   The next lemma asserts that theχ ¯ condition number of Aˆ is not much bigger than that of A of the original system (LP). Lemma 7.1 ([VY96, Lemma 23]). χ¯ 3√2(¯χ + 1). Aˆ ≤ A We extend this bound forχ ¯∗.

Lemma 7.2. χ¯∗ 3√2(¯χ∗ + 1). Aˆ ≤ A

54 Proof. Let D D and let Dˆ D the matrix consisting of three copies of D, i.e. ∈ n ∈ 3n D 0 0 Dˆ =  0 D 0  .    0 0 D     Then

AD AD 0 AˆDˆ = −  D 0 D   Row-scaling does not changeχ ¯ as the kernel of the matrix remains unchanged. Thus, we can 1 rescale the last n rows of AˆDˆ, to the identity matrix, i.e. multiplying by (I,D− ) from the left hand side. We observe that

AD AD 0 √ χ¯AˆDˆ =χ ¯ − 3 2(¯χAD + 1)  I 0 I ≤   where the inequality follows from Lemma 7.1. The lemma now readily follows as

χ¯∗ = inf χ¯ ˆ : D D inf 3√2(¯χ + 1) : D D =3√2(¯χ∗ + 1). Aˆ { AˆD ∈ 3n}≤ { AD ∈ n} A We show next that the optimal solutions of the original system are preserved for suﬃciently 1 large M. We let d be the min-norm solution to Ax = b, i.e., d = A⊤(AA⊤)− b. Proposition 7.3. Assume both primal and dual of (LP) are feasible, and M > max (¯χ + { A 1) c , χ¯A d . Every optimal solution (x,y,s) to (LP), can be extended to an optimal solution (x,kx,kx,¯ y,k s,k}s, s¯) to (Init-LP); and conversely, from every optimal solution (x, x, x,¯ y, z, s, s, s¯) to ¯ ¯ ¯ ¯ (Init-LP), we obtain an optimal solution (x,y,s) by deleting the auxiliary variables.

Proof. If system (LP) is feasible, it admits a basic optimal solution (x∗,y∗,s∗) with basis B such that A x∗ = b, x∗ 0, A⊤y∗ = c and A⊤y∗ c. Using Proposition 2.1(ii) we see that B B ≥ B ≤ 1 1 x∗ = A− b = A− Ad χ¯ d < M , (53) k Bk k B k k B k≤ Ak k and using that A = A⊤ we observe k k k k 1 A⊤y∗ = A⊤A−⊤c A⊤A−⊤ c = A− A c χ¯ c < M. (54) k k k B k≤k B kk k k B kk k≤ Ak k We can extend this solution to a solution of system (Init-LP) via settingx ¯ = 2Me x , x = ∗ ∗ ¯∗ 0,z =s ¯ = 0 and s = Me + A y . Observe thatx ¯ > 0 and s > 0 by (53)− and (54). ∗ ∗ ¯∗ ⊤ ∗ ∗ ¯∗ Furthermore observe that by complementary slackness this extended solution for (Init-LP) is an optimal solution. The property that s > 0 immediately tells us that x vanishes for all optimal ¯∗ ¯ solutions of (Init-LP) and thus all optimal solutions of (LP) coincide with the optimal solutions of (Init-LP), with the auxiliary variables removed. The next lemma is from [MT03, Lemma 4.4]. Recall that w = (x,y,s) (β) if xs/µ(w) e β. ∈ N k − k≤ Lemma 7.4. Let w = (x,y,s) ++ ++, and let ν > 0. Assume that xs/ν e τ. Then (1 τ/√n)ν µ(w) (1 + τ/√∈Pn)ν and×Dw (τ/(1 τ)). k − k≤ − ≤ ≤ ∈ N −

55 The new system has the advantage that we can easily initialize the system with a feasible solution in close proximity to central path: Proposition 7.5. We can initialize system (Init-LP) close to the central path with initial solution w0 = (x0,y0,s0) (1/8) and parameter µ(w0) M 2 if M > 15 max (¯χ + 1) c , χ¯ d . ∈ N ≈ { A k k Ak k} Proof. The initialization follows along the lines of [VY96, Section 10]. We let d as above, and set

x¯0 = Me,x0 = Me, x0 = Me d ¯ − y0 =0,z0 = Me − s¯0 = Me,s0 = Me + c, s0 = Me. ¯ This is a feasible primal-dual solution to system (Init-LP) with parameter

0 1 0 0 0 0 0 0 1 2 2 µ = (3n)− ( x ,s + x , s + x¯ , s¯ )=(3n)− (3nM + Mc⊤e Md⊤e) M . h i h¯ ¯ i h i − ≈ We see that 2 x¯0s¯0 1 1 1  0 0 2 2 2 2 2 x s e = M − c + M − d 2 2 2 . M − k k k k ≤ 9 χ¯A ≤ 9  0 0 x s  ¯ ¯    0 0 0 0 1/9 With Lemma 7.4 we conclude that w = (x ,y ,s ) 1 1/9 = (1/8). ∈ N − N Detecting infeasibility To use the extended system (Init-LP), we still need to assume that both the primal and dual programs in (LP) are feasible. For arbitrary instances, we first need to check if this is the case, or conclude that the primal or the dual (or both) are infeasible. This can be done by employing a two-phase method. The first phase decides feasibility by running (Init-LP) with data (A, b, 0) and M > χ¯A d . The objective value of the optimal primal- dual pair is 0 if and only if (LP) has a feasiblek solution.k If the optimal primal/dual solution (x , x , x¯ ,y ,s , s , s¯ ) has positive objective value, we can extract an infeasibility certificate in ∗ ¯∗ ∗ ∗ ∗ ¯∗ ∗ the following way. By the characterization ofχ ¯A as in Proposition 2.1(ii), there exists an optimal solution x′ withx ¯′ > 0, and so by strong duality,s ¯∗ = 0. From the dual, we conclude that z = 0, and therefore A⊤y∗ A⊤y∗ + s∗ + z = c = 0. On the other hand, by assumption the objective value ≤ of the dual is positive, and so y⊤b y⊤b +2Me⊤z > 0. Feasibility of the dual of (LP)≥ can be decided by running (Init-LP) on data (A, 0,c) and M > (¯χA + 1) c with the same argumentation: Either the objective value of the dual is 0 and therefore thek k dual optimal solution (y , s ,s , s¯ ) corresponds to a feasible dual solution ∗ ¯∗ ∗ ∗ of (LP) or the objective value is negative and we extract a dual infeasibility certificate in the following way: By assumption c x c x+Me x< 0. Furthermore, there exists a basic optimal ⊤ ⊤ ⊤¯ solution to the dual of (Init-LP) with≤ s> 0 and therefore x = 0 for the optimal primal solution ¯ ¯∗ (x , x , x¯ ). So, we have Ax = b = 0, together with c x< 0 yielding the certificate. ¯∗ ∗ ∗ ∗ ⊤

Finding the right value of M While Algorithm 2 does not require any estimate onχ ¯∗ orχ ¯, the initialization needs to set M max (¯χA + 1) c , χ¯A d as in Proposition 7.3. A straightforward guessing approach≥ { (attributedk k tok J.k} Renegar in [VY96]) starts with a constant guess, sayχ ¯A = 100, constructs the extended system, and runs the algorithm. In

56 case the optimal solution to the extended system does not map to an optimal solution of (LP), 2 we restart withχ ¯A = 100 and try again; we continue squaring the guess until an optimal solution is found. This would still require a series of log logχ ¯A guesses, and thus, result in a dependence onχ ¯A in the running time. However, if we initially rescale our system using the near-optimal rescaling

Theorem 2.5, the we can turn the dependence fromχ ¯A toχ ¯A∗ . The overall iteration complexity 2.5 remains O(n log n log(¯χA∗ + n)), since the running time for the final guess onχ ¯A∗ dominates the total running time of all previous computations due to the repeated squaring. An alternative approach, that does not rescale the system, is to use Theorem 2.5 to approxi- mateχ ¯ . In this case we repeatedly square a guess ofχ ¯∗ instead ofχ ¯ which takes (log logχ ¯∗ ) A A A O A iterations until our guess corresponds to a valid upper bound forχ ¯A. Note that either guessing technique can handle bad guesses gracefully. For the first phase, if neither a feasible solution to (LP) is returned nor a Farkas’ certificate can be extracted, we have proof that the guess was too low by the above paragraph. Similarly, in phase two, when feasibility was decided in the affirmative for primal and dual, an optimal solution to (Init-LP) that corresponds to an infeasible solution to (LP) serves as a certificate that another squaring of the guess is necessary.

References

[ABGJ18] Xavier Allamigeon, Pascal Benchimol, Stéphane Gaubert, and Michael Joswig. Log-barrier interior point methods are not strongly polynomial. SIAM Journal on Applied Algebra and Geometry, 2(1):140–178, 2018. 8 [AMO93] R. K. Ahuja, T. L. Magnanti, and J. B. Orlin. Network Flows: Theory, Algorithms, and Applications. Prentice-Hall, Inc., 1993. 13, 21 [BE14] Sébastien Bubeck and Ronen Eldan. The entropic barrier: a simple and optimal universal self-concordant barrier. arXiv preprint arXiv:1412.1587, 2014. 7 [Chu14] Sergei Chubanov. A polynomial algorithm for linear optimization which is strongly polynomial under certain conditions on optimal solutions. http://www.optimization-online.org/DB_HTML/2014/12/4710.html, 2014. 2 [CLS19] Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 938–942, 2019. 7 [DHNV20] Daniel Dadush, Sophie Huiberts, Bento Natura, and LászlóA Végh. A scaling- invariant algorithm for linear programming whose running time depends only on the constraint matrix. In Proceedings of the 52nd Annual ACM Symposium on Theory of Computing (STOC), pages 761–774, 2020. 1, 11 [Dik67] II Dikin. Iterative solution of problems of linear and quadratic programming. Doklady Akademii Nauk, 174(4):747–748, 1967. 9 [DLHL15] Jesús A De Loera, Raymond Hemmecke, and Jon Lee. On augmentation algorithms for linear and integer-linear programming: From edmonds–karp to bland and beyond. SIAM Journal on Optimization, 25(4):2494–2511, 2015. 8

57 [DLKS19] Jesús A De Loera, Sean Kafer, and Laura Sanità. Pivot rules for circuit- augmentation algorithms in linear optimization. arXiv preprint arXiv:1909.12863, 2019. 7 [DNV20] Daniel Dadush, Bento Natura, and LászlóA. Végh. Revisiting Tardos’s framework for linear programming: Faster exact solutions using approximate solvers. In Proceedings of the 61st Annual IEEE Symposium on Foundations of Computer Science, 2020. (to appear). 7, 11, 12 [DS08] Samuel I Daitch and Daniel A Spielman. Faster approximate lossy generalized flow via interior point algorithms. In Proceedings of the 40th annual ACM symposium on Theory of Computing, pages 451–460, 2008. 7 [Fra11] András Frank. Connections in Combinatorial Optimization. Number 38 in Oxford Lecture Series in Mathematics and its Applications. Oxford University Press, 2011. 16, 17 [GL97] Clovis C. Gonzaga and Hugo J. Lara. A note on properties of condition numbers. Linear Algebra and its Applications, 261(1):269 – 273, 1997. 10, 30 [Gof80] Jean-Louis Goffin. The relaxation method for solving systems of linear inequalities. Mathematics of Operations Research, 5(3):388–414, 1980. 2 [Gon92] Clovis C Gonzaga. Path-following methods for linear programming. SIAM review, 34(2):167–224, 1992. 3, 21 [HT02] Jackie CK Ho and Levent Tun¸cel. Reconciliation of various complexity and condition measures for linear programming problems and a generalization of Tardos’ theorem. In Foundations of Computational Mathematics, pages 93–147. World Scientific, 2002. 7, 9 [Kar84] Narendra Karmarkar. A new polynomial-time algorithm for linear programming. In Proceedings of the 16th Annual ACM Symposium on Theory of Computing (STOC), pages 302–311, 1984. 7 [Kha79] Leonid G Khachiyan. A polynomial algorithm in linear programming. In Doklady Academii Nauk SSSR, volume 244, pages 1093–1096, 1979. 2 [KM13] Tomonari Kitahara and Shinji Mizuno. A bound for the number of different basic solutions generated by the simplex method. Mathematical Programming, 137(1- 2):579–586, 2013. 2 [KOT13] Satoshi Kakihara, Atsumi Ohara Ohara, and Takashi Tsuchiya. Information geometry and interior-point algorithms in semidefinite programs and symmetric cone programs. Journal of Optimization Theory and Applications, 157:749–780, 2013. 7 [KOT14] Satoshi Kakihara, Atsumi Ohara Ohara, and Takashi Tsuchiya. Curvature inte- grals and iteration complexities in SDP and symmetric cone programs. Computa- tional Optimization and Applications, 57:623–665, 2014. 7 [KT13] Tomonari Kitahara and Takashi Tsuchiya. A simple variant of the Mizuno–Todd– Ye predictor-corrector algorithm and its objective-function-free complexity. SIAM Journal on Optimization, 23(3):1890–1903, 2013. 7

58 [LMT09] Guanghui Lan, Renato DC Monteiro, and Takashi Tsuchiya. A polynomial predictor-corrector trust-region algorithm for linear programming. SIAM Jour- nal on Optimization, 19(4):1918–1946, 2009. 4, 5, 25 [LS14] Yin Tat Lee and Aaron Sidford. Path finding methods for linear programming: Solving linear programs in O˜(√rank) iterations and faster algorithms for maximum flow. In Proceedings of the 55th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 424–433, 2014. 7 [LS15] Yin Tat Lee and Aaron Sidford. Efficient inverse maintenance and faster algorithms for linear programming. In 2015 IEEE 56th Annual Symposium on Foun- dations of Computer Science, pages 230–249, 2015. 7 [LS19] Yin Tat Lee and Aaron Sidford. Solving linear programs with O˜(√rank) linear system solves. arXiv preprint 1910.08033, 2019. 7 [Mad13] Aleksander Madry. Navigating central path with electrical flows: From flows to matchings, and back. In Proceedings of the 54th IEEE Annual Symposium on Foundations of Computer Science, pages 253–262. IEEE, 2013. 7 [Meg83] Nimrod Megiddo. Towards a genuinely polynomial algorithm for linear programming. SIAM Journal on Computing, 12(2):347–353, 1983. 2 [Meh92] Sanjay Mehrotra. On the implementation of a primal-dual interior point method. SIAM Journal on Optimization, 2(4):575–601, 1992. 7 [MMT98] Nimrod Megiddo, Shinji Mizuno, and Takashi Tsuchiya. A modified layered- step interior-point algorithm for linear programming. Mathematical Programming, 82(3):339–355, 1998. 7, 36 [MT03] Renato D. C. Monteiro and Takashi Tsuchiya. A variant of the Vavasis-Ye layered- step interior-point algorithm for linear programming. SIAM Journal on Optimiza- tion, 13(4):1054–1079, 2003. 4, 6, 7, 8, 9, 21, 25, 37, 55 [MT05] Renato D. C. Monteiro and Takashi Tsuchiya. A new iteration-complexity bound for the MTY predictor-corrector algorithm. SIAM Journal on Optimization, 15(2):319–347, 2005. 4, 8, 24, 25, 36, 37 [MT08] Renato D.C. Monteiro and Takashi Tsuchiya. A strong bound on the integral of the central path curvature and its relationship with the iteration-complexity of primal- dual path-following LP algorithms. Mathematical Programming, 115(1):105–149, 2008. 7 [MTY93] Shinji Mizuno, Michael Todd, and Yinyu Ye. On adaptive-step primal-dual interior-point algorithms for linear programming. Mathematics of Operations Re- search - MOR, 18:964–981, 11 1993. 3, 4, 7, 22, 35 [O’L90] Dianne P. O’Leary. On bounds for scaled projections and pseudoinverses. Linear Algebra and its Applications, 132:115–117, April 1990. 10

[OV17] Neil Olver and LászlóA. Végh. A simpler and faster strongly polynomial algorithm for generalized flow maximization. In Proceedings of the Forty-Ninth Annual ACM Symposium on Theory of Computing (STOC), pages 100–111, 2017. 2

59 [Ren88] James Renegar. A polynomial-time algorithm, based on Newton’s method, for linear programming. Mathematical Programming, 40(1-3):59–93, 1988. 7 [Ren94] James Renegar. Is it possible to know a problem instance is ill-posed?: some foundations for a general theory of condition numbers. Journal of Complexity, 10(1):1–56, 1994. 2 [Ren95] James Renegar. Incorporating condition measures into the complexity theory of linear programming. SIAM Journal on Optimization, 5(3):506–524, 1995. 2 [Sch03] Alexander Schrijver. Combinatorial Optimization – Polyhedra and Efficiency. Springer, 2003. 16, 17 [Sma98] Steve Smale. Mathematical problems for the next century. The Mathematical Intelligencer, 20:7–15, 1998. 2 [SSZ91] György Sonnevend, Josef Stoer, and Gongyun Zhao. On the complexity of following the central path of linear programs by linear extrapolation II. Mathematical Programming, 52(1-3):527–553, 1991. 7 [ST04] Daniel A Spielman and Shang-Hua Teng. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC), 2004. 7 [Ste89] G.W. Stewart. On scaled projections and pseudoinverses. Linear Algebra and its Applications, 112:189 – 193, 1989. 9, 10 [Tar85] Eva´ Tardos. A strongly polynomial minimum cost circulation algorithm. Combi- natorica, 5(3):247–255, Sep 1985. 2 [Tar86] Eva´ Tardos. A strongly polynomial algorithm to solve combinatorial linear programs. Operations Research, pages 250–256, 1986. 2

[Tod90] Michael J. Todd. A Dantzig–Wolfe-like variant of Karmarkar’s interior-point linear programming algorithm. Operations Research, 38(6):1006–1018, 1990. 9 [TTY01] Michael J. Todd, Levent Tun¸cel, and Yinyu Ye. Characterizations, bounds, and probabilistic analysis of two complexity measures for linear programming problems. Mathematical Programming, 90(1):59–69, Mar 2001. 10 [Tun99] Levent Tun¸cel. Approximating the complexity measure of Vavasis-Ye algorithm is NP-hard. Mathematical Programming, 86(1):219–223, Sep 1999. 5, 13 [Vai89] Pravin M Vaidya. Speeding-up linear programming using fast matrix multiplication. In Proceedings of the 30th IEEE Annual Symposium on Foundations of Computer Science, pages 332–337, 1989. 7 [Vav94] Stephen A Vavasis. Stable numerical algorithms for equilibrium systems. SIAM Journal on Matrix Analysis and Applications, 15(4):1108–1131, 1994. 5, 11 [vdB20] Jan van den Brand. A deterministic linear program solver in current matrix multiplication time. In Proceedings of the Symposium on Discrete Algorithms (SODA), 2020. 7

60 [vdBLL+21] Jan van den Brand, Yang P. Liu, Yin-Tat Lee, Thatchaphol Saranurak, Aaron Sidford, Zhao Song, and Di Wang. Minimum cost flows, MDPs, and L1-regression in nearly linear time for dense instances. In STOC (to appear), 2021. 7 [vdBLN+20] Jan van den Brand, Yin-Tat Lee, Danupon Nanongkai, Richard Peng, Thatchaphol Saranurak, Aaron Sidford, Zhao Song, and Di Wang. Bipartite matching in nearly- linear time on moderately dense graphs. In IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 919–930, 2020. 7 [vdBTLSS20] Jan van den Brand, Yin Tat Lee, Aaron Sidford, and Zhao Song. Solving tall dense linear programs in nearly linear time. In Proceedings of the 52nd Annual ACM Symposium on Theory of Computing (STOC), pages 775–788, 2020. 7 [Vég17] LászlóA. Végh. A strongly polynomial algorithm for generalized flow maximization. Mathematics of Operations Research, 42(2):179–211, 2017. 2 [VW12] Virginia Vassilevska Williams. Multiplying matrices faster than coppersmith- winograd. In Proceedings of the 44th annual ACM symposium on Theory of computing, pages 887–898, 2012. 36 [VY96] Stephen A. Vavasis and Yinyu Ye. A primal-dual interior point method whose running time depends only on the constraint matrix. Mathematical Programming, 74(1):79–120, 1996. 2, 5, 6, 9, 10, 22, 25, 28, 35, 36, 37, 41, 54, 56 [Ye97] Yinyu Ye. Interior-Point Algorithms: Theory and Analysis. John Wiley and Sons, New York, 1997. 23 [Ye05] Yinyu Ye. A new complexity result on solving the Markov decision problem. Mathematics of Operations Research, 30(3):733–749, 2005. 2 [Ye06] Yinyu Ye. Improved complexity results on solving real-number linear feasibility problems. Mathematical Programming, 106(2):339–363, April 2006. 5 [Ye11] Yinyu Ye. The simplex and policy-iteration methods are strongly polynomial for the Markov decision problem with a fixed discount rate. Mathematics of Operations Research, 36(4):593–603, 2011. 2