Signature Redacted

Faster Algorithms for Matrix Scaling and Balancing via Convex Optimization

by Dimitrios Tsipras

Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of

Master of Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2017

Signature redacted

A u th or ...... Department of Electrical Enginee1 g~dn Computer Science May 19, 2017 Signature redacted C ertified by ...... Aleksander Mqdry Assistant Professor Thesis Supervisor Signature redacted A ccepted by ...... Professor /esli&AlKolodziejski MASSACHUSETTS INSTITUTE Chair, Department Committee on Graduate Theses OF TECHNOLOGY

[JUN 232017 LIBRARIES ARCHIVES 2 Faster Algorithms for Matrix Scaling and Balancing via Convex Optimization by Dimitrios Tsipras

Submitted to the Department of Electrical Engineering and Computer Science on May 19, 2017, in partial fulfillment of the requirements for the degree of Master of Science

Abstract In this thesis, we study matrix scaling and balancing, which are fundamental problems in scientific computing, with a long line of work on them that dates back to the 1960s. We provide algorithms for both these problems that, ignoring logarithmic factors involving the dimension of the input matrix and the size of its entries, both run in time 6(m log K log 2 (1/E)) where e is the amount of error we are willing to tolerate. Here, K represents the ratio between the largest and the smallest entries of the optimal scalings. This implies that our algorithms run in nearly-linear time whenever K is quasi-polynomial, which includes, in particular, the case of strictly positive matrices. We complement our results by providing a separate algorithm that uses an interior-point method and runs in time O(m 3/2(log log K + log(1/E))), which becomes O(m 3 / 2 log(1/)) for the case of matrix balancing and the doubly-stochastic variant of matrix scaling. In order to establish these results, we develop a new second-order optimization framework that enables us to treat both problems in a unified and principled manner. This framework identifies a certain generalization of linear system solving which we can use to efficiently minimize a broad class of functions, which we call second-order robust. We then show that in the context of the specific functions capturing matrix scaling and balancing, we can leverage and generalize the work on Laplacian system solving to make the algorithms obtained via this framework very efficient. This thesis is based on joint work with Michael B. Cohen, Aleksander Mqdry, and Adrian Vladu.

Thesis Supervisor: Aleksander Mqdry Title: Assistant Professor

3 4 Acknowledgments

First of all, I would like to thank my advisor Aleksander Mqdry for all his support and inspiration during the duration of this thesis. His unending energy and enthusiasm motivated me to learn this field and actually do successful research in it. I am excited for our research progress on the years to come. Also, I would like to thank my student collaborators Michael B. Cohen and Adrian Vladu, who apart from being part of this work, helped me discover everything I know about the field of algorithms via convex optimization. Moreover, I cannot emphasize how helpful the whole ToC group at MIT has been in this whole effort. Being constantly around knowledgeable and enthustiastic people, kept mentally pushing me to move forward and learn more and more. Last but not least, I would like to deeply thank my friends and family for being close to me and supporting me through my whole life and academic career. None of this would have been possible without them.

5 6 Contents

1 Introduction 11

1.1 Previous Work ...... 12

1.2 Our Contributions ...... 14

1.3 O ur A pproach ...... 15

1.4 Outline of the Thesis ...... 17

1.5 Prelim inaries ...... 17

2 Minimizing Functions Satisfying Second-Order Robustness With Re- spect to f,, 21

3 Matrix Scaling and Balancing 25

3.1 M atrix Scaling ...... 26

3.1.1 Matrix Scaling via Convex Optimization ...... 27

3.1.2 Regularization for Solving via Box-Constrained Newton Method 29

3.1.3 Bounding the Magnitude of the Optimal and Approximately Optimal Scalings for Doubly Stochastic Scaling ...... 30

3.2 M atrix Balancing ...... 31

3.2.1 Reducing Matrix Balancing to Convex Optimization ...... 32

3.2.2 Regularization for Solving via Box-Constrained Newton Method 33

3.2.3 Bounding the Condition Number of the Optimal Balancing . . 34

3.3 Tolerance of Finite Precision ...... 35

3.4 Matrix Scaling and Balancing as Nonlinear Flow Problems ...... 36

7 4 Implementing an O(log n)-Oracle in Nearly Linear Time 39

4.1 Sparse Decompositions ...... 40

4.2 Constructing Approximate Mappings ...... 44

5 Matrix Scaling and Balancing with Exponential Cone Programming 51

5.1 Setting Up the Interior Point Method, and Bounding the Number of S tep s ...... 53

5.2 Implementing an Iteration of the Interior Point Method ...... 56

5.3 Error Tolerance ...... 58

5.4 Putting Everything Together ...... 59

A Deferred Proofs 67 A.1 Proof of Theorem 2.0.4 ...... 67 A.2 Proof of Lemma 3.1.7 ...... 69 A.3 Proof of Lemma 3.1.10 ...... 70 A.4 Proof of Theorems 3.1.5 and 3.1.6 ...... 71 A.5 Proof of Lemma 3.1.13 ...... 71 A.6 Proof of Lemma 3.2.6 ...... 72 A.7 Proof of Theorems 3.2.4 and 3.2.5 ...... 73 A.8 Proof of Lemma 3.2.10 ...... 74 A.9 Proof of Lemma 5.3.2 ...... 74 A.10 Proof of Lemma 5.2.3 ...... 76

8 List of Figures

4.2.1 Implementation of an -- approximate mapping ...... 44 4.2.2 Optimizing a vertex sparsifier chain ...... 47

9 I

10 Chapter 1

Introduction

Matrix balancing and scaling are problems of fundamental importance in scientific computing, as well as in statistics, operations research, image reconstruction, and engineering. The literature on these problems [38, 40, 17, 19, 11, 22, 43, 47, 46, 52, 41,

3, 14, 201 is truly extensive and dates back to 1960s. In general, both these problems boil down to finding diagonal scalings for a matrix such that the rescaled matrix gains some favorable structure. They both are a key primitive in most mainstream numerical software packages (MATLAB, R, LAPACK, EISPACK) [35, 34, 42, 16, 2].

More specifically, in the matrix scaling problem, we are given a nonnegative matrix A, and our goal is to find diagonal matrices X, Y such that the matrix XAY has prescribed row and column sums. The most common instance of this problem is the one where we want to scale the matrix to doubly stochastic - in other words, we want to make all rows and columns sum to 1. This procedure, also known as as Sinkhorn-

Knopp scaling [47], has been repeatedly used since as early as 1937 in a number of diverse areas, such as forecasting traffic through telephone lines [26], engineering [3], statistics [13, 461, machine learning [8], and even computational complexity [32, 18].

A standard application for scaling is preconditioning linear system solving. Given a linear system Ax = b, one can produce a solution by computing Y(XAY)-'Xb, since applying the inverse of XAY is more numerically stable procedure than directly applying the inverse of A [52]. Another example, which commonly occurs in statistics, is iterative proportional fitting, studied ever since 1912 [541, which is used for

11 standardizing cross-tabulations. Even more interestingly, matrix scaling turned out to have surprising connections to fundamental problems in the theory of computation. Notably, in [321, it is observed that scaling can be used to approximate the permanent of any nonnegative matrix within a multiplicative factor of e'. Furthermore, deciding whether the permanent of a bipartite graph's adjacency matrix is 0 or at least 1 is equivalent to deciding whether the graph contains a perfect matching. The scaling based method can, as a matter of fact, be used to compute maximum matchings in bipartite graphs, which is a classical and intensely studied problem in graph algorithms [12, 15, 33]. For more history and information on this problem, we refer the reader to Idel's extensive survey [20], or [44] for a list of applications. Now, in the matrix balancing problem, we are, again, given a nonnegative matrix A, and our goal here is to find a diagonal matrix D such that the matrix DAD- is balanced, that is the sum of every row is equal to the sum of the corresponding column. Multiplying by D and D-1 corresponds to scaling each row by some factor, and dividing the corresponding column by the inverse of this factor. This procedure has been introduced by Osborne [38] with the purpose of preconditioning matrices in order to increase the accuracy of eigenvalue computation, since the balancing operation does not change the eigenvalues of the matrix. The initial algorithm was based on a simple iterative method, and was followed by a long series of improvements and extensions. Notably, Osborne [381 studied a version of this problem where rows and columns, are required to be equal in t 2 norm. It turns out, however, that the f, version we study is equivalent, as it was shown in [39] that matrix balancing in any fixed i, norm, p > 1 reduces to the f, version of the problem (which is the variant of the problem we consider here).

1.1 Previous Work

The early methods used for solving these problems - Osborne's iteration for balancing, and the RAS method for scaling - are simple iterative algorithms. However, merely the task of analyzing their convergence turned out to be a major challenge. Significant

12 effort has gone into understanding their convergence [4, 45, 39, 32], and providing better analyses or better iterative methods has been a key problem within this realm.

The major shortcoming of the methods obtained so far in this context is their very large running time. For matrix scaling, Kalantari and Kachiyan [21] obtained an algorithm that finds an 8-approximate solution and runs in time 0(n4 log(1/e)). This algorithm used the ellipsoid method, but [21] also proposed, but did not rigorously analyze, a formulation for interior point methods which they expected to run in 3 time 6(m 3 log(1/8)). Then, Nemirovsky and Rothblum [36] analyzed an interior point method which runs in time O(m4 log(1/e)). Finally, Samorodnitsky, Linial, and Wigderson [32] gave a strongly polynomial 0(n7 ) time algorithm.

For the case of matrix balancing, Parlett and Reinsch [40] provided an iterative method based on Osborne's iteration, without proving convergence. Then, Grad [17] proved that Osborne's iteration converges in the limit. The first polynomial time bound was obtained by Kalantari, Khachiyan, and Shokoufandeh [22], who gave an algorithm with running time 6(n, log(1/e)).

Another important strain of work in this domain was focused on the related t variant of the balancing problem. Schneider and Schneider [43] gave a non-iterative algorithm running in time 0(n), improved to O(mn + n2) by Young, Tarjan, and Orlin [53]. More recently, Schulman and Sinclair [45] provided an analysis of the classical Osborne-Parlett-Reinsch obtaining a running time of 0(n2m), and gave a version of it with running time O(n3 log(1/6)).

Finally, if one considers a relaxed regime in which the running time is allowed to depend polynomially - instead of logarithmically - on the (inverse of the) desired accuracy of the solution, the current state-of-the-art is given by Samorodnitsky, Linial, and Wigderson [32], who obtain 0(n3 e-2) running time for scaling. In the case of balancing, recently, Ostrovsky, Rabani, and Yousefi [391 made a significant progress by obtaining running times of 6(m + ne-2) and 6(n Me-1).

13 1.2 Our Contributions

We provide algorithms for both matrix scaling and balancing.

For matrix scaling, we establish an algorithm that runs in time

S(mlog(r,(U*) + r,(V*)) log 2 S) where U* and V* are the optimal scalings, r,(.) is the maximum ratio between the entries of its argument, SA is the sum of the entries in the input matrix, and 6 is the measure of the target error of the scaling.

For matrix balancing, we establish a running time of

S(m log ,(D*) log2 1) where WA is the ratio of the sum of the entries to the minimum nonzero entry, D* is the optimal balancing, r,(-) has the same meaning as above, and e is the measure of the target balancing error, normalized by the sum of the entries in the input matrix.

Notably, our running times depend logarithmically on both the target accuracy and the magnitude of the entries in the optimal balancing or scaling. This implies that if the optimal solution has quasi-polynomially bounded entries, our algorithms run in nearly linear time O(mlog(1/e)) (ignoring logarithmic factors involving the entries of the input matrix). This includes, for instance, the case when input matrix has all its entries positive or, in case of matrix balancing, if there just exists a single row/column pair with all positive entries.

One should note though that, in general, K can be very large. Because of that, we complement our results with alternative algorithms for such poorly conditioned instances. These algorithms are based on interior point methods, with appropriately chosen barriers, commonly used in exponential programming [1]. We show that the linear system solves required by the interior point method every iteration can be reduced via Schur complementing to approximately solving a Laplacian system, which can be done in nearly linear time using any standard Laplacian solver [50, 24, 25, 23,

14 7, 27, 28]. This yields a running time of

0M(m/2 log WA for balancing, and

((m3/2 log W + log log((U*) + K(V*))

for scaling, where WA is the ratio between the largest and smallest nonzero entry of A, and K(-) is the maximum ratio between the entries of the argument. One can also show that the i-dependent term can be ignored when the matrix scaling instance corresponds to the doubly-stochastic case.

1.3 Our Approach

We approach the problem of scaling and balancing problems by applying a continuous optimization lense. More precisely, we solve both matrix scaling and balancing problems by casting them as tasks of minimizing certain convex function. In fact, in the case of balancing, that function is directly inspired by the one used in [39]; for scaling, it is function derived from the one used in [21]. (Although both these functions require certain regularization to yield optimal performance.) Since our goal is to obtain logarithmic - instead of polynomial - dependence on the (inverse of the) desired accuracy e, it would be tempting to use well-known tools for convex programing, such as ellipsoid method or interior point method. However, these methods are, a priori, computationally expensive. This motivates us to look for different, more direct approaches. As a first step, we introduce a technique for minimizing functions that are second- order robust with respect to f,, or in other words, functions whose Hessians do not change too much within an &o ball. Therefore, if - after rescaling by the Hessian at the current point - we were somehow able to perform projected gradient descent steps within this region, we would find the minimizer of the function within that region in

15 a very small number of iterations. However, projecting onto an t, ball after rescaling is computationally expensive, so such an approach seems doomed to failure.

We show, however, that if we are able to implement a specific primitive, called k-oracle, which approximately minimizes a quadratic within a region that is within a factor of k larger than the target fo, ball, a simple iterative method will produce the global optimum in a small number of iterations. More precisely, assuming we can implement this oracle, we show how to minimize a convex function f that is second-order robust with respect to o to within E additive error from optimum in

0 ((kRc + 1)log f(xo)

iterations, where xO is the starting point, x* is the minimizer of f, and R, is the c radius of the level set of xO.

Now, since the number of iterations required by box-constrained Newton is very mild, the entire difficulty in applying this method shifts to efficiently implementing a k-oracle. We show that for functions whose Hessian is symmetric diagonally dominant, with nonzero off-diagonal entries, or SDD in short1 , we can implement a

O(log n)-oracle in nearly linear time in Hessian sparsity.

We achieve this by considering the Laplacian solver of Lee, Peng and Spielman [291, and carefully lifting the solutions produced in coarser (and smaller) approximations of the underlying matrix to solutions for the initial matrix without allowing them to exceed the boundaries of a O(log n) radius f, ball.

Finally, once the whole continuous optimization framework is setup, applying it to the matrix scaling and balancing problem is fairly straight forward and only a matter of applying the right regularization and combining it with the efficient implementation of the O(log n)-oracle.

1which is essentially a Laplacian matrix plus a nonnegative diagonal

16 1.4 Outline of the Thesis

The rest of the paper is organized as follows. First, we introduce relevant notation and concepts in Section 1.5. Then, in Chapter 2 we introduce a class of convex functions we call second order robust with respect to l. For these, we develop a specific optimization primitive called box-constrained Newton method.

We describe how we can apply the primitive from Chapter 2 to matrix balancing and scaling in Chapter 3, by reducing these problem to a convex function minimization with favorable structure. In order to complete our algorithm, in Chapter 4, we show how to efficiently implement an iteration of the box constrained Newton in the special case where the Hessian of the function is SDD. In Chapter 5 we provide a different approach for balancing and scaling based on interior point methods. Supplementary proofs and technical details are presented in the Appendix.

1.5 Preliminaries

Vectors We let 0, i E R" denote the all zeros and all ones vectors, respectively. When it is clear from the context, we apply scalar operations to vectors with the interpretation that they are applied coordinate-wise.

Matrices We write matrices in bold. We use I to denote the identity matrix, and 0 to denote the zero matrix. Given a matrix A, we denote its number of nonzero entries by nnz(A). When it is clear from the context, we use m to denote the the number of nonzeros; similarly, we use n to denote the dimension of the ambient space.

We denote by SA the sum of entries of A, by fA the minimum nonzero entry of

A, and by WA the ratio between these quantities. We use supp(A) to denote the set of pairs of indices (i, j) corresponding to the nonzero entries of A. Given a matrix

A, we define rA = Al to be the vector consisting of row sums, and cA = AT 1 to be the vector consisting of column sums.

17 Positive Semidefinite Ordering and Approximation For symmetric matrices A, B E R"' we use A - B to represent the fact that that xTAx < xTBx, for all x. A symmetric matrix A E R"' is positive semidefinite (PSD) if A >, 0. We use

4, -, -< in a similar fashion. For vectors x, we define the norm I|xIIA = v/xTAx. Given two PSD matrices A and B, and a parameter a > 0, we use A ~ B to denote the fact that e-c -B - A - el -B.

Laplacian and SDD matrices We will crucially rely on a family of matrices with very special structure, namely symmetric diagonally dominant (SDD) matrices. These are matrices A, that symmetric and moreover each diagonal entry is larger than the sum of absolute values of the row, that is for every i

isi

A special case of SDD matrices are Laplacian matrices, which have negative off- diagonal entries and the sum of each row is required to be zero. The structure of such matrices can be exploited to solve linear systems in nearly linear time as shown in

[50, 24, 25, 23, 7, 27,28].

Diagonals For x E RI we denote by D(x) E R"'X the diagonal matrix where ID(x)ii = xi. Given a nonnegative diagonal matrix D, we use K(D) to denote the ratio between its largest and smallest entry. We will overload notation and, for any matrix

A E RfX"l, use D(A) to denote the main diagonal of A, that is (lD(A))jj = As and (D(A))ij = 0 for i # j.

Gradients and Hessians Given a function f we denote by Vf(x) its gradient at x, and by V 2f(x) its Hessian at x. When it is clear from the context, we also use H, for the Hessian.

Block Matrices As part of our algorithms, we will consider partitioning the coordinates of vectors into sets of indices F and C. When we compute the quadratic

18 form of a matrix with these vectors, we need to be able to reason about how values in each component interact with the rest of the vector. For that reason it is convenient to denote the block form notation for a matrix A as:

A[Fc] A = A[FF] A[C,F] A[c,C]

Schur Complements For a matrix A E R"'f and a partition of its indices (F, C), the Schur complement of F in A is defined as

Sc(A, F) A[c,c] - A[C,F]A[FF]A[F,C].

The exact use of Schur complements will become clear in Chapter 4,5. These are objects that naturally arise during Gaussian elimination for the solution of linear systems. By pivoting out variables F the remaining system to solve for variables of C is exactly the Schur complement of F in A.

19 20 Chapter 2

Minimizing Functions Satisfying Second-Order Robustness With Respect to fo0

The central element of our approach is developing an efficient second-order method based minimization framework for a suitably chosen class of functions. To motivate the choice of this class, recall that second-order methods for function minimization are iterative in nature, and they boil down to repeated minimizing the local quadratic approximation of the function around the current point. Consequently, in order to obtain meaningful guarantees about the progress made such methods, one needs to have that this local quadratic approximation constitutes a good approximation of the function not only at the current point but also in a reasonably large neighborhood of that point. The most natural way to obtain such a guarantee is to ensure that the Hessian of the function (which is the basis of our local quadratic approximations) does not change by more than a constant factor in that neighborhood. As a result, the functions we are interested in optimizing in this paper are the ones that satisfy that property in an t ball around the current point. This is formalized in the following definition.

Definition 2.0.1 (Second-Order Robust w.r.t. f,). We say that a convex function

21 f : R -+ R is second-order robust (SOR) with respect to f, if, for any x, y E R' such that lix - y11" < 1,

V 2 f(x) ~2 V2 f(y), or equivalently, -V2f(x) 4 V2 2 V2

Note that the exact choice of the constant 2 in the exponent of the above approximation guarantee is somewhat arbitrary. If a function satisfies SOR condition for some other positive constant in that bound, it can be made to satisfy the above bound via simple rescaling of our space (which can be viewed as rescaling of that function).

Now, the above definition suggests a natural framework for optimizing such functions. Namely, in every iteration, optimize a local quadratic approximation of the function within a unit f, ball around the current point. As we will see shortly, this approach can be rigorously analyzed. In particular, our key technical result is that if we apply this approach to an SOR function whose Hessians has additionally a special structure, i.e., those for which the Hessian is, essentially, a symmetric diagonally dominant (SDD) matrix, we can implement every iteration in time nearly linear in the number of nonzero entries of the Hessian. This leads to running time bounds captured by the following theorem.

Theorem 2.0.2 (Minimizing Second-Order Robust Functions w.r.t o). Let f

R' -+ R be a second-order robust (SOR) function with respect to , such that its

Hessian is symmetric diagonally dominant (SDD) with nonpositive off-diagonals, and

has m, nonzero entries. Given a starting point x 0 E R' we can compute a point x,

such that f(x) - f(x*) < E, in time

SmR log fXo) - f(x*)

where R. = supx:f(x)f(xo) |x - x*|| is the {x diameter of the corresponding level set

of f.

Note that the bounds provided by the above theorem are, in a sense, the best

22 possible for any kind of approach that relies on repeated minimization of a local approximation of a function in an Et-ball neighborhood. In particular, as each step can make a progress of at most 1 in ft&-norm towards the optimal solution, one would expect the total number of steps to be Q(R,). It turns out that the above theorem is all we need to establish our results for scaling and balancing problems (except the ones relying on the interior point method). That is, these results can be cast as applying the above theorem to an appropriate SOR function. We provide all the details in Section 3. Now, the first step to proving the above Theorem 2.0.2 is to view each iteration of our iterative minimization procedure as a call to a certain oracle problem.

Definition 2.0.3. We say that a procedure 0 is a k-oracle for a class of matrices M, if on input (A, b), where A C M C R"'X, and b E R', returns a vector i satisfying

1. |iIc < k, and

2. jiTA + bTi 1minjzj 11 i (QzTAz+bTz).

Observe that minimizing the function _jTAz + bT without any constraints on i corresponds to solving a linear system Ai = b. So, one can view the k-oracle problem as a certain generalization of linear system solving. Specifically, it is a task in which we aim to find a point in the L,-ball of diameter k around the origin that is closest (in a certain sense) to the solution to that linear system. If b is sufficiently small, the k-oracle problem corresponds directly to solving that system. One can view the parameter k as the measure of the "quality" of our k-oracle. The smaller it is, the faster convergence the overall procedure will have. Importantly, however, the value of k impacts only the convergence and not the quality of the final solution. The following theorem makes this relationship precise.

Theorem 2.0.4. Let f : D -+ R' be a function that is second-order robust with respect to E. Let 0 be a k-oracle for {V 2f (x) : x E D}, along with an initial point x0 E D and an accuracy parameterE. Let R, = supx:,(x)f(X0) lix - x*| , where x* is a minimizer of f. Then one can produce a solution XT satisfying f(xT) - f(x*) E

23 using (f(XO) f( 0 ((kRo + 1) log calls to 0.

We present the proof of this theorem in Section A.1 of the Appendix.

24 Chapter 3

Matrix Scaling and Balancing

Having developed our main optimization primitives, we can develop efficient algorithms for matrix scaling and matrix balancing. Our approach is essentially the same for both problems, and differs only in technical details.

At the high level, we will construct convex functions with optima corresponding to exact scalings/balancings of the matrix. Moreover, the gradient of these functions will be directly related to the amount of error each coordinate is responsible for. This will allow us to prove that approximately optimal points correspond to E-balancings/e- scalings. The fact that that these functions are second-order robust with respect to f.. makes it sufficient to apply the optimization method of Section 2. But to get a good running time, we require two more ingredients.

Firstly, proving runtime bounds for this method requires an upper bound on the

E, radius of the level set of the initial point, i.e. the R. parameter defined in Theo- rem 2.0.4. Depending on the structure of the matrix, there are several different bounds that one can prove, depending only on parameters of the original problem. However, the most interesting case is when we are promised that the exact scaling/balancing of the matrix is "small" (in the sense that the ratio between factors is, say, polynomial).

In that case, we can regularize the function to turn this promise into a guarantee for the size of the level set without sacrificing too much accuracy. Moreover, using a simple doubling trick the algorithm does not require knowledge of such a parameter, and it will only appear as a factor in the runtime of the algorithm.

25 Secondly, we need an efficient implementation of a good k-oracle, that the method needs in order to perform its iterations. It turns out that in the case of functions that capture matrix scaling and balancing, we can implement a O(log n)-oracle which runs in nearly linear time in the sparsity of the input. For the remainder of this section, we define the convex functions that we need optimize, show how to regularize them, and prove bounds on the corresponding R" parameters. We will describe and analyze the implementation of a O(log n)-oracle in Section 4.

3.1 Matrix Scaling

We now formally define the scaling problem, along with the notion of E-scaling.

Definition 3.1.1 (Matrix Scaling). Let A C R "" be a nonnegative matrix and r, c E R" be vectors such that E"1 ri = E ci, and |Ir|oo, I|cL,|, 1. We say that two nonnegative diagonal matrices X and Y (r, c)-scale A if the matrix M = XAY satisfies M1 = r and MT1 = c, i.e. row i sum to ri and column j sum to c3 for every i, j.

Definition 3.1.2 (E-(r, c) scaling). Given nonnegative A and positive diagonal matrices X, Y, we say that (X, Y) is an e-(r, c) scaling (or 6-scaling, when r and c are clear from the context) for matrix A if the matrix M = XAY satisfies

||rM - r 11 + 11cM - c1 < E.

Definition 3.1.3 (Scalable and Almost-Scalable Matrices). A nonnegative matrix A, is called (r, c)-scalable, if there exist X and Y that (r, c)-scale A. It is called almost

(r, c)-scalable if for every e > 0, there exist X, and Ye that e-(r, c) scale A.

'In literature we also encounter this problem for non-square matrices; however solving squares is sufficient, since given A E R"XC, we can reduce to this instance by scaling the square matrix Or, AT. The upper bound on r and c is harmless, since for larger values we can always shrink all of A, r, c and e by the same factor in order to enforce this constraint.

26 There are well-known necessary and sufficient conditions about the scalability of A stated in the following lemma.

Lemma 3.1.4 ([321). A nonnegative matrix A is exactly (r, c)-scalable iff for every

zero minor Z x L of A,

1. ZEEZ: ri > Ej L C3 .

2. Equality in 1 holds iff Z' x L' is also a zero minor.

A nonnegative matrix A is almost (r, c)-scalable iff condition 1 holds.

We will cast matrix scaling as a convex optimization problem and show that applying the method from section 2 yields a good approximate scaling.

Theorem 3.1.5. Let A be a matrix, that has an (r, c) scaling (U*, V*). Then, we can compute an e-(r, c) scaling of A in time

5 (mlog(K(U*) + K(V*)) log 2 (SAE-1))

This implies that if U* and V* are, say, quasi-polynomially bounded, we can find an approximate scaling in nearly linear time. If fact, we can generalize this statement to obtain a similar result for the case of approximate scalings. This is made precise in Theorem 3.1.6.

3.1.1 Matrix Scaling via Convex Optimization

We will compute an (approximate) scaling of the matrix by (approximately) minimizing the function

f(x,y) = Aiie'-i - rixi - E cjy . (3.1.1) 1 i,j!n \ in 1 j

The technical statement we need to establish is summarized in the following theorem.

27 Theorem 3.1.6. Suppose that there exist a point z* = (x*, y*) for which f(z*)E -f* e2 /(3n) and ||z*|| B. Then we can compute an E-(r, c) scaling of A in time

0 (mBlog 2 (SAE 1 ))

First we will establish that approximate optimality of f implies an approximate scaling of the matrix.

Lemma 3.1.7. Let A be an e-scalable matrix. Let f* = inf(,y) f(x, y). Then, a pair of vectors (x, y) satisfying f(x, y) - f* < E2/3n, for 0 < E < 1, yields an E-(r, c) scaling of A: M = ID(exp(x)) -A -D(exp(y)) .

Note that we compare the value of f(x, y) to its infimum, as for the case of almost scalable matrices it is possible that this value is attained only to the limit. To prove this lemma, we first look at the first and second order derivatives of f.

Lemma 3.1.8. Let M be the matrix obtained by scaling A with vectors (x, y), i.e.

M = D(exp(x)) - A - D(exp(y)). The gradient and Hessian of f satisfy the identities:

Vf(x, y) = - , L-cM j L-CJ

V2f(XY)= [D(rM) -M L-MT D(CM)

We can observe that any (x, y) for which Vf(x, y) = 0 yield diagonal matrices that exactly scale A. It is also worth noticing, that while the gradient depends on the target row and column sums, the Hessian does not. By proving that a large gradient in e2 norm implies the non-optimality of f we can prove Lemma 3.1.7. The details are technical and are presented in Section A.2 of the appendix.

28 3.1.2 Regularization for Solving via Box-Constrained Newton Method

It is straightforwards to verify that the function we are minimizing (defined in Equa- tion 3.1.1), satisfies the correct requirements for us to apply the tools of Section 2.

Lemma 3.1.9. The function f is convex, second-order robust with respect to f c, and its Hessian is SDD.

Proof. By observing the Hessian of the function, we can see that it is a Laplacian matrix. Therefore, it is positive semi-definite thus f is convex. To prove that it is second-order robust, we notice that adding some z with IzII 1 to the current scaling corresponds to multiplying each row and column by some factor between 1/e and e.

2 By writing down the quadratic form of the Hessian, vTV f(x)v = Ei Mi,(v, - Vj)2 we observe that each Mij will only be multiplied by some factor between 1/e2 and e2 , proving that

- V2f(x) = Vf(x + z) 4 e2 V2f(x), concluding the proof.

Even if we know that there is some point approximately optimizing f close to our initial point, we cannot directly apply Theorem 2.0.4, since we need to upper bound the radius of the entire level set. This can be achieved by adding a regularizing term to f. This term will have small impact to the additive error we can achieve, but ensure that the entire level set is within some i ball around our initial point. The properties of the regularized function can be verified directly. We state the proof in Section A.3 of the Appendix for the sake of completeness.

Lemma 3.1.10. Suppose that there exist a point z* = (x*, y,*) for which f(z*) - f* < E2 /(3n) and ||zl,| B. Then the regularization of f defined as

E2 ( f (x, y) = f(x, y) + 3662eB yz(exi + -xi) + (e - eY4) (3.1.2) satisfies the following properties

29 1. f is second-order robust with respect to x and has a SDD Hessian,

2. f(x) < f(x), and there is a point 3* such that f(Y*) : f* +

3. for all z' such that f(z') < f(0), ||z'I|K = O(B log(nsA6 1 )).

Theorems 3.1.5 and 3.1.6 follow from applying Theorem 2.0.2 on the regularized function defined in Equation 3.1.2, and combining with the guarantee from Lemma 3.1.7 and Lemma 3.1.10. The complete proof is presented in Section A.4 of the Appendix. We note that we don't need to know a bound on B a priori. We can simply run our algorithm more than one times, doubling B at each iteration without increasing our runtime by more than a factor of two.

3.1.3 Bounding the Magnitude of the Optimal and Approxi- mately Optimal Scalings for Doubly Stochastic Scaling

In order to provide bounds for the magnitude of the scaling factors that only depend on the parameters of the initial problem we refer to the following lemmas from [211 for the case of double stochastic (i.e. (1,1)) scaling.

Lemma 3.1.11 (Lemma 1 of [21]). If A is strictly positive, then it can be scaled to

doubly stochastic by diagonal matrices U, V with log(,(U) + K(V)) O(log(1/eA)).

Lemma 3.1.12 (Corollary 1 of [21]). If A is scalable, then it can be scaled to doubly stochastic by diagonals diagonal matrices U, V with log(r,(U) + K(V)) < O(n log(1/fA)).

For almost scalable matrices, there can be arbitrarily good solutions, using arbitrarily large scaling factors. To prove bounds on the runtime of finding an approximate doubly-stochastic matrix, we will have to explicitly demonstrate an vector that approximately minimizes function f while having small f, norm.

Lemma 3.1.13. If A is almost-doubly-stochastic scalable, then there exist points

(x, y) such that f (x, y) - f* < E2 /3n, such that 11(x, y)|, O(n log(nwA/E)).

The proof of the lemma is presented in Section A.5.

30 3.2 Matrix Balancing

Our approach for the balancing problem is identical to that for the scaling problem, with minor technical differences. The formal definition of the problem and the notion of approximation we are considering follow.

Definition 3.2.1 (Matrix Balancing). Let A be a square nonnegative matrix. We say that A is balanced if the sum of each row is equal to the sum of the corresponding

collumn, i.e. rA = CA. We say that a nonnegative diagonal matrix D balances A if the matrix M = DAD- 1 is balanced.

Definition 3.2.2 (E-Balanced Matrix [22]). We say that a nonnegative matrix M E

R"'f is E-balanced if

IrM - CM112 _ VX =1((rM)i - (cM)i)2 <6. Z 1 i,j~n MjE~~~ i

Observe that this definition is invariant to a global scaling of all the entries of the matrix by some factor. There is a very simple condition that characterizes the set of matrices that can be balanced

Lemma 3.2.3 ([22]). A nonnegative matrix A E R x" can be balanced if and only if the graph with adjacency matrix A is strongly connected.

In the case where the graph is not strongly connected, the matrix can have its rows and columns rearranged to be written as a lower triangular block matrix, with strongly connected diagonal blocks. The reason no scaling exists is that off diagonal elements will always create imbalances. However, by setting all of these entries to zero, we are implicitly scaling each block globally by a large amount. Therefore, since the case of matrices that cannot be balanced is both easy to detect and easy to fix, for the rest of the paper we consider only matrices that can be balanced, and therefore represent strongly connected graphs. We can now state our main theorem for this section, which follows our initial discussion.

31 Theorem 3.2.4. Let A be a matrix that can be balanced by the diagonal matrix D*. Then, we can compute an E-approximate balancing of A in time

0(m log r,(D*) log 2 (wAE- 1 ))

This immediately implies that if D* is, say, quasi-polynomially conditioned, we can find an approximate balancing in nearly linear time. Again, we can generalize this result to hold for approximate balancings. We make this statement precise in Theorem 3.2.5.

3.2.1 Reducing Matrix Balancing to Convex Optimization

Similarly to the scaling problem, we consider (a slightly different) convex function as defined in [22], f(x) = Aijexi-i, (3.2.1) 1

Theorem 3.2.5. Suppose that there exists a point x such that f(x) f* + E2+A/24, and lx||,K < B. Then, we can compute an c-approximate balancing of A in time

O(mBlog2 (WA )-1).

We will follow the same steps as the scaling approach. First, we prove that small additive error in function optimization implies an approximate balancing for A.

Lemma 3.2.6. Consider a matrix A and the correspondingfunction f. Any vector x satisfying f(x) - f* E2 A/8 yields an E-approximate balancing of A:

M = D(exp(x)) -A - ID(exp(-x)) .

32 Proving the lemma requires computing the first and second order derivatives of f.

Lemma 3.2.7. Let M be the matrix obtained by balancing A with the vector x, which corresponds to M = D(exp(x)).A.D(exp(-x)). The gradient and Hessian of f satisfy the identities:

Vf(x) = rM - cm,

V 2 f(x) = D(rm + cm) - (M + MT).

Intuitively, since the gradient is 0 precisely when the corresponding point produces an exact balancing, a small gradient should imply a good approximate balancing. This guides the proof of Lemma 3.2.6. We will prove that a large gradient corresponds to being able to significantly decrease the function value, thus contradicting the approximate optimality of the point. The technical details are presented in Section A.6.

3.2.2 Regularization for Solving via Box-Constrained Newton Method

We observe that the function f defined in Equation 3.2.1 satisfies all the conditions required to efficiently minimize it using the method we described in Section 2.

Lemma 3.2.8. The function f is convex, second-order robust with respect to , and its Hessian is SDD.

The method we described in Section 2 depends on a promise concerning the point we initialize it with. Recall that in order to apply Theorem 2.0.2 we require an upper bound on the size of the f. box containing the level set of the initial point. In order to provide good bounds, we regularize f. The description and effect of this regularization are captured by the following lemma.

Lemma 3.2.9. Suppose that there exists a point x such that f(x) <; f* + E 2 A/24,

33 and 1|x||. < B. Then, the regularization of f is defined as

f(x) = f(x) + n + e6x) (3.2.2) 48neBZei i=1 and satisfies the following properties:

1. f is second-order robust with respect to f. and has a SDD Hessian,

2. f(x) < f(x), and if Y* is the minimizer of f, then f(3*) f(x*) + E2 A/24,

3. for all y such that f(y) < f(0), 11y - x*|' = O(Blog(nwAe-1 )).

The details of the lemma are identical to Lemma 3.1.10, and we therefore omit the proof. In particular, this lemma implies that approximately optimizing the regularized function will still produce an approximately balanced matrix. Theorem 3.2.4 then follows from applying Theorem 2.0.2 on the regularized function defined in Lemma 3.2.9, along with the error guarantee from Lemma 3.2.6. The technical details are in Section A.7 of the Appendix. Similarly to the scaling section, we don't need to know the bound on B a priori. Just trying increasing value of B (doubling at every iteration) suffices without increasing our runtime.

3.2.3 Bounding the Condition Number of the Optimal Bal- ancing

As we saw above, the running time given by Theorem 3.2.4 depends logarithmically on r,(D*), where D* is the matrix that achieves the optimal balancing. While, in general ,(D*) can be exponentially large (and therefore we might be better off running the interior point method described in Section 5), tighter connectivity of the graph implies better bounds:

Lemma 3.2.10. Let A E R"*' be a nonnegative matrix. Suppose that the graph with adjacency matrix A is strongly connected, and every vertex can reach every other vertex within at most k hops. Then the matrix D* that perfectly balances A has log s;(D*) = O(k log WA).

34 The proof of the lemma is in Section A.8 of the Appendix.

This shows the extreme cases for the magnitude of the log-condition number.

Corollary 3.2.11. If A is a balanceable matrix, and D* perfectly balances it, then

log r,(D*) = O(n log WA). If A is strictly positive, then log ,(D*) = O(log WA).

3.3 Tolerance of Finite Precision

The exposition of the analysis so far is under the assumption of exact arithmetic. However, our algorithms do in fact tolerate finite fixed-point precision on the scale of the natural parameters of the problem (n, E, sA and WA). It is therefore sufficient to use a number of bits that is logarithic in the input parameters of the problem.

Between iterations, we store a fixed-point representation of the variables xi. These are, by construction, bounded by the parameter R,. It is important (at least if using fixed-point rather than floating-point) that we are storing the xi rather than the actual scalings exi.

When iterating, we first determine the post-scaling elements of the matrix. These can also simply be stored in fixed-point-i.e. up to additive error. Note that this rounding could completely eliminate very small entries of the matrix. This representation then gives us the gradient and Hessian of the problem up to additive error. To make the additive error polynomially small only a logarithmic number of bits are needed, because the entries of the scaled matrix can never be more than polynomial (in the natural parameters mentioned). This follows from the fact that the objective function, which includes the sum of all the entries, cannot increase.

Finally, there is an polynomially small absolute lower bound on the eigenvalues of the Hessian, simply from the regularizer itself. This ensures that additive error to the gradient and Hessian can only affect the function value improvement by a polynomially larger amount, and ensures the stability of the k-oracle algorithm. Thus polynomially small error is sufficient, requiring only logarithmically many bits.

35 3.4 Matrix Scaling and Balancing as Nonlinear Flow Problems

We show that both matrix scaling and balancing can be phrased as instances of a more general problem, which can be seen as finding a potential-induced flow that routes a fixed demand (see [10] for a comprehensive treatment of nonlinear networks). To see that, given a weighted directed graph G = (V, E, w) let us define the edge-vertex incidence matrix B being an n x m matrix with rows indexed by edges and columns indexed by vertices such that

B if e = (V,u) E E,

Bv,e= 1 if e = (u,v) E E,

0 otherwise.

Using this matrix we define the nonlinear operator L as follows.

Definition 3.4.1. Let G = (V, E, w) be a directed graph with vertex-edge incidence matrix B, and let W = D(w). We define the operator L associated with G as

L(x) = BT W exp(Bx) . (3.4.1)

This can be seen as a nonlinear generalization of the Laplacian operator, which is a linear operator defined as L = BTWB. There is extensive literature on solving Laplacian linear systems [50, 24, 25, 23, 7, 27, 28]. We argue that our framework can be used to solve systems of the form

L(x) = d. (3.4.2)

This can be seen as finding vertex potentials x which induce a flow vector f:

fuv = wuv -exu-XV , for all (u, v) E E,

36 such that f routes a given demand d. As it turns, out the solution to this system is the minimizer of a function similar to those defined in Equations 3.1.1 and 3.2.1. More precisely:

Lemma 3.4.2. Let G = (V, E, w) be a directed graph with nonnegative weights, let A be its adjacency matrix, and let L be the operator associated with G, as defined as in Equation 3.4. 1. Consider the function f defined as

f(x) = E Aijexi~x - E dixi . (3.4.3) 1

Then f has a minimizer x if and only if it is the solution to the system L(x) = d.

The proof follows directly from writing optimality conditions for f, noting that the condition that Vf(x) = 0 is equivalent to L(x) = d. Similarly to Theorem 3.1.6 and Theorem 3.2.5, we can provide conditions on function value error to bound the error lid - 1(x) 112. Also, in order to obtain a good running time, we require regularizing f in a manner similar to the regularization applied in Lemmas 3.1.10 and 3.2.9. Note that for scaling and balancing, since we require problem specific error guarantees, the regularization needs to customized accordingly. Finally, we state without proof that balancing and scaling are instances of solving L(x) = d.

Observation 3.4.3. Let A E R"*' be a balanceable nonnegative matrix. Let L be the nonlinear operator associated with the graph with adjacency matrix A. Then the solution x to 1(x) = 0 yields a balancing ID(exp(x)).

Observation 3.4.4. Let A E R' ,"n be a (r, c)-scalable nonnegative matrix. Let 12 0 A be the nonlinear operator associated with the graph with adjacency matrix . 0 0 Then the solution z = (x, Y)T, with x, y E R n to 'C(Z) = (r, -C)T yields a (r, c)-scaling (ID(exp (x)), ]D (exp (y)).

37 38 Chapter 4

Implementing an O(log n)-Oracle in Nearly Linear Time

In Section 3 we reduced the balancing and scaling problems to the approximate minimization of second-order robust functions with respect to the f. norm. All that is left to have a complete algorithm, we need a fast procedure to implement a k-oracle as in

Definition 2.0.3. Namely, show how to construct an O(log n)-oracle for the problem,

min xTMX + (b, x) , (4.0.1) where M is an SDD matrix. For this section, whenever we say that a matrix is SDD we will also imply that the off-diagonal entries are nonpositive.

One possible approach, is to use standard convex optimization reductions to turn this problem into the minimization of the maximum of an f, norm and anf2 norm subject to linear constraints. This problem can be solve in time O(mni1/ 3) using the multiplicative weights framework as applied in [6, 5]. The resulting algorithm for implementing the k-oracle would take time O(m + n4/3 ), by taking advantage of spectral sparsification algorithms [48, 31, 30]. Instead, we will come up with a faster algorithm.

39 4.1 Sparse Decompositions

Our approach, based on the Lee-Peng-Spielman solver [29], is to identify large sets of vertices where the problem is "easy" to solve and then deal with the rest of the graph

(reduced in size) recursively. The particular notion of "easy" we are going to use, is that of strong diagonal dominance.

Definition 4.1.1. A matrix M is ce-strongly diagonally dominant (o-SDD), if for all

i54

The reason that such matrices enable us to solve the corresponding problems fast is that they can be well-approximated by a diagonal matrix.

Lemma 4.1.2. Every ae-SDD matrix M, with diagonalID(M), satisfies

1 - 1 D(M) M 1+ 1 D(M). 1+a) + a

Proof. This follows from the fact that

-1 0 0 1 1 0 0 -1 1 0 0 1

Applying this to each off-diagonal entry, the off-diagonal part of the matrix will be bounded between diagonal matrices; by the oz-SDD property these can be bounded

by 1iaD(M). 0

In our context, problems in the form of Equation 4.0.1, where M is an a-SDD matrix for some a > Q(1), can be turned into well conditioned quadratic minimization problems for which we can apply standard linearly convergent algorithms. For a more detailed description and analysis of such algorithms can be found in [37].

Lemma 4.1.3. There is an algorithm FASTSOLVE, that given an Q(1)-SDD matrix

40 M, and E > 0, returns a point Y, such that ||Y||oo < 2, and

xT My + (a, b) < (1 - E) min xTMX + (x, b) in time O(mlog(1/c)), where m is the number of nonzero entries of M.

Proof. By Lemma 4.1.2, there is some diagonal matrix D such that

B - lDn4M4 1+ t D. 1+ a )1 +a)

By applying the transformation x = D -1/2 z and the problem becomes

min h(z) = zT D-1/ 2 MD- 1/ 2z + (D- 1/2b, x). IzI 2D'/ 2

We will apply proximal gradient descent, defined by the sequence z(O) = 0 and

- z(t)|| 2 z(t+) = arg 11z 1 2 z) + (I + Ia zi n2D/ { (Vh(z(t)), 2

Computing z(t+1) from z(t) corresponds to computing

z' z=t) - 1 + a Vh(z(t)) = P) - 1 + a D-1/2MD1/2Z(t), 2+a2 and projecting it to the space IzJ 5 2D1 /2 , by simply trunctating any coordinates exceeding the bounds. We can clearly implement each iteration in linear time in the number of nonzeros of M. Since the condition number of the function is at most (1 + 2/a) = 0(1), such a step will imply that

1 (h(z(t h(z(t)) - h(zt+1)) )) - h(z*), 0(1) and thus inductively,

h(z t)) - h(z*) ( I - 0(1) (h(z(o)) - h(z*)).

41 Therefore, after O(log(1/e)) steps we will have a point with h(z(t))-h(z*) E(h(z(0))- h(z*)). The fact that h(O) = 0 concludes the proof. 1

An even simpler case is when the matrix is of size 1, in which case the problem can be exactly solved in constant time:

Lemma 4.1.4. There is an algorithm TRIVIALSOLVE, that given a 1 by 1 matrix M returns an x optimizing xTMx + (b, x) over the interval [-1,1].

Proof. By convexity, there must exist an optimal x that is either one of the two endpoints of the interval, or the unique global optimum of the function over the line. One may simply check all candidates and return the best value. 0

A key insight of [29] is that one can find Q(1)-SDD submatrices of M of size 0(n). We denote such a subset by F and V \ F by C. To ensure that solving the problem for xF will not interfere with our solution xc we map a solution xc supported only on coordinates of C to a solution xc through a linear mapping P. If P were the energy minimizing extension of voltages on C to voltages on V,

(P^c)F MFF]M[F,C]$C, we would have that xF and xc are M-orthogonal, since xFMPC = 0. Then, optimizing over ic involves the quadratic PTMP which is exactly equal to M[c,c] -

M[C,FIM[1F]M[F,C] = Sc(M, F). Applying this proccess recursively leads to the notion of vertex sparsifier chains that we will heavily rely on.

Definition 4.1.5 (Definition 5.7 of [291). For any SDD matrix M(0), a vertex sparsifier chain of M(0) with parameters ac 4 and 1/2 > ej > 0, is a sequence of matrices and subsets (M(1), ... , M(d); F1 ,... , Fd_1) such that

1. M( ~ M(i )

2. M(+ 1 [)~iSc(Mg)y, Fia)

3. M (,F) is cei-strongly diagonally dominant, and

42 4. M(d) has size 1.

Note that this last requirement is slightly different from [291: we require the chain to end with size 1, rather than just being constant. However, the chain construction from [29] immediately extends to this requirement (they presumably proposed stop- ping early because it is a simple optimization that would likely be valuable in any implementation). To be able to reason about the approximation guarantees of the chain as a whole we will use an error-quantifying definition.

Definition 4.1.6 (Definition 5.9 of [29]). An e-vertex sparsifier chain of an SDD matrix M(0 ) of work W, is a vertex sparsifier chain of M(0 ) with parameters ai 4 and 1/2 > Ei > 0 that satisfies

1. 2 Eid i<,

2. Zd-1mi log. E1 < W, where mi is the number of nonzeros in LM.

Finally, the construction of such chains, as well as their error guarantees have been already analyzed in [29] and can be used in a black-box manner.

Theorem 4.1.7 (Theorem 5.10 of [29]). Every SDD matrix M of dimension n has a 6-vertex sparsifier chain of work O(n) and d < O(log n), for any constant 0 < 6 < 1. Such a chain can be constructed in time, 0(m).

We note that Theorem 4.1.7 was stated for 6 = 1, but it is straightforward to modify to proof without changing the work or the length of the chain by more than a constant factor. Since we cannot exactly compute the energy minizing mapping P, we will define an approximate mapping that suffices for our purposes.

Definition 4.1.8. A linear mapping P is an E-approximate voltage extension from C to V according to L if for any ic E RIci

1. II(P - P)2cIIM EIIPcIIM,

43 APPROXMAPPING(M, C, E)sc

1 1. T <- log(1+a)

2. x( - j c0 1 XF <- -E- M[FC]XC, E is external degree matrix Ei= Mij - EjE MI 3. Fort+- 1,...,T: x(t) x(t-1)

4F - D(M[FF]) lM[FFUc]x(tl)

4. return X(T)

Figure 4.2.1: Implementation of an E-approximate mapping

2. P is the identity on coordinates in C

3. the coordinates of PJc are convex combination of the coordinates of ,c and 0. where P is the energy minimizing extension.

4.2 Constructing Approximate Mappings

We will construct such a mapping through a simple averaging scheme. First we set the voltage of every vertex in F to be the weighted average of its neighbors in C. Then at every step we replace its voltage by the weighted average of all its neighbors. (Here, excess diagonal is treated as an edge to a vertex with voltage 0.) We do so for O(log(l/E)) iterations. It is easy to see that all steps of the algorithm are linear maps, and we can therefore also implement its transpose.

Lemma 4.2.1. For any SDD matrix M, given an Q(1)-SDD subset F and some e > 0 one can apply an E-approximate voltage extension mapping in time O(m log(1/E)).

Proof. The linearity of the mapping, properties 2 and 3, as well as the runtime claimed hold inductively by the construction of the mapping. We will now prove property

44 1, namely, if P is the true energy minimizing mapping and P is our approximate

0 mapping, for any i E RI C, II(P - P)icIIM EllPZcIlM-

First, we will bound the error of x('). We define v() = x(t) - Plc, which is 0 outside F by construction. We will use the notation wij = -Mij-i.e. the edge weight between i and j, and wj0 = M - Eh IMij . Here, wio accounts for "excess diagonal" of M, which we treat as an edge to a "virtual vertex" 0. We will also define

(ic)0 = 0. Now, we have

||P cI2 > wij((P^c)i - (--C)X iEF,jECU{0} (V(O))2

Wij i iEF (jECUI 01

1+a (0)

Rearranging gives I|v(0)1|D(M) 1 IP^cIIM-

Next, we note that v(t) = ID(M)- 1 (I - M)[FF]v(t-1 ). This follows from the fact that

-(M[FF M[F,Fuc](P^C) since M(Pxc) is 0 on F. Applying Lemma 4.1.2, we get

11 12(I - M)[FF]ID(M)-12 < ||D(M)-1 1 -

This implies that |Iv(t)D(M) 12+. v(t-1)IIID(M).

By induction, we have 1+ 1a |vI(t)|ID(M) 5 ( + at |IPicIIM-

45 Finally, using Lemma 4.1.2 again, we get

1 +2 IV"'t)IDM) ~ 1 at IIPX'CIIM. (1+ a)'

Given such a mapping, we can uniquely express any voltage vector as x = xc + XF,

where xC = P.c (i.e. it is in the image of P) and XF is supported on F. By the convex

combination property of P, we have Ixc I < II ^cIIoo 5 11x Io; since XF = x - xC, by

the triangle inequality we have IIXFIoIo < 2I1xII00 . The domain lixiKC) 5 1 is therefore

contained in ll|cllao 5 1, IIxFIo < 2. Moreover, any point for which II.cl|Ko < k,

IIXFI o 2, corresponds to a point x = Pfc + XF with ljxIKco k + 2.

Having expressed all of the components of our approach, stating the algorithm is

simple. Given the decomposition of the problem the vertex sparsifier chain provides,

we will solve the smallest problem and then iteratively combine it with the solution of

the submatrices along the chain. The algorithm is formally described in Figure 4.2.2,

and the main claim in Theorem 4.2.3.

To facilitate the analysis we first state the following decoupling lemma.

Lemma 4.2.2. Consider an SDD matrix M, a partition of its columns (F, C) and

some 0 < e < 1/2. Let P be an E-approximate voltage extension from C to (F, C) as

define in Definition 4.1.8. Then

(Plc + XF)TM(PC + XF) (1 -+6 + E2) (pC)TM(Pic)+ (1 + E)XTFMXF

(1 - e + E2)(pic)TM(Pxc) +(Plc (1 - e)XT + XF)TM(PxCMXF- + XF) E

Proof. Since P is the true energy minimizing extension, we know that M(PJc), will

be zero on the coordinates of F. Since (P - P) c + XF is supported on F (property

46 O PTIMIZEC HAIN((MW,... , M(d); F,... ,Fdl1; Eo, . . b)

1. bW) <- b/e'o

2. For i -1 ... , d - 1

(a) p(z) < APPROXMAPPING(M( ), F, Ei)

(b) b ._ (P(i))T b(i)/(eei (1 + Ej + E?))

3. x(d) +- TRIVIALSOLVE(M(d), b(d)) 4. For i +- d - 1, ... , 1

(a) xoi <_ N(i)x(i+1) (b) Xo <- FASTSOLVE(MF, b(,) F )

Figure 4.2.2: Optimizing a vertex sparsifier chain

2 of Definition 4.1.8), we can expand

(P.c+xF)TM(PC +xF) =

= (P^c + (P - P>Cc + xF)TM(PC + (P - P) c + XF)

= (P:c)TM(Pic) + ((P - P)Tc + XF)M((P - P)sc + XF)

= ((P c)TM(P.c) + XMXF + 2 - P)iC)TMXF + ((P - P) T M((p - P) C) -

The first property of P in Definition 4.1.8, implies that

((P - P)ic) T M((P - P)ic) e2 (Pxc) T M(Pxc).

We can upper bound the contribution of the cross-term as

1- 2((P - P)iC)TMXF ! -( - P)ic)TM((P - P5c) + (XPXFMXF

<(PsC)TM(P.C) + EXMXF-

47 Similarly, we can lower bound the contribution of the cross-term as

--( - P)dC)TM((P - P>Xc) - 2((PeXFMXF - P)ic)TMxF E

> -6(P-c)TM(P.c) - EXMXF-

Combining these inequalities and rearranging terms concludes the proof. E

We can now use this lemma to prove that by decoupling the problem and using approximate Schur complements does not reduce the quality of the solution by more than a constant.

Theorem 4.2.3. Algorithm OPTIMIZECHAIN implements a O(log n)-oracle, and runs in time 0(m).

Proof. By construction and the triangle inequality we have that

d IIx(0 |o IIx(d)11 + | ')I 1 1 +2d < O(log n).

We will define the following functions to reason about our approximation guarantees:

Hi(x) xTM(i)x + (b(0), x)

Hi(x) = (1 + ei + FQ)xT Sc(M )X,Fi)x + (b(0), 3()x)

Hi'(x) - eei(1 + ej + E )xT1M(i+l)x + (b(0), p(wx)

Gi(x) = (1+ Ei)xTM2,F]X + (0,x)

We can now derive the following facts about these functions

1. directly applying Lemma 4.2.2, for any ^C, XF,

HiTPc + XF) H,(ic) + Gi(xF),

2. using the fact that any x with lixi|K 1 can be written as Psc + XF for

48 |:cjj : 1 and |IxFIK 2, and Lemma 4.2.2,

1i- Ei min H|(.c) + min Gi(xF) min Hi(x), IkIc I10l 1 IIXF 110052 S+ Ej) IX o11 _:1

3. by the definition of the vertex sparsifier chain, M(+1 ~ Sc(M(i), F), implying for any x Hi'(x) :! Hj'(x),

4. again from the fact that M(+ ~ Sc(M('), F),

min H '(x) e- 2 Ei min Hj(x), lIX|1100 IX110oo 1

5. by the definition of 0),

H'(x) = (1 + 6i)(1 + ej + Ei)H+ 1(x).

We are going to combine these facts to show that for any i E [1, d],

4 5 Hi(x(')) e. es min Hi(x). 11X1100<1

We procceed by induction from d to 1. Recall that ej 1/2. For the case of i = d it trivially holds by the guarantee of TRIVIALSOLVE:

Hd(x(d)) < min Hd(x) 11X1100o1

49 Assuming that it holds for any j > i, by the guarantees of FASTSOLVE:

Hi(x(2)) < H(x('+1 )) + Gi(x2)

H'(x+1)) +Gi(x4)

=]t(1 H+ 1(x(+l)) + Gi(xf) (I (+ Ei + Ei)

eZdI-1 -5E e Smin(1+ i + E2) jIIxII.< H+ 1(x) + (1 - Ei) min G11Gi(x)(2

< e =i+1 -5 min H7'(x) + e-2,j min Gi(x) IlxIlo 1 IIll 2 < e + -5Eje-2Ei mm H(x) + e- 2ei mn G(x) IIXIloo<1 IIXIloo 2

< e =1 -5 Ee-2eiHi min H((x)

5 - eZ- - Ejin H i(x).

Finally, we can similarly argue about Ho(x(1)): it is at most e'oHi(x(1)) while

3 min||xil. 1 Hl(x) < e- EO minlx1j| ii Ho(x). By the guarantees of the vertex sparsifier chain, we know that $-1 Ej < 6 for some constant 6 of our choice. By choosing the right constant we can ensure that our multiplicative error is less than 1/2. In order to bound the runtime we notice that for every i we need to compute

APPROXMAPPING and FASTSOLVE which both take time O(mi log(1/Ei)). By the bound on the work W of the chain we get that applying the chain takes time O(n), while constructing it takes time 0(m).

50 Chapter 5

Matrix Scaling and Balancing with Exponential Cone Programming

In this section, we switch to a completely different approach, and present an algorithm based on interior point methods, a broad class of methods for solving problems in convex optimization. While the results form previous sections are great in the regime where, roughly speaking, the magnitude of the entries in the scaling matrices are not too large (quasipolynomial, for instance), in general they can become exponentially large. For such cases, interior point methods are an all-purpose tool which mitigates these bad dependencies. Although interior point methods would seem like a natural option for the problems of matrix scaling and balancing, standard formulations require solving linear systems involving various rescalings of the input matrix. However, it is not clear whether these can be solved faster than matrix multiplication time. Instead, it turns out that a somewhat nonstandard formulation requires solving linear systems for more structured matrices. Particularly, we will see that these matrices admit a decomposition involving only matrices that are easy to invert (triangular matrices, solvable by back substitution, and SDD matrices which can be tackled via a standard Laplacian solver). [22] also provide a formulation that is similar to ours, but do not show how to implement the iterations, and do not rigorous bounds on how many such iterations are required.

51 Notably, a similar observation was made by Daitch and Spielman [9], in the case of interior point methods applied to flow problems on graphs. The main result of this section is the following.

Theorem 5.0.4. Given a nonnegative matrix A E R nx, one can:

1. compute an e-balancing in time

O(m3 / 21 1o(WAE71 )),

2. if there exists an (r,c)-scaling (U*,V*) with objective value difference at most

, compute a e-(r, c)-scaling in time

O(m1/ 2 (log(s 6-)+log log max{ 11 ,IK lVl*l})).

This is as a matter of fact a consequence of the fact that a specific class of functions, which capture both balancing and scaling, can be minimized efficiently. We capture this result in the following Theorem.

Theorem 5.0.5. Let A E R nx" be a nonnegative matrix with m nonzero entries, let f be the function f (x) = Aijexi-rx - (d, x), (Fl) (i,j)Esupp(A)

and let Bx be a positive real number. There exists an algorithm which, for any E > 0, finds a vector x such that f(x) - f(x*(Bx)) E (where x*(Bx) is the optimum of f over the region |lx||. Bx) in time

O(m3/2 log 2+ Bx + SA + d )

Using this result, one can then conclude the proof of Theorem 5.0.4.

Proof of Theorem 5.0.4. For the balancing objective, we first decompose the nonzero entries of A into strongly connected components. For each component, we will call

52 with d = 0. From Corollary 3.2.11 we have that I1x*I1, = O(nlog wA). Plugging this in, we obtain a total running time of 6(m3/2 log(WA6 1 )). For the c-(r, c)-scaling objective, we set d = (c, r)T, and run the interior point method on the matrix [ ]. Also, using the promise that the logs of the entries A 0 of (r, c) scaling exists within the f, ball of radius log max{ IIU* III V *II.}, we obtain a total running time of O(m 3/2(lo --1) + log log max{|U* 11, IIV* 1K})). Using Lemm 3.1.7, this yields the conclusion.

We prove Theorem 5 by showing that an interior point method defined and analyzed by Nesterov [371 can be efficiently implemented. In order to do so, we require two components. The former involves providing a formulation for minimizing the function in F1 for which the interior point method can produce an iterate that is close in value to optimum within a small number of iterations. The latter involves showing how to efficiently implement these iterations. Generally they involve solving a linear system; in our case, we show that such iterations can be executed by solving an SDD linear system to constant accuracy.

5.1 Setting Up the Interior Point Method, and Bound- ing the Number of Steps

In order to apply an interior point method, we first reformulate the problem in F1 in an equivalent form.

Lemma 5.1.1 (Equivalent Formulation). Let B, be a promise on the magnitude of the entries in the optimal solution of Fl:

|X*|I < Bx.

Also, let U = (sA +|dIJ1Bx).

53 Then the objective

min (f, t) - (d, x) where S = {(t, x) E R' x R' : Aijexi-xi < tij < 3U , for all (i, j) E supp(A), (t,x)ES

-Bx 5 xi B, for all 1 < i < n}, (F2) has an identical value and solution to Fl.

Proof. Given the promise, the bounds on x are redundant. This is also the case with the upper bounds on tij, since setting x = 0 and tj3 = Aij yields a solution of value SA. Therefore, setting t*' = Ai3exi-"x, the value of (I, t*) must be at most

SA + (d, x*) 5 sA + IdII1Bx < U. El

The second step is to replace the hard constraints with appropriate barrier functions, whose value blows up when approaching the boundary of the feasible set S

(see [1, 37] for more details). More precisely, we consider the barrier functions

#ij(t, x) = -log(log t - (log A) + xi - xj)) (5.1.1) for all (i, j) E supp(A), and

4(t, x) = - log(3U - ti1 ) - E (log(Bx - xi) + log(xi + Bx)). (5.1.2) (ij)Esupp(A) 1

The former blow up when t approaches exp(log(Aij + xi - xj)) from above, and are standard in exponential cone programming [1]. The barrier 4 handles all the other inequality constraints. Very importantly, all these barrier functions are well behaved, in the sense that they satisfy a required property called self-concordance. Since this property defines the number of iterations the method needs to execute, we highlight it below.

Fact 5.1.2. The function (t, x) = 0(t, x) + E(ij)esupp(A) #i5(t, x) is an O(m)-self- concordant barrierfor the set S defined in F2.

54 With the barrier function set up, the method has to solve a sequence of subprob- lems of the form

min f,(t, x) where f,(t, x) = p ((I, t) - (d, x)) + (t, x) (5.1.3) t'x while increasing p until it becomes sufficiently large that the solution we produce is close to the optimum of the initial constrained problem. What is essential here is the number of iterations of the method, which depends mostly on the quality of the barrier function, and little on in initialization and accuracy to which we want to solve. More precisely, we apply the following theorem which follows from [37], Theorems 4.2.9 and 4.2.11.

Theorem 5.1.3. Given an initial point v in the strict interior of D with a v-self- concordant barrier , the problem minVED cTv can be solved to within E additive error in 0 (xvlog 2 + IIV (v)|v2 (vo)-1 + VIcIV2(vo))-1 iterations, where v0 is the minimizer over D of (v).

In what follows we bound the quantities involved in the above statement. In order to have a bound on the number of iterations required for our cone program, we require lower bounding V 2 (vo). In order to do so, we lower bound the Hessian everywhere.

Lemma 5.1.4. The Hessian V 2 is lower bounded everywhere by the diagonal matrix with -17 on t variables and 2 on on x variables. X

2 2 2 2 Proof. Since V = V2'> + V J , V /, and by Vcalculation we see that V is diagonal, and

[V 20(t, x)]tijto = (3U - )2 9U2 1 1 2 [V2i(t, X)]( 2 +x 2 B12 ( BX -xi )2 {BX + X )2 B2

55 Therefore we have that this gives a lower bound on V2', and thus on H. 0

We also show how to pick the initial point for our particular problem, which turns out to be a trivial task, since the only requirement is that it lies in the strict interior of D. The more challenging part is upper bounding the V at that point in Hessian inverse norm.

Lemma 5.1.5. The point v = (t,x), where tij = 2U, x = 0, belongs to the strict interior of S. Furthermore, log IIV (v) IV(V)-1 = 0(log(2 + m + Bx)).

Proof. First we verify that the point belongs to the strict interior. As we set x = 0, no constraint on x is tight. 3U > tij = 2U > 0. For the second part, we may simply bound the contribution from each term of the barrier to each entry of the gradient. The t entries end up bounded by at most O~7), while the x entries end up bounded by O(m) + Or), providing the claimed bound.

5.2 Implementing an Iteration of the Interior Point Method

The steps mentioned in the statement of Theorem 5 consist only of standard Newton steps, i.e. minimizing a second order locall approximation of the function fU(x). 2 These steps are generally expensive, since they involve applying the inverse of V f to a vector. In our case, fortunately, we are able to exploit the structure of f in order to do this in nearly linear time in the sparsity of V 2f,'. Below we give a precise statement concerning our ability to solve linear systems involving the Hessian matrix.

Theorem 5.2.1. For any e > 0, and any H = V 2 f,(v) = V2(v), where v = (t, x) is a point in the strict interior of the feasible region S (see F2), and any vector b C R m +n, one can, with high probability, compute in 6(m log E-1) time a vector y

such that ||Y - y*||H EIIx 1H, where y* is the solution to Hy* = b.

56 In order to achieve this result, we leverage the power of Laplacian solvers. From the algorithmic point of view, the crucial property of the Laplacian is that it is symmetric and diagonally dominant. This enables us to use fast approximate solvers for symmetric and diagonally dominant linear systems. Namely, there is a long line of work [50, 24, 25, 23, 7, 27, 28] that builds on an earlier work of Vaidya [51] and Spielman and Teng [49], that designed an SDD linear system solver. We employ as a' black box the following theorem, which follows from [251, and constructs an operator that approximates M+.

Theorem 5.2.2. For any E > 0, and any SDD matrix M E R"n> with m nonzero entries, and any vector b in the image of M, one can, with high probability, compute in O(mloge-1) time a vector x such that lx - x*||M elx*IIm, where x* is the solution of Mx* = b. Furthermore, a given choice of random bits produces a correct result for all b simultaneously, and makes x linear in b.

The result follows using this tool, and the following structural lemma, whose proof can be found in Appendix A.10.

Lemma 5.2.3. The Hessian V 2 has a factorization

H = USU T , where S is SDD, U is lower triangular, and each of then has O(m) nonzero entries. Furthermore, this factorization can be computed in O(m) time.

With this in hand, proving Theorem 5.2.1 is immediate. First note that, given 5, we can actually choose a -1 satisfying the needed properties: a linear-operator based graph Laplacian solver, such as [25]. Having access to the linear operator S', we consider the error produced by applying the operator U-lS(UT)- 1 :

1 T 1 1 1 1 1 I1(U T )-' 5-U-lb1 - H- blH = l|(U ) (- - S )U-'bH = ll(S - S- )U-lbls 1 1

57 5.3 Error Tolerance

While the classical analysis of Newton's method used for iterations of interior point methods assumes exact computations, in our case the Laplacian solver we employ adds some error. We quickly show that this error does not hurt us, and as a matter of fact it is sufficient to solve these systems to constant accuracy. First, we require understanding the guarantees of Newton's method, and its requirements.

Fact 5.3.1 (Progress via Newton Steps). Let v be a point in the interior of the feasible region such that 1

Then, applying one step of the interior point method consists of producing a new iterate

'= v-H;-Vf4(v) which provably satisfies IIVf,(v')IIH-,1

In order to make progress it is sufficient that |Vf,(v') IIH-,1 6.

We can easily show that applying the inverse matrix with the solver guarantees we give in Theorem 5.2.1 is sufficient in order to make progress.

Lemma 5.3.2. Let v be a point in the interior of the feasible region such that

1 flVf,4(v)IIH-1 <

Letting v" = v - A such that

I|A - H;-'Vf,(v)IHv If IIHV1, for e < 0.1, we get that 1

Since the proof is rather standard, we defer it to Appendix A.9.

58 5.4 Putting Everything Together

We can combine the results from this section in order to provide a proof for Theorem 5.

Proof of Theorem 5. By combining Theorem 5.1.3, along with Fact 5.1.2, Lemma 5.1.4 and Lemma 5.1.5, we see that we can approximately minimize the function defined in Equation F1 by performing

o (V/;nlog (2 m+ + B,+ + |d|1 E E iterations of the interior point method referenced in Theorem 5.1.3. From Theorem 5.2.1 and the iteration accuracy required by Fact 5.3.1 and Lemma 5.3.2 we see that each iteration of the interior point method referenced in Theorem 5 can be implemented in time 6(m). This yields the conclusion. 11

59 60 Bibliography

[1] Aharon Ben-Tal and Arkadi Nemirovski. Lectures on modern convex optimization: analysis, algorithms, and engineering applications. SIAM, 2001. [2] Susan Blackford. Balancing and conditioning. http: //www.netlib. org/ lapack/lug/node94.html.

[3] David T Brown. A note on approximations to discrete probability distributions. Information and Control, 2(4):386-392, 1959. [4] Tzu-Yi Chen and James W Demmel. Balancing sparse matrices for computing eigenvalues. Linear algebra and its applications, 309(1-3):261-287, 2000. [5] Hui Han Chin, Aleksander Madry, Gary L. Miller, and Richard Peng. Runtime guarantees for regression problems. In Innovations in Theoretical Computer Sci- ence, ITCS '13, Berkeley, CA, USA, January 9-12, 2013, pages 269-282, 2013.

[6] Paul Christiano, Jonathan A. Kelner, Aleksander Madry, Daniel A. Spielman, and Shang-Hua Teng. Electrical flows, Laplacian systems, and faster approximation of maximum flow in undirected graphs. In STOC'11: Proceedings of the 43rd ACM Symposium on Theory of Computing, pages 273-282, 2011. [7] Michael B. Cohen, Rasmus Kyng, Gary L. Miller, Jakub W. Pachocki, Richard Peng, Anup B. Rao, and Shen Chen Xu. Solving sdd linear systems in nearly m log1/ 2 n time. In STOC'14: Proceedings of the 46th Annual ACM Symposium on Theory of Computing, pages 343-352, 2014. [8] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292-2300, 2013. [91 Samuel I. Daitch and Daniel A. Spielman. Faster approximate lossy generalized flow via interior point algorithms. In STOC '08: Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, pages 451-460, 2008. [10] RJ Duffin. Nonlinear networks. iia. Bulletin of the American Mathematical Society, 53(10):963-971, 1947. [11] B Curtis Eaves, Alan J Hoffman, Uriel G Rothblum, and Hans Schneider. Line-sum-symmetric scalings of square nonnegative matrices. In Mathemati- cal Programming Essays in Honor of George B. Dantzig Part II, pages 124-141. Springer, 1985.

61 [12] Jack Edmonds. Maximum matching and a polyhedron with 0, 1-vertices. Journal of Research of the National Bureau of Standards B, 69(125-130):55-56, 1965.

[13] Stephen E Fienberg. The Analysis of Cross-ClassifiedCategorical Data. Springer, 1976.

[14] Shmuel Friedland, Chi-Kwong Li, and Hans Schneider. Additive decomposition of nonnegative matrices with applications to permanents and scalingt. Linear and Multilinear Algebra, 23(1):63-78, 1988.

[15] Harold N Gabow and Robert E Tarjan. Faster scaling algorithms for network problems. SIAM Journal on Computing, 18(5):1013-1036, 1989.

[16] Vincent Goulet, Christophe Dutang, Martin Maechler, David Firth, Marina Shapira, and Michael Stadelmann. Package 'expm'. https: //cran. r-proj ect. org/web/packages/expm/expm.pdf.

[17] Janez Grad. Matrix balancing. The Computer Journal, 14(3):280-284, 1971.

[18] Leonid Gurvits. Classical deterministic complexity of edmonds' problem and quantum entanglement. In STOC'03: Proceedings of the 35th Annual ACM Symposium on Theory of Computing, pages 10-19, 2003.

[19] Darald J Hartfiel. Concerning diagonal similarity of irreducible matrices. Pro- ceedings of the American Mathematical Society, 30(3):419-425, 1971.

[20] Martin Idel. A review of matrix scaling and sinkhorn's normal form for matrices and positive maps. arXiv preprint arXiv:1609.06349, 2016.

[21] Bahman Kalantari and Leonid Khachiyan. On the complexity of nonnegative- matrix scaling. Linear Algebra and its applications, 240:87-103, 1996.

[22] Bahman Kalantari, Leonid Khachiyan, and A Shokoufandeh. On the complexity of matrix balancing. SIAM Journal on Matrix Analysis and Applications, 18(2):450-463, 1997.

[23] Jonathan A. Kelner, Lorenzo Orecchia, Aaron Sidford, and Zeyuan Allen Zhu. A simple, combinatorial algorithm for solving SDD systems in nearly-linear time. In STOC'13: Proceedings of the 45th Annual ACM Symposium on the Theory of Computing, pages 911-920, 2013.

[24] loannis Koutis, Gary L. Miller, and Richard Peng. Approaching optimality for solving SDD systems. In FOCS'10: Proceedings of the 51st Annual IEEE Sym- posium on Foundations of Computer Science, pages 235-244, 2010.

[251 Ioannis Koutis, Gary L. Miller, and Richard Peng. A nearly m log n-time solver for SDD linear systems. In FOCS'11: Proceedings of the 52nd Annual IEEE Symposium on Foundations of Computer Science, pages 590-598, 2011.

62 [261 R Telefoonverkeersrekening Kruithof. De ingenieur 52. E15-E25, 1937. 1271 Rasmus Kyng, Yin Tat Lee, Richard Peng, Sushant Sachdeva, and Daniel A. Spielman. Sparsified cholesky and multigrid solvers for connection laplacians. In STOC'16: Proceedings of the 48th Annual ACM Symposium on Theory of Computing, 2016.

[28] Rasmus Kyng and Sushant Sachdeva. Approximate gaussian elimination for laplacians: Fast, sparse, and simple. In FOCS'16: Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science, 2016.

[29] Yin Tat Lee, Richard Peng, and Daniel A Spielman. Sparsified cholesky solvers for sdd linear systems. arXiv preprint arXiv:1506.08204, 2015.

130] Yin Tat Lee and He Sun. An SDP-based algorithm for linear-sized spectral sparsification. In STOC'1 7: Proceedings of the 49th Annual ACM Symposium on the Theory of Computing, pages 250-269, 2015.

[31] Yin Tat Lee and He Sun. Constructing linear-sized spectral sparsification in almost-linear time. In STOC'17: Proceedings of the 56th Annual IEEE Sympo- sium on Foundations of Computer Science, 2017.

[32] Nathan Linial, Alex Samorodnitsky, and Avi Wigderson. A deterministic strongly polynomial algorithm for matrix scaling and approximate permanents. In STOC'98: Proceedings of the 30th Annual ACM Symposium on the Theory of Computing, pages 644-652, 1998. to [331 Aleksander Madry. Navigating central path with electrical flows: From flows matchings, and back. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on, pages 253-262. IEEE, 2013.

[34] MathWorks. balance - diagonal scaling to improve eigenvalue accuracy. https: //www . mathworks. com/help/matlab/ref /balance . html.

[351 MathWorks. eig - eigenvalues and eigenvectors. https : //www . mathworks . com/ help/matlab/ref/eig.html.

[361 Arkadi Nemirovski and Uriel Rothblum. On complexity of matrix scaling. Linear Algebra and its Applications, 302:435-460, 1999.

[37] Yu Nesterov. Introductory lectures on convex programming volume i: Basic course. 1998.

[38] E. E. Osborne. On pre-conditioning of matrices. Journal of the ACM, 7(4):338- 345, 1960.

[39] Rafail Ostrovsky, Yuval Rabani, and Arman Yousefi. Matrix balancing in Lp norms: Bounding the convergence rate of osborne's iteration. In SODA '17: Proceedings of the 28th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 154-169, 2017.

63 M

[40] B. N. Parlett and C. Reinsch. Balancing a matrix for calculation of eigenvalues and eigenvectors. Numer. Math., 13(4):293-304, 1969.

[41] TES Raghavan. On pairs of multidimensional matrices. Linear Algebra and its Applications, 62:263-268, 1984.

[42] RDocumentation. Balance a square matrix via lapack's dgebal. https: //www. rdocumentation. org/packages/expm/versions/0.99-1. 1/topics/balance.

[43] Hans Schneider and Michael H Schneider. Max-balancing weighted directed graphs and matrix scaling. Mathematics of Operations Research, 16(1):208-222, 1991.

[44] Michael H Schneider and Stavros A Zenios. A comparative study of algorithms for matrix balancing. Operations research, 38(3):439-455, 1990.

[45] Leonard J Schulman and Alistair Sinclair. Analysis of a classical matrix preconditioning algorithm. In STOC'15: Proceedings of the 47th Annual ACM on Symposium on Theory of Computing, pages 831-840, 2015.

[46] Richard Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematical statistics, 35(2):876-879, 1964.

[47] Richard Sinkhorn and Paul Knopp. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2):343-348, 1967.

[48] Daniel A Spielman and Nikhil Srivastava. Graph sparsification by effective re- sistances. SIAM Journal on Computing, 40(6):1913-1926, 2011.

[49] Daniel A. Spielman and Shang-Hua Teng. Solving sparse, symmetric, diagonally- dominant linear systems in time 0(m1 3 1 ). In FOCS'03: Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science, pages 416-427, 2003.

[50] Daniel A. Spielman and Shang-Hua Teng. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In STOC'04: Proceedings of the 36th Annual ACM Symposium on the Theory of Computing, pages 81-90, 2004.

[51] Pravin M. Vaidya. Solving linear equations with symmetric diagonally dominant matrices by constructing good preconditioners. Unpublished manuscript, UIUC 1990. A talk based on the manuscript was presented at the IMA Workshop on Graph Theory and Sparse Matrix Computation, October 1991, Mineapolis.

[52] James Hardy Wilkinson. Rounding errors in algebraic processes. Courier Cor- poration, 1994.

[53] Neal E Young, Robert E Tarjan, and James B Orlin. Faster parametric shortest path and minimum-balance algorithms. Networks, 21(2):205-221, 1991.

64 [541 G Udny Yule. On the methods of measuring association between two attributes. Journal of the Royal Statistical Society, 75(6):579-652, 1912.

65 66 Appendix A

Deferred Proofs

A.1 Proof of Theorem 2.0.4

Proof of Theorem 2.0.4. The iteration we are going to implement is

Xi+1 = Xi + . ( V2f(xi), IVf(x) (A.1.1)

Since f is second-order robust with respect to Lo by definition (see Definition 2.0.1), we know that within an t ball centered at x the function f is lower and upper bounded by fL and fu, respectively, where:

T 2 (x')= f (X) + (Vf (X),x -,_X) + .X) V f (X) (X' _X

( - x) T V 2 - x) + f(x)(x' - x). fu(x') = f(x) + (Vf(x), x' 2

Also, define XL and xU to be the minimizers of fL and fu, respectively, over the f.. ball of radius k1 centered at x, i.e.

XL = argmin||z-x||1

Next, we see how much f decreases when we move from x to x' = x+ 1A, where A is obtained via the oracle call ( V2f(xi), (Vf(x)). We know from Definition 2.0.3

67 that

1/ 2 2 1 (Vf(x), A)+ 2k2z~e2AT V f(x)A < (Vf(x), xu - x) + 2(XU - X)V2f(X)(XU - X) (A.1.2) as the function the oracle is approximately minimizing is precisely that. Expanding fu (x + A) we have

fu(x') - fu(x) = fu (x+ S- fu(x)

k (Vf (x), A) + eATV2f(X)A

1 -(XU TV2 ((Vf(x), XU - x) + - x) f (X) (XU - x)) I (fu(XU) - fu(x)). 2

Since f(x) = fu(x), we see that

f ) - f(x') > fu(x) - fu(x') > (fU(x) - fu(xu)) . (A.1.3)

Also, we have that

fu(xu) - fu(x) fu (X + XL X)- fU(X) T XL XL e4 X) 2 -4 X) + 1 4 V f (X) =V KvX) XL- 2e ( 1I 2 (XL x )TV f(x) (XL- x)) ((Vf (x),XL - x) + 2

= -(fL(xL) - fL(X)) -

Combining this with Equation A.1.3 gives

f(X) - f(x') 4 U(fL(x) - fL(xU)) - (A.1.4)

Finally, we show that this amount of progress is comparable to that achievable by making a large step towards x*. More precisely, we have from the R,, condition

68 that 1kx - x*IIo < R,. Thus, letting i = x + max(kRoo,1)(X* - x), we have that the Ik -x*I .. Therefore, fL(k) fL(XL), since XL was a minimizer of fL over box of radius I around x. Also, since fL lower bounds f over this box,

fL (XL ) :5 fL ( ^) !: f(i). (A. 1.5)

Combining Equations A.1.4 and A.1.5, we see that

1 1 1 f(x) - f(x') 2f4 2e4k 2e4 max(kRo, 1)

where the last inequality follows from convexity. This implies that at every iteration f(x) - f(x*) is decreased by a factor of (1 - Q(1/(kR, + 1))), implying that after

T = 0 (kRo + 1) - log f(X0) f(X*)

iterations, we have that f(XT) - f(x*) -.

A.2 Proof of Lemma 3.1.7

Proof of Lemma 3.1.7. Suppose that, without loss of generality, the ith row of M has a very large violation of the scaling constraint: letting y := (rm)i - ri we have >yE /v'2.

In order to improve the solution, we can make an update to the corresponding coordinate of x which makes the largest possible improvement in function value. More precisely by setting x' = xi + 6, and x = xj whenever j =/ i, we have that

f (x) - f(x') = (rm)i(1 - e') + rj6.

Optimizing for the largest possible decrease, we set 6 = log(ri/(rm)i) which shows

69 that we can decrease f by

(rM)i - ri - ri log 1+ (rmX- ri =ri - log (I + ri (ri rij

Since we have -y/ri > -1, we can lower bound the improvement by

- > ri -( E , f(x) f(x') 4 ri 2n whenever y/ri K 1.62, and by

> , - f(x') ri - = f(x) 3 3 3 V5 whenever -y/ri > 1.62. Since by assumption lr|K K 1, this change improves function value by at least min{e 2 /(2n), E/(3V2)}, which contradicts the fact that f(x)-f* < E 2 /3n. Therefore all rows and columns are within E/V2H away from being correctly scaled. Hence this is a E-(r, c) scaling.

A.3 Proof of Lemma 3.1.10

Proof. The proof is similar to the one for Lemma 3.2.9. The first point holds by the same argument. For the third part, all (x, y) for which f(x, y) f(O, 0) must satisfy:

2 2 E2 2 2 36n eB (exi + 6i) f(0, 0) + 36n eB 4 = SA + 9neB' and similarly for y. Therefore

lxil 5 log 36n2 BSA + 4) = O(B log(nsA6-)) , and similarly for yi.

70 The second part follows by the nonnegativity of the regularizer and the observation

2 that f(z*) f(z*) + E2/(36 2 eB) . 4eB f(z*) + E /(9n). By the third property we know that the level set in bounded and thus f attains its minimum, and that minimum can only be better that x*, which concludes the proof. l

A.4 Proof of Theorems 3.1.5 and 3.1.6

Proof of Theorem 3.1.6. By Lemma 3.1.7 and Lemma 3.1.10 we get that in order to obtain a 2E-approximate scaling, it is sufficient to minimize f up to E2/(2n) additive error. Furthermore, from Lemma 3.1.10 we get that the R, bound required for 1 Theorem 2.0.2 is R, = O(Blog(nsAe- )). Finally, since f(0) = SA, we see that, initializing at (xO, Yo) = (0, 0), the total running time of the method is upper bounded by

b (mB log 2 (sAE6 1 ))

Proof of Theorem 3.1.5. We can directly prove this theorem by applying Theorem 3.1.6 to the optimal solution promised. Since z* exactly (r, c)-scales A, we know that it must be a minimizer of f and thus f(z*) = f* Moreover, by definition we have the bound, Iz*11 = B < log(r,(U*) + r,(V*)), which concludes that proof. El

A.5 Proof of Lemma 3.1.13

Proof. By Lemma 3.1.4, we know that any almost scalable matrix can be written as a block lower triangular matrix, whose diagonal blocks are exactly scalable. By Lemma 3.1.12, every such block can be scaled to doubly stochastic, using factors with a ratio at most O(ni log(1/fA)), where ni is the number of vertices in block i. The infimum of the function value is exactly the sum of the function values for the diagonal block problems, since the contribution of the entries below the diagonal can be made arbitrarily close to 0. We observe that it suffices to ensure that the

71 contribution of each such edge is at most E2/3n , since then the total contribution will be at most e2/3n which is the additive error we can tolerate. Scaling the off- diagonal entries can be done in a very simple way. For any block, we can scale all the columns down by a fixed amount and all the columns up by the same amount. This will not affect the contribution of the block's entries to the function and will only decrease the contribution of all the off-diagonal blocks in the same columns.

By choosing the ratio between any two consecutive blocks to be log(n3 SA/ 2 ), we can ensure that the entries contained in the interesection of the rows and columns of these blocks contribute less than e 2/3n 3 each. That ratio between any two factors of this new scaling is at most

o n log(nSA/c)+ rij lo9(1/tA) O(n log(nwA/E))-

A.6 Proof of Lemma 3.2.6

Proof of Lemma 3.2.6. First we observe that since the Hessian of f is SDD, it is spectrally upper bounded by two times its diagonal and therefore by the identity 2 2 matrix multiplied by twice the trace, that is V f(x) - 2 -tr(V f(x)) .I. Since, by construction, tr(V2 f(x)) = EZ(rm + cM)i = 2f(x), we have that

V 2 f(x) = 4f(x)I.

Therefore for any y with f(y) f(x), we have

2 f(y) = f(x) + (Vf(x), y - X) + (y - X)T j V f (x + t(y - X))(y X)tdt ~.1

< f (x) + (Vf (x), y - x) + 211x - y2f ().

72 It is straightforward to reason that

f, = inf f(y) f(x) + min {(Vf(x), y - x) + 211x - y12f(x)} = f (x) 2f(x)' Y y2 8f (x) and thus, IjVf(x)12 < 8(f(x) - f*) f(x) - f(x)

Finally, we lower bound f(x). Since the matrix can be balanced, its corresponding graph is strongly connected. Therefore it contains a cycle, and thus some edge (i, j) satisfies exi-'i > 1. Hence f(x) A > EA . Plugging in this lower bound, we get that (X) * < 2eA E2 8

Hence 2 IIVf(x)11 2 E * f(x) f which is equivalent to the fact that D(exp(x)) yields an E-balancing for A. We note that a similar bound also follows from [39], using a different argument. 0

A.7 Proof of Theorems 3.2.4 and 3.2.5

Proof of Theorem 3.2.5. By Lemma 3.2.6 and Lemma 3.2.9 we get that optimizing f up to an additive error of E2 Amin/24, suffices to get an E-balancing of the matrix. Furthermore, from Lemma 3.2.9 we see that the R, bound required for Theo- rem 2.0.2 is R, = O(Blog(nwAe-')). Finally, using the fact that f(0) = sA, we see that, initializing at xO = 0, the total running time of the method is

O(mB log2 (WA--1 ))

Proof of Theorem 3.2.4. Having proved Theorem 3.2.5, this theorem is a simple corollary. Consider x* to be the vector such that D* = D(exp(x*)). That implies that

73 Vf(x*) = 0 and therefore (by the convexity of f) x* is a minimizer of f implying that f(x*) = f*. Moreover, B = maxi Ilog D Ij = O(log r,(D*)), which concludes that proof by applying Theorem 3.2.5. El

A.8 Proof of Lemma 3.2.10

Proof. Consider the optimal solution x to the optimization problem described in

Equation 3.2.1, for which we know that D* = D(exp(x)) via Lemma 3.2.6. Since this is a minimizer, we know that

Aijex-ex =f(x) f(0) = sA. (ij)Esupp(A)

Therefore, for any (i, j) E supp(A), one has that

Xi - x 5 1og(sA/Aij) log WA.

Since there is a directed path of length at most k from any vertex to any other, we get that

log K(D*) = max xi - min xj = O(klog WA).

A.9 Proof of Lemma 5.3.2

Proof. The first part is a standard property of Newton's method applied to self- concordant functions. We refer the reader to [1] for details.

What we want to prove is that Newton's is robust to errors in the solution to the linear system involving the Hessian. Indeed, first we see that the Hessian at v" approximates the one at v'. To simplify notation, we write H, = V2 f(v). Since f, is

74 self-concordant, we have that

1 HVI -(1 - iv' - v"iIHVI Hva 924 4 ( - li-|v' - V"||H, 2 (A.9.1) and similarly

2 1 He - (1 - liv - v'IIHer,) 4 He Ho.( - Vi 2 (A.9.2) the latter of which can be written equivalently as

1 H (1 f)H (1 - IlVfA(v)IIH-1) 2 so 9 16 (A.9.3) 16 9

The error guarantee on v" equivalently gives us that

liv' - V" lHv EIV - VIIHv (A.9.4)

so combining with A.9.3 we obtain that

4 /1 4 4 -I~ I()I - (A.9.5) |V' - V"||1HII ! 3i|v' - v"H 3< -H= 3E -f()lHl 3

Also, since for any z

Vf,(z + w) = Vf,(z) + Hz4tww -dt 10

we get, by applying triangle inequality and A.9.2, that

1 I|Vf,(z + w)IIH-1 < II7fA(z)IHg1 + 1 Hz IHzwllH1. (A.9.6)

75 Therefore, using A.9.1 and A.9.6 where we substitute v' for z:

IIVfY(v")IIH-, IIVf(V")I- 1 -Hv 1 v ~~- -V"IIH,/

Hv'(v' - v")IIH-1) 1 H -1 +V 1 -(i') / -~ c' 1 |V' -V"I|H,, 1 ~ Iv' - V"IIH-1 1 E/31 (8 1 - e/3) I - E/3 1 K

A.10 Proof of Lemma 5.2.3

Proof. First we note that the nonzero submatrix of V2qij is (where rows/columns correspond to xi, xj, tj, in this order):

a -a -0/ty

V,2gg(xi Ixj Iti ) = -a an a/t+V

L-a/t' such that

1 and 0 = ce + ya+ 1 (log tij - (log A + xi -Xj))2

Furthermore, this submatrix can be factored, by Schur complementing the last row and column, as:

-2(1-) 0 -a a 2(1- ) 01 1 0 0 ,72 Oj(Xi, 0) Xj , trj) = V0 1 0Z -2(1-) a2(1 - 0 1 0, 0 0 1 0 0 #/t 3J -I a and thus one can easily notice that the Schur complement is SDD.

76 Furthermore, since 4 is a standard logarithmic barrier, its Hessian is a diagonal matrix with nonnegative entries. Therefore, we can split the contribution of the diagonal matrix V 2 4'(x, t) into pieces Dij which contains nonzeroes only at xi, xj and tij. In other words, we can write H = E(i)Esupp(A)(V 2 05i(x, t) + Dij). Since 2 the Schur complement of tij of the matrix V 0(x, t) is SDD, we also have that the Schur complement of tij of the matrix V 2q(x, t) + Dij is SDD, so the matrix can also be factored similarly to the factoring above, and all of these factorizations can be computed in overall O(m + n) = O(m) time. Finally, since each of these factorizations is computed by Schur complementing a unique tij, which is nonzero in a single matrix, we see that the Schur complement of the block H(t) of the matrix H is equal to the sum of Schur complements of the block tij of the'matrices V 2 0,j (x, t) + Di. This holds similarly, for the corresponding lower and upper diagonal matrices, which yields the desired factorization simply by summing up. I