Carg`Ese 2015
Total Page:16
File Type:pdf, Size:1020Kb
Carg`ese2015 Nisheeth K. Vishnoi i Preface These notes are intended for the attendees of the Sixth Carg`eseWork- shop on Combinatorial Optimization to accompany the lectures deliv- ered by the author in the Fall of 2015. These notes draw heavily from a set of surveys/lecture notes written by the author on the underly- ing techniques for fast algorithms (see [Vis13, SV14, Vis14]) and are intended for an audience familiar with the basics. Many thanks to the many people who have contributed to these notes and to my under- standing. Please feel free to email your comments and suggestions. Nisheeth K. Vishnoi September 2015, EPFL nisheeth.vishnoi@epfl.ch Contents 1 Solving Equations via the Conjugate Gradient Method 1 1.1 A Gradient Descent Based Linear Solver 2 1.2 The Conjugate Gradient Method 3 1.3 Matrices with Clustered Eigenvalues 5 2 Preconditioning for Laplacian Systems 7 2.1 Preconditioning 7 2.2 Combinatorial Preconditioning via Trees 8 2.3 An O~(m4=3)-Time Laplacian Solver 9 3 Newton's Method and the Interior Point Method 11 3.1 Newton's Method and its Quadratic Convergence 11 3.2 Constrained Convex Optimization via Barriers 16 3.3 Interior Point Method for Linear Programming 18 3.4 Self-Concordant Barrier Functions 28 3.5 Appendix: The Dikin Ellipsoid 28 ii Contents iii 3.6 Appendix: The Length of Newton Step 32 4 Volumetric Barrier, Universal Barrier and John Ellipsoids 35 4.1 Restating the Interior Point Method for LPs 36 4.2 Vaidya's Volumetric Barrier 38 4.3 Appendix: The Universal Barrier 43 4.4 Appendix: Calculating the Gradient and the Hessian of the Volumetric Barrier 45 4.5 Appendix: John ellipsoid 47 References 50 1 Solving Linear Equations via the Conjugate Gradient Method We discuss the problem of solving a system of linear equations. We first present the Conjugate Gradient method which works when the corresponding matrix is positive definite. We then give a simple proof of the rate of convergence of this method by using low-degree polynomial approximations for xs.1 n×n n Given a matrix A 2 R and a vector v 2 R ; our goal is to find a vector n ? def −1 x 2 R such that Ax = v: The exact solution x = A v can be computed by Gaussian elimination, but the fastest known implementation requires the same time as matrix multiplication (currently O(n2:737)). For many applications, the number of non-zero entries in A (denoted by m), or its sparsity, is much smaller than n2 and, ideally, we would like linear solvers which run in time O~(m) 2, roughly the time it takes to multiply a vector with A: While we are far from this goal for general matrices, iterative methods, based on techniques such as gradient descent or the Conjugate Gradient method reduce the problem of solving a system of linear equations to the computation of a small number of matrix-vector products with the matrix A when A is symmetric and positive definite (PD). These methods often produce only approximate solutions. However, these approximate solutions suffice for most applications. While the running time of the gradient descent-based method varies linearly with the condition number of A; that of the Conjugate Gradient 1 This is almost identical to Chapter 9 from Sachdeva and Vishnoi [SV14]. 2 The O~ notation hides polynomial factors in log n: 1 2 Solving Equations via the Conjugate Gradient Method method depends on the square-root of thep condition number; the quadratic saving occurring precisely because there exist s- degree polynomials approximating xs: The guarantees of the Conjugate Gradient method are summarized in the following theorem: n Theorem 1.1. Given an n × n symmetric matrix A 0; and a vector v 2 R ; the −1 −1 Conjugate Gradient method can find a vector x such that kx−A vkA ≤ δkA vkA p in time O((tA + n) · κ(A) log 1=δ); where tA is the time required to multiply A with a given vector, and κ(A) is the condition number of A: 1.1 A Gradient Descent Based Linear Solver The gradient descent method is a general method to solve convex programs; here we only focus on its application to linear systems. The PD assumption on A allows us to formulate the problem of solving Ax = v as a convex programming problem: For the function def ? 2 ? > ? > > ?> ? fA(x) = kx − x kA = (x − x ) A(x − x ) = x Ax − 2x v + x Ax ; find the vector x that minimizes fA(x). Since A is symmetric and PD, this is a convex function, and has a unique minimizer x = x?: When minimizing fA, each iteration of the gradient descent method is as follows: ? Start from the current estimate of x , say xt; and move along the direction of maximum rate of decrease of the function fA; i.e., against its gradient, to the point that minimizes the function along this line. We can explicitly compute the gradient ? of fA to be rfA(x) = 2A(x − x ) = 2(Ax − v): Thus, we have xt+1 = xt − αtrfA(xt) = xt − 2αt(Axt − v) def for some αt: Define the residual rt = v − Axt: Thus, we have ? > ? fA(xt+1) = fA(xt + 2αtrt) = (xt − x + 2αtrt) A(xt − x + 2αtrt) ? > ? ? > 2 > = (xt − x ) A(xt − x ) + 4αt(xt − x ) Art + 4αt rt Art: ? > ? > 2 > = (xt − x ) A(xt − x ) − 4αtrt rt + 4αt rt Art is a quadratic function in αt: We can analytically compute the αt that minimizes fA > 1 rt rt ? −1 and find that it is · > : Substituting this value of αt; and using x −xt = A rt; 2 rt Art we obtain > 2 > > ? 2 ? 2 (rt rt) ? 2 rt rt rt rt kxt − x k = kxt − x k − = kxt − x k 1 − · : +1 A A > A > > −1 rt Art rt Art rt A rt Now we relate the rate of convergence to the optimal solution to the condition number of A: Towards this, note that for any z; we have > > > −1 −1 > z Az ≤ λ1z z and z A z ≤ λn z z; 1.2. The Conjugate Gradient Method 3 where λ1 and λn are the largest and the smallest eigenvalues of A respectively. Thus, ? 2 −1 ? 2 kxt+1 − x kA ≤ (1 − κ )kxt − x kA; def λ where κ = κ(A) = 1=λn is the condition number of A: Hence, assuming we start ? ? with x0 = 0; we can find an xt such that kxt − x kA ≤ δkx kA in approximately κ log 1=δ iterations, with the cost of each iteration dominated by O(1) multiplications of the matrix A with a given vector (and O(1) dot product computations). Thus, this gradient descent-based method allows us to compute a δ approximate solution ? to x in time O((tA + n)κ log 1=δ): 1.2 The Conjugate Gradient Method Observe that at any step t of the gradient descent method, we have xt+1 2 Spanfxt; Axt; vg: Hence, for x0 = 0, it follows by induction that for any positive integer k; k xk 2 Spanfv; Av; : : : ; A vg: The running time of the gradient descent-based method is dominated by the time required to compute a basis for this subspace. However, this vector xk may not be a vector from this subspace that minimizes fA: On the other hand, the essence of the Conjugate Gradient method is that it finds the vector in this subspace that minimizes fA, in essentially the same amount of time required by k iterations of the gradient descent-based method. We must address two important questions about the Conjugate Gradient method: (1) Can the best vector be computed efficiently? and (2) What is the approximation guarantee achieved after k iterations? We show that the best vector can be found efficiently, and prove, using a low-degree polynomial approximations to xk, that the Conjugate Gradient method achieves a quadratic improvement over the gradient descent-based method in terms of its dependence on the condition number of A: def Finding the best vector efficiently. Let fv0; : : : ; vkg be a basis for K = Spanfv; Av; : : : ; Akvg (called the Krylov subspace of order k). Hence, any vector in Pk this subspace can be written as i=0 αivi: Our objective then becomes ? X 2 X > X X > ? 2 kx − αivikA = ( αivi) A( αivi) − 2( αivi) v + kx kA : i i i i Solving this optimization problem for αi requires matrix inversion, the very problem we set out to mitigate. The crucial observation is that if the vis are A-orthogonal, > i.e., vi Avj = 0 for i =6 j; then all the cross-terms disappear. Thus, ? X 2 X 2 > > ? 2 kx − αivikA = (αi vi Avi − 2αivi v) + kx kA ; i i and, as in the gradient descent-based method, we can analytically compute the set > vi v of values αi that minimize the objective to be given by αi = > : vi Avi 4 Solving Equations via the Conjugate Gradient Method Hence, if we can construct an A-orthogonal basis fv0; : : : ; vkg for K efficiently, we do at least as well as the gradient descent-based method. If we start with an arbitrary set of vectors and try to A-orthogonalize them via the Gram-Schmidt process (with inner products with respect to A), we need to compute k2 inner products and, hence, for large k; it is not more efficient than the gradient descent- based method. An efficient construction of such a basis is one of the key ideas here. th We proceed iteratively, starting with v0 = v: At the i iteration, we compute Avi−1 and A-orthogonalize it with respect to v0; : : : ; vi−1; to obtain vi: It is trivial to see that the vectors v0; : : : ; vk are A-orthogonal.