Optimal Approximation of Doubly Stochastic Matrices
Total Page:16
File Type:pdf, Size:1020Kb
Optimal Approximation of Doubly Stochastic Matrices Nikitas Rontsis Paul J. Goulart Department of Engineering Science Department of Engineering Science University of Oxford, OX1 3PJ, UK University of Oxford, OX1 3PJ, UK Abstract X 2 Rn×n, i.e., a matrix with nonnegative elements whose columns and rows sum to one, that has the same We consider the least-squares approximation sparsity pattern as C. Problem P is a matrix nearness of a matrix C in the set of doubly stochastic problem, i.e., a problem of finding a matrix with cer- matrices with the same sparsity pattern as C. tain properties that is close to some given matrix; see Our approach is based on applying the well- Higham (1989) for a survey on matrix nearness prob- known Alternating Direction Method of Mul- lems. tipliers (ADMM) to a reformulation of the Adjusting a matrix so that it becomes doubly stochas- original problem. Our resulting algorithm tic is relevant in many fields, e.g., for precondition- requires an initial Cholesky factorization of ing linear systems (Knight, 2008), as a normalization a positive definite matrix that has the same tool used in spectral clustering (Zass and Shashua, sparsity pattern as C + I followed by sim- 2007), optimal transport (Cuturi, 2013), and image fil- ple iterations whose complexity is linear in tering (Milanfar, 2013), or as a tool to estimate a dou- the number of nonzeros in C, thus ensuring bly stochastic matrix from incomplete or approximate excellent scalability and speed. We demon- data used e.g., in longitudinal studies in life sciences strate the advantages of our approach in a (Diggle et al., 2002), or to analyze the 3D structure of series of experiments on problems with up to the human genome (Rao et al., 2014) 82 million nonzeros; these include normaliz- ing large scale matrices arising from the 3D A related and widely used approach is to search for a structure of the human genome, clustering diagonal matrix D such that DCD is doubly stochas- applications, and the SuiteSparse matrix li- tic. This is commonly referred to as the matrix bal- brary. Overall, our experiments illustrate the ancing problem, or the Sinkhorn's algorithm (Knight, outstanding scalability of our algorithm; ma- 2008). Such a scaling matrix D exists and is unique trices with millions of nonzeros can be ap- whenever C has total support. Perhaps surprisingly, proximated in a few seconds on modest desk- when C has only nonnegative elements, then the ma- top computing hardware. trix balancing problem can be considered as a matrix nearness problem. This is because, DCD has been shown to minimize the relative entropy measure (Idel, 1 INTRODUCTION 2016, Observation 3.19) (Benamou et al., 2015), i.e., it is the solution of the following convex problem Consider the following optimization problem P minimize i;j Xi;j log Xi;j=Ci;j subject to X doubly stochastic (1) minimize 1 kX − Ck2 2 Xi;j = 0 8 (i; j) with Ci;j = 0; subject to X nonnegative (P) Xi;j = 0 8 (i; j) with Ci;j = 0 where we define 0 · log(0) = 0, 0 · log(0=0) = 0, and X1 = 1;XT 1 = 1; 1 · log(1=0) = 1. Note, however, that the relative entropy is not a proper distance metric since, it is not which approximates the symmetric real, n × n matrix symmetric, does not satisfy the triangular inequality C in the Frobenius norm by a doubly stochastic matrix and can take the value 1. Proceedings of the 23rdInternational Conference on Artifi- The matrix balancing problem can be solved by iter- cial Intelligence and Statistics (AISTATS) 2020, Palermo, ative methods with remarkable scalability. Standard Italy. PMLR: Volume 108. Copyright 2020 by the au- and simple iterative methods exist that exhibit linear thor(s). per-iteration complexity w.r.t. the number of nonze- Optimal Approximation of Doubly Stochastic Matrices ros in C and linear convergence rate (Knight, 2008), 2 MODELING P EFFICIENTLY (Idel, 2016). More recent algorithms can exhibit a super-linear convergence rate and increased robustness In this section, we present a reformulation of the dou- in many practical situations (Knight and Ruiz, 2013). bly stochastic approximation problem P suitable for solving large-scale problems. One of the difficulties The aim of the paper is to show that the direct mini- with the original formulation, P, is that it has n2 mization of a least squares objective in doubly stochas- variables and 2n2 + 2n constraints. Attempting to tic approximation, which has a long history dating solve P with an off-the-shelf QP solver, such as Gurobi back to the influential paper of (Deming and Stephan, (Gurobi Optimization LLC, 2018), can result in out- 1940)1, can also be solved efficiently. This gives practi- of-memory issues just by representing the problem's tioners a new solution to a very important problem of data even for medium-sized problems. doubly stochastic matrix approximation, which might prove useful for cases where the relative entropy metric In order to avoid this issue, we will perform a series is not suitable to their problem. of reformulations that will eliminate variables from P. The final problem, i.e., P , will have significantly fewer The approach we present is also flexible in the sense 2 variables and constraints while maintaining a remark- that it can handle other interesting cases, for example, able degree of structure, which will be revealed and where C is asymmetric or rectangular, k·k is a weighted exploited in the next section. Frobenius norm, and X1 and XT 1 are required to sum to an arbitrary, given vector. We discuss these We first take the obvious step of eliminating all vari- generalizations in x3.2. ables in P that are constrained to be zero, as pre- scribed by the constraint X = 0 for all (i; j) such that C = 0: (2) Related work Zass and Shashua (2007) consider i;j i;j the problem P in the case when C is fully dense. They Indeed, consider vec(·), the operator that stacks a ma- suggest an alternating projections algorithm that has trix into a vector in a column-wise fashion and H an linear per-iteration complexity. The approach of Zass 2 nnz ×n matrix, where nnz := card(C), that selects all and Shashua (2007) resembles the results of x3.2, but the nonzero elements of vec(C). Note that H depends as we will see in the experimental section, it is not on the sparsity pattern S of C, defined as the 0 − 1 guaranteed to converge to an optimizer. n × n matrix The paper is organized as follows: In Section 2, we ( introduce a series of reformulations to Problem 1, re- 0 Ci;j = 0 Si;j := (3) sulting in a problem that is much easier to solve.In 1 otherwise; Section 3, we suggest a solution method that reveals a particular structure in the problem. Finally, in Sec- but we will not denote this dependence explicitly. We tion 4, we present a series of numerical results that can now isolate the nonzero variables contained in X highlight the scalability of the approach. and C in nnz−dimensional vectors defined as x := H vec(X); c := H vec(C): (4) Notation used The symbol ⊗ denotes the Kro- Note that x is simply a re-writing of necker product, vec(·) the operator that stacks a X in Rnnz , since for any X 2 S := n×n matrix into a vector in a column-wise fashion and fX 2 R j Xi;j = 0 for all Ci;j = 0g we have mat(·) the inverse operator to vec(·) (see (Golub and T T Van Loan, 2013, 12.3.4) for a rigorous definition). H H vec(X) = vec(X) , H x = vec(X); Given two matrices, or vectors, with equal dimensions T denotes the Hadamard (element-wise) product. k·k due to the fact that H H = diag(vec(S)) Thus every denotes the 2-norm of a vector and the Frobenius norm x defines a unique X 2 S and vice versa. of a matrix, while card(·) the number of nonzero ele- We can now describe the constraints of P, i.e., X ≥ 0, T ments in a vector or a matrix. Finally p denotes the X1n = 1n and X 1n = 1n, on the \x−space". Ob- machine precision. viously X ≥ 0 trivially maps to x ≥ 0. Furthermore, recalling the standard Kronecker product identity vec(LMN) = (N T ⊗ L) vec(M) (5) 1Deming and Stephan (1940) consider the weighted 1 P 2 for matrices of compatible dimension, the constraints least squares cost 2 i;j [Xi;j − Ci;j ] =Ci;j which we treat T in x3.2. X1n = 1n and X 1n = 1n of P can be rewritten in Nikitas Rontsis, Paul J. Goulart an equivalent vectorized form as As in the previous definitions, define Hu as the matrix that stacks all the nonzeros of Cu in an column-wise T 1n ⊗ In fashion, which is used to extract the nonzero elements T vec(X) = 12n (6) In ⊗ 1n of Xu and Cu, i.e., scaled nonzero elements of the up- per triangular part of X, to the vectors or, by noting that HT x = vec(X), as xu := Hu vec(Xu); cu := Hu vec(Xu): (10) T 1n ⊗ In T T H x = 12n: (7) In ⊗ 1n We can now write down our reduced optimization problem. Note that, although it might not be directly Thus P can be rewritten in the \x−space" as evident, the following problem possesses a remarkable degree of internal structure that is exploited in the minimize 1 kx − ck2 2 suggested solution algorithm of the next section.