<<

Optimal Approximation of Doubly Stochastic Matrices

Nikitas Rontsis Paul J. Goulart Department of Engineering Science Department of Engineering Science University of Oxford, OX1 3PJ, UK University of Oxford, OX1 3PJ, UK

Abstract X ∈ Rn×n, i.e., a with nonnegative elements whose columns and rows sum to one, that has the same We consider the least-squares approximation sparsity pattern as C. Problem P is a matrix nearness of a matrix C in the set of doubly stochastic problem, i.e., a problem of finding a matrix with cer- matrices with the same sparsity pattern as C. tain properties that is close to some given matrix; see Our approach is based on applying the well- Higham (1989) for a survey on matrix nearness prob- known Alternating Direction Method of Mul- lems. tipliers (ADMM) to a reformulation of the Adjusting a matrix so that it becomes doubly stochas- original problem. Our resulting algorithm tic is relevant in many fields, e.g., for precondition- requires an initial Cholesky factorization of ing linear systems (Knight, 2008), as a normalization a positive definite matrix that has the same tool used in spectral clustering (Zass and Shashua, sparsity pattern as C + I followed by sim- 2007), optimal transport (Cuturi, 2013), and image fil- ple iterations whose complexity is linear in tering (Milanfar, 2013), or as a tool to estimate a dou- the number of nonzeros in C, thus ensuring bly from incomplete or approximate excellent scalability and speed. We demon- data used e.g., in longitudinal studies in life sciences strate the advantages of our approach in a (Diggle et al., 2002), or to analyze the 3D structure of series of experiments on problems with up to the human genome (Rao et al., 2014) 82 million nonzeros; these include normaliz- ing large scale matrices arising from the 3D A related and widely used approach is to search for a structure of the human genome, clustering D such that DCD is doubly stochas- applications, and the SuiteSparse matrix li- tic. This is commonly referred to as the matrix bal- brary. Overall, our experiments illustrate the ancing problem, or the Sinkhorn’s algorithm (Knight, outstanding scalability of our algorithm; ma- 2008). Such a matrix D exists and is unique trices with millions of nonzeros can be ap- whenever C has total support. Perhaps surprisingly, proximated in a few seconds on modest desk- when C has only nonnegative elements, then the ma- top computing hardware. trix balancing problem can be considered as a matrix nearness problem. This is because, DCD has been shown to minimize the relative entropy measure (Idel, 1 INTRODUCTION 2016, Observation 3.19) (Benamou et al., 2015), i.e., it is the solution of the following convex problem Consider the following optimization problem P minimize i,j Xi,j log Xi,j/Ci,j subject to X doubly stochastic (1) minimize 1 kX − Ck2 2 Xi,j = 0 ∀ (i, j) with Ci,j = 0, subject to X nonnegative (P) Xi,j = 0 ∀ (i, j) with Ci,j = 0 where we define 0 · log(0) = 0, 0 · log(0/0) = 0, and X1 = 1,XT 1 = 1, 1 · log(1/0) = ∞. Note, however, that the relative entropy is not a proper distance metric since, it is not which approximates the symmetric real, n × n matrix symmetric, does not satisfy the triangular inequality C in the Frobenius norm by a and can take the value ∞.

Proceedings of the 23rdInternational Conference on Artifi- The matrix balancing problem can be solved by iter- cial Intelligence and (AISTATS) 2020, Palermo, ative methods with remarkable scalability. Standard Italy. PMLR: Volume 108. Copyright 2020 by the au- and simple iterative methods exist that exhibit linear thor(s). per-iteration complexity w.r.t. the number of nonze- Optimal Approximation of Doubly Stochastic Matrices ros in C and linear convergence rate (Knight, 2008), 2 MODELING P EFFICIENTLY (Idel, 2016). More recent algorithms can exhibit a super-linear convergence rate and increased robustness In this section, we present a reformulation of the dou- in many practical situations (Knight and Ruiz, 2013). bly stochastic approximation problem P suitable for solving large-scale problems. One of the difficulties The aim of the paper is to show that the direct mini- with the original formulation, P, is that it has n2 mization of a least squares objective in doubly stochas- variables and 2n2 + 2n constraints. Attempting to tic approximation, which has a long history dating solve P with an off-the-shelf QP solver, such as Gurobi back to the influential paper of (Deming and Stephan, (Gurobi Optimization LLC, 2018), can result in out- 1940)1, can also be solved efficiently. This gives practi- of-memory issues just by representing the problem’s tioners a new solution to a very important problem of data even for medium-sized problems. doubly stochastic matrix approximation, which might prove useful for cases where the relative entropy metric In order to avoid this issue, we will perform a series is not suitable to their problem. of reformulations that will eliminate variables from P. The final problem, i.e., P , will have significantly fewer The approach we present is also flexible in the sense 2 variables and constraints while maintaining a remark- that it can handle other interesting cases, for example, able degree of structure, which will be revealed and where C is asymmetric or rectangular, k·k is a weighted exploited in the next section. Frobenius norm, and X1 and XT 1 are required to sum to an arbitrary, given vector. We discuss these We first take the obvious step of eliminating all vari- generalizations in §3.2. ables in P that are constrained to be zero, as pre- scribed by the constraint

X = 0 for all (i, j) such that C = 0. (2) Related work Zass and Shashua (2007) consider i,j i,j the problem P in the case when C is fully dense. They Indeed, consider vec(·), the operator that stacks a ma- suggest an alternating projections algorithm that has trix into a vector in a column-wise fashion and H an linear per-iteration complexity. The approach of Zass 2 nnz ×n matrix, where nnz := card(C), that selects all and Shashua (2007) resembles the results of §3.2, but the nonzero elements of vec(C). Note that H depends as we will see in the experimental section, it is not on the sparsity pattern S of C, defined as the 0 − 1 guaranteed to converge to an optimizer. n × n matrix The paper is organized as follows: In Section 2, we ( introduce a series of reformulations to Problem 1, re- 0 Ci,j = 0 Si,j := (3) sulting in a problem that is much easier to solve.In 1 otherwise, Section 3, we suggest a solution method that reveals a particular structure in the problem. Finally, in Sec- but we will not denote this dependence explicitly. We tion 4, we present a series of numerical results that can now isolate the nonzero variables contained in X highlight the scalability of the approach. and C in nnz−dimensional vectors defined as x := H vec(X), c := H vec(C). (4)

Notation used The symbol ⊗ denotes the Kro- Note that x is simply a re-writing of necker product, vec(·) the operator that stacks a X in Rnnz , since for any X ∈ S := n×n matrix into a vector in a column-wise fashion and {X ∈ R | Xi,j = 0 for all Ci,j = 0} we have mat(·) the inverse operator to vec(·) (see (Golub and T T Van Loan, 2013, 12.3.4) for a rigorous definition). H H vec(X) = vec(X) ⇔ H x = vec(X),

Given two matrices, or vectors, with equal dimensions T denotes the Hadamard (element-wise) product. k·k due to the fact that H H = diag(vec(S)) Thus every denotes the 2-norm of a vector and the Frobenius norm x defines a unique X ∈ S and vice versa. of a matrix, while card(·) the number of nonzero ele- We can now describe the constraints of P, i.e., X ≥ 0, T ments in a vector or a matrix. Finally p denotes the X1n = 1n and X 1n = 1n, on the “x−space”. Ob- machine . viously X ≥ 0 trivially maps to x ≥ 0. Furthermore, recalling the standard identity

vec(LMN) = (N T ⊗ L) vec(M) (5) 1Deming and Stephan (1940) consider the weighted 1 P 2 for matrices of compatible dimension, the constraints least squares cost 2 i,j [Xi,j − Ci,j ] /Ci,j which we treat T in §3.2. X1n = 1n and X 1n = 1n of P can be rewritten in Nikitas Rontsis, Paul J. Goulart

an equivalent vectorized form as As in the previous definitions, define Hu as the matrix that stacks all the nonzeros of Cu in an column-wise  T  1n ⊗ In fashion, which is used to extract the nonzero elements T vec(X) = 12n (6) In ⊗ 1n of Xu and Cu, i.e., scaled nonzero elements of the up- per triangular part of X, to the vectors or, by noting that HT x = vec(X), as xu := Hu vec(Xu), cu := Hu vec(Xu). (10)  T  1n ⊗ In T T H x = 12n. (7) In ⊗ 1n We can now write down our reduced optimization problem. Note that, although it might not be directly Thus P can be rewritten in the “x−space” as evident, the following problem possesses a remarkable degree of internal structure that is exploited in the minimize 1 kx − ck2 2 suggested solution algorithm of the next section. subject to x ≥ 0  T  (P1) Theorem 2.2. Solving P for a symmetric C is equiv- 1n ⊗ In T T H x = 12n. alent to solving In ⊗ 1n 1 2 minimize 2 kp (xu − cu)k A further reduction of the variables and constraints subject to xu ≥ 0 (P2) of P1 can be achieved by exploiting the symmetry of Axu = 1n, C. To this end, note that when C is symmetric the where optimal doubly stochastic approximation will also be √ √ symmetric according to the following proposition: 2 2 ··· 2  .. .  Proposition 2.1. If C is symmetric, then the optimal p := H vec . .  ∗ u  √  solution X of P is also symmetric.  2 2 Proof. Assume the contrary, so that X∗ is optimal but 0 2 asymmetric. Then the matrix X∗T is a feasible solu- A := 1T ⊗ I ,A := I ⊗ 1T tion for P since it remains element-wise negative when 1 n n 2 n n T its row and column sums are exchanged, and has an and A := (A1 + A2)Hu , in the sense that P is feasible ∗ identical objective value since C is assumed symmet- iff P2 is, and the optimizer X of P can be constructed ∗ ric. Then define the X˜ as the convex from the optimizer xu of P2 using (10) and (8). combination Proof. We will first show that every feasible xu of P2 ˜ 1 ∗ ∗T T X := (X + X ) defines a feasible X for P, where vec(Xu) := Hu xu, 2 with the same objective value. Similarly to S, define which is also a feasible point for P. Since the objective n×n Su ∈ R as an 0−1 matrix that represents the spar- function is strictly convex (at least on the subset of sity of C and the upper triangular of C respectively, elements of X not constrained to be zero), the objective i.e. ˜ ( function evaluated at X will be strictly lower than that 0 Ci,j = 0, or i < j ∗ ∗ Su = for both X , contradicting the optimality of X .  i,j 1 otherwise.

Note that the above proof, like the one of the following The equality of the objective value can be shown as Theorem 2.2, are simply algebraic calculations. follows: It follows that restricting the feasible set of P to sym- kp (xu − cu)k metric matrices does not affect its solution. We will √ √ exploit this by eliminating all the variables embedded = (1n×n 2 + (2 − 2)In) (Xu − Cu) into X that are below its main diagonal. Define an up- √ √ = (1n×n 2 + (2 − 2)In) U (X − C) (11) per Xu consisting of scaled elements T √ √ of X such that X = Xu + Xu , and likewise for C, i.e. = (1n×n 2 + (1 − 2)In) SU (X − C) Xu := U X,Cu := U C (8) =kS (X − C)k = kX − Ck. where Furthermore, similarly to (5)-(7), we have  1 1 ··· 1 2 T . . X1n = (Xu + Xu )1n  .. .  U :=  . . (9) T T  1  = (1n ⊗ In + In ⊗ 1n ) vec(Xu) (12)  2 1 1 T T T 0 2 = (1n ⊗ In + In ⊗ 1n )Hu xu = Axu, Optimal Approximation of Doubly Stochastic Matrices

resulting in X1n = 1n due to the feasibility of xu for is the augmented Lagrangian of (13) and ρ is a positive T P2. Due to the symmetry of X we also get X 1n. Fi- penalty parameter. nally, X is nonnegative construction and has sparsity Recalling that (P ) is a Quadratic Problem, we note pattern S. Therefore X is feasible for P. 2 that solving QPs with ADMM has been widely studied Likewise, following (11)-(12) in reverse order, we can in the literature (Stellato et al., 2017), (Garstka et al., show that every symmetric feasible matrix X of P de- 2019), (O’Donoghue et al., 2016). We will follow the fines an xu := Hu vec(Xu), where Xu := U X, that is approach of (Stellato et al., 2017) which can solve P2 feasible for P2 and has identical objective value. Since by applying ADMM to the following splitting only symmetric optimizers exist for P (Lemma 2.1) this concludes the proof. minimize f(˜x, z˜) + g(x, z)  (14) subject to (˜x, z˜) = (x, z) 2 Unlike P which has n and 2n constraints, P2 has approximately nnz/2 and n constraints. Furthermore, where f and g, are defined as it possesses a specific internal structure that we exploit in the solution algorithm presented in §3. 1 f(˜x, z˜) = xT P x − P c + I (˜x, z˜) 2 u Ax˜=˜z g(x, z) = I (x) + I (z) 3 SOLUTION METHOD x≥0 z=12n

In this section, we describe how the reduced problem and P := diag(p p), IAx˜=˜z,n ¯nz are the number of nonzeros in the upper triangular of P2 can be solved with ADMM. We begin with a brief introduction to the ADMM algorithm in the general C and Ix≥0, Iz=12n denote the indicator func-  n¯ 2n setting, which follows (Boyd et al., 2011), and then tions of the sets (x, z) ∈ R nz × R | Ax = z , n¯nz  2n describe how ADMM can be applied efficiently for P2. {x ∈ R | x ≥ 0} and x ∈ R | x = 12n respec- tively. Several optimization problems, including reformula- 2 tions of P2 (Stellato et al., 2017), are concerned with Applying ADMM for the problem (14) results in Al- the minimization of a function q that can be decom- gorithm 1, where Π+ denotes the projection of a vector posed into two parts q = f + g such that optimizing to the nonnegative orthant and Π1(x) := 1 for any vec- independently f or g is tractable. Ideally, if f and g op- tor x. Refer to (Stellato et al., 2017, §3) for details. erate on disjoint variables, i.e., if q(χ, ψ) = f(χ)+g(ψ), then q can also be optimized efficiently by merely min- imizing q over χ and ψ independently. However, it is Algorithm 1: Solving P2 with ADMM often that case that there is some coupling between χ 0 0 0 1 given initial values x , z , y and parameters and ψ which we assume to be in the form Aχ+Bψ = d, ρ > 0, σ > 0, and α ∈ (0, 2); resulting in the following optimization problem 2 repeat k+1 k+1 minimize f(χ) + g(ψ) 3 (˜x , z˜ ) ← solution of the linear system (13)  T  k+1 subject to Aχ + Bψ = d (P + σIn¯nz ) ρA x˜ k+1 = where χ, ψ denote the decision variables, f, g are ρA −ρI2n z˜ σxk − wk + AT (ρzk − yk) + P c  proper lower-semicontinuous convex functions, and u ; A, B, d are matrices of appropriate dimensions. 0 k+1 k+1 k 4 x ← Π+(αx˜ + (1 − α)x ); The Alternating Direction Method of Multipliers k+1 k+1 k −1 k 5 z ← Π1(αz˜ + (1 − α)z + ρ y ); (ADMM) is a first-order (i.e., “-based”) algo- k+1 k k+1 k k+1 6 w ← w + σ(αx˜ + (1 − α)x − x ); rithm that solves (13) while exploiting the assumption k+1 k k+1 k k+1 that f and g can be easily optimized independently. 7 y ← y + ρ(αz˜ + (1 − α)z − z ); Indeed, ADMM iterates as follows, 8 until termination condition is satisfied; k+1 k k χ = inf Lρ(χ, ψ , y ) χ 2Algorithm 1 includes two extensions over the simple k+1 k+1 k ADMM algorithm discussed in §3: it uses over-relaxation, ψ = inf Lρ(χ , ψ, y ) ψ a commonly used variation that can increase the speed of yk+1 = yk + ρ(Aχk+1 + Bψk+1 − d) convergence (Eckstein and Bertsekas, 1992, Figure 2), and two different step sizes ρ, σ for the update of Lagrange where multipliers w and y respectively. The parameters ρ, σ and α are chosen according to (Stellato et al., 2017) wherein T Lρ(χ, ψ, y) := f(χ) + h(ψ) + y (Aχ + Bψ − d) exhaustive numerical testing was done to identify the best 2 choices of these parameters when solving QPs across a wide +(ρ/2)kAχ + Bψ − dk , variety of problem structures. Nikitas Rontsis, Paul J. Goulart

The most computationally intensive operation of Al- 3.1 Convergence and Feasibility gorithm 1 is the solution of the (¯nnz + n) × (¯nnz + n) linear system in line 3. We will show that its solution We terminate Algorithm 1 when the primal and dual can be obtained by solving instead a reduced n × n residuals of P2 linear system. rprim := kAu − 1k , Fact 3.1. Consider the following linear system ∞ T rdual := P x − cu + A y + w P + σI ρAT x r ∞ n¯nz = (15) ρA −ρI2n z 0 become smaller than some acceptable tolerance. Al- gorithm 1 is guaranteed to converge to the solution of where x, r ∈ n¯nz , z ∈ n¯nz and A ∈ 2n×n¯nz . Its R R R P whenever P , or equivalently P, is feasible. solution can be obtained as follows 2 2 We next establish conditions that characterize the fea- 1. Obtain z by solving the following n × n positive sibility of P: definite linear system Lemma 3.3. Problem P is feasible if and only if there −1 T −1 (ρA(P + σIn¯nz ) A + In)z = A(P + σIn¯nz ) r exists a set of indices I = {(i1, j1) ..., (in, in)} corre- (16) sponding to exactly one nonzero element from each row −1 T and column of C. 2. Obtain x as (P + σIn¯nz ) (r − ρA z). T Proof. Regarding the “if” part, the n × n matrix Proof. The first block row gives (P +σIn¯nz )x+ρA z = r ⇔ x = (P + σI )−1(r − ρAT z). Reducing the ( n¯nz 1 if i, j ∈ I variable x from (15) results in (16). Xi,j = (18)  0 otherwise Thus solving (15) can be reduced to solving a linear system with left hand side is a feasible point for P. Regarding the “only if” part, since P has a feasible point X, then according to (Per- −1 T (ρA(P + σIn¯nz ) A + In). (17) fect and Mirsky, 1965, Theorem 1) there exists a set of Fortunately, the reduced matrix (17), which is equal indices I = {(i1, j1) ..., (in, in)} corresponding to ex- T to A diag(Hu vec(U))A with actly one nonzero element from each row and column of X. Due to the second constraint of P, the same (4 + σ)−1 (2 + σ)−1 ··· (2 + σ)−1 argument also holds for C.   .. .  U :=  . . ,  −1 −1 Note that Lemma 3.3 and (Perfect and Mirsky, 1965,  (4 + σ) (2 + σ)  0 (4 + σ)−1 Theorem 1) imply that P is feasible whenever the ma- trix balancing problem is feasible, i.e. when the matrix turns out to be positive definite with a sparsity pattern C has total support (Knight and Ruiz, 2013). matching that of C + I: Theorem 3.2. The following relation holds 3.2 Special cases and generalizations A diag(H vec(D))AT u The case where C is almost or fully dense: In =S (D + DT ) + diag(S (D + DT )1) the special case where C is fully dense we have S = T for any upper triangular n × n matrix D. 1n1n , and Theorem 3.2 gives Proof. The proof is in the supplementary material. −1 T  ρA(P + σIn¯nz ) A + In = αI + β1n×n The solution of the reduced linear system (16) can where be obtained given an initial Cholesky factorization of σρ nρ ρ ρA(P + σI )−1AT + I , or even with a factoriza- α := + +1 and β := . n¯nz n (2 + σ/2)(2 + σ) 2 + σ 2 + σ tion free algorithm e.g., Conjugate which primarily consists of repeated matrix-vector multipli- Using the Sherman-Morrison formula, we can explic- −1 T cations with ρA(P + σIn¯nz ) A + In. The fact that itly calculate the inverse of (16) as the linear system to be solved has the same sparsity pattern of S + I can be particularly beneficial in cases 1  β1 1T  (ρA(P + σI )−1AT + I )−1 = I − n n . where a fill-in reducing permutation is already known n¯nz n α α + βn for the matrix under approximation C (and thus for (19) S), since the same permutation could be used before We can then solve (15) and perform ADMM on P2 −1 T the factorization of ρA(P + σIn¯nz ) A + In resulting without the need to perform an initial matrix factor- in reduced fill-in and thus significant speedup. ization. Optimal Approximation of Doubly Stochastic Matrices

This approach can also be extended to cases where C 3D structure of the human genome, starting with a has a relatively small number of zero elements. Indeed, description of the nature of these matrices. The hu- by avoiding the elimination of the zero variables of P man genome has an end-to-end length on the order of we get the following variant of P2: meters when unfolded, but fits inside the cell nucleus with dimensions on the order of micrometers, imply- 1 2 minimize 2 kp xu − cuk ing that the genome is heavily folded in the 3D space. subject to xu ≥ 0 The 3D structure of the genome can be examined by (P3) xui = 0 ∀ i with cui = 0 breaking the genome into a number of pieces and mea- Axu = 12n suring how many contacts exist between each piece in the 3D space (Rao et al., 2014). This produces Hi-C where p, A, x , and c are defined according to Section u u contact matrices, where the term Hi-C describes the 2, but with H an (n+1)n ×n2 that extracts u 2 particular experimental procedure used. all the upper triangular elements of a vectorized n × n matrix. P3 can then be solved with Algorithm 1, with The process is, however, subject to errors and ex- the following two changes. First, we replace Π+ (line perimental constraints. To alleviate these issues, the (n2+n)/2 2, Algorithm 1) with ΠS , where S := {x ∈ R | contact matrix is normalized so that all its rows and x ≥ 0 and xi = 0 for all i such that ci = 0}. Sec- columns sum to the same value. The matrix balanc- ondly, the solution of the linear system of line 3 of ing approach is often the method of choice for this Algorithm 1 is trivially solved using Fact 3.1 and (19). task (Rao et al., 2014, Supplemental Material II.b), but other methods have also been suggested in the lit- Solving variants of P with Algorithm 1: Algo- erature (Yaffe and Tanay, 2011). rithm 1 can be easily adjusted to the case where k·k We will show that our approach can also be used for in the objective of P is a weighted Frobenius norm, this task even for contact matrices containing hun- i.e. kXk = kW XkF where W is a given symmet- dreds of millions of nonzero entries as in Rao et al. ric matrix. The only thing that has to change is the (2014). In particular, we consider normalization of the definition of p, and thus of P := diag(p p), to: contact matrix of the 7th chromosome of the GM12878 √ √ cell3, thus replicating (Rao et al., 2014, Figure 1.C). 2 2 ··· 2  Since the genome consists of sequences of the bases  .. .    . .   adenine, guanine, cytosine and thymine, it is typi- p := Hu vec √  W .  2 2  cal to measure the length of each genome piece by 0 2 the average number of bases it contains. The total range of the contact matrices is 0 to 160 mega-bases Theorem 3.2 could then be used to solve the linear (Mb). We consider two discretization lengths, 1 kilo- system of Algorithm 1 (line 3) efficiently. Similarly, base (Kb), and 5Kb, which result in contact matrices we can allow for general constraints X1 = r and of 151 and 82 million nonzeros respectively. Figure 1 T X 1 = r in P, where r is a given nonnegative vec- provides detailed views of the contact matrices, span- tor, by simply changing Π1 in line 5 of Algorithm 1 ning the range [137.2–137.8Mb] for the contact map to Πr(x) := r. Finally, non-square or asymmetric ma- at 5Kb resolution and the [137.55–137.75Mb] for the trices C can be solved via use of P for the symmetric   one at 1Kb resolution. These regions were chosen to 0 C highlight interesting regions of the contact matrix as matrix T . C 0 they appear in (Rao et al., 2014, Figure 1.C, rightmost column, two bottom subfigures). 4 NUMERICAL EXPERIMENTS Our approach produces considerably different normal- ized contact matrices than the matrix balancing ap- In this section we present numerical results of Algo- proach. In particular, our approach results in contact rithm 1 on a range of matrix normalization problems. matrices that have increased sparsity and higher con- We provide a Julia implementation of the Algorithm, trast. This is unlike the matrix balancing approach along with code that generates all the results of this which results in a normalized matrix that has exactly section at: the same nonzeros as the original matrix. Although github.com/oxfordcontrol/DoublyStochastic.jl further investigation is necessary for the evaluation of 4.1 Normalizing Hi-C Contact Matrices of 3The data corresponding to the thresholding cri- the Human Genome in the 3D Space terion MAPQ ≥ 30 were used (Rao et al., 2014, Supplemental Material IIa.4). Obtained from the We first present results on the application of our GM12878 “combined” intrachromosomal tarball at method to real-world contact matrices describing the ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE63525 Nikitas Rontsis, Paul J. Goulart the suitability of the approach in Hi-C data, the results indicate that our method can be used to normalize very large Hi-C datasets leading to promising visual results.

4.2 Spectral clustering problems

Next, we present results of running our Algorithm on correlation matrices arising from spectral clustering Figure 2: Comparison of the primal convergence (i.e. feasibility) of Algorithm 1 vs. the approach of (Zass and Shashua, 2007) (21) on the Spambase dataset with (22) as affinity criterion with σ = 100.

(Zass and Shashua, 2007). In spectral clustering one d is given a set of points {xi ∈ R } to be arranged into l clusters. To this end, an affinity matrix C is gen- erated with each entry Ci,j representing a measure of the pairwise similarity between points i and j. This matrix is then normalized and used by later stages of the clustering procedure. Zass and Shashua (2007) suggested that the normalization

1 2 minimize 2 kX − Ck subject to X nonnegative (20) T X1n = 1n,X 1n = 1n,

leads to superior clustering performance in various dif- ferent test cases. Note that solving (20) is equivalent Figure 1: Details of Hi-C Contact Matrices for the 7th to solving P for C + p1n×n. Note that the formula- chromosome of the GM12878 cell, corresponding to tion (20) of Zass and Shashua (2007) does not exploit (Rao et al., 2014, Figure 1.C, rightmost column, two sparsity in C. Nevertheless, Zass and Shashua (2007) bottom subfigures). Top row shows the area [137.2 − suggested that the following iterative scheme can be 137.8Mb]2 for the contact matrix of 5Kb resolution. used to solve 20 Bottom row shows the area [137.55 − 137.75Mb]2 for ˜ k k −2 T k the contact matrix of 1Kb resolution. Areas repre- X = X + n (1n X 1n + n)1n×n −1 k k senting zero contacts are depicted in black, while areas − n (X 1n×n + 1n×nX ) (21a) with a high number of contacts are shown in yellow. k+1  ˜ k The total area of the contact matrices is [0 − 160Mb]2, X = Π+ X . (21b) thus the areas depicted are zoomed 71 and 640 thou- sand times in the top and bottom figures respectively. The first step in (21) minimizes the objective of P The leftmost column shows the original contact ma- subject to the equality constraints, while the sec- trices. Rao et al. (2014) normalize the n × n contact ond projects the iterate to the nonnegative orthant. matrix C via the matrix balancing method of (Knight However, this approach is not guaranteed to solve P and Ruiz, 2013) so that its columns and rows sum to optimality. For example, in the simple case of P 1 9 9 to Ci,j/n. The resulting normalized matrices are 1 9 1 0 i,j C = 10 , (21) converges to a suboptimal point depicted in the rightmost column. The middle col- 9 0 9 ¯ ¯ ∗ ∗ umn depicts the results of Algorithm 1 for normaliz- X with X − X ∞ = 0.07 where X is the opti- ing the matrices so that its columns and rows sum to mizer. Nevertheless, it appears that, in general, (21) P converges to a feasible point. However, even conver- i,j Ci,j/n. The Conjugate Gradient method is used to solve the linear system (16) since using a Cholesky gence to a feasible point can be much slower than our factorization resulted in memory issues. A tolerance of approach, as demonstrated in Figure 2. −3 10 is used for the termination of our Algorithm. The Besides guaranteed convergence to the optimizer of P, normalization of the 5Kb and 1Kb contact matrix take our approach can also handle sparsity in the affinity 701 and 3136 seconds respectively on a single-threaded matrices. To demonstrate how exploiting sparsity can implementation on Intel Gold 5120, 192GB memory. lead to significant speedups we consider the Spambase Optimal Approximation of Doubly Stochastic Matrices

Table 1: Normalizing correlation matrices with Algo- relative to Gurobi (on P1), as a function of each prob- rithm 1 for spectral clustering on Spambase. A tol- lem’s nonzeros. erance of 10−4 is used for the termination of our Al- gorithm. The Timings are expressed in seconds. and compared against Gurobi with its default settings (on P1) and against solving C +p1n×n with the approach of §3.2 where and is C(σ) the original affinity matrix. Hardware used: Intel i7-5557U CPU @ 3.10GHz, 8GB Memory.

σ 1.0 5.0 10.0 20.0 4 6 6 6 nnz 3.9 × 10 1.8 × 10 4.0 × 10 7.3 × 10 Figure 3: Results of Algorithm 1 for matrices from −1 1 1 the SuiteSparse collection. Dots represent the timing tadmm 1.1 × 10 5.7 1.5 × 10 2.6 × 10 −4 −1 1 2 2 of our Algorithm (with 10 tolerance), while squares tgurobi 3.9 × 10 4.6 × 10 1.1 × 10 2.0 × 10 dense 2 2 2 2 represent the speedup achieved over Gurobi with its tadmm 7.7 × 10 6.9 × 10 6.4 × 10 7.2 × 10 default settings (on P1). Hardware used: a single thread running on an Intel Gold 5120 with 192GB of dataset4 (considered in Zass and Shashua (2007)) with memory. an RBF kernel as an affinity criterion

2 2 −kxi−xj k /σ Ci,j(σ) = e , (22) 5 Conclusions where σ is a parameter that is typically tuned to In this paper we have shown that approximating dou- achieve the best clustering performance. Note that due bly stochastic matrices, under a Frobenius distance to the exponential form of C(σ), some values will be metric, can be performed efficiently, even for very very small. Therefore, we truncate to zero all entries large, sparse matrices. We believe that our approach with value less than 10−7. The runtimes of applying will complement very popular existing methods, such our Algorithm to this dataset are listed in Table 1 and as the matrix balancing or Sinkhorn’s algorithm, that compared to timings achieved with Gurobi (note that solve the same problem under a different “distance” we use P for Gurobi as we consider P to be a “stan- 1 1 metric, thus giving practitioners more freedom in dard” reformulation of P). We observe that exploiting choosing the most suitable objective for this widely sparsity can lead to significant speedup as compared used approximation problem. to treating the affinity matrix as fully dense, even if we follow the approach of §3.2. At the same time, the Acknowledgments This work was supported by optimizers of P for the affinity matrices C(σ) consid- the EPSRC AIMS CDT grant EP/L015987/1 and ered in Table 1 appear to coincide with the ones for Schlumberger. We thank Giovanni Fantuzzi for valu- the fully dense C(σ) +  1 . p n×n able discussions.

4.3 Matrices from the SuiteSparse Collection References Finally, we consider all Undirected Weighted Graph J. D.. Benamou, G. Carlier, M. Cuturi, L. Nenna, and Matrices, with less than 50 million nonzeros, contained G. Peyr´e. Iterative bregman projections for regu- in the SuiteSparse collection5. 69 matrices meet these larized transportation problems. SIAM Journal on criteria. We use Algorithm 1 to normalize every ma- Scientific Computing, 37(2):A1111–A1138, 2015. trix C so that all of its columns and rows sum to S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eck- maxi,j(Ci,j). All of the problems, except two, have stein. Distributed Optimization and Statistical nonnegative entries. For these two exceptions, we Learning via the Alternating Direction Method of change the negative entries to their absolute value, as Multipliers. Found. Trends Mach. Learn., 3(1):1– we are unaware of practical cases where C is expected 122, 2011. ISSN 1935-8237. to have negative entries. M Cuturi. Sinkhorn distances: Lightspeed computa- Detailed comparison of our results with Gurobi pre- tion of optimal transport. In Advances in Neural In- sented in the Supplementary Material. Figure 3 shows formation Processing Systems 26, pages 2292–2300. the timings achieved by our method, and the speedups Curran Associates, Inc., 2013. 4archive.ics.uci.edu/ml/datasets/spambase W. E. Deming and F. F. Stephan. On a least squares 5Available at sparse.tamu.edu adjustment of a sampled frequency table when the Nikitas Rontsis, Paul J. Goulart

expected marginal totals are known. The Annals of P. Milanfar. A Tour of Modern Image Filtering: New Mathematical Statistics, 11(4):427–444, 1940. ISSN Insights and Methods, Both Practical and Theoret- 00034851. ical. IEEE Signal Processing Magazine, 30(1):106– P. Diggle, P. Heagerty, K. Y. Liang, and S. L. 128, Jan 2013. Zeger. Analysis of longitudinal data. Oxford sta- B O’Donoghue, E. Chu, N. Parikh, and S. Boyd. tistical science series; 25. 2nd edition, 2002. ISBN via Operator Splitting and Ho- 9780198524847. mogeneous Self-Dual Embedding. Journal of Op- J. Eckstein and D. P. Bertsekas. On the Douglas— timization Theory and Applications, 169(3):1042– Rachford splitting method and the proximal point 1068, 2016. algorithm for maximal monotone operators. Math- H. Perfect and L. Mirsky. The Distribution of Positive ematical Programming, 55(1):293–318, 1992. ISSN Elements in Doubly-Stochastic Matrices. Journal of 1436-4646. the London Mathematical Society, s1-40(1):689–698, M. Garstka, M. Cannon, and P. J. Goulart. COSMO: 1965. A conic operator splitting method for convex conic S. Rao, M. Huntley, N. Durand, E. Stamenova, problems. arXiv 1901.10887, 2019. I. Bochkov, J. Robinson, A. Sanborn, I. Machol, G. H. Golub and C. F. Van Loan. Matrix Computa- A. Omer, E. Lander, and E. Aiden. A 3D Map tions. Johns Hopkins Studies in the Mathematical of the Human Genome at Kilobase Resolution Re- Sciences. Johns Hopkins University Press, 4th edi- veals Principles of Chromatin Looping. Cell, 159(7): tion, 2013. ISBN 9781421407944. 1665–1680, 2014. ISSN 0092-8674. Gurobi Optimization LLC. Gurobi Optimizer Refer- B. Stellato, G. Banjac, P. Goulart, A. Bemporad, and ence Manual, 2018. URL http://www.gurobi.com. S. Boyd. OSQP: An Operator Splitting Solver for N. J. Higham. Matrix nearness problems and applica- Quadratic Programs. ArXiv e-prints, 2017. tions. Applications of Matrix Theory, 1989. E. Yaffe and A. Tanay. Probabilistic modeling of Hi-C M. Idel. A review of matrix scaling and Sinkhorn’s contact maps eliminates systematic biases to char- normal form for matrices and positive maps. arXiv acterize global chromosomal architecture. Nature 1609.06349, 2016. Genetics, 43(11):1059, 2011. ISSN 1061-4036. P. Knight. The Sinkhorn-Knopp Algorithm: Conver- R. Zass and A. Shashua. Doubly stochastic normaliza- gence and Applications. SIAM Journal on Matrix tion for spectral clustering. In B. Sch¨olkopf, J. C. Analysis and Applications, 30(1):261–275, 2008. Platt, and T. Hoffman, editors, Advances in Neu- P. A. Knight and D. Ruiz. A fast algorithm for Matrix ral Information Processing Systems 19, pages 1569– Balancing. IMA Journal of , 33 1576. MIT Press, 2007. (3):1029–1047, 2013.