arXiv:1511.07694v1 [math.NA] 24 Nov 2015 ePcri ue en,Ain,France). Amiens, Verne, Jules Picardie de France( 00A eeul ( Venezuela 1080-A, e .. 1 ,1,1,1,1,2] nti ok efcso in on focus we work, this In 24]. function 17, positive-scaling-invariant 16, se the 19, positive minimizing 12, and on 5, symmetric [1, of e.g., cone see non-polyhedral the of ture oiiedfiie e,eg,[2]. preconditioner e.g., inverse see, an definite; produce positive will general in residual eti priypten e .. 2 ,8 ,1,1,1,19 18, 12, 10, 9, 8, 7, when [2, e.g., that see pattern; sparsity certain 4 2 3.Fragvnsur eq square of given systems a v For linear wide sparse 23]. a and 22, in large 14, role of key solution a the involve play preconditioning inverse Algebraic Introduction 1 nmnmzn h rbnu omo h eiul( residual the of optimizat norm Frobenius on the based minimizing are on which approximations inverse sparse emtia nes rcniinn o ymti positi symmetric for preconditioning inverse Geometrical ∗ † LAMFA, eatmnod ´muoCetıc Estad´ıstica, C´omputo Unive Cient´ıfico y de Departamento hr scretyagoigitrs,adudrtnig i understanding, and interest, growing a currently is There epeetadaayegain-yemtost minimize to methods gradient-type analyze and present We where atst o htw s h emtia rpriso h non- prope the special of the properties also geometrical and the matrices, use definite positive we and that symmetric For set. pact e words: included. are Key are results approximations numerical sparse encouraging and and dense Preliminary which set. feasible the method. [email protected] efcso nes rcniinr ae nminimizing on based preconditioners inverse on focus We A XA UMR ssmercadpstv ent,mnmzn h Frobeniu the minimizing definite, positive and symmetric is stepeodtoe arxand matrix preconditioned the is [email protected] 32 nvri´ ePcri ue en,3 u an Leu, Saint rue 33 Verne, Jules Picardie de Universit´e 7352, rcniinn,cnso arcs rdetmto,mnmlresid minimal method, gradient matrices, of cones Preconditioning, enPu Chehab Jean-Paul .Spotdb NS(ntF39 R Math´ematiques, ARC Unive FR3399 (Unit CNRS by Supported ). ent matrices definite oebr1,2015 18, November ) Abstract A ∗ hr xs eea rpsl o constructing for proposals several exist there , 1 A acsRaydan Marcos I ssmercadpstv definite. positive and symmetric is − sddSmo o´vr p 90,Caracas 89000, Sim´on Bol´ıvar, Ap. rsidad XA F ( X 3.Hwvr ems remark must we However, 13]. , vraset a over ) hc snihrsmercnor symmetric neither is which 1 = ) o ehius anybased mainly techniques, ion F ain;seeg,[,6 ,9, 7, 6, [4, e.g., see uations; h ihgoerclstruc- geometrical rich the n ient arcs( matrices midefinite F es rcniinr based preconditioners verse ( reyo plctosthat applications of ariety ( X X − nasial com- suitable a on ) 1 = ) oyerlcn of cone polyhedral cos( † lopeetdin presented also te of rties P − A I XA, fmtie iha with matrices of cos( omo the of norm s F A I XA, 03 Amiens 80039 ,isedof instead ), ( X on ) ual ), SD P ve rsit´e ); minimizing the Frobenius norm of the residual. Our approach takes advantage of the geometrical properties of the P SD cone, and also of the special properties of F (X) on a suitable compact set, to introduce specialized gradient-type methods for which we analyze their convergence properties. The rest of the document is organized as follows. In Section 2, we develop and analyze two different gradient-type iterative schemes for finding inverse approximations based on minimizing F (X), including sparse versions. In Section 3, we present numerical results on some well-known test matrices to illustrate the behavior and properties of the introduced gradient-type methods. Finally, in Section 4 we present some concluding remarks.

2 Gradient-type iterative methods

Let us recall that the cosine between two n n real matrices A and B is defined as × A, B cos(A, B)= h i , (1) A B k kF k kF where A, B = (BT A) is the Frobenius inner product in the space of matrices and h i . is the associated Frobenius norm. By the Cauchy-Schwarz inequality it follows that k kF cos(A, B) 1, | |≤ and the equality is attained if and only if A = γB for some nonzero real number γ. To compute the inverse of a given symmetric and positive definite matrix A we consider the function F (X) = 1 cos(XA,I) 0, (2) − ≥ for which the minimum value zero is reached at X = ξA−1, for any positive real number ξ. Let us recall that any positive semidefinite matrix B has nonnegative diagonal entries and so trace(B) 0. Hence, if XA is symetric, we need to impose that XA,I = ≥ h i trace(XA) 0 as a necessary condition for XA to be in the P SD cone, see [5, 17, 24]. ≥ Therefore, in order to impose uniqueness in the P SD cone, we consider the constrained minimization problem Min F (X), (3) X∈S∩T where S = X IRn×n XA = √n and T = X IRn×n trace(XA) 0 . Notice { ∈ | k kF } { ∈ | ≥ } that S T is a closed and bounded set, and so problem (3) is well-posed. ∩ Remark 2.1 For any β > 0, F (βX) = F (X), and so the function F is invariant under positive scaling.

The derivative of F (X), denoted by F (X), plays an important role in our work. ∇ Lemma 2.1 1 XA,I F (X)= h 2i XA I A. ∇ I F XA F XA − k k k k  k kF 

2 Proof. For fixed matrices X and Y , we consider the function ϕ(t) = F (X + tY ). It is well-known that ϕ′(0) = F (X),Y . We have h∇ i 1 XA,I + t YA,I F (X + tY ) = 1 h i h i , 2 − I F XA F hXA,Y Ai 2 kYAk k k k k 1 + 2t 2 + t 2 kXAkF kXAkF r and we obtain after differentiating ϕ(t) and some algebraic manipulations

′ 1 XA,I ϕ (0) = h 2i XA I A, Y , h I F XA F XA − i k k k k  k kF  and the result is established. Theorem 2.1 Problem (3) possesses the unique solution X = A−1. Proof. Notice that F (X) = 0 for X S, if and only if X = βA−1 for β = 1. Now, ∇ ∈ ± F ( A−1) = 2 and so X = A−1 is the global maximizer of the function F on S, but − − A−1 / T ; whereas F (A−1)=0 and A−1 T . Therefore, X = A−1 is the unique feasible − ∈ ∈ solution of (3). Before discussing different numerical schemes for solving problem (3), we need a couple of technical lemmas. Lemma 2.2 If X S and XA = AX, then ∈ F (X), X = 0. h∇ i Proof. Since X S then XA 2 = n, and we have ∈ k kF 1 XA,I F (X)= h i XA I A, ∇ n n −   hence 1 XA,I 1 F (X), X = h i XAA, X A, X . h∇ i n n h i− nh i But XAA, X = XA 2 = n, so h i k kF XA,I 1 F (X), X = h i A, X = 0, h∇ i n − nh i since A, X = AX,I = XA,I . h i h i h i Lemma 2.3 If X S, then ∈ √n X √n A−1 . A ≤ k kF ≤ k kF k kF Proof. For every X we have X = A−1AX, and so X = A−1AX √n A−1 . k kF k kF ≤ k kF On the other hand, since X S, √n = XA X A , and hence ∈ k kF ≤ k kF k kF √n X , k kF ≥ A k kF and the result is established.

3 2.1 The negative gradient direction For the numerical solution of (3), we start by considering the classical gradient iterations that, from an initial guess X0, are given by

X(k+1) = X(k) α F (X(k)), − k∇ where αk > 0 is a suitable step length. A standard approach is to use the optimal choice i.e., the positive step length that (exactly) minimizes the function F (X) along the negative gradient direction. We present a closed formula for the optimal choice of step length in a more general setting, assuming that the iterative method is given by:

(k+1) (k) X = X + αkDk, where Dk is a search direction in the space of matrices. (k) Lemma 2.4 The optimal step length αk, that optimizes F (X + αDk), is given by

(k) (k) X A, I X A, DkA n DkA, I αk = h ih i− h i . D A, I X(k)A, D A X(k)A, I D A, D A h k ih k i − h ih k  k i Proof. Consider the auxiliary function in one variable 

(k) (k) X A, I + α DkA, I ψ(α)= F (X + αDk) = 1 h i h i. − √n X(k)A + αD A k k kF Differentiating ψ(α), using that X(k)A, X(k)A = n, and also that h i (k) ∂ (k) X A, DkA + α DkA, DkA X A + αD A F = h i h i, ∂αk k k X(k)A + αD A k k kF and then forcing ψ′(α) = 0 the result is obtained, after some algebraic manipulations. Remark 2.2 For our first approach, D = F (X(k)), and so for the optimal gradient k −∇ method (also known as Cauchy method or steepest descent method) the step length is given by n F (X(k))A, I X(k)A, I X(k)A, F (X(k))A αk = h∇ i − h ih ∇ i . (4) F (X(k))A, I X(k)A, F (X(k))A X(k)A, I F (X(k))A 2 h∇ ih ∇ i − h ik∇  kF Notice that if we use instead D = F (X(k)), the obtained α which also forces ψ′(α ) = 0 k ∇ k k is given by (4) but with a negative sign. Therefore, to guarantee that αk > 0 minimizes F along the negative gradient direction to approximate A−1, instead of maximizing F along the gradient direction to approximate A−1, we will choose the step length α as − k the absolute value of the expression in (4). Since I = √n, the gradient iterations can be written as k kF α X(k)A, I X(k+1) = X(k) k X(k)A I A, (k) h (k) 2i − √n X A F X A − ! k k k kF 4 which can be further simplified by imposing the condition for uniqueness X(k)A = √n. k kF In that case we set α X(k)A, I Z(k+1) = X(k) k h iX(k)A I A, (5) − n n − ! and then we multiply the matrix Z(k+1) by the factor √n/ Z(k+1)A to guarantee that k kF X(k+1) S, i.e., such that X(k+1)A = √n. ∈ k kF Concerning the condition that the sequence X(k) remains in T , in our next result { } we establish that if the step length αk remains uniformly bounded from above, then trace(X(k)A) > 0 for all k. 3/2 Lemma 2.5 Assume that trace(X(0)A) > 0 and that 0 < α n . Then k ≤ A 2 k kF trace(X(k)A)= X(k)A, I > 0 for all k. h i Proof. We proceed by induction. Let us assume that w := trace(X(k)A)= X(k)A, I > 0. k h i It follows that α w Z(k+1)A = X(k)A k k X(k)A I A2, − n n − and so   α w trace(Z(k+1)A)= w k k trace(X(k)A3) trace(A2) . k − n n − Now, since trace(A2)= A, A = A 2 and  h i k kF trace(X(k)A3) X(k)A A 2 = √n A 2 , ≤ k kF k kF k kF we obtain that α α trace(Z(k+1)A) (1 k √n A 2 )w + k A 2 . ≥ − n2 k kF k n k kF 3/2 Since 0 < α n , then (1 αk √n A 2 ) > 0, and we conclude that k ≤ A 2 − n2 k kF k kF trace(Z(k+1)A) > 0.

(k+1) (k+1) Since X is obtained as a positive scaling factor of Z , then wk+1 > 0 and the result is established. Now, for some given matrices A, we cannot guarantee that the step length com- puted as the absolute value of (4) will satisfy α (n3/2)/ A 2 for all k. Therefore, k ≤ k kF if trace(X(k+1)A)= X(k+1)A, I < 0 then we will set in our algorithm X(k+1) = X(k+1) h i − to guarantee that trace(X(k+1)A) 0, and hence that the cosine between X(k+1)A and ≥ I is nonnegative, which is a necessary condition to guarantee that X(k+1) remains in the P SD cone; see, e.g., [5, 17, 24]. We now present our steepest descent gradient algorithm that will be referred as the CauchyCos Algorithm.

5 Algorithm 1 : CauchyCos (Steepest descent approach on F (X) = 1 cos(XA,I)) − 1: Given X P SD 0 ∈ 2: for k = 0, 1, until a stopping criterion is satisfied, do · · · (k) 3: Set wk = X A, I h(k) 1 i wk (k) 4: Set F (X )= n n X A I A ∇ (k) − (k) (k) n F (X )A, I wk X A, F (X )A 5: Set αk = h∇ i− h ∇ i F (X(k))A, I X(k)A, F (X(k))A w F (X(k))A 2 k F (k+1) h∇ (k) ih (k) ∇ i− k∇ k 6: Set Z = X αk F (X ) − (∇k+1) (k+1) Z (k+1) 7: Set X = s√n (k+1) , where s =1 if trace(Z A) > 0, s = 1 else Z A F − 8: end for k k

We note that if we start from X(0) such that X(0)A = √n then by construction k kF X(k)A = √n, for all k 0; for example, X(0) = (√n/ A )I is a convenient choice. k kF ≥ k kF For that initial guess, trace(X(0A) = X(0A, I > 0 and again by construction all the h i iterates will remain in the P SD cone. Notice also that, at each iteration, we need to compute the three matrix-matrix products: X(k)A, ( wk X(k)A I)A, and F (X(k))A, n − ∇ which for dense matrices require n3 floating point operations (flops) each. Every one of the remaining calculations (inner products and Frobenius norms) are obtained with n column-oriented inner products that require n flops each. Summing up, in the dense case, the computational cost of each iteration of the CauchyCos Algorithm is 3n3 + O(n2) flops. In section 2.5, we will discuss a sparse version of the CauchyCos Algorithm and its computational cost.

2.2 Convergence properties of the CauchyCos Algorithm We start by establishing the commutativity of all iterates with the matrix A.

Lemma 2.6 If X(0)A = AX(0), then X(k)A = AX(k), for all k 0 in the CauchyCos ≥ Algorithm.

Proof. We proceed by induction. Assume that X(k)A = AX(k). It follows that

α X(k)A, I AZ(k+1) = AX(k) k h iAX(k)A A A − n n −   α X(k)A, I = X(k)A k h iAX(k) I AA, − n n −   α X(k)A, I = X(k) k h iX(k)A I A A, − n n −     = Z(k+1)A, and since Z(k+1) and X(k+1) differ only by a scaling factor, then AX(k+1) = X(k+1)A. Hence, since X(0)A = AX(0), the result hods for all k. It is worth noticing that using Lemma 2.6 and (5), it follows by simple calculations that

6 Z(k) as well as X(k) are symmetric matrices for all k. In turn, if X(0)A = AX(0), this clearly imply using Lemma 2.6 that X(k)A is also a for all k. Our next result establishes that the sequences generated by the CauchyCos Algorithm are uniformly bounded away from zero, and hence the algorithm is well-defined. Lemma 2.7 If X(0)A = AX(0), then the sequences X(k) , Z(k) , and Z(k)A gener- { } { } { } ated by the CauchyCos Algorithm are uniformly bounded away from zero. Proof. Using Lemmas 2.2 and 2.6 we have that

Z(k+1), X(k) = X(k) 2 α F (X(k)), X(k) = X(k) 2 , h i k kF − kh∇ i k kF which combined with the Cauchy-Schwarz inequality and Lemma 2.3 implies that √n Z(k+1) X(k) > 0, k kF ≥ k kF ≥ A k kF for all k. Moreover, since A is nonsingular then

(k+1) (k+1) Z F √n Z A F k k > 0 k k ≥ A−1 ≥ A A−1 k kF k kF k kF is bounded away from zero for all k.

Theorem 2.2 The sequence X(k) generated by the CauchyCos Algorithm converges to { } A−1.

Proof. The sequence X(k) S T , which is a closed and bounded set, therefore { } ⊂ ∩ there exist limit points in S T . Let X be a limit point of X(k) , and let X(kj ) ∩ { } { } be a subsequence that converges to X. Let us suppose, by way of contradiction, that F (X) = 0. b ∇ 6 In that case, the negative gradient, b F (X) = 0, is a descent direction for the function −∇ 6 F at bX. Hence, there existsα> ˆ 0 such that b δ = F (X) F (X αˆ F (X)) > 0. b − − ∇ Consider now an auxiliary function θ : IRn×n IR given by b b → b θ(X)= F (X) F (X αˆ F (X)). − − ∇ Clearly, θ is a continuous function, and then θ(X(kj )) converges to θ(X) = δ. Therefore, for all kj sufficiently large, b F (X(kj )) F (X(kj ) αˆ F (X(kj ))) = θ(X(kj )) δ/2. − − ∇ ≥

Now, since αkj was obtained using Lemma 2.4 as the exact optimal step length along the negative gradient direction, then using Remark 2.1 it follows that

F (X(kj +1))= F (Z(kj +1)) = F (X(kj ) α F (X(kj ))) − kj ∇ δ < F (X(kj ) αˆ F (X(kj ))) F (X(kj )) , − ∇ ≤ − 2

7 and thus, δ F (X(kj )) F (X(kj +1)) , (6) − ≥ 2 for all kj sufficiently large. On the other hand, since F is continuous, F (X(kj )) converges to F (X). However, the whole sequence F (X(k)) generated by the CauchyCos Algorithm is decreasing, and so (k) { } F (X ) converges to F (X), and since F is bounded below then for kj largeb enough

b F (X(kj )) F (X(kj +1)) 0, − → which contradicts (6). Consequently, F (X) = 0. ∇ Now, using Lemma 2.1, it follows that F (X) = 0 implies X = A−1. Hence, the ∇ subsequence X(kj ) converges to A−1. Nevertheless,b as we argued before, the whole { } sequence F (X(k)) converges to F (A−1) = 0, and byb continuity theb whole sequence X(k) { } converges to A−1.

Remark 2.3 The optimal choice of step length αk, as it usually happens when combined with the negative gradient direction (see e.g., [3, 21]), produces an orthogonality between consecutive gradient directions, that in our setting becomes F (Z(k+1)), F (X(k)) = 0. h∇ ∇ i Indeed, α minimizes ψ(α)= F (X(k) α F (X(k))) which means that k − ∇ 0= ψ′(α )= F (X(k) α F (X(k))), F (X(k)) = F (Z(k+1)), F (X(k)) . k −h∇ − ∇ ∇ i −h∇ ∇ i This orthogonality is responsible for the well-known zig-zagging behavior of the optimal gradient method, which in some cases induces a very slow convergence.

2.3 A simplified search direction To avoid the zig-zagging trajectory of the optimal gradient iterates, we now consider a different search direction:

(k) (k) 1 X A, I (k) Dk D(X )= h iX A I , (7) ≡ −n n − ! b b to move from X(k) S T to the next iterate. Notice that D A = F (X(k)) and ∈ ∩ k −∇ so Dk can be viewed as a simplified version of the search direction used in the classical steepest descent method. Notice also that D resembles the residualb direction (X(k)A I) k − usedb in the minimal residual iterative method (MinRes) for minimizing I XA in the k − kF least-squares sense; see e.g., [9, 22]. Nevertheless,b the scaling factors in (7) differ from the scaling factors in the classical residual direction at X(k). For solving (3), we now present a variation of the CauchyCos Algorithm, that will be referred as the MinCos Algorithm, which from a given initial guess X0 produces a sequence of iterates using the search direction D , while remaining in the compact set S T . This k ∩ new algorithm consists of simply replacing F (X(k)) in the CauchyCos Algorithm by D . −∇ k b b

8 Algorithm 2 : MinCos (simplified gradient approach on F (X) = 1 cos(XA,I)) − 1: Given X P SD 0 ∈ 2: for k = 0, 1, until a stopping criterion is satisfied, do · · · (k) 3: Set wk = X A, I h w i 4: Set D = 1 k X(k)A I k −n n − n D A, I w X(k)A, D A 5: Set α = k k k bk h (ki−) h i 2 DkA, I X A, DkA wk DkA F (k+1) h (k)bih i− kb k 6: SetZ = X + αkDk (k+1) (k+1) b Z b b (k+1) 7: Set X = s√n (k+1) , where s =1 if trace(Z A) > 0, s = 1 else Z b A F − 8: end for k k

As before, we note that if we start from X(0) = (√n/ A )I then by construction k kF X(k)A = √n, for all k 0. For that initial guess, trace(X(0A) = X(0A, I > 0 and k kF ≥ h i again by construction all the iterates remain in the P SD cone. Notice also that, at each (k) iteration, we now need to compute the two matrix-matrix products: X A, and DkA, which for dense matrices require n3 flops each. Every one of the remaining calculations (inner products and Frobenius norms) are obtained with n column-oriented inner productsb that require n flops each. Summing up, in the dense case, the computational cost of each iteration of the MinCos Algorithm is 2n3 + O(n2) flops. In Section 2.5, we will discuss a sparse version of the MinCos Algorithm and its computational cost.

2.4 Convergence properties of the MinCos Algorithm

We start by noticing that, unless we are at the solution, the search direction Dk is a descent direction. b Lemma 2.8 If X S T and F (X) = 0, the search direction D(X) is a descent ∈ ∩ ∇ 6 direction for the function F at X. b Proof. We need to establish that, for a given X S T , D(X), F (X) < 0. Since A−1 ∈ ∩ h ∇ i is symmetric and positive definite, then it has a unique square root which is also symmetric and positive definite. This particular square root will beb denoted as A−1/2. Therefore, since D(X)A = F (X), and using that trace(E E ) = trace(E E ), for given square −∇ 1 2 2 1 matrices E1 and E2, it follows that b D(X), F (X) = D(X)AA−1, F (X) = F (X)A−1, F (X) h ∇ i h −1∇/2 i −h∇−1/2 ∇ −i1/2 2 = F (X)A , F (X)A = F (X)A F < 0. b −h∇b ∇ i −k∇ k

Remark 2.4 The step length in the MinCos Algorithm is obtained using the search di- rection D in Lemma (2.4). Notice that if we use D instead of D , the obtained α k − k k k b b b 9 ′ which also forces ψ (αk) = 0 is the one given by Lemma (2.4) but with a negative sign. Therefore, as in the CauchyCos Algorithm, to guarantee that αk > 0 minimizes F along −1 the descent direction Dk to approximate A , instead of maximizing F along the ascent direction D to approximate A−1, we choose the step length α as the absolute value of − k − k the expression in Lemmab (2.4). b We now establish the commutativity of all iterates with the matrix A. Lemma 2.9 If X(0)A = AX(0), then X(k)A = AX(k), for all k 0 in the MinCos ≥ Algorithm. Proof. We proceed by induction. Assume that X(k)A = AX(k). We have that

α X(k)A, I AZ(k+1) = AX(k) k h iAX(k)A A − n n −   α X(k)A, I = X(k)A k h iAX(k) I A, − n n −   α X(k)A, I = X(k) k h iX(k)A I A , − n n −     = Z(k+1)A, and since Z(k+1) and X(k+1) differ only by a scaling factor, then AX(k+1) = X(k+1)A. Hence, since X(0)A = AX(0), the result hods for all k. It is worth noticing that using Lemma 2.9 and (5), it follows by simple calculations that Z(k), X(k), and X(k)A in the MinCos Algorithm are symmetric matrices for all k. These three sequences generated by the MinCos Algorithm are also uniformly bounded away from zero, and so the algorithm is well-defined. Lemma 2.10 If X(0)A = AX(0), then the sequences X(k) , Z(k) , and Z(k)A gener- { } { } { } ated by the MinCos Algorithm are uniformly bounded away from zero. Proof. From Lemma 2.3 the sequence X(k) is uniformly bounded. For the sequence { } Z(k) , using Lemmas 2.2 and 2.9 we have that { } Z(k+1)A1/2, X(k)A1/2 = Z(k+1)A, X(k) = X(k)A, X(k) + α D A, X(k) h i h i h i kh k i = X(k)A, X(k) α F (X(k)), X(k) = X(k)A, X(k) h i− kh∇ i h i = X(k)A1/2, X(k)A1/2 = X(k)A1/2 2 , h i k kF where A1/2 is the unique square root of A which is also symmetric and positive defi- nite. Combining the previous equality with the Cauchy-Schwarz inequality, and using the consistency of the Frobenius norm, we obtain

Z(k+1) A1/2 Z(k+1)A1/2 X(k)A1/2 . (8) k kF k kF ≥ k kF ≥ k kF Since X(k) S, then √n = X(k)A X(k)A1/2 A1/2 , which combined with (8) ∈ k kF ≤ k kF k kF implies that (k+1) √n Z F > 0 k k ≥ A1/2 2 k kF 10 is bounded away from zero for all k. Moreover, since A is nonsingular then

Z(k+1) √n Z(k+1)A F > 0 F k −1 k 1/2 2 −1 k k ≥ A F ≥ A A k k k kF k kF is also bounded away from zero for all k.

Theorem 2.3 The sequence X(k) generated by the MinCos Algorithm converges to A−1. { } Proof. From Lemma 2.8 the search direction D(X) is a descent direction for F at X, unless F (X) = 0. Therefore, since α in the MinCos Algorithm is obtained as the exact ∇ k minimizer of F along the direction D(Xk) for allb k, the proof is obtained repeating the same arguments shown in the proof of Theorem 2.2, simply replacing F (Y ) by D(Y ) −∇ for all possible instances Y .

2.5 Sparse versions We now discuss how to dynamically impose sparsity in the sequence of iterates X(k) { } generated by either the CauchyCos Algorithm or the MinCos Algorithm, to reduce their required storage and computational cost. A possible way of accomplishing this task is to prescribe a sparsity pattern beforehand, which is usually related to the sparsity pattern of the original matrix A, and then impose it at every iteration; see e.g., [6, 13, 18, 19]. At this point, we would like to mention that although there exist some special applications for which the involved matrices are large and dense [11, 15], frequently in real applications the involved matrices are large and sparse. However, in general the inverse of a is dense anyway. Moreover, with very few exceptions, it is not possible to know a priori the location of the large or the small entries of the inverse. Consequently, it is very difficult in general to prescribe a priori a nonzero sparsity pattern for the approximate inverse. As a consequence, to force sparsity in our gradient related algorithms, we use instead a numerical dropping strategy to each column (or row) independently, using a threshold tolerance, combined with a fixed bound on the maximum number of nonzero elements to be kept at each column (or row) to limit the fill-in. This combined strategy will be fully described in our numerical results section. In the CauchyCos and MinCos Algorithms, the dropping strategy must be applied to the matrix Zk+1 right after it is obtained at Step 6, and before computing Xk+1 at Step 7. That way, X will remain sparse at all iterations, and we guarantee that X S T . k+1 k+1 ∈ ∩ The new Steps 7 and 8, in the sparse versions of both algorithms, are given by 7 : Apply numerical dropping to Z(k+1) with a maximum number of nonzero entries (k+1) 8 : Set X(k+1) = s√n Z , where s =1 if trace(Z(k+1)A) > 0, s = 1 else Z(k+1)A − k kF Notice that, since all the involved matrices are symmetric, the matrix-matrix products required in both algorithms can be performed using sparse-sparse mode column-oriented inner products; see, e.g., [9]. The remaining calculations (inner products and Frobenius

11 norms), required to obtain the step length, must be also computed using sparse-sparse mode. Using this approach, which takes advantage of the imposed sparsity, the computa- tional cost and the required storage of both algorithms are drastically reduced. Moreover, using the column oriented approach both algorithms have a potential for parallelization.

3 Numerical Results

We present some numerical results to illustrate the properties of our gradient-type algo- rithms for obtaining inverse approximations. All computations are performed in MATLAB using double . For a given matrix A, the merit function Φ(X) = 1 I XA 2 has been widely 2k − kF used for computing approximate inverse preconditioners; see; e.g., [2, 7, 8, 9, 10, 12, 13, 19]. In that case, the properties of the Frobenius norm permit in a natural way the use of parallel computing. Moreover, the minimization of Φ(X) can also be accomplished imposing a column-wise numerical dropping strategy leading to a sparse approximation of A−1. Therefore, when possible, it is natural to compare the CauchyCos and the MinCos Algorithms applied to the angle-related merit function F (X) with the optimal Cauchy method applied to Φ(X) (referred from now on as the CauchyFro method), and also to the Minimal Residual (MinRes) method applied to Φ(X); see, e.g., [2, 9]. The gradient of Φ(X) is given by Φ(X) = AT (I XA), and so the iterations of ∇ − − the CauchyFro method, from the same initial guess X(0) = (√n/ A )I used by MinCos k kF and CauchyCos, can be written as (k+1) (k) X = X + αkGk, (9) (k) where Gk = Φ(X ) and the step length αk > 0 is obtained as the global minimizer (k) −∇ of Φ(X + αGk) along the direction Gk, as follows R , AG α = h k ki , (10) k AG , AG h k ki where R = I AX(k) is the residual matrix at X(k). The iterations of the MinRes method k − can be obtained replacing Gk by the residual matrix Rk in (9) and (10); see [9] for details. We need to remark that in the dense case, the CauchyFro method needs to compute two matrix-matrix products per iteration, whereas the MinRes method by using the recursion R = R α AR needs one matrix-matrix product per iteration. k+1 k − k k For our experiments we consider the following test matrices in the P SD cone: from the Matlab gallery: Poisson, Lehmer, Wathen, Moler, and miij. Notice that the • Poisson matrix, referred in Matlab as (Poisson, N) is the N 2 N 2 finite differences 2D × discretization matrix of the negative Laplacian on ]0, 1[2 with homogeneous Dirichlet boundary conditions. Poisson 3D (that depends on the parameter N), is the N 3 N 3 finite differences 3D • × discretization matrix of the negative Laplacian on the unit cube with homogeneous Dirichlet boundary conditions.

12 from the Matrix Market (http://math.nist.gov/MatrixMarket/): nos1, nos2, • nos5, and nos6.

In Table 1 we report the considered test matrices with their size, sparsity properties, and 2-norm condition number κ(A). Notice that the Wathen matrices have random entries so we cannot report their spectral properties. Moreover, Wathen (N) is a sparse n n × matrix with n = 3N 2 + 4N + 1. In general the inverse of all the considered matrices are dense, except the inverse of the Lehmer matrix which is tridiagonal.

Matrix A Size (n n) κ(A) A × Poisson (50) n=2500 1.05e+03 sparse Poisson (100) n=1000 6.01e+03 sparse Poisson (150) n=22500 1.34e+04 sparse Poisson (200) n=400000 2.38e+04 sparse Poisson 3D (10) n=1000 79.13 sparse Poisson 3D (15) n=3375 171.66 sparse Poisson 3D (30) n=27000 388.81 sparse Poisson 3D (50) n=125000 1.05e+03 sparse Lehmer (100) n=100 1.03e+04 dense Lehmer (200) n=200 4.2e+04 dense Lehmer (300) n=300 9.5e+04 dense Lehmer (400) n=400 1.7e+05 dense Lehmer (500) n=500 2.6e+05 dense minij (20) n=20 677.62 dense minij (30) n=30 1.5e+03 dense minij (50) n=50 4.13e+03 dense minij (100) n=100 1.63e+04 dense minij (200) n=200 6.51e+04 dense moler (100) n=100 3.84e+16 dense moler (200) n=200 3.55e+16 dense moler (300) n=300 3.55e+16 dense moler (500) n=500 3.55e+16 dense moler (1000) n=1000 3.55e+16 dense nos1 n=237 2.53e+07 sparse nos2 n=957 6.34e+09 sparse nos5 n=468 2.91e+04 sparse nos6 n=675 8.0e+07 sparse

Table 1: Considered test matrices and their characteristics.

13 3.1 Approximation to the inverse with no dropping strategy To add understanding to the properties of the new CauchyCos and MinCos Algorithms, we start by testing their behavior, as well as the behavior of CauchyFro and MinRes, without imposing sparsity. Since the goal is to compute an approximation to A−1, it is not necessary to carry on the iterations up to a very small tolerance parameter ǫ, and we choose ǫ = 0.01 for our experiments. For all methods, we stop the iterations when min F (X(k)), Φ(X(k)) ǫ. { }≤ Matrix CauchyCos CauchyFro MinRes MinCos Poisson 2D (n=50) 88 132 7 6 Poisson 2D (n=70) 7 6 Poisson 2D (n=100) 7 7 Poisson 2D (n=200) 7 7 Poisson 3D (n=10) 9 12 3 2 Poisson 3D (n=15) 10 14 3 2 Poisson 3D (n=30) 3 3 Poisson 3D (n=50) 3 3 Lehmer (n=10) 888 1141 21 15 Lehmer (n=20) 9987 49901 123 51 Lehmer (n=30) 355 109 Lehmer (n=40) 645 190 Lehmer (n=50) 987 293 Lehmer (n=50) 1399 423 Lehmer (n=100) 3905 1178 Lehmer (n=200) 16189 4684 Minij (n=20) 31271 63459 209 45 Minij (n=30) 153456 629787 553 102 Minij (n=50) 1565 307 Minij (n=100) 6771 1259 Minij (n=200) 26961 5057 Moler (n=100) 7 83 3 3 Moler (n=200) 77 15243 19 12 Moler (n=300) 105 22 Moler (n=500) 381 48 Moler (n=1000) 1297 152 Wathen (n=10) 10751 17729 68 57 Wathen (n=20) 495 1112 22 16 Wathen (n=30) 24 17 Wathen (n=50) 20 15

Table 2: Number of iterations required for all considered methods when ǫ = 0.01.

Table 2 shows the number of required iterations by the four considered algorithms when applied to some of the test functions, and for different values of n. No information

14 in some of the entries of the table indicates that the corresponding method requires an excessive amount of iterations as compared with the MinRes and MinCos Algorithms. We can observe that CauchyFro and CauchyCos are not competitive with MinRes and MinCos, except for very few cases and for very small dimensions. Among the Cauchy- type methods, CauchyCos requires less iterations than CauchyFro, and in several cases the difference is significant. The MinCos and MinRes Algorithms were able to accomplish the required tolerance using a reasonable amount of iterations, except for the Lehmer(n) and minij(n) matrices for larger values of n, which are the most difficult ones in our list of test matrices. The MinCos Algorithm clearly outperforms the MinRes Algorithm, except for the Poisson 2D (n) and Poisson 3D (n) for which both methods require the same number of iterations. For the more difficult matrices and specially for larger values of n, MinCos reduces in the average the number of iterations with respect to MinRes by a factor of 4.

Merit function=1−cos Merit function=1−cos 0 0 10 10 Cauchy cos MinRes Cauchy fro MinCos

−1 10

−1 10

−2 10

−2 −3 10 10 0 500 1000 1500 0 5 10 15 20 25 iterations iterations

Merit function=Frobenius norm Merit function=Frobenius norm 2 2 10 10 Cauchy cos MinRes Cauchy fro MinCos

1 1 10 10

0 0 10 10 0 500 1000 1500 0 5 10 15 20 25 iterations iterations

(a) CauchyFro versus CauchyCos (b) MinRes versus MinCos

Figure 1: Convergence history for CauchyFro and CauchyCos (left), and MinRes and MinCos (right) for two merit functions: F (X) (up) and Φ(X) (down), when applied to the Wathen matrix for n = 20 and ǫ = 0.01.

In Figure 1 we show the (semilog) convergence history for the four considered methods and for both merit functions: F (X) and Φ(X), when applied to the Wathen matrix for n = 20 and ǫ = 0.01. Once again, we can observe that CauchyFro and CauchyCos are not competitive with MinRes and MinCos, and that MinCos outperforms MinRes. Moreover,

15 we observe in this case that the function F (X) is a better merit function than Φ(X) in the sense that it indicates with fewer iterations that a given iterate is sufficiently close to the inverse matrix. The same good behavior of the merit function F (X) has been observed in all our experiments. Based on these preliminary results, we will only report the behavior of MinRes and MinCos for the forthcoming numerical experiments.

3.2 Sparse approximation to the inverse We now build sparse approximations by applying the dropping strategy, described in Section 2.5, which is based on a threshold tolerance with a limited fill-in (lfil) on the matrix Z(k+1), at each iteration, right before the scaling step to guarantee that the iterate X(k+1) S T . We define thr as the percentage of coefficients less than the maximum ∈ ∩ value of the modulus of all the coefficients in a column. To be precise, for each i-th column we select at most lfil off-diagonal coefficients among the ones that are larger in magnitude than thr (Z(k+1)) , where (Z(k+1)) represents the i-th column of Z(k+1). × k ik∞ i We begin by comparing MinRes and MinCos when we apply the numerical dropping strategy. For both algorithms we use the column-oriented sparse calculations described in Section 2.5; see also [9].

(k) (k) Matrix Method κ(X A)/κ(A) [λmin, λmax] of (X A) Iter % fill-in nos1 (lfil = 10) MinCos 0.0835 [2.44e-06,2.3272] 20 3.71 nos1(lfil = 10) MinRes [-98.66,5.40] nos6 (lfil = 10) MinCos 0.4218 [5.07e-06,3.1039] 20 0.45 nos6 (lfil = 20) MinCos 0.2003 [8.51e-06,3.0702] 20 0.82 nos6(lfil = 10) MinRes [ -0.7351,2.6001] nos6(lfil = 20) MinRes [ -0.2256,2.2467] nos5(lfil = 5) MinCos 0.068 [0.002,1.36] 10 1.18 nos5(lfil = 10) MinCos 0.0755 [00.0024,1.3103] 10 2.47 nos5(lfil = 5) MinRes [-20.31,2.16] nos5(lfil = 10) MinRes 0.1669 [0.0021,1.7868] 10 2.36 nos2(lfil = 5) MinCos 0.1289 [5.2e-09,2.73] 10 0.52 nos2(lfil = 10) MinCos 0.0891 [7.95e-09,2.2873] 10 0.80 nos2(lfil = 20) MinCos 0.0700 [9.7e-09,1.9718] 10 1.14 nos2(lfil = 5) MinRes [ 0.3326, 2.4869] − nos2(lfil = 10) MinRes 0.0970 [4.21e-09,1.5414] 10 0.93 nos2(lfil = 20) MinRes 0.0861 [4.21e-09,1.1638] 10 1.14

Table 3: Performance of MinRes and MinCos when applied to the Matrix Market matrices nos1, nos2, nos5, and nos6, for ǫ = 0.01, thr = 0.01, and different values of lfil.

Table 3 shows the performance of MinRes and MinCos when applied to the matrices nos1, nos2, nos5, and nos6, for ǫ = 0.01, thr = 0.01, and several values of lfil. We report the iteration k (Iter) at which the method was stopped, the interval [λmin, λmax] of

16 (X(k)A), the quotient κ(X(k)A)/κ(A), and the percentage of fill-in (% fill-in) at the final matrix X(k). We observe that, when imposing the dropping strategy to obtain sparsity, MinRes fails to produce an acceptable preconditioner. Indeed, as it has been already observed (see [2, 9]) quite frequently MinRes produces an indefinite approximation to the inverse of a sparse matrix in the P SD cone. We also observe that, in all cases, the MinCos method produces a sparse symmetric and positive definite preconditioner with relatively few iterations and a low level of fill-in. Moreover, with the exception of the matrix nos6, the MinCos method produces a preconditioned matrix (X(k)A) whose condition number is reduced by a factor of approximately 10 with respect to the condition number of A. In some cases, MinRes was capable of producing a sparse symmetric and positive definite preconditioner, but in those cases the MinCos produced a better preconditioner in the sense that it exhibits a better reduction of the condition number, and also a better eigenvalues distribution. Based on these results, for the remaining experiments we only report the behavior of the MinCos Algorithm. Table 4 shows the performance of the MinCos Algorithm when applied to the Wathen matrix for different values of n and a maximum of 20 iterations. For this numerical experiment we fix ǫ = 0.01, thr = 0.04, and lfil = 20. For the particular case of the Wathen matrix when n = 50, we show in Figure 2 the (semilog) convergence history of the norm of the residual when solving a linear system with a random right hand side vector, using the Conjugate Gradient (CG) method without preconditioning, and also using the preconditioner generated by the MinCos Algorithm after 20 iterations, fixing ǫ = 0.01, thr = 0.04, and lfil = 20. We also report in Figure 3 the eigenvalues distribution of A and of X(k)A, at k = 20, for the same experiment with the Wathen matrix and n = 50. Notice that the eigenvalues of A are distributed in the interval [0, 350], whereas the eigenvalues of X(k)A are located in the interval [0.03, 1.4] (see Table 4). Even better, we can observe that most of the eigenvalues are in the interval [0.3, 1.4], and very few of them are in the interval [0.03, 0.3], which clearly accounts for the good behavior of the preconditioned CG method (see Figure 2).

(k) (k) Matrix A κ(X A)/κ(A) [λmin, λmax] of (X A) iter % fil-in wathen (30) 0.0447 [0.0109, 1.3889] 20 0.73 wathen (50) 0.0461 [0.0366, 1.4012] 20 0.27 wathen (70) 0.0457 [0.0086, 1.3894] 20 0.14 wathen (100) 0.0467 [0.0289, 1.4121] 20 6.8436e-02

Table 4: Performance of MinCos applied to the Wathen matrix for different values of n and a maximum of 20 iterations, when ǫ = 0.01, thr = 0.04, and lfil = 20.

Tables 5, 6, and 7 show the performance of the MinCos Algorithm when applied to the Poisson 2D, the Poisson 3D, and the Lehmer matrices, respectively, for different values of n, and different values of the maximum number of iterations, ǫ, thr, and lfil. We can observe that, for the Poisson 2D and 3D matrices, the MinCos Algorithm produces a sparse symmetric and positive definite preconditioner with very few iterations, a low level of fill-in, and a significant reduction of the condition number.

17 2 10 Precond CG CG 0 10

−2 10

−4 10 Residual

−6 10

−8 10

−10 10 0 50 100 150 200 250 300 iterations

Figure 2: Convergence history of the CG method applied to a linear system with the Wathen matrix, for n = 50, 20 iterations, ǫ = 0.01, thr = 0.01, and lfil = 20, using the preconditioned generated by the MinCos Algorithm and without preconditioning.

Eigenvalues of the preconditioned matrix computed with cos MF 1

0.8

0.6

0.4

0.2

0

−0.2

−0.4

−0.6

−0.8

−1 0 0.5 1 1.5

Eigenvalues of A 1

0.8

0.6

0.4

0.2

0

−0.2

−0.4

−0.6

−0.8

−1 0 50 100 150 200 250 300 350

Figure 3: Eigenvalues distribution of A (down) and of X(k)A (up) after 20 iterations of the MinCos Algorithm when applied to the Wathen matrix for n = 50, ǫ = 0.01, thr = 0.01, and lfil = 20.

For the Lehmer matrix, which is one of the most difficult considered matrices, we observe in Table 7 that the MinCos Algorithm produces a symmetric and positive definite preconditioner with a significant reduction of the condition number, but after 40 iterations

18 (k) (k) Matrix A κ(X A)/κ(A) [λmin, λmax] of (X A) iter % fil-in Poisson 2D (50) 0.1361 [0.0138, 1.2961] 6 1.65 Poisson 2D (100) 0.1249 [0.0039, 1.1452] 7 0.41 Poisson 2D (150) 0.1248 [0.0017, 1.1459] 7 0.18 Poisson 2D (200) 0.1246 [9.78e-04,1.1484] 7 0.10

Table 5: Performance of MinCos applied to the Poisson 2D matrix, for different values of n and a maximum of 20 iterations, when ǫ = 0.01, thr = 0.04, and lfil = 40.

(k) (k) Matrix A κ(X A)/κ(A) [λmin, λmax] of (X A) iter % fil-in Poisson 3D (10) 0.3393 [0.1161, 1.4410] 2 2.09 Poisson 3D(15) 0.3357 [0.0561, 1.4639] 2 0.66

Table 6: Performance of MinCos applied to the Poisson 3D matrix, for different values of n and a maximum of 20 iterations, when ǫ = 0.01, thr = 0.01, and lfil = 40

(k) (k) Matrix A κ(X A)/κ(A) [λmin, λmax] of (X A) iter % fil-in Lehmer (100) 0.0150 [0.0223, 3.4270] 40 37.04 Lehmer (200) 0.0180 [0.0069, 5.0768] 40 38.34

Table 7: Performance of MinCos applied to the Lehmer matrix, for different values of n and a maximum of 40 iterations, when ǫ = 0.01, thr = 0.06, and lfil = 100. and fixing lfil = 100, for which the preconditioner accepts a high level of fill-in. If we impose a low level of fill-in, by reducing the value of lfil, MinCos still produces a symmetric and positive definite matrix, but the reduction of the condition number is not significant.

4 Final remarks

We have introduced and analyzed two gradient-type optimization schemes to build sparse inverse preconditioners for symmetric positive definite matrices. For that we have proposed the novel objective function F (X) = 1 cos(XA,I), which is invariant under positive − scaling and has some special properties that are clearly related to the geometry of the P SD cone. One of the new schemes, the CauchyCos Algorithm, is closely related to the classical steepest descent method, and as a consequence it shows in most cases a very slow convergence. The second new scheme, denoted as the MinCos Algorithm, shows a much faster performance and competes favorably with well-known methods. Based on our numerical results, by choosing properly the numerical dropping parameters, the MinCos Algorithm produces a sparse inverse preconditioner in the P SD cone for which a significant reduction of the condition number is observed, while keeping a low level of fill-in.

19 References

[1] R. Andreani, M. Raydan and P. Tarazaga [2013], On the geometrical structure of symmetric matrices, Linear Algebra and its Applications, 438, 1201–1214.

[2] M. Benzi and M. Tuma [1999], A comparative study of sparse approximate inverse preconditioners, Applied Numerical Mathematics, 30, 305–340

[3] D. P. Bertsekas [1999], Nonlinear Programming, Athena Scientific, Boston.

[4] J.-P. Chehab [2007], Matrix differential equations and inverse preconditioners, Com- putational and Applied Mathematics, 26, 95–128.

[5] J.P. Chehab and M. Raydan [2008], Geometrical properties of the Frobenius condition number for positive definite matrices, Linear Algebra and its Applications, 429, 2089– 2097.

[6] K. Chen [2001], An analysis of sparse approximate inverse preconditioners for bound- ary integral equations, SIAM J. Matrix Anal. Appl., 22, 1058-1078.

[7] K. Chen [2005], Matrix Preconditioning Techniques and Applications, Cambridge Uni- versity Press.

[8] E. Chow and Y. Saad [1997], Approximate inverse techniques for block-partitioned matrices, SIAM J. Sci. Comput., 18, 1657–1675.

[9] E. Chow and Y. Saad [1998], Approximate inverse preconditioners via sparse-sparse iterations, SIAM Journal on Scientific Computing, 19, 995–1023.

[10] J. D. F. Cosgrove, J. C. Daz, and A. Griewank [1992], Approximate inverse precon- ditioning for sparse linear systems, Internat. J. Comput. Math., 44, 91–110.

[11] K. Forsman, W. Gropp, L. Kettunen, D. Levine, and J. Salonen [1995], Solution of dense systems of linear equations arising from integral equation formulations, Anten- nas and Propagation Magazine, 37, 96–100.

[12] L. Gonz´alez [2006], Orthogonal projections of the identity: spectral analysis and applications to approximate inverse preconditioning, SIAM Review, 48, 66–75.

[13] N. I. M. Gould and J. A. Scott [1998], Sparse approximate-inverse preconditioners using norm-minimization techniques, SIAM J. Sci. Comput., 19, 605–625.

[14] P. H. Guillaume, A. Huard, and C. Le Calvez [2002], A block constant approximate inverse for preconditioning large linear systems, SIAM J. Matrix Anal. Appl., 24, 822–851.

[15] J. Helsing [2006], Approximate inverse preconditioners for some large dense random electrostatic interaction matrices, BIT Numerical Mathematics, 46, 307–323.

20 [16] R. D. Hill and S. R. Waters [1987], On the cone of positive semidefinite matrices, Linear Algebra and its Applications, 90, 81–88.

[17] A. N. Iusem and A. Seeger [2005], On pairs of vectors achieving the maximal angle of a convex cone, Math. Program., Ser. B, 104, 501-523.

[18] L. Yu. Kolotilina and A. Yu. Yeremin [1993], Factorized sparse approximate inverse preconditioning I. Theory, SIAM J. Matrix Anal. Appl., 14, 45–58.

[19] G. Montero, L. Gonz´alez, E. Fl´orez, M. D. Garc´ıa, and A. Su´arez [2002], Approximate inverse computation using Frobenius inner product, Numerical Linear Algebra with Applications, 9, 239–247.

[20] M. Raydan and B. Svaiter [2002], Relaxed steepest descent and Cauchy-Barzilai- Borwein method, Computational Optimization and Applications, 21, 155–167.

[21] Ademir A. Ribeiro and Elizabeth W. Karas [2014], Otimiza¸c˜ao Cont´ınua: aspectos te´oricos e computacionais, Cengage Learning Editora.

[22] Y. Saad [2010], Iterative Methods for Sparse Linear Systems, SIAM, 2nd edition, Philadelphia.

[23] A. M. Sajo-Castelli, M. A. Fortes, and M. Raydan [2014], Preconditioned conjugate gradient method for finding minimal energy surfaces on Powell-Sabin triangulations, Journal of Computational and Applied Mathematics, 268, 34–55.

[24] P. Tarazaga [1990], Eigenvalue estimates for symmetric matrices, Linear Algebra and its Applications, 135, 171-179.

21