<<

University of Southern Denmark

A -algebraic algorithm for the Riemannian logarithm on the Stiefel under the canonical metric

Zimmermann, Ralf

Published in: SIAM Journal on Matrix Analysis and Applications

DOI: 10.1137/16M1074485

Publication date: 2017

Document version: Accepted manuscript

Citation for pulished version (APA): Zimmermann, R. (2017). A matrix-algebraic algorithm for the Riemannian logarithm on the Stiefel manifold under the canonical metric. SIAM Journal on Matrix Analysis and Applications, 38(2), 322-342. https://doi.org/10.1137/16M1074485

Go to publication entry in University of Southern Denmark's Research Portal

Terms of use This work is brought to you by the University of Southern Denmark. Unless otherwise specified it has been shared according to the terms for self-archiving. If no other license is stated, these terms apply:

• You may download this work for personal use only. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying this open access version If you believe that this document breaches copyright please contact us providing details and we will investigate your claim. Please direct all enquiries to [email protected]

Download date: 25. Sep. 2021 1 A MATRIX-ALGEBRAIC ALGORITHM FOR THE RIEMANNIAN 2 LOGARITHM ON THE STIEFEL MANIFOLD UNDER THE 3 CANONICAL METRIC

4 RALF ZIMMERMANN∗

5 Abstract. We derive a numerical algorithm for evaluating the Riemannian logarithm on the 6 Stiefel manifold with respect to the canonical metric. In contrast to the optimization-based approach 7 known from the literature, we work from a purely matrix-algebraic perspective. Moreover, we prove 8 that the algorithm converges locally and exhibits a linear rate of convergence.

9 Key words. Stiefel manifold, Riemannian logarithm, Riemannian exponential, Dynkin series, Goldberg series, Baker-Campbell-Hausdorff series10

11 AMS subject classifications. 15A16, 15B10, 15B57, 33B30, 33F05, 53-04, 65F60

12 1. Introduction. Consider an arbitrary Riemannian manifold M. Geodesics 13 on M are locally shortest curves that are parametrized by the arc length. Because 14 they satisfy an initial value problem, they are uniquely determined by specifying a

15 starting point p0 ∈ M and a starting velocity ∆ ∈ Tp0 M from the tangent space at 16 p0. Geodesics give rise to the Riemannian exponential function that maps a tangent

17 vector ∆ ∈ Tp0 M to the endpoint C(1) of a geodesic path C : [0, 1] → M starting at ˙ 18 C(0) = p0 ∈ M with velocity ∆ = C(0) ∈ Tp0 M. It thus depends on the base point 19 p0 and is denoted by

20 (1) Expp0 : Tp0 M → M, Expp0 (∆) := C(1).

21 The Riemannian exponential is a local diffeomorphism, [13, §5]. This means that 22 it is locally invertible and that its inverse, called the Riemannian logarithm is also 23 differentiable. Moreover, the exponential is radially isometric, i.e., the Riemannian

24 distance between the starting point p0 and the endpoint p1 := Expp0 (∆) on M is

25 the same as the length of the velocity vector ∆ of the geodesic t 7→ Expp0 (t∆) when

26 measured on the tangent space Tp0 M,[13, Lem. 5.10 & Cor. 6.11]. In this way, the 27 exponential mapping gives a local parametrization from the (flat, Euclidean) tangent 28 space to the (possibly curved) manifold. This is also referred to as to representing 29 the manifold in normal coordinates [12, §III.8]. 30 The Riemannian exponential and logarithm are important both from the theo- 31 retical perspective as well as in practical applications. The latter fact holds true in 32 particular, when M is a matrix manifold [2]. Examples range from data analysis and 33 signal processing [7, 17,3, 18] over computer vision [4, 14] to adaptive model reduction 34 and subspace interpolation [5] and, more generally speaking, optimization techniques 35 on [6,1,2]. This list is far from being exhaustive. 36 Original contribution. In the work at hand, we present a matrix-algebraic deriva- 37 tion of an algorithm for computing the Riemannian logarithm on the Stiefel manifold. 38 The matrix-algebraic perspective allows us to prove local linear convergence. The 39 approach is based on an iterative inversion of the closed formula for the associated 40 Riemannian exponential that has been derived in [6, §2.4.2]. Our main tools are 41 Dynkin’s explicit Baker-Campbell-Hausdorff formula [19] and Goldberg’s exponential

∗Department of Mathematics and Computer Science, University of Southern Denmark (SDU) Odense, ([email protected]) 1

This manuscript is for review purposes only. 2 RALF ZIMMERMANN

42 series [8], both of which represent a solution Z to the matrix equation

43 expm(Z(X,Y )) = expm(X) expm(Y )(⇔ Z(logm(V ), logM (W )) = logm(VW )) ,

44 where V = expm(X),W = expm(Y ) and expm, logm are the standard matrix expo- 45 nential and matrix logarithm [11, §10, §11]. As an aside, we improve Thompson’s 46 norm bound from [20] on kZ(X,Y )k for the Goldberg series by a factor of 2, where 47 k · k is any submultiplicative matrix norm. 48 The Stiefel log algorithm can be implemented in O(10) lines of (commented) 49 MATLAB [15] code, which we include in Appendix C.1. 50 Comparison with previous work. To the best of our knowledge, up to now, the 51 only algorithm for evaluating the Stiefel logarithm appeared in Q. Rentmeesters’ thesis 52 [18, Alg. 4, p. 91]. This algorithm is based on a Riemannian optimization problem. It 53 turns out that this approach and the ansatz that is pursued here, though very different 54 in their course of action, lead to essentially the same numerical scheme. Rentmeesters 55 observes numerically a linear rate of convergence [18, p.83, p.100]. Proving linear con- 56 vergence for [18, Alg. 4, p. 91] would require estimates on the Hessian, see [18, §5.2.1], 57 [2, Thm. 4.5.6]. In contrast, the derivation presented here uses only elementary ma- 58 trix algebra and the convergence proof given here formally avoids the requirements 59 of computing/estimating step sizes, gradients and Hessians that are inherent to an- 60 alyzing the convergence of optimization approaches. In fact, the convergence proof 61 applies to [18, Alg. 4, p. 91] and yields the linear convergence of this optimization 62 approach when using a fixed unit step size, but only on a sufficiently small domain. 63 The thesis [18] was published under a two-years access embargo and the fundamentals 64 of the work at hand were developed independently before [18] was accessible. 65 Transition to the complex case. The basic geometric concepts of the Stiefel man- 66 ifold, the algorithm for the Riemannian log mapping developed here and its conver- 67 gence proof carry over to complex matrices, where orthogonal matrices have to be 68 replaced with unitary matrices and skew-symmetric matrices with skew-Hermitian 69 matrices and so forth, see also [6, §2.1]. The thus adjusted log mapping algorithm 70 was also confirmed numerically to work in the complex case. 71 Organization. Background information on the Stiefel manifold are reviewed in 72 Section 2. The new derivation for the Stiefel log algorithm is in Section 3, convergence 73 analysis is performed in Section 4, experimental results are in Section 5, and the 74 conclusions follow in Section 6. p×p 75 Notational specifics. The (p × p)-identity matrix is denoted by Ip ∈ R . If the 76 dimension is clear, we will simply write I. The (p × p)-, i.e., the set 77 of all square orthogonal matrices is denoted by

p×p T T 78 Op×p = {Φ ∈ R |Φ Φ = ΦΦ = Ip}.

79 The standard matrix exponential and matrix logarithm are denoted by

∞ ∞ X Xj X Xj 80 exp (X) := , log (I + X) := (−1)j+1 . m j! m j j=0 j=1

81 We use the symbols ExpSt, LogSt for the Riemannian counterparts on the Stiefel 82 manifold. 83 When we employ the qr-decomposition of a rectangular matrix A ∈ Rn×p, we 84 implicitly assume that n ≥ p and refer to the ‘economy size’ qr-decomposition A = 85 QR, with Q ∈ Rn×p, R ∈ Rp×p.

This manuscript is for review purposes only. A MATRIX-ALGEBRAIC STIEFEL LOGARITHM 3

86 2. The Stiefel manifold in numerical representation. This section reviews 87 the essential aspects of the numerical treatment of Stiefel manifolds, where we rely 88 heavily on the excellent references [2,6]. The Stiefel manifold is the compact homo- 89 geneous matrix manifold of all column-orthogonal rectangular matrices

n×p T 90 St(n, p) := {U ∈ R | U U = Ip}.

91 The tangent space TU St(n, p) at a point U ∈ St(n, p) can be thought of as the space 92 of velocity vectors of differentiable curves on St(n, p) passing through U:

93 TU St(n, p) = {C˙(to)|C :(t0 − , t0 + ) → St(n, p), C(t0) = U}.

94 For any matrix representative U ∈ St(n, p), the tangent space of St(n, p) at U is 95 represented by  n×p T T n×p 96 TU St(n, p) = ∆ ∈ R | U ∆ = −∆ U ⊂ R .

97 Every tangent vector ∆ ∈ TU St(n, p) may be written as T p×p n×p 98 (2) ∆ = UA + (I − UU )T,A ∈ R skew,T ∈ R arbitrary. 1 99 The dimension of both TU St(n, p) and St(n, p) is np − p(p + 1). 2   ˜ T 1 T ˜ 100 Each tangent space carries an inner product h∆, ∆iU = tr ∆ (I − 2 UU )∆ p 101 with corresponding norm k∆kU = h∆, ∆iU . This is called the canonical met- 102 ric on TU St(n, p). It is derived from the quotient space representation St(n, p) = 103 On×n/O(n−p)×(n−p) that identifies two square orthogonal matrices in On×n as the 104 same point on St(n, p), if their first p columns coincide [6, §2.4]. Endowing each 105 tangent space with this metric (that varies differentiably in U) turns St(n, p) into a 106 Riemannian manifold. 107 We now turn to the Riemannian exponential (1) but for M = St(n, p). An 108 efficient algorithm for evaluating the Stiefel exponential was derived in [6, §2.4.2]. 109 The algorithm starts with decomposing an input tangent vector ∆ ∈ TU St(n, p) into 110 its horizontal and vertical components with respect to the base point U,

T T T (qr of (I−UU )∆) 111 ∆ = UU ∆ + (I − UU )∆ = UA + QERE.

112 Because ∆ is tangent, A ∈ Rp×p is skew. Then the matrix exponential is invoked to 113 compute

   T    M A −RE Ip 114 (3) := expm . NE RE 0 0

115 The final output is1 ˜ St 116 (4) U := ExpU (∆) = UM + QENE ∈ St(n, p).

117 (A MATLAB function for the Stiefel exponential is in the supplement in Section S4.) 118 The matrix exponential in (3) is related with the solution of the initial value problem 119 that defines a geodesic on St(n, p), see [6, §2.4.2] for details. It turns out that the 120 main obstacle in computing the inverse of the Stiefel exponential and thus the Stiefel 121 logarithm is inverting (3), i.e. finding A, RE given M,NE, compare to [18, eq. (5.21)].

1 The index in QE ,RE ,NE is used to emphasize that these matrices stem from the Stiefel expo- nential as opposed to the closely related matrices Q, R, N that will appear in the procedure for the Stiefel logarithm.

This manuscript is for review purposes only. 4 RALF ZIMMERMANN

122 3. Derivation of the Stiefel log algorithm. Let U, U˜ ∈ St(n, p) and assume ˜ St 123 that U is contained in a neighborhood D of U such that ExpU is a diffeomorphism 124 from a neighborhood of 0 ∈ TU St(n, p) onto D. The central objective is to find St ˜ 125 ∆ ∈ TU St(n, p) such that ExpU (∆) = U. 126 Because of (4), we know that U˜ allows for a representation U˜ = UM + QENE. p×p n×p 127 Hence, we have to determine the unknown matrices M,NE ∈ R , QE ∈ R , T T T 128 which feature the following properties: QEU = 0 and M M + NE NE = Ip. (Note 129 that by (3), M and NE are the left upper and lower p×p blocks of a 2p×2p orthogonal 130 matrix.) We directly obtain

T T 131 M = U U,Q˜ ENE = (I − UU )U.˜

132 We compute candidates for QE,NE via a qr-decomposition qr 133 QN = (I − UU T )U,Q˜ ∈ St(n, p).

134 The set of all orthogonal matrices with M,N as an upper diagonal and lower 135 off-diagonal block is parametrized via        MX X X0 136 | = Φ, Φ ∈ Op×p , NY Y Y0

T T T 137 where (X0 ,Y0 ) is a specific orthogonal completion, computed, say, via the Gram- 138 Schmidt process. 139 Thus, the objective is reduced to solving the following nonlinear matrix equation        MX0 Ip 0 0 140 (5) 0 = 0 Ip logm , Φ ∈ Op×p. NY0 0 Φ Ip

     T  MX0 Ip 0 A −B 141 Writing logm = , this means finding a rotation NY0 0 Φ BC 142 Φ such that C = 0. 143 The first result is that solving (5) indeed leads to the Riemannian logarithm on 144 the Stiefel manifold. 145 Theorem 1. Let U, U˜ ∈ St(n, p) and assume that U˜ is contained in a neigh- St 146 borhood D of U such that ExpU is a diffeomorphism from a neighborhood of 0 ∈ 147 TU St(n, p) onto D. 148 Let M, QE,NE, Q, N, X0,Y0 as introduced in the above setting. Suppose that 149 Φ ∈ Op×p solves (5), i.e.,

   T  MX0Φ A −B 150 logm = . NY0Φ B 0

St ˜ St ˜ 151 Define ∆ := UA + QB ∈ TU St(n, p). Then ExpU (∆) = U, i.e., ∆ = LogU (U). 152 Proof. By construction, it holds QN = (I − UU T )U˜ and hence

153 (6) U T Q = 0, (I − UU T )Q = Q, QQT U˜ = QQT (I − UU T )U˜ = (I − UU T )U.˜

154 Now, we apply the Stiefel exponential (4) to ∆ = UA+QB. This gives (I −UU T )∆ = 155 QB and qr T 2 156 QERE = QB ⇔ RE = ΨB, where Ψ := (QEQ) ∈ Op×p.

2 The matrices QE and Q differ by an orthogonal rotation but span the same subspace.

This manuscript is for review purposes only. A MATRIX-ALGEBRAIC STIEFEL LOGARITHM 5

157 With U T ∆ = A, we obtain

   T    M A −RE Ip 158 := expm NE RE 0 0    T      I 0 A −B I 0 Ip 159 = exp 0 Ψ m B 0 0 ΨT 0         I 0 MX0Φ Ip M 160 = = . 161 0 Ψ NY0Φ 0 ΨN

T 162 Keeping in mind that QEΨ = QEQEQ = Q, this leads to an output of

St ˜ 163 ExpU (∆) = UM + QENE = UM + QEΨN = UM + QN = U. St ˜ 164 Thus, ∆ is a valid tangent vector in TU St(n, p) such that ExpU (∆) = U ∈ St(n, p). St ˜ 165 From abstract differential geometry, we know that LogU (U) ∈ TU St(n, p) is the St St ˜ ˜ 166 unique tangent with ExpU (LogU (U)) = U. We arrive at the claim St ˜ 167 ∆ = LogU (U).

168 Having established Theorem 1, we now focus on solving (5). Let

   T  MX0 A0 −B0 169 V0 := , logm(V0) := , NY0 B0 C0     Ip 0 0 0 170 W0 := , logm(W0) = . 171 0 Φ0 0 logm(Φ0)

172 Up to terms of first order, it holds logm(V0W0) = logm(V0) + logm(W0). Hence, the 173 choice 174 Φ0 = expm(−C0) 175 gives an approximate solution to (5). We define

     T  MX0 Ip 0 A1 −B1 176 (9) V1 := , logm(V1) := NY0 0 Φ0 B1 C1

177 and iterate. This is the essential idea of Algorithm 1 for the Riemannian logarithm.3 178 In Section 4 we make use of the Baker-Campbell-Hausdorff formula [19, §1.3, p. 179 22] that corrects for the misfit in the approximative matrix relation logm(VW ) ≈ 180 logm(V ) + logm(W ) for two non-commuting matrices V,W in order to show that the 181 above procedure leads to 182 kCk+1k2 < αkCkk2

183 for all k ∈ N0 and a constant α < 1 and is thus convergent. 184 Since the Riemannian exponential is a local diffeomorphism, we have to postulate 185 a suitable bound on the distance between the input matrices U and U˜. Suppose that T T 186 kU − U˜k2 < . Recalling the definitions M = U U˜ and (I − UU )U˜ = QN, this 187 gives the following bounds for the horizontal and the vertical component of U − U˜ 188 with respect to the subspace spanned by U:

T T 189 kUU (U − U˜)k2 = kIp − Mk2 < , k(I − UU )(U − U˜)k2 = kQNk2 = kNk2 < .

3This is the same algorithm as [18, Alg. 4, p. 91] that Rentmeesters obtains from his geometrical perspective when a fixed unit step length is employed and when [18, §5.3] is taken into account.

This manuscript is for review purposes only. 6 RALF ZIMMERMANN

190 However, it turns out that for the convergence proof, estimates on the norms 191 of X0, Y0 and Y0 − Ip are also required. By the CS-decomposition of orthonormal 192 matrices [9, Thm 2.6.3, p. 78], the diagonal blocks M and Y0 share the same singular 193 values and so do the off-diagonal blocks N,X0. Hence, kNk2 = kX0k2 < . Let T T 194 D1ΣR1 be the SVD of M and D2ΣR2 be the SVD of Y0. An estimate for the 195 singular values of M can be obtained as follows:

2 2 T T 2 2 196 (10)  > kNk2 = kN Nk2 = kI − M Mk2 = kI − Σ k2 = max(1 − σk), σk

197 where we have used that σ1 = kMk2 ≤ 1. Now, we replace the Y0 that has been T T 198 obtained via, say, Gram-Schmidt by Y0R2D2 = D2ΣD2 (and, correspondingly, X0 by T 199 X0R2D2 ). Essentially, this is the orthogonal Procrustes method, [9, §12.4.1, p.601], 200 applied to minΨ∈O kI − Y0Ψk2.This operation preserves the orthogonality of V0 =   p×p MX0 T 201 , but the new Y0 is symmetric with eigenvalue decomposition Y0 = D2ΣD2 . NY0 202 This gives

2 2 203 kY0 − Ipk2 = kΣ − Ipk2 = max |1 − σk| < max(1 − σk) <  . σk σk

204 In summary, if kU − U˜k2 <  and if we start the iterations indicated by (9) with 205 the Procrustes orthogonal completion X0,Y0 rather than the standard Gram-Schmidt 206 process, we obtain Algorithm 1 with the starting conditions

2 207 (11) kIp − Mk2 < , kNk2 = kX0k2 < , kY0 − Ipk2 <  .

Algorithm 1 Stiefel logarithm, iterative procedure Input: base point U ∈ St(n, p) and U˜ ∈ St(n, p) ‘close’ to base point, τ > 0 conver- gence threshold 1: M := U T U˜ ∈ Rp×p 2: QN := U˜ − UM ∈ Rn×p {(thin) qr-decomp. of normal component of U˜}   MX0 3: V0 := ∈ O2p×2p {orthogonal completion and Procrustes} NY0 4: for k = 0, 1, 2,... do  T  Ak −Bk 5: := logm(Vk) {matrix log, Ak,Ck skew} Bk Ck 6: if kCkk2 ≤ τ then 7: break 8: end if 9: Φk = exp (−Ck) {matrix exp, Φk orthogonal} m   Ip 0 10: Vk+1 := VkWk, where Wk := {update} 0 Φk 11: end for St ˜ Output: ∆ := LogU (U) = UAk + QBk ∈ TU St(n, p)

208 209 Computational costs. W.l.o.g. suppose that n ≥ p. In fact the most important 210 case in practical applications is n  p. Because of the matrix product in step 1 and the 211 qr-decomposition in step 2 of Algorithm 1, the preparatory steps 1–3 require O(np2) 212 FLOPS. The dominating costs in the iterative loop, steps 5–10, are the evaluation of

This manuscript is for review purposes only. A MATRIX-ALGEBRAIC STIEFEL LOGARITHM 7

213 the matrix logarithm for a 2p-by-2p orthogonal matrix and the matrix exponential 214 for a p-by-p skew-symmetric matrix in every iteration, both of which can be achieved 215 efficiently via the Schur decomposition. The costs are O(p3), see [9, Alg. 7.5.2]. 216 A MATLAB function for Algorithm 1 is in Appendix C.1.

217 4. Convergence proof. In this section, we establish the convergence of Algo- 218 rithm 1 under suitable conditions. We state the main result as Theorem 2; the proof is 219 subdivided into the auxiliary results Lemma 3, and Lemma 4 as well as Lemma 5 that 220 appears in Appendix A. An essential requirement is that the point U˜ ∈ St(n, p) that 221 is to be mapped to the tangent space TU St(n, p) is sufficiently close to the base point 222 U ∈ St(n, p) in the sense that kU − U˜k2 < . Throughout, we will make extensive use 223 of Dynkin’s explicit BCH formula [19, §1.3, p. 22].

224 Theorem 2. Let U, U˜ ∈ St(n, p). Assume that kU − U˜k2 < . Let (Vk)k be the 225 sequence of orthogonal matrices generated by Algorithm 1. 226 If  < 0.0912, then Algorithm 1 converges to a limit matrix V∞ := limk→∞ Vk 227 such that  T   T  A∞ −B∞ A∞ −B∞ 228 logm(V∞) := = . B∞ C∞ B∞ 0

229 Given a numerical convergence threshold τ > 0, see Algorithm 1, line 7, the algorithm log(kC0k2)−log(τ) 230 requires at most k = d log(2) e − 1 iteration steps to meet the convergence 231 criterion under the above conditions. 232 Remark 1. Algorithm 1 generates a sequence of orthonormal matrices     Ip 0 Ip 0 233 (12) Vk+1 = VkWk = V0(W0W1 ...Wk) = V0 ... ∈ O2p×2p. 0 Φ0 0 Φk

234 The proof of Theorem 2 will show that limk→∞ Wk = I2p, see (23). Therefore, the 235 sequence of orthogonal products Φ0 ... Φk converges to a limit Φ∞ for k → ∞. The 236 limit Φ∞ solves (5). However, it is not required to actually form Φ∞. 237 In pursuit of the proof of Theorem 2, we first show that if the norm of the matrix 238 logarithm of the orthogonal matrix Vk produced by Algorithm 1 at iteration k is 239 sufficiently small, then the norm of the lower p-by-p diagonal block of the matrix 240 logarithm of the next iterate Vk+1 is strictly decreasing by a constant factor.

241 Lemma 3. Let U, U˜ ∈ St(n, p). Let (Vk)k ⊂ O2p×2p be the sequence of orthogonal 242 matrices generated by Algorithm 1. Suppose that at stage k, it holds

 T  √ Ak −Bk 1 243 (13) k logm(Vk)k2 := k k2 < ( 5 − 1). Bk Ck 2

 T  Ak+1 −Bk+1 244 Then logm(Vk+1) := features a lower (p×p)-diagonal block of norm Bk+1 Ck+1

1 245 kC k < αkC k , 0 < α < . k+1 2 k 2 2

   T  MXk Ak −Bk 246 Proof. Given Vk = = expm , Algorithm 1 computes NYk Bk Ck 247 the next iterate Vk+1 via 248 Vk+1 := VkWk,

This manuscript is for review purposes only. 8 RALF ZIMMERMANN   Ip 0 249 where Wk := . For brevity, we introduce the notation LV := 0 expm(−Ck) 250 logm(V ) for the matrix logarithm. Recall that [V,W ] := VW − WV denotes the 251 commutator or Lie-bracket of the matrices V,W . From Dynkin’s formula for the 252 Baker-Campbell-Hausdorff series, see [19, §1.3, p. 22], we obtain

253 LVk+1 = logm(VkWk) 1 254 = L + L + [L ,L ] Vk Wk 2 Vk Wk 1 255 + L , [L ,L ] + L , [L ,L ] 12 Vk Vk Wk Wk Wk Vk ∞ 1   X 256 − L , L , [L ,L ] + z (L ,L ), 24 Wk Vk Vk Wk l Vk Wk 257 l=5 P∞ 258 where l=5 zl(LVk ,LWk ) =: h.o.t.(5) are the terms of fifth order and higher in the 259 series. In the case at hand, it holds

 T     T  Ak −Bk 0 0 Ak −Bk 260 LVk + LWk = + = , Bk Ck 0 −Ck Bk 0  T  0 Bk Ck 261 [LVk ,LWk ] = . 262 CkBk 0

263 (Note that the basic idea in designing Algorithm 1 was exactly to choose Wk such 264 that the lower diagonal block in the BCH-series cancels in the first order terms.) 265 The third and fourth order terms are  T T T 2  1 −2Bk CkBk AkBk Ck − 2Bk Ck 266 2 T T , and 12 2Ck Bk − CkBkAk BkBk Ck + CkBkBk  T 3 T 2  1 0 −Bk Ck + AkBk Ck 267 3 2 T 2 2 T , respectively. 268 24 −Ck Bk + Ck BkAk BkBk Ck − Ck BkBk

269 Therefore, the series expansion for the lower diagonal block in logm(Vk+1) starts with 270 the terms of third order: 1 1 271 kC k = k (B BT C + C B BT ) − (B BT C2 − C2B BT ) + h.o.t.(5)k k+1 2 12 k k k k k k 24 k k k k k k 2   1 2 1 2 kh.o.t.(5)k2 272 (17a) ≤ kBkk2 + kBkk2kCkk2 + kCkk2. 273 6 12 kCkk2

274 We tackle the higher order terms via Lemma√ 5 from the appendix. The lemma applies 275 1 because kCkk2 = kLWk k2 ≤ kLVk k2 < 2 ( 5 − 1) < 1. In this setting, it gives ∞ ∞ X X l−1 276 kh.o.t.(5)k2 ≤ kzl(LVk ,LWk )k2 < kLVk k kLWk k2, l=5 l=5

277 since each of the “letters” LVk ,LWk appears at least once in every “word” that con-

278 tributes to zk(LVk ,LWk ), see Appendix A and [20, 16, 21].

279 Writing s := kLVk k2 and substituting in (17a) leads to

∞ ! 1 1 X 280 (18) kC k < s2 + s3 + sl kC k =: αkC k . k+1 2 6 12 k 2 k 2 l=4

This manuscript is for review purposes only. A MATRIX-ALGEBRAIC STIEFEL LOGARITHM 9

P∞ l 1 281 The proof is complete, if we can show that α < 1. Note that l=4 s = 1−s − 1 − s − 282 s2 − s3. As a consequence

s2 5 11 283 α < 1 ⇔ − s2 − s3 < 1. 1 − s 6 12

2 284 An obvious bound on the size of s is obtained via observing that s < 1, if s < √ 1−s 1 1 285 2 ( 5 − 1) ≈ 0.618. The corresponding α is 0.4653 < 2 . A sharper bound can be 286 obtained via solving the associated quartic equation. This shows that the inequality 287 even holds for all s < 0.7111.

288 In√ order to make use of Lemma 3, we establish conditions such that k logm(Vk)k2 < 1 289 2 ( 5 − 1) holds throughout the iterations of Algorithm 1. 290 This is the goal of the the next lemma. It relies on the auxiliary results Propo- 291 sition 6, Proposition 7 and Lemma 8 from Appendix B. Proposition 6 shows that 292 k expm(C) − Ik2 < kCk2 for C skew-symmetric; Proposition 7 establishes a bound in 293 the opposite direction: if V is orthogonal such that kV −Ik2 < r, then k logm(V )k2 < q r2 r2 294 r 1 − 4 /(1 − 2 ). Finally, Lemma 8 shows that kV0 − Ik2 < 2 for the first iterate 295 V0 of Algorithm 1, provided that kU − U˜k2 < .

296 Lemma 4. Let U, U˜ ∈ St(n, p) with kU − U˜k2 < . Let (Vk)k ⊂ O2p×2p be the   MXk 297 sequence of orthogonal matrices generated by Algorithm 1, where Vk = . NYk √ 1−2 2˜ 2 298 Let ˜ = 2 1−22 and ˆ := (e − 1) +  +  . If 0 <  is small enough such that q 2 1− ˆ √ 299 4 1 ˆ ˆ2 < 2 ( 5 − 1), then 1− 2  T  √ Ak −Bk 1 300 k logm(Vk)k2 = k k2 < ( 5 − 1) for all k. Bk Ck 2 √ 1 301 Proof. Let δ0 := 2 ( 5 − 1). By Lemma 8 from Appendix B, it holds √ 1 − 2 302 k log (V )k < 2 = ˜ (< δ ). m 0 2 1 − 22 0

 T  A0 −B0 303 In particular, ˜ > k k2 ≥ kC0k2. By Algorithm 1,Φ0 = expm(−C0), B0 C0 304 where Φ0 is orthogonal. By Proposition 6 from Appendix B

305 kΦ0 − Ik2 ≤ kC0k2 < .˜

306 Writing V1 = I + (V1 − I) =: I + E1, this leads to the estimate   M − IX0Φ0 307 kE1k2 = k k2 NY0Φ0 − I       M − I 0 0 0 0 X0Φ0 308 = k + + k2 0 Y0(Φ0 − I) 0 Y0 − I N 0

309 ≤ max{kM − Ik2, kY0(Φ0 − I)k2} + kY0 − Ik2 + max{kNk2, kX0Φ0k2} 2 2 310 ≤ max{, kY0k2k(Φ0 − I)k2} +  +  ≤ ˜+  +  2˜ 2 312311 ≤ (e − 1) +  +  =, ˆ

This manuscript is for review purposes only. 10 RALF ZIMMERMANN

313 where we have used (11) and the fact that kY0k2 ≤ 1, see (33a), (33b). By Lemma 8, q ˆ2 ˆ2 314 k logm(V1)k2 < ˆ 1 − 4 /(1 − 2 ) < δ0. Thus, the claim holds for k = 0, 1. 1 ˜ 315 Lemma 3 applies to k logm(V0)k2 and leads to kC1k2 < 2 kC0k2 < 2 for the lower 316 diagonal block C1 of the next iterate logm(V1). Therefore, by using Proposition 6 317 once more, we see that ˜ 318 kΦ − Ik ≤ kC k < . 1 2 1 2 2

319 By induction, we obtain Vk = I + (Vk − I) =: I + Ek with   MX0Φˆ k−1 320 kEkk2 = k − Ik2 NY0Φˆ k−1       M − I 0 0 0 0 X0Φˆ k−1 321 = k + + k2 0 Y0(Φˆ k−1 − I) 0 Y0 − I N 0

322 ≤ max{kM − Ik2, kY0(Φˆ k−1 − I)k2} + kY0 − Ik2 + max{kNk2, kX0Φˆ k−1k2} ˆ 2 324323 (20a) ≤ max{, kY0k2kΦk−1 − Ik2} +  + .

325 where Φˆ k−1 = Φ0 ··· Φk−1. 326 We can estimate kΦˆ k−1 − Ik2 as follows: By the induction hypothesis, we assume 327 that we have checked that k logm(Vj)k2 < δ0 for j = 0, . . . , k − 1. Hence, Lemma 3 1 1 ˜ 328 ensures that kCjk2 < 2 kCj−1k2 < . . . < 2j kC0k2 < 2j for the lower diagonal block ˜ 329 of logm(Vj), j = 0, . . . , k − 1. As above, this gives kΦj − Ik2 ≤ kCjk2 < 2j . We thus ˜ 330 may write Φj = I + (Φj − I) =: I + Γj with kΓjk2 =: gj < 2j . This gives

331 (21) kΦˆ k−1 − Ik2 = k(I + Γ1) ··· (I + Γk−1) − Ik2 ≤ (1 + g1) ··· (1 + gk−1) − 1.

332 It holds k−1  k−1 k−1 ∞ Y X X X ˜ 333 ln (1 + g ) = ln(1 + g ) ≤ g ≤ = 2˜.  j  j j 2j j=0 j=0 j=0 j=0

334 Using this estimate in (21) gives

2˜ 335 kΦˆ k−1 − Ik2 < e − 1

336 and we finally arrive at

2˜ 2 337 kEkk2 ≤ (e − 1) +  +  =. ˆ

338 Recalling (20a), we have Vk = I + Ek with kEkk2 < ˆ at every iteration k. By q ˆ2 1− 4 339 Lemma 8, k logm(Vk)k2 < ˆ ˆ2 and we see that the postulate on the size of  is 1− 2 340 such that k logm(Vk)k2 < δ0. Thus Lemma 3 indeed applies at iteration k, which 341 closes the induction. 1 q 2 1− ˆ √   2 342 4 √ 1 Remark: The inequality ˆ ˆ2 < δ0 holds precisely for ˆ < 2 1 − 2 =: 1− 2 1+δ0 2˜ 2 343 ˆ0. A further calculations shows that if  < 0.0912, then ˆ = (e − 1) +  +  < ˆ0, 344 i.e., the conditions of Lemma 4 hold, for all  < 0.0912. 345 With the tools established above at hand, we are now in a position to prove 346 Theorem 2.

This manuscript is for review purposes only. A MATRIX-ALGEBRAIC STIEFEL LOGARITHM 11

347 Proof (Theorem 2). Let (Vk)k∈N0 be the sequence of orthogonal matrices gener- 348 ated by Algorithm 1. By Lemma 3 and Lemma 4, it holds  T  √ Ak −Bk 1 k+1 349 (22) k logm Vkk2 := k k2 < ( 5 − 1), kCk+1k2 < α kC0k2 Bk Ck 2

1 350 for all k ≥ 0, where 0 < α < 2 . From this equation and the continuity of the matrix 351 exponential, we obtain     Ip 0 Ip 0 352 (23) lim Wk = lim = . k→∞ k→∞ 0 expm(−Ck) 0 Ip 353 The convergence result is now an immediate consequence of Algorithm 1, step 10. 354 The upper bound on the iteration count required for numerical convergence is also 355 obvious from (22).

356 5. Examples and experimental results. In this section, we discuss a special 357 case that can be treated analytically. Following, we present numerical results on the 358 performance of Algorithm 1.

359 5.1. A special case. Here, we consider the special situation, where the two 360 points U, U˜ ∈ St(n, p) are such that their columns span the same subspace.4 Hence, T 361 there exists an orthogonal matrix M ∈ Op×p such that U˜ = UM = UM +(I −UU )0. M 0  362 In this case, Algorithm 1 produces the initial matrices V0 = and Φ0 = 0 Y0   −1 Ip 0 363 expm(− logm(Y0)) = Y0 . Note that the corresponding W0 = −1 com- 0 Y0 364 mutes with V0. Thus, we have the reduced BCH formula logm(V0W0) = logm(V0) + log (M) 0 365 log (W ) = m , i.e., Algorithm 1 converges after a single iteration and m 0 0 0 366 gives St 367 (24) LogU (UM) = U logm(M). 368 (Of course, it is also straight forward to show this directly without invoking Algo- iϕ1 iϕp 369 rithm 1.) Let σ(M) = {e , . . . , e } be the spectrum of M ∈ Op×p and suppose that 370 M is such that none of its eigenvalues is on the negative real axis, i.e., ϕj ∈ (−π, π). 371 Then, the maximal Riemannian distance between two points U and UM is bounded 372 by p 373 dist(U, UM) = hU logm(M),U logm(M)iU 1 p 1 1  2 1 X  2 374 = tr(log (M)T log (M)) = ϕ2 . 2 m m 2 j 375 j=1 376 As a consequence  p p  2 π, p even, 377 dist(U, UM) < q p−1  2 π, p odd. 378 The latter fact holds, because the eigenvalues of M come in complex conjugate pairs. iϕj 379 Hence, if p is odd, there is at least one real eigenvalue λj = e and because ϕj ∈ 380 (−π, π), there is at least one zero argument ϕj = 0. Related is [6, eq. (2.15)].

4We may alternatively express this by saying that [U] := colspan(U) and [U˜] := colspan(U˜) are the same points on the Grassmann manifold [U] = [U˜] ∈ Gr(n, p).

This manuscript is for review purposes only. 12 RALF ZIMMERMANN

381 5.2. Numerical performance. First, we try to mimic the experiments featured 382 in [18, §5.4]. Fig. 5.5 (lower left) of the aforementioned reference shows the average 383 iteration count when applying the optimization-based Stiefel logarithm to matrices 384 within a Riemannian annulus of inner radius 0.4π and outer radius 0.44π around T 385 (Ip, 0) ∈ St(n, p) for dimensions of n = 10, p = 2. Convergence is detected, if −7 386 kCkkF < τ = 10 , where Ck is the same as in Algorithm 1. ([18, Alg. 4, p. 91] 387 uses τ 2 < 10−14). Since [18, §5.4] does not list the precise input data, we create 388 comparable data randomly. To this end, we fix an arbitrary point U ∈ St(10, 2) and 389 create artificially but randomly another point U˜ ∈ St(10, 2) such that the Riemannian 390 distance from U to U˜ is exactly 0.44π. For full comparability, we replace the 2-norm 391 in Algorithm 1, line 7 with the Frobenius norm. We average over 1000 random 392 experiments and arrive at an average iteration count of k¯ = 7.83. A MATLAB script 393 that performs the required computations is available in Section S2. When the distance 394 of U and U˜ is lowered to 0.4π, the average iteration count drops to a value of k¯ = 6.92. 395 As a second experiment, we now return to the 2-norm and lower the convergence −13 396 threshold to kCkk2 < τ = 10 in the convergence criterion of Algorithm 1. We cre- 397 ate randomly points U, U˜ ∈ St(n, p) that are also a Riemannian distance of 0.44π away 398 from each other, where we consider various different dimensions (n, p), see Table 1. St ˜ 399 We apply Algorithm 1 to compute ∆ = LogU (U).

Table 1 −13 Convergence of Algorithm 1 for random data to an accuracy of kCkk2 ≤ 10 .   ˜ ˜ St ˜ (n, p) dist U, U kU − Uk2 iters. k∆ − LogU (U)k2 time (10,2) 0.44π 1.0179 16 8.7903e-15 0.01s (10,2) 0.89π 1.7117 95 4.1934e-13 0.06s (1,000, 200) 0.44π 0.1616 5 1.5119e-14 0.7s (1,000, 200) 0.89π 0.3256 7 1.7272e-14 0.8s (1,000, 900) 0.44π 0.1234 4 9.6999e-14 16.1s (1,000, 900) 0.89π 0.2491 5 7.9052e-14 21.0s (100,000, 500) 0.44π 0.0875 4 5.9857e-14 13.1s (100,000, 500) 0.89π 0.1768 5 6.1041e-14 14.0s

400 Figure 1 shows the associated convergence histories. The associated computation 401 times5 are listed in Table 1. As can be seen from the figure and the table, Algorithm 1 402 converges slowest (in terms of the iteration count) in the case of St(10, 2). Note that in 403 this case, the constant kU − U˜k2 that played a major role in the convergence analysis 404 of Algorithm 1 is largest. Moreover, we observe that the algorithm converges in all 405 test cases even though in only one of the experiments the theoretical convergence 406 guarantee kU0 − U˜k2 < 0.09 is satisfied, so that the theoretical bound derived here 407 can probably be improved. Table 1 suggests that the impact of the size of kU − U˜k2 408 on the iteration count is more direct than that of the actual Riemannian distance. 409 We repeat the exercise with random data U, U˜ ∈ St(n, p) that are a distance of 410 0.89π apart, which is the lower bound for the injectivity radius on the Stiefel manifold 411 given in [18, eq. (5.14)]. In the case of St(10, 2), we hit a pair U, U˜, 412 where the associated valuekU − U˜k2 is so large that the conditions of Theorem 2 413 and Lemma 3, Lemma 4 do not hold. In fact, we have k logm(V0)k2 = 3.141 for the

5as measured on a Dell desktop computer endowed with six processors of type Intel(R) Core(TM) i7-3770 [email protected]

This manuscript is for review purposes only. A MATRIX-ALGEBRAIC STIEFEL LOGARITHM 13

10 0 10 0 St(10,2) St(1000,200) 10 -2 St(1000,900) 10 -2 St(100000,500)

-4 -4 10 10 St(10,2) St(1000,200) St(1000,900) -6 -6 10 10 St(100000,500) 2 2 k k k k C C k k 10 -8 10 -8

10 -10 10 -10

10 -12 10 -12

10 -14 10 -14 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 iteration k iteration k

Fig. 1. Convergence of Algorithm 1 for random data U, U˜ ∈ St(n, p) for various n and p. −13 ˜ Convergence accuracy is set to kCkk2 ≤ 10 . Left: convergence graphs for dist(U, U) = 0.44π; right: for dist(U, U˜) = 0.89π.

414 starting point of Algorithm 1 in this case, which is close to π. Yet, the algorithm 415 converges, but very slowly so, see Table 1, second row and Figure 1, right side. In all 416 of the other cases, Algorithm 1 converges in well under ten iterations, even for the 417 larger test cases. 418 A MATLAB script that performs the required computations is available in Sec- 419 tion S2.

420 5.3. Dependence of the convergence on the Riemannian and the Eu- 421 clidean distance. In this section, we examine the convergence of Algorithm 1 de- 422 pending on the Riemannian distance dist(U, U˜) and the distance kU − U˜k2 in the 423 Euclidean operator-2-norm. To this end, we create a random point U ∈ St(n, p) with 424 MATLAB by computing the thin qr-decomposition of an (n × p) matrix with entries 425 sampled uniformly from (0, 1). Likewise, we create a random tangent vector ∆ ∈ ˜ ˜T p×p 426 TU St(n, p) by chosing randomly a skew-symmetric matrix A = A − A ∈ R and 427 a matrix T ∈ Rn×p, where the entries of A˜ and T are again uniformly sampled from 428 (0, 1), and setting ∆˜ = UA+(I −UU T )T . We normalize ∆˜ according to the canonical ˜ 429 metric ∆ = √ ∆ , see Section 2. In this way, we obtain for every t ∈ [0, π) a point h∆˜ ,∆˜ iU 430 U˜ = U(t) that is a Riemannian distance of dist(U, U(t))) = kt∆kU = t away from U. 431 We discretize the interval [0.1, 0.9π) by 100 equidistant points {xk|k = 1,..., 100} 432 and compute St 433 • the number of iterations until convergence when computing logU (U(tk)) with 434 Algorithm 1 for k = 1,..., 100. 435 • the distance in spectral norm kU − U(tk)k2, k = 1,..., 100. 436 • the norm of the matrix logarithm of the first iterate k logm(V0)k2 from Algo- 437 rithm 1, step 3.

This manuscript is for review purposes only. 14 RALF ZIMMERMANN

438 The results are displayed in Figures2,3 and4 for dimensions of St(10, 000, 400), 439 St(100, 10) and St(4, 2), respectively. In all cases, the convergence threshold was set to −13 440 kClk2 < τ = 10 . The algorithm converged in all cases, where k logm(V0)k2 < π and St −13 441 produced a tangent vector ∆(tk) := logU (U(tk)) of accuracy k∆(tk)−tk∆k2 < 10 . 442 A MATLAB script that performs the required computations is available in Section S3. π In the case of St(4, 2), the algorithm starts to fail for tk ≈ 2 , where k logm(V0)k2 jumps

6 0.25 0.25

St(10,000, 400)

0.2 0.2 5 2 k 2

k 0.15 0.15 )) ) k k t t ( ( 4 0 U V ( − m

U 0.1 0.1 log k k

3

iterations until convergence 0.05 0.05

2 0 0 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 tk = dist(U,U(tk)) tk = dist(U,U(tk)) tk = dist(U,U(tk))

˜ St Fig. 2. Convergence of Algorithm 1 for U, U = U(tk) = ExpU (tk∆) ∈ St(n, p), where ∆ is a random tangent vector of canonical norm 1 and n = 10, 000, p = 400. Convergence accuracy is set −13 ˜ ˜ to kCkk2 ≤ 10 . Left: number of iterations until convergence vs. dist(U, U); middle: kU − Uk2 ˜ ˜ vs. dist(U, U); right: k logm(V0)k2 vs. dist(U, U). 443 444 to a value of π. This indicates that V0 features (up to numerical errors) an eigenvalue 445 λ = −1 so that the standard principal matrix logarithm is no longer well-defined. In 446 all the experiments that were conducted, this behavior was observed only for small 447 values of p < 8, while there was never produced a random data set where Algorithm 1 448 failed for t < 0.9π and p > 10. The figures suggest that for small column-numbers 449 p, the ratio between the Riemannian distance dist(U, U˜) and the spectral distance 450 kU − U˜k2 is smaller than in higher dimensions. Moreover, for smaller p, it seems to 451 be more likely to hit a random tangent direction along which Algorithm 1 fails early 452 than for higher p. This may partly be explained by the star-shaped nature of the 453 domain of injectivity of the Riemannian exponential, [13, Lemma 5.7], and the richer 454 variety of directions in higher dimensions. 455 From these observations, it is tempting to conjecture that Algorithm 1 will con- 456 verge, whenever k logm(V0)k2 < π. However, these results are based on a limited 457 notion of randomness and a more thorough examination of the numerical behavior 458 of Algorithm 1 is required to obtained conclusive results, which is beyond the scope 459 of this work. Note that the domain of convergence of Algorithm 1 is related to the 460 injectivity radius of St(n, p) but it does not have to be the same. In Section S1 from 461 the supplement, we state an explicit example in St(4, 2), where Algorithm 1 produces ˜ ˜ π 462 a first iterate V0 with λ = −1 for an input pair U, U ∈ St(4, 2) with dist(U, U) = 2 ,

This manuscript is for review purposes only. A MATRIX-ALGEBRAIC STIEFEL LOGARITHM 15

20 1.2 1.5

St(100,10) 18 1 16

14 0.8 1 2 k 2 k )) ) k t 12 k t ( ( 0

0.6 V U ( m 10 − U log k k 8 0.4 0.5

6 iterations until convergence 0.2 4

2 0 0 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 tk = dist(U,U(tk)) tk = dist(U,U(tk)) tk = dist(U,U(tk))

Fig. 3. Same as Figure 2, but for n = 100, p = 10.

50 1.5 3.5 St(4,2) 45 3 40

35 2.5 1 2 k 2

30 k )) )

k 2 k t t ( ( 25 0 U V ( − m 1.5

20 U log k 0.5 k 15 1

10 iterationns until convergence 0.5 5

0 0 0 0 0.5 1 1.5 0 0.5 1 1.5 0 0.5 1 1.5 tk = dist(U,U(tk)) tk = dist(U,U(tk)) tk = dist(U,U(tk))

Fig. 4. Same as Figure 2, but for n = 4, p = 2.

463 while the injectivity radius is estimated to be ≈ 0.71π in [18, §5]. An analytical in- 464 vestigation in St(4, 2) might be possible and may shed more light on the precise value 465 of the Stiefel manifold’s injectivity radius.

466 6. Conclusions and outlook. We have presented a matrix-algebraic derivation St ˜ 467 of an algorithm for evaluating the Riemannian logarithm LogU (U) on the Stiefel

This manuscript is for review purposes only. 16 RALF ZIMMERMANN

468 manifold. In contrast to [18, Alg. 4, p. 91], the construction here is not based on an 469 optimization procedure but on an iterative solution to a non-linear matrix equation. 470 Yet, it turns out that both approaches lead to essentially the same numerical scheme. 471 More precisely, our Algorithm 1 coincides with [18, Alg. 4, p. 91], when a unit step 472 size is employed in the optimization scheme associated with the latter method. Apart 473 from its comparatively simplicity, a key benefit is that our matrix-algebraic approach 474 allows for a convergence analysis that does not require estimates on gradients nor 475 Hessians and we are able to prove that the convergence rate of Algorithm 1 is at 476 least linear. This, in turn, proves the local linear convergence of [18, Alg. 4, p. 91] 477 when using a unit step size. The algorithm shows a very promising performance in 478 numerical experiments, even when the dimensions n, p become large. 479 So far, we have carried out a theoretical local convergence analysis. Open ques- 480 tions to be tackled in the future include estimates on how large the convergence 481 domain of Algorithm 1 is in terms of the Riemannian distance of the input points 482 dist(U, U˜). This is related with the question of determining the injectivity radius of 483 the Stiefel manifold. Estimates on the injectivity radius are featured in [18, §5.2.1].

484 Appendix A. A sharper majorizing series for Goldberg’s Exponential 485 series. As an alternative to Dynkin’s BCH formula of nested commutators, Goldberg 486 has shown in [8] that the solution to the exponential equation

487 expm(X) expm(Y ) = expm(Z)

488 can be written as a formal series

∞ X X 489 (26) Z = X + Y + zk(X,Y ), zk(X,Y ) = gww. k=2 w,|w|=k

490 Each term zk(X,Y ) in (26) is the sum over all words of length k in the alphabet 491 {X,Y }. For example, YXYX2 and X2YXY 2 are such words of length 5 and 6 and 492 thus contributing to z5(X,Y ) and z6(X,Y ), respectively. The coefficients are rational 493 numbers gw ∈ Q, called Goldberg coefficients. 494 Thompson [20] has shown that the series converges provided that kXk, kY k ≤ µ < 495 1 for any submultiplicative norm k · k. More precisely, his result is that kzk(X,Y )k = P k 496 k w,|w|=k gwwk ≤ 2µ for k ≥ 2, see also [16, eq. 2]. In the next lemma, we improve 497 this bound by cutting the factor 2. 498 Lemma 5. Let kXk, kY k ≤ µ < 1. The Goldberg series is majorized by

∞ X 499 kZk < kXk + kY k + µk. k=2

500 Proof. One ingredient of Thompson’s proof is the following basic estimate on 501 binomial terms:   m − 1 m−1 502 (27) m m ≥ 2 . b 2 c

503 Here, bxc denotes the largest integer smaller or equal to x. Thompson’s argument is     m−1 m−1 Pm−1 m − 1 m − 1 504 that 2 = (1 + 1) = l=0 and that m is the largest out of l b 2 c 505 the m terms in the binomial sum. (It appears twice, if m−1 is odd.) In the following,

This manuscript is for review purposes only. A MATRIX-ALGEBRAIC STIEFEL LOGARITHM 17

m − 1  m − 1  506 we prefer to write this term with using the ceil-operator as m = m−1 , b 2 c d 2 e 507 because in this way, the same index m − 1 appears in the upper and lower entry of 508 the binomial coefficient. 509 For larger m, the inequality (27) can in fact be improved by a factor of 2:   m − 1 m 510 (28) Claim: m m−1 > 2 for all m ≥ 7. d 2 e   7 − 1 7 511 For m = 7, we have 7 7−1 = 7 · 20 = 140 > 128 = 2 ; for m = 8, the inequality d 2 e 512 evaluates to 280 > 256 = 28. To prove the claim, we proceed by induction. m m m−1 513 Case 1: “m even”. In this case, d 2 e = 2 = d 2 e and  m  m − 1 m − 1 514 (m + 1) m = (m + 1) m + m d 2 e 2 − 1 2   m m − 1 2 m+1 515 (29a) = 2(m + 1) m−1 > 2(m + 1) > 2 , 516 d 2 e m 517 where we have used the symmetry in the Pascal triangle (m − 1 is odd) and the 518 induction hypothesis to arrive at (29a). m m+1 519 Case 2: “m odd”. In this case, d 2 e = 2 and  m   m − 1  m − 1 520 (m + 1) m = (m + 1) m+1 + m+1 d 2 e 2 − 1 2  m − 1  m − 1 521 (30a) = (m + 1) m−1 + m . 522 d 2 e d 2 e m − 1 523 Note that m is the second-to-largest term in the binomial expansion of (1 + d 2 e 524 1)m−1. Moreover, since m − 1 is even, the relation to the largest term is m − 1 m − 1  m − 1  525 m = m−1 . d 2 e m + 1 d 2 e 526 Substituting in (30a) and applying the induction hypothesis gives    m m    m 2 m − 1 2 m + 1 m − 1 m m+1 527 (m + 1) m > (m + 1) + = + 2 = 2 . d 2 e m m + 1 m m m 528 Using (28) rather than (27) in Thompson’s original proof leads to the improved bound k 529 of kzk(X; Y )k ≤ µ for k ≥ 7. 530 We tackle the terms involving words of lengths k = 2, 3,..., 6 manually. The 531 reference [21] lists explicit expressions of the summands in the Goldberg BCH series 532 up to z8. The first three of them read 1 2 533 z (X,Y ) = (XY − YX) ⇒ kz (X,Y )k ≤ µ2. 2 2 2 2 X 1 534 z (X,Y ) = X2Y − 2XYX + XY 2 + YX2 − 2YXY + Y 2X 3 12 8 535 ⇒ kz (X,Y )k ≤ µ3. 3 12 X 1 2 2 2 2 6 4 536 z4(X,Y ) = X Y − 2XYXY + 2YXYX − Y X ⇒ kz4(X,Y )k ≤ µ . 537 24 24 X

This manuscript is for review purposes only. 18 RALF ZIMMERMANN

538 The expressions for z5(X,Y ) and z6(X,Y ) are too cumbersome to be restated here. 539 However, for our purposes, a very rough counting argument is sufficient: The expres- 540 sion for z5(X,Y ) features 30 length-5 words with non-zero Goldberg coefficient and the 1 P 30 5 541 largest Goldberg coefficient is 30 . Hence, kz5(X,Y )k = k w,|w|=5 gwwk < 30 µ .X 176 5 542 (A more careful consideration reveals kz5(X,Y )k ≤ 720 µ .) 543 The expression for z6(X,Y ) features 28 length-6 words with non-zero Gold- 1 544 berg coefficient and the largest Goldberg coefficient is 60 . Hence, kz6(X,Y )k = P 28 6 545 k w,|w|=6 gwwk ≤ 60 µ .X

546 Appendix B. Norm bound for logm(V0). p×p 547 Proposition 6. Let C ∈ R be skew-symmetric with kCk2 < π. Then

548 k expm(C) − Ik2 < kCk2. 549 Proof. Since C is skew-symmetric, it features an EVD C = QΛQH with Λ = 550 diag(λ1, . . . , λp) = diag(iϕ1, . . . , iϕp), where ϕ ∈ (−π, π) and maxj |iϕj| = kCk2. H iϕ1 iϕp 551 Therefore, expm(C) = Q expm(Λ)Q with expm(Λ) = diag(e , . . . , e ) and

iϕj 552 k expm(C) − Ik2 = max |e − 1| < max |ϕj| = kCk2. j j

553 (The latter estimate may also be deduced from Figure 5.)

554 Proposition 7. Let V ∈ On×n be such that kV − Ik2 < r < 1. Then

q r2 1 − 4 555 k log (V )k < r . m 2 r2 1 − 2 556 Proof. Let E = V −I. The matrices V and E share the same (orthonormal) basis 557 of eigenvectors Q and the spectrum of V is precisely the spectrum of E shifted by 558 +1. By assumption, r > kEk2 = maxµ∈σ(E) |µ|. Hence, the eigenvalues λ ∈ σ(V ) are 559 complex numbers of modulus one of the form λ = eiα = 1 + µ, with |µ| < r. Thus, λ 560 lies on the unit circle but within a ball of radius r around 1 ∈ C, see Figure 5. The 561 maximal angle α for such a λ is bounded by the slope of the line that starts in 0 ∈ C 562 and crosses the points of intersection of the two circles {|z| < 1} and {|z − 1| < r}.   r2 q r2 563 The intersection points are (xs, ±ys) = 1 − 2 , ±r 1 − 4 . Therefore

 q 2  q 2   r 1 − r 1 − r ys 4 4 564 |α| < arctan = arctan < r . x  r2  r2 s 1 − 2 1 − 2

565 As a consequence, q 1 − r2 H 4 k logm(V )k2 = kQ logm(Λ)Q k2 = max | ln(λ)| = max |iα| < r 2 . λ∈σ(V ) λ=eiα∈σ(V ) r 566 1 − 2

567 Lemma 8. Let U, U˜ ∈ St(n, p) with kU − U˜k2 < . Let M,N,X0,Y0 and V0 :=   MX0 568 ∈ O2p×2p be as constructed in the first steps of Algorithm 1. NY0 569 Then √ 1 − 2 570 (32) k log (V )k < 2 . m 0 2 1 − 22

This manuscript is for review purposes only. A MATRIX-ALGEBRAIC STIEFEL LOGARITHM 19

(x , y ) s s

r α (1,0) (0,0)

(x , -y ) s s

Fig. 5. Geometrical illustration of Proposition 7 in the complex plane.

571 Proof. Because V0 is orthogonal,   H MX0 572 (33a) 1 = kV0k2 ≥ ν(V0) = max |w w| kwk2=1 NY0     H MX0 0 573 (33b) ≥ max |(0, v ) | = kY0k2, 574 kvk2=1 NY0 v

575 where ν(V0) denotes the numerical radius of V0, see [10, eq. 1.21, p. 21]. Likewise, 576 kMk2 ≤ 1 so that the singular values of M and Y0 range between 0 and 1. Moreover, 577 by the Procrustes preprocessing outlined at the end of Section 3,

2 578 kNk2 = kX0k2 < , kM − Ik2 < , kY0 − Ipk2 <  ,

579 see (11). Combining these facts, we obtain V = I + (V − I) = I + E, where       M − IX0 M − I 0 0 X0 580 kEk2 = k k2 ≤ k + k2 NY0 − I 0 Y0 − I N 0 2 582581 ≤ max{kM − Ik2, kY0 − Ik2} + max{kNk2, kX0k2} < max{,  } +  = 2.

583 Applying Proposition 7 to V = I + E proves the claim.

584 Appendix C. MATLAB code.

585 C.1. Algorithm 1. 586 % 587 function [Delta, k, conv_hist, norm_logV0] = ... 588 Stiefel_Log_supp(U0, U1, tau) 589 %------590 %@author: Ralf Zimmermann, IMADA, SDU Odense 591 % 592 % Input arguments 593 % U0, U1 : points on St(n,p)

This manuscript is for review purposes only. 20 RALF ZIMMERMANN

594 % tau : convergence threshold 595 % Output arguments 596 % Delta : Log^{St}_U0(U1), 597 % i.e. tangent vector such that Exp^St_U0(Delta) = U1 598 % k : iteration count upon convergence 599 % supplementary output 600 % conv_hist : convergence history 601 % norm_logV0 : norm of matrix log of first iterate V0 602 %------603 % get dimensions 604 [n,p] = size(U0); 605 % store convergence history 606 conv_hist = [0]; 607 608 % step 1 609 M = U0’*U1; 610 % step 2 611 [Q,N] = qr(U1 - U0*M,0); % thin qr of normal component of U1 612 % step 3 613 [V, ~] = qr([M;N]); % orthogonal completion 614 615 % "Procrustes preprocessing" 616 [D,S,R] = svd(V(p+1:2*p,p+1:2*p)); 617 V(:,p+1:2*p) = V(:,p+1:2*p)*(R*D’); 618 V = [[M;N], V(:,p+1:2*p)]; % |M X0| 619 % now, V = |N Y0| 620 % just for the record 621 norm_logV0 = norm(logm(V),2); 622 623 % step 4: FOR-Loop 624 for k = 1:10000 625 % step 5 626 [LV, exitflag] = logm(V); 627 % standard matrix logarithm 628 % |Ak -Bk’| 629 % now, LV = |Bk Ck | 630 C = LV(p+1:2*p, p+1:2*p); % lower (pxp)-diagonal block 631 % steps 6 - 8: convergence check 632 normC = norm(C, 2); 633 conv_hist(k) = normC; 634 if normC

This manuscript is for review purposes only. A MATRIX-ALGEBRAIC STIEFEL LOGARITHM 21

644 % prepare output |A -B’| 645 % upon convergence, we have logm(V) = |B 0 | = LV 646 % A = LV(1:p,1:p); B = LV(p+1:2*p, 1:p) 647 % Delta = U0*A+Q*B 648 Delta = U0*LV(1:p,1:p) + Q*LV(p+1:2*p, 1:p); 649 return; 650 end 651 Note: The performance of this method may be enhanced by computing expm, logm 652 via a Schur decomposition.

REFERENCES653

654 [1] P.-A. Absil, R. Mahony, and R. Sepulchre, Riemannian geometry of Grassmann mani- 655 folds with a view on algorithmic computation, Acta Applicandae Mathematica, 80 (2004), 656 pp. 199–220, doi:10.1023/B:ACAP.0000013855.14971.91, http://dx.doi.org/10.1023/B% 3AACAP.0000013855.14971.91.657 658 [2] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization Algorithms on Matrix Manifolds, 659 Princeton University Press, Princeton, New Jersey, 2008, http://press.princeton.edu/titles/ 8586.html.660 661 [3] L. Balzano, R. Nowak, and B. Recht, Online identification and tracking of subspaces from 662 highly incomplete information, in Proceedings of Allerton, September 2010. Available at http://arxiv.org/abs/1006.4046.663 664 [4] E. Begelfor and M. Werman, Affine invariance revisited, 2012 IEEE Conference on 665 Computer Vision and Pattern Recognition, 2 (2006), pp. 2087–2094, doi:http://doi. ieeecomputersociety.org/10.1109/CVPR.2006.50.666 667 [5] P. Benner, S. Gugercin, and K. Willcox, A survey of projection-based model re- 668 duction methods for parametric dynamical systems, SIAM Review, 57 (2015), 669 pp. 483–531, doi:10.1137/130932715, http://dx.doi.org/10.1137/130932715, arXiv:http://dx.doi.org/10.1137/130932715.670 671 [6] A. Edelman, T. A. Arias, and S. T. Smith, The geometry of algorithms with orthogonality 672 constraints, SIAM Journal on Matrix Analysis and Application, 20 (1999), pp. 303–353, 673 doi:10.1137/S0895479895290954, http://dx.doi.org/10.1137/S0895479895290954. 674 [7] K. Gallivan, A. Srivastava, X. Liu, and P. Van Dooren, Efficient algorithms for inferences 675 on Grassmann manifolds, in Statistical Signal Processing, 2003 IEEE Workshop on, 2003, pp. 315–318, doi:10.1109/SSP.2003.1289408.676 677 [8] K. Goldberg, The formal power series for log exey, Duke Math. J., 23 (1956), pp. 13–21, 678 doi:10.1215/S0012-7094-56-02302-X, http://dx.doi.org/10.1215/S0012-7094-56-02302-X. 679 [9] G. H. Golub and C. F. Van Loan, Matrix Computations, The John Hopkins University Press, Baltimore – London, 3 ed., 1996.680 681 [10] A. Greenbaum, Iterative Methods for solving linear systems, vol. 17 of Frontiers in applied 682 mathematics, SIAM Society for Industrial and Applied Mathematics, Philadelphia, 1997. 683 [11] N. J. Higham, Functions of Matrices: Theory and Computation, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2008.684 685 [12] S. Kobayashi and K. Nomizu, Foundations of Differential Geometry, vol. I of Interscience 686 Tracts in Pure and Applied Mathematics no. 15, John Wiley & Sons, New York – London – Sidney, 1963.687 688 [13] J. Lee, Riemannian Manifolds: an Introduction to Curvature, Springer Verlag, New York – Berlin – Heidelberg, 1997.689 690 [14] Y. Man Lui, Advances in matrix manifolds for computer vision, Image and Vision Computing, 691 30 (2012), pp. 380–388, doi:http://dx.doi.org/10.1016/j.imavis.2011.08.002, http://www. 692 sciencedirect.com/science/article/pii/S0262885611000692. 693 [15] MATLAB, version 7.10.0 (R2010a), The MathWorks Inc., Natick, Massachusetts, 2010. 694 [16] M. Newman, S. Wasin, and R. C. Thompson, Convergence domains for the Campbell-Baker- 695 Hausdorff formula, Linear and Multilinear Algebra, 24 (1989), pp. 301–310. 696 [17] I. U. Rahman, I. Drori, V. C. Stodden, D. L. Donoho, and P. Schroder¨ , Multiscale 697 representations for manifold-valued data, SIAM J. Mult. Model. Simul., 4 (2005), pp. 1201– 1232.698 699 [18] Q. Rentmeesters, Algorithms for data fitting on some common homogeneous spaces, PhD 700 thesis, Universit´eCatholique de Louvain, Louvain, Belgium, July 2013/2015, http://hdl.

This manuscript is for review purposes only. 22 RALF ZIMMERMANN

handle.net/2078.1/132587.701 702 [19] W. Rossmann, Lie Groups: An Introduction Through Linear Groups, Oxford graduate 703 texts in mathematics, Oxford University Press, 2006, https://books.google.de/books?id= bAjulQ65W-UC.704 705 [20] R. C. Thompson, Convergence proof for Goldberg’s exponential series, Linear Algebra and its Applications, 121 (1989), pp. 3–7.706 707 [21] A. Van-Brunt and M. Visser, Simplifying the Reinsch algorithm for the Baker-Campbell- 708 Hausdorff series. arXiv:1501.05034, 2015, http://arxiv.org/abs/1501.05034.

This manuscript is for review purposes only.