HESSIAN OF THE RIEMANNIAN SQUARED DISTANCE ON CONNECTED LOCALLY SYMMETRIC SPACES WITH APPLICATIONS

Ricardo Ferreira ∗Jo˜ao Xavier ∗

[email protected] [email protected] Instituto de Sistemas e Rob´otica (ISR), Instituto Superior T´ecnico Av. Rovisco Pais, 1049-001 Lisbon, Portugal Abstract: Many optimization problems formulated on Riemannian manifolds involve their intrinsic Riemannian squared distance function. A notorious and important example is the centroid computation of a given finite constellation of points. In such setups, the implementation of the fast intrinsic Newton optimization scheme requires the availability of the Hessian of this intrinsic function. Here, a method for obtaining the Hessian of this function is presented for connected locally-symmetric spaces on which certain operations, e.g. exponential, logarithm and curvature maps, are easily carried out. Particularly, naturally reductive homogeneous spaces provide the needed information, hence applications will be shown from this set. We illustrate the application of this theoretical result in two engineering scenarios: (i) centroid computation and (ii) maximum a posteriori (MAP) estimation. Our results confirm the quadratic convergence rate of the intrinsic Newton method derived herein.

Keywords: Riemannian manifolds, Hessian of intrinsic distance, Intrinsic Newton method, Centroid computation, MAP estimation

1. INTRODUCTION AND MOTIVATION of the function to be optimized. In this paper, our interest lies on those optimization problems Due to its quadratic convergence rate near the so- involving the intrinsic squared Riemannian dis- lution, Newton’s method has for a long time been tance. As a special case, this subsumes the class the method of choice for optimization problems, of optimization problems known as centroid com- especially when high is required. This is putation. true in many fields, from engineering to numerical Many applications of centroid computation exist analysis where it is used extensively to obtain in the literature. For example (Moakher, 2002) many digits of precision. Intrinsic quasi-Newton mentions centroid computation in SO(3) for the and Newton algorithms for optimization problems study of plate tectonics and sequence-dependent on smooth manifolds have been discussed (among continuum modeling of DNA. In these, experimen- others) in (Gabay, 1982) and in (Edelman et tal observations are obtained with a significant al., 1998) and (Manton, 2002). Applications can amount of noise that needs to be smoothed. Pos- be found in robotics (Belta and Kumar, 2002), sig- itive definite symmetric matrices are used as co- nal processing (Manton, 2005), image processing variance matrices in (Pennec et al., 2004) for sta- (Helmke et al., 2004), etc. tistical characterization of deformations and en- Implementation of the intrinsic Newton scheme coding of principle diffusion directions in Diffusion relies on the availability of the intrinsic Hessian Imaging (DTI), expanding the range of category, here SE(n) is seen as SO(n) applications to medicine. Computation of centers Rn). Further, when dealing with constant× of mass also find applications for analyzing shapes sectional curvature manifolds, we show how in medical imaging, see (Fletcher et al., 2004). to decrease the computational complexity of It is also a mandatory step when considering the the algorithm. extension of the K-means algorithm to manifolds. Section 4 illustrates the application of the • theory to the problem of intrinsic cen- Most of these approaches rely on meth- troid computation by a Newton method. ods for computing the intrinsic Riemannian cen- A pseudo-code implementation of the algo- troid with a few exceptions: H¨uper and Manton rithm is given. Results for the special Eu- in (H¨uper and Manton, 2005) developed a New- clidean group, the Grassmann manifold and ton method for the special orthogonal group and real projective plane are presented. Another (Absil et al., 2004) introduced a Newton method application, concerning MAP position esti- applicable to Grassmann manifolds which oper- mation in the context of robot navigation is ates on an approximation of the intrinsic cost shown as well. function, yielding the intrinsic centroid. Finally, some conclusions are drawn and di- 1.1 Contribution We present and discuss a • rections for future work are delineated in method for computing the intrinsic Hessian of the section 5. Riemannian squared distance function for locally symmetric Riemannian manifolds. This builds on 2. HESSIAN OF THE RIEMANNIAN our previous work in (Ferreira et al., 2006). In this SQUARED DISTANCE FUNCTION paper we simplify the description of the intrinsic Hessian by providing a concise expression 2.1 Review of Newton’s Method in Rie- in each space. This entails an appre- mannian Manifolds Let qk M henceforth des- ciable simplification with respect to (Ferreira et ignate the kth iterate in an optimization∈ method al., 2006), e.g. the need for parallel translation formulated on a Riemannian manifold M. New- of tangent vectors is dropped, thus relaxing the ton’s method on a manifold is essentially the same amount of required differential geometric func- as in Rn (see (Edelman et al., 1998), (Manton, tions initially imposed. 2002) and (H¨uper and Trumpf, 2004) for some generalizations). It generates a search direction Adding to the already described cases of the n d T M as the solution of the linear system embedded sphere S , special orthogonal group k ∈ qk SO (n) and symmetric positive definite matrices H dk = grad f(qk) , (1) Sym+(n), the method will also be shown to work · − for the special Euclidean group SE(n) and the where H is the bilinear Hessian tensor of the smooth cost function f : M R evalu- Grassmann manifold G(n,p). Notice that the real → n ated at q and grad f(q ) T M is its gra- projective plane P is a particular case of the k k ∈ qk Pn G dient. Some care is needed though, since the Grassmann manifold ( ∼= (n +1, 1)). As an example of other applications, a simple example Hessian and the gradient are not as simple Rn of MAP estimation is described and results are to find as in , but are in fact given as shown. the solutions of (df)qXq = grad f(q),Xq and Hess f(q)(X , Y ) = gradh f, Y , wherei q q q ∇Xq q ∈ 1.2 Paper Organization The paper is struc- M and Xq, Yq TqM are any tangent vectors. ∈ tured as follows: Here (df)q denotes the differential of the function Section 2 provides a short review of Newton’s f at the point q and denotes the Levi-Civita • connection of the manifold.∇ Here, , denotes the method on Riemannian manifolds and also h· ·i presents the main result of this article - the inner product on TqM. matrix version of the Hessian of the Rieman- Once a Newton direction has been obtained, it nian squared distance function in a given should be checked if it’s a descent direction (its tangent space. This enhances our previous inner product with the gradient vector should result in (Ferreira et al., 2006) and describes be negative). If so, the update equation qk+1 =

it in an implementation-friendly format. expqk (αkdk), can be used to obtain a better esti- To construct the Hessian, several differential- mate. Here α is a step size, given for example by • k geometric operations on the manifold are Armijo’s rule, and exp denotes the Riemannian needed, e.g. computation of the curvature exponential map. If the inner product is negative, tensor. Section 3 mentions a method for ob- a safe negative gradient direction should be used. taining the needed computational tools for a class of commonly used manifolds: natu- Although the gradient is usually easy to compute, rally reductive homogeneous spaces (all the determination of the Hessian is more involved. manifolds mentioned earlier fall under this The next section describes a method to calculate it on particular Riemannian manifolds (connected and locally-symmetric where the curvature tensor ˆ i.e. [X]i = Xq, Fiq , let Rk be the matrix with and Riemannian logarithm maps are known). entries [R ] = F , (F ) and consider the k ij iq R j q 2.2 Hessian in Matrix Form This section D E T eigenvalue decomposition Rk = EΛE . Here λi starts with a slightly upgraded version of the main will be used to describe the i’th diagonal element result presented in (Ferreira et al., 2006), stating: of Λ. Then the Hessian matrix (a representation for the bilinear Hessian tensor on the finite di- Theorem 1. Consider M to be a connected locally- mensional tangent space with respect to the fixed symmetric n-dimensional Riemannian manifold basis) is given by: with curvature endomorphism R. Let Bǫ(p) be a T Hk = EΣE (2) geodesic ball centered at p M and r : B (p) ∈ p ǫ → R the function returning the intrinsic (geodesic) where Σ is diagonal with elements σi given by σi = ˆ T ˆ distance to p. Let γ : [0, r] Bǫ(p) denote ctgλi (r). Hence Hess(kp)q(Xq, Yq)= X HkY . the unit speed geodesic connecting→ p to a point q B (p), where r = r (q), and letγ ˙ γ˙ (r) ∈ ǫ p q ≡ This result follows from theorem 1 by decompos- be its velocity vector at q. Define the function ing each vector with respect to the considered R 1 2 kp : Bǫ(p) , kp(x) = 2 rp(x) and consider basis and noting that, due to the symmetries of any X , Y →T M. Then q q ∈ q the curvature endomorphism, the operator has a k null eigenvalue associated with the eigenvectorγ ˙ q. Hess(kp)q(Xq, Yq)= Xq , Yq + D E n 3. IMPLEMENTATION ctg (r) X ⊥, E Y , E . λi q iq q iq 3.1 Naturally Reductive Homogeneous D E Xi=1 Spaces The theory of naturally reductive ho- where Ei TqM is an orthonormal basis mogeneous spaces, henceforth denoted by NRHS, q ⊂ which diagonalizes the linear operator : TqM provides a practical way of obtaining the needed R → TqM, (Xq ) = R(Xq, γ˙q)γ ˙ q with eigenvalues λi, curvature endomorphism R, Riemannian expo- R this means (Ei )= λiEi . Also, nential and logarithm maps for those cases where R q q the manifold M can be written as a coset space √λ t/ tan(√λt) λ> 0 G/H of a group G, acting on M transitively  ctgλ(t)= 1 λ =0 . and by isometries, with isotropy subgroup H.  √ λ t/ tanh(√ λt) λ< 0 The situation is particularly favorable from the − − Here the and signs denote parallel and or- computational viewpoint if G can be taken as SO GL thogonal componentsk ⊥ of the vector with respect (n) or (n), given the easy formula (matrix k ⊥ exponential) for its one parameter subgroups. If to the velocity vector of γ, i.e. Xq = Xq + Xq , ⊥ k ⊥ certain additional properties are verified (e.g. ex- Xq ,Xq = 0, and Xq , γ˙ q = 0 . istence of a Lie subspace as defined for exam- D E D E ple, in (O’Neil, 1983)), all of the required maps The improvement relative to (Ferreira et al., can be found from the corresponding maps in G. 2006), which might be unnoticed at first sight, is For a description of these spaces see for example the point of the manifold M at which the self- (O’Neil, 1983). Many manifolds in engineering adjoint operator is diagonalized, i.e. at point q. can be described by this construction, particularly Previously, the diagonalizationR took place at p all the manifolds mentioned are NRHS (Grass- M and the result was then parallel-transported to∈ mann, sphere, special orthogonal group, special q (where it is needed). Here, we see that parallel Euclidean group, positive definite matrices and translation is no longer required, and better yet the projective plane). Note though that the NRHS this allows for the formula to be written in matrix set is not a subset of connected locally symmetric notation as described next. Note that the theo- manifolds, hence care should be taken to ensure rems are formulated intrinsically, and nowhere is it that the particular manifold on which optimiza- assumed an embedded manifold characterization. tion is to be performed verifies these conditions. This means that the presented work is indepen- For example, the Stiefel manifold (the set of k Rn dent of the representation chosen for the manifold dimensional orthogonal frames in ) is an NRHS M, which can range from simple cartesian prod- space but is not locally symmetric, except in the ucts of submanifolds of R, quotient manifolds or cases where k = 1 which results in the sphere. any other abstract construction. Note that when k = n the Stiefel is not connected (it is actually O(n)). Theorem 2. Under the same assumptions as above, 3.2 Special Considerations Spaces with con- consider Fiq TqM an orthonormal basis. stant sectional curvature deserve special mention. ⊂ If Xq TqM is a vector, let the notation Xˆ In these spaces, it is easily seen that the eigen- denote∈ the column vector describing the decom- values of the operator are the constant value R position of Xq with respect to the basis Fiq , of the curvature, except for one, which is 0 and  is associated with eigenvectorγ ˙ q. Hence, if λ is objective will be to find a centroid for the given the value of the curvature, the matrices can be constellation (which in the applications of interest decomposed as follows: should be unique), but the possibility of conver- gence to a local minimum is not dealt with. If 0 0 E = γ˙ γ˙ ⊥ Λ= λ the points on the constellation are close enough to q q 0 I   each other, it is known that the global set mf ( ) ⊥ X whereγ ˙ q is an orthonormal complement ofγ ˙q. has a single element and so the centroid is unique Thus the Hessian matrix becomes: as stated in (Manton, 2004) and (Karcher, 1977). T ⊥ ⊥T Hk =γ ˙ qγ˙ q + ctgλ(r)γ ˙ q γ˙ q Using linearity of the gradient and the Hessian T T operators (meaning in particular that if f,g : =γ ˙ qγ˙ + ctg (r)(I γ˙ qγ˙ ) q λ − q M R then Hess(f + g) = Hess f + Hess g and = ctg (r)I + (1 ctg (r))γ ˙ γ˙ T . (3) → λ − λ q q grad(f + g) = grad f + grad g), the cost function This removes the need for the numerical compu- in equation (4) allows for the decomposition tation of the eigenvalue decomposition. L L grad C (q)= grad k (q)= log (p ) X pl − q n All manifolds with an NRHS structure (possibly Xl=1 Xl=1 others where isometries are known) allow for an- L

other, sometimes important, optimization. Sup- Hess CX (q)= Hess kpl (q) , (5) pose that there’s a privileged point o M where Xl=1 ∈ computations are ‘cheaper’ to carry out. Since where the fact that the gradient of the squared there’s a group G acting transitively on M by Riemannian distance function is the symmetric isometries, an element g G can be found such o ∈ of the Riemannian log map is used (as stated in that o = go q, where q M is the current (Lee, 1997) as a corollary to Gauss’s lemma). Newton estimate.· Applying∈ this isometry to the ′ ′ whole constellation yields pi : pi = go pi , i.e., Here is the outline of the Newton method as a new constellation ‘centered’{ at o. The· result} is applied to this problem: that all optimization steps can be carried out at o Input: Constellation = p1, ..., pL M simply by pre-applying g0 and after the Newton it- X { } ∈ −1 Output: Karcher Mean q mk( ) eration applying go to the new estimate. Besides Initialization: ∈ X simplifying the computation of the Riemannian Choose q0 M, tolerance ǫ> 0. Set k 0. curvature, logarithm and exponential maps, it also Loop: ∈ ← allows for a tangent basis Fio to be fixed for all iterations. { } * Apply initial isometry f to constellation • and taking the current estimate qk to o. 4. RESULTS Compute intrinsic gradient g = grad C (q) • k X ∈ Tq M by equation 5. 4.1 Centroid Computation Let M be a con- k If gk ǫ set q = qk and return. nected locally-symmetric Riemannian manifold • | |≤ Compute Hessian matrix H = L H , and = p ,...,p M a constellation of l=1 l 1 L • where each H is given by equation 2 or 3. L points.X Let{ r : M }M ⊂ R be the function l P Compute Newton direction d as the solution that returns the intrinsic× → distance of any two k • of the system Hd = g . points on the manifold and define a cost function k − k If dk,gk 0 set dk = gk. CX : M R as • h i≥ − → Apply Armijo rule (a popular • 1 L P algorithm, see for example (Bertsekas, 1999)) C (q)= r(p , q)2 = k (q) , (4) X 2 l pl to obtain Xl=1 Xl=1 αk argminα≥0 expqk (αdk). R ≈ where the functions kpl : M consider the Set qk+1 = expqk (αkdk). Please note that distance to each point individually→ and are de- • due to finite precision limitations, after a few 2 fined as kpl (q)=1/2 r(pl, q) . The Fr´echet mean iterations the result should be enforced to lie set of the constellation is defined as the set of on the manifold. solutions to the optimization problem mf ( ) = * If initial isometry was applied, set qk+1 X • −1 ← argminq∈M CX (q). Each element of the set mf ( ) f (qk+1). will be called a centroid of . Note that dependingX Set k k + 1 and re-run the loop. on the manifold M a genericX constellation might • ← Where the steps marked with an asterisk are have more than one centroid (for example if the optional computational optimizations. sphere is considered with a constellation consist- ing of two antipodal points, all the equator points 4.2 The SE(3) M(3, 4) Manifold An exam- are centroids). The set of points at which the ple on the special⊂ Euclidean group SE(3) (seen function (4) attains a local minimum is called the as a Riemannian submanifold of M(3, 4)) with Karcher mean set and is denoted as m ( ). The a constellation of 5 points is shown in figure 1. k X 2 0 Newton Newton Gradient Gradient 0 −2

−2 −4 ) )

10 −4 10 −6

−6 −8 −8

−10 −10 Distance to optimum (Log Distance to optimum (Log −12 −12

−14 −14

−16 −16 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 num. iterations num. iterations

Fig. 1. Intrinsic centroid computation on SE(3) Fig. 3. P5 with a constellation of 25 points. with a constellation of 5 points. which acts on points in the world referential, tak- ing them to the local referential. To keep the ex-

0 periment simple, assume also that there are some Newton n Gradient known landmarks in the world x1, ..., xk R which the robot observes. Hence,{ in the} ∈ local referential, the robot observes the points T xi. ) 10 −5 Assuming that the robot is known to be in po- sition T0 with a certain uniform uncertainty, it is possible to build a prior knowledge probability density function as: −10 Distance to optimum (Log 1 2 2 − d(T,T0) /σ p(T )= K1 e 2 2 where K1 is a normalizing constant and σ en-

−15 codes the level of uncertainty. Notice that all 0 1 2 3 4 5 6 7 8 9 10 num. iterations directions are treated equally which is usually not the case, but to keep the example simple assume Fig. 2. Intrinsic centroid computation on the that this description is usefull. Assume also that G(6, 3) with a constellation of 5 points. the robot’s sensor is not perfect and the obser- vations obey the following Gaussian probability The results, shown in a logarithmic scale, clearly distribution: show the quadratic convergence rate of Newton’s T −1 p(y T )= K e−(yi−T xi) R (yi−T xi) method and the almost perfectly linear conver- i| 2 gence rate of the . Note the where, again K2 is a normalizing constant and plateau at a precision of 10−15 resulting from R is a matrix encoding the uncertainty of the numeric precision limitations. sensor. With these assumptions and assuming the SO observations are independent, the MAP estimator 4.3 The G(6, 3) (6) Manifold The ∼= SO(3)×SO(3) of the robot’s position is given by G results for (6, 3), the Grassmann manifold (here ∗ SO T =arg max p(T y1,y2, ..., yk) represented as a coset space of (n) as in T ∈SE(n) | (Edelman et al., 1998)) of 3 dimensional subspaces which is equivalent to in R6, are shown in figure 2. Again the quadratic k convergence rate of Newton’s method is clear. ∗ T =arg max log(p(yi T )) + log(p(T )) Note that this class of manifolds do not admit a T ∈SE(n) | canonic embedding in Rn, thus showing that the Xi=1 k presented results are not bound to embeddable T −1 =arg max (yi T xi) R (yi T xi) manifolds. T ∈SE(n) − − − Xi=1 4.4 The P5 G(6, 1) Manifold The method 1 2 2 ∼= 2 d(T,T0) /σ is also valid for the projection space, which is a − This is an optimization problem on SE(n). The special case of the Grassmann manifold. Results gradient of each term is readily available and are presented in figure 3 for the manifold P5. the Hessian of the first terms can be obtained 4.5 Robot Navigating in Rn As a simple using standard techniques. The result presented application consider a robot moving freely in Rn. in this paper allows for the Hessian of the last Its state may be represented as a point T SE(n), term to be obtained as well, thus allowing for ∈ Edelman, Alan, Tomas A. Arias and Steven T. 0 Newton Smith (1998). The geometry of algorithms Gradient −2 Gradient+Armijo with orthogonality constraints. SIAM J. Ma- trix Anal. Appl. 20(2), 303–353. −4 )

10 Ferreira, R., J. Xavier, J. Costeira and V. Barroso −6 (2006). Newton method for Riemannian cen-

−8 troid computation in naturally reductive ho- mogeneous spaces. International Conference −10 on Acoustics, Speech, and Signal Processing Distance to optimum (Log −12 ICASSP’06. Fletcher, P., C., Lu S.Pizer and S. Joshi (2004). −14 Principal geodesic analysis for the study of −16 0 10 20 30 40 50 60 nonlinear of shape. IEEE Transac- num. iterations tions on Medical Imaging 23(8), 995–1005. Gabay, D. (1982). Minimizing a differentiable Fig. 4. Distance to optimum MAP estimate for 5 function over a differential manifold. J. Op- observations, σ = 1 and R = I. tim. Theory Appl. a Newton algorithm to be implemented. Figure Helmke, Uwe, Knut H¨uper, Pei Lee and John 4 shows the results when both the gradient and Moore (2004). Essential matrix estimation Newton methods are applied to 5 observations via Newton-type methods. 16th International (using σ = 1 and R = I). The gradient method Symposium on Mathematical Theory of Net- (both with and without the Armijo step selection work and System (MTNS), Leuven. rule) is clearly outperformed since the Newton H¨uper, K. and J. Manton (2005). The Karcher method takes only 5 iterations to hit the objective mean of points on the orthogonal group. at the given precision. Preprint. H¨uper, Knut and Jochen Trumpf (2004). Newton- like methods for numerical optimization on manifolds. Asilomar Conference on Signals, 5. CONCLUSION Systems and Computers pp. 136–139. Karcher, H. (1977). Riemannian center of mass A simple formula for the Hessian matrix of the and mollifier smoothing. Comm. Pure Appl. squared distance function on a connected locally- Math. 30, 509–541. symmetric manifold was presented. The range of Lee, John M. (1997). Riemannian Manifolds: An manifolds for which the method has been proven Introduction to Curvature. Springer. to work comprises the important cases of SO(n), Manton, Jonathan H. (2002). Optimisation Sym+(n), Sn, SE(n), G(n,p) and Pn. A simple algorithms exploiting unitary constraints. example of MAP position estimation was pre- IEEE Transactions on Signal Processing sented, illustrating how to use the result on other 50(3), 635–650. problems besides centroid computation. Possible Manton, Jonathan H. (2004). A globally conver- future work involves approximating the general gent numerical algorithm for computing the Hessian expression (2), by a constant sectional centre of mass on compact Lie groups. In: curvature approximation. Although the quadratic Eighth International Conference on Control, convergence rate should be lost, a super-linear Automation, Robotics and Vision. Kunming, convergence rate might be possible without the China. need for an EVD decomposition. At this stage an Manton, Jonathan H. (2005). A centroid (Karcher implementation of a K-means clustering algorithm mean) approach to the joint approximate should be trivial as well. diagonalisation problem: The real symmetric case. Digital Signal Processing. Moakher, Maher (2002). Means and averaging in REFERENCES the group of rotations. SIAM J. Matrix Anal. Absil, P., R. Mahony and R. Sepulchre (2004). Appl. 24(1), 1–16. Riemannian geometry of grassmann mani- O’Neil, Barret (1983). Semi-Riemannian Geome- folds with a view on algorithmic computation. try. Academic Press. Acta Appl. Math 80(2), 199–220. Pennec, Xavier, Pierre Fillard and Nicholas Ay- Belta, Calin and Vijay Kumar (2002). An svd- ache (2004). A Riemannian framework for based projection method for interpolation on tensor computing. Research Report 5255. IN- se(3). IEEE transactions on Robotics and RIA. To appear to the Int. Journal of Com- Automation 18(3), 334–345. puter Vision. Bertsekas, Dimitri P. (1999). Nonlinear Program- ming: 2nd Edition. Athena Scientific.