<<

View metadata,citationandsimilarpapersatcore.ac.uk arXiv:1802.02628v1 [math.OC] 7 Feb 2018 ovninlgnrcadseilzdsles especially solvers, specialized dimensions. and outper framework generic proposed method conventional the the art that the reveal of tes which state they over particular, results method in proposed simulation the of and programs efficiency dis analysis probability convex Theoretical multidimensional a function. solving is interest algorith of to variable and well-adapted manifolds defin proposed the a are The known and also manifold. symmetric simplex solve multinomial mani the stochastic, to generalize The doubly manifolds, manifolds constraints. multinomial the box three called particular introduces folds, under restric paper programs a The over convex main one constrai space. unconstrained The the an search view approach. sub as to different a problem is optimization solving optimization a Riemannian in using behind interested proble idea optimization is nonconvex paper convex have to This of manifold solutions dimensional time. on finding reasonable of hig algorithms in in curse ability for optimization great the hand, slow shown from other excessively the suffer are On they semi-definite methods i.e., as these dimensions, be known has that programs it However, programs. established convex solving cone second-order of allowed and programs set methods program interior-point general convex solve more of to development proposed decades been the The Over have fields. approaches all tiple almost in applications with area tcatcmtie,pstv arcs ovxoptimizat convex matrices, positive matrices, stochastic loihsi ofida xrm point extreme an find to is algorithms ern n opttoa cecs osdramapping a subset Consider a sciences. from computational and neering x nwihtedmi fteojciefnto stewhole the is function objective the of i.e., domain space, the which in e scntand i.e., constrained, sea is the which set in problem optimization denotes optimization f lda ncnrs ihteReana loih nteres the in algorithm Riemannian the with contrast in clidean Pasaden Technology, of (e-mail: Institute California Engineering, ntesmnlbo [1]. found book be seminal can the analysis in performance convex pro- and of semi-definite summary methods A optimization as programs. known cone second-order programs and convex grams of mo T set a exhibits. solving general algorith allowed it methods optimization property interior-point convergence of in development desirable convex role the to crucial problems, thanks a programming plays linear s the optimization and with and initiated least-squares function Historically objective convex. of are the set both search which the in problems mization ∗ ( he oi n aa asb r ihteDprmn fElec of Department the with are Hassibi Babak and Douik Ahmed 1 x ne Terms Index Abstract ueia piiaini h onaino aiu engi- various of foundation the is optimization Numerical ovxotmzto saseilcs fcntandopti- constrained of case special a is optimization Convex nosrie Euclidean Unconstrained . h rdtoa piiainshmsaeietfidwt th with identified are schemes optimization traditional The ∗ ) tcatcMtie:AScn-re Geometry Second-Order A Matrices: Stochastic aiodOtmzto vrteSto Doubly of Set the Over Optimization Manifold ≤ { ahmed.douik,hassibi f Cne piiaini eletbihdresearch well-established a is optimization —Convex ( D y ) = D o l point all for Reana aiod,smercdoubly symmetric manifolds, —Riemannian R of n nteohrhn,cntandEuclidean constrained hand, other the On . R .I I. he Douik, Ahmed n D } @caltech.edu). NTRODUCTION to ( y R R 1 h olo h optimization the of goal The . n piiainrfr otesetup the to refers optimization N ∈ . x ∗ ntenihoho of neighborhood the in tdn ebr IEEE Member, Student x ∗ ∈ ,C 12 USA 91125 CA a, ftepaper. the of t D odEu- word e uhthat such ion. tribution iythe tify nhigh in mul- , forms the s tudy trical .In s. ned rch ted ity. ms ms set he ite en re m f s. h a - vrarsrce erhsae ... imninmanifo Riemannian a optimizatio a.k.a., space, unconstrained search restricted lower an a of constraine over into structure a is optimization from geometric manifold reformulated Euclidean underlying is the problem its optimization that The exploits fact and the dimension dimension high of the advantage opti- optimizati Riemannian on takes slow. constrained a excessively solving be the can with requires which solve space identified [2], to be e.g., tools mization, can Euclidea advanced higher-dimensional space Using a in space. search embedded is the that manifold of interior the .Saeo h Art the of State A. a provides optimization. work such This carry . to optimizati stochastic framework the prede- unified doubly that a a constraint minimizing is similarity the variable by under a given function performed cost graph is fined a recovery of The structure doubly matrix. the recover [4]–[ a ,e.g., to problems interes is wishes such particularly In variable applications. is clustering framework optimization for Such the matrix. which stochastic in problems manifolds. Riemannian to properties met convergence optimization d their Euclidean literature and traditional of adapting body perform large to a to icated Therefore, expected [3]. is efficiently manifolds more Riemannian over mization h us-etnagrtmo meddsbaiod of submanifolds embedded an on descent R steepest algorithm the quasi-Newton introduces manifolds. [9] the on Gabay problems later, opti- to th decade Newton’s adapted A with standard been 1970’s the has the method wherein mization as [8] early Luenberger as of work literature optimization the in xoeta a pt ie re,cle retraction, a called order, given the a approximating to suggest up [14] in map authors overcom exponential To as the [13]. complicated problem, problem as the optimization be original may com- the which requires practica solving map geodesic be exponential the not the of might puting a expression it Whil the intuitive, line. along finding and straight Indeed, searching natural a is of by method idea the algorithms the generalizes Euclidian which geodesic in search line and for rate method Newton’s convergence control. and step-size provides descent Armijo steepest author the the for guarantees wherein relax is conver [12] assumption the The in algorithms. concluded proposed authors their the manifolds of search, Riemannian gence Newton’s line to Newton exact [11] the the using [10], and By and in descent extended descent steepest is the algorithm steepest of analysis the The method. both of properties n aa Hassibi, Babak and n nte motn rpryo ovxotmzto sthat is optimization convex of property important Another hsppritoue rmwr o ovn optimization solving for framework a introduces paper This opti- feature, low-dimension aforementioned the to Thanks piiainagrtm nReana aiod appeared manifolds Riemannian on algorithms Optimization h bv-etoe ok usiuetecneto the of concept the substitute works above-mentioned The h okivsiae h lbladlclconvergence local and global the investigates work The . ebr IEEE Member, brought toyouby provided byCaltechAuthors ] one 7], hods ting ed- ld. on on ed l. d n d n e e e 1 - . . CORE and show quadratic convergence for Newton’s method under The paper investigates the first and second order Rieman- such setup. The work initiated more sophisticated optimization nian geometry of the proposed manifolds endowed with the algorithm such as the trust region methods [3], [15]–[18]. metric which guarantees that the manifolds Analysis of the convergence of first and second order methods have a differentiable structure. For each manifold, the tangent on Riemannian manifolds, e.g., gradient and conjugate gradi- space, Riemannian gradient, Hessian, and retraction are de- ent descent, Newton’s method, and trust region methods, using rived. With the aforementioned expressions, the manuscript general retractions are summarized in [13]. formulates first and a second order optimization algorithms Thanks to the theoretical convergence guarantees mentioned and characterizes their complexity. Simulation results are above, the optimization algorithms on Riemannian manifolds provided to further illustrate the efficiency of the proposed are gradually gaining momentum in the optimization field [3]. method against state of the art algorithms. Several successful algorithms have been proposed to solve The rest of the manuscript is organized as follows: Section II non-convex problems, e.g., the low-rank matrix completion introduces the optimization algorithms on manifolds and lists [19]–[21], online learning [22], clustering [23], [24] and the problems of interest in this paper. In Section III, the doubly tensor decomposition [25]. It is worth mentioning that these stochastic manifold is introduced and its first and second works modify the optimization algorithm by using a general order geometry derived. Section IV iterate a similar study connection instead of the genuine parallel vector transport to to a particular case of doubly stochastic matrices known as move from a tangent space to the other while computing the the symmetric manifold. The study is extended to the definite (approximate) Hessian. Such approach conserves the global symmetric manifold in Section V. Section VI suggests first and convergence of the quasi-Newton scheme but no longer en- second order algorithms and analyze their complexity. Finally, sures their superlinear convergence behavior [26]. before concluding in Section VIII, the simulation results are Despite the advantages cited above, the use of optimization plotted and discussed in Section VII. algorithms on manifolds is relatively limited. This is mainly due to the lack of a systematic mechanism to turn a constrained II. OPTIMIZATION ON RIEMANNIAN MANIFOLDS optimization problem into an optimization over a manifold provided that the search space forms a manifold, e.g., convex This section introduces the numerical optimization methods optimization. Such reformulation, usually requiring some level on smooth matrix manifolds. The first part introduces the of understanding of differential geometry and Riemannian Riemannian manifold notations and operations. The second manifolds, is prohibitively complex for regular use. This paper part extends the first and second order Euclidean optimization addresses the problem by introducing new manifolds that algorithm to the Riemannian manifolds and introduces the allow solving a non-negligible class of optimization problem necessary machinery. Finally, the problems of interest in this in which the variable of interest can be identified with a paper are provided and the different manifolds identified. multidimensional probability distribution function.

B. Contributions A. Manifold Notation and Operations In [25], in a context of tensor decomposition, the authors The study of optimization algorithms on smooth manifolds propose a framework to optimize functions in which the engaged a significant attention in the previous years. However, variable are stochastic matrices. This paper proposes extending such studies require some level of knowledge of differen- the results to a more general class of manifolds by proposing a tial geometry. In this paper, only smooth embedded matrix framework for solving a subset of convex programs including manifolds are considered. Hence, the definitions and theorems those in which the optimization variable represents a doubly may not apply to abstract manifolds. In addition, the authors stochastic and possibly symmetric and/or definite multidimen- opted for a coordinate free analysis omitting the chart and the sional probability distribution function. To this end, the paper differentiable structure of the manifold. For an introduction introduces three manifolds which generalize the multinomial to differential geometry, abstract manifold, and Riemannian manifold. While the multinomial manifold allows represent- manifolds, we refer the readers to the following references ing only stochastic matrices, the proposed ones characterize [27]–[29], respectively. doubly stochastic, symmetric and definite arrays, respectively. An embedded matrix manifold is a smooth subset of a M Therefore, the proposed framework allows solving a subset of vector space included in the set of matrices Rn×m. The set E convex programs. To the best of the author’s knowledge, the is called the ambient or the embedding space. By smooth E proposed manifolds have not been introduced or studied in the subset, we mean that the can be mapped by a bijective M literature. function, i.e., a chart, to an open subset of Rd where d is The first part of the manuscript introduces all relevant called the dimension of the manifold. The dimension d can concepts of the Riemannian geometry and provides insights be thought of as the degree of freedom of the manifold. In on the optimization algorithms on such manifolds. In an effort particular, a vector space is a manifold. to make the content of this document accessible to a larger In the same line of thoughE of approximating a function audience, it does not assume any prerequisite on differential locally by its derivative, a manifold of dimension d can be geometry. As a result, the definitions, concepts, and results in approximated locally at a point X byM a d-dimensional vector this paper are tailored to the manifold of interest and may not space X generated by taking derivatives of all smooth be applicable for abstract manifolds. curvesT goingM through X. Formally, let γ(t): R I ⊂ −→ M 2 B. First and Second Order Algorithms x + TxM The general idea behind unconstrained Euclidean numerical 0 optimization methods is to start with an initial point X0 and γ2(0) to iteratively update it according to certain predefined rules in t x order to obtain a sequence X which converges to a local 0 { } γ1(0) minimizes of the objective function. A typical update strategy is the following: Xt+1 = Xt + αtpt, (5) γ2(t) where αt is the step size and pt the search direction. Let Grad f(X) be the Euclidean gradient2 of the objective func- tion defined as the unique vector satisfying: γ1(t) M Grad f(X), ξ = Df(X)[ξ], ξ , (6) h i ∀ ∈E where ., . is the inner product on the vector space and Df(X)[hξ] iis the directional derivative of f given by: E 3 Fig. 1. Tangent space of a 2-dimensional manifold embedded in R . The f(X + tξ) f(X) tangent space TXM is computed by taking derivatives of the curves going Df(X)[ξ] = lim − (7) through X at the origin. t→0 t In order to obtain a descent direction, i.e., f(Xt+1) < f(Xt) for a small enough step size αt, the search direction pt is chosen in the half space spanned by Grad f(X). In other be a curve on with γ(0) = X. Define the derivative of words, the following expression holds: − M γ(t) at zero as follows: Grad f(Xt),pt < 0. (8) γ(t) γ(0) h i γ′(0) = lim − . (1) In particular, the choices of the search direction satisfying t→0 t Grad f(Xt) The space generated by all γ′(0) represents a vector space pt = (9) − Grad f(Xt) X called the tangent space of at X. Figure 1 shows t t || || TanM example of a two-dimension tangentM space generated by a Hess f(X )[p ]= Grad f(X) (10) couple of curves. The tangent space plays a primordial role in yield the celebrated steepest descent (9) and the Newton’s the optimization algorithms over manifold in the same way method (10), wherein Hess f(X)[ξ] is the Euclidean Hessian3 as the derivative of a function plays an important role in of f at X defined as an operator from to satisfying: E E Euclidean optimization. The union of all tangent spaces X 2 X X TM 1) Hess f( )[ξ], ξ = D f( )[ξ, ξ]= D(Df( )[ξ])[ξ], is referred to as the tangent bundle of , i.e.,: 2) hHess f(X)[ξ], ηi = ξ, Hess f(X)[η] , ξ, η . M h i h i ∀ ∈E = X . (2) After choosing the search direction, the step size αt is TM X T M [∈M chosen so as to satisfy the Wolfe conditions for some constant As shown previously, the notion of tangent space generalizes c (0, 1) and c (c , 1), i.e., 1 ∈ 2 ∈ 1 the notion of directional derivative. However, to optimize 1) The Armijo condition: functions, one needs the notion of directions and lengths which t t t t t t t f(X + α p ) f(X ) c1α Grad f(X ),p (11) can be achieved by endowing each tangent space X by a − ≤ h i T M 2) The curvature condition: bilinear, symmetric positive form ., . X, i.e., an inner product. h i Xt t t t Let g : R be a smoothly varying bilinear form Grad f( + α p ),p c2. (12) T M×T M −→ h i≥ such that its restriction on each tangent space is the previously The Riemannian version of the steepest descent, called the defined inner product. In other words: line-search algorithm, follows a similar logic as the Euclidean g(ξX, ηX)= ξX, ηX X, ξX, ηX X (3) one. The search direction is obtained with respect to the h i ∀ ∈ T M Riemannian gradient which is defined in a similar manner Such metric, known as the Riemannian metric, turns the as the Euclidean one with the exception that it uses the manifold into a Riemannian manifold. Any manifold (in this Riemannian geometry, i.e.,: paper) admits at least a Riemannian metric. Lengths of tangent vectors are naturally induced from the inner product. The norm Definition 1. The Riemannian gradient of f at X denoted by X on the tangent space X is denoted by . X and defined grad f( ) of a manifold , is defined as the unique vector M by: T M || || in X that satisfies: T M ξX X = ξX, ξX X, ξX X (4) grad f(X), ξX X = Df(X)[ξX], ξX X . (14) || || h i ∀ ∈ T M h i ∀ ∈ T M Both the ambient spacep and the tangent space being vector 2The expression of the Euclidean gradient (denoted by Grad) is explicitly spaces, one can define the orthogonal projection ΠX : E −→ given to show the analogy with the Riemannian gradient (denoted by grad). X verifying ΠX ΠX = ΠX. The projection is said to be The nabla symbol ∇ is not used in the context of gradient as it is reserved orthogonalT M with respect◦ to the restriction of the Riemannian for the Riemannian connection. Similar notations are used for the Hessian. 3The Euclidean Hessian is seen as an operator to show the connection with metric to the tangent space, i.e., ΠX is orthogonal in the ., . X h i the Riemanian Hessian. One can show that the proposed definition matches sens. the “usual” second order derivative matrix for ξ = I.

3 Algorithm 1 Line-Search Method on Riemannian Manifold Require: Manifold , function f, and retraction R. M 1: Initialize X . t t t t ∈M Expxt (α ξx) α ξ 2: while grad f(X) X ǫ do x || || ≥ 3: Choose search direction ξX X such that: X ∈ T M t t grad f( ), ξX X < 0. (13) R t h i x (α ξx)

4: Compute Armijo step size α. xt 5: Retract X = RX(αξX). xt+1 6: end while 7: Output X.

After choosing the search direction as mandated by (8), the t γ(t) x Txt M step size is selected according to Wolfe’s conditions similar + to the one in (11) and (12). A more general definition of a M descent direction, known as gradient related sequence, and the Fig. 2. The update step for the two-dimensional sphere embedded in R3. t t Riemannian Armijo step expression can be found in [13]. The update direction ξX and step length α are computed in the tangent t+1 t t t Xt t t While the update step X = X + α p is trivial in the space TXt M. The point + α ξX lies outside the manifold and needs t+1 Euclidean optimization thanks to its vector space structure, it to be retracted to obtain the update X . The update is not located on the geodesic γ(t) due to the use of a retraction instead of the exponential map. might result on a point Xt+1 outside of the manifold. Moving on a given direction of a tangent space while staying on the manifold is realized by the concept of retraction. The ideal retraction is the exponential map ExpX as it maps point a x + TxM ξx tangent vector ξX X to a point along the geodesic curve (straight line∈ on T theM manifold) that goes through X in the direction of ξX. However, computing the geodesic curves is challenging and may be more difficult that the Rx(ξx) original optimization problem. Luckily, one can use a first- order retraction (called simply retraction in this paper) without T compromising the convergence property of the algorithms. A x ηRx(ξx) first-order retraction is defined as follows: Definition 2. A retraction on a manifold is a smooth mapping R from the tangent bundle ontoM . For all TM M Rx(ξx) + TRx x M X , the restriction of R to X , called RX satisfy the (ξ ) ∈M T M following properties: M • Centering: RX(0) = X.

• Local rigidity: The curve γξX (τ) = RX(τξX) satisfy Fig. 3. An illustration of a vector transport T on a two-dimensional manifold embedded in R3 that connects the tangent space of X with tangent vector dγξX (τ) = ξX, ξX X . ξX with the one of its retraction RX(ξX). A connection ∇ can be obtained dτ τ=0 ∀ ∈ T M from the speed at the origin of the inverse of the vector transport T −1.

For some predefined Armijo step size, the procedure above is guaranteed to converge for all retractions [13]. The general- other as shown in Figure 3. The definition of a connection is ization of the steepest descent to the Riemannian manifold is given below: obtained by finding the search direction that satisfies similar equation as in the Euclidean scenario (9) using the Rieman- Definition 3. An affine connection is a mapping from ∇ nian gradient. The update is then retracted to the manifold. to that associate to each (η, ξ) the tangent TM×TM TM The steps of the line-search method can be summarized in vector ηξ satisfying for all smooth f,g : R, ∇ M −→ Algorithm 1 and an illustration of an iteration of the algorithm a,b R: ∈ is given in Figure 2. • ξ = f( ηξ)+ g( χξ) ∇f(η)+g(χ) ∇ ∇ Generalizing the Newton’s method to the Riemannian set- • η(aξ + bϕ)= a ηξ + b ηϕ ∇ ∇ ∇ ting requires computing the Riemannian Hessian operator • η(f(ξ)) = ξ(f)η + f( ηξ), which requires taking a directional derivative of a vector ∇ ∇ wherein the vector field ξ acts on the function f by derivation, field. As the vector field belong to different tangent spaces, i.e., ξ(f)= D(f)[ξ] also noted as ξf in the literature. one needs the notion of connection that generalizes the notion of directional derivative of a vector∇ field. The notion On a Riemannian manifold, the Levi-Civita is the canonical of connection is intimately related to the notion of vector choice as it preserve the Riemannian metric. The connection transport which allows moving from a tangent space to the is computed as:

4 Algorithm 2 Newton’s method on Riemannian Manifold is found by retraction the tangent vector to the manifold. The Require: Manifold , function f, retraction R, and affine steps of the algorithm are illustrated in Algorithm 2. connection . M ∇ 1: Initialize X . ∈M 2: while grad f(X) X ǫ do C. Problems of Interest || || ≥ 3: Find descent direction ξX X such that: ∈ T M As shown in the previous section, computing the Rieman- hess f(X)[ξX]= grad f(X), (17) nian gradient and Hessian for a given function over some X − X wherein hess f( )[ξX]= ξX grad f( ) manifold allows the design of efficient algorithms that ∇ M 4: Retract X = RX(ξX). exploit the geometrical structure of the problem. The paper’s 5: end while main contribution is to propose a framework for solving 6: Output X. a subset of convex programs including those in which the optimization variable represents a doubly stochastic and pos- sibly symmetric and/or definite multidimensional probability Definition 4. The Levi-Civita connection is the unique affine distribution function. connection on with the Reimannian metric ., . that satisfy In particular, the paper derives the relationship between M h i for all η,ξ,χ : the Euclidean gradient and Hessian and their Riemannian ∈TM counterpart for the manifolds of doubly stochastic matri- 1) ηξ ξη = [η, ξ] ∇ − ∇ ces, symmetric stochastic matrices, and symmetric positive 2) χ η, ξ = χη, ξ + η, χξ , h i h∇ i h ∇ i stochastic matrices. In other words, for a convex function where [ξ, η] is the Lie bracket, i.e., a function from the set of f : Rn×m R, the paper proposes solving the following smooth function to itself defined by [ξ, η]g = ξ(η(g)) η(ξ(g)). −→ − problem: min f(X) (19a)

For the manifolds of interest in this paper, the Lie bracket s.t. Xij > 0, 1 i n, 1 j m, (19b) can be written as the implicit directional differentiation m ∀ ≤ ≤ ≤ ≤ [ξ, η]= D(η)[ξ] D(ξ)[η]. The expression of the Levi-Civita Xij =1, 1 i n, (19c) − ∀ ≤ ≤ can be computed using the Koszul formula: j=1 Xn 2 χη, ξ = χ η, ξ + η ξ,χ ξ χ, η h∇ i h i h i− h i Xij =1, 1 j m, (19d) χ, [η, ξ] + η, [ξ,χ] + ξ, [χ, η] (15) ∀ ≤ ≤ i=1 − h i h i h i X Note that connections and particularity the Levi-Civita, are X = XT , (19e) defined for all vector fields on . However for the purpose M X 0, (19f) of this paper, only the tangent bundle is of interest. With the ≻ above notion of connection, the Riemannian Hessian can be wherein constraints (19b)-(19c) produce a , written as: (19b)-(19d) a doubly stochastic one, (19b)-(19e) a symmetric stochastic one, and (19b)-(19f) a definite . Definition 5. The Riemannian Hessian of f at X, denoted by While the first scenario is studied in [25], the next sections hess f(X), of a manifold is a mapping from X into study each problem, respectively. Let 1 be the all ones vector M T M itself defined by: and define the multinomial, doubly stochastic multinomial, X X hess f( )[ξX]= ξX grad f( ), ξX X , (16) symmetric multinomial, and definite multinomial, respectively, ∇ ∀ ∈ T M where grad f(X) is the Riemannian gradient and is the as follows: ∇ m n×m Riemannian connection on . P = X R Xij > 0, X1 = 1 n ∈ M n×n T DPn = X R Xij > 0, X1 = 1, X 1 = 1 It can readily be verified that the Riemannian Hessian verify  ∈ n×n T similar property as the Euclidean one, i.e. for all ξX, ηX SPn = X R Xij > 0, X1 = 1, X = X ∈  ∈ X , we have SP+ X Rn×n X1 1 X XT X 0 T M n =  Xij > 0, = , = , hess f(X)[ξX], ηX X = ξX, hess f(X)[ηX] X, ∈ ≻ h i h i 

Remark 1. The name of Riemannian gradient and Hessian is For all the above manifolds, the paper uses the Fisher infor- due to the fact that the function can be approximated in a mation as the Riemannian metric g those restriction on X f T M neighborhood of X by the following: is defined by: T −1 X X X X X X X X f(X + δX)= f(X)+ grad f(X), ExpX (δX) X (18) g(ξ , η )= ξ , η = Tr((ξ )(η ) ) (20) h i h n m i ⊘ (ξX)ij (ηX)ij 1 −1 −1 + hess f(X)[ExpX (δX)], ExpX (δX) X. = , ξX, ηX X . Xij ∀ ∈ T M 2h i i=1 j=1 Using the above definitions, the generalization of Newton’s Endowing the multinomialX X with the Fisher information as method to Riemannian optimization is done by replacing both Riemannian metric gives the manifold a differential structure the Euclidean gradient and Hessian by their Riemannian coun- that is invariant over the choice of coordinate system. More terpart in (10). Hence, the search direction is the tangent vector information about the Fisher information metric and its use ξX that satisfies hess f(X)[ξX] = grad f(X). The update in information goemetry can be found in [30]. Using the − 5 manifold definition above, the optimization problems can be Proof. Techniques for computing the orthogonal projection on reformulated over the manifolds as: the tangent space for the manifolds of interest in this paper min f(X), min f(X), min f(X), min f(X). can be found in Appendix B. The projection on the tangent X Pm X DP X SP + ∈ n ∈ n ∈ n X∈SPn space of the doubly stochastic matrices can be found in the A B In the rest of the paper, the notation refers to the first subsection of Appendix B. Hadamard, i.e., element-wise, division of A⊘by B. Similarly, the symbol denotes the Hadamard product. The projection X is of great interest as it would allow ⊙ Π in the next subsection to relate the Riemannian gradient and III. THE DOUBLY STOCHASTIC MULTINOMIAL MANIFOLD Hessian to their Euclidean equivalent. This section studies the structure of the doubly stochastic Remark 2. The above theorem gives separate expressions of α DP manifold n and provides the expressions of the necessary and β for ease of notations in the upcoming computation of the ingredients to design Riemannian optimization algorithms over Hessian. However, such expressions require squaring matrix the manifold. X, i.e., XXT which might not be numerically stable. For implementation purposes, the vectors α and β are obtained A. Manifold Geometry as one of the solutions (typically the left-pseudo inverse) to the linear system: The set of doubly stochastic matrices is the set of square Z1 I X α matrices with positive entries such that each column and row = (25) ZT 1 XT I β sums to 1. It can easily be shown that only a       can verify such property. As a consequence of the Birkhoff- von Neumann theorem, DPn is an embedded manifold of Rn×n. A short proof of the Birkhoff-von Neumann theorem B. Riemannian Gradient and Retraction Computation using elementary geometry concepts can be found in [31]. This subsection first derives the relationship between the DP 2 The dimension of n is (n 1) which can be seen from Riemannian gradient and its Euclidean counterpart for the the fact that the manifold is generated− from 2n 1 linearly − manifold of interest. The equation relating these two quantities independent equations specifying that the rows and columns is first derived in [25] for the multinomial manifold, but all sums to one. The dimension of the manifold would be no proof is provided therein. For completeness purposes, we clearer after deriving the tangent space which is a linear space provide the lemma with its proof in this manuscript. with the same dimension as the manifold. X Let X DPn be a point on the manifold, the tangent space Lemma 1. The Riemannian gradient grad f( ) can be ∈ X XDPn is given by the following preposition. obtained from the Euclidean gradient Grad f( ) using the T identity: Preposition 1. The tangent space XDPn is defined by: X X X X n×n T T grad f( ) = Π (Grad f( ) ) (26) XDPn = Z R Z1 = 0, Z 1 = 0 , (21) ⊙ T ∈ wherein 0 is the all zeros vector.  Proof. As shown in Section II, the Riemannian gradient is Proof. The technique of computing the tangent space of the by definition the unique element of XDP that is related to manifolds of interest in this paper can be found in Appendix A. n the directional derivative through theT Riemannian metric as The complete proof of the expression of the tangent space of follows: doubly stochastic matrices is located in the first subsection of grad f(X), ξX X = Df(X)[ξX], ξX XDPn. (27) Appendix A. h i ∀ ∈ T Since the Riemannian gradient is unique, then finding an From the expression of the tangent space, it is clear that the element of the tangent space that verifies the equality for equations Z1 = 0 and ZT 1 = 0 yield only 2n 1 linearly all tangent vectors is sufficient to conclude that it is the − independent constraints as the last column constraint can be Riemannian gradient. Now note that the Euclidean gradient written as the sum of all rows and using the fact that the can be written as a function of the directional derivative using n n previous (n 1) columns sum to zero. Let ΠX : R × the usual scalar product as: DP − −→ n×n X n be the orthogonal, in the ., . X sens, projection of Grad f(X), ξ = Df(X)[ξ], ξ R , (28) theT ambient space onto the tangenth space.i The expression of h i ∀ ∈ In particular, by restriction the above equation to XDPn such operator is given in the upcoming theorem: Rn×n and converting the usual inner product to theT Rieman-⊂ Theorem 1. The orthogonal projection ΠX has the following nian one, we can write: expression: Grad f(X), ξX = Grad f(X) X, ξX X Z Z 1T 1 T X h i h ⊙ i ΠX( )= (α + β ) , (22) = Df(X)[ξX], ξX XDPn. (29) − ⊙ ∀ ∈ T wherein the vectors α and β are obtained through the follow- Finally, projecting the scaled Euclidean gradient onto ing equations: the tangent space and its orthogonal complement, i.e., T † T ⊥ α = (I XX ) (Z XZ )1 (23) Grad f(X) X = ΠX(Grad f(X) X) + ΠX(Grad f(X) − − X ⊙ ⊙ ⊙ β = ZT 1 XT α, (24) ) yields Y† − Y†Y I Grad f(X) X, ξX X = ΠX(Grad f(X) X), ξX X, with being the left-pseudo inverse that satisfy = . h ⊙ i h ⊙ i

6 wherein, by definition of the projection on the orthogonal be the projection onto the set of doubly stochastic matri- complement of the tangent space, the following holds ces obtained using the Sinkhorn-Knopp algorithm [32]. The ⊥ ΠX(Grad f(X) X), ξX X =0 (30) proposed retraction, using the element-wise exponential of a h ⊙ i The element ΠX(Grad f(X) X) being a tangent vector that matrix exp(.), is given in the following lemma satisfy (27), we conclude that:⊙ Lemma 2. The mapping R : DPn DPn whose X X X T −→ grad f( ) = ΠX(Grad f( ) ) (31) restriction RX to XDPn is given by: ⊙ T RX(ξX)= (X exp(ξX X)) , (34) P ⊙ ⊘ Note that the result of Lemma 1 depends solely on the is a retraction on the doubly stochastic multinomial manifold for all ξX DPn. expression of the Riemannian metric and thus is applicable ∈ T to the three manifolds of interest in this paper. Combining Proof. To show that the operator represents a well-defined re- the expression of the orthogonal projection with the one of traction, one needs to demonstrate that the centering and local Lemma 1, we conclude that the Riemannian gradient has the rigidity conditions are satisfied. The fact that the mapping is following expression: obtained from the smoothness of the projection onto the set of grad f(X)= γ (α1T + 11T γ 1αT X) X doubly stochastic matrices which is provided in Appendix C. − − ⊙ α = (I XXT )†(γ XγT )1 The centering property is straightforward, i.e.,: − − RX(0)= (X exp(0)) = (X)= X, (35) γ = Grad f(X) X. (32) P ⊙ P ⊙ wherein the last inequality is obtained from the fact that X is For numerical stability, the term α is computed in a similar fashion as the procedure described in Remark 2 wherein Z is a . replaced by . To prove the local rigidity condition, one needs to study the γ X X As shown in Section II, one needs only to define a retraction perturbation of ( ) around a “small” perturbation ∂ in the PDP X from the tangent bundle to the manifold instead of the com- tangent space n wherein small refers to the fact that + X Rn×n T plex exponential map to take advantage of the optimization ∂ . First note from the Sinkhorn-Knopp algorithm ∈ X D XD X algorithms on the Riemannian manifolds. Among all possible that ( ) = 1 2. However, since is already doubly P D D I retractions, one needs to derive one that have low-complexity stochastic, then 1 = 2 = . The first order approximation of the can be written as: in order to obtain efficient optimization algorithms. Therefore, P (X + ∂X) = (D + ∂D )(X + ∂X)(D + ∂D ) the canonical choice is to exploit the linear structure of the P 1 1 2 2 embedding space in order to derive a retraction that does not D1XD2 + D1∂XD2 + ∂D1XD2 + D1X∂D2 require a projection on the manifold. Such canonical retraction ≈ X + ∂X + ∂D1X + X∂D2 (36) is given in the following theorem: ≈ Since (X + ∂X) and X are doubly stochastic and ∂X is in P Theorem 2. The mapping R : DPn DPn whose the tangent space, then we obtain: T −→ restriction RX to XDPn is given by: (X + ∂X)1 = (X + ∂X + ∂D1X + X∂D2)1 T X P ⇒ RX(ξX)= + ξX, (33) ∂D1X1 + X∂D21 = ∂D11 + X∂D21 = 0 (37) T T represents a well-defined retraction on the doubly stochastic Similarly, by post multiplying by 1 , we obtain 1 ∂D1X + T T multinomial manifold provided that ξX is in the neighborhood 1 ∂D2 = 0 . For easy of notation, let ∂D11 = ∂d1, i.e., 0 X of X, i.e., ij > (ξX)ij , 1 i, j n. ∂d1 is the vector created from the diagonal entries of ∂D1 − ≤ ≤ D Proof. The proof of this theorem relies on the fact that the and the same for ∂ 2. Combining both equations above, the manifold of interest is an embedded manifold of an Euclidean perturbation on the diagonal matrices satisfy the condition: I X d 0 space. For such manifold, one needs to find a matrix decom- ∂ 1 XT I d = 0 (38) position with desirable dimension and smoothness properties. ∂ 2  d      The relevant theorem and techniques for computing the canon- ∂ 1 In other words, d is the null space of the above matrix ical retraction on embedded manifold are given in Appendix C. ∂ 2   1 The proof of Theorem 2 is accomplished by extending the which is generated by 1 from the previous analysis. As a Sinkhorn’s theorem [32] and can be found in the first section d d −1  D X X D 0 of Appendix C. result, ∂ 1 = ∂ 2 = c which gives ∂ 1 + ∂ 2 = . Therefore, (X− + ∂X) X + ∂X. Now, consider the curve P ≈ The performance of the above retraction are satisfactory γξX (τ)= RX(τξX). The derivative of the curve at the origin as long as the optimal solution X does not have vanishing can be written as: X dγξX (τ) γξX (τ) γξX (0) entries, i.e., some ij that approach 0. In such situations, the = lim − update procedure results in tiny steps which compromises the dτ τ=0 τ→0 τ X X X convergence speed of the optimization algorithms. Although ( exp(τξX )) = lim P ⊙ ⊘ − (39) the projection on the set of doubly stochastic matrices is τ→0 τ difficult [33], this paper proposes a highly efficient retraction A first order approximation of the exponential allows to that take advantage of the structure of both the manifold and express the first term in the denominator as: its tangent space. Define the set of entry-wise positive matrices (X exp(τξX X)) = (X + τξX)= X + τξX (40) n×n n×n n×n P ⊙ ⊘ P R = X R Xij > 0 and let : R DPn wherein the last equality is obtained from the previous analy- { ∈ | } P −→ 7 sis. Plugging the expression in the limit expression shows the space generated by X are written without the subscript. The local rigidity condition, i.e.,: left hand side gives:

dγξX (τ) χ η, ξ X = D( η, ξ X)[χ] = ξX (41) h i h i dτ τ=0 n n ηij ξij Therefore, RX(ξX) is a retraction on the doubly stochastic = D [χij ] Xij multinomial manifold. i=1 j=1 ! X X n ηij ξij 1 = D (ξij )+ D (ηij )+ ηij ξij D [χij ] Note that the above retraction resembles the one proposed Xij Xij Xij i,j=1 !! in [25] (without proof) for the trivial case of the multinomial X n n manifold. The projection on the set of doubly stochastic ηij ξij χij = Dχη, ξ X + η, Dχξ X h i h i − X2 matrices is more involved as shown in the lemma above. i=1 j=1 ij Further, note that the retraction does not require the tangent X X = χη, ξ X + η, χξ X. (44) vector ξX to be in the neighborhood of X as the one derived h∇ i h ∇ i in Theorem 2. However, it is more expensive to compute as it requires projecting the update onto the manifold. Recall that the Euclidean Hessian Hess f(X)[ξX] = D(Grad f(X))[ξX] is defined as the directional derivative of Remark 3. The local rigidity condition of the retraction in the Euclidean gradient. Using the results above, the Rieman- (39) is particularly interesting as it shows that the canonical nian Hessian can be written as a function of the Euclidean retraction X + ξX is the first order approximation of the gradient and Hessian as follows: retraction (X exp(ξX X)) around the origin. P ⊙ ⊘ Theorem 3. The Riemannian Hessian hess f(X)[ξX] can be obtained from the Euclidean gradient Grad f(X) and the Euclidean Hessian Hess f(X)[ξX] using the identity: C. Connection and Riemannian Hessian Computation 1 hess f(X)[ξX] = ΠX δ˙ (δ ξX) X As shown in Section II, the computation of the Riemannian − 2 ⊙ ⊘ ! Hessian requires the derivation of the Levi-Civita connection α = ǫ(γ XγT )1 ηX ξX. Using the result of [13], the Levi-Civita connection − T 1 XT ∇of a submanifold of the Euclidean space Rn×n can be ob- β = γ α M − tained by projecting the Levi-Civita ηX ξX of the embedding γ = Grad f(X) X ∇ T ⊙ T space onto the manifold, i.e., ηX ξX = ΠX( ηX ξX). From δ = γ (α1 + 1β ) X ∇ ∇ n×n the Koszul formula (15), the connection ηX ξX on R − ⊙ ∇ ǫ = (I XXT )† solely depends on the Riemannian metric. In other words, − X T X T 1 the connection ηX ξX on the embedding space is the same α˙ = ǫ˙(γ γ )+ ǫ(γ ˙ ξXγ γ˙ ) ∇ − − − for all the considered manifolds in this paper. For manifolds ˙ T 1 T XT β =γ ˙ ξXα α˙  endowed with the Fisher information as metric, the Levi-Civita − − γ˙ = Hess f(X)[ξX] X + Grad f(X) ξX connection on Rn×n is given in [25] as follows: ⊙ ⊙ T T T T δ˙ =γ ˙ (α ˙ 1 + 1β˙ ) X (α1 + 1β ) ξX Proposition 1. The Levi-Civita connection on the Euclidean − T T ⊙ − ⊙ ǫ˙ = ǫ(Xξ + ξXX )ǫ (45) space Rn×n endowed with the Fisher information is given by: X 1 X ηX ξX = D(ξX)[ηX] (ηX ξX) (42) ∇ − 2 ⊙ ⊘ Proof. The useful results to compute the Riemannian Hessian Proof. The Levi-Civita connection is computed in [25] using for the manifold of interest in this paper can be found in the Koszul formula. For completeness, this short proof shows Appendix D. The expression of the Riemannian Hessian for that the connection do satisfy the conditions proposed in the doubly stochastic multinomial manifold can be found in Definition 4. Since the embedding space is a Euclidean space, the first subsection of the appendix. the Lie bracket can be written as directional derivatives as follows: IV. THE SYMMETRIC MULTINOMIAL MANIFOLD [ηX, ξX]= D(ξX)[ηX] D(ηX)[ξX] Whereas the doubly stochastic multinomial manifold is − 1 regarded as an embedded manifold of the vector space of = D(ξX)[ηX] (ηX ξX) X matrices Rn×n, the symmetric and positive multinomial mani- − 2 ⊙ ⊘ folds are seen as embedded manifolds of the set of symmetric 1 D(ηX)[ξX] (ηX ξX) X matrices. In other words, the embedding Euclidean space is − − 2 ⊙ ⊘ ! the space of symmetric matrices n defined as: X Rn×nS X XT = ηX ξX ξX ηX (43) n = = (46) ∇ − ∇ S ∈ The second property is obtained by direct computation of the Such choice of the ambient space allows to reduce the  2 n(n+1) right and left hand sides in Definition 4. For simplicity of ambient dimension from n to 2 and thus enables the notation, all the tangent vectors χX, ηX, ξX in the tangent the simplification of the projection operators. As a result,

8 the expression of Riemannian gradient and Hessian can be considering that the manifold is embedded in an Euclidean computed more efficiently. subspace. Techniques for computing the retraction on embed- ded manifold is given in Appendix C. Note that the result of A. Manifold Geometry, Gradient, and Retraction the doubly stochastic multinomial is not directly applicable as n×n the embedding space is different ( n instead of R ). The Let X SPn be a point on the manifold, the tangent space ∈ problem is solved by using the DADS theorem [34] instead of XSPn is given by the following preposition. T the Sinkhorn’s one [32]. The complete proof of the corollary Preposition 2. The tangent space XSPn is defined by: is given in the second subsection of Appendix C. T XSPn = Z n Z1 = 0 . (47) T ∈ S 

Proof. The technique of computing the tangent space of The canonical retraction suffers from the same limitation manifold of interest in this paper can be found in Appendix A. as the one discussed in the previous section. Indeed, the The complete proof of the expression of the tangent space performance of the optimization algorithm heavily depend on of symmetric stochastic matrices is located in the second whether the optimal solution has vanishing entries or not. This subsection of Appendix A. section shows that the retraction proposed in Lemma 2 is Let ΠX : n XSPn be the orthogonal, in the ., . X a valid retraction on the set of symmetric double stochastic S −→ T h i sens, projection of the ambient space onto the tangent space. matrices. However, instead of the Sinkhorn-Knopp algorithm Note that the ambient space for the symmetric multinomial [32], this part uses the DAD algorithm [34] to project the n×n T SPn is the set of symmetric matrices n and not the set of all retracted vector. Let = X R X > 0, X = X S n ij matrices Rn×n as in Section III. The following theorem gives represent the set of symmetric,S element-wise∈ positive matrices.  the expression of the projection operator: The projection onto the set of symmetric doubly stochastic + matrices is denoted by the operator : n SPn. The Theorem 4. The orthogonal projection ΠX operator onto the P S −→ tangent set has the following expression retraction is given in the following corollary. T T ΠX(Z)= Z (α1 + 1α ) X, (48) Corollary 2. The mapping R : SPn SPn whose − ⊙ T −→ wherein the vector α is computed as: restriction RX to XSPn is given by: I X −1Z1 T + α = ( + ) . (49) RX(ξX)= (X exp(ξX X)) , (52) P ⊙ ⊘ is a retraction on the symmetric doubly stochastic multinomial manifold for all ξX SPn. Proof. Techniques for computing the orthogonal projection on ∈ T the tangent space for the manifolds of interest in this paper Proof. The proof of this corollary is straightforward. Indeed, can be found in Appendix B. The projection on the tangent after showing that the range space of RX(ξX) is the set space of the doubly stochastic matrices can be found in the symmetric element-wise matrices , the proof concerning second subsection of Appendix B and its derivation from n the centering and local rigidity of theS retraction are similar to the projection of the doubly stochastic manifold in the third the one in Lemma 2. subsection. Using the result of Lemma 1 and using the expression of the projection onto the tangent space, the Rienmannian gradient can be efficiently computed as: T T grad f(X)= γ (α1 + 1α ) X B. Connection and Riemannian Hessian Computation − ⊙ α = (I + X)−1γ1 γ = (Grad f(X) X), (50) As discussed earlier, the Levi-Civita connection solely de- ⊙ where γ is a simple sum that can be computed efficiently. pends on the Riemannian metric. Therefore, the symmetric Similar to the result for the doubly stochastic multinomial stochastic multinomial manifold shares the same retraction on manifold, the canonical retraction on the symmetric multino- the embedding space4 as the doubly stochastic multinomial mial manifold can be efficiently computed as shown in the manifold. The Riemannian Hessian can be obtained by dif- following corollary. ferentiating the Riemnanian gradient using the projection of the Levi-Civita connection onto the manifold as shown in the Corollary 1. The mapping R : SPn SPn whose T −→ below corollary: restriction RX to XSPn is given by: T X X X X R (ξ )= + ξ , (51) Corollary 3. The Riemannian Hessian hess f(X)[ξX] can represents a well-defined retraction on the symmetric multino- be obtained from the Euclidean gradient Grad f(X) and the mial manifold provided that ξX is in the neighborhood of 0X, i.e., Xij > (ξX) , 1 i, j n. − ij ≤ ≤ Proof. The proof of this corollary follows similar steps like 4Even though the embedding spaces are not the same for both manifolds, the one for the doubly stochastic multinomial manifold by one can easily show that the expression of the connection is invariant.

9 Euclidean Hessian Hess f(X)[ξX] using the identity: exponential to retract the tangent vectors as shown in the 1 following theorem. hess f(X)[ξX] = ΠX δ˙ (δ ξX) X − 2 ⊙ ⊘ ! SP+ SP+ Theorem 5. Define the map RX from X n to n by: I X −1 1 T α = ( + ) γ 1 1 −ωXξX RX(ξX)= X + I e , (55) δ = γ (α1T + 1αT ) X ωX − ωX Y 5 − ⊙ wherein e the of matrix Y and ωX is a γ = Grad f(X) X scalar that ensures: −1⊙ −1 −1 α˙ = (I + X) γ˙ (I + X) ξX(I + X) γ 1 − RX(ξX)ij > 0, 1 i, j n (56) T T T T ≤ ≤ δ˙ =γ ˙ (α ˙ 1 + 1α˙ ) X (α1 + 1α ) ξX 0 − ⊙ − ⊙ RX(ξX) , (57) X X X + ≻ γ˙ = Hess f( )[ξX] + Grad f( ) ξX (53) for all ξX XSP in the neighborhood of 0X, i.e., ⊙ ⊙ ∈ T n ξX F ǫ for some ǫ > 0. Then, there exists a sequence || || ≤ + of scalars ωX + such that the mapping R : SP { }X∈SPn T n −→ SP+, whose restriction RX to XSP+, is a retraction on the V. EXTENSIONTOTHE DEFINITE SYMMETRIC n T n MULTINOMIAL MANIFOLD definite symmetric multinomial manifold. The definite symmetric stochastic multinomial manifold is Proof. Unlike the previous retractions that rely on the Eu- defined as the subset of the symmetric stochastic multinomial clidean structure of the embedding space, this retraction is manifold wherein the matrix of interest is positive definite. obtained by direct computation of the properties of the re- Similar to the condition Xij > 0, the strict condition X 0, ≻ traction given in Section II. The organization of the proof is i.e., full-rank matrix, ensures that the manifold has a differen- the following: First, assuming the existence of ωX, we show tiable structure. that the range of the mapping RX is included in the definite The positive-definiteness constraint is a difficult one to symmetric multinomial manifold. Afterward, we demonstrate retract. In order to produce highly efficient algorithms, one that the operator satisfies the centering and local rigidity usually needs a re-parameterization of the manifold and to conditions. Therefore, the operator represents a retraction. regard the new structure as a quotient manifold, e.g., a Finally, showing the existence of the scalar ωX for an arbitrary Grassmann manifold. However, this falls outside the scope of X SP+ n concludes the proof. this paper and is left for future investigation. This part extends ∈ the previous study and regards the manifold as an embedded Recall that the matrix exponential of a symmetric real T matrix ξX with eigenvalue decomposition ξX = UΛU is manifold of n for which two retractions are proposed. S given by eξX = U exp(Λ)UT , where exp(Λ) is the usual element-wise exponential of the element on the diagonal and A. Manifold Geometry zeros elsewhere. From the derivation of the tangent space SP+ The manifold geometry of the definite symmetric stochastic of the definite symmetric multinomial manifold X n , we T multinomial manifold is similar to the one of the previous have ξX1 = 0. Therefore, ξX has an eigenvalue of 0 cor- manifold. Indeed, the strictly positive constraint being an responding to the eigenvector 1. As stated by the definition inequality one, the dimension of the manifold is unchanged. of the matrix exponential, the eigenvalue are exponentiated X SP+ and the eigenvectors are unchanged. Therefore, eξX (and thus Therefore, let n be a point on the manifold, the SP∈ + e−ωXξX ) has an eigenvalue of exp(0) = 1 corresponding to tangent space X n of the definite symmetric stochastic ωXξX multinomial manifoldT is similar to tangent space of the sym- the eigenvector 1, i.e., e− 1 = 1. First note from the first metric stochastic multinomial manifold, i.e.,: condition on ωX, all entries are positive. Now, computing the SP+ Z Z1 0 rows summation gives: X n = n = . (54) T ∈ S 1 1 −ωXξX As a result, the expression of the Riemannian gradient and RX(ξX)1 = X1 + I1 e 1  ωX − ωX Hessian are the same as the one presented in the previous 1 1 section. Hence, one only needs to design a retraction to derive = 1 + 1 1 optimization algorithms on the manifold. ωX − ωX = 1. (58) Hence RX(ξX) is stochastic. Besides, all matrices are sym- B. Retraction on the Cone of Positive Definite Matrices metric which concludes that the matrix is doubly stochastic. As shown in the previous subsection, the geometry of the Finally, the second condition on ωX ensure the definiteness of definite symmetric multinomial manifold is similar to the the matrix which concludes that RX(ξX) SP+. ∈ n symmetric multinomial manifold one. Therefore, one can ex- The centering property can be easily checked by evaluating tend the canonical retraction proposed in the previous section. + the retraction RX at the zero-element 0X of XSP . Indeed, However, even tough the retraction looks similar to the one T n proposed in the previous section, its implementation is more problematic as it includes a condition on the eigenvalues. Hence, the section proposes another retraction that exploits 5Not to be confused with the exponential map Exp of the Riemannian the definite structure of the manifold and uses the matrix geometry or with the element-wise exponential exp.

10 we obtain: decreasing to 0, then there exists M0 such that m M0, 1 1 −ωX0X ∀ ≥ RX(0X)= X + I e the following is true: n ωX − ωX m 2 ′ (64) m 2 (1 exp( ωX ǫ)) ǫ ǫ X 1 I 1 I X (ωX ) − − ≤ ≤ = + = . (59) ′ ωX − ωX Combining the above results, we find out that ǫ > 0, M0 ∀ ∃ The speed of the rigidity curve γξX (τ) = RX(τξX) at the such that m M0 the following holds: origin is given by: X ∀ X≥ ′ SP+ m(ξX) F <ǫ , ξX X n with ξX F ǫ e−ωXτξX || − || ∀ ∈ T || || ≤ dγξx (τ) 1 d = dτ τ=0 −ωX dτ τ=0 Finally, as stated earlier, combining the uniform conver- 1 d exp( ωXτΛ) T + = U − U gence and the fact that n is an open subset of n allows −ωX dτ τ=0 S S us to conclude the existence of ωX such that both conditions U UT SP+ = Λ exp( ωXτΛ) (56) and (57) are satisfied for all tangent vectors ξX X n − τ=0 ∈ T with ξX ǫ. T T F = UΛUU exp( ωXτΛ) U || || ≤ − τ=0 Remark 4. The retraction developed in Theorem 5 can be −ωXτξX = ξXe = ξX (60) applied to the symmetric multinomial manifold in which the τ=0 scalar ωX is chosen in such fashion so as to satisfy condition Therefore, we conclude that RX(ξX ) is a well-defined retrac- tion. (56) independently of condition (57). However, the retraction is not valid for the doubly stochastic multinomial manifold. Finally, the existence of the weight ωX is ensured by the fact + Indeed, the tangent vectors of the doubly stochastic multino- that n is an open subset of n. Consider a positive sequence mS∞ S mial manifold being not necessarily symmetric, nothing can ωX m=1 decreasing to 0 and construct the function {X } ∞ be claimed about the exponential of the vector ξX. m m=1 as follows: { } 1 1 m X X I e−ωX ξX While the retraction proposed in Theorem 5 is superior to m(ξX)= + m m . (61) ωX − ωX the canonical one, it still suffers from the scaling of ξX. In X ∞ We aim to show that m m=1 uniformly converges to the other words, if the optimal solution has vanishing entries or + { +} constant X . Since is an open set, then there exists vanishing eigenvalues, the retraction results in tiny updates ∈ Sn Sn an index m0 above which, i.e., m m0 the sequence which compromise the convergence speed of the proposed X + SP+∀ ≥ m(ξX) n , ξX X n with ξX F ǫ. Hence algorithms. As stated at the beginning of the section, the ∈ S ∀ ∈ T m || || ≤ ωX can be chosen to be any ωX for m m0. The uniform problem can be solved by re-parameterizing the manifold convergence of the series is given in the≥ following lemma which falls outside the scope of this paper.

Lemma 3. The uniform convergence of the function series VI. OPTIMIZATION ALGORITHMS AND COMPLEXITY X ∞ is satisfied as ′ , such that m m=1 ǫ > 0 M0 m M0 This section analysis the complexity of the steepest de- the{ following} holds: ∀ ∃ ∀ ≥ scent, summarized in Algorithm 1, and the Newton’s method, ′ + Xm(ξX) X F <ǫ , ξX XSP with ξX F ǫ || − || ∀ ∈ T n || || ≤ summed up in Algorithm 2, algorithms on the proposed Proof. Note that the condition over all tangent vectors can be Riemannian Manifolds. While this section only presents the replaced by the following condition (up to an abuse of notation simplest first and second order algorithms, the simulation with the ǫ′): section uses the more sophisticated conjugate gradient and ′ + trust regions methods as first and second order algorithms to Xm(ξX) X F <ǫ , ξX XSP with ξX F ǫ || − || ∀ ∈ T n || || ≤ obtain the curves. The complexity analysis of these algorithms X X 2 ′ sup m(ξX) F <ǫ is similar to the one presented herein as it is entirely governed ⇔ + ξX∈TXSPn || − || ||ξX||F ≤ǫ by the complexity of computing the Riemannian gradient 2 ′ and Hessian. Table I summarizes the first and second order max Xm(ξX) X <ǫ , (62) + F ⇔ ξX∈TXSPn || − || complexity for each manifold. ||ξX||F ≤ǫ wherein the last equivalence is obtained from the fact that the A. Gradient Descent Complexity search space is closed. The last expression allows us to work with an upper bound of the distance. Indeed, the distance can The complexity of computing the gradient (32) of the be bound by: doubly stochastic multinomial manifold can be decomposed m into the complexity of computing γ, α, and grad f(X). The 2 1 −ω ξX 2 X (ξX) X = I e X m F (ωm)2 F term γ is a simple Hamadard product that can be computed in || − || X || − || 2 n n operations. The term α is obtained by solving the system of 1 m 2 equations in (25) which takes (2/3)(2n)3 when using an LU = m 2 (1 exp( ωX λi)) (ωX ) − − X i=1 factorization. Finally, the expression of grad f( ) requires a n X couple of additions and an hadamard product which can be m 2 (63) m 2 (1 exp( ωX ǫ)) , 2 ≤ (ωX ) − − done in 3n operations. Finally, the complexity of computing with the last inequality being obtained from ξX F ǫ the retraction can be decomposed into the complexity of com- || ||m ≤∞ ⇒ λi ǫ, 1 i n. Now using the fact that ωX is puting the update vector and the complexity of the projection. ≤ ≤ ≤ { }m=1 11 TABLE I COMPLEXITYOFTHESTEEPESTDESCENTAND NEWTON’SMETHODALGORITHMSFORTHEPROPOSEDMANIFOLDS. Manifold Steepest descent algorithm Newton’s method algorithm 3 2 3 2 Doubly Stochastic Multinomial DPn (16/3)n +7n + log(n)√n 32/3n + 15n + log(n)√n 3 2 3 2 Symmetric Multinomial SPn (1/3)n +2n +2n + log(n)√n n +8n + 17/2n + log(n)√n SP+ 3 2 3 2 Definite Symmetric Multinomial n n +3n +3n 4/3n + 13/2n +7n

n The updated vector is an hadamard product and division that 60 70 80 90 100 DP can be computed in 3n2. The complexity of projecting a CVX n 32.04 60.19 97.69 152.32 267.45 CG DPn 0.80 0.89 1.42 1.69 2.16 matrix A onto the set of doubly stochastic manifold [35] with TR DPn 7.69 11.03 18.24 22.60 24.17 accuracy ǫ is given by: TABLE II ((1/ǫ + log(n))√nV/v), (65) PERFORMANCE OF THE DOUBLY STOCHASTIC MULTINOMIAL MANIFOLD O AGAINST THE PROBLEM DIMENSION. wherein V = max(A) and v = min(A). Therefore, the total complexity of an iteration of the gradient descent algorithm n 60 70 80 90 100 on the doubly stochastic multinomial manifold is (16/3)n3 + CVX SPn 8.66 16.08 26.12 39.95 59.57 SP 7n2 + log(n)√n. CG n 0.18 0.31 0.34 0.44 0.53 TR SPn 0.61 1.15 1.43 2.07 2.50 The complexity of the symmetric stochastic manifold can TABLE III be obtained in a similar manner. Due to the symmetry, term PERFORMANCE OF THE SYMMETRIC STOCHASTIC MULTINOMIAL γ only requires n(n + 1)/2 operations. The term α is the MANIFOLD AGAINST THE PROBLEM DIMENSION. solution to an n n system of equations which can be 3 × X solved in (1/3)n . Similarly, grad f( ) requires 3/2n(n+1). the optimization variables is a doubly stochastic matrix. The Therefore, the total complexity can be written as: experiments are carried out using Matlab on an Intel Xeon 3 2 (1/3)n +2n +2n + log(n)√n. (66) Processor E5-1650 v4 (15M Cache, 3.60 GHz) computer with The retraction on the cone of positive matrices requires n3 + 32GB 2.4 GHz DDR4 RAM. The optimization is performed 2n2 which gives a total complexity of the algorithm for the using the Matlab toolbox “Manopt” [36] and the conjugate positive symmetric multinomial manifold of n3 +3n2 +3n. gradient (denoted by CG) and trust regions (denoted by TR) algorithms. B. Newton’s Method Complexity The section is divided into three subsections. The first sub- The complexity of computing the Newton’s method requires section tests the performance of the proposed solution against computing the Riemannian gradient and Hessian and solving a popular generic solver “CVX” [37] for each of the manifolds. an n n linear system. However, from the expressions of The section further shows the convergence of each manifold the Riemannian× Hessian, one can note that the complexity of in reaching the same solution by mean of regularization. The computing the Riemannian gradient in included in the one of second subsection solves a convex clustering problem [4] the Riemannian Hessian. and testifies the efficiency of the proposed algorithm against For the doubly stochastic manifold, the complexity of com- a generic solver. Finally, the last subsection shows that the puting the Riemannian Hessian is controlled by the complexity proposed framework outperforms a specialized algorithm [5] of the projection and the inversions. The projection onto the in finding the solution of a non-convex clustering problem. tangent space requires solving a n n system and a couple × of additions and an Hadamard product. The total cost of the A. Performance of the Proposed Manifolds operation is 2/3(2n)3 +3n2. The ǫ and epsilon˙ terms are 3 This section solves the following optimization problem: inversions and matrices products that require 4n . The other A X 2 2 min F , (70) terms combined require 9n operation. The retraction costs X∈M || − || 3n2 + log(n)√n and solving for the search direction requires wherein the manifold is the doubly stochastic, symmetric, M 2/3(2n)3 which gives a total complexity of: and definite stochastic multinomial manifold, respectively. For A A M N 32/3n3 + 15n2 + log(n)√n. (67) each of the experiment, matrix is generated by = + with M belonging to the manifold of interest and N is A similar analysis as the one above allows to conclude ∈M that the total complexity of a second order method on the a zero-mean white Gaussian noise. symmetric and positive doubly stochastic manifold require, The optimization problem is first solved using the toolbox X∗ respectively, the following number of iterations: CVX to obtain the optimal solution with a predefined pre- cision. The proposed algorithms are iterated until the desired n3 +8n2 + 17/2n + log(n)√n (68) optimal solution X∗ is reached with the same and the 3 2 4/3n + 13/2n +7n. (69) total execution time is displayed in the corresponding table. Table II illustrates the performance of the proposed method VII. SIMULATION RESULTS in denoising a doubly stochastic matrix against the problem This section attests the performance of the proposed frame- dimension. The table reveals a significant gain in the simu- work in efficiently solving optimization problems in which lation time ranging from 39 to 123 fold for the first order

12 n −1 60 70 80 90 100 10 + CVX SPn 5.43 11.39 22.84 33.77 53.79 CG DP SP+ + CG n 0.17 0.18 0.21 0.28 0.34 −2 CG SP + 10 TR SPn 0.58 0.78 0.93 1.32 1.90 CG SP TABLE IV −3 PERFORMANCE OF THE DEFINITE SYMMETRIC STOCHASTIC 10 MULTINOMIAL MANIFOLD AGAINST THE PROBLEM DIMENSION.

−4 10 method and from 4 to 11 fold for the second order algorithm −5 as compared with the generic solver. The gain in performance 10 can be explained by the fact that the proposed method uses −6 the geometry of the problem efficiently unlike generic solvers Norm of Gradient 10 which convert the problem in a standard form and solve it

−7 using standard methods. The second order method performs 10 poorly as compared with the first order method due to the fact

−8 that the expression of the Riemannian Hessian is complex to 10 0 1 2 3 compute. 10 10 10 10 Number of Iteration Table III shows the simulation time of the symmetric doubly stochastic multinomial manifold against the problem size. One Fig. 4. Convergence rate of the conjugate gradient algorithm on the various proposed manifolds against the iteration number for a high dimension system can note that the gain is more important than the one in n = 1000.

Table II. Indeed, as shown in the complexity analysis section, −1 10 the symmetric manifold enjoys a large dimension reduction TR DP as compared with the doubly stochastic one which makes + −2 TR SP 10 the required ingredients easier to compute. One can note that TR SP the computation of the Riemannian Hessian of the symmetric −3 stochastic manifold is more efficient that the doubly stochastic 10 manifold which is reflected in a better performance against the −4 conjugate gradient algorithm. 10 Table IV displayed similar performance in the positive sym- −5 metric doubly stochastic multinomial manifold. The proposed 10 definite multinomial manifold efficiently finds the solution.

−6

This is due to the fact that the optimal solution does not Norm of Gradient 10 represent vanishing entries or eigenvalues as pointed out in

−7 Section V which makes the retraction efficient. Such condition 10 being not fulfilled in the upcomming couple of subsections, the performance of the positive definite manifold is omitted −8 10 0 1 2 and a relaxed version using the symmetric manifold and 10 10 10 Number of Iterations regularization is presented. The rest of the subsection confirms the linear and quadratic Fig. 5. Convergence rate of the trust region method on the various proposed manifolds against the iteration number for a high dimension system n = convergence rate behaviors of the proposed method by plotting 1000. the norm of the gradient against the iteration number of each of the manifolds and algorithms. For each of the manifolds, Figure 5 shows that the trust region method has a super an optimization problem is set up using regularizers is order linear, i.e., quadratic, convergence behavior with respect to to reach the optimal solution to optimization problem (70) the iteration number. The figure particularly show that the with = SP+. The definition of the regularized objective M n quadratic rate is achieved after a number of iterations which functions is delayed to the next subsection for the more can be explained by the fact that our implementation uses interesting clustering problem. Since the complexity of each a general retraction instead of the optimal (and complex) step depends on the manifold, optimization problem, and used Exponential map. algorithm, nothing is concluded in this subsection about the efficiency of the algorithm in reaching the same solution. Figure 4 plots the convergence rate of the first order method B. Similarity Clustering via Convex Programming using the doubly stochastic, symmetric, and positive mani- This section suggests using the proposed framework to folds. The figures clearly shows that the conjugate gradient solve the convex clustering problem [4]. Given an entry-wise algorithm exhibits a linear convergence rate behavior similar non-negative similarity matrix A between n data points, the to the one of unconstrained optimization. This is mainly due goal is to cluster these data points into r clusters. Similar to the fact that the Riemannian optimization approach convert to [4], the matrix is generated by assuming a noisy block a constrained optimization into an unconstrained one over a stochastic model with 3 blocks, a connection probability of constrained set. 0.7 intra-cluster and 0.2 inter-clusters, and a noise variance of

13 3 2 10 10 CVX Relaxed MM CG DP CG SP+

CG SP 1 + 10 TR SP 2 10 TR DP CG SP TR SP TR SP

0 10

1 10

−1 10 Running Time Running Time

0 10 −2 10

−1 −3 10 10 20 40 60 80 100 120 140 160 10 20 30 40 50 60 70 80 Dimension of the Problem Dimension of the Problem n 30 60 90 120 150 + Fig. 7. Performance of the positive and symmetric multinomial manifolds in CVX SPn 90.53 383.69 881.49 1590.41 2499.86 solving the non-convex clustering problem against the Relaxed MM algorithm. CG SPn 90.53 383.67 881.44 1590.33 2499.74 TR SPn 90.77 384.02 881.98 1590.90 2500.36 CG DPn 90.53 383.66 881.46 1590.32 2499.76 for the first order methods. The precision of the algorithms is TR DPn 90.87 384.15 881.92 1590.58 2500.06 satisfactory as they achieve the CVX precision in almost all Fig. 6. Performance of the doubly stochastic and symmetric multinomial experiments. Also note that using the symmetric multinomial manifolds in solving the convex clustering problem in terms of running time manifold produces better results. This can be explained by the and objective cost. fact that not only the objective function (72) is simpler than (73) but also by the fact that the manifold contains less degrees 0.2. Under the above conditions, the reference guarantees the of freedom which makes the projections more efficient. recovery of the clusters by solving the following optimization problem: C. Clustering by Low-Rank Doubly Stochastic Matrix Decom- A X 2 X min F + λTr( ), (71) position X∈SPn || − || X0 wherein λ is the regulizer parameter whose expression is This last part of the simulations tests the performance of the derived in [4]. The optimal solution to the above problem proposed method for clustering by low-rank doubly stochastic is a (up to a permutation) of rank equal to the in the setting proposed in [5]. Given number of clusters. Due to such rank deficiency of the optimal a similarity matrix as in the previous section, the authors in solution, the definite positive manifold cannot be used to solve the above reference claim that a suitable objective function to the above problem. Therefore, we reformulate the problem on determine the clusters structure is the following non-convex SP by adding the adequate regulizers as below: cost: n X X min A X 2 + λTr(X)+ ρ( X Tr(X)), (72) A ik jk X X SP F ∗ min ij log (α 1) log ( ij ) ∈ n || − || || || − X∈SPn − Xvk − − X 0 i,j k v ! ij wherein ρ is the regularization parameter. The expression of  X X X such regulizer can be obtained by expressing the Lagrangian The authors propose aP specialized algorithm, known as of the original problem and deriving the expression of the “Relaxed MM”, to efficiently solve the problem above. This Lagrangian multiplier. However, this falls outside the scope section suggests solving the above problem using the positive and the symmetric multinomial manifold (with the proper of this paper. Clearly, the expression X ∗ Tr(X) = n || || − regularization as shown in the previous subsection). In order to i=1 λi λi is positive and equal to zero if and only if all the| eigenvalues|− are positive which concludes that X is reach the same solution, all algorithms are initialized with the P same value. The objective function achieved by the algorithm positive. Similarly, the problem can be reformulated on DPn as follows: of [5] is taken as a reference, and the other algorithms stop X X X X XT 2 as soon as their cost drops below such value. min f( )+ ρ( ∗ Tr( )) + µ( F ), (73) X∈SPn || || − || − || Figure 7 illustrates the running time of the different algo- where f(X) is the original objective function in (71) regular- rithms in order to reach the same solution. The plot reveals that ized with ρ and µ to promote positiveness and symmetry. the proposed framework is highly efficient in high dimension Figure 6 plots the running time required to solve the convex with significant gain over the specialized algorithm. The clustering problem and show the achieved original cost (71) performance of the first order method is noticeably better for the different optimization methods. Clearly, the proposed than the second order one. This can be explained by the Riemannian optimization algorithms largely outperform the complexity of deriving the Riemannian Hessian. In practical standard approach with gains ranging from 15 to 665 fold implementations, one would use an approximation of the

14 Hessian in a similar manner as the quasi-Newton methods, Differentiating both equations above concludes that the tangent e.g., BHHH, BFGS. Finally, one can note that the symmetric space is a subset of n×n T multinomial performs better than the positive one which can XDPn Z R Z1 = 0, Z 1 = 0 . (A.3) T ⊆ ∈ be explained by the fact that the optimal solution has vanishing From the Birkhoff-von Neumann theorem, the degrees of  eigenvalues which make the retraction on the cone of positive freedom of doubly stochastic matrices is (n 1)2. Similarly, matrices non-efficient. one can note that the above space is generated− by 2n 1 independent linear equations (the sum of the last column− can VIII. CONCLUSION be written as the difference of the sum of all row and the some of all except the last column). In other words, the dimension of This paper proposes a Riemannian geometry-based frame- the space is n2 (2n 1)=(n 1)2. Therefore, a dimension work to solve a subset of convex (and non-convex) opti- count argument− concludes− that the− tangent space has the above mization problems in which the variable of interest rep- expression. resents a multidimensional probability distribution function. The optimization problems are reformulated from constrained optimizations into unconstrained ones over a restricted search B. Proof of Preposition 2 space. The fundamental philosophy of the Riemannian opti- The symmetric multinomial manifold has the following ex- n×n T mization is to take advantage of the low-dimension manifold in pression SPn = X R Xij > 0, X1 = 1, X = X . ∈ which the solution lives and to use efficient unconstrained op- Therefore, a smooth curve X(t) that goes through a point  timization algorithms while ensuring that each update remains X SPn satisfies: ∈ feasible. The geometrical structure of the doubly stochastic, X(t) = (X(t))T X˙ (t) = (X˙ (t))T (A.4) ⇒ symmetric stochastic, and the definite multinomial manifold X(t)1 = 1 X˙ (t)1 = 0 (A.5) is studied and efficient first and second order optimization ⇒ which concludes that the tangent space XDP is included in algorithms are proposed. Simulation results reveal that the n the set Z Z1 = 0 . Now considerT Z in the above set proposed approach outperforms conventional generic and spe- n and the smooth∈ S curve γ(t)= X + tZ. Clearly, γ(t) = (γ(t))T cialized solvers in efficiently solving the problem in high  for all t R. Furthermore, we have: dimensions. ∈ γ(t)1 = X1 + tZ1 = X1 = 1 (A.6) Finally, since Xij > 0 defines an open set, then there exists PPENDIX A A an interval R such that γ(t)ij > 0. Finally, it is clear that COMPUTATION OF THE TANGENT SPACE γ(0)= X andI ⊂γ′(0)= Z which concludes that: XSPn = Z n Z1 = 0 . (A.7) This section computes the tangent space of the different T ∈ S manifold of interest in this paper. Recall that the tangent space  x of the manifold at x is the d-dimensional vector APPENDIX B T M M space generated by the derivative of all curves going through ORTHOGONAL PROJECTIONONTHE TANGENT SPACE x. Therefore, the computation of such tangent space requires This section describes the general procedure to obtain the considering a generic smooth curve γ(t): R such as −→ M orthogonal projection to manifolds of interest in this paper. For γ(0) = x and γ(t) for some t in the neighborhood of a manifold embedded in a vector space , the orthogonal 0. The evaluation of∈ the M derivative of such parametric curves M E projection Πx(z) projects a point z onto the tangent space at the origin generates a vector space such that x . ∈E x for some x . The term orthogonal herein refers to Two approaches can be used to showD the converse.T M If ⊆ the D T M ∈M the fact that the difference z Πx(z) is orthogonal to the space dimension of the manifold is known apriori and match the − x for the inner product ., . x. Therefore, the first step in dimension of , then x = . Such approach is referred derivingT M the expression of theh orthogonali projection onto the as dimensionD count. TheT M secondD and more direct method is tangent space x is to determine its orthogonal complement to consider each element d and construct a smooth ⊥ T M ∈ D x defined as: curve γ(t) for some t R such that γ(0) = x T M ⊥ ⊥ ⊥ ∈ M ∈I ⊂ = z z ,z x, z x (B.1) and γ′(0) = d. For illustration purposes, the first and second Tx M { ∈E|h i ∀ ∈ T M} As the dimension of ⊥ can be written as dim( ) subsections use the former and latter approaches, respectively. Tx M E − dim( x ), a typically method for deriving the expression of theT orthogonalM complement is to find a generating family ⊥ A. Proof of Preposition 1 for the space and check its dimension. Now, let Πx (z) be the orthogonal projection onto the orthogonal complement ⊥ , Recall the definition of the doubly stochastic manifold Tx M DP X Rn×n X1 1 XT 1 1 each point in the ambient space can be decomposed as: n = Xij > 0, = , = . Con- ⊥ X DP∈ X z = Πx(z) + Πx (z), z , x . (B.2) sider an n and let (t) be a smooth curve such that ∀ ∈E ∀ ∈M ∈ Using the expressions of both the tangent set and its com- X(0) = X. Since X(t) DPn for some t in the neighborhood of the origin, then the curve∈ satisfies: plement, the above equation allows deriving the expressions of X(t)1 = 1 X˙ (t)1 = 0 (A.1) both projections simultaneously. The next subsections compute ⇒ the orthogonal projection on the set of double stochastic and (X(t))T 1 = 1 (X˙ (t))T 1 = 0 (A.2) ⇒ symmetric multinomial manifolds using the described method.

15 A. Proof of Theorem 1 vector of interest is orthogonal to the null space of the matrix of interest as follows: As stated earlier, the first step in deriving the expression of Z1 T 1 the orthogonal projection onto the tangent space, one needs to = 1T ZT 1 1T Z1 ZT 1 1 − derive the expression on the orthogonal complement which is   −  = 1T Z1 1T Z1 = 0 (B.8) given in the following lemma. − A particular solution to the system is the solve for β as Lemma 4. The orthogonal complement of the tangent space a function of α and solve for α which gives the following of the doubly stochastic multinomial has the following expres- solution sion: α = (I XXT )†(Z XZT )1 (B.9) ⊥ ⊥ n×n ⊥ T T X DPn = Z R Z = (α1 + 1β ) X (B.3) − − T ∈ ⊙ β = ZT 1 XT α (B.10) for some α, β Rn. −  Finally, rearranging the terms in (B.5) allows to conclude ∈ Proof. As stated in the introduction of the section, the com- that the orthogonal projection onto the tangent space has the putation of the orthogonal complement of the tangent space following expression T T requires on deriving a basis for the space and counting the ΠX(Z)= Z (α1 + 1β ) X, (B.11) ⊥ ⊥ − ⊙ dimension. Let Z X DPn and Z XDPn, the inner wherein α and β are obtained according to (B.10) or more ∈ T ∈ T product can be written as: generally (B.7). ⊥ ⊥ T Z , Z X = Tr((Z X)Z ) h i ⊘ = Tr((α1T + 1βT )ZT ) B. Proof of Theorem 4 = αT Z1 + βT ZT 1 (B.4) The proof of this theorem is closely related to the proof of But Z1 = ZT 1 = 0 by definition of the tangent space. Theorem 1. Indeed, the next section shows that the orthogonal ⊥ Therefore, we have Z, Z x, Z XDPn. projection on the symmetric double stochastic multinomial h i ∀ ∈ T Finally, one can note that the dimension of set is 2n 1 manifold is a special case of the projection on the doubly ⊥ − which is the correct dimension for X DPn. Therefore, the stochastic multinomial manifold. This section provides a more derived set is the orthogonal complementT of the tangent space. direct proof that does not reply on the previously derived result. The expression of the orthogonal complement of the tangent space is given in the following lemma. Let Z Rn×n be a vector in the ambient space and X Lemma 5. The orthogonal complement of the tangent space of ∈ ∈ DPn. The expression of the projections are obtained using the the symmetric multinomial can be represented by the following following decomposition: set: Z Z ⊥ Z ⊥ ⊥ ⊥ T T = ΠX( ) + ΠX( ) X SPn = Z Sn Z = (α1 + 1α ) X (B.12) ⊥ T n ∈ ⊙ Z1 = ΠX(Z)1 + ΠX(Z)1 (B.5) for some α R .  ∈ However, by definition of the tangent space, the first term in Proof. The proof of this lemma uses similar steps like the one the right hand side in the above equation vanishes. Similarly, of Lemma 4 and thus is omitted herein. from Lemma 4, the second term can be replaced by its (α1T + 1βT ) X. Therefore the first equation implies Let Z Rn×n be a vector in the ambient space and X ⊙ T T DP ∈ Z ∈ Z1 = ((α1 + 1β ) X)1 n. The decomposition of gives the following: ⊙ ⊥ n n Z = ΠX(Z) + ΠX(Z) Zij = (αi + βj )Xij , 1 i n Z1 Z 1 ⊥ Z 1 ≤ ≤ = ΠX( ) + ΠX( ) j=1 j=1 X X Z1 = ((α1T + 1αT ) X)1 n n ⊙ Zij = αi + βj Xij , 1 i n Z1 = α + Xα = (I + X)α ≤ ≤ j=1 j=1 −1 X X α = (I + X) Z1, (B.13) Z1 X = α + β (B.6) wherein the steps of the computation are obtained in a sim- A similar argument allows to conclude that ZT 1 = XT α + β. ilar fashion as the one in (B.6). Therefore, the orthogonal Grouping the equations above gives the following system of projection on the tangent set of symmetric doubly stochastic equations: multinational manifold is given by: Z1 I X α Z Z 1T 1 T X = (B.7) ΠX( )= (α + α ) , (B.14) ZT 1 XT I β − ⊙       with α being derived in (B.13). I X Even though the matrix XT I is rank deficient to  1  C. Relationship Between the Orthogonal Projections the present of the null vector at 1 , the systems admits One can note that the projection onto the tangent space of infinitely many solutions. Indeed, from−  the orthogonal com- the symmetric multinomial is a special case of the projection plement identification of the range and null space of a matrix onto the tangent space of the doubly stochastic manifold when A, i.e., (A) = ⊥(A), it is sufficient to show that the both the point on the manifold X and the ambient vector Z R N 16 are symmetric. In other words, the projection can be written + dim( ) = dim( ). Assume that there is a diffeomorphism N E as φ : ∗ Z Z 1T 1 T X M×N −→E ΠX( )= (α + β ) , (B.15) (F, G) φ(F, G) (C.1) − ⊙ 7−→ with where ∗ is an open subset of , with a neutral element I N α = (I XXT )†(Z XZT )1 (B.16) satisfyingE E ∈ − − β = (ZT (X−T X)†(Z XZT ))1 (B.17) φ(F, I)= F, F (C.2) − − − ∀ ∈M and the additional identities X = XT and Z = ZT . Using the Under the above assumption, the mapping economic eigenvalue decomposition of the symmetric matrix Rx : x T T M −→M X = UΛU , vector α can be expressed as: −1 ξx Rx(ξx)= π1(φ (x + ξx)), (C.3) T † T α = (I XX ) (Z XZ )1 7−→ F G F − − where π1 : : ( , ) is the projection = (I XXT )†Z1 (I XXT )†XZT 1 onto the firstM×N component, −→M defines a retraction7−→ on the manifold − − − = (I X2)†Z1 (X−1 X)†Z1 for all x and ξx in the neighborhood of 0x. − − − M ∈M = U (I Λ2)−1 (Λ−1 Λ)−1 UT Z1 (B.18) The upcoming sections take advantage of the matrix decom- − − − The inner matrix is a diagonal one with diagonal entries equal position to design a mapping φ. Interestingly, the inverse of   to: the map φ turns out to be straightforward to compute even 1 1 1 λ 1 though the projection on the doubly stochastic matrices space = = (B.19) 1 λ2 − 1 1 λ2 − 1 λ2 1+ λ is challenging. − λ − − λ − Therefore, (I Λ2)−1 (Λ−1 Λ)−1 = (I + Λ)−1 which gives the final− expression− of α as:− A. Proof of Theorem 2 α = (I + X)−1Z1. (B.20) This subsection uses the Sinkhorn’s theorem [32] to derive Finally, the expression of β is given by: an expression for the mapping φ. The Sinkhorn’s theorem β = ZT 1 XT α = (I X(I + X)−1)Z1 states: − − n×n = U I Λ(I + Λ)−1) UT Z1, (B.21) Theorem 7. Let A R be an element-wise positive − with the inner matrix equals to: matrix. There exists two∈ strictly positive diagonal matrices D   1 1 1 and D2 such that D1AD2 is doubly stochastic. 1 λ = (I X(I + X)−1) = (I + X)−1 − 1+ λ 1+ λ ⇒ − Due the invariance of the above theorem for scaling D and Therefore, we conclude that β = (I + X)−1Z1 = α which is 1 D , the rest of the paper assumes that (D ) = 1 without in accordance with the result derived in Theorem 4. 2 1 11 loss of generality. Define the φ mapping as follows: 2n−1 n×n Remark 5. Note that the link between both expressions can be φ : DPn R R obtained easier by assuming that (I XXT ) is invertible. In- × −→ − A d1 A deed, for example the expression of α can be easily computed , diag(1, d1) diag(d2). (C.4) d2 7−→ as:    2n−1 2n−1 T −1 T Note that R is an open subset of R and thus is a α = (I XX ) (Z XZ )1 n×n manifold by definition. Similarly, R is an open subset of − −1 − = (I XX) (Z XZ)1 n×n 2n−1 2 R . Finally, dim(DPn)+dim(R ) = (n 1) +2n 1= − −1 − −1 = (I + X) (I X) (I X)Z1 n2 = dim(Rn×n). Also, the all one element of−R2n−1 satisfies− − − = (I + X)−1Z1 (B.22) φ(A, 1)= A. However, due to eigenvalue at 1 from X1 = 1, such proof is Clearly, the mapping φ is smooth by the smoothness of the not valid and we need to use the pseudo-inverse as shown in matrix product. The existence of the inverse map is guaranteed the section above. by the Sinkhorn’s theorem. Such inverse map is obtained through the Sinkhorn’s algorithm [32] that scales the rows and columns of the matrix, i.e., the inverse map is smooth. Finally, we conclude that φ represents a diffeomorphism. APPENDIX C Using the result of Theorem 6, we conclude that RETRACTION ON EMBEDDED MANIFOLDS −1 π1(φ (X + ξX)) is a valid retraction for ξX in the neigh- n×n borhood of 0X, i.e., (X + ξX) R which can explicitly This section exploits the vector space structure of the X ∈ written as ij > (ξX)ij , 1 i, j n. Using the property embedding space to design efficient, i.e., low-complexity, of the manifold and− its tangent≤ space,≤ the inverse map reduce retractions on the manifolds of interest in this paper. The the identity. Indeed, it holds true that: construction of the retraction rely on the following theorem (X + ξX)1 = X1 + ξX1 = 1 + 0 = 1 (C.5) whose proof can be found in [13]. T T T (X + ξX) 1 = X 1 + ξX1 = 1 + 0 = 1 (C.6) Theorem 6. Let be an embedded manifold of the Euclidean Therefore, the canonical retraction is defined by RX(ξX) = M space and let be an abstract manifold such that dim( ) X + ξX. E N M 17 B. Proof of Corollary 1 product f g and the matrix product fg are given by: ⊙ D(f g)[ξ]= f˙[ξ] g + f g˙[ξ] (D.3) The proof of this corollary follows similar steps than the ⊙ ⊙ ⊙ one of Theorem 2. However, instead of using the Sinkhorn’s D(fg)[ξ]= f˙[ξ]g + fg˙[ξ] (D.4) theorem to find the adequate matrix decomposition, we use its Proof. The matrices identities and differentiation, including extension to the symmetric case known as the DAD theorem the above identities, are summarized in the following reference [34] given below: [38]. Theorem 8. Let A n be a symmetric, element-wise positive matrix. There exists∈ S a strictly positive The next subsections derive such directional derivative for D such that DAD is symmetric doubly stochastic. the double stochastic and the symmetric multinomial manifold to derive the final expression of the Riemannian Hessian. In With the theorem above, define the map φ as follows: X X n both subsections, let γ denote Grad f( ) . φ : SPn R n ⊙ × −→ S (A, d) diag(d)Adiag(d). (C.7) 7−→ A. Proof of Theorem 3 Similar to the previous proof, n is an open subset of the n S vector space n and R is a manifold. The dimension of the Recall that the projection on the set of doubly stochastic S left hand side gives: multinomial manifold is given by: n(n 1) T T SP Rn ΠX(Z)= Z (α1 + 1β ) X dim( n)+ dim( )= − + n − ⊙ 2 α = (I XXT )†(Z XZT )1 n(n + 1) T− T − = = dim( n). (C.8) β = Z 1 X α (D.5) 2 S − Using similar techniques as the ones in Theorem 2, we can Therefore, the directional derivative can be expressed as: conclude that φ is a diffeomorphism whose inverse is ensured D(grad f(X))[ξX]= D(ΠX(γ)[ξX] by the DAD algorithm. Finally, one can note that the projection T T = D(γ (α1 + 1β ) X)[ξX] onto the set of symmetric double stochastic matrices leaves − T ⊙ T = D(γ)[ξX] D((α1 + 1β ) X)[ξX] X + ξX unchanged for ξX in the neighborhood of 0X. − ⊙ T T =γ ˙ [ξX] (α ˙ [ξX]1 + 1β˙ [ξX]) X − T T ⊙ (α1 + 1β ) ξX (D.6) APPENDIX D − ⊙ RIEMANNIAN HESSIAN COMPUTATION with • γ˙ [ξX]= D(γ)[ξX] can be expressed as: Recall that the Riemannian Hessian is related to the Rie- γ˙ [ξX]= D(Grad f(X))[ξX] X + Grad f(X) ξX mannian connection and Riemnanian gradient through the ⊙ ⊙ following equation: = Hess f(X)[ξX] X + Grad f(X) ξX ⊙ ⊙ X X • α˙ [ξX]= D(α)[ξX] can be computed as follows: hess f( )[ξX]= ξX grad f( ), ξX X . (D.1) ∇ ∀ ∈ T M I XXT † X T 1 Furthermore, the connection X on the submanifold is α˙ [ξX]= D(( ) (γ γ ) )[ξX] (D.7) ηX ξ − − ∇ Rn×n T † T given by the Levi-Civita connection ηX ξX on by = D((I XX ) )[ξX](γ Xγ )1 ∇ − − ηX ξX = ΠX( ηX ξX). Substituting in the expression of the T † T ∇ ∇ + (I XX ) (γ ˙ [ξX] ξXγ Xγ˙ [ξX])1, Riemannian Hessian yields: − T − − with the term D((I XX )†)[ξX] being derived below. hess f(X)[ξX] = ΠX (D(grad f(X))[ξX]) − • β˙[ξX]= D(β)[ξX] can be computed as follows: 1 T T X X β˙[ξX]= D(γ 1 X α)[ξX] ΠX ((grad f( ) ξX) ) − − 2 ⊙ ⊘ T T T =γ ˙ [ξX]1 ξXα X α˙ [ξX] (D.8) = ΠX (D(grad f(X))[ξX]) (D.2) − − T In order to compute D((I XX )†)[ξX], first introduce the 1 − ΠX ((ΠX(Grad f(X) X) ξX) X) following lemma: − 2 ⊙ ⊙ ⊘ Apart for the term D(grad f(X))[ξX], all the other terms Lemma 6. Let A be an n n matrix with a left pseudo inverse in the above equation are available. Therefore, one only needs A†. The left pseudo inverse× of (A + BC) is given by: to derive the expression of D(grad f(X)[ξX] to obtain the (A + BC)† = A† A†B(I + CA†B)†CA† (D.9) mapping from the Euclidean gradient and Hessian to their − Riemannian counterpart. For ease of notation, the section uses Proof. The above identity is similar to the Kailath variant of the short notation f˙[ξ] to denote the directional derivative Sherman-Morrison-Woodbury formula [39] for an invertible A D(f)[ξ] (also denoted by ξf in the Riemannian geometry matrix . The proof is given by a simple left multiplication community).The computation of the directional derivative of as follows: the Riemannian gradient uses the result of the following (A + BC)†(A + BC)= A†A A†B(I + CA†B)†CA†A − proposition: + A†BC A†B(I + CA†B)†CA†BC (D.10) − Proposition 2. Let f and g be two matrix functions, i.e., f,g : = I A†B((I + CA†B)†(I + CA†B)C C)= I − − Rn×n Rn×n. The directional derivative of the Hamadard −→ 18 Using the identity above, the pseudo inverse of the perturbed [5] Z. Yang and E. Oja, “Clustering by low-rank doubly stochastic T † (I XX ) along ξX is given by: matrix decomposition,” in Proceedings of the 29th International − T † † Coference on International Conference on Machine Learning, ser. (I (X + tξX)(X + tξX) ) = A[ξX] ICML’12. USA: Omnipress, 2012, pp. 707–714. [Online]. Available: − A† A† I CA† †CA† http://dl.acm.org/citation.cfm?id=3042573.3042666 = + t ( t ) (D.11) [6] X. Wang, F. Nie, and H. Huang, “Structured doubly stochastic matrix T − T wherein A = I XX , B = t, and C = XξX + for graph based clustering: Structured doubly stochastic matrix,” in T T − − Proceedings of the 22Nd ACM SIGKDD International Conference on ξXX + tξXξ in the above inversion lemma. Therefore, the X Knowledge Discovery and Data Mining, ser. KDD ’16, pp. 1245–1254. directional derivative can obtained by: [7] R. Zass and A. Shashua, “Doubly stochastic normalization for spectral † † A X A clustering,” in NIPS, 2006, pp. 1569–1576. T † [ξ ] D((I XX ) )[ξX] = lim − [8] D. G. Luenberger, “The gradient projection method along geodesics,” − t→0 t Management Science, vol. 18, no. 11, pp. 620–631, 1972. tA†(I tCA†)†CA† [9] D. Gabay, “Minimizing a differentiable function over a differential = lim − manifold,” Journal of Optimization Theory and Applications, vol. 37, t→0 t no. 2, pp. 177–219, 1982. † † † † = lim A (I tCA ) CA [10] S. T. Smith, Geometric Optimization Methods for Adaptive Filtering. t→0 − Cambridge, MA, USA: Harvard University, uMI Order No. GAX93- = A†(lim C)A† (D.12) 31032. t→0 [11] C. Udriste, Convex functions and optimization methods on Riemannian T † T T T † = (I XX ) (XξX + ξXX )(I XX ) manifolds. Springer Science & Business Media, 1994, vol. 297. − − [12] Y. Yang, “Globally convergent optimization algorithms on Riemannian manifolds: Uniform framework for unconstrained and constrained opti- mization,” Journal of Optimization Theory and Applications, vol. 132, B. Proof of Corollary 3 no. 2, pp. 245–265, 2007. [13] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization Algorithms on The proof of this corollary follows similar steps as the one Matrix Manifolds. Princeton, NJ: Princeton University Press, 2008. of Theorem 3. Recall that the projection on the symmetric [14] K. Huper and J. Trumpf, “Newton-like methods for numerical optimiza- tion on manifolds,” in Proc. of 38th Asilomar Conference on Signals, multinomial manifold is given by: Systems and Computers, 2004, Pacific Grove, CA, USA, vol. 1, Nov T T ΠX(Z)= Z (α1 + 1α ) X 2004, pp. 136–139 Vol.1. [15] C. G. Baker, P.-A. Absil, and K. A. Gallivan, “An implicit trust-region − −1 ⊙ α = (I + X) Z1. (D.13) method on Riemannian manifolds,” IMA J. Numer. Anal., vol. 28, no. 4, Therefore, using a technique similar to Theorem 3, the direc- pp. 665–689, 2008. [16] C. Baker, Riemannian Manifold Trust-region Methods with Applications tional derivative can be expressed as: to Eigenproblems, 2008. T T D(grad f(X))[ξX] =γ ˙ [ξX] (α ˙ [ξX]1 + 1α˙ [ξX]) X [17] P.-A. Absil, C. G. Baker, and K. A. Gallivan, “Trust-region methods on Riemannian manifolds,” Foundations of Computational Mathematics, T T − ⊙ (α1 + 1α ) ξX (D.14) vol. 7, no. 3, pp. 303–330, 2007. − ⊙ [18] P. Absil, C. G. Baker, and K. A. Gallivan, “Trust-region methods on with γ˙ [ξX] = Hess f(X)[ξX] X + Grad f(X) ξX. ⊙ ⊙ Riemannian manifolds,” Proceedings of 16th International Symposium The computation of the directional derivative of α requires on Mathematical Theory of Networks and Systems (MTNS’ 04), Leuven, differentiating (I + X)−1 = A−1. Using the Kailath variant Belgium. of Sherman-Morrison-Woodbury formula [39], the inverse is [19] N. Boumal and P.-a. Absil, “RTRMC: A Riemannian trust-region method for low-rank matrix completion,” in Advances in neural information given by: processing systems, 2011, pp. 406–414. −1 −1 −1 −1 −1 −1 (I + X + tξX) = A A t(I + tξXA ) ξXA . [20] B. Vandereycken, “Low-rank matrix completion by riemannian opti- − mization,” SIAM Journal on Optimization, vol. 23, no. 2, pp. 1214–1236, (D.15) 2013. Therefore, the directional derivative can be expressed as: [21] L. Cambier and P.-A. Absil, “Robust low-rank matrix completion via A X −1 A−1 riemannian optimization,” To appear in SIAM Journal on Scientific −1 [ξ ] D((I + X) )[ξX] = lim − Computing, 2015. t→0 t [22] U. Shalit, D. Weinshall, and G. Chechik, “Online learning in the A−1t(I + tξXA−1)−1ξXA−1 embedded manifold of low-rank matrices,” Journal of Machine Learning = lim − Research, vol. 13, no. 1, pp. 429–458, Feb. 2012. t→0 t [23] R. Inokuchi and S. Miyamoto, “c-means clustering on the multinomial −1 −1 = (I + X) ξX(I + X) (D.16) manifold,” in Proceedings of the 4th International Conference on − Modeling Decisions for Artificial Intelligence (MDAI ’07), Kitakyushu, Hence, we obtain: Japan, 2007, pp. 261–268. −1 −1 −1 α˙ [ξX]= (I + X) γ˙ [ξX] (I + X) ξX(I + X) γ 1. [24] J. Lafferty and G. Lebanon, “Diffusion kernels on statistical manifolds,” − Journal of Machine Learning Research, vol. 6, pp. 129–163, Dec. 2005.  [25] Y. Sun, J. Gao, X. Hong, B. Mishra, and B. Yin, “Heterogeneous tensor decomposition for clustering via manifold optimization,” IEEE REFERENCES Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 3, pp. 476–489, March 2016. [1] S. Boyd and L. Vandenberghe, Convex Optimization. New York, NY, [26] H. Ji, Optimization approaches on smooth manifolds, 2007. USA: Cambridge University Press, 2004. [27] M. Berger and B. Gostiaux, Differential geometry: manifolds, curves, [2] J. Nocedal and S. J. Wright, Numerical Optimization. Springer Series and surfaces, ser. Graduate texts in mathematics. Springer-Verlag, 1988. in Operations Research and Financial Engineering. Springer, New York, [28] J. Lee, Introduction to Topological Manifolds, ser. Graduate Texts in second edition, 2006. Mathematics. Springer New York, 2010. [3] W. Ring and B. Wirth, “Optimization methods on Riemannian manifolds [29] P. Petersen, Riemannian Geometry, ser. Graduate texts in mathematics. and their application to shape space,” SIAM Journal on Optimization, Springer, 1998. vol. 22, no. 2, pp. 596–627, 2012. [30] S. Amari and H. Nagaoka, Methods of Information Geometry, ser. [4] R. K. Vinayak and B. Hassibi, “Similarity clustering in the presence of Translations of mathematical monographs. American Mathematical outliers: Exact recovery via convex program,” in 2016 IEEE Interna- Society, 2007. tional Symposium on Information Theory (ISIT’ 2016), July 2016, pp. [31] G. Hurlbert, “A short proof of the Birkhoff-von Neumann Theorem,” 91–95. preprint (unpublished), 2008.

19 [32] R. Sinkhorn, “A relationship between arbitrary positive matrices and doubly stochastic matrices,” The annals of mathematical , vol. 35, no. 2, pp. 876–879, 1964. [33] R. Sinkhorn and P. Knopp, “Concerning nonnegative matrices and doubly stochastic matrices,” Pacific Journal of Mathematics, vol. 21, no. 2, pp. 343–348, 1967. [34] J. Csima and B. Datta, “The DAD theorem for symmetric non-negative matrices,” Journal of Combinatorial Theory, Series A, vol. 12, no. 1, pp. 147 – 152, 1972. [35] M. Idel, “A review of matrix scaling and sinkhorn’s normal form for matrices and positive maps,” ArXiv preprint. [Online]. Available: https://arxiv.org/abs/1609.06349 [36] N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre, “Manopt, a Matlab toolbox for optimization on manifolds,” Journal of Machine Learning Research, vol. 15, pp. 1455–1459, 2014. [Online]. Available: http://www.manopt.org [37] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convex programming, version 2.1,” http://cvxr.com/cvx, Mar. 2014. [38] K. B. Petersen, M. S. Pedersen, J. Larsen, K. Strimmer, L. Christiansen, K. Hansen, L. He, L. Thibaut, M. Baro, S. Hattinger, V. Sima, and W. The, “The matrix cookbook,” Tech. Rep., 2006. [39] C. M. Bishop, Neural Networks for Pattern Recognition. New York, NY, USA: Oxford University Press, Inc., 1995.

20