École polytechnique fédérale de Lausanne

Master Project

Nonsmooth Riemannian optimization for completion and eigenvalue problems

Author: Supervisor: Francesco Nobili Prof. Daniel Kressner

June 23, 2017 2 Contents

1 Introduction 5

2 Geometrical background for Riemannian optimization 9 2.1 First order geometry ...... 9 2.2 Riemannian geometry ...... 13 2.2.1 Riemannian steepest descent ...... 14 2.3 Distances on manifolds and geodesics ...... 14 2.4 The exponential map and retractions ...... 16 2.4.1 Convergence Analysis ...... 18 2.5 Vectors Transport ...... 19 2.5.1 Riemannian conjugate gradients ...... 20

3 Geometric CG for Matrix Completion 23 3.1 Different formulations ...... 23 3.2 The manifold Mk ...... 27 3.3 The proposed method ...... 29 3.3.1 Implementation aspects ...... 30 3.4 Error Analysis ...... 31 3.4.1 Numerical simulations ...... 31

4 Nonsmooth Riemannian Optimization 37 4.1 Overview ...... 37 4.1.1 Subdifferential for Convex functions ...... 38 4.1.2 Generalized gradients for Lipschitz functions ...... 40 4.2 Riemannian subgradients ...... 41 4.2.1 Convex and Lipschitz maps ...... 42 4.2.2 Approximating the subdifferential ...... 44 4.2.3 Implementation aspects ...... 45 4.3 -subdifferential method ...... 47 4.4 An example of optimization on Sn−1 ...... 48 4.4.1 Numerical Results ...... 50

3 5 Sparse Rayleigh quotient minimization 51 5.1 The eigenvalue problem ...... 51 5.2 Sparse eigenvectors ...... 55 5.2.1 Localized spectral projectors ...... 56 5.2.2 Generating process ...... 57 5.3 Weighted Rayleigh quotient sparse minimization ...... 58 n 5.3.1 The manifold Stp ...... 59 5.3.2 Sparse eigenvector on Sn−1 ...... 59 n 5.3.3 Sparse eigenvectors on Stp ...... 61 5.4 Nonsmooth Matrix Completion ...... 65 5.4.1 Further work directions ...... 66

Bibliography 67

4 Chapter 1

Introduction

In numerical , optimization problems arise from very natural issue, such as solving a linear system or eigenvalue problems. In this Thesis, we focus on con- strained optimization problems where the constraints identify an Euclidean embedded submanifold. In such cases, assuming a Riemannian structure of the manifold M al- lows to switch to an unconstrained optimization problem whose active set becomes the whole set M. Smooth Riemannian optimization has been largely investigated by researchers in the recent years. The most important results are gathered, for instance, in the monograph [1]. The Riemannian approach is helpful when optimizing with con- straints difficult to deal with. Consider for example the cost functions defined on the matrix manifolds

m×n n n×p > Mk := {X ∈ R : rank(X) = k}, Stp := {X ∈ R : X X = Ip}. (1.0.1) Nevertheless, a whole theoretical backup coming from Riemannian Geometry is needed in order to design optimization process for a map f : M → R. One can also ask how hard is to optimize convex or Lipschitz maps defined on Riemannian manifold. Re- cently, nonsmooth optimization on Riemannian manifolds has been investigated in [29, 13, 17, 19]. In this Thesis we deal closely with the sets (1.0.1) as they are the natural setting for many issues arising in Linear Algebra. We consider smooth and nonsmooth formula- tions of two main problems and propose algorithms and numerical simulations.

Matrix Completion The matrix completion problem is a numerical linear algebra issue that consists in completing in a unique way a partially observed matrix. Suppose we observe a matrix A on a subset of its entries. The goal is to complete the unknown entries to recover uniquely the original matrix A. The challenge lies in the set’s size of the observed entries; incredibly, a whole matrix can be recovered from just a small knowledge of the original data. m×n Formally, let A be in R and suppose that we observe A on a subset Ω of the

5 complete set of entries {1, ..., m} × {1, ..., n}. We denote the cardinality of this set by |Ω|. It is convenient to define the orthogonal projector PΩ onto the set of indices Ω as following ( m×n m×n Xi,j, if (i, j) ∈ Ω, PΩ : R → R ,Xi,j 7→ 0, otherwise.

We denote AΩ = PΩ(A) the known entries of the matrix A. The action of PΩ is schematized in Figure 1.1. The matrix completion problem consists in finding a matrix

PΩ

Figure 1.1: Matrix completion problem: the action of the projector operator PΩ. In black the entries set to zero.

X satisfying PΩ(X) = AΩ. In this Thesis, we assume an a priori knowledge of the solution’s rank and we consider on Mk 1 2 min ||PΩ(X − A)||F . X∈Mk 2 The problem is robust but unfortunately it cannot deal with localized type of noise. This is often the case in application and one prefers to include convex penalties in the cost function to deal with the presence of outliers. For this purpose, we investigate a generalization proposed by [19, 17] in the nonsmooth Riemannian setting for the completion of matrices.

The eigenvalue problem The eigenvalue problem is the cornerstone of every linear algebra course and, despite its age, it still occupies a central position in the ongoing numerical research. For a good introduction we refer to the monograph [16]. n×n Formally, given a matrix A ∈ R we are interested in the eigenpairs (λ, v) satisfying

Av = λv.

In the 20th century, for the first time iterative methods to approximate an eigenpair were proposed. The most basic are power method and inverse iteration for the biggest

6 and smallest eigenvalue, respectively. While, to approximate multiple eigenvectors at the same time, we can rely instead on subspace iterations. For a more efficient approach, one prefers to choose a Krylov subspace method building the set

2 r−1 Kr(A, b) = span{b, Ab, A v, ..., A b}, to then perform a Ritz extraction on a smaller size matrix and obtain the eigenpairs of A. In recent years, the bigger computational capacity has challenged researchers to design algorithms for higher order eigenvalue problem. In this Thesis, we approach the eigenvalue problem by means of the Riemannian op- timization of the Rayleigh quotient

> > −1 ρA(X) = trace(X AX(X X) ). (1.0.2)

We show how the local minima of the map are related to the eigenvectors of a A. As a warm up, we consider a Riemannian steepest descent approach on n−1 the unit sphere S to approximate the smallest eigenpair (λmin, v). Finally, we investigate the eigenvalue problem for structured symmetric matrices A that admits localized basis of eigenvectors. The goal is to propose a penalized formulation of (1.0.2)

X>AX λ|| X || min n trace( ) + vec( ) `1 X∈Stp to seek a sparse basis for p-eigenvectors with a nonsmooth Riemannian approach in n the manifold Stp .

Outline of the Thesis The Thesis is structured as follows. In Chapter 2, we recall the most important tools in differential and Riemannian geom- etry. In Chapter 3, we propose a CG Riemannian method with a special emphasis on the matrix completion problem. Several numerical results are discussed for the com- pletion of different low-rank matrices and an application in computational imaging is proposed. In Chapter 4, we discuss the generalization of certain notions, coming from convex analysis, to the Riemannian setting. A numerical approach based on a subd- ifferential set is proposed. Finally, in Chapter 5, nonsmooth strategies are applied to the eigenvalue problem and numerical results are shown. We conclude the Thesis with some nonsmooth formulations for the matrix completion problem. All the numerical outputs of this work are obtained in MatLab.

7 8 Chapter 2

Geometrical background for Riemannian optimization

n In this Chapter, we describe the properties of a manifold M ⊂ R in order to under- stand a Riemannian geometric descent method. We will not recall the basic ingredients of differential geometry such as charts, atlas or smooth maps except for the purpose of fixing notations with the reader. On the other hand, we introduce basic theory in differential and Riemannian geometry such as tangent spaces, Riemannian metric, geodesics and retractions. Throughout this part, we will denote by M, N general man- ifold, (U, ψ) a general chart and f : M → R a smooth map in the sense of manifolds. Moreover, we recall that a an embedded submanifold M ⊂ N is a immersed manifold for which the inclusion map M → N is a topological embedding, i.e. the subspace topology of M coincides with the manifold topology.

2.1 First order geometry

The aim of this section is to introduce the ingredients of differential geometry to properly set the basis for a simple descent Riemannian optimization method. In un- constrained optimization, a function is minimized step by step looking for suitable line searches containing successive approximations: on manifolds, due to the lack of a lin- ear structure, descent directions live outside the active set. The natural environment to generalize these concepts is the tangent space. One way to introduce the tangent space of a manifold M at a point x is by means of smooth curves γ : R → M. Let Fx(M) be the set of smooth maps f : M → R around x. We define a tangent vector to the curve γ the mapping γ˙ (0) from Fx(M) to R

df(γ(t)) γ˙ (0)f := , for every f ∈ Fx(M). dt t=0

9 We are ready to define a tangent vector to a manifold.

Definition 2.1.1. We define a tangent vector at point x, denoted by ξx, the mapping Fx(M) → R such that there exists a curve γ on M with γ(0) = x satisfying df(γ(t)) ξxf :=γ ˙ (0)f = . dt t=0 The collection of tangent vectors is called the tangent space at point x denoted by TxM.

In the notation ξx we stress the fact that x represents the "foot" of the vector and γ is a curve trough x that realizes the tangent vector. For a sketch of the tangent space geometry, see Figure 2.1. It can be shown that the mapping γ˙ (0) is entirely

TxM

γ(t)

M

Figure 2.1: Tangent space TxM and a tangent vector ξx realized by a curve γ defined on M. characterized by its values in a neighborhood of x. Moreover, TxM is a vector space as the next proposition shows.

Proposition 2.1.2. Let M be a d-dimensional manifold and x ∈ M. Then, TxM is a vector space.

Proof. Consider ξx, ηx ∈ TxM. For scalars a, b ∈ R we define the map Fx(M) → R

(αξx + βηx)f := a(ξxf) + b(ηxf). d We need to show that (αξx + βνx) ∈ TxM. Indeed, let x be in U ⊂ R for a chart (U, ψ), and let γ1, γ2 be two curves through x realizing ξx and ηx, i.e. γ˙1(0) = ξx and γ˙2(0) = ηx. We consider the path −1 γ(t) := ψ (aψ(γ1(t)) + bψ(γ2(t))). Then γ(0) = x and

df(γ(t)) −1 = ∇(f ◦ ψ )∇(aψ ◦ γ1 + bψ ◦ γ2)(0) dt t=0 −1 −1 = a∇(f ◦ ψ )∇(ψ ◦ γ1)(0) + b∇(f ◦ ψ )∇(ψ ◦ γ2)(0)

df(γ1(t)) df(γ2(t)) = a + b dt t=0 dt t=0 = aξxf + bηxf,

10 where, by noticing that

−1 d d d f ◦ ψ : R → R, aψ ◦ γ1 : R → R , bψ ◦ γ2 : R → R , we applied in the second equality the chain rule for the composition f(γ(t)) = (f ◦ −1 ψ ) ◦ (aψ ◦ γ1 + bψ ◦ γ2) and in the third equality the chain rule backwards for f ◦ γ1 and f ◦ γ2. The argument concludes the proof since we have exhibited a path γ satisfying

df(γ(t)) (αξx + βηx)f = . dt t=0

An equivalent definition of tangent vectors can be introduced with the concept of derivations. See [21] for a rigorous theoretical treatment. In the submanifold case, we can practically work with tangent vectors as the next example shows.

Example 2.1.1 (Tangent space for Sn−1). The unit sphere is an embedded subman- n n−1 n > ifold of R formed by unit vectors and denoted by S := {x ∈ R : x x = 1}. The aim of this example is to characterize tangent spaces for the unit sphere, as a n n−1 n−1 linear subspace of R . Let x0 be in S and x(t) be a curve on S trough x0, i.e. > > x(t) x(t) = 1 for every t and x(0) = x0. Differentiating, we get x˙(t) x(t) = 0 and > n > in particular, x˙(0) x0 = 0. Defining Z := {z ∈ R : z x0 = 0}, we conclude thus T Sn−1 ⊂ Z z ∈ Z x t x0+tz Sn−1 x0 . Conversely, if , we define ( ) := ||x0+tz|| that belongs to for every t and it is such that x˙(0) = z. Finally, we have the equality

n−1 n > Tx0 S = {z ∈ R : z x0 = 0}.

Next, we introduce the concept of differential for a smooth map. Let F : M → N be a smooth mapping between two manifolds. Let ξx be in TxM, then (dF (x)[ξ])f := ξ(f ◦ F ) is a mapping FF (x)(N ) → R that take value in TF (x)N , i.e. dF (x)[ξ] ∈ TxM. The differential of F is then defined as following.

Definition 2.1.3. Let F : M → N be a smooth map between two manifolds. We define the differential of F at point x ∈ M, denoted by dF (x), the mapping

dF (x): TxM → TF (x)N , ξ 7→ dF (x)[ξ]

11 Finally, we define the tangent bundle the set of all tangent vectors [ T M := TxM. x∈M The tangent bundle is the natural environment for vector fields. They can be pictured as a collection of tangent arrows attached to each point of the manifold. More rigor- ously, a vector field ξ is a smooth function ξ : M → T M that assigns at each point x ∈ M a tangent vector ξx ∈ TxM. In other worlds, a vector field is a section of the tangent bundle to the manifold. Example 2.1.2 (An example from optimization). The Euclidean steepest descent method needs at each iteration the steepest direction to compute the successive ap- n n proximation. In R , for a function f : R → R we define the directional derivative of f at point x in the direction of ξ the limit f(x + tξ) − f(x) Df(x)[ξ] := lim (2.1.1) t→0 t

A simple application of the chain rule produces Df(x)[ξ] = h∇f(x), ξiRn . The steepest (normalized) ascent then becomes v := argmax Df(x)[ξ] = argmax ||∇f(x)||||ξ|| cos(θ), ||ξ||=1 ||ξ||=1 = argmax cos(θ) ||ξ||=1 being θ the angle between ∇f(x) and ξ. The maximun is attained for θ = 0, i.e. when ∇f(x) the two vectors are collinear and v = ||∇f(x)|| . The steepest (normalized) descent ∇f(x) direction is obtained imposing ξ = − ||∇f(x)|| . Suppose now we want to seek the minimum of f with an iterative approach. We can produce a sequence of iterates by moving, for a suitable step length, on steepest descent directions. Let xi be an approximation of the minimizer for f. We would naturally consider the following update process

αi ← argminf(xi + αξ), α

xi+1 ← xi + αiξ.

The straight curve xi + tξ is a line search in the steepest descent direction ξ obtained from the gradient. Our intent is to investigate a geometric descent method for smooth maps defined on a manifold. A first remark concerns (2.1.1): without a linear structure of the manifold, the sum x + tξ is defined only as an element of the tangent space for which f could be not defined. Moreover, the Euclidean gradient is strictly linked to the metric structure n of R and this suggests to endow the tangent space with an additional structure, the Riemannian metric.

12 2.2 Riemannian geometry

In this section, we generalize the concept of steepest descent to smooth maps defined on M. As pointed out in Example 2.1.2, we miss a notion of metric on TxM to properly define a gradient vector. To fix this, we equip the manifold with as a smoothly varying scalar product gx : TxM × TxM → R. We call g the Riemannian metric and the couple (M, g) a Riemannian manifold, provided x 7→ gx(ξx, ηx) to be smooth, for arbitrary couple of vector fields ξ, η. The Riemannian metric turns the tangent space in a normed vector space defining q ||ξx||gx := gx(ξx, ξx)

From now on, we shall consider writing g(ξ, ζ) and ||ξ||g for tangent vectors ξx, ζx ∈ TxM to avoid a heavy notation with the metric and tangent vector. We use inter- changeably g(·, ·) = h·, ·ig for the scalar product. It will be clear whenever we are dealing with tangent vectors or vector fields. Naturally, we define the Riemannian gradient gradf(x) ∈ TxM as the unique tangent vector satisfying

hgradf(x), ξig = df(x)[ξ], for all ξ ∈ TxM. It is worthwhile to treat more specifically the case of embedded submanifold of M ⊂ n R , where the Riemannian gradient admits an easy representation. The matrix case n×m M ⊂ R will be included in the discussion thanks to the indentification vec : n×m nm R → R which stacks the column of a matrix one on top of another. The couple (M, g), where g is obtained by restricting the Euclidean inner product to the tangent bundle, is a Riemannian submanifold with constant metric g w.r.t. the foot x ∈ M. In this situation we will use h·, ·ig when the metric g does not vary in x. We call the normal space of M at point x, the space ⊥ n NxM = (TxM) := {ξ ∈ R : g(η, ξ) = 0 ∀η ∈ TxM}. In numerical linear algebra optimization problems, the tangent and normal space are n often nested in R where the cost function is still defined. We can rely on the Hilbert- projection Theorem (proved for instance in [26]) to characterize the components of a vector w.r.t a linear subspace. Theorem 2.2.1. Let G ⊂ H be a closed and convex set of a Hilbert space H. Then, for every x ∈ H, there exists a unique element xˆ ∈ G called projection satisfying

||x − xˆ||H = inf ||x − g||H. g∈G Moreover, we denote by PG : H → G,

x 7→ PG = inf ||x − g||H g∈G the metric projector over the set G.

13 Remark. Let G ⊂ H be a closed linear subspace of an Hilbert space H. Then the results of Theorem 2.2.1 still holds. The Euclidean gradient of a smooth maps f admits then the following decomposition

∇fx) = PTxM(∇f(x)) + PNxM(∇f(x)),

where PTxM(·) and PNxM(·) are the tangent and orthogonal metric projector, respec- tively. It can be proved that the Riemannian gradient is exactly the tangent vector component

gradf(x) = PTxM(∇f(x)). (2.2.1) When optimizing on embedded manifold, this result is efficiently exploited as we will see in several cases. If the Euclidean gradient of f is known and the tangent space ad- mits a known structure, then the Riemannian gradient can be computed by projection.

2.2.1 Riemannian steepest descent

We are now ready to define the steepest (normalized) descent Riemannian direction for f at point x ∈ M as the tangent vector ξx ∈ TxM

grad(f(x)) ξx = − . (2.2.2) ||grad(f(x))||gx

The vector provides a direction for which f is locally descending but, as we will discuss later, conjugate directions are preferable for optimization purposes.

2.3 Distances on manifolds and geodesics

The Riemannian metric g induces a norm on tangent spaces that could be exploited to measure length of curves γ : [0, 1] → M on the manifold with the arc length integral definition Z 1 L(γ) := ||γ˙ (t)||gdt, 0 where the norm could vary w.r.t to the point γ(t). This allow us to equip the Riemannian manifold with a notion of distance as following: we denote by Γ := {γ : [0, 1] → M, γ smooth} and we define the Riemannian distance as the minimal arc length over all possible curves

dg :(M, g) × (M, g) → R, (x, y) 7→ dg(x, y) := inf L(γ). γ∈Γ

14 It is reasonable to ask whenever the infimum is attained by a curve on the manifold, at least locally for reasonable close points we are interested in optimization. In Euclidean spaces, we can say that straight lines minimize distance from two points; we are now interested in generalizing the concept of straight lines, i.e. curves with zero acceleration on Riemannian manifolds called geodesics. In a general framework, to define a geodesic we need to introduce a further structure of affine connection to properly define the acceleration γ¨(0) vector. For the Riemannian submanifolds that we encounter in this thesis, we follow instead a simpler approach. n A vector field along a curve γ is a smooth map ξ : I → R that assigns to each point γ(t) ∈ M a tangent vector ξ(t) ∈ Tγ(t)M, for every t ∈ I. In general, the velocity 0 0 n field ξ (t), where (·) stands for the usual derivative in R , is not necessarily part of the tangent space Tγ(t)M. Its component on the tangent space is called the covariant 0 0 derivative, denoted by ∇ξ(t) := PTγ(t)Mξ (t). Notice that γ (t) is indeed a vector field 0 along γ, i.e. γ (t) ∈ Tγ(t)M, hence it is possible to evaluate its covariant derivative 0 00 ∇γ (t) = PTγ(t)Mγ (t). This yields to the following definition. n Definition 2.3.1. Let M ⊂ R be a smooth Riemannian submanifold and I ⊂ R an interval. A smooth curve γ : I → M is called a geodesics if ∇γ0(t) = 0, ∀t ∈ I.

Example 2.3.1 (Geodesics on Sn−1). We consider the Riemannian manifold (Sn−1, g), where g is the constant Riemannian metric n−1 n−1 g : TxS × TxS → R (ξ, η) 7→ g(ξ, η) = ξ>η, for every x ∈ Sn−1. The Riemannian metric is simply the restriction of the Euclidean inner product to the unit sphere. The geodesics on Sn−1 are curves between points with minimal arc length, also called great circles. A geodesics is uniquely prescribed in terms of the initial values γ(0) ∈ n−1 n−1 S and its velocity γ˙ (0) ∈ Tγ(0)S as γ˙ (0) γ(t) = cos(||γ˙ (0)||t)γ(0) + sin(||γ˙ (0)||t). ||γ˙ (0)|| By products manipulations, it can be proved that γ(t)>γ(t) = 1 for every t, so that it is indeed a curve on the sphere, but also that its covariant derivative is zero as the following computations show: ∇γ0(t) = (I − γ(t)γ(t)>)γ00(t) = −||γ˙ (0)||2(I − γ(t)γ(t)>)γ(t) = −||γ˙ (0)||2(γ(t) − γ(t)) = 0,

15 where in the second equality we used the fact that γ00(t) = −||γ˙ (0)||2γ(t).

2.4 The exponential map and retractions

The sphere Sn−1 is a good example of complete manifold, where geodesics are defined for every time t ∈ R on the manifold. In general, one cannot hope that the solution of the second order differential equation involved in Definition 2.3.1 exists for every t. Thanks to the Picard-Lindel¨of theorem (see, e.g., [28]) it can be shown that for a given tangent vector ξx ∈ TxM, there exists an interval Ix,ξ containing 0 where a unique geodesic γ(t): Ix,ξ→M with γ(0) = x and γ˙ (0) = ξx is defined. Moreover, one can always choose Ix,ξ ⊂ R s.t. it also contains 1. This translates in the existence of a sufficiently small neighborhood Ux ⊂ TxM for which the exponential map

expx : Ux ⊂ TxM → M,

ξ 7→ expx(ξ) = γ(1) is well defined. The role of the exponential map in Riemannian optimization is to identify line searches on M when a descent direction is provided. Suppose we have found a descent direction ξx according to (2.2.2): to move on M in this direction we can consider the path expx(tξ). Unfortunately, even when the differential equation involved in Definition 2.3.1 is completely understood, computing geodesics cannot be feasible in an iterative optimization process. Vandereycken has shown in his PhD thesis n,k > [30, Proposition 3.32-3.25] that the geodesics of the low-rank manifold S+ := {YY : n×p Y ∈ R , rank(Y ) = k} of fixed rank positive-semidefinite matrices can be computed from a differential system that is not numerically well conditioned. He still proposes a numerical approach but one prefers instead to work with retractions which are a first order approximation of expx(·). From the optimization point of view, a computational friendly tangent-to-manifold map is preferable.

TxM

x ξ

M

y = Rx(ξ)

Figure 2.2: Retraction mapping Rx : TxM → M.

Definition 2.4.1. A retraction on a manifold M is a smooth mapping R : T M → M with the following properties. Let Rx be the restriction of R to TxM:

16 (i) Rx(0x) = x, where 0x the zero element of TxM.

(ii) dRx(0x) = idTxM, where idTxM is the identity mapping on TxM.

The condition (ii) is called local rigidity, and it states that dR behaves locally around x as the identity map idTxM. Working with retractions, we can consider the modified update process that generates successive approximations in a meaningful way:

αi ← argminf(Rxi (αξxi )), α (2.4.1)

xi+1 ← Rxi (αiξxi ).

For a sketch of the retraction map, see Figure 2.2. The linesearch Rxi (tξxi ) is perfectly defined as a curve on the manifold minimizing the cost function in the direction of steepest descent. As R is called at each step i, we would like it to be computational- friendly. To do so, the metric projection can be exploited to define a retraction; we report the results of [2, Proposition 5].

Proposition 2.4.2 (Projective retraction). Let M ⊂ E be a closed manifold embedded in a vector space E. Let PM : E → M be the metric projector (Theorem 2.2.1) and G : T M → E be the smooth mapping (x, u) 7→ x + u. Then

PM ◦ G : T M → M,

(x, u) 7→ PM(x + u) is a retraction.

The update rule (2.4.1) combined with the retraction mapping gives us a steepest Riemannian descent direction that involves a non linear optimization step at each iteration. It is convenient to consider the linearized step

ti ← argminf(xi + tξxi ) t

n that is effective when the function f admits an extension for example in R . However, to be sure to move on a descent direction, it is better to consider ti as an "initial guess" while ensuring a suitable step αi on the linesearch with the Armijo backtracking rule

Find smallest integer m ≥ 0 s.t. m m (2.4.2) f(xi) − f(Rxi (δ tiηi)) ≥ −σhηi, δ tiηiig

m where σ ∈ (0, 1) and δ > 0 have to be set, and the Armijo step size is αi = δ ti.

17 2.4.1 Convergence Analysis

We are now ready to provide a convergence results for the iterates {xi} produced by the derived retraction based update rule

ηi ← −gradf(xi)

ti ← argminf(xi + tηi) t (2.4.3) αi ← Armijo Backtracking

xi+1 ← Rxi (αiηi).

We notice that the sequence of tangent vectors ηi is such that hgradf(xi), ηiig < 0 for the trivial choice ηi = −gradf(xi). This is a particular case of a more general gradient related property of sequence of directions. Obviously, the topology of the manifold will play an important role in order to characterize accumulation points for a smooth map f. We follow [1, Chapter 4] for this part. A manifold can be considered equipped with a topology structure induced by a maximal Atlas. The open sets in the atlas topology are the V ⊂ M s.t. for every x ∈ V, there exists a chart (U, ψ) with x ∈ U ⊂ V. We can characterize accumulation points for sequences in the manifold according to this topology, but we give instead an equivalent definition of convergence in the coordinate domain.

∞ Definition 2.4.3. A sequence {xi}i=0 ⊂ M is said to be convergent if there exists a chart (U, ψ), an integer I s.t. xi ∈ U for i ≥ I and a point x∗ ∈ M so that ψ(xi) converges to ψ(x∗). The uniqueness of the limit has not to be taken for granted: we require the manifold to be at least Hausdorff for a meaningful convergence analysis (and this will always be n the case for embedded submanifold in R ). Finally, we conclude with the convergence result.

∞ Theorem 2.4.4. Let {xi}i=0 be the an infinite sequence produced by the update rule (2.4.3). Then every accumulation point is a critical point of f. Moreover if M is compact, we have lim ||gradf(xi)||g = 0. i→∞ We conclude the section with a illustrative example of convergence on a compact manifold. Example 2.4.1 (Smooth optimization on Sn−1). We consider the Riemannian mani- fold (Sn−1, g) as in Example 2.3.1. x+ξ According to Proposition 2.4.2, a retraction mapping is given by ξ 7→ Rx(ξ) = ||x+ξ|| n−1 for every ξ ∈ TxS , where the projection on the sphere translates in the normaliza- tion step.

18 n×n Given a positive-semidefinite matrix A ∈ R , we seek the minimum eigenvalue λmin and the corresponding eigenvector v. Inverse iterations and Lanczos algorithm can al- ready deal with this problem, but we consider in this thesis the following Riemannian optimization of the Rayleigh quotient

min x>Ax, x∈Sn−1 to approximate the eigenpair (λmin, v). It is known that the Rayleigh quotient is minimized by the smallest eigenvector. We postpone this theoretical discussion to Section 5.1. In Figure 2.3, we apply the update rule (2.4.3) to approximate the smallest eigenpair of a randomly generated s.d.p. 3-by-3 matrix. We can see the iterates xi

Figure 2.3: Rayleigh quotient Riemannian optimization: iterates produced by the steepest descent method. converging to the eigenvector v computed with the eig command in MatLab and, at > the same time, we remark that the quantity (xi) Axi is converging to λmin.

2.5 Vectors Transport

Finally, we would like to consider directions different from the steepest descent as n in the conjugate gradient method (CG) in R . We need to design properly conjugate directions on a manifold, with a special consideration to the embedded matrix manifold. The key role is played by the vector transport mapping: to linearly combine the steepest descent with previous tangent vectors, we need to transport information from a tangent space to another one. We will discuss in this section only the case of embedded n submanifolds in the Euclidean space R . We consider results on vector transport from [1] while see [21] for a complete and rigorous discussion. n For an embedded submanifold M ⊂ R and x, y ∈ M, translating and then projecting

19 a tangent vector ξx ∈ TxM to TyM, yields to the following transport vector mapping

Tx→y : TxM → TyM, (2.5.1) ξx 7→ PTyM(ξ).

For an intuitive draw of the vector transport action, see Figure (2.4).

TxM

x ξ

Tx→y M TyM y Tx→y(ξ)

Figure 2.4: Vector transport mapping Tx→y : TxM → TyM.

Example 2.5.1 (Parallel Transport along geodesics). On certain manifolds, as on the unit sphere, one has at disposal parallel transport maps along geodesics. On Sn−1, we can exploit the geodesics derived in Example 2.3.1 to transport tangent vectors. n−1 Let x be in S , ξx ∈ TxM. We consider the map

> > Tx→γ(t)(ξ) := (In + (cos(||γ˙ (0)||t) − 1)uu − sin(||γ˙ (0)||t)xu )ξx,

γ˙ (0) where γ(t) ∈ M for t ∈ I with γ(0) = x and u = ||γ˙ (0)||) . The function Tx→γ(t)(ξ) transports ξx on tangent spaces Tγ(t)M in a parallel manner.

2.5.1 Riemannian conjugate gradients

Suppose to have already obtained the iterates xi−1, xi and a conjugate direction ηi−1. We can obtain the successive conjugate direction linearly combining the Riemannian gradient ξi = gradf(xi) with the transported tangent vector Txi−1→xi (ηi−1). For con- jugate directions, a possible factor for the linear combination is the Polak-Ribière+ update ( ) hξi, ξi − Txi−1→xi (ξi−1)ig βi = max 0, 2 ||ξi−1||g

20 presented in [1] and adapted to the Riemannian optimization. We thus have the Riemannian conjugate update rule

ηi ← −ξi + βiTxi−1→xi (ηi−1) ti ← argminf(xi + tηi) t (2.5.2) αi ← Armijo Backtracking

xi+1 ← Rxi (αiηi).

21 22 Chapter 3

Geometric CG for Matrix Completion

In this Chapter we are going to face more closely the matrix completion problem presented in the Introduction. After a first review of the literature to understand which assumption leads to the uniqueness of the completion, we move on to the Riemannian setting. The Riemannian geometry of Mk, the set of fixed rank-k matrices, is then described to propose a Geometric CG method. We conclude the Chapter with an exhaustive error analysis.

3.1 Different formulations

m×n Let A be in R and PΩ be the orthogonal projector ( m×n m×n Xi,j, if (i, j) ∈ Ω, PΩ : R → R ,Xi,j 7→ 0, otherwise over a subset Ω of the complete set of entries {1, ..., m} × {1, ..., n}. We denote by AΩ the observations PΩ(A). The matrix completion problem consists in finding a matrix X from the equation PΩ(X) = AΩ. (3.1.1) Clearly, there exists an infinite number of matrices X that agree on Ω with A, thus the problem is ill posed without further assumptions. We translate this in terms of the kernel of PΩ(·) noticing that the matrix system is equivalent to find X s.t. PΩ(X − A) = 0. The mapping is not injective and the smaller |Ω| is, the larger becomes the kernel Ker(PΩ). Moreover, equation (3.1.1) can be turned into a vector system noticing that

y = vec(AΩ) = vec(PΩ(X)) = vec(P X), (3.1.2)

23 where P= PΩ(1) with 1 being the m-by-n ones matrix, is the Hadamard product mn and y ∈ R is given by the linearization vec(·) of the observations. Finally, defining the operator m×n mn A : R → R , X 7→ A(X) = vec(P X) we turn the matrix completion problem into a linear system A(X) = y for which A can have a very large kernel. In analogy with Compressive Sensing, a sparsity assumption leads instead to a well posed problem. In the matrix case, sparsity translates in the low-rank structure. We consider thus the problem of finding a matrix X with lowest rank that agrees with A on the set Ω:

min rank(X), X∈ m×n R (M0) subject to PΩ(X) = PΩ(A) where we avoid using the operator A in the constraint to stress the case of matrix completion problem. A log-det heuristic approach has been proposed in [12] to face numerically a rank mimization problem with convex constraints. In general, a rank minimization problem is a NP-hard problem to solve. Problem (M0) is equivalent to a minimization problem with sparse solution when seeking a minimizer that has the sparsest vector of singular values. In the same spirit of Compressive Sensing (see,e.g., [14, Section 4.6]), one can look at the convex relaxation

min ||X||∗, X∈ m×n R (M1) subject to PΩ(X) = PΩ(A) where, by setting k =min{m, n}, we consider the singular value decomposition (SVD) > m×k n×k of a matrix X = UΣV with U ∈ R ,V ∈ R two unitary matrices and Σ = Pk diag(σ1, ..., σk) to define ||X||∗ := i=1 σi the (convex) nuclear norm of a matrix. Notice that √ q X>X = (UΣV >)>(UΣV >) √ = V > Σ2V = V >ΣV.

Since the spectrum of Σ does not√ change under a similarity transformtion V , we have > the equivalence ||X||∗ = trace( X X). Different approaches, as the nuclear norm minimization, have been considered for instance in [9, 7]. The relaxation is meaningful whenever the minimal rank solution can be recovered via (M1). Indeed, problem (M0) and (M1) are equivalent if a suitable condition on the null rank property of A is satisfied, as the next theorem shows.

24 n ×n m Theorem 3.1.1. Let A : C 1 2 → C be a linear mapping and consider the problem

min ||Z||∗, n ×n Z∈C 1 2 (3.1.3) subject to A(Z) = y.

Set n = min{n1, n2}. We have the equivalence

n ×n (i) Every matrix X ∈ C 1 2 with rank(X) ≤ k is the unique solution of (3.1.3) with y := A(X)

(ii) For every M ∈ Ker(A) \{0}, with singular values σ1(M) ≥ ... ≥ σn(M) ≥ 0, we have k n X X σi(M) < σi(M). (null rank property) i=1 i=k+1

Proof. To avoid confusion with the singular values involved, we adopt in this proof the notation σi(B) for the i-th singular value of a matrix B. First, we start with (i) ⇒ (ii). Consider M ∈ Ker(A)\{0} and its SVD decomposition > M = UΣV with Σ =diag(σ1(M), ..., σn(M)). Then we set

> M1 = Udiag(σ1(M), ..., σk(M), 0, ..., 0)V , > M2 = Udiag(0, ..., 0, −σk+1(M), ..., −σn(M))V .

We have the following facts

• 0 = A(M) = A(M1) − A(M2) ⇒ A(M1) = A(M2),

• rank(M1) ≤ k by construction.

By hypothesis, M1 is the unique minimizer of (3.1.3) with y := A(M1). From the uniqueness, we have that every other solution must have a bigger nuclear norm. In particular, M2 has a higher rank and satisfies ||M1||∗ < ||M2||∗ that translates in

k n X X σi(M) < σi(M). i=1 i=k+1

Conversely (ii) ⇐ (i), assume M ∈ Ker(A) \{0} whose singular values satisfy the n ×n null rank property of order k. Consider X ∈ C 1 2 with rank(X) ≤ k and a matrix Z =6 X satisfying A(X) = A(Z). Set M = X − Z ∈ Ker(A) \{0}. By [14, Lemma A.18] we have n k X X ||Z||∗ = σi(X − M) ≥ |σi(X) − σi(M)|. i=1 i=1

25 For every 0 ≤ i ≤ k, |σi(X) − σi(M)| ≥ σi(X) − σi(M) holds while, for k + 1 ≤ i ≤ n, we have |σi(X) − σi(M)| = σi(M) because rank(X) ≤ k by assumption. Finally,

k k n k X X X X ||Z||∗ ≥ σi(X) − σi(M) + σi(M) > σi(X) = ||X||∗, i=1 i=1 i=k+1 i=1 | {z } >0 where the last inequality is due to the hypothesis. Since Z is arbitrary, the claim is then proved.

This theorem shows the necessary and sufficient conditions to achieve exactly the lowest rank solution of the matrix completion via the nuclear norm minimization. In [24], an exhaustive overview on Problem (M1) is presented in the more general form of nuclear norm minimization with linear constraints. Moreover, a semidefinite programming approach is presented and an interior point numerical method is discussed. Indeed, (M1) is equivalent to the positive semidefinite formulation

1 min [trace(W1) + trace(W2)], X,W1,W2 2 subject to PΩ(X) = PΩ(A), (PSD) " # W1 X >  0 X W2

m×n m×n n×n where X ∈ R ,W1 ∈ R ,W2 ∈ R . To see this, it suffices to notice that a solution X that minimizes the nuclear norm satisfies also √ √ √ " > # 1 > > XX X ||X||∗ = [trace( XX ) + trace( X X)], √  0. 2 X> X>X

The second property is obvious exploiting the SVD decomposition of X = UΣV > as follows "√ # " # " #> XX> X U U √ = Σ  0. X> X>X V V Unfortunately, checking null space or rank properties for a linear operator A is never easy, and thus one prefers to include random sampling process in PΩ that would satisfy a restricted isometry property (RIP) with high probability. A RIP analysis has been presented in [24] for nuclear norm minimization. Alternatively, in [9, Theorem 1.1] it is stated that if the matrix A is generated and sampled according to a suitable model, then with high probability, the convex problem achieves an exact recovery from the knowledge AΩ of the entries. However, (M1) cannot deal with the presence of noise and it is convenient to relax the condition PΩ(X) = PΩ(A) that is no more satisfied

26 and try to minimize instead the difference PΩ(X) − PΩ(A). We thus seek a different optimization approach assuming an a priori knowledge of the solution’s rank and exploiting the Riemannian Geometry of the admissible set. One of the optimization method that we are going to face attempts to solve the matrix completion problem in a more robust form. Set k = rank(A), then we have the robust formulation

1 2 min f(X) := ||PΩ(X − A)||F , X 2 (MC) m×n subject to X ∈ Mk := {X ∈ R s.t. rank(X) = k}.

A possible way to face numerically this problem consists in a retraction based Rieman- nian optimization method which exploits the Riemannian structure of the manifold Mk.A Geometric (CG) could be designed efficiently for the problem (MC) exploiting m×n the geometrical properties of the embedded manifold Mk ⊂ R . We start from the geometry of the manifold Mk before going trough the algorithm and an exhaustive error analysis obtained in MatLab as part of this Thesis work.

3.2 The manifold Mk

The manifold of fixed rank matrices admits an equivalent definition based on the sin- gular value decomposition (SVD) that will be deeply exploited in the implementation. Set X = UΣV >, then we have

> m n Mk = {UΣV : U ∈ Stk ,V ∈ Stk , Σ = diag(σ1, ..., σk), σ1 ≥ ... ≥ σk > 0},

n where Stk is the Stiefel manifold of n-by-k real, orthonormal matrices and diag(σi) is the with σi on the diagonal. We will deal more closely in Section 5.3.1 with the Stiefel manifold, for now we only use its definition to properly describe Mk. In what follows, we will refer interchangeably to an element X of Mk with its SVD X = UΣV >. The first result regarding the manifold concerns the tangent space which is essential for numerical optimization purposes. Such results is achieved with the help of the general theory of embedded submanifold in differential geometry in [31, Proposition 2.1]

Theorem 3.2.1. The set Mk is a smooth submanifold of dimension (m + n − k)k m×n embedded in R . The tangent space at X ∈ Mk, denoted by TX Mk, is

( " k×k k×(n−k) # ) h i R R h i> TX Mk = UU⊥ (m−k)×k VV⊥ R 0(m−k)×(m−k) > > > k×k m×k (3.2.1) = {UMV + UpV + UVp : M ∈ R ,Up ∈ R , > n×k > Up U = 0,Vp ∈ R ,Vp V = 0}.

27 m×n We will consider from now on (Mk, g) as a Riemannian submanifold of R , with Rie- mannian metric inherited by the Euclidean scalar product for matrices hA, Bi =trace(A>B) m×n for two matrices A, B ∈ R . The first remark about this theorem concern the projection operator onto the tangent > ⊥ space at X. Defining PU := UU and PU := (I − PU ) we have m×n ⊥ ⊥ PTX Mk : R → TX Mk,Z 7→ PU ZPV + PU ZP,V + PU ZPV . where U, V are the orthonormal matrices involved in the SVD of X. The projection operator is used at every iteration to compute the steepest descent direction as a projection of the Euclidean gradient ∇f(X) which is given w.r.t the Euclidean basis as following. Let Ei,j be the matrix with only a nonzero entry equal to 1 in position (i, j). For (i, j) ∈ Ω, we have 1 d h∇f(X),Ei,jiF = hPΩ(X + tEi,j − A),PΩ(X + tEi,j − A)iF 2 dt t=0

= hPΩ(X − A) + tEi,j,Ei,jiF t=0 = hPΩ(X − A),Ei,jiF ,

df(X+tEi,j ) where in the first equality we used h∇f(X),Ei,jiF = and the fact that dt t=0 f can be expressed as a scalar product. While, for (i, j) ∈/ Ω, the equation continues 1 to hold trivially since the derivative vanishes. We thus conclude that ∇( 2 ||PΩ(X − 2 m×n A)||F ) = PΩ(X −A) since the equality holds for every Ei,j which span the whole R . Combining this with (2.2.1) and (2.2.2), we have the following important formula for the steepest (normalized) descent direction ξX at X ∈ Mk

PTX Mk (PΩ(X − A)) ξX = − . ||PTX Mk (PΩ(X − A)||

Moreover, according to Proposition 2.4.2 a possible retraction for the manifold Mk is then RX : TX Mk → Mk,

ξ 7→ argmin ||X + ξ − Z||F Z∈Mk since the metric projector minimizes the Frobenius norm. The minimizer can be com- puted in closed form thanks to the Eckart Young theorem ([11] as following k X > RX (ξ) = σiuivi , (3.2.2) i=1 where σi are the singular values for the SVD decomposition of X + ξ and ui, vi are the left and right singular vectors, respectively. We denote the k-truncated svd decomposition of a matrix X as [U, Σ,V ] =svdk(X) For a quick glance, we summarize all the geometrical tools involved in Table 3.1.

28 n×m Manifold Mk Total space R 1 2 1 2 Cost function f(X) 2 ||PΩ(X − A)||F , rank(X) = k 2 ||PΩ(X − A)||F Metric induced metric hX,Y i =trace(X>Y ) > > > > n×p Tangent space TX Mk UMV + UpV + UVp ,X = UΣV R ⊥ ⊥ Projection PTX Mk (Z) PU ZPV + PU ZPV + PU ZPV \

Gradient gradf(X) = PTX Mk (∇f(X)) ∇f(X) = PΩ(X − A)

Retraction RX (ξ) =svdk(X + ξ) \

Vector Transport TX→Y PTY Mk (ξ), ξX ∈ TX Mk translation

Table 3.1: Cost function for the matrix completion problem on the fixed rank matrix manifold.

3.3 The proposed method

In the same spirit of the update rule (2.5.2), we write in Algorithm 1 the pseudocode for the Riemannian conjugate gradient method for the matrix completion problem. The algorithm is presented in [31] to whom we refer for the implementation details. m×n The geometry of the embedded submanifold Mk ⊂ R is completely understood by

Algorithm 1 Geometric CG for Matrix Completion (MC) Input: Smooth f : Mk → R, X0,X1 ∈ Mk initial approximations, grad-tol τ > 0, tangent vector η0 = 0. 1: for i = 1, 2, ... do 2: ξi ← gradf(Xi) B Riemannian gradient 3: if ||ξi|| ≤ τ then 4: Exit. 5: end if

6: δ ← ξi − TXi−1→Xi (ξi−1) 2 7: βi ← max{0, hξi, δi/||ξi−1|| }

8: ηi ← −ξi + βiTXi−1→Xi (ηi−1) B CG by PR+ 9: ti ← argminf(Xi + tηi) B Initial guess t 10: Find smallest integer m ≥ 0 s.t. m m f(Xi) − f(RXi (0.5 tiηi)) ≥ −0.0001hηi, 0.5 tiηii B Amrijo rule m 11: Xi+1 ← RXi (0.5 tiηi) 12: end for Output: Xi the results presented in Section 3.2 and the notions of Riemannian gradient and vector transport can also be computed numerically. Nevertheless, line 7 has not already been discussed. In general, computing the initial guess t∗ can be not trivial for arbitrary nonlinear cost functions. In the matrix completion problem we deal with a smooth

29 quadratic function, thus line 7 reads

1 1 2 minf(X + tη) = min||PΩ(X) + tPΩ(η) − PΩ(A)|| . 2 t 2 t F The solution is given by setting the t-derivative to vanish,

df(X + tη) 0 = = thP (η),P (η)i + hP (η),P (X − A)i, dt Ω Ω Ω Ω which yields the initial guess

hPΩ(η),PΩ(A − X)i t∗ = . hPΩ(η),PΩ(η)i

3.3.1 Implementation aspects Working in a matrix manifold, dimensions and computational costs grow rapidly on the matrix size. We have fully analyzed and described the mappings involved in Algorithm 1 from a theoretical point of view. Now instead, we discuss some implementation details that can have a big impact on the running time and stability.

Non-linear CG In order to improve robustness, a variant of the PR+ rule is adopted. Whenever two consecutive conjugate directions are almost collinear, we pre- fer to restart from the steepest descent for the successive iteration. A further if-check is then used in MatLab as following

δ ← ξi − TXi−1→Xi (ξi−1) 2 θ ← hξi, δi/||ξi−1|| If θ ≥ 0.1, βi ← 0 B Collinear directions 2 else βi ← max{0, hξi, δi/||ξi−1|| }.

ηi ← −ξi + βiTXi−1→Xi (ηi−1)

Retraction A full SVD has a cubic cost that becomes prohibitively expensive in view of CG iterations. Even for small m, n, the rectration is needed at every step and it is convenient to improve the performance. An alternative way for computing the k-truncated SVD, designed in [31], exploits the recurrent form of tangent vectors X +ξ as follows: " # h i Σ + MIk h i> X + ξ = UUp VVp , Ik 0

> > > > where X = UΣV ∈ Mk and ξ = UMV + UpV + UVp ∈ TX Mk. We want to exploit this factorization for computing the SVD of X + ξ more efficiently. First, we

30 compute the reduced QR decomposition of Up and Vp and we denote QuRu = Up and QvRv = Vp. Observe that it holds " # > > h i Σ + MRv h i X + ξ = UQu VVp . Ru 0 | {z } =S

Then by computing [Us, Σs,Vs] =svds(S), we have finally the retraction on Mk

h i  h i > RX (ξ) = UQu Us Σs VVp Vs ,

The economy-sized QR decomposition and the SVD of the small matrix S reduce the computational effort to the order O(mk2 + k3).

3.4 Error Analysis

The rank k matrix A is randomly generated in MatLab as following: L = rand(m,k); R = rand(n,k); A = L*R’;. Before analyzing the behavior of the algorithm, we summarise the roles played by the rank and the set Ω in the problem.

• As a low-rank matrix problem, we keep k relatively small w.r.t the matrix size in the numerical simulations.

• The interesting case is when the number |Ω| is relatively small, meaning that we are able to recover (or complete) a matrix from a few information about PΩ(A). |Ω| We will control this defining the oversampling factor OS := k(2n−k) , where the denominator is the number of degree of freedom of a rank k matrix.

To measure the distance between the approximation and the original matrix, we look at the behaviour of two quantities:

||A − X||F Relative Error = , (3.4.1) ||A||F ||PΩ(X − A)||F Relative Residual = . (3.4.2) ||PΩ(A)||F

3.4.1 Numerical simulations We propose an exhaustive error analysis thanks to 5 different numerical simulations and we look at the behaviour of the quantities (3.4.1),(3.4.2) to measure the accuracy of the matrix completion.

31 First Simulation The first simulation consists in looping on the size of the matrices −7 (squared for simplicity). The exit condition is ||gradf(X)||2 ≤ 10 . Manipulating the rank k and the cardinality |Ω| we keep approximatively fixed the OS factor to 3.5. In Figure 3.1 we report the result. We can see that the geometric method requires a

100 n=600 n=600 102 n=800 n=800 n=1000 −3 n=1000 n=1200 10 n=1200 2 −1 || 10

10−6 10−4 grad f(X) || Relative Error −9 −7 10 10

10−10 10−12 0 20 40 60 80 0 20 40 60 80 iter iter 100 n=600 n=800 −3 n=1000 10 n=1200

10−6

−9 Relative Residual 10

10−12 0 20 40 60 80 iter

Figure 3.1: Geometric CG for different size matrix completion problems (MC). Topleft: 2- norm behaviour of the Riemannian gradient. Topright: relative error. Bottom: relative resid- ual. small number of iterations to converge according to the tolerance set for the gradient. Moreover, a high accuracy is reached, but smaller ns require less iterations.

Second Simulation The second simulation consists in approximate the matrix com- pletion problem for a set of 100-by-100 rank 10 random matrices with different OS factors. Intuitively, the larger the quantity OS is, the easier A is to recover. Theoreti-

32 cally, one needs at least |Ω| > 1 to uniquely complete any rank k matrix. Practically, we show in Figure 3.2 that OS> 2 is a suitable request. We see that for OS< 2, the

102 OS= 0.58 1 10 OS= 0.86 OS= 1.13 OS= 2.26 10−1 10−1 OS= 3.46

2 OS= 4.59 ||

10−3 10−4

grad f(X) OS= 0.58 || 10−5 Relative Error OS= 0.86 10−7 OS= 1.13 OS= 2.26 OS= 3.46 −7 10 OS= 4.59 10−10 0 100 200 300 0 100 200 300 iter iter

OS= 0.58 −1 10 OS= 0.86 OS= 1.13 OS = 2.26 10−3 OS= 3.46 OS= 4.59

10−5

−7 Relative Residual 10

10−9

0 100 200 300 iter

Figure 3.2: Geometric CG for different OS factors. Top left: 2-norm behaviour of the Rie- mannian gradient. Top right: relative error. Bottom: relative residual. relative error does not decay, or it is not reliable. The gradients, in this case, oscillate very fast; will discuss this in the next simulation. Moreover, the plots confirm the in- tuition: there is a monotonic decrease of iterations needed as OS increases, while, for OS < 2, the CG iterations can reach convergence to a matrix that does not complete the original A.

33 Third Simulation We remark that in the previous routines, we did not take into account the randomness of the generating process for A and the sampling set Ω. The goal of this simulation is to verify that, whenever |Ω| ≈ nlog(n), the completion is not always achieved. Twenty different problems are generated and we report in Figure 3.3 the results. The CG method manages to decrease (even if oscillating) the Riemannian

102

2 0

|| 10

10−2 grad f(X) || 10−4

0 50 100 150 200 250 300 350 400 iter 101

100

10−1

10−2

Relative Error 10−3

10−4 0 50 100 150 200 250 300 350 400 iter

Figure 3.3: Geometric CG for 20 different matrices with |Ω| ≈ nlog(n) gradient of the cost function f(·), but its decay is affected by the condition |Ω| ≈ nlog(n). Nevertheless, only a few matrices are actually recovered by the method. The relative error does not decay for some attempts.

Noisy Simulation This simulation focuses in the noisy matrix completion, i.e. when the data A is observed in presence of noise (and this will always be the case in appli- cations). The setting is the same of previous simulations, but the presence of the noise translates in a perturbation A() obtained adding to the data a with a sufficiently small standard deviation , e.g. 10−3. We consider a perturbation of A with Gaussian noise matrix N obtained in MatLab as N = *randn(m,n).

34 Finally, to have an a priori estimate of the errors as in [31], we define

||AΩ||F A() = A +  N. ||NΩ||F We cannot rely on the residual or relative error as exit conditions: these indices would not reach convergence due to noise. However, the conjugate gradient method seeks a solution that approximates the perturbated data, so we can expect the Riemannian gradient norm to converge. The numerical experiments in Figure 3.4 confirms this expectation.

104 ||grad f(X)||2 () () ||A -X||F/||A ||F 101 std 

10−2

10−5

10−8 0 20 40 60 80 100 120 140 160 iter

Figure 3.4: Noisy matrix completion problem with a rank-30 matrix A ∈ R1000×1000, OS ≈ 3 and standard deviation  = 10−3.

Image Completion A further simulation consists in the completion of a matrix arising from application in computational imaging. Suppose that we have a Low-Rank gray-scale image heavily corrupted on more than half of the overall pixels. The pixel’s intensity, after the corruption, is completely insignificant and we may think of them as unknown entries as in the completion problem. An effective way to measure the quality of the reconstruction is the peak signal-to-noise ratio (PSNR) index, a positive index that tells us the reached accuracy. PSNR can be seen as a function of the reference image X and the one we want to compare Y as following ! MAX(X) PSNR(X,Y ) = 20 log10 p , MSE(X,Y )

1 PN−1 PM−1 2 with MSE(X,Y ) = MN i=0 j=0 [X(i, j) − Y (i, j)] being the mean square error for two images X,Y of M × N pixels, and MAX(X) is the maximum pixel value of

35 the image X. The index is measured in decibel (dB) and the smaller the MSE is, the higher the PSNR, thus for a better accuracy, we shall increase the PSNR value. For our simulations, we expect a value between 20 dB and 40 dB. A second important formula is the structural similarity (SSIM) index that tells us the similarity between the original and reconstructed image. It is defined as

(2µX µY + c1)(2σXY + c2) SSIM(X,Y ) = 2 2 2 2 , (µX + µY + c1)(σX + σY + c2)

2 where µ(·) is the average, σ(·) is the variance, σ(·,·) is the covariance operator while c1 and c2 are two constants to stabilize the division to the denominator. The SSIM index takes values between 0 and 1; in our simulations, we would like to keep it over 0.4 at least. The results are reported in Figure 3.5. The quality is significantly improved

Figure 3.5: Corrupted Image completion of Lena.png 512 × 512 with ≈ 50% corrupted pixels. On the right the corrupted image, on the left the reconstruction with PSNR = 25.06, SSIM=0.644. as the corrupted pixels (approximatively %50) are replaced in the completion of the 512 × 512 Lena.png image. The numerical rank is not known a priori, but the lack of structure in the Lena picture translates to a small magnitude for many singular values. The Geometric CG method is thus performed in M100 to obtain an accuracy of PSNR = 25.06 and SSIM= 0.664.

36 Chapter 4

Nonsmooth Riemannian Optimization

The smooth optimization reveals not effective when dealing with non Gaussian type of noise. Smooth quadratic norms, such as the the Frobenius norm, tend to spread the noise among all the other entries when optimized. A further application regards nonsmooth sparse optimization where `1 penalties are present in the cost function to increase the numerical sparsity of the minimizer. Moreover, the Riemannian optimiza- tion is powerful in turning a rank-constraint problem into a Riemannian unconstrained formulation. These facts suggest to explore the nonsmooth framework to take into ac- count convex or Lipschitz cost maps defined on Riemannian manifolds. In this section, we investigate a subdifferential Riemannian theory that has been developed recently in n [17] and [19]. Thanks to some examples in R , we show the idea that has brought the authors to generalize many concepts of convex analysis to Riemann manifolds. Finally, we discuss a subdifferential approach to optimize convex and Lipschitz maps.

4.1 Overview

Non smooth optimization is characterized by non smooth cost functions, typically convex or Lipschitz, that attain the minimum of interests at non differentiable points. The challenge lies in the fact that we cannot rely on single steepest descent directions, since slopes could have non smooth angles around local minima. In the Euclidean setting, nonsmooth optimization has been largely investigated in Statistics and Data Science. In this Thesis, we are interested in replacing gradients with subdifferential sets which provide descent directions at nondifferentiability minimum points.

37 4.1.1 Subdifferential for Convex functions

Subdifferentials are part of the differential theory in convex analysis introduced, for instance, by Rockafellar in [25, Part V]. The first generalization proposed is about directional derivatives for convex functions.

n Definition 4.1.1. Let f be convex and domf ⊂ R be its domain. We define the directional derivative in the direction v at the point x ∈ domf, as the limit

f(x + δv) − f(x) f 0(x; v) := lim , δ&0 δ when it exits.

Notice that the increment δ is positive in the limit, contrary to the classical directional derivative. Next, we generalize the concept of gradients.

Definition 4.1.2 (Subdifferential). We call a subgradient for a convex function f : n n R → R at point x0, a vector g ∈ R satisfying,

f(x) ≥ f(x0) + hg, x − x0i, for every x ∈ domf.

We define the subdifferential at x0, denoted by ∂f(x0), the collection of subgradients.

∗ n The first important result concerning the subdifferential at minimizing points x ∈ R tells us that a point is stationary for a function f if and only if 0 is a subgradient.

n ∗ Proposition 4.1.3. Let f : R → R be convex. Then we have f(x ) = min f(x) if x∈Rn and only if 0 ∈ ∂f(x∗).

Proof. The results is clear rewriting

f(x) ≥ f(x∗) = f(x∗) + h0, x − x∗i, for every x ∈ domf.

From the definition of subgradient, we have the statement.

0 A useful way to characterize the subdifferential is as support function of f (x0; ·). We report this result in the next theorem [22, Theorem 3.1.14].

n Theorem 4.1.4. Let f : R → R be a closed and convex function. Then, for every ◦ n x0 ∈ (domf) and v ∈ R , we have

0 f (x0; v) = max{hg, vi : g ∈ ∂f(x0)}.

38 Remark. Notice that f(x0 + δv) − f(x0) ≥ hg, x0 + δv − x0i, for every g ∈ ∂f(x0), by definition of subgradient. For every δ > 0 we have f(x + δv) − f(x ) 0 0 ≥ hg, v i. δ 0 Thus, taking the limit for δ & 0 we have 0 f (x0; v) ≥ hg, vi for every g ∈ ∂f(x0).

The inequality provides an upper bound for the quantities hg, vi with g ∈ ∂f(x0), when 0 f (x0, v) is finite.

The theorem can be exploited to derive trivial subdifferentials at differentiable points. Suppose f to be a closed and convex function, differentiable in the interior of its ◦ n domain. Fix an arbitrary v ∈ R , then for every x ∈ (domf) and g ∈ ∂f(x), we have h∇f(x), vi = f 0(x; v) ≥ hg, vi, where the last inequality is due to Theorem 4.1.4. Thanks to the existence of the gra- dient, we can actually consider −v in the directional derivative and claim the existence of the limit f 0(x; −v), to derive

h∇f(x), −vi = f 0(x; −v) ≥ hg, −vi.

Finally, combining the inequalities we have h∇f(x), vi = hg, vi for an arbitrary g ∈ ∂f(x). Letting now v varying among the Euclidean basis ek, we obtain the equality

∂f(x) = {∇f(x)}. ◦ √ ◦ The assumption√ x ∈ (domf) is crucial, as the function f(x) = − x shows: 0 ∈/ (domf) but ∂(− 0) = ∅. Example 4.1.1 (Modulus). The model problem for this class of functions is,

min ||x||`1 . x∈Rn

In R, the `1-norm reduces to the modulus |x|. In this example, we analyze the subdif- ferential ∂|0|. An arbitray subgradient g ∈ ∂|0| satisfies, n |x| ≥ 0 + gx, for every x ∈ R . Clearly this is possible if and only if g ∈ [−1, 1], hence we obtain ∂|0| = [−1, 1]. On x 6 x0 the other hand if 0 = 0, the subdifferential reduces to the only element |x0| , i.e. its gradient. Geometrically, we can see in Figure 4.1, that each subgradient could be seen as a angular coefficient of lines lying below the graph of |x|.

39 jxj

−1 +1

@j0j = [−1; 1]

Figure 4.1: Subdifferential ∂|0| for the convex modulus.

Subgradients for convex functions has been exploited to design a subdifferential method presented for instance by Nesterov in [22, Chapter 3] and in Boyd’s lecture notes [6] at Stanford. They provides convergence results for the update rule k+1 k k k k x ← x − αkg B g ∈ ∂f(x ), k+1 k+1 k fbest = min{f(x ), f(x ), ...} P P 2 when for instance the sequence αk is s.t. k αk = +∞ and k αk < +∞. The choice of the k-th subgradient is completely arbitrary, but we remark here that convergence requires uniform boundedness of the gk (that is the case for Lipschitz functions).

4.1.2 Generalized gradients for Lipschitz functions A generalization of gradients for Lipschitz functions is due to Clarke, one of Rockafel- lar’s student. In [10], Clarke exploits the almost everywhere existence of ∇f(x) for Lipschitz functions f to define the generalized gradient of g at x. We call a locally n Lipschitz function f : R → R, a mapping s.t. for every x ∈ R there exists a ball B with x ∈ B and a constant K > 0 satisfying |f(y) − f(z)| ≤ K||y − z||, for every y, z ∈ B. Before characterizing generalized gradients, we recall that the convex hull of a set S, denoted by conv(S) is the smallest convex set containing S. n Definition 4.1.5. Let fR → R be a locally Lipschitz function. We define the gener- alized gradient of f at x, denoted ∂f(x), as the convex hull   ∂f(x) := conv lim ∇f(x + hi), with hi → 0 . i→+∞ +∞ We remark that arbitrary sequences {hi}i=1 ⊂ R s.t. x + hi are differentiable points for f exist if f is Lipschitz. If f is convex, by [25, Theorem 25.6], ∂f(x) is equivalent to the set of subgradients in Definition 4.1.2. Moreover, the proper directional derivative for the class of locally Lipschitz functions is the following.

40 n Definition 4.1.6. Let f : R → R be a locally Lipschitz function. We define the generalized directional derivative of f at x in the direction v, denoted by f ◦(x; v), the limit f(x + h + δv) − f(x) f ◦(x; v) = lim sup . h→0 δ&0 δ

In the same spirit of Theorem 4.1.4, we have analogously f ◦(x; v) =max{hg, vi : g ∈ ∂f(x)} whose proof is more technical because of the lim sup concatenation of the di- rectional derivative. Finally, Goldstein proposed a further generalization of gradient directions for the class of Lipschitz function in [15] that has proved fruitful also numerically. He managed to build a differentiable theory for the class of Lipschitz functions exploiting again the almost everywhere existence of the gradients ∇f(x). He then considered as subdiffer- ential set a -dependent convex hull of gradients at points whose distance is controlled by the  parameter. In this Thesis, as we will see when dealing with non smoooth maps defined on Riemann manifolds, we will be closer to the Goldstein’s definition of -subdifferential to seek descent directions. For a Lipschitz function f at point x we have the definition:

n Definition 4.1.7. Let f : R → R be a locally Lipschitz function and Bˆ(x, ) = {y ∈ +∞ B(x, ): ∇f(y) exists}. Let {θk}k=1 ⊂ R be a sequence converging downward to 0. We define the -generalized gradient at x the set

∞ ! \ ∂f(x) := conv {∇f(y): y ∈ Bˆ(x,  + θk)} . k=1

Remark. The object ∂f(x) is well-defined. Indeed, we define the sets

T (θk) := {∇f(y): y ∈ Bˆ(x,  + θk)} which are non-empty and nested, i.e. T (θk+1) ⊂ T (θk). Thus, the intersection T∞ k=1 T (θk) is non-empty. Moreover, it can be proved that ∂f(x) is also compact and convex.

4.2 Riemannian subgradients

In this section we are going to generalize the concepts we have introduced so far, to the Riemannian setting. Nonsmooth analysis on Riemannian manifolds is theoretichally developed for instance in [29] and [13]. Firstly, we are going to discuss what nonsmooth maps defined on Riemannian manifolds are. In what follows, Riemannian norms and metric will often appear and it is convenient to lighten the notation: it will be clear from the context whether || · || and h·, ·i are Euclidean or Riemannian forms.

41 4.2.1 Convex and Lipschitz maps

n The definition of convex functions require the linear structure of R that is missing for maps defined on manifolds. The natural idea to discuss convexity on Riemannian manifolds is by means of geodesics. Convexity is discussed in [29] in a more general framework of non complete manifolds, while we follow the approach in [13] for an easier discussion. The completeness is to be understood w.r.t. the Riemannian distance. In this case, the Hopf-Rinow theorem [20] ensures that the exponential mapping is well defined over the whole tangent space.

Definition 4.2.1 (Convexity via geodesics). Let (M, g) be a complete and connected Riemannian manifold. We say that f : M → R is convex if f ◦ γ : R → R satisfies

f(γ(ta + (1 − t)b)) ≤ tf(γ(a)) + (1 − t)f(γ(b)), for every geodesic γ and a, b ∈ R, t ∈ [0, 1]. In other words, we require the maps to be convex in the usual sense when composed with any complete geodesic. Subgradients can be defined in the same fashion as following.

Definition 4.2.2 (Subgradient of convex maps). Let (M, g) be a connected and com- plete Riemannian manifold, f be a convex map and x ∈ M. We say that s ∈ TxM is a subgradient for f at x if for every geodesic γ with γ(0) = x we have

f(γ(t)) ≥ f(x) + ths, γ˙ (0)i, for every t ≥ 0.

Moreover, we denote by ∂f(x) the set of all subgradients.

On the other hand, the Lipschitz property of a map requires only the presence of metric structure in the definition. A Riemannian manifold equipped with its Riemannian distance dg is indeed a metric space. A Lipschitz map satisfies

|f(x) − f(y)| ≤ Kdg(x, y) for every x, y, ∈ M, where K > 0 is the Lipschitz constant. Furthermore, we say that f is locally Lipschitz if, for every x ∈ M, there exists a neighborhood for which f is Lipschitz. It is better at this point to lose the geodesically completeness of the Riemannian manifold. We know that the exponential map is strictly connected to the solution of a second order ordinary differential equation. In practice, for general Riemannian manifolds, we expect this solution to be unique in neighborhoods. For x ∈ M, this translates in the existence of an injectivity radius iM(x) that is the suprimum of radius r s.t. B(0x, r) ⊂ TxM and expx(·) is a diffeomorphism from B(0x, r) to B(x, r) ⊂ M. We introduce next the concept of directional derivative and subdifferentials as in [17].

42 Definition 4.2.3 (Clarke generalized directional derivative). Let f : M → R be a locally Lipschitz maps. Let φx : Ux → TxM be an exponential chart at x ∈ M. Given −1 another point y ∈ Ux, consider σy,v(t) := φy (tw), a geodesics through y with initial −1 velocity w, where φy is an exponential chart around y satisfying d(φx◦φy )(0y)(w) = v. We define the generalized directional derivative of f at x in the direction v ∈ TxM, denoted by f ◦(x; v) as

f(σy,v(t)) − f(y) f ◦(x; v) := lim sup . y→x t&0 t The subdifferential set, denoted again ∂f(x) can be then introduced as the subset of ◦ TxM whose support function is f (x; v). Alternatively, an equivalent definition is

∂f(x) = conv{ lim gradf(xi): {xi} ⊂ Ωf , xi → x}, i→∞ where Ωf is a (dense) subset of M for which the Riemannian gradient exists. Definition 4.2.4 (-subdifferential). Let f : M → R be a locally Lipschitz maps and θk a downward scalar sequence converging to zero. For each  > 0, s.t.  + θk < iM(x) for every k, the -subdifferential is defined as ∞ ! \ −1 ∂f(x) = conv {dexpx (y)(gradf(y)) : y ∈ Bˆ(x,  + θk)} , k where Bˆ(x,  + θk) = B(x,  + θk) ∩ Ωf . This definition requires some remarks to be understood as a generalization of Gold- stein’s -subdifferential. Firstly, we stress the fact that the gradients taken around n a point x belongs to the different tangent space TyM. On R there is no need to bring back these directions because of the flat structure of the vector space. This −1 job is achieved by the map dexpx (y): TyM → TxM. Secondly, we shall select  small enough so that f is Lipschitz on B(x, ) and expx(·) is a diffeomorphism from B(0x, ) to B(x, ). Finally, we are interested in the presence of descent directions in the subdifferential set. For this purpose, we give a notion of descent direction in the nonsmooth optimization framework.

Definition 4.2.5 (Descent direction). Let x be in M, w ∈ TxM and f : M → R be locally Lipschitz. We define a descent direction for f at point x the tangent vector w g = − ||w|| , if there exists α > 0 satisfying

f(expx(tg)) − f(x) ≤ −t||w||, ∀t ∈ (0, α). In other words, we require the function to decrease uniformly in the direction of w along the geodesic expx(tg). We now consider the problem of existence of such descent directions in the -subdifferential set. The answer is addressed by the following theorem [19, Theorem 3.11].

43 Theorem 4.2.6. Assume δ,  > 0 are s.t. 0 ∈/ ∂f(x) for every x ∈ B(x, δ). Fix x ∈ B(x, δ) and consider the element of ∂f(x) satisfying

w0 := argmin||v||. v∈∂f(x)

Set g − w0 . Then g affords a uniform decrease of f over B x,  in the sense of 0 = ||w0|| 0 ( ) Definition 4.2.5, i.e.

f(expx(g)) − f(x) ≤ −||w||.

Remark. The theorem actually provides an optimization approach to extract descent direction from ∂f(x). Unfortunately the subdifferential set is not easy to characterize, so we are going to see next how to replace it with a suitable version.

4.2.2 Approximating the subdifferential

In this section, we present a practical approach to work with descent directions in the set ∂f(x) to optimize nonsmooth maps defined on Riemannian manifolds. We follow closely the results presented in [17]. For general Lipschitz maps f, it could be hard to give an explicit description of the -subdifferential set. Yet, for numerical purposes, we are going to see that it suffices to work with a partial description of ∂f(x). It is sufficient to increase gradually an approximated version Wk := {v1, v2, ..., vk} ⊂ ∂f(x) until we detect the presence of a descent direction in conv(Wk). According to Theorem 4.2.6, we consider the rule wk ← argmin ||v||, v∈conv(Wk) wk gk ← − . ||wk|| Now if

f(expx(gk)) − f(x) ≤ −c||wk||, c ∈ (0, 1), then we say that Wk is a suitable approximation of ∂f(x), otherwise we increase the convex hull by adding another gradient vk+1 ∈ ∂f(x)\conv(Wk). The increasing process requires some theoretical backup. We follow the discussion in [17, Section 3.3]. First, the starting vector w1 can be chosen from the ∂f(y) with y in a neighborhood of x according to the following Lemma.

Lemma 4.2.7. For every y ∈ B(x, ),

−1 dexpx (y)(∂f(y)) ⊂ ∂f(x).

Then, for successive gradients wk we have:

44 Lemma 4.2.8. Set Wk = {v1, ..., vk} ⊂ ∂f(x) s.t. 0 ∈/ conv(Wk) and

wk = argmin ||v||. v∈conv(Wk)

wk If f(exp (gk)) − f(x) > c||wk||, where gk = − , then there exists θ0 ∈ (0, ] and x ||wk|| v¯k+1 ∈ ∂f(expx(θ0gk)) s.t.

−1 hdexpx (expx(θ0gk))(¯vk+1), gki ≥ −c||wk||,

−1 and vk+1 = dexpx (expx(θ0gk))(¯vk+1) ∈/ conv(Wk). The combination of the two lemmas produce the strategies summurized in Algorithm 2 and Algorithm 3.

Algorithm 2 Descent direction Algorithm; gk = Descent(f, x, δ, c, ). Input: f : M → R locally Lipschitz, x ∈ M, δ, c,  ∈ (0, 1). 1: Let g0 ∈ TxM s.t. ||g0|| = 1. 2: if gradf(expx(g0)) exists then −1 3: v = dexpx (expx(g0))(gradf(exp(g0)) 4: else −1 5: Select v ∈ dexpx (expx(g0))(∂f(expx(g0))), 6: end if 7: W1 = {v} 8: for k = 1, 2, ... do 9: wk ← argmin ||v|| B Extraction step v∈conv(Wk) 10: if ||wk|| ≤ δ then 11: Exit. 12: else wk 13: gk ← − ||wk|| 14: end if 15: if f(expx(gk)) − f(x) ≤ −c||wk|| then 16: Exit. Bgk is a descent direction 17: end if 18: vk+1 = Increasing(f, x, gk, 0, ) 19: Wk+1 = Wk ∪ {vk+1} 20: end for

4.2.3 Implementation aspects We discuss in this part some implementation aspects to explain how practically the first two algorithms are implemented. As for the Geometric Conjugate Gradient, the n embedded structure of the manifold M ⊂ R plays an important role to handle all the abstract concept involves. Still, it is not clear so far how to select arbitrary subgradients

45 Algorithm 3 h-increasing Algorithm; v = Increasing(f, x, g, a, b). w Input: f : M → R locally Lipschitz, x ∈ M, g = − ||w|| ∈ TxM and a, b ∈ R. Set h(t) = f(expx(tg)) − f(x) + ct||w|| and t = b. 1: repeat 2: Select v ∈ ∂f(expx(tg)) 3: if hv, dexpx(tg)(g)i < c||w|| then a+b 4: t = 2 5: if h(b) > h(t) then 6: a = t 7: else 8: b = t 9: end if 10: end if 11: until hv, dexpx(tg)(g)i ≥ c||w|| v ∈ ∂f(y). For this issue, we remark that we are interested in cost functions involving nonsmooth || · ||`1 norms. The almost everywhere differentiability can be exploited in order to to choose gradf(y) ∈ ∂f(y) for almost every y ∈ M. Then we bring back the new vector on the current tangent space TxM. This transport action can be −1 properly accomplished by a vector transport map instead dexpx (y)(·), provided they act sufficiently close. More formally, for a transport vector Tx←y we require for every x ∈ M and l > 0, the existence of δ := δ(x, l) > 0 s.t. −1 ||dexpx (y) − Tx←y|| ≤ l, provided dg(x, y) ≤ δ.

The second issue regards the minimization step over the convex hull Wk. Let Wk be the matrix whose columns are the vectors vi for i = 1, 2, ..., k produced by Algorithm 2. We consider the following alternative definition for the convex hull of a set containing a finite number of vectors. n Definition 4.2.9. Let S be a finite collection of vectors si ∈ R . We define the convex hull of S, denoted by conv(S), the set

 |S| |S|  X X  conv(S) = αisi : si ∈ S and αi = 1 ∧ ~α> 0 .  i i  To optimize a quadratic norm over a convex hull we have the equivalences k 2 X > min ||v|| ⇔ min ||v|| ⇔ min αiαjvi vj. v∈conv(Wk) v∈conv(Wk) ~α·~1=1 ~α>0 i,j=1 > Denoting by Wi,j = vi vj for i, j = 1, 2, ...k, we have finally min α>W α, (4.2.1) ~α·~1=1 ~α>0

46 that is a quadratic programming over linear constraint. Such minimization can be easily handled by an interior point method; in MatLab, we rely on the quadprog(·) function for this part. Nevertheless, we imagine that such solver would reach an approximated version of the minimizer for some prescribed tolerance, and it is not convenient to slow down the method by lowering the tolerance. In view of Riemannian optimization, such vector can be the velocity of a geodesic through the current iterate and it is crucial that matches accurately a tangent plane. An easy trick (that has shown to work on Sn−1) is to project onto the tangent space the numerical approximation of (4.2.1), so that a geodesic with such velocity is well defined numerically on the manifold. Otherwise, a retraction mapping based on Proposition 2.4.2 could be used instead to move along linesearches on the manifold. In this Thesis, we are also interested in handling the case of matrix manifolds. To optimize over a convex hull of matrices, we show now that a vectorization step leads to case we have just discussed. Rearranging the notation as usual with matrices, if now the tangent vectors are the m×n Vi ∈ R , i = 1, ..., k matrices, we denote their collection by Wk = {V1, ..., Vk}. We consider the equivalence 2 2 min ||V ||F ⇔ min ||vec(V )|| V ∈conv(Wk) V ∈conv(Wk)

> and we define Wi,j = trace(Vi Vj) the symmetric matrix whose entries are equal to > all the possible scalar (Frobenius) products, or simply Wi,j =vec(Vi) vec(Vj). The matrix minimization then becomes

min α>Wα, (4.2.2) ~α·~1=1 ~α>0

Pk nm looking for a solution v = i=1 αivec(Vi) ∈ R and then considering finally V ∈ m×n R s.t. v =vec(V ).

4.3 -subdifferential method

We are now ready to discuss a minimization approach based on the extraction proce- dure for descent directions of Algorithm 2. The idea is to successively decrease the  parameter while approaching a stationary point to seek descent directions among gradients in closer neighborhoods. We summarize the strategy in Algorithm 4. We remark that Line 10 could be actually improved. The i steplength is sufficient to accomplish a uniform decrease of the cost function, but, as shown in [17], a further linesearch could be performed to both avoid too small steps on the manifold and ac- complish a sufficiently large step towards the minimum. The line search is based on a generalization of the rule presented in [23, Pag 60-62] to satisfy the Wolfe conditions for nonlinear cost functions.

47 Algorithm 4 A minimization Algorithm; x = Min(f, x1, 1, δ1, c1,tol). Input: f : M → R locally Lipschitz, x0 starting point and c1, 1, δ1 ∈ (0, 1). 1: x = x0. 2: for i = 1, 2, ... do 3: x1 = x 4: for k = 1, 2, ... do wk 5: gk = Descent(f, xk, δi, c1, i) with gk ← − Descent direction ||wk|| B 6: if ||wk|| ≤ tol then 7: x = xk, Exit. 8: end if 9: if f  g − f x ≤ −c  ||w || then (expxk ( i k)) ( i) 1 i k 10: x  g g k = expxk ( i k) B Uniform decrease over k 11: else

12: Set i+1 < i, δi+1 < δi and x = xk B Shrink ∂i f(x) 13: Break. 14: end if 15: end for 16: end for

4.4 An example of optimization on Sn−1

We consider as a first simulation the following convex optimization problem:

min ||Qx||`1 , x∈Sn−1 (4.4.1) " # 1 0 with Q ∈ m×n of the form Q = and Q˜ ∈ (m−1)×(n−1) is a real orthonormal R 0 Q˜ R matrix. Problem (4.4.1) admits the sparse local minimum x = [1, 0, ..., 0]> that we seek with the Riemannian nonsmooth approach.

Consider f(x) = ||Qx||`1 on the unit sphere manifold. The Riemannian gradient, when exists, admits the formula gradf(x) = (I − xx>)Q>sign(Qx). As we pointed out, the descent directions can be obtained from convex hulls of Riemannian gradient exploiting the almost everywhere differentiability of the convex map. Moreover, these tangent vectors need to be transported to the current iterate: we rely on the parallel transport map along geodesics obtained as following. Assume that we want to transport a tangent vector ξy ∈ TyM to TxM along a geodesic linking y to x. In the next proposition we show how to obtain a geodesic on the sphere prescribing two points of passage. n−1 Proposition 4.4.1. Consider x, y ∈ S and the geodesic σ(t) = expy(tv), where v is given by dg(x, y) > v = > (In − yy )(x − y), ||(In − yy )(y − x)|| and In is the identity n-by-n matrix. Then σ(1) = x.

48 Proof. On Sn−1 the exponential mapping is an omeomorphism and its inverse at y is given by −1 n−1 n−1 expy : S → TyS θ x 7→ exp−1(x) = (x − y cos(θ)), y sin(θ)

> −1 where θ = arcos(x y) is the angle between x, y. We shall consider v = expy (x). The tangent vector v ∈ TyM is thus obtained as a lift of the point x as sketched in Figure 4.2. Thus we have

−1 v =expy (x) n−1 TyS y

Sn−1 x

Figure 4.2: Two dimensional example of the geodesics σ(t) collecting y → x on the unit sphere.

θ dg(x, y) > (x − y cos(θ)) = q (x − yy x) sin(θ) 1 − cos2(arcos(x>y))

dg(x, y) > = q (In − yy )(x − y). 1 − cos2(arcos(x>y))

Moreover, ||y − x cos(θ)||2 = y>y − 2 cos(θ)x>y + x>x cos2(θ) = 1 − cos2(arcos(x>y))). Finally, rewriting the denominator, we get

−1 dg(x, y) > v = expy (x) = > (In − yy )(x − y), ||(In − yy )(y − x)||

−1 and σ(t) =expy(texpy (x)). Clearly, σ(1) = x and the proof is then concluded.

According to Example 2.5.1, we have explicitly the parallel (backward) transport

> > Tx←y(ξ) = (In + (cos(||v||) − 1)uu − sin(||v||)xu )ξy,

v where u = ||v|| . We summarize in Table 4.1 the tools derived in this section.

49 Manifold Sn−1 Total space Rn > Cost function f(x) ||Qx||`1 , x x = 1 ||Qx||`1 Metric induced metric hx, yi = x>y n−1 n > n Tangent space TxS z ∈ R : z x = 0 R > n−1 Projection PTxS (z) (I − xx )z \ Gradient gradf(x) = (I − xx>)∇f(x) ∇f(x) = 2Q>sign(Qx)

Retraction Rx(ξ) = expx(ξ) \ n−1 Transport along geodesics Tx→γ(t)(ξ), ξx ∈ Tγ(0)S translation

Table 4.1: `1 sparsest optimization on the unit sphere.

10−1 100

10−3 10−3 ,e) k k −5  10 (x −6 g 10 d 10−7 10−9 10−9 10−12 0 50 100 150 0 50 100 150 iter iter

k 49 Figure 4.3: On the left: Riemannian distance in logscale of the approximation x to e1 ∈ S for Problem (4.4.1). On the right:  updates for the subdifferential ∂f(x).

4.4.1 Numerical Results 500 We consider a 50 dimensional linear subspace of R whose orthonormal basis is given 500×50 by the matrix Q ∈ R as in Problem (4.4.1) with n = 50 and m = 500 to find n−1 the sparsest vector-solution e1 ∈ S . We report the results in Figure 4.3. On the left plot, we can see how the method recovers with high accuracy the sparse solution 49 e1 ∈ S as the Riemannian distance decays very fast toward machine . On the right, it is interesting to see how the  parameter behaves during the iterations: approaching the solution, the subdifferential method require to shrink the set ∂f(x) around the minimum to move on linesearches and find new descent directions.

50 Chapter 5

Sparse Rayleigh quotient minimization

In this Chapter we are going to discuss a Riemannian optimization approach to ap- proximate a collection of p eigenvectors of a symmetric matrix A. In numerical linear algebra, subspace iterations and Arnoldi-type methods can already deal with the eigen- value problem. Nevertheless, we are going to see that band matrices admit a sparse basis of eigenvectors. We apply the nonsmooth Riemannian approach to a penalized Rayleigh quotient formulation to achieve a more sparsified approximated eigenbasis. We finally conclude the chapter with numerical results.

5.1 The eigenvalue problem

In this part of the Thesis we focus on the eigenvalue problem for a symmetric matrix n×n n×n A ∈ R , seeking an A-invariant p-dimensional subspace X ⊂ R spanned by the eigenvectors of A and satisfying AX = XN, (5.1.1)

n×p p×p where X ∈ R with span(X) = X and N ∈ R . The spectral theorem for self- n adjoint linear and compact operators ensures that R admits an orthonormal decom- position w.r.t the eigenvectors of A. Moreover, we are going to deal with a diagonal matrix N with corresponding p real eigenvalues situated in the diagonal. We now derive the theoretical backup to understand how eigenvectors are related to the min- imizer of the Rayleigh quotient. Part of the techniques involved are taken from [32] and, respecting the C-operations, the discussion can be easily generalize to Hermitian complex matrices.

51 n×n n×p Definition 5.1.1. Let Abe in R and X ∈ R with rank(X) = p. Then > > −1 ρA(X) := trace(X AX(X X) ) is the Rayleigh quotient of A at X. Remark. For p = 1, Definition 5.1.1 reduces to the function x>Ax ρ (x) = , A x>x for every x ∈ R \{0}. We will see that it is useful to discuss minimizers and maximizers of ρA(·) in the case p = 1. n×n Theorem 5.1.2. Let A ∈ R be symmetric with eigenvalues λ1 ≤ ... ≤ λn. Then λ1 = min ρA(x), λn = max ρA(x). x∈Rn\{0} x∈Rn\{0} n Proof. Indeed, if A is symmetric, let {vi}i=1 be an orthonormal basis of eigenvectors so n Pn that every x ∈ R can be expanded on this basis as x = i=1 αivi for scalars αi ∈ R. Then ρA(x) reads Pn > Pn  j=1 αjvj A j=1 αivi ρA(x) = Pn > i,j=1 αiαjvi vj Pn i,j=1 αiαjλiδi,j (5.1.2) = Pn i,j=1 αiαjδi,j P1 2 i=1 αi λi = Pn 2 , i=1 αi where δi,j is the Kronecker delta. Trivially, λ1 ≤ ρA(x) ≤ λn for every x ∈ R. Moreover, we have the equalities ρA(v1) = λ1 and ρA(vn) = λn.

n Remark. The fact that A is symmetric is crucial to exploit the decomposition of R w.r.t an orthonormal basis of eigenvectors. In general, we cannot hope the theorem to hold for non symmetric matrices. For this, we give the following counterexample. Consider " # 0 1 A = , 0 0 that is not symmetric. The only eigenvalue is λ = 0, while 1 max ρA(x) = , ||x||=1 2

x √1 , > attained for = 2 [1 1] .

52 A variant of this theorem is obtained when x is constrained to be orthogonal to the first p − 1 eigenvectors.

n×n Lemma 5.1.3. Let A ∈ R be symmetric with eigenvalues λ1 ≤ ... ≤ λn. Set Vp−1 := {v1, ..., vp−1} for 2 ≤ p ≤ n. Then

λp = min ρA(x). ⊥ x∈Vp−1 x6=0

⊥ Pn Proof. If x ∈ Vp−1, then it admits a decomposition of the kind x = i=p αivi and(5.1.2) reduces to P1 2 i=p αi λi ρA(x) = Pn 2 . i=p αi

In the same way, we conclude that the minimum is attained at vp and ρA(vp) = λp.

Lemma 5.1.3 can be extended to gain the same results for interior eigenvalues without the knowledge of the previous p − 1 eigenvectors.

n×n Theorem 5.1.4 (Courant-Fischer min-max theorem). Let A ∈ R be symmetric with eigenvalues λ1 ≤ ... ≤ λn. Then

λp = min max ρA(x) = min max ρA(Xy). n n×p X subspace R x∈X X∈R y6=0 dim(X )=p x6=0 rank(X)=p

Proof. The second equality is obvious by changing coordinate system. For the first equality, we set

Vp−1 = {v1, ..., vp−1}.

⊥ > For every p-dimensional subspace X , there is x ∈ X ∩ Vp−1 with x x = 1. In- ⊥ deed, dim(Vp−1) = n − p + 1 and the intersection with a p-dimensional is at least 1-dimensional. From Lemma 5.1.3 and for such x it follows that ρA(x) ≥ λp. The equality is obtained by considering X = Vp and x = vp.

We have the following Corollary.

n×n Corollary 5.1.5 (Monotonicity principle). Let A ∈ R be symmetric with eigen- n×p > values λ1 ≤ ... ≤ λn and let U ∈ R be orthonormal, i.e. U U = Ip. Then the > eigenvalues µ1 ≤ ...µn of the symmetric matrix M := U AU satisfy

λk ≤ µk, for every 1 ≤ k ≤ p.

53 Proof. Let U =span(U). By applying Theorem 5.1.4, we have λ ρ x k = min nmax A( ) X subspace R x∈X dim(X )=k x6=0

≤ min max ρA(x) X subspace U x∈X dim(X )=k x6=0

= min max ρA(UXy), n×k k X∈R y∈R rank(X)=k y6=0 where in the last equality we have used again Theorem 5.1.4 and the fact that mini- n×k n×p mizing over the intersection X ∈ R ∩ U is equivalent to minimize over X ∈ R with argument variable UX taking value in the intersection. Finally, we notice that

(UXy)>A(UXy) (Xy)>M(Xy) ρA(UXy) = = = ρM (Xy) (UXy)>(UXy) (Xy)>(Xy) and also min max ρM (Xy) = µk, n×k k X∈R y∈R rank(X)=k y6=0 that concludes the proof.

We now state and proof the result for the trace minimization of the Rayleigh quotient over the set of orthonormal n-by-p matrices. n×n Theorem 5.1.6 (Ky-Fan theorem). Let A ∈ R be symmetric with eigenvalues λ1 ≤ ... ≤ λn. Then

> λ1 + ... + λp = min trace(X AX). n×p X∈R > X X=Ip

Proof. Let X be an arbitrary orthonormal n-by-p matrix and let µ1 ≤ ... ≤ µn be the eigenvalues of X>AX. By the monotonicity principle,

λ1 + ... + λp ≤ µ1 + ... + µp.

Again, the equality holds when X is an orthonormal basis for the smallest p eigenvec- tors, i.e. X = {v1, ...vp}.

The Rayleigh quotient ρA(X) is defined for arbitrary full rank matrices X, but in this Thesis we focus on orthonormal n-by-p matrices since they forms an embedded submanifold called the Stiefel manifold. We are thus interested in optimizing modified cost function obtained from

> > ρA(X) = trace(X AX), with X X = Ip.

54 5.2 Sparse eigenvectors

We now discuss the eigenvalue problem and an intrinsic "hidden" property for the case of band matrices. Suppose we would like to approximate a p-dimensional basis of eigenvector of a matrix A obtained as following A=rand(n)); A=A+A’; A=hess(A);. We will get a better insight on this generating process later in the section, but it is already clear that we are generating a symmetric . What if now we extract, with the help of the eig(·), the smallest p-dimensional basis of eigenvectors n×p > V ∈ R and we plot the spectral projector P = VV ? In Figure 5.1 we show the

0.8

0.6

0.4

0.2 i,j

P 0

-0.2

-0.4

-0.6 100

50 100 80 60 40 20 j 0 0 i

0

-2 ) | -4 i,j P | -6 log( -8

-10 100 80 60 100 80 40 60 20 40 j 20 0 0 i

Figure 5.1: Pi,j and log(|Pi,j|) on the top and bottom, respectively, for the entries of the spectral projector P = VV > ∈ R100×100 relatively to a symmetric tridiagonal matrix. In the bottom log-plot, the entries below 10−10 are set to be zero. magnitude of the spectral projector’s entries Pi,j for a symmetric tridiagonal matrix

55 100×100 A ∈ R with V a 100-by-40 orthonormal basis of eigenvectors. On the top plot, we can see the localization properties of Pi,j around the diagonal while, on the bottom, the rapid decay for the off-diagonal entries.

5.2.1 Localized spectral projectors Decay properties for functions of band matrices have been investigated in [4, 5]. In this works, the authors have shown how the structure of a is numerically preserved when a suitable matrix function is applied. If, for example, we consider a tridiagonal matrix A, then the exponential matrix exp(A) has strong localization properties. The goal of this section is to show how the sparsity of a basis of eigenvectors is linked to this decay theory by rewriting the spectral projector as a matrix function of a tridiagonal symmetric matrix. We follow the discussion in [3, Section 4.1.3]. For simplicity we start from p = 1. Let v be an eigenvector of A, then the spectral projector associated to v is

 2  v1 v1v2 . . . v1vn v v v2 . . . v v  >  2 1 2 2 n P = vv =   ,  . . .. .   . . . .  2 vnv1 vnv2 . . . vn where here, differently from Section 5.1, we denote by v1 the i-th component of a vector v. Let us now assume that v is sparse numerically, i.e. its "mass" is localized on a few entries. Then, the spectral projector P is numerically sparse since most of the entries vivj are going to be close to zero. As in the Figure 5.1, the biggest magnitudes tend to be localized in the diagonal. We then showed that P is localized if and only if v is. Consider now 1 ≤ p ≤ n, and let Vp be an orthonormal basis w.r.t the p smallest eigenvectors: in this case, due to a numerical cancellation issue, the localization of the > eigenvector is only a sufficient condition for the localization of P = VpVp . Indeed, if p = n, we denote by V the of the spectral decomposition of A. Notice > that VV = In (maximal localization), while V can be strongly delocalized. It is then reasonable to ask p  n. Most importantly, we remark that if P is delocalized, we may seek a more sparse basis U (as localized as possible)

> > > > P = VpVp = VpΘΘ Vp = UU ,U = VpΘ, for another unitary transformation Θ. The spectral projector associated to the smallest p eigenvectors can be considered as a regular matrix function φ(·) satisfying ( 1, if 1 ≤ i ≤ p, φ(λi) = 0, otherwise.

56 Then it holds > > > φ(A) = φ(VDV ) = V φ(D)V = VpVp . Evidently, we relied on the spectral decomposition D = V >AV for a symmetric tridi- agonal matrix A and on the fact that a matrix function acts as a polynomial; if A admits a diagonal unitary decomposition, φ(·) acts directly on the diagonal as an ana- lytic function that interpolates part of the spectrum. For more on matrix function we refer to [18]. > In conclusion, we can exploit the decay behaviour of φ(A) = VpVp to infer that A admits a localized basis of eigenvectors Vp. This theoretical excursus is the basis for the numerical sparse approach we re going to present next.

5.2.2 Generating process Our goal is to design a numerical strategy involving the Rayleigh quotient Riemannian optimization to reach a even more sparse approximation of a p-dimensional eigenspace. Nevertheless, we need to design a way to generate such matrices while controlling the spectrum at the same time to obtain meaningful numerical results. We are interested in working with tridiagonal symmetric matrices that have a well- separated spectrum around the p-th eigenvalue, and a controlled gap l ∈ R. In MatLab we directly build by hand their spectral decomposition as following. Let D = diag(λ1, ..., λn) so that λ1 ≤ ... ≤ λp < λp+1 ≤ ... ≤ λn and |λn − λ1| ≤ l. We n×n then consider an orthonormal matrix V ∈ R to obtain B = VDV >. Clearly, B is a n-by-n real symmetric matrix whose spectrum matches exactly the cho- sen eigenvalues. To obtain a symmetric tridiagonal matrix A from B with the same spectrum, we consider the Hessenberg reduction, by means of Householder transfor- n mations. The latter works as following: let u ∈ R and consider the Householder reflection 2 P := I − uu>. n u>u Then P maps a vector x to span(u)⊥. As explained in [16], by setting

u = x ± ||x||e1, we have  2  P x = I − uu> x = ∓||x||e . n u>u 1 To reduce a matrix in Hessenberg form, we apply both on the left and on the right a n − 1 Hauseholder reflection P1 to the (n − 1) × (n − 1) submatrix B1 of B as " #" #" #−1 1 0 × ... 1 0 . (5.2.1) . 0 P1 . B1 0 P1

57 Clearly, the symmetric property is conserved and we thus have a matrix of the form  × × 0 ... 0   × × × ... ×       0 ×  .    . .   . .  0 × Iterating several Hauseholder reflections to smaller and smaller submatrices yields to the desired tridiagonal matrix A with the same spectrum of D. In Matlab, the matrix A is achieved by typing d=sort(l*rand(n,1)-l/2); d(p+1)=d(p) + 0.2; D=diag(d); V=orth(rand(n)); (5.2.2) B=V*D*V’; A=hess(B); where in the first line we generate the spectrum with the desired gap l, and we separate the first p smallest eigenvalues from the others. The rule generalizes the one introduced at the beginning of the discussion when we want to control the spectrum to do coherent and meaningful numerical tests.

5.3 Weighted Rayleigh quotient sparse minimization

We are interested in approximating the solution of a weighted formulation of the n×p Rayleigh quotient to find a even more sparsified basis X ∈ R for p eigenvector of A as in (5.2.2). To do this, we naturally set the problem in the Riemmanian frame- n n×p > work of the Stiefel manifold Stp := {X ∈ R : X X = Ip}. We consider the `1 weighted formulation f X X>AX λ|| X || , min n ( ) := trace( ) + vec( ) `1 (WRQ) X∈Stp where λ ∈ R is the weight parameter. If the smooth part tends to be minimized by the p-smallest collection of eigenvectors, the presence of the `1-norm penalizes the entries and, depending on the λ parameter, allows us to approximate the basis of eigenvectors while sparsifying at the same time. We would like to investigate the applicability of the -subdifferential method to the nonsmooth cost function involved. P Remark. The nonsmooth part of the cost function is simply ||vec(X)||`1 := i,j |Xi,j|. In MatLab, we shall consider the command norm(X(:),1) to compute the quantity of our interest. Moreover, its Euclidean gradient is

∇||vec(X)||`1 = sign(X).

58 n 5.3.1 The manifold Stp As for the sphere and the manifold of fixed rank matrices, we seek for the Stiefel manifold a analytic expression of maps such as exponential, geodesics, projections and vector transport. Unfortunately, contrary to the unit-sphere, geodesics and exponential mappings are too expensive to compute even if given in closed form. We are going to consider a more computational friendly retraction. n The tangent space at point X ∈ Stp is the set

n n×p > > TX Stp = {Z ∈ R : X Z + Z X = 0},

n > (n−k)×k or equivalently, TX Stp = {XΩ + X⊥K :Ω = Ω,K ∈ R }. An effective retraction is the map RX (ξ) = qf(X + ξ), n with ξ ∈ TX Stp and qf(·) is the q factor in the QR decomposition of a matrix. The n×p n projection of a matrix Z ∈ R onto the tangent space TX Stp is obtained by

n > > > PTX Stp (Z) = Z − Xsym(X Z) = (I − XX )Z + Xskew(X Z), (5.3.1) where 1 1 sym(M) := (M + M >), skew(M) := (M − M >) 2 2 denote the symmetric and the skew-symmetric part of a matrix M. We now derive the Riemannian gradient for the cost function (WRQ). Firstly, we notice that g(X) := > > trace(X AX) = trace(Φ(X)), where Φ(X) = X AX. Thus, ∇X trace(Φ(X)) = ∇Φ(X)trace(Φ(X))∇X Φ(X) = 2AX. The Riemannian gradient for the weighted Rayleigh quotient with symmetric matrix A is thus

gradf(X) = (I − XX>)(2AX + λsign(X)) + λskew(X>sign(X)), obtained according to the projection (5.3.1). We summarize all the derived tools in Table 5.1. We stress the fact that the optimization on Sn−1 is also described in the Table as the Stiefel manifold turns into the unit sphere when p = 1. Nevertheless, on the Sphere it is preferable to retract and parallel translate tangent vectors via geodesics.

5.3.2 Sparse eigenvector on Sn−1 Problem (WRQ) reduces to a Riemannian optimization on the sphere when p = 1. The implementation of this case is free applying the machinery discussed in Section 4.4 to the functions > f(x) = x Ax + λ||x||`1 , > gradf(x) = (In − xx )(2Ax + λsign(x)).

59 n n×p Manifold Stp Total space R > > Cost function f(X) trace(X AX) + λ||X||`1 trace(X AX) + λ||X||`1 Metric induced metric hX,Y i =trace(X>Y ) n n×p > n×p Tangent space TX Stp Z ∈ R : sym(X Z) = 0 R > P n Z Z − X X Z Projection TX Stp ( ) sym( ) \

f X P n ∇f X ∇f X AX λ X Gradient grad ( ) = TX Stp ( ( )) ( ) = 2 + sign( )

Retraction RX (ξ) =qf(X + ξ) \ n T P n ξ , ξ ∈ T Vector Transport X→Y TY Stp ( ) X X Stp translation

Table 5.1: Weighted Rayleigh function on the Stiefel manifold.

n×n Consider now a matrix A ∈ R , with n = 1000 and spectral gap l = 10, obtained in MatLab according to (5.2.2). We apply the -subdifferential method to the weighted minimization with different λs. We remark here that the unit sphere is equipped with an efficient exponential mapping and we can closely follow the steps in the algorithms presented. In Figure 5.2, we measure the goodness of the approximation with the Riemannian distance dg in logscale. We denote by v the smallest eigenvector of A and by xi the iterates produced by the algorithm. Intuitively, the smaller the λ is,

101

100

10−1

−2

,v) 10 i (x

g −3 λ=1e-2

d 10 λ=1e-3 10−4 λ=1e-4 λ=1e-5 10−5 λ=1e-6 λ=1e-7 10−6 20 40 60 80 100 120 iter k

Figure 5.2: Riemannian distance in logscale between the sparse approximations xi of the eigenvector v on S999 for different weight parameter. the closer the final approximation is to the eigenvector v of A. The plot confirms the intuition, but we are now interested in the sparsity of the approximation: for which λ we have a good compromise between accuracy and sparsity? How does the sparsity behave w.r.t to n?

60 m×n To measure the sparsity of a matrix A ∈ R , we define the index

nnz(A) := #{Ai,j : |Ai,j| ≥ tol}, where tol is a lower threshold to ensure that an entry of A is to be considered as zero. In Figure 5.3 we address these questions. Firstly, we notice in the right-plot

10−2 0.3 λ = 1e − 3 λ = 1e − 4 λ = 1e − 5 λ = 1e − 6 10−3 Sparisty Eig 0.2

10−4 (x(n),v(n)) g d nnz(x(n))/(n) 0.1 λ = 1e − 3 λ = 1e − 4 −5 10 λ = 1e − 5 λ = 1e − 6 0 0 500 1,000 1,500 0 500 1,000 1,500 n n

Figure 5.3: On the left: reached accuracies for different-size approximations x(n) ∈ Rn varying the λ weight. On the right: sparsity of x(n) compared to the eigenvector v obtained by the eig(·) command in MatLab. that the accuracy does not change significantly w.r.t n, rather than λ. We denote n by x(n) ∈ R the n-size approximation achieved by the -subdifferential method. On nnz(x(n)) the left-plot, we compare instead the numerical sparsity n to the one computed by the eig(·) command in MatLab, again for different λ. A tolerance tol = 10−4 is fixed to consider an entry as zero in the sparse analysis. There is no significant gain in sparsity and, even for larger λ (e.g. 10−3), the graph lies below the sparsity of the exact eigenvectors v(n). Nevertheless, the goal is to work with multiple eigenvectors at the same time in the Stiefel manifold where also a small gain in sparsity for each vector can be meaningful.

n 5.3.3 Sparse eigenvectors on Stp As in this part we need to compare subspaces that contains multiple directions spanning eigenspaces, we introduce a general notion of angles between subspaces. Indeed, even if we get an orthonormal matrix as output of the numerical approach, we would like

61 to measure numerically how much is A-invariant and close to the span of the smallest p-eigenvectors. To do so, we define a notion of angles between subspaces.

n×p Definition 5.3.1. Let X,Y ∈ R and let by X , Y denote the subspaces spanned by the columns of X and Y . We define the i-th canonical angle between X and Y the quantity θi(X , Y) := arcos σi, i = 1, ..., p, > where 0 ≤ σ1 ≤ ... ≤ σp are the singular values of X Y .

The definition does not change for complex subspaces (up to consider complex adjoint operator), but we prefer to remain in the real case in order to stay close to the Stiefel manifold. Remark. There are a few remarks about this definition.

(i) θ1 is the largest canonical angle since the arcos(·) function is monotonically de- creasing and it coincides with the angles between vectors in the one dimensional x>y case, i.e. p = 1 and θ(x, y) =arcos ||x||||y|| . (ii) We have the geometric characterization

θ1(X , Y) = max min θ(x, y). x∈X y∈Y x6=0 y6=0

⊥ It follows that θ1(X , Y) = 0 if and only if X ∩ Y = {0}.

(iii) In MatLab, it suffices to call the function subspace(X,Y) to measure the biggest canonical angle.

We now apply the -subdifferential method to problem (WRQ), for the case of a tridiag- 300×300 onal symmetric matrix A ∈ R . Unfortunately, on the Stiefel manifold computing retraction by the exponential mapping is computationally impracticable; we thus ap- ply the method replacing the retraction and vector transport discussed in Table 5.1. We seek a sparsified approximation of the smallest p = 40 eigenvectors, as a minimizer 300 which belongs to St40 . The method only needs to evaluate the cost function and its Riemannian gradient at several points during the iterations: it is convenient to store A in sparse format to speed up matrix multiplications. Let Xi be the iterates of the subspaces produced, and V be the subspace spanned by n the p-smallest eigenvectors of A. We indicate by Xi,V ∈Stp the two orthonormal basis for these subspaces, respectively. In Figure 5.4 we can see the behaviour of the iterates produced by the nonsmooth Riemannian approach for different values of the weight. The plots confirm the intuition that for lower values of the λ parameter, the subspaces approach closely the span of the smallest p-eigenvectors, as the biggest canonical angles reaches smaller and smaller values. Moreover, the cost function is significantly close

62 102 λ=1e-2 λ=1e-3 101 λ=1e-4 λ=1e-5 λ=1e-6 λ=1e-7 100 λ=1e-8

) −1 V

, 10 i X

( −2 1 10 θ 10−3

10−4

10−5 50 100 150 200 iter k 50 λ=1e-2 λ=1e-3 λ=1e-4 λ=1e-5 0 λ=1e-6 λ=1e-7 λ=1e-8 Rayleigh Q.

−50 ) i

f(X −100

−150

−200 50 100 150 200 iter k

Figure 5.4: Biggest canonical angles between approximated subspaces and the p-smallest eigenspace on the top. Values f(Xi) converging to the Rayleigh quotient quantity on the bottom.

to the Rayleigh quotient ρA(V ). Now that we established the power of the nonsmooth approach providing a numerical analysis of the weight parameter, we investigate how sparse (numerically) the output Xi is. Intuitively, the λ parameter, if not too small, would penalize the magnitude of the entries of Xi as its 1-norm adds contributes to the cost functions. We consider a set of matrices with separated spectrum, as in Sec- tion 5.2, with increasing sizes n. We denote by X(n),V (n) the approximation matrix and the exact eigenvectors, respectively, and by X (n), V(n) the subspace spanned by nnz(X(n)) them. The goal is to show that for larger n, the sparsity np decreases while still keeping a good level of accuracy expressed again in term of the biggest canonical angle. We fix p = 40, the gap |λmin − λmax| = 10 and we propose in Figure 5.5 the

63 same simulation performed for Sn−1. The accuracy, expressed in term of the biggest

10−1

10−2 (n))

V 10−3 (n), X

( −4 1 10 θ λ=1e-3 λ=1e-4 λ=1e-5 −5 10 λ=1e-6

200 400 600 800 1,000 n

λ=1e-3 0.6 λ=1e-4 λ=1e-5 λ=1e-6 Sparisty Eig 0.4

.

nnz(X(n))/(40n) 0 2

0 200 400 600 800 1,000 n

Figure 5.5: On the top: reached accuracies for different-size approximations X(n) of n V (n) ∈St40 varying the λ weight. On the bottom: sparsity of X(n) compared to the eigenbasis V (n) obtained by the eig(·) command in MatLab. canonical angle is kept approximatively fixed for each λ while varying n. Moreover, the reached accuracy is line with the error showed on the unit sphere. Regarding the sparsity of the basis X(n), we see that a big improvement has been achieved compared to the eig(·) command; for each n, the recovered eigenbasis X(n) is sensibly more sparse numerically (tol = 10−4). A key strategy was here used: the approximation X(n) obtained for one λ is then used as initial approximation for the successive call of the method with a smaller λ. This is helpful in keeping the matrices sparse and decreasing the number of iterations needed for convergence. On the other hand, it is not particularly effective to gain accuracy.

64 5.4 Nonsmooth Matrix Completion

In [8], a Riemannian method is proposed to face the nonsmooth regularized formulation of the matrix completion problem 2 min f(X) := ||vec(PΩ(X − A))||`1 + µ||PΩ¯ (X)||F , (MC-R) X∈Mk ¯ 2 where Ω := {(i, j) : 1 ≤ i ≤ m, 1 ≤ j ≤ n}\ Ω. The regularization factor µ||PΩ¯ (X)||F penalizes the entries on the complementary set Ω¯ while the distance PΩ(X − A) in `1 norm is minimized. One prefers to introduce convex 1-norms to take into account non-Gaussian type of noise. The authors propose a Riemannian approach based on the following smoothing technique. Select δ > 0 and consider q X 2 δ + (Xi,j − Ai,j) (i,j)∈Ω that is a smooth approximation of ||vec(PΩ(X − A))||`1 , when δ is small. For a sketch see Figure 5.6. The goal is then to perform a Geometric CG for the optimization

pδ2 + x2 x j j

δ

√ Figure 5.6: Smoothing technique: approximation δ2 + x2 for the function |x|. problem q X 2 2 min fδ(X) := δ + (Xi,j − Ai,j) + µ||P¯ (X)||F , X∈M Ω (5.4.1) k (i,j)∈Ω for decreasing values of δ. Notice that the Euclidean gradient of fδ(X) is  X −A √ i,j i,j , if (i, j) ∈ Ω,  δ2+(X −A )2 ∇fδ(X) = i,j i,j 2µXi,j, otherwise. The Riemannian gradient is then obtained by projection according to the formula (2.2.1). One can finally consider the optimization Algorithm 5. The algorithm produces a l ∞ sequence of objective function {f }i=0 that can be proved to be decreasing (see [8, Theorem 1]). The optimization method is shown by the authors to perform good re- sults in the case of perfect matrix and noisy completion.

65 Algorithm 5 Smoothing Completion for Problem (MC-R) 0 Input: fδ : Mk → R smooth, X0 ∈ Mk, δ > 0, 0 < θ < 1, µ > 0 and tol > 0. 1: f 0 ← ∞ 2: for l = 0, 1, ... do 3: Solve with Geometric CG q X l 2 2 min (δ ) + (Xi,j − Ai,j) + µ||PΩ¯ (X)||F X∈Mk (i,j)∈Ω

and set Xl+1 its solution. l+1 4: f ← fδl (Xl+1) 5:  ← |f l − f l+1| 6: if  ≥ tol then 7: Exit. 8: end if 9: δl+1 ← θδl 10: end for

5.4.1 Further work directions A natural follow-up of this Thesis work is to investigate the -subdifferential method for the convex formulation of matrix completion instead of using smoothing techniques. We have already seen that convex `1 penalties can be efficiently handled by the sub- differential. In addition to the regularization of Problem (MC-R), on can consider simply min ||vec(PΩ(X − A))||`1 . (MC1) X∈Mk

Obviously, if the matrix completion problem is well posed in Mk, there exists a unique global minimum X ∈ Mk s.t. ||vec(PΩ(X − A))||`1 = 0. Nevertheless, the implementation of the -subdifferential method on Mk requires a further research and coding work in MatLab. One needs to explore how good the SVD retraction and the transport vector by metric projection behave for the nonsmooth approach on Mk. If there is too much error accumulating for the -subdifferential method, we cannot hope to reach a suitable accuracy in the approximations. Moreover, a further coding work is needed to be able to work in high dimensional matrix manifolds. The quadratic programming involved in (4.2.2) can be replaced by a much more effective choice. The minimizer lies in a facet of the convex hull lying in a d-dimensional tangent space and the program can be replaced by an efficient projection if we are able to identify the right facet without building the convex hull. Finally, one can investigate the applicability of the nonsmooth method in the tensor case in the same spirit of the PhD thesis [27] where Riemannian optimization has been extended to the tensor completion problem.

66 Bibliography

[1] P. A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2008.

[2] P.-A. Absil and J. Malick. Projection-like retractions on matrix manifolds. SIAM J. Optim., 2012.

[3] M. Benzi, D. Bini, D. Kressner, H. Munthe-Kaas, and C. Van Loan. Exploiting Hidden Structure in Matrix Computations: Algorithms and Applications: Cetraro, Italy 2015. Springer, 2017.

[4] M. Benzi and P. Boito. Decay properties for functions of matrices over C∗- algebras. Linear Algebra Appl., 2014.

[5] M. Benzi, P. Boito, and N. Razouk. Decay properties of spectral projectors with applications to electronic structure. SIAM Rev., 2013.

[6] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.

[7] J.-F. Cai, E. J. Candès, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM J. Optim., 2010.

[8] L. Cambier and P.A. Absil. Robust low-rank matrix completion by Riemannian optimization. SIAM J. Sci. Comput., 2016.

[9] E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Found. Comput. Math., 2009.

[10] F. H. Clarke. Generalized gradients and applications. Trans. Amer. Math. Soc., 1975.

[11] C. Eckart and G. Young. The approximation of one matrix by another of lower rank. Psychometrika, 1936.

67 [12] M. Fazel, H. Hindi, and S. P. Boyd. Log-det heuristic for matrix rank minimization with applications to hankel and euclidean distance matrices. In American Control Conference, 2003. Proceedings of the 2003. IEEE, 2003.

[13] O. P. Ferreira and P. R. Oliveira. Subgradient algorithm on Riemannian manifolds. J. Optim. Theory Appl., 1998.

[14] S. Foucart and H. Rauhut. A mathematical introduction to compressive sensing. Applied and Numerical Harmonic Analysis. Birkhäuser/Springer, 2013.

[15] A. A. Goldstein. Optimization of Lipschitz continuous functions. Math. Program- ming, 1977.

[16] G. H. Golub and C. F. Van Loan. Matrix computations. Johns Hopkins Studies in the Mathematical Sciences. Johns Hopkins University Press, fourth edition, 2013.

[17] P. Grohs and S. Hosseini. ε-subgradient algorithms for locally Lipschitz functions on Riemannian manifolds. Adv. Comput. Math., 2016.

[18] N. J. Higham. Functions of matrices. Society for Industrial and Applied Mathe- matics (SIAM), 2008.

[19] S. Hosseini and M. R. Pouryayevali. Generalized gradients and characterization of epi-Lipschitz sets in Riemannian manifolds. Nonlinear Anal., 2011.

[20] S. Kobayashi and K. Nomizu. Foundations of differential geometry. Wiley Classics Library. John Wiley & Sons„ 1996.

[21] J. M. Lee. Introduction to smooth manifolds. Graduate Texts in Mathematics. Springer, 2013.

[22] Y. Nesterov. Introductory lectures on convex optimization. Applied Optimization. Kluwer Academic Publishers, 2004.

[23] J. Nocedal and S. J. Wright. Numerical optimization. Springer, second edition, 2006.

[24] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev., 2010.

[25] R. T. Rockafellar. Convex analysis. Princeton Landmarks in Mathematics. Prince- ton University Press, 1997.

[26] B. P. Rynne and M. A. Youngson. Linear functional analysis. Springer Under- graduate Mathematics Series. Springer-Verlag London, second edition, 2008.

68 [27] M. M. Steinlechner. Riemannian Optimization for Solving High-Dimensional Problems with Low-Rank Tensor Structure. PhD thesis, EPFL, 2016.

[28] G. Teschl. Ordinary differential equations and dynamical systems. Graduate Studies in Mathematics. American Mathematical Society, 2012.

[29] C. Udrişte. Convex functions and optimization methods on Riemannian manifolds. Mathematics and its Applications. Kluwer Academic Publishers Group, 1994.

[30] B. Vandereycken. Riemannian and multilevel optimization for rankconstrained matrix problems with applications to Lyapunov equations. PhD thesis, Katholieke Universiteit Leuven, 2010.

[31] B. Vandereycken. Low-rank matrix completion by Riemannian optimization. SIAM J. Optim., 2013.

[32] J. H. Wilkinson, editor. The Algebraic Eigenvalue Problem. Oxford University Press, Inc., 1988.

69