Riemannian Optimization for High-Dimensional Tensor Completion

RIEMANNIAN OPTIMIZATION FOR HIGH-DIMENSIONAL TENSOR COMPLETION

MICHAEL STEINLECHNER∗

Abstract. Tensor completion aims to reconstruct a high-dimensional data set where the vast majority of entries is missing. The assumption of low-rank structure in the underlying original data allows us to cast the completion problem into an optimization problem restricted to the manifold of fixed-rank tensors. Elements of this smooth embedded submanifold can be efficiently represented in the tensor train (TT) or matrix product states (MPS) format with storage complexity scaling linearly with the number of dimensions. We present a nonlinear conjugate gradient scheme within the framework of Riemannian optimization which exploits this favorable scaling. Numerical experiments and comparison to existing methods show the effectiveness of our approach for the approximation of multivariate functions. Finally, we show that our algorithm can obtain competitive reconstructions from uniform random sampling of few entries compared to adaptive sampling techniques such as cross-approximation.

Key words. Completion, data recovery, high-dimensional problems, low-rank approximation, tensor train format, matrix product states, Riemannian optimization, curse of dimensionality

AMS subject classiﬁcations. 15A69, 65K05, 65F99, 58C05

1. Introduction. In this paper, we consider the completion of a d-dimensional tensor A Rn1×n2×···×nd in which the vast majority of entries are unknown. Depend- ing on the∈ source of the data set represented by A, assuming an underlying low-rank structure is a sensible choice. In particular, this is the case for discretizations of functions that are well-approximated by a short sum of separable functions. Low-rank models have been shown to be eﬀective for various applications, such as electronic structure calculations and approximate solutions of stochastic and parametric PDEs, see [19] for an overview. With this assumption, the tensor completion problem can be cast into the optimization problem

1 2 min f(X) := PΩ X PΩ A (1.1) X 2k − k n1×n2×···×nd subject to X r := X R rank (X)= r . ∈M ∈ TT where rankTT(X) denotes the tensor train rank of the tensor X, a certain generalization of the matrix rank to the tensor case, see Section 3 for details. We denote with PΩ the projection onto the sampling set Ω 1, 2,...,n1 1, 2,...,nd corresponding to the indices of the known entries⊂ of { A, }×···×{ }

X(i1,i2,...,id) if(i1,i2,...,id) Ω, PΩ X := ∈ (1.2) (0 otherwise.

In the two-dimensional case d = 2, tensor completion (1.1) reduces to the extensively studied matrix completion for which convergence theory and eﬃcient algorithms have been derived in the last years, see [31] for an overview. The number of entries in the

∗MATHICSE-ANCHP, Section de Mathématiques, Ecole´ Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland. E-mail: michael.steinlechner@epfl.ch This work has been supported by the SNSF research module Riemannian optimization for solving high-dimensional problems with low-rank tensor techniques within the SNSF ProDoc Efficient Nu- merical Methods for Partial Differential Equations. Parts of this work have been conducted during a research stay at PACM, Princeton University, with a mobility grant from the SNSF. 1 2 M. STEINLECHNER

optimization variable X Rn1×n2×···×nd scales exponentially with d, the so-called curse of dimensionality.∈ Hence, the naive approach of reshaping the given data A into a large matrix and applying existing algorithms for matrix completion is computationally infeasible for all but very small problem sizes. Assuming low-rank structure in the different tensor formats along with their no- tion of tensor rank allows us to reduce the degrees of freedom significantly. For an overview of the most common choices of tensor formats, we refer to the literature survey by Grasedyck et al. [19]. The most popular choice, the Tucker tensor format, reduces the storage complexity from O(nd) to O(dnr + rd) and thus yields good results for not too high-dimensional problems, say d 5. Algorithms developed for the Tucker format range from alternating minimization≤ [27, 28] and convex optimization, using various generalizations of the nuclear norm [27, 28, 15, 41, 42, 43], to Rieman- nian optimization on the embedded submanifold of tensors with fixed Tucker rank [26]. If higher dimensional problems are considered, the Tucker format can be replaced by the hierarchical Tucker (HT) or alternatively the tensor train / matrix product states (TT/MPS) formats. In the TT format, for example, the number of unknowns in a tensor with fixed rank r is scaling like O(dnr2), which potentially allows for the efficient treatment of problems with a much higher number of dimensions d than the Tucker format. This paper is a generalization of the Riemannian approach presented in [48, 26] to a higher-dimensional setting using this TT format. We obtain an algorithm called RTTC that scales linear in the number of dimensions d, in the tensor size n and in the size of the sampling set Ω and scales polynomially in the tensor train rank r. This scaling allows us to apply| | the algorithm to very high-dimensional problems. For the HT format, Da Silva and Herrmann [10, 11] have derived HTOpt, a closely related Riemannian approach. By exploiting the simpler structure of the TT format, the algorithm presented in this paper exhibits a better asymptotic complexity of O(d Ω r2) instead of O(d Ω r3) for HTOpt. Furthermore, our implementation is specifically| | tuned for very small| | sampling set sizes. In the TT format Grasedyck et al. [17] presented an alternating direction method with similar computational complexity of O(d Ω r2). We will compare these two methods to our algorithm and to a simple alternating| | least squares approach. Note that for tensor completion, the index set Ω of given entries is fixed. If we relax this constraint and instead try to reconstruct the original tensor A from as few entries as possible, black-box approximation techniques such as cross-approximation are viable alternatives. In the matrix case, cross approximation was first developed for the approximation of integral operators [6, 7, 8] and as the so-called Skeleton decomposition [16, 45]. Extensions to high dimensional tensors were developed by Oseledets et al. [36, 39] for the TT format and by Ballani et al. [5, 4] for the HT format. In these techniques, the sampling points are chosen adaptively during the runtime of the algorithm. In this paper, we will compare them to our proposed algorithm and show in numerical experiments that choosing randomly distributed sampling points beforehand is competitive in certain applications. We will first state in Section 2 the final Riemannian optimization algorithm to give an overview of the necessary ingredients and quantities. Then we will briefly re- capitulate the TT format and its geometric properties in Section 3, and explain each individual part of the optimization algorithm in detail in Section 4. We will focus on efficient computation of each part and give detailed analysis of the involved com- RIEMANNIAN OPTIMIZATION FOR TENSOR COMPLETION 3

putational complexity to successfully approach the tensor completion problem (1.1). Finally, we apply our algorithm to several test cases and compare its performance to other existing methods in Section 5.

2. A Riemannian nonlinear conjugate gradient scheme. To solve the tensor completion problem (1.1), we will use a nonlinear conjugate gradient scheme restricted to r, the set of TT tensors with fixed rank. As we will discuss in Section 3, M this set forms a smooth embedded submanifold of Rn1×n2×···×nd and we can apply a Riemannian version [1, Sec. 8.3] of the usual nonlinear CG algorithm. The main difference is the fact that we need to make sure that every iterate Xk of the optimization scheme stays on the Manifold r and that we use the Riemannian gradient ξk, M which points to the direction of greatest increase within the tangent space TX r. k M In the first step, we move from the starting point X0 r in the direction of the negative Riemannian gradient η = ξ . This yields∈ a newM iterate which is, 0 − 0 in general, not an element of r anymore. Hence, we will need the concept of a M retraction R which maps the new iterate X0 + α0η0 back to the manifold. Starting from the second step, we begin with the conjugate gradient scheme. For this, we need to compare the current gradient ξk TX r with the previous search direction ∈ k M ηk TX − r, and hence need to bring Xk into the current tangent space using −1 ∈ k 1 M −1 the vector transport X − X . T k 1→ k

Algorithm 1 RTTC: Riemannian CG for Tensor Train Completion, see also [26]

Input: Initial guess X r. 0 ∈M ξ grad f(X ) % compute Riemannian gradient (Sec. 4.2) 0 ← 0 η ξ % ﬁrst step is steepest descent 0 ←− 0 α argmin f(X + αη ) % step size by linearized line search (Sec 4.5) 0 ← α 0 0 X R(X ,α η ) % obtain next iterate by retraction (Sec. 4.3) 1 ← 0 0 0 for k = 1, 2,... do ξk grad f(Xk) % compute Riemannian gradient ← ηk ξk + βk X − X ηk % conjugate direction by updating rule (Sec. 4.4) ←− T k 1→ k −1 αk argmin f(Xk + αηk) % step size by linearized line search ← α Xk+1 R(Xk,αkηk) % obtain next iterate by retraction end for←

The resulting optimization scheme, which we call RTTC (Riemannian Tensor Train Completion), is given by Algorithm 1. Note that the basic structure does not depend on the representation of X as a TT tensor and is exactly the same as in the Tucker case [26]. We will derive and explain each part of the algorithm in the following Section 4: the computation of the Riemannian gradient in Sec. 4.2, the line search procedure in Sec. 4.5, the retraction in Sec. 4.3, the direction updating rule and vector transport in Sec. 4.4 and the stopping criteria in Sec. 4.6.

3. The geometry of tensors in the TT format. In this section, we will give a brief overview of the TT representation and fix the notation for the rest of this paper. The notation follows the presentation in [25]. For a more detailed description of the TT format, we refer to the original paper by I. Oseledets [34]. We will review the manifold structure of TT tensors with fixed rank and present different representations of elements in the tangent space. This will then allow us to derive our optimization scheme in Section 4. 4 M. STEINLECHNER

3.1. The TT representation of tensors. A d-dimensional tensor X Rn1×n2×···×nd can be reshaped into a matrix ∈

X<µ> R(n1n2···nµ)×(nµ+1···nd), ∈ the so-called µth unfolding of X, see [34] for a formal deﬁnition of the ordering of the indices. This allows us to deﬁne the tensor train rank rankTT as the following (d + 1)-tuple of ranks,

<1> rankTT(X)= r =(r0, r1,...,rd) := 1, rank(X ),..., rank(X ), 1 , where we deﬁne r0 = rd = 1. Given these ranks (r0,...rd), it is possible to write every entry of the d-dimensional tensor X Rn1×n2×···×nd as a product of d matrices, ∈ X(i ,i ,...,id)= U (i )U (i ) Ud(id), (3.1) 1 2 1 1 2 2 ··· where each Uµ(iµ) is a matrix of size rµ rµ for iµ = 1, 2,...,nµ. The slices Uµ(iµ), −1 × iµ = 1, 2,...,nµ can be combined into third order tensors Uµ of size rµ−1 nµ rµ, the cores of the TT format. In terms of these core tensors, equation (3.1) can× also× be written as

r1 rd−1 X(i ,...,id)= U (1,i ,k )U (k ,i ,k ) Ud(kd ,id, 1). 1 ··· 1 1 1 2 1 2 2 ··· −1 k k − X1=1 dX1=1 To be able to exploit the product structure of (3.1), we now focus on ways to manipulate individual cores. We deﬁne the left and the right unfoldings, L R U Rrµ−1nµ×rµ , and U Rrµ−1×nµrµ µ ∈ µ ∈ rµ−1×nµ×rµ by reshaping Uµ R into matrices of size rµ−1nµ rµ and rµ−1 nµrµ, respectively. Additionally,∈ we can split the tensor X into left× and right parts,× the interface matrices

n1n2···nµ×rµ X µ(i ,...,iµ)= U (i )U (i ) Uµ(iµ), X µ R , ≤ 1 1 1 2 2 ··· ≤ ∈

T nµnµ+1···nd×rµ−1 X µ(iµ,...,id)=[Uµ(iµ)Uµ (iµ ) Ud(id)] , X µ R . ≥ +1 +1 ··· ≥ ∈ Figure 3.1 depicts a TT tensor of order 6. The interface matrices can be constructed recursively: L R T X µ =(In X µ )U , and X µ =(X µ In )(U ) , (3.2) ≤ µ ⊗ ≤ −1 µ ≥ ≥ +1 ⊗ µ µ where Im denotes the m m identity matrix and the Kronecker product. The following formula× connects the µth unfolding⊗ of a tensor with its interface matrices:

<µ> T X = X≤µX≥µ+1. We call X µ-orthogonal if L T L T (Uν ) Uν = Irν , and hence X≤ν X≤ν = Irν for all ν = 1,...,µ 1, R R T T − Uν (Uν ) = Irν−1 , and hence X≥ν X≥ν = Irν−1 , for all ν = µ + 1,...,d. RIEMANNIAN OPTIMIZATION FOR TENSOR COMPLETION 5

r1 r2 r3 r4 r5 U1 U2 U3 U4 U5 U6

X 3 X 5 n1 n2 ≤ n3 n4 n5 ≥ n6

Fig. 3.1: Representation of a TT tensor of order d = 6. As r0 = rd = 1, they are usually omitted. The two interface matrices X≤3 and X≥5 are shown. See also [25].

The tensor X is called left-orthogonal if µ = d and right-orthogonal if µ = 1. Orthogo- nalizing a given TT tensor can be done eﬃciently using recursive QR decompositions, as we have for the left-orthogonal part:

L X µ = Q µRµ, with Q µ =(In Q µ )Q and Q = 1. ≤ ≤ ≤ µ ⊗ ≤ −1 µ ≤0 L Thus, starting from the left, we compute a QR decomposition of the first core, U1 = L Q1R1. The orthogonal factor is kept as the new, left-orthogonal first core and the non-orthogonal part is shifted to the second core. Then, a QR decomposition of this updated second core is performed, and so on and so forth. A similar relation holds for the right-orthogonal part, see [34] for more details. Moving from a µ-orthogonal tensor L to a (µ + 1)-orthogonal one only requires a QR decomposition of Uµ. An important consequence that we will use in Section 3.3 is that changing the orthogonalization of a tensor only affects its internal representation and does not change the tensor itself. As an inner product of two TT tensors we take the standard Euclidean inner product of the vectorizations vec(X), vec(Y) Rn1n2···nd : ∈ T X, Y = vec(X), vec(Y) = trace (X<µ>) Y<µ> for any µ 1,...d . (3.3) h i h i ∈ { } The norm of a TT tensor is induced by the inner product. Note that if X is µ- orthogonal, then we obtain

L X = X, X = Uµ = U F = vec(Uµ) , k k h i k k k µk k k so we only have to computep the norm of the µth core. The recursive relation (3.2) allows us to extend the singular value decomposition to TT tensors, the TT-SVD [34] procedure. Let X be d-orthogonal. By taking the R T SVD of Ud = QSV and splitting the product to the adjacent cores: R T L L U V , U U QS, d ← d−1 ← d−1 the tensor is now (d 1)-orthogonalized. By keeping only s singular values, the rank − rd−1 between Ud−1 and Ud is reduced to s. Repeating this procedure till core U1 allows us to truncate the rank r of X to any prescribed value ˜r, withr ˜µ rµ for all µ = 1,...,d. Similar to the best-approximation property of the matrix S≤VD, the truncated X fulﬁlls the following inequality:

X X √d 1 X P (X) , (3.4) e k − k≤ − k − M˜r k with PM˜r (X) being the best approximatione to X within the set of rank-˜r TT tensors, see [34, Cor. 2.4] for a proof. 6 M. STEINLECHNER

3.2. Manifold of TT tensors of ﬁxed rank. It was shown by Holtz et al. [22] that the set

n1×···×nd r = X R rank (X)= r M ∈ TT forms a smooth embedded submanifold of Rn 1×···×nd of dimension

d d−1 2 dim r = rµ nµrµ r . (3.5) M −1 − µ µ=1 µ=1 X X The inner product (3.3) induces an Euclidean metric on the embedding space Rn1×···×nd . When equipped with this metric, r is turned into a Riemannian manifold. An analogous result is also available for theM HT format [46].

n1×···×n 3.3. The tangent space TX r. Assume X R d is represented in the TT format as M ∈

X(i ,...,id)= U (i )U (i ) Ud(id) 1 1 1 2 2 ··· L T L with left-orthogonal cores Uµ, U U = Ir , for all µ = 1,...,d 1. By the µ µ µ − product rule, any element of the tangent space TX r can be uniquely represented in the following way: M

Y(i ,...,id)= δU (i )U (i ) Ud(id) 1 1 1 2 2 ··· + U (i )δU (i ) Ud(id) (3.6) 1 1 2 2 ··· + ··· + U (i )U (i ) δUd(id), 1 1 2 2 ··· L T L with the gauge condition Uµ δUµ = 0, for all µ = 1,...,d 1. There is no gauge condition for the last core. This representation has been introduced− by Holtz et al. [22]. Rn1×···×nd For the orthogonal projection PTX r : TX r of an arbitrary tensor M → M Z Rn1×···×nd , Lubich et al. [30] derived the following explicit expression1 for the ∈ ﬁrst order variations δUµ:

L T <µ> T −1 δU = In r − Pµ In X µ Z X µ X X µ , (3.7) µ µ µ 1 − µ ⊗ ≤ −1 ≥ +1 ≥µ+1 ≥ +1 L where Pµ denotes the orthogonal projection onto the range of Uµ. This expression is relatively straight-forward to implement using tensor contractions along the modes T −1 1,...,µ 1,µ + 1,...,d. Note that the inverse X≥µ+1X≥µ+1 potentially leads − T to numerical instabilities or is not even well-defined: For example, X≥µ+1X≥µ+1 is R R singular in case the right unfoldings Uµ+1,..., Ud do not have full rank. To circumvent these numerical instabilites, we will work with a different parametriza- tion of the tangent space first proposed in [24]: Observe that in the representation (3.6), each individual summand is a TT tensor which can be orthogonalized individu- ally without changing the overall tangent tensor. Hence, to obtain the µth summand, we can µ-orthogonalize X,

X = U (i ) Uµ (iµ )Uµ(iµ)Vµ (iµ ) Vd(id) 1 1 ··· −1 −1 +1 +1 ··· 1 L Note that the expression is stated for its lefte unfolding δUµ. From this, δUµ can be directly obtained by reshaping. This notation will be used several times throughout this manuscript. RIEMANNIAN OPTIMIZATION FOR TENSOR COMPLETION 7 where the cores denoted by the letter U are left-orthogonal, V are right-orthogonal and U are neither left- nor right-orthogonal. Then, the resulting first-order variation is given by e U (i ) Uµ (iµ )δUµ(iµ)Vµ (iµ ) Vd(id). (3.8) 1 1 ··· −1 −1 +1 +1 ··· L T L The first order variations δUµ still fulfille the gauge condition (Uµ) δUµ = 0,µ = 1,...,d 1. Because of the right-orthogonalization of Xµ , the projection (3.7) − +1 simplifies to e e

L T <µ> δU = In r − Pµ In X µ Z X µ , (3.9) µ µ µ 1 − µ ⊗ ≤ −1 ≥ +1 Te −1 because now X≥µ+1X≥µ+1 is the identity matrix. Hence, we are able to circumvent the explicit inverse. Due to the individual orthogonalizations of the summands, a tangent tensor can then be written in the form

Y(i ,...,id)= δU (i )V (i ) Vd(id) 1 1 1 2 2 ··· + U (i )δU (i ) Vd(id) (3.10) e 1 1 2 2 ··· + ··· e + U (i )U (i ) δUd(id). 1 1 2 2 ··· Making use of the structure of the sum (3.10), we observee that a tangent tensor in TX r can also be represented as a TT tensor: M Remark 1. Let Y TX r be given in the representation (3.10). Then, it can ∈ M be identiﬁed with a TT tensor of TT-rank at most (1, 2r1, 2r2,..., 2rd−1, 1):

Y(i ,...,id)= W (i )W (i ) Wd(id), (3.11) 1 1 1 2 2 ··· with cores given for µ = 2 ...d 1 by −

Vµ(iµ) 0 Vd(id) W1(i1)= δU1(i1) U1(i1) ,Wµ(iµ)= ,Wd(id)= . δUµ(iµ) Uµ(iµ) δUd(id) h i e This identity can be easily veriﬁed by multiplyinge out the matrix product. Thee resulting cores Wµ are of size 2rµ nµ 2rµ for µ = 2,...,d 1. −1 × × − This result allows us to eﬃciently handle elements of the tangent space in the same way as elements in the manifold, being able to directly reuse all implemented operations. 4. A Riemannian nonlinear conjugate gradient scheme. With the basic properties of elements in r and TX r discussed, we can now describe all necessary steps of RTTC in detail.M For eachM part, we will also mention its computational complexity. To simplify the presentation, we will state the operation count in terms of r := maxµ rµ and n := maxµ nµ. 4.1. Evaluation of the cost function and sampling of TT tensors. In the tensor completion problem (1.1), the cost function is given by f(X) = PΩX P A 2/2. As the data samples P A are supplied as input, evaluatingk the cost− Ω k Ω function reduces to evaluating the entries of the sparse tensor PΩ X. 8 M. STEINLECHNER

Sampling of TT tensors. We can compute the application of the sampling operator PΩ, see (1.2) for its deﬁnition, to a TT tensor X by direct evaluation of the matrix product for each sampling point:

X(i ,i ,...,in)= U (i )U (i ) Ud(id). 1 2 1 1 2 2 ··· Thus, for each sampling point, we need to compute d 1 matrix-vector multiplica- tions. Hence, evaluating the tensor at Ω sampling points− involves O( Ω (d 1)r2) operations. | | | | − Sampling of TT tangent tensors. Writing the tangent tensor in the form (3.11) we can reuse the sampling routine for TT tensors with an acceptable amount of extra cost—the rank is doubled, hence the number of operations is quadrupled. 4.2. Riemannian gradient. To obtain the descent direction in our optimization scheme, we need the Riemannian gradient of the objective function f(X) = 1 P X P A 2. 2 k Ω − Ω k Proposition 4.1 ([1, Chap. 3.6]). Let f : Rn1×···×nd R be a cost function → with Euclidean gradient fX at point X r. Then the Riemannian gradient of R ∇ ∈ M f : r is given by grad f(X)=PTXMr ( fX). MHence,→ the Riemannian gradient is obtained∇ by projecting the Euclidean gradient f =P X P A onto the tangent space: ∇ Ω − Ω

grad f(X)=PTX r (P X P A). (4.1) M Ω − Ω

n1×···×nd As the Euclidean gradient Z =PΩ X PΩ A R is a very large but sparse tensor, the projection onto the tangent− space has∈ to be handled with care. When successively calculating the projection (3.9), the intermediate results are, in general, <µ> dense already after the ﬁrst step. For example, the intermediate result Z X≥µ+1 L Rn1n2···nµ×rµ is a huge dense matrix of size whereas the resulting δUµ is only of size rµnµ×rµ+1 R . Let us restate the equations of the ﬁrst order variations δUµ: e L T <µ> δU =(Irµnµ Pµ)(Inµ X≤µ−1) Z X≥µ+1 , for µ = 1,...,de 1 µ − ⊗ − L T δU =(In X d ) Z . e d d ⊗ ≤ −1

The projectione (Irµnµ Pµ) is a straight-forward matrix multiplication using Pµ = T − L QµQµ, where Qµ is an orthogonal basis for the range of Uµ obtained from a QR L decomposition and can be done separately at the end. Instead, we focus on Cµ := T <µ> (Iµ X µ ) Z X µ . Let Θi Ω be the set of sampling points (j ,...,jd) ⊗ ≤ −1 ≥ +1 µ ⊂ 1 where the µth index jµ coincides with iµ:

Θi = (j ,...jd) Ω jµ = iµ Ω. (4.2) µ { 1 ∈ | }⊂

The iµth slice of Cµ is given by a sum over Θiµ , as the other entries do not contribute:

T C(iµ)= X≤µ−1(j1,...,jµ−1) Z(j1,...,jd)X≥µ+1(jµ+1,...,jd), j ,...j ( 1 Xd)∈Θiµ which can be reformulated into T C(iµ)= Z(j ,...,jd) U (j ) Uµ (jµ ) Vµ (jµ ) Vd(jd) . 1 1 1 ··· −1 −1 +1 +1 ··· (j ,...j )∈Θ 1 Xd iµ RIEMANNIAN OPTIMIZATION FOR TENSOR COMPLETION 9

The algorithmic implementation for all ﬁrst order variations δUµ for µ = 1,...,d is shown in Algorithm 2, with reuse of intermediate matrix products. For each entry, we have to perform d 1 small matrix-vector products for a totale cost of O(d Ω r2). The projection onto the− orthogonal complement at the end adds another O(|dnr| 3) operations.

Algorithm 2 Computing the Riemannian gradient

Input: Left-orth. cores Uµ and right-orth. cores Vµ of X r, Euclid. gradient Z ∈M Output: First order variations δUµ for µ = 1,...,d. % Initialize cores} { for µ = 1,...,d do e Cµ zeros(rµ, nµ, rµ+1) end for←

for (j ,j ,...,jd) Ω do 1 2 ∈ % Precompute left matrix product} { UL 1 U1(j1) for{µ }= ← 2,...,d 1 do − UL µ UL µ 1 U2(j2) end for{ } ← { − }

VR Vd(jd) ← % Calculate the cores beginning from the right} { T Cd(jd) Cd(jd)+ Z(j1,j2,...,jd) UL d 1 for µ =←d 1,..., 2 do · { − } − T T Cµ(jµ) Cµ(jµ)+ Z(j1,j2,...,jd) UL µ 1 VR ← T · { − } VR Vµ(jµ)VR end for← T C1(j1) C1(j1)+ Z(j1,j2,...,jd) VR end for ← · δUd Cd ← L % Project cores 1,...,d − 1 onto orth. complement of the range of U } { µ fore µ = 1,...,d 1 do − L [Qµ,Rµ] = qr(Uµ) L L T L δUµ Cµ Qµ(Qµ)Cµ end for← − e

4.3. Retraction. Taking one step from the current iterate Xk r in the ∈ M search direction ξ TX r yields a new point in the tangent space which is, in ∈ M general, not an element of r anymore. In Riemannian geometry, the well-known M exponential map provides a local diffeomorphism between the neighborhood of Xk and the neighborhood around the origin of TX r. Assuming not-too-large step sizes, M the exponential map thus allows us to map the new point Xk + ξ back onto the manifold. While this map is given in a simple form for some manifolds, such as the sphere, it is computationally infeasible for our manifold r. However, as proposed in [1], a first order approximation of this map, a so-calledMretraction, is sufficient to ensure convergence of gradient-based optimization schemes. Definition 4.2 (Retraction, [2, Def. 1]). Let be a smooth submanifold of n1×···×n M R d . Let 0x denote the zero element of Tx . A mapping R from the tangent M 10 M. STEINLECHNER bundle T into is said to be a retraction on around x if there exists a M M M ∈ M neighborhood of (x, 0x) in T such that the following properties hold: (a) We haveU dom(R)Mand the restriction R : is smooth. U ⊆ U→M (b) R(y, 0y)= y for all (y, 0y) . ∈U (c) With the canonical identification T0x Tx Tx , R satisfies the local rigid- ity condition: M≃ M

DR(x, )(0x)=idT for all (x, 0x) , · xM ∈U

where idTxM denotes the identity mapping on Tx . For matrices, multiple variants of retractions into the lowM-rank manifold have been discussed in a recent survey paper by Absil and Oseledets [3]. For r, a retraction maps the new point to a rank-r tensor. A computationally efficient methodM is given by the TT-SVD procedure which satisfies the quasi-best approximation property (3.4). This is very similar to the Tucker case discussed in [26], where it was shown that the HOSVD (which posesses an analogous quasi-best approximation property) fulfills indeed all necessary properties of a retraction map. TT Rn1×···×nd Proposition 4.3. Let Pr : r denote the projection operator performing rank truncation using the TT-SVD.→ Then, M

TT R : T r r, (X,ξ) P (X + ξ) (4.3) M →M 7→ r defines a retraction on r around X according to Def. 4.2. M n1×···n Proof. (a) There exists a small neighborhood of R d of X r such TT U ⊂ ∈ M that the TT-SVD operator Pr : r is smooth, as it basically only consists of unfoldings and sequential applicationsU →M of locally smooth SVDs, moving from one core to the next. Thus, the retraction (4.3) is locally smooth around (X, 0X), as it TT can be written as R(X,ξ)=P p(X,ξ) with the smooth function p : T r r ◦ M → Rn1×···×nd ,p(X,ξ)= X + ξ. TT (b) We have the obvious identity Pr (X)= X for all X r. (c) We use the quasi-optimality (3.4). We have that (X+∈Mtξ) P (X+tξ) = O(t2) k − Mr k for t 0, as TX r is a first order approximation of r around ξ. Hence, → M M (X + tξ) R(X,tξ) √d 1 (X + tξ) P (X + tξ) = O(t2). k − k≤ − k − Mr k 2 d Therefore, R(X,tξ)=(X + tξ)+ O(t ) and dt R(X,tξ) = ξ, which is equivalent t=0 to DR(X, )(0 )=id X r . X T M To efficiently· apply the TT-SVD procedure, we exploit the representation (3.10) and notice that the new point, obtained by a step Y TX r of length α from the ∈ M point X r, can be written as ∈M V2(i2) 0 X(i ,...,id)+ αY(i ,...,id)= αV (i ) U (i ) 1 1 1 1 1 1 αδU (i ) U (i ) ··· 2 2 2 2 Vd−1(id−1) 0 Vd(id) e . ··· αδUd (id ) Ud (id ) Ud(id)+ αδUd(id) −1 −1 −1 −1 Hence, retracting back to the rank-r emanifold is equivalent to truncatinge a rank- 2r TT tensor to rank r, for a total cost of approximatelyM O(dnr3). We refer to [34, Section 3] for a detailed description on the TT-SVD procedure and its computational complexity. Furthermore, the orthogonality relations between Uµ and δUµ can be used to save some work when reorthogonalizing. e RIEMANNIAN OPTIMIZATION FOR TENSOR COMPLETION 11

4.4. Search direction update and vector transport. In the nonlinear conjugate gradient scheme, the new search direction ηk is computed as a linear combination of the current gradient ξk and the previous direction ηk−1 scaled by a factor βk,

ηk = ξk + βk X − X ηk , − T k 1→ k −1 where X − X is the vector transport which maps elements from TX − r to T k 1→ k k 1 M TX r, see [1] for a formal definition. The orthogonal projection onto the tangent k M space, PTX r , yields a vector transport, as r is an embedded submanifold of +1M M Rn1×···×nd , see [1, Sec. 8.1.3]. To compute the projection of a certain tangent tensor Y onto TX r, we make use of the identity (3.11), reducing it to the projection of a TT tensor ofM rank 2r onto TX r. Let Y thus be represented as M Y(i ,...,id)= W (i )W (i ) Wd(id). 1 1 1 2 2 ··· Then, we have to compute the first order variations T L hµi δU = In r − Pµ In X µ Y X µ µ µ µ 1 − µ ⊗ ≤ −1 ≥ +1 T T = In r − Pµ In X µ Y µY X µ , e µ µ 1 − µ ⊗ ≤ −1 ≤ ≥µ+1 ≥ +1 which, for each µ, can be efficiently evaluated as a tensor contraction along dimensions 1,...d except mode µ. This can be checked using the recursive relations (3.2), T R T R T Y≥µ+1X≥µ+1 = Wµ+1(Y≥µ+2 Inµ+1 )(X≥µ+2 Inµ+1 )(Vµ+1) R T ⊗ ⊗R T = W (Y X µ In )(V ) µ+1 ≥µ+2 ≥ +2 ⊗ µ+1 µ+1 T R R T T and Y≥dX≥d = Wd (Vd ) . An analogous relation holds for X≤µ−1Y≤µ−1. T As these quantities are shared between all δUµ, we first precompute Y≥µX≥µ T 3 for all µ = 2,...,d and X≤µY≤µ for µ = 1,...,d 1, using O(dr n) operations. e − Combining this with the orthogonal projection Inµrµ−1 Pµ yields a total amount of O(dr3n) flops. − Many options exist for the choice of βk within the framework of nonlinear conjugate gradient. Here, we choose the Fletcher-Reeves update,

ξk βk = k k , ξk k −1k as the computation of the norm of a tangent tensor ξ TX r is particularly simple when using the representation (3.10). Due to the µ-orthogonality∈ M of each term in the 2 d 2 2 sum, we have that ξ = δUµ . Hence we only need about 2dnr operations k k µ=1 k k to compute βk. In our experiments, we observed that choosing a different scheme, such as Polak-Ribière+, has almostP no influencee on the convergence of RTTC. 4.5. Line search procedure. Determining a good step size α in RTTC is crucial to ensure fast convergence of the method. The optimal choice would be the minimizer of the objective function along the geodesic determined by the current search direction ηk: αk = argminα f(R(Xk + αηk)). As described in [48, 26], this nonlinear optimization problem can be linearized and then solved explicitly by dropping the retraction R and thereby only minimizing within the tangent space,

PΩ ηk, PΩ(A Xk) αk argmin f(Xk + αηk)= h − i. ≈ α P ηk, P ηk h Ω Ω i 12 M. STEINLECHNER

This estimate can be calculated in O( Ω (d 1)r2) operations. The accuracy of this approximation depends on the curvature| | of− the manifold at the current iterate, but in all our experiments, we have never observed a situation where this estimate was not good enough. To make sure that the iterates fulﬁll the Wolfe conditions, a simple Armijo backtracking scheme can be added, see e.g. [33].

4.6. Stopping criteria. During the course of the optimization procedure, we monitor the current relative error on the sampling set,

P A P Xk ε (X )= k Ω − Ω k. Ω k P A k Ω k This requires almost no extra computations, as the Euclidean gradient f =P A ∇ Ω − PΩ Xk is already calculated in each step. When the sampling rate is very low or the estimated rank r is chosen larger than the true rank of the underlying data, one can often observe overﬁtting. In these cases, the residual error does converge to zero as expected, but the obtained reconstruction is a bad approximation of the original data. This is to be expected as the degrees of freedom in the model exceed the information available. While this intrinsic problem cannot be completely avoided, it is helpful to be able to detect these cases of overﬁtting. Then, we can act accordingly and reduce the degrees of freedom of the model (by choosing a lower estimated rank r) or by increasing the number of sampling points, if possible. Hence, we also measure the relative error on a control set ΩC , ΩC Ω= : ∩ ∅

P A P Xk ε (X )= k ΩC − ΩC k. ΩC k P A k ΩC k This control set is either given or can be obtained by random subsampling of the given data. Numerical experiments show that it usually suﬃces to choose only few control samples, say, ΩC = 100. RTTC is run until either the prescribed tolerance | | is attained for εΩ(Xk) and εΩC (Xk) or the maximum number of iterations is reached. Furthermore, to detect stagnation in the optimization procedure, we check if the

relative change in εΩ and εΩC between two iterates is below a certain threshold δ:

ε (Xk) ε (Xk ) ε (Xk) ε (Xk ) | Ω − Ω +1 | < δ, | ΩC − ΩC +1 | < δ. (4.4) ε (Xk) ε (Xk) | Ω | | ΩC | In practice, we choose δ 10−4, but adjustments can be necessary depending on the underlying data. ≈

4.7. Scaling of the algorithm: computational complexity. If we sum up the computational complexity needed for one iteration of RTTC, we arrive at a total cost of

O(dnr3 + d Ω r2) (4.5) | | operations. Note that for reconstruction, the number of samples Ω should be O(dnr2), as it needs to be at least as large as the dimension of the man| | ifold (3.5). Therefore, in realistic cases, the algorithm will have a complexity of O(d2nr4). RIEMANNIAN OPTIMIZATION FOR TENSOR COMPLETION 13

4.8. Convergence of the algorithm. The convergence analysis of RTTC follows the same steps as the analysis employed in the matrix [48] and the Tucker tensor case [26]. First, to fulfill the assumptions of standard convergence results for Riemannian line search methods, we have to safeguard the heuristic linearized line search with an Armijo-type backtracking scheme. Then, we can apply [1, Theorem 4.3.1] to obtain Proposition 4.4. Let (Xk)k∈N be an infinite sequence of iterates generated by RTTC with backtracking linesearch. Then, every accumulation point X∗ of (Xk) is a critical point of the cost function f and hence satisfies grad f(Xk) = 0, that is,

PTX∗ Mr (PΩ X∗)=PTX∗ Mr (PΩ A). This result shows us that in the tangent space, reconstruction is achieved for all limit points of RTTC. Unfortunately, as r is not closed, X is not necessarily an M ∗ element of r anymore—and hence not a valid solution of problem (1.1). To enforce M this, we regularize the cost function in such a way that the iterates Xk stay inside a compact subset of r. Analogous to [26], we define the modified cost function g as M d 2 <µ> 2 <µ> † 2 g : r R, X f(X)+ λ X + (X ) , λ> 0, (4.6) M → 7→ k kF k kF µ=1 X where (X<µ>)† denotes the pseudo-inverse of X<µ> and λ is the regularization parameter. With this modification regularizing the singular values of the matricizations of X, we obtain the following result: Proposition 4.5 (c.f. [26, Prop. 3.2]). Let (Xk)k∈N be an infinite sequence of iterates generated by Algorithm 1 but with the modified cost function g defined in (4.6). Then limk grad g(Xk) = 0. →∞ k k Proof. The cost function is guaranteed to decrease due to the line search, g(Xk) 2 2 d <µ> 2 2 2 d <µ> † 2 ≤2 g(X0) =: C0 , which yields λ µ=1 Xk F C0 and λ µ=1 (Xk ) F C0 . From this, we can deduce upper andk lower boundsk ≤ for the largest andk smallestk singular≤ values of the matricizations, P P

σ (X<µ>) X<µ> C λ−1, σ−1 (X<µ>) (X<µ>)† C λ−1. max k ≤ k k k≤ 0 min k ≤ k k k≤ 0 Therefore, all iterates are guaranteed to stay inside the compact set

<µ> −1 <µ> −1 B := X r σ (X ) C λ , σ (X ) C λ, µ = 1,...,d . ∈M max ≤ 0 min ≥ 0 n o

Now, as B is compact, the sequence (Xk) has an accumulation point inside B and we can use Proposition 4.4 to obtain limk→∞ grad g(Xk) = 0. Note that the regularization parameterk λ can be chosenk arbitrarily small. If the core unfoldings are always of full rank during the optimization procedure (even when λ 0), then the accumulation points X are guaranteed to stay inside r and → ∗ M grad f(X∗)=PTX∗ Mr (PΩ X∗ PΩ A) 0 as λ 0. In this case, optimizing the modiﬁed cost function (4.6) is− equivalent→ to the original→ cost function. Asymptotic convergence rates are, unfortunately, not available for Riemannian conjugate gradient schemes. For steepest descent, linear convergence can be shown, with a constant depending on the eigenvalues of the Riemannian Hessian at the critical point, see [1, Theorem 4.5.6]. A very diﬀerent approach which avoids the need for regularization has been discussed in a recent work by Schneider and Uschmajew [40], extending Riemannian n×n optimization techniques to the algebraic variety r := X R rank(X) r , M≤ { ∈ | ≤ } 14 M. STEINLECHNER

the closure of r. Convergence results for the therein proposed gradient scheme follow fromLojasiewicz M inequality. Extension of this work to the Tensor Train format remains an open question. 4.9. Adaptive rank adjustment. There is an obvious disadvantage in the way the tensor completion problem (1.1) is defined: The rank of the underlying problem has to be known a priori. For high-dimensional TT tensors, this problem is much more severe than in the matrix case, as we not only have one unknown rank, but a whole rank tuple r =(r0, r1,...,rd). For real-world data, this information is usually not available and the ranks can be very different from each other, see e.g. [34, Table 3.2] for a common distribution of the ranks. Instead, we can require that the solution should have rank smaller than a certain prescribed maximal rank. This maximal rank rmax is determined by the computational resources available and not necessarily by the underlying data. As the rank has a strong influence on the complexity of the algorithm, see Section 4.7, this motivates the following rank-adaptive procedure. First, we run RTTC with r(1) = (1, 1,..., 1) to obtain a result X(1). Then, we add a rank-1 correction term R and then run RTTC again with starting guess X(1) + R with r(2) = rank(X(1) + R) to obtain X(2). An optimal rank correction would be given by the minimizer of min f(X + R), ×···× (1) R∈Rn1 nd rank(R)=(1,...,1) which is not computationally feasible to compute exactly. Instead, we can choose R to be a rank-1 approximation of the negative Euclidean gradient f(X ), so −∇ (1) that we perform one step of steepest descent to obtain X(2), see [47] for a discussion of such greedy rank-1 updates in the context of matrix completion. In our case, the high-dimensionality makes a global update very costly. Instead, we can perform a local update to only increase one rank-index at a time, cycling through each rµ, µ = 1,...,d 1. To modify the rank rµ, we set − R L L R Uµ+1 Uµ Uµ Rµ , Uµ+1 . (4.7) ← ← Rµ+1 rµ−1nµ×1 1×nµ+1rµ+1 When we choose Rµ R and Rµ+1 R from a local projection of the approximated∈ gradient, we obtain an AMEn-like∈ procedure [13]. While very successful in the case of linear systems, we have found that choosing Rµ and Rµ+1 as random vectors with small magnitude, say, 10−8, performs just as well. The resulting rank-adaptive algorithm is shown in Algorithm 3. This procedure is repeated until either the prescribed residual tolerance is fulfilled or r = rmax is reached. As the main goal of the first steps is to steer the iterates into the right direction, only few steps of RTTC are required at each level. We make use of the stagnation criterion (4.4) with a crude tolerance of, say, δ = 0.01. With this choice, RTTC usually needs less than 10-15 iteration steps at each level. At the final rank, a finer tolerance is used to allow for more iterations. In the worst-case scenario, this leads to (d 1)rmax executions of RTTC. To limit the occurances of unnecessary rank increases,− it can be helpful to revert the rank in the current mode if the increase did not improve on εΩC . To detect such cases, we measure the relative change from εΩC (X) to εΩC (Xnew). If it did improve, we always accept the step; additionally, the parameter ρ 0 allows for slight increases of the error in one rank-increasing step, with the hope≥ that it will decrease in later steps. In the numerical experiments, we chose ρ = 1. RIEMANNIAN OPTIMIZATION FOR TENSOR COMPLETION 15

Algorithm 3 Adaptive rank adjustment

Input: Sampled data PΩ A, maximal rank rmax, acceptance parameter ρ Output: Completed tensor X of TT-rank r, rµ < rmax,µ. T X random tensor, r := rankTT(X)=(1, 1,..., 1) X Result of RTTC using X as starting guess. ← for k = 2,...,rmax do for µ = 1,...,d 1 do − if rµ rmax,µ then µth≥ rank is maximal. Skip. else Xnew Increase µth rank of X to (1, r1,...,rµ + 1,...rd−1, 1) using (4.7) X ← Result of Algorithm 1 using X as starting guess. new ← new

if (εΩC (Xnew) εΩC (X)) >ρεΩC (X) then Revert step. − else X Xnew % accept step. end if← end if end for end for

As a side remark, we mention that a very different approach to introduce rank- adaptivity could try to replace the retraction step by a rounding procedure, which does not truncate to a fixed rank, but adjusts the rank according to some prescribed accuracy. As the step Xk + αηk has up to rank 2r, see Remark 1, this would allow to possibly double the rank in each step. We have performed a few tests in this direction, but could not achieve good performance. 5. Numerical experiments. In the following section we investigate the performance of the proposed tensor completion algorithm and compare it to existing methods, namely HTOpt [10, 11], ADF [17] and a simple alternating optimization scheme, see Section 5.2. To quantify the reconstruction quality of the algorithms, we measure the relative error on a test set Γ different from Ω and ΩC . All sampling sets are created by randomly sampling each index iµ from a uniform distribution on 1,...,nµ . We sample more entries than necessary, such that after removing duplicates{ we end} up with a sampling set of the desired size. As mentioned in the introduction, cross approximation techniques can be used if the sampling set Ω is not prescribed, but can be choosen freely. This is often the case when dealing with the approximation of high-dimensional functions and solutions of parameter-dependent PDEs. To show the effectiveness of tensor completion in these applications, we compare RTTC to state-of-the-art cross-approximation algorithms based on the TT [14] and HT format [4, 20]. We have implemented RTTC in Matlab based on the TTeMPS toolbox, see http: //anchp.epfl.ch/TTeMPS. As shown in Section 4, the calculation of the cost function, the gradient and the linearized linesearch involve operations on sparse tensors. To obtain an efficient algorithm we have implemented these crucial steps in C using the Mex-function capabilities of Matlab with direct calls to BLAS routines wherever 16 M. STEINLECHNER

possible. For example, when evaluating the cost function, see Section 4.1, permuting the storage of the cores of size rµ nµ rµ+1 to arrays of size rµ rµ+1 nµ before starting the sampling procedure allows× × us to continuously align the× matrix× blocks in memory for fast calls to the BLAS routine DGEMV. While the computational complexity, Section 4.7, suggests that most parts of the algorithm are equally expensive, the majority of the wall-time is spent on the calculation of the gradient, see Section 4.2, as it involves Ω operations on single indices. Inevitably, this leads to unaligned memory accesses which| | are a bottleneck on today’s computers. This can be partly avoided by embarrassingly parallel distribution of the loop over all indices in the sampling set Ω, but is out of the scope of this article. All timings have been conducted on an Intel Xeon E31225, 3.10GHz quad-core processor with 8 GB of memory, running Matlab version 2012b. 5.1. Scaling of the algorithm. As a first experiment, we check the computational complexity of RTTC derived in Section 4.7. In Figure 5.1, we show three plots corresponding to the scaling with respect to (left) the tensor size n, (middle) the number of dimensions d and (right) the tensor rank r. In all cases, the original data is a TT tensor of known rank r whose cores consist of entries uniformly, randomly chosen from [0, 1]. In the first case (left), we have chosen d = 10, r = (1, 10,..., 10, 1), and mode size n 20, 40,..., 200 . In (middle), we have set the mode size to n = 100, r = (1, 10,...,∈ {10, 1), and dimensions} ranging from d = 3 to d = 20. Finally, for (right) we set d = 10, n = 100 and r = (1,r,...,r, 1) with r 2,..., 25 . For each configuration, we run 10 steps of RTTC with 5 repetitions. The∈ {sampling} set size Ω is chosen such that it scales like O(nr2). The lines are linear fits for n and d and| a| quartic fit for tensor rank r. We observe that RTTC indeed scales linearly with n and d and to fourth order in r (due to the scaling of Ω), in accordance with (4.5). Note that if we scale Ω like O(dnr2), then the timing would depend quadratically on d and not linearly.

1000 3 3 100 2 2 Time [s] Time [s] Time [s] 10 1 1 1 0 50 100 150 200 5 10 15 20 5 10 203040 Mode size n Dimensions d Tensor rank r

Fig. 5.1: Scaling of RTTC with respect to tensor size n, number of dimensions d and tensor rank r. In the rightmost ﬁgure, in log-log scale, the line corresponds to a fourth order dependence on the tensor rank. See Section 5.1 for details.

5.2. Comparison to an alternating linear scheme. To evaluate the performance of our proposed algorithm, we have implemented a simple alternating linear scheme (ALS) algorithm to approach the tensor completion problem (1.1). This approach was suggested by Grasedyck, Kluge and Kr¨amer [18]. In the TT format, algorithms based on ALS-type optimization are a simple but surprisingly eﬀective way RIEMANNIAN OPTIMIZATION FOR TENSOR COMPLETION 17

100 100 ALS, r =4 ALS, r =8 ALS, r =12 RTTC, r =4 r −5 RTTC, =8 −5 10 RTTC, r =12 10

10−10 10−10 Rel. error on Γ Rel. error on Γ

10−15 10−15 0 20 40 60 100 101 102 103 104 Iteration Time [s]

Fig. 5.2: Comparison between the convergence of RTTC and ALS. The underlying tensor is a random TT tensor of size n = 50, d = 10 and known rank r 4, 8, 12 . Left: Rel. error with respect to number of iterations. One iteration of ALS∈ { corresponds} to one half-sweep. Right: Rel. error with respect to time.

to solve an optimization scheme. In each step, all cores but one are kept ﬁxed and the optimization problem is reduced to a small optimization problem of a single core. The core is replaced by its locally optimal solution and ﬁxed, while the optimization scheme moves to the neighboring core. This approach has been successfully used to solve linear systems and eigenvalue problems in the TT format, see e.g. [12, 21, 38]. For the tensor completion problem (1.1), one single core optimization step of the ALS scheme is given by

1 2 Uµ = argmin A(i1,...,id) U1(i1) Uµ−1(iµ−1)V (iµ)Uµ+1(iµ+1) Ud(id) V 2 − ··· ··· i∈Ω X e 1 2 = argmin A(i1,...,id) X≤µ−1(i1,...iµ−1)V (iµ)X≥µ+1(iµ+1,...,id) . V 2 − i∈Ω X Cycling through all cores Uµ for µ = 1,...,d multiple times until convergence yields the full ALS completion scheme. To solve one core optimization step eﬃciently, we look at the iµth slice of the new core Uµ, making use of notation (4.2): 1 Uµ(iµ) = argmin A(j1,...,jd)e V 2 j∈Θ Xiµ e 2 X µ (j ,...jµ )V (jµ)X µ (iµ ,...,id) − ≤ −1 1 −1 ≥ +1 +1 1 = argmin A(j1,...,jd) V 2 j ∈XΘjµ h T 2 X µ (jµ ,...,jd) X µ (j ,...jµ ) vec(V (jµ)) − ≥ +1 +1 ⊗ ≤ −1 1 −1 i which is a linear least squares problem for vec(V (iµ)) with a system matrix of size nµ Θi rµ−1rµ. As Θi = Ω , solving the least squares problems to ob- | µ | × iµ=1 | µ | | | 4 tain the new core Uµ can be performed in O( Ω r ) operations. Updating each core P | | e 18 M. STEINLECHNER

µ = 1,...,d once (one so-called half-sweep of the ALS procedure) then results in a total cost of O(d Ω r4) operations. Compared to the complexity of RTTC, this ALS procedure involves| | the fourth power of the rank instead of the square. In Figure 5.2, we compare the convergence of the two approaches as a function of iterations and time when reconstructing a random TT tensor of size n = 50, d = 10 and known rank r chosen from 4, 8, 12 . While the iteration count is similar, the ALS approach takes around an order{ magnitude} more time to converge, slowing down considerably as the rank of the underlying data increases. To make sure that the implementation is similarly optimized as RTTC, the setup of the least squares system matrices is performed in optimized Mex-functions, written in C.

·104 ·104 5.5 5.5 5 5 4.5 4.5 4 4 3.5 3.5 3 3 2.5 2.5 2 2 1.5 1.5

Size of sampling set Ω 1 Size of sampling set Ω 1 100150200250300350400 100150200250300350400 Tensor size n Tensor size n (a) Phase plot for d = 5. Left: RTTC. Right: ALS.

·105 ·105 1.6 1.6 1.4 1.4 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 Size of sampling0 set Ω .2 Size of sampling0 set Ω .2 100150200250300350400 100150200250300350400 Tensor size n Tensor size n (b) Phase plot for d = 10. Left: RTTC. Right: ALS.

Fig. 5.3: Phase plots for RTTC and ALS for (a) d = 5 and (b) d = 10, showing the ability to recover the original data as a function of the mode size n and the size of the sampling set Ω. The underlying tensor is a random TT tensor of known rank r = (1, 3,..., 3, 1). White means convergent in all 5 runs, black means convergent in no runs. The white dashed curve corresponds to 20n log(n) for d = 5 and 80n log(n) for d = 10.

5.3. Reconstruction of random tensors. In Figure 5.3 phase plots are de- picted, showing the ability of RTTC and the ALS procedure to recover a tensor as a function of the number of known values of the original tensor. The results are shown for two diﬀerent numbers of dimensions, d = 5 and d = 10. We run both algorithms 5 RIEMANNIAN OPTIMIZATION FOR TENSOR COMPLETION 19

times for each combination of tensor size n and sampling set size Ω . The brightness represents how many times the iteration converged, where white| means| 5 out of 5 converged and black corresponds to no convergence. We call the iteration convergent if the relative error on the test set Γ is smaller than 10−4 after at most 250 iterations. Note that in almost all convergent cases, the algorithms have reached this prescribed accuracy within less than 100 steps. The original data A is constructed as a TT tensor of rank r = (1, 3,..., 3, 1) with the entries of each core being uniform random samples from [0, 1]. The question of how many samples are required to recover the underlying data has been a very active topic in the last years. For the matrix case (d = 2) it has been shown that around O(nr log(n)) samples are necessary [9, 23], where n is the matrix size and r the rank of the underlying data. Up to this date, these results could not be extended to the tensor case, see e.g. [37] for a discussion. Some approaches, such as [32] were able ⌊ d ⌋ ⌈ d ⌉ to prove that O(r 2 n 2 ) samples suffice, but numerical experiments, such as in [25] for d = 3, clearly suggest a much better scaling behaviour similar to the matrix case. In Figure 5.3 we have included a white dashed curve corresponding to 20n log(n) for d = 5 and 80n log(n) for d = 10. 5.4. Interpolation of high-dimensional functions. In the next section, we will investigate the application of tensor completion to discretizations of high dimensional functions. The random tensors considered in the previous Section 5.3 are constructed to be of a certain well-defined rank. For discretizations of function data, we can, in general, only hope for a sufficiently fast decay of the singular values in each matricization, resulting in good approximability by a low-rank tensor. As the rank is not known a priori, we will employ the rank-increasing strategy presented in Section 4.9, Algorithm 3, for all following experiments, both for RTTC and ALS. 5.4.1. Comparison to ADF. We compare the accuracy and timings of RTTC and the ALS procedure to an alternating direction fitting by Grasedyck et al. [17] for different values of the maximal rank. We reconstruct the setup in [17, Section 4.2] for RTTC and the ALS scheme to compare against the therein stated results. The original data tensor of dimension d = 8 and mode sizes n = 20 is constructed by 1 A(i1,i2,...,id)= , n 2 µ=1 iµ The results are shown in Table 5.1 for differentqP values of the maximal rank r := maxµ rµ. Both sampling set Ω and test set Γ are chosen to be of size Ω = Γ = 10dnr2. | | | | The results are shown in Table 5.1 for different values of the maximal rank r := maxµ rµ. We can see that the reconstruction quality is similar for all algorithms, but the ADF scheme, also implemented in Matlab, suffers from exceedingly long computation times. A large part of these timing differences may be attributed to the fact that it does not use optimized Mex routines for the sampling. Therefore, we focus mostly on the achieved reconstruction error and only include the timings as a reference. For the ALS procedure, the quartic instead of quadratic scaling of the complexity with respect to the rank r becomes noticable. 5.4.2. Comparison to HTOpt. We now compare the reconstruction performance to a Riemannian optimization scheme on the manifold of tensors in the Hier- archical Tucker format, HTOpt [10, 11]. 20 M. STEINLECHNER

ADF RTTC ALS r εΓ(Xend) time εΓ(Xend) time εΓ(Xend) time 2 1.94 · 10−2 14 s 6.42 · 10−3 11.1 s 7.55 · 10−3 5.90 s 3 2.05 · 10−3 1.3 min 5.60 · 10−3 25.4 s 5.63 · 10−3 19.1 s 4 1.72 · 10−3 24 min 1.73 · 10−3 45.7 s 1.50 · 10−3 46.5 s 5 1.49 · 10−3 39 min 1.53 · 10−3 1.3 min 9.49 · 10−4 1.7 min 6 2.25 · 10−3 1.7 h 7.95 · 10−4 2.1 min 1.09 · 10−3 3.3 min 7 8.53 · 10−4 2.6 h 3.49 · 10−4 3.3 min 3.92 · 10−4 6.4 min

Table 5.1: Comparison of the reconstruction accuracy of RTTC to an ADF completion algorithm [17] for diﬀerent values of the maximal rank. The underlying data is a discretized function resulting in a tensor of dimension d = 8 and mode size n = 20, see Section 5.4.1. The results and timings for ADF are taken from [17].

HTOpt ADF RTTC ALS |Ω|/nd εΓ(Xend) time εΓ(Xend) time εΓ(Xend) time εΓ(Xend) time 0.001 1.15 · 100 28.4 s 6.74 · 10−2 53.0 s 9.17 · 10−2 4.99 s — — 0.005 1.67 · 100 28.8 s 6.07 · 10−3 3.2 m 6.79 · 10−3 5.11 s 2.19 · 10−2 3.87 s 0.01 1.99 · 100 29.7 s 3.08 · 10−2 6.4 m 1.19 · 10−3 5.72 s 1.59 · 10−2 4.69 s 0.05 3.68 · 10−3 29.6 s 1.95 · 10−5 18 m 2.13 · 10−4 6.42 s 5.64 · 10−4 6.94 s 0.1 1.99 · 10−3 29.3 s 6.50 · 10−5 27 m 2.29 · 10−5 7.58 s 2.09 · 10−5 9.95 s

Table 5.2: Comparison of the reconstruction accuracy of RTTC to ALS, ADF and a Riemannian optimization scheme using the Hierarchical Tucker format (HTOpt) for diﬀerent sizes of the sampling set. The underlying data is a discretized function resulting in a tensor of dimension d = 4 and mode size n = 20, see Section 5.4.2.

This algorithm is conceptually very similar to RTTC and the Riemannian tensor completion algorithm in the Tucker format GeomCG [26], but uses a different tensor format and a Gauss-Newton instead of a conjugate gradient scheme. We use the most recent HTOpt version 1.1 which is available from the authors. This implementation is optimized for very high sampling rates, where up to 50% of the original data is given. Motivated by this setting, the involved matrices and tensors cannot be considered sparse anymore and are treated by dense linear algebra. On the other hand, this approach limits the applicability of the algorithm to relatively low-dimensional tensors, say d = 4, 5. To make a fair comparison, we have adapted the example synthetic.m in HTOpt, which also aims to reconstruct a tensor originating from the discretization of a high-dimensional function, with the parameters left at their default value. As this code works with a different tensor format, prescribed tensor ranks are in general not directly comparable. To circumvent this, we have chosen d = 4, as in this case, the TT-rank r = (1, r1, r2, r3, 1) directly corresponds to a HT dimension tree with internal rank r2 and leaf ranks r1 and r3. We discretize the function f : [0, 1]4 R, f(x) = exp( x ) → −k k using n = 20 equally spaced discretization points on [0, 1] in each mode. The maximum rank is set to r = (1, 5, 5, 5, 1). In Table 5.2 we present the results for different sizes of the sampling set Ω and the corresponding relative error on the test set Γ, Γ = 100. Sampling and test| sets| are chosen identically for all algorithms. Since this example| | is not covered in [17], we now RIEMANNIAN OPTIMIZATION FOR TENSOR COMPLETION 21

use the reference implementation of ADF provided by the authors. We observe that the proposed RTTC is the fastest and yields better reconstruction than HTOpt. For sampling ratios 0.001, 0.005, 0.01, HTOpt fails to recover the original data within the specified number of maximum iterations, 250. For the smallest sampling ratio 0.001, the ALS procedure does not have enough samples available to solve the least squares problems. Altogether, RTTC, ALS and ADF perform very similar when comparing the reconstruction quality. We assume that the reason for the worse performance of HTOpt is due to the missing rank-adaptivity. For high sampling ratios, the use of a rank adaptation scheme such as the one described in Section 4.9 is less important, as overfitting is very unlikely to occur. 5.4.3. Parameter-dependent PDE: cookie problem. As an example of a parameter-dependent PDE, we investigate the cookie problem [44, 4]: Consider the 2 2 1 heat equation on the unit square D = [0, 1] with d = m disks Ds,t of radius ρ = 4m+2 and midpoints (ρ(4s 1), ρ(4t 1)) for s,t = 1,...,m. A graphical depiction of this setup is shown in Figure− 5.4 for− the cases m = 3 and m = 4. The heat conductivities of the different disks Ds,t are described by the coefficient vector p = (p1,...,pd), yielding the piecewise constant diffusion coefficient

pµ, if x Ds,t, µ = m(t 1) + s, a(x,p) := ∈ − (1, otherwise.

1 with each pµ [ 2 , 2]. The temperature u(x,p) at position x D for a certain choice of parameters∈p [ 1 , 2]d is then described by the diﬀusion equation∈ ∈ 2

1 1

D1,2

0 1 0 1 (a) 9 cookies (m = 3) (b) 16 cookies (m = 4)

Fig. 5.4: Graphical depiction of the cookie problem for values m = 3 and m = 4 resulting in 9 and 16 cookies, respectively.

div(a(x,p) u(x,p))=1, x D, − ∇ ∈ (5.1) u(x,p) = 0, x ∂D. ∈ 1 d R Assume now that we are interested in the average temperature u(p):[ 2 , 2] over the domain D, →

u(p) := u(x,p)dx. 2 Z[0,1] 1 d Following the setup of [4], we discretize the parameter space[ 2 , 2] by spectral collocation using n = 10 Chebyshev points for each mode µ = 1,...d. Hence, calculating the 22 M. STEINLECHNER

TT-Cross HT-Cross (from [4]) RTTC Tol. εΓ(X) Eval. εΓ(X) Eval. εΓ(X) |Ω| 10−3 8.35 · 10−4 11681 3.27 · 10−4 1548 9.93 · 10−5 1548 10−4 2.21 · 10−5 14631 1.03 · 10−4 2784 8.30 · 10−6 2784 10−5 1.05 · 10−5 36291 1.48 · 10−5 3224 6.26 · 10−6 3224 10−6 1.00 · 10−6 42561 2.74 · 10−6 5338 6.50 · 10−7 5338 10−7 1.31 · 10−7 77731 1.88 · 10−7 9475 1.64 · 10−7 9475

(a) 9 cookies (m = 3)

TT-Cross HT-Cross (from [4]) RTTC Tol. εΓ(X) Eval. εΓ(X) Eval. εΓ(X) |Ω| 10−3 8.17 · 10−4 22951 3.98 · 10−4 2959 2.84 · 10−4 2959 10−4 3.93 · 10−5 68121 2.81 · 10−4 5261 2.10 · 10−5 5261 10−5 9.97 · 10−6 79961 1.27 · 10−5 8320 1.07 · 10−5 8320 10−6 1.89 · 10−6 216391 3.75 · 10−6 12736 1.89 · 10−6 12736 10−7 — — 3.12 · 10−7 26010 7.12 · 10−7 26010

(b) 16 cookies (m = 3)

Table 5.3: Reconstruction results for the cookie problem using cross-approximation and tensor completion. The last entry in Table (b) for TT-Cross has been omitted due to an exceedingly high amount of function evaluations. The results for HT-Cross have been taken from the corresponding paper [4].

average of the solution of (5.1) for each collocation point in the parameter space would result in a solution tensor A of size 10d. As this represents an infeasible amount of work, we instead try to approximate the solution tensor by a low-rank approximation X. A common tool to tackle such problems are cross-approximation techniques. In Table 5.3 we compare the results obtained by a recent algorithm in the hierarchical Tucker format [4], denoted by HT-Cross, to the most recent cross-approximation technique in the TT format [39, 14], denoted by TT-Cross, the routine amen cross from the TT-Toolbox [35]. The results for HT-Cross are taken from the corresponding paper [4]. The speciﬁed tolerance is shown along with the relative reconstruction error on a test set Γ of size 100 and the number of sampling points. Both algorithms are able to construct a low-rank approximation X up to the prescribed accuracy, while TT- Cross needs about 10 times more sampling points. This could be due to an additional search along the modes of the solution tensor A, as we have a mode size of n = 10 in our example. At each sampling point, the PDE (5.1) is solved for the corresponding parameter set p =(p1,...,pd) using the FEniCS [29] ﬁnite element package, version 1.4.0. To compare the performance of these two cross-approximation algorithms to our tensor completion approach, we now try to complete A by using the same amount of sampling points as the HT-Cross algorithm, but uniformly distributed. The test set Γ is the same as for TT-Cross. RTTC is able to recover A with similar accuracy. We observe that for this problem, an adaptive choice of sampling points does not seem to perform better than a random sampling approach using tensor completion. The tensor completion approach has the added advantage that all sampling RIEMANNIAN OPTIMIZATION FOR TENSOR COMPLETION 23

points are determined a priori. Thus, the expensive evaluation of the sampling points can be easily distributed to multiple computers in an embarrassingly parallel way before starting the completion procedure. In the adaptive cross-approximation, the necessary sampling points are only determined at run-time, preventing an effective parallelization. 6. Conclusion. In this paper, we have derived a tensor completion algorithm called RTTC which performs Riemannian optimization on the manifold of TT tensors of fixed rank. We have shown that our algorithm is very competitive and significantly faster when compared to other existing tensor completion algorithms. For function- related tensors, a rank increasing scheme was shown to be essential to be able to recover the original data from very few samples. In our test case, tensor completion with randomly chosen samples yielded similar reconstruction errors as an adaptive cross-approximation approach with the same number of samples. Theoretical lower bounds for the recoverability of low-rank tensors from few samples remain a challenging open question. Acknowledgments. The author would like to thank Jonas Ballani, Daniel Kress- ner and Bart Vandereycken for helpful discussions on this paper and Melanie Kluge and Sebastian Krämer for providing an implementation of the ADF algorithm.

REFERENCES

[1] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization algorithms on matrix manifolds, Princeton University Press, Princeton, NJ, 2008. [2] P.-A. Absil and J. Malick, Projection-like retractions on matrix manifolds, SIAM J. Control Optim., 22 (2012), pp. 135–158. [3] P.-A. Absil and I. Oseledets, Low-rank retractions: a survey and new results, Computational Optimization and Applications, (2014). [4] J. Ballani and L. Grasedyck, Hierarchical tensor approximation of output quantities of parameter-dependent PDEs, tech. report, ANCHP, EPF Lausanne, Switzerland, March 2014. [5] J. Ballani, L. Grasedyck, and M. Kluge, Black box approximation of tensors in hierarchical Tucker format, Linear Algebra Appl., 438 (2013), pp. 639–657. [6] M. Bebendorf, Approximation of boundary element matrices, Numer. Math., 86 (2000), pp. 565–589. [7] M. Bebendorf and S. Rjasanow, Adaptive low-rank approximation of collocation matrices, Computing, 70 (2003), pp. 1–24. [8] S. Borm¨ and L. Grasedyck, Hybrid cross approximation of integral operators, Numer. Math., 101 (2005), pp. 221–249. [9] E. J. Candes` and T. Tao, The power of convex relaxation: Near-optimal matrix completion, IEEE Trans. Inform. Theory, 56 (2009), pp. 2053–2080. [10] C. Da Silva and F. J. Herrmann, Hierarchical tucker tensor optimization – applications to tensor completion, SAMPTA, 2013. [11] , Hierarchical Tucker tensor optimization – applications to tensor completion, Linear Algebra Appl., 481 (2015), pp. 131–173. [12] S. V. Dolgov and I. V. Oseledets, Solution of linear systems and matrix inversion in the TT-format, SIAM J. Sci. Comput., 34 (2012), pp. A2718–A2739. [13] S. V. Dolgov and D. V. Savostyanov, Alternating minimal energy methods for linear systems in higher dimensions, SIAM J. Sci. Comput., 36 (2014), pp. A2248–A2271. [14] , Alternating minimal energy methods for linear systems in higher dimensions, SIAM Journal on Scientiﬁc Computing, 36 (2014), pp. A2248–A2271. [15] S. Gandy, B. Recht, and I. Yamada, Tensor completion and low-n-rank tensor recovery via convex optimization, Inverse Problems, 27 (2011), p. 025010. [16] S. A. Goreinov, E. E. Tyrtyshnikov, and N. L. Zamarashkin, A theory of pseudoskeleton approximations, Linear Algebra Appl., 261 (1997), pp. 1–21. [17] L. Grasedyck, M. Kluge, and S. Kramer¨ , Alternating directions ﬁtting (ADF) of hierarchical low rank tensors, DFG SPP 1324 Preprint 149, 2013. 24 M. STEINLECHNER

[18] . Personal Communication, 2015. [19] L. Grasedyck, D. Kressner, and C. Tobler, A literature survey of low-rank tensor approximation techniques, GAMM-Mitt., 36 (2013), pp. 53–78. [20] L. Grasedyck, R. Kriemann, C. Lobbert,¨ A. Nagel,¨ G. Wittum, and K. Xylouris, Paral- lel tensor sampling in the hierarchical Tucker format, Comput. Vis. Sci., 17 (2015), pp. 67– 78. [21] S. Holtz, T. Rohwedder, and R. Schneider, The alternating linear scheme for tensor optimization in the tensor train format, SIAM J. Sci. Comput., 34 (2012), pp. A683–A713. [22] , On manifolds of tensors of fixed TT-rank, Numer. Math., 120 (2012), pp. 701–731. [23] R. H. Keshavan, A. Montanari, and S. Oh, Matrix completion from noisy entries, JMLR, 11 (2010), pp. 2057–2078. [24] B. N. Khoromskij, I. V. Oseledets, and R. Schneider, Efficient time-stepping scheme for dynamics on TT-manifolds, Tech. Report 24, MPI MIS Leipzig, 2012. [25] D. Kressner, M. Steinlechner, and A. Uschmajew, Low-rank tensor methods with subspace correction for symmetric eigenvalue problems, SIAM J. Sci. Comput., 36 (2014), pp. A2346– A2368. [26] D. Kressner, M. Steinlechner, and B. Vandereycken, Low-rank tensor completion by Riemannian optimization, BIT, 54 (2014), pp. 447–468. [27] J. Liu, P. Musialski, P. Wonka, and J. Ye, Tensor completion for estimating missing values in visual data, in Computer Vision, 2009 IEEE 12th International Conference on, 2009, pp. 2114–2121. [28] Y. Liu and F. Shang, An efficient matrix factorization method for tensor completion, IEEE Signal Processing Letters, 20 (2013), pp. 307–310. [29] A. Logg, K.-A. Mardal, and G. N. Wells, Automated Solution of Differential Equations by the Finite Element Method, vol. 84 of Lecture Notes in Computational Science and Engineering, Springer-Verlag, Berlin, 2012. [30] C. Lubich, I. Oseledets, and B. Vandereycken, Time integration of tensor trains, arXiv preprint 1407.2042, 2014. [31] Y. Ma, J. Wright, A. Ganesh, Z. Zhou, K. Min, S. Rao, Z. Lin, Y. Peng, M. Chen, L. Wu, E. Candes,` and X. Li, Low-rank matrix recovery and completion via convex optimization. Survey website, http://perception.csl.illinois.edu/matrix-rank/. Accessed: 22. April 2013. [32] C. Mu, B. Huang, J. Wright, and D. Goldfarb, Square deal: Lower bounds and improved relaxations for tensor recovery, in Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014, pp. 73–81. [33] J. Nocedal and S. J. Wright, Numerical Optimization, Springer Series in Operations Re- search, Springer, 2nd ed., 2006. [34] I. V. Oseledets, Tensor-train decomposition, SIAM J. Sci. Comput., 33 (2011), pp. 2295–2317. [35] I. V. Oseledets, S. Dolgov, D. Savostyanov, and V. Kazeev, TT-Toolbox Version 2.3, 2014. Available at https://github.com/oseledets/TT-Toolbox. [36] I. V. Oseledets and E. E. Tyrtyshnikov, TT-cross approximation for multidimensional arrays, Linear Algebra Appl., 432 (2010), pp. 70–88. [37] H. Rauhut, R. Schneider, and Z. Stojanac, Tensor completion in hierarchical tensor representations, arXiv preprint 1404.3905, 2014. [38] T. Rohwedder and A. Uschmajew, On local convergence of alternating schemes for optimization of convex problems in the tensor train format, SIAM J. Numer. Anal., 51 (2013), pp. 1134–1162. [39] D. V. Savostyanov and I. V. Oseledets, Fast adaptive interpolation of multi-dimensional arrays in tensor train format, in Proceedings of 7th International Workshop on Multidi- mensional Systems (nDS), IEEE, 2011. [40] R. Schneider and A. Uschmajew, Convergence results for projected line-search methods on varieties of low-rank matrices viaLojasiewicz inequality, SIAM J. Optim., 25 (2015), pp. 622–646. [41] M. Signoretto, L. De Lathauwer, and J. A. K. Suykens, Nuclear norms for tensors and their use for convex multilinear estimation, Tech. Report 10-186, K. U. Leuven, 2010. [42] M. Signoretto, Q. Tran Dinh, L. De Lathauwer, and J. A. K. Suykens, Learning with tensors: a framework based on convex optimization and spectral regularization, Machine Learning, 94 (2014), pp. 303–351. [43] M. Signoretto, R. Van de Plas, B. De Moor, and J. A. K. Suykens, Tensor versus matrix completion: A comparison with application to spectral data, IEEE Signal Processing Letters, 18 (2011), pp. 403–406. [44] C. Tobler, Low-rank Tensor Methods for Linear Systems and Eigenvalue Problems, PhD RIEMANNIAN OPTIMIZATION FOR TENSOR COMPLETION 25

thesis, ETH Zurich, Switzerland, 2012. [45] E. E. Tyrtyshnikov, Incomplete cross approximation in the mosaic-skeleton method, Com- puting, 64 (2000), pp. 367–380. International GAMM-Workshop on Multigrid Methods (Bonn, 1998). [46] A. Uschmajew and B. Vandereycken, The geometry of algorithms using hierarchical tensors, Linear Algebra Appl., 439 (2013), pp. 133–166. [47] , Greedy rank updates combined with Riemannian descent methods for low-rank optimization, in Sampling Theory and Applications (SampTA), 2015 International Conference on, IEEE, 2015, pp. 420–424. [48] B. Vandereycken, Low-rank matrix completion by Riemannian optimization, SIAM J. Optim., 23 (2013), pp. 1214—1236.