Tensors: An Adaptive Approximation Algorithm, Convergence in Direction, and Connectedness Properties

A dissertation presented to the faculty of the College of Arts and Sciences of Ohio University

In partial fulfillment of the requirements for the degree Doctor of Philosophy

Nathaniel J. McClatchey May 2018

© 2018 Nathaniel J. McClatchey. All Rights Reserved. 2

This dissertation titled Tensors: An Adaptive Approximation Algorithm, Convergence in Direction, and Connectedness Properties

by NATHANIEL J. MCCLATCHEY

has been approved for the Department of and the College of Arts and Sciences by

Martin J. Mohlenkamp Associate Professor of Mathematics

Robert Frank Dean, College of Arts and Sciences 3 Abstract

MCCLATCHEY, NATHANIEL J., Ph.D., May 2018, Mathematics Tensors: An Adaptive Approximation Algorithm, Convergence in Direction, and Connectedness Properties (135 pp.) Director of Dissertation: Martin J. Mohlenkamp This dissertation addresses several problems related to low-rank approximation of tensors. Low-rank approximation of tensors is plagued by slow convergence of the sequences produced by popular algorithms such as Alternating Least Squares (ALS), by ill-posed approximation problems which cause divergent sequences, and by poor understanding of the nature of low-rank tensors. Though ALS may produce slowly-converging sequences, ALS remains popular due to its simplicity, its robustness, and the low computational cost for each iteration. I apply insights from Successive Over-Relaxation (SOR) to ALS, and develop a novel adaptive method based on the resulting Successive Over-Relaxed Least Squares (SOR-ALS) method. Numerical experiments indicate that the adaptive method converges more rapidly than does the original ALS algorithm in almost all cases. Moreover, the adaptive method is as robust as ALS, is only slightly more complicated than ALS, and each iteration requires little computation beyond that of an iteration of ALS. Divergent sequences in tensor approximation may be studied by examining their images under some map. In particular, such sequences may be re-scaled so that they become bounded, provided that the objective function is altered correspondingly. I examine the behavior of sequences produced when optimizing bounded multivariate rational functions. The resulting theorems provide insight into the behavior of certain divergent sequences. Finally, to improve understanding of the nature of low-rank tensors, I examine connectedness properties of spaces of low-rank tensors. I demonstrate that spaces of unit 4 tensors of bounded rank are path-connected if the space of unit vectors in at least one of the factor spaces is path-connected, and that spaces of unit separable tensors are simply-connected if the unit vectors are simply-connected in every factor space. Moreover, I partially address simple connectedness for unit tensors of higher rank. 5 Dedication

To my family, whose love and support brought me to this point: 6 Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant No. 1418787. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation. I am grateful to Dr. Mohlenkamp, without whose guidance this dissertation would not have been possible. That you always had a word of advice when I needed it, despite my research touching three unique topics, is nothing short of amazing. I am grateful also to the other faculty of Ohio University — particularly Dr. Just, whose timely and thorough feedback has been a blessing. 7 Table of Contents

Page

Abstract...... 3

Dedication...... 5

Acknowledgments...... 6

List of Figures...... 9

List of Acronyms...... 10

1 Introduction...... 11

2 Background...... 16 2.1 Tensors...... 16 2.2 Tensor formats...... 18 2.2.1 Coordinate format...... 18 2.2.2 Separated representation...... 19 2.2.2.1 Difficulties inherent in the separated representation... 20 2.2.3 Tucker format...... 23 2.2.4 Tensor Train format...... 24 2.3 Algorithms for separated representations...... 24 2.3.1 Block Coordinate Descent...... 24 2.3.1.1 Alternating Least Squares...... 27 2.3.2 Line Search...... 28 2.3.3 Rank adjustment...... 30 2.4 Algorithms for separable approximation...... 31 2.5 Łojasiewicz inequality and related results...... 32

3 Nonlinear SOR and Tensor Approximation...... 35 3.1 Introduction...... 35 3.2 SOR-ALS...... 37 3.3 Local Q-linear convergence of SOR-ALS...... 40 3.3.1 Uschmajew’s proof of the local linear convergence of ALS..... 42 3.3.2 Linearization of nonlinear SOR...... 46 3.3.3 Linear SOR and the energy ...... 51 3.3.4 Local linear convergence of SOR-ALS...... 54 3.4 Numerical results...... 55 3.4.1 Modeling terminal behavior...... 57 3.4.2 An adaptive algorithm...... 61 8

3.5 Concluding remarks...... 64

4 On Convergence to Essential Singularities...... 65 4.1 Introduction...... 65 4.2 On the assumption of a Łojasiewicz inequality...... 71 4.2.1 A Łojasiewicz-like inequality holds on cones...... 72 4.2.2 A Łojasiewicz inequality for sequences approaching singularities. 80 4.2.3 New convergence theorems...... 86 4.3 Algorithms, examples, and implications...... 87 4.3.1 Example from figure...... 88 4.3.2 Convergence in direction...... 89 4.3.3 Implications for tensor approximation...... 91 4.4 Concluding remarks...... 92

5 Connectedness Properties of Low-rank Unit Tensors...... 93 5.1 Preliminaries...... 94 5.2 Separable tensors...... 100 5.2.1 The tensor product is continuous...... 100 5.2.2 Path-connectedness...... 103 5.2.3 Simple connectedness...... 105 5.2.4 Miscellany...... 111 5.3 Sums of separable tensors...... 113 5.3.1 Path-connectedness...... 113 5.3.2 Simple connectedness...... 115 5.4 Concluding remarks...... 125

References...... 127 9 List of Figures

Figure Page

3.1 f (v + ω arg min f (z + v)) as a function of ω ...... 39 z∈V( j) 3.2 Example rate of convergence (qω) of SOR-ALS relative to rate of convergence (q1) of ALS. If ln(qω/q1) < 0, then SOR-ALS outperforms ALS at that value of ω...... 56 3.3 Speedup of SOR-ALS compared to speed of ALS. Rank of target equals rank of approximation...... 59 3.4 Speedup of SOR-ALS compared to speed of ALS. Rank of target equals rank of approximation...... 60 3.5 Optimal ω compared to speed of ALS...... 61 3.6 Speedup of Algorithm8 compared to speed of ALS. Rank of target equals rank of approximation...... 63 3.7 An example comparing ALS and Algorithm8...... 64

−xy 4.1 Line Search along the gradient maximizes f (x, y) = (x2+y2)(1+x2+y2) from an initial estimate of (x0, y0) = (2, −0.1)...... 67 4.2 Illustration of Lemma 4.2.8...... 82

k1 5.1 Refining the partition {sn}n=1...... 97 5.2 Improving a spline by refining subdivisions...... 119 5.3 Hierarchy of sets used in proof of Theorem 5.3.7...... 121 10 List of Acronyms ALS Alternating Least Squares

BCD Block Coordinate Descent

CP Canonical Polyadic dGN damped Gauss-Newton

GD Gradient Descent

GS Gauss-Seidel

MBI Maximum Block Improvement

NCG Nonlinear Conjugate Gradient

NSOR Nonlinear Successive Over-Relaxation

PNCG Nonlinearly Preconditioned Nonlinear Conjugate Gradient

RALS Randomized Alternating Least Squares

SEY Schmidt-Eckhart-Young

SOR Successive Over-Relaxation

SOR-ALS Successive Over-Relaxed Least Squares

SVD Singular Value Decomposition 11 1 Introduction

Representation and analysis of multidimensional data has been the subject of renewed interest in recent years. Consider a d-dimensional multi-index array A. Each

tuple of indexes i1,..., id has a corresponding element Ai1,...,id in the array. The computational and storage complexity of such an array depends exponentially on the

number of dimensions d. For example, if each index i j ranges from 1 to 10, then A would have 10d entries. To reduce the computational and storage complexity of such an array, one may decompose it into a sum of separable arrays

r d X Y l Ai1,...,id = v j(i j) , l=1 j=1

which may be represented more concisely using the tensor product

r d X O l A = v j . l=1 j=1

This decomposition, the separated representation, is known by many names including the Canonical Polyadic (CP) decomposition, CANDECOMP, PARAFAC, and tensor rank decomposition. The separated representation has found applications in data analysis [KB09, CMP+15, Bro97, LRA93, Har70, BGM09, LR92, CC70] and high-dimensional numerical methods [BAE16, BMP08, SK16, BM05, BD15, BD16, Kho12, LMS14, LMS11, Nou10b, Nou10a, PCA10, Val14, HK06? ], among others [BAEH11]. The representation is not yet fully understood, however. If a tensor cannot be exactly represented using a given number of summands, then there is rarely a single optimal representation [VNVM14] even if one ignores rearrangement of summands and re-scaling of factors. If an optimal representation does exist, finding it may be NP-hard [HL13]. Over the years, many iterative optimization methods have been applied to the tensor approximation problem, including Newton’s method [EH12], damped 12

Gauss-Newton (dGN) methods [PTC13], variations of Nonlinear Conjugate Gradient (NCG)[DSW15], and a number of other line-search methods [BAEH11, RCH08, CHQ11]. The Block Coordinate Descent (BCD) class of algorithms has also remained popular for tensor approximation, because each block may be updated using robust linear methods at a low cost. The efficiency of each iteration may allow BCD methods to produce an approximation with less computation than would a method that requires fewer iterations. Such BCD methods include ALS, Maximum Block Improvement (MBI)[LT92, LUZ15], Randomized Alternating Least Squares (RALS) [RT14], and more exotic variants [TPC16, RDB16]. Despite the variety of algorithms that have been proposed for tensor approximation, creating an approximation often requires significant amounts of computation. Algorithms with a low cost for each iteration often require many iterations to create an acceptable approximation, while algorithms that require fewer iterations require additional calculation at each iteration. The problem is complicated by the existence of “swamps” [MB94] — transient periods of slow convergence that are not yet fully understood, but that may be related to the “shape” of spaces of low-rank tensors [Paa00]. Of the many open questions related to tensors, three will be addressed in this dissertation.

1. How can tensors be more efficiently approximated?

2. If no optimal representation exists, would a given approximation algorithm approach a single approximation, or cycle among several approximations of similar quality?

3. Can a tensor be continuously altered to produce every other tensor with the same number of summands? More generally, what is the general “shape” of a space of tensors with a specified number of summands? 13

I reduce the total computation required to create tensor approximations by improving the popular ALS algorithm. I also partially remedy the lack of theoretical knowledge by improving understanding of both tensors and optimization methods, providing insight into the “shape” of tensors and into the behavior one might expect if no optimal representation exists. In Chapter2, I collect definitions and results relevant to the dissertation. In Chapter3, I develop an algorithm that provides better performance than does the popular ALS algorithm. ALS is simple to implement, numerically stable, and has a low computational cost for each iteration. I apply the insights of the Nonlinear Successive Over-Relaxation (NSOR) method [You50, Had00, VMP03] to ALS; specifically, I linearly extrapolate from the result of each substep of ALS. The resulting algorithm, SOR-ALS, often requires fewer iterations than would ALS when creating an approximation of similar accuracy. Because each substep of SOR-ALS is a linear extrapolation of a substep of ALS, the SOR-ALS algorithm is simple to implement, is numerically stable, and has a low computational cost for each iteration. Moreover, I demonstrate that SOR-ALS converges linearly once it is sufficiently close to the final approximation, and that if the extrapolation parameter ω is in (0, 2), then SOR-ALS produces a sequence with decreasing error. Unfortunately, the performance of the SOR-ALS method is sensitive to choice of the extrapolation parameter ω. To remedy the problem of sensitivity, I create an adaptive algorithm that selects ω based on a measure of the effectiveness of previous iterations. The cost of each iteration of the adaptive algorithm is comparable to that of ALS, but numerical experiments reveal that the adaptive algorithm often requires significantly fewer iterations than would ALS. The improvement is most significant when ALS would perform poorly, which renders the adaptive algorithm useful in exactly those situations that require an improved algorithm. This addresses, but does not close, question1. 14

In Chapter4, I analyze convergence of sequences to essential singularities of rational functions. Proofs of convergence for sequences produced by optimization algorithms may either demonstrate convergence directly or by employing a more general convergence theorem to simplify the problem. One such general convergence theorem, first introduced by Absil et al. in 2005 [AMA05], has demonstrated utility in the study of optimization methods [Usc15, SU15, LUZ15, BK86, BK85]. However, the theorem requires the objective function to be real-analytic. If one wishes to examine whether a divergent sequence would converge if examined in a projective space, the corresponding redefinition of the objective function is rarely continuous, much less real-analytic. I extend the work of Absil et al. to cover certain discontinuous functions relevant to tensor approximation. Though my results require a technical condition that has proven difficult to demonstrate for tensors, the result gives insight into the behavior of the divergent sequences produced in tensor approximation. In particular, Chapter4 indicates that approximating tensors using the Gradient Descent (GD) method will produce sequences that converge, with high probability, if considered in a projective space. This partially answers question2. In Chapter5, I examine the connectedness properties of spaces of unit tensors of bounded rank. One may ask whether a tensor approximation method may produce a reasonable approximation while taking small steps, even if it happens to be poorly initialized. When answering this question, one must consider that many tensor approximation algorithms, including ALS, may display undesired behavior if any iteration should become the zero tensor. To demonstrate that large steps are unnecessary, it would be sufficient to show that any initial approximation may be transformed continuously into any other tensor of the same rank without becoming zero; this is equivalent to demonstrating that spaces of unit tensors of bounded rank are path-connected. I characterize the path-connectedness of spaces of unit real and complex tensors. In all save a trivial case, such spaces are path-connected. This demonstrates that an iterative tensor 15 approximation algorithm can, in principle, reach any tensor of a given rank from an initialization with the same rank while taking arbitarily small steps, which answers the first half of question3. Another connectedness property, simple connectedness, is of less practical interest than is path connectedness, but a positive or negative answer would give insight into the “shape” of spaces of tensors. A space is simply-connected if every continuous loop may be continuously contracted to a point; intuitively, a simply-connected space has no “holes”. For spaces of unit tensors of bounded rank, I reduce the problem of determining simple connectedness to an easier problem. I demonstrate that the trivial indeterminacies inherent in tensor decomposition do not preclude simple connectedness. I determine sufficient conditions under which spaces of separable unit tensors are simply-connected. This partially addresses the second half of question3. 16 2 Background

In this chapter, I discuss background and literature related to the problem of approximating tensors. I concentrate on results relevant to low-rank approximation and separated representations, and include papers on other tensor formats and more general multivariate optimization for context.

2.1 Tensors

Perhaps the most intuitive multidimensional objects that can be used to represent multidimensional data are multidimensional arrays.

× ×···× Definition 2.1.1. A multidimensional real or complex array A ∈ Kn1 n2 nd , where K = R

or K = C is a multi-index tuple. For any d indices i1 ∈ {1,..., n1},..., id ∈ {1,..., nd}, there exists a corresponding element of the multidimensional array, denoted by Ai1,...,id .

Multidimensional arrays exactly correspond to finite grid-aligned multidimensional data. For this reason, they are commonly used in the tensor-approximation literature, particularly in the study of algorithms.

This notion may be generalized to objects of infinite size. For a given field K and a S , define the free on S by

F(S ) = { f | f : S → K and |{s | f (s) , 0}| < ∞} .

In simpler terms, the F(S ) is the vector space consisting of all finite linear combinations of the elements of S .

Suppose U and V are vector spaces over the field K. Define the equivalence relation ∼ as follows: For all u1, u2 ∈ U, all v1, v2 ∈ V and all k ∈ K,

(u1, v1) + (u2, v1) ∼ (u1 + u2, v1) ,

(u1, v1) + (u1, v2) ∼ (u1, v1 + v2) ,

(ku1, v1) ∼ (u1, kv1) ∼ k(u1, v1) . 17

Definition 2.1.2. For vector spaces U and V over a field K, the tensor product U ⊗ V is the quotient F(U × V)/ ∼. For any [w1], [w2] ∈ U ⊗ V and k ∈ K,

[w1] + [w2] = [w1 + w2] ,

k[w1] = [kw1] .

Finally, the tensor product of two vectors u ∈ U and v ∈ V is the equivalence class

u ⊗ v = [(u, v)] .

If U and V are inner product spaces, one may define the induced inner product h·, ·i :(U ⊗ V) × (U ⊗ V) → K by

* r1 r2 + r1 r2 2 X j j X l l X X Y D j lE (u1 ⊗ u2), (v1 ⊗ v2) = ui , vi , (2.1) j=1 l=1 j=1 l=1 i=1 so U ⊗ V is an inner product space. Tensors are sufficiently structured to permit deep examination of their multidimensional structure and general enough that results found have the potential to be widely applicable in theoretical work. Tensors may also be used to represent multidimensional real or complex arrays, so the study of tensors is motivated by practical applications. Moreover, the tensor product naturally leads to a useful representation for tensors.

Nd T T Definition 2.1.3. A tensor is separable if it is a tensor product of vectors = i=1 vi.

Separability is useful for computing; the space and time required for operations on the factors vi is often significantly less than that required for operations on a more general tensor or multidimensional array. Unfortunately, randomly generated or real-world tensors are rarely separable. To mitigate this problem, one may instead consider sums of separable tensors. By definition, any tensor may be represented as a sum of finitely many separable tensors. The number of 18 separable tensors that must be added to represent a given tensor is known as the rank of the tensor [Lan11].

Definition 2.1.4. The rank of a tensor T , rank(T ) is the minimal number of separable tensors such that T is their sum. Formally,  r d   X O  rank(T ) = inf r r ∈ , v1 ... vr vectors, T = vl  N 1 d i  l=1 i=1  Though determining the rank of a given tensor is NP-hard [HL13], one can determine generic or typical ranks for tensors of a given size [CtBDLC09] and use that to estimate the rank of a tensor.

2.2 Tensor formats

Practical study of tensors is concerned primarily with tensor formats, methods of

× ×···× representing multidimensional arrays A ∈ Rn1 n2 nd with space not exponentially dependent on the number of dimensions d, and algorithms which produce tensors in these formats. These formats may be unified as multilinear maps [EHK], but this unification is impractical for most purposes; more readily-analyzed families of such multilinear maps are used instead.

The naive format stores one value Ai1,i2,...,id for each element of a multidimensional array, and requires storage space exponentially dependent on the number of dimensions. Though not suited to practical applications, the naive format provides an upper bound on the space required to store a tensor. A format requiring as much space as the naive format usually does not provide any advantage over the simpler naive format, so exceeding this bound is typically impractical.

2.2.1 Coordinate format

The coordinate format [BK08] generalizes well-known sparse representations of matrices as sets T coord of location-value tuples, assuming that all elements not specified 19 are 0. Thus    coord α if (i1, i2,..., id, α) ∈ T A  i1,i2,...,id =  .  0 otherwise The coordinate format efficiently and exactly represents sparse tensors, for which most elements may be omitted; it is thus is well-suited to archiving recorded data, for which the number of samples is expected to be small relative to the number of potential tuples of indexes. The format is also easy to implement in software, with most performance concerns resolved by sorting the recorded elements by index. However, the coordinate format cannot efficiently represent a dense tensor, for which few elements could be omitted, and is thus unsuitable for data extrapolation and similar tasks.

2.2.2 Separated representation

The separated representation, also known as CP decomposition, tensor-rank decomposition, CANDECOMP, and PARAFAC, represents a tensor as a sum of separable tensors r d X O l A = v j l=1 j=1

l n j with each v j ∈ R [KB09]. This is most easily understood by examining how it represents each element of the array. r d X Y l Ai1,i2,...,id = v j(i j) l=1 j=1 The format is useful for computations, because of its simplicity and elegant support for symmetry. The format may not, however, be the best choice for tensor storage. The format imposes no requirement of orthogonality, so it is prone to destructive cancellation of summands [MB94], a source of numerical error. The typical ranks published in [CtBDLC09] indicate that the minimal typical rank of a space of tensors tends to grow exponentially with the number of dimensions d, so exact representations in the format may require as much space as does the naive format. 20

2.2.2.1 Difficulties inherent in the separated representation

Low-rank approximation — approximating tensors using low-rank separated representations — is attractive because of the simple structure of the representation, but the analysis of algorithms which create these approximations is complicated by difficulties inherent to separated representations. It has been shown that many problems related to tensor fitting, particularly tensor rank, are NP-hard [HL13]. Though iterative methods circumvent this by solving problems approximately, the results are useful guides for research, indicating that knowledge of certain quantities cannot be assumed in a practical algorithm. Determining the rank of a tensor is NP-hard, as is the creation of an optimal separable approximation. The spectral Nd hT , v i kT k i=1 i 2,...,2 = sup Qd v1,...,vd i=1 kvik in particular should be avoided, as even approximating the spectral norm is NP-hard. In addition to the complexity of creating approximations, the problem of creating an optimal fixed-rank approximation of a tensor may be ill-posed. More precisely, if d ≥ 3, then there exist targets T which may be approximated arbitrarily well, but not exactly, by tensors of rank 2 ≤ r < rank(T ). This notion is best described by the border rank of a given tensor. Border rank describes the minimal rank required to approximate a tensor to within an arbitrarily low error tolerance [Lan11].

Definition 2.2.1. The border-rank of a tensor T is the minimal number of separable tensors such that T may be approximated arbitrarily well by their sum. Formally,

rank(T ) = max {min {rank(X): X ∈ B(T )} :  > 0} .

If rank(T ) < rank(T ), there exist ranks r, with rank(T ) ≤ r < rank(T ) such that T may be represented arbitrarily well, but not exactly, by a tensor of rank r. Though the 21 tensors with a given border rank may sometimes be described [BL14], such descriptions are complicated even in the case of rank(T ) = 2. The lack of a unique, optimal approximation of rank r can prevent the existence of certain generalizations of the two-dimensional Singular Value Decomposition (SVD). Specifically, it may prevent the existence of a decomposition which, when truncated to the first r terms, provides the best rank-r approximation of the decomposed tensor — that is, a Schmidt-Eckhart-Young (SEY) decomposition. For non-trivial tensors of dimension higher than 2, generic tensors fail to admit an SEY decomposition [VNVM14]. This non-existence can be illustrated by example [Moh13]. The rank-3 Laplacian tensor

∇3 = e2 ⊗ e1 ⊗ e1 + e1 ⊗ e2 ⊗ e1 + e1 ⊗ e1 ⊗ e2 ,

T 2 T 2 where e1 = [1, 0] ∈ R and e2 = [0, 1] ∈ R , cannot be represented as a sum of two separable tensors, yet it is the limit of a sequence of rank-2 tensors

N3 N3 − i=1(e1 + te2) i=1 e1 ∇3 = lim . t→0 t

Thus a tensor need not admit an optimal approximation with a given rank. Even if an optimal approximation exists, it may not be unique. Re-ordering of summands provides trivial alternate representations, and some tensors may even admit decompositions with multiple non-trivial representations. [Moh13] provides an example

d in the form of a trigonometric identity; for any {ci}i=1 ⊂ R such that

[i , l] =⇒ [sin(ci − cl) , 0],   Xd Xd Yd −   sin(xi + ci cl) sin  xi = sin(xl) .   sin(ci − cl) i=1 l=1 i=1, i,l Non-trivially ambiguous decompositions complicate proofs of convergence because optimization methods can, in principle, travel along a path of ambiguous decompositions rather than converging to a single optimal representation. Because non-trivially 22 ambiguous decompositions are not yet well characterized, proofs of convergence often impose conditions that prevent consideration of such decompositions. When creating a sequence of rank-r separated representations approaching a tensor T , with rank(T ) ≤ r < rank(T ), the parameters of the sequence may diverge even if the sequence itself converges. This difficulty is not restricted to pathological cases. In

d the set of complex multidimensional arrays Cn , for most choices of n and d, this difficulty affects almost all target arrays [VNVM14]. For theoretical purposes, divergent sequences of parameters complicate the task of proving convergence of a sequence of tensors. Analysis of convergence in the literature has avoided this problem by examining compact subsets of sets of low-rank tensors and by modifying the fitting problem. However, fitting algorithms may create sequences of parameterizations that, despite possessing diverging parameters, represent a convergent sequence of tensors. The convergence of these sequences has not yet been studied. For practical purposes, the divergence of parameters introduces significant numerical error. Numerical error, the error introduced when approximating an object such as a real

|xapprox−x| number using a finite representation, is modeled as a relative perturbation |x| . This relative perturbation is usually bounded above by some representation-dependent  > 0. If an object is represented as the image of parameters under some map, a small relative perturbation of the parameters may correspond to a large relative perturbation of the image. The condition number κ is defined by

kδyk kxk κ = inf sup , >0 kδxk≤ kyk kδxk

where x denotes a given input, y denotes the corresponding output, δx denotes a perturbation of the input, and δy denotes the corresponding perturbation of the output. The condition number κ provides a bound on the error introduced by a perturbation in the input, such as that induced by storing something using a finite representation. 23

r The condition number of a separated representation {Tl}l=1 is Pr kT k κ = l=1 l , (2.2) Pr l=1 Tl which bounds the magnitude of relative perturbations in tensor space as a proportion of the magnitude of relative perturbations of the summands T1,..., Tr. This expression for the condition number is not everywhere differentiable, which would complicate its use as the objective function for an optimization problem. The similar, but analytic, expression

Pr 2 l=1 kTlk Pr 2 may be substituted. By using this latter expression as a penalty term in a k l=1 Tlk minimization problem, the condition number may be controlled, though not optimized. This penalty is usually applied in the form of Tikhonov regularization (see eg. [Usc15]),

Pr 2 which modifies the objective function from k l=1 Tl − Ttargetk to 2 Xr Xr T − T + λ kT k2 . (2.3) l target l l=1 l=1 If an optimal approximation does exist, it inherits the differentiability class of the approximated tensor [Usc11]. This applies also to optimal regularized approximations.

2.2.3 Tucker format

×···× The Tucker format represents a tensor as a small “core” tensor C ∈ Rr1 rd ,

j d transformed on each direction by orthogonal matrices {U } j=1 as follows: [KB09, GQ14]

Xr1 Xrd Yd A = ··· C U j . i1,i2,...,id l1,...,ld l j,i j l1=1 ld=1 j=1 The requirement that the matrices U j be orthogonal ensures, by a compactness argument, that every target tensor admits a best approximation in Tucker format with given rank r1,..., rd. The Tucker format imitates the orthogonality and existence properties of the singular value decomposition, but does not reveal the rank of the represented tensor. Despite this, approximations in Tucker format are not prone to numerical error, so the format may be a good choice for tensor storage. 24

2.2.4 Tensor Train format

More recently, a tensor train format (see eg. [KSU14]) has been developed, which

d represents tensors of arbitrary dimension as a list of 3-dimensional tensors {U j} j=1. Let the

matrix U j(n) be the n-th “slice” of U j; then the tensor A represented in TT format is given by Yd Ai1,i2,...,id = U j(i j) . j=1 The tensor train format and its generalization, the hierarchical tensor format [BG14], are well-suited to tensor storage because they decouple the ranks used for each dimension. The hierarchical tensor format requires space not greater than that required for the separated representation. Moreover, for every tensor and selected rank, there exists a best approximation in the tensor train format. Despite this, the format is more complicated than the other formats, and does not elegantly support symmetry.

2.3 Algorithms for separated representations

Many algorithms have been developed for the creation of approximations of tensors in various formats. Some methods specifically address some aspect of tensor approximation, such as sampling of the target tensor [EGH09] or creation of good initial approximations [NW14]. The majority of tensor approximation algorithms, however, are specializations of more generally-applicable iterative optimization algorithms.

2.3.1 Block Coordinate Descent

BCD describes a family of hill-climbing algorithms which optimize multivariate functions by iteratively optimizing over subsets (blocks) of variables. The efficiency of BCD algorithms depends strongly on the choice of blocks. For tensor optimization, selecting for the j-th block every variable corresponding to the j-th direction of the tensor approximation causes the objective function to be quadratic on each block; optimizing a 25 single block is then a linear least-squares problem. Because of the efficiency of a block update, many variations of BCD have been proposed for tensor approximation. Cyclic BCD, Maximum Block Improvement, and Randomized BCD each prescribe different rules for selecting the order in which to visit the blocks. Variants of the BCD method have also been developed. One may constrain the difference between successive approximations [LKN13], select blocks that do not correspond to directions of the tensor [TPC16], or use random projection to reduce the condition number of the problem [RDB16]. Using a second-order approximation of the objective function, as in [Nes12] does not provide any benefit when approximating tensors. Algorithm1, the Maximum Block Improvement algorithm, is deterministic and behaves in a manner similar to gradient descent by selecting the block with steepest descent; this similarity was exploited to prove convergence [LUZ15] for rank-1 approximations (Theorem 2.5.4) and, under a weak additional assumption, linear convergence (Theorem 2.5.5). MBI may not be ideal, however, because the cost of a single iteration of MBI is equivalent to the cost of a full cycle of Cyclic BCD; the MBI algorithm may thus converge more slowly, with respect to computational cost, than would Cyclic BCD. Algorithm2, the Randomized Block Coordinate Descent algorithm, selects blocks randomly — typically using a discrete uniform distribution. It is by its nature stochastic, and cannot guarantee convergence in all cases. Despite this, the algorithm provides many of the advantages of MBI, with positive probability, while retaining the lower cost of Cyclic BCD. Further, recent research [RT14] has provided probabilistic bounds on the number of iterations required for the algorithm to reach a given proximity to the target, though these bounds assume that the objective function being minimized is both smooth and convex. This assumption is violated by tensor approximation problems. 26

Algorithm 1: Maximum Block Improvement (MBI)

Given blocks of parameters v1,..., vd and a function f to minimize, repeat

for j = 1 to d do   u j = arg minx f v1,..., v j−1, x, v j+1,..., vd end for   k = arg min j=1,...,d f v1,..., v j−1, u j, v j+1,..., vd

vk ← uk

until v1,..., vd converge

Algorithm 2: Randomized Block Coordinate Descent (RBCD)

Given blocks of parameters v1,..., vd and a function f to minimize, repeat

Select j randomly from 1,..., d   v j ← arg minx f v1,..., v j−1, x, v j+1,..., vd

until v1,..., vd converge 27

Algorithm3, the Cyclic Block Coordinate Descent algorithm, selects blocks according to a fixed pattern. Unlike MBI, this pattern does not depend on the location of the current iteration. This provides simple deterministic behavior, which may be more easily analyzed than that of other variants of BCD. Cyclic BCD applied to tensor

Algorithm 3: Cyclic Block Coordinate Descent (CBCD)

Given blocks of parameters v1,..., vd and a function f to minimize, repeat

for j = 1 to d do   v j ← arg minx f v1,..., v j−1, x, v j+1,..., vd end for

until v1,..., vd converge

optimization, with the aforementioned choice of blocks, is the ALS algorithm.

2.3.1.1 Alternating Least Squares

The ALS algorithm [HPJ+98, ADK11, LKN13, MB94, Usc12, EHK] derives its

effectiveness from the observation that the map M j defined by

r  j−1   d  M j X O  O  x1,..., xr 7−→  vl  ⊗ xl ⊗  vl   j  j l=1 i=1 i= j+1 is a . If the objective function f is chosen to be the squared error of the

2 approximation, kTapprox − Ttargetk , then the subproblem

  arg min f v1,..., v j−1, x, v j+1,..., vd x

from CBCD is a linear least-squares problem. Because of this, optimization over a block of variables may be performed using fast, numerically stable algorithms for solving linear systems. 28

Algorithm 4: ALS for Separated Representations 1 r Given vectors v1,..., vd and a target tensor Ttarget, repeat

for j = 1 to d do

Define a linear map M j by

r  j−1   d  M j X O  O  x1,..., xr 7−→  vl  ⊗ xl ⊗  vl   j  j l=1 i=1 i= j+1

1 r ∗ −1 ∗ v j ,..., v j ← (M j M j) M j Ttarget end for

1 r until v1,..., vd converge

Algorithm4 is simple and has low cost for each iteration. However, the sequences produced by ALS may fail to converge due to the nature of the problem, as noted in Section 2.2.2.1. Further, the algorithm may encounter “swamps” [MB94], regions or parts of the sequence which converge slowly, followed by more rapid convergence. These swamps are not yet fully understood, but it has been suggested that destructive cancellation [MB94] or the shape of the set of low-rank tensors [Paa00] may contribute to swamps. When ALS does converge, however, it usually converges at least linearly [Usc12].

2.3.2 Line Search

Line Search describes a family of algorithms which optimize over 1-dimensional affine subspaces of parameters, chosen such that the current approximation is contained within the subspace. Though this allows 1-dimensional optimization methods to be employed, such methods typically require evaluating the objective function multiple 29 times. For tensor approximation, evaluating the objective function is a non-trivial task. Additionally, the rate and order of convergence of this family of algorithms depends strongly on the affine subspace chosen.

Algorithm 5: Line Search Given parameters v and a function f to minimize, repeat

Select a non-zero vector d  t = procedure arg mins∈R f (v + sd) v ← v + td until v converges

TheGD method selects as its search direction d = −∇ f (v), where ∇ f is the gradient of the objective function [Cur44]. The algorithm is simple, and selecting the search direction has computational cost comparable to an iteration of ALS. However,GD often requires many more iterations than does ALS, so the total computational costGD often greatly exceeds that of ALS. One may combine line search with Newton’s method by selecting d = −(H f (v))−1(∇ f (v)) as the search direction, where H f (v) is the Hessian matrix of f at v [EH12]. The algorithm provides locally quadratic convergence, under some weak assumptions, but selecting the search direction may be prohibitively costly due to the size of the Hessian matrix. Even constructing the Hessian matrix is costly, though more sophisticated methods may reduce the cost of construction [KT10]. Approximations of Newton’s method, such as the Gauss-Newton method [PTC13, ADK11], reduce the computational cost of each iteration but remain costly when applied to tensor approximation. 30

NCG, a more sophisticated method, provides better convergence than doesGD without explicitly computing the Hessian matrix. Though this method can converge quadratically [Coh72], the conditions required to prove quadratic convergence are not satisfied by the tensor approximation problem. Moreover, the convergence of NCG is hampered significantly by the presence of numerical error, which renders it ill-suited to tensor approximation. A variant of NCG that is better-suited to tensor approximation, Nonlinearly Preconditioned Nonlinear Conjugate Gradient (PNCG), was introduced in [DSW15]. One can also use the results of another algorithm, such as ALS, to select a search direction. This is employed by the Enhanced Line Search [RCH08] method, among others [CHQ11].

2.3.3 Rank adjustment

BCD and Line Search algorithms iteratively improve an approximation of a tensor, but they do not change the rank of the approximation. If the rank of the initial approximation is too small, the aforementioned iterative algorithms cannot introduce additional summands. If the rank of the initial approximation is unnecessarily large, the aforementioned algorithms cannot discard summands. To address these shortcomings, one may start with a high-rank approximation and penalize the rank of the approximation, as in [GQ14]. Alternatively, one may explicitly reduce the rank of an approximation by removing some of its summands or increase the rank of an approximation by adding separable summands to it. When reducing the rank of an approximation, one must select which separable summands to remove. Matrix rank-reduction methods may be used to efficiently determine which summands ought to be removed [BBB15]. 31

When increasing the rank of an approximation, one may do so by adding separable summands that approximate the residual (Ttarget − Tapprox). Repeatedly adding the best separable approximation of the residual will create a sequence that converges to the target [ACF10], but this may increase the rank of the approximation beyond what is necessary. It is often preferable to use an iterative method to improve the approximation after adding each summand, as in [EHRS12, GNL14].

2.4 Algorithms for separable approximation

Separable (rank-1) approximation of a tensor is equivalent to approximation of the largest eigenvalue of a tensor and a corresponding eigenvector [WC14]. For this reason, algorithms have been developed specifically for creation of separable approximations of tensors. In some cases, these algorithms may be generalized to creation of non-separable approximations of tensors. Among such algorithms are generalizations of the power method for matrices. Each paper addressing a tensor power method contains a definition of tensor-vector multiplication, for which the notation varies widely. This multiplication corresponds to the notion of a complementary inner product [Moh13], which may be more intuitive than other definitions of such multiplication. Define the complementary inner product of two multidimensional arrays

n1×···×nd n1×···×nd n j h·, ·i\ j : R × R → R by n n Xn1 Xj−1 Xj+1 Xnd (hA, Bi\ j)i j = ··· ··· Ai1,...,id Bi1,...,id . i1=1 i j−1=1 i j+1=1 id=1 The Higher-Order Power Method (HOPM) [WC14] generalizes the power method by cyclically updating each “direction” of the approximation. Though developed as a generalization of the power method, the HOPM is equivalent to a specialization of ALS 32

Algorithm 6: Higher-Order Power Method (HOPM) n1 nd Given λ ∈ R, v1 ∈ R ,..., vd ∈ R and a target tensor Ttarget, repeat

for j = 1 to d do

j D N j−1  Nd E u = Ttarget, vi ⊗ 1 ⊗ vi i=1 i= j+1 \ j λ ← ku jk

u j v j ← ku jk end for

until v1,..., vd converge Nd return λ i=1 vi

for separable approximation. The HOPM converges with unknown, but presumed linear, rate [WC14, Usc15]. Two variants of the HOPM have been developed for approximation of symmetric tensors. The Symmetric Higher-Order Power Method (S-HOPM) creates symmetric tensor approximations by simultaneously updating all “directions” of the approximation. Kolda et al. note that S-HOPM may fail to converge and introduces a Shifted Symmetric Higher-Order Power Method (SS-HOPM) [KM11], of which S-HOPM is a special case. SS-HOPM converges if the target tensor admits at most finitely many eigenvectors and the shifting parameter α is sufficiently large. However, large values of α should be expected to cause slow convergence, as α acts as a damping parameter.

2.5 Łojasiewicz inequality and related results

In recent years, the Łojasiewicz gradient inequality has become an invaluable tool for proving convergence of iterative optimization algorithms. 33

Theorem 2.5.1 (Łojasiewicz gradient inequality). Let f be a real-analytic function on a

∗ neighborhood of x in Rn. Then there are constants c > 0 and θ ∈ (0, 1/2] such that

| f (x) − f (x∗)|1−θ ≤ ck∇ f (x)k (2.4)

for any x in some neighborhood of x∗.

Though Stanisław Łojasiewicz stated in [Łoj65] that θ ∈ (0, 1), it is noted in [SU15] that this may be strengthened to θ ∈ (0, 1/2] without altering the proof. Absil, Mahony, and Andrews proved in [AMA05] that a sequence which satisfies certain descent conditions and has a cluster point will necessarily converge to that cluster point. For ease of reference, I state the result here, in the style of Uschmajew [Usc15]:

n n Theorem 2.5.2. Let f : R → R be a real-analytic function, and let (xk) ⊂ R be a sequence of vectors satisfying

f (xk) − f (xk+1) ≥ σ k∇ f (xk)k kxk+1 − xkk , (A1)

for sufficiently large k and some σ > 0. If the implication

f (xk) = f (xk+1) =⇒ xk = xk+1 (A2)

∗ holds, then a cluster point x of the sequence (xk) must be its limit.

This result was later extended in [SU15, Usc15] to estimate the order of convergence of the sequence, under an additional assumption.

Theorem 2.5.3. Under the conditions of Theorem 2.5.2, if there exists some κ > 0 such that for sufficiently large k,

kxk+1 − xkk ≥ κk∇ f (xk)k , (A3)

then ∇ f (xk) → 0 and the convergence rate may be estimated as   qk if θ = 1 (for some 0 < q < 1), ∗  2 kx − xkk .   −θ 1  1−2θ k if 0 < θ < 2 where θ is such that (2.4) holds. 34

The objective functions used in tensor applications are typically analytic, so if a compactness argument can be applied to guarantee existence of a cluster point, Theorems 2.5.2 and 2.5.3 may significantly reduce the difficulty of constructing a proof of convergence. [LUZ15] applied these results to the MBI algorithm to show that, under certain more easily proven assumptions, the sequences produced by MBI converge. The paper proves that those assumptions hold for the case of rank-1 approximations, but does not address approximations of higher ranks. We provide the relevant theorems below, for convenience.

Theorem 2.5.4 ([LUZ15, p. 223]). If the initial approximation is non-zero, the sequence of iterates of an MBI method applied to the rank-one approximation problem converges to a critical point of the problem.

1 d Theorem 2.5.5 ([LUZ15, p. 226]). Let F∗ = λ∗ x∗ ⊗ · · · ⊗ x∗ be a best rank-one q F F E kEk d kF k approximation of = ∗ + . Then, if < 3d−2 , the MBI method is linearly

convergent in a neighborhood of x∗.

[SU15] showed that conditions (A1), (A2), and (A3) hold for a variant of the line-search algorithm. More relevant to this dissertation is the work of [Usc15], which shows that conditions (A1), (A2), and (A3) hold for the sequences produced by the ALS algorithm for rank-1 tensors and also for the sequences produced by the ALS algorithm with Tikhonov regularization, regardless of the rank of the approximation. Because Tikhnov regularization permits a compactness argument, this shows that ALS with Tikhonov regularization converges. The assumptions required by the paper for non-regularized ALS are not yet proven to hold, however, and do not hold for divergent sequences. 35 3 Nonlinear Successive Over-Relaxation and

Tensor Approximation

3.1 Introduction

ALS — one of the most popular iterative algorithms for low-rank tensor approximation — is simple to implement, numerically stable, and has a low computational cost for each iteration. ALS can also require many iterations to converge, which mitigates these advantages. To achieve better performance, I apply the nonlinear SOR method to the problem of low-rank tensor approximation. The resulting SOR-ALS method can outperform ALS, but is sensitive to the choice of a tuning parameter. I then propose an adaptive method based on SOR-ALS which provides better performance than does ALS while remaining simple to implement, being numerically stable, and having a low computational cost per iteration. The low-rank tensor approximation problem attempts to represent a target tensor T using a sum of separable tensors. For convenience, this sum of separable tensors is formulated as the image of a map τ from parameters in X = (Rn1 × · · · × Rnd )r to tensors.

1 r To define τ, first partition x ∈ X as x = (v1,..., vd), where Nd 1 r ∈ n1 1 r ∈ nd 1 r Pr l v1,..., v1 R ,..., vd,..., vd R . Then τ(x) = τ(v1,..., vd) = l=1 i=1 vi. The low-rank tensor approximation problem seeks parameters x ∈ X such that the “error” kτ(x) − T k is minimized. This is often formulated as minimization of the real-analytic function f : X → R defined by

f x 7−→ kτ(x) − T k2 . (3.1)

Minimization of (3.1) is a non-trivial problem. Even in the simplest case — in which r = 1 — direct minimization of f is NP-hard [HL13]. To mitigate this, minimizers of 36

(3.1) are approximated using iterative methods. If r > 1, then minimization of (3.1) can be ill-posed; for a treatment of the ill-posed case, see Chapter4. Iterative methods may repeatedly construct and minimize a local approximation of (3.1), as in Newton’s method and trust-region methods, or they may minimize (3.1) over affine subspaces of its domain. The latter approach can be more effective in the tensor approximation problem than it would be for more general minimization problems. If one

1 r holds constant all blocks except those in one “direction” j, that is all except v j ,..., v j, then (3.1) becomes quadratic and may be minimized by methods of linear algebra. The ALS method exploits this to update the block of variables for each direction j = 1,..., d in turn. The process of updating a single block of variables will be denoted as a microstep of ALS. ALS has a number of advantages over other iterative methods that have ensured its continuing popularity. In particular, efficient, numerically stable methods exist for solving systems of linear equations, or equivalently for minimizing quadratic forms. Using these to update each block of variables ensures that each iteration of ALS is cheap and numerically stable. Despite the low cost for each iteration of ALS, ALS may require many iterations to produce an acceptable approximation. This is due in part to the linear convergence that ALS exhibits, which may be quite slow. Additionally, ALS may exhibit periods of extremely slow progress unrelated to the linear convergence on the tail, which are colloquially known as “swamps” [MB94]. Various methods of enhancing the performance of ALS have been suggested. Most relevant to this chapter are extrapolation methods. The line search in [CHQ11] determines the direction along which a single iteration of ALS would move, then searches along a line in that direction. Enhanced Line Search [RCH08] increases the dimension of the space searched, which may reduce the total number of iterations required, but at the 37 expense of making each iteration more costly. The additional search operations required by these methods increase the complexity of implementing the method and significantly increase the cost of each iteration. A blind search in the direction travelled by ALS was mentioned — but not studied — in [Bro98, p. 95]. Like line search, the method extrapolates from the result of an iteration of ALS, but the amount of extrapolation is determined a priori. Though the cost of each iteration is comparable to that of ALS, the method can fail to produce a sequence with monotonically decreasing error. To properly examine the performance of tensor approximation methods, I distinguish between two forms of linear convergence — Q-linear convergence and R-linear

(k) convergence. Suppose a sequence {x }k∈N converges to x∗ under a metric m.

(k) Definition 3.1.1. The sequence {x }k∈N exhibits Q-linear convergence if and only if there (k+1) m(x ,x∗) exists some q ∈ (0, 1) such that lim →∞ (k) = q. k m(x ,x∗)

(k) Definition 3.1.2. The sequence {x }k∈N exhibits R-linear convergence if and only if the

(k) (k) sequence of “errors” {m(x , x∗)}k∈N is dominated by a sequence { }k∈N ⊂ (0, ∞) that converges Q-linearly to zero.

Rather than examining sequences, one may wish to examine the method itself. If a sequence is produced by repeated application of an affine map and if that map is a contraction, then the sequence converges Q-linearly.

3.2 SOR-ALS

The ALS method may be considered as an application of the nonlinear block Gauss-Seidel (GS) method to tensor approximation. The nonlinear blockGS method iteratively minimizes an objective function f : Rn → R by minimizing over blocks of variables. That is, the space Rn is partitioned into blocks of variables; each iteration of 38 nonlinear blockGS holds constant the variables in all save one block, and minimizes the objective function over the free variables. Various improvements have been made to the linearGS method. In particular, the SOR method — which linearly extrapolates after each block minimization — often provides faster convergence than doesGS. Such statements of faster convergence are rarely supported by theory, however. Though much work has been dedicated to demonstrating superiority of SOR overGS, the resulting proofs require strict assumptions about the system of equations. I define below Algorithm7, a slight generalization of Block Nonlinear SOR. For generality, Algorithm7 optimizes not over blocks of variables, but over linear subspaces

V( j) of Rn. Though this generalization is not necessary for examination of the basic ALS method, it unifies several variants created by visiting blocks of variables repeatedly or in varying orders. To create the Block Nonlinear SOR method, one can define the spaces V( j) to be composed of those vectors for which every entry is zero if it is outside of the j-th

block of some partition of Rn.

Algorithm 7: Nonlinear SOR Let V( j) be subspaces of Rn, and let ω ∈ R. Select an initial guess x(1) ∈ Rn. for k = 1 to ∞ do v0 ← x(k)

for i = 1 to d do vi ← vi−1 + ω arg min f (z + vi−1) z∈V(i) end for x(k+1) ← vd end for 39

Suppose Algorithm7 is applied to minimize (3.1). If each space V( j) is selected to be

1 r 1 r those tuples (v1,..., vd) such that only v j ,..., v j are non-zero, then the block optimization 1 r step arg minz∈V( j) f (z + v) holds fixed all variables except v j ,..., v j. That is, arg minz∈V( j) f (z + v) is a linear least-squares problem. This is the SOR-ALS algorithm. Note that each microstep of SOR-ALS is a linear extrapolation of the corresponding ALS microstep. Consequently, the cost of an iteration of SOR-ALS — with any ω — is only negligibly greater than that of an iteration of ALS. Note also that if ω = 1, then SOR-ALS is exactly ALS. Before Algorithm7 can be used, the parameter ω must be selected. One typically

(k) desires that ω be selected such that the image { f (x )}k∈N is a monotonic sequence. In the general case, where nothing is assumed about the objective function f , one can select ω using line search at each microstep. For SOR-ALS, however, a cheaper solution exists. Each microstep — that is, each block update — extrapolates linearly from the minimizer of a quadratic function, as in Fig. 3.1. One can thus control the rate of descent of SOR-ALS, as a fraction of the rate of descent of ALS at the same point, by bounding ω as follows:

ω = 0 ω = 2

ω = 1

Figure 3.1: f (v + ω arg min f (z + v)) as a function of ω z∈V( j) 40

Lemma 3.2.1. Let V be an inner product space on a subfield F ⊂ C, let U be a linear subspace of V, let u, v ∈ U and w ∈ V, and let θ ∈ R ∩ F. If v minimizes kv − wk on U, and θ ∈ [0, 2], then k(1 − θ)u + θv − wk ≤ ku − wk .

Proof: The expression k(1 − θ)u + θv − wk2 is a quadratic polynomial on θ, and may be expanded as ku − wk2 + 2θ Re(hu − w, v − ui) + θ2kv − uk2 .

Because (1 − θ)u + θv ∈ U, the quadratic is minimized when θ = 1, by definition of v. Note that θ = 0 =⇒ k(1 − θ)u + θv − wk2 = ku − wk2. One may construct the vertex form of the quadratic using this point, the location of the minimum, and the leading coefficient:

k(1 − θ)u + θv − wk2 = ku − wk2 − kv − uk2 + (θ − 1)2kv − uk2

= ku − wk2 + ((θ − 1)2 − 1)kv − uk2 .

2 If θ ∈ [0, 2], then (θ − 1) − 1 ≤ 0, so k(1 − θ)u + θv − wk ≤ ku − wk. 

l 1 r Pr Nd l If one holds constant all vi except for v j ,..., v j, then l=1 i=1 vi is linear on 1 r v j ,..., v j. As a consequence, each microstep of SOR-ALS satisfies the conditions of Lemma 3.2.1. That is, if one selects ω ∈ [0, 2], then SOR-ALS produces a sequence

(k) (k) {x }k∈N such that its image { f (x )}k∈N is monotonic. A discussion on selection of ω is found in Section 3.4.

3.3 Local Q-linear convergence of SOR-ALS

The analysis of any low-rank tensor approximation algorithm is complicated by the scaling and summation indeterminacies inherent in the low-rank approximation problem.

d To be precise, one may multiply the factors of any separable tensor ⊗i=1vi by scalars Qd d d λ1, . . . , λd ∈ K with 1 = i=1 λi to create an equivalent tensor ⊗i=1λivi = ⊗i=1vi. Tensor addition is commutative, so one may rearrange the summands of any sum of separable tensors without altering the sum. Moreover, certain tensors of rank 2 or greater may admit 41 additional equivalent decompositions not formed by rearrangement of summands [Moh13]. The scaling indeterminacy alone is enough to ensure that the Hessian matrix of

2 kX − Ttargetk is singular. The difficulties stemming from the scaling indeterminacy may be mitigated by equalizing the norms of the factors of the summands. This does not, however, prevent the Hessian from being singular. Non-trivial equivalent decompositions, such as those in [Moh13], can increase the multiplicity of the zero eigenvalue of the Hessian matrix. Despite this, Uschmajew showed that if one equalizes the norms of the factors of each summand after each step, then ALS exhibits local Q-linear convergence [Usc12] under the energy seminorm. It is natural to ask whether SOR-ALS exhibits the same. Unfortunately, it is not trivial to extend Uschmajew’s analysis of ALS to SOR-ALS. Uschmajew’s analysis depends

1. on a linearization of nonlinearGS[BH03, Lemma 2] and

2. on linearGS being a contraction in the energy seminorm [LWXZ06, LWXZ08].

If Uschmajew’s proof is to be extended to SOR-ALS, both Items1 and2 must be replaced with equivalent statements for SOR. I must replace Item1 with a linearization of nonlinear SOR. Unfortunately, though the literature contains various linearizations of nonlinear SOR, such results require conditions not satisfied by the low-rank approximation problem; see eg. [VMP03]. Moreover, the method used to linearize nonlinearGS in [BH03, Lemma 2] cannot easily be extended to SOR. In Section 3.3.2, I will linearize the nonlinear SOR method under conditions satisfied by the low-rank tensor approximation problem. Item2 is better-addressed in the literature, but a perfect replacement is again missing.

The R-linear convergence of linear SOR in the `2 norm was stated as a remark in [LT92]. 42

One can use this to establish that linear SOR converges R-linearly in the energy seminorm, and that the composition of some finite number of steps of linear SOR is a contraction in the energy seminorm. Such statements would suffice to establish R-linear convergence of SOR-ALS, but are weaker than Item2. For completeness, I will prove in Section 3.3.3 the stronger statement that linear SOR is a contraction in the energy seminorm, which replaces Item2. In Section 3.3.1 I sketch Uschmajew’s proof from [Usc12]. In Section 3.3.4, I show how the results in Sections 3.3.2 and 3.3.3 are used to extend it to a proof of local Q-linear convergence of SOR-ALS.

3.3.1 Uschmajew’s proof of the local linear convergence of ALS

Uschmajew demonstrated that ALS will converge linearly if started sufficiently close to the target, under a technical assumption. This section, Section 3.3.1, contains a sketch of Uschmajew’s proof, with notation altered to match the rest of this chapter, and with certain results stated as lemmas. Uschmajew averted the scaling indeterminacy by equilibriating the factors of each summand of the approximation after each step. The equilibriation is performed by means of a function R : X → X. The function R will affect each separable summand identically.

n1 nd n1 nd Let R1 : R × · · · × R → R × · · · × R be the map equilibriating the factors of a separable tensor. This map is defined by     δ δ d  v1,..., vd if δ , 0 Y R1  kv1k kvdk 1/d (v1,..., vd) 7−→  , δ = kvik .  (0,..., 0) if δ = 0 i=1

The map R for equilibriation of rank-r tensors may then be expressed as an application of

R1 to all separable summands.

1 r R  1 1 r r  (v1,..., vd) 7−→ R1(v1,..., vd),..., R1(v1,..., vd) . 43

Note that R is a projection that changes parameterizations of tensors, but not the tensors themselves. Specifically, R ◦ R ≡ R, and τ ◦ R ≡ τ. A parameterization x will be said to be equilibriated if x = Rx.

For the remainder of this section, x∗ will denote a local minimizer of f , and will be assumed to be equilibriated — that is, R(x∗) = x∗. Moreover, S 1 will denote the map corresponding to a full step of ALS. Finally, denote by Equilibriated ALS the method of

(k+1) (k) iteratively applying the map R ◦ S 1; that is, x = (R ◦ S 1)(x ). To address the summation indeterminacy, Uschmajew employed an assumption, specifically [Usc12, Assumption 1], which generalizes to tensors of arbitrary degree —

×···× that is, tensors in Rn1 nd — as follows:

00 Pd Assumption 3.3.1. The rank of f (x∗) equals r(1 − d + i=1 ni).

00 If Assumption 3.3.1 holds, then the nullspace of f (x∗) contains only those vectors suggested by the scaling indeterminacy. That is, Assumption 3.3.1 ensures that an equilibriated local minimum is unique within an open neighborhood of itself, and that

00 0 ker( f (x∗)) = ker(R (x∗)); see [Usc12].

00 Assumption 3.3.1 also has implications for a certain block representation of f (x∗).

00 In this dissertation, references to “blocks” of f (x∗) refer to those blocks created by

00 partitioning the Hessian matrix f (x∗) into blocks corresponding to tensor directions, unless explicitly stated otherwise.

Lemma 3.3.2 (Adapted from [Usc12]). If Assumption 3.3.1 holds, then the block diagonal

00 of f (x∗) is positive definite.

Proof: If one selects a vector h of parameters such that exactly one block is non-zero —

1 r 0 that is h = (..., 0, vi ,..., vi , 0,... ) , 0 — then R(h + x∗) , x∗. Because h < ker(R (x∗)), 00 00 Assumption 3.3.1 guarantees that h f (x∗)h, hi > 0. Thus f (x∗) has a positive-definite

block diagonal.  44

Lemma 3.3.3 (From [Usc12]). Assume that Assumption 3.3.1 holds. Then the ALS operator S 1 is well-defined and continuously differentiable in some neighborhood of x∗.

Moreover, x∗ is a fixed point of S 1 and we have

0 −1 ∗ S 1(x∗) = −(D + L) L ,

00 where L is the strict lower block triangular part of f (x∗), and D is the block diagonal

00 part of f (x∗). In a possibly smaller neighborhood the composition R ◦ S 1 is well-defined and continuously differentiable.

Proof: Each microstep of ALS is a quadratic minimization problem, the parameters of which depend smoothly on the starting point x. The Hessian matrix of the minimized quadratic is a diagonal block of f 00(x). Lemma 3.3.2 and continuity of f 00 ensure that this is nonsingular if x is sufficiently close to x∗, so the minimization has a unique solution that depends smoothly on x.

The point x∗ is a fixed point of each microstep, so within some neighborhood of x∗ all d microsteps may be taken consecutively with unique solutions. The map S 1 is then well-defined and smooth on this neighborhood.

Because S 1 is smooth and S 1(x∗) is in the open set on which R is smooth, R ◦ S 1 is smooth on some neighborhood of x∗.

0 −1 ∗ 1 That S 1(x∗) = −(D + L) L is established in [BH03, Lemma 2]. 

00 00 The energy seminorm | · | f (x∗) — defined by x 7→ hx, f (x∗)xi — may be well-suited to analysis of optimization methods. However, a seminorm may not be suitable for statements of convergence. For the low-rank tensor approximation problem, one may employ the energy seminorm and R to construct a vector norm well-suited to the problem.

Define the norm | · |∗ by

2 0 2 2 | | k − k | | 00 x ∗ = (I R (x∗))x + x f (x∗) .

1 0 −1 ∗ Alternatively, that S 1(x∗) = −(D + L) L follows from Corollary 3.3.8 with ω = 1. 45

00 Near x∗, convergence in | · |∗ may be shown by examining the energy norm of f (x∗), as is stated by the following lemma.

Lemma 3.3.4 (Adapted from [Usc12]). Let s : X → X be continuously differentiable with

0 0 00 a fixed point at x∗. Then |(R ◦ s) (x∗)h|∗ ≤ |s (x∗)| f (x∗)|h|∗ for all h ∈ X.

00 00 Proof: By definition, f (R(x∗ + h)) = f (x∗ + h) for all h, and thus ( f ◦ R) (x∗) = f (x∗).

0 0 Because f (R(x∗)) = f (x∗) = 0, it holds that

00 0 T 00 0 0 T 00 0 ( f ◦ R) (x∗) = R (x∗) ( f ◦ R)(x∗)R (x∗) = R (x∗) f (x∗)R (x∗) .

It follows that

0 2 00 0 0 00 2 | | 00 h i h i | | 00 R (x∗)h f (x∗) = f (x∗)R (x∗)h, R (x∗)h = f (x∗)h, h = h f (x∗) .

0 0 By [Usc12, lemma 3.1], (I − R (x∗))R (x∗) = 0. Thus, for all h ∈ X it holds that

0 0 0 0 0 00 00 |(R ◦ s) (x∗)h|∗ = |R (x∗)s (x∗)h|∗ = |s (x∗)h| f (x∗) ≤ |s (x∗)| f (x∗)|h|∗ .



The local linear convergence of an iterative method may be established if the Jacobian matrices of the method at its fixed points are contractions in some vector norm. Lemma 3.3.4 establishes that the performance of an equilibriated method as measured by

the | · |∗ norm may be bounded by the performance of the base method as measured by the

00 energy seminorm | · | f (x∗).

(k) Theorem 3.3.5 (From [Usc12]). The sequence {x }k∈N produced by Equilibriated ALS

exhibits Q-linear convergence in some neighborhood of a point x∗ for which

0 00 Assumption 3.3.1 holds. Further, the asymptotic rate of convergence is |S 1(x∗)| f (x∗) < 1 (k) for the sequence of parameters {x }k∈N. 46

Proof: By Lemma 3.3.3, S 1 is well-defined and continuous in a neighborhood of x∗.By Lemma 3.3.4, every h ∈ X has

0 0 00 |(R ◦ S 1) (x∗)h|∗ ≤ |S 1(x∗)| f (x∗)|h|∗ .

00 Moreover, Lemma 3.3.2 ensures that f (x∗) is positive semi-definite with positive definite

0 00 ∗ block diagonal. Under this condition, |S 1(x∗)| f (x ) < 1, as is established in [LWXZ06, 2 Theorem 3.2]. 

0 00 In practice, |S 1(x∗)| f (x∗) is rarely known. One may estimate it, however, by examining the (k) (k) sequence of parameters {x }k∈N, or by examining the sequence of tensors {τ(x )}k∈N. The latter also converges at least linearly, as is established by the following lemma.

(k) Lemma 3.3.6 (Adapted from [Usc12]). If a sequence {x }k∈N ⊂ X converges at least

(k) linearly to x∗ with rate at most q, then {τ(x )}k∈N converges at least linearly to τ(x∗) with rate at most q.

Proof: The function τ is Lipschitz-continuous on any compact subset of X so

(k) 1/n (k) 1/n lim supk→∞ kτ(x ) − τ(x∗)k ≤ lim supk→∞ |x − x∗|∗ ≤ q. 

3.3.2 Linearization of nonlinear SOR

The Jacobian of a full step of ALS is identical to that of a full step ofGS[BH03]. In this section, I show the same for the microsteps of ALS andGS. It is then trivial to extend the microstep result to a statement that the Jacobians of the full steps of SOR-ALS and SOR are also identical. Given an inner product space X and a linear subspace V ⊂ X, we say that a linear map A : X → X is positive-definite with respect to V if and only if hv, Avi > 0 for all v ∈ V \{0}.

0 2 00 ∗ Alternatively, that |S 1(x∗)| f (x ) < 1 follows from Corollary 3.3.12 with ω = 1. 47

Theorem 3.3.7. Let f : Rn → R be thrice continuously differentiable, let V be a linear subspace of Rn, and let α, γ : Rn → Rn be the maps defined by α 00 x 7−→ x + arg min |x + v| f (x∗) , v∈V γ x 7−→ x + arg min f (x + v) , v∈V 00 where x∗ is a fixed point of the nonlinearGS method and f (x∗) is the Hessian matrix of f at x∗.

00 If the Hessian matrix f (x∗) is positive semi-definite and is positive-definite with

n respect to V, and if for each x ∈ R the restricted function f |{x+v|v∈V} admits a unique local minimum at γ(x), then the Jacobian matrix Jγ(x∗) of γ is identical to Jα(0)

Jγ(x∗) = Jα(0) .

Proof: By the condition that x∗ must be a fixed point of the nonlinearGS method,

∇ f (x∗) = 0. To simplify the proof of the Theorem, assume without loss of generality that x∗ = 0, and that f (0) = 0. Under these assumptions, the Taylor expansion of f at 0 is

∗ f (x) = x Ax + R2(x) ,

1 00 where A = 2 f (0) and R2 is the remainder of the Taylor series. Note that R2 is continuous, by the assumption that f is thrice continuously differentiable.

Because A is positive definite with respect to V, f (0) = minu∈U f (0) for some open neighborhood U ⊂ V of 0. From this and the assumption that f |{x+v|v∈V} has a unique minimum, it follows that γ(0) = 0. Using the definition of the derivative to examine the

Jacobian matrix Jγ(0), γ(tx) − γ(0) γ(tx) Jγ(0) x = lim = lim . t→0 t t→0 t The techniques of matrix calculus allow formulation of α as a linear map, so

Jα(0) x = α(x). To demonstrate that the Theorem holds, it suffices to show that γ(tx) lim = α(x) (3.2) t→0 t 48

− for all nonzero x ∈ Rn. We will do so by bounding the remainder t 1γ(tx) − α(x) in terms of t and of a bound K on R2. As t → 0, the remainder must then also approach 0. Before we examine γ, we first simplify the objective function f . For any v ∈ V,

2 ∗ f (tα(x) + tv) = |t| (α(x) + v) A(α(x) + v) + R2(tα(x) + tv)

= |t|2α(x)∗Aα(x) + 2|t|2 Re(v∗Aα(x)) + |t|2v∗Av

+ R2(tα(x) + tv) .

∗ Because α(x) minimizes | · |A over x + V, we have that Re(v Aα(x)) = 0 for all v ∈ V, for otherwise the inequality |α(x) + v|A < |α(x)|A would hold for some small v ∈ V. Using this, the objective function may be reduced to three terms:

2 ∗ 2 ∗ f (tα(x) + tv) = |t| α(x) Aα(x) + |t| v Av + R2(tα(x) + tv) . (3.3)

With the objective function f now reduced to only its most important terms, we now turn our attention to t−1γ(tx). Note that α has been defined such that α(x) − x ∈ V, so

t−1γ(tx) = x + arg min f (tx + tv) v∈V = α(x) + arg min f (tα(x) + tv) v∈V

 ∗ −2  = α(x) + arg min v Av + |t| R2(tα(x) + tv) . v∈V Though our analysis will ultimately depend on the nature of the term v∗Av, it must

−2 first be shown that the term |t| R2(tα(x) + tv) does not preclude (3.2). We will use continuity of the remainder term R2 and restrictions on v and t to bound R2(tα(x) + tv). Select some  > 0. The set

{ α(x) + v | v ∈ V and kα(x) + vk ≤  + kα(x)k }

is compact and R2 is continuous on this set, so there exists K > 0 such that if kvk ≤  and |t| ≤ 1, then

3 |R2(tα(x) + tv)| ≤ |t| K . (3.4) 49

∗ Denote by σmin the infimum of w Aw over unit vectors w ∈ V. Note that σmin > 0 because A is positive definite with respect to V. Let

2 δ = min{  σmin/3K , 1 } .

If 0 < |t| < δ and v ∈ V with kvk = , the following inequality — which combines (3.3) and (3.4) — holds:

f (tα(x) + tv) ≥ 2σ − |t|−2|R (tα(x) + tv)| + α(x)∗Aα(x) |t|2 min 2 22σ f (tα(x)) ≥ min + − |t|−2|R (tα(x))| 3 |t|2 2 2σ f (tα(x)) ≥ min + . (3.5) 3 |t|2

The set { v ∈ V | kvk ≤ } is compact, so the continuous map defined by v 7→ f (tα(x) + tv) is minimized over the set at some point vt. By definition, f (tα(x) + tvt) ≤ f (tα(x)). If kvtk were equal to , this would contradict (3.5), so kvtk < . Because this minimizer is contained in an open subset of V, the point t(x + vt) is a local minimizer of f over { tx + v | v ∈ V }. By the assumptions of the Theorem, this is then the only such minimizer

— that is, γ(tx) = t(α(x) + vt). From this, it follows that

kt−1γ(tx) − α(x)k ≤  (3.6) for all 0 < |t| < δ. Suppose that v = t−1γ(tx) − tα(x) and 0 < |t| < δ. Inequality (3.6) implies that (3.4)

−2 holds; it then follows that |t| R2(γ(tx)) ≥ −|t|K. Combining this with (3.3) gives

|t|−2 f (γ(tx)) ≥ α(x)∗Aα(x) + (t−1γ(tx) − α(x))∗A(t−1γ(tx) − α(x)) − |t|K

∗ −1 2 ≥ α(x) Aα(x) + σminkt γ(tx) − α(x)k − |t|K . (3.7)

Suppose that v = 0. Combining (3.3) and (3.4) gives

|t|−2 f (tα(x)) ≤ α(x)∗Aα(x) + |t|K . (3.8) 50

Finally, γ is defined such that f (tα(x)) ≥ f (γ(tx)). Combining this with (3.8) − (3.7) gives

−2 −1 2 0 ≤ |t| ( f (tα(x)) − f (γ(tx))) ≤ 2K|t| − σminkt γ(tx) − α(x)k ,

which then implies that

−1 2 σminkt γ(tx) − α(x)k ≤ 2K|t| .

−1 Because σmin > 0, it follows that t γ(tx) → α(x) as t → 0. 

Theorem 3.3.7 establishes that the microsteps of the nonlinearGS method may be linearized in a straightforward manner – they correspond to the microsteps of the linear GS method – provided that each microstep has an unambiguous solution. Though this was previously known for the full steps of nonlinearGS[BH03], that result is di fficult to extend. By linearizing the microsteps individually, however, the result may be extrapolated to form a linearization of the microsteps of noninear SOR. This in turn may be employed to linearize a full step of nonlinear SOR.

Corollary 3.3.8. Let f, x∗, and H f (x∗) fulfill the conditions of Theorem 3.3.7, and let

n ω ∈ R. Letting V1,..., Vm ⊂ R be linear subspaces corresponding to blocks 1 ... m, and

n n defining for i = 1,..., m maps αi, γi : R → R by

αi x 7−→ x + ω arg min |x + v|H f (x∗) , v∈Vi γ x 7−→i x + ω arg min f (x + v) , v∈Vi we have that the maps α = αm ◦ · · · ◦ α1 and γ = γm ◦ · · · ◦ γ1 satisfy

Jγ(x∗) = Jα(0) .

Proof: Follows immediately from Theorem 3.3.7, linearity of derivatives, and the chain rule.  51

3.3.3 Linear SOR and the energy seminorm

The literature contains many results on the convergence of SOR, but I have been unable to find a statement that SOR is a contraction in the energy seminorm when used to approximate the solution of a singular system. The SOR method has been shown to converge in the `2 norm even if applied to a singular system [LT92]. This can be used to show that SOR would reduce the energy seminorm in finitely many steps, but it would be a weaker result than that shown in this section. Alternatively, arguments thatGS is a contraction in the energy seminorm, such as that in [LWXZ08], could potentially be adapted to SOR; this would be far more complicated than proving the result directly.

× Consider a linear system Ax = b, where A ∈ Kn n is a positive semi-definite Hermitian matrix and b ∈ Kn is a constant vector. Let V(1),..., V(m) be linear subspaces

n ( j) ( j) n of K , and let P be an orthogonal projector onto V . Define the set Z j,v ⊂ K of those

( j) minimizers of | · |A that can be found by adding to v a vector in V as

Z j,v = arg min |z|A . z∈V( j)+v

( j) n Each z ∈ Z j,v minimizes | · |A over the affine subspace V + v of K . If |Z j,v| > 1, there

is not a unique minimizer. This ambiguity may be resolved by selecting z ∈ Z j,v with kx − zk minimized. For each j = 1,..., m, define matrices

( j) ( j) ( j) † ( j) Qω = In − ω(P AP ) P A . (3.9)

( j) d | ( j) |2 This is created as Qω v = v + ωx, where x is the solution of d x P x + v A = 0 that ( j) ( j) ( j) minimizes kxk. This ensures that k(I − P )xk = 0, so x = P x ∈ V , and that x + v ∈ Z j,v. Then

( j) Qω v = (1 − ω)v + ω arg min kz − vk . z∈Z j,v 52

( j) ( j) n ( j) That is, Q1 v minimizes | · |A over the affine subspace V + v of K . Moreover, Qω v = v ( j) if v is a minimizer of | · |A over V + v. Finally, define the product

(m) (2) (1) Qω = Qω ... Qω Qω . (3.10)

As a special case, (3.10) determines the convergence of SOR. In particular, if one

selects a partition of Kn into m blocks, then each subspace V( j) may be selected as the set of vectors which are zero except in the j-th block of Kn. If also A has a nonsingular block diagonal, then each step of SOR is described by the first-order difference equation

k+1 k v = Qωv + c ,

where c ∈ Kn is some constant vector.

Lemma 3.3.9. Let K = R or C and let A : Kn → Kn be a Hermitian and positive

Sm ( j) n n semi-definite matrix. If 0 < ω < 2 and span( j=1 V ) = K then for all x ∈ K , the following hold:

|Qωx|A ≤ |x|A , and |Qωx|A = |x|A ⇐⇒ |x|A = 0 .

( j) n Proof: By definition as a minimizer of | · |A, it holds that |Q1 x|A ≤ |x|A for all x ∈ K . To determine that this inequality is strict for certain x, consider the gradient of the squared

2 ( j) 2 ( j) energy seminorm | · |A. If there exists v ∈ V such that hv, ∇|x|Ai , 0, then |Q1 x|A < |x|A. ( j) ( j) If no such v exists, then x minimizes | · |A over V + x, so Q1 x = x. To establish these inequalities for ω , 1, use the characterization

( j) ( j) Qω = (1 − ω)In + ωQ1 .

( j) 2 For real ω, the expression |Qω x|A is quadratic on ω,

( j) 2 ( j) ∗ ( j) |Qω x|A = (Qω x) AQω x

∗ ( j) ∗ ( j) = x ((1 − ω)In + ωQ1 ) A((1 − ω)In + ωQ1 )x

2 ∗ 2 ∗ ( j)∗ 2 ∗ ( j)∗ ( j) = |1 − ω| x Ax + 2(ω − |ω| ) re(x Q1 Ax) + |ω| x Q1 AQ1 x . 53

( j) ( j) ( j) This is minimized at ω = 1 by definition of Q1 ; moreover, Q0 x = x so |Q0 x|A = |x|A. ( j) ( j) 2 2 Therefore, if 0 < ω < 2, then |Qω x|A ≤ |x|A, with equality if and only if |Q1 x|A = |x|A. (m) (2) (1) I now consider the full step Qω = Qω ... Qω Qω .

That |Qωx|A ≤ |x|A follows trivially from the corresponding microstep inequality

( j) |Qω x|A ≤ |x|A. Because | · |A is non-negative,

|Qωx|A = |x|A ⇐= |x|A = 0 .

It remains to be shown that |Qωx|A = |x|A =⇒ |x|A = 0. I show this by proving its

2 contrapositive. If |x|A , 0, then ∇|x|A , 0. Let i ∈ N be the least natural number for which

(i) 2 (i) there exists v ∈ V with hv, ∇|x|Ai , 0. As was shown earlier, this implies |Qω x|A < |x|A,

(m) (i) ( j) and thus that |Qω ... Qω x|A < |x|A. For j < i, however, no such vector v ∈ V exists, so

(i−1) (2) (1) Qω ... Qω Qω x = x. Thus |x|A , 0 =⇒ |Qωx|A < |x|A. This completes the proof. 

Remark 3.3.10. Lemma 3.3.9 requires only that each block be visited at least once per step. That is, it holds not only for cyclic block updates, but also under an almost-cyclic update rule.

Lemma 3.3.11. Let K = R or C, and let A, Q : Kn → Kn be matrices, with A Hermitian

n and positive semi-definite. If |Qx|A ≤ |x|A for all x ∈ K , and |Qx|A = |x|A ⇐⇒ |x|A = 0, then

|Q|A < 1 .

Proof: Because A is Hermitian, im(A) = Kn/ ker(A). It follows that every x ∈ Kn admits an orthogonal decomposition x = u0 + u+, with u0 ∈ ker(A) and u+ ∈ im(A). Because

n |Qx|A ≤ |x|A for every x ∈ K , we have

∗ ∗ ∗ 0 ≤ u0Q AQu0 ≤ u0Au0 = 0 .

Because A is Hermitian and positive semi-definite, it admits a factorization A = LL∗.

∗ ∗ ∗ From this it follows that L Qu0 = 0, for otherwise u0Q AQu0 would be an inner product 54 of a non-zero vector with itself. As a consequence, AQu0 = 0. Expanding

2 ∗ ∗ |Qx|A = x Q AQx gives

∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ u0Q AQu0 + u+Q AQu0 + u0Q AQu+ + u+Q AQu+ ,

∗ ∗ 2 2 2 which simplifies to u+Q AQu+ = |Qu+|A. That is, |Qx|A = |Qu+|A. Then

2 2 2 |Q|A = sup |Qz|A = sup |Qz|A < 1 . n 2 2 z∈K , |z|A≤1 z∈im(A) , |z|A≤1

The final inequality follows from compactness of the set { z ∈ im(A) | |z|A ≤ 1 }. 

Corollary 3.3.12. Let K = R or C, and let A : Kn → Kn be a Hermitian and positive

n Sm ( j) semi-definite matrix. If 0 < ω < 2 and K = span j=1 V then

|Qω|A < 1 .

Proof: Corollary 3.3.12 follows immediately from Lemmas 3.3.9 and 3.3.11. 

Thus SOR is a contraction under the energy seminorm.

3.3.4 Local linear convergence of SOR-ALS

I now replace the results upon which Uschmajew’s result was built. The replacements, collected in Sections 3.3.2 and 3.3.3, provide for SOR statements equivalent to those made aboutGS. As such, extending Uschmajew’s results to hold also for SOR-ALS requires little additional effort.

Lemma 3.3.13. Lemma 3.3.3 holds also for SOR-ALS. Specifically, if Assumption 3.3.1 holds, then the SOR-ALS operator S ω is well-defined and continuously differentiable in

∗ some neighborhood of x . Moreover, x∗ is a fixed point of S ω and we have

0 −1 ∗ S ω(x∗) = (D + ωL) ((1 − ω)D − ωL ) , 55

00 where L is the strict lower block triangular part of f (x∗), and D is the block diagonal

00 part of f (x∗). In a possibly smaller neighborhood the composition R ◦ S ω is well-defined and continuously differentiable.

Proof: Each microstep of SOR-ALS is a linear extrapolation of the microstep of ALS. As such, the fixed points x∗ of SOR-ALS are exactly those of ALS. Moreover, each microstep of SOR-ALS possesses a unique solution which depends smoothly on the input. Thus, there is some neighborhood of x∗ on which S ω and R ◦ S ω are well-defined and smooth.

0 −1 ∗ That S ω(x∗) = (D + ωL) ((1 − ω)D − ωL ) follows from Corollary 3.3.8. 

Theorem 3.3.14. Theorem 3.3.5 holds also for SOR-ALS. Specifically, the sequence

(k) {x }k∈N produced by Equilibriated SOR-ALS exhibits Q-linear convergence in some neighborhood of a point x∗ for which Assumption 3.3.1 holds. Further, the asymptotic rate

0 k 00 ( ) of convergence is |S ω(x∗)| f (x∗) < 1 for the sequence of parameters {x }k∈N.

Proof: One need only replace S 1 by S ω and replace [LWXZ06, Theorem 3.2] by

Corollary 3.3.12 in the proof of Theorem 3.3.5. 

Thus SOR-ALS exhibits local linear convergence, and the rate of convergence may be bounded by examining the rate of convergence of SOR with system matrix f 00(x∗).

3.4 Numerical results

In this Section, I examine the effects of the tuning parameter ω on the behavior of SOR-ALS. I then employ these observations to construct an adaptive method based on SOR-ALS. Lemma 3.3.13 and Theorem 3.3.14 establish that SOR-ALS converges linearly, under Assumption 3.3.1, and that the asymptotic rate of convergence is that of linear SOR with

00 system matrix f (x∗). If one were to determine an optimal value for ω for linear SOR, then that value of ω would be optimal also for SOR-ALS. 56

Unfortunately, though much effort has been devoted to examining the rate of convergence of SOR[You50, XZ06] and optimal values of ω [Rei66, EV93, Had00, YG09], such results typically require that the system matrix fulfill strict conditions. In the context of low-rank tensor approximation, the matrix f 00(x∗) is singular, which precludes application of the majority of such results. In practice, one may select an arbitrary value in (0, 2) for ω, or attempt to estimate an optimal value for the parameter. Several aspects of the behavior of the SOR-ALS method ought to be addressed in a practical implementation of the method.

1. If ω > 1, the first few iterations of SOR-ALS method may travel in a suboptimal direction, before reversing course.

2. The parameter ω significantly affects the performance of the method, as illustrated in Fig. 3.2.

1

0.5 ) 1 q /

ω 0 q ln( −0.5

−1

0 1 2 ω

Figure 3.2: Example rate of convergence (qω) of SOR-ALS relative to rate of convergence

(q1) of ALS. If ln(qω/q1) < 0, then SOR-ALS outperforms ALS at that value of ω. 57

Of these, Item1 is most easily addressed. Performing a single iteration of ALS before employing SOR-ALS greatly increases the likelihood that SOR-ALS will travel in an appropriate direction, and thus mitigates Item1. Item2 is more di fficult to address. In theory, a near-optimal value can be selected for

00 00 ω by examining f (x∗). In practice, even if f (x∗) were known, selecting ω would require optimization of a non-smooth function, as is illustrated in Fig. 3.2. To select a useful value for ω, I model its effects on behavior near a local minimizer

x∗. Near x∗, the convergence of SOR-ALS is at least linear — see Theorem 3.3.5 and Corollary 3.3.8 — so one may examine the rate of linear convergence. The model can then be used to project near-optimal values for ω.

3.4.1 Modeling terminal behavior

To determine useful values of ω for use during the final approach to the target, I

estimate the rate of linear convergence qω — defined as in Definition 3.1.1 — for sequences produced by the SOR-ALS algorithm with parameter ω. To simplify the experimental process, I examine behavior in the case that the rank of the target equals the rank of the approximation. I generate samples of the behavior of SOR-ALS as follows:

1. Generate a target T of rank r ∈ {1,..., 16} and order d ∈ {3,..., 7} and an initial approximation X(0) of equal rank. Select the initial approximation such that its parameterization x(0) is a perturbation of that of the corresponding target; this requires that the approximation’s rank be equal to the target’s rank.

2. Condition x(0) by performing a single microstep of ALS.

3. For each value of ω to be tested, 58

(a) Generate x(1),..., x(n) by performing n iterations of SOR-ALS, with parameter ω.

(n) (b) Estimate qω by using the ratio between the f (x ) and some previous error f (x(m)) to approximate the ratio between successive error measurements. That

 (n) 1/(n−m) ≈ f (x ) is, qω f (x(m)) .

4. Select the optimal value of ω for that target/approximation pair:

ωbest = arg min qω . ω

Ideally, n should both be large, so that qω would closely approximate the asymptotic rate of linear convergence; I use n = 192 unless machine precision is reached, as explained below. In practice, using a large value of n may not be desired, as numerical error may prevent error from decreasing below a certain point. If one were to continue to iterate after reaching machine precision, the result would not accurately reflect the rate at which error decreases on the approach to the target. To avoid this, I end any experiment at iteration k if

(k) −12 f (x ) < 10 and adjust the calculation of qω accordingly; that is, I set n = k for that experiment.

The selection of m is arbitrary, provided that m < n. If m  0, then qω will be less

affected by early iterations. Conversely, if m  n, then qω will better reflect the average performance of SOR-ALS, rather than the apparent performance of a small number of iterations. I use m = 2. I visualize the results in terms of the speed of SOR-ALS, which I define to be

− ln(qω), rather than its rate of convergence. Though the rate of convergence is well-known, the speed of the algorithm lends itself well to intuitive interpretation. In particular, if the speed is doubled, the time required for the algorithm to reach any desired level of error is halved. In absolute terms, if the speed of an algorithm is 1, then it requires approximately 7 iterations to reduce error by a factor of 1000. 59

One may also consider the speedup provided by using SOR-ALS instead of using

ALS. This quantity, − ln(qω) , also lends itself well to intuitive interpretation. If the speedup − ln(q1) is 2 for a particular target, initial approximation, and value of ω, then using that value of ω will cause SOR-ALS to reach any desired level of error in half the time required for ALS to do the same.

101 ω = 1.33 ω = 1.66 ω = ωbest

100

Speedup from SOR-ALS 10−1 0 1 2 3 4 5 6 7 8 9 10 Speed of ALS

Figure 3.3: Speedup of SOR-ALS compared to speed of ALS. Rank of target equals rank of approximation.

The results are shown in Fig. 3.3, and on a log scale in Fig. 3.4. Note that one sample, with ALS speed 1.6 · 10−3, is omitted from the figure. In the figures, I graph the speedup provided by SOR-ALS against the speed of ALS for ω = 1.33, for ω = 1.66, and for ω selected optimally for each target/initial approximation pair. I note

1. that the speedup provided by SOR-ALS appears to be described — if in broad strokes — by a function of ω and the speed of ALS,

2. that for any ω > 1, two distinct behaviors exist depending on the speed of ALS,

3. that those cases in which SOR-ALS provides little or no improvement over ALS are those for which ALS already converges rapidly, 60

101 ω = 1.33 ω = 1.66 ω = ωbest

100 Speedup from SOR-ALS

10−1 10−2 10−1 100 101 Speed of ALS

Figure 3.4: Speedup of SOR-ALS compared to speed of ALS. Rank of target equals rank of approximation.

4. that the point at which behavior changes (for a particular value of ω) appears easily predicted, and

5. that selecting a near-optimal value of ω will often more than double the speed of ALS.

If a constant value were selected for ω, then I expect that SOR-ALS would be outperformed by ALS in certain cases. However, as noted in Item3, such cases are those in which ALS already converges rapidly. If one selects ω = 1.33, for example, then Fig. 3.4 suggests that SOR-ALS typically outperforms ALS unless the speed of ALS exceeds 2. That is, SOR-ALS outperforms ALS unless ALS requires fewer than 7 iterations to reduce error by a factor of 106. Item1 hints that one might be able to select a good value of ω based on the speed of ALS. Indeed, graphing optimal values of ω — determined numerically — against the speed of ALS, as in Fig. 3.5, hints that one should 61 select ω as a function of the speed of ALS. This is not surprising; if the problem is sufficiently well behaved, the optimal parameter for linear SOR is a function of the rate of convergence of the Jacobi method; see eg. [You50]. The suggested function is not well-approximated by clean formulas such as arctan, so I approximate the function manually as   1/(1 + 2.95(− ln(q ))0.75) if q ≥ 0.273  1 1 ωbest ≈ 1 +  . (3.11)  1.8  (− ln(− ln(q1)) ) 0.23e if q1 < 0.273

2 Experimental Approximation 1.8

ω 1.6

1.4 Optimal 1.2

1

10−3 10−2 10−1 100 101 Speed of ALS

Figure 3.5: Optimal ω compared to speed of ALS.

3.4.2 An adaptive algorithm

I now use the knowledge from Section 3.4.1 to construct an adaptive algorithm. This algorithm will periodically sample the rate of convergence of ALS and use (3.11) to select a parameter ω for use in SOR-ALS. Unlike in Section 3.4.1, one cannot assume that the target and approximation are of equal rank, as the target’s rank is a priori unknown. Consequently, f (x∗) cannot be 62 assumed to be 0, and thus the ratio f (x(k+1))/ f (x(k)) is meaningless. The rate of convergence of ALS, q1, will instead be calculated using a ratio of deltas — that is, using

f (x(k+2))− f (x(k+1)) f (x(k+1))− f (x(k)) .

Algorithm 8: Adaptive SOR-ALS Select an initial guess x(0) ∈ Rn and some integer n > 3. k ← 1 repeat

for i = 1, 2, 3 do

(k+i) (k+i−1) x ← S 1(x ) end for

(k+3) (k+2) ≈ f (x )− f (x ) Estimate q1 f (x(k+2))− f (x(k+1))

Calculate ωbest using (3.11). for i = 4 to n do

(k+i) (k+i−1) x ← S ωbest (x ) end for

k ← k + n until f (x(k)) converges

The resulting algorithm, Algorithm8, is neither signficantly more complicated than ALS nor significantly more computationally intensive, yet it provides a notable increase in

performance. The cost of calculating q1 is negligible; one may reuse the inner products employed in the calculation of x(k), x(k+1), and x(k+2) to cheaply compute f (x(k)), f (x(k+1)), and f (x(k+2)). Moreover, the cost of an iteration of SOR-ALS is exactly that of the combination of an iteration of ALS and a linear extrapolation of the parameters. 63

I demonstrate the effectiveness of Algorithm8 by comparing its performance to that of ALS. In Fig. 3.6, I apply the experimental method from Section 3.4.1 to compare the

101

Speedup relative to ALS 100

10−3 10−2 10−1 100 101 Speed of ALS

Figure 3.6: Speedup of Algorithm8 compared to speed of ALS. Rank of target equals rank of approximation.

terminal behavior of Algorithm8 to that of ALS. As expected, Algorithm8 provides better terminal performance than does ALS in almost all cases. In Fig. 3.7, I compare the performance of Algorithm8 and ALS when used to

× × approximate a dense tensor in R3 3 4 using a rank 5 approximation. In the example, ω is recalculated every 64 iterations; that is, n = 64. Because the cost for each iteration of Algorithm8 is only neglibly more than that of an iteration of ALS, I compare the two methods in terms of iterations, rather than CPU time. Clearly, Algorithm8 can provide performance significantly better than that of ALS. 64

10−1 ALS Algorithm8 ) ) k (

x −4 ( 10 f

10−7

−10

Squared10 Error:

10−13 20 40 60 80 100 120 140 160 180 Iteration: k

Figure 3.7: An example comparing ALS and Algorithm8.

3.5 Concluding remarks

In this Chapter, I have examined the behavior of Nonlinear SOR when applied to the tensor approximation problem. Though extrapolation of ALS is not new, the application of Nonlinear SOR to the tensor approximation problem had not previously been studied. I have demonstrated that Uschmajew’s analysis of the local convergence of ALS may be applied also to SOR-ALS. In addition to the results related to SOR-ALS, I have addressed shortcomings of SOR-ALS by constructing Algorithm8, an adaptive method that intelligently selects the extrapolation parameter ω. Algorithm8 retains the advantages of ALS; it is simple to implement, numerically stable, and has a low computational cost for each iteration. Moreover, Algorithm8 outperforms ALS, particularly in those cases where ALS would perform poorly. 65 4 On Convergence to Essential Singularities

4.1 Introduction

Historically, a variety of approaches have been employed to examine the convergence of sequences produced by iterative optimization methods. Some examine local convergence — whether small perturbations of a solution are corrected, and at what rate. Such results are typically proved directly, as in the case of the Newton-Raphson method, or by comparison to a method with known behavior [Coh72, Usc12]. Others attempt to show global convergence — that a particular method will produce a convergent sequence. Though global convergence can be proved directly in some cases [EHK, SYX16], more general approaches exist. Since 1971, the Wolfe conditions [Wol71] have been employed to ensure that the gradient of the objective function — evaluated at the points of a sequence — tends to zero. More recently, seminal work by Absil et al. [AMA05] employed the Łojasiewicz gradient inequality to provide a stronger result. For convenience, I state the gradient form of Stanisław Łojasiewicz’s theorem [Łoj59, Łoj93] below, incorporating an improvement from [SU15].

Theorem 4.1.1 (Łojasiewicz gradient inequality). Let f be a real-analytic function on a

n neighborhood of x∗ in R . Then there are constants c > 0 and θ ∈ (0, 1/2] such that

1−θ | f (x) − f (x∗)| ≤ ck∇ f (x)k (4.1)

for any x in some neighborhood of x∗.

Though the constant θ can be determined in some cases [Brz15] and estimated in others [Gwo99, Ole13], it is in general a priori unknown.

n The result of Absil et al. [AMA05] states that a sequence {xk}k∈N ⊂ R converges if it has a cluster point and if there is some real-analytic function f : Rn → R such that the 66

“descent conditions” (A1–A2) hold. The first descent condition, that

f (xk) − f (xk+1) ≥ σ k∇ f (xk)k kxk+1 − xkk (A1) for some σ > 0 and all sufficiently large k, ensures that the sequence of f -values decreases sufficiently quickly. The second descent condition prevents cycles by requiring that

f (xk) = f (xk+1) =⇒ xk = xk+1 . (A2)

Moreover, Uschmajew, et al. proved that if also there exists some κ > 0 such that for all sufficiently large k,

kxk+1 − xkk ≥ κk∇ f (xk)k , (A3) then the rate of convergence of the sequence may be bounded [SU15, Usc15]. In recent years, the results of Absil, Uschmajew, et al. [AMA05, SU15, Usc15] have been of interest in the field of tensor approximation. Given a target tensor T , one may attempt to find the tensor of a fixed rank r which best approximates the target. This problem is often described as minimization of the “error” kT − τ(x)k, where x ∈ Rn is a tuple of parameters, and τ is a multilinear map from parameters to tensors of rank r. If r = 1, there exist parameters x that minimize error, and several methods, including ALS, are known to produce convergent sequences [Usc15, LUZ15]. If r ≥ 2, then the approximation τ(x) is a sum of two or more separable tensors X1 + ··· + Xr. Because no orthogonality constraint is imposed on X1,..., Xr, there exist problems for which error cannot be minimized [BM05, VNVM14]. For these ill-posed problems, if kT − τ(xk)k is to approach its infimum, the sequence of parameters {xk}k∈N must diverge. As yet, little intuition exists for the behavior of these sequences; in particular, it is not known whether the summands X1,..., Xr maintain some steady configuration, changing little except in scale, or whether they cycle among different configurations. To gain intuition, the divergent case may be converted to a projective space by rescaling; one may then ask whether the normalized sequence {xk/kxkk}k∈N converges. 67

When this re-scaling is performed, however, one must alter the objective function such that it is optimized by a unit vector, as in x 7→ minλ∈R f (λx). In the tensor approximation problem, this results in a bounded multivariate rational function. Targets for which the best approximation problem is ill-posed correspond to singularities of this function; some of these singularities are known to be essential — that is, neither removable nor unbounded.

−xy Figure 4.1: Line Search along the gradient maximizes f (x, y) = (x2+y2)(1+x2+y2) from an initial

estimate of (x0, y0) = (2, −0.1).

Wherever a function fails to be continuous, the Łojasiewicz gradient inequality cannot apply. This does not prevent optimization methods from producing convergent sequences, as is illustrated in Fig. 4.1, but it does preclude the use of the theorems of Absil, Uschmajew, et al. Fortunately, those theorems may be strengthened to require only that the Łojasiewicz gradient inequality hold on points in the sequence. The conditions required by the following theorems will follow from the main results

of this paper. Specifically, Theorems 4.2.6 and 4.2.9 will establish (Ł*) for sequences such as that illustrated in Fig. 4.1.

Theorem 4.1.2. Let U ⊂ Rn be an open set, let f : U → R be a differentiable function, ∞ and let {xk}k=1 ⊂ U be a sequence of vectors satisfying Assumptions (A1) and (A2). If a 68

∗ ∞ n cluster point x of the sequence {xk}k=1 admits an open neighborhood V ⊂ R and constants θ ∈ (0, 1/2] and c ∈ R such that for all k,

1−θ xk ∈ V =⇒ | f (xk) − lim f (x j)| ≤ ck∇ f (xk)k (Ł*) j→∞

∗ ∞ then x must be the limit of the sequence {xk}k=1.

Theorem 4.1.3. Under the conditions of Theorem 4.1.2, if also Assumption (A3) holds, then ∇ f (xk) → 0 and   O(qk) if θ = 1 (for some 0 < q < 1), ∗  2 kx − xkk =   −θ 1  1−2θ O(k ) if 0 < θ < 2

where θ is such that (Ł*) holds.

Proof of Theorems 4.1.2 and 4.1.3: Assumption (Ł*) implies that limk→∞ f (xk) exists.

The gradient of f is finite where defined, so assumption (Ł*) requires that lim f (xk) be k→∞ finite. A proof of the original theorems of Absil, Uschmajew, et al. is provided in [SU15, p. 644]. Because only three changes must be made to the original proof, I quote Uschmajew’s proof here as indented block quotes.3 Between block quotes, I supply the alterations which extend Uschmajew’s proof to a proof of Theorems 4.1.2 and 4.1.3. Let

− fk = f (xk), and let gk = k∇ f (xk)k.

− We can assume that gk > 0 for all k since otherwise the sequence becomes stationary and there is nothing to prove.

There will also be no loss of generality to assume that (A1) and (A2) hold for all k and that

∗ limk→∞ fk = 0. Then 0 ≤ fk for all k and the Łojasiewicz gradient inequality at x reads as

1−θ − fk ≤ Λgk (A.1)

∗ whenever xk ∈ U. The set U contains an open ball Bδ(x ) for some δ > 0. 3 To simplify reading, I have changed Uschmajew’s notation to match mine. 69

∗ Let  ∈ (0, δ], and assume kxk − x k < δ. Then, by (A.1) and (A1),

Λ kx − x k ≤ f θ−1( f − f ) . k k+1 σ k k k+1

θ−1 θ−1 θ−1 Using the fact that for φ ∈ [ fk+1, fk] there holds fk ≤ φ ≤ fk+1 , we can estimate Z fk θ−1 θ−1 1 θ θ fk ( fk − fk+1) ≤ φ d φ = ( fk − fk+1) fk+1 θ and thus obtain Λ kx − x k ≤ ( f θ − f θ ) . k k+1 σθ k k+1

∗ More generally, let kx j − x k <  ≤ δ for k ≤ j < m; we get by this argument that

Xm Xm Λ Λ kx − x k ≤ kx − x k ≤ ( f θ − f θ ) ≤ f θ . (A.2) m k j+1 j σθ j j+1 σθ k j=k j=k

∗ Since x is an accumulation point, and because fk → 0, we can pick n so large that

 Λ  kx − x∗k < and f θ < . k 2 σθ k 2

∗ ∗ Then (A.2) inductively implies kxm − x k <  for all m > k. This proves that x

− is the limit point of the sequence, and, by (A3), gk → 0. P∞ To estimate the convergence rate, let rk = j=k kx j+1 − x jk. Then

∗ kxk − x k ≤ rk, so it suffices to estimate the latter. By (A.2), (A.1), and (A3),

there exists k0 ≥ 1 such that for k ≥ k0 it holds that ! 1−θ ! 1−θ 1−θ Λ θ Λ Λ θ Λ r θ ≤ kx − x k = (r − r ) , n σθ κ k+1 k σθ κ k k+1

that is,

1−θ θ rk+1 ≤ rk − νrk (A.3) 70

  θ−1 Λ θ κ ∈ with ν = σθ Λ . Now, if θ = 1/2, we get from (A.3) that ν (0, 1), and

k−k0 rk ≤ rk0 (1 − ν)

for k ≥ k0. The case 0 < θ < 1/2 is more delicate. We follow Levitt: put θ ≥ { ν −p −p} −p ≥ p = 1−2θ , C max ( p ) , rk0 k0 , and sk = Ck ; then sk0 rk0 , and

p p+1 1−θ s = s (1 + k−1)−p ≥ s (1 − pk−1) = s − s p ≥ s − νs θ k+1 k k k C1/p k k k

(the first inequality holding by convexity of x−p). Using induction, it now

follows from (A.3) that rk ≤ sk for all k ≥ k0, which finishes the proof.



Remark 4.1.4. Theorems 4.1.2 and 4.1.3 are strictly stronger than the results of Absil, Uschmajew, et al. in that weaker hypotheses allow the same conclusions. Specifically, if f

∗ is analytic on a neighborhood of x ,(Ł*) follows from Theorem 4.1.1.

My Theorems 4.1.2 and 4.1.3 do not require continuity of the objective function, but do require a weaker form (Ł*) of the Łojasiewicz gradient inequality. Beyond Theorems 4.1.2 and 4.1.3, the contributions of this paper are four-fold.

• In Section 4.2.1, I contribute a version of the Łojasiewicz gradient inequality that may be employed at essential singularities of bounded multivariate rational functions. This inequality holds not on a neighborhood of a singularity, but instead on open sets which have boundaries containing the singularity.

• Section 4.2.2 establishes that my Łojasiewicz gradient inequality holds on the tail of the sequence, under a technical condition. Specifically, if the limiting behavior of f near the singularity does not depend continuously on the direction of approach, then the sequence must avoid the “unsafe” directions at which discontinuities occur. I 71

show that the set of “unsafe” directions of approach is closed and has Lebesgue measure zero.

• Section 4.2.3 combines these results to prove that if this and (A1–A3) hold, then the

sequence {xk+1}k∈N converges.

• Finally, Section 4.3 provides additional tools for examination of convergence in direction and briefly discusses the implications for tensor approximation.

This partially closes a gap in the theory of tensor approximation, and provides a general tool for analysis of sequences.

4.2 On the assumption (Ł*)

Before I begin my analysis of (Ł*), I note two obstacles and the means by which I circumvent them. First, I must assume that f has some structure strict enough that its behavior may be analyzed near a singularity x∗, yet not so strict as to require that f be continuous at x∗.

Throughout the paper, I will consider multivariate rational functions on Rn, with the assumption that the domain of such functions is defined implicitly to be all points in Rn at which division by zero does not occur. More formally, I define a multivariate rational function as follows:

Definition 4.2.1. A function r is a multivariate rational function on Rn if and only if there

n p exist multivariate polynomials p, q : R → R such that r = q .

Second, I cannot expect the Łojasiewicz gradient inequality to hold on open

∗ neighborhoods of x . As part my analysis of (Ł*), I find sets on which

| f (x) − K|1−θ ≤ ck∇ f (x)k (4.2) 72 holds. If x∗ were a removable singularity of f , one might be able to establish (4.2) on an open neighborhood of x∗. For an essential singularity, however, (4.2) can fail to hold on open neighborhoods of x∗.

f 2 2 7−→ y −x To illustrate this, consider the function defined by (x, y) x2+y2 and the sequence 1 1 1 · · · ⊂ 2 → ∈ e1, 2 e2, 4 e1, 8 e2, R . Clearly, xk 0. Moreover, for all k N, it holds that k ∇ f (xk) = 0 and f (xk) = (−1) . If (4.2) were to hold on any open neighborhood of 0, it would imply that f (xk) = K for all sufficiently large k, which contradicts the calculation above. Rather than examine open neighborhoods of singularities, I establish that (4.2) holds on certain open sets adjacent to singularities of bounded multivariate rational functions. I then provide conditions under which sequences {xk}k∈N remain within those open sets.

4.2.1 A Łojasiewicz-like inequality holds on cones

Bounded multivariate rational functions may admit essential singularities, such as that illustrated in Fig. 4.1. Bounded univariate rational functions instead exhibit only removable singularities. In this section, I parameterize multivariate functions in terms of directions and distance; that is, as images of lines under a multivariate function. Lemmas 4.2.2 and 4.2.4 establish that these parameterizations are themselves analytic, if not rational. The parameterization is employed in Theorem 4.2.6 to establish a Łojasiewicz inequality on cones near singularities of the original function.

Lemma 4.2.2. Suppose f is a multivariate rational function on Rm × R. If

fx(t) = lim f (x, s) s→t is defined for some x ∈ Rm and all t ∈ R, then there exists an open set O ⊂ Rm of full measure and multivariate rational functions cn : O → R defined everywhere on O such 73 that, for every x ∈ O there exists ρx > 0 such that

∞ X n |t| < ρx =⇒ fx(t) = cn(x)t . n=0

m Proof: Suppose fx(t) = lims→t f (x, s) is defined for all t ∈ R. By fixing x ∈ R , we reduce

f (x, t) to a univariate rational function with respect to t, so its continuous extension fx(t) is a univariate rational function. Because fx(t) is defined everywhere, it is everywhere real-analytic. Let f (x, t) f (x, t) = numer , fdenom(x, t) where fnumer and fdenom are multivariate polynomials. fx(t) is analytic, so the coefficients cn of its Taylor expansion around 0 exist, and for n = 0,..., ∞,

n 1 d fnumer(x, t) cn(x) = lim n . n! t→0 dt fdenom(x, t)

Because the function fx is analytic for each x, there exist ρx such that |t| < ρx implies

P∞ n fx(t) = n=0 cn(x)t . Note that the neighborhood of convergence depends on the direction parameter x. We now examine properties of the coefficients cn(x). By repeated application of the quotient rule, and omitting the numerators because they are not required for the proof, we obtain an expression of the form

1 ... c x n( ) = lim 2n n! t→0 ( fdenom(x, t))

which is a limit of a rational function. These limits exist by analyticity of fx(t), so they may be evaluated by repeated application of L’Hospital’s rule.

Let fn(x) denote the coefficients of the Taylor expansion, with respect to t, of

fdenom(x, t). Note that these are multivariate polynomials with respect to x. Let

nmin = min{n ∈ Z : n ≥ 0 and ∃ x st. fn(x) , 0} , 74 and define

O = {x : fnmin (x) , 0} . (4.3)

It must be shown that nmin is well-defined. The function fx is assumed to be defined for

m some x ∈ R , so fdenom is nonzero at some (x, t). From this and analyticity of fdenom, it

follows that fn(x) is non-zero for some n. This establishes that nmin is well-defined. A multivariate real or complex polynomial is either identically zero or non-zero almost-everywhere with respect to Lebesgue measure. Further, the set on which such a polynomial is equal to a given constant is closed. Thus O is an open set of full measure.

For every x ∈ O, the coefficients cn(x) may be evaluated by applying L’Hospital’s rule

n exactly 2 nmin times. This implies that on O, every cn is a multivariate rational function with respect to x, and is defined everywhere on O. 

It is well known that the trailing coefficients of the Taylor series expansion of a univariate rational function satisfy a linear recurrence relation. For completeness, I provide a proof of this here:

Proposition 4.2.3. Let K = R or C, let f, g : K → K be polynomials of degree d f and dg

Pd f i Pdg j ∞ given as f (x) = i=0 fi x and g(x) = j=0 g j x , and let {cn}n=0 ⊂ K such that  f  P∞ n g (x) = n=0 cn x on some open U ⊂ K. Then for all n > max{d f , dg}, the coefficients cn

Pdg g j are given by cn = − cn− j . j=1 g0

∈ f (x) P∞ n Proof: For all x U, we have g(x) = n=0 cn x . Then, multiplying by g(x),

P∞ n f (x) = g(x) n=0 cn x =⇒

Pd f i Pdg j P∞ n i=0 fi x = j=0 g j x n=0 cn x

P∞  n Pmin{n,dg}  = n=0 x j=0 cn− jg j .

Pdg Matching terms gives the equation 0 = j=0 cn− jg j for all n > max{d f , dg}, and solving for

cn completes the proof.  75

I have shown that the Taylor series expansion in Lemma 4.2.2 converges within some radius of convergence ρx, but this radius of convergence might depend on x. Before I can

use Lemma 4.2.2 for its intended purpose, I must show that ρx may be bounded away from

0. I will do this by showing that ρx depends continuously on x so that I may later apply a compactness argument.

Lemma 4.2.4. If the maps cn are defined as in Lemma 4.2.2, then there exists a continuous map ρ : O → (0, ∞) and a corresponding set

Uρ = { (x, t) | x ∈ O, t ∈ (−ρ(x), ρ(x)) }

P∞ n on which n=0 |cn(x)t | converges uniformly.

r×r Proof: By Proposition 4.2.3, there exist N, r ∈ N and matrices Cx ∈ R such that for all n > N,          cn+1(x)   cn(x)       .   .   .  = Cx  .  .         cn+r+1(x) cn+r(x)

Further, the recurrence matrices Cx and its ∞-norm kCxk∞ depend continuously on the choice of x ∈ O. We may thus bound the coefficients cn by a geometric sequence and

n control the convergence of the terms cnt by choice of t. Select N ∈ N such that recurrence relation holds for all n ≥ N − r. Note that N may be chosen identically for all x ∈ O. For any 0 < k < 1, selecting t such that

k |t| ≤ (4.4) max{1, kCxk∞}

n guarantees that for n ≥ N, we may bound |cnt | by ! !n n n−N k |cn(x)t | ≤ max |cN− j(x)| kCxk∞ j=0,...,r max{1, kCk∞} ! n ≤ max |cN− j(x)| k . j=0,...,r 76

Then ∞ ! X n N 1 |cn(x)t | ≤ k max |cN− j(x)| . 1 − k j=0,...,r n=N This may be bounded arbitrarily by choice of k. In particular, for a given bound  > 0, the

P∞ n above bound shows that n=N |cnt | ≤  if ! N max |cN− j(x)| k + k −  = 0 . j=0,...,r

This equation has a solution k = kx, ∈ (0, 1) which depends continuously on x and . Fix  > 0, and define ρ by

ρ k x 7−→ x, . 2 max{1, kCxk∞}

Note that this map is continuous with respect to x, and that

∞ X n |t| ≤ 2ρ(x) =⇒ |cn(x)t | ≤  . n=N

n Then, for |t| ≤ ρ(x) and n ≥ N, the terms |cn(x)t | are bounded by

 |c (x)tn| ≤ . n 2n

P∞  P∞ | n| ∈ The series n=N 2n converges, so n=N cn(x)t converges uniformly for (x, t) Uρ. 

∗ I now define Ox∗ the largest set of directions from which x can be approached while ensuring that the parameterized function (x, t) 7→ f (tx + x∗) behaves consistently.

Definition 4.2.5. Given a bounded multivariate rational function f on Rn and a point

∗ n x ∈ R , and denoting by fn(x) the n-th Maclaurin series coefficient, with respect to t, of the denominator of the rational function defined by (x, t) 7→ f (tx + x∗), I define

∗ Ox = { x : fnmin (x) , 0 } ,

where nmin = min{ n ∈ Z≥0 | ∃ x : fn(x) , 0 }. 77

Definition 4.2.5 will be used to describe those directions of approach near which the function f is sufficiently well-behaved. Later sections will demonstrate that if a sequence

∗ ∗ admits a subsequence that approaches x along a direction in Ox∗ , then x is the limit of that sequence. That is, approaching x∗ from almost any direction will trap the sequence at that cluster point.

Note that Ox∗ satisfies the conditions of the open set specified by Lemma 4.2.2 for the

∗ function fx(t) = lims→t f (sx + x ), as it is identical to the set constructed in the proof of

Lemma 4.2.2. Note also that for any t ∈ R \ {0} and x ∈ Ox∗ , I have tx ∈ Ox∗ . I am now prepared to show that a generalized Łojasiewicz gradient inequality (4.2) holds near the singularities of bounded multivariate rational functions.

Theorem 4.2.6. Suppose E is a bounded multivariate rational function on Rm. For any

m point p ∈ R and any direction d ∈ Op \{0}, where Op is as in Definition 4.2.5, there exist ⊂ m ∈ 1 an open set U./ R and constants θ (0, 2 ] and k > 0 such that for all

x ∈ domain(E) ∩ U./, 1−θ E(x) − lim E(p + sd) ≤ k k∇E(x)k . (4.5) s→0

Further, U./ may be chosen so that

1. there exists some s ∈ (0, 1] for which p + sd ∈ U./, and

2. for all y ∈ U./ and t ∈ [−1, 0) ∪ (0, 1], the set U./ contains p + t(y − p).

Proof: If E is nowhere defined, the theorem holds vacuously. We assume that E is somewhere defined.

Consider lines in parameter space, beginning at p and with direction d ∈ Rm:

p + td, t ∈ R .

E(p + td) is a univariate bounded real rational function, with respect to t. As such, it either exists nowhere or the limit lims→t E(p + sd) exists for all t ∈ R. If the limit exists, it is a 78 univariate real rational function defined everywhere on R, and thus it is everywhere real-analytic. By Lemma 4.2.2, there exists some open subset O ⊂ Rm of full measure on which there are defined rational functions cn : O → R such that for all y ∈ O,

∞ X n lim E(p + sy) = cn(y)t s→t n=0 if |t| < ρy. This may be selected such that O = Op, as in Definition 4.2.5. By Lemma 4.2.4, select ρy to depend continuously on y. Select a reference direction d ∈ O. We may assume, by rescaling, that kdk = 1. The parameter space is a finite-dimensional and O is open, so there exist a compact set K and an open neighborhood U of d such that

d ∈ U ⊂ K ⊂ O .

ρx is continuous on K and on U. Let M = minx∈K ρx. Because K is compact, M is

well-defined. Further, ρx > 0 by definition, so M > 0. Define       y  U =   : y ∈ U and |t| < M . (4.6) d    t 

Define the extended function F by

∞ X n F(y, t) = lim E(p + sy) = cn(y)t . (4.7) s→t n=0

Within Ud,(4.7) is a convergent sum of analytic functions, and thus (4.7) is itself a   d ⊂   real-analytic function. By Theorem 4.1.1 there exist an open ball V Ud centered at   0  i ∈ 1 ∈ and constants θ 0, 2 and k > 0, such that for all y V,

k |F(y, t) − F(d, 0)|1−θ ≤ k∇F(y, t)k . (4.8) 2 79

Define the open set V∗ ⊂ V by    y    1 1 V∗ =   ∈ V : ky − dk < and 0 < |t| < .   2 2 t      d y     ∈ From the definition of V as an open ball centered at  , we conclude that if   V∗ then 0 t    y    ∈   V∗. −t   y   ∈ For any   V∗, if E(p + ty) exists then its gradient exists, and the gradient of F is t characterized as    t∇E(p + ty)  ∇   F(y, t) =   , hy, ∇E(p + ty)i where ∇E(p + ty) is the gradient of E at the point p + ty. Decompose ∇F(y, t) as ∇F(y, t) = a + b, where      0  ∇E(p + ty)     a =   , and b = t   . hy, ∇E(p + ty)i  0 

By the Cauchy-Schwartz inequality and definition of k · k,

kak ≤ kykk∇E(p + ty)k, and kbk = |t|k∇E(p + ty)k .

Using the triangle inequality, relate k∇F(y, t)k and k∇E(p + ty)k by

k∇F(y, t)k ≤kak + kbk

≤kykk∇E(p + ty)k + |t|k∇E(p + ty)k

=(kyk + |t|)k∇E(p + ty)k .

| | 1 k k k k 1 3 Note that by definition of V∗, t < 2 and y < d + 2 = 2 , so

k∇F(y, t)k ≤ 2k∇E(p + ty)k . 80

We now establish (4.5) for points p + ty using (4.8):

1−θ − E(p + ty) − lim E(p + sd) =|F(y, t) − F(d, 0)|1 θ s→0 k ≤ k∇F(y, t)k 2 ≤kk∇E(p + ty)k . (4.9)

  y   ∈ Because this holds, with identical constants, for all   V∗, we may establish (4.5) on a t m subset of R . Define the cone U./ by        y      m U./ = p + ty   ∈ V∗ ⊂ R .      t 

Note that U./ is open and satisfies the requirements of the theorem. By (4.9), for any

x ∈ U./, 1−θ E(x) − lim E(p + sd) ≤ kk∇E(x)k . s→0

This completes the proof. 

Though I have shown that a generalized Łojasiewicz gradient inequality holds in cones near the discontinuities of bounded rational functions, it is important to note that these cones are not neighborhoods of the discontinuities. Before this theorem may be used to show convergence, it must be shown that a sequence approaching a discontinuity remains within a single cone, or within a finite union of cones.

4.2.2 A Łojasiewicz inequality for sequences approaching singularities

This section proves conditions under which sequences remain within sets such as those produced by Theorem 4.2.6. This is broken into Lemmas 4.2.7 and 4.2.8 and Theorem 4.2.9. Lemma 4.2.8 shows that a continuous function on a product space may be used as a funnel, guiding a sequence into an open subset of one of its factor 81 spaces, such as a set provided by Theorem 4.2.6. If one set provided by Theorem 4.2.6 is insufficient to capture the full behavior of a sequence, Lemma 4.2.7 allows one to form the union of multiple such sets. Finally, Theorem 4.2.9 constructs a set on which the generalized Łojasiewicz gradient inequality holds, and provides conditions under which a sequence will be funneled into the constructed set.

n Sm Lemma 4.2.7. Let U1,..., Um be subsets of R , and let the function f : i=1 Ui → R be ∈ ∈ 1 bounded and differentiable. If there exist constants L R, θ1, . . . , θm (0, 2 ], and

c1,..., cm ∈ R such that, for all x ∈ Ui,

1−θi | f (x) − L| ≤ cik∇ f (x)k ,

∈ 1 ∈ ∈ Sm then there exist θ∗ (0, 2 ] and c∗ R such that, for all x i=1 Ui,

1−θ∗ | f (x) − L| ≤ c∗k∇ f (x)k .

Proof: Let M = sup Sm | f (x) − L|. Because f is bounded, M < ∞. Let θ∗ = mini θi. x∈ i=1 Ui

Then, for any x ∈ Ui,

1−θ∗ θi−θ∗ 1−θi θi−θ∗ | f (x) − L| ≤ M | f (x) − L| ≤ M cik∇ f (x)k .

θi−θ∗ Selecting c∗ = maxi(M ci) completes the proof. 

∞ In the proof of Theorem 4.2.9, I will parameterize elements of a sequence {xk}k=1 as

pairs of directions xk/kxkk and distances kxkk. Using this parameterization, I must show

that if kxkk is sufficiently small, then the direction xk/kxkk must be within a specified set. That argument is greatly simplified by Lemma 4.2.8, which states formally an intuitive property of continuous functions and direct products of compact sets illustrated by Fig. 4.2.

Lemma 4.2.8. Given metric spaces X1, X2, compact sets K1 ⊂ X1 and K2 ⊂ X2, and a

continuous function f : X1 × X2 → R, if f |K1×K2 > 0, then there exists  > 0 such that,

for each x1 ∈ X1 and x2 ∈ X2, at least one of the statements 82

   K1 × K2  f (x1, x2) ≥ 

Figure 4.2: Illustration of Lemma 4.2.8

1.f (x1, x2) < ,

2. d(x1, K1) < ,

3. d(x2, K2) < 

is false.

Proof: Note that the product topology on X1 × X2 is induced by the metric defined by

d∞((x1, y1), (x2, y2)) = max{d(x1, x2), d(y1, y2)} .

Because f is continuous, each (x, y) ∈ K1 × K2 admits an open neighborhood

x,y Nq,r = { (z, w) | d(x, z) < q and d(y, w) < r }

x,y | x,y × such that f Nq,r > 0. These sets Nq,r form an open cover of K1 K2. By Tychonoff’s × { xi,yi }n theorem, K1 K2 is compact, so the cover admits a finite subcover Nqi,ri i=1. The closures of the sets in this subcover are compact, by the Heine-Borel property, so their union is also compact.

C × \ Sn xi,yi 1 C × The set N = (X1 X2) ( i=1 Nqi,ri ) is closed, so 1 = 2 d∞(N , K1 K2) is positive. Define

K1 = {(x1, x2) ∈ X1 × X2| d∞((x1, x2), K1 × K2) ≤ 1} ,

⊂ Sn xi,yi { xi,yi }n and note that K1 i=1 Nqi,ri . As a closed subset of the union of the closures of Nqi,ri i=1, the set K1 is compact. Thus f attains a lower bound 2 > 0 on K1 . Define  = min{1, 2}. 83

Given a point (x1, x2) ∈ X1 × X2, the statements d(x1, K1) <  and d(x2, K2) < 

together imply d∞((x1, x2), K1 × K2) < 1. This in turn implies (x1, x2) ∈ K1 , so

f (x1, x2) ≥ 2 ≥ . 

With the lemmas above, I am able to formalize a statement that certain sequences are funneled into cones on which the Łojasiewicz inequality holds. Though I have a statement that the Łojasiewicz inequality holds on cones adjacent to essential singularities, infinitely many such cones may be required to cover a deleted neighborhood of a given point. Lemma 4.2.7 can be applied only to finitely many cones, and these cones must have an identical limiting value L. I can ensure a finite union by imposing a condition which induces a compact set covered by cones, such as condition (4.10) below.

Theorem 4.2.9. Let f be a bounded multivariate rational function on Rn, and let

∞ n ∞ ∗ {xk}k=1 ⊂ R be a sequence such that { f (xk)}k=1 is monotonic. Suppose that x is a cluster ∞ ∗ point of the sequence {xk}k=1 and there exists a closed set V ⊂ Ox such that for all sufficiently large k, x − x∗ k ∈ ∗ V . (4.10) kxk − x k ∗ Then there exist an open neighborhood U of x and constants θ ∈ (0, 1/2] and c ∈ R such

that if k is sufficiently large and if xk ∈ U, then

1−θ

f (xk) − lim f (xi) ≤ ck∇ f (xk)k . i→∞

Proof: If f (x∗) is defined, then f is real-analytic in an open neighborhood of x∗, and the conclusion of the theorem follows directly from the Łojasiewicz gradient inequality.

∗ ∗ Assume instead that f (x ) is undefined. Then xk , x for all k ∈ N. ∞ The sequence { f (xk)}k=1 is monotonic and bounded, so it admits a limit

L = lim f (xk) . k→∞ 84

Without loss of generality, we may assume that x∗ = 0 and that every u ∈ V has kuk = 1.

Define F(x, t) = lims→t f (sx) and note that it is analytic on Ox∗ × R.

Let K ⊂ Ox∗ be defined by   n K = V ∩ x ∈ R lim inf |F(z, 0) − L| = 0 . z→x

As an intersection of a compact set and a closed set, K is compact. To show that K is { }∞ → ∗ non-empty, take a subsequence xki i=1 such that xki x , and note that   ∞ ∗ ∗ xki /kxki k, kxki k i=1 admits a cluster point (u , 0), for some u ∈ V. Because F is ∗ ∗ continuous, F(u , 0) = limi→∞ F(xki /kxki k, kxki k) = L, and thus u ∈ K. Further, we may use continuity of F to calculate the image F[K × {0}] = {L}. Using Theorem 4.2.6, create collections

n  1 i Uu ⊂ , Θu ∈ 0, , and Cu ∈ open R 2 R

such that, for every u ∈ K,

1−Θu y ∈ Uu =⇒ | f (y) − L| ≤ Cuk∇ f (y)k .

By their definition, each Uu admits an open subset Vu ⊂ Uu and a

1,u ∈ (0, 1] such that

y 0 < |t| ≤  ∧ y ∈ V =⇒ t ∈ V , (4.11) 1,u u kyk u

and such that there exists t > 0 such that tu ∈ Vu. For each u ∈ K, let   0 2 1,u V = y y ∈ Vu . Note that kuk = 1 by assumption on V, and thus that ∈ Vu; it u 1,u 2 0 0 follows that u ∈ Vu. The collection {Vu}u∈K is then an open cover of K, and thus admits a {V0 }m { }m ⊂ finite subcover ui i=1, where ui i=1 K. Let

1 = min 1,u , i=1,...,m i [m U = Uui . i=1 85

By Lemma 4.2.7, there exist θ ∈ (0, 1/2] and c ∈ R such that

y ∈ U =⇒ | f (y) − L|1−θ ≤ ck∇ f (y)k .

Sn Note that 1K = { 1u | u ∈ K } ⊂ i=1 Vui ⊂ U, and that the distance between a compact set and a closed set is positive if the sets are disjoint. Thus, we may define

d ( K, n \U) δ = 1 R > 0 . (4.12) 1

Recalling that u ∈ K =⇒ kuk = 1, and applying (4.11) and (4.12) gives that if kxkk < 1

and d(xk/kxkk, K) < δ, then xk ∈ U.

Let Vδ = { u ∈ V | kuk = 1 ∧ d(u, K) ≥ δ}. By (4.10), ! xk xk < Vδ ⇐⇒ d , K < δ . kxkk kxkk

Vδ is compact, and |F − L| > 0 on Vδ × {0}. By Lemma 4.2.8, there exists 2 > 0

 xk  xk such that kxkk < 2 and F , kxkk − L < 2 together imply Vδ. Note that kxkk kxkk <

 xk  F , kxkk = f (xk). kxkk

 xk  If k is large enough that | f (xk) − L| < 2 and if kxkk < min{1, 2}, then d , K < δ, kxkk n ∗ and xk ∈ U. Selecting U = {v ∈ R | kv − x k < min{1, 2}} completes the proof. 

With the possible exception of (4.10), the conditions of Theorem 4.2.9 impose no great burden. Any sequence produced by a hill-climbing algorithm will necessarily have

∞ monotonic { f (xk)}k=1.

Proposition 4.2.10. The set Ox∗ defined in Definition 4.2.5 contains every cluster point of

∗ ∞ ∗ n xk−x o xk−x the sequence ∗ if and only if there exists a closed set V ⊂ Ox∗ such that ∗ ∈ V kxk−x k k=1 kxk−x k for all sufficiently large k.

Proof: If the set V exists, then the set { u ∈ V | kuk = 1 } is compact, so the “if” direction holds. 86

∞  − ∗  ∞ xki x If no such set V exists, then some subsequence {ui}i=1 = kx −x∗k must approach ki i=1 n the compact set K = { u ∈ R | kuk = 1 }\Ox∗ . That is d(ui, K) → 0. This may be shown by

contradiction; if d(ui, K) 6→ 0, a set of points -distant from K would fulfill the requirements of V, which would contradict the the assumption that no such set exists.

Define a sequence of closest points in the complement of Ox∗ by selecting, for each i, a point

ki ∈ { k ∈ K | d(ui, k) = d(ui, K) }

∞ ∞ The sequence {ki}i=1 admits a cluster point, and d(ui, ki) → 0, so the sequence {ui}i=1 also ∗ ∗ admits a cluster point u ∈ K. By definition of K, u < Ox∗ , so the “only if” direction

holds. 

4.2.3 New convergence theorems

In the previous sections, I established that a generalized Łojasiewicz gradient inequality holds on cones and established conditions under which sequences are funneled into those cones. I now combine this with Theorems 4.1.2 and 4.1.3 to establish conditions under which sequences converge to essential singularities of bounded multivariate rational functions.

∞ n I say that a sequence {xk}k=1 ⊂ R satisfies Assumption (A4) if

• the set Ox∗ in Definition 4.2.5 contains every cluster point of the sequence ( ) x − x∗ k . (A4) kx − x∗k k k∈N

Theorem 4.2.11. Let f be a bounded multivariate rational function on Rn, and let

∞ n ∗ {xk}k=1 ⊂ R be a sequence of vectors with a cluster point x . If Assumptions (A1), (A2), ∗ ∞ and (A4) hold on the tail of the sequence, then x is the limit of {xk}k=1. 87

Proof: Assumption (A1) guarantees that the sequence { f (xk)}k∈N is decreasing. Assumption (A4) and Proposition 4.2.10 then fulfill the conditions of Theorem 4.2.9. This fulfills condition (Ł*) of Theorem 4.1.2. The conditions of Theorem 4.1.2 are thus satisfied, so the conclusions of

Theorem 4.2.11 hold. 

The proof of Theorem 4.2.11 merely uses Theorem 4.2.9 to show that the conditions of Theorem 4.1.2 hold, so the same argument provides the following specialization of Theorem 4.1.3.

Theorem 4.2.12. Under the conditions of Theorem 4.2.11, if Assumption (A3) holds, then

∇ f (xk) → 0 and the convergence rate may be estimated as   O(qk) if θ = 1 (for some 0 < q < 1), ∗  2 kx − xkk =   −θ 1  1−2θ O(k ) if 0 < θ < 2

where θ is such that (Ł*) holds.

Proof: Follows immediately from the proof of Theorem 4.2.11 and from the conditions of Theorem 4.1.3. 

These theorems can be used to show convergence of sequences to cluster points not in a function’s domain, but both theorems rely on assumption (A4), which may be difficult to verify a priori.

4.3 Algorithms, examples, and implications

Uchmajew et al. have shown that (A1), (A2), and (A3) hold for the sequences produced by various optimization algorithms [Usc15, SU15, LUZ15]. The reader should note, however, that these results may rely on additional restrictions on the sequence or on the objective function, which may preclude application to sequences approaching singularities of an objective function. 88

Notably, in the analysis of the Gradient-Related Projection Method with Line-Search (GRPMLS) from [SU15], the assumption (A0) required by Corollary 2.9 in [SU15] need only hold on the sequence itself, and thus (A1) and (A2) hold even if the sequence diverges or has a cluster point not in the domain of the objective function. Unfortunately, the additional conditions that Uschmajew used to prove that (A3) holds, in particular that ∇ f must be Lipschitz continuous on a neighborhood of x∗, do not typically hold if x∗ is an essential singularity. If the gradient-related projection method with line-search is applied to a bounded multivariate rational function then the sequence produced satisfies (A1) and (A2). One must still guarantee the existence of cluster points, and show that (A4) holds.

4.3.1 Example: Figure 4.1

Figure 4.1 illustrates a sequence that maximizes the rational function defined by 7→ −xy (x, y) (x2+y2)(1+x2+y2) . I will employ the results of this paper to establish its convergence. More precisely, I will examine the equivalent problem of minimizing

xy f (x, y) = (x2+y2)(1+x2+y2) . The sequence is produced by the method of [SU15], and thus satisfies (A1) and (A2). Any cluster point of the sequence will occur either in the domain of f or at its singularity — that is, at 0. To guarantee convergence in the latter case, I establish that

n Ox∗ = R \{0}. As established in Definition 4.2.5, I may examine the Maclaurin series coefficients of the denominator of (x, y, t) 7→ f (tx, ty). The zeroth and first Maclaurin series coefficients are identically 0. The second Maclaurin series coefficient is d2 2 2 2 2 2 2 | 2 2 n \{ } f2(x, y) = dt2 t (x + y )(1 + t (x + y )) t=0 = 2(x + y ), which is nonzero on R 0 . Thus n Ox∗ = R \{0}, and (A4) holds regardless of the direction from which the sequence approaches. 89

Noting that the degree of the denominator (x2 + y2)(1 + x2 + y2) exceeds that of the numerator xy and that the sequence of images f (xk) = f (xk, yk) is negative and decreasing ensures that kxkk remains bounded. Thus a cluster point exists. This cluster point must be the limit of the sequence. If it is in the domain of f , convergence would be guaranteed by the results of [SU15]. Otherwise, convergence would be guaranteed by Theorem 4.2.11. In this example, it is also trivial to establish that the sequence converges to 0. If the sequence were to converge to some x∗ in the domain of f , then [SU15, Corrollary 2.11]

and continuity would provide ∇ f (x∗) = 0. Calculation gives ∇ f (x, y) , 0, so this case does not occur. Though this trivial example could be solved using other methods, the results of this paper provide more interesting results when a function admits multiple singularities. As P j an example of this, the function (x, y) 7→ i=1 wi f (x − ui, y − vi), for constants wi, ui, vi ∈ R ∗ and i = 1,..., j, admits multiple singularities xi = (ui, vi), each of which has

n O ∗ = \{0}. Any sequence produced by GRPMLS would be precluded from cycling xi R among the singularities by the results of this paper.

4.3.2 Convergence in direction

If a sequence admits a cluster point, one may ask whether this cluster point is the limit of the sequence. If a sequence does not admit a cluster point, one may instead ask whether there exists a related convergent sequence that approximates a solution of a corresponding problem. In this section I examine a class of problems for which such related convergent sequences may be produced. If a sequence fails to converge because it is unbounded, one may optimize instead a homogeneous function of degree 0, such as that defined by

f¯(x) = min f (αx) . (4.13) α∈R 90

A sequence that optimizes f¯ may be paired with a sequence of scalar constants to describe a sequence that optimizes the original objective function f . In many cases, this is significantly more difficult to analyze than the original objective function f , in part because f¯ may introduce discontinuities. For some notable problems, such as low-rank approximation of tensors, f¯ is a bounded multivariate rational function, and is thus within the scope of this paper. ¯ Optimizing f may still produce a divergent sequence {xi}i∈N. To address this, consider ¯ instead the normalized sequence {xi/kxik}i∈N, which also optimizes f . This normalized sequence must admit a cluster point, by compactness of the unit ball in a

finite-dimensional Banach space. I will show that if Assumption (A1) holds on {xi}i∈N, it holds also on the normalized sequence {xi/kxik}i∈N.

Lemma 4.3.1. Let V be a , and let u, v ∈ V be vectors such that kuk = 1. Then v u − ≤ 2ku − vk . (4.14) kvk Proof: Let k = kvk, let w = v/k, and let x = Re(hu, wi).

(4.14) ⇐⇒

1 k − k2 − k − k2 ≥ ⇐⇒ 2 (4 u kw u w ) 0 2k2 − 4kx + 1 + x ≥ 0 ⇐⇒

2(k − x)2 + 1 + x − 2x2 ≥ 0 (4.15)

−1 ≤ ≤ − 2 ≥ If 2 x 1, then 1 + x 2x 0, and (4.15) holds. If x ≤ 0, then (k − x)2 ≥ x2 by the choice of k, and (4.15) holds if 1 + x ≥ 0.

By the Cauchy-Schwarz inequality, x ∈ [−1, 1], so (4.15) holds. 

Proposition 4.3.2. Suppose that f : Rn → R has the property that f (x) = f (cx) for all c ∈ R. Suppose also that {xi}i∈N is a sequence such that Assumption (A1) holds. Then

Assumption (A1) holds also for the sequence {xi/kxik}i∈N. 91

Proof: For clarity, let ui = xi/kxik. By Assumption (A1), there is some σ > 0 such that

f (xi) − f (xi+1) ≥ σk∇ f (xi)k kxi − xi+1k .

It is easily verified, by definitions of the derivative and f , that

∇ f (ui) = kxik∇ f (xi) .

Further, by Lemma 4.3.1,

xi+1 kxi − xi+1k kui − ui+1k ≤ 2 ui − = 2 . kxik kxik

Thus, σ k∇ f (u )k ku − u k 2 i i i+1 σ = k∇ f (x )k kx k ku − u k 2 i i i i+1

≤ σk∇ f (xi)k kxi − xi+1k

≤ f (xi) − f (xi+1)

= f (ui) − f (ui+1) ,

So Assumption (A1) holds for the sequence {ui}i∈N. 

4.3.3 Implications for tensor approximation

Creation of low-rank approximations of tensors, specifically when approximating a tensor using two or more separable tensors, is prone to produce divergent sequences of parameters. I make no effort to establish convergence of such a sequence. Instead, I consider the convergence of the sequence of normalized parameters, as in Section 4.3.2.

Let τ(x) be a multilinear map from parameters x ∈ Rn to tensors, and let T be the tensor one wishes to approximate. If one selects the usual objective function

f (x) = kτ(x) − T k2, one obtains a polynomial function on Rn. If instead one selects 2 ˆ x hτ(x),T i x − T f ( ) = kτ(x)k2 τ( ) , one obtains a bounded multivariate rational function with 92 denominator kτ(x)k4. Moreover, fˆ is a homogeneous function of degree 0, and ˆ ˆ f (x) = minα∈R f (αx) wherever f is defined. dm k k4 A full analysis of the Maclaurin series coefficients dtm τ(x + td) would be beyond the scope of this paper. One can, however, draw conclusions from established properties

n of the set Ox∗ . In particular, R \Ox∗ is of Lebesgue measure 0, so Ox∗ covers almost every direction of approach. If one uses the GRPMLS method from [SU15] to optimize fˆ, and normalizes the resulting parameters, then it is likely that the resulting sequence of parameters will converge. That is, it is likely that the approximation’s summands will form a stable configuration, changing little except in scale.

4.4 Concluding remarks

I have shown that a generalized Łojasiewicz inequality holds on sets adjacent to essential singularities of bounded multivariate rational functions. With this result, I have provided sufficient conditions under which a sequence will converge to a singularity of the objective function. Finally, I have shown that these results may be employed in the study of unbounded sequences by converting to a projective space. The results employed in this paper could potentially be extended to sequences for which Assumption (A4) does not hold. In particular, I expect that if a sequence is

well-described by some smooth curve, that is xi = γ(1/i) for some smooth γ with γ(0) = x∗, then a generalized Łojasiewicz inequality should hold on that sequence. This would, however, be of little practical utility due to the difficulty of showing that an algorithm produces sequences well-described by smooth curves. 93 5 Connectedness Properties of Low-rank Unit Tensors

In this chapter, I examine certain aspects of the topology of low-rank unit tensors that are relevant to the problem of low-rank tensor approximation. In particular, I consider connectedness properties of the topology induced by the natural inner product on a tensor product of Hilbert spaces. If a topological space is connected, an iterative method may in principle reach any point by taking arbitrarily small steps. If the space has the stronger path-connectedness property, a method may reach any point by a continuous motion – that is, by following a continuous path. I examine also the property of simple connectedness. Though this is of less practical interest than path-connectedness, a positive or negative answer would give insight into the “shape” of sets of low-rank unit tensors. The space of low-rank unit tensors is of practical interest because it shares the same connectedness properties as does the space of low-rank non-zero tensors. Study of the latter space is motivated by three observations. First, certain practical problems – such as the problem of low-rank approximation of a tensor – are typically formulated as optimization of functions on tensors of restricted rank. Second, popular approxmation algorithms such as ALS may exhibit undesired behavior4 if one of the summands of a low-rank tensor should become zero during the process of generating a decomposition. Third, if an iterative method were to generate an approximation that is zero due to destructive cancellation, then the behavior of the method may be dominated by numerical error. Together, these motivate the study of the space of low-rank unit tensors. In addition to practical concerns, determining connectedness properties of the space of low-rank unit tensors is non-trivial. If one could ignore restrictions on the rank and norm, the connectedness properties of the space would be reduced to a solved problem.

4 If a summand becomes zero at one iteration of a method, it may remain zero for all future iterations unless care is taken to detect this and re-initialize the summand. This could prevent a method from accurately decomposing a tensor of rank r, even if the method were initialized with an approximation of sufficient rank. 94

1. Failing to restrict the norm produces a star domain, which is contractible and thus simply-connected.

2. Restricting only the norm of the tensors considered produces a space that is homeomorphic to an n-sphere. It is known that the n-sphere is simply-connected if and only if n > 1.

In both of these cases, the connectedness properties are well known. If one restricts both rank and norm, however, the connectedness properties become more difficult to analyze. Section 5.1 provides formal definitions of various terms and introduces notation for the space of low-rank unit tensors. It also includes several lemmas not directly related to tensors. Section 5.2 establishes that a space of separable unit tensors inherits path-connectedness from any of its factors — that is, at least one must be path-connected. Similarly, the space of separable unit tensors inherits simple connectedness from all of its factors — that is, they must all be simply-connected. Section 5.3 extends the path-connectedness result from Section 5.2 to a statement that unit tensors of rank at most r inherit path-connectedness from their factors. As a corollary, such spaces are path-connected if their factors are vector spaces over C, or if their factors are vector spaces over R and at least one factor has dimension 2 or greater. Finally, it partially addresses simple connectedness of tensors of rank r. More precisely, it reduces the problem of showing that all loops may be contracted to that of showing the existence of connecting paths which may be contracted.

5.1 Preliminaries

Before I delve into the details of the topology of low-rank unit tensors, matters of definitions and notation must be addressed. For convenience, I state formal definitions of connected, path-connected, and simply-connected spaces. 95

Definition 5.1.1. A topological space X is connected if and only if there are no two disjoint non-empty open sets U, V ⊂ X such that X = U ∪ V.

In principle, one can reach any point in a connected space from any other by taking arbitrarily small steps. However, this does not guarantee that one may reach the other point by a continuous motion. This latter property is formalized as path-connectedness.

Definition 5.1.2. A topological space X is path-connected if and only if for every x, y ∈ X

there exists a continuous map f : [0, 1] → X such that f (0) = x and f (1) = y.

Note that a path-connected space must necessarily be connected. In this chapter, I will address only the stronger property.

Definition 5.1.3. A topological space X is simply-connected if and only if it is path-connected and every continuous function f : S 1 → X can be extended to a

2 1 2 continuous function F : D → X such that F|S 1 ≡ f . Here, S is the circle and D is the two-dimensional unit disk.

Simple connectedness gives insight into the shape of the space. In particular, a simply-connected space has no “holes” around which a loop may be wrapped.

Definition 5.1.4. Given topological spaces X and Y, two functions φ0, φ1 : X → Y are homotopic if they are continuous and there exists a , a continuous map

ψ : X × [0, 1] → Y such that ψ(x, 0) = φ0(x) and ψ(x, 1) = φ1(x) for all x ∈ X.

Because the topological spaces considered in this chapter are metric spaces, I will use a shorthand notation for open balls in each space.

Definition 5.1.5. Given a metric space X, with metric m : X × X → R, define the open ball with radius r > 0 centered at p ∈ X by BX(p, r) = {x ∈ X : m(x, p) < r}.

Throughout this chapter, I will refer to sets of unit tensors of rank at most r. For brevity, I will denote sets of tensors of rank at most r as follows: 96

d Definition 5.1.6. Given vector spaces V1,..., Vd, let S(r, {Vi}i=1) be the set of tensors of  Nd  S { }d Pr l l ∈ rank bounded by r, defined formally as (r, Vi i=1) = l=1 i=1 vi vi Vi .

I define also the corresponding sets of unit tensors as follows.

d Definition 5.1.7. Given vector spaces V1,..., Vd, let S1(r, {Vi}i=1) be the set of unit tensors n o d d of rank bounded by r, defined formally as S1(r, {Vi}i=1) = T ∈ S(r, {Vi}i=1) kT k = 1 .

Finally, several lemmas employed in this chapter are not directly related to low-rank tensors. The following lemma describes a property of complex numbers that will be used in Lemma 5.2.6.

1 Lemma 5.1.8. Let k ∈ [1, ∞), and let x, y ∈ C. If k ≤ |x| ≤ |y| ≤ 1, then x y − ≤ k |x − y| . |x| |y| Proof: Multiplication by a unit preserves modulus, so assume, by

|y| multiplication of both x and y by y , that y = |y|. Let x = a + bi and y = c, with a, b, c ∈ R. Then !2 x y 2 a b2 − = √ − 1 + |x| |y| 2 2 a2 + b2 a + b √ 2(a2 + b2) − 2a a2 + b2 = a2 + b2 √ ≤ 2k2(a2 + b2) − 2k2a a2 + b2

= k2(a2 + b2) + k2|x|2 − 2k2a|x|

≤ k2(a2 + b2) + k2c2 − 2k2ac

= k2(a − c)2 + k2b2

= k2 |a + bi − c|2

= (k |x − y|)2 .

Here, the second inequality follows from the assumption that c = |y| ≥ |x|, and thus that c ≥ |a|.  97

In several of the more complicated proofs, I must define functions piecewise on the unit interval. The following lemma will guide the subdivision of the interval:

Lemma 5.1.9. Let K = [0, 1], and let {Uα}α∈A be an open cover of K. Then there exist t1,..., tm ∈ K such that 0 = t1 < tn < tn+1 < tm = 1 and [tn, tn+1] ⊂ Uα for some αn ∈ A.

Proof: Each Uα is a countable union of disjoint intervals Iα,n open in K. It suffices to show that the lemma holds for the collection {Iα,n}α∈A,n∈N. The space K is compact under the usual topology, so there exists a finite subcover

k1 {I j} j=1 of {Iα,n}α∈A,n∈N. By possible removal of intervals from this cover, assume that there there exist no i, j such that Ii ⊂ I j. For each j, let a j, b j denote the endpoints of I j; note

k2 Sk1 that a j < b j. Denote by {sn}n=0 an enumeration of {0, 1} ∪ j=1{a j, b j} such that sn < sn+1.

Note that s0 = 0 and sk2 = 1.

t2n = sn = a j2n+1 t2n+1 t2n+2 = sn+1 = b j2n

I j2n I j2n+1

k1 Figure 5.1: Refining the partition {sn}n=1.

{ }k2 ⊂ The partition sn n=1 does not ensure that [sn, sn+1] I jn for some jn = 1,..., k1, so I

2k2 will refine it. Define {tn}n=0 by   s if n is even  n/2 tn =   s(n−1)/2+s(n+1)/2  2 if n is odd.

I now establish that every n has [tn, tn+1] ⊂ I jn , for some jn ∈ 1,..., k1. Each interval I j

k1 covers (a j, b j), and there must exist h, i such that an ∈ Ih and bn ∈ Ii, for otherwise {I j} j=1

would fail to cover K. For any c1 ∈ (a j, bh) and c2 ∈ (ai, b j),

[a j, c1] ⊂ Uh , [c1, c2] ⊂ U j and [c2, b j] ⊂ Ui . 98

That is, one need only select a partition consisting of the endpoints of the intervals, and a

2k2 point in each interval of overlap. The partition {tn}n=0 satisfies this property, as is

illustrated in Fig. 5.1. Each interval I j is a subset of some Uα, so the lemma holds. 

As part of a later proof, I will need to alter a piecewise-continuous path in a vector space such that it becomes continuous. I will be able to assume that the path may be made continuous by multiplying each point in the path by a scalar.

Lemma 5.1.10. Let V be an inner product space over a field K = R or C. If

m 1. {In}n=1 is a finite collection of closed intervals such that In ⊂ [0, 1] and Sm n=1 In = [0, 1], and such that the interiors of the intervals are pairwise disjoint,

2. ψ, β : [0, 1] → V are functions such that hψ(x), β(x)i , 0 for all x ∈ [0, 1], and

3. λn : In → K are functions such that |λn(x)| = 1 for every x ∈ In and such that each

λnψ|In is continuous on In, then there exists a function λ : [0, 1] → K such that φ = λψ is continuous and |λ(x)| = 1 for every x ∈ [0, 1]. Further, this function may be selected such that if β(0) = β(1) and

ψ(0) = kψ(1) for some k ∈ K, then φ(0) = φ(1)

Proof: For each n, denote by [an, bn] the interval In. No generality is lost in assuming that

m the collection {In}n=1 is sorted. Similarly, discarding any In and λn for which an = bn does

not affect the validity of the proof. For the remainder of the proof, I assume that bn ≤ an+1 for every n = 1,..., m − 1 and that an < bn for every n. This implies bn = an+1. Define

λ (x) c = n , x hψ(x), β(x)i t−an ! bn−an can |can | cbn ϕn(t) = λn(t)ψ(t) . |can | can |cbn | 99

As a continuous scalar multiple of λnψ|In , the map ϕn is continuous on In. At an and at bn, scalars cancel giving

|hψ(an), β(an)i| |hψ(bn), β(bn)i| ϕn(an) = ψ(an) and ϕn(bn) = ψ(bn) . (5.1) hψ(an), β(an)i hψ(bn), β(bn)i

Note that the values ϕn attains at the endpoints an and bn are independent of λn.

It follows that ϕn(bn) = ϕn+1(bn) = ϕn+1(an+1), so the functions can be joined into a single continuous function. Define the map φ : [0, 1] → V \{0} by   φ ϕn(t) if t ∈ (an, bn] t 7−→  .  ϕ1(0) if t = 0

The function φ is continuous. Because φ is constructed from ψ by unit scalar

multiplication, there exists λ : [0, 1] → K with |λ(t)| = 1 such that φ = λψ. Suppose that β(0) = β(1) and ψ(0) = kψ(1) for some k ∈ K. Substituting these into (5.1) gives

|hψ(0), β(0)i| |hkψ(1), β(1)i| |hψ(1), β(1)i| ϕ (0) = ψ(0) = kψ(1) = ψ(1) = ϕ (1) . 1 hψ(0), β(0)i hkψ(1), β(1)i hψ(1), β(1)i m

That φ(0) = φ(1) then follows from its definition. 

Finally, I must demonstrate that any path in an inner-product space admits a path orthogonal to it, provided that the inner-product space is of dimension 2 or higher.

Lemma 5.1.11. Let V be an inner product space on K = R or C, with dimension d ≥ 2. Any continuous φ : [0, 1] → V \{0} admits a continuous φ⊥ : [0, 1] → V with kφ⊥(t)k = kφ(t)k and φ⊥(t) ⊥ φ(t) for all t ∈ [0, 1]. Moreover, if φ(0) = φ(1), then φ⊥(0) = φ⊥(1).

Proof: Let e1, e2 ∈ V be orthonormal vectors. The subset W = span(e1, e2) \{0} ⊂ V is ρ path-connected. Define ρ : V → W by x 7−→ hx, e2ie1 − hx, e1ie2. Let { ∈ |h i|2 |h i|2 1 k k } U = v V : v, e1 + v, e2 < 2 v . Note that U is open. The inverse image 100

−1 φ [U] is open in [0, 1], and thus is a countable union of disjoint intervals In open in [0, 1].

For each n, let αn : In → W be a path connecting the vectors u and v, which are defined as follows:

1. If an, the left endpoint of In, is contained in In, then u = e1. Otherwise, u = ρ(φ(an)).

2. If bn, the right endpoint of In, is contained in In, then v = e1. Otherwise,

v = ρ(φ(bn)).

The map ψ : [0, 1] → V defined by   hα (t),φ(t)i α (t) − φ(t) n if t ∈ I ψ  n kφ(t)k2 n t 7−→   ρ(φ(t)) otherwise

⊥ 7−→ kφ(t)k is continuous, nonzero, and orthogonal to φ on [0, 1]. Defining φ by t kψ(t)k ψ(t) completes the proof. 

5.2 Separable tensors

The separable case is relatively straightforward. A separable tensor may be considered the image of a tuple of vectors [v1,..., vd] under the map defined by Nd Nd Nd 7−→ [v1,..., vd] i=1 vi. This map is not one-to-one, as i=1 vi = i=1 λivi, provided Qd that i=1 λi = 1. Despite this scaling indeterminacy, one may address problems of connectedness of separable tensors by selecting appropriate factorizations, demonstrating the desired property in the factor spaces, and mapping back to the space of separable tensors.

5.2.1 The tensor product is continuous

The following result provides a tight bound on the size of perturbation a tensor displays when its factors are perturbed. Further, it establishes that ⊗ is uniformly continuous on unit vectors. Though Lemma 5.2.1 is not vital for later proofs, uniform 101

continuity allows more elegant arguments, and the tight bound allows precise formulation of certain inequalities.

Lemma 5.2.1. Suppose V1,..., Vd are vector spaces equipped with k · k1,...,

k · kd. Let x1, y1 ∈ V1,..., xd, yd ∈ Vd. If there exists  ≥ 0 such that for all k, it holds that kyk−xkkk ≤ , then kxkkk Nd  Nd  − k=1 yk k=1 xk ≤ (1 + )d − 1 . (5.2) Nd

k=1 xk Further, this bound is tight.

Proof: For each n = 1,..., d, let

On On An = xk and Bn = yk . k=1 k=1 The proof proceeds by induction on n with the induction hypothesis

n kBn − Ank/kAnk ≤ (1 + ) − 1 . (5.3)

Base case If n = 1, then (5.3) holds trivially because

kB1 − A1k = ky1 − x1k1 ≤ kx1k1 = kA1k . 102

Inductive step If the induction hypothesis (5.3) holds at n, then

kBn+1 − An+1k/kAn+1k

= kBn ⊗ yn+1 − An ⊗ xn+1k/kAn+1k

≤ (kxn+1 − yn+1kn+1kBnk + kxn+1kn+1kBn − Ank)/kAn+1k

≤ kxn+1kn+1(kBnk + kBn − Ank)/kAn+1k

≤ kxn+1kn+1(kAnk + (1 + )kBn − Ank)/kAn+1k

=  + (1 + )kBn − Ank/kAnk .   Xn−1  k =  1 + (1 + ) (1 + )  k=0 Xn =  (1 + )k . k=0 ! 1 − (1 + )n =  1 − (1 + )

= (1 + )n − 1 .

Thus if (5.3) holds at n, it holds at n + 1.

By induction, (5.3) holds at n = d, and thus (5.2) holds.

To see that the bound is tight, consider yk = (1 + )xk for all k = 1 ... d. 

The tensor product is known to be continuous in the topology induced by the inner product [Hac14]. For completeness, I prove the same here, with the aid of Lemma 5.2.1. Recall that a function f : X → Y, where X and Y are topological spaces, is continuous if and only if for every open U ⊂ Y, the preimage f −1[U] is open in X.

d d Corollary 5.2.2. The map φ : ×i=1Vi → S(1, {Vi}i=1) defined by

d φ O [x1,..., xd] 7−→ xi i=1 is continuous. 103

d −1 Proof: Let U ⊂ S(1, {Vi}i=1) be an open subset. Then for any x = [x1,..., xd] ∈ φ [U], there exists  > 0 such that BS { }d (φ(x), ) ⊂ U. (1, Vi i=1) I separate the problem into two cases:

Case 1 Suppose xi = 0 for some i.

Then I need only control the norm of the tensor product. Let M = maxi kxik. For each i = 1,..., d, let   B (0, /(M + 1)d) if x = 0  Vi i U =  i   BVi (0, M + 1) if xi , 0

nNd o Then ui : ui ∈ Ui ⊂ BS { }d (φ(x), ). i=1 (1, Vi i=1)

Case 2 Suppose otherwise, that xi , 0 for all i.

For each i = 1,..., d, let

  1/d  Ui = BVi xi, kxik (1 + /kφ(x)k) − 1 .

kxi−uik 1/d By definition, any ui ∈ Ui satisfies < (1 + /kφ(x)k) − 1. kxik

Nd kφ(x)− i=1 uik  Lemma 5.2.1 gives kφ(x)k < kφ(x)k . That is,

nNd o ui : ui ∈ Ui ⊂ BS { }d (φ(x), ) . i=1 (1, Vi i=1)

−1 I have established that x ∈ φ [U] admits an open neighborhood U1 × · · · × Ud such that

−1 the image φ[U1 × · · · × Ud] is a subset of U. It follows that the preimage φ [U] is open.

By definition of continuity, φ is continuous. 

5.2.2 Path-connectedness

The tensor product has been shown to be continuous. Thus, if one selects paths

within the factor spaces V1,..., Vd, then their product will also be a path. That is, if each 104

Vi \{0} were path connected then one would only need to select factorizations of each endpoint and paths between the factors of each endpoint. Though this argument does not require that the endpoints be factored in any specific way, a careful factorization can strengthen the argument, as in Lemma 5.2.3.

Lemma 5.2.3. Let Vi be inner product spaces over K. If at least one Vi \{0} is

d path-connected, then S1(1, {Vi}i=1) is path-connected. Moreover, if every Vi \{0} is d d path-connected then the path connecting ⊗i=1ui to ⊗i=1vi may be constructed as a

pointwise tensor product of paths connecting each ui to vi.

Proof: Suppose at least one Vi \{0} is path-connected; I assume, without loss of

generality, that V1 \{0} is path-connected.

d Let X, Y ∈ S1(1, {Vi}i=1). Then for some vectors ui, vi ∈ Vi, i = 1,..., d,

Od Od X = ui and Y = vi . i=1 i=1

For each i for which Vi \{0} fails to be path-connected, multiply both ui and u1 by −1 Nd h i if Re( ui, vi ) < 0. Note that this does not alter i=1 ui. Define a curve

φi(t) : [0, 1] → Vi \{0} by

φi (1 − t)u + tv t 7−→ i i . k(1 − t)ui + tvik

This is well-defined, because for every t ∈ [0, 1), Re(hui, (1 − t)ui + tvii) > 0, so

(1 − t)ui + tvi , 0.

For each i with Vi \{0} path-connected, there exists a path φi(t) : [0, 1] → Vi \{0} connecting ui to vi.

d Let φ : [0, 1] → S1(1, {Vi}i=1) be defined by

d φ O t 7−→ φi(t) . i=1

Clearly, φ(0) = X, and φ(1) = Y. By Corollary 5.2.2, φ is continuous.  105

5.2.3 Simple connectedness

d I show that S1(1, {Vi}i=1) is simply-connected if the sets of non-zero vectors in its factor spaces Vi \{0} are simply-connected. Unlike the argument that demonstrates path-connectedness, the argument that demonstrates simple connectedness requires that factorizations be carefully selected. Even if each Vi \{0} were simply-connected, one would need to show that continuous loops of tensors may be factored continuously before this could be employed. The following pair of lemmas show that similar separable tensors have similar factorizations, up to scalar multiplication. Though this is weaker than the existence of a continuous inverse, it will serve much the same purpose. Note that for every Nd X ∈ S { }d X 1(1, Vi i=1), there exist representations = i=1 xi such that for all i = 1,..., d, it holds that kxik = 1. h  ∈ 1 Lemma 5.2.4. Let Vi be inner-product spaces over K, and  0, 2 be given. Let Nd Nd Y Z ∈ S { }d Y Z , 1(1, Vi i=1), with representations = i=1 yi and = i=1 zi such that

ky1k = kz1k = 1.

If kY − Zk ≤ , then there exists λ ∈ K such that |λ| = 1 and ky1 − λz1k ≤ 2.

Proof: I seek a contradiction. Suppose the lemma does not hold; specifically, suppose

there exist Y and Z satisfying the conditions of the lemma, but all λ ∈ K with |λ| = 1 have

the property that ky1 − λz1k > 2. The quantity |hy1, z1i| is then bounded from above, since

2 2 ky1 − λz1k > 4

2   2 2 2 ⇐⇒ ky1k − 2Re λ¯hy1, z1i + |λ| kz1k > 4

  2 ⇐⇒ 1 − 2Re λ¯hy1, z1i + 1 > 4

2   ⇐⇒ 1 − 2 > Re λ¯hy1, z1i

2 =⇒ 1 − 2 > |hy1, z1i| . (5.4) 106

0 0 Let y = y1 − cz1, where c = hy1, z1i. Note that hy , z1i = 0. This implies Nd Nd Nd h 0 ⊗ Zi h 0 ⊗ ⊗ i y i=2 yi, = y i=2 yi, z1 i=2 yi = 0. Thus,

Nd Nd 2 kY − Zk2 0 ⊗ ⊗ − Z = y i=2 yi + cz1 i=2 yi Nd 2 Nd 2 0 ⊗ ⊗ − Z = y i=2 yi + cz1 i=2 yi Nd 2 ≥ 0 ⊗ y i=2 yi = ky0k2

0 By (5.4) and orthogonality of y and z1,

0 2 2 2 2  22  2 4 2 ky k = ky1k − |c| kz1k > 1 − 1 − 2 = 4  −  ≥  .

Then kY − Zk > , which contradicts the assumption that kY − Zk ≤  and thus establishes the lemma. 

Lemma 5.2.4 shows that factorizations must be “similar”, up to scalar multiplication, but only for a single factor. This shortcoming is addressed by the following lemma, which provides a similar result for all factors simultaneously.

d Lemma 5.2.5. Let Vi be inner-product spaces over K. Let Y, Z ∈ S1(1, {Vi}i=1), with Nd Nd Y Z k k k k representations = i=1 yi and = i=1 zi such that yi = zi = 1 for all i = 1,..., d. For every  ≥ 0, there exists δ ≥ 0 such that, if kY − Zk ≤ δ, then there exist

λ1, . . . , λd ∈ K such that, for every i = 1,..., d, |λi| = 1 and kyi − λizik ≤ . Further, if  > 0 then one may select δ > 0.

1/d − 1 { } Proof: Let γ = (/2 + 1) 1, and let δ = 2 min 1, , γ .

Assume that kY − Zk ≤ δ. Using Lemma 5.2.4, select λ1, . . . , λd−1 such that

kyi − λizik ≤ γ for each i = 1,..., d − 1. 107

Nd Qd−1 −1 | | Z Let λd = i=1 λi . Then λd = 1 and = i=1 λizi . By Lemma 5.2.1,   Od−1   d  Y −  λizi ⊗ yd ≤ (1 + γ) − 1 ≤ ,   2 i=1 and by the triangle inequality,

 −   −  Od 1  Od 1  kY − Zk ≥  λ z  ⊗ (y − λ z ) − Y −  λ z  ⊗ y  i i d d d  i i d i=1 i=1  ≥ ky − λ z k − d d d 2 kY − Zk ≤ k − k ≤  ≤ If δ, then yd λdzd δ + 2 . 

Lemma 5.2.5 states that factorizations are similar up to scalar multiplication, but this does not establish a continuous factorization. Fortunately, the structure provided by an inner product can be used to select appropriate scalars for each factor.

Lemma 5.2.6. Let X by a topological space, let ψi : X → Vi \{0}, for i = 1,..., d be Nd ∈ T 7→ T maps, let v V1, and let be the map defined by t i=1 ψi(t). If is continuous and |h i| ≥ 1 ∈ if v, ψ1(t) 2 for all t X, then the map φ defined by

φ |hψ1(t), vi| t 7−→ ψ1(t) hψ1(t), vi is continuous.

−1 Proof: Let t ∈ X and  > 0. I must show that the preimage φ [BV1 (φ(t), )] contains an open neighborhood of t. By Lemma 5.2.4 and continuity of T , there exists an open ∈ k − k 1 ∈ neighborhood U of t such that each s U has ψ1(t) λψ1(s) < 3  for some λ K with |λ| = 1. ∈ |h i − h i| 1 Let s U. By the Cauchy-Schwartz inequality, ψ1(t), v λ ψ1(s), v < 3 .

Lemma 5.1.8 implies that hψ1(t),vi − λ hψ1(s),vi < 2 . Inversion of unit complex numbers is |hψ1(t),vi| |hψ1(s),vi| 3 conjugation, a unitary linear map, so |hψ1(t),vi| − λ−1 |hψ1(s),vi| < 2 . hψ1(t),vi hψ1(s),vi 3 108

Let w = λ−1 |hψ1(s),vi| hψ1(t),vi . Then |1 − w| < 2 , and hψ1(s),vi |hψ1(t),vi| 3

|hψ1(t), vi| |hψ1(s), vi| kφ(t) − φ(s)k = ψ1(t) − ψ1(s) hψ1(t), vi hψ1(s), vi

= kψ1(t) − λwψ1(s)k

≤ kψ1(t) − λψ1(s)k + kλψ1(s) − λwψ1(s)k

= kψ1(t) − λψ1(s)k + |λ||1 − w|kψ1(s)k

<  .

⊂ −1 Thus U φ [BV1 (φ(t), )]. It follows that φ is continuous on X. 

The proof of Theorem 5.2.8 will show that one factor of a loop or path may be selected to be continuous. The following lemma demonstrates that the product of all the other factors must then also be continuous. That is, the product of the remaining factors is itself a continuous tensor path. Repeating the argument of Theorem 5.2.8 will again reduce the number of discontinuous factors, until none remain.

Lemma 5.2.7. Let U, V be vector spaces equipped with a seminorm, and let u : [0, 1] → U and v : [0, 1] → V be mappings such that ku(s)k = kv(s)k = 1 for all s ∈ [0, 1]. If u is continuous, but v has a discontinuity at t, then u ⊗ v : [0, 1] → U ⊗ V has a discontinuity at t.

Proof: For some  > 0, for all δ > 0, there exists s ∈ (t − δ, t + δ) such that kv(t) − v(s)k ≥ . If δ is such that s ∈ (t − δ, t + δ) =⇒ ku(t) − u(s)k < /2, then

ku(t) ⊗ v(t) − u(s) ⊗ v(s)k ≥ ku(s) ⊗ v(t) − u(s) ⊗ v(s)k

− ku(t) ⊗ v(t) − u(s) ⊗ v(t)k

= kv(t) − v(s)k − ku(t) − u(s)k

> /2 .

Thus u ⊗ v has a discontinuity at t.  109

Lemma 5.2.6 provides a method of enforcing continuity and a condition which the method requires. In particular, enforcing continuity on a path using the method of Lemma 5.2.6 requires selecting a vector to which no part of the path is orthogonal. Though this is not possible for certain paths, one may subdivide a path until the method of Lemma 5.2.6 becomes viable. Lemma 5.1.10 modifies the continuous segments and connects them into a single continuous function. All that remains is to properly subdivide the path.

Theorem 5.2.8. Let Vi be subsets, closed under scalar multiplication, of inner product

d spaces over K, and let X : [0, 1] → S1(1, {Vi}i=1) be a continuous map. Then there exist

continuous maps φi : [0, 1] → Vi \{0} such that

Nd X ∈ 1. (t) = i=1 φi(t) for all t [0, 1], and

2. if X(0) = X(1) then φi(0) = φi(1) for all i = 1,..., d.

Proof: To construct the maps φi, I first show that a factor may be made continuous by multiplying by a scalar with norm 1, then apply Lemma 5.2.7 and induction to show this holds for all factors simultaneously. The case d = 1 is trivial so I assume d > 1.

d Let X : [0, 1] → S1(1, {Vi}i=1) be a continuous map. By definition, there exist maps Nd → \{ } X ∈ ψi : [0, 1] Vi 0 such that (t) = i=1 ψi(t) for all t [0, 1]. Assume, without loss of generality, that kψi(t)k = 1 for all t ∈ [0, 1].

For each α ∈ {v ∈ V1 : kvk = 1} = A, let

 d  O 1 U =  v : v ∈ V , kv k = 1, |hv , αi| >  . α  i i i i 1 2  i=1 

d Note that each Uα is open in S1(1, {Vi}i=1). −1 X is continuous, so each preimage X [Uα] is open. Lemma 5.1.9 provides

m0 Sm {tn}n=1 ⊂ [0, 1] such that X([tn, tn+1]) ⊂ Uα for some α, and such that k=1[tn, tn+1] = [0, 1]. 0 For each n < m , there exists αn such that X([tn, tn+1]) ⊂ Uαn . If X(0) = X(1), select 110

0 0 0 αm = α1. If X(0) , X(1), there exists αm such that X(tm ) ∈ Uαm0 . By construction, |h i| 1 ∈ ψ1(t), αn > 2 for all t [tn, tn+1]. These intervals fulfill condition1 of Lemma 5.1.10.

Lemma 5.2.6 states that the maps ϕn :[tn, tn+1] → V1 defined by

ϕn |hψ1(t), αni| t 7−→ ψ1(t) hψ1(t), αni

|hψ1(t),αni| are continuous. Thus, selecting λn = fulfills condition3 of Lemma 5.1.10. hψ1(t),αni Defining β : [0, 1] → V by x 7−→ α fulfills condition2 – the final remaining 1 (min{n:X(x)∈Uαn })

condition – of Lemma 5.1.10. By Lemma 5.1.10, there exists a function λ1 : [0, 1] → K with |λ1(t)| = 1 such that φ1 = λψ is a continuous function. Further,

[X(0) = X(1)] =⇒ [β(0) = β(1)] =⇒ [φ1(0) = φ1(1)] , so the factorization is loop-preserving. I show by induction that the remaining factors are continuous.  Nd  X ⊗ −1 = φ1 λ1 i=2 ψi , so by Lemma 5.2.7, the map  Nd  −1 → S { }d λ1 i=2 ψi : [0, 1] 1(1, Vi i=2) is continuous. By the argument above, one may construct continuous maps λ2 and φ2 = λ2ψ2. Then  Nd  −1 −1 → S { }d λ1 λ2 i=3 ψi : [0, 1] 1(1, Vi i=3) is continuous. In the same manner, construct Qd−1 −1 continuous maps φ3, . . . , φd−1 and λ3, . . . , λd−1. Let λd = i=1 λi and let φd = λdψd. By Nd X X construction, = i=1 φi. Lemma 5.2.7 then states that φd is continuous, for otherwise would be discontinuous.

If X(0) = X(1), then φi(0) = φi(1) for i = 1,..., d − 1. If also φd(0) , φd(1), then

X(0) , X(1) Thus, by contradiction, [X(0) = X(1)] =⇒ [φd(0) = φd(1)]. 

I am, at last, able to prove the that separable tensors are simply-connected. As with the proof of path-connectedness of separable tensors, the proof reduces the problem to one on the factor spaces. 111

Theorem 5.2.9. Let Vi be subsets, closed under scalar multiplication, of inner product

d spaces over K. If each Vi \{0} is simply-connected, then S1(1, {Vi}i=1) is simply-connected.

Proof: Let S 1 = {x ∈ R2 : kxk = 1} be the 1-sphere, and let D2 = {x ∈ R2 : kxk ≤ 1} be the

1 d corresponding closed disk. Let X : S → S1(1, {Vi}i=1) be a continuous map. The compact interval I1, with endpoints identified, is homeomorphic to S 1, so I may apply

1 Theorem 5.2.8 to X. By Theorem 5.2.8, there exist continuous maps φi : S → Vi \{0} Nd X \{ } such that = i=1 φi. Because for each i, Vi 0 is simply-connected, there exist 2 0 2 d continuous ψi : D → Vi \{0} such that ψi|S 1 = φi. Define X : D → S1(1, {Vi}i=1) as 0 Nd 0 0 X X | 1 X X = i=1 ψi. Note that S = , and that is continuous by Corollary 5.2.2. 

5.2.4 Miscellany

Thus far, I have worked with separable unit tensors. In later sections, I will work with sums of separable tensors, without restriction on the norms of the summands. To simplify later proofs, I wish to assume that all summands are either non-zero or identically zero. The following lemma enables this assumption by showing that, if a path of separable tensors is not always zero, it is homotopic to a path which is never zero.

Lemma 5.2.10. Let Vi be inner product spaces over K such that every Vi \{0} is

d path-connected, and let  > 0. Let also T : [0, 1] → S(1, {Vi}i=1) be a continuous map such that kT (0)k ≥  and kT (1)k ≥ . Then there exists some continuous map

2 d φ : [0, 1] → S(1, {Vi}i=1) such that for every s, t ∈ [0, 1],

φ(t, 0) = T (t) ,

φ(0, s) = T (0) ,

φ(1, s) = T (1) ,

kφ(t, 1)k ≥  , and

kT (t) − φ(t, s)k ≤ (3d − 1) . 112

Note that the map φ above is a homotopy.

Proof: For any 0 < δ ≤ , let Uδ = {t : T < δ}. Then Uδ is open under the usual topology

δ S δ δ of the real line, so it is a countable union of disjoint open intervals U = i(ai , bi ). By δ δ δ d δ δ δ Lemma 5.2.3, there exist paths φi :[ai , bi ] → S(1, {Vi}i=1) such that φi (ai ) = T (ai ), that δ δ δ δ φi (bi ) = T (bi ), and that for all t, kφi (t)k = δ. Define   δ δ δ φi (t) if t ∈ (ai , bi ) φδ : [0, 1] → S(1, {V }d ) by t 7→  . i i=1  T (t) otherwise

Note that every φδ is continuous. For any choice of δ, the paths φδ and φδ/2 both admit continuous factorizations, by Theorem 5.2.8. By Lemma 5.2.4, the factorizations of φδ and of φδ/2 are identical up to unit scalar multiplication on [0, 1] \ Uδ. I may then multiply the factors φδ/2 by unit scalars to force them to be equal to those of φδ on [0, 1] \ Uδ. By Lemma 5.2.7 these unit scalars may be chosen to be continuous on t. Using Lemma 5.2.3, one may select the connecting paths on Uδ such that their factorizations are paths connecting the factorizations of the endpoints of the intervals. Thus, there is no generality lost in assuming that the continuous factorizations of φδ and of φδ/2 are identical on [0, 1] \ Uδ.

1− j Further, by induction on δ we may assume that for all j ∈ N, the maps φ2 δ are identical on [0, 1] \ Uδ. Denote these factorizations by

d δ O δ φ = ψi . i=1

Note that on Uδ,

δ δ/2 δ δ/2 kψi − ψi k ≤ 2 max{kψi k, kψi k} ≤ 2δ .

δ δ/2 Note also that ψi |[0,1]\Uδ = ψi |[0,1]\Uδ . 113

2 d Define φ : [0, 1] → S(1, {Vi}i=1) by   1− j − j  Nd j 2  j 2  − j 1− j φ  i (2 s − 1)ψi (t) + (2 − 2 s)ψi (t) if s ∈ (2 , 2 ] (t, s) 7→  =1   T (t) if s = 0

By Lemma 5.2.1, φ is continuous and fulfills the requirements of Lemma 5.2.10. 

Though Lemma 5.2.10 was stated for separable tensors, it is not difficult to apply it to

d tensors of higher rank. Suppose I have a loop in S1(r, {Vi}i=1) with continuous summands. 1 I observe that by selecting  < r(3d−1) , the homotopies generated by Lemma 5.2.10 for the d summands cannot sum to 0. By re-scaling these maps, I see that any loop in S1(r, {Vi}i=1) d with continuous summands is homotopic to a loop in S1(r, {Vi}i=1) for which all summands are continuous and either everywhere non-zero or everywhere zero.

5.3 Sums of separable tensors

Spaces of sums of separable tensors are more difficult to analyze than are spaces of separable tensors. As was established in Lemma 5.2.5, factorizations of a separable tensor are equivalent up to scalar multiplication. Similarly, a sum of separable tensors may have its summands rearranged to create an equivalent decomposition. In certain cases, other trivially equivalent decompositions may also exist. Unfortunately, not all equivalent decompositions may be described trivially. This summation indeterminacy complicates

d the study of continuous functions on S1(r, {Vi}i=1), and is not yet fully understood.

5.3.1 Path-connectedness

I now provide sufficient conditions for the set of unit tensors of rank at most r to be path-connected, and provide neccessary and sufficient conditions if the factor spaces Vi are vector spaces over R or over C. In particular, a space of unit tensors of bounded rank is path-connected if at least one of its factor spaces — with 0 removed — is 114

path-connected. If the field is C, the space is path connected. If the field is R, at least one factor space must have dimension exceeding 1.

Theorem 5.3.1. Let Vi be inner product spaces over K. If at least one Vi \{0} is

d path-connected, then S1(r, {Vi}i=1) is path-connected.

Proof: By Lemma 5.2.3, it suffices to show that each unit tensor is connected by a path to a separable unit tensor. If r = 1, the tensor is already separable, so we consider here only r > 1.

d 1 r d Let X ∈ S1(r, {Vi}i=1). Then, for some X ,..., X ∈ S(1, {Vi}i=1), Xr X = Xl . l=1 Without loss of generality, assume that i < j =⇒ kXik ≥ kX jk.

1 Pr l −1 1 d Case 1: If X = λ l=2 X for some λ ∈ K, then (1 + λ )X = X, so X ∈ S1(1, {Vi}i=1).

1 Pr l 1 Case 2: Suppose that for all λ ∈ K, X , λ l=2 X . Then the line connecting X and X does not pass through 0. Define maps

r d 1 X l Y : [0, 1] → S(r, {Vi}i=1) by t 7→ X + (1 − t) X and l=2 Y(t) Y : [0, 1] → S (r, {V }d ) by t 7→ . 1 1 i i=1 kY(t)k

d Trivially, these are continuous. Y1(0) = X and Y1(1) ∈ S1(1, {Vi}i=1), so this completes the proof.



Remark 5.3.2. If every Vi \{0} is path-connected then Lemma 5.2.10 may be applied to create a non-zero path that is the sum of r non-zero separable paths.

Theorem 5.3.1 provides necessary and sufficient conditions for path-connectedness of spaces of unit real and complex tensors. 115

d Corollary 5.3.3. Let Vi be inner product spaces over C. S1(r, {Vi}i=1) is path-connected.

Proof: Every factor space Vi \{0} is path-connected. 

d Corollary 5.3.4. Let Vi be inner product spaces over R. S1(r, {Vi}i=1) is path-connected if

and only if at least one Vi has dimension greater than 1.

d Proof: If every Vi has dimension 1, then S1(r, {Vi}i=1) is homeomorphic to R \{0}, which

is not path-connected. If some Vi has dimension greater than 1, then Vi \{0} is path-connected. 

5.3.2 Simple connectedness

One consequence of simple connectedness is that if a space of unit tensors is simply-connected, then every loop admits a contraction. This includes paths between two decompositions of the same tensor. I reduce the problem of simple connectedness to that of finding contractible paths between equivalent decompositions of tensors. Unlike in the separable case, a path or loop of tensors need not admit a continuous

3 decomposition. As an example, consider ` : [0, 1] → S(2, {Vi}i=1) \{0} defined by     3   3   ` O 1 O | cos(πt)| t 7−→ (2 + t)   + (2 − t)       i=1 0 i=1  sin(πt) 

The map ` has `(0) = `(1), and is thus a continuous loop, but the summands do not share this property. For t ∈ (0, 1), the rank of `(t) is exactly 2. The decomposition is unique5 in this case [DL13, Theorem 1.5], so any decomposition of ` into a sum of separable paths

`1 + `2 will have {k`1(t)k, k`2(t)k} = {2 + t, 2 − t}. It follows that max{k`1(t)k, k`2(t)k} → 2 as t → 0, but max{k`1(t)k, k`2(t)k} → 3 as t → 1. That is, `1 and `2 cannot be continuous loops.

5 Up to rearrangement of the summands and the scaling indeterminacy. 116

Paths of unit tensors must, however, be homotopic to a path with a piecewise-continuous decomposition, as will be shown in Theorem 5.3.6 below. To prove this, I will construct a sequence of splines that approaches the target path. Linearly interpolating between these splines will create a homotopy, but I must demonstrate that convex combinations of parameters do not exhibit pathological behavior. In particular, the interpolation in parameter space must not cause its image in tensor space to depart significantly from the target path. Lemma 5.3.5 provides conditions precluding this.

l d Lemma 5.3.5. For all  > 0, r ∈ N, and separable tensors T ∈ S(1, {Vi}i=1), there exist Nd l T l T l − T 1 some δ > 0 and factorizations i=1 vi = such that if < δ for every

l = 1,..., r and if t1,..., tr ∈ [0, 1], then   Od Xr  Xr  t vl − T 1 <  if t = 1 .  l i l i=1 l=1 l=1

Nd T 1 1 Proof: Fix a factorization = i=1 vi of the first separable tensor. Select γ > 0 such that (1 + γ)d − 1 ≤ . By Lemma 5.2.5, δ may be selected such that there exist factorizations of T 2,..., T r with the property that, for every l = 2,..., r and i = 1,..., d,

l 1 kvi − vi k < γ .

Pr From the assumption that l=1 tl = 1, it follows that

Xr Xr Xr t vl − v1 = t (vl − v1) ≤ t (vl − v1) < γ . l i i l i i l i i l=1 l=1 l=1 By Lemma 5.2.1,   Od Xr   t vl − T 1 <  .  l i i=1 l=1 

Theorem 5.3.6. Let Vi be subsets, closed under scalar multiplication, of inner product

d spaces over K, and let X : [0, 1] → S1(r, {Vi}i=1) be a continuous map. Then there exist 117

l d Pr l piecewise-continuous maps φ : [0, 1] → S(1, {Vi}i=1) such that X and l=1 φ are d homotopic in S1(r, {Vi}i=1).

Proof: The proof will construct linear splines, and linear interpolations thereof, to approximate X. These are convex combinations of the parameters of 2 and 4 tensors, respectively. As in the proof of Theorem 5.2.8, the space of tensors will be covered by a collection of open sets, and compactness of [0, 1] will be used to select a finite collection covering the target path. Using Lemma 5.3.5, for every  > 0 and separable tensor T , define an open ball

U,T = BS { }d (T , δ), where δ is chosen such that, for any four separable tensors (1, Vi i=1) Nd T 1 T 4 ∈ l ,..., U,T , there exist factorizations i=1 vi for which   Od X4  X4  t vl − T <  if t = 1 ,  l i l i=1 l=1 l=1

∈ 1 ∈ provided that t1,..., t4 [0, 1]. Fix 0 <  < 2 and r N. Define the sum of sets of vectors to be the Minkowski sum U + V = { u + v : u ∈ U, v ∈ V }. For each n ∈ N, the collection

 r  X d   −n  On = U 2 ,T : T1,..., Tr ∈ S(1, {Vi}i=1)  r l   l=1 

d is an open cover of S(r, {Vi}i=1). −1 X is continuous, so each preimage X [U] is open for any U ∈ On. Lemma 5.1.9

m Pr provides {sn,k} ⊂ [0, 1] such that X([sn,k, sn,k+1]) ⊂ l=1 U 2−n for some separable k=1 r ,Tl,k Sm tensors Tl,k, and k=1[sn,k, sn,k+1] = [0, 1].

I now construct sets An of points at which the pieces of the n-th spline begin and end. To ensure that I can continuously connect the n-th spline with the n + 1-th spline using

convex combinations of parameters, I require that An ⊂ An+1. Let A0 = ∅. For An, n ≥ 1, recursively define An = An−1 ∪ {sn,k : k = 1,..., m}. By construction, then, |An| < ∞ for all

n ∈ N. 118

n m0 n n Denote by {tk }k=1 an enumeration of An such that for all k, tk < tk+1. For each k, there

exist separable tensors Tl such that

r n n X X([t , t ]) ⊂ U 2−n . k k+1 r ,Tl l=1

By definition of U,T and Lemma 5.3.5, there exist decompositions

r d r d n X O l n X O l X(tk ) = ui,n and X(tk+1) = vi,n , l=1 i=1 l=1 i=1

l n n d and continuous maps φn,k :[tk , tk+1] → S(1, {Vi}i=1) defined by

Od x − tn ! x 7−→ ul + k (vl − ul ) i,n tn − tn i,n i,n i=1 k+1 k

Pr l | X| k Pr l − X k ≤ 1−n ∈ n n such that l=1 φn,k An = An and l=1 φn,k(t) (t) 2  for all t [tk , tk+1]. l d Using these, define piecewise-continuous maps φn : [0, 1] → S(1, {Vi}i=1) by   l l φ if t = 0 φn  n,1 t 7−→    l n n φn,k if tk < t ≤ tk+1

Pr l Pr l 1−n such that l=1 φn|An = X|An and k l=1 φn(t) − X(t)k ≤ 2  for all t ∈ [0, 1]. Each

Pr l l successive approximation l=1 φn better approximates X, but the summands φn contain

Pr l additional discontinuities. Figure 5.2 illustrates l=1 φn. I now have continuous functions -close to X, with piecewise continuous summands. I now construct between these discrete splines. I do this by interpolating

l 1−n l linearly between the spline φn, which is at most 2 -distant from X, and φn+1, which is at most half as distant from X.

l 2 d Define maps ψn : [0, 1] → S(1, {Vi}i=1) by

l l (s, t) 7→ (1 − t)φn(s) + tφn+1(s) . 119

Pn l (b) l=1 φ Pn l n+1 (a) l=1 φn

Figure 5.2: Improving a spline by refining subdivisions

By construction, these have the following properties:

l l ψn(s, 0) = φn(s) ,

l l ψn(s, 1) = φn+1(s) ,   Pr l − l ≤ 1−n l=1 ψn(s, t) ψn(s, 1) 2  , and (5.5)

Pr l l=1 ψn is continuous.

Let X = Φ1 + ··· + Φr be a decomposition of X into separable, but not necessarily

1 r d continuous, maps. That is, Φ ,..., Φ : [0, 1] → S(1, {Vi}i=1). Define maps l 2 d Pr l ψ : [0, 1] → S(1, {Vi}i=1) and ensure that l=1 ψ (s, t) → X(s) as t → 0 by connecting the linear interpolations as follows:   l l  ψ Φ (s) if t = 0 (s, t) 7−→    l n −n 1−n ψn(s, 2 − 2 t) if 2 < t ≤ 2 .

l Pr l Though each ψ is discontinuous, the sum l=1 ψ is continuous, and thus is a Pr l Pr l homotopy between X and l=1 φ1. By (5.5), l=1 ψ (s, t) − X(s) ≤ 2, so Pr l Pr l l=1 ψ (s, t) ≥ 1 − 2 > 0. Because l=1 ψ never vanishes, I may re-scale it to create a

d homotopy in S1(r, {Vi}i=1). l 1 l l Finally, note that each φ1 can be discontinuous only at the points tk . Defining φ = φ1 completes the proof.  120

I have established that loops of tensors admit piecewise-continuous decompositions. Theorem 5.3.7 below will establish that a loop with a continuous decomposition is homotopic to a separable loop, and is thus contractible.

Theorem 5.3.7. Let Vi be inner product spaces over K. If each Vi has dimension

1 d exceeding 1, then any loop X : S → S1(r, {Vi}i=1) defined as a sum of tensor products of l 1 loops φi : S → Vi \{0}, r d X X O l θ 7−→ φi(θ) , l=1 i=1 1 d is homotopic to some separable loop Y : S → S1(1, {Vi}i=1).

l Proof: By Lemma 5.2.10, I assume without loss of generality that kφi(θ)k > 0 for all i, l and all θ ∈ S 1. Nd Xl l X\1 X − X1 Pr Xl For brevity, let (θ) = i=1 φi(θ), and let = = l=2 . For the same Nn Nn purpose, define u ⊗ u to be u . i, j i j j i=1 i Let A = { ω ∈ S 1 : kX1(ω)k < kX\1(ω)k } and let B = { ω ∈ S 1 : ∃λ ≤ 0 s.t. λX1(ω) = X(ω) }. Note that the set B includes only those points that would have X1(ω) + αX\1(ω) = 0 for some α ∈ [0, 1]. It follows that B ⊂ A, for

if there were some ω ∈ B with ω < A, then X(ω) would be 0, which would contradict its definition. By continuity of X1, X, and k · k, A is open and B is closed.

1 \1 1 \1 Let C = { ω ∈ A : |hX (ω), X (ω)i| > (1 − )kX (ω)k kX (ω)k } for each  > 0. ⊂ ∈ 1 1 Note that B C, and that C is open. If there exists  (0, 2 ) such that C , S , then let d C = C. If no such  exists, then there is nothing to prove, as X(ω) ∈ S1(1, {Vi}i=1) for all ω ∈ S 1. For the remainder of the proof, I will assume C , S 1.

1 d I now define a homotopy ψ : S × [0, 1] → S(1, {Vi}i=1) \{0}. This homotopy is complicated, so I will define its behavior on various subsets of S 1 × [0, 1] separately. For each ω ∈ S 1 \ C and t ∈ [0, 1] define

ψ (ω, t) 7−→ X1(ω) . 121

S 1 A C B (an, dn, j), (cn,k, dn),...

Figure 5.3: Hierarchy of sets used in proof of Theorem 5.3.7

S The open set C is a countable union of disjoint open intervals C = n(an, bn). If such

an interval (an, bn) is disjoint from B, then for all ω ∈ (an, bn) and t ∈ [0, 1], define

ψ (ω, t) 7−→ X1(ω) .

Otherwise, (an, bn) \ B is itself a countable union of disjoint open intervals (cn,1, dn,1),

(cn,2, dn,2), . . . . By a compactness argument, there exist j, k such that cn, j = an and

1 dn,k = bn, for otherwise B would contain elements arbitrarily close to S \ C.

⊥ 1 ⊥ 1 ⊥ 1 Let φi : S → Vi be loops such that φi (ω) ⊥ φi (ω) and kφi (ω)k = kφi (ω)k for all ω ∈ S 1. That such loops exist is ensured by Lemma 5.1.11. Note that for any ω ∈ C and Nd ≤ ⊥ X\1 1 × 2 → λ 0 I have i=1 φi (ω) , λ (ω). Let also βi : S R Vi be defined by   φ1(ω) if t < 0  i  βi  (ω, t, u) 7−→ 1 ⊥ ≤ ≤ . cos(tπ/2)φi (ω) + sin(tπ/2)φi (ω) if 0 t u    1 ⊥ cos(uπ/2)φi (ω) + sin(uπ/2)φi (ω) if u < t

1 Note that kβi(ω, t, u)k = kφi (ω)k. For any ω ∈ C I have ω ∈ A and thus that kψ(ω, t)k = kX1(ω)k =⇒ ψ(ω, t) , −X\1(ω).

For each ω ∈ [dn, j, cn,k] and t ∈ [0, 1] define

d ψ O (ω, t) 7−→ βi(ω, t, 1) . i=1

I now examine the interval (an, dn, j). The definition of ψ on the interval (cn,k, bn) follows a symmetric argument. 122

I say that u and v are congruent if u = λv for some λ ∈ K. By Lemma 5.2.5 and definition of the tensor product, two separable tensors are congruent if and only if their factors are congruent.

1 \1 Case 1 Suppose that X is congruent to X on (an, dn, j).

For each ω ∈ (an, dn, j) and t ∈ [0, 1] define d ! ψ O ω − a (ω, t) 7−→ β ω, t n , t . i d − a i=1 n, j n

\1 Case 2 Suppose X fails to be separable at some x ∈ (an, dn, j).

By continuity of X\1 and closedness of separable tensors, there exists an open

\1 subinterval (y, z) ⊂ (an, dn, j) on which X is non-separable. For each ω ∈ (an, dn, j) and t ∈ [0, 1] define d ! ψ O ω − y (ω, t) 7−→ β ω, t , t . i z − y i=1 \1 Case 3 Suppose that X is separable on (an, dn, j) and that there is some x ∈ (an, dn, j) such that X1(x) is not congruent to X\1(x).

Nd X\1 X\1 On (an, dn, j), admits a factorization (ω) = i=1 αi(ω), where each αi is a

continuous map from (an, dn, j) to Vi. Therefore, there must exist some j = 1,..., d

1 such that φ j (x) is not congruent to α j(x). Further, by continuity, there exists an open

1 subinterval (y, z) ⊂ (an, dn, j) on which φ j is not congruent to α j.

For each ω ∈ (an, dn, j) and t ∈ [0, 1] define O !  ω − y+z  ψ ω − y  2  (ω, t) 7−→ βi ω, 2t , t ⊗ j β j ω, 2t , t . z − y  z − y  i, j The map ψ is continuous, and for all ω ∈ S 1, t ∈ [0, 1], and −1 ≤ λ ≤ 0,

ψ(ω, 0) = X1(ω) ,

ψ(ω, 1) , λX\1(ω) , and

ψ(ω, t) , −X\1(ω) . 123

Thus, ψ is a homotopy, and the map defined by

ψ(ω, t) + X\1(ω) (ω, t) 7−→ kψ(ω, t) + X\1(ω)k

d \1 is a homotopy in S1(r, {Vi}i=1). Apply this homotopy, then continuously scale X to 0 using multiplication by a non-negative real factor to complete the proof. 

I have now established both that tensor loops are homotopic to those with piecewise continuous decompositions and that tensor loops with continuous decompositions are homotopic to separable loops. To connect these statements, I must make the following assumption:

Assumption 5.3.8. Every pair of equivalent decompositions

Xr Xr Xl = Yl l=1 l=1

d admits a loop P : [0, 1] → S1(r, {Vi}i=1) defined as the sum of tensor products of paths l d ψ : [0, 1] → S(1, {Vi}i=1), r P X t 7−→ ψl(t) , l=1 l l l l d such that ψ (0) = X and ψ (1) = Y and P may be contracted to a point in S1(r, {Vi}i=1) while keeping P(0) and P(1) fixed.

Assumption 5.3.8 remains an open problem, though I address one case here. Any non-separable tensor can have its summands reordered to create an equivalent decomposition. For Assumption 5.3.8 to be reasonable, it must not be violated by this trivial indeterminacy.

Pr l d 1 r d Lemma 5.3.9. Let X = l=1 X ∈ S1(r, {Vi}i=1), where X ,..., X ∈ S(1, {Vi}i=1). Reordering the summands X1,..., Xr does not violate Assumption 5.3.8. 124

Proof: Select n = 1,..., r such that Xn is non-zero. As was shown in the proof of

l d l Xn Theorem 5.3.1, there are paths ρ : [0, 1] → S(1, {Vi}i=1) connecting each X to rkXnk , such Pr l d that l=1 ρ is a path in S1(r, {Vi}i=1). Let i1,..., ir be a permutation of 1,..., r. Because

Pr il Pr l 2 d l=1 ψ and l=1 ψ are identical, there is a homoptopy φ : [0, 1] → S1(r, {Vi}i=1) defined by   Pr ρl s − t s ≤ 1 φ  l=1 (2 (1 )) if 2 (s, t) 7−→  .  Pr il − − 1  l=1 ρ ((2 2s)(1 t)) if s > 2 Note that φ(s, 0) is a sum of separable paths defined by    l 1 l ρ s s ≤ ψ  (2 ) if 2 s 7−→  ,   il − 1 ρ (2 2s) if s > 2

i i l that X l = ψ l (0) = ψ (1), and that φ(0, t) = φ(1, t) = φ(s, 1) = X. 

Assumption 5.3.8 is a necessary condition for rank-r unit tensors to be

simply-connected. If each Vi \{0} is simply-connected, Assumption 5.3.8 is also sufficient, as will be shown in Corollary 5.3.11. Assumption 5.3.8 is weaker than a statement of simple connectedness, in that one need only show that a contractible path exists, rather than that every path is contractible. As an example, consider topological spaces each composed of the decompositions of a tensor T ; if every such space is path-connected, then Assumption 5.3.8 holds.

1 d Lemma 5.3.10. Let Vi be inner product spaces over K. Let X : S → S1(r, {Vi}i=1) be a tensor-space loop defined by r X X θ 7−→ φl(θ) , l=1 l 1 d where each φ : S → S(1, {Vi}i=1) has at most finitely many discontinuities. If 1 d Assumption 5.3.8 holds, then X is homotopic to a loop Y : S → S1(r, {Vi}i=1) defined by r Y X θ 7−→ ρl(θ) , l=1 125

l 1 d where each ρ : S → S(1, {Vi}i=1) is continuous.

Proof: If φ1, . . . , φr admit no discontinuities, there is nothing to prove, as this case reduces to Theorem 5.3.7. Otherwise, assume without loss of generality that one of

1 r l l φ , . . . , φ admits a discontinuity at θ = 0. Let X = limω→0− φ (ω) and let

l l Y = limω→0+ φ (ω). By Assumption 5.3.8, there exists a homotopy Ψ taking P to X(0). Apply the homotopy h defined, for each t ∈ [0, 1] and ω ∈ [0, 2π), by     X 2π ω − t ω ≥ t h  2π−t ( ) if (ω, t) 7−→  .     ω − Ψ t , 1 t if ω < t

Note that h(ω, 1) may be decomposed as   Pr φl 2π ω − ω ≥  l=1 ( 2π−1 ( 1)) if 1 h(ω, 1) =  ,  Pr l  l=1 ψ (t) if ω < 1

and thus its decomposition has fewer discontinuities than does that of X. Repeating this

process for each discontinuity completes the proof. 

Corollary 5.3.11. Let Vi be inner product spaces over K. If each Vi \{0} is

d simply-connected, and Assumption 5.3.8 holds, then S1(r, {Vi}i=1) is simply-connected for any r ∈ N.

Proof: By Theorem 5.3.6, X is homotopic to some loop that satisfies the conditions of Lemma 5.3.10. Thus, X is homotopic to some loop that satisfies the conditions of

Theorem 5.3.7. Theorem 5.2.9 completes the proof. 

5.4 Concluding remarks

In this chapter, I have established that spaces of low rank unit tensors are path and simply-connected — the latter of which requires either that the tensors be separable or that 126

Assumption 5.3.8 hold. My work ensures that there is always a path that an iterative method could follow without encountering degenerate cases. Though Assumption 5.3.8 remains an open problem, that it is sufficient for simple connectedness provides insight into the nature of low-rank unit tensors. In particular, the only obstacles that might prevent contraction of a loop within a space of low-rank unit tensors are those tensors which admit non-trivial equivalent decompositions. 127 References

[ACF10] A. Ammar, F. Chinesta, and A. Falco.´ On the convergence of a greedy rank-one update algorithm for a class of linear systems. Arch. Comput. Methods Eng., 17(4):473–486, 2010. doi:10.1007/s11831-010-9048-z [ADK11] Evrim Acar, Daniel M. Dunlavy, and Tamara G. Kolda. A scalable optimization approach for fitting canonical tensor decompositions. Journal of Chemometrics, 25(2):67–86, FEB 2011. doi:10.1002/cem.1335 [AMA05] P.-A. Absil, R. Mahony, and B. Andrews. Convergence of the iterates of descent methods for analytic cost functions. SIAM J. Optim., 16(2):531–547, 2005. doi:10.1137/040605266 [BAE16] Karl-Heinz Boehm, Alexander A. Auer, and Mike Espig. Tensor representation techniques for full configuration interaction: A Fock space approach using the canonical product format. JOURNAL OF CHEMICAL PHYSICS, 144(24), JUN 28 2016. doi:10.1063/1.4953665 [BAEH11] Udo Benedikt, Alexander A. Auer, Mike Espig, and Wolfgang Hackbusch. Tensor decomposition in post-Hartree-Fock methods. I. Two-electron integrals and MP2. JOURNAL OF CHEMICAL PHYSICS, 134(5), FEB 7 2011. doi:10.1063/1.3514201 [BBB15] David J. Biagioni, Daniel Beylkin, and Gregory Beylkin. Randomized interpolative decomposition of separated representations. J. Comput. Phys., 281:116–134, 2015. doi:10.1016/j.jcp.2014.10.009 [BD15] Markus Bachmayr and Wolfgang Dahmen. Adaptive near-optimal rank tensor approximation for high-dimensional operator equations. Found. Comput. Math., 15(4):839–898, 2015. doi:10.1007/s10208-013-9187-3 [BD16] M. Bachmayr and W. Dahmen. Adaptive low-rank methods for problems on Sobolev spaces with error control in L2. ESAIM Math. Model. Numer. Anal., 50(4):1107–1136, 2016. doi:10.1051/m2an/2015071 [BG14] Jonas Ballani and Lars Grasedyck. Tree adaptive approximation in the hierarchical tensor format. SIAM J. Sci. Comput., 36(4):A1415–A1431, 2014. doi:10.1137/130926328 [BGM09] Gregory Beylkin, Jochen Garcke, and Martin J. Mohlenkamp. Multivariate regression and machine learning with sums of separable functions. SIAM J. Sci. Comput., 31(3):1840–1857, 2009. doi:10.1137/070710524 [BH03] James C. Bezdek and Richard J. Hathaway. Convergence of alternating optimization. Neural, Parallel Sci. Comput., 11(4):351–368, December 2003. URL: http://dl.acm.org/citation.cfm?id=964885.964886 128

[BK08] Brett W. Bader and Tamara G. Kolda. Efficient MATLAB computations with sparse and factored tensors. SIAM J. Sci. Comput., 30(1):205–231, 2007/08. doi:10.1137/060676489

[BK85] M. E. Brewster and R. Kannan. Global convergence of nonlinear successive overrelaxation via linear theory. Computing, 34(1):73–79, 1985. doi:10.1007/BF02242174

[BK86] M. E. Brewster and R. Kannan. A computational process for choosing the relaxation parameter in nonlinear SOR. Computing, 37(1):19–29, 1986. doi:10.1007/BF02252731

[BL14] Jarosław Buczynski´ and J. M. Landsberg. On the third secant variety. J. Algebraic Combin., 40(2):475–502, 2014. doi:10.1007/s10801-013-0495-0

[BM05] Gregory Beylkin and Martin J. Mohlenkamp. Algorithms for numerical analysis in high dimensions. SIAM Journal on Scientific Computing, 26(6):2133–2159, 2005. doi:10.1137/040604959

[BMP08] G. Beylkin, M. J. Mohlenkamp, and F. Perez.´ Approximating a wavefunction as an unconstrained sum of Slater determinants. J. Math. Phys., 49(3):032107, 2008. doi:10.1063/1.2873123

[Bro97] Rasmus Bro. Parafac. tutorial & applications. Chemometrics Intell. Lab. Syst., 38:149–171, 1997. doi:10.1016/S0169-7439(97)00032-4

[Bro98] Rasmus Bro. Multi-way Analysis in the Food Industry. PhD thesis, Royal Veterinary and Agricultural University, 1998.

[Brz15] Szymon Brzostowski. The Łojasiewicz exponent of semiquasihomogeneous singularities. Bulletin of the London Mathematical Society, 47(5):848–852, 2015. doi:10.1112/blms/bdv058

[CC70] J. D. Carroll and J. J. Chang. Analysis of individual differences in multidimensional scaling via an N-way generalization of Eckart-Young decomposition. Psychometrika, 35:283–320, 1970. doi:10.1007/BF02310791

[CHQ11] Yannan Chen, Deren Han, and Liqun Qi. New ALS methods with extrapolating search directions and optimal step size for complex-valued tensor decompositions. IEEE Trans. Signal Process., 59(12):5888–5898, 2011. doi:10.1109/TSP.2011.2164911

[CMP+15] Andrzej Cichocki, Danilo P. Mandic, Anh Huy Phan, Cesar F. Caiafa, Guoxu Zhou, Qibin Zhao, and Lieven De Lathauwer. Tensor Decompositions for Signal Processing Applications. IEEE SIGNAL 129

PROCESSING MAGAZINE, 32(2):145–163, MAR 2015. doi:10.1109/MSP.2013.2297439

[Coh72] Arthur I. Cohen. Rate of convergence of several conjugate gradient algorithms. SIAM J. Numer. Anal., 9(2):248–259, 1972.

[CtBDLC09] P. Comon, J. M. F. ten Berge, L. De Lathauwer, and J. Castaing. Generic and typical ranks of multi-way arrays. Linear Algebra Appl., 430(11-12):2997–3007, 2009. doi:10.1016/j.laa.2009.01.014

[Cur44] Haskell B. Curry. The method of steepest descent for non-linear minimization problems. Quart. Appl. Math., 2:258–261, 1944. doi:10.1090/qam/10667

[DL13] Ignat Domanov and Lieven De Lathauwer. On the uniqueness of the canonical polyadic decomposition of third-order tensors—part i: Basic results and uniqueness of one factor matrix. SIAM Journal on Matrix Analysis and Applications, 34(3):855–875, 2013. doi:10.1137/120877234

[DSW15] Hans De Sterck and Manda Winlaw. A nonlinearly preconditioned conjugate gradient algorithm for rank-R canonical tensor approximation. Numer. Linear Algebra Appl., 22(3):410–432, 2015. doi:10.1002/nla.1963

[EGH09] Mike Espig, Lars Grasedyck, and Wolfgang Hackbusch. Black box low tensor-rank approximation using fiber-crosses. Constr. Approx., 30(3):557–597, 2009. doi:10.1007/s00365-009-9076-9

[EH12] Mike Espig and Wolfgang Hackbusch. A regularized Newton method for the efficient approximation of tensors represented in the canonical tensor format. Numer. Math., 122(3):489–525, 2012. doi:10.1007/s00211-012-0465-9

[EHK] Mike Espig, Wolfgang Hackbusch, and Aram Khachatryan. On the convergence of alternating least squares optimisation in tensor format representations. preprint. arXiv:1506.00062.

[EHRS12] Mike Espig, Wolfgang Hackbusch, Thorsten Rohwedder, and Reinhold Schneider. Variational calculus with sums of elementary tensors of fixed rank. Numer. Math., 122(3):469–488, 2012. doi:10.1007/s00211-012-0464-x

[EV93] M. Eiermann and R.S. Varga. Is the optimal best for the sor iteration method? Linear Algebra and its Applications, 182(Supplement C):257 – 277, 1993. doi:10.1016/0024-3795(93)90503-G 130

[GNL14] L. Giraldi, A. Nouy, and G. Legrain. Low-rank approximate inverse for preconditioning tensor-structured linear systems. SIAM J. Sci. Comput., 36(4):A1850–A1870, 2014. doi:10.1137/130918137

[GQ14] Donald Goldfarb and Zhiwei Qin. Robust low-rank tensor recovery: models and algorithms. SIAM J. Matrix Anal. Appl., 35(1):225–253, 2014. doi:10.1137/130905010

[Gwo99] J. Gwozdziewicz.´ The łojasiewicz exponent of an analytic function at an isolated zero. Commentarii Mathematici Helvetici, 74(3):364–375, 1999. doi:10.1007/s000140050094

[Hac14] Wolfgang Hackbusch. Numerical tensor calculus. Acta Numer., 23:651–742, 2014. doi:10.1017/S0962492914000087

[Had00] A. Hadjidimos. Successive overrelaxation (sor) and related methods. Journal of Computational and Applied Mathematics, 123(12):177 – 199, 2000. Numerical Analysis 2000. Vol. III: Linear Algebra. doi:10.1016/S0377-0427(00)00403-9

[Har70] R. A. Harshman. Foundations of the parafac procedure: Model and conditions for an “explanatory” multi-mode factor analysis. Working Papers in Phonetics 16, UCLA, 1970. URL: http://www.psychology.uwo.ca/faculty/harshman/wpppfac0.pdf

[HK06] W. Hackbusch and B. N. Khoromskij. Low-rank Kronecker-product approximation to multi-dimensional nonlocal operators. I. Separable approximation of multi-variate functions. Computing, 76(3-4):177–202, 2006. doi:10.1007/s00607-005-0144-0

[HL13] Christopher J. Hillar and Lek-Heng Lim. Most tensor problems are NP-hard. J. ACM, 60(6):Art. 45, 39, 2013. doi:10.1145/2512329

[HPJ+98] PK Hopke, P Paatero, H Jia, RT Ross, and RA Harshman. Three-way (PARAFAC) factor analysis: examination and comparison of alternative computational methods as applied to ill-conditioned data. Chemometrics and Intelligent Laboratory Systems, 43(1-2):25–42, SEP 28 1998. doi:10.1016/S0169-7439(98)00077-X

[KB09] Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009. doi:10.1137/07070111X

[Kho12] Boris N. Khoromskij. Tensors-structured numerical methods in scientific computing: Survey on recent advances. Chemometrics Intell. Lab. Syst., 110(1):1–19, JAN 15 2012. doi:10.1016/j.chemolab.2011.09.001 131

[KM11] Tamara G. Kolda and Jackson R. Mayo. Shifted power method for computing tensor eigenpairs. SIAM J. Matrix Anal. Appl., 32(4):1095–1124, 2011. doi:10.1137/100801482

[KSU14] Daniel Kressner, Michael Steinlechner, and Andre´ Uschmajew. Low-rank tensor methods with subspace correction for symmetric eigenvalue problems. SIAM J. Sci. Comput., 36(5):A2346–A2368, 2014. doi:10.1137/130949919

[KT10] V. A. Kazeev and E. E. Tyrtyshnikov. Structure of the Hessian matrix and an economical implementation of Newton’s method in the problem of canonical approximation of tensors. COMPUTATIONAL MATHEMATICS AND MATHEMATICAL PHYSICS, 50(6):927–945, JUN 2010. doi:10.1134/S0965542510060011

[Lan11] J.M. Landsberg. Tensors: Geometry and Applications. Graduate studies in mathematics. American Mathematical Society, 2011.

[LKN13] Na Li, Stefan Kindermann, and Carmeliza Navasca. Some convergence results on the regularized alternating least-squares method for tensor decomposition. Linear Algebra Appl., 438(2):796–812, 2013. doi:10.1016/j.laa.2011.12.002

[LMS11] Flavia Lanzara, Vladimir Maz’ya, and Gunther Schmidt. On the fast computation of high dimensional volume potentials. Math. Comp., 80(274):887–904, 2011. doi:10.1090/S0025-5718-2010-02425-1

[LMS14] F. Lanzara, V. Maz’ya, and G. Schmidt. Fast cubature of volume potentials over rectangular domains by approximate approximations. Appl. Comput. Harmon. Anal., 36(1):167–182, 2014. doi:10.1016/j.acha.2013.06.003

[Łoj59] S. Łojasiewicz. Sur le probleme` de la division. Studia Mathematica, 18(1):87–136, 1959. URL: http://eudml.org/doc/216949

[Łoj65] Stanisław Łojasiewicz. Ensembles semi-analytiques. preprint IHES, 1965. URL: https://perso.univ-rennes1.fr/michel.coste/Lojasiewicz.pdf

[Łoj93] Stanislas Łojasiewicz. Sur la geom´ etrie´ semi- et sous- analytique. Annales de l’institut Fourier, 43(5):1575–1595, 1993. URL: http://eudml.org/doc/75048

[LR92] Sue Leurgans and Robert T. Ross. Multilinear models: applications in spectroscopy. Statist. Sci., 7(3):289–319, 1992. With comments by Jan de Leeuw, Pieter M. Kroonenberg and Donald S. Burdick and a rejoinder by the authors. URL: http://links.jstor.org/sici?sici=0883-4237(199208)7: 3h289:MMAISi2.0.CO;2-1&origin=MSN 132

[LRA93] S. E. Leurgans, R. T. Ross, and R. B. Abel. A decomposition for three-way arrays. SIAM J. Matrix Anal. Appl., 14(4):1064–1083, 1993. doi:10.1137/0614071 [LT92] Z. Q. Luo and P. Tseng. On the convergence of the coordinate descent method for convex differentiable minimization. J. Optim. Theory Appl., 72(1):7–35, 1992. doi:10.1007/BF00939948 [LUZ15] Zhening Li, Andre´ Uschmajew, and Shuzhong Zhang. On convergence of the maximum block improvement method. SIAM J. Optim., 25(1):210–233, 2015. doi:10.1137/130939110 [LWXZ06] YoungJu Lee, Jinbiao Wu, Jinchao Xu, and Ludmil Zikatanov. On the convergence of iterative methods for semidefinite linear systems. SIAM Journal on Matrix Analysis and Applications, 28(3):634–641, 2006. doi:10.1137/050644197 [LWXZ08] Young-Ju Lee, Jinbiao Wu, Jinchao Xu, and Ludmil Zikatanov. A sharp convergence estimate for the method of subspace corrections for singular systems of equations. Mathematics of Computation, 77(262):831–850, 2008. doi:10.1090/S0025-5718-07-02052-2 [MB94] B. C. Mitchell and D. S. Burdick. Slowly converging parafac sequences: Swamps and two-factor degeneracies. J. Chemometrics, 8:155–168, 1994. doi:10.1002/cem.1180080207 [Moh13] Martin J. Mohlenkamp. Musings on multilinear fitting. Linear Algebra and its Applications, 438(2):834 – 852, 2013. Tensors and Multilinear Algebra. doi:10.1016/j.laa.2011.04.019 [Nes12] Yu. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim., 22(2):341–362, 2012. doi:10.1137/100802001 [Nou10a] Anthony Nouy. A priori model reduction through proper generalized decomposition for solving time-dependent partial differential equations. Comput. Methods Appl. Mech. Engrg., 199(23-24):1603–1626, 2010. doi:10.1016/j.cma.2010.01.009 [Nou10b] Anthony Nouy. Proper generalized decompositions and separated representations for the numerical solution of high dimensional stochastic problems. Arch. Comput. Methods Eng., 17(4):403–434, 2010. doi:10.1007/s11831-010-9054-1 [NW14] Jiawang Nie and Li Wang. Semidefinite relaxations for best rank-1 tensor approximations. SIAM J. Matrix Anal. Appl., 35(3):1155–1179, 2014. doi:10.1137/130935112 133

[Ole13] Grzegorz Oleksik. The Łojasiewicz exponent of nondegenerate surface singularities. Acta Mathematica Hungarica, 138(1):179–199, 2013. doi:10.1007/s10474-012-0285-5

[Paa00] Pentti Paatero. Construction and analysis of degenerate parafac models. Journal of Chemometrics, 14(3):285–299, 2000. doi:10.1002/1099-128X(200005/06)14:3¡285::AID-CEM584¿3.0.CO;2-1

[PCA10] E. Pruliere, F. Chinesta, and A. Ammar. On the deterministic solution of multidimensional parametric models using the proper generalized decomposition. Math. Comput. Simulation, 81(4):791–810, 2010. doi:10.1016/j.matcom.2010.07.015

[PTC13] Anh-Huy Phan, Petr Tichavsky,´ and Andrzej Cichocki. Low complexity damped Gauss-Newton algorithms for CANDECOMP/PARAFAC. SIAM J. Matrix Anal. Appl., 34(1):126–147, 2013. doi:10.1137/100808034

[RCH08] Myriam Rajih, Pierre Comon, and Richard A. Harshman. Enhanced line search: a novel method to accelerate PARAFAC. SIAM J. Matrix Anal. Appl., 30(3):1128–1147, 2008. doi:10.1137/06065577

[RDB16] Matthew J. Reynolds, Alireza Doostan, and Gregory Beylkin. Randomized alternating least squares for canonical tensor decompositions: application to a PDE with random data. SIAM J. Sci. Comput., 38(5):A2634–A2664, 2016. doi:10.1137/15M1042802

[Rei66] J. K. Reid. A method for finding the optimum successive over-relaxation parameter. The Computer Journal, 9(2):200–204, 1966. doi:10.1093/comjnl/9.2.200

[RT14] Peter Richtarik´ and Martin Taka´c.ˇ Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program., 144(1-2, Ser. A):1–38, 2014. doi:10.1007/s10107-012-0614-z

[SK16] Yifei Sun and Mrinal Kumar. Uncertainty propagation in orbital mechanics via tensor decomposition. Celestial Mech. Dynam. Astronom., 124(3):269–294, 2016. doi:10.1007/s10569-015-9662-z

[SU15] Reinhold Schneider and Andre´ Uschmajew. Convergence results for projected line-search methods on varieties of low-rank matrices via łojasiewicz inequality. SIAM J. Optim., 25(1):622–646, 2015. doi:10.1137/140957822

[SYX16] Zhanwen Shi, Guanyu Yang, and Yunhai Xiao. A limited memory BFGS algorithm for non-convex minimization with applications in matrix largest 134

eigenvalue problem. Math. Methods Oper. Res., 83(2):243–264, 2016. doi:10.1007/s00186-015-0527-8

[TPC16] Petr Tichavsky, Anh-Huy Phan, and Andrzej Cichocki. Partitioned Alternating Least Squares Technique for Canonical Polyadic Tensor Decomposition. IEEE Signal Process. Lett., 23(7):993–997, JUL 2016. doi:10.1109/LSP.2016.2577383

[Usc11] Andre´ Uschmajew. Regularity of tensor product approximations to square integrable functions. Constr. Approx., 34(3):371–391, 2011. doi:10.1007/s00365-010-9125-4

[Usc12] Andre´ Uschmajew. Local convergence of the alternating least squares algorithm for canonical tensor approximation. SIAM J. Matrix Anal. Appl., 33(2):639–652, 2012. doi:10.1137/110843587

[Usc15] Andre´ Uschmajew. A new convergence proof for the higher-order power method and generalizations. Pac. J. Optim., 11(2):309–321, 2015. arXiv:1407.4586.

[Val14] AbdoulAhad Validi. Low-rank separated representation surrogates of high-dimensional stochastic functions: application in Bayesian inference. J. Comput. Phys., 260:37–53, 2014. doi:10.1016/j.jcp.2013.12.024

[VMP03] M.N. Vrahatis, G.D. Magoulas, and V.P. Plagianakos. From linear to nonlinear iterative methods. Applied Numerical Mathematics, 45(1):59–77, apr 2003. doi:10.1016/s0168-9274(02)00235-0

[VNVM14] N. Vannieuwenhoven, J. Nicaise, R. Vandebril, and K. Meerbergen. On generic nonexistence of the Schmidt-Eckart-Young decomposition for complex tensors. SIAM J. Matrix Anal. Appl., 35(3):886–903, 2014. doi:10.1137/130926171

[WC14] Liqi Wang and Moody T. Chu. On the global convergence of the alternating least squares method for rank-one approximation to generic tensors. SIAM J. Matrix Anal. Appl., 35(3):1058–1072, 2014. doi:10.1137/130938207

[Wol71] Philip Wolfe. Convergence conditions for ascent methods. ii: Some corrections. SIAM Review, 13(2):185–188, 1971. doi:10.1137/1013035

[XZ06] Shuhuang Xiang and Shenglei Zhang. A convergence analysis of block accelerated over-relaxation iterative methods for weak block h-matrices to partition π. Linear Algebra and its Applications, 418(1):20–32, oct 2006. doi:10.1016/j.laa.2006.01.013 135

[YG09] Shiming Yang and Matthias K. Gobbert. The optimal relaxation parameter for the sor method applied to the poisson equation in any space dimensions. Applied Mathematics Letters, 22(3):325 – 331, 2009. doi:10.1016/j.aml.2008.03.028

[You50] David M. Young. Iterative methods for solving partial difference equations of elliptic type. PhD thesis, Harvard University, Cambridge, MA, May 1950. ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Thesis and Dissertation Services ! !