TENSOR RANK DECOMPOSITIONS VIA THE PSEUDO-MOMENT METHOD

A Dissertation Presented to the Faculty of the Graduate School

of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

by Jonathan Shi December 2019 © 2019 Jonathan Shi ALL RIGHTS RESERVED TENSOR RANK DECOMPOSITIONS VIA THE PSEUDO-MOMENT METHOD Jonathan Shi, Ph.D. Cornell University 2019

Over a series of four articles and an introduction, a “method of pseudo-moments” is developed which gives polynomial-time tensor rank decompositions for a variety of tensor component models. Algorithms given fall into two general classes: those falling back on convex optimization, which develop the theory of polynomial-time algorithms, as well as those constructed through spectral and polynomial methods, which illustrate the possibility of realizing the aforementioned polynomial-time algorithm ideas in runtimes practical for real- life inputs. Tensor component models covered include those featuring random, worst-case, generic, or smoothed inputs, as well as cases featuring overcomplete and undercomplete rank. All models are capable of tolerating substantial noise in tensor inputs, measured in spectral norm. BIOGRAPHICAL SKETCH

Jonathan Shi was raised in Seattle, Washington, and received a Bachelors of Science from the University of Washington in 2013 in the major concentrations of Computer Science, Mathematics, and Physics. He has been working at Cornell University under the guidance of Professor David Steurer and will soon join Bocconi University in Milan as a Research Fellow.

iii This dissertation is dedicated to my mother and father, Richard C-J Shi and Tracey H. Luo, who made this work possible with their perseverence, and in the memory of Mary Kaitlynne Richardson, whose kindness would have driven her to greatness.

iv ACKNOWLEDGEMENTS

I am in gratitude for the support and mentorship of my advisor David Steurer, who oversaw my development of skill and knowledge in this thesis topic from the starting scraps I had, as well as Professor Robert Kleinberg in his role as a Director of Graduate Studies, committee member, and all-around pretty cool person. I would like to ensure the acknowledgement of those in the student community who worked (in part in student organizations) to create a welcoming, supportive, and inclusive community, as well as those who labor to keep the Computer Science Field at Cornell running. I would not have been able to complete this work without the aid of Eve Abrams, LCSW-R, Robert Mendola, MD, Clint Wattenberg, MS RD, and Edward Koppel, MD in improving my health and directing me to needed resources. I acknowledge support from a Cornell University Fellowship as well as the NSF via my advisor’s NSF CAREER Grant CCF-1350196 over the duration of my graduate program.

v CONTENTS

Biographical Sketch...... iii Dedication...... iv Acknowledgements...... v Contents...... vi List of Tables...... x List of Figures...... xi

1 Introduction1 1.0.1 Results stated...... 3 1.0.2 Overview of methods...... 7 1.0.3 Frontier of what is possible...... 13 1.0.4 Organization...... 15

2 Spiked tensor model 17 2.1 Introduction...... 17 2.1.1 Results...... 20 2.1.2 Techniques...... 23 2.1.3 Related Work...... 27 2.2 Preliminaries...... 29 2.2.1 Notation...... 29 2.2.2 Polynomials and Matrices...... 30 2.2.3 The Sum of Squares (SoS) Algorithm...... 31 2.3 Certifying Bounds on Random Polynomials...... 32 2.4 Polynomial-Time Recovery via Sum of Squares...... 35 2.4.1 Semi-Random Tensor PCA...... 39 2.5 Linear Time Recovery via Further Relaxation...... 40 2.5.1 The Spectral SoS Relaxation...... 41 Í ⊗ 2.5.2 Recovery via the i Ti Ti Spectral SoS Solution...... 44 2.5.3 Nearly-Linear-Time Recovery via Tensor Unfolding and Spectral SoS...... 49 2.5.4 Fast Recovery in the Semi-Random Model...... 52 2.5.5 Fast Recovery with Symmetric Noise...... 54 2.5.6 Numerical Simulations...... 58 2.6 Lower Bounds...... 59 2.6.1 Polynomials, Vectors, Matrices, and Symmetries, Redux. 62 2.6.2 Formal Statement of the Lower Bound...... 65 2.6.3 In-depth Preliminaries for Pseudo-Expectation Symmetries 68 2.6.4 Construction of Initial Pseudo-Distributions...... 71 2.6.5 Getting to the Unit Sphere...... 75 2.6.6 Repairing Almost-Pseudo-Distributions...... 79 2.6.7 Putting Everything Together...... 80 2.7 Higher-Order Tensors...... 87

vi 3 Spectral methods for the random overcomplete model 90 3.1 Introduction...... 90 3.1.1 Planted Sparse Vector in Random Linear Subspace.... 92 3.1.2 Overcomplete Tensor Decomposition...... 94 3.1.3 Tensor Principal Component Analysis...... 97 3.1.4 Related Work...... 98 3.2 Techniques...... 100 3.2.1 Planted Sparse Vector in Random Linear Subspace.... 104 3.2.2 Overcomplete Tensor Decomposition...... 107 3.2.3 Tensor Principal Component Analysis...... 111 3.3 Preliminaries...... 113 3.4 Planted Sparse Vector in Random Linear Subspace...... 115 3.4.1 Algorithm Succeeds on Good Basis...... 119 3.5 Overcomplete Tensor Decomposition...... 122 3.5.1 Proof of Theorem 3.5.3...... 125 3.5.2 Discussion of Full Algorithm...... 129 3.6 Tensor principal component analysis...... 132 3.6.1 Spiked tensor model...... 132 3.6.2 Linear-time algorithm...... 133

4 Polynomial lifts 136 4.1 Introduction...... 136 4.1.1 Results for tensor decomposition...... 140 4.1.2 Applications of tensor decomposition...... 146 4.1.3 Polynomial optimization with few global optima...... 148 4.2 Techniques...... 149 4.2.1 Rounding pseudo-distributions by matrix diagonalization 150 4.2.2 Overcomplete fourth-order tensor...... 154 4.2.3 Random overcomplete third-order tensor...... 157 4.3 Preliminaries...... 158 4.3.1 Pseudo-distributions...... 160 4.3.2 Sum of squares proofs...... 162 4.3.3 Matrix constraints and sum-of-squares proofs...... 164 4.4 Rounding pseudo-distributions...... 165 4.4.1 Rounding by matrix diagonalization...... 165 4.4.2 Improving accuracy of a found solution...... 169 4.5 Decomposition with sum-of-squares...... 171 4.5.1 General algorithm for tensor decomposition...... 172 4.5.2 Tensors with orthogonal components...... 175 4.5.3 Tensors with separated components...... 176 4.6 Spectral norms and tensor operations...... 179 4.6.1 Spectral norms and pseudo-distributions...... 179 4.6.2 Spectral norm of random contraction...... 181 4.7 Decomposition of random overcomplete 3-tensors...... 185

vii 4.8 Robust decomposition of overcomplete 4-tensors...... 189 4.8.1 Noiseless case...... 193 4.8.2 Noisy case...... 195 4.8.3 under smooth analysis...... 199 4.9 Tensor decomposition with general components...... 201 4.9.1 Improved rounding of pseudo-distributions...... 202 4.9.2 Finding all components...... 208 4.10 Fast orthogonal tensor decomposition without sum-of-squares.. 211

5 Overcomplete generic decomposition spectrally 216 5.1 Introduction...... 216 5.1.1 Our Results...... 221 5.1.2 Related works...... 224 5.2 Overview of algorithm...... 230 5.3 Preliminaries...... 235 5.4 Tools for analysis and implementation...... 238 5.4.1 Robustness and spectral perturbation...... 238 5.4.2 Efficient implementation and runtime analysis...... 239 5.5 Lifting...... 241 5.5.1 Algebraic identifiability argument...... 245 5.5.2 Robustness arguments...... 247 5.6 Rounding...... 250 5.6.1 Recovering candidate whitened and squared components 251 5.6.2 Extracting components from the whitened squares.... 255 5.6.3 Testing candidate components...... 257 5.6.4 Putting things together...... 259 5.6.5 Cleaning...... 261 5.7 Combining lift and round for final algorithm...... 262 5.8 Condition number of random tensors...... 266 5.8.1 Notation...... 272 5.8.2 Fourth Moment Identities...... 273 5.8.3 Matrix Product Identities...... 274 5.8.4 Naive Spectral Norm Estimate...... 277 5.8.5 Off-Diagonal Second Moment Estimates...... 278 5.8.6 Matrix Decoupling...... 280 5.8.7 Putting It Together...... 281 5.8.8 Omitted Proofs...... 283

Bibliography 286

A Spiked tensor model 294 A.1 Pseudo-Distribution Facts...... 294 A.2 Concentration bounds...... 297 A.2.1 Elementary Review...... 297

viii Í ⊗ A.2.2 Concentration for i Ai Ai and Related Ensembles... 298 A.2.3 Concentration for Spectral SoS Analyses...... 302 A.2.4 Concentration for Lower Bounds...... 303 A.3 Spectral methods for the random overcomplete model...... 308 A.3.1 ...... 308 A.3.2 Concentration tools...... 312 A.4 Concentration bounds for planted sparse vector in random linear subspace...... 320 A.5 Concentration bounds for overcomplete tensor decomposition.. 324 A.6 Concentration bounds for tensor principal component analysis.. 336

ix LIST OF TABLES

3.1 Comparison of algorithms for the planted sparse vector problem with ambient dimension n, subspace dimension d, and relative sparsity ε...... 94 3.2 Comparison of decomposition algorithms for overcomplete 3- tensors of rank n in dimension d...... 96 3.3 Comparison of principal component analysis algorithms for 3- tensors in dimension d and with signal-to-noise ratio τ...... 98

5.1 A comparison of tensor decomposition algorithms for rank-n 4-tensors in ’d 4. Here ω denotes the constant. A ( )⊗ robustness bound E 6 η refers to the requirement that a d2 d2 k k × reshaping of the error tensor E have spectral norm at most η. Some of the algorithms’ guarantees involve a tradeoff between robustness, runtime, and assumptions; where this is the case, we have chosen one representative setting of parameters. See ?? for details. Above, “random” indicates that the algorithm assumes a1,..., an are independent unit vectors (or Gaussians) and “algebraic” indicates that the algorithm assumes that the vectors avoid an algebraic set of measure 0...... 225

x LIST OF FIGURES

1.1 An example of the shortcomings of purely linear methods. Multicomponent phenomena whose components occur in non- orthogonal directions cannot be recovered by principle component analysis...... 2

2.1 Numerical simulation of Algorithm 2.5.1 (“Nearly-optimal spectral SoS” implemented with matrix power method), and two implementations of Algorithm 2.5.7 (“Accelerated power method”/“Nearly-linear tensor unfolding” and “Naive power method”/“Naive tensor unfolding”. Simulations were run in Julia on a Dell Optiplex 7010 running Ubuntu 12.04 with two Intel Core i7 3770 processors at 3.40 ghz and 16GB of RAM. Plots created with Gadfly. Error bars denote 95% confidence intervals. Matrix-vector multiply experiments were conducted with n  200. Reported matrix-vector multiply counts are the average of 50 independent trials. Reported times are in cpu-seconds and are the average of 10 independent trials. Note that both axes in the right-hand plot are log scaled...... 59

xi CHAPTER 1 INTRODUCTION

Linear algebraic tools are central in modern data analysis and , machine learning, and scientific computing applications. Matrices are excellent models for linear relationships between two vectors, or, in a closely related way, for bilinear or quadratic functions of vectors. “Principle Component Analysis” is in essence a synonym for eigenvalue decomposition on such linear relationships, isolating a set of “most important” directions in a particular set of data, effectively modeling the data as a linearly additive superposition of independent contributions from those orthogonal “most important” directions.

This linear toolset permanently reshaped the landscape of the methods we use to understand the world. But while linear relationships have proven potent as models of the world, phenomena we seek to understand may be more complicated

Chapter 1. A natural extension of linear models would be those captured by low-degree polynomials over vectors. In general, these low-degree polynomials are represented by tensors, and they are capable of modeling document topics

[10, 7], an analytic notion of sparsity [17], and statistical mixtures of signals [46, 29, 55], among other phenomena [1, 22, 45]. A well performing and robust toolkit for tensors may again reshape our empirical methods, just as matrix operations are foundational for many of today’s methods. One might even say "tensors are the new matrices" [74].

In recent years, the fields of algorithms and complexity theory have discov- ered a uniquely powerful general approach—the sum-of-squares meta-algorithm [71, 64, 61, 54]—which has achieved new qualitative improvements in many families of problems involving higher-order relationships between variables or

1 Figure 1.1: An example of the shortcomings of purely linear methods. Multicomponent phenom- ena whose components occur in non-orthogonal directions cannot be recovered by principle component analysis. approximation problems involving some degree of insensitivity—or robustness— to small perturbations in the problem input [21, 14].

The sum-of-squares method is based on the concept of sum-of-squares proofs—or

Positivstellensatz proofs [73]—a proof system that derives polynomial inequalities that are true over the real numbers, allowing the use of addition, multiplication, and composition of inequalities, along with the simple fact that every square polynomial is non-negative. This proof system is tractable in the sense that for any fixed degree d, any proof that contains only polynomials of degree at most d can be found if it exists, or a counterexample found if such a proof does not exist.

Said counterexample would take the form of a degree-d pseudo-distribution—an object which may be represented as a generalization of a probability density function, in which not all probabilities are necessarily non-negative, but for which it is certifiably true that the expectation of any square polynomial up to degree d is non-negative. Such an object, if it violates a given polynomial inequality

2 in expectation, ensures that there is no degree-d sum-of-squares proof for that inequality.

This sum-of-squares method boasts features making it particularly compelling for application to data-driven tensor problems: first and most essentially, it is a framework built fundamentally around polynomials, and certifying facts about polynomials—suiting for symmetric tensors as they are representations of polynomials. Secondly, the success of sum-of-squares in robustly handling perturbations in input bodes well for its application to data.

The remainder of this document will cover recent progress in this program of applying sum-of-squares to tensor problems: most specifically tensor rank decomposition, a stronger higher-order analog to matrix low-rank approximations or principle component analysis.

1.0.1 Results stated

We will briefly review some notation in order to state the algorithmic results presented in this thesis.

A k-tensor may be conceived of as a multilinear map from k vectors in ’d to ’. Special cases include linear forms (or vectors) as 1-tensors and bilinear forms (or matrices) as 2-tensors. In general a k-tensor may be represented as a k-dimensional array of real numbers. Symmetric k-tensors are those that are invariant with respect to interchange of their k parameters, and may additionally be concieved of as representations of degree-k polynomials by restricting the input to k copies of the same vector. A rank-1 symmetric k-tensor represents a scalar multiple of the kth power of some linear functional v over ’d, and may

3 k be expressed as cv⊗ : cv ⊗ v ⊗ ... ⊗ v, where ⊗ is the tensor product and c ∈ ’. The symmetric rank (henceforth just “rank”) of a symmetric tensor T is the minimum number r such that there are r rank-one symmetric tensors Ti  Ír satisfying T i1 Ti.

In the following results, we will allow for an error term E and assume the input tensor T is close to a low-rank tensor, up to some small error

n  Õ k + T ai⊗ E. i0

We assume kai k  1 for all i for convenience and because the most difficult case will be when all components are of equal magnitude for the orthogonal, random, and separated components models. We may quantify the size of the error kEk in different ways: usually in this document we will focus on the spectral error kEk : kEkspectral, defined as the spectral norm of one of the squarest matrix

k 2 k 2 reshaping of E, as a db / c × dd / e matrix, column-indexed by a tuple of the first bk/2c tensor indices of E and row-indexed by a tuple of the remaining dk/2e tensor indices. The reason for the focus on the spectral error is that (a symmetricized version of) this form of error arises as cross-terms of data components with each other or with independent noise, especially when mixed or noisy observations are tensor powered as moment estimators, for example in dictionary learning applications [17, 58, 68].

For this summary, “recover a component” will mean that we will find, with probability at least Ω(1), among the output of the algorithm a unit vector whose absolute value of the dot product with that normalized component (the absolute cosine similarity between those two vectors) is at least some constant. From there, one may attempt locally converging optimization techniques to refine the recovered components. Alternatively, “precisely recover a component’ will

4 refer to finding a vector in the output whose absolute cosine similarity with the component is at least 1 − O(kbEk), again with probability at least Ω(1). And “recover a component up to error f ” will mean that the absolute cosine similarity

is at least 1 − f with the same probability bound.

Theorem 1.0.1 (Single-spike model with Gaussian noise). Suppose we are given the  Ín k + ∈ ’d  tensor T i0 ai⊗ E with ai and n 1. Suppose E is sampled according to a distribution where each entry is an independent Gaussian with mean 0 and variance

1 ετ− (asymmetric noise), or where the entry E i,j,k is an independent Gaussian with ( ) 1 mean 0 and variance ετ− when i 6 j 6 k or set equal to E min i,j,k ,median i,j,k ,max i,j,k ( ( ) ( ) ( )) otherwise (symmetric noise). Then we may recover the only component of T

( 1 k 4 ( ) 1 4) • using semidefinite programming, up to error O τ− d− / log d − / in polynomial time in both the symmetric and asymmetric cases.

• using a spectral approach on matrix polynomials, up to error 1 k 4 1 4 2 k+1 2 O(τ− d− / log(d)− / ) in time O˜ (d d( )/ e) in the asymmetric case.

( 1 k 4) • using a spectral approach on matrix polynomials, up to error O τ− d− / in time O˜ (dk+1) in both the symmetric and asymmetric cases.

 Ín k + Theorem 1.0.2 (Orthogonal model). Suppose we have T i0 ai⊗ E as described

above, with ai arbitrary mutually orthogonal unit vectors and kEk < 1. Then we may recover the components of T

• precisely, using semidefinite programming, in polynomial time.

+ • precisely, when k  3, using a spectral algorithm, in time O˜ (d1 ω), where ω is the matrix multiplication time exponent.

5  Ín k + Theorem 1.0.3 (Random components model). Suppose we have T i0 ai⊗ E as described above, with each ai a random unit vector. Then we may recover the components of T

 ( / 3 2) + (k k) • using semidefinite programming, when k 3, up to error poly n d / O E , in polynomial time.

 ( / 4 3) + (k k) • using spectral methods, when k 3, up to error poly n d / O E , in time O˜ (d4).

 Ín k + Theorem 1.0.4 (Separated model). Suppose we have T i0 ai⊗ E as described k Í T k |h i| above, with ai unit vectors satisfying ai ai 6 σ and maxi,j ai , a j 6 ρ for some ∈ ( ) ( 1+log σ ) σ > 1 and ρ 0, 1 held constant. Suppose also that k > O log rho . Then for any η ∈ (0, 1), we may recover the components of T using semidefinite programming, up to error O(η + kEk), in polynomial time.

 Ín k + Theorem 1.0.5 (Generic/smoothed model). Suppose we have T i0 ai⊗ E as described above, with k  4. Then when n  d2, there are polynomials

(let P be one of them) in a1,..., an, such that when a particular condition number (rational function of eigenvalues) κ of P is not 0 (which is true almost everywhere) and kbEk 6 κ,

• we may recover the components of T, using semidefinite programming, precisely, in polynomial time.

• we may recover 0.99 of the components of T using spectral methods, up to error 2 1 8 2 3 O(kEk/κ ) / , in time O˜ (n d ).

We also provide a lower bound in hardness for the single-spike model with

Gaussian noise, showing that our algorithms are near-optimal for it among those

6 that are based on rounding a degree-4 sum-of-squares relaxation. This is notable for being among the first examples of sum-of-squares lower bounds for degree larger than 2 among unsupervised learning problems.

Theorem 1.0.6 (Lower bound on tensor decomposition with Gaussian noise). Let us suppose the same setting as in Theorem 1.0.1, with k  3 and asymmetric noise.

3 4 1 4 There is a τ 6 O(d / /log(d) / ) such that with high probability, the sum-of-squares relaxation for the maximum likelihood estimator of a1 has at least one solution that is formally independent of a1 (in that the solution to the relaxation may be expressed as solely a function of the noise tensor E and not of a1). Then we may recover the only component of T.

1.0.2 Overview of methods

The general framework of this body of work is that of using the method of moments with spectral techniques to compute roundings of pseudo-distributions.

The method of moments is a family of methods to derive the parameters of an observed distribution within a family of distributions, given the moments of the observed distribution. Since pseudo-distributions are defined by their pseudo- moments up to a specific degree, it is natural to apply the method of moments to a pseudo-distribution. And as pseudo-distributions are indistinguishable from actual distributions in their moments up to that specific degree, any proof of validity within the method of moments will carry over to pseudo-moments as well, as long as that proof makes no reference to moments higher than are valid in this pseudo-distribution.

We break this approach up into two broad steps: that of lifting’, or generating

7 a relaxed solution concept (such as a pseudo-distribution), and that of rounding, or extracting actual solutions from the relaxed solution concepts. Because the rounding step informs what kind of lifting is possible, we begin with the rounding.

Rounding:

Rounding in the method of moments is a special case of a combining algorithm: an algorithm that combines a probability distribution over solutions into a single (perhaps less optimal) solution. Specifically, the method of moments may be used as a combining algorithm when the parameters of the distribution are themselves the desired solutions. The technique of turning combining algorithms into rounding algorithms for the sum-of-squares hierarchy was first pioneered by Barak, Kelner, and Steurer in 2014 [16].

The methods used in the body of this thesis to extract solutions from pseudo-

d k moments reduce to linear algebraic techniques. By reshaping a tensor in (’ )⊗

d 2 d 2 to a matrix in (’b / c×d / e) in the various possible ways, we enable the use of powerful tools from the spectral theory of matrices.

One of the specific instantiations of these tools is by the combination of Gaussian rounding (also known as sampling from a Gaussian) and reweighing, which was also introduced by Barak, Kelner, and Steurer in 2014 [16]. We describe these techniques in two different languages: first in the language of probability distributions and then in the language of moment tensors. Gaussian rounding may be described as sampling from a Gaussian distribution with a matching to the pseudo-distribution. When working with pseudo-distributions of degree larger than 2, this will generally occur after a

8 series of random linear reweighings, taking inspiration from a reweighing by some polynomial function p in actual distributions wherein [x  a] becomes proportional to [x  a]p(a), and realized for general pseudo-distributions as changing ¾˜ q(x) to ¾˜ q(x)p(x) for any polynomial q. In a random linear reweighing, p is taken as a randomly generated linear function with Gaussian coefficients.

Equivalently in the tensor view, a Gaussian rounding takes a degree-2 pseudo- distribution (so given by solely a mean vector and a covariance matrix) and takes its PSD square root of the pseudo-covariance matrix and multiplies it by a random unit Gaussian vector, and adds the result to the pseudo-mean (generally the step of taking the PSD square root is unnecessary and therefore sometimes omitted). The sequence of random reweighings is then the process of multiplying a random Gaussian vector into each of the other tensor modes of the degree-k pseudo-distribution realized as a k-tensor over ’ ⊕ ’d (where ⊕ denotes the direct sum) until only two modes remain and Gaussian rounding may be applied.

The second major rounding technique substitutes for Gaussian rounding by computing the eigendecomposition of a pseudo-covariance matrix, this technique being developed in Chapter 4. This takes inspiration from the algorithm known as Jennrich’s simultaneous diagonalization [46, 29], which was previously used to recover components from undercomplete tensors. When an eigendecomposition is applied on a noise-free orthogonal tensor after it has been reweighed down to a 2-tensor (a matrix), then the resulting eigenvectors must correspond to the original tensor components. When there is some noise, repeating this operation enough times randomly results in a large enough eigengap to withstand the spectral perturbation. These two facts are the same in Jennrich’s algorithm and

9 in ours, though our technique is simplified by the assumption that we have a certificate of orthogonality which allows for easy verification of a hypothetical solution vector.

In all of the algorithms, the way in which the method of moments is shown to be robust to the noise term in the input tensor is to find a certification function (which is a low-degree polynomial) of the input, such that the desired (signal) components to recover are certified to exist in the input if the function evaluates to over some chosen threshold. At the same time, we must show that the noise term by itself fails to achieve certification by the same function (via a sum-of-squares proof), and that the function is “strongly subadditive” in that the cross-terms that arise when multiplying the sum of the noise and the signal are with high probability too small to push the noise over the threshold. This certification, combined with the strong duality between pseudo-distributions and sum-of-squares proofs, then guarantees that the solved pseudo-distribution is predominately composed of information about the signal terms.

Lifting: generating relaxed solution concepts

In the simplest settings, the lifting step consists simply of generating consistent hypothetical higher pseudo-moments to match the distribution whose lower moments are given in the input. This is an advantage because higher moments may be easier to use: there are more of them, and their linear algebraic representations are easier to handle as rank-one components become farther apart in the higher tensor space, and the increased number of dimensions in their matrix reshapings enable more components to be recovered. Meanwhile the lower moments are much easier to empirically estimate.

10 In the overcomplete models (including the random, separated, and generic/smoothed models), there is also a concept of “lifting” via an “orthogo- nalizing map” P(x) which is a low-degree vector-valued polynomial, such that when applied to the original components to be recovered, P(a1),..., P(an) form an orthonormal set of vectors in some lifted vector space. This orthogonalizing map can be understood as a sum-of-squares proof of the existence of only a small number of solutions to the original problem, as at most dk vectors can be mutually orthogonal after being mapped by P.

Finally, we develop a technique to generate “pseudo-pseudo-moments” from the algorithm input without the use of convex optimization techniques, making this approach much faster and demonstrating potential for use in practice. These “pseudo-pseudo-moments” are not quite pseudo-moments in that they don’t necessarily satisfy the full set of symmetry or positive-semidefiniteness constraints, but enough constraints are satisfied that the rounding method used can succeed anyway. The pseudo-pseudo-moments are generated from simple matrix polynomials and spectral operations applied to the input, generally taking not much longer than computing a singular value decomposition of the input, and the recipes to generate them are inspired by examining dual objects in the sum-of-squares proofs of the rounding algorithms. The proofs that these solution concepts suffice generally go through eigenvector perturbation theorems, as used in perturbation theory in quantum mechanics.

This technique of pseudo-pseudo-moments is pushed far in the development of a spectral algorithm for the overcomplete generic/smoothed components model. A new algebraic identity is given which yields a new identifiability argu- ment, showing that the inferred order-6 pseudo-moments uniquely determine the

11 original components of the input, based on a lower level of pseudo-moments than what was previously (established as “FOOBI” requiring order-8 pseudo-moments [55]) possible. As well, a method of truncating “excess spectral mass” in various reshapings of the pseudo-moments was adapted to this setting from another work focusing on the undercomplete case [68].

Lower bound

Finally, the lower bound Theorem 1.0.6 presented in Chapter 2 states that a degree-4 sum-of-squares program is unable to distinguish a single-spike model with Gaussian noise from pure Gaussian noise model at high enough levels of noise. This lower bound is proved by direct construction of a degree-4 pseudo- distribution that with high probability purports to demonstrate the existence of a signal spike in a random input with no actual signal term. The construction is by a fine analysis of the structure of degree-4 pseudo-distributions that maximize isotropically generated degree-3 polynomials, under the symmetry and positive- semidefiniteness constraints that must be followed by pseudo-distributions. Outer products are added to an initial pseudo-distribution construction to satisfy the positive-semidefiniteness constraints, then the transformations of those outer products under 4-symmetry operators are added to satisfy symmetry constraints, along with an analysis of their spectral effect on the positive-semidefiniteness constraints, and so on. These techniques ultimately go on to contribute to the technique of pseudo-calibration which has succeeded in demonstrating sum-of- squares lower bounds at much higher degrees [52].

12 1.0.3 Frontier of what is possible

What is doable via existing techniques. Generally, the results stated for k  4 are generalizable to all even orders, and the results stated for k  3 generalize to all orders, although surgury on “spurious subspaces” may be required to apply them to even orders.

Some of the algorithms are easily extended to non-unit vectors, with perhaps some loss in performance depending on a condition number: the main exception to this being the spectral algorithm for the generic/smoothed model and the random components model (although the results for the random components model should apply just as well to standard Gaussian vectors).

New innovations needed. One barrier repeatedly encountered in the progress of this work was the lack of a notion of a “higher-order whitening” as a primitive algebraic technique. In several places, it was necessary to assume isotropy of the components in ’d, to make a lifting procedure work, while it also would

d k have been useful to have isometry of the components in (’ )⊗ , for the pur- poses of successfully rounding the lifted tensor. It is in general not possible to achieve both of these at the same time, although the lifting technique useds for the generic/smoothed model may be considered a limited form of a possibly more general “higher-order whitening”. Such a concept could then be applied to generic/smoothed odd-ordered tensors, or to more exact recovery of the components in the spectral algorithm for generic/smoothed tensors.

More pointedly, currently there is an absense of any sort of general approach for overcomplete third-order tensors whose components fall outside of being

13 simultaneously (1) pairwise nearly orthogonal and (2) isotropically distributed. One difficulty here is that the span of any reshaping of an order-3 tensor inevitably has dimension at most d. This suggests looking to the technique of approximate message passing (AMP), which implicitly uses larger tensors as maps, to try to extract higher-dimensional polynomial maps from the input that provide new information about the components [81].

To be able to handle an important class of models realistically, it would also be important to weaken the “niceness” constraint on dictionary learning via optimization techniques [17, 58, 68], as this is a strong limitation on the distributional robustness of the technique.

Another robustness improvement important to general applicability would be the ability to recover non-unit component in the spectral algorithm for the generic/smoothed model, since in distributions in the wild, we should not expect contributions from different sources to all be of the same magnitude.

More specfically, the rounding algorithm described herein has a better chance of recovering the smaller components: recovering (and subtracting out) the large components first would be much less fragile and more akin to how Principle

Component Analysis is used today in the linear case.

It would also be interesting to apply some of these techniques in order to round solution concepts in discrete domains, e.g. in constraint satisfaction problems. Also of interest would be the derivation of tight sample size bounds, and perhaps improvements in running speed in practice by avoiding explicit construction of the full tensor (which takes O(mdk) time with m samples).

14 1.0.4 Organization

The body of this document will present the results above in detail in roughly the chronological order in which they were discovered.

Chapter 2 covers the algorithms and the lower-bound for the single-spike model as stated by Theorem 1.0.1 and Theorem 1.0.6, including the first use of “pseudo-pseudo-moments”. It shares material with a paper presented in

Conference on Learning Theory 2015 by Sam Hopkins, David Steurer, and myself [50].

Chapter 3 covers a variety of pseudo-pseudo-moments algorithms, including for the overcomplete tensor decomposition problem as stated in Theorem 1.0.3, as well as for a different problem of finding planted sparse vectors in a random subspace, and shares material with a paper presented in Symposium for the Theory of Computation 2016 by Sam Hopkins, Tselil Schramm, David Steurer, and myself [49].

Chapter 4 covers the development of lifting via an orthogonal map as well as rounding by eigendecomposition of pseudo-covariance, gives the semidefinite programming algorithms for Theorem 1.0.2, Theorem 1.0.3, Theorem 1.0.4, and

Section 1.0.4, as well as the spectral algorithm for Theorem 1.0.2, and shares material with a paper presented in Foundations of Computer Science by Tengyu Ma, David Steurer, and myself [58].

Chapter 5 covers the development of pseudo-pseudo-moments techniques in a generic overcomplete case, giving the spectral algorithm for , and shares material with a paper presented in Conference on Learning Theory 2019 by Sam Hopkins, Tselil Schramm, and myself [48].

15 Each chapter will vary slightly in the notation used and so a preliminaries section is included for each chapter.

16 CHAPTER 2 SPIKED TENSOR MODEL

2.1 Introduction

Principal component analysis (pca), the process of identifying a direction of largest possible variance from a matrix of pairwise correlations, is among the most basic tools for data analysis in a wide range of disciplines. In recent years, variants of pca have been proposed that promise to give better statistical guarantees for many applications. These variants include restricting directions to the nonnegative orthant ( factorization) or to directions that are sparse linear combinations of a fixed basis (sparse pca). Often we have access to not only pairwise but also higher-order correlations. In this case, an analog of pca is to find a direction with largest possible third moment or other higher-order moment

(higher-order pca or tensor pca).

All of these variants of pca share that the underlying optimization problem is

NP-hard for general instances (often even if we allow approximation), whereas vanilla pca boils down to an efficient eigenvector computation for the input matrix. However, these hardness result are not predictive in statistical settings where inputs are drawn from particular families of distributions, where efficent algorithm can often achieve much stronger guarantees than for general instances. Understanding the power and limitations of efficient algorithms for statistical models of NP-hard optimization problems is typically very challenging: it is not clear what kind of algorithms can exploit the additional structure afforded by statistical instances, but, at the same time, there are very few tools for reasoning about the computational complexity of statistical / average-case problems. (See

17 [23] and [18] for discussions about the computational complexity of statistical models for sparse pca and random constraint satisfaction problems.)

We study a statistical model for the tensor principal component analysis problem introduced by [60] through the lens of a meta-algorithm called the sum-of-squares method, based on semidefinite programming. This method can capture a wide range of algorithmic techniques including linear programming and spectral algorithms. We show that this method can exploit the structure of statistical tensor pca instances in non-trivial ways and achieves guarantees that improve over the previous ones. On the other hand, we show that those guarantees are nearly tight if we restrict the complexity of the sum-of-squares meta-algorithm at a particular level. This result rules out better guarantees for a fairly wide range of potential algorithms. Finally, we develop techniques to turn algorithms based on the sum-of-squares meta-algorithm into algorithms that are truely efficient (and even easy to implement).

Montanari and Richard propose the following statistical model1 for tensor pca.

Problem 2.1.1 (Spiked Tensor Model for tensor pca, Asymmetric). Given an

3 n input tensor T  τ · v⊗ + A, where v ∈ ’ is an arbitrary unit vector, τ > 0 is the signal-to-noise ratio, and A is a random noise tensor with iid standard Gaussian entries, recover the signal v approximately.

√ Montanari and Richard show that when τ 6 o( n) Problem 3.6.1 becomes √ information-theoretically unsovlable, while for τ > ω( n) the maximum likeli- hood estimator (MLE) recovers v0 with hv, v0i > 1 − o(1).

The maximum-likelihood-estimator (MLE) problem for Problem 3.6.1 is an

1Montanari and Richard use a different normalization for the signal-to-noise ratio. Using their notation, β τ √n. ≈ /

18  7→ Í instance of the following meta-problem for k 3 and f : x ijk Tijk xi x j xk [60].

Problem 2.1.2. Given a homogeneous, degree-k function f on ’n, find a unit vector v ∈ ’n so as to maximize f (v) approximately.

For k  2, this problem is just an eigenvector computation. Already for k  3, it is NP-hard. Our algorithms proceed by relaxing Problem 2.1.2 to a convex problem. The latter can be solved either exactly or approximately (as will be the case of our faster algorithms). Under the Gaussian assumption on the noise

3 4 1 4 in Problem 3.6.1, we show that for τ > ω(n / log(n) / ) the relaxation does not substantially change the global optimum.

Noise Symmetry. Montanari and Richard actually consider two variants of this model. The first we have already described. In the second, the noise is

3 symmetrized, (to match the symmetry of potential signal tensors v⊗ ).

Problem 2.1.3 (Spiked Tensor Model for tensor pca, Symmetric). Given an input

3 n tensor T  τ · v⊗ + A, where v ∈ ’ is an arbitrary unit vector, τ > 0 is the signal-to-noise ratio, and A is a random symmetric noise tensor—that is,  Aijk Aπ i π j π k for any permutation π—with otherwise iid standard Gaussian ( ) ( ) ( ) entries, recover the signal v approximately.

It turns out that for our algorithms based on the sum-of-squares method, this kind of symmetrization is already built-in. Hence there is no difference between

Problem 3.6.1 and Problem 2.1.3 for those algorithms. For our faster algorithms, such symmetrization is not built in. Nonetheless, we show that a variant of our nearly-linear-time algorithm for Problem 3.6.1 also solves Problem 2.1.3 with matching guarantees.

19 2.1.1 Results

Sum-of-squares relaxation. We consider the degree-4 sum-of-squares relax- ation for the MLE problem. (See Section 4.2 for a brief discussion about sum-of- squares. All necessary definitions are in Section 4.3. See [21] for more detailed discussion.) Note that the planted vector v has objective value (1 − o(1))τ for the √ MLE problem with high probability (assuming τ  Ω( n) which will always be the case for us).

Theorem 2.1.4. There exists a polynomial-time algorithm based on the degree-4 sum- of-squares relaxation for the MLE problem that given an instance of Problem 3.6.1 or

3 4 1 4 Problem 2.1.3 with τ > n / (log n) / /ε outputs a unit vector v0 with hv, v0i > 1−O(ε)

10 with probability 1−O(n− ) over the randomness in the input. Furthermore, the algorithm works by rounding any solution to the relaxation with objective value at least (1 − o(1))τ.

Finally, the algorithm also certifies that all unit vectors bounded away from v0 have objective value significantly smaller than τ for the MLE problem Problem 2.1.2.

We complement the above algorithmic result by the following lower bound.

3 4 1 4 Theorem 2.1.5 (Informal Version). There is τ : Ž → ’ with τ 6 O(n / /log(n) / ) so that when T is an instance of Problem 3.6.1 with signal-to-noise ratio τ, with probability

10 1 − O(n− ), there exists a solution to the degree-4 sum-of-squares relaxation for the MLE problem with objective value at least τ that does not depend on the planted vector v. In particular, no algorithm can reliably recover from this solution a vector v0 that is significantly correlated with v.

Faster algorithms. We interpret a tensor-unfolding algorithm studied by Monta- nari and Richard as a spectral relaxation of the degree-4 sum-of-squares program

20 for the MLE problem. This interpretation leads to an analysis that gives better guarantees in terms of signal-to-noise ratio τ and also informs a more efficient implementation based on shifted matrix power iteration.

Theorem 2.1.6. There exists an algorithm with running time O˜ (n3), which is linear in the size of the input, that given an instance of Problem 3.6.1 or Problem 2.1.3 with

3 4 10 τ > n / /ε outputs with probability 1−O(n− ) a unit vector v0 with hv, v0i > 1−O(ε).

We remark that unlike the previous polynomial-time algorithm this linear time algorithm does not come with a certification guarantee. In Section 2.4.1, we show that small adversarial perturbations can cause this algorithm to fail, whereas the previous algorithm is robust against such perturbations. We also devise an algorithm with the certification property and running time O˜ (n4) (which is subquadratic in the size n3 of the input).

Theorem 2.1.7. There exists an algorithm with running time O˜ (n4) (for inputs of

3 3 4 1 4 size n ) that given an instance of Problem 3.6.1 with τ > n / (log n) / /ε for some

10 ε, outputs with probability 1 − O(n− ) a unit vector v0 with hv, v0i > 1 − O(ε) and certifies that all vectors bounded away from v0 have MLE objective value significantly less than τ.

Higher-order tensors. Our algorithmic results also extend in a straightforward way to tensors of order higher than 3. (See Section 2.7 for some details.) For simplicity we give some of these results only for the higher-order analogue of Problem 3.6.1; we conjecture however that all our results for Problem 2.1.3 generalize in similar fashion.

n k 4 1 4 Theorem 2.1.8. Let k be an odd integer, v0 ∈ ’ a unit vector, τ > n / log(n) / /ε, and A an order-k tensor with independent unit Gaussian entries. Let T(x)  τ ·

21 k hv0, xi + A(x).

1. There is a polynomial-time algorithm, based on semidefinite programming, which

k on input T(x)  τ· hv0, xi +A(x) returns a unit vector v with hv0, vi > 1−O(ε)

10 with probability 1 − O(n− ) over random choice of A.

2. There is a polynomial-time algorithm, based on semidefinite programming, which on

k k k 4 1 4 input T(x)  τ·hv0, xi +A(x) certifies that T(x) 6 τ·hv, xi +O(n / log(n) / )

10 for some unit v with probability 1 − O(n− ) over random choice of A. This guarantees in particular that v is close to a maximum likelihood estimator for the · k + problem of recovering the signal v0 from the input τ v0⊗ A. 3. By solving the semidefinite relaxation approximately, both algorithms can be

1+1 k k implemented in time O˜ (m / ), where m  n is the input size.

2 For even k, the above all hold, except now we recover v with hv0, vi > 1 − O(ε), and the algorithms can be implemented in nearly linear time.

Remark 2.1.9. When A is a symmetric noise tensor (the higher-order analogue of Problem 2.1.3), (1–2) above hold. We conjecture that (3) does as well.

The last theorem, the higher-order generalization of Theorem 2.1.6, almost completely resolves a conjecture of Montanari and Richard regarding tensor unfolding algorithms for odd k. We are able to prove their conjectured signal-to- noise ratio τ for an algorithm that works mainly by using an unfolding of the input tensor, but our algorithm includes an extra random-rotation step to handle sparse signals. We conjecture but cannot prove that the necessity of this step is an artifact of the analysis.

n k 4 Theorem 2.1.10. Let k be an odd integer, v0 ∈ ’ a unit vector, τ > n / /ε, and A an order-k tensor with independent unit Gaussian entries. There is a nearly-linear-time

22 10 algorithm, based on tensor unfolding, which, with probability 1 − O(n− ) over random

2 choice of A, recovers a vector v with hv, v0i > 1 − O(ε). This continues to hold when A is replaced by a symmetric noise tensor (the higher-order analogue of Problem 2.1.3).

2.1.2 Techniques

We arrive at our results via an analysis of Problem 2.1.2 for the function ( )  Í f x ijk Tijk xi x j xk, where T is an instance of Problem 3.6.1. The function f de-  + ( )  ·h i3 ( )  Í composes as f 1 h for a signal 1 x τ v, x and noise h x ijk aijk xi x j x j where {aijk } are iid standard Gaussians. The signal 1 is maximized at x  v, where it takes the value τ. The noise part, h, is with high probability at most √ √ O˜ ( n) over the unit sphere. We have insisted that τ be much greater than n, so f has a unique global maximum, dominated by the signal 1. The main problem is to find it.

To maximize 1, we apply the Sum of Squares meta-algorithm (SoS). SoS provides a hierarchy of strong convex relaxations of Problem 2.1.2. Using convex duality, we can recast the optimization problem as one of efficiently certifying the upper bound on h which shows that optima of 1 are dominated by the signal. SoS efficiently finds boundedness certificates for h of the form

2 2 c − h(x)  s1(x) + ··· + sk(x)

2 where “” denotes equality in the ring ’[x]/(kxk − 1) and where s1,..., sk have bounded degree, when such certificates exist. (The polynomials {si } and {tj } certify that h(x) 6 c. Otherwise c − h(x) would be negative, but this is impossible by the nonnegativity of squared polynomials.)

Our main technical contribution is an almost-complete characterization of

23 certificates like these for such degree-3 random polynomials h when the poly- nomials {si } have degree at most four. In particular, we show that with high

3 4 probability in the random case a degree-4 certificate exists for c  O˜ (n / ), and that with high probability, no significantly better degree-four certificate exists.

Algorithms. We apply this characterization in three ways to obtain three differ- ent algorithms. The first application is a polynomial-time based on semidefinite

3 4 programming algorithm that maximizes f when τ > Ω˜ (n / ) (and thus solves

3 4 TPCA in the spiked tensor model for τ > Ω˜ (n / ).) This first algorithm involves solving a large semidefinite program associated to the SoS relaxation. As a second application of this characterization, we avoid solving the semidefinite program. Instead, we give an algorithm running in time O˜ (n4) which quickly constructs only a small portion of an almost-optimal SoS boundedness certificate; in the random case this turns out to be enough to find the signal v and certify the boundedness of 1. (Note that this running time is only a factor of n polylog n greater than the input size n3.)

Finally, we analyze a third algorithm for TPCA which simply computes the highest singular vector of a matrix unfolding of the input tensor. This algorithm was considered in depth by Montanari and Richard, who fully characterized its behavior in the case of even-order tensors (corresponding to k  4, 6, 8, ··· in Problem 2.1.2). They conjectured that this algorithm successfully recovers the signal v at the signal-to-noise ratio τ of Theorem 2.1.7 for Problem 3.6.1 and Problem 2.1.3. Up to an extra random rotations step before the tensor unfolding in the case that the input comes from Problem 2.1.3 (and up to logarithmic factors in τ) we confirm their conjecture. We observe that their algorithm can be viewed as a method of rounding a non-optimal solution to the SoS relaxation to find

24 the signal. We show, also, that for k  4, the degree-4 SoS relaxation does no better than the simpler tensor unfolding algorithm as far as signal-to-noise ratio is concerned. However, for odd-order tensors this unfolding algorithm does not certify its own success in the way our other algorithms do.

Lower Bounds. In Theorem 2.1.5, we show that degree-4 SoS cannot certify ( )  Í that the noise polynomial A x ijk aijk xi x j xk for aijk iid standard Gaussians 3 4 satisfies A(x) 6 o(n / ).

To show that SoS certificates do not exist we construct a corresponding dual object. Here the dual object is a degree-4 pseudo-expectation: a linear map

¾˜ : ’[x]64 → ’ pretending to give the expected value of polynomials of degree at most 4 under some distribution on the unit sphere. “Pretending” here means that, just like an actual distribution, ¾˜ p(x)2 > 0 for any p of degree at most 4. In other words, ¾˜ is positive semidefinite on degree 4 polynomials. While for any √ actual distribution over the unit sphere ¾ A(x) 6 O˜ ( n), we will give ¾˜ so that

3 4 ¾˜ A(x) > Ω˜ (n / ).

3 4 To ensure that ¾˜ A(x) > Ω˜ (n / ), for monomials xi x j xk of degree 3 we take

3 4 ¾˜ ≈ n / xi x j xk A,A aijk. For polynomials p of degree at most 2 it turns out to be h i enough to set ¾˜ p(x) ≈ ¾µ p(x) where ¾µ denotes the expectation under the uniform distribution on the unit sphere.

Having guessed these degree 1, 2 and 3 pseudo-moments, we need to define

¾˜ xi x j xk x` so that ¾˜ is PSD. Representing ¾˜ as a large , the Schur complement criterion for PSDness can be viewed as a method for turning candidate degree 1–3 moments (which here lie on upper-left and off-diagonal

n2 n2 blocks) into a candidate matrix M ∈ ’ × of degree-4 pseudo-expectation

25 values which, if used to fill out the degee-4 part of ¾˜ , would make it PSD.

We would thus like to set ¾˜ xi x j xk x`  M[(i, j), (k, l)]. Unfortunately, these candidate degree-4 moments M[(i, j), (k, l)] do not satisfy commuta- tivity; that is, we might have M[(i, j), (k, l)] , M[(i, k), (j, `)] (for exam- ple). But a valid pseudo-expectation must satisfy ¾˜ xi x j xk x`  ¾˜ xi xk x j x`.  To fix this, we average out the noncommutativity by setting ½e xi x j xk x` 1 Í M[(π(i), π(j)), (π(k), π(`))], where S4 is the symmetric group on 4 4 π 4 |S | ∈S elements.

This ensures that the candidate degree-4 pseudo-expectation ½e satisfiefs commutativity, but it introduces a new problem. While the matrix M from the Schur complement was guaranteed to be PSD and even to make ¾˜ PSD when used as its degree-4 part, some of the permutations π·M given by (π·M)[(i, j), (k, `)]  M[(π(i), π(j)), (π(k), π(`))] need not even be PSD themselves. This means that, while ½e avoids having large negative eigenvalues (since it is correlated with M from Schur complement), it will have some small negative eigenvalues; i.e.

½e p(x)2 < 0 for some p.

For each permutation π · M we track the most negative eigenvalue λmin(π · M) using matrix concentration inequalities. After averaging the permutations together to form ½e and adding this to ¾˜ to give a linear functional ¾˜ + ½e on polynomials of degree at most 4, our final task is to remove these small negative eigenvalues. For this we mix ¾˜ + ½e with µ, the uniform distribution on the unit sphere. Since ¾µ has eigenvalues bounded away from zero, our final pseudo-expectation

0 def ¾˜ p(x)  ε · ¾˜ p(x) + ε · ½e p(x) + (1 − ε)· ¾µ p(x) | {z } | {z } | {z } degree 1-3 pseudo-expectations degree 4 pseudo-expectations fix negative eigenvalues

26 is PSD for ε small enough. Having tracked the magnitude of the negative eigenvalues of ½e, we are able to show that ε here can be taken large enough to 3 4 get ¾˜ 0 A(x)  Ω˜ (n / ), which will prove Theorem 2.1.5.

2.1.3 Related Work

There is a vast literature on tensor analogues of linear algebra problems—too vast to attempt any survey here. Tensor methods for machine learning, in particular for learning latent variable models, have garnered recent attention, e.g., with works of Anandkumar et al. [10, 5]. These approaches generally involve decomposing a tensor which captures some aggregate statistics of input data into rank-one components. A recent series of papers analyzes the tensor power method, a direct analogue of the matrix power method, as a way to find rank-one components of random-case tensors [11,7].

Another recent line of work applies the Sum of Squares (a.k.a. Lasserre or Lasserre/Parrilo) hierarchy of convex relaxations to learning problems. See the survey of Barak and Steurer for references and discussion of these relaxations [21].

Barak, Kelner, and Steurer show how to use SoS to efficiently find sparse vectors planted in random linear subspaces, and the same authors give an algorithm for dictionary learning with strong provable statistical guarantees [16, 15]. These algorithms, too, proceed by decomposition of an underlying random tensor; they exploit the strong (in many cases, the strongest-known) algorithmic guarantees offered by SoS for this problem in a variety of average-case settings.

Concurrently and independently of us, and also inspired by the recently- discovered applicability of tensor methods to machine learning, Barak and

27 Moitra use SoS techniques formally related to ours to address the tensor prediction problem: given a low-rank tensor (perhaps measured with noise) only a subset of whose entries are revealed, predict the rest of the tensor entries [19]. They work with worst-case noise and study the number of revealed entries necessary for the SoS hierarchy to successfully predict the tensor. By constrast, in our setting, the entire tensor is revealed, and we study the signal-to-noise threshold necessary for SoS to recover its principal component under distributional assumptions on the noise that allow us to avoid worst-case hardness behavior.

Since Barak and Moitra work in a setting where few tensor entries are revealed, they are able to use algorithmic techniques and lower bounds from the study of sparse random constraint satisfaction problems (CSPs), in particular random 3XOR [41, 34, 33, 32]. The tensors we study are much denser. In spite of the density (and even though our setting is real-valued), our algorithmic techniques are related to the same spectral refutations of random CSPs. However our lower bound techniques do not seem to be related to the proof-complexity techniques that go into sum-of-squares lower bound results for random CSPs.

The analysis of tractable tensor decomposition in the rank one plus noise model that we consider here (the spiked tensor model) was initiated by Montanari and Richard, whose work inspired the current paper [60]. They analyze a number of natural algorithms and find that tensor unfolding algorithms, which use the spectrum of a matrix unfolding of the input tensor, are most robust to noise. Here we consider more powerful convex relaxations, and in the process we tighten Montanari and Richard’s analysis of tensor unfolding in the case of odd-order tensors.

Related to our lower bound, Montanari, Reichman, and Zeitouni prove strong

28 impossibility results for the problem of detecting rank-one perturbations of Gaussian matrices and tensors using any eigenvalue of the matrix or unfolded tensor; they are able to characterize the precise threshold below which the entire spectrum of a perturbed noise matrix or unfolded tensor becomes indistinguish- able from pure noise [?]. Our lower bounds are much coarser, applying only to the objective value of a relaxation (the analogue of just the top eigenvalue of an unfolding), but they apply to all degree-4 SoS-based algorithms, which are a priori a major generalization of spectral methods.

2.2 Preliminaries

2.2.1 Notation

We use x  (x1,..., xn) to denote a vector of indeterminates. The letters u, v, w are generally reserved for real vectors. The letters α, β are reserved for multi- indices; that is, for tuples (i1,..., ik) of indices. For f , 1 : Ž → ’ we write f - 1 for f  O(1) and f % 1 for f  Ω(1). We write f  O˜ (1) if f (n) 6 1(n)·polylog n, and f  Ω˜ (1) if f > 1(n)/polylog n.

We employ the usual Loewner (a.k.a. positive semi-definite) ordering  on Hermitian matrices.

We will be heavily concerned with tensors and matrix flattenings thereof. In general, boldface capital letters T denote tensors and ordinary capital letters denote matrices A. We adopt the convention that unless otherwise noted for a tensor T the matrix T is the squarest-possible unfolding of T. If T has even order

29 k 2 k 2 k 2 k 2 k then T has dimensions n / × n / . For odd k it has dimensions nb / c × nd / e. All tensors, matrices, vectors, and scalars in this paper are real.

We use h·, ·i to denote the usual entrywise inner product of vectors, matrices, and tensors. For a vector v, we use kvk to denote its `2 norm. For a matrix A, we use kAk to denote its operator norm (also known as the spectral or `2-to-`2 norm).

k For a k-tensor T, we write T(v) for hv⊗ , Ti. Thus, T(x) is a homogeneous real polynomial of degree k.

We use Sk to denote the symmetric group on k elements. For a k-tensor T

π and π ∈ Sk, we denote by T the k-tensor with indices permuted according to π, π  ∈ S so that Tα Tπ 1 α . A tensor T is symmetric if for all π k it is the case that − ( ) Tπ  T. (Such tensors are sometimes called “supersymmetric.”)

For clarity, most of our presentation focuses on 3-tensors. For an n × n

3-tensor T, we use Ti to denote its n × n matrix slices along the first mode, i.e., ( ) Ti j,k  Ti,j,k.

We often say that an sequence {En }n Ž of events occurs with high probability, ∈ 10 c which for us means that (En fails)  O(n− ). (Any other n− would do, with appropriate modifications of constants elsewhere.)

2.2.2 Polynomials and Matrices

Let ’[x]6d be the vector space of polynomials with real coefficients in variables x  (x1,..., xn), of degree at most d. We can represent a homogeneous even-

30 d 2 d 2 degree polynomial p ∈ ’[x]d by an n / × n / matrix: a matrix M is a matrix

d 2 d 2 representation for p if p(x)  hx⊗ / , Mx⊗ / i. If p has a matrix representation   Í ( )2 M 0, then p i pi x for some polynomials pi.

2.2.3 The Sum of Squares (SoS) Algorithm

Definition 2.2.1. Let L : ’[x]6d → ’ be a linear functional on polynomials of degree at most d for some d even. Suppose that

• L 1  1.

L ( )2 ∈ [ ] • p x > 0 for all p ’ x 6d 2. /

Then L is a degree-d pseudo-expectation. We often use the suggestive notation ¾˜ for such a functional, and think of ¾˜ p(x) as giving the expectation of the polynomial p(x) under a pseudo-distribution over {x}.

For p ∈ ’[x]6d we say that the pseudo-distribution {x} (or, equivalently, the functional ¾˜ ) satisfies {p(x)  0} if ¾˜ p(x)q(x)  0 for all q(x) such that p(x)q(x) ∈ ’[x]6d.

Pseudo-distributions were first introduced in [14] and are surveyed in [21].

We employ the standard result that, up to negligible issues of numerical accuracy, if there exists a degree-d pseudo-distribution satisfying constraints

O d {p0(x)  0,..., pm(x)  0}, then it can be found in time n ( ) by solving a O d semidefinite program of size n ( ). (See [21] for references.)

31 2.3 Certifying Bounds on Random Polynomials

Let f ∈ ’[x]d be a homogeneous degree-d polynomial. When d is even, f has

d 2 d 2 square matrix representations of dimension n / × n / . The maximal eigenvalue of a matrix representation M of f provides a natural certifiable upper bound on ( ) max v 1 f v , as k k h i d 2 d 2 w, Mw f (v)  hv⊗ / , Mv⊗ / i 6 max  kMk . d 2 h i w ’n / w, w ∈ When f (x)  A(x) for an even-order tensor A with independent random entries, the quality of this certificate is well characterized by random matrix theory. In the case where the entries of A are standard Gaussians, for instance, kMk  k + T k ˜ ( d 4) ( ) A A 6 O n / with high probability, thus certifying that max v 1 f v 6 k k d 4 O˜ (n / ).

A similar story applies to f of odd degree with random coefficients, but with a catch: the certificates are not as good. For example, we expect a degree-3 random polynomial to be a smaller and simpler object than one of degree-4, ( ) and so we should be able to certify a tighter upper bound on max v 1 f v . k k The matrix representations of f are now rectangular n2 × n matrices whose top ( ) singular values are certifiable upper bounds on max v 1 f v . But in random k k matrix theory, this maximum singular value depends (to a first approximation) only on the longer dimension n2, which is the same here as in the degree-4 case.

Again when f (x)  A(x), this time where A is an order-3 tensor of independent p standard Gaussian entries, kMk  kAAT k > Ω˜ (n), so that this method cannot ( ) ˜ ( ) certify better than max v 1 f v 6 O n . Thus, the natural spectral certificates k k are unable to exploit the decrease in degree from 4 to 3 to improve the certified bounds.

32 To better exploit the benefits of square matrices, we bound the maxima of degree-3 homogeneous f by a degree-4 polynomial. In the case that f is ( )  1 h ∇ ( )i multi-linear, we have the polynomial identity f x 3 x, f x . Using Cauchy- ( ) 1 k kk∇ ( )k Schwarz, we then get f x 6 3 x f x . This inequality suggests using the degree-4 polynomial k∇ f (x)k2 as a bound on f . Note that local optima of f on the sphere occur where ∇ f (v) ∝ v, and so this bound is tight at local maxima.

Given a random homogeneous f , we will associate a degree-4 polynomial related to k∇ f k2 and show that this polynomial yields the best possible degree-4 ( ) SoS-certifiable bound on max v 1 f v . k k

Definition 2.3.1. Let f ∈ ’[x]3 be a homogeneous degree-3 polynomial with indeterminates x  (x1,..., xn). Suppose A1,..., An are matrices such that  Í h i f i xi x, Ai x . We say that f is λ-bounded if there are matrices A1,..., An k k4 Í ⊗  2 · as above and a matrix representation M of x so that i Ai Ai λ M.

We observe that for f multi-linear in the coordinates xi of x, up to a constant factor we may take the matrices Ai to be matrix representations of ∂i f , so that Í ⊗ k∇ k2 i Ai Ai is a matrix representation of the polynomial f . This choice of Ai may not, however, yield the optimal spectral bound λ2.

The following theorem is the reason for our definition of λ-boundedness.

∈ [ ] ( ) Theorem 2.3.2. Let f ’ x 3 be λ-bounded. Then max v 1 f v 6 λ, and the k k degree-4 SoS algorithm certifies this. In particular, every degree-4 pseudo-distribution {x} over ’n satisfies 3 4 ¾˜ f 6 λ · ¾˜ kxk4 / .

Proof. By Cauchy–Schwarz for pseudo-expectations, the pseudo-distribution   ¾˜ k k22 ¾˜ k k4 ¾˜ Í h i2 ¾˜ Í 2 · Í h i2 satisfies x 6 x and i xi x, Ai x 6 i xi i x, Ai x .

33 Therefore,

Õ ¾˜ f  ¾˜ xi · hx, Ai xi i 1 2 1 2  Õ 2 /  Õ 2 / 6 ¾˜ x · ¾˜ hx, Ai xi i i i 1 2 21 2  2 Õ  2  /  ¾˜ kxk / · ¾˜ hx⊗ , Ai ⊗ Ai x⊗ i i 41 4 2 2 2 1 2 6 ¾˜ kxk / · ¾˜ hx⊗ , λ · Mx⊗ i /

3 4  λ · ¾˜ kxk4 / .

(Í ⊗ )  2 · The last inequality also uses the premise i Ai Ai λ M for some matrix rep- k k4  2 · − (Í ⊗ )  resentation M of x , in the following way. Since M0 : λ M i Ai Ai 2 2 0, the polynomial hx⊗ , M0x⊗ i is a sum of squared polynomials. Thus,

2 2 ¾˜ hx⊗ , M0x⊗ i > 0 and the desired inequality follows. 

We now state the degree-3 case of a general λ-boundedness fact for homo- geneous polynomials with random coefficients. The SoS-certifiable bound for a random degree-3 polynomial this provides is the backbone of our SoS algorithm for tensor PCA in the spiked tensor model.

Theorem 2.3.3. Let A be a 3-tensor with independent entries from N(0, 1). Then A(x)

3 4 1 4 is λ-bounded with λ  O(n / log(n) / ), with high probability.

The full statement and proof of this theorem, generalized to arbitrary-degree homogeneous polynomials, may be found as Theorem A.2.5; we prove the statement above as a corollary in Section A.2. Here provide a proof sketch.

Proof sketch. We first note that the matrix slices Ai of A satisfy A(x)  Í h i Í ⊗ i xi x, Ai x . Using the matrix Bernstein inequality, we show that i Ai − Í ⊗  ( 3 2( )1 2)· Ai ¾ i Ai Ai O n / log n / Id with high probability. At the same time,

34 1 ¾ Í ⊗ a straightforward computation shows that n i Ai Ai is a matrix represen- k k4 Í ⊗  2 · tation of x . Since Id is as well, we get that i Ai Ai λ M , where M is k k4 Í ⊗ some matrix representation of x which combines Id and ¾ i Ai Ai, and 3 4 1 4 λ  O(n / (log n) / ). 

Corollary 2.3.4. Let A be a 3-tensor with independent entries from N(0, 1). Then, ( ) with high probability, the degree-4 SoS algorithm certifies that max v 1 A v 6 k k 3 4 1 4 O(n / (log n) / ). Furthermore, also with high probability, every pseudo-distribution {x} over ’n satisfies

3 4 1 4 4 3 4 ¾˜ A(x) 6 O(n / (log n) / )(¾˜ kxk ) / .

Proof. Immediate by combining Theorem 2.3.3 with Theorem 2.3.2. 

2.4 Polynomial-Time Recovery via Sum of Squares

Here we give our first algorithm for tensor PCA: we analyze the quality of the natural SoS relaxation of tensor PCA using our previous discussion of boundedness certificates for random polynomials, and we show how to round this relaxation. We discuss also the robustness of the SoS-based algorithm to some amount of additional worst-case noise in the input. For now, to obtain a solution to the SoS relaxation we will solve a large semidefinite program. Thus, the algorithm discussed here is not yet enough to prove Theorem 2.1.7 and Corollary 2.1.7: the running time, while still polynomial, is somewhat greater than O˜ (n4).

35 Tensor PCA with Semidefinite Programming

 · 3 + ∈ ’n Input: T τ v0⊗ A, where v and A is some order-3 tensor. n Goal: Find v ∈ ’ with |hv, v0i| > 1 − o(1).

Algorithm 2.4.1 (Recovery). Using semidefinite programming, find the degree-4 pseudo-distribution {x} satisfying {kxk2  1} which maximizes ¾˜ T(x). Output ¾˜ x/k ¾˜ xk.

Algorithm 2.4.2 (Certification). Run Algorithm 2.4.1 to obtain v. Using semidefinite programming, find the degree-4 pseudo-distribution {x} satisfying {kxk  1}

3 3 3 4 1 4 which maximizes ¾˜ T(x) − τ · hv, xi . If ¾˜ T(x) − τ · hv, xi 6 O(n / log(n) / ), output certify. Otherwise, output fail.

The following theorem characterizes the success of Algorithm 2.4.1 and Algo- rithm 2.4.2

 · 3 + Theorem 2.4.3 (Formal version of Theorem 2.1.4). Let T τ v0⊗ A, where n 3 4 1 4 v0 ∈ ’ and A has independent entries from N(0, 1). Let τ % n / log(n) / /ε. Then 3 1 Í π with high probability over random choice of A, on input T or T0 : τ·v⊗ + A , 0 3 π 3 |S | ∈S Algorithm 2.4.1 outputs a vector v with hv, v0i > 1 − O(ε). In other words, for this τ, Algorithm 2.4.1 solves both Problem 3.6.1 and Problem 2.1.3.

n For any unit v0 ∈ ’ and A, if Algorithm 2.4.2 outputs certify then T(x) 6

3 3 4 1 4 τ·hv, xi +O(n / log(n) / ). For A as described in either Problem 3.6.1 or Problem 2.1.3

3 4 1 4 and τ % n / log(n) / /ε, Algorithm 2.4.2 outputs certify with high probability.

The analysis has two parts. We show that

1. if there exists a sufficiently good upper bound on A(x) (or in the case of the

π symmetric noise input, on A (x) for every π ∈ S3) which is degree-4 SoS

36 certifiable, then the vector recovered by the algorithm will be very close to v, and that

2. in the case of A with independent entries from N(0, 1), such a bound exists with high probability.

Conveniently, Item 2 is precisely the content of Corollary 2.3.4. The following lemma expresses Item 1.

4 3 4 Lemma 2.4.4. Suppose A(x) ∈ ’[x]3 is such that | ¾˜ A(x)| 6 ετ ·(¾˜ kxk ) / for any { } · 3 + degree-4 pseudo-distribution x . Then on input τ v0⊗ A, Algorithm 2.4.1 outputs a unit vector v with hv, v0i > 1 − O(ε).

Proof. Algorithm 2.4.1 outputs v  ¾˜ x/k ¾˜ xk for the pseudo-distribution that it finds, so we’d like to show hv0, ¾˜ x/k ¾˜ xki > 1 − O(ε). By pseudo-Cauchy- Schwarz (Lemma A.1.2), k ¾˜ xk2 6 ¾˜ kxk2  1, so it will suffice to prove just that hv0, ¾˜ xi > 1 − O(ε).

3 If ¾˜ hv0, xi > 1 − O(ε), then by Lemma A.1.5 (and linearity of pseudo- expectation) we would have

hv0, ¾˜ xi  ¾˜ hv0, xi > 1 − O(2ε)  1 − O(ε)

3 So it suffices to show that ¾˜ hv0, xi is close to 1.

Recall that Algorithm 2.4.1 finds a pseudo-distribution that maximizes ¾˜ T(x). ¾˜ ( ) ¾˜ h 3 3i ¾˜ ( ) We split T x into the signal v0⊗ , x⊗ and noise A x components and use our hypothesized SoS upper bound on the noise.

¾˜ ( )  ·(¾˜ h 3 3i) + ¾˜ ( ) ·(¾˜ h 3 3i) + T x τ v0⊗ , x⊗ A x 6 τ v0⊗ , x⊗ ετ .

37 h 3 3i h i3 Rewriting v0⊗ , x⊗ as v0, x , we obtain

1 ¾˜ hv , xi3 > · ¾˜ T(x) − ε . 0 τ

Finally, there exists a pseudo-distribution that achieves ¾˜ T(x) > τ − ετ.

Indeed, the trivial distribution giving probability 1 to v0 is such a pseudo- distribution:

T(v0)  τ + A(v0) > τ − ετ.

Putting it together,

1 (1 − ε)τ ¾˜ hv , xi3 > · ¾˜ T(x) − ε > − ε  1 − O(ε) .  0 τ τ

Proof of Theorem 2.4.3. We first address Algorithm 2.4.1. Let τ, T, T0 be as in the theorem statement. By Lemma 2.4.4, it will be enough to show that with high

4 3 4 probability every degree-4 pseudo-distribution {x} has ¾˜ A(x) 6 ε0τ ·(¾˜ kxk ) / and 1 ¾˜ Aπ(x) 6 ε τ ·(¾˜ kxk4)3 4 for some ε  Θ(ε). By Corollary 2.3.4 and our 3 0 / 0 S assumptions on τ this happens for each permutation Aπ individually with high

π probability, so a union bound over A for π ∈ S3 completes the proof.

Turning to Algorithm 2.4.2, the simple fact that SoS only certifies true upper bounds implies that the algorithm is never wrong when it outputs certify. It is not hard to see that whenever Algorithm 2.4.1 has succeeded in recovering v because ¾˜ A(x) is bounded, which as above happens with high probability, Algorithm 2.4.2 will output certify. 

38 2.4.1 Semi-Random Tensor PCA

We discuss here a modified TPCA model, which will illustrate the qualitative differences between the new tensor PCA algorithms we propose in this paper and previously-known algorithms. The model is semi-random and semi-adversarial. Such models are often used in average-case complexity theory to distinguish between algorithms which work by solving robust maximum-likelihood-style problems and those which work by exploiting some more fragile property of a particular choice of input distribution.

 · 3 + Problem 2.4.5 (Tensor PCA in the Semi-Random Model). Let T τ v0⊗ A, n n n where v0 ∈ ’ and A has independent entries from N(0, 1). Let Q ∈ ’ × with

1 4 k Id −Qk 6 O(n− / ), chosen adversarially depending on T. Let T0 be the 3-tensor whose n2 × n matrix flattening is TQ. (That is, each row of T has been multiplied by a matrix which is close to identity.) On input T0, recover v.

Here we show that Algorithm 2.4.1 succeeds in recovering v in the semi- random model.

Theorem 2.4.6. Let T0 be the semi-random-model tensor PCA input, with τ >

3 4 1 4 n / log(n) / /ε. With high probability over randomness in T0, Algorithm 2.4.1 outputs a vector v with hv, v0i > 1 − O(ε).

 ( − · 3) Proof. By Lemma 2.4.4, it will suffice to show that B : T0 τ v0⊗ has 4 3 4 ¾˜ B(x) 6 ε0τ ·(¾˜ kxk ) / for any degree-4 pseudo-distribution {x}, for some

ε0  Θ(ε). We rewrite B as

T B  (A + τ · v0(v0 ⊗ v0) )(Q − Id) + A

39 where A has independent entries from N(0, 1). Let {x} be a degree-4 pseudo-

2 T distribution. Let f (x)  hx⊗ , (A + τ · v0(v0 ⊗ v0) )(Q − Id)xi. By Corollary 2.3.4,

3 4 1 4 4 3 4 ¾˜ B(x)  ¾˜ f (x) + O(n / log(n) / )(¾˜ kxk ) / with high probability. By triangle inequality and sub-multiplicativity of the operator norm, we get that with high probability

3 4 k(A + τ · v0(v0 ⊗ v0))(Q − Id)k 6 (kAk + τ)kQ − Id k 6 O(n / ) , where we have also used Lemma A.2.4 to bound kAk 6 O(n) with high probability and our assumptions on τ and kQ − Id k. By an argument similar to that in the proof of Theorem 2.3.2 (which may be found in Lemma A.1.6), this yields

3 4 4 3 4 ¾˜ f (x) 6 O(n / )(¾˜ kxk ) / as desired. 

2.5 Linear Time Recovery via Further Relaxation

We now attack the problem of speeding up the algorithm from the preceding section. We would like to avoid solving a large semidefinite program to optimality: our goal is to instead use much faster linear-algebraic computations—in particular, we will recover the tensor PCA signal vector by performing a single singular vector computation on a relatively small matrix. This will complete the proofs of Theorem 2.1.7 and Theorem 2.1.6, yielding the desired running time.

Our SoS algorithm in the preceding section turned on the existence of the Í ⊗ λ-boundedness certificate i Ai Ai, where Ai are the slices of a random tensor  · 3 + A. Let T τ v0⊗ A be the spiked-tensor input to tensor PCA. We could look Í ⊗ ( ) at the matrix i Ti Ti as a candidate λ-boundedness certificate for T x . The Í ⊗ spectrum of this matrix must not admit the spectral bound that i Ai Ai does, because T(x) is not globally bounded: it has a large global maximum near the

40 signal v. This maximum plants a single large singular value in the spectrum of Í ⊗ i Ti Ti. The associated singular vector is readily decoded to recover the signal.

Before stating and analyzing this fast linear-algebraic algorithm, we situate it more firmly in the SoS framework. In the following, we discuss spectral SoS, a convex relaxation of Problem 2.1.2 obtained by weakening the full-power SoS Í ⊗ relaxation. We show that the spectrum of the aforementioned i Ti Ti can be viewed as approximately solving the spectral SoS relaxation. This gives the fast, certifying algorithm of Theorem 2.1.7. We also interpret the tensor unfolding algorithm given by Montanari and Richard for TPCA in the spiked tensor model as giving a more subtle approximate solution to the spectral SoS relaxation. We prove a conjecture by those authors that the algorithm successfully recovers the TPCA signal at the same signal-to-noise ratio as our other algorithms, up to a small pre-processing step in the algorithm; this proves Theorem 2.1.6[ 60]. This last algorithm, however, succeeds for somewhat different reasons than the others, and we will show that it consequently fails to certify its own success and that it is not robust to a certain kind of semi-adversarial choice of noise.

2.5.1 The Spectral SoS Relaxation

The SoS Algorithm: Matrix View

To obtain spectral SoS, the convex relaxation of Problem 2.1.2 which we will be able to (approximately) solve quickly in the random case, we first need to return to the full-strength SoS relaxation and examine it from a more linear-algebraic standpoint.

41 We have seen in Section 2.2.2 that a homogeneous p ∈ ’[x]2d may be represented as an nd × nd matrix whose entries correspond to coefficients of p.A

2 d 2 similar fact is true for non-homogeneous p. Let #tuples(d)  1+ n + n + ··· + n / .

6d 2 0 2 d 2 Let x⊗ / : (x⊗ , x, x⊗ ,..., x⊗ / ). Then p ∈ ’[x]6d can be represented as an #tuples(d) × #tuples(d) matrix; we say a matrix M of these dimensions is a

6 d 2 6 d 2 matrix representation of p if hx ⊗ / , Mx ⊗ / i  p(x). For this section, we let

Mp denote the set of all such matrix representation of p.

A degree-d pseudo-distribution {x} can similarly be represented as an

#tuples d #tuples d ’ ( )× ( ) matrix. We say that M is a matrix representation for {x} if M[α, β]  ¾˜ xαxβ wheneverα and β are multi-indices with |α|, |β| 6 d.

{ } ∈ Formulated this way, if M x is the matrix representation of x and Mp { } M ∈ [ ] ( ) h i p for some p ’ x 62d, then ¾˜ p x  M x , Mp . In this sense, pseudo- { } distributions and polynomials, each represented as matrices, are dual under the trace inner product on matrices.

We are interested in optimization of polynomials over the sphere, and we have been looking at pseudo-distribution {x} satisfying {kxk2 − 1  0}. From this matrix point of view, the polynomial kxk2 − 1 corresponds to a vector

#tuples d T w ∈ ’ ( ) (in particular, the vector w so that ww is a matrix representation of (kxk2 − 1)2), and a degree-4 pseudo-distribution {x} satisfies {kxk2 − 1  0} if ∈ and only if w ker M x . { }

A polynomial may have many matrix representations, but a pseudo- distribution has just one: a matrix representation of a pseudo-distribution must obey strong symmetry conditions in order to assign the same pseudo- expectation to every representation of the same polynomial. We will have much

42 more to say about constructing matrices satisfying these symmetry conditions when we state and prove our lower bounds, but here we will in fact profit from relaxing these symmetry constraints.

∈ [ ] Let p ’ x 62d. In the matrix view, the SoS relaxation of the problem ( ) max x 21 p x is the following convex program. k k

max min hM, Mpi . (2.5.1) M:w ker M Mp p M∈ 0 ∈M M, 1 1 h M i It may not be immediately obvious why this program optimizes only over M which are matrix representations of pseudo-distributions. If, however, some M h i  −∞ does not obey the requisite symmetries, then minMp p M, Mp , since the ∈M asymmetry may be exploited by careful choice of Mp ∈ Mp. Thus, at optimality this program yields M which is the matrix representation of a pseudo-distribution

{x} satisfying {kxk2 − 1  0}.

Relaxing to the Degree-4 Dual

We now formulate spectral SoS. In our analysis of full-power SoS for tensor PCA we have primarily considered pseudo-expectations of homogeneous degree-4 polynomials; our first step in further relaxing SoS is to project from ’[x]64 to ’[x]4.

n2 n2 #tuples 2 #tuples 2 Thus, now our matrices M, M0 will be in ’ × rather than ’ ( )× ( ). The projection of the constraint on the kernel in the non-homogeneous case implies Tr M  1 in the homogeneous case. The projected program is

max min hM, M0i . Tr M1 Mp p M 0 ∈M  We modify this a bit to make explicit that the relaxation is allowed to add and subtract arbitrary matrix representations of the zero polynomial; in particular

43 − ∈ M M x 4 Id for any M x 4 x 4 . This program is the same as the one which k k k k k k precedes it.

h − · i + max min M, Mp c M x 4 c . (2.5.2) Tr M1 Mp p k k M 0 ∈M M 4 4  x ∈M x k kc ’ k k ∈

By weak duality, we can interchange the min and the max in (2.5.2) to obtain the dual program:

h − · i h − · i + max min M, Mp c M x 4 6 min max M, Mp c M x 4 c Tr M1 Mp p k k Mp p Tr M1 k k M 0 ∈M ∈M M 0 M 4 4 M 4 4  x ∈M x x ∈M x  k kc ’ k k k kc ’ k k ∈ ∈ (2.5.3)

 h T − · i + min max vv , Mp c M x 4 c Mp p v 1 k k ∈M k k M 4 4 x ∈M x k kc ’ k k ∈ (2.5.4)

( ) We call this dual program the spectral SoS relaxation of max x 1 p x . If k k  Í h i N( ) p i x, Ai x for A with independent entries from 0, 1 , the spectral SoS relaxation achieves the same bound as our analysis of the full-strength SoS

3 2 1 2 relaxation: for such p, the spectral SoS relaxation is at most O(n / log(n) / ) with high probability. The reason is exactly the same as in our analysis of the Í ⊗ full-strength SoS relaxation: the matrix i Ai Ai, whose spectrum we used before to bound the full-strength SoS relaxation, is still a feasible dual solution.

Í ⊗ 2.5.2 Recovery via the i Ti Ti Spectral SoS Solution

 · 3 + Let T τ v0⊗ A be the spiked-tensor input to tensor PCA. We know from our initial characterization of SoS proofs of boundedness for degree-3 polynomials

44 ( )  ( ⊗ )T(Í ⊗ )( ⊗ ) that the polynomial T0 x : x x i Ti Ti x x gives SoS-certifiable upper bounds on T(x) on the unit sphere. We consider the spectral SoS relaxation of ( ) max x 1 T0 x , k k k − · k + min MT x c M x 4 c . MT x T x ( ) k k ( )∈M ( ) M 4 4 x ∈M x k kc ’ k k ∈ ∈ M Our goal now is to guess a good M0 T x . We will take as our dual-feasible ( ) Í ⊗ − Í ⊗ solution the top singular vector of i Ti Ti ¾ i Ai Ai. This is dual feasible  h 2 ( Í ⊗ ) 2i  k k4 with c n, since routine calculation gives x⊗ , ¾ i Ai Ai x⊗ x . This Í ⊗ top singular vector, which differentiates the spectrum of i Ti Ti from that of Í ⊗ ( ) i Ai Ai, is exactly the manifestation of the signal v0 which differentiates T x from A(x). The following algorithm and analysis captures this.

Í ⊗ Recovery and Certification with i Ti Ti

 · 3 + ∈ ’n Input: T τ v0⊗ A, where v0 and A is a 3-tensor. n Goal: Find v ∈ ’ with |hv, v0i| > 1 − o(1).

Algorithm 2.5.1 (Recovery). Compute the top (left or right) singular vector v0 of  Í ⊗ − Í ⊗ × M : i Ti Ti ¾ i Ai Ai. Reshape v0 into an n n matrix V0. Compute the top singular vector v of V0. Output v/kvk.

3 Algorithm 2.5.2 (Certification). Run Algorithm 2.5.1 to obtain v. Let S : T − v⊗ . Compute the top singular value λ of

Õ Õ Si ⊗ Si − ¾ Ai ⊗ Ai . i i

3 2 1 2 If λ 6 O(n / log(n) / ), output certify. Otherwise, output fail.

The following theorem describes the behavior of Algorithm 2.5.1 and Algo- rithm 2.5.2 and gives a proof of Theorem 2.1.7 and Corollary 2.1.7.

45  · 3 + Theorem 2.5.3 (Formal version of Theorem 2.1.7). Let T τ v0⊗ A, where n v0 ∈ ’ and A has independent entries from N(0, 1). In other words, we are given an

3 4 1 4 instance of Problem 3.6.1. Let τ > n / log(n) / /ε. Then:

2 — With high probability, Algorithm 2.5.1 returns v with hv, v0i > 1 − O(ε).

3 3 4 1 4 — If Algorithm 2.5.2 outputs certify then T(x) 6 τ · hv, xi + O(n / log(n) / ) (regardless of the distribution of A). If A is distributed as above, then Algorithm 2.5.2 outputs certify with high probability.

— Both Algorithm 2.5.1 and Algorithm 2.5.2 can be implemented in time

O(n4 log(1/ε)).

The argument that Algorithm 2.5.1 recovers a good vector in the spiked tensor model comes in three parts: we show that under appropriate regularity Í ⊗ − ⊗ conditions on the noise A that i Ti Ti ¾ Ai Ai has a good singular vector, then that with high probability in the spiked tensor model those regularity conditions hold, and finally that the good singular vector can be used to recover the signal.

 · 3 + k Í ⊗ −¾ Í ⊗ Lemma 2.5.4. Let T τ v0⊗ A be an input tensor. Suppose i Ai Ai i Ai k 2 k Í ( ) k Ai 6 ετ and that i v0 i Ai 6 ετ. Then the top (left or right) singular vector v0 2 of M has hv0, v0 ⊗ v0i > 1 − O(ε).

 · 3 + N( ) Lemma 2.5.5. Let T τ v0⊗ A. Suppose A has independent entries from 0, 1 . k Í ⊗ − Í ⊗ k ( 3 2 ( )1 2) Then with high probability we have i Ai Ai ¾ i Ai Ai 6 O n / log n / √ k Í ( ) k ( ) and i v0 i Ai 6 O n .

n n2 Lemma 2.5.6. Let v0 ∈ ’ and v0 ∈ ’ be unit vectors so that hv0, v0 ⊗v0i > 1−O(ε).

Then the top right singular vector v of the n × n matrix folding V0 of v0 satisfies hv, v0i > 1 − O(ε).

46 A similar fact to Lemma 2.5.6 appears in [60].

The proofs of Lemma 2.5.4 and Lemma 2.5.6 follow here. The proof of Lemma 2.5.5 uses only standard concentration of measure arguments; we defer it to Section A.2.

Proof of Lemma 2.5.4. We expand M as follows. Õ  2 ·( 3) ⊗ ( 3) + · (( 3) ⊗ + ⊗ ( 3) ) + ⊗ − ¾ ⊗ M τ v0⊗ i v0⊗ i τ v0⊗ i Ai Ai v0⊗ i Ai Ai Ai Ai i Õ Õ  2 ·( ⊗ )( ⊗ )T + · T ⊗ ( ) + · ( ) ⊗ T + ⊗ − ¾ ⊗ τ v0 v0 v0 v0 τ v0v0 v0 i Ai τ v0 i Ai v0v0 Ai Ai Ai Ai . i i k Í ⊗ By assumption, the noise term is bounded in operator norm: we have i Ai − ¾ Í ⊗ k 2 k · T ⊗ Ai i Ai Ai 6 ετ . Similarly, by assumption the cross-term has τ v0v0 Í ( ) k 2 i v0 i Ai 6 ετ .

·Õ (( 3) ⊗ + ⊗( 3) )  ·Õ ( ) ( T ⊗ + ⊗ T) τ Pu⊥ v0⊗ i Ai Ai v0⊗ i Pu⊥ τ v0 i Pu⊥ v0v0 Ai Ai v0v0 Pu⊥ . i i All in all, by triangle inequality,

Õ Õ · T ⊗ ( ) + · ( ) ⊗ T + ⊗ − ¾ ⊗ ( 2) τ v0v0 v0 i Ai τ v0 i Ai v0v0 Ai Ai Ai Ai 6 O ετ . i i Again by triangle inequality,

T 2 2 kMk > (v0 ⊗ v0) M(v0 ⊗ v0)  τ − O(ετ ) .

Let u, w be the top left and right singular vectors of M. We have

T 2 2 2 2 u Mw  τ · hu, v0 ⊗ v0ihw, v0 ⊗ v0i + O(ετ ) > τ − O(ετ ) , so rearranging gives the result. 

Proof of Lemma 2.5.6. Let v0, v0, V0, v, be as in the lemma statement. We know v T is the maximizer of max w , w 1 w V0w0. By assumption, k k k 0k T  h ⊗ i − ( ) v0 V0v0 v0, v0 v0 > 1 O ε .

47 Thus, the top singular value of V0 is at least 1 − O(ε), and since kv0k is a unit vector, the Frobenius norm of V0 is 1 and so all the rest of the singular values are

O(ε). Expressing v0 in the right singular basis of V0 and examining the norm of

V0v0 completes the proof. 

Proof of Theorem 2.5.3. The first claim, that Algorithm 2.5.1 returns a good vector, follows from the previous three lemmas, Lemma 2.5.4, Lemma 2.5.5, Lemma 2.5.6. Í ⊗ − Í ⊗ The next, for Algorithm 2.5.2, follows from noting that i Si Si ¾ i Ai Ai is a feasible solution to the spectral SoS dual. For the claimed runtime, since we are working with matrices of size n4, it will be enough to show that the top Í ⊗ − Í ⊗ singular vector of M and the top singular value of i Si Si ¾ i Ai Ai can be recovered with O(poly log(n)) matrix-vector multiplies.

In the first case, we start by observing that it is enough to find a vector w which has hw, v0i > 1 − ε, where v0 is a top singular vector of M. Let λ1, λ2 be the top two singular values of M. The analysis of the algorithm already showed that λ1/λ2 > Ω(1/ε). Standard analysis of the matrix power method now yields that O(log(1/ε)) iterations will suffice.

Í ⊗ − Í ⊗ We finally turn to the top singular value of i Si Si ¾ i Ai Ai. Here the matrix may not have a spectral gap, but all we need to do is ensure that

3 2 1 2 the top singular value is no more than O(n / log(n) / ). We may assume that

3 2 1 2 some singular value is greater than O(n / log(n) / ). If all of them are, then a single matrix-vector multiply initialized with a random vector will discover this. Otherwise, there is a constant spectral gap, so a standard analysis of matrix power method says that within O(log n) iterations a singular value greater than

3 2 1 2 O(n / log(n) / ) will be found, if it exists. 

48 2.5.3 Nearly-Linear-Time Recovery via Tensor Unfolding and

Spectral SoS

 · 3 + ∈ ’n On input T τ v0⊗ A, where as usual v0 and A has independent entries from N(0, 1), Montanari and Richard’s Tensor Unfolding algorithm computes the top singular vector u of the squarest-possible flattening of T into a matrix.

2 It then extracts v with hv, v0i > 1 − o(1) from u with a second singular vector computation.

Recovery with TTT, a.k.a. Tensor Unfolding

 · 3 + ∈ ’n Input: T τ v0⊗ A, where v0 and A is a 3-tensor. n Goal: Find v ∈ ’ with |hv, v0i| > 1 − o(1).

Algorithm 2.5.7 (Recovery). Compute the top eigenvector v of M : TTT. Output v.

2 We show that this algorithm successfully recovers a vector v with hv, v0i >

3 4 1 − O(ε) when τ > n / /ε. Montanari and Richard conjectured this but were only able to show it when τ > n. We also show how to implement the algorithm in time O˜ (n3), that is to say, in time nearly-linear in the input size.

Despite its a priori simplicity, the analysis of Algorithm 2.5.7 is more subtle than for any of our other algorithms. This would not be true for even-order tensors, for which the square matrix unfolding tensor has one singular value asymptotically larger than all the rest, and indeed the corresponding singular vector is well-correlated with v0. However, in the case of odd-order tensors the unfolding has no spectral gap. Instead, the signal v0 has some second-order effect

49 on the spectrum of the matrix unfolding, which is enough to recover it.

We first situate this algorithm in the SoS framework. In the previous section Í ⊗ − Í ⊗ we examined the feasible solution i Ti Ti ¾ i Ai Ai to the spectral SoS ( ) relaxation of max x 1 T x . The tensor unfolding algorithm works by examining k k the top singular vector of the flattening T of T, which is the top eigenvector of the n × n matrix M  TTT, which in turn has the same spectrum as the n2 × n2 matrix

TTT. The latter is also a feasible dual solution to the spectral SoS relaxation of ( ) ( ) max x 1 T x . However, the bound it provides on max x 1 T x is much worse k k k k Í ⊗ than that given by i Ti Ti. The latter, as we saw in the preceding section, gives 3 4 1 4 the bound O(n / log(n) / ). The former, by contrast, gives only O(n), which is the operator norm of a random n2 × n matrix (see Lemma A.2.4). This n versus

3 4 n / is the same as the gap between Montanari and Richard’s conjectured bound and what they were able to prove.

3 4 Theorem 2.5.8. For an instance of Problem 3.6.1 with τ > n / /ε, with high prob-

2 ability Algorithm 2.5.7 recovers a vector v with hv, v0i > 1 − O(ε). Furthermore, Algorithm 2.5.7 can be implemented in time O˜ (n3).

 · 3 + ∈ ’n Lemma 2.5.9. Let T τ v0⊗ A where v0 is a unit vector, so an instance of T Problem 3.6.1. Suppose A satisfies A A  C · Idn n +E for some C > 0 and E with × 2 T kEk 6 ετ and that kA (v0 ⊗ v0)k 6 ετ. Let u be the top left singular vector of the

2 matrix T. Then hv0, ui > 1 − O(ε).

Proof. The vector u is the top eigenvector of the n × n matrix TTT, which is also the top eigenvector of M : TTT − C · Id. We expand:

T  T  2 · T + · ( ⊗ )T + · T( ⊗ ) T +  u Mu u τ v0v0 τ v0 v0 v0 A τ A v0 v0 v0 E u  2 · h i2 + T  · ( ⊗ )T + · T( ⊗ ) T +  τ u, v0 u τ v0 v0 v0 A τ A v0 v0 v0 E u

50 2 2 2 6 τ hu, v0i + O(ετ ) .

T T  2 − ( 2) Again by triangle inequality, u Mu > v0 Mv τ O ετ . So rearranging we 2 get hu, v0i > 1 − O(ε) as desired. 

The following lemma is a consequence of standard matrix concentration inequalities; we defer its proof to Section A.2, Lemma A.2.10.

n Lemma 2.5.10. Let A have independent entries from N(0, 1). Let v0 ∈ ’ be a unit vector. With high probability, the matrix A satisfies AT A  n2 · Id +E for some E with

3 2 T p kEk 6 O(n / ) and kA (v0 ⊗ v0)k 6 O( n log n).

The final component of a proof of Theorem 2.5.8 is to show how it can be implemented in time O˜ (n3). Since M factors as TTT, a matrix-vector multiply by M can be implemented in time O(n3). Unfortunately, M does not have an adequate eigenvalue gap to make matrix power method efficient. As we know from Lemma 2.5.10, suppressing εs and constants, M has eigenvalues in the √ 2 3 2 range n ± n / . Thus, the eigenvalue gap of M is at most 1  O(1 + 1/ n). For

1 2 δ any number k of matrix-vector multiplies with k 6 n / − , the eigenvalue gap √ n1 2 δ will become at most (1 + 1/ n) / − , which is subconstant. To get around this problem, we employ a standard trick to improve spectral gaps of matrices close to C · Id: remove C · Id.

Lemma 2.5.11. Under the assumptions of Theorem 2.5.8, Algorithm 2.5.7 can be implemented in time O˜ (n3) (which is linear in the input size, n3).

Proof. Note that the top eigenvector of M is the same as that of M − n2 · Id. The latter matrix, by the same analysis as in Lemma 2.5.9, is given by

− 2 ·  2 · T + M n Id τ v0v0 M0

51 2 2 where kM0k  O(ετ ). Note also that a matrix-vector multiply by M − n · Id can still be done in time O(n3). Thus, M − n2 · Id has eigenvalue gap Ω(1/ε), which is enough so that the whole algorithm runs in time O˜ (n3). 

Proof of Theorem 2.5.8. Immediate from Lemma 2.5.9, Lemma 2.5.10, and Lemma 2.5.11. 

2.5.4 Fast Recovery in the Semi-Random Model

There is a qualitative difference between the aggregate matrix statistics needed by our certifying algorithms (Algorithm 2.4.1, Algorithm 2.4.2, Algorithm 2.5.1,

Algorithm 2.5.2) and those needed by rounding the tensor unfolding solution spectral SoS Algorithm 2.5.7. In a precise sense, the needs of the latter are greater. The former algorithms rely only on first-order behavior of the spectra of a tensor unfolding, while the latter relies on second-order spectral behavior. Since it uses second-order properties of the randomness, Algorithm 2.5.7 fails in the semi-random model.

 · 3 + ∈ ’n Theorem 2.5.12. Let T τ v0⊗ A, where v0 is a unit vector and A has 7 8 independent entries from N(0, 1). There is τ  Ω(n / ) so that with high probability

1 4 there is an adversarial choice of Q with kQ − Id k 6 O(n− / ) so that the matrix (TQ)TTQ  n2 · Id. In particular, for such τ, Algorithm 2.5.7 cannot recover the signal v0.

T 1 2 Proof. Let M be the n × n matrix M : T T. Let Q  n · M− / . It is clear that

T 2 1 4 (TQ) TQ  n Id. It suffices to show that kQ − Id k 6 n / with high probability.

52 We expand the matrix M as

 2 · T + · ( ⊗ )T + · T( ⊗ ) T + T M τ v0v0 τ v0 v0 v0 A τ A v0 v0 v0 A A .

T 2 3 2 T By Lemma 2.5.10, A A  n · Id +E for some E with kEk 6 O(n / ) and kA (v0 ⊗ p v0)k 6 O( n log n), both with high probability. Thus, the eigenvalues of M all

2 1+3 4 lie in the range n ± n / . The eigenvalues of Q in turn lie in the range n 1 1   . ( 2 ± ( 1+3 4))1 2 ( ± ( 1 4))1 2 ± ( 1 4) n O n / / 1 O n− / / 1 O n / Finally, the eigenvalues of Q − Id lie in the range 1 − 1  ±O(n 1 4), so we 1 O n1 4 − / ± ( / ) are done. 

The argument that that Algorithm 2.5.1 and Algorithm 2.5.2 still succeed in the semi-random model is routine; for completeness we discuss here the necessary changes to the proof of Theorem 2.5.3. The non-probabilistic certification claims made in Theorem 2.5.3 are independent of the input model, so we show that Algorithm 2.5.1 still finds the signal with high probability and that Algorithm 2.5.2 still fails only with only a small probability.

1 4 3 4 1 4 Theorem 2.5.13. In the semi-random model, ε > n− / and τ > n / log(n) / /ε, with

2 high probability, Algorithm 2.5.1 returns v with hv, v0i > 1−O(ε) and Algorithm 2.5.2 outputs certify.

Proof. We discuss the necessary modifications to the proof of Theorem 2.5.3.

1 4 Since ε > n− / , we have that k(Q − Id)v0k 6 O(ε). It suffices then to show that the probabilistic bounds in Lemma 2.5.5 hold with A replaced by AQ. Note that this means each Ai becomes AiQ. By assumption, kQ ⊗ Q − Id ⊗ Id k 6 O(ε), k Í ⊗  Í ⊗ k so the probabilistic bound on i Ai Ai ¾ i Ai Ai carries over to the Í ( ) semi-random setting. A similar argument holds for i v0 i AiQ, which is enough to complete the proof. 

53 2.5.5 Fast Recovery with Symmetric Noise

We suppose now that A is a symmetric Gaussian noise tensor; that is, that A is π ∈ S the average of A0 over all π 3, for some order-3 tensor A0 with iid standard Gaussian entries.

It was conjectured by Montanari and Richard [60] that the tensor unfolding  3 + technique can recover the signal vector v0 in the single-spike model T τv0⊗ A 3 4 with signal-to-noise ratio τ > Ω˜ (n / ) under both asymmetric and symmetric noise.

Our previous techniques fail in this symmetric noise scenario due to lack of independence between the entries of the noise tensor. However, we sidestep that issue here by restricting our attention to an asymmetric block of the input tensor.

The resulting algorithm is not precisely identical to the tensor unfolding algorithm investigated by Montanari and Richard, but is based on tensor unfolding with only superficial modifications.

54 Fast Recovery under Symmetric Noise

 · 3 + ∈ ’n Input: T τ v0⊗ A, where v0 and A is a 3-tensor. n Goal: Find v ∈ ’ with |hv, v0i| > 1 − o(1).

Algorithm 2.5.14 (Recovery). Take X, Y, Z a random partition of [n], and R a

n random rotation of ’ . Let PX, PY, and PZ be the diagonal projectors onto the

3 coordinates indicated by X, Y, and Z. Let U : R⊗ T, so that we have the matrix unfolding U : (R ⊗ R)TRT Using the matrix power method, compute the top singular vectors vX, vY, and vZ respectively of the matrices

T 2 MX : PXU (PY ⊗ PZ)UPX − n /9 · Id

T 2 MY : PYU (PZ ⊗ PX)UPY − n /9 · Id

T 2 MZ : PZU (PX ⊗ PY)UPZ − n /9 · Id .

1 Output the normalization of R− (vX + vY + vZ). Remark 2.5.15 (Implementation of Algorithm 2.5.14 in nearly-linear time.). It is possible to implement each iteration of the matrix power method in Algo- rithm 2.5.14 in linear time. We focus on multiplying a vector by MX in linear time; the other cases follow similarly.

T T T 2 We can expand MX  PX RT (R ⊗ R) (PY ⊗ PZ)(R ⊗ R)TR PX − n /9 · Id.

T It is simple enough to multiply an n-dimensional vector by PX, R, R , T, and Id in linear time. Furthermore multiplying an n2-dimensional vector by TT is also a simple linear time operation. The trickier part lies in multiplying an

2 2 2 T n -dimensional vector, say v, by the n -by-n matrix (R ⊗ R) (PY ⊗ PZ)(R ⊗ R).

To accomplish this, we simply reflatten our tensors. Let V be the n-by-n

T T T matrix flattening of v. Then we compute the matrix R PY R · V · R PZ R, and return its flattening back into an n2-dimensional vector, and this will be equal

55 T to (R ⊗ R) (PY ⊗ PZ)(R ⊗ R) v. This equivalence follows by taking the singular  Í T  Í ⊗ value decomposition V i λi ui wi , and noting that v i λi ui wi.

Lemma 2.5.16. Given a unit vector u ∈ ’n, a random rotation R over ’n, and a projection P to an m-dimensional subspace, with high probability

k k2 − / (p / 2 ) PRu m n 6 O m n log m .

Proof. Let γ be a random variable distributed as the norm of a vector in ’n with entries independently drawn from N(0, 1/n). Then because Gaussian vectors are rotationally invariant and Ru is a random unit vector, the coordinates of γRu are independent and Gaussian in any orthogonal basis.

So γ2kPRuk2 is the sum of the squares of m independent variables drawn p from N(0, 1/n). By a Bernstein inequality, γ2kPRuk2 − m/n 6 O( m/n2 log m) p with high probability. Also by a Bernstein inequality, γ2 − 1 < O( 1/n log n) with high probability. 

3 4 Theorem 2.5.17. For τ > n / /ε, with high probability, Algorithm 2.5.14 recovers a vector v with hv, v0i > 1 − O(ε) when A is a symmetric Gaussian noise tensor (as in √ Problem 2.1.3) and ε > log(n)/ n.

Furthermore the matrix power iteration steps in Algorithm 2.5.14 each converge within O˜ (− log(ε)) steps, so that the algorithm overall runs in almost linear time

O˜ (n3 log(1/ε)).

Proof. Name the projections UX : (PY ⊗ PZ)UPX, UY : (PZ ⊗ PX)UPY, and

UZ : (PX ⊗ PY)UPZ.

3 First off, U  τ(Rv0)⊗ + A0 where A0 is a symmetric Gaussian tensor (distributed identically to A). This follows by noting that multiplication by

56 3 3 π 3 π R⊗ commutes with permutation of indices, so that (R⊗ B)  R⊗ B , where  Í π we let B be the asymmetric Gaussian tensor so that A π 3 B . Then ∈S  3 Í π  Í ( 3 )π A0 R⊗ π 3 B π 3 R⊗ B . This is identically distributed with A, as ∈S ∈S follows from the rotational symmetry of B.

T Thus UX  τ(PY ⊗ PZ)(R ⊗ R)(v0 ⊗ v0)(PX Rv0) + (PY ⊗ PZ)A0PX, and

+ 2/ ·  T MX n 9 Id UXUX

2 2 2 T  τ kPY Rv0k kPZRv0k (PX Rv0)(PX Rv0) (2.5.5)

T T + τ(PX Rv0)(v0 ⊗ v0) (R ⊗ R) (PY ⊗ PZ)A0PX (2.5.6)

T T + τPXA0 (PY ⊗ PZ)(R ⊗ R)(v0 ⊗ v0)(PX Rv0) (2.5.7)

T + PXA0 (PY ⊗ PZ)A0PX . (2.5.8)

k k2 − 1 Let S refer to Expression 2.5.5. By Lemma 2.5.16, PRv0 3 < p O( 1/n log n) with high probability for P ∈ {PX , PY , PZ }. Hence S  ( 1 ± (p / )) 2( )( )T k k  ( 1 ± (p / )) 2 9 O 1 n log n τ PX Rv0 PX Rv0 and S 27 O 1 n log n τ .

Let C refer to Expression 2.5.6 so that Expression 2.5.7 is CT. Let also

A00  (PY ⊗ PZ)A0PX. Note that, once the identically-zero rows and columns of A00 are removed, A00 is a matrix of iid standard Gaussian entries. Finally, let v00  PY Rv0 ⊗ PZRv0. By some substitution and by noting that kPX Rk 6 1, we

T 2 have that kCk 6 τ kv0v00 A00k. Hence by Lemma A.2.10, kCk 6 O(ετ ).

T Let N refer to Expression 2.5.8. Note that N  A00 A00. Therefore by

2 3 2 Lemma 2.5.10, kN − n /9 · Id k 6 O(n / ).

2 2 Thus MX  S + C + (N − n /9 · Id), so that kMX − Sk 6 O(ετ ). Since S is rank-one and has kSk > Ω(τ2), we conclude that matrix power iteration converges in O˜ (− log ε) steps.

57 2 The recovered eigenvector vX satisfies hvX , MX vXi > Ω(τ ) and hvX , (MX − ) i ( 2) h i  ( 1 ± ( + p / )) 2 S vX 6 O ετ and therefore vX , SvX 27 O ε 1 n log n τ . Substi- 1 tuting in the expression for S, we conclude that hPX Rv0, vXi  ( ± O(ε + √3 p 1/n log n)).

The analyses for vY and vZ follow in the same way. Hence

hvX + vY + vZ , Rv0i  hvX , PX Rv0i + hvY , PY Rv0i + hvZ , PZRv0i √ p > 3 − O(ε + 1/n log n) .

At the same time, since vX, vY, and vZ are each orthogonal to each other, √ 1 kvX + vY + vZ k  3. Hence with the output vector being v : R− (vX + vY + vZ)/kvX + vY + vZ k, we have

1 p hv, v0i  hRv, Rv0i  hvX + vY + vZ , Rv0i > 1 − O(ε + 1/n log n) . √3



2.5.6 Numerical Simulations

We report now the results of some simple numerical simulations of the algorithms from this section. In particular, we show that the asymptotic running time differences among Algorithm 2.5.1, Algorithm 2.5.7 implemented naïvely, and the linear-time implementation of Algorithm 2.5.7 are apparent at reasonable values of n, e.g. n  200.

Specifics of our experiments are given in Figure 2.1. We find pronounced differences between all three algorithms. The naïve implementation of Algo- rithm 2.5.7 is markedly slower than the linear implementation, as measured

58 10 1.5 2

5 2 1.0

0 2 key key nearly-linear tensor unfolding 0.5 accel. power method naive tensor unfolding naive power method nearly-optimal spectral SoS -5 2 correlation with signal with correlation 0.0 -10 2 time to achieve correlation 0.9 with input (sec) input with 0.9 correlation achieve to time -15 -0.5 2 4 5 6 7 8 9 10 0 5 10 15 20 2 2 2 2 2 2 2

matrix-vector multiplies instance size

Figure 2.1: Numerical simulation of Algorithm 2.5.1 (“Nearly-optimal spectral SoS” implemented with matrix power method), and two implementations of Algorithm 2.5.7 (“Accelerated power method”/“Nearly-linear tensor unfolding” and “Naive power method”/“Naive tensor unfolding”. Simulations were run in Julia on a Dell Optiplex 7010 running Ubuntu 12.04 with two Intel Core i7 3770 processors at 3.40 ghz and 16GB of RAM. Plots created with Gadfly. Error bars denote 95% confidence intervals. Matrix-vector multiply experiments were conducted with n  200. Reported matrix-vector multiply counts are the average of 50 independent trials. Reported times are in cpu-seconds and are the average of 10 independent trials. Note that both axes in the right-hand plot are log scaled.

either by number of matrix-vector multiplies or processor time. Algorithm 2.5.1

suffers greatly from the need to construct an n2 × n2 matrix; although we do not count the time to construct this matrix against its reported running time, the memory requirements are so punishing that we were unable to collect data

beyond n  100 for this algorithm.

2.6 Lower Bounds

We will now prove lower bounds on the performance of degree-4 SoS on random

instances of the degree-4 and degree-3 homogeneous polynomial maximization problems. As an application, we show that our analysis of degree-4 for Tensor PCA is tight up to a small logarithmic factor in the signal-to-noise ratio.

59 Theorem 2.6.1 (Part one of formal version of Theorem 2.1.5). There is τ  Ω(n) and a function η : A 7→ {x} mapping 4-tensors to degree-4 pseudo-distributions satisfying

2 {kxk  1} so that for every unit vector v0, if A has unit Gaussian entries, then, with high ˜ · h i4 + ( ) probability over random choice of A, the pseudo-expectation ¾x η A τ v0, x A x ∼ ( ) 4 is maximal up to constant factors among ¾˜ τ · hv0, yi + A(y) over all degree-4 pseudo- distributions {y} satisfying {kyk2  1}.

Theorem 2.6.2 (Part two of formal version of Theorem 2.1.5). There is τ 

3 4 1 4 Ω(n / /(log n) / ) and a function η : A 7→ {x} mapping 3-tensors to degree-4 pseudo-

2 distributions satisfying {kxk  1} so that for every unit vector v0, if A has unit Gaussian entries, then, with high probability over random choice of A, the pseudo- ˜ · h i3 + ( ) expectation ¾x η A τ v0, x A x is maximal up to logarithmic factors among ∼ ( ) 3 2 ¾˜ τ · hv0, yi + A(y) over all degree-4 pseudo-distributions {y} satisfying {k yk  1}.

The existence of the maps η depending only on the random part A of the 3 + tensor PCA input v0⊗ A formalizes the claim from Theorem 2.1.5 that no algorithm can reliably recover v0 from the pseudo-distribution η(A).

Additionally, the lower-bound construction holds for the symmetric noise model also: the input tensor A is symmetrized whereever it occurs in the construction, so it does not matter if it had already been symmetrized beforehand.

The rest of this section is devoted to proving these theorems, which we eventually accomplish in Section 2.6.2.

Discussion and Outline of Proof

Given a random 3-tensor A, we will take the degree-3 pseudo-moments of our ( ) ˜ ( ) η A to be εA, for some small ε, so that ¾x η A A x is large. The main question ∼ ( )

60 is how to give degree-4 pseudo-moments to go with this. We will construct these

T from AA and its permutations as a 4-tensor under the action of S4.

We have already seen that a spectral upper bound on one of these permutations, Í ⊗ i Ai Ai, provides a performance guarantee for degree-4 SoS optimization of degree-3 polynomials. It is not a coincidence that this SoS lower bound depends on the negative eigenvalues of the permutations of AAT. Running the argument for the upper bound in reverse, a pseudo-distribution {x} satisfying {k k2  } ¾˜ ( ) x 2 1 and with A x large must (by pseudo-Cauchy-Schwarz) also ˜ h 2 (Í ⊗ ) 2i T have ¾ x⊗ , i Ai Ai x⊗ large. The permutations of AA are all matrix h 2 (Í ⊗ ) 2i ˜ ( ) representations of that same polynomial, x⊗ , i Ai Ai x⊗ . Hence ¾ A x will be large only if the matrix representation of the pseudo-distribution {x} is well correlated with the permutations of AAT. Since this matrix representation will also need to be positive-semidefinite, control on the spectra of permutations of AAT is therefore the key to our approach.

The general outline of the proof will be as follows:

1. Construct a pseudo-distribution that is well correlated with the permuta- tions of AAT and gives a large value to ¾˜ A(x), but which is not on the unit sphere.

2. Use a procedure modifying the first and second degree moments of the pseudo-distribution to force it onto a sphere, at the cost of violating the

2 condition that ¾˜ p(X) > 0 for all p ∈ ’[x]62, then rescale so it lives on the unit sphere. Thus, we end up with an object that is no longer a valid pseudo-distribution but a more general linear functional L on polynomials.

3. Quantitatively bound the failure of L to be a pseudo-distribution, and

repair it by statistically mixing the almost-pseudo-distribution with a small

61 amount of the uniform distribution over the sphere. Show that ¾˜ A(x) is still large for this new pseudo-distribution over the unit sphere.

But before we can state a formal version of our theorem, we will need a few facts about polynomials, pseudo-distributions, matrices, vectors, and how they are related by symmetries under actions of permutation groups.

2.6.1 Polynomials, Vectors, Matrices, and Symmetries, Redux

Here we further develop the matrix view of SoS presented in Section 2.5.1.

We will need to use general linear functionals L : ’[x]64 → ’ on polynomials as an intermediate step between matrices and pseudo-distributions. Like pseudo- distributions, each such linear-functional L has a unique matrix representation

M satisfying certain maximal symmetry constraints. The matrix M is positive- L L semidefinite if and only if L p(x)2 > 0 for every p. If L satisfies this and L 1  1, then L is a pseudo-expectation, and M is the matrix representation of the L corresponding pseudo-distribution.

Matrices for Linear Functionals and Maximal Symmetry

#tuples d #tuples d Let L : ’[x]6d → ’. L can be represented as an n ( ) × n ( ) matrix indexed by all d0-tuples over [n] with d0 6 d/2. For tuples α, β, this matrix M is L given by def M [α, β]  L xαxβ . L

For a linear functional L : ’[x]6d → ’, a polynomial p(x) ∈ ’[x]6d, and a matrix representation Mp for p we thus have hM , Mpi  L p(x). L

62 A polynomial in ’[x]6d may have many matrix representations, while for us, a linear functional L has just one: the matrix M . This is because in our L definition we have required that M obey the constraints L

α β α β M [α, β]  M [α0, β0] when x x  x 0 x 0 . L L in order that they assign consistent values to each representation of the same polynomial. We call such matrices maximally symmetric (following Doherty and

Wehner [30]).

We have particular interest in the maximally-symmetric version of the . The degree-d symmetrized identity matrix Idsym is the unique maximally symmetric matrix so that

h d 2 sym d 2i  k kd x⊗ / , Id x⊗ / x 2 . (2.6.1)

The degree d will always be clear from context.

k kd In addition to being a matrix representation of the polynomial x 2 , the max- imally symmetric matrix Idsym also serves a dual purpose as a linear functional.

We will often be concerned with the expectation operator ¾µ for the uniform distribution over the n-sphere, and indeed for every polynomial p(x) with matrix representation Mp, µ 1 sym ¾ p(x)  hId , Mpi , n2 + 2n and so Idsym/(n2 + 2n) is the unique matrix representation of ¾µ.

The Monomial-Indexed (i.e. Symmetric) Subspace

[ ] We will also require vector representations of polynomials. We note that ’ x 6d 2 / #tuples d has a canonical embedding into ’ ( ) as the subspace given by the following

63 family of constraints, expressed in the basis of d0-tuples for d0 6 d/2:

[ ] '{ ∈ #tuples d  } ’ x 6d 2 p ’ ( ) such that pα pα if α0 is a permutation of α . / 0

We let Π be the projector to this subspace. For any maximally-symmetric M we have ΠMΠ  M, but the reverse implication is not true (for readers familiar with quantum information: any M which has M  ΠMΠ is Bose-symmetric, but may not be PPT-symmetric; maximally symmetric matrices are both. See [30] for further discussion.)

[ ] If we restrict attention to the embedding this induces of ’ x d 2 (i.e. the / d 2 homogeneous degree-d/2 polynomials) into ’n / , the resulting subspace is

d 2 n sometimes called the symmetric subspace and in other works is denoted by ∨ / ’ .

d 2 We sometimes abuse notation and let Π be the projector from ’n / to the canonical [ ] embedding of ’ x d 2. /

Maximally-Symmetric Matrices from Tensors

The group Sd acts on the set of d-tensors (canonically flattened to matrices

n d 2 n d 2 n d 2 n d 2 ’ b / c × d / e ) by permutation of indices. To any such flattened M ∈ ’ b / c × d / e , we associate a family of maximally-symmetric matrices symM given by ( ) def Õ symM  t π · M for all t > 0 . π ∈Sd That is, symM represents all scaled averages of M over different possible flat- tenings of its corresponding d-tensor. The following conditions on a matrix M are thus equivalent: (1) M ∈ symM, (2) M is maximally symmetric, (3) a tensor that flattens to M is invariant under the index-permutation action of Sd, and (4) M may be considered as a linear functional on the space of homogeneous

64 polynomials ’[x]d. When we construct maximally-symmetric matrices from un-symmetric ones, the choice of t is somewhat subtle and will be important in not being too wasteful in intermediate steps of our construction.

There is a more complex group action characterizing maximally-symmetric ’#tuples d #tuples d S matrices in ( )× ( ), which projects to the action of d0 under the #tuples d #tuples d nd0 2 nd0 2 projection of ’ ( )× ( ) to ’ / × / . We will never have to work explicitly with this full symmetry group; instead we will be able to construct linear

#tuples d #tuples d functionals on ’[x]6d (i.e. maximally symmetric matrices in ’ ( )× ( )) by symmetrizing each degree (i.e. each d0 6 d) more or less separately.

2.6.2 Formal Statement of the Lower Bound

We will warm up with the degree-4 lower bound, which is conceptually somewhat simpler.

Theorem 2.6.3 (Degree-4 Lower Bound, General Version). Let A be a 4-tensor and let λ > 0 be a function of n. Suppose the following conditions hold:

Í π — A is significantly correlated with π 4 A . ∈S h Í πi Ω( 4) A, π 4 A > n . ∈S — Permutations have lower-bounded spectrum. ∈ S 2 × 2 1 ( π + ( π)T) π For every π 4, the Hermitian n n unfolding 2 A A of A has no eigenvalues smaller than −λ2.

— Using A as 4th pseudo-moments does not imply that kxk4 is too large.

sym π 2 3 2 For every π ∈ S4, we have hId , A i 6 O(λ n / )

65 — Using A for 4th pseudo-moments does not imply first and second degree moments are too large.

Let L : ’[x]4 → ’ be the linear functional given by the matrix representation  1 Í π M : 2 2 π 4 A . Let L λ n ∈S

def L k k2 δ2 max x 2 xi x j i,j

def 2 2 δ0  max L kxk x 2 i 2 i

3 2 + 2 ( ) Then n / δ02 n δ2 6 O 1 .

{ } {k k2  } ¾˜ ( ) Then there is a degree-4 pseudo-distribution x satisfying x 2 1 so that A x > Ω(n2/λ2) + Θ(¾µ A(x)).

The degree-3 version of our lower bound requires bounds on the spectra of the flattenings not just of the 3-tensor A itself but also of the flattenings of an h 2 (Í ⊗ ) 2i associated 4-tensor, which represents the polynomial x⊗ , i Ai Ai x⊗ .

Theorem 2.6.4 (Degree-3 Lower Bound, General Version). Let A be a 3-tensor and let λ > 0 be a function of n. Suppose the following conditions hold:

Í π — A is significantly correlated with π 3 A . ∈S h Í πi Ω( 3) A, π 3 A > n . ∈S — Permutations have lower-bounded spectrum.

For every π ∈ S3, we have

1 1 −2λ2·Π Id Π  Π(σ·Aπ(Aπ)T+σ2·Aπ(Aπ)T)Π+ Π(σ·Aπ(Aπ)T+σ2·Aπ(Aπ)T)TΠ . 2 2

— Using AAT for 4th moments does not imply kxk4 is too large.

sym π π T 2 2 For every π ∈ S3, we have hId , A (A ) i 6 O(λ n )

66 — Using A and AAT for 3rd and 4th moments do not imply first and second degree moments are too large.

Let π ∈ S3. Let L : ’[x]4 → ’ be the linear functional given by the matrix  1 Í · T representation M : 2 2 π 4 π0 AA . Let L λ n 0∈S

def 1 δ  max hId , Aπi 1 3 2 n n i i λn / × def L k k2 δ2 max x 2 xi x j i,j

def 2 2 1 4 δ0  max L kxk x − L kxk 2 i 2 i n 2

+ 3 2 + 2 ( ) Then nδ1 n / δ02 n δ2 6 O 1 .

{ } {k k2  } Then there is a degree-4 pseudo-distribution x satisfying x 2 1 so that  n3 2  ¾˜ A(x) > Ω / + Θ(¾µ A(x)) . λ

Proof of Theorem 2.6.2

We prove the degree-3 corollary; the degree-4 case is almost identical using Theorem 2.6.3 and Lemma A.2.12 in place of their degree-3 counterparts.

Proof. Let A be a 3-tensor. If A satisfies the conditions of Theorem 2.6.4 with

3 4 1 4 λ  O(n / log(n) / ), we let η(A) be the pseudo-distribution described there, with

 n3 2  ¾˜ A(x) > Ω / + Θ(¾µ A(x)) x η A λ ∼ ( ) If A does not satisfy the regularity conditions, we let η(A) be the uniform distribution on the unit sphere. If A has unit Gaussian entries, then Lemma A.2.11 says that the regularity conditions are satisfied with this choice of λ with high

67 √ √ probability. The operator norm of A is at most O( n), so ¾µ A(x)  O( n) (all with high probability) [76]. We have chosen λ and τ so that when the conditions of Theorem 2.6.4 and the bound on ¾µ A(x), obtain,

 3 4  3 n / ¾˜ τ · hv0, xi + A(x) > Ω . x η A log(n)1 4 ∼ ( ) / On the other hand, our arguments on degree-4 SoS certificates for random polynomials say with high probability every degree-4 pseudo-distribution {y}

2 3 3 4 1 4 satisfying {kyk  1} has ¾˜ τ · hv, yi + A(y) 6 O(n / log(n) / ). Thus, {x} is nearly optimal and we are done. 

2.6.3 In-depth Preliminaries for Pseudo-Expectation Symme-

tries

This section gives the preliminaries we will need to construct maximally- symmetric matrices (a.k.a. functionals L : ’[x]64 → ’) in what follows. For a n2 n2 non-maximally-symmetric M ∈ ’ × under the action of S4 by permutation of indices, the subgroup C3 < S4 represents all the significant permutations whose spectra may differ from one another in a nontrivial way. The lemmas that follow will make this more precise. For concreteness, we take C3  hσi with σ  (234), but any other choice of 3-cycle would lead to a merely syntactic change in the proof.

Lemma 2.6.5. Let D8 < S4 be given by D8  h(12), (34), (13)(24)i. Let C3 

2 {(), σ, σ }  hσi, where () denotes the identity in S4. Then {1h : 1 ∈ D8, h ∈ C3}  S4.

Proof. The proof is routine; we provide it here for completeness. Note that C3 is a subgroup of order 3 in the alternating group A4. This alternating group

68 can be decomposed as A4  K4 ·C3, where K4  h(12)(34), (13)(24)i is a normal subgroup of A4. We can also decompose S4  C2 ·A4 where C2  h(12)i and A4 is a normal subgroup of S4. Finally, D8  C2 ·K4 so by associativity,

S4  C2 ·A4  C2 ·K4 ·C3  D8 ·C3. 

This lemma has two useful corollaries:

Corollary 2.6.6. For any subset S ⊆ S4, we have {1hs : 1 ∈ D8, h ∈ C3, s ∈ S}  S4.

n2 n2 Corollary 2.6.7. Let M ∈ ’ × . Let the matrix M0 be given by

def 1 2  1 2 T M0  Π M + σ · M + σ · M Π + Π M + σ · M + σ · M Π . 2 2

Then M0 ∈ symM.

+ · + 2 ·  Í · Proof. Observe first that M σ M σ M π 3 π M. For arbitrary ∈C ∈ ’n2 n2 1 Π Π + 1 Π TΠ  1 Í · N × , we show that 2 N 2 N 8 π 8 π N. First, conjugation ∈D by Π corresponds to averaging M over the group h(12), (34)i generated by interchange of indices in row and column indexing pairs, individually. At the same time, N + NT is the average of M over the matrix transposition permutation group h(13)(24)i. All together,

1 Õ Õ 1 Õ M0  (1h)· M  π · M 8 8 1 8 h 3 π 4 ∈D ∈C ∈S and so M0 ∈ symM. 

We make an useful observation about the nontrivial permutations of M, in the special case that M  AAT for some 3-tensor A.

n2 n Lemma 2.6.8. Let A be a 3-tensor and let A ∈ ’ × be its flattening, where the first and third modes lie on the longer axis and the third mode lies on the shorter axis. Let Ai

69 be the n × n matrix slices of A along the first mode, so that

© A1 ª ­ ® ­ ® ­ A2 ® A  ­ ® . ­ . ® ­ . ® ­ ® ­ ® An « ¬ 2 2 Let P : ’n → ’n be the orthogonal linear operator so that [Px](i, j)  x(j, i). Then ! · T  Õ ⊗ 2 · T  Õ ⊗ T σ AA Ai Ai P and σ AA Ai Ai . i i

T[( ) ( )]  Í (Í ⊗ Proof. We observe that AA j1, j2 , j3, j4 i Aij1 j2 Aij3 j4 and that i Ai )[( ) ( )]  Í Ai j1, j2 , j3, j4 i Aij1 j3 Aij2 j4 . Multiplication by P on the right has [(Í ⊗ the effect of switching the order of the second indexing pair, so i Ai ) ][( ) ( )]  Í · T  Ai P j1, j2 , j3, j4 i Aij1 j4 Aij2 j3 . From this it is easy to see that σ AA ( )· T  (Í ⊗ ) 234 AA i Ai Ai P.

Similarly, we have that

( 2 T)[( ) ( )]  (( )· T)[( ) ( )]  Õ σ AA j1, j2 , j3, j4 243 AA j1, j2 , j3, j4 Aij1 j3 Aij4 j2 , k

2 · T  Í ⊗ T from which we see that σ AA i Ai Ai . 

Permutations of the Identity Matrix. The nontrivial permutations of Idn2 n2 × are:

Id[(j, k), (j0, k0)]  δ(j, k)δ(j0, k0)

σ · Id[(j, k), (j0, k0)]  δ(j, j0)δ(k, k0)

2 σ · Id[(j, k), (j0, k0)]  δ(j, k0)δ(j0, k) .

70 2 Since (Id +σ · Id +σ · Id) is invariant under the action of D8, we have (Id +σ · Id +σ2 · Id) ∈ symM; up to scaling this matrix is the same as Idsym defined in (2.6.1). We record the following observations:

— Id, σ · Id, and σ2 · Id are all symmetric matrices.

— Up to scaling, Id +σ2 Id projects to identity on the canonical embedding of

’[x]2.

— The matrix σ · Id is rank-1, positive-semidefinite, and has Π(σ · Id)Π  σ · Id.

— The scaling [1/(n2 + 2n)](Id +σ · Id +σ2 · Id) is equal to a linear functional

µ ¾ : ’[x]4 → ’ giving the expectation under the uniform distribution over

n 1 the unit sphere S − .

2.6.4 Construction of Initial Pseudo-Distributions

We begin by discussing how to create an initial guess at a pseudo-distribution whose third moments are highly correlated with the polynomial A(x). This initial guess will be a valid pseudo-distribution, but will fail to be on the unit sphere, and so will require some repairing later on. For now, the method of creating this initial pseudo-distribution involves using a combination of symmetrization techniques to ensure that the matrices we construct are well defined as linear functionals over polynomials, and spectral techniques to establish positive-semidefiniteness of these matrices.

71 Extending Pseudo-Distributions to Degree Four

In this section we discuss a construction that takes a linear functional L :

’[x]63 → ’over degree-3 polynomials and yields a degree-4 pseudo-distribution {x}. We begin by reminding the reader of the Schur complement criterion for positive-semidefiniteness of block matrices.

Theorem 2.6.9. Let M be the following block matrix.

BCT def © ª M ­ ® CD « ¬ 1 T where B  0 and is full rank. Then M  0 if and only if D  CB− C .

Suppose we are given a linear functional L : ’[x]63 → ’ with L 1  1.

Let L |1 be L restricted to ’[x]1 and similarly for L |2 and L |3. We define the following matrices:

∈ ’n 1 L | — M 1 × is the matrix representation of 1. L | ∈ ’n n L | — M 2 × is the matrix representation of 2. L | ∈ ’n2 n L | — M 3 × is the matrix representation of 3. L | ∈ ’n2 1 — V 2 × is the vector flattening of M 2 . L | L |

#tuples 2 #tuples 2 Consider the block matrix M ∈ ’ ( )× ( ) given by

1 MT VT © 1 2 ª def ­ L | L | ® M  ­ M M MT ® , ­ 1 2 3 ® ­ L | L | L | ® ­ ® V 2 M 3 D « L | L | ¬

72 n2 n2 with D ∈ ’ × yet to be chosen. By taking

1 MT   1 B  © L | ª C  , V 2 M 3 ­ ® L | L | M 1 M 2 « L | L | ¬ we see by the Schur complement criterion that M is positive-semidefinite so

1 T long as D  CB− C . However, not any choice of D will yield M maximally symmetric, which is necessary for M to define a pseudo-expectation operator ¾˜ .

We would ideally take D to be the spectrally-least maximally-symmetric

1 T matrix so that D  CB− C . But this object might not be well defined, so we instead take the following substitute.

Definition 2.6.10. Let L, B, C as be as above. The symmetric Schur complement D ∈ 1 T Í ·( 1 T) Í ·( 1 T)  symCB− C is t π 4 π CB− C for the least t so that t π 4 π CB− C ∈S ∈S 1 T CB− C . We denote by ¾˜ L the linear functional ¾˜ L : ’[x]64 → ’ whose matrix representation is M with this choice of D, and note that ¾˜ L is a valid degree-4 pseudo-expectation.

Example 2.6.11 (Recovery of Degree-4 Uniform Moments from Symmetric Schur

µ Complement). Let L : ’[x]63 → ’ be given by L p(x) : ¾ p(x). We show that

µ 1 T 2 ¾˜ L  ¾ . In this case it is straightforward to compute that CB− C  σ · Id/n . t Π( + · + 2 · )Π  1 Π( · )Π Our task is to pick t > 0 minimal so that n2 Id σ Id σ Id n2 σ Id .

We know that Π(σ · Id)Π  σ · Id. Furthermore, Π Id Π  Π(σ2 · Id)Π, and

#tuples 4 both are the identity on the canonically-embedded subspace ’[x]2 in ’ ( ). We have previously observed that σ · Id is rank-one and positive-semidefinite, so

#tuples 4 T let w ∈ ’ ( ) be such that ww  σ · Id.

T( + · + 2 · )  k k2 + k k4  + 2 T( · We compute w Id σ Id σ Id w 2 w 2 w 2 2n n and w σ )  k k4  2  2/( 2 + ) Id w w 2 n . Thus t n n 2n is the minimizer. By a previous observation, this yields ¾µ.

73 To prove our lower bound, we will generalize the above example to the case

µ that we start with an operator L : ’[x]63 → ’ which does not match ¾ on degree-3 polynomials.

Symmetries at Degree Three

We intend on using the symmetric Schur complement to construct a pseudo- distribution from some L : ’[x]63 → ’ for which L A(x) is large. A good such L L x x x Í Aπ i j k will have i j k correlated with π 3 ijk for all (or many) indices , , . ∈S That is, it should be correlated with the coefficient of the monomial xi x j xk in ( ) L  Í π A x . However, if we do this directly by setting xi x j xk π Aijk, it becomes technically inconvenient to control the spectrum of the resulting symmetric Schur complement. To this avoid, we discuss how to utilize a decomposition of M 3 L | into nicer matrices if such a decomposition exists.

1 1 k Lemma 2.6.12. Let L : ’[x]63 → ’, and suppose that M  (M + ··· + M ) 3 k 3 3 L | L | L | 1 k n2 n for some M ,..., M ∈ ’ × . Let D1,..., Dk be the respective symmetric Schur 3 3 L | L | complements of the family of matrices

   1 MT VT  © 1 2 ª ­ L | L | ® ­ M M (Mi )T ® . ­ 1 2 3 ® ­ L | L | L | ® ­ i ®  V M •   2 3  « L | L | ¬16i6k Then the matrix k 1 MT VT © 1 2 ª def 1Õ ­ L | L | ® M  ­ M M (Mi )T ® ­ 1 2 3 ® k ­ L | L | L | ® i1 ­ i ® V M Di 2 3 « L | L | ¬ is positive-semidefinite and maximally symmetric. Therefore it defines a valid pseudo- expectation ¾˜ L. (This is a slight abuse of notation, since the pseudo-expectation defined

74 here in general differs from the one in Definition 2.6.10.)

Proof. Each matrix in the sum defining M is positive-semidefinite, so M  0. Ík Each Di is maximally symmetric and therefore so is i1 Di. We know that Ík i M   M is maximally-symmetric, so it follows that M is the matrix 3 i 1 3 L | L | representation of a valid pseudo-expectation. 

2.6.5 Getting to the Unit Sphere

Our next tool takes a pseudo-distribution ¾˜ that is slightly off the unit sphere, and corrects it to give a linear functional L : ’[x]64 → ’ that lies on the unit sphere.

We will also characterize how badly the resulting linear functional deviates

2 from the nonnegativity condition (L p(x) > 0 for p ∈ ’[x]62) required to be a pseudo-distribution

Definition 2.6.13. Let L : ’[x]6d → ’. We define

2 def L p(x) λ L  min min µ ( )2 p ’ x 6d 2 ¾ p x ∈ [ ] / where ¾µ p(x)2 is the expectation of p(x)2 when x is distributed according to the uniform distribution on the unit sphere.

Since ¾µ p(x)2 > 0 for all p, we have L p(x)2 > 0 for all p if and only if

λmin L > 0. Thus L on the unit sphere is a pseudo-distribution if and only if

L 1  1 and λmin L > 0.

Lemma 2.6.14. Let ¾˜ : ’[x]64 → ’ be a valid pseudodistribution. Suppose that:

75  ¾˜ k k4 1. c : x 2 > 1. ¾˜ 2. is close to lying on the sphere, in the sense that there are δ1, δ2, δ02 > 0 so that:

| 1 ¾˜ k k2 − L | (a) c x 2 xi 0 xi 6 δ1 for all i. | 1 ¾˜ k k2 − L | (b) c x 2 xi x j 0 xi x j 6 δ2 for all i , j. | 1 ¾˜ k k2 2 − L 2| (c) c x 2 xi 0 xi 6 δ02 for all i.

Let L : ’[x]64 → ’ be as follows on homogeneous p:

 ˜  ¾ 1 if deg p 0  def  L p(x)  1 ¾˜ p(x) if deg p  3, 4  c   1 ¾˜ ( )k k2   c p x x 2 if deg p 1, 2 .  L L ( )(k k2 − )  ( ) ∈ ’[ ] L Then satisfies p x x 2 1 0 for all p x x 62 and has λmin > − c 1 − ( ) − ( 3 2) − ( 2) −c O n δ1 O n / δ02 O n δ2.

L ( )(k k2 − )  ∈ ’[ ] Proof. It is easy to check that p x x 2 1 0 for all p x 62 by expanding the definition of L.

Let the linear functional L0 : ’[x]64 → ’ be defined over homogeneous polynomials p as

 c if deg p  0   L ( ) def  0 p x ¾˜ p(x) if deg p  3, 4   ¾˜ ( )k k2   p x x 2 if deg p 1, 2 . 

Note that L0 p(x)  c L p(x) for all p ∈ ’[x]64. Thus λmin L > λmin L0/c, and the kernel of L0 is identical to the kernel of L.

76 (k k2 − ) L L  In particular, since x 2 1 is in the kernel of 0, either λmin 0 0 or 2 L0 p(x) λmin L0  min . 2 µ ( )2 p ’ x 62,p x 1 ¾ p x ∈ [ ] ⊥(k k2 − ) Here p ⊥ (kxk2 −1) means that the polynomials p and kxk2 −1 are perpendictular ( )  + Í + Í in the coefficient basis. That is, if p x p0 i pi xi ij pij xi x j, this means Í  K ii pii p0. The equality holds because any linear functional on polynomials with (kxk2 − 1) in its kernel satisfies K(p(x) + α(kxk2 − 1))2  K p(x)2 for every

µ α. The functionals L0 and ¾ in particular both satisfy this.

Let ∆ : L0 − ¾˜ , and note that ∆ is nonzero only when evaluated on the degree-1 or -2 parts of polynomials. It will be sufficient to bound ∆, since assuming λmin L0 , 0, ∆p(x)2 + ¾˜ p(x)2 λmin L0  min 2 µ ( )2 p ’ x 62,p x 1 ¾ p x ∈ [ ] ⊥(k k2 − ) ∆p(x)2 > min . 2 µ ( )2 p ’ x 62,p x 1 ¾ p x ∈ [ ] ⊥(k k2 − )

∈ [ ] ( )  + Í + Let p ’ x 62. We expand p in the monomial basis: p x p0 i pi xi Í i,j pij xi x j. Then

!2 ! 2 ( )2  2+ Õ + Õ + Õ + Õ ©Õ ª+©Õ ª p x p0 2p0 pi xi 2p0 pij xi x j pi xi 2 pi xi ­ pij xi x j® ­ pij xi x j® . i ij i i ij ij « ¬ « ¬ An easy calculation gives !2 µ 2 2 2p0 Õ 1 Õ 2 1 Õ Õ 2 Õ 2 ¾ p(x)  p + pii + p + © pii + p + p ª . 0 n n i n2 + 2n ­ ij ii® i i i ij i « ¬ ⊥ (k k2 − )  Í The condition p x 2 1 yields p0 i pii. Substituting into the above, we obtain the sum of squares

2 2p 1 Õ 1 Õ Õ ¾µ p(x)2  p2 + 0 + p2 + ©p2 + p2 + p2 ª . 0 n n i n2 + 2n ­ 0 ij ii® i ij i « ¬

77 Without loss of generality we assume ¾µ p(x)2  1, so now it is enough just to

2 bound ∆p(x) . We have assumed that |∆xi | 6 cδ1 and |∆xi x j | 6 cδ2 for i , j and |∆ 2| ∆  − ∆ ( )  xi 6 cδ02. We also know 1 c 1 and p x 0 when p is a homogeneous degree-3 or -4 polynomial. So we expand

Õ Õ Õ ∆ ( )2  2( − ) + ∆ + ∆ + ∆ p x p0 c 1 2p0 pi xi 2p0 pij xi x j pi p j xi x j i ij i,j and note that this is maximized in absolute value when all the signs line up:

!2 Õ Õ Õ Õ Õ |∆ ( )2| 2( − )+ | | | |+ | | © | | + | |ª+ | | + 2 p x 6 p0 c 1 2cδ1 p0 pi 2 p0 ­cδ2 pij cδ02 pii ® cδ2 pi cδ02 pi . i i,j i i i « ¬

2  ∈ [ ] Í 2 ( − ) We start with the second term. If p0 α for α 0, 1 , then i pi 6 n 1 α by our assumption that ¾µ p(x)2  1. This means that s | | Õ | | Õ 2 p ( − ) ( ) 2cδ1 p0 pi 6 2cδ1 αn pi 6 2cδ1n α 1 α 6 O n cδ1 , i i

2 where we have used Cauchy-Schwarz and the fact max06α61 α(1 − α)  (1/2) . The other terms are all similar:

2( − ) − p0 c 1 6 c 1 Õ s Õ | | | | 2 2 ( 2)p ( − ) ( 2) 2 p0 cδ2 pij 6 2cδ2 αn pij 6 2cδ2O n α 1 α 6 O n cδ2 i,j ij s | | Õ | | Õ 2 ( 3 2) 2 p0 cδ02 pii 6 2cδ02 αn pii 6 O n / cδ02 i i !2 Õ | | Õ 2 ( 2) cδ2 pi 6 cδ2n pi 6 O n cδ2 i i Õ 2 ( ) cδ02 pi 6 O n cδ02 , i where in each case we have used Cauchy-Schwarz and our assumption ¾µ p(x)2 

1.

78 Putting it all together, we get

∆ −( − ) − ( ) − ( 3 2) − ( 2) λmin > c 1 O n cδ1 O n / cδ02 O n cδ2 . 

2.6.6 Repairing Almost-Pseudo-Distributions

Our last tool takes a linear functional L : ’[x]6d that is “almost” a pseudo- distribution over the unit sphere, in the precise sense that all conditions for being a pseudo-distribution over the sphere are satisfied except that λmin L  −ε. The tool transforms it into a bona fide pseudo-distribution at a slight cost to its evaluations at various polynomials.

Lemma 2.6.15. Let L : ’[x]6d → ’ and suppose that

— L 1  1

L ( )(k k2 − ) ∈ [ ] — p x x 1  0 for all p ’ x 6d 2. −

— λmin L  −ε.

Then the operator ¾˜ : ’[x]6d → ’ given by

def 1 ¾˜ p(x)  (L p(x) + ε ¾µ p(x)) 1 + ε is a valid pseudo-expectation satisfying {kxk2  1}.

¾˜ ¾˜ ¾˜ (k k2 − )2  Proof. It will suffice to check that λmin > 0 and that has x 2 1 0 and

¾˜ 1  1. For the first, let p ∈ ’[x]>2. We have

¾˜ p(x)2  1  ¾0 p(x)2 + ε ¾µ p(x)2   1   > (−ε + ε) > 0 . ¾µ p(x)2 1 + ε ¾µ p(x)2 1 + ε

Hence, λmin ¾˜ > 0.

79 It is straightforward to check the conditions that ¾˜ 1  1 and that ¾˜ satisfies {kxk2 − 1  0}, since ¾˜ is a convex combination of linear functionals that already satisfy these linear constraints. 

2.6.7 Putting Everything Together

We are ready to prove Theorem 2.6.3 and Theorem 2.6.4. The proof of Theo- rem 2.6.3 is somewhat simpler and contains many of the ideas of the proof of

Theorem 2.6.4, so we start there.

The Degree-4 Lower Bound

Proof of Theorem 2.6.3. We begin by constructing a degree-4 pseudo-expectation 0 ¾˜ : ’[x]64 → ’ whose degree-4 moments are biased towards A(x) but which {k k2 −  } does not yet satisfy x 2 1 0 .

Let L : ’[x]64 → ’ be the functional whose matrix representation when L | [ ] →  1 Í π restricted to 4 : ’ x 4 ’ is given by M 2 A , and which is 0 4 4 n π 4 L | |S | ∈S on polynomials of degree at most 3.

0 0 Let ¾˜ : ¾µ +ε L, where ε is a parameter to be chosen soon so that ¾˜ p(x)2 >

0 for all p ∈ ’[x]62. Let p ∈ ’[x]62. We expand p in the monomial basis as ( )  + Í + Í p x p0 i pi xi ij pij xi x j. Then

1 Õ ¾µ p(x)2 > p2 . n2 ij ij

π By our assumption on negative eigenvalues of A for all π ∈ S4, we know that L ( )2 λ2 Í 2 / 2 ¾˜ 0  ¾˜ µ + L/ 2 p x > −n2 ij pij. So if we choose ε 6 1 λ , the operator λ

80 0 will be a valid pseudo-expectation. Moreover ¾˜ is well correlated with A, since it was obtained by maximizing the amount of L, which is simply the ¾˜ 0 k k4 (maximally-symmetric) dual of A. However the calculation of x 2 shows that this pseudo-expectation is not on the unit sphere, though it is close. Let c refer to

0 1 1 Õ c : ¾˜ kxk4  ¾µ kxk4+ L kxk4  + h sym Aπi  +O(n 1 2) 2 2 2 2 1 2 2 Id , 1 − / . λ |S4|n λ π 4 ∈S

0 We would like to use Lemma 2.6.14 together with ¾˜ to obtain some L1 ’[ ] → ’ k k2 − L1 : x 64 with x 2 1 in its kernel and bounded λmin while still maintaining a high correlation with A. For this we need ξ1, ξ2, ξ20 so that

0 0 1 ¾˜ kxk2x − ¾˜ x i — c 2 i i 6 ξ1 for all . 0 0 1 ¾˜ kxk2x x − ¾˜ x x i j — c 2 i j i j 6 ξ2 for all , . 0 0 1 ¾˜ kxk2x2 − ¾˜ x2 i — c 2 i i 6 ξ20 for all .

0 Since ¾˜ p(x)  0 for all homogeneous odd-degree p, we may take ξ1  0. For

ξ2, we have that when i , j,

1 0 2 0 1 2 ¾˜ kxk xi x j − ¾˜ xi x j  L kxk xi x j 6 δ2 , c 2 cλ2 2 where we recall δ2 and δ02 defined in the theorem statement. Finally, for ξ20 , we have

1 0 2 2 0 2 1 2 2 1 µ 2 2 µ 2 c 1 ¾˜ kxk x − ¾˜ x 6 L kxk x + ¾ kxk x − ¾ x 6 δ0 + − . c 2 i i cλ2 2 i c 2 i i 2 cn

L1 ’[ ] → ’ k k2 − Thus, Lemma 2.6.14 yields : x 64 with x 2 1 in its kernel in the L1 ( )(k k2 − )  ∈ ’[ ]  sense that p x x 2 1 0 for all p x 62. If we take ξ2 δ2 and  + c 1 L1 − c 1 − 2 − 3 2( + c 1 )  − ( ) ξ20 δ02 cn− , then λmin > −c n δ2 n / δ02 cn− O 1 . Furthermore, L1 ( )  1 L ( )  Θ( 1 L ( )) A x cλ2 A x λ2 A x .

81 So by Lemma 2.6.15, there is a degree-4 pseudo-expectation ¾˜ satisfying {k k2  } x 2 1 so that

 1  ¾˜ A(x)  Θ L A(x) + Θ(¾µ A(x)) λ2 ! 1 Õ  Θ hA Aπi + Θ(¾µ A(x)) 2 2 , |S4|n λ π 4 ∈S  n2  > Ω + Θ(¾µ A(x)) .  λ2

The Degree-3 Lower Bound

Now we turn to the proof of Theorem 2.6.4.

Proof of Theorem 2.6.4. Let A be a 3-tensor. Let ε > 0 be a parameter to be chosen later. We begin with the following linear functional L : ’[x]63 → ’. For any monomial xα (where α is a multi-index of degree at most 3),

 ¾µ xα xα def  if deg 6 2 L xα  .  ε Í π α   3 2 π 3 Aα if deg x 3  n / ∈S The functional L contains our current best guess at the degree 1 and 2 moments of a pseudo-distribution whose degree-3 moments are ε-correlated with A(x).

The next step is to use symmetric Schur complement to extend L to a degree-4 pseudo-expectation. Note that M 3 decomposes as L | Õ  Π π M 3 A L | π 3 ∈S where, as a reminder, Aπ is the n2 × n flattening of Aπ and Π is the projector to

n2 the canonical embedding of ’[x]2 into ’ . So, using Lemma 2.6.12, we want to

82 find the symmetric Schur complements of the following family of matrices (with notation matching the statement of Lemma 2.6.12):

   1 MT VT  © 1 2 ª ­ L | L | ® ­ M M ε (ΠAπ)T ® . ­ 1 2 n3 2 ® ­ L | L | / ® ­ ε Π π • ®  V 2 3 2 A   n /  L | π 3 « ¬ ∈S π Since we have the same assumptions on A for all π ∈ S3, without loss of generality we analyze just the case that π is the identity permutation, in which case Aπ  A.

Since L matches the degree-one and degree-two moments of the uniform  distribution on the unit sphere, we have M 1 0, the n-dimensional zero vector, L |  1 ∈ ’n2 2 and M 2 n Idn n. Let w be the n -dimensional vector flattening of L | × T  · Idn n. We observe that ww σ Id is one of the permutations of Idn2 n2 . Taking × × B and C as follows,

1 0    © ª  ε B C w 3 2 A , ­ 1 ® n / 0 Idn n « n × ¬ we compute that 2 1 T 1 ε T CB− C  (σ · Id) + ΠAA Π . n2 n2 Symmetrizing the Id portion and the AAT portion of this matrix separately, we see that the symmetric Schur complement that we are looking for is the   ∈ 1 ( · ) + ε2 T spectrally-least M sym n2 σ Id n2 AA so that

t  ε2  M  3 Idsym + Π(AAT + σ · AAT + σ2 · AAT)Π + Π(AAT + σ · AAT + σ2 · AAT)TΠ n2 2 1 ε2  (σ · Id) + ΠAATΠ . n2 n2

Here we have used Corollary 2.6.7 and Corollary 2.6.6 to express a general ( ε2 Π TΠ) Π T · T 2 · T element of sym n2 AA in terms of , AA , σ AA , and σ AA .

83 Any spectrally small M satisfying the above suffices for us. Taking t  1, canceling some terms, and making the substitution 3 Idsym −σ · Id  2Π Id Π, we see that it is enough to have

ε2 ε2 −2 Π Id Π  Π(σ · AAT + σ2 · AAT)Π + Π(σ · AAT + σ2 · AAT)TΠ , 2 2 which by the premises of the theorem holds for ε  1/λ. Pushing through our symmetrized Schur complement rule with our decomposition of M 3 L | 0 (Lemma 2.6.12), this ε yields a valid degree-4 pseudo-expectation ¾˜ : ’[x]64 → 0 0 ’. From our choice of parameters, we see that ¾˜ |4, the degree-4 part of ¾˜ , is ¾˜ 0 |  n2+2n ¾µ + L L ’[ ] → ’ given by 4 n2 , where : x 4 is as defined in the theorem 0 statement. Furthermore, ¾˜ p(x)  ¾µ p(x) for p with deg p 6 2.

¾˜ 0 k k4 We would like to know how big x 2 is. We have 0  1  1 c : ¾˜ kxk4  1 + ¾µ kxk4 + L kxk4  1 + + L kxk4 . 2 n 2 2 n 2

We have assumed that hIdsym, AATi 6 O(λ2n2). Since Idsym is maximally h sym Í · Ti  h sym |S | Ti symmetric, we have Id , π 4 π AA Id , 4 AA and so ∈S 1 1 Õ L k k4  h sym i  Θ(h sym · Ti) ( ) x 2 Id , M 4 Id , π AA 6 O 1 . λ2n2 L | n2λ2 π 4 ∈S

h Í πi Finally, our assumptions on A, π 3 A yield ∈S 0 ε Õ  n3 2  ¾˜ A(x)  hA, Aπi > Ω / . 3 2 n / λ π 3 ∈S

We have established the following lemma.

Lemma 2.6.16. Under the assumptions of Theorem 2.6.4 there is a degree-4 pseudo- 0 expectation operator ¾˜ so that

 ¾˜ 0 k k4  + ( ) — c : x 2 1 O 1 .

84 0 3 2 — ¾˜ A(x) > Ω(n / /λ).

0 µ — ¾˜ p(x)  ¾ p(x) for all p ∈ ’[x]62.

¾˜ 0 |  ( + 1 ) ¾µ | + L — 4 1 n 4 . 

0 Now we would like feed ¾˜ into Lemma 2.6.14 to get a linear functional L1 ’[ ] → ’ k k2 − : x 64 with x 2 1 in its kernel (equivalently, which satisfies {k k4 −  } x 2 1 0 ), but in order to do that we need to find ξ1, ξ2, ξ20 so that

0 0 1 ¾˜ kxk2x − ¾˜ x i — c 2 i i 6 ξ1 for all . 0 0 1 ¾˜ kxk2x x − ¾˜ x x i j — c 2 i j i j 6 ξ2 for all , . 0 0 1 ¾˜ kxk2x2 − ¾˜ x2 i — c 2 i i 6 ξ20 for all .

0 0 For ξ1, we note that for every i, ¾˜ xi  0 since ¾˜ matches the uniform distribution 0 0 0 1 ¾˜ kxk2x − ¾˜ x  1 ¾˜ kxk2x on degree one and two polynomials. Thus, c 2 i i c 2 i .

0 We know that M 0 , the matrix representation of the degree-3 part of ¾˜ , is ¾˜ 3 | 1 ˜ 0 k k2 3 2 A. Expanding ¾ x xi with matrix representations, we get 3 n λ 2 |S | / 0 1 Õ 1 ¾˜ kxk2x  hId , A i 6 δ , c 2 i |S | 3 2 n n i 1 3 cn / λ × π 3 ∈S where δ1 is as defined in the theorem statement.

L Now for ξ2 and ξ20 . Let be the operator in the theorem statement. By the 0 definition of ¾˜ , we get

0  1   ¾˜ | 6 1 + ¾µ | + L . 4 n 4

In particular, for i , j, 0 1 ¾˜ 0 k k2 − ¾˜  1 L k k2 c x 2 xi x j xi x j c x 2 xi x j 6 δ2 .

85 For i  j,   0 µ 1 1 ¾˜ kxk2x2 − ¾˜ x2  1 L kxk2x2 + 1 + ¾µ kxk2x2 − c ¾µ x2 c 2 i i c 2 i n 2 i i   1 1  1 L kxk2x2 − 1 L kxk4 + 1 L kxk4 + 1 + ¾µ kxk4 − c ¾µ x2 c 2 i n 2 n 2 n n 2 i

 1 L k k2 2 − 1 L k k4 + 1 ¾˜ 0 k k4 − ¾µ 2 c x 2 xi n x 2 n x 2 c xi

 1 L k k2 2 − 1 L k k4 c x 2 xi n x 2

6 δ02 .

    ¾˜ 0 k k4  + ( ) Thus, we can take ξ1 δ1, ξ2 δ2, ξ20 δ02, and c x 2 1 O 1 , and apply Lemma 2.6.14 to conclude that

L1 − c 1 − ( ) − ( 3 2) − ( 2)  − ( ) λmin > −c O n ξ1 O n / ξ20 O n ξ2 O 1 .

The functional L1 loses a constant factor in the value assigned to A(x) as compared 0 to ¾˜ : 0 ¾˜ A(x)  n3 2  L1 A(x)  > Ω / . c λ

Now using Lemma 2.6.15, we can correct the negative eigenvalue of L1 to get a pseudo-expectation def ¾˜  Θ(1)L1 +Θ(1) ¾µ .

¾˜ {k k2  } By Lemma 2.6.15, the pseudo-expectation satisfies x 2 1 . Finally, to complete the proof, we have:

 n3 2  ¾˜ A(x)  Ω / + Θ(1) ¾µ A(x) .  λ

86 2.7 Higher-Order Tensors

We have heretofore restricted ourselves to the case k  3 in our algorithms for the sake of readability. In this section we state versions of our main results for general k and indicate how the proofs from the 3-tensor case may be generalized to handle arbitrary k. Our policy is to continue to treat k as constant with respect to n, hiding multiplicative losses in k in our asymptotic notation.

The case of general odd k may be reduced to k  3 by a standard trick, which we describe here for completeness. Given A an order-k tensor, consider the

β polynomial A(x) and make the variable substitution yβ  x for each multi-index

β with |β|  (k + 1)/2. This yields a degree-3 polynomial A0(x, y) to which the analysis in Section 2.3 and Section 2.4 applies almost unchanged, now using pseudo-distributions {x, y} satisfying {kxk2  1, kyk2  1}. In the analysis of tensor PCA, this change of variables should be conducted after the input is split into signal and noise parts, in order to preserve the analysis of the

k second half of the rounding argument (to get from ¾˜ hv0, xi to ¾˜ hv0, xi), which then requires only syntactic modifications to Lemma A.1.5. The only other non-syntactic difference is the need to generalize the λ-boundedness results for random polynomials to handle tensors whose dimensions are not all equal; this is already done in Theorem A.2.5.

For even k, the degree-k SoS approach does not improve on the tensor unfolding algorithms of Montanari and Richard [60]. Indeed, by performing a

β similar variable substitution, yβ  x for all |β|  k/2, the SoS algorithm reduces exactly to the eigenvalue/eigenvector computation from tensor unfolding. If we

β perform instead the substitution yβ  x for |β|  k/2 − 1, it becomes possible

87 to extract v0 directly from the degree-2 pseudo-moments of an (approximately) optimal degree-4 pseudo-distribution, rather than performing an extra step to k 2 ⊗ / recover v0 from v well-correlated with v0 . Either approach recovers v0 only up to sign, since the input is unchanged under the transformation v0 7→ −v0.

We now state analogues of all our results for general k. Except for the above noted differences from the k  3 case, the proofs are all easy transformations of the proofs of their degree-3 counterparts.

n k 4 1 4 Theorem 2.7.1. Let k be an odd integer, v0 ∈ ’ a unit vector, τ % n / log(n) / /ε, and A an order-k tensor with independent unit Gaussian entries.

1. There is an algorithm, based on semidefinite programming, which on input

k T(x)  τ · hv0, xi + A(x) returns a unit vector v with hv0, vi > 1 − ε with high probability over random choice of A.

2. There is an algorithm, based on semidefinite programming, which on input

k k k 4 1 4 T(x)  τ · hv0, xi + A(x) certifies that T(x) 6 τ · hv, xi + O(n / log(n) / ) for some unit v with high probability over random choice of A. This guarantees in particular that v is close to a maximum likelihood estimator for the problem of · k + recovering the signal v0 from the input τ v0⊗ A.

3. By solving the semidefinite relaxation approximately, both algorithms can be

1+1 k k implemented in time O˜ (m / ), where m  n is the input size.

2 For even k, the above all hold, except now we recover v with hv0, vi > 1 − ε, and the algorithms can be implemented in nearly-linear time.

The next theorem partially resolves a conjecture of Montanari and Richard regarding tensor unfolding algorithms for odd k. We are able to prove their

88 conjectured signal-to-noise ratio τ, but under an asymmetric noise model. They conjecture that the following holds when A is symmetric with unit Gaussian entries.

n k 4 Theorem 2.7.2. Let k be an odd integer, v0 ∈ ’ a unit vector, τ % n / /ε, and A an order-k tensor with independent unit Gaussian entries. There is a nearly-linear-time algorithm, based on tensor unfolding, which, with high probability over random choice of

2 A, recovers a vector v with hv, v0i > 1 − ε.

89 CHAPTER 3 SPECTRAL METHODS FOR THE RANDOM OVERCOMPLETE MODEL

3.1 Introduction

The sum-of-squares (SoS) method (also known as the Lasserre hierarchy) [71, 64, 61, 54] is a powerful, semidefinite-programming based meta-algorithm that applies to a wide-range of optimization problems. The method has been studied extensively for moderate-size polynomial optimization problems that arise for example in control theory and in the context of approximation algorithms for combinatorial optimization problems, especially constraint satisfaction and graph partitioning (see e.g. the survey [21]). For the latter, the SoS method captures and generalizes the best known approximation algorithms based on linear programming (LP), semidefinite programming (SDP), or spectral methods, and in many cases the SoS method is the most promising approach to obtain algorithms with better guarantees—especially in the context of Khot’s Unique Games Conjecture [14].

A sequence of recent works applies the sum-of-squares method to basic problems that arise in unsupervised machine learning: in particular, recovering sparse vectors in linear subspaces and decomposing tensors in a robust way [16, 17, 50, ?, 39]. For a wide range of parameters of these problems, SoS achieves significantly stronger guarantees than other methods, in polynomial or quasi-polynomial time.

Like other LP and SDP hierarchies, the sum-of-squares method comes with a degree parameter d ∈ Ž that allows for trading off running time and solution

90 quality. This trade-off is appealing because for applications the additional utility of better solutions could vastly outweigh additional computational costs. Unfortunately, the computational cost grows rather steeply in terms of the

O d parameter d: the running time is n ( ) where n is the number of variables (usually comparable to the instance size). Further, even when the SDP has size polynomial in the input (when d  O(1)), solving the underlying semidefinite programs is prohibitively slow for large instances.

In this work, we introduce spectral algorithms for planted sparse vector, tensor decomposition, and tensor principal components analysis (PCA) that exploit the same high-degree information as the corresponding sum-of-squares algorithms without relying on semidefinite programming, and achieve the same (or close to the same) guarantees. The resulting algorithms are quite simple (a couple of lines of matlab code) and have considerably faster running times—quasi-linear or close to linear in the input size.

A surprising implication of our work is that for some problems, spectral algorithms can exploit information from larger values of the parameter d without

O d spending time n ( ). For example, our algorithm for the planted sparse vector problem runs in nearly-linear time in the input size, even though it uses properties that the sum-of-squares method can only use for degree parameter d > 4. (In particular, the guarantees that the algorithm achieves are strictly stronger than the guarantees that SoS achieves for values of d < 4.)

The initial successes of SoS in the machine learning setting gave hope that techniques developed in the theory of approximation algorithms, specifically the techniques of hierarchies of convex relaxations and rounding convex relaxations, could broadly impact the practice of machine learning. This hope was dampened

91 by the fact that in general, algorithms that rely on solving large semidefinite programs are too slow to be practical for the large-scale problems that arise in machine learning. Our work brings this hope back into focus by demonstrating for the first time that with some care SoS algorithms can be made practical for large-scale problems.

In the following subsections we describe each of the problems that we consider, the prior best-known guarantee via the SoS hierarchy, and our results.

3.1.1 Planted Sparse Vector in Random Linear Subspace

The problem of finding a sparse vector planted in a random linear subspace was introduced by Spielman, Wang, and Wright as a way of learning sparse dictionaries [72]. Subsequent works have found further applications and begun studying the problem in its own right [?, 16, 65]. In this problem, we are given a basis for a d-dimensional linear subspace of ’n that is random except for one planted sparse direction, and the goal is to recover this sparse direction. The computational challenge is to solve this problem even when the planted vector is only mildly sparse (a constant fraction of non-zero coordinates) and the subspace

Ω 1 dimension is large compared to the ambient dimension (d > n ( )).

Several kinds of algorithms have been proposed for this problem based on linear programming (LP), basic semidefinite programming (SDP), sum-of-squares, and non-convex gradient descent (alternating directions method).

An inherent limitation of simpler convex methods (LP and basic SDP) [72, 25] is that they require the relative sparsity of the planted vector to be polynomial in

92 √ the subspace dimension (less than n/ d non-zero coordinates).

Sum-of-squares and non-convex methods do not share this limitation. They can recover planted vectors with constant relative sparsity even if the subspace

1 2 has polynomial dimension (up to dimension O(n / ) for sum-of-squares [16] and

1 4 up to O(n / ) for non-convex methods [65]).

We state the problem formally:

Problem 3.1.1. Planted sparse vector problem with ambient dimension n ∈ Ž, subspace dimension d 6 n, sparsity ε > 0, and accuracy η > 0. Given an arbitrary

n orthogonal basis of a subspace spanned by vectors v0, v1,..., vd 1 ∈ ’ , where v0 − is a vector with at most εn non-zero entries and v1,..., vd 1 are vectors sampled − independently at random from the standard Gaussian distribution on ’n, output

n 2 a unit vector v ∈ ’ that has correlation hv, v0i > 1 − η with the sparse vector v0.

Our results. Our algorithm runs in nearly linear time in the input size, and matches the best-known guarantees up to a polylogarithmic factor in the subspace dimension [16].

Theorem 3.1.2. Planted sparse vector in nearly-linear time. There exists an algorithm that, for every sparsity ε > 0, ambient dimension n, and subspace dimension d √ O 1 with d 6 n/(log n) ( ), solves the planted sparse vector problem with high probability

1 4 for some accuracy η 6 O(ε / ) + on (1). The running time of the algorithm is O˜ (nd). →∞

We give a technical overview of the proof in Section 4.2, and a full proof in Section 3.4.

93 Table 3.1: Comparison of algorithms for the planted sparse vector problem with ambient dimen- sion n, subspace dimension d, and relative sparsity ε.

Reference Technique Runtime Largest d Largest√ ε ( / ) Demanet, Hand [?] linear programming poly any√ Ω 1 d Barak, Kelner, Steurer [16] SoS, general SDP poly Ω( n) Ω(1) ˜ ( 2 5) ( 1 4) ( ) Qu, Sun, Wright [65] alternating minimization O n d Ω n√/ Ω 1 this work SoS, partial traces O˜ (nd) Ω˜ ( n) Ω(1)

Previous work also showed how to recover the planted sparse vector exactly. The task of going from an approximate solution to an exact one is a special case of standard compressed sensing (see e.g. [16]).

3.1.2 Overcomplete Tensor Decomposition

Tensors naturally represent multilinear relationships in data. Algorithms for tensor decompositions have long been studied as a tool for data analysis across a wide-range of disciplines (see the early work of Harshman [?] and the survey [53]).

While the problem is NP-hard in the worst-case [?, ?], algorithms for special cases of tensor decomposition have recently led to new provable algorithmic results for several unsupervised learning problems [10, 24, 42, 9] including independent component analysis, learning mixtures of Gaussians [37], Latent Dirichlet topic modeling [4] and dictionary learning [17]. Some earlier algorithms can also be reinterpreted in terms of tensor decomposition [?, ?, ?].

A key algorithmic challenge for tensor decompositions is overcompleteness, when the number of components is larger than their dimension (i.e., the compo- nents are linearly dependent). Most algorithms that work in this regime require tensors of order 4 or higher [55, 24]. For example, the FOOBI algorithm of [55] can recover up to Ω(d2) components given an order-4 tensor in dimension d under

94 mild algebraic independence assumptions for the components—satisfied with high probability by random components. For overcomplete 3-tensors, which arise in many applications of tensor decompositions, such a result remains elusive.

Researchers have therefore turned to investigate average-case versions of the problem, when the components of the overcomplete 3-tensor are random: Given a 3-tensor T ∈ ’d3 of the form

Õn T  ai ⊗ ai ⊗ ai , i1 where a1,..., an are random unit or Gaussian vectors, the goal is to approximately recover the components a1,..., an.

Algorithms based on tensor power iteration—a gradient-descent approach for tensor decomposition—solve this problem in polynomial time when n 6 C · d for any constant C > 1 (the running time is exponential in C)[12]. Tensor power iteration also admits local convergence analyses for up to n 6 Ω˜ (d1.5) components [12, 7]. Unfortunately, these analyses do not give polynomial-time algorithms because it is not known how to efficiently obtain the kind of initializations assumed by the analyses.

Recently, Ge and Ma [39] were able to show that a tensor-decomposition algorithm [17] based on sum-of-squares solves the above problem for n 6 Ω˜ (d1.5)

O log n in quasi-polynomial time n ( ). The key ingredient of their elegant analysis is a subtle spectral concentration bound for a particular degree-4 matrix-valued polynomial associated with the decomposition problem of random overcomplete 3-tensors.

We state the problem formally:

95 Table 3.2: Comparison of decomposition algorithms for overcomplete 3-tensors of rank n in dimension d.

Reference Technique Runtime Largest n Components Anandkumar et al. [12]a tensor power iteration poly C · d incoherent O log n Ω˜ ( 3 2) N( 1 ) Ge, Ma [39] SoS, general SDP n ( ) d / 0, d Idd b ˜ ( 1+ω) Ω˜ ( 4 3) N( 1 ) this work SoS, partial traces O nd d / 0, d Idd a The analysis shows that for every constant C > 1, the running time is polynomial for n 6 C · d components, assuming that the components also satisfy other random-like properties besides incoherence. b Here, ω 6 2.3729 is the constant so that d × d matrices can be multiplied in O(dω) arithmetic operations.

Problem 3.1.3. Random tensor decomposition with dimension d, rank n, and

d accuracy η. Let a1,..., an ∈ ’ be independently sampled vectors from the N( 1 ) ∈( ’d) 3  Ín 3 Gaussian distribution 0, d Idd , and let T ⊗ be the 3-tensor T i1 ai⊗ .

Single component: Given T sampled as above, find a unit vector b that

has correlation maxi hai , bi > 1 − η with one of the vectors ai.

All components: Given T sampled as above, find a set of unit vectors

{b1,..., bn } such that hai , bii > 1 − η for every i ∈ [n].

Our results. We give the first polynomial-time algorithm for decomposing random overcomplete 3-tensors with up to ω(d) components. Our algorithms

4 3 works as long as the number of components satisfies n 6 Ω˜ (d / ), which comes close to the bound Ω˜ (d1.5) achieved by the aforementioned quasi-polynomial algorithm of Ge and Ma. For the single-component version of the problem, our algorithm runs in time close to linear in the input size.

Theorem 3.1.4. Fast random tensor decomposition. There exist randomized

4 3 O 1 algorithms that, for every dimension d and rank n with d 6 n 6 d / /(log n) ( ), solve the random tensor decomposition problem with probability 1 − o(1) for some accuracy

3 4 1 2 η 6 O˜ (n /d ) / . The running time for the single-component version of the problem is

96 + O˜ (d1 ω), where dω 6 d2.3279 is the time to multiply two d-by-d matrices. The running + time for the all-components version of the problem is O˜ (n · d1 ω).

We give a technical overview of the proof in Section 4.2, and a full proof in

Section 3.5.

We remark that the above algorithm only requires access to the input tensor with some fixed inverse polynomial accuracy because each of its four steps amplifies errors by at most a polynomial factor (see Algorithm 3.5.9). In this sense, the algorithm is robust.

3.1.3 Tensor Principal Component Analysis

The problem of tensor principal component analysis is similar to the tensor decomposition problem. However, here the focus is not on the number of components in the tensor, but about recovery in the presence of a large amount

3 n of random noise. We are given as input a tensor τ · v⊗ + A, where v ∈ ’ is a unit vector and the entries of A are chosen iid from N(0, 1). This spiked tensor model was introduced by Montanari and Richard [67], who also obtained the first algorithms to solve the model with provable statistical guarantees. The spiked tensor model was subsequently addressed by a subset of the present authors [50], who applied the SoS approach to improve the signal-to-noise ratio required for recovery from odd-order tensors.

We state the problem formally:

Problem 3.1.5. Tensor principal components analysis with signal-to-noise ratio

d 3 3 τ and accuracy η. Let T ∈ (’ )⊗ be a tensor so that T  τ · v⊗ + A, where A is

97 Table 3.3: Comparison of principal component analysis algorithms for 3-tensors in dimension d and with signal-to-noise ratio τ.

Reference Technique Runtime Smallest τ

Montanari, Richard [67] spectral O˜ (d3) n

3 3 4 Hopkins, Shi, Steurer [50] SoS, spectral O˜ (d ) O(n / ) 3 3 4 this work SoS, partial traces O(d ) O˜ (n / ) a tensor with independent standard gaussian entries and v ∈ ’d is a unit vector.

d Given T, recover a unit vector v0 ∈ ’ such that hv0, vi > 1 − η.

Our results. For this problem, our improvements over the previous results are more modest—we achieve signal-to-noise guarantees matching [50], but with an algorithm that runs in linear time rather than near-linear time (time O(d3) rather than O(d3 polylog d), for an input of size d3).

Theorem 3.1.6. Tensor principal component analysis in linear time. There is an algorithm which solves the tensor principal component analysis problem with

3 4 1 1 2 accuracy η > 0 whenever the signal-to-noise ratio satisfies τ > O(n / · η− · log / n). Furthermore, the algorithm runs in time O(d3).

Though for tensor PCA our improvement over previous work is modest, we include the results here as this problem is a pedagogically poignant illustration of our techniques. We give a technical overview of the proof in Section 4.2, and a full proof in the full version.

3.1.4 Related Work

Foremost, this work builds upon the SoS algorithms of [16, 17, 39, 50]. In each of these previous works, a machine learning decision problem is solved using an

98 SDP relaxation for SoS. In these works, the SDP value is large in the yes case and small in the no case, and the SDP value can be bounded using the spectrum of a specific matrix. This was implicit in [16, 17], and in [50] it was used to obtain a fast algorithm as well. In our work, we design spectral algorithms which use smaller matrices, inspired by the SoS certificates in previous works, to solve these machine-learning problems much faster, with almost matching guarantees.

A key idea in our work is that given a large matrix with information encoded in its spectral gap, one can often efficiently “compress” the matrix to a much smaller one without losing that information. This is particularly true for problems with planted solutions. Thus, we are able to improve running time by replacing

O d k k an n ( )-sized SDP with an eigenvector computation for an n × n matrix, for some k < d.

The idea of speeding up LP and SDP hierarchies for specific problems has been investigated in a series of previous works [27, 20, 44], which shows that with respect to local analyses of the sum-of-squares algorithm it is sometimes possible

O d O d O 1 to improve the running time from n ( ) to 2 ( ) · n ( ). However, the scopes and strategies of these works are completely different from ours. First, the notion of local analysis from these works does not apply to the problems considered here. Second, these works employ the ellipsoid method with a separation oracle inspired by rounding algorithms, whereas we reduce the problem to ordinary eigenvector computation.

It would also be interesting to see if our methods can be used to speed up some of the other recent successful applications of SoS to machine-learning type problems, such as [?], or the application of [16] to tensor decomposition with components that are well-separated (rather than random). Finally, we

99 would be remiss not to mention that SoS lower bounds exist for several of these problems, specifically for tensor principal components analysis, tensor prediction, and sparse PCA [50, ?, ?]. The lower bounds in the SoS framework are a good indication that we cannot expect spectral algorithms achieving better guarantees.

3.2 Techniques

Sum-of-squares method (for polynomial optimization over the sphere). The problems we consider are connected to optimization problems of the following form: Given a homogeneous n-variate real polynomial f of constant degree, find a unit vector x ∈ ’n so as to maximize f (x). The sum-of-squares method allows us to efficiently compute upper bounds on the maximum value of such a polynomial f over the unit sphere.

For the case that k  deg( f ) is even, the most basic upper bound of this kind is the largest eigenvalue of a matrix representation of f .A matrix representation of a polynomial f is a symmetric matrix M with rows and columns indexed by monomials of degree k/2 so that f (x) can be written as the quadratic form

k 2 k 2 k 2 f (x)  hx⊗ / , Mx⊗ / i, where x⊗ / is the k/2-fold tensor power of x. The largest eigenvalue of a matrix representation M is an upper bound on the value of f (x) over all unit vectors x ∈ ’n because

( )  h k 2 k 2i ( ) · k k 2k2  ( ) f x x⊗ / , Mx⊗ / 6 λmax M x⊗ / 2 λmax M .

The sum-of-squares methods improves on this basic spectral bound systemat- ically by associating a large family of polynomials (potentially of degree higher than deg( f )) with the input polynomial f and computing the best possible spectral

100 bound within this family of polynomials. Concretely, the sum-of-squares method with degree parameter d applied to a polynomial f with deg( f ) 6 d considers { + ( − k k2)· 1 | (1) − } ⊆ ’[ ] the affine subspace of polynomials f 1 x 2 deg 6 d 2 x and minimizes λmax(M) among all matrix representations 1 M of polynomials in this space.2 The problem of searching through this affine linear space of polynomials and their matrix representations and finding the one of smallest maximum eigenvalue can be solved using semidefinite programming.

Our approach for faster algorithms based on SoS algorithms is to construct specific matrices (polynomials) in this affine linear space, then compute their top eigenvectors. By designing our matrices carefully, we ensure that our algorithms have access to the same higher degree information that the sum-of-squares algorithm can access, and this information affords an advantage over the basic spectral methods for these problems. At the same time, our algorithms avoid searching for the best polynomial and matrix representation, which gives us faster running times since we avoid semidefinite programming. This approach is well suited to average-case problems where the choice of input is not adversarial; in particular it is applicable to machine learning problems where noise and inputs are assumed to be random.

Compressing matrices with partial traces. A serious limitation of the above approach is that the representation of a degree-d, n-variate polynomial requires size roughly nd. Hence, even avoiding the use of semidefinite programming,

1Earlier we defined matrix representations only for homogeneous polynomials of even degree. In general, a matrix representation of a polynomial 1 is a symmetric matrix M with rows and   6` 6` columns indexed by monomials of degree at most ` deg 1 2 such that 1 x x⊗ , Mx ⊗ 6`  0 1 ` ( )/ + ( ) h i (as a polynomial identity), where x⊗ x⊗ , x⊗ ,..., x⊗ √` 1 is the vector of all monomials 6`  ( )/ of degree at most `. Note that x⊗ 1 for all x with x 1. 2The name of the method stemsk fromk the fact that thisk k last step is equivalent to finding the minimum number λ such that the space contains a polynomial of the form λ 12 + + 12 , − ( 1 ··· t ) where 11,..., 1t are polynomials of degree at most d 2. /

101 improving upon running time O(nd) requires additional ideas.

In each of the problems that we consider, we have a large matrix (suggested by a SoS algorithm) with a “signal” planted in some amount of “noise”. We show that in some situations, this large matrix can be compressed significantly without loss in the signal by applying partial trace operations. In these situations, the partial trace yields a smaller matrix with the same signal-to-noise ratio as the large matrix suggested by the SoS algorithm, even in situations when lower degree sum-of-squares approaches are known to fail (as for the planted sparse vector and tensor PCA problems).3

d2 d2 d d The partial trace Tr’d : ’ × → ’ × is the linear operator that satisfies d d Tr’d A ⊗ B  (Tr A)· B for all A, B ∈ ’ × . To see how the partial trace can be used to compress large matrices to smaller ones with little loss, consider the following

d2 d2 problem: Given a matrix M ∈ ’ × of the form M  τ ·(v ⊗ v)(v ⊗ v)> + A ⊗ B d d d for some unit vector v ∈ ’ and matrices A, B ∈ ’ × , we wish to recover the vector v. (This is a simplified version of the planted problems that we consider in this paper, where τ ·(v ⊗ v)(v ⊗ v)> is the signal and A ⊗ B plays the role of noise.)

It is straightforward to see that the matrix A ⊗ B has spectral norm kA ⊗ Bk  kAk · kBk, and so when τ  kAkkBk, the matrix M has a noticeable spectral gap, and the top eigenvector of M will be close to v ⊗ v. If | Tr A| ≈ kAk, the matrix

Tr’d M  τ · vv> + Tr(A)· B has a matching spectral gap, and we can still recover v, but now we only need to compute the top eigenvector of a d × d (as opposed to

3For both problems we use matrices with dimensions corresponding to degree-2 SoS programs. An argument of Spielman et al. ([72], Theorem 9) shows that degree-2 sum-of-squares can only find sparse vectors with sparsity k 6 O˜ √n , wherease we achieve sparsity as large as k  Θ n . For tensor PCA, the degree-2 SoS program( ) cannot even express the objective function. ( )

102 d2 × d2) matrix.4

If A is a Wigner matrix (e.g. a symmetric matrix with iid ±1 entries), then √ both Tr(A), kAk ≈ n, and the above condition is indeed met. In our average case/machine learning settings the “noise” component is not as simple as A ⊗ B with A a Wigner matrix. Nonetheless, we are able to ensure that the noise displays a similar behavior under partial trace operations. In some cases, this requires additional algorithmic steps, such as random projection in the case of tensor decomposition, or centering the matrix eigenvalue distribution in the case of the planted sparse vector.

It is an interesting question if there are general theorems describing the behavior of spectral norms under partial trace operations. In the current work, we compute the partial traces explicitly and estimate their norms directly. Indeed, our analyses boil down to concentrations bounds for special matrix polynomials. A general theory for the concentration of matrix polynomials is a notorious open problem (see [?]).

Partial trace operations have previously been applied for rounding SoS relaxations. Specifically, the operation of reweighing and conditioning, used in rounding algorithms for sum-of-squares such as [20, 66, 16, 17, ?], corresponds to applying a partial trace operation to the moments matrix returned by the sum-of-squares relaxation.

We now give a technical overview of our algorithmic approach for each problem, and some broad strokes of the analysis for each case. Our most

4In some of our applications, the matrix M is only represented implicitly and has size super- linear in the size of the input, but nevertheless we can compute the top eigenvector of the partial trace Tr’d M in nearly-linear time.

103 substantial improvements in runtime are for the planted sparse vector and overcomplete tensor decomposition problems (Section 3.2.1 and Section 3.2.2 respectively). Our algorithm for tensor PCA is the simplest application of our techniques, and it may be instructive to skip ahead and read about tensor PCA first (Section 3.2.3).

3.2.1 Planted Sparse Vector in Random Linear Subspace

Recall that in this problem we are given a linear subspace U (represented by

d some basis) that is spanned by a k-sparse unit vector v0 ∈ ’ and random unit

d vectors v1,..., vd 1 ∈ ’ . The goal is to recover the vector v0 approximately. −

n d Background and SoS analysis. Let A ∈ ’ × be a matrix whose columns form an orthonormal basis for U. Our starting point is the polynomial f (x)  √ k k4  Ín ( )4  Ax 4 i1 Ax i . Previous work showed that for d n the maximizer of this polynomial over the sphere corresponds to a vector close to v0 and that degree-4 sum-of-squares is able to capture this fact [14, 16]. Indeed, typical ’n k k4 ≈ / random vectors v in satisfy v 4 1 n whereas our planted vector satisfies k k4 /  / v0 4 > 1 k 1 n, and this degree-4 information is leveraged by the SoS algorithms.

 Ín ( ) 2 The polynomial f has a convenient matrix representation M i1 ai a>i ⊗ , where a1,..., an are the rows of the A. It turns out that the eigenvalues of this matrix indeed give information about the planted sparse

d vector v0. In particular, the vector x0 ∈ ’ with Ax0  v0 witnesses that M has / 2 an eigenvalue of at least 1 k because M’s quadratic form with the vector x0⊗ h 2 2i  k k4 / satisfies x0⊗ , Mx0⊗ v0 4 > 1 k. If we let M0 be the corresponding matrix

104 for the subspace U without the planted sparse vector, M0 turns out to have only eigenvalues of at most O(1/n) up to a single spurious eigenvalue with eigenvector far from any vector of the form x ⊗ x [14].

It follows that in order to distinguish between a random subspace with a planted sparse vector (yes case) and a completely random subspace (no case), it is enough to compute the second-largest eigenvalue of a d2-by-d2 matrix

(representing the 4-norm polynomial over the subspace as above). This decision version of the problem, while strictly speaking easier than the search version above, is at the heart of the matter: one can show that the large eigenvalue for the yes case corresponds to an eigenvector which encodes the coefficients of the sparse planted vector in the basis.

Improvements. The best running time we can hope for with this basic √ approach is O(d4) (the size of the matrix). Since we are interested in d 6 O( n), the resulting running time O(nd2) would be subquadratic but still super-linear in the input size n · d (for representing a d-dimensional subspace of ’n). To speed things up, we use the partial trace approach outlined above. We will first apply the approach naively (obtaining a reasonable bound), and then show that a small modification to the matrix before the partial trace allows us to achieve even smaller signal-to-noise ratios.

≈ 1 ( ) 2 + In the planted case, we may approximate M k x0x0> ⊗ Z, where x0 is the vector of coefficients of v0 in the basis representation given by A (so that Ax0  v0), and Z is the noise matrix. Since kx0k  1, the partial trace operation preserves ( ) 2 ( ) 2  the projector x0x0> ⊗ in the sense that Tr’d x0x0> ⊗ x0x0>. Hence, with our heuristic approximation for M above, we could show that the top eigenvector of

Tr’d M is close to x0 by showing that the spectral norm bound kTr’d Zk 6 o(1/k).

105  Ín ( ) ⊗ ( ) The partial trace of our matrix M i1 ai a>i ai a>i is easy to compute directly, n   Õk k2 · N Tr’d M ai 2 ai a>i . i1 In the yes case (random subspace with planted sparse vector), a direct computation shows that

h i ≈ d · + n k k4 d + n  λyes > x0, Nx0 n 1 d v0 4 > n 1 dk .

Hence, a natural approach to distinguish between the yes case and no case (completely random subspace) is to upper bound the spectral norm of N in the no case.

In order to simplify the bound on the spectral norm of N in the no case, suppose that the columns of A are iid samples from the Gaussian distribution N( 1 ) 0, d Id (rather than an orthogonal basis for the random subspace)–Lemma 3.4.6 establishes that this simplification is legitimate. In this simplified setup, the {k k2 · } matrix N in the no case is the sum of n iid matrices ai ai a>i , and we can p upper bound its spectral norm λno by d/n ·(1 + O( d/n)) using standard matrix concentration bounds. Hence, using the spectral norm of N, we will be able to distinguish between the yes case and the no case as long as

p d/n  n/(dk) ⇒ λno  λyes .

2 1 3 For linear sparsity k  ε · n, this inequality is true so long as d  (n/ε ) / , which √ is somewhat worse than the bound n bound on the dimension that we are aiming for.

 Í ( ) Recall that Tr B i λi B for a symmetric matrix B. As discussed above, the partial trace approach works best when the noise behaves as the tensor of two

Wigner matrices, in that there are cancellations when summing the eigenvalues

106 ( ) ⊗ ( ) of the noise. In our case, the noise terms ai a>i ai a>i do not have this property,  k k2 ≈ / as in fact Tr ai a>i ai d n. Thus, in order to improve the dimension bound, we will center the eigenvalue distribution of the noise part of the matrix. This will cause it to behave more like a Wigner matrix, in that the spectral norm of the noise will not increase after a partial trace. Consider the partial trace of a matrix of the form − · ⊗ Õ M α Id ai a>i , i for some constant α > 0. The partial trace of this matrix is

n  Õ(k k2 − )· N0 ai 2 α ai a>i . i1

We choose the constant α ≈ d/n such that our matrix N0 has expectation 0 in the no case, when the subspace is completely random. In the yes case, the

Rayleigh quotient of N0 at x0 simply shifts as compared to N, and we have h i ≈ k k4 / λyes > x0, N0x0 v0 4 > 1 k (see Lemma 3.4.5 and sublemmas). On the other hand, in the no case, this centering operation causes significant cancellations in the eigenvalues of the partial trace matrix (instead of just shifting the eigenvalues). √ 3 2 In the no case, N0 has spectral norm λno 6 O(d/n / ) for d  n (using standard matrix concentration bounds; again see Lemma 3.4.5 and sublemmas). Therefore, the spectral norm of the matrix N0 allows us to distinguish between the yes and √ 3 2 no case as long as d/n /  1/k, which is satisfied as long as k  n and d  n. We give the full formal argument in Section 3.4.

3.2.2 Overcomplete Tensor Decomposition

 Ín 3 ∈ ’d3 Recall that in this problem we are given a 3-tensor T of the form T i1 ai⊗ , ∈ ’d N( 1 ) where a1,..., an are independent random vectors from 0, d Id . The goal

107 is to find a unit vector a ∈ ’d that is highly correlated with one5 of the vectors a1,..., an.

Background. The starting point of our algorithm is the polynomial f (x)  Ín h i3  1.5 i1 ai , x . It turns out that for n d the (approximate) maximizers of this polynomial are close to the components a1,..., an, in the sense that f (x) ≈ 1 if and h i2 ≈ only if maxi n ai , x 1. Indeed, Ge and Ma [39] show that the sum-of-squares ∈[ ] method already captures this fact at degree 12, which implies a quasipolynomial time algorithm for this tensor decomposition problem via a general rounding result of Barak, Kelner, and Steurer [17].

The simplest approach to this problem is to consider the tensor representation  Í 3 of the polynomial T i n ai⊗ , and flatten it, hoping the singular vectors of the ∈[ ] flattening are correlated with the ai. However, this approach is doomed to failure for two reasons: firstly, the simple flattenings of T are d2 × d matrices, and since  2 n d the ai⊗ collide in the column space, so that it is impossible to determine { 2} Span ai⊗ . Secondly, even for n 6 d, because the ai are random vectors, their norms concentrate very closely about 1. This makes it difficult to distinguish any one particular ai even when the span is computable.

Improvements. We will try to circumvent both of these issues by going to Í 4 higher dimensions. Suppose, for example, that we had access to i n ai⊗ .6 ∈[ ] { 2} The eigenvectors of the flattenings of this matrix are all within Spani n ai⊗ , ∈[ ] addressing our first issue, leaving us only with the trouble of extracting individual 2 Í 6 ai⊗ from their span. If furthermore we had access to i n ai⊗ , we could perform ∈[ ]

5We can then approximately recover all the components a1,..., an by running independent trials of our randomized algorithm repeatedly on the same input. 6As the problem is defined, we assume that we do not have access to this input, and in many machine learning applications this is a valid assumption, as gathering the data necessary to generate the 4th order input tensor requires a prohibitively large number of samples.

108 (Φ ⊗ ⊗ ) Í 6 Φ ∈ ’d d a partial random projection Id Id i n ai⊗ where × is a matrix ∈[ ] with independent Gaussian entires, and then taking a partial trace, we end up with  Õ  Õ (Φ ⊗ ⊗ ) 6  hΦ 2i 4 Tr’d Id Id ai⊗ , ai⊗ ai⊗ . i n i n ∈[ ] ∈[ ] With reasonable probability (for exposition’s sake, say with probability 1/n10), Φ 2 hΦ 2i hΦ 2i is closer to some ai⊗ than to all of the others so that , ai⊗ > 100 , a⊗j for ∈ [ ] 2 all j n , and then ai⊗ is distinguishable from the other vectors in the span of our matrix, taking care of the second issue . As we show, a much smaller gap is sufficient to distinguish the top ai from the other a j, and so the higher-probability event that Φ is only slightly closer to ai suffices (allowing us to recover all vectors at an additional runtime cost of a factor of O˜ (n)). This discussion ignores the presence of a single spurious large eigenvector, which we address in the technical sections.

Í 6 Of course, we do not have access to the higher-order tensor i n ai⊗ . Instead, ∈[ ] we can obtain a noisy version of this tensor. Our approach considers the following matrix representation of the polynomial f 2,

Õ 3 3  ⊗ ( ) ⊗ ( ) ∈ ’d d M ai a>j ai a>i a j a>j × . i,j

Alternatively, we can view this matrix as a particular flattening of the Kronecker-

2 squared tensor T⊗ . It is instructive to decompose M  Mdiag + Mcross into  Í ( ) 3  Í ⊗ its diagonal terms Mdiag i ai a>i ⊗ and its cross terms Mcross i,j ai a>j ( ) ⊗ ( ) ai a>i a j a>j . The algorithm described above is already successful for Mdiag; we need only control the eigenvalues of the partial trace of the “noise” component,

Mcross. The main technical work will be to show that k Tr’d Mdiagk is small. In fact, we will have to choose Φ from a somewhat different distribution—observing ( ⊗ ⊗ )  Í h i · ( ⊗ )( ⊗ ) that Tr’d Φ Id Id i,j ai , Φa j ai a j ai a j >, we will sample Φ so that

109 hai , Φaii  hai , Φa ji. We give a more detailed overview of this algorithm in the full version, explaining in more detail our choice of Φ and justifying heuristically the boundedness of the spectral norm of the noise.

Connection to SoS analysis. To explain how the above algorithm is a speedup of SoS, we give an overview of the SoS algorithm of [39, 17]. There, the degree-t

SoS SDP program is used to obtain an order-t tensor χt (or a pseudodistribution). Í t Informally speaking, we can understand χt as a proxy for i n ai⊗ , so that ∈[ ]  Í t + χt i n ai⊗ N, where N is a noise tensor. While the precise form of N ∈[ ] is unclear, we know that N must obey a set of constraints imposed by the SoS hierarchy at degree t. For a formal discussion of pseudodistributions, see [17].

Í t To extract a single component ai from the tensor i n ai⊗ , there are many ∈[ ] algorithms which would work (for example, the algorithm we described for Mdiag above). However, any algorithm extracting an ai from χt must be robust to the noise tensor N. For this it turns out the following algorithm will do: suppose we Í t  ( ) 1 1 have the tensor i n ai⊗ , taking t O log n . Sample 1,..., log n 2 random ∈[ ] ( )−  Í (Î h1 i) · unit vectors, and compute the matrix M i 16 j6log n 2 j , ai ai a>i . If we ( )− are lucky enough, there is some ai so that every 1j is a bit closer to ai than any  + k k  k k other ai0, and M ai a>i E for some E 1. The proof that E is small can be made so simple that it applies also to the SDP-produced proxy tensor χlog n, and so this algorithm is robust to the noise N. This last step is very general and can handle tensors whose components ai are less restricted than the random vectors we consider, and also more overcomplete, handling tensors of rank up to n  Ω˜ (d1.5).7

Our subquadratic-time algorithm can be viewed as a low-degree, spectral

7It is an interesting open question whether taking t  O log n is really necessary, or whether this heavy computational requirement is simply an artifact( of the) SoS proof.

110 analogue of the [17] SoS algorithm. However, rather than relying on an SDP to Í t produce an object close to i n ai⊗ , we manufacture one ourselves by taking ∈[ ] the Kronecker square of our input tensor. We explicitly know the form of the

2 Í 6 deviation of T⊗ from i n ai⊗ , unlike in [17], where the deviation of the SDP ∈[ ] Í t certificate χt from i n ai⊗ is poorly understood. We are thus able to control ∈[ ] this deviation (or “noise”) in a less computationally intensive way, by cleverly designing a partial trace operation which decreases the spectral norm of the deviation. Since the tensor handled by the algorithm is much smaller—order 6 rather than order log n—this provides the desired speedup.

3.2.3 Tensor Principal Component Analysis

3 d Recall that in this problem we are given a tensor T  τ · v⊗ + A, where v ∈ ’ is a unit vector, A has iid entries from N(0, 1), and τ > 0 is the signal-to-noise ratio. The aim is to recover v approximately.

Background and SoS analysis. A previous application of SoS techniques to this problem discussed several SoS or spectral algorithms, including one that runs in quasi-linear time [50]. Here we apply the partial trace method to a subquadratic spectral SoS algorithm discussed in [50] to achieve nearly the same signal-to-noise guarantee in only linear time.

3 3 Our starting point is the polynomial T(x)  τ · hv, xi + hx⊗ , Ai. The √ maximizer of T(x) over the sphere is close to the vector v so long as τ  n [67]. In [50], it was shown that degree-4 SoS maximizing this polynomial can recover

3 4 v with a signal-to-noise ratio of at least Ω˜ (n / ), since there exists a suitable SoS

3 bound on the noise term hx⊗ , Ai.

111 Specifically, let Ai be the ith slice of A, so that hx, Ai xi is the quadratic form Í ( ) | ( )− ·h i3| j,k Aijk x j xk. Then there is a SoS proof that T x is bounded by T x τ v, x 6 ( )1 2 · k k ( ) ( )  Í h i2 f x / x , where f x is the degree-4 polynomial f x i x, Ai x . The ( )  h 2 (Í ⊗ polynomial f has a convenient matrix representation: f x x⊗ , i Ai 2 Ai)x⊗ i: since this matrix is a sum of iid random matrices Ai ⊗ Ai, a matrix Chernoff bound shows that this matrix spectrally concentrates to its expectation. Í ⊗ So with high probability one can show that the eigenvalues of i Ai Ai are at 3 2 1 2 most ≈ d / log(d) / (except for a single spurious eigenvector), and it follows

3 4 1 4 that degree-4 SoS solves tensor PCA so long as τ  d / log(d) / .

This leads the authors to consider a slight modification of f (x), given by ( )  Í h i2 1 x i x, Ti x , where Ti is the ith slice of T. Like T, the function 1 also contains information about v, and the SoS bound on the noise term in T carries over as an analogous bound on the noise in 1. In particular, expanding Ti ⊗ Ti and ignoring some negligible cross-terms yields

Õ 2 Õ Ti ⊗ Ti ≈ τ ·(v ⊗ v)(v ⊗ v)> + Ai ⊗ Ai . i i

Using v ⊗ v as a test vector, the quadratic form of the latter matrix can be made at

2 3 2 1 2 least τ − O(d / log(d) / ). Together with the boundedness of the eigenvalues of Í ⊗  3 4 ( )1 4 i Ai Ai this shows that when τ d / log d / there is a spectral algorithm Í ⊗ 2 × 2 to recover v. Since the matrix i Ti Ti is d d , computing the top eigenvector requires O˜ (d4 log n) time, and by comparison to the input size d3 the algorithm runs in subquadratic time.

Improvements. In this work we speed this up to a linear time algorithm via the partial trace approach. As we have seen, the heart of the matter is to show 2 ·( ⊗ )( ⊗ ) + Í ⊗ that taking the partial trace of τ v v v v > i Ai Ai does not increase

112 the spectral noise. That is, we require that

Õ ⊗  Õ ( )· ( 3 2 ( )1 2) Tr’d Ai Ai Tr Ai Ai 6 O d / log d / . i i

The Ai have iid Guassian entries, and so as in the case of Wigner matrices, it is roughly true that | Tr(Ai)| ≈ kAi k. Thus the situation is very similar to our toy example of the application of partial traces in Section 4.2.

Í ⊗ Í ( )· Heuristically, because i n Ai Ai and i n Tr Ai Ai are random matrices, ∈[ ] ∈[ ] we expect that their eigenvalues are all of roughly the same magnitude. This means that their spectral norm should be close to their Frobenius norm divided by the square root of the dimension, since for a matrix M with eigenvalues q k k  Í 2 λ1, . . . , λn, M F i n λi . By estimating the sum of the squared entries, we ∈[ ] Í ( )· Í ⊗ expect that the Frobenious norm of i Tr Ai Ai is less than that of i Ai Ai √ by a factor of d after the partial trace, while the dimension decreases by a factor of d, and so assuming that the eigenvalues are all of the same order, a typical eigenvalue should remain unchanged. We formalize these heuristic calculations using standard matrix concentration arguments in the full version.

3.3 Preliminaries

Linear algebra. We will work in the real vector spaces given by ’n. A vector of indeterminates may be denoted x  (x1,..., xn), although we may sometimes switch to parenthetical notation for indexing, i.e. x  (x(1),..., x(n)) when subscripts are already in use. We denote by [n] the set of all valid indices for a

n vector in ’ . Let ei be the ith canonical basis vector so that ei(i)  1 and ei(j)  0 for j , i.

113 For a vectors space V, we may denote by L(V) the space of linear operators from V to V. The space orthogonal to a vector v is denoted v⊥.

1 For a matrix M, we use M− to denote its inverse or its Moore-Penrose pseudoinverse; which one it is will be clear from context. For M PSD, we write

1 2 1 2 2 1 M− / for the unique PSD matrix with (M− / )  M− .

Norms and inner products. We denote the usual entrywise inner product by h· ·i h i  Í ∈ n ∈ n , , so that u, v i n ui vi for u, v ’ . The `p-norm of a vector v ’ is ∈[ ] k k  (Í p)1 p k k given by v p i n vi / , with v denoting the `2-norm by default. The ∈[ ] matrix norm used throughout the paper will be the operator / spectral norm, k k  k k  k k/k k denoted by M M op : maxx,0 Mx x .

n n n Tensor manipulation. Boldface variables will reserved for tensors T ∈ ’ × × , of which we consider only order-3 tensors. We denote by T(x, y, z) the multilinear ∈ n ( )  Í function in x, y, z ’ such that T x, y, z i,j,k n Ti,j,k xi yj zk, applying x, y, ∈[ ] and z to the first, second, and third modes of the tensor T respectively. If the arguments are matrices P, Q, and R instead, this lifts T(P, Q, R) to the unique multilinear tensor-valued function such that [T(P, Q, R)](x, y, z)  T(Px, Q y, Rz) for all vectors x, y, z.

Tensors may be flattened to matrices in the multilinear way such that for

n n n n2 n every u ∈ ’ × and v ∈ ’ , the tensor u ⊗ v flattens to the matrix uv> ∈ ’ × with u considered as a vector. There are 3 different ways to flatten a 3-tensor T, corresponding to the 3 modes of T. Flattening may be understood as reinterpreting the indices of a tensor when the tensor is expressed as an 3-dimensional array of

3 numbers. The expression v⊗ refers to v ⊗ v ⊗ v for a vector v.

114 Probability and asymptotic bounds. We will often refer to collections of in- dependent and identically distributed (or iid) random variables. The Gaussian distribution with mean µ and variance σ2 is denoted N(µ, σ2). Sometimes we state that an event happens with overwhelming probability. This means that

ω 1 its probability is at least 1 − n− ( ). A function is O˜ (1(n)) if it is O(1(n)) up to polylogarithmic factors.

3.4 Planted Sparse Vector in Random Linear Subspace

In this section we give a nearly-linear-time algorithm to recover a sparse vector planted in a random subspace.

∈ ’n k k4 1 Problem 3.4.1. Let v0 be a unit vector such that v0 4 > εn . Let n 1 v1,..., vd 1 ∈ ’ be iid from N(0, Idn). Let w0,..., wd 1 be an orthogo- − n − nal basis for Span{v0,..., vd 1}. Given: w0,..., wd 1 Find: a vector v such − − 2 that hv, v0i > 1 − o(1).

115 Sparse Vector Recovery in Nearly-Linear Time

Algorithm 3.4.2. Input: w0,..., wd 1 − as in Problem 3.4.1. Goal: Find v

2 with hvˆ, v0i > 1 − o(1).

• Compute leverage scores 2 2 ka1k ,..., kan k , where ai is the ith row of the n × d matrix    S : w0 ··· wd 1 . − • Compute the top eigenvector u of the matrix

def Õ (k k2 − d )· A ai 2 n ai a>i . i n ∈[ ]

• Output Su.

Remark 3.4.3. Implementation of Algorithm 3.4.2 in nearly-linear time. The

2 2 leverage scores ka1k ,..., kan k are clearly computable in time O(nd). In the course of proving correctness of the algorithm we will show that A has constant spectral gap, so by a standard analysis O(log d) matrix-vector multiplies suffice to recover its top eigenvector. A single matrix-vector multiply Ax requires  (k k2 − d )h i ( ) computing ci : ai n ai , x for each i (in time O nd ) and summing Í ( ) i n ci xi (in time O nd ). Finally, computing Su requires summing d vectors of ∈[ ] dimension n, again taking time O(nd).

The following establishes our algorithm’s correctness.

∈ ’n k k4 / Theorem 3.4.4. Let v0 be a unit vector with v0 4 > 1 εn. Let

116 ∈ ’n N( 1 ) v1,..., vd 1 be iid from 0, n Idn . Let w0,..., wd 1 be an orthogonal basis for − −   { } ×  Span v0,..., vd 1 . Let ai be the i-th row of the n d matrix S : w0 ··· wd 1 . − −

1 2 When d 6 n / /polylog(n), for any sparsity ε > 0, w.ov.p. the top eigenvector u of Ín (k k2 − d )· h i2 − ( 1 4) − ( ) i1 ai n ai a>i has Su, v0 > 1 O ε / o 1 .

We have little control over the basis vectors the algorithm is given. However, there is a particularly nice (albeit non-orthogonal) basis for the subspace which exposes the underlying randomness. Suppose that we are given the basis vectors v0,..., vd, where v0 is the sparse vector normalized so that kv0k  1, and 1 v1,..., vd 1 are iid samples from N(0, Idn). The following lemma shows that if − n the algorithm had been handed this good representation of the basis rather than an arbitrary orthogonal one, its output would be the correlated to the vector of coefficients giving of the planted sparse vector (in this case the standard basis vector e1).

n n Lemma 3.4.5. Let v0 ∈ ’ be a unit vector. Let v1,..., vd 1 ∈ ’ be iid from  −  N( 1 ) ×  ··· 0, n Id . Let ai be the ith row of the n d matrix S : v0 vd 1 . Then there − 1 2 is a universal constant ε∗ > 0 so that for any ε 6 ε∗, so long as d 6 n / /polylog(n), w.ov.p. n Õ(k k2 − d )·  k k4 · + ai n ai a>i v0 4 e1e1> M , i1 k k (k k3 · 1 4 + k k2 · 1 2 + where e1 is the first standard basis vector and M 6 O v0 4 n− / v0 4 n− / 3 4 1 kv0k4 · n− / + n− ).

The second ingredient we need is that the algorithm is robust to exchanging this good basis for an arbitrary orthogonal basis.

n 4 1 n Lemma 3.4.6. Let v0 ∈ ’ have kv0k > . Let v1,..., vd 1 ∈ ’ be iid from 4 εn − 1 N(0, Idn). Let w0,..., wd 1 be an orthogonal basis for Span{v0,..., vd 1}. Let ai n − −

117   ×  ··· be the ith row of the n d matrix S : v0 vd 1 . Let a0i be the ith row of   − ×  ···  Í ∈ ’d d the n d matrix S0 : w0 wd 1 . Let A : i ai a>i . Let Q × be the − 1 2 1 2 so that SA− /  S0Q, which exists since SA− / is orthogonal, and  1 2 1 2/ ( ) which has the effect that a0i QA− / ai. Then when d 6 n / polylog n , w.ov.p. ! n n Õ(k k2 − d )· − Õ(k k2 − d )· a0i n a0i a0>i Q ai n ai a>i Q> i1 i1  1  6 O + o(kvk4) . n 4

Last, we will need the following fact, which follows from standard concentra- tion.

n d 1 Lemma 3.4.7. Let v ∈ ’ be a unit vector. Let b1,..., bn ∈ ’ − be iid 1 d from N(0, Idd 1). Let ai ∈ ’ be given by ai : (v(i) bi). Then w.ov.p. n − k Ín − k ˜ ( / )1 2  ( ) i1 ai a>i Idd 6 O d n / . In particular, when d o n , this implies that w.ov.p. k(Ín ) 1 − k ˜ ( / )1 2 k(Ín ) 1 2 − k ˜ ( / )1 2 i1 ai a>i − Idd 6 O d n / and i1 ai a>i − / Idd 6 O d n / .

We reserve the proofs of Lemma 3.4.6 and Lemma 3.4.7 for the full version. We are ready to prove Theorem 3.4.4.

   of Theorem 3.4.4. Let b1,..., bn be the rows of the matrix S0 : v0 ··· vd 1 . −  Í 1 2 Let B i bi b>i . Note that S0B− / has columns which are an orthogonal basis d d 1 2 for Span{w0,..., wd 1}. Let Q ∈ ’ × be the rotation so that S0B− /  SQ. −

 Ín (k k2 − By Lemma 3.4.5 and Lemma 3.4.6, we can write the matrix A i1 ai 2 d )· n ai a>i as  k k4 · + A v0 4 Qe1e1>Q> M , where w.ov.p.

k k (k k3 · 1 4) + (k k2 · 1 2) M 6 O v0 4 n− / O v0 4 n− /

118 + (k k · 3 4) + ( 1) + (k k4) O v0 4 n− / O n− o v 4 .

k k4 ( ) 1 We have assumed that v0 4 > εn − , and so since A is an almost-rank-one 2 1 4 matrix (Lemma A.3.3), the top eigenvector u of A has hu, Qe1i > 1 − O(ε / ), so

2 1 4 that hSu, SQe1i > 1 − O(ε / ) by column-orthogonality of S.

1 2 1 2 At the same time, SQe1  S0B− / e1, and by Lemma 3.4.7, kB− / − Id k 6

1 2 2 2 O˜ (d/n) / w.ov.p., so that hSu, S0e1i > hSu, SQe1i − o(1). Finally, S0e1  v0 by

2 1 4 definition, so hSu, v0i > 1 − O(ε / ) − o(1). 

3.4.1 Algorithm Succeeds on Good Basis

We now prove Lemma 3.4.5. We decompose the matrix in question into a k k4 Í(k k2 − contribution from v0 4 and the rest: explicitly, the decomposition is ai 2 d )·  Í ( )2 · + Í(k k2 − d · ) n ai a>i v i ai a>i bi 2 n ai a>i . This first lemma handles the k k4 contribution from v0 4.

n d 1 Lemma 3.4.8. Let v ∈ ’ be a unit vector. Let b1,..., bn ∈ ’ − be random vectors iid

1 d 1 2 from N(0, · Idd 1). Let ai  (v(i) bi) ∈ ’ . Suppose d 6 n / /polylog(n). Then n − n Õ ( )2 ·  k k4 · + v i ai a>i v 4 e1e1> M0 , i1

k k (k k3 1 4 + k k2 1 2) where M0 6 O v 4 n− / v 4 n− / w.ov.p.. of Lemma 3.4.8. We first show an operator-norm bound on the principal submatrix Ín ( )2 · i1 v i bi b>i using a truncated matrix Bernstein inequality Proposition A.3.7. First, the expected operator norm of each summand is bounded:     2 2 2 d 2 d ¾ v(i) kbi k 6 (max v(j) )· O 6 kvk · O . 2 j n 4 n

119 The operator norms are bounded by constant-degree polynomials in Gaussian variables, so Lemma A.3.8 applies to truncate their tails in preparation for application of a Bernstein bound. We just have to calculate the variance of the sum, which is at most

n   Õ 4 2 4 d ¾ v(i) kbi k · bi b>  kvk · O . 2 i 4 n2 i1

2 ¾ Ín ( )2 · v · The expectation i1 v i bi b>i is k nk Id. Applying a matrix Bernstein bound (Proposition A.3.7) to the deviation from expectation, we get that w.ov.p.,

n ! Õ 1  d  v(i)2 · b b − · Id 6 kvk2 · O˜ i >i n 4 n i1 (k k2 1 2) 6 O v 4 n− /

1 2 for appropriate choice of d 6 n− / /polylog(n). Hence, by triangle inequality, k Ín ( )2 · k k k2 1 2 i1 v i bi b>i 6 v 4 n− / w.ov.p..

Using a Cauchy-Schwarz-style inequality (Lemma A.3.1) we now show that the bound on this principal submatrix is essentially enough to obtain the lemma.

d Let pi , qi ∈ ’ be given by

0 def ( )· ( ) ···  def ( )· © ª p>i v0 i v0 i 0 0 qi v0 i ­ ® . bi « ¬ Then n n Õ ( )2 ·  k k4 + Õ + + v i bi b>i v 4 pi q>i qi p>i qi q>i . i1 i1 Ín  Ín ( )2 · We have already bounded i1 qi q>i i1 v i bi b>i . At the same time, k Ín k  k k4 i1 pi p>i v 4. By Lemma A.3.1, then,

n Õ + (k k3 1 4) pi q>i qi p>i 6 O v 4 n− / i1 w.ov.p.. Finally, applying the triangle inquality gives the desired result. 

120 Our second lemma controls the contribution from the random part of the leverage scores.

n d 1 Lemma 3.4.9. Let v ∈ ’ be a unit vector. Let b1,..., bn ∈ ’ − be random vectors iid

1 d 1 2 from N(0, · Idd 1). Let ai  (v(i) bi) ∈ ’ . Suppose d 6 n / /polylog(n). Then n − w.ov.p.

n Õ(k k2 − d )· k k2 · ( 3 4) bi 2 n ai a>i 6 v 4 O n− / i1 1 1 + kvk4 · O(n− ) + O(n− ) .

Ín (k k2 − d )· (sketch). Like in the proof of Lemma 3.4.8, i1 bi 2 n ai a>i decomposes into a convenient block structure; we will bound each block separately.

n Õ(k k2 − d )· bi 2 n ai a>i i1 n v(i)2 v(i)· b  Õ(k k2 − d )· © >i ª bi 2 n ­ ® . (3.4.1) i1 v(i)· bi bi b> « i ¬ In each block we can apply a (truncated) Bernstein inequality. The details are deferred to the full version. 

We are now ready to prove Lemma 3.4.5

k k2  ( )2 + k k2 of Lemma 3.4.5. We decompose ai 2 v0 i bi 2 and use Lemma 3.4.8 and Lemma 3.4.9.

n Õ(k k2 − d )· ai 2 n ai a>i i1 n ! n !  Õ ( )2 · + Õ(k k2 − d )· v0 i ai a>i bi 2 n ai a>i i1 i1  k k4 · + v0 4 e1e1> M ,

121 where

k k (k k3 · 1 4 + k k2 · 1 2) + (k k · 1 + 1) M 6 O v0 4 n− / v0 4 n− / O v0 4 n− n− .

k k4 ( ) 1 k k4/k k 1 Since v0 > εn − , we get v0 M > 1 4 , completing the proof.  4 4 ε /

3.5 Overcomplete Tensor Decomposition

In this section, we give a polynomial-time algorithm for the following problem

4 3 when n 6 d / /(polylog d):

 Ín ⊗ ⊗ ∈ d Problem 3.5.1. Given an order-3 tensor T i1 ai ai ai, where a1,..., an ’ N( 1 ) ∈ ’n are iid vectors sampled from 0, d Id , find vectors b1,..., bn such that for all i ∈ [n],

hai , bii > 1 − o(1) .

We give an algorithm that solves this problem, so long as the overcompleteness

4 3 of the input tensor is bounded such that n  d / /polylog d.

 Ín ⊗ ⊗ ∼ N( 1 ) Theorem 3.5.2. Given as input the tensor T i1 ai ai ai where ai 0, d Idd 4 3 1+ω with d 6 n 6 d / /polylog d,8 there is an algorithm which may run in time O˜ (nd ) or O˜ (nd3.257), where dω is the time to multiply two d × d matrices, which with probability 1 − o(1) over the input T and the randomness of the algorithm finds unit

d vectors b1,..., bn ∈ ’ such that for all i ∈ [n],

 3 2  n / hai , bii > 1 − O˜ . d2 8The lower bound d 6 n on n, is a matter of technical convenience, avoiding separate concentration analyses and arithmetic in the undercomplete n < d and overcomplete n > d settings. Indeed, our algorithm still works in the undercomplete( setting) (tensor decomposition( is) easier in the undercomplete setting than the overcomplete one), but here other algorithms based on local search also work [12].

122 3 2 2 We remark that this accuracy can be improved from 1 − O˜ (n / /d ) to an arbi- trarily good using existing local search methods with local convergence guarantees—we discuss the details in the full version.

Í 6 As discussed in Section 4.2, to decompose the tensor i ai⊗ (note we do not actually have access to this input!) there is a very simple tensor decomposition 1 ∈ ’d2 Í h1 2i( ) 2 algorithm: sample a random and compute the matrix i , ai⊗ ai a>i ⊗ . O ε ( ) 2+ With probability roughly n− ( ) this matrix has (up to scaling) the form ai a>i ⊗ E for some kEk 6 1 − ε, and this is enough to recover ai.

Í 6 Í ( ⊗ ) 3 However, instead of i ai⊗ , we have only i,j ai a j ⊗ . Unfortunately, running the same algorithm on the latter input will not succeed. To see why,  Í h ⊗ i( ⊗ ) 2 |h ⊗ i| ≈ consider the extra terms E0 : i,j 1, ai a j ai a j ⊗ . Since 1, ai a j 1, 2 it is straightforward to see that kE0kF ≈ n. Since the rank of E0 is clearly d , even if we are lucky and all the eigenvalues have similar magnitudes, still a typical ≈ /  Í 6 eigenvalue will be n d 1, swallowing the i ai⊗ term.

Í ( ⊗ ) 3 A convenient feature separating the signal terms i ai ai ⊗ from the Í ( ⊗ ) 3 crossterms i,j ai a j ⊗ is that the crossterms are not within the span of the ai ⊗ai. Although we cannot algorithmically access Span{ai ⊗ai }, we have access to  Í ( ⊗ ) something almost as good: the unfolded input tensor, T i n ai ai ai >. The ∈[ ] rows of this matrix lie in Span{ai ⊗ ai }, and so for i , j, kT(ai ⊗ ai)k  kT(ai ⊗ a j)k. √ In fact, careful computation reveals that kT(ai ⊗ ai)k > Ω˜ ( n/d)kT(ai ⊗ a j)k.

Í h ⊗ i( ⊗ ) 2 Í h ( ⊗ The idea now is to replace i,j 1, ai a j ai a j ⊗ with i,j 1, T ai 2 a j)i(ai ⊗ a j)⊗ , now with 1 ∼ N(0, Idd). As before, we are hoping that there h ( ⊗ )i  h ( ⊗ )i is i0 so that 1, T ai0 ai0 maxj,i0 1, T a j a j . But now we also require k Í h1 ( ⊗ )i( ⊗ )( ⊗ ) k  h1 ( ⊗ )i ≈ k ( ⊗ )k i,j , T ai a j ai a j ai a j > , T ai0 ai0 T ai ai . If we

123 are lucky and all the eigenvalues of this cross-term matrix have roughly the same magnitude (indeed, we will be lucky in this way), then we can estimate heuristically that

Õ h1, T(ai ⊗ a j)i(ai ⊗ a j)(ai ⊗ a j)>

i,j

1 Õ ≈ h1, T(ai ⊗ a j)i(ai ⊗ a j)(ai ⊗ a j)> d i,j F

1 √n Õ 6 · |h1, T(ai ⊗ ai )i| (ai ⊗ a j)(ai ⊗ a j)> d d 0 0 i,j F 3 2 n / |h1 ( ⊗ )i| 6 d2 , T ai0 ai0 ,

3 2 2 4 3 suggesting our algorithm will succed when n /  d , which is to say n  d / . The following theorem, which formalizes the intuition above, is at the heart of our tensor decomposition algorithm.

N( 1 ) Theorem 3.5.3. Let a1,..., an be independent random vectors from 0, d Idd with 4 3 d 6 n 6 d / /(polylog d) and let 1 be a random vector from N(0, Idd). Let Σ : √ ¾ ( ) 2  ·(Σ+)1 2  Í ( ⊗ ) x 0,Idd xx> ⊗ and let R : 2 / . Let T i n ai ai ai >. Define the ∼N( ) ∈[ ] d2 d2 matrix M ∈ ’ × ,

Õ M  h1, T(ai ⊗ a j)i · (ai ⊗ a j)(ai ⊗ a j)> . i,j n ∈[ ] √ With probability 1 − o(1) over the choice of a1,..., an, for every polylog d/ d < ε < 1,

d2 the spectral gap of RMR is at least λ2/λ1 6 1 − O(ε) and the top eigenvector u ∈ ’

O ε of RMR satisfies, with probability Ω˜ (1/n ( )) over the choice of 1,

 3 2  2 2 4 n / maxhRu, ai ⊗ aii / kuk · kai k > 1 − O˜ . i n εd2 ∈[ ] √ Moreover,with probability 1−o(1) over the choice of a1,..., an, for every polylog d/ d <

124 1+O ε ε < 1 there are events E1,..., En so that 1 Ei > Ω˜ (1/n ( )) for all i ∈ [n] and when

 3 2  h ⊗ i2/k k2 · k k4 − ˜ n / Ei occurs, Ru, ai ai u ai > 1 O εd2 .

We will eventually set ε  1/log n, which gives us a spectral algorithm for ( − ˜ ( / 3 2)) 2 recovering a vector 1 O n d / -correlated to some ai⊗ . Once we have a vector 2 correlated with each ai⊗ , obtaining vectors close to the ai is straightforward. We will begin by proving this theorem, and give algorithmic details in section Section 3.5.2.

In Section 3.5.1 we prove Theorem 3.5.3 using two core facts: the Gaussian vector 1 is closer to some ai than to any other with good probability, and the Í h ( ⊗ )i( ⊗ )( ⊗ ) noise term i,j 1, T ai a j ai a j ai a j > is bounded in spectral norm.

3.5.1 Proof of Theorem 3.5.3

The strategy to prove Theorem 3.5.3 is to decompose the matrix M into two parts  +  Í h ( ⊗ )i · M Mdiag Mcross, one formed by diagonal terms Mdiag i n 1, T ai ai ∈[ ] ( ⊗ )( ⊗ )  Í h ( ⊗ )i · ai ai ai ai > and one formed by cross terms Mcross i,j 1, T ai a j

(ai ⊗ a j)(ai ⊗ a j)>. We use the fact that the top eigenvector Mdiag is likely to be 2 correlated with one of the vectors a⊗j , and the fact that Mdiag has a noticeable spectral gap. The following propositions characterize the spectra of Mdiag and

Mcross, and are proven in the full version. √ 2 + 1 2 Proposition 3.5.4. Spectral gap of diagonal terms. Let R  2·((¾(xx>)⊗ ) ) / ∼ N( ) N( 1 ) for x 0, Idd . Let a1,..., an be independent random vectors from 0, d Idd 2 Ω 1 with d 6 n 6 d − ( ) and let 1 ∼ N(0, Idd) be independent of all the others. Let  Í ( ⊗ )  Í h1 2i · ( ) 2 T : i n ai ai ai >. Suppose Mdiag i n , Tai⊗ ai a>i ⊗ . Let also v j be ∈[ ] ∈[ ]

125 2 2 such that v j v>  h1, Ta⊗ i · (a j a>)⊗ . Then, with probability 1 − o(1) over a1,..., an, j √j j for each ε > polylog d/ d and each j ∈ [n], the event

n def − · Ej,ε RMdiagR ε Rv j v>j R  √  o − − ˜ / · 6 RMdiagR ε O n d Rv j v>j R

1+O ε has probability at least Ω˜ (1/n ( )) over the choice of 1.

4 3 Second, we show that when n  d / the spectral norm of Mcross is negligible compared to this spectral gap.

Proposition 3.5.5. Bound on crossterms. Let a1,..., an be independent random N( 1 ) 1 N( )  vectors from 0, d Idd , and let be a random vector from 0, Idd . Let T : Í ( ⊗ )  Í h1 ( ⊗ )i ⊗ i n ai ai ai >. Let Mcross : i,j n , T ai a j ai a>i a j a>j . Suppose n > d. ∈[ ] ∈[ ] Then with w.ov.p.,  1 2 n3 / kM k 6 O˜ . cross d4

Using these two propositions we will conclude that the top eigenvector of

2 RMR is likely to be correlated with one of the vectors a⊗j . We also need two simple concentration bounds; we defer the proof to the full version.

N( 1 ) Lemma 3.5.6. Let a1,..., an be independently sampled vectors from 0, d Idd , and N( )  Í ( ⊗ ) let 1 be sampled from 0, Idd . Let T i ai ai ai >. Then with overwhelming probability, for every j ∈ [n], √   4 n h1, T(a j ⊗ a j)i − h1, a jika j k 6 O˜ . d

Fact 3.5.7. Let x, y ∼ N(0, 1 Id). With overwhelming probability, 1 − kxk2 6 √ d O˜ (1/ d) and hx, yi2  O˜ (1/d).

126 As a last technical tool we will need a simple claim about the fourth moment matrix of the multivariate Gaussian: √  ( ) 2  ( +)1 2 k k  Fact 3.5.8. Let Σ ¾x 0,Id xx> ⊗ and let R 2 Σ / . Then R 1, and ∼N( d) for any v ∈ ’d, k ( ⊗ )k2  − 1  · k k4 R v v 2 1 d+2 v .

We are prepared prove Theorem 3.5.3.

4 3 of Theorem 3.5.3 (Sketch). Let d 6 n 6 d / /(polylog d) for some polylog d to be N( 1 ) chosen later. Let a1,..., an be independent random vectors from 0, d Idd and let 1 ∼ N(0, Idd) be independent of the others. Let

Õ  h1 ( ⊗ )i · ( ) 2 Mdiag : , T ai ai ai a>i ⊗ , i n ∈[ ] Õ  h1 ( ⊗ )i · ⊗ Mcross : , T ai a j ai a>i a j a>j . i,j n ∈[ ]

Note that M : Mdiag + Mcross.

Proposition 3.5.5 implies that

n o k k ˜ ( 3 2/ 2) − ω 1  Mcross 6 O n / d > 1 d− ( ) . (3.5.1) √  ( ) 2  ·( +)1 2 Recall that Σ ¾x 0,Id xx> ⊗ and R 2 Σ / . By Proposition 3.5.4, ∼N( d) with probability 1 − o(1) over the choice of a1,..., an, each of the following events √ 1+O ε Ej,ε for j ∈ [n] and ε > polylog(d)/ d has probability at least Ω˜ (1/n ( )) over the choice of 1:

  E0 R M − h1 Ta 2i(a a ) 2 R j,ε : diag ε , ⊗j j >j ⊗ n1 2 6 kRM Rk − (ε − O˜ ( / ))|h1, Ta 2i|k Ra 2k2 . diag d j ⊗ j ⊗

127 And therefore together with (3.5.1), the events

  − h1 2i( ) 2 E∗j,ε : R M ε , Ta⊗j a j a>j ⊗ R n1 2 6 kR · M · Rk − (ε − O˜ ( / )|h1, Ta 2i|kRa 2k2 d j ⊗ j ⊗ 3 2 2 + O˜ (n / /d ) occur under the same stipulations (as M  Mdiag + Mcross and kR · Mcross · Rk 6

2 kRk kMcrossk 6 kMcrossk by Fact 3.5.8).

By standard reasoning about the top eigenvector of a matrix with a spectral

d2 gap (Lemma A.3.3), the event E∗ j,ε implies that the top eigenvector u ∈ ’ of R · M · R satisfies

* 2 +2 √ Ra⊗j O˜ ( n/d) O˜ (n3 2/d2) u, > 1 − − / . k 2k k 2k2 k 2k2|h1 2i| Ra⊗j ε Ra⊗j ε Ra⊗j , Ta⊗j

Applying Fact 3.5.8, Fact 3.5.7, Lemma 3.5.6, and standard concentration argu-

ω 1 ments to simplify the expression, we have that with probability 1 − n− ( ), √  n   n3 2  > 1 − O˜ − O˜ / . εd εd2

(The calculation is performed in more detail in the full version). A union bound now gives the desired conclusion.

Finally, we use concentration arguments and our knowledge of the correlation

2 between u and Rai⊗ to upper bound the second eigenvector

3 2 2 λ2(RMR) 6 1 − O˜ (ε) + O˜ (n / /εd ) , with details given in the full version of the paper. Therefore,

λ2(RMR)/λ1(RMR) 6 1 − O(ε), completing the proof. 

128 Spectral Tensor Decomposition (One Attempt)

This is the main subroutine of our algorithm—we will run it O˜ (n) times to recover all of the components Ín a1,...,⊗ ⊗an. Algorithm 3.5.9. Input: T i1 ai ai ai. Goal: Recover ai for some i ∈ [n]. ∈ d2 d • Compute the matrix unfolding T ’ × of T. Then compute a 3- d2 d2 d2 tensor S ∈ ’ × × by starting with the 6-tensor T ⊗ T, permuting indices, and flattening to a 3-tensor. Apply T in one mode of S to d d2 d2 obtain M ∈ ’ ⊗ ⊗ : n Õ 2 Õ 3 T  ai(ai ⊗ ai)> , S  T⊗  (ai ⊗ a j)⊗ , i n i,j1 ∈[ ]  ( )  Õ ( ⊗ ) ⊗ ( ⊗ ) ⊗ ( ⊗ ) M S T, Idd2 , Idd2 T ai a j ai a j ai a j . i,j n ∈[ ] • Sample a vector 1 ∈ ’d with iid standard gaussian entries. Evaluate d2 d2 M in its first mode in the direction of 1 to obtain M ∈ ’ × :  ( )  Õ h ( ⊗ )i · ( ⊗ )( ⊗ ) M : M 1, Idd2 , Idd2 1, T ai a j ai a j ai a j > . i,j n ∈[ ] √ def [( ) 2] ∼ N( ) def ·( +)1 2 • Let Σ ¾ aa> ⊗ for a 0, Idd . Let R 2 Σ / . Com- 2 pute the top eigenvector u ∈ ’d of RMR, and reshape Ru to a d d matrix U ∈ ’ × . • For each of the signings of the top 2 unit left (or right) singular ± ± Í h ± i3 − ( ) vectors u1, u2 of U, check if i n ai , u j > 1 c n, d where 3 2 ∈[ ] c(n, d)  Θ(n/d / ) is an appropriate threshold. If so, output ±u j. Otherwise output nothing.

3.5.2 Discussion of Full Algorithm

In this subsection we discuss the full details of our tensor decomposition algorithm. As above, the algorithm constructs a random matrix from the input tensor, then computes and post-processes its top eigenvector.

Theorem 3.5.3 is almost enough for the correctness of Algorithm 3.5.9, prov-

2 ing that RMR’s top eigenvector is correlated with some ai⊗ with reasonable probability. We need a few more ingredients to prove Theorem 3.5.2—we must

129 2 show that finding a vector u correlated with some ai⊗ is sufficient for finding a vector close to ai. This can be done by standard eigenvector computations on a reshaping of u to an n × n matrix. The details are left for the full version.

Finally, we must prove a bound on the runtime of the algorithm, which we do by implementing each of the steps as a sequence of carefully-chosen matrix multiplications.

+ Lemma 3.5.10. Algorithm 3.5.9 can be implemented in time O˜ (d1 ω) 6 O˜ (d3.3729), where dω is the runtime for multiplying two d × d matrices.

Proof. To run the algorithm, we only require access to power iteration using the matrix RMR. We first give a fast implementation for power iteration with the matrix M, and handle the multiplications with R separately.

d2 Consider a vector v ∈ ’ , and a random vector 1 ∼ N(0, Idd), and let

d d V, G ∈ ’ × be the reshapings of v and 1T respectively into matrices. Call

Tv  T(Idd , V, G) (V and G applied in the second and third modes of T), and call

2 Tv the reshaping of Tv into a d × d matrix. We have that

Õ Tv  ai(Vai ⊗ Gai)> . i n ∈[ ] We show that the matrix-vector multiply Mv can be computed as a flattening of the following product:

 © Õ ( ⊗ ) ª © Õ ( ⊗ ) ª TvT> ­ ai Vai Gai >® ­ a j a j a>j ® i n j n « ∈[ ] ¬ « ∈[ ] ¬  Õ h i · h i · a j , Vai a j , Gai ai a>j i,j n ∈[ ] Õ  h ⊗ i · h1 ⊗ i · ai a j , v T, ai a j ai a>j . i,j n ∈[ ]

130 d2 Flattening TvT> from a d × d matrix to a vector vTT ∈ ’ , we have that

Õ vTT  h1T, ai ⊗ a ji · hai ⊗ a j , vi · ai ⊗ a j  Mv . i,j n ∈[ ]

So we have that Mv is a flattening of the product TvT>, which we will compute as a proxy for computing Mv via direct multiplication.

Computing Tv  T(Id, V, G) can be done with two matrix multiplication operations, both times multiplying a d2 ×d matrix with a d ×d matrix. Computing

2 2 TvT> is a multiplication of a d × d matrix by a d × d matrix. Both these steps + may be done in time O(d1 ω), by regarding the d × d2 matrices as block matrices with blocks of size d × d. The asymptotically fastest known algorithm for matrix multiplication gives a time of O(d3.3729) [35].

2 Now, to compute the matrix-vector multiply RMRu for any vector u ∈ ’d , + we may first compute v  Ru, perform the operation Mv in time O(d1 ω) as described above, and then again multiply by R. The matrix R is sparse: it has

O(d) entries per row (see the full version for a complete description of R), so computing Ru requires time O(d3).

Performing the update RMRv a total of O(log2 n) times is sufficient for convergence, as we have that with reasonable probability, the spectral gap ( )/ ( ) − ( 1 ) λ2 RMR λ1 RMR 6 1 O log n , as a result of applying Theorem 3.5.3 with  ( 1 ) the choice of ε O log n .

Í h i3 ( 3) Finally, checking the value of i ai , x requires O d operations, and we do so a constant number of times, once for each of the signings of the top 2 left (or right) singular vectors of U. 

131 3.6 Tensor principal component analysis

The Tensor PCA problem in the spiked tensor model is similar to the setting of tensor decomposition, but here the goal is to recover a single large component with all smaller components of the tensor regarded as random noise.

Problem 3.6.1 (Tensor PCA in the Order-3 Spiked Tensor Model). Given an input

3 n tensor T  τ · v⊗ + A, where v ∈ ’ is an arbitrary unit vector, τ > 0 is the signal-to-noise ratio, and A is a random noise tensor with iid standard Gaussian entries, recover the signal v approximately.

Using the partial trace method, we give the first linear-time algorithm for

3 4 this problem that recovers v for signal-to-noise ratio τ  O(n / /poly log n). In addition, the algorithm requires only O(n2) auxiliary space (compared to the input size of n3) and uses only one non-adaptive pass over the input.

3.6.1 Spiked tensor model

This spiked tensor model (for general order-k tensors) was introduced by Mon- tanari and Richard [67], who also obtained the first algorithms to solve the model with provable statistical guarantees. Subsequently, the SoS approach was applied to the model to improve the signal-to-noise ratio required for odd- order tensors [50]; for 3-tensors reducing the requirement from τ  Ω(n) to

3 4 1 4 τ  Ω(n / log(n) / ).

Using the linear-algebraic objects involved in the analysis of the SoS relaxation, the previous work has also described algorithms with guarantees similar to those

132 of the SoS SDP relaxation, while requiring only nearly subquadratic or linear time [50].

The algorithm here improves on the previous results by use of the partial trace method, simplifying the analysis and improving the runtime by a factor of log n.

3.6.2 Linear-time algorithm

Linear-Time Algorithm for Tensor PCA

3 Algorithm 3.6.2. Input: T  τ · v⊗ + A. Goal: Recover v0 with hv, v0i > 1 − o(1).

 Í ⊗ ∈ n n • Compute the partial trace M : Tr’n i Ti Ti ’ × , where Ti are the first-mode slices of T.

• Output the top eigenvector v0 of M.

3 4 1 2 Theorem 3.6.3. When A has iid standard Gaussian entries and τ > Cn / log(n) / /ε for some constant C, Algorithm 3.6.2 recovers v0 with hv, v0i > 1 − O(ε) with high probability over A.

Theorem 3.6.4. Algorithm 3.6.2 can be implemented in linear time and sublinear space.

These theorems are proved by routine matrix concentration results, showing that in the partial trace matrix, the signal dominates the noise.

To implement the algorithm in linear time it is enough to show that this (sublinear-sized) matrix has constant spectral gap; then a standard application of the matrix power method computes the top eigenvector.

133 Lemma 3.6.5. For any v, with high probability over A, the following occur:

Õ ( )· ( 3 2 2 ) Tr Ai Ai 6 O n / log n i

√ Õ ( )· ( ) v i Ai 6 O n log n i

√ Õ ( ) ( )· ( ) Tr Ai v i vv> 6 O n log n . i

The proof may be found in Appendix A.6.

Í ⊗ Proof of Theorem 3.6.3. We expand the partial trace Tr’n i Ti Ti. Õ Õ Tr’n Ti ⊗ Ti  Tr(Ti)· Ti i i Õ  Tr(τ · v(i)vv> + Ai)·(τ · v(i)vv> + Ai) i Õ 2  (τv(i)kvk + Tr(Ai)) · (τ · v(i)vv> + Ai) i ! 2 Õ Õ Õ  τ vv> + τ v(i)· Ai + Tr(Ai)v(i)vv> + Tr(Ai)· Ai . i i i Applying Lemma 3.6.5 and the triangle inequality, we see that !

Õ ( )· + Õ ( ) ( ) + Õ ( )· ( 3 2 ) τ v i Ai Tr Ai v i vv> Tr Ai Ai 6 O n / log n i i i

3 4p with high probability. Thus, for appropriate choice of τ  Ω(n / (log n)/ε), the Í ⊗ matrix Tr’n i Ti Ti is close to rank one, and the result follows by standard manipulations. 

Proof of Theorem 3.6.4. Carrying over the expansion of the partial trace from above  ( 3 4p( )/ ) Í ⊗ and setting τ O n / log n ε , the matrix Tr’n i Ti Ti has a spectral gap ratio equal to Ω(1/ε) and so the matrix power method finds the top eigenvector

134 in O(log(n/ε)) iterations. This matrix has dimension n × n, so a single iteration takes O(n2) time, which is sublinear in the input size n3. Finally, to construct Í ⊗ Tr’n i Ti Ti we use

Õ Õ Tr’n Ti ⊗ Ti  Tr(Ti)· Ti i i and note that to construct the right-hand side it is enough to examine each entry of T just O(1) times and perform O(n3) additions. At no point do we need to store more than O(n2) matrix entries at the same time. 

135 CHAPTER 4 POLYNOMIAL LIFTS

4.1 Introduction

Tensors are arrays of (real) numbers with multiple indices—generalizing matrices (two indices) and vectors (one index) in a natural way. They arise in many different contexts, e.g., moments of multivariate distributions, higher-order derivatives of multivariable functions, and coefficients of multivariate polynomials. An important ongoing research effort aims to extend algorithmic techniques for vectors and matrices to more general tensors. A key challenge is that many tractable matrix computations (like rank and spectral norm) become NP-hard in the tensor setting (even for just three indices) [?, ?]. However, recent work gives evidence that it is possible to avoid this computational intractability and develop provably efficient algorithms, especially for low-rank tensor decompositions, by making suitable assumptions about the input and allowing for approximations [12, 7, 39, 50, ?]. These algorithms lead to the best known provable guarantees for a wide range of unsupervised learning problems [10, 24, 42, 9], including learning mixtures of Gaussians [37], Latent Dirichlet topic modeling [4], and dictionary learning [17]. Low-rank tensor decompositions are useful for these learning problems because they are often unique up to permuting the factors—in contrast, low-rank matrix factorizations are unique only up to unitary transformation. In fact, as far as we are aware, in all natural situations where finding low-rank tensor decompositions is tractable, the decompositions are also unique.

We consider the following (symmetric) version of the tensor decomposition

d problem: Let a1,..., an ∈ ’ be d-dimensional unit vectors. We are given (ap-

136 proximate) access to the first k moments M1,..., Mk of the uniform distribution over a1,..., an, that is,

n M  1 Õ t ∈ { } t n ai⊗ for t 1,..., k . (4.1.1) i1

The goal is to approximately recover the vectors a1,..., an. What conditions on the vectors a1,..., an and the number of moments k allow us to efficiently and robustly solve this problem?

A classical algorithm based on (simultaneous) matrix diagonalization [46, 56, attributed to Jennrich] shows that whenever the vectors a1,..., an are linearly independent, k  3 moments suffice to recover the vectors in polynomial time. (This algorithm is also robust against polynomially small errors in the input moment tensors [6, 42, 24].) Therefore an important remaining algorithmic challenge for tensor decomposition is the overcomplete case, when the number of vectors (significantly) exceeds their dimension. Several recent works studied this case with different assumptions on the vectors and the number of moments. In this work, we give a unified algorithmic framework for overcomplete tensor decomposition that achieves—and in many cases surpasses—the previous best guarantees for polynomial-time algorithms.

In particular, some decompositions that previously required quasi-polynomial time to find are reduced to polynomial time in our framework, including the case of general tensors with order logarithmically large in its overcompleteness

3 2 O 1 n/d [17] and random order-3 tensors with rank n 6 d / /log ( )(d) [39]. Iterative methods may also achieve fast local convergence guarantees for incoherent

3 2 order-3 tensors with rank o(d / ), which become global convergence guarantees under no more than constant overcompleteness [10]. In the smoothed analysis model, where each vector of the desired decomposition is assumed to have

137 been randomly perturbed by an inverse polynomial amount, polynomial-time decomposition was achieved for order-5 tensors of rank up to d2/2 [24]. Our framework extends this result to order-4 tensors, for which the corresponding analysis was previously unknown for any superconstant overcompleteness.

The starting point of our work is a new analysis of the aforementioned matrix diagonalization algorithm that works for the case when a1,..., an are linearly independent. A key ingredient of our analysis is a powerful and by now standard concentration bound for Gaussian matrix series [63, 77]. An important feature of our analysis is that it is captured by the sum-of-squares (SoS) proof system in a robust way. This fact allows us to use Jennrich’s algorithm as a rounding procedure for sum-of-squares relaxations of tensor decomposition, which is the key idea behind improving previous quasi-polynomial time algorithms based on these relaxations [17, 39].

The main advantage that sum-of-squares relaxations afford for tensor de- composition is that they allow us to efficiently hallucinate faithful higher-degree moments for a distribution given only its lower-degree moments. We can now run classical tensor decomposition algorithms like Jennrich’s on these halluci- nated higher-degree moments (akin to rounding). The goal is to show that those algorithms work as well as they would on the true higher moments. What is challenging about it is that the analysis of Jennrich’s algorithm relies on small spectral gaps that are difficult to reason about in the sum-of-squares setting. (Previous sum-of-squares based methods for tensor decomposition also followed this outline but used simpler, more robust rounding algorithms which required quasi-polynomial time.)

To this end, we view solutions to sum-of-squares relaxations as pseudo-

138 distributions, which generalize classical probability distributions in a way that takes computational efficiency into account.1 More concretely, pseudo-distributions are indistinguishable from actual distributions with respect to tests captured by

a restricted system of proofs, called sum-of-squares proofs.

An interesting feature of how we use pseudo-distributions is that our re- laxations search for pseudo-distributions of large entropy (via an appropriate

surrogate). This objective is surprising, because when we consider convex relax- ations of NP-hard search problems, the intended solutions typically correspond to atomic distributions which have entropy 0. Here, high entropy in the pseudo-

distribution allows us to ensure that rounding results in a useful solution. This appears to be related to the way in which many randomized rounding procedures use maximum-entropy distributions [?], but differs in that the aforementioned

rounding procedures focus on the entropy of the rounding process rather than the entropy (surrogate) of the solution to the convex relaxation. A measure of “entropy” has also been directly ascribed to pseudo-distributions previously [?],

and the principle of maximum entropy has been applied to pseudo-distributions as well [?], but these have previously occurred separately, and our application is

the first to encode a surrogate notion of entropy directly into the sum-of-squares proof system.

Our work also takes inspiration from a recent work that uses sum-of-squares techniques to design fast spectral algorithms for a range of problems includ-

ing tensor decomposition [?]. Their algorithm also proceeds by constructing surrogates for higher moments and applying a classical tensor decomposition algorithm on these surrogates. The difference is that the surrogates in [?] are

1In particular, the set of constant-degree moments of n-variate pseudo-distributions admits O 1 an n ( )-time separation oracle based on computing eigenvectors.

139 explicitly constructed as low-degree polynomial of the input tensor, whereas our surrogates are computed by sum-of-squares relaxations. The explicit surrogates of [?] allow for a direct (but involved) analysis through concentration bounds for matrix polynomials. In our case, a direct analysis is not possible because we have very little control over the surrogates computed by sum-of-squares relaxations. Therefore, the challenge for us is to understand to what extent classical tensor decomposition algorithms are compatible with the sum-of-squares proof system. Our analysis ends up being less technically involved compared to [?] (using the language of pseudo-distributions and sum-of-squares proofs).

4.1.1 Results for tensor decomposition

d Let {a1,..., an } ⊆ ’ be a set of unit vectors. We study the task of approximately recovering this set of vectors given (noisy) access to its first k moments (4.1.1).

We organize this overview of our results based on different kinds of assumptions imposed on the set {a1,..., an } and the order of tensor/moments that we have access to. All of our algorithms are randomized and may fail with some small probability over their internal randomness, say probability at most 0.01. (Standard arguments allow us to amplify this probability at the cost of a small increase in running time.)

Orthogonal vectors. This scenario often captures the case of general linearly independent vectors because knowledge of the second moments of a1,..., an allows us to orthonormalize the vectors (this process is sometimes called “whiten- ing”). Many efficient algorithms are known in this case. Our contribution here

d 3 is in improving the error tolerance. For a symmetric 3-tensor E ∈ (’ )⊗ , we

140 k k 2 use E 1 , 2,3 to denote the spectral norm of E as a d-by-d matrix (using the { } { } first mode of E to index rows and the last two modes of E to index the columns). √ k k This norm is at most d times the injective norm E 1 , 2 , 3 (the maximum of { } { } { } hE, x ⊗ y ⊗ zi over all unit vectors x, y, z ∈ ’d). The previous best error tolerance  − Ín 3 for this problem required the error tensor E T i1 ai⊗ to have injective k k  / k k  norm E 1 , 2 , 3 1 d. Our algorithm requires only E 1 , 2,3 1, which is { } { } { } √ { } { } k k  / satisfied in particular when E 1 , 2 , 3 1 d. { } { } { }

Theorem 4.1.1. There exists a polynomial-time algorithm that given a symmetric 3-

d 3 d tensor T ∈ (’ )⊗ outputs a set of vectors {a0 ,..., a0 } ⊆ ’ such that for every 1 n0 d orthonormal set {a1,..., an } ⊆ ’ , the Hausdorff distance2 between the two sets is at most

n { }  2 ( )· − Õ 3 distH a1,..., an , a10 ,..., a0n 6 O 1 T ai⊗ . (4.1.2) 0 i1 1 , 2,3 { } { }

k −Ín 3k / Under the additional assumption T i1 a⊗ 1 , 2,3 6 1 log d, the running i { } { } + time of the algorithm can be improved to O(d1 ω) 6 d3.33 using fast matrix multiplication, where ω is the number such that two n × n matrices can be multiplied together in time nω (See Theorem 4.10.2).

k·k It is also possible to replace the spectral norm 1 , 2,3 in the above theorem { } { } statement by constant-degree sum-of-squares relaxations of the injective norm of 3-tensors. (See Remark 4.5.3 for details. ) If the error E has Gaussian N( 2 · 3) · 3 4( )O 1 distribution 0, σ Idd⊗ , then this norm is w.h.p. bounded by σ d / log d ( ) k·k ( · ) [50], whereas the norm 1 , 2,3 has magnitude Ω σ d . We prove Theorem 4.1.1 { } { } in Section 4.5.2.

2The Hausdorff distance distH X, Y between two finite sets X and Y measures the length of the ( ) largest gap between the two sets. Formally, distH X, Y is the maximum of maxx X miny Y x y ( ) ∈ ∈ k − k and maxy Y minx X x y . ∈ ∈ k − k

141 Random vectors. We consider the case that a1,..., an are chosen independently at random from the unit sphere of ’d. For n 6 d, this case is roughly equivalent to the case of orthonormal vectors. Thus, we are interested in the “overcomplete” case n  d, when the rank is larger than the dimension. Previous work found the

3 2 O 1 decomposition in quasi-polynomial time when n 6 d / /log ( ) d [39], or in time

4 3 O 1 subquadratic in the input size when n 6 d / /log ( ) d [?]. Our polynomial-time

4 3 3 2 algorithm therefore is an improvement when n is between d / and d / (up to logarithmic factors).

Theorem 4.1.2. There exists a polynomial-time algorithm A such that with probability

ω 1 d 1 − d− ( ) over the choice of random unit vectors a1,..., an ∈ ’ , every symmetric

d 3 3-tensor T ∈ (’ )⊗ satisfies

 2  Ω 1 n  ( ) { } n ( ) + − Õ 3 distH A T , a1,..., an 6 O T ai⊗ . (4.1.3) d1.5 i1 1 , 2,3 { } { }

k·k Again it is possible to replace the spectral norm 1 , 2,3 in the above theorem { } { } statement by constant-degree sum-of-squares relaxations of the injective norm of 3-tensors, which as mentioned before give better bounds for Gaussian error tensors. We prove Theorem 4.1.2 in Section 4.7.

Smoothed vectors. Next, we consider a more general setup where the vectors

d a1,..., an ∈ ’ are smoothed, i.e., randomly perturbed. This scenario is sig- nificantly more general than random vectors. Again we are interested in the overcomplete case n  d. The previous best work [24] showed that the fifth

2 moment of smoothed vectors a1,..., an with n 6 d /2 is enough to approximately recover the vectors even in the presence of a polynomial amount of error. For fourth moments of smoothed vectors, no such result was known even for lower overcompleteness, say n  d1.01.

142 We give an interpretation of the 4-tensor decomposition algorithm FOOBI3 [55] as a special case of a sum-of-squares based decomposition algorithm. We show that the sum-of-squares based algorithm works in the smoothed setting even in the presence of a polynomial amount of error. We define a condition

d number κ(·) for sets of vectors a1,..., an ∈ ’ (a polynomial in the condition { 2 | ∈ [ ]} number of two matrices, one with columns ai⊗ i n and one with columns

{ai ⊗ (ai ⊗ a j − a j ⊗ ai) ⊗ a j | i , j ∈ [n]}). First, we show that the algorithm can tolerate error  1/κ which could be independent of the dimension. Concretely, our algorithm will output a set of vectors aˆ1,..., aˆn which will be close to

{a1,..., an } up to permutations and sign flip with a relative error that scales linearly in the relative error of the input and the condition number κ. Second, we show that for smoothed vectors this condition number is at least inverse polynomial with probability exponentially close to 1.

Theorem 4.1.3. There exists a polynomial-time algorithm such that for every symmetric

d 4 d 4-tensor T ∈ (’ )⊗ and every set {a1,..., an } ⊆ ’ of vectors not necessarily unit [ ] → [ ] { } length, there exists a permutation π : n n so that the output a10 ,..., a0n of the algorithm on input T satisfies

n 4 ai − a0 T − Í a π i i1 i⊗ 1,2 , 3,4 max ( ) 6 O(1)· { } { } · κ(a ,..., an) , (4.1.4) k k  T 1 i n ai Ín ( 2)( 2) ∈[ ] σn i1 ai⊗ ai⊗ where σn(A) refers to the nth singular value of the matrix A, here the smallest non-zero singular value.

d We say that a distribution over vectors a1,..., an ∈ ’ is γ-smoothed if  0 + · 1 0 0 1 1 ai ai γ i, where a1,..., an are fixed vectors and 1,..., n are independent N( 1 ) Gaussian vectors from 0, d Idd . 3The FOOBI algorithm is known to work for overcomplete 4-tensors when there is no error in the input. Researchers [24] asked if this algorithm tolerates a polynomial amount of error. Our work answers this question affirmatively for a variant of FOOBI (based on sum-of-squares).

143 Theorem 4.1.4. Let ε > 0 and n, d ∈ Ž with n 6 d2/10. Then, for any γ-smoothed

d distribution over vectors a1,..., an in ’ ,

n o Ω 1  κ(a1,..., an) 6 poly(d, γ) > 1 − exp(−d ( )) .

The above theorems together imply a polynomial-time algorithm for approxi- mately decomposing overcomplete smoothed 4-tensors even if the input error is polynomially large. The error probability of the algorithm is exponentially small over the choice of the smoothing. It is an interesting open problem to extend this result to overcomplete smoothed 3-tensors, even for lower overcompleteness n  d1.01. Theorem 4.1.3 and Theorem 4.1.4 are proved in Section 4.8.

Separated unit vectors. In the scenario, when inner products among the vectors

d a1,..., an ∈ ’ are bounded by ρ < 1 in absolute value, the previous best decomposition algorithm shows that moments of order (log n)/log ρ suffice [69]. Our algorithm requires moments of higher order (by a factor logarithmic in the desired accuracy) but in return tolerates up to constant spectral error. This increased error tolerance also allows us to apply this result for dictionary learning with up to constant sparsity (see Section 4.1.2).

Theorem 4.1.5. There exists an algorithm A with polynomial running time (in the size of its input) such that for all η, ρ ∈ (0, 1) and σ > 1, for every set of unit { } ⊆ d kÍn T k |h i| vectors a1,..., an ’ with i1 ai ai 6 σ and maxi,j ai , a j 6 ρ, when the  +  ∈ (’d) k 1 log σ · ( / ) algorithm is given a symmetric k-tensor T ⊗ with k > O log ρ log 1 η , d then its output A(T) is a set of vectors {a0 ,..., a0 } ⊆ ’ such that 1 n0

 2  n  { 2 2} { 2 2} + − Õ k distH a0⊗ ,..., a0⊗n , a⊗ ,..., an⊗ 6 O η T ai⊗ . 1 1 i1 1,..., k 2 , k 2 +1,...,k { b / c} {b / c } (4.1.5)

144 We also show that a simple spectral algorithm with running time close to dk (the size of the input) achieves similar guarantees (see Remark 4.10.3). However, the error tolerance of this algorithm is in terms of an unbalanced spectral norm: k − Ín k k T i1 a⊗ 1,...,k 3 , k 3+1,...,k (the spectral norm of the tensor viewed as a i { / } { / } k 3 2k 3 d / -by-d / matrix). This norm is always larger than the balanced spectral norm in the theorem statement. In particular, for dictionary learning applications, this norm is larger than 1, which renders the guarantee of the simpler spectral algorithm vacuous in this case. We prove Theorem 4.1.5 in Section 4.5.3.

General unit vectors. In this scenario, the number of moments that our algo-

Í T rithm requires is constant as long as i ai ai has constant spectral norm and the desired accuracy is constant.

Theorem 4.1.6. There exists an algorithm A (see Algorithm 4) with polynomial running time (in the size of its input) such that for all ε ∈ (0, 1), σ > 1, for every set of unit vectors { } ⊆ d kÍn T k ∈ ( d) 2k a1,..., an ’ with i1 ai ai 6 σ and every symmetric 2k-tensor T ’ ⊗

k ( / )O 1 · ( ) T − Í a 2k / with > 1 ε ( ) log σ and i i⊗ 1,...,k , k+1,...,2k 6 1 3, we have { } { }  2 ( ) { 2 2} ( ) distH A T , a1⊗ ,..., an⊗ 6 O ε .

The previous best algorithm for this problem required tensors of order

O log σ εO 1 +log n (log σ)/ε and had running time d (( )/ ( ) ) [17, Theorem 4.3]. We require the same order of the tensor and the runtime is improved to be polynomial in the

poly log σ ε size of the inputs (that is, d (( )/ )).

We also remark that a bit surprisingly we can handle 1/3 error in spectral norm, and this is possible partly due to the choice of working with high order tensors. As a sanity check, we note that information-theoretically the components are

145 2k identifiable: under the assumptions, the only vectors u that satisfy hT, u⊗ i > 1/3 are those vectors close to one of the ai’s. We also note that the rounding algorithm of the sum-of-squares relaxation of this simple inefficient test requires a bit new idea beyond what we used previously. Here the difficulty is to make the runtime

poly log σ ε poly σ ε d (( )/ ) instead of d ( / ). See Section 4.9 for details.

Spectral algorithms without sum-of-squares. Finally, using a similar rounding technique directly on a orthogonal tensor (without using sum-of-squares and the pseudo-moment), we also obtain a fast and robust algorithm for orthogonal tensor decomposition. See Section 4.10 for details.

4.1.2 Applications of tensor decomposition

Tensor decomposition has a wide range of applications. We focus here on learning sparse dictionaries, which is an example of the more general phenomenon of using tensor decomposition to learn latent variable models. Here, we obtain the

first polynomial-time algorithms that work in the overcomplete regime up to constant sparsity.

Dictionary learning is an important problem in multiple areas, ranging from computational neuroscience [?, ?, ?], machine learning [?, ?], to computer vision and image processing [?, ?, ?]. The general goal is to find a good basis for given data. More formally, in the dictionary learning problem, also known as sparse coding, we are given samples of a random vector y ∈ ’n, of the form y  Ax

n m where A is some unknown matrix in ’ × , called dictionary, and x is sampled from an unknown distribution over sparse vectors. The goal is to approximately

146 recover the dictionary A.

We consider the same class of distributions over sparse vectors {x} as [17], which as discussed in [17] admits a wide-range of non-product distributions over sparse vectors. (The case of product distributions reduces to the significantly easier problem of independent component analysis.) We say that {x} is (k, τ)-nice

k k 2 k 2 ¾  ∈ [ ] ¾ / / ∈ [ ] ¾ α  if xi 1 for every i m , xi x j 6 τ for all i , j m , and x 0 for every non-square degree-k monomial xα. Here, τ is a measure of the relative sparsity of the vectors {x}.

We give an algorithm that for nice distributions solves the dictionary learning problem in polynomial time when the desired accuracy is constant, the over- completeness of the dictionary is constant (measured by the spectral norm kAk), and the sparsity parameter τ is a sufficiently small constant (depending only on the desired accuracy and kAk). The previous best algorithm [17] requires quasi-polynomial time in this setup (but works in polynomial-time for polynomial

Ω 1 sparsity τ 6 n− ( )).

Theorem 4.1.7. There exists an algorithm R parameterized by σ > 1, η ∈ (0, 1), such

n m that for every dictionary A ∈ ’ × with kAk 6 σ and every (k, τ)-nice distribution m O k {x} over ’ with k > k(η, σ)  O((log σ)/η) and τ 6 τ(k)  k− ( ), the algorithm O k {  } O k given n ( ) samples from y Ax outputs in time n ( ) vectors a10 ,..., a0m that are 1 2 O(η) / -close to the columns of A.

Since previous work [17] provides a black box reduction from dictionary learning to tensor decomposition, the theorem above follows from Theorem 4.1.6.

Our Theorem 4.1.5 implies a dictionary learning algorithm with better parameters for the case that the columns of A are separated.

147 4.1.3 Polynomial optimization with few global optima

Underlying our algorithms for the tensor decomposition is an algorithm for solving general systems of polynomial constraints with the property that the total number of different solutions is small and that there exists a short certificate for that fact in form of a sum-of-squares proof.

Let A be a system of polynomial constraints over real variables x  (x1,..., xd)

` and let P : ’d → ’d be a polynomial map of degree at most `—for example,

` d P(x)  x⊗ . We say that solutions a1,..., an ∈ ’ to A are unique under the map P if the vectors P(a1),..., P(an) are orthonormal up to error 0.01 (in spectral norm) and every solution a to A satisfies P(a) ≈ P(ai) for some i ∈ [n]. We encode this property algebraically by requiring that the constraints in A imply Ín h ( ) ( )i4 · k ( )k4 the constraint i1 P ai , P x > 0.99 P x . We say that the solutions a1,..., an are `-certifiably unique if in addition this implication has a degree-` sum-of-squares proof.

The following theorem shows that if polynomial constraints have certifiably unique solutions (under a given map P), then we can find them efficiently (under the map P).

Theorem 4.1.8 (Informal statement of Theorem 4.5.2). Given a system of polynomial constraints A and a polynomial map P such that there exists `-certifiably unique solutions

O ` a1,..., an for A, we can find in time d ( ) vectors 0.1-close to P(a1),..., P(an) in Hausdorff distance.

148 4.2 Techniques

Here is the basic idea behind using sum-of-squares for tensor decomposition:

d Let a1,..., an ∈ ’ be unit vectors and suppose we have access to their first three moments M1, M2, M3 as in (4.1.1). Since the task of recovering a1,..., an is easier the more moments we know, we would make a lot of progress if we could compute higher moments of a1,..., an, say the fourth moment M4.A natural approach toward that goal is to compute a probability distribution D over

d the sphere of ’ such that D matches the moments of a1,..., ak that we know, M 2 M 3 M i.e., ¾D u u  1, ¾D u u⊗  2, ¾D u u⊗  3, and then use the fourth ( ) ( ) ( ) 4 moment ¾D u⊗ as an estimate for M4.

There are two issues with this approach: (1) computing such a distribution D is intractable and (2) even if we could compute such a distribution it is not clear if its fourth moment will be close to the fourth moments M4 we are interested in.

We address issue (1) by relaxing D to be a pseudo-distribution (solution to sum-of-squares relaxations). Then, we can match the given moments efficiently.

Issue (2) is related to the uniqueness of the tensor decomposition, which relies on properties of the vectors a1,..., an. Here, the general strategy is to first prove that this uniqueness holds for actual distributions and then transfer the uniqueness proof to the sum-of-squares proof system, which would imply that uniqueness also holds for pseudo-distributions.

In subsection 4.2.1 below, we demonstrate our key rounding idea on the

(nearly) orthogonal tensor decomposition problem. Then in subsection 4.2.2 we discuss the high level insight for the robust 4th-order tensor decomposition

149 algorithm and in subsection 4.2.3 the techniques for random 3rd-order tensor decomposition.

4.2.1 Rounding pseudo-distributions by matrix diagonalization

Our main departure from previous tensor decomposition algorithms based on sum-of-squares [17, 39] lies in rounding: the procedure to extract an actual solution from a pseudo-distribution over solutions. The previous algorithms rounded a pseudo-distribution D by directly using the first moments (or the mean) ¾D u u, ( ) which requires D to concentrate strongly around the desired solution. Our approach here instead uses Jennrich’s (simultaneous) matrix diagonalization [46, 56], to extract the desired solution as a singular vector of a matrix of the form h i T ¾D u 1, u uu , for a random vector 1.4 This permits us to impose much weaker ( ) conditions on D.

For the rest of this subsection, we assume that we have an actual distribution

d D that is supported on vectors close to some orthonormal basis a1,..., ad of ’ , and we will design a rounding algorithm that extracts the vectors ai from the low-degree moments of D. This is a much simpler task than rounding from a pseudo-distribution, though it captures most of the essential difficulties. Since pseudo-distributions behave similarly to actual distributions on the low-degree moments, the techniques involved in rounding from actual distributions will turn out to be easily generalizable to the case of pseudo-distributions.

Let D be a distribution over the unit sphere in ’d. Suppose that this

4In previous treatments of simultaneous diagonalization, multiple matrices would be used for noise tolerance—increasing the confidence in the solution when more than one matrix agrees on a particular singular vector. This is unnecessary in our setting, since as we’ll see, the SoS framework itself suffices to certify the correctness of a solution.

150 distribution is supported on vectors close to some orthonormal basis a1,..., ad of ’d, in the sense that the distribution satisfies the constraint

( d ) Õ 3 hai , ui > 1 − ε . (4.2.1) i1 D u ( ) { h i − } Íd h i3 (This constraint implies maxi d ai , u > 1 ε D u because i1 ai , u 6 ∈[ ] ( ) h i maxi d ai , u by orthonormality.) The analysis of [17] shows that reweighing ∈[ ] 2k the distribution D by a function of the form u 7→ h1, ui for 1 ∼ N(0, Idd) and some k 6 O(log d) creates, with significant probability, a distribution D0 such that for one of the basis vectors ai, almost all of the probability mass of D0 is on vectors close to ai, in the sense that

2k max ¾ hai , ui > 1 − O(ε) , where D0(u) ∝ h1, ui D(u) . i d D u ∈[ ] 0( )

In this case, we can extract a vector close to one of the vectors ai by computing the mean ¾D u u of the reweighted distribution. This rounding procedure takes 0( ) quasi-polynomial time because it requires access to logarithmic-degree moments of the original pseudo-distribution D.

To avoid this quasi-polynomial running time, our strategy is to instead modify the original distribution D in order to create a small bias in one of the directions ai such that a modified moment matrix of D has a one-dimensional eigenspace close to ai. (This kind of modification is much less drastic than the kind of modification in previous works. Indeed, reweighing a distribution such that it concentrates around a particular vector seems to require logarithmic degree.)

Concretely, we will study the spectrum of matrices of the following form, for

1 ∼ N(0, Idd):

T M1  ¾ h1, ui · uu . D u ( )

151 Our goal is to show that with good probability, M1 has a one-dimensional eigenspace close to one of the vectors ai.

However, this is not actually true for a naïve distribution: although we have encoded the basis vectors ai into the distribution D by means of constraint (4.2.1), we cannot yet conclude that the eigenspaces of M1 have anything to do with them. We can understand this as the error allowed in (4.2.1) being highly under- constrained. For example, the distribution could be a uniform mixture of vectors of the form ai + εw for some fixed vector w, which causes w to become by far the most significant contribution to the spectrum of M1. More generally, an arbitrary spectrally small error could still completely displace all of the eigenspaces of M1.

An interpretation of this situation is that we have permitted D itself to contain a large amount of information that we do not actually possess. Constraint (4.2.1) is consistent with a wide range of possible solutions, yet in the pathological example above, the distribution does not at all reflect this uncertainty, instead settling arbitrarily on some particular biased solution: it is this bias that disrupts the usefulness of the rounding procedure.

A similar situation has previously arisen in strategies for rounding convex relaxations—specifically, when the variables of the relaxations were interpreted as the marginals of some probability distribution over solutions, then actual solutions were constructed by sampling from that distribution. In that context, a workaround was to sample those solutions from the maximum-entropy dis- tributions consistent with those marginals [?], to ensure that the distribution faithfully reflected the ignorance inherent in the relaxation solution rather than incorporating arbitrary information. Our situation differs in that it is the solution to the convex relaxation itself which is misbehaving, rather than some aspect of

152 the rounding process, but the same approach carries over here as well.

Therefore, suppose that D satisfies the maximum-entropy constraint k T k / ¾D u uu 6 1 n. This essentially enforces D to be a uniform distribution over ( ) vectors close to a1,..., an. For the sake of demonstration, we assume that D is a uniform distribution over a1,..., an. Moreover, since our algorithm is invariant under linear transformations, we may assume that the components a1,..., an

d are the standard basis vectors e1,..., en ∈ ’ . We first decompose M1 along the coordinate 11,

 1 · + 1  1 − 1 · M1 1 Me1 M10 , where 0 1 e1 .

Note that under our simplified assumption for D, by simple algebraic ma-  T  T nipulation we have Me1 ¾D u u1uu e1e1 . Moreover, by definition, ( ) 11 and 10 are independent. It turns out that the entropy constraint implies ¾ k k p · / 10 M10 . log d 1 n (using concentration bounds for Gaussian matrix series 1p [63]). Therefore, if we condition on the event 11 > η− log d, we have that  1 T + 1 T M1 1e1e1 M10 consists of two parts: a rank-1 single part 1e1e1 with with 1p eigenvalue larger than η− log d, and a noise part which has spectral norm at p most . log d. Hence, by the eigenvector perturbation theorem we have that the

1 2 top eigenvector is O(η / )-close to e1 as desired.

1p Taking η  0.1, we see with 1/poly(d) probability the event 11 > η− log d will happen, and therefore by repeating this procedure poly(d) times, we obtain a

1 2 vector that is O(η / )-close to e1. We can find other vectors similarly by repeating the process (in a slightly more delicate way), and the accuracy can also be boosted

(see Sections 5.6 and 4.5 for details).

153 4.2.2 Overcomplete fourth-order tensor

In this section, we give a high-level description of a robust sum-of-squares version of the tensor decomposition algorithm FOOBI [55]. For simplicity of the demonstration, we first work with the noiseless case where we are given a tensor

d 4 T ∈ (’ )⊗ of the form n  Õ 4 T ai⊗ . (4.2.2) i1

We will first review the key step of FOOBI algorithm and then show how to convert it into a sum-of-squares algorithm that will naturally be robust to noise.

To begin with, we observe that by viewing T as a d2 × d2 matrix of rank

2 n, we can easily find the span of the ai⊗ ’s by low-rank matrix factorization. However, since the low rank matrix factorization is only unique up to unitary

2 transformation, we are not able to recover the ai⊗ ’s from the subspace that they 2 live in. The key observation of [55] is that the ai⊗ ’s are actually the only “rank-1” vectors in the span, under a mild algebraic independence condition. Here, a d2-dimensional vector is called “rank-1” if it is a tensor product of two vectors of dimension d.

Lemma 4.2.1 ([55]). Suppose the following set of vectors is linearly independent, n o 2 ⊗ 2 − ( ⊗ ) 2 ai⊗ a⊗j ai a j ⊗ i , j . (4.2.3)

2 2 2 Then every vector x⊗ in the linear span of a1⊗ ,..., an⊗ is a multiple of one of the vectors 2 ai⊗ .

This observation leads to the algorithm FOOBI, which essentially looks for

2 rank-1 vectors in the span of ai⊗ ’s. The main drawback is that it uses simultaneous diagonalization as a sub-procedure, which is unlikely to tolerate noise better

154 than inverse polynomial in d, and in fact no noise tolerance guarantee has been explicitly shown for it before.

Our approach starts with rephrasing the original proof of Lemma 4.2.1 into the following SoS proof (which only uses polynomial inequalities that can be proved by SoS).

2  Ín · 2 Proof of Lemma 4.2.1. Let α1, . . . , αn be multipliers such that x⊗ i1 αi ai⊗ .5 Then, these multipliers satisfy the following quadratic equations:

4 Õ 2 2 x⊗  αi α j · a⊗ ⊗ a⊗ , i,j i j

4 Õ 2 x⊗  αi α j ·(ai ⊗ a j)⊗ . i,j

Together, the two equations imply that

Õ  2 2 2 0  αi α j · a⊗ ⊗ a⊗ − (ai ⊗ a j)⊗ . i,j i j

2 ⊗ 2 − ( ⊗ ) 2 By assumption, the vectors ai⊗ a⊗j ai a j ⊗ are linearly independent for Í 2 2  i , j. Therefore, from the equation above, we conclude i,j αi α j 0, meaning that at most one of αi can be non-zero. Furthermore this argument is a SoS proof,

D D since for any matrix A ∈ ’ × with linearly independent columns and any vector ∈ [ ]D k k2 1 k k2 polynomial v ’ x , the inequality v 6 2 Av can be proved by SoS σmin A ( ) (here σmin(A) denotes the least singular value of matrix A). So choosing A to be 2 ⊗ 2 − ( ⊗ ) 2 the matrix with columns ai⊗ a⊗j ai a j ⊗ for i , j and v to be the vector k k4 − k k4  with entries αi α j, we find by SoS proof that α 4 α 2 0. 

2 When there is noise present, we cannot find the true subspace of the ai⊗ ’s and instead we only have an approximation, denoted by V, of that subspace. We will

2 Ín 2 5technically, α1, . . . , αn are polynomials in x so that x   αi a⊗ holds ⊗ i 1 · i

155 modify the proof above by starting with a polynomial inequality

2 2 2 2 k IdV x⊗ k > (1 − δ)kx⊗ k , (4.2.4)

2 which constrains x⊗ to be close to the estimated subspace V (where δ is a small number that depends on error and condition number). Then an extension of the proof of Lemma 4.2.1 will show that equation (4.2.4) implies (via a SoS proof) that for some small enough δ,

Õ 2 2 ( ) αi α j 6 o 1 . (4.2.5) i,j

2 2 Note that α  Kx⊗ is a linear transformation of x⊗ , and furthermore K is 2 the pseudo-inverse of the matrix with columns ai⊗ . Moreover, if we assume for a moment that α has 2-norm 1 (which is not true in general), then the equation above further implies that

n Õh 2i4  k k4 − ( ) Ki , x⊗ α 4 > 1 o 1 , (4.2.6) i1

d2 where Ki ∈ ’ is the i-th row of K. This effectively gives us access to the 4-tensor Í 4 2 i Ki⊗ (which has ambient dimension d when flattened into a matrix), since equation (4.2.6) is anyway the constraint that would have been used by the SoS Í 4 algorithm if given the tensor i Ki⊗ as input. Note that because the Ki are not necessarily (close to) orthogonal, we cannot apply the SoS orthogonal tensor decomposition algorithm directly. However, since we are working with a 4-tensor

2 whose matrix flattening has higher dimension d , we can whiten Ki effectively in the SoS framework and then use the orthogonal SoS tensor decomposition algorithm to find the Ki’s, which will in turn yield the ai’s.

Many details were omitted in the heuristic argument above (for example, we assumed α to have norm 1). The full argument follows in Section 4.8.

156 4.2.3 Random overcomplete third-order tensor

In the random overcomplete setting, the input tensor is of the form n  Õ 3 + T ai⊗ E , i1 where each ai is drawn uniformly at random from the Euclidean unit sphere, we 1.5/( )O 1 k k have d < n 6 d log d ( ), and E is some noise tensor such that E 1 , 2,3 < ε { } { } or alternatively such that a constant-degree sum-of-squares relaxation of the injective norm of E is at most ε.

Our original rounding approach depends on the target vectors ai being orthonormal or nearly so. But when n  d in this overcomplete setting, orthonormality fails badly: the vectors ai are not even linearly independent.

We circumvent this problem by embedding the vectors ai in a larger ambient  2  2 space—specifically by taking the tensor powers a10 a1⊗ ,..., a0n an⊗ . Now the vectors a10 ,..., a0n are linearly independent (with probability 1) and actually close to orthonormal with high probability. Therefore, if we had access to the order-6 Í( ) 3  Í 6 tensor a0i ⊗ ai⊗ , then we could (almost) apply our rounding method to recover the vectors a0i.

The key here will be to use the sum-of-squares method to generate a pseudo- distribution over the unit sphere having T as its third-order moments tensor, and then to extract from it the set of order-6 pseudo-moments estimating the Í 6 moment tensor i ai⊗ . This pseudo-distribution would obey the constraint {( ⊗ ⊗ )T − } {Í h i3 − } u u u T > 1 ε , which implies the constraint i ai , u > 1 ε , saying, informally, that our pseudo-distribution is close to the actual uniform distribution

2 over {ai }. Substituting v  u⊗ , we obtain an implied pseudo-distribution in v { } which therefore ought to be close to the uniform distribution over a0i , and we

157 should therefore be able to round the order-3 pseudo-moments of v to recover { } a0i .

Í ( )( )T Only two preconditions need to be checked: first that i a0i a0i is not too large in spectral norm, and second that our pseudo-distribution in v satisfies the {Í h i3 − ( )} constraint i a0i , v > 1 O ε . The first precondition is true (except for a spurious eigenspace which can harmlessly be projected away) and is essentially equivalent to a line of matrix concentration arguments previously made in [?]. The second precondition follows from a line of constant-degree sum-of-squares proofs, notably extending arguments made in [39] stating that the constraints {Í h i3 − k k2  } i ai , u > 1 ε, u 1 imply with constant-degree sum-of-squares proofs {Í h ik − ( ) − ˜ ( / 3 2)} that i ai , u > 1 O ε O n d / for some higher powers k. The rigorous verification of these conditions is detailed in Section 4.7.

4.3 Preliminaries

Unless explicitly stated otherwise, O(·)-notation hides absolute multiplicative constants. Concretely, every occurrence of O(x) is a placeholder for some function f (x) that satisfies x ∈ ’. | f (x)| 6 C|x| for some absolute constant ∀ C > 0. Similarly, Ω(x) is a placeholder for a function 1(x) that satisfies x ∈ ∀ ’. |1(x)| > |x|/C for some absolute constant C > 0.

+ For a matrix A, let A denote the Moore-Penrose pseudo-inverse of A. For a

1 2 symmetric positive semidefinite matrix B, let B / denote the square root of B, i.e. the unique symmetric positive-semidefinite matrix L such that L2  B.

The Kronecker product of two matrices A and B is denoted by A ⊗ B. A useful

158 identity is that (A ⊗ B)(C ⊗ D)  (AC)⊗(BD) whenever the matrix multiplications are defined. The norm k · k denotes the Euclidean norm for vectors and the spectral norm for matrices.

∈ ( d) k d  Í ⊗ · · · ⊗ Let T ’ ⊗ be a k-tensor over ’ such that T i ,...,i Ti1 ik ei1 eik , 1 k ··· d where e1,..., ed is the standard basis of ’ . We say T is symmetric if the entries ( ) Ti1,...,ik are invariant under permuting the indices. The k index positions of T are called modes. The injective norm kTkinj is the maximum value of

d hT, x1 ⊗ · · · ⊗ xki over all vectors x1,..., xk ∈ ’ with kx1k  ···  kxk k  1.A useful class of multilinear operations on tensors has the form T 7→ (A1 ⊗· · ·⊗Ak)T, where A1,..., Ak are matrices with d columns. (This notation is the same as the Kronecker product notation for matrices, that is, (A1 ⊗ · · · ⊗ Ak)T  Í ( ) ⊗ · · · ⊗ ( ) i ,...,i Ti1 ik A1ei1 Ak eik .) If some of the matrices Ai are row vectors, 1 k ··· and the others are the identity matrix, then the corresponding operation is called

d 3 tensor contraction. For example, for a third-order tensor T ∈ (’ )⊗ and a vector 1 ∈ ’d, we call (Id ⊗ Id ⊗1T)T the contraction of the third mode of T with 1.

(Some authors use the notation T(Id, Id, 1) to denote this operation.)

For a bipartition A, B of the index set [k] of T, we let kTkA,B denote the spectral norm of the matrix unfolding TA,B of T with rows indexed by the indices in A and columns indexed by indices in B. Concretely,

k k  Õ · T A,B max Ti1 ik xiA yiB , d A d B x ’ ⊗| | , y ’ ⊗| | ··· ∈( ) ∈( ) i1,...,ik x 61, y 61 k k k k  ···  ···  Here, iA ia1 ia A and iB ib1 ib B are multi-indices, where A | | | | { }  { }  k k a1,..., a A and B b1,..., b B . For k 2, T 1 , 2 is the spectral norm of T | | | | { } { }  k k viewed as a d-by-d matrix. For k 3, T 1,2 , 3 is the spectral norm of T viewed { } { } as a d2-by-d matrix with rows indexed by the first two modes of T and columns k k indexed by the last index of T. For symmetric 3-tensors, all norms T 1,2 , 3 , { } { }

159 k k k k T 1,3 , 2 , and T 2,3 , 1 are the same. { } { } { } { }

4.3.1 Pseudo-distributions

Pseudo-distributions generalize probability distributions in a way that allows us to optimize efficiently over moments of pseudo-distributions. We represent a discrete probability distribution D over ’n by its probability mass function D : ’n → ’ such that D(x) is the probability of x under the distribution for every ∈ n Í ( )  x ’ . This function is nonnegative point-wise and satisfies x supp D D x 1. ∈ ( ) For pseudo-distributions we relax the nonnegative requirement and only require that the function passes a set of simple nonnegativity tests.

A degree-d pseudo-distribution over ’n is a finitely6 supported function D : ’n → Í ( )  Í ( ) ( )2 ’ such that x supp D D x 1 and x supp D D x f x > 0 for every function ∈ ( ) ∈ ( ) f : ’n → ’ of degree at most d/2. We define the pseudo-expectation of a (possibly vector-valued or matrix-valued) function f with respect to D as

def Õ ¾˜ D f  D(x) f (x) . x supp D ∈ ( ) In order to emphasize which variable is bound by the pseudo-expectation, we ( ) ( ) write ¾˜ D x f x . (This notation is useful if f x is a more complicated expression ( ) involving several variables.)

Note that a degree-∞ pseudo-distribution D satisfies D(x) > 0 for all x ∈ ’n. Therefore, D is an actual probability distribution (with finite support). The pseudo-expectation ¾˜ D f  ¾D f of a function f is its expected value under the distribution D. 6We restrict these functions to be finitely supported in order to avoid integrals and measurability issues. It turns out to be without loss of generality in our context.

160 Our algorithms will not work with pseudo-distributions (as finitely-supported functions on ’n) directly. Instead the algorithms will work with moment tensors ( ) d ¾˜ D x 1, x1,..., xn ⊗ of pseudo-distributions and the associated linear functional ( ) 7→ ( ) p ¾˜ D x p x on polynomials p of degree at most d. ( )

Unlike actual probability distribution, pseudo-distributions admit general, efficient optimization algorithms. In particular, the set of low-degree moments of pseudo-distributions has an efficient separation oracle.

O d Theorem 4.3.1 ([71, 64, 54]). For n, d ∈ Ž, the following set admits an n ( )-time weak separation oracle (in the sense of [43]),   d n ¾˜ (1, x1,..., xn)⊗ degree-d pseudo-distribution D over ’ . D x ( )

This theorem, together with the equivalence of separation and optimization

[43] allows us to solve a wide range of optimization and feasibility problems over pseudo-distributions efficiently.

The following definition captures what kind of linear constraints are induced on a pseudo-distribution over ’n by a system of polynomial constraints over ’n.

Definition 4.3.2. Let D be a degree-d pseudo-distribution over ’n. For a system of polynomial constraints A  { f1 > 0,..., fm > 0} with deg( fi) 6 ` for every i, we say that D satisfies the polynomial constraints A at degree `, denoted D |` A, if ˜ Î  ⊆ [ ] ¾D i S fi h > 0 for every S m and every sum-of-squares polynomial h on ∈ ’n with |S|` + deg h 6 d.

This is a relaxation (to pseudo-distributions) of the statement that the prob- ability mass of a true distribution contains only solutions to A. Indeed, if an

161 actual distribution D is supported on the solutions to A, then D satisfies D |` A regardless of the value of `.

We say that D satisfies A (without further specifying the degree) if D |` A  A for ` max f >0 deg f . We say that a system of polynomial constraints in { }⊆A variables x is explicitly bounded if it contains a constraint of the form {kxk2 6 M}. The following theorem follows from Theorem 4.3.1 and [43]. We give a proof in

Appendix ?? for completeness.

O d Theorem 4.3.3. There exists a (n + |A|) ( )-time algorithm that, given any explicitly bounded and satisfiable system A of polynomial constraints in n variables, outputs (up to arbitrary accuracy) a degree-d pseudo-distribution that satisfies A.

4.3.2 Sum of squares proofs

Let x  (x1,..., xn) be a tuple of indeterminates. Let ’[x] be the set of poly- nomials in these indeterminates with real coefficients. A polynomial p is a  2 + ··· + 2 sum-of-squares if there are polynomials q1,..., qr such that p q1 qr . Let f1,..., fr and 1 be multivariate polynomials in ’[x].A sum-of-squares proof that the constraints { f1 > 0,..., fr > 0} imply the constraint {1 > 0} consists of ( ) [ ] sum-of-squares polynomials pS S n in ’ x such that ⊆[ ] Õ Ö 1  pS · fi . S n i S ⊆[ ] ∈ ⊆ [ ] ( ·Î ) We say that this proof has degree ` if every set S n satisfies deg pS i S fi 6 ` ∈  Î (in particular, this would imply pS 0 for every set S such that deg i S fi > `). ∈ If there exists a degree-` sum-of-squares proof that { f1 > 0,..., fr > 0} implies {1 > 0}, we write

{ f1 > 0,..., fr > 0} `` {1 > 0} .

162 In order to emphasize the indeterminates for the proofs, we sometimes write

{ f1(x) > 0,..., fr(x) > 0} `x,` {1(x) > 0} .

Sum-of-squares proofs obey the following inference rules, for all polynomials f , 1 : ’n → ’ and F : ’n → ’m , G : ’n → ’k , H : ’p → ’n,

A `` { f > 0, 1 > 0} A `` { f > 0}, A `` {1 > 0} , 0 , A ` { + 1 } A ` + { · 1 } ` f > 0 ` `0 f > 0 (addition and multiplication)

A ` B, B ` C ` `0 , (transitivity) A `` ` C · 0 {F > 0} `` {G > 0} { ( )) } { ( ) } . (substitution) F H > 0 `` deg H G H > 0 · ( )

Sum-of-squares proofs are sound and complete for polynomial constraints over pseudo-distributions, in the sense that sum-of-squares proofs allow us to reason about what kind of polynomial constraints are satisfied by a pseudo-distribution. We defer the proofs of the following lemmas to Appendix ??.

Lemma 4.3.4 (Soundness). If D |` A for a pseudo-distribution D and there exists a sum-of-squares proof A `` B, then D |` ` B. 0 · 0

Lemma 4.3.5 (Completeness). Suppose d > `0 > `, and A is a collection of polynomial A ` { 2 + ··· + 2 } constraints with degree at most `, and x1 xn 6 B for some finite B. Let

{1 > 0} be a polynomial constraint with degree `0. If every degree-d pseudo-distribution | A | {1 } D that satisfies D ` also satisfies D `0 > 0 , then for every ε > 0, there is a sum-of-squares proof A `d {1 > −ε}. 7 7The completeness claim stated here does not match the strength of the corresponding soundness claim. This reflects an impreciseness in how we count the degrees of intermediate sum-of-squares proofs (in particular our degree accounting is not tight under proof composition), and does not reflect than the power of the proofs themselves.

163 4.3.3 Matrix constraints and sum-of-squares proofs

In sections 5.6 and 4.9, we still state positive-semidefiniteness constraints on matrices, which will be implied by sum-of-squares proofs. We define notation to express what it means for a matrix constraint to be implied by sum-of-squares. While the duality between proof systems and convex relaxations also holds in the matrix case [?], and it is possible to give a full treatment of matrix constraints in sum-of-squares, here we give an abridged and simplified treatment sufficient for our purposes.

Definition 4.3.6. Let A be a set of polynomial constraints in indeterminant x, and M is a symmetric p × p matrix with entries in ’[x]. Then we write A ``

{M  0} if there exists a set of polynomials q1(x),... qm(x) and a set of vectors ( ) ( ) [ ] A { } v1 x ,..., vm x of p-dimension with entries in ’ x such that ``i qi > 0 where `i + 2 deg(vi) 6 ` for every i, and

m Õ T M  qi(x)vi(x)vi(x) . (4.3.1) i1

The proof that sum-of-squares is sound for these matrix constraints is very similar to the analogous proof of Lemma 4.3.4 (see Appendix ??).

Lemma 4.3.7. Let D be a pseudo-distribution of degree d and d > ``0. Suppose | A A `  ¾˜ [ ]  D ` , and `0 M 0. Then M 0.

We now give some basic properties of these matrix sum-of-squares proofs.

Lemma 4.3.8. Suppose A, B0 are symmetric matrix polynomials such that ` {A  0, B  0}. Then ` {A ⊗ B  0}.

164  Ín ( ) ( ) ( )T  Ím ( ) ( ) ( )T Proof. Express A i1 qi x ui x ui x and B i1 ri x vi x vi x . Then

n m Õ Õ    T A ⊗ B  qi(x)rj(x) ui(x) ⊗ v j(x) ui(x) ⊗ v j(x) .  i1 j1

Lemma 4.3.9. Suppose A, B, A0, B0 are symmetric matrix polynomials such that `

{A  0, B0  0, A  A0, B  B0}. Then ` {A ⊗ B  A0 ⊗ B0, B ⊗ A  B0 ⊗ A0}.

Proof. By Lemma 4.3.8, we have ` A ⊗ (B − B0)  0 and ` (A − A0) ⊗ B0  0. Adding the two equations we complete the proof. We may also take the tensor powers in the other order. 

T 2 Lemma 4.3.10. Let u  [u1,..., ud] be an indeterminate. Then ` {uu  kuk ·Idd }.

Proof. The conclusion follows from the following explicit decomposition

2 T Õ T ` kuk Id −uu  (ui ej − u j ei)(ui ej − u j ei)  0  16i

4.4 Rounding pseudo-distributions

4.4.1 Rounding by matrix diagonalization

The following theorem analyzes a form of Jennrich’s algorithm for tensor de- composition through matrix diagonalization, when applied to the moments of a pseudo-distribution. We show that if the pseudo-distribution D(u) has

k good correlation with some vector a⊗ , then with good chance a simple random contraction of the (k + 2)-th moments of the pseudo-distribution will return a matrix with top eigenvector close to a.

165 Theorem 4.4.1 below is the key ingredient toward a polynomial-time algorithm. It states that in order for Jennrich’s approach to successfully extract a solution in polynomial time, the correlation of the desired solution with the (k + 2)-th moments of the pseudo-distribution only needs to be large compared to the spectral norm of the covariance matrix of the pseudo-distribution. This covariance matrix can be made as small as O(1/n) in spectral norm in many situations, including—as a toy example—when D is a uniform distribution over n orthogonal unit vectors. Therefore in this sense the condition (4.4.1) below is a fairly weak requirement, which is key to the polynomial-time algorithm in Section 4.5.1.

Theorem 4.4.1. Let k ∈ Ž be even and ε ∈ (0, 1). Let D be a degree-O(k) pseudo- d {k k2 } ∈ d distribution over ’ that satisfies u 6 1 D u , let a ’ be a unit vector. Suppose ( ) that   h ik+2 1 · T ¾˜ a, u > Ω √ ¾˜ D u uu . (4.4.1) D u ( ) ( ) ε k / O k 1 ∼ N( k) Then, with probability at least 1 d ( ) over the choice of 0, Idd⊗ , the top ? ? 2 eigenvector u of the following matrix M1 satisfies ha, u i > 1 − O(ε),

k T M1 : ¾˜ h1, u⊗ i · uu . (4.4.2) D u ( )

As before, we decompose M1 into two parts, with M k and M1 defined in a⊗ 0 analogy with M1.

k k k M1  h1, a i · M k + M1 where 1  1 − h1, a i · a . (4.4.3) ⊗ a⊗ 0 0 ⊗ ⊗

Our proof of Theorem 4.4.1 consists of two propositions: one about the good part M k and one about the noise part M1 . The first proposition shows that a⊗ 0 T M k is close to a multiple of aa in spectral norm (which means that the top a⊗ eigenvector of M k is close to a). a⊗

166 h ik+2 Proposition 4.4.2. In the setting of Theorem 4.4.1, for t  ¾˜ D u a, u , ( )

T M k − t · aa 6 O (ε) · t . (4.4.4) a⊗

The second proposition shows that M10 has small spectral norm in expectation.

k k Proposition 4.4.3. In the setting of Theorem 4.4.1, let 10  1 − h1, a⊗ i · a⊗ . Then, h ik+2 for t  ¾˜ D u a, u , ( )

¾ ( 2 )1 2 · M10 6 O ε k log d / t . 10

Before proving the above propositions, we demonstrate how they allow us to prove Theorem 4.4.1.

O k Proof of Theorem 4.4.1. We are to show that with probability 1/d ( ) over the choice

T of the Gaussian vector 1, there exists s ∈ ’ such that s · M1 − aa 6 O(ε). By Davis-Kahan Theorem (see Theorem ??), this implies the conclusion of h ik+2 ( )1 2 Theorem 4.4.1. Let t  ¾˜ D u a, u . For a parameter τ  Ω k log d / , we ( ) k bound the spectral norm conditioned on the event h1, a⊗ i > τ,

h 1 T k i ¾ M1 − aa h1, a⊗ i > τ 1 1,a k t h ⊗ i· h i 1 − T + 1 h ki 6 M k aa ¾ k M1 1, a⊗ > τ (by (4.4.3)) t a⊗ 1 1,a t 0 h ⊗ i· 1 T 1 k 6 M k − aa + · ¾ M1 (by independence of h1, a i and 1 ) t a⊗ τ t 0 ⊗ 0 · 10 ( ) + 1 · ( 2 )1 2 6 O ε τ O ε k log d / (by Proposition 4.4.2 and 4.4.3) 6 O(ε) . (4.4.5)

k By Markov’s inequality, it follows that conditioned on h1, a⊗ i > τ, the event

1 T M1 − aa 6 O(ε) has probability at least Ω(1). The theorem follows 1,a k t h ⊗ i· k O k because the event h1, a⊗ i > τ has probability at least d− ( ). 

167 T Proof of Proposition 4.4.2. We are to bound the spectral norm of M k − t · aa for a⊗ h ik+2 k T k T t  ¾˜ D u a, u . Let α  ¾˜ D u uu . Let Id1  aa be the projector onto the ( ) ( ) subspace spanned by a and let Id 1  Id − Id1 be the projector on the orthogonal − T complement. By our choice of t, we have Id M k Id  t · aa . 1 a⊗ 1

 − · k kT Since Id 1 Id1 0, we can upper bound the spectral norm of Ma k t a⊗ a⊗ , − ⊗ k − · k k ( − · ) k + k k + k k Ma k t Id1 6 Id1 Ma k t Id1 Id1 Id 1 Ma k Id 1 2 Id1 Ma k Id 1 ⊗ ⊗ − ⊗ − ⊗ − k k + k k 6 Id 1 Ma k Id 1 2 Id1 Ma k Id 1 − ⊗ − ⊗ −

(because Id M k Id  t · Id ) 1 a⊗ 1 1 k k + k k1 2 · k k1 2 6 Id 1 Ma k Id 1 2 Id1 Ma k Id1 / Id 1 Ma k Id 1 / − ⊗ − ⊗ − ⊗ −

(because M k  0) a⊗ √ k k + · k k1 2 6 Id 1 Ma k Id 1 2 α Id 1 Ma k Id 1 / . (4.4.6) − ⊗ − − ⊗ −

It remains to bound the spectral norm of Id 1 Ma k Id 1, − ⊗ −

k k  ˜ h ik · T Id 1 Ma k Id 1 ¾ a, u Id 1 uu Id 1 − ⊗ − D u − − ( ) 6 ¾˜ ha, uik ·(1 − ha, ui2) D u ( ) T 2 2 (because ` Id 1 uu Id 1  (kuk − ha1, ui ) Id) − − 2 2 ¾˜ h i ` + k 2 ·( − 2) 2 6 k 2 a, u (using k 2 x − 1 x 6 k 2 ; see below) − D u − ( ) 2 6 · α (4.4.7) k − 2 k 2 ·( − 2) 2 ∈ ’ Basic calculus shows that the inequality x − 1 x 6 k 2 holds for all x . − Since it is a true univariate polynomial inequality in x, it has a sum-of-squares proof with degree no larger than the degree of the involved polynomials, which is k + 2 in our case.

Combining (4.4.6) and (4.4.7), yields as desired that

T  1  M k − t · aa 6 O · α 6 O (ε) · t , a⊗ √k

168 h i where the second step uses the condition of Theorem 4.4.1 on t  ¾˜ D u a, u ( ) k T k and α  ¾˜ D u uu .  ( )

h ki · T Proof of Proposition 4.4.3. The matrix M1  ¾˜ D u 10, u⊗ uu , whose spectral 0 ( ) norm we are to bound, is a random contraction of the third-order tensor T  ⊗ ⊗ ( k) ¾˜ D u u u u⊗ . Corollary 4.6.6 gives the following bound on the expected ( ) norm of a random contraction in terms of spectral norms of two matrix unfoldings of T—which turn out to be the same in our case due to the symmetry of T.

¾k k ( )1 2 · {k k k k }  ( )1 2 · ¾˜ k T M10 6 O log d / max T 1 , 23 , T 2 , 13 O log d / u⊗ u . 10 { } { } { } { } D u ( ) (4.4.8) Theorem 4.6.1 shows that for any pseudo-distribution D that satisfies {kuk2 6 1},

k T T ¾˜ u⊗ u 6 ¾˜ uu . (4.4.9) D u D u ( ) ( ) The statement of the lemma follows by combining the previous bounds (4.4.8) and (4.4.9),

¾k k ( )1 2· ¾˜ k T ( )1 2· ¾˜ T ( 2 )1 2· M10 6 O log d / u⊗ u 6 O log d / uu 6 O ε k log d / t , 10 D u D u ( ) ( ) h ik+2 using condition (4.4.1) of Theorem 4.4.1 which yields t  ¾˜ D u a, u > √ ( )  1 T Ω (ε k)− k¾˜ D uu k. 

4.4.2 Improving accuracy of a found solution

We need one more technical ingredient before analyzing our main algorithm.

Previously, the run-time of the sum-of-squares algorithm in [17] (on which our algorithm is based) depended exponentially on the accuracy parameter 1/ε, and we give here a simple boosting technique that allows us to remove this dependency and achieve polynomially small error.

169 Here we have a set of nearly isotropic vectors a1,..., an. We give a sum- Ín h i4 of-squares proof that if i1 ai , u is only ε off from its maximum possible value, and if u has constant correlation with some ai, then u must in fact be

(1−O(ε))-correlated with ai. Intuitively, the former constraint forces D to roughly be a mixture distribution over vectors that are close to a1,..., an, and the latter one forces it to actually only be close to ai. We then briefly show how this proof implies an algorithm to boost the accuracy when we already know a vector b that is 0.01-close to a solution, by solving for a pseudo-distribution with the added constraint {hb, ui2 > 0.9}.

d Theorem 4.4.4. Let ε > 0 be smaller than some constant. Let a1,..., an ∈ ’ be unit kÍn T k + vectors such that i1 ai ai 6 1 ε. Define the following systems of constraints, for each j ∈ [n] or unit vector b ∈ ’d:

n n 2 Õ 4 2 1 o A j : kuk 6 1, hai , ui > 1 − ε, ha j , ui > i1 2 D u ( ) n n 2 Õ 4 2 o Bb : kuk 6 1, hai , ui > 1 − ε, hb, ui > 0.9 . i1 D u ( ) 2 2 Then A j `4 {ha j , ui > 1 − 10ε} for all j ∈ [n], and also Bb ` A j and hai , bi > 0.8 for some j ∈ [n].

Proof. We have the following sum-of-squares proof:

n Õ 4 A `u,4 1 − ε 6 hai , ui i1 2 4 Õ 2 6 ha j , ui + hai , ui (by adding only square terms) i,j 4 22 6 ha j , ui + 1 + ε − ha j , ui Ín h i2 ( + )k k2 (using `u i1 ai , u 6 1 ε u ) h i2 + 1 +  + − h i2 6 a j , u 2 ε 1 ε a j , u 2 (since ` 1/2 6 ha j , ui 6 1)

170 1 −  h i2 + 1 + 6 2 ε a j , u 2 2ε , (4.4.10)

2 which means that A `u,4 ha j , ui > 1 − 10ε for ε > 0 small enough.

To show that Bb ` Ai for some i, it is enough to show that if Bb is consistent

(i.e. there exists a pseudo-distribution satisfying Bb), then there exists i ∈ [n]

2 2 such that hai , bi > 0.8, because it implies {hai , ui > 1/2} by triangle inequality.

2 For the sake of contradiction, assume that hai , bi < 0.8 for all i ∈ [n]. Then, by

2 triangle inequality (see Lemma ??), Bb ` { i ∈ [n]. hai , ui 6 0.99} which when ∀ T 2 combined with ai ai 6 1 + ε using substitution, contradicts the assumption B {Ín h i4 − } that b ` i1 ai , u > 1 ε for small enough ε > 0. 

d | Corollary 4.4.5. Let D be a degree-` pseudo-distribution over ’ such that D ` 4 / Bb, with Bb as defined in Theorem 4.4.4. Then, there exists i ∈ [n] such that 2 − 2 2 ( ) h i2 ¾˜ D u u⊗ a⊗ 6 O ε and ai , b > 0.8. ( ) i

2 Proof. By Theorem 4.4.4, Bb `4 {hai , ui > 1 − 10ε} for some i. It follows by Lemma 4.3.4 that 2   ¾˜ 2 − 2 − ¾˜ 2 2  − ¾˜ h i2 u⊗ ai⊗ 6 2 2 u⊗ , ai⊗ 2 2 ai , u 6 20ε .  D u D u D u ( ) ( ) ( )

4.5 Decomposition with sum-of-squares

In this section, we give a generic sum-of-squares algorithm (Algorithm1 and

Theorem 4.5.2) that will be used for various different settings in the following subsections (Section 4.5.2 for orthogonal tensors, Section 4.5.3 for tensors with separated components), and in the section 4.7 for random 3-tensor and Section 4.8 for robust FOOBI.

171 4.5.1 General algorithm for tensor decomposition

In this section, we provide a general sum-of-squares tensor decomposition that serve as the main building block for sections later. We will need the following lemma, which appears in [17, Proof of Lemma 6.1].

d Lemma 4.5.1. Let ε ∈ (0, 1) and {a1,..., an } be a set of unit vectors in ’ with kÍn T k + ∈ i1 ai ai 6 1 ε. Then, for all even integers k Ž, there exists a sum-of-squares proof that

( n ) ( n ) k k2 Õh i4 − Õh ik+2 − ( ) u 6 1, ai , u > 1 ε `u, k+2 ai , u > 1 O kε . (4.5.1) i1 i1

Our main algorithm below finds the solutions to a system of polynomial constraints A, when given a “hint” in the form of a polynomial transformation of formal variables P(·). Roughly P should be an “orthogonalizing” map so that if a1,..., an are the desired solutions to the constraints A, then P(a1),..., P(an) are k Ín ( ) ( )T k + nearly an orthonormal basis, or more precisely i0 P ai P ai 6 1 ε while 2 kP(ai)k > 1 − ε for all i. We then only require that a sum-of-squares proof exists certifying that the solutions to A after being mapped by P are actually close to ( ) ( ) A {Ín h ( ) ( )i4 − } P a1 ,... P an ; more precisely, that `` i1 P ai , P u > 1 ε u for some `. The existence of this sum-of-squares certificate then allows us to recover the solutions P(ai) up to O(ε) accuracy by solving for pseudo-distributions and then rounding them.

We later show how Algorithm 1 can be applied to a variety of tensor rank decomposition problems by the design of an appropriate orthogonalizing trans- form P. For example, in Section 4.7 P(·) orthogonalizes an overcomplete tensor by lifting the variables to a higher-dimensional space, and P(·) serves as a whitening transformation on a far-from-orthogonal tensor in Section 4.8.

172 The main technical difficulty in this analysis was in making the run-time polynomial (as opposed to quasi-polynomial in [17]) for the nearly-orthogonal case where P is the identity transform.

O ` Theorem 4.5.2. For every ` ∈ Ž, there exists an n ( )-time algorithm (see Algorithm 1) with the following property: Let ε > 0 be smaller than some constant. Let d, d0 ∈ Ž be d d d numbers. Let P : ’ → ’ 0 be a polynomial with deg P 6 `. Let {a1,..., an } ⊆ ’ be

d a set of vectors such that b1  P(a1),..., bn  P(an) ∈ ’ 0 all have norm at least 1 − ε kÍn T k + A and i1 bi bi 6 1 ε. Let be a system of polynomial inequalities in variables u  (u1,..., ud) such that the vectors a1,..., an satisfy A and

( n ) Õ 4 4 A `u,` hbi , P(u)i > (1 − ε) kP(u)k . (4.5.2) i1

A { } ⊆ ’d0 Then, the algorithm on input and P outputs a set of unit vectors b10 ,..., b0n such that

   2 2 ( ) 2 ( ) 2 ( )1 2 distH b1⊗ ,..., bn⊗ , b10 ⊗ ,..., b0n ⊗ 6 O ε / .

Proof of Theorem 4.5.2. We analyze Algorithm 1. By Corollary 4.4.5, if there exists a pseudo-distribution D0(u) that satisfies constraints (4.5.5), then the top ( ) ( )T ( )1 2 eigenvector of ¾˜ D u P u P u is O ε / -close to one of the vectors b1,..., bn. 0( ) {h ( ) i } The fact that we add in step 4, the constraint P u , b0i 6 0.1 also implies by

Corollary 4.4.5 that in some iteration i, we can never find a vector b0i that is close to one vector b0j from a previous iteration j < i. Therefore, it remains to show that in each of the n iterations with high probability we can find a pseudo-distribution

D0(u) that satisfies (4.5.5).

Consider a particular iteration i0 ∈ [n] of Algorithm 1. We may assume that the vectors b0 ,..., b0 are close to b1,..., bi 1. First we claim that there 1 i0 1 0 − −

173 Algorithm 1 General tensor decomposition algorithm Parameters: numbers ε > 0, n, ` ∈ Ž. Given: system A of polynomial inequalities over ’d and polynomial P : ’d → ’d0. ∈ ’d0 Find: vectors b10 ,..., b0n . Algorithm:

• For i from 1 to n, do the following: 1. Compute a degree-(k + 2)` pseudo-distribution D(u) over ’d, with k  O(1), that satisfies the constraints

A ∪ {1 + ε > kP(u)k2 > 1 − ε} + ( ) ( )T 1 ε ¾˜ D u P u P u 6 . (4.5.3) ( ) n − i + 1

1 T 2. Choose standard Gaussian vectors 1( ) ,..., 1( ) ∼ N(0, Id k ) and d0 O 1 ( ) T  d ( ) and compute the top eigenvectors of the following matrices for all t ∈ [T]:

t k T d d ¾˜ h1( ) , P(u)⊗ i · P(u)P(u) ∈ ’ 0× 0 . (4.5.4) D u ( ) 3. Check if for one of the normalized top eigenvectors b? computed in the previous step, there exists a degree-4` pseudo-distribution D0(u) that satisfies the constraints

A ∪ 1 + ε > kP(u)k2 > 1 − ε, hb?, P(u)i2 > 0.99 . (4.5.5)

b ¾˜ P(u)P(u)T 4. Set 0i to be the top eigenvector of the matrix D0 u and add A {h ( ) i2 } ( ) to the constraint P u , b0i 6 0.01 . exists a pseudo-distribution satisfying conditions (4.5.3) in step 1, including the additional constraints added to A in previous iterations. Indeed, the uniform distribution over vectors ai ,..., an satisfies all of those conditions. By assumption, A {Ín h ( )i4 − } we have a sum-of-squares proof `u,` i1 bi , P u > 1 ε . Lemma 4.5.1 A {Ín h ( )ik − ( )} then implies `u, k+2 ` i1 bi , P u > 1 O kε for an absolute constant ( ) parameter k to be determined later. Since A includes the added constraints {h ( )i2 h ( )i2 } Ín T 2 + ( ) b1, P u 6 0.1,..., bi0 1, P u 6 0.1 , it follows by i bi bi 6 1 O ε − 1

174 A ` {Íi0 1h ( )ik ( )k 2·( + ( ))} and substitution that i−1 bi , P u 6 0.1 − 1 O ε , here choosing k ( )k 2 ·( + ( )) A {Ín h ( )ik } so that 0.1 − 1 O ε 6 0.001. Therefore, ` k+2 ` ii bi , P u > 0.99 ( ) 0 ˜ Ín h ( )ik ( + ) and so ¾D u ii bi , P u > 0.99 for any degree- k 2 ` pseudo-distribution ( ) 0 D that satisfies constraints (4.5.3). In particular, by averaging, there exists an

? index i ∈ {i0,..., n} such that

k 0.99 T ¾˜ hbi? , P(u)i > > 0.9 · ¾˜ P(u)P(u) . D u n − i0 + 1 D u ( ) ( ) By Theorem 4.4.1, for each of the matrices (4.5.4) in step 2, its top eigenvector

O 1 is 0.001-close to bi? with probability at least d− ( ). Therefore, we find at least

Ω 1 one of those vectors with probability no smaller than 1 − d ( ). In this case, a pseudo-distribution D0(u) as required in step 3 exists, as an atomic distribution supported only on bi? is an example that satisfies the conditions. 

4.5.2 Tensors with orthogonal components

We apply Theorem 4.5.2 to orthogonal tensors with noise.

Theorem (Restatement of Theorem 4.1.1). There exists a polynomial-time algorithm

d 3 d that given a symmetric 3-tensor T ∈ (’ )⊗ outputs a set of vectors {a0 ,..., a0 } ⊆ ’ 1 n0 d such that for every orthonormal set {a1,..., an } ⊆ ’ , the Hausdorff distance between the two sets is at most

n { }  2 ( )· − Õ 3 distH a1,..., an , a10 ,..., a0n 6 O 1 T ai⊗ . (4.5.6) 0 i1 1 , 2,3 { } { }

 3 Proof. We feed Algorithm1 with the inputs P(u)  u and A  hT, u⊗ i > 1 − ε  k k  − Í 3 where ε E 1 , 2,3 and E T i a⊗ . We have { } { } i n Õ 3 3 3 A `4 hai , ui  hT, u⊗ i − hE, u⊗ i i1

175 3  hT, u⊗ i − ε

> 1 − 2ε .

Here at the second line we used that

h 3i k k ` E, u⊗ 6 E 1 , 2,3 6 ε . (4.5.7) { } { } We verify that A satisfies the requirement (4.5.2),

n n ! n ! Õ 4 Õ 4 Õ 2 A ` hai , ui > hai , ui hai , ui (using orthonormality) i1 i1 i1 n !2 Õ 3 > hai , ui (Cauchy-Schwarz: Lemma ??) i1 > 1 − 4ε .

Therefore calling Algorithm1, we can recover aˆi which is, up to sign flip, close to

1 2 ai with error O(ε / ). We determine the sign by finding the τ ∈ {−1, +1} such h ˆ 3i − ˆ that T, τai⊗ > 1 ε and set the output a0i to τai. 

Remark 4.5.3. Note that in the proof of Theorem 4.1.1, the conclusion of equa- tion (4.5.7) is the only thing we used about the error term E. Therefore, define the following SoS relaxation of the injective norm: h i k k  k k2 h 3i E SoS inf u 6 1 ` E, u⊗ 6 c . c ’ ∈ Then we can replace the right hand side of equation (4.5.6) by O(1)· n T − Í a 3 i1 i⊗ SoS.

4.5.3 Tensors with separated components

The following lemma shows that for separated vectors the sum of higher-order outer products has spectral norm that decrease exponentially with the tensor

176 order.

d Lemma 4.5.4. Let a1,..., an be unit vectors in ’ . Then, for every k ∈ Ž, n k n +   Õ T k 1 Õ T ai ai ⊗( ) 6 1 + max |hai , a ji| · ai ai . i,j i1 i1

+  Í T k 1 ∈ ( d) k+1 Proof. Let A i ai ai ⊗ . For a unit vector x ’ ⊗ we’ll bound the quadratic form xTAx.

First, without loss of generality we can assume that x is in the subspace V { k+1} spanned by ai⊗ i. This is because if x had a component y orthogonal to V, T k+1  ∈ [ ]  then y ai⊗ 0 for all i n by definition, so that Ay 0 and y can make no nonzero contribution to the quadratic form above.

1 2 + Also let W  (A / ) so that W is a whitening transform and WAW is a  Í k+1 Í 2  k k2  projector onto V. Then suppose x i ciWai⊗ , so that i ci x 1. Then

T  Õ ( k+1)T k+1 x Ax ci c j ai⊗ WAWa⊗j ij  Õ 2 + Õ h ik+1 ci ci c j ai , a j i i,j   k Õ 2 + |h i| Õ h i 6 ci max ai , a j ci c j ai , a j i,j i i,j   k n Õ T 6 1 + max |hai , a ji| ai ai , i,j i1  Í T  ( 1 2)+ where in the last step we let A0 i ai ai and W0 A0 / , and apply the Í h i  Í T  T k k  inequality i,j ci c j ai , a j ij ci c j ai W0A0W0a j x0 A0x0 6 A0 , where x0 Í i ciW0ai is a unit vector. 

d d k k 2 Lemma 4.5.5. Let a ∈ ’ and b ∈ (’ )⊗ be unit vectors such that ha⊗ , bi > 1 − ε.

k 1 Let B be the reshaping of the vector b into a d-by-d − matrix. Then the top left singular

d 2 vector a0 ∈ ’ of B satisfies ha0, ai > 1 − O(ε).

177 k Proof. Let c be the top right singular vector of B. Then, ha0⊗c, bi > ha⊗ , bi > 1−ε.

1 2 k 1 Therefore, ka0 ⊗ c − bk 6 O(ε) / . By triangle inequality, ka0 ⊗ c − a ⊗ a⊗ − k 6

1 2 k 1 O(ε) / , which means that as desired |ha, a0i| > ha0 ⊗ c, a ⊗ a⊗( − )i > 1−O(ε). 

Theorem (Restatement of Theorem 4.1.5). There exists an algorithm A with poly- nomial running time (in the size of its input) such that for all η, ρ ∈ (0, 1) and { } ⊆ d kÍn T k σ > 1, for every set of unit vectors a1,..., an ’ with i1 ai ai 6 σ and |h i| ∈ ( d) k maxi,j ai , a j 6 ρ, when the algorithm is given a symmetric k-tensor T ’ ⊗ with +  1 log σ  d k > O · log(1/η), then its output A(T) is a set of vectors {a0 ,..., a0 } ⊆ ’ log ρ 1 n0 such that

 2  n  { 2 2} { 2 2} + − Õ k distH a0⊗ ,..., a0⊗n , a⊗ ,..., an⊗ 6 O η T ai⊗ . 1 1 i1 1,..., k 2 , k 2 +1,...,k { b / c} {b / c } (4.5.8)

 − Í k Proof. We use Algorithm 1 from Theorem 4.5.2. Let E T i ai⊗ . We may k k assume that E 1,..., k 2 , k 2 +1,...,k 6 η, since otherwise the theorem follows { b / c} {b / c }  k k from the case when η E 1,..., k 2 , k 2 +1,...,k . Let P be the polynomial map { b / c} {b / c } k 4 P(x)  x⊗d / e and let A be the system of polynomial inequalities

k 2 A  {hT, u⊗ i > 1 − η, kuk  1} . (4.5.9)

 ( ) k k in variables u u1,..., ud . Since E 1,..., k 2 , k 2 +1,...,k 6 η, all of the vectors { b / c} {b / c } a1,..., an satisfy A. Let b1,..., bn be the unit vectors bi  P(ai). By Lemma 4.5.4 kÍ T k + k 4 + and the condition on k, these vectors satisfy i bi bi 6 1 ρd / e σ 6 1 η. Then, we have the following sum-of-squares proof

n n n A Õh ( )i4 Õh i4 k 4 Õh ik h ki − h ki `u,k bi , P u  ai , u d / e > ai , u  T, u⊗ E, u⊗ i1 i1 i1 (4.5.10)

− − k k − > 1 η T 1,..., k 2 , k 2 +1,...,k > 1 2η . (4.5.11) { b / c} {b / c }

178 It follows that A and P satisfy the conditions of Theorem 4.5.2. Thus, Algorithm 1 A on input and P recovers vectors b10 ,..., b0n with Hausdorff distance at most √ O( η) from b1,..., bn. By Lemma 4.5.5, the top left singular vectors of the √ k 4 1 ( ) d-by-dd / e− matrix reshapenings of b10 ,..., b0n are O η -close to the vectors a1,..., an up to sign. (If k is odd, then we may determine the signs of the ai by h ki − ( ) h ki − + ( ) checking if T, a0i ⊗ > 1 O η or T, a0i ⊗ 6 1 O η for each output vector a0i.) 

4.6 Spectral norms and tensor operations

In this section, we provide several bounds regarding the spectral norms of moments of the lifted vectors, and the spectral norm of random contraction of a tensor, which are crucial in our analysis in previous sections. We suggest readers who are more interested in applications of the algorithms jump to Section 4.7 and 4.8.

4.6.1 Spectral norms and pseudo-distributions

Theorem 4.6.1. Let D be a degree-4(p + q) pseudo-distribution over ’d that satisfies {k k2 } ∈ u 6 1 D u . Then, for all p, q Ž, ( )

p q T T ¾˜ u⊗ u⊗ 6 ¾˜ uu . (4.6.1) D u D u ( ) ( )

The theorem follows by combining Lemma 4.6.2 and Lemma 4.6.4 proved below. Lemma 4.6.2 reduces Theorem 4.6.1 to the case when p  q.

179 Lemma 4.6.2. Let D be a degree-4(p + q) pseudo-distribution over ’d that satisfies {k k2 } ∈ u 6 1 D u . Then, for all p, q Ž, ( ) 2 p q T p  p T q  q T ¾˜ u⊗ u⊗ 6 ¾˜ u⊗ u⊗ · ¾˜ u⊗ u⊗ . (4.6.2) D u D u D u ( ) ( ) ( )

d p d q Proof. For all unit vectors x ∈ (’ )⊗ and y ∈ (’ )⊗   p q T p q hx, ¾˜ u⊗ u⊗ yi  ¾˜ hx, u⊗ ihu⊗ , yi D u D u ( ) ( )  1 2  1 2 p 2 / q 2 / 6 ¾˜ hx, u⊗ i · ¾˜ hu⊗ , yi D u D u ( ) ( ) (Cauchy–Schwarz for pseudo-expectations) 1 2 1 2 p  p T / q  q T / 6 ¾˜ u⊗ u⊗ · ¾˜ u⊗ u⊗ . (4.6.3) D u D u ( ) ( ) The lemma follows from this bound by choosing x and y as the top left and right

p q T singular vectors of the matrix ¾˜ D u u⊗ (u⊗ ) .  ( )

Towards proving Theorem 4.6.1 for the case of p  q, we first establish the following lemma which says that tensoring with vector with norm less 1 won’t increase the spectral norm.

Lemma 4.6.3. Let 1(u, v) be a polynomial in indeterminates u, v. Let D be a degree-4 d {k k2 ( ) } pseudo-distribution over ’ that satisfies u 6 1, 1 u, v > 0 D u,v . Then, for all ( ) p ∈ Ž,

T T ¾˜ 1(u, v) (u ⊗ v)(u ⊗ v) 6 ¾˜ 1(u, v)vv . (4.6.4) D u,v D v ( ) ( )

Proof. We have the sum-of-squares proof that

 T ` 1(u, v) Id ⊗vv> − (u ⊗ v)(u ⊗ v)

 T  1(u, v) Id −uu> ⊗ vv

180 2 2  1(u, v) 1 − kuk Id ⊗vv> + 1(u, v)(kuk Id −uu>) ⊗ vv>

2 2  0 (by 1 − kuk > 0 and ` kuk Id −uu>  0 (Lemma 4.3.10))

Therefore, we obtain that

h Ti ¾˜ [1(u, v) Id ⊗vv>] − ¾˜ 1(u, v) (u ⊗ v)(u ⊗ v)  0 D u,v D u,v ( ) ( ) The desired inequality follows,

T T ¾˜ 1(u, v) (u ⊗ v)(u ⊗ v) 6 ¾˜ Id ⊗1(u, v)vv>  ¾˜ 1(u, v)vv D u,v D u,v D v ( ) ( ) ( ) 

The following statement follows straightforward from the Lemma 4.6.3 by induction on p.

Lemma 4.6.4. Let D be a degree-4p pseudo-distribution over ’d that satisfies {kuk2 6 } ∈ 1 D u . Then, for all p Ž, ( ) p  T T ¾˜ uu ⊗ 6 ¾˜ uu . (4.6.5) D u D u ( ) ( )

4.6.2 Spectral norm of random contraction

The following theorem shows that a random contraction of a 3-tensor has p spectral norm at most O( log d) factor larger than the spectral norm of its matrix unfoldings.

p q r Theorem 4.6.5. Let T ∈ ’ ⊗ ’ ⊗ ’ be an order-3 tensor. Let 1 ∈ N(0, Idr). Then for any t > 0,   ⊗ ⊗ T · k k k k ( + )· t2 2  Id Id 1 T > t max T 1 , 2,3 , T 2 , 1,3 6 2 p q e− / , 1 1 , 2 { } { } { } { } { } { } (4.6.6)

181 and consequently,8

  ⊗ ⊗ T ( ( + ))1 2 · k k k k ¾ Id Id 1 T 6 O log p q / max T 1 , 2,3 , T 2 , 1,3 . 1 1 , 2 { } { } { } { } { } { } (4.6.7)

T Proof. Let Ti denote the ith third-mode slice of T so that Ti  Id ⊗ Id ⊗ei T reshaped as a p-by-q matrix. Note that when regarded as a p-by-q matrix, the

T contraction Id ⊗ Id ⊗1 T is a Gaussian matrix series with coefficients T1,..., Tr, so that r T Õ Id ⊗ Id ⊗1 T  1iTi , 1 , 2 i1 { } { } where 11,..., 1r are independent standard Gaussians with 1i  h1, eii. Therefore, by concentration of Gaussian matrix series [63, Theorem 1] (also see [77, Corollary 4.2]), we have

 T t2 2  k Id ⊗ Id ⊗1 Tk > tσ 6 2(p + q)e− / ,

  Í T Í T 1 2 where σ max i TiTi , i Ti Ti / .

For U and V sets of indices, let TU,V denote the matrix unfolding of T with rows indexed by U and columns indexed by V, so that kTkU,V  kTU,V k. We Í T  ( )T( ) Í T  ( )T( ) claim that i TiTi T 1 , 2,3 T 1 , 2,3 and i Ti Ti T 2 , 1,3 T 2 , 1,3 , { } { } { } { } { } { } { } { } which completes the proof. These identities are forced by the observations that both of these objects are matrix quantities that are quadratic in T, with the first object being a sum over the 2nd and 3rd indices of the two copies of T, and the second object being a sum over the 1st and 3rd indices. 

The following corollary of Theorem 4.6.5 handles a larger class of random contractions. 8For large enough p and q, the constant hidden in the big-Oh notation below is at most 2

182 Corollary 4.6.6. Let T ∈ ’p ⊗ ’q ⊗ ’r be an order-3 tensor. Let 1 ∼ N(0, Σ) with covariance matrix Σ satisfying 0  Σ  Idr. Then for any t > 0,   ⊗ ⊗ T · k k k k ( + )· t2 2  Id Id 1 T > t max T 1 , 2,3 , T 2 , 1,3 6 4 p q e− / . 1 1 , 2 { } { } { } { } { } { } (4.6.8)

Proof. We reduce to the case Σ  Idp and apply Theorem 4.6.5. Concretely, let 10 

1 + h and 100  1 − h where h is a random variable with distribution N(0, Idp −Σ) that is independent of 1. By this construction, 10 and 100 both have marginal N( ) 1  1 (1 + 1 ) distribution 0, Idp , and 2 0 00 . Therefore we can invoke Theorem 4.6.5  {k k k k } for random variables 10 and 100. Letting σ max T 1 , 2,3 , T 2 , 1,3 , using { } { } { } { } the union bound and the triangle inequality, we have that

n T o  Id ⊗ Id ⊗1 T > tσ 1 1 , 2 { } { }    T   Id ⊗ Id ⊗(10 + 100) T > 2tσ 1,h 1 , 2  { } { }   T  T 6  Id ⊗ Id ⊗(10) T + Id ⊗ Id ⊗(100) T > 2tσ 1,h 1 , 2 1 , 2  { } { }   { } { }   T  T 6  Id ⊗ Id ⊗(10) T > tσ +  Id ⊗ Id ⊗(100) T > tσ 1,h 1 , 2 1,h 1 , 2 { } { } { } { } t2 2 6 4(p + q)· e− / , where the second line uses the triangle inequality, the third line uses the union bound, and the fourth line uses Theorem 4.6.5 applied to 10 and 100. 

Corollary 4.6.6 and Theorem 4.6.1 together imply the following theorem..

Theorem 4.6.7. Let k ∈ Ž and D be a degree-(4k + 10) pseudo-distribution over ’d {k k2 } ∼ N( ) that satisfies u 6 1 D u . Let 1 0, Σ be a Gaussian vector with covariance ( ) Σ  k Idd⊗ . Then,

k T p T ¾ ¾˜ h1, u⊗ i · uu - k log d ¾˜ uu . (4.6.9) 1 0,Id D u D u ∼N( d) ( ) ( )

183 We can apply Corollary 4.6.6 repeatedly to obtain a bound for random contraction over a larger number of modes.

p q r1 rs Theorem 4.6.8. Let T ∈ ’ × × ×···× be an order-(s + 2) tensor, and 11 ∼

N(0, Σ1),..., 1s ∼ N(0, Σs) be independent Gaussian random variables with covariance  ∈ [ ]  { + } Σi Idri for each i r . Let r¯ maxi s ri 2 . Then for any t > 0, ∈[ ]   T T s  s 1 t2 2  Id ⊗ Id ⊗11 ⊗ ... ⊗ 1s T > t · max kTkS,Sc 6 4(p+q) r¯ − e− / . 1 1 , 2 S s :1 S,2

Proof. We prove by induction on s. The base case is exactly Corollary 4.6.6. For s > 2, suppose we have proved the (s − 1)-case.

T T Let T0  Id ⊗ Id ⊗ Id ⊗12 · · · ⊗ 1s T be an order-3 tensor. Then we have that T T T Id ⊗ Id ⊗11 ⊗ ... ⊗ 1s T  Id ⊗ Id ⊗11 T0 .

Then using Corollary 4.6.6 on T0 and 11, and then taking the expectation over

12,..., 1s, we have   ⊗ ⊗ T ⊗ ⊗ T · k k k k ( + ) t2 2  Id Id 11 ... 1s T > t max T0 1 , 2,3 , T0 2 , 1,3 6 4 p q e− / . 11,...,1s 1 , 2 { } { } { } { } { } { } (4.6.11)

We view T0 as an order-(s + 1) tensor by merging the 2nd and 3rd modes, that T T is, Id ⊗(Id ⊗ Id) ⊗ 12 · · · ⊗ 1s T, and then apply the inductive hypothesis. We obtain   k k s 1 · k k ( + ) s 2 · t2 2  T0 1 , 2,3 > t − max T S,Sc 6 4 p qr1 r¯ − e− / . 12,...,1s { } { } S s :1 S, 2,3 S ⊂[ ] ∈ { }∩ ∅ (4.6.12)

184 Similarly, we have that   k k s 1 · k k ( + ) s 2 · t2 2  T0 1,3 , 2 > t − max T S,Sc 6 4 pr1 q r¯ − e− / . 12,...,1s { } { } S s : 1,3 S,2

Using equations (4.6.11), (4.6.12), (4.6.13), and applying union bound we obtain   T T s  s 1 t2 2  Id ⊗ Id ⊗11 ⊗ · · · ⊗ 1s T > t · max kTkS,Sc 6 4(p+q) r¯ − e− / , 1 1 , 2 S s :1 S,2



4.7 Decomposition of random overcomplete 3-tensors

In this section, we assume that we are given a random 3rd order overcomplete symmetric tensor T of the following form

n Õ 3 T  a⊗ + E , (4.7.1) i1 i

1.5 O 1 where n 6 d /(log d) ( ), the vectors ai are drawn independently at random k k from the Euclidean unit sphere, and the error tensor E satisfies E 1 , 2,3 6 ε. { } { }

d 2 Let Idsym be the projection to the symmetric subspace of (’ )⊗ (the span of 2 d 1 Íd 2 d 2 all x⊗ for x ∈ ’ ), and let Φ   e⊗ ∈ (’ )⊗ . Let Idsym be the projection √d i 1 i 0 to the subspace orthogonal to Φ:

 −ΦΦT Idsym0 Idsym . (4.7.2)

ω 1 Theorem (Restatement of Theorem 4.1.2). With probability 1 − d− ( ) over the choice

d d 3 of random unit vectors a1,..., an ∈ ’ , when given a symmetric 3-tensor T ∈ (’ )⊗

185 Algorithm 2 Polynomial-time algorithm for random overcomplete 3-tensor decomposition d 3 Input: Number ε > 0 and n ∈ Ž and symmetric tensor T ∈ (’ )⊗ of the form (4.7.1). d Find: aˆ1,..., aˆn ∈ ’ . Algorithm: 1. Call Algorithm 1 with

 3 2 A  hT, u⊗ i > 1 − ε, kuk  1 , (4.7.3) ( )  2 P u Idsym0 u⊗ , (4.7.4)

where Idsym is defined in (4.7.2). Suppose the outputs of Algorithm 1 are ˆ ˆ 0 b1,..., bn. ˆ 2. Let aˆi be τi the top eigenvector of the matrix reshaping of bi, where τi ∈ {1, −1} is chosen so that Taˆ1 > 0.

d as input, the output aˆ1,..., aˆn ∈ ’ of Algorithm 2 satisfies

 2  Ω 1 n  { ˆ ˆ } { } n ( ) + − Õ 3 distH a1,..., an , a1,..., an 6 O T ai⊗ . d1.5 i1 1 , 2,3 { } { } (4.7.5)

Theorem 4.1.2 follows immediately from Theorem 4.5.2 and the following proposition:

ω 1 Proposition 4.7.1. With probability 1 − d− ( ) over the choice of random unit vectors a1,..., an, the parameters P(·) and A defined in Algorithm 2 satisfy the requirements of  ( )  2 Theorem 4.5.2. In particular, let ci P ai Idsym0 ai⊗ . Then

n Õ T + ci ci 6 1 δ , (4.7.6) i1 √ where δ  O˜ ( n/d + n/d1.5), and

n Õ 4   4 A ` hci , P(u)i > 1 − O(ε + 1/d) kP(u)k . (4.7.7) i1

186 We first show that by a simple extension of [39, Theorem 4.2 and Lemma A h i8 ≈ 2 8], implies that the sum of the terms ai , u is large. Note that ci ai⊗ and 2 P(u) ≈ u⊗ , and therefore this is already fairly close to our target inequality (4.7.7).

Lemma 4.7.2 (Simple extension of [39, Theorem 4.2 and Lemma 8]). With proba-

ω 1 bility 1 − d− ( ) over the choice of random unit vectors ai,

( n ) ( n ) Õ 3 2 Õ 8 A ` hai , ui > 1 − ε, kuk  1 ` hai , ui > 1 − O(ε) − δ . (4.7.8) i1 i1

 ( / 3 2) where δ Oe n d / .

Proof. Using the proof of [39, Theorem 4.2] (specifically Lemma 3 and Claim 1), and the proof of Lemma 5 (specifically equation (11) and equation (15)) we have9

( n n ) Õ 4 Õ 6 A ` hu, aii > 1 − O(ε) − δ, and hu, aii > 1 − O(ε) − δ . (4.7.9) i1 i1

O 1 3 2 where δ  O(n log ( ) d/d / ). Then we extend the proof using the same idea to higher powers:

!2 *" # +2 2 n n n Õh i6  Õh i5 Õh i5 ` u, ai u, ai ai , u 6 u, ai ai i1 i1 i1 n Õ 10 Õ 5 5  hu, aii + hu, aii hu, a ji hai , a ji i1 i,j n n ! n ! Õ 10 Õ 4 Õ 4 6 hu, aii + hu, aii hu, aii max |hai , a ji| , i,j i1 i1 i1 (4.7.10) where the first line uses the Cauchy-Schwarz inequality (Lemma ??) and the last line uses the fact that D(u) satisfies the constraint −1 6 hu, aii 6 1. By [39,

d 9Technically [39] only proved the case when the vectors ai are uniform over 1 √d , though the proofs work for the uniform distribution over the unit sphere as well. {± / }

187 Lemma 2], we have that n Õ 4 A ` hu, aii 6 1 + δ . (4.7.11) i1 Combining the equation above, equation (4.7.10), equation (4.7.9), and the fact √ that with high probability hai , a ji 6 Oe(1/ d), we obtain n Õ 10 A ` hu, aii > 1 − O(ε) − δ . (4.7.12) i1

2 Therefore, using the fact that hu, aii 6 1, we complete the proof. 

d Lemma 4.7.3 (Rephrasing of [?, Lemma 5.9]). Let a1,..., an ∈ ’ be independent ran-

1.5 O 1 dom vectors drawn uniformly from the Euclidean unit sphere with 1 6 n 6 d /log ( ) d,  2 and let C be the matrix with columns ci Idsym0 ai⊗ . Then

T k C C − Idn k 6 δ , (4.7.13) √ where δ  O˜ ( n/d + n/d1.5).

Though [?, Lemma 5.9] assumes n > d, its proof can also handle n 6 d if the √ error bound is relaxed to O˜ ( n/d). See specifically the end of the first paragraph of its proof. Also while [?, Lemma 5.9] assumes Gaussian random vectors, its proof reduces to the case on the unit sphere. Therefore we omit the proof of

Lemma 4.7.3.

Finally we prove Proposition 4.7.1.

Proof of Proposition 4.7.1. Equation (4.7.6) follows from Lemma 4.7.3. To prove

2 equation (4.7.3), we essentially just replace ai⊗ in equation (4.7.8) by ci and bound the approximation error. We have

Õn Õn Õn A ` h 2 i4  h 2 2i4  h 2 2i − hΦ 2i4 Idsym0 u⊗ , ci u⊗ , Idsym0 ai⊗ u⊗ , ai⊗ , ai⊗ i1 i1 i1

188 n  Õ h 2 2i − / 4 u⊗ , ai⊗ 1 d i1 n Õ 8 > (1 − 1/d) hu, aii − O(1/d) i1 3 2 > 1 − O(ε) − O(1/d) − O˜ (n/d / ) . (4.7.14) where the second last step uses ` (x − y2)4 > (1 − y2)x4 − O(y2), and the last step uses equation (4.7.8). 

4.8 Robust decomposition of overcomplete 4-tensors

In this section we provide a sum-of-squares version of the FOOBI algorithm [55].  Ín 4 FOOBI yields the rank decomposition of a 4th order tensor T i1 ai⊗ under { 2 ⊗ 2 − ( ⊗ ) 2} the mild condition that the set ai⊗ a⊗j ai a j ⊗ i,j is linearly independent. FOOBI has not been formally shown to be robust to noise, though it’s believed to tolerate spectral noise with magnitude up to some inverse polynomial of dimension. In contrast, the noise tolerance of our sum-of-squares version depends only on the condition number of certain matrices, and not directly on the dimension.

In Section 4.8.3 we additionally show, under a smoothed analysis model where each component ai of the input tensor is randomly perturbed, that the relevant condition numbers are never smaller than some inverse polynomial of the dimension, with high probability over the random perturbations.

Throughout this section we will work with an input tensor T of the form

n  Õ 4 + T ai⊗ E i1

189 where n 6 d2 and E is a symmetric noise tensor with bounded spectral norm k k E 1,2 , 3,4 . { } { }

For a matrix M, we use σmax(M), σmin(M) to denote its largest and smallest singular values respectively, and σk(M) to denote its kth largest singular value.

∈ ’d2 n 2  Let A × be the matrix with columns ai⊗ for i 1,..., n. The guarantees of our algorithm will depend on the following 4th order condition number of A:

∈ ’d2 n 2  Definition 4.8.1. For a full rank matrix A × with columns ai⊗ for i 1,..., n, let κ(A) defined as

( )  1.5 ( )/ 1.5( ) + 2.5 ( )/( 2 ( ) 0.5( )) κ A σmax Q σn Q σmax Q σmin B σn Q . (4.8.1)

 T ∈ ’d4 n n 1  2 ⊗ 2 − where Q AA and B × ( − ) is the matrix with columns bi,j ai⊗ a⊗j 2 (ai ⊗ a j)⊗ for every i , j.

d 4 Theorem 4.8.2 (Restatement of Theorem 4.1.3). Let δ > 0. Let T ∈ (’ )⊗ be a { } ⊆ ’d  − Ín 4 symmetric 4-tensor and a1,..., an be a set of vectors. Define E T i1 ai⊗ 2 k k ( T) and define A as the matrix with columns ai⊗ . If E 1,2 , 3,4 6 δ σn AA and { } { } Algorithm 3 outputs {aˆ1,..., aˆn } on input T, then there exists a permutation π : [n] → [n] so that for every i ∈ [n],

− ˆ ( ) k k ai aπ i 6 O δ κ A ai . (4.8.2) ( )

Let Qe be the best rank-n approximation of the d2 × d2 matrix reshaping of T, and let Se be the column space of Qe. These two objects serve as our initial best-guess approximations of Q  AAT and the subspace S spanned by { 2 2} a1⊗ ,..., an⊗ (also the column space of Q), which we do not have access to. We ∈ ’d4 n n 1  2 ⊗ 2 − ( ⊗ ) 2 define B × ( − ) as the matrix with columns bi,j ai⊗ a⊗j ai a j ⊗ for i , j.

190 Algorithm 3 Sum-of-Squares FOOBI for robust overcomplete 4-tensor decompo- sition d 4 Input: Number δ > 0 and symmetric tensor T ∈ (’ )⊗ . d Find: aˆ1,..., aˆn ∈ ’ . Algorithm: 1. Compute the best rank-n approximation10 Qe of the d2 ×d2 matrix reshaping of T. Let Sebe the column span of Qe. 2. Run Algorithm 1 with inputs P(·) and A set to

( )  ( +)1 2 2 P x Qe / x⊗ , (4.8.3) A  k x 2k2 ( − ) kxk4 IdSe ⊗ > 1 3δ x . (4.8.4)

Suppose the algorithm outputs cˆ1,..., cˆn. T 3. Output aˆ1,..., aˆn such that for each i ∈ [n], the matrix aˆi aˆi is the best 1 2 ˆ rank-1 approximation of the matrix reshaping of Qe / ci.

One of the core techniques in the analysis will be to use (the following rephrased version of) Davis and Kahan’s “sin θ” Theorem, which bounds the principle angle between the column spaces of two matrices that are spectrally close to each other.

Theorem 4.8.3 (Direct consequence of Davis-Kahan Theorem [?]). Suppose symmet- ∈ D D ∈ D D k − k ( ) ric PSD matrices Q ’ × and Qe ’ × of rank n 6 D satisfy Q Qe 6 δ σn Q . 1 Let S and Sebe the column spaces of Q and Qe respectively, and assume δ 6 2 . Then we have

def (S S)  k − k  k − k /( − ) sin , e IdS IdSe IdS IdSe IdS IdSe 6 δ 1 δ . (4.8.5)

Consequently, k − k O( ) IdS IdSe 6 δ . (4.8.6)

Theorem 4.8.2 follows from the analysis of Algorithm 1 as given in The- orem 4.5.2, as long as we can verify its two conditions, which we restate in

191 the following two propositions. While Proposition 4.8.4 follows quickly from Theorem 4.8.3, we prove Proposition 4.8.5 over the next three subsections.

Proposition 4.8.4. Let P(x) and A be as defined in Algorithm 3. Then each vector a1,..., an satisfies A.

Proof. k a 2k k a 2k − k( − ) a 2k ( − By Theorem 4.8.3, IdSe i⊗ > IdS i⊗ IdS IdSe i⊗ > 1 4 2δ) kai k . 

Proposition 4.8.5. Let P(x) and A be as defined in Algorithm 3. Then

n Õ 4 4 A `8 hP(ai), P(x)i > (1 − τ) kP(x)k , i1 ( 2 ( )/ 2 ( )) + ( ( )/ ( )) where τ 6 O δ σmax Q σmin B O δ σmax Q σn Q .

Proof of Theorem 4.8.2. By Theorem 4.5.2 along with Proposition 4.8.4 and Propo- sition 4.8.5, step 2 in Algorithm 3 must yield vectors cˆ1,..., cˆn that are re- ( ) ( ) ( ) ( 2 ( )/ 2 ( )) + spectively O τ -close to P a1 ,..., P an , where τ 6 O δ σmax Q σmin B

O(δ σmax(Q)/σn(Q)). Then

k 2 − 1 2 ˆ k k 2 − 1 2 ( )k + k 1 2( ( ) − ˆ )k ai⊗ Qe / ci 6 ai⊗ Qe / P ai Qe / P ai ci

2 2 1 2 ka − a k + / (Q)· O( ) 6 i⊗ IdSe i⊗ σmax τ 2 1 2 k( − )a k + / (Q)· O( ) 6 IdS IdSe i⊗ σmax τ ( 1 2 ( )) + 1 2 ( )· ( ) 6 O δ σmax/ Q σmax/ Q O τ 1 2 ( )· ( ) 6 σmax/ Q O τ 1 2 ( )/ 1 2( )· ( ) · k 2k 6 σmax/ Q σn/ Q O τ ai⊗  ( ( )) k 2k O δ κ A ai⊗ .

1 2 ˆ Therefore taking the best rank-1 approximation of the matrix reshaping of Qe / ci gives an O(δ κ(A))-approximation of ai. 

192 4.8.1 Noiseless case

 Í 4  We first prove Proposition 4.8.5 in the noiseless case, when T ai⊗ and Qe Q and Se  S. In this scenario, we find that the left-hand side of the conclusion of Proposition 4.8.5 becomes

n n Õh ( ) ( )i4  Õ ( +)1 2 2 ( +)1 2 2 4 P ai , P x Q / ai⊗ , Q / x⊗ i1 i1 n h i4  Õ ( 2)T + 2 ai⊗ Q x⊗ i1  T + 2 4 A Q x⊗ 4 .

k ( )k4 k( +)1 2 2k4 The term P x 2 on the right becomes Q / x⊗ 2. Thus Proposition 4.8.5 becomes

Proposition 4.8.6 (Noiseless Proposition 4.8.5). Let

A  k 2k2 ( − ) k k4 0 IdS x⊗ 2 > 1 c δ x 2 x , (4.8.7) for some constant c > 0. Then

4 4 A ` T + 2 ( − ) ( +)1 2 2 0 8 A Q x⊗ 4 > 1 τ Q / x⊗ 2 ,

( 2 ( )/ 2 ( )) where τ 6 O δ σmax Q σmin B , where c is treated as a constant in the big-O notation.

2 2 Proof. We write x⊗ as a linear combination of the vectors ai⊗ plus some term orthogonal to S.

` 2  2 + 2 x⊗ IdS x⊗ IdS⊥ x⊗ " n #  Õ 2 + 2 αi ai⊗ IdS⊥ x⊗ , i1

193 + 2 where α  A x⊗ is a n-dimensional vector with polynomial entries. Since  T T + 2  k( +)1 2 2k4  k k4 Q AA , it follows that A Q x⊗ α and Q / x⊗ 2 α 2, so that it will k k4 ( − ) k k4 suffice to show α 4 > 1 τ α 2.

4 We consider x⊗ :

" n # " n # 4  2 ⊗ 2  Õ 2 ⊗ Õ 2 + ` x⊗ x⊗ x⊗ αi ai⊗ αi ai⊗ ζ i1 i1    Õ 2 2   αi α j a⊗ ⊗ a⊗  + ζ , (4.8.8)  i j  i,j n   ∈[ ]   ( x 2 + x 2) 2 − ( x 2) 2 A ` k k2 with the error term ζ IdS ⊗ IdS⊥ ⊗ ⊗ IdS ⊗ ⊗ , so that 0 ζ 2 6 O( ) kxk8 A ` k x 2k2 O( ) kxk4 A δ 2, since 0 IdS⊥ ⊗ 2 6 δ 2 from the definition of 0.

4 Since x⊗ is invariant with respect to permutation of its tensor modes, we can also write it as " n # 4 Õ 2 ` x⊗  αi α j (ai ⊗ a j)⊗ + ζ0 , (4.8.9) i1 A ` k k2 ( ) k k8 where similarly 0 ζ0 2 6 O δ x 2.

Therefore, taking the difference of constraints (4.8.8) and (4.8.9) and recalling  2 ⊗ 2 − ( ⊗ ) 2 the definition of B being the matrix with columns bi,j ai⊗ a⊗j ai a j ⊗ , we obtain 2

Õ 2 8 A0 ` αi α j bi,j  kζ0 − ζk 6 O(δ) kxk , 2 2 i,j 2 k k2 2 ( ) k k2 so that therefore since Bv 2 > σmin B v 2 for all vectors v, 2

Õ 2 2 2 Õ 8 A0 ` α α · σ (B) 6 αi α j bij 6 O(δ) kxk i j min 2 i,j i,j 2  2 ( ) ( ) 2 + 2 ( ) 2 ( ) k k4 6 O δ σmax Q x⊗ Q x⊗ 6 O δ σmax Q α 2 .

194 Hence, substituting in the above inequality,

A k k4  k k4 − Õ 2 2 ( − ( 2 ( )/ 2 ( ))) k k4 0 ` α 4 α 2 αi α j > 1 O δ σmax Q σmin B α 2 .  i,j

4.8.2 Noisy case

At this point we’ve proved a version of Proposition 4.8.5 in the special case where there is no noise. In order to handle noise we need to show two things: first that the noisy set of polynomial constraints A used in Algorithm 3 implies the noiseless version A0 from Proposition 4.8.6, and second that the desired conclusion of A in Proposition 4.8.5 follows from its noiseless counterpart in Proposition 4.8.6.

The first step follows immediately from Theorem 4.8.3:

Lemma 4.8.7. Let A be defined as in Algorithm 3 and A0 be defined as in Proposi- tion 4.8.6. Then A ` A0.

Proof. k x 2k2  k x 2k2 − (x 2)T( − ) x 2 ( − By Theorem 4.8.3, IdS ⊗ 2 IdSe ⊗ 2 ⊗ IdSe IdS ⊗ > 1 ( )) k k4 O δ x 2. 

For the second step, we have by Proposition 4.8.6 and Lemma 4.8.7 a statement of the form 4 4 A ` T + 2 ( − ) ( +)1 2 2 8 A Q x⊗ 4 > 1 τ0 Q / x⊗ 2 , but to prove Proposition 4.8.5 we need a statement of the form (after expanding out the function P)

4 4 A ` T + 2 ( − ) ( +)1 2 2 ` A Qe x⊗ 4 > 1 τ Qe / x⊗ 2 .

195 + Thus what remains is to show that not too much is lost when we approximate Q + with Qe .

∈ D D ∈ D D Lemma 4.8.8. Suppose symmetric PSD matrices Q ’ × and Qe ’ × both of k − k ( ) k ( + − +)k ( ) rank n 6 D satisfy Q Qe 6 δ σn Q . Then Q Q Qe 6 O δ and similarly 1 2 ( +)1 2 − ( +)1 2 ( ) Q / Q / Qe / 6 O δ .

Proof. kQQ+ − QQ+k k − k O( ) S S By Theorem 4.8.3, ee 6 IdS IdSe 6 δ , where and e are the column spaces of Q and Qe respectively. Then by adding and subtracting + a term of QQe , + + + Q(Q − Qe ) + (Q − Qe) Qe 6 O(δ) .

By triangle inequality,

+ + + + Q(Q − Qe ) 6 O(δ) + (Q − Qe) Qe 6 O(δ) + kQ − Qek · kQe k 6 O(δ) .

1 2 ( +)1 2 − ( +)1 2 The analogous result for Q / Q / Qe / is obtained by substituting ( +)1 2 + ( +)1 2 + Q / for Q and Qe / for Qe in the above argument. 

We also show that not too much is lost when approximating a vector with high `4/`2 ratio.

Lemma 4.8.9. For γ, β, τ 6 1/2, let B be the set of polynomial inequalities

B  − k k2 k k2 − k k2 k k2 k − k2 k k2 ∪ k k4 ( − )k k4 β v 2 6 u 2 v 2 6 β v 2 , u v 2 6 γ v 2 v 4 > 1 τ v 2 .

Then we have √ B k k4 ( − − ( + )) k k4 `4 u 4 > 1 τ O γ β u 2 .

k − k2 k k2 ` ( − )2 k k2 Proof of Lemma 4.8.9. We have that u v 2 6 γ v 2 ui vi 6 γ v . B ` k k2 ( + )k k2 B ` ( + )2 k + k2 Moreover, since u 2 6 1 γ v 2 we have that ui vi 6 u v 2 6

196 (k k2) O v 2 . Therefore it follows that √ √ B 2 − 2  ( − )( + ) / ·( + )2 + / ·( − )2 ` vi ui vi ui ui vi 6 γ 2 ui vi 1 γ ui vi √ ( k k2) 6 O γ v 2 , where we used the AM-GM inequality. It follows that for every i, √ B Õ 2 − 2 k k2 − k k2 + 2 − 2 (( + ) k k2) ` u j v j 6 u v vi ui 6 O γ β v 2 . j,i Therefore by two rounds of adding-and-subtracting,

Õ Õ Õ Õ Õ Õ Õ Õ B ` 2 2  2 © 2 − 2ª + 2 © 2 − 2ª + 2 2 ui u j ui ­ u j v j ® vi ­ u j v j ® vi v j i,j i j,i j,i i j,i j,i i,j « √ ¬ « √ ¬ Õ 2 · (( + ) k k2) + Õ 2 · (( + ) k k2) + Õ 2 2 6 ui O γ β v 2 vi O γ β v 2 vi v j i i i,j √  ( + ) (k k2k k2 + k k4) + k k4 − k k4 O γ β u 2 v 2 v 2 v 4 v 2 √ ( + ( + )) k k4 6 τ O γ β v 2 √ ( + ( + )) k k4 6 τ O γ β u 2 .

k k2 − k k2 k k2 Here in the second line we used the axiom that u 2 v 2 6 β v 2, the k k4 ( − )k k4 second-to-last line uses the axiom v 4 > 1 τ v 2, and the last one uses k k2 − k k2 − k k2 k k2 ( − ) 1k k2 u 2 v 2 > β v 2 so that v 2 6 1 β − u 2. Rearranging the final inequality above we obtain the desired result. 

Proof of Proposition 4.8.5. We know by Proposition 4.8.6 and Lemma 4.8.7

4 4 A ` T + 2 ( − ) ( +)1 2 2 8 A Q x⊗ 4 > 1 τ0 Q / x⊗ 2 , (4.8.10)

( 2 ( )/ 2 ( )) where τ0 6 O δ σmax Q σmin B .

By Lemma 4.8.8,

2 2 ` T + 2 − T + 2  T( + − +) 2 A Q x⊗ A Qe x⊗ 2 A Q Qe x⊗ 2

197  + ( + − +) 2 2 A Q Q Qe x⊗ 2 k +k2 · ( 2) · k k4 6 A 2 O δ x 2 ( 2) 1( ) k k4 6 O δ σn− Q x 2 .

+ + Also, using Lemma 4.8.8 after adding and subtracting a term of Q QQe ,

2 T T ` T + 2k2 − k T + 2  ( 2) + T + 2 − ( 2) + T + 2 A Q x⊗ 2 A Qe x⊗ 2 x⊗ Q AA Q x⊗ x⊗ Qe AA Qe x⊗  ( 2)T[ + + − + +] 2 x⊗ Q QQ Qe QQe x⊗  ( 2)T[ + ( + − +) + ( + − +) +] 2 x⊗ Q Q Q Qe Q Qe QQe x⊗ ( 2)T[ ( ) + + ( ) +] 2 6 x⊗ O δ Q O δ Qe x⊗ ( ) 1( ) k k4 6 O δ σn− Q x 2 , and similarly in the other direction. Furthermore,

2 ` T + 2  + 2 2 1 ( ) k k4 A Q x⊗ 2 A x⊗ 2 > σmax− Q x 2 .

Combining the above three inequalities with Lemma 4.8.9 and (4.8.10),

4 4 A ` T + 2 ( − ) T + 2 8 A Qe x⊗ 4 > 1 τ A Qe x⊗ 2 ,

2 ( )/ 2 ( ) + ( )/ ( ) where τ 6 O δ σmax Q σmin B δ σmax Q σn Q .

+ 1 2 Finally, using the fact that A Q / is a whitened matrix and therefore has orthonormal rows and then using triangle inequality with Lemma 4.8.8,

2 2 ` T + 2  + 1 2 · 1 2( +)1 2 ·( +)1 2 2 A Qe x⊗ 2 A Q / Q / Qe / Qe / x⊗ 2  1 2( +)1 2 ·( +)1 2 2 2 Q / Qe / Qe / x⊗ 2 1 2( +)1 2 ·( +)1 2 2 2 − 1 2[( +)1 2 − ( +)1 2]·( +)1 2 2 2 > Q / Q / Qe / x⊗ 2 Q / Q / Qe / Qe / x⊗ 2 ·( +)1 2 2 2 − ( )· ( +)1 2 2 2 > IdS Qe / x⊗ 2 O δ Qe / x⊗ 2 ( − ( )) ( +)1 2 2 2 > 1 O δ Qe / x⊗ 2 .

Combining the above two inequalities we obtain the theorem. 

198 4.8.3 Condition number under smooth analysis

In this section we prove that the condition number κ(A) is at least inverse polynomial under the smooth analysis framework [?]. We work with the same

ρ-perturbation model as introduced by [24]: Each a˜i is generated by adding a ρ Gaussian random variable with covariance matrix d Idd to ai. We are given a Ín 4 symmetric 4th order tensor i1 a˜i⊗ (with noise). Let Ae be the corresponding 2 ( ) matrix with columns a˜i⊗ . We will give an upper bound on κ Ae . Suppose the vectors ai have bounded norm; then σmax(Q) is bounded, and therefore an upper ( ) ( ) ( T) bound on κ Ae follows from establishing lower bounds on σmin eB and σmin AeAe . The lower bound on the latter follows from [24] and therefore we focus on the former.

d2 Theorem 4.8.10. Let n 6 10 and a˜1,..., a˜n be independent ρ-perturbations of ∈ ’d4 n n 1 ˜  2 ⊗ 2 −( ⊗ ) 2 a1,..., an. Let eB × ( − ) be the matrix with columns bij a˜i⊗ a˜⊗j a˜i a˜j ⊗ − (− Ω 1 ) ( ) ( / ) for i , j. Then with probability 1 exp d ( ) , we have σmin eB > poly 1 d, ρ .

We will bound the smallest singular value using the leave-one-out distance defined by [?].

d n Lemma 4.8.11. [?] For matrix A ∈ ’ × with columns Ai , i ∈ [n], let S i be the span − ( )  k( − ) k ( ) of the columns without Ai, and d A mini n Id IdS i Ai . Then σmin A > ∈[ ] − 1 d(A). √n

To bound d(eB) from below, we use [24, Theorem 3.9] as our main tool.

Theorem 4.8.12. [24, Theorem 3.9] Let δ ∈ (0, 1) be a constant and W be an operator n` m ( ) ∈ d from ’ to ’ such that σδn` W > η. Then for any a1,..., a` ’ and their

ρ-perturbations a˜1,..., a˜`,

h ` O 3` i 1 3`  W(a˜1 ⊗ · · · ⊗ a˜`) > ηρ d− ( ) 6 1 − exp(−δd / ) (4.8.11)

199 Towards bounding the least singular value of eB using Theorem 4.8.12, we need to address two issues that don’t exist in [24]. The first one is that Theorem 4.8.12 requires a˜1,..., a˜` to be independent perturbations of a1,..., a`. However, we need to deal with a˜i ⊗a˜i ⊗a˜j ⊗a˜j which is a correlated perturbation of ai ⊗ai ⊗a j ⊗a j. We will use (a simpler version of) the decoupling technique of [?] and focus on a sub-matrix of eB where the noise is un-correlated.

The second difficulty is that the columns of eB are also correlated since each a˜i is used in n columns. Therefore when the leave-one-out distance of eB is under consideration, the column eBij and the subspace of the rest of the columns have correlated randomness, which prevents us from using Theorem 4.8.12 directly.

We will address this issue by projecting eBij into a smaller subspace which is un-correlated with eBij and then apply Theorem 4.8.12.

Proof of Theorem 4.8.10. We partition [d] into 4 disjoint subsets L1, L2, L3, L4 of / × × × size d 4. Let eB0 be the set of rows of eB indexed by L1 L2 L3 L4. That is, the ⊗ ( ⊗ − ⊗ ) ⊗ columns of eB0 are a˜i,L1 a˜i,L2 a˜j,L3 a˜j,L2 a˜i,L3 a˜j,L4 , for i , j, where a˜i,L denotes the restriction of vector ai to the subset L.

 { ( ) ( )} We fix a column eB0ij with i , j. Let V span eB0k` : k, ` , i, j . Clearly V is correlated with eB0ij. We define the following subspace that contains V,  ˆ  ⊗ ⊗ ⊗ V span a˜j,L1 x y a˜i,L4 , (4.8.12) ⊗ ⊗ ⊗ a˜k,L1 a˜k,L2 x y , ⊗ ⊗ ⊗ a˜k,L1 x a˜k,L3 y , ⊗ ⊗ ⊗ x y a˜k,L3 a˜k,L4 ,  d 4 x ⊗ a˜k L ⊗ y ⊗ a˜k L x, y ⊗ ’ / , k < {i, j} , 2 , 4

200 ˆ ˆ Therefore by definition V ⊃ V, and thus V⊥ ⊂ V⊥ where V⊥ denotes the ˆ ⊗ subspace orthogonal to V. Observe that by the definition of V, we have a˜i,L1 ⊗ ⊗ ˆ ˆ a˜i,L2 a˜j,L3 a˜j,L4 is independent from V. Moreover, V has dimension at most d2 + d2 + 4nd2 < d4/2. Then by Theorem 4.8.12 we obtain that with probability

Ω 1 at least 1 − exp(−d ( )),

⊗ ⊗ ⊗ ( / ) IdV a˜i,L1 a˜i,L2 a˜j,L3 a˜j,L4 > poly 1 d, ρ . ˆ ⊥

Consequently,

 ⊗ ⊗ ⊗ ( / ) IdV B0 > IdV B0 IdV a˜i,L1 a˜i,L2 a˜j,L3 a˜j,L4 > poly 1 d, ρ , ⊥ eij ˆ ⊥ eij ˆ ⊥ where the first inequality follows from V ⊂ Vˆ and second one follows from the ⊗ ⊗ ⊗ ˆ fact that a˜i,L1 a˜j,L2 a˜i,L3 a˜j,L4 is orthogonal to the subspace V⊥.

( ) ( / ) Then taking union bound over all i , j, we obtain that d eB0 > poly 1 d, ρ − (− Ω 1 ) ( ) ( / ) occurs with probability 1 exp d ( ) . Therefore σmin eB0 > poly 1 d, ρ which ( ) ( ) ( / ) in turn implies that σmin eB > σmin eB0 > poly 1 d, ρ .



4.9 Tensor decomposition with general components

In this section we prove Theorem 4.1.6 (tensor decomposition with general components). The key ingredient is a scheme for rounding pseudo-distributions

(see Theorem 4.9.1 below) that improves over our previous scheme (Theorem 4.4.1): The improved rounding scheme only requires moments of degree logarithmic in the overcompleteness parameter σ.

201 4.9.1 Improved rounding of pseudo-distributions

Theorem 4.9.1. Let s, σ > 1 and ε ∈ (0, 1). Let D be a degree-s pseudo-distribution over ’d that satisfies the constraint {kuk2 6 1}, and let a be a unit vector in ’d. Suppose that s > O(1/ε)· log(σ/ε) and,

2s+2  T O 1 ¾˜ ha, ui > Ω(1/σ)· ¾˜ uu > d− ( ) . (4.9.1) D u ( )

O s3 Then, with probability at least 1/d ( ) over the choice of independent random ∼ N( ) ? variables 11,..., 1s 0, Idd2 , the top eigenvector u of the following matrix satisfies that ha, u?i2 > 1 − O(ε),

2 2 T M1  ¾˜ h11, u⊗ i · · · h1s , u⊗ i · uu (4.9.2) D u ( )

We start by defining some notations for convenience. Let

2 2 p1(u)  h11, u⊗ i · · · h1s , u⊗ i .

Moreover, Let

 h1 2i 1  1 − 2  h1 i α j j , a⊗ , 0j j α j a⊗ , and β j 0j , u .

h1 2i  h 2 2i + h1 2i  h i2 + Therefore we have that j , u⊗ α j a⊗ , u⊗ 0j , u⊗ α j a, u β j, and ( )  Î ( h i2 + ) it follows that p1 u 16 j6s α j a, u β j .

Theorem 4.9.1 follows from the following proposition and a variant of Wedin’s

Theorem (see Lemma ??).

T Proposition 4.9.2. In the setting of Theorem 4.9.1, let Id 1  Id −aa . Then, with at − O 1 O s3 least (Ω(1/n) − 1/d ( ))· 1/d ( ) probability over randomness of 1, we have

  T   T   2 max ¾˜ p1(u) Id 1 uu Id 1 , ¾˜ p1(u) Id 1 uu Id1 6 ε ¾˜ p1(u)hu, ai . − − − (4.9.3)

202 Proposition above follows from the following two propositions, one of which lowerbounds the RHS of (4.9.3) and the other upperbounds the LHS of (4.9.3).

p Proposition 4.9.3. In the setting of Theorem 4.9.1, let τ  s s log d. Conditioned on the event that α1, . . . , αs > τ, we have with at least Ω(1/n) probability over the choice of 10

 2 2s+2 ¾˜ p1(u)hu, ai > 0.9α1 . . . αs ¾˜ [ha, ui ]

p Proposition 4.9.4. In the setting of Theorem 4.9.1, let τ  s s log d and Id 1  − T Ω 1 Id −aa . Conditioned on the event that α1, . . . , αs > τ, we have with at least 1 − d− ( ) probability over the choice of 10,

  T   T   2s+2 max ¾˜ p1(u) Id 1 uu Id 1 , ¾˜ p1(u) Id 1 uu Id1 6 O(εα1 . . . αs)· ¾˜ ha, ui . − − − (4.9.4)

We first prove Proposition 4.9.3. We need the following three lemmas.

Lemma 4.9.5. Let ε ∈ (0, 1/3) and 0 6 δ 6 ε. Suppose 0 6 κ 6 δα j for every j ∈ [s], then there exists a SoS proof

2 Ö 2 s 2 s 2s+2 `x x (α j x + κ) 6 α1 . . . αs (1 − ε) x + (1 + O(δ)) x j s ∈[ ]

Proof. Since this is a univariate polynomial inequality, it suffice to show that it’s true for every x, which will imply that there is also a SoS proof. For x ∈ ’ such that x2 6 1 − 2ε, we have that

2 Ö 2 2 Ö x (α j x + κ) 6 x (1 − ε)α j (by ε > δ and δα j > κ) j s j s ∈[ ] ∈[ ] s 2 6 (1 − ε) α1 . . . αs x

203 For x ∈ ’ such that x2 > 1 − 2ε > 1/3, we have that

2 Ö 2 2 Ö 2 2 x (α j x + κ) 6 x α j(1 + O(δ))x (by x > 1/3 and δα j > κ) j s j s ∈[ ] ∈[ ] s 2s+2 6 α1 . . . αs(1 + O(δ)) x

Hence we obtain a proof for the nonnegativity of the target polynomial. It is known that every nonnegative univariate polynomial admits a sum-of-squares proof. Therefore the inequality above has a sum-of-squares proof. 

p p Lemma 4.9.6. In the setting of Theorem 4.9.1, let τ  s s log d, κ  O( s log d).

Conditioned on the event that α1, . . . , α j > τ, we have

   2 Ö 2   2s+2 ¾˜ ha, ui (α j ha, ui + κ) 6 α1 . . . αs · O(¾˜ ha, ui ) .    j s   ∈[ ] 

Proof. By Lemma 4.9.5, we have,

   2 Ö 2  s 2 s  2s+2 ¾˜ ha, ui (α j ha, ui + κ) 6 α1 . . . αs (1 − ε) ¾˜ [ha, ui ] + (1 + O(1/s)) ¾˜ ha, ui    j s   ∈[ ]   2s+2 6 α1 . . . αs · O(¾˜ ha, ui ) ( − )s / ¾˜ h i2s+2 1 ¾˜ [ T (by 1 ε 6 ε σ and u, a > σ uu ) 

Proof of Proposition 4.9.3. We have

      2    Ö 2 2  ¾ ¾˜ p1(u)hu, ai | α1, . . . , αs  ¾ ¾˜  (α j ha, ui + β j)hu, ai  | α1, . . . , αs 10 10      16 j6s         Ö 2 2  ¾˜  (α j ha, ui + ¾[β j]) · hu, ai    j s   ∈[ ]  1 1 (by linearity of pseudo-expectation and independence of 10 ,..., s0)

204 2s+2  α1 . . . αs ¾˜ [ha, ui ] (4.9.5)

Moreover, we bound the variance,

h  22 i   2 4  ¾ ¾˜ p1(u)hu, ai | α1, . . . , αs 6 ¾ ¾˜ p1(u) hu, ai | α1, . . . , αs 10 10   Ö  2 2 4  ¾˜  ¾ α j ha, ui + β j · hu, ai   1j  j s   ∈[ ]    Ö   ¾˜  (α2ha, ui4 + 1) · hu, ai4  j  j s   ∈[ ]    Ö  6 ¾˜  (α2ha, ui2 + 1) · hu, ai2  j  j s   ∈[ ]  (by hu, ai2 6 1)

2s+2  α1 . . . αs ¾˜ [ha, ui ] (By Lemma 4.9.6)

Therefore, by Paley-Zygmund inequality, we have that with probability

  2 2s+2  1 2s+2  ¾˜ p1(u)hu, ai > 0.9α1 . . . αs ¾˜ [ha, ui ] | α1, . . . , αs > α1 . . . αs ¾˜ [ha, ui ] > Ω(1/n) , 10 100 which completes the proof. 

Towards proving Proposition 4.9.4, we start with the following Lemma.

Lemma 4.9.7. Let ε > 0, k ∈ Ž, a ∈ ’d with kak  1 and A  {kuk2 6 1}. Then, there exists a matrix sum-of-squares proof,

  A h i2k − ( − )k ·( )( )T  ( ) · h i2k+2 · `u,1 ε a, u 1 ε Id 1 u Id 1 u O ε a, u Id . / − −

Proof. Let r  1/ε. We may assume r is an integer and that ε > 0 is small enough such that (1 − ε)r > 1/3. Then, the univariate polynomial inequality x2k − (1 − ε)k 6 3x2k · x2r holds for all x ∈ ’. (For x2 < 1 − ε, the left-hand

205 side is negative. For x2 > 1 − ε, the right-hand side is at least x2k because 3x2r > x2r/(1 − ε)r > 1.) It follows that there exists a sum-of-squares proof

2k k 2k 2r `x x − (1 − ε) 6 3x · x . (4.9.6)

Similarly, there exists a sum-of-squares proof (see the texts below equation (4.4.7) as well)

2r 2 2 2 `x x (1 − x ) 6 O(1/r)· x  O(ε)· x . (4.9.7)

Therefore,   A h i2k − ( − )k ·( )( )T `u,1 ε a, u 1 ε Id 1 u Id 1 u / − − 2k+2r T  3ha, ui ·(Id 1 u)(Id 1 u) (by (4.9.6)) − − +  3ha, ui2k 2r ·(1 − ha, ui2) Id (because ` vvT  kvk2 Id by Lemma 4.3.10)

+  O(ε) · ha, ui2k 2 · Id . (by (A.2.2))



 T  Proof of Proposition 4.9.4. We only bound ¾˜ p1(u) Id 1 uu Id 1 . The other − − term can be controlled similarly and the detailed proof are left to the readers. Let  Î  Î h 2i  h i2 + αS j S α j and βS j S β j. By the fact that 1j , u⊗ α j a, u β j, we ∈ ∈ have,

T Õ 2 S T p1(u) Id 1 uu Id 1  αSβL ha, ui | | Id 1 uu Id 1 , (4.9.8) − − − − S s ,LSc | {z } ⊂[ ] WS u ( ) where each summand is denoted by W (u). Observe that W can be written as S S

© ª   ­ 0T 0T® 2 S L WS(u)  ­Id ⊗ Id ⊗ 1 ⊗ · · · ⊗ 1 ® · ha, ui | |(Id 1 u) ⊗ (Id 1 u) ⊗ u⊗| | ­ j1 jr ® − − ­ | {z }® j1,...,jr T « { } ¬

206 s Ω 1 Then by Theorem 4.6.8, with probability at least 1 − 2− d− ( ) over the choice of 1 1 10 ,..., s0 we have ,

¾˜ [ ] ( ) L 2 · ¾˜ h i2 S ( ) ⊗ ( ) ⊗ L  WS 6 αSO s log d | |/ max a, u | | Id 1 u Id 1 u u⊗| | J Jc J L +2 :1 J,2

By Lemma 4.9.7 we have that   {k k2 } h i2 S − ( − ) S ·( )( )T  ( )·h i2 S +2· u 6 1 `u,1 ε a, u | | 1 ε | | Id 1 u Id 1 u O ε a, u | | Id . / − −

Therefore taking pseudo-expectation, we obtain that

 2 S   S   2 S +2  ¾˜ ha, ui | |(Id 1 u) ⊗ (Id 1 u)  ¾˜ (1 − ε)| |(Id 1 u) ⊗ (Id 1 u) + O(ε) ¾˜ ha, ui | | Id − − − − (4.9.9)

Then using the fact that

 S   S T S  T ¾˜ (1 − ε)| |(Id 1 u) ⊗ (Id 1 u)  Id 1 ¾˜ (1 − ε)| | uu Id 1 6 (1−ε)| | ¾˜ uu , − − − − and equation (4.9.9), we have

L 2  S  T  2 S +2  ¾˜ [WS] 6 αSO(s log d)| |/ · (1 − ε)| | ¾˜ uu + O(ε) ¾˜ ha, ui | | Id

(4.9.10)

Ω 1 Taking union bound over all subset S, with probability at least 1 − d− ( ), we have equation (4.9.10) holds for every S ⊂ [s]. Taking the sum of equation (4.9.10) over S, we conclude that

 T  Õ ¾˜ p1(u) Id 1 uu Id 1 6 ¾˜ [WS] (by equation (4.9.8)) − − S    T Ö p  2 Ö 2 p  6 ¾˜ uu · ((1 − ε)α j + O( s log d)) + O(ε) ¾˜ ha, ui (α j ha, ui + O( s log d))   j s  j s  ∈[ ]  ∈[ ]  (4.9.11)

207  p  c ( / ) When α j > τ, where τ s s log d and s ε log σ ε where c is a sufficiently  T large absolute constant, using the fact that ¾˜ uu 6 σ/n, we have

 T Ö p s ε ¾˜ uu ((1 − ε)α j + O( s log d)) 6 (1 − ε/2) α1 . . . αs 6 α1 . . . αs . n j s ∈[ ] Regarding the second term on the RHS of (4.9.11), by Lemma 4.9.6, we have that when α j > τ,

   2 Ö 2 p   2s+2 ¾˜ ha, ui (α j ha, ui + O( s log d)) 6 α1 . . . αs · O(¾˜ ha, ui )    j s   ∈[ ]  Therefore, plugging in the two bounds above into equation (4.9.11), we obtain that

 T   2s+2 ¾˜ p1(u) Id 1 uu Id 1 6 O(εα1 . . . αs)· ¾˜ ha, ui . − −



4.9.2 Finding all components

In this section we prove Theorem 4.1.6 (restated below) using iteratively the rounding scheme that is developed in the subsection before.

Theorem (Restatement of Theorem 4.1.6). There exists an algorithm A (see Al- gorithm 4) with polynomial running time (in the size of its input) such that for all ∈ ( ) { } ⊆ d kÍn T k ε 0, 1 , σ > 1, for every set of unit vectors a1,..., an ’ with i1 ai ai 6 σ d 2k O 1 and every symmetric 2k-tensor T ∈ (’ )⊗ with k > (1/ε) ( ) · log(σ) and

T − Í a 2k / i i⊗ 1,...,k , k+1,...,2k 6 1 3, we have { } { }  2 ( ) { 2 2} ( ) distH A T , a1⊗ ,..., an⊗ 6 O ε .

208 Algorithm 4 Tensor decomposition with general components Parameters: numbers ε > 0, n ∈ Ž. Given: 2k-th order tensor T  { ˆ ˆ } ⊂ ’d Find: Set of vectors S a1,..., an0 with n0 6 n.

 −  ( )  ( )1 2 • Let s k 1, ` O s , and η O ε / .

n + o A  kuk2  1 ∪ hT, u2s 2i > 2/3 . (4.9.12)

• For i from 1 to n, do the following: 1. Compute a `-degree pseudo-distribution D(u) over ’d that satisfies the constraints

A T σ and ¾˜ D u uu 6 . (4.9.13) ( ) n − i + 1

O s3 2. Repeat T  d ( ) rounds of the following: ∼ N( ) – Choose standard Gaussian vectors 11,..., 1s 0, Idd2 and compute the top eigenvectors a? of the following matrix,

2 2 T d d ¾˜ h1s , u⊗ i ... h1s , u⊗ i · uu ∈ ’ × . (4.9.14) D u ( ) ? ? – Check if a satisfies A. If yes, let aˆi  aˆ and S ← S ∪ {aˆi }, add to 2 A the constraint {hu, aˆii 6 1 − 5η}, and break the (inner) loop.

3. If no new aˆi is found in the previous step, stop the algorithm.

1 2 Proof of Theorem 4.1.6. We analyze Algorithm 4. Let η  c0ε / where c0 is a large A Í 2s+2 / enough absolute constant. Let 0 be the constraint that i ai , u > 1 3. Then we have A ` A0. We first observe that as long as a vector a satisfies A, then a

1 2 has to be O(ε / )-close to one of the ai’s up to sign flip. This is because ! 2s+2 2s+2 2s Õ 2 2s 1/3 6 hT, u⊗ i − 1/3 6 hu, aii 6 maxhai , ui hai , ui 6 σ maxhai , ui . i i i

That is, we can always check whether a? is what we wanted as in the second bullet of step 2. Therefore, it remains to show that as long as there exists a j that

1 2 is η / -far away (up to sign flip) to the set S, we will find a new vector after Step 2

209 in the next iteration.

2 We assume that after iteration i0, the set W  {a j : i ∈ [i0], ha j , aˆii 6 1 − η} ∀ is not empty. We will show that after iteration i0 + 1, we will find a new vector

1 2 in W up to O(ε) / error. We claim first that in the i0 + 1 iteration there exists a pseudo-distribution D(u) that satisfies (4.9.13). Indeed, this is because the actual uniform distribution over the finite set W satisfies constraint (4.9.13). Here we ∈ [ ] h 2ki hÍ 2k 2ki − / used the fact that for every j n we have T, a⊗j > i ai⊗ , a⊗j 1 3 > k ha j , a ji − 1/3  2/3.

Since constraints (4.9.13) enforce that for every i 6 i0 pseudo-distribution D(u)

2 2 satisfies that hu, aˆii 6 1 − η, and moreover, we have kai − τi aˆi k 6 O(ε) for some

τi ∈ {−1, +1}, by Lemma ??, we conclude that D(u) also satisfies the constraint

2 that hu, aii 6 1−η/2 (here we use the fact that η  c0ε with large enough constant

( ) Íi0 h i2s+2 ( − )2s Íi0 h i2 c0). These implies that D u satisfies that i1 u, ai 6 1 η i1 u, ai . h i ¾˜ Íi0 h i2s+2 ( − )2s ¾˜ h i2 1 · /( − Therefore, we have i1 u, ai 6 1 η u, ai 6 σ− σ n + ) / ¾˜ Í h i2s+2 / i0 1 6 1 3. Thus by constraint (4.9.13) we have i>i0 u, ai > 1 3. ?  2s+2 1 Therefore, there exists i > i0 such that ¾˜ hu, ai?i > . Then by 3 n i0+1 ( − ) O s3 Theorem 4.9.1 we obtain that with 1/d ( ) probability, in each step of the inner ˆ 1 2 loop we can find ai that is O(ε / )-close to ai?, and therefore at the end of the ˆ inner loop with high probability we found a new vector ai0+1 which is close 1 2 O(ε / )-close to ai?.



210 4.10 Fast orthogonal tensor decomposition without sum-of-

squares

In this section, we give an algorithm (see Theorem 4.10.2) with quasi-linear running time (in the size of the input) that finds a component of an orthogonal 3-tensor in the presence of spectral norm error at most 1/log d . The previous best known algorithm for orthogonal 3-tensor is by [10, Theorem 5.1] which takes similar runtime and tolerates 1/d error in injective norm. It is known that for any symmetric tensor E the spectral norm can be bounded by injective norm √ √ k k · k k with multiplicative factor d, that is, E 1 2,3 6 d E 1 2 3 . Therefore, √ { }{ } { }{ }{ } our robustness guarantee is at least d factor better than tensor power method.

The key step of Algorithm is the following simple Theorem that finds a single component. It is in fact an analog of Theorem 4.4.1 without sum-of-squares. Here we analyze the success probability much more carefully for achieving quasi-linear time.

d d 3 Theorem 4.10.1. Let a1,..., an ∈ ’ be orthonormal vectors. Let T ∈ (’ )⊗ be k − Í 3k a symmetric 3-tensor such that T i a⊗ 1 , 2,3 6 τ. Let 1 be a standard d- i { } { } 1+δ O 1 dimensional Gaussian vector. Let δ ∈ [0, 1]. Then, with probability 1/(d (log d) ( )) over the choice of 1, the top eigenvector of the following matrix is O(τ/δ)-close to a1,

T M1 : (Id ⊗ Id ⊗1 )T .

At the same time, the ratio between the top eigenvalue and the second largest eigenvalue in absolute value is at least 1 + δ/3 − O(τ).

211  − Í 3 Proof. Let E T i ai⊗ . Then, Õn  ( ⊗ ⊗1T) + h1 i · 2 M1 Id Id E , ai ai⊗ , (4.10.1) i1 k k Since E is symmetric and E 1 , 2,3 6 τ, Theorem 4.6.5 implies that with { } { } probability at least 1 − 1/d2 over the choice of 1,

T 1 2 (Id ⊗ Id ⊗1 )E 6 2(log d) / τ . (4.10.2) 1 , 2 { } { } p Let t  2 log d. By the fact that h1, a1i,..., h1, ani are independent standard Gaussian variables and standard estimates on their cumulative density function,

1 the following event happens with probability at least + d 1 δ log d O 1 ( )·( ) ( )

h1, a1i > (1 + δ/3)· t and max |h1, aii| 6 t (4.10.3) i 2,...,n ∈{ } Conditioned on the events in (4.10.2) and (4.10.3), we have the following bound − / · · 2 on the spectral norms of M1 and M1 δ 3 t a1⊗ , which implies that the top eigenvector of M1 is O(τ/δ)-close to a1 (by [?, Lemma A.1]),

− − 1 · 2 M1 1 , 2 M1 δt a1⊗ (4.10.4) { } { } 3 1 , 2 { } { } n n Õ T 1 T Õ T 1 2 > h1, aii · ai ai − (h1, a1i − δt)· a1a1 + h1, aii · ai ai − 2(log n) / τ i1 3 i2 (conditioned on event in (4.10.2))

1 1 2 > δt − 2(log d) / τ (conditioned on event in (4.10.3)) 3 1 > (1 − O(τ/δ)) · δt . (4.10.5) 3

The probability that the events of (4.10.2) and (4.10.3) happen simultaneously is at least 1 1 1 − > . 1+δ( )O 1 2 1+δ( )O 1 d log d ( ) d d log d ( ) This bound implies the first part of the theorem. To see the eigengap bound, we first observe that the largest eigenvalue of M1 is at least h1, a1i −

212 ( ⊗ ⊗1T) ( + / ) − ( )1 2 Id Id E 1 , 2 > 1 δ 3 t 2 log d / τ. On the other hand, by eigen- { } { } value interlacing, the second larges eigenvalue of M1 is bounded by the top − h1 i T  ( ⊗ ⊗1T) + Ín h1 i · 2 eigenvalue of M1 , a1 a1a1 Id Id E i2 , ai ai⊗ , which in turn 1 2 is bounded above by 2(log d) / τ + t. Therefore the eigenvalue gap statement p follows by recalling t  2 log d. 

We remark that we can amplify the success probability of the algorithm by running it repeatedly with independent randomness.

3 O 1 Theorem 4.10.2. There exists a randomized algorithm with running time d ·(log d) ( ) ∈ ( d) 3 k − Ín 3k / that given a symmetric 3-tensor T ’ ⊗ such that T i1 a⊗ 1 , 2,3 6 1 log d i { } { } d for some set of orthonormal vector {a1,..., an } ⊆ ’ outputs with probability Ω(1) a vector unit v such that

n k − k2 1 + − Õ 3 min v ai 6 T ai⊗ . i n 2d i1 1 , 2 , 3 ∈[ ] { } { } { } 1+ω O 1 Furthermore, there exists a randomized algorithm with running time d ·(log d) ( ) 6 O(d3.33) that given T as before, with probability at least Ω(1) outputs a set of vec- n {a a } 1 + T − Í a 3 tors 10 ,..., 0n with Hausdorff distance at most 2d i1 i⊗ 1 , 2 , 3 from { } { } { } {a1,..., an }. Here ω is the matrix multiplication exponent.

+ Proof. We may assume that d is larger than some constant. We run d1 ω ·

O 1 O(log d) ( ) iterations of the following procedure which can be carried out in O˜ (d3) time. We will discuss how to speed the algorithm at the end.

T 1. Choose a standard Gaussian vector 1 and compute M1  (Id ⊗ Id ⊗1 )T.

2 2. Run O(log d) iterations of the matrix power method on M1 viewed as a d-by-d matrix (with random initialization) and set u to be the top eigenvector

calculated in this way.

213 3 3. Check that |hu⊗ , Ti| > 0.9.

4. Run O(log log d) iterations of the tensor power method on T starting from

u. Output the final iterate v of the method.

The analysis of the tensor power method [10, Lemma 5.1] shows that whenever the check in step 3 succeeds then the final output v satisfies the desired accuracy guarantee of the theorem. It remains to show that the check in step 3 succeeds

O 1 with probability at least 1/(log d) ( ) over the randomness of the algorithm. (We

O 1 obtain success probability Ω(1) by repeating the algorithm (log d) ( ) times.) We apply Theorem 4.10.1 for δ  O(1/log d) and τ  1/log d such that for every i ∈ [n], the distance guarantee for the top eigenvector of M1 is at most 0.001 and the ratio between first and second eigenvalue is at least 1 + 1/log d. By symmetry, for every index i ∈ [n], the probability that the top eigenvector of

O 1 M1 is 0.001-close to ai is at least 1/d(log d) ( ). Since the vectors a1,..., an are orthonormal these events are disjoint. Therefore, with probability at least

O 1 1/(log d) ( ) over the choice of 1, the top eigenvector of M1 is 0.001-close to one of the vectors a1,..., an. Since the multiplicative gap between the top eigenvalue and the remaining eigenvalues of M1 is at least 1 + 1/log d (by Theorem 4.10.1 for our choice of δ and τ), it follows that with constant probability over the choice of the random initialization of the matrix power method, the second step of the algorithm recovers a vector that is 0.001-close to the top eigenvector of M1. In

3 this case, the resulting vector u satisfies the check |hu⊗ , Ti| > 0.9.

1+ω O 1 O 1 In order to find all components in time d · O(log d) ( ) we run d ·(log d) ( ) independent evaluations of the above algorithm. Note that each run involves multiplication of a d2 × d matrix with a d dimensional vector and therefore in total we are to multiply a d2 × d matrix with d × d matrix. Therefore, using

214 fast matrix multiplication, we can “parallelize” all of the required linear algebra

4 O 1 operations and speedup the running time from d (log d) ( ) to the desired

1+ω O 1 O(d )·(log d) ( ). 

Remark 4.10.3 (Extension to other settings). The same rounding idea in Theo- rem 4.10.1 can be extended to the setting when the components a1,..., an are Í T − close to isotropic in the sense that i ai ai Idd 6 σ. The success probability 1+poly σ will decrease to roughly 1/d ( ), and therefore when σ is at most a constant, the overall runtime will remain polynomial in d.

Suppose a1,..., an are separate vectors as in the setting of Theorem 4.1.5, we 3 k 3 Í  ⊗ / can apply the idea in paragraph above to the 3-tensor i bi⊗ where bi ai and  +  1 log σ · ( / ) k > O log ρ log 1 η is a multiple of 3. By Lemma 4.5.4 and the condition on Í T + k, we have that bi are in nearly isotropic position with i bi bi 6 1 η. Hence, using idea above we have a spectral algorithm without sum-of-squares for this set- ting. As noted before (below Theorem 4.1.5), the error tolerance of this algorithm k − Ín 2k k is in terms of an unbalanced spectral norm: T i1 a⊗ 1,...,2k 3 , 2k 3+1,...,2k , i { / } { / } which limits its application, for example, to dictionary learning.

215 CHAPTER 5 OVERCOMPLETE GENERIC DECOMPOSITION SPECTRALLY

5.1 Introduction

Tensors are higher-order analogues of matrices: multidimensional arrays of numbers. They have broad expressive power: tensors may represent higher-order moments of a probability distribution [10], they are natural representations of cubic, quartic, and higher-degree polynomials [67, 50], and they appear whenever data is multimodal (e.g. in medical studies, where many factors are measured) [1, 22, 45]. Due to these reasons, in recent decades tensors have emerged as fundamental structures in machine learning and signal processing.

d k The notion of rank extends from matrices to tensors: a rank-1 tensor in (’ )⊗

1 k is a tensor that can be written as a tensor product u( ) ⊗ · · · ⊗ u( ) of vectors

1 k d d k u( ),..., u( ) ∈ ’ . Any tensor T ∈ (’ )⊗ can be expressed as a sum of rank-1 tensors, and the rank of T is the minimum number of terms needed in such a sum. As is the case for matrices, we are often interested in tensors of low rank: low-rank structure in tensors often carries interpretable meaning about underlying data sets or probability distributions, and the tensors that arise in many applications are low-rank [10].

Tensor decomposition is the natural inverse problem in the context of tensor

d k rank: given a d-dimensional symmetric k-tensor T ∈ (’ )⊗ of the form

 Õ k + T ai⊗ E, i6n d d k for vectors a1,..., an ∈ ’ and an (optional) error tensor E ∈ (’ )⊗ , we are asked to output vectors b1,..., bn as close as possible to a1,..., an (e.g. minimizing the

216 Euclidean distance kbi −ai k). The goal is to accomplish this with an algorithm that is as efficient as possible, under the mildest-possible assumptions on k,a1,..., an, and E.

While tensor rank decomposition is a generalization of rank decomposition for matrices, decomposition for tensors of order k > 3 differs from the matrix case in several key ways.

1. (Uniqueness) Under mild assumptions on the vectors a1,..., an, tensor decompositions are unique (up to permutations of [n]), while matrix decompositions are often unique only up to unitary transformation.

2. (Overcompleteness) Tensor decompositions often remain unique even when the number of factors n is larger than the ambient dimension d (up

k 1 to n  O(d − )), while a d × d matrix can have only d eigenvectors or 2d singular vectors.

These features make tensor decompositions suitable for many applications where matrix factorizations are insufficient. However, there is another major difference:

3. (Computational Intractability) While many matrix decompositions — eigen- decompositions, singular value decompositions, LU-factorizations, and so on — can be found in polynomial time, tensor decomposition is NP-hard in

general [47].

In spite of the NP-hardness of general tensor decomposition, many special cases turn out to admit polynomial-time algorithms. A classical algorithm, often called Jennrich’s algorithm, recovers the components a1,..., an from T when they

217 are linearly independent (which requires n 6 d) and E  0 using simultaneous diagonalization [46, 29].

More sophisticated algorithms improve on Jennrich’s in their tolerance to overcompleteness (and the resulting lack of ) and robustness to nontrivial error tensors E. The literature now contains a wide variety of techniques for tensor decomposition: the major players are iterative methods

(tensor power iteration, stochastic gradient descent, and alternating minimization), spectral algorithms, and convex programs. Convex programs, and in particular the sum-of-squares semidefinite programming hierarchy (SoS), require the mildest assumptions on k, a1,..., an , E among known polynomial-time algorithms [58]. In pushing the boundaries of what is known to be achievable in polynomial time, SoS-based algorithms have been crucial. However, the running times of these algorithms are large polynomials in the input, making them utterly impractical for applications.

The main contribution of this work is a tensor decomposition algorithm whose robustness to errors and tolerance for overcompleteness are similar to those of the SoS-based algorithms, but with subquadratic running time. Other algorithms with comparable running times either require higher-order tensors,1 are not robust in that they require the error E  0 or E exponentially small, or require linear independence of the components a1,..., an and hence n 6 d.2

Our algorithm is comparatively simple, and can be implemented with a small number of dense matrix and matrix-vector multiplication operations, which are

1Higher-order tensors are costly because they are larger objects, and for learning applications they often require a polynomial increase in sample complexity. 2There are also existing robust algorithms which tolerate some overcompleteness when a1,..., an are assumed to be random; in this paper we study generic a1,..., an, which is a much more challenging setting than random a1,..., an [7, 49].

218 fast not only asymptotically but also in practice.

Concretely, we study tensor decomposition of overcomplete 4-tensors under algebraic nondegeneracy conditions on the tensor components a1,..., an. Alge- braic conditions like ours are the mildest type of assumption on a1,..., an known to lead to polynomial time algorithms – our algorithm can decompose all but a measure-zero set of 4-tensors of rank n  d2, and in particular we make no assumption that the components a1,..., an are random.3

When n  d2, our algorithm approximately recovers a 1 − o(1) fraction of  Í 4 + a1,..., an (up to their signs) from T i6n ai⊗ E, so long as the spectral norm kEk is (significantly) less than the minimum singular value of a certain matrix associated to the {ai }. (In particular, nonsingularity of this matrix is our nondegeneracy condition on a1,..., an.) The algorithm requires time O˜ (n2d3) 6 O(d7), which is subquadratic in the input size d4.

Robustness, Overcompleteness, and Applications to Machine Learning. Ten- sor decomposition is a common primitive in algorithms for statistical inference that leverage the method of moments to learn parameters of latent variable models. Examples of such algorithms exist for independent component analysis / blind source separation [28], dictionary learning [17, 58, 68], overlapping community detection [8, 51], mixtures of Gaussians [38], and more.

In these applications, we receive samples x ∈ ’d from a model distribution

D(ρ) that is a function of parameters ρ. The goal is to estimate ρ using the samples. The method-of-moments strategy is to construct the third- or fourth-order moment

3Although decompositions of 4-th order tensors can remain unique up to n d3, no polynomial- time algorithms are known which successfully decompose tensors of overcompleteness≈ n d2. 

219 ¾ k  ¾ k  Í k tensor x⊗ (k 3, 4) from samples whose expectation x⊗ i6n ai⊗ is a low rank tensor with components a1,..., an, from which the model parameters ρ can

k be deduced.4 Since ¾ x⊗ is estimated using samples, the tensor decomposition k algorithm used to extract a1,..., an from ¾ x⊗ must be robust to error from sampling. The sample complexity of the resulting algorithm depends directly on the magnitude of errors tolerated by the decomposition algorithm. In addition, the greater general error-robustness of our result suggests better tolerance of model misspecification error.

Some model classes give rise to overcomplete tensors; roughly speaking, this occurs when the number of parameters (the size of the description of ρ) far exceeds d2, where d is the ambient dimension. Typically, in such cases, ρ consists of a

d collection of vectors a1,..., an ∈ ’ with n  d. Such overcomplete models are widely used; for example, in the dictionary learning setting, we are given a data set S and are asked to find a sparse representation of S. This is a powerful preprocessing tool, and the resulting representations are more robust to perturbations, but assembling a truly sparse, effective dictionary often requires representing d- dimensional data in a basis with n  d elements [57, 31]. Recent works also relate the problem of learning neural networks with good generalization error to tensor decomposition, showing a connection between overcompleteness and the width of the network [59].5

Using tensor decomposition in such settings requires algorithms with practical running times, error robustness, and tolerance to overcompleteness. The strongest polynomial-time guarantees for overcomplete dictionary learning and similar

4Any constant k, rather than just k  3, 4, may lead to polynomial-time learning algorithms, but the cost is typically gigantic polynomial sample complexity and running time, scaling like dk, to estimate and store a k-th order tensor. 5Strictly speaking, this work shows a reduction from tensor decomposition to learning neural nets, but the connection between width and overcompleteness is direct regardless.

220 models currently rely on overcomplete tensor decomposition via the SoS method [58]; our work is an important step towards giving lightweight, spectral algorithms for such problems.

5.1.1 Our Results

Our contribution is a robust, lightweight spectral algorithm for tensor decom- position in the overcomplete regime. We require that the components satisfy an algebraic non-degeneracy assumption satisfied by all but a measure-0 set of inputs. At a high level, we require that a certain matrix associated with the components of the tensor have full rank. Though the assumption may at first seem complicated, we give it formally here:

Π Definition 5.1.1. Let 2⊥,3 be the projector to the orthogonal complement of the d 3 subspace of (’ )⊗ that is symmetric in its latter two tensor modes. Equivalently, Π  1 ( − ) 2⊥,3 2 Id P2,3 , where P2,3 is the linear operator that interchanges the second d 3 and third modes of (’ )⊗ .

Definition 5.1.2. Let Πimg M denote the projector to the column space of the ( )  ( ) 1 2  ( ) 1 2 matrix M. Equivalently, Πimg M MM> − / M M M>M − / , where ( ) 1 2 (MM>)− / is the whitening transform of M and is equal to the Moore-Penrose

1 2 pseudoinverse of (MM>) / .

d Definition 5.1.3. Vectors a1,..., an ∈ ’ are κ-non-degenerate if the matrix

K(a1,..., an), defined below, has minimum singular value at least κ > 0. If

κ  0, we say that the {ai } are degenerate.

The matrix K(a1,..., an) is given by choosing for each i ∈ [n] a matrix Bi whose

d columns form a basis for the orthogonal complement of ai in ’ , assembling the

221 j 3 × ( − ) ⊗ ⊗ ( ) d n d 1 matrix H whose rows are given by ai ai Bi as

 ⊗ ⊗   a1> a1> B1>      .  H  .     ⊗ ⊗   a>n a>n Bn>    ( )  and letting K a1,..., an Π⊥ Πimg H . 2,3 ( >)

2 We note that when n  d then all but a measure-zero set of unit (a1,..., an) ∈ ’dn satisfy the condition that κ > 0. We expect also that for n  d2, if

d a1,..., an ∈ ’ are independent random unit vectors then κ > Ω(1) – we provide simulations in support of this in ??.6

Some previous works on tensor decomposition under algebraic nondegener- acy assumptions also give smoothed analyses of nondegeneracy, showing that

1 small random perturbations of arbitrary vectors are poly d -well-conditioned (for ( ) differing notions of well-conditioned-ness) [24, 58]. We expect that a similar smoothed analysis is possible for κ-non-degeneracy, though because of the spe- cific form of the matrix K(a1,..., an) it does not follow immediately from known results. We defer this technical challenge to future work.

Given this non-degeneracy condition, we robustly decompose the input tensor ˜ ( n2d3 ) in time O κ , where we have suppressed factors depending on the smallest singular value of a matrix flattening of our tensor.

Theorem (Special case of Theorem 5.7.3). Suppose that d 6 n 6 d2, and that

d a1,..., an ∈ ’ are κ-non-degenerate unit vectors for κ > 0, and suppose that T is their

6Furthermore, standard techniques in random matrix theory prove that when a1,..., an are random then matrices closely related to K a1,..., an are well-conditioned; for instance 1 2 ( 1)2 this holds (roughly speaking) if H1>H − / and H2>H − / are removed. However, inverses and pseudoinverses of random matrices,( ) especially( those) with dependent entries like ours, are infamously challenging to analyze – we leave this challenge to future work. See ?? for details.

222 ∈ (’d) 4  Í 4 + 4-tensor perturbed by noise, T ⊗ such that T i n ai⊗ E, where E is a ∈[ ] k k ε 2 × 2 perturbation such that E 6 log d in its d d reshaping. Suppose further that when 2 × 2 k 1k ( ) k Í ( 3)( 3) k ( ) reshaped to a d d matrix, T− 6 O 1 and that i n ai⊗ ai⊗ > 6 O 1 . ∈[ ]

2 3 1 There exists an algorithm decompose with running time O˜ (n d κ− ), so that for every such T there exists a subset S ⊆ {a1,..., an } of size |S| > 0.99n, such that decompose(T) with high probability returns a set of t  O˜ (n) unit vectors b1,..., bt where every ai ∈ S is close to some b j, and each b j is close to some ai ∈ S:

 ε 1 8  ε 1 8 a ∈ S |hb a i| −O / and j ∈ [t] |hb a i| −O / i , max j , i > 1 2 , , max j , i > 1 2 . j κ ai S κ ∀ ∀ ∈

Furthermore, if a1,..., an are random unit vectors, then with high probability they satisfy the conditions of this theorem with κ  Ω(1).

When n 6 d, our algorithm still obtains nontrivial guarantees (though the runtime asymptotics are dominated by other terms); however in this regime, a combination of the simpler algorithm of [68] and a whitening procedure gives comparable guarantees.

We remark that our full theorem, Theorem 5.7.3, does not pose as many

1 restrictions on the {ai }; we do not generally require that kT− k 6 O(1) or that k Í ( 3)( 3) k ( ) i ai⊗ ai⊗ > 6 O 1 . However, allowing these quantities to depend on d and n affects our runtime and approximation guarantees, and so to simplify presenta- tion we have made these restrictions here; we refer the reader to Theorem 5.7.3 for details.

Furthermore, in the theorem stated above we recover only a 0.99-fraction ( 1 ) of the vectors, and we require the perturbation to have magnitude O log d . This is again a particular choice of parameters in Theorem 5.7.3, which allows

223 for a four-way tradeoff among accuracy, magnitude of perturbation, fraction of components recovered, and runtime. For example, if the perturbation is 1 ˜ ( 2 3 1) poly d in spectral norm, then we may recover all components in time O n d κ− ; ( ) alternatively, if the perturbation has spectral norm η2  Θ(1), then we may

2+O η 3 1 recover an 0.99-fraction of components in time O˜ (n ( )d κ− ) up to accuracy − ( η )1 8 1 O κ2 / . Again, we refer the reader to Theorem 5.7.3 for the full tradeoff.

Finally, a note about our recovery guarantee: we guarantee that every vector returned by the algorithm is close to some component, and furthermore that most components will be close to some vector. It is possible to run a clean-up procedure after our algorithm, in which nearby approximate components b j are clustered to correspond to a specific ai; depending on the proximity of the ai to each other, this may require stronger accuracy guarantees, and so we leave this procedure as an independent step. Our guarantee does not include signs, but this is because the tensor T is an even-order tensor, so the decomposition is only (− ) 4  4 unique up to signings as ai ⊗ ai⊗ .

5.1.2 Related works

The literature on tensor decomposition is broad and varied, and we will not attempt to survey it fully here (see e.g. the survey [53] or the references within

[10, 40] for a fuller picture). We will give an idea of the relationship between our algorithm and others with provable guarantees.

For simplicity we focus on order-4 tensors. Algorithms with provable guaran- tees for tensor decomposition fall broadly into three classes: iterative methods, convex programs, and spectral algorithms. For a brief comparison of our algo-

224 Algorithm Type Rank Robustness Assumptions Runtime 2 O d 3 4 [55] algebraic n 6 d kEk 6 2− ( ) algebraic O˜ (n d ) ( 1.5) k ∞k ( n ) ˜ ( 3) [13] iterative n 6 o d E 6 o d2 random, warm start O nd [40] iterative n 6 O(d2) E  0 random, warm start O˜ (nd4) [58] SDP n 6 d2 kEk 6 0.01 algebraic > nd24 k k ( log log d ) ˜ ( 2+ω) [68] spectral n 6 d E 6 O log d orthogonal O d 2 k k ( 1 ) ˜ ( 2 3) this paper spectral n 6 d E 6 O log d algebraic O n d

Table 5.1: A comparison of tensor decomposition algorithms for rank-n 4-tensors in ’d 4. Here ω denotes the matrix multiplication constant. A robustness ( )⊗ bound E 6 η refers to the requirement that a d2 d2 reshaping of the error tensor Ek havek spectral norm at most η. Some of the× algorithms’ guarantees involve a tradeoff between robustness, runtime, and assumptions; where this is the case, we have chosen one representative setting of parameters. See ?? for details. Above, “random” indicates that the algorithm assumes a1,..., an are independent unit vectors (or Gaussians) and “algebraic” indicates that the algorithm assumes that the vectors avoid an algebraic set of measure 0.

rithm to previous works, we include Table 5.1.

Iterative Methods.. Iterative methods are a class of algorithms that maintain one (or sometimes several) estimated component(s) b, and update the estimate using a variety of update rules. Some popular update rules include tensor power

iteration [10], gradient descent [40], and alternating-minimization [11]. Most of these methods have the advantage that they are fast; the update steps usually run in time linear in the input size, and the number of updates to convergence is

often polylogarithmic in the input size.

The performance of the most popular iterative methods has been well- characterized in some restricted settings; for example, when the components

{ai } are orthogonal or linearly independent [10, 36, 70], or are independently drawn random vectors [13, 40]. Furthermore, many of these analyses require a “warm start,” or an initial estimate b that is more correlated with a component

than a typical random starting point. Few provable guarantees are known for the

225 non-random overcomplete regime, or in the presence of arbitrary perturbations.

Convex Programming.. Convex programs based on the sum-of-squares (SoS) semidefinite programming (SDP) relaxation yield the most general provable guarantees for tensor decomposition. These works broadly follow a method of Í k pseudo-moments: interpreting the input tensor i n ai⊗ as the k-th moment tensor ∈[ ] k d ¾ X⊗ of a distribution X on ’ , this approach uses SoS to generate surrogates (or ¾ 100k  Í 100k pseudo-moments) for higher moment tensors, like X⊗ i n ai⊗ . It is ∈[ ] Í ( 100) k generally easier to extract the components a1,..., an from i n ai⊗ ⊗ than ∈[ ] Í k { 100} from i n ai⊗ , because the vectors ai⊗ have fewer algebraic dependencies ∈[ ] than the vectors {ai }, and are farther apart in Euclidean distance. Of course, ¾ 100k  Í 100k X⊗ i n ai⊗ is not given as input, and even in applications where the ∈[ ] input is negotiable, it may be expensive or impossible to obtain such a high-order tensor. The SoS method uses semidefinite programming to generate a surrogate which is good enough to be used to find the vectors a1,..., an

Work on sum-of-squares relaxations for tensor decomposition began with the quasi-polynomial time algorithm of [17]; this algorithm requires only mild well-conditioned-ness assumptions, but also requires high-order tensors as input, and runs in quasi-polynomial time. This was followed by an analysis showing that, at least in the setting of random a1,..., an, the SoS algorithm can decompose substantially overcomplete tensors of order 3 [39]. This line of work finally concluded with the work of Ma, Shi, and Steurer [58], who give sum-of-squares based polynomial-time algorithms for tensor decomposition in the most general known settings: under mild algebraic assumptions on the components, and in the presence of adversarial noise, so long as the noise tensor has bounded spectral norm in its matrix reshapings.

226 These SoS algorithms have the best known polynomial-time guarantees, but they are formidably slow. The work of [58] uses the degree-8 sum-of-squares relaxation, meaning that to find each of the n components, one must solve an

SDP in Ω(d8) variables. While these results are important in establishing that polynomial-time algorithms exist for these settings, their runtimes are far from efficient.

Spectral algorithms from Sum-of-Squares Analyses.. Inspired by the mild

assumptions needed by SoS algorithms, there has been a line of work that uses the analyses of SoS in order to design more efficient spectral algorithms, which ideally work for similarly-broad classes of tensors.

At a high level, these spectral algorithms use eigendecompositions of specific matrix polynomials to directly construct approximate primal and dual solutions

to the SoS semidefinite programs, thereby obtaining the previously mentioned “surrogate moments” without having to solve an SDP. Since the SoS SDPs are quite powerful, constructing (even approximate) solutions to them directly and

efficiently is a nontrivial endeavor. The resulting matrices are only approximately SDP solutions — in fact, they are often far from satisfying most of the constraints of the SoS SDPs. There is a tradeoff between how well these spectrally constructed

solutions approximate the SoS output and how efficiently the algorithm can be implemented. However, by carefully choosing which constraints to satisfy, these works are able to apply the SDP rounding algorithms to the approximate

spectrally-constructed solutions (often with new analyses) to obtain similar algorithmic guarantees.

The work of [49] was the first to adapt the analysis of SoS for random a1,..., an

227 presented by [39] to obtain spectral algorithms for tensor decomposition, giving subquadratic algorithms for decomposing random overcomplete tensors with

4 3 n 6 O(d / ). As SoS algorithms have developed, so too have their faster spectral counterparts. In particular, [68] adapted some of the SoS arguments presented in [58] to give robust subquadratic algorithms for decomposing orthogonal 4-tensors in the presence of adversarial noise bounded only in spectral norm.

Our result builds on the progress of both [58, 68]. The SoS algorithm of [58] was the first to robustly decompose generic overcomplete tensors in polynomial time. The spectral algorithm of [68] obtains a much faster running time for robust tensor decomposition, but sacrifices overcompleteness. Our work adapts (and improves upon) the SoS analysis of [58] to give a spectral algorithm for the robust and overcomplete regime. Our primary technical contribution is the efficient implementation of the lifting step in the SoS analysis of [58] as an efficient spectral algorithm to generate surrogate 6th order moments ; this is the subject of Section 5.5, and we give an informal description in Section 5.2.

FOOBI. The innovative FOOBI (Fourth-Order Cumulant-Based Blind Identi- fication) algorithm of [55] was the first method with provable guarantees for overcomplete 4-th order tensor decomposition under algebraic nondegeneracy assumptions. Like our algorithm, FOOBI can be seen as a lifting procedure (to an 8-th order tensor) followed by a rounding procedure. The FOOBI lifting procedure inspires ours – although ours runs faster because we lift to a 6-tensor rather than an 8-tensor – but the FOOBI rounding step is quite different, and proceeds via a clever simultaneous diagonalization approach. The advantage our algorithm offers over FOOBI is twofold: first, it provides formal, strong robustness guarantees, and second, it has a faster asymptotic runtime.

228 To the first point: for a litmus test, consider the case that n  d and a1,..., an ∈ ’d  Ín 4 + are orthonormal. On input T i1 ai⊗ E, our algorithm recovers the ai for arbitrary perturbations E so long as they are bounded in spectral norm by kEk 6 1/poly log d.7 We are not aware of any formal analyses of FOOBI when run on tensors with arbitrary perturbations of this form. Precisely what degree of robustness should be expected from this modified FOOBI algorithm is unclear. The authors of [55] do suggest (without analysis) a modification of their algorithm for the setting of nonzero error tensors E, involving an alternating- minimization method for computing an approximate simultaneous diagonalization.

Because the problem of approximate simultaneous diagonalization is non-convex, establishing robustness guarantees for the FOOBI algorithm when augmented with the approximate simultaneous diagonalization step appears to be a nontrivial technical endeavor. We think this is an interesting and potentially challenging open question.

Further, while the running time of FOOBI depends on the specific imple- mentation of its linear-algebraic operations, we are unaware of any technique to implement it in time faster than O˜ (n3d4). In particular, the factor of d4 appears essential to any implementation of FOOBI; it represents the side-length of a d4 × d4 square unfolding of a d-dimensional 8-tensor, which FOOBI employs extensively. By contrast, our algorithm runs in time O˜ (n2d3), which is (up to logarithmic factors) faster by a factor of nd.

7In contrast, most iterative methods, such as power iteration, can only handle perturbations of spectral norm at most E 6 1 poly d . k k / ( )

229 5.2 Overview of algorithm

We begin by describing a simple tensor decomposition algorithm for orthogonal

3-tensors: Gaussian rounding (Jennrich’s algorithm [46]). We then build on that intuition to describe our algorithm.

d Orthogonal, undercomplete tensors.. Suppose that u1,..., ud ∈ ’ are or-  Í 3 thonormal vectors, and that we are given T i d ui⊗ . As a first attempt at ∈[ ] recovering the ui, one might be tempted to choose the first “slice” of T, the ×  Í ( )· d d matrix T1 i ui 1 ui u>i , and compute its singular value decomposition

(SVD). However, if |ui(1)|  |u j(1)| for some i , j ∈ [d], the SVD will not allow us to recover these components. In this setting, Gaussian rounding allows us to exploit the additional mode of T: If we sample 1 ∼ N(0, Idd), then we (1)  Í h1 i · h1 i can take the random flattening T i , ui ui u>i ; because the , ui are independent standard Gaussians, they are distinct with probability 1, and an

SVD will recover the ui exactly. Moreover, this algorithm also solves k-tensor Í k decomposition for orthogonal tensors with k > 4, by treating i d ui⊗ as the ∈[ ] Í k 1 ⊗ ⊗ 3-tensor i d ui⊗ − ui ui. ∈[ ]

Challenges of overcomplete tensors.. In our setting, we have unit vectors { } ⊂ d  Í 4 ai i n ’ with n > d, and T i a⊗ (focusing for now on the unperturbed ∈[ ] i case). Since n > d, the components a1,..., an are not orthogonal: they are not even linearly independent. So, we cannot hope to use Gaussian rounding as a black box. While the vectors a1 ⊗ a1,..., an ⊗ an may be linearly independent, Í ( 2)( 2) the spectral decompositions of the matrix i n ai⊗ ai⊗ > are not necessarily ∈[ ] useful, since its eigenvectors may not be close to any of the vectors ai, and may be

230 unique only up to rotation.

Challenges of perturbations.. Returning momentarily to the orthogonal setting with n 6 d, new challenges arise when the perturbation tensor E is nonzero.  Í 4 + For an orthogonal 4-tensor T i d ui⊗ E, the Gaussian rounding algorithm ∈[ ] Í h1 2i T + × produces the matrix i d , ui⊗ ui ui E1 for some d d matrix E1. The ∈[ ] k k  (Í ( 2)( 2)T)  difficulty is that even if the spectral norm E σmin i d ui⊗ ui⊗ 1, ∈[ ] the matrix E1 sums many slices of the tensor E, and so the spectrum of E1 can Í h1 2i T overwhelm that of i d , ui⊗ ui ui . ∈[ ]

This difficulty is studied in [68], where it is resolved by SoS-inspired prepro- cessing of the tensor T. We borrow many of those ideas in this work.

Algorithmic strategy.. We now give an overview of our algorithm. Algorithm 5 gives a summarized version of the algorithm, with details concerning robustness and fast implementation omitted.

There are two main stages to the algorithm: the first stage is lifting, where the input rank-n 4-tensor over ’d is lifted to a corresponding rank-n 3-tensor

2 over a higher dimensional space ’d ; this creates an opportunity to use Gaussian rounding on the newly-created tensor modes. In the second rounding stage, the components of the lifted tensor are recovered using a strategy similar to Gaussian rounding and then used to find the components of the input.

This parallels the form of the SoS-based overcomplete tensor decomposition algorithm of [58], where both stages rely on SoS semidefinite programming. Our main technical contribution is a spectral implementation of the lifting stage; our spectral implementation of the rounding stage reuses many ideas of [68], adapted

231 round the output of our new lifting stage.

Lifting.. The goal of the lifting stage is to transform the input T  Í ( 2)( 2)T  1 2 i n ai⊗ ai⊗ to an orthogonal 3-tensor. Let W T− / and observe that ∈[ ] ( 2) the whitened vectors W ai⊗ are orthonormal; therefore we will want to use T to Í ( 2) 3 find the orthogonal 3-tensor i n Wai⊗ ⊗ . ∈[ ]

( 3) ( 2) The lifting works by deriving Span a⊗ i n from Span a⊗ i n , where the i ∈[ ] i ∈[ ] ( 3) latter is simply the column space of the input T. By transforming Span ai⊗ using 1 2 ( ( 2)⊗ ) { ( 2)⊗ } W  T− / , we obtain Span W a⊗ a . Since W a⊗ ai i n are orthonormal, i i i ∈[ ] Í ( ( 2) ⊗ )( ( 2) ⊗ the orthogonal projector to their span is in fact equal to i W ai⊗ ai W ai⊗ )T ai , which is only a reshaping and a final multiplication by W away from the Í ( ( 2)) 3 orthogonal tensor i W ai⊗ ⊗ .

( 3) ( 2) The key step is the operation which obtains Span ai⊗ from Span ai⊗ . It rests on an algebraic “identifiability” argument, which establishes that for almost all problem instances (all but an algebraic set of measure 0), the subspace ( 3) ( 2) ⊗ ’d Span ai⊗ is equal to Span ai⊗ intersected with the symmetric subspace ({ ⊗ ⊗ } ) ( ⊗ ) Span x x x x ’d . Since we can compute Span ai ai from the input and ∈ since the symmetric subspace is easy to describe, we are able to perform this lifting step efficiently. The simplest version of the identifiability argument is given in Lemma 5.2.1, and a more robust version that includes a condition number analysis is given in Section 5.5.1.

d 2 Lemma 5.2.1 (Simple Identifiability). Let a1,..., an ∈ ’ with n 6 d . Let S denote ({ 2}) ({ 3}) Span ai⊗ and let T denote Span ai⊗ and assume both have dimension n. Let ⊆ ( d) 3  ({ ⊗ ⊗ } ) sym ’ ⊗ be the linear subspace sym Span x x x x ’d . For each i, let ∈ { } d bi,j j d 1 be an arbitrary orthonormal basis the orthogonal complement of ai in ’ . ∈[ − ]

232 Let also  ⊗ ⊗ − ⊗ ⊗   a1 a1 b1,1 a1 b1,1 a1     .   .    T   K0 :  a ⊗ a ⊗ b − a ⊗ b ⊗ a  ,  i i i,j i i,j i     .   .     ⊗ ⊗ − ⊗ ⊗   an an bn,d 1 an bn,d 1 an   − −  d Then if K0 has full rank n(d − 1), it follows that (S ⊗ ’ ) ∩ sym  T.

⊆ ( ⊗ d) ∩ { ⊗ ⊗ } Proof. To show that T S ’ sym, we simply note that ai ai ai i n ∈[ ] form a basis for T and are also each in both S ⊗ ’d and sym.

To show that (S ⊗ ’d) ∩ sym ⊆ T, we take some y ∈ (S ⊗ ’d) ∩ sym. Since y is symmetric under mode interchange, we express y in two ways as

Õ Õ y  ai ⊗ ai ⊗ ci  ai ⊗ ci ⊗ ai . i i

Then by subtracting these two expressions for y from each other, we find

Õ 0  ai ⊗ (ai ⊗ ci − ci ⊗ ai). i  h i + Í We express ci ai , ci ai j γij bij for some vector γ. Then the symmetric parts cancel out, leaving

Õ 0  γij ai ⊗ (ai ⊗ bij − bij ⊗ ai)  K0γ . ij

Since K0 is full rank by assumption, this is only possible when γ  0. Therefore, ci ∝ ai for all i, so that y ∈ T. 

Remark 5.2.2. Although the condition number from the matrix K0 here is not the same as the one derived from K from Definition 5.1.3, it is off by at most a

1 1 2 multiplicative factor of 2kT− k− / . To see this, K in Definition 5.1.3 is given as K 

233 T T 1 2 Π Π T  Π  Π Π T ( )  2⊥,3 img H , whereas we may write K0 2 2⊥,3H 2 2⊥,3 img H HH / ( ) ( ) ( T)1 2 k 1k 1 k 1k k 1k k 1k2 2K HH / . Therefore, K0− > 2 K− H− . By [58, Lemma 6.3], H− > 1 kT− k.

Robustness.. To ensure that our algorithm is robust to perturbations E, we must argue that the column span of T and T + E are close to each other so long as E is bounded in spectral norm, and furthermore than the lifting operation still produces a subspace V which is close to Span({W(ai ⊗ ai) ⊗ ai }). This is done via careful application of matrix perturbation analysis to the identifiability argument. By operating with W only on third-order vectors and matrices over

d 3 (’ )⊗ , we also avoid incurring factors of the fourth-order operator norm kTk in the condition numbers, instead only incurring a much milder sixth-order penalty k Í 3 3T k ai⊗ ai⊗ . For details, see Section 5.5.2.

Rounding.. If we are given direct access to T in the absence of noise, the rounding stage can be accomplished with Gaussian rounding. However when we allow T to be adversarially perturbed the situation becomes more delicate. Our rounding stage is an adaptation of [68], though some modifications are required for the additional challenges of the overcomplete setting. It recovers the components of an approximation of a 3-tensor with n orthonormal components, √ provided that said approximation is within ε n in Frobenius norm distance. The technique is built around Gaussian rounding, but in order to have this succeed in √ the presence of ε n Frobenius norm noise, the large singular values are truncated from the rectangular matrix reshapings of the 3-tensor: this ensures that the rounding procedure is not entirely dominated by any spectrally large terms in the noise.

234 ≈ 2 After we recover approximations of the orthonormal components bi Wai⊗ , 1 we wish to extract the ai. Naively one could simply apply W− , but this can cause

1 errors in the recovered vectors to blow up by a factor of kW− k. Even when the

1 {ai } are random vectors, kW− k  Ω(poly(d)).8 Instead, we utilize the projector to Span{W(ai ⊗ ai) ⊗ ai } computed in the lifting step: we lift bi, project it into the

2 span to obtain a vector close to W(ai ⊗ ai) ⊗ ai, and reshape it to a d × d matrix whose top right-singular vector is correlated with ai. This extraction-via-lifting

1 step allows us to circumvent a loss of kW− k in the error.

Organization.

The full implementation details and the analysis of our algorithm are given in the following few sections. First, Section 5.4 sets up some primitives for spectral subspace perturbation analysis and linear-algebraic procedures on which we build the full algorithm and its analysis. Then Section 5.5 covers the lifting stage of the algorithm in detail, while Section 5.6 elaborates on the rounding stage. Finally, in Section 5.7 we combine these tools to prove Theorem 5.7.3. The appendices some linear-algebraic tools and simulations strongly suggesting that random tensors with n  d2 components have constant condition number κ.

5.3 Preliminaries

Linear algebra. We use Idd to denote the d × d identity matrix, or just Id if the dimension is clear from context. For any subspace S, we use ΠS to denote

8This is in contrast to W , which is O 1 in the random case. k k ( )

235 Algorithm 5 Sketch of full algorithm, in the absence of noise ∈ (’d) 4  Ín 4 ∈ ’d Input: A 4-tensor T ⊗ , so that T i1 ai⊗ for unit vectors ai .

d2 d2 1. Take the square reshaping T ∈ ’ × of T and compute its whitening 1 2 W  T− / and the projector Π2  WTW to the image of T. ∈ (’d2 ) 3  Í ( 2) 3 2. Lifting: Compute the lifted tensor T0 ⊗ so that T0 i Wai⊗ ⊗ . (See Algorithm 6 for full details).

d (a) Find a basis for the subspace S3  (img T) ⊗ ’ ∩ sym: take S3 to be the top-n eigenspace of (Π2 ⊗ Id)Πsym(Π2 ⊗ Id). Then by Lemma 5.2.1,  ( 3) S3 Span ai⊗ . Π ( ⊗ )  ( 2 ⊗ ) (b) Find the projector 3 to the space W Id S3 Span Wai⊗ ai . { 2 ⊗ } (c) Compute the orthogonal 3-tensor: since Wai⊗ ai is an orthonormal basis, Õ 2 2 T Π3  (Wa⊗ ⊗ ai)(Wa⊗ ⊗ ai) . i i i Π Í ( 2) ⊗ ( 2) ⊗ ( 2) Therefore, reshape 3 as i Wai⊗ Wai⊗ ai⊗ and multiply W into the third mode to obtain T0.

3. Rounding: Use Gaussian rounding to find the components ai. (In the pres- ence of noise, this step becomes substantially more delicate; see Algorithms 7 to9). ∼ N( ) (a) Compute a random flattening of T0 by contracting with 1 0, Idd2 (1)  Í h1 ( 2)i · ( 2)( 2) along the first mode, T0 i , Wai⊗ Wai⊗ Wai⊗ > (b) Perform an SVD on T0(1) to recover the eigenvectors ( 2) ( 2) Wa1⊗ ,..., Wan⊗ . 1 2 2 (c) Apply W− to each eigenvector to obtain the ai⊗ , and re-shape ai⊗ to a matrix and compute its eigenvector to obtain ai. the projector to that subspace. For M a matrix, img(M) refers to the image, or columnspace, of M.

1 We will, in a slight abuse of notation, use M− to denote the Moore-Penrose pseudo-inverse of M. Except where explicitly specified, this will never be assumed 1  to be equal to the proper inverse, so that, e.g., in general MM− Πimg M , Id ( ) 1 1 1 and (AB)− , B− A− .

236 m n 1 2 For a matrix B ∈ ’ × , we will use the whitening matrix W  (BB)− / , which maps the columns of B to an orthonormal basis for img(B), so that ( )( )  WB WB > Πimg B . ( )

d 3 We denote by sym ⊆ (’ )⊗ the linear subspace

 ({ ⊗ ⊗ } ) sym Span x x x x ’d . ∈

( ) (|{ }| ) 1 { }  { } Note that Πsym i,j,k ; i ,j ,k is i, j, k ! − when i, j, k i0, j0, k0 and zero ( ) ( 0 0 0) otherwise.

d k Tensor manipulations. When working with tensors T ∈ (’ )⊗ , we will some- times reshape the tensors to lower-order tensors or matrices; in this case, if

S1,..., Sm are a partition of k, then T S1,...,Sm is the tensor given by identifying ( ) k the modes in each Si into a single mode. For S ⊂ [d] , we will also sometimes use the notation T(S) to refer to the entry of T indexed by S.

A useful property of matrix reshapings is that u ⊗ v reshapes into the outer product uvT. Linearity allows us to generalize this so, e.g., the reshaping of

n n m m n m q (U⊗V)M for U ∈ ’ × and V ∈ ’ × and M ∈ ’( ⊗ )× is equal to UM0(V ⊗Idq), n m q where M0 ∈ ’ ×( ⊗ ) is the reshaping of M. Since reshapings can be easily done and undone by exchanging indices, these identities will sometimes allow more efficient computation of matrix products over tensor spaces.

We will on occasion use a · as a placeholder in a partially applied multiple-

∂ 1 argument function: for instance f (·, y)  limh 0 ( f (·, y + h) − f (·, y)). ∂y → h

237 5.4 Tools for analysis and implementation

In this section, we briefly introduce some tools which we will use often in our analysis.

5.4.1 Robustness and spectral perturbation

A key tool in our analysis of the robustness of Algorithm 5 comes from the theory of the perturbation of eigenvalues and eigenvectors.

The lemma below combines the Davis-Kahan sin-Θ theorem with Weyl’s inequality to characterize how top eigenspaces are affected by spectral perturba- tion.

D D Theorem 5.4.1 (Perturbation of top eigenspace). Suppose Q ∈ ’ × is a symmetric ∈ D D matrix with eigenvalues λ1 > λ2 > ... > λD. Suppose also Qe ’ × is a symmetric matrix with kQ − Qek 6 ε. Let S and Sebe the spaces generated by the top n eigenvectors of Q and Qe respectively. Then,

def ε (S S)  kΠ − Π Π k  kΠ − Π Π k sin , e S Se S Se S Se 6 . (5.4.1) λn − λn+1 − 2ε

Consequently, 2ε kΠ − Π k S Se 6 . (5.4.2) λn − λn+1 − 2ε

Proof. We first prove the theorem assuming that Q and Qe are symmetric. By − Weyl’s inequality for matrices [82], the nth eigenvalue of Qe is at least λn 2ε. By Davis and Kahan’s sin-Θ theorem [26], since the top-n eigenvalues of Qe are all at least λn − 2ε and the lower-than-n eigenvalues of Q are all at most λn+1, the

238 k − k/( − − ) sine of the angle between S and Seis at most Q Qe λn λn+1 2ε . The final kΠ − Π k bound on S Se follows by triangle inequality. 

5.4.2 Efficient implementation and runtime analysis

It is not immediately obvious how to implement Algorithm 5 in time O˜ (n2d3), since there are steps that require we multiply or eigendecompose d3 × d3 matrices, which if done naively might take up to Ω(d9) time.

To accelerate our runtime, we must take advantage of the fact that our matrices have additional structure. We exploit the fact that in certain reshapings our tensors have low-rank representations. This allows us to perform matrix multiplication and eigendecomposition (via power iteration) efficiently, and obtain a runtime that is depends on the rank rather than on the dimension.

For example, the following lemma, based upon a result of [3], captures our eigendecomposition strategy in a general sense.

Lemma 5.4.2 (Implicit gapped eigendecomposition). Suppose a symmetric matrix ∈ d d  Í T M ’ × has an eigendecomposition M j λ j v j v j , and that Mx may be computed d within t time steps for x ∈ ’ . Then v1,..., vn and λ1, . . . , λn may be computed in

1 2 3 time O˜ (min(n(t + nd)δ− / , d )), where δ  (λn − λn+1)/λn. The dependence on the desired precision is polylogarithmic.

1 2 Proof. The n(t + nd)δ− / runtime is attained by LazySVD in [3, Corollary 4.3]. While LazySVD’s runtime depends on nnz(M) where nnz denotes the number of non-zero elements in the matrix, in the non-stochastic setting nnz(M) is used only as a bound on the time cost of multiplying a vector by M, so in our case we

239 may substitute O(t) instead.

The d3 time is attained by iterated squaring of M: in this case, all runtime dependence on condition numbers is polylogarithmic. 

The following lemma lists some primitives for operations with the tensor

d2 3 d 6 T0 ∈ (’ )⊗ in Algorithm 5, by interpreting it as a 6-tensor in (’ )⊗ and using a low-rank factorization of the square reshaping of that 6-tensor.

d 2 3 Lemma 5.4.3 (Implicit tensors). For a tensor T ∈ (’[ ] )⊗ , suppose that the matrix ∈ d 3 d 3  T ’[ ] ×[ ] given by T i,i ,j , k,k ,j T i,i , j,j , k,k has a rank-n decomposition ( 0 ) ( 0 0) ( 0) ( 0) ( 0) T d3 n 2 T  UV with U, V ∈ ’ × and n 6 d . Such a rank decomposition provides an implicit representation of the tensor T. This implicit representation supports:

d 2 T T Tensor contraction: For vectors x, y ∈ ’[ ] , the computation of (x ⊗ y ⊗ Id)T or (xT ⊗ Id ⊗yT)T or (Id ⊗xT ⊗ yT)T in time O(nd3) to obtain an output vector in

2 ’d .

∈ d2 d4 Spectral truncation: For R ’ × equal to one of the two matrix reshapings T 1,2 3 { }{ } 61 or T 2,3 1 of T, an approximation to the tensor T , defined as T after all larger- { }{ } than-1 singular values in its reshaping R are truncated down to 1. Specifically,

letting ρk be the kth largest singular value of R for k 6 O(n), this returns an

61 implicit representation of a tensor T0 such that kT0 − T kF 6 (1 + δ)ρk kTkF and

the reshaping of T0 corresponding to R has largest singular value no more than

1+(1+ δ)ρk. The representation of T0 also supports the tensor contraction, spectral truncation, and implicit matrix multiplication operations, with no more than a

2 3 3 2 1 2 constant factor increase in runtime. This takes time O˜ (n d + k(nd + kd )δ− / ).

d 2 d 2 Implicit matrix multiplication: For a matrix R ∈ ’[ ] ×[ ] with rank at most O(n), an implicit representation of the tensor (RT ⊗ Id ⊗ Id)T or (Id ⊗ Id ⊗RT)T, in time

240 O(nd4). This output also supports the tensor contraction, spectral truncation, and implicit matrix multiplication operations, with no more than a constant factor

increase in runtime. Multiplication into the second mode (Id ⊗RT ⊗ Id)T may also be implicitly represented, but without support for the spectral truncation operation.

The implementation of these implicit tensor operations consists solely of tensor reshapings, singular value decompositions, and matrix multiplication. However, the details get involved and lengthy, and so we defer their exposition to ??.

5.5 Lifting

This section presents Algorithm 6, which lifts a well-conditioned 4-tensor T of

2 d 3 rank at most d in (’ )⊗ to T0, an orthogonalized version of the 6-tensor in

d2 3 the same components in (’ )⊗ ; that is, we obtain an orthogonal 3-tensor T0 whose components correspond to the orthogonalized Kronecker squares of the components of T. Section 5.5.1 presents the identifiability argument giving robust algebraic non-degeneracy conditions under which the algorithm succeeds.

Although we assume that the tensor components ai are unit vectors, through- out this section we will keep track of factors of kai k so as to better elucidate the scaling and dimensional analysis.

The following two lemmas will argue that the algorithm is correct, and that it is fast. First, Lemma 5.5.1 states that the output of Algorithm 6 is an orthogonal

3-tensor whose components are W(ai ⊗ ai), where the ai are the components of the original 4-tensor and W is the whitening matrix for the ai ⊗ ai. Furthermore,

241 Algorithm 6 Function lift(T, n) d 4 2 Input: T ∈ (’ )⊗ , n ∈ Ž with n 6 d . 1. Use Lemma 5.4.2 to find the top-n eigenvalues and corresponding eigenvec- tors of the square matrix reshaping of T, and call the eigendecomposition T 1 2 T T T  QΛQ . This also yields W  QΛ− / Q and ΠS  QQ . 2. Use Lemma 5.4.2 again to find the top-n eigendecomposition of ( )  ΠS ’d ΠsymΠS ’d , implementing multiplication by ΠS ’d as ΠS ’d v , ,i ⊗T ⊗ ⊗ ⊗ (· · ) T QQ v , ,i and implementing Πsym as a . Call the result RΣR (· · )  T and take ΠS3 RR .  ( ⊗ ) ( ⊗ ) 3. Find a basis B0 for the columnspace of M3 W Id ΠS3 W Id . Implement this as ( )  1 2 T B0 , ,i ; QΛ− / Q R , ,i ; . (· · ) · (· · ) · 4. Use Gram-Schmidt orthogonalization to find an orthonormalization B of T B0. Call the projection operator to this basis Π3  BB . d2 3 T 5. Instantiate an implicit tensor in (’ )⊗ with Lemma 5.4.3, using BB as the 3 3 1 SVD of its underlying d × d reshaping. Output this as (Id ⊗W− ⊗ Id)T0, 1 meaning a tensor which, when W− is multiplied into its second mode, becomes equal to T0.

1 d2 3 Output: (Id ⊗W− ⊗ Id)T0 ∈ (’ )⊗ , implicitly as specified by Lemma 5.4.3, and d3 d3 Π3 ∈ ’ × . if the error in the input is small in spectral norm compared to some condition numbers, the Frobenius norm error in the output robustly remains within a small √ constant of n.

The main work of the lemma is deferred to Lemma 5.5.5 in Section 5.5.2, which repeatedly applies Davis and Kahan’s sin-Θ theorem (Theorem 5.4.1) to say that the top eigenspaces of various matrices in the algorithm are relatively unperturbed in spectral norm by small spectral norm error in the matrices. After √ that, we simply bound the Frobenius norm error of a rank-n matrix by 2 n times its spectral norm error, and reason that Frobenius norms are unchanged by tensor reshapings.

242 d Lemma 5.5.1 (Correctness of lift). Let a1,..., an ∈ ’ and suppose that T  Í 4 + k k 2 1 2 / i n ai⊗ E satifies E12;34 6 ε σn µ− κ for some ε < 1 63, where σn is the nth ∈[ ] Í 2 2T Í k k 2 3 3T eigenvalue of i n ai⊗ ai⊗ and µ is the operator norm of ai − ai⊗ ai⊗ and κ is ∈[ ] 1 the condition number from Lemma 5.5.3. Then the outputs (Id ⊗W− ⊗ Id)T0 and Π3 of lift(T, n) in Algorithm 6 satisfy

√ − Õ k k 2( ( ⊗ )) 3 1 2 T0 ai − W ai ai ⊗ 6 126 ε σn− / n i F and

Π − Π 2 6 63 ε . 3 Span(Wa⊗ a ) i ⊗ i

Proof. By Lemma 5.5.5, the ΠS3 computed in step 2 as the projector to the top-n 1 Π d Π Π d kΠ − Π 3 k eigenspace of S ’ sym S ’ satisfies S3 Span a⊗ 6 18 εσn µ− , and ⊗ ⊗ ( i ) subsequently, the Π3 computed in steps 3 and 4 as the projector to (W ⊗ Id)S3

kΠ − Π 2 k satisfies 3 Span Wa⊗ a 6 63 ε. ( i ⊗ i)

Since the rank of the error is at most 2n, the Frobenius norm error is at √ {k k 1 2 ⊗ } most 126 ε n, and since ai − Wai⊗ ai is an orthonormal set of vectors, the ( 2 ⊗ ) projector to Span Wai⊗ ai is just the sum of the self-outer-products of vectors in that set, so √ Õ 2 2 2 T Π3 − kai k− (Wa⊗ ⊗ ai)(Wa⊗ ⊗ ai) 6 126 ε n . i i F

3 3 d2 3 Reshaping the d × d matrix Π3 into a tensor in (’ )⊗ does not change the Frobenius norm error, and finally, multiplying in the last factor of W may k k  1 2 k − Í k k 2( ( ⊗ contribute a factor of W σn− / , so that in the end, T0 i ai − W ai √ )) 3k 1 2 ai ⊗ F 6 126 ε σn− / n. 

The next lemma states that the running time is O˜ (n2d3) multiplied by some condition numbers. We assume that asympotically faster matrix multiplications

243 and pseudo-inversions are not used, so that, for instance, squaring a d × d matrix takes time Θ(d3).

d Lemma 5.5.2 (Running time of lift). Let a1,..., an ∈ ’ and suppose that T  Í 4 + i n ai⊗ E satisfies the conditions stated in Lemma 5.5.1. Let σn be the nth eigenvalue ∈[ ] Í 2 2T ( ) of i n ai⊗ ai⊗ and κ the condition number from Lemma 5.5.3. Then lift T, n in ∈[ ] ˜ ( 4 1 2 + 2 3 1) Algorithm 6 runs in time O nd σn− / n d κ− , and the efficient implementation steps are correct.

Proof. Step 1 of lift invokes Lemma 5.4.2 on a d2 × d2 matrix T, recovering n  ˜ (( 4 + 2 2) 1 2) dimensions with a spectral gap of δ σn. This requires time O nd n d σn− / .

Step 2 again invokes Lemma 5.4.2, this time on a d3 × d3 matrix

ΠS ’d ΠsymΠS ’d , recovering n dimensions with a spectral gap of at least ⊗ ⊗ 2 − ∈ ( 2) ( 3) κ 2ε Ω κ . Multiplying by ΠS ’d may be done in time O nd due ⊗ ( )  T ( d) 3 to its expression as ΠS ’d v , ,i QQ v , ,i , since the third mode of ’ ⊗ ⊗ (· · ) (· · )  T ⊗ is unaffected by ΠS ’d QQ Id, and this is a concatenation of d different ⊗ 2 matrix-vector multiplies that take O(nd ) time each. Multiplying by Πsym takes

3 O(d ) time, since the (i, j, k)th row of Πsym has at most 6 nonzero entries cor- responding to the different permutations of (i, j, k). Thus the overall time to ( 3) multiply a vector by ΠS ’d ΠsymΠS ’d is O nd , so that Lemma 5.4.2 gives a ⊗ ⊗ 2 3 2 3 1 runtime of O˜ ((n d + n d )κ− ) for this step.

Step 3 is a concatenation of d different matrix products, each of which involves 2 × × 2 1 2 T multiplying a d n matrix R , ,i ; by a n d matrix Λ / Q and then multiplying (· · ) · the resulting n × n matrix by a d2 × n matrix Q. Each product thus takes O(n2d2) time, and since there are d of them the entire step takes O(n2d3) time. The result

1 2 T is equal to (QΛ / Q ⊗ Idd)R  (W ⊗ Id)R, whose columns form a basis for the

T columnspace of M3  (W ⊗ Id)RΣR (W ⊗ Id).

244 3 Step 4 applies Gram-Schmidt orthonormalization on n vectors in ’d , taking O˜ (n2d3) time. And step 5 takes constant time. Therefore, lift takes time ˜ ( 4 1 2 + 2 3 1) O nd σn− / n d κ− . 

5.5.1 Algebraic identifiability argument

The main lemma in this section gives a more careful analysis of the algebraic identifiability argument from Lemma 5.2.1, in order to obtain a quantitative condition number bound.

d 2 Lemma 5.5.3 (Main Identifiability Lemma). Let a1,..., an ∈ ’ with n 6 d . Let S ({ 2}) ({ 3}) denote Span ai⊗ and let S3 denote Span ai⊗ and assume both have dimension n. { } d For each i, let bi,j j d 1 be an arbitrary orthonormal basis for vectors in ’ orthogonal ∈[ − ] to ai, and let  ⊗ ⊗   a1 a1 b1,1     .   .    T   H :  a ⊗ a ⊗ b  .  i i i,j     .   .     ⊗ ⊗   an an bn,d 1   −  T 1 2 Let R  (HH )− / H be a column-wise orthonormalization of H, and let K  1 ( − ) 2 Id P2,3 R, where P2,3 is the that exchanges the 2nd and 3rd d 3 modes of (’ )⊗ . Then if κ  σmin(K) is non-zero (so that K is full rank),

d (S ⊗ ’ ) ∩ sym  S3 , and furthermore, k − k − 2 ΠS ’d ΠsymΠS ’d ΠS3 6 1 κ . ⊗ ⊗

245  (Í 2 2T) 1 2 ( 2 ⊗ ) Proof. Let W i ai⊗ ai⊗ − / and let T denote the columnspace of W Id H. 2 d The columns of W H form a basis for the subspace of S ⊗ ’ orthogonal to S3

2 3 since each column of W H is orthogonal to every ai⊗ . Therefore,

 + ΠS ’d ΠS3 ΠT . ⊗

 Multiplying this with Πsym and itself and then applying the identities ΠsymΠS3    ΠS3 Πsym ΠS3 and ΠS3 ΠT ΠTΠS3 0,

 + ΠS ’d ΠsymΠS ’d ΠS3 ΠTΠsymΠT . ⊗ ⊗

Therefore, k − k k k2 ΠS ’d ΠsymΠS ’d ΠS3 6 ΠsymΠT . ⊗ ⊗ 2 2 We would thus like to show that kΠsymΠT k 6 1 − κ .

2 2 2 Since kΠsymΠT k  maxy T kΠsym y0k /k y0k  1 − 0∈ 2 2 miny T k(Id −Πsym)y0k /k y0k , it is enough to show that 0∈ miny T k(Id −Πsym)y0k/ky0k > κ. By Lemma 5.5.4, that is implied by 0∈ k(Id −Πsym)yk/kyk > κ for y ∈ img(H).

Since Πsym  Π2,3 where Π2,3 is the projector to the space invariant under (’d) 3 Π  1 ( + ) interchange of the second and third modes of ⊗ and 2,3 2 Id P2,3 , we k( −Π ) k/k k k 1 ( − ) k/k k ∈ ( ) see that Id sym y y > 2 Id P2,3 y y for y img H . Since the 1 columns of R are an orthonormal basis for img(H), for x  R− y spanning all of ’n we have

k(Id −P2,3)yk k(Id −P2,3)Rxk k(Id −P2,3)Rxk kKxk    . 2k yk 2kRxk 2kxk kxk

The expression on the right is the definition of κ. Therefore, k(Id −Πsym)yk/kyk > k 1 ( − ) k/k k  2 Id P2,3 y y κ. 

246 { } Lemma 5.5.4. For each i, let bi,j j d 1 be an arbitrary orthonormal basis for vectors ∈[ − ] d in ’ orthogonal to ai, and let

 ⊗ ⊗   a1 a1 b1,1     .   .    T   H :  a ⊗ a ⊗ b  .  i i i,j     .   .     ⊗ ⊗   an an bn,d 1   −  2 Let H0  (W ⊗ Id)H. If k(Id −Πsym)yk > tk yk for all y ∈ img(H), then k(Id −Πsym)y0k > tk y0k for all y0 ∈ img(H0).

 ( 2)  ( 3) ⊆ ( ) ⊆ Proof. Let S Span ai⊗ and S3 Span ai⊗ sym. Observe that img H0 d S ⊗ ’  img(H) + S3. Therefore, for every y0 ∈ img(H0) there will be some y ∈ img(H) and some z ∈ S3 such that y0  y + z. Also, since S3 ⊥ img(H0), we have z ⊥ y0, and therefore k yk  ky0 − zk > k y0k.

So if the premise of the lemma holds and k(Id −Πsym)yk > tkyk for all y ∈ img(H), it will also be the case that k(Id −Πsym)y0k  k(Id −Πsym)(y + z)k > tkyk > tky0k. 

5.5.2 Robustness arguments

The main lemma of this section gives all of the spectral eigenspace perturbation arguments needed to argue the correctness and robustness of Algorithm 6. Here we essentially repeatedly apply Davis and Kahan’s sin-Θ theorem (Theorem 5.4.1) through a sequence of linear algebraic transformations, along with triangle inequality and some adding-and-subtracting, to argue that the desired top

247 eigenspace remains stable against the spectral-norm errors melded in at each step.

 Í 2 2T Lemma 5.5.5 (Subspace perturbation for lift). Let T i n ai⊗ ai⊗ and let Te be ∈[ ] k − k 2 1 2 / a matrix with T Te 6 ε σn µ− κ for some ε < 1 63, where σn is the nth eigenvalue Í k k 2 3 3T of T and µ is the operator norm of ai − ai⊗ ai⊗ and κ is the condition number  ({ 2})  ( )  ( ) from Lemma 5.5.3. Let S Span ai⊗ img T and let Se img Te . Also let  ({ 3}) S3 Span ai⊗ . Then

k Π Π Π  − Π k 1 topn S ’d sym S ’d S3 6 18 ε σn µ− , e⊗ e⊗  where topn denotes the top-n eigenspace. Furthermore, letting Se3 ( )  1 2  1 2 topn ΠS ’d ΠsymΠS ’d and W T− / and We Te− / , we have e⊗ e⊗ kΠ − Π k W Id S3 W Id S3 6 63 ε . ( e ⊗ )e ( ⊗ )

  Proof. For brevity, let Π ΠS ’d and let Πe ΠS ’d . We write ⊗ e⊗ −  −  + −  ΠΠe sym Πe ΠΠsym Π Πe Π Πsym Πe ΠΠsym Πe Π .

kT −Tk 2 1 2 kΠ−Πk  kΠ −Π k 1 2 Since e 6 ε σn µ− κ , by Theorem 5.4.1, e Se S 6 3εσn µ− κ . Since projectors don’t increase spectral norm, we conclude

k − k 1 2 ΠΠe sym Πe ΠΠsym Π 6 6εσn µ− κ .

 + Furthermore, by Lemma 5.5.3, ΠΠsym Π ΠS3 Z, where Z is a symmet- 2 ric matrix with kZk 6 1 − κ whose columnspace is orthogonal to S3 since  ΠS3 ΠS ’d ΠsymΠS ’d ΠS3 . Therefore, ⊗ ⊗ kΠΠ Π − (Π + )k 1 2 e syme S3 Z 6 6εσn µ− κ .

( + ) ( + ) The top-n eigenspace of ΠS3 Z is S3 and the nth and n 1 th eigenvalues of ( + ) 2 ΠS3 Z differ by at least κ . So by Theorem 5.4.1,

k (ΠΠ Π) − Π k 1 topn e syme S3 6 18εσn µ− .

248 k k2  1 Multiplying by W multiplies this error by at most a factor of W σn− , so that

1 k(W ⊗ Id)Π (W ⊗ Id) − (W ⊗ Id)Π (W ⊗ Id)k 6 18 εµ− . Se3 S3

And k(W ⊗ Id)Π (W ⊗ Id) − (W ⊗ Id)Π (W ⊗ Id)k 6 3ε since Π (W ⊗ Id) has e Se3 e Se3 Se3 1 2 a spectral norm at most σn− / , so that

1 k(W ⊗ Id)Π (W ⊗ Id) − (W ⊗ Id)Π (W ⊗ Id)k 6 21 εµ− . e Se3 e S3

( ⊗ ) ( ⊗ ) 1 By Lemma 5.5.6, the smallest eigenvalue of W Id ΠS3 W Id is at least µ− . kΠ − Π k Therefore, by Theorem 5.4.1, W Id S3 W Id S3 6 63 ε.  ( ⊗ )e ( ⊗ )

The following utility lemma is used to reduce the impact of condition numbers on the algorithm. It shows that when multiplying a third-order tensor in the span 3  1 2  (Í 2 2T) 1 2 of ai⊗ by the second-order whitener W T− / i ai⊗ ai⊗ − / , the penalty to the error may be expressed in terms of a sixth-order condition number – the  Í k k 2 3 3T spectral norm of U ai − ai⊗ ai⊗ – instead of the fourth-order one given by T.

Í 2 2T The reason this is important is that i ai⊗ ai⊗ suffers from spurious directions: 2 directions v ∈ ’⊗ in which Tv may be very large, but v is not close to any of the ai ⊗ ai, or in fact any rank-1 2-tensor at all. For example, for n random Gaussian  ⊗ vectors, the spurious direction is given by Φ ¾1 Ž 0,1 1 1, which will have ∼ ( ) kTΦk ≈ n/d.

 Í k k 2 3 3T The sixth-order object U ai − ai⊗ ai⊗ does not suffer with this problem for n up to O˜ (n2), due to cancellation with the odd number of modes. For 3  k ( ⊗ )k ≈ / 2 ∈ d instance, U ¾1 Ž 0,1 1⊗ 0 and U Φ u n d for all unit u ’ and U ∼ ( ) generated from random Gaussian vectors.

249 d 2 Lemma 5.5.6 (Sixth-order condition numbers). Let a1,..., an ∈ ’ with n 6 d .  (Í 2 2T) 1 2 Í k k 2 3 3T Let W i ai⊗ ai⊗ − / have rank n. Let U be the matrix ai − ai⊗ ai⊗ . Then ∈ ( 3) for a vector v Span ai⊗ , the following hold:

1 1 2 k(W ⊗ Id)vk 6 kU− k / kvk ,

1 2 k(W ⊗ Id)vk > kUk− / kvk .

 Í k k 1 3 Proof. Let v µi ai − ai⊗ . Then

k( ⊗ ) k2  Õ k k 1k k 1h ( ⊗ ) ( ⊗ )ih i  Õ 2 W Id v µi µj ai − a j − W a j a j , W ai ai a j , ai µi , using the fact that {W(ai ⊗ ai)}i is an orthonormal set of vectors. 

5.6 Rounding

In this section, we show how to “round” the lifted tensor to extract the components. That is, assuming we are given the tensor

 Õ ( 2) 3 + T Wai⊗ ⊗ E i n ∈[ ] √ where E is a tensor of Frobenius norm at most ε n, we show how to find the components ai.

d Lemma 5.6.1. Suppose a1,..., an ∈ ’ are unit vectors satisfying the identifiability as- sumption from Lemma 5.5.1, and suppose we are given an implicit rank-n representation √  Í ( 2) 3+ ∈ (’d2 ) 3 k k of the tensor T i Wai⊗ ⊗ E ⊗ , where E F 6 ε n, and an implicit rank-n Π kΠ −Í (( 2)⊗ )(( 2)⊗ ) k 1 representation of a matrix 3 such that 3 i Wai⊗ ai Wai⊗ ai > 6 ε < 2 .

Then for any β, δ ∈ (0, 1) so that βδ  Ω(ε) and δ  Ω(ε), there is a randomized ( 1 1+O β 3) ˜ ( 2 3) algorithm that with high probability in time O β n ( )d with O n d preprocessing

250 time recovers a unit vector u such that for some i ∈ [n],

 1 8 2 ε / hai , ui > 1 − kW k · O , β

 1 8 k k ε / so long as W β < C for a universal constant C.

Further, there is an integer m > (1 − δ)n so that repeating the above algorithm O˜ (n)  1 8 h i2 − k k · ε / times recovers unit vectors u1,..., um so that ui , ai > 1 W O δβ for all  1 8 ∈ [ ] k k ε / i m (up to re-indexing), again so long as W δβ < C, and with a total runtime ˜ ( 1 2+O β 3) of O β n ( )d .

We will prove this theorem in four steps. First, in Section 5.6.1 we will show how to recover vectors that are (with reasonable probability) correlated with the

2 whitened Kronecker squares of the components, Wai⊗ . In Section 5.6.2, we’ll 2 give an algorithm that given a vector close to the whitened square Wai⊗ , recovers a vector close to the component ai. In Section 5.6.3, we give an algorithm that ∈ d { } tests if a vector a ’ is close to one of the components ai i n . In these first ∈[ ] three sections, we omit runtime details; in Section 5.6.4 we put the arguments together and address runtime details as well.

5.6.1 Recovering candidate whitened and squared components

Here, we give an algorithm for recovering components that have constant

2 correlation with the Wai⊗ . In this subsection, our result applies in generality to d2 arbitrary orthonormal vectors b1,..., bn ∈ ’ . The algorithm and its analysis follow almost directly from [68]; for completeness we re-state the important lemmas here, and detail what little adaptation is necessary.

251 Algorithm 7 Rounding to a whitened component Function round(T, β, ε): d2 3 Input: a tensor T ∈ (’ )⊗ , a spectral gap bound β, and an error tolerance ε. 1. Decrease the spectral norm of the error term in rectangular reshapings: 4 × 2 (a) compute T0, the projection of T 1,2 3 to O, the set of d d matrices with spectral norm at most 1 { }{ } √ T61 T O n (b) compute , the projection of 01,3 2 to (may be done up to ε Frobenius norm error). { }{ } 61 { } ∼ N( ) 2. Compute a random flattening of T along the 1 mode: for 1 0, Idd2 , compute Õ T(1)  1i · T(i, ·, ·). i d2 ∈[ ]

3. Recover candidate component vectors: compute uL(1) and uR(1), the top (1) ( 1 ) left- and right-singular vectors of T using O β log d steps of power iteration.

Output: the candidate components uL(1) and uR(1).

∈ ’d2  Í 3+ Lemma 5.6.2. Suppose that b1,..., bn are orthonormal. Then if T i n bi⊗ √ ∈[ ] k k  Ω( ) Ω( ε ) E for a tensor E with E F 6 ε n and δ ε , δ 6 β < 1, repeating steps 2 O β & 3 of Algorithm 7 O˜ (n ( )) times will with high probability recover a unit vector u h i2 − ε ∈ [ ] such that u, bi > 1 δβ for some i n . Furthermore, repeating steps 2 & 3 of 1+O β Algorithm 7 O˜ (n ( )) times will with high probability recover m > (1 − δ)· n unit ∈ [ ] h i2 − ε vectors u1,..., um such that for each ui there exists j n so that ui , b j > 1 δβ .9

The proof follows from two lemmas:

Lemma 5.6.3. The tensor T61 computed in step 1 of Algorithm 7 remains close to √  Í 3 k 61 − k S i bi⊗ in Frobenius norm, T S F 6 ε n, and furthermore

kT61 k kT61 k 1,2 3 6 1 and 1,3 2 6 1. { }{ } { }{ } 9In particular, if we choose δ  log log n ε, we will will recover all but ε n log log n of the bi in O˜ n repetitions. · · ( )

252 The proof of Lemma 5.6.3 is identical to the proof of [68, Lemma 4.5], and uses the fact that distances decrease under projection to convex sets to control the error, and the fact that the truncation operation is equivalent to multiplication by a contractive matrix to argue that T61 has bounded norm in both reshapings.

k k k k Lemma 5.6.4. Suppose that in spectral norm T 1,2 3 , T 1,3 2 6 1, and also that √ { }{ } { }{ } k − Í 3k (1) T i bi⊗ F 6 ε n. Let T be the random flattening of T produced in step 2 of Algorithm 7, and let uL(1) and uR(1) be the top left- and right-signular vectors of T(1) respectively. Then there is a universal constant C such that for any δ > C · ε and Ω( ε ) − ∈ [ ] δ 6 β < 1, for a 1 δ fraction of j n ,   2 ε 2 ε  1 O β   huL(1), b ji > 1 − or huR(1), b ji > 1 − > Ω˜ n− − ( ) , 1 N 0,Id δβ δβ ∼ ( ) and further when this event occurs the ratio of the first and second singular values of

σ1 T 1 T(1) is lower bounded by β, ( ( )) > 1 + β. σ2 T 1 ( ( ))

61  +  Í 3 Proof. By assumption, T S E for S i n bi⊗ , and E is a tensor of Frobenius √ ∈[ ] k k k k norm at most ε n and spectral norms E 1,2 3 6 1 and E 1,3 2 6 1. For { }{ } { }{ } 1 1 ∼ N(0, Σ− ), we have

Õ Õ (1)  (1) + (1)  © h1 i · ª + © 1 · ª T S E ­ , bk bk b>k ® ­ i Ei® , k n i d2 « ∈[ ] ¬ « ∈[ ] ¬ 2 2 where we use Ei  E(i, ·, ·) to refer to the d × d matrix given by taking the {2}, {3} flattening of E restricted to coordinate i in mode 1.

The proof of the lemma is now identical to that of [68, Lemma 4.6 and Lemma

4.7]. There are two primary differences: the first is that in [68] the tensor has four modes, and our tensor effectively has 3 modes. This difference is negligible, since in [68], two of the four modes are always identified anyway.

253 The second difference is that we choose parameters differently. We take the  Ω( ε ) parameter β appearing in [68, Lemma 4.6] so that β δ 10; this is to emphasize  1  that for small ε n , one can recover all m n of the components. Because the proof is otherwise the same, we merely sketch an overview here.

The first term in isolation is a random flattening of an orthogonal tensor, and so with probability 1 the eigenvectors of the first term are precisely the bk. The second term, which is the flattening of the noise term, introduces complications; however, the combination of the spectral norm bound and the Frobenius norm bound on E is enough to argue (using a matrix Bernstein inequality, Markov’s inequality and the orthogonality of the bi) that the random flattening of E cannot have spectral norm larger than ε/δ in more than 1 − δ of the bk’s directions.

To finish the proof, we perform a large deviation analysis on the coefficients h1, bki, lower bounding the probability that for the 1 − δ fraction of the bk that are not too aligned with the spectrum of E, there is a sufficiently large gap between h1, bki and the h1, bii for i , k so that bk is correlated with the top singular vectors of T(1).11 The bound on the ratio of the singular values comes from [68, Lemma 4.7] as well. 

Proof of Lemma 5.6.2. The proof simply follows by applying Lemma 5.6.3, then Lemma 5.6.4.  10We comment that the parameter c appearing in the statement of [68, Lemma 4.6] is larger than √2; this is necessary for the application of [68, Lemma 4.7], and is not clear from the lemma statement but is implicit in the proof. 11We note that to obtain correlation 1 ε , one must directly use the proof of [68, Lemma 4.7], − δβ 2ε 1+β ( ) rather than the statement of the lemma (which has assumed that βδ 6 0.01, and replaced the 2ε 1+β expression 1 ( ) with the lower bound 0.99). − βδ )

254 5.6.2 Extracting components from the whitened squares

We now present the following simple algorithm which recovers a vector close ( 2) to ai, given a vector close to W ai⊗ . For convenience we will again work with ( 2) generic orthonormal vectors bi in place of the W ai⊗ , and we will assume we Π { ( 2) ⊗ } have access to the matrix 3 (the approximate projector to Span W ai⊗ ai ) computed in Algorithm 6.

Algorithm 8 Extracting the component from the whitened square

Function extract(u, Π3): d 2 2 Input: a unit vector u ∈ (’ )⊗ such that hu, bii > 1 − θ for some i ∈ [n], and a Π ∈ ’d3 d3 kΠ − Í ⊗ k projector 3 × such that 3 i bi b>i ai a>i 6 ε.

1. Compute the matrix M  Π3(uu> ⊗ Id). 2. Compute the top-left singular vector v of M.   3. Taking the reshaping V v 3 1,2 , let a Vu. { }{ } Output: the vector a ∈ ’d

d2 d Lemma 5.6.5. Suppose b1,..., bn ∈ ’ are orthonormal vectors and a1,..., an ∈ ’ , Π ∈ ’d3 d3 kΠ − Í ⊗ k ∈ ’d2 and 3 × is such that 3 i bi b>i ai a>i 6 ε. Then if u is a unit h i2 − 1 ∈ ’d vector with u, bi > 1 θ for θ < 10 , then the output a of Algorithm 8 on u √ 1 4 has the property that |ha, aii| > 1 − 4θ / − 4 ε.

 Í ⊗ Proof. Let P3 i bi b>i ai a>i . By assmption we can write the approximate projector Π3  P3 + E, for a matrix E of spectral norm kEk 6 ε. Based on these expressions we can re-express the product,

M  Π3(uu> ⊗ Id)  P3(uu> ⊗ Id) + E(uu> ⊗ Id).

By assumption, the second term is a matrix of spectral norm at most kEk 6 ε.

255 We now consider the first term. If u  c · bi + w, then for the first term we have

( ⊗ )  2 · ⊗  + ( · + · + ) ⊗  P3 uu> Id P3 c bi b>i Id P3 c bi w> c wb>i ww> Id √ The second term is again a matrix of spectral norm at most 3c · kwk  3c · 1 − c2. The first term can be further simplified as

( 2 · ⊗ )  2 · Õh i ⊗  2 · ⊗ P3 c bi b>i Id c bi , bi bi b>i ai a>i c bi b>i ai a>i , i by the orthogonality of the bi. This is a rank-1 matrix with singular value √ 2 2 2 c . Therefore, M  c (bi ⊗ ai)(bi ⊗ ai)> + E˜ where kE˜ k 6 ε + 3c 1 − c . It follows from Lemma 5.6.6 that if v is the top unit left-singular vector of M, then h ⊗ i2 − 2 k ˜ k v, bi ai > 1 c2 E .

Now, in step 3 when we re-shape v to a d × d2 matrix V of Frobenius norm  + ˜ ˜ 1, because v is a unit vector we have that V ai b>i V for V of spectral norm q k ˜ k k ˜ k 2 k ˜ k V 6 V F 6 c2 E . Therefore,

 ( )( · + ) + ˜  ( − h i) · + ˜ Vu ai b>i c bi w Vu c 1 w, bi ai Vu, √ 2 and the latter vector has norm at most kV˜ k, and hw, bii 6 kwk 6 1 − c . Finally, √ substituting c  1 − θ and using our bound on kE˜ k and kV˜ k and some algebraic simplifications, the conclusion follows. 

d k Lemma 5.6.6. Suppose that M  uv> + E for u ∈ ’ , v ∈ ’ unit vectors and d k E ∈ ’ × a matrix of spectral norm kEk 6 ε. Then if x, y are the top left- and right-singular vectors of M, |hx, ui|, |hy, vi| > 1 − 2ε.

 Í Proof. Let M i σi xi yi> be the singular value decomposition of M, with

σ1 > ··· > σd. We have that

1 − ε 6 u>Mv 6 σ1.

256 On the other hand, if with hx1, ui  α 6 1 and hy1, vi  β 6 1,

 + σ1 x1>M y1 6 αβ ε.

Therefore,

|α|, |β| > αβ1e1 − 2ε, and thus min{|α|, |β|} > 1 − 2ε. 

5.6.3 Testing candidate components

The following algorithm allows us to test whether a candidate component u is close to some component ai.

Algorithm 9 Testing component membership ( ) Function test u, θ, ΠS3 : Input: A unit vector uˆ, and the correlation parameter θ. Also, Π3, an approximate {( 2) ⊗ } projector to Span Wai⊗ ai . ˆ 2 ˆ 1. Compute ρ  ((Wu⊗ ) ⊗ u). kΠ k2 ( − )k k2 2. If 3ρ 2 < 1 θ ρ 2, return false. Otherwise, return true.

{( 2) ⊗ } Lemma 5.6.7. Let P3 be the projector to Span Wai⊗ ai , and suppose that we have Π kΠ − k 1 ˆ 3 such that 3 P3 6 ε < 2 . Then if Algorithm 9 is run on a vector u such that 2 huˆ , aii 6 1 − θ − 2ε for all i ∈ [n], then Algorithm 9 returns false.

 2 ˆ h ˆ i2 − θ ε − 1 Converseley, if Algorithm 9 is run on a vector u with u, ai > 1 10 −W > 1 10 k k for some i ∈ [n], then when run on a unit vector uˆ, Algorithm 9 returns true.

Proof. By assumption, we can write Π3  P3 + E for P3 the projector to {( 2) ⊗ } Span Wai⊗ ai and E a matrix of spectral norm at most ε. From this,

257 we have

2 2 2 Π3(Wuˆ ⊗ ) ⊗ uˆ  P3(Wuˆ ⊗ ) ⊗ uˆ + E(Wuˆ ⊗ ) ⊗ uˆ , (5.6.1)

k ( ˆ 2) ⊗ ˆ k k ˆ 2k ˆ 2  Í ( ⊗ ) + and E Wu⊗ u 6 ε Wu⊗ . Now, we can write Wu⊗ i ciW ai ai e, { 2} where e is orthogonal to Span Wai⊗ , and we can further write

2 Õ 2 i Õ 2 ( ˆ ) ⊗ ˆ  ·( ) ⊗ ( ) + ·( ) ⊗ + ⊗ ˆ Wu⊗ u ci γj Wai⊗ b j ci γi Wai⊗ ai e u, i,j i

i { ( )} where b j j,i is an orthogonal basis for the orthogonal complement of ai in d 2 i ’ ( ) ⊗ ( )  . By definition, P3 Wai⊗ b j 0, as this is orthogonal to every vector in {( 2) ⊗ } Span Wai⊗ ai . Therefore,

( ˆ 2) ⊗ ˆ  Õ ·( 2) ⊗ P3 Wu⊗ u ci γi Wai⊗ ai. i

h ˆ i2 ∈ [ ] 2 ∈ [ ] Now, if u, ai 6 τ for all i n , then γi 6 τ for all i n . It thus follows k ( ˆ 2) ⊗ ˆ k2 2 · Í 2 · k ˆ 2k2 that P3 Wu⊗ u 6 maxi γi j c j 6 τ Wu⊗ 2. Combining this with Eq. (5.6.1), we have that √ kΠ ( ˆ 2) ⊗ ˆ k2 ( + + 2) · k ˆ 2k2 ( + )k ˆ 2k2 3 Wu⊗ u 2 6 τ ε τ ε Wu⊗ 2 6 τ 2ε Wu⊗ 2 ,

1 h ˆ i2  − − ∈ [ ] for ε < 2 . It follows that if u, ai τ < 1 θ 2ε for all i n , then the algorithm returns false.

d Converseley, if without loss of generality uˆ  ζ · a1 + eˆ for eˆ ∈ ’ orthogonal ˆ 2  2 · 2 + k k2 ( − 2) · k k2 to a1, then Wu⊗ ζ Wa1⊗ We0 with We0 2 6 1 ζ W . Measuring ˆ 2 2 2 the correlation of Wu⊗ with Wa1⊗ , we have that c1 > ζ . Also γ1 > ζ, which implies ( ˆ 2) ⊗ ˆ  3 ·( 2) ⊗ + P3 Wu⊗ u ζ Wa1⊗ a1 e˜.

p 2 ˆ 2 2p 2 where e˜ is a leftover term with ke˜k 6 1 − ζ · kWu⊗ k + ζ 1 − ζ (where we have used the PSDness of W). Combining this with Eq. (5.6.1),

Π ( ˆ 2) ⊗ ˆ  3( 2) ⊗ + + ( ˆ 2) ⊗ ˆ 3 Wu⊗ u ζ Wa1⊗ a1 e˜ E Wu⊗ u.

258 ˆ 2 ˆ For convenience let ρˆ  e˜ + E(Wu⊗ ) ⊗ u; from our previous observations, we

p 2 ˆ 2 2p 2 have kρˆk 6 (ε + 1 − ζ )kWu⊗ k + ζ 1 − ζ .

2 2 Now, if huˆ , a1i  ζ > 1 − η, we have that √ √ 2 (1 − η) − ηkW k 6 kWuˆ ⊗ k 6 (1 − η) + ηkW k.

From this,

√ √ 2 3 3 2 2 kΠ3(Wuˆ ⊗ )k > ζ − kρˆk > (1 − η) / − (ε + η)kWuˆ ⊗ k − (1 − η) η √ √ √ p 2 2 > 1 − η(kWuˆ ⊗ k − ηkW k) − (ε + η)kWuˆ ⊗ k − (1 − η) η √ √ 2 > (1 − ε − 2 η)kWuˆ ⊗ k − 2 ηkW k, √ 2 > (1 − ε − 5 ηkW k)kWuˆ ⊗ k.

2 1  θ ε  where we have used that η < 10 . Thus, if η < 10 −W , Algorithm 9 does not k k return false. 

5.6.4 Putting things together

Finally, we prove Lemma 5.6.1.

Proof of Lemma 5.6.1. By the assumptions of the theorem, we have access to an  Í ( ( 2)) 3 + ∈ (’d2 ) 3  implicit rank-n representation of T i n W ai⊗ ⊗ E ⊗ , where W   1 2 ∈[ ] √ Í ( 2)( 2) − / k − k i n ai⊗ ai⊗ > , and with T E F 6 ε n. For convenience we denote ∈[ ]  ( 2) ’d2 bi W ai⊗ . Note that the bi are orthonormal vectors in . We also have implicit k − Í ( ⊗ )( ⊗ ) k access to a rank-n representation of Π3, where Π3 i bi ai bi ai > 6 ε.

We first run step 1 of Algorithm 7 to produce the tensor which we will round.

1+O β Then, for `  O˜ (n ( )) independent iterations, we run steps 2 & 3 of Algorithm 7

259 to produce candidate whitened squares u1,..., u`, then run Algorithm 8 on the ui to produce candidate components uˆ i, and finally run Algorithm 9 to check if uˆ i is close to a j for some j ∈ [n].

We show that step 1 of Algorithm 7 takes time O˜ (n2d3). Since T is at most √ ε n in Frobenius norm away from a tensor that is a rank-n projector in both ( ) rectangular reshapings T 1,2 , 3 and T 2,3 , 1 , the 2n th singular values in either { } { } { } { } reshaping must be at most ε: otherwise the error term would have over n √ singular values more than ε and therefore Frobenius norm more than ε n. √ Also kTkF  n because it is a rank-n projector in its square matrix reshaping. Therefore, by Lemma 5.4.3, step 1 requires time O˜ (n2d3 + n(nd3 + nd2)) to return √ an ε n-approximation in Frobenius norm to the projected matrix.12 Note that this step only needs to be carried out once regardless of how many times the algorithm is invoked for a specific input T, so the O˜ (n2d3) runtime is incurred as a preprocessing cost.

˜ ( 1 3) Then, again by Lemma 5.4.3, steps 2 & 3 require time O β nd , since the ratio of the first and second singular values of the the matrix is 1 + Ω(β), and since ( 1 ) (1) O β log d steps of power iteration with T can be implemented by choosing ∼ N( ) ∈ d2 the random direction 1 0, Idd2 , the starting direction v1 ’ , and then 61 61 computing vt+1  (Id ⊗1> ⊗ vt)T where T is the truncated tensor.

Thus, if we choose β, δ satisfying the requirements of Lemma 5.6.2, after

O β d2 O˜ (n ( )) iterations of steps 2 & 3 we will recover a vector u ∈ ’ such that h i2 − ε ˜ ( 1+O β ) u, bi > 1 3 β , and after O n ( ) iterations of steps 2 & 3 we will recover

12Some of the lemmas we apply, out of concerns for compatibility with [68], assume that the maximum singular value of T61 is at most 1. Though one could re-do the previous analysis with minimal consequences under the assumption that the spectral norm is at most 1 + ε, for 1 brevity we note that we may instead multiply the whole tensor by 1 ε , and because the tensor has Frobenius norm at most 1 + 3ε √n, this costs at most 4ε√n additional− Frobenius norm error. ( )

260 h i2 − ε ( − ) ∈ [ ] vectors ut1 ,..., utm so that uti , bi > 1 3 βδ for m > 1 δ n of the i n .

Next, applying Lemma 5.6.5 to each of the good candidate vectors obtained ˆ ˆ in Algorithm 7, Algorithm 8 will give us candidate components ut1 ,..., utm so √  1 4 h ˆ i2 − − 3ε / Π T that uti , ai > 1 4 ε 4 δβ . Since 3 has rank n, we write it as UU for d3 n T T T U ∈ ’ × . Then we may reshape (uu ⊗ Id)Π3 as uu U0(U ⊗ Id), where U0 is the d2 × nd reshaping of U. Multiplying uT through takes O(nd3) time and then

T reshaping the result back results in (uu ⊗ Id)Π3. Therefore, by Lemma 5.4.2, each invocation of Algorithm 8 requires O˜ (nd3) operations.

Finally, from Lemma 5.6.7, we know that if we run Algorithm 9 with θ    1 8 k k · 1 4 + 3ε / + ˆ h ˆ i2 − − 10 W 2ε / 2 δβ 2ε, we will reject any u such that u, ai 6 1 θ 2ε for all i ∈ [n], and will keep all of the good outputs of Algorithm 8. Each iteration 4 3 3 ˆ 2 ˆ of Algorithm 9 requires time O(d +nd +d ), since we form the vector (Wu⊗ )⊗ u, then multiply with the rank-n matrix Π3, and ultimately compute a norm.

This completes the proof. 

5.6.5 Cleaning

Lemma 5.6.8. Suppose there is a set of indices J with |J|  m and let A ⊆ {ai | i ∈ J}.

k Let S k  Span(a⊗ | a ∈ A) for any k and let Π k be the projector to S k . Let A A A  ( 2)  ( 3) S Span ai⊗ and S3 Span ai⊗ .

Suppose Π is a projector such that kΠ−Π 2 k 6 δ and kΠ(Id −ΠS)k 6 ε2. Suppose A Π kΠ − Π k Π  [(Π ⊗ )Π (Π ⊗ )] 3 is a projector such that 3 S3 6 ε3. Let 0 topm Id 3 Id .

Then kΠ0 − Π 3 k 6 ?. A

261 h ( 2)i − As a consequence, if we have access to unit vectors ui such that ui , We ai⊗ > 1 γ, h ( 2)i − k Í T − Π k we obtain vi such that vi , W ai⊗ > 1 ? and furthermore, vi vi 2 6 ?. A

{ ⊗ }  Í ⊗ Proof. Since ai ai is linearly independent, take bi j αij a j a j so that

Õ T Õ Õ T Π − Π 2  E + βi bi bi  E + βi αij αik(a j ⊗ a j)(ak ⊗ ak) , A i j,k J ∈ where E has rank up to 2m and kEk 6 2ε2.



5.7 Combining lift and round for final algorithm

In this section we describe and analyze our final tensor decomposition algorithm, proving our main theorem.

Algorithm 10 Main algorithm for overcomplete 4-tensor decomposition Function decompose(T): d 4 Input: a tensor T ∈ (’ )⊗ , numbers β, δ, ε ∈ (0, 1), numbers σ, κ0 ∈ ’>0, and n 6 d2.

1. Run lift(T, n) from Algorithm 6 to obtain an implicit tensor T0 and an implicit matrix Π3, using σ, κ0 as upper bounds on condition numbers σn , κ.

2. Run the algorithm specified by Lemma 5.6.1 on input (T0, Π3, ε, β, δ) with independent randomness t  O˜ (n) times, to obtain vectors u1,..., ut.

Output: u1,..., ut

d Definition 5.7.1 (Signed Hausdorff distance). For sets of vectors a1,..., an ∈ ’

d and b1,..., bm ∈ ’ , we define the signed Hausdorff distance to be the maximum k − k of the following two quantities. (1) maxi n minj m ,σ 1 ai σb j and (2) ∈[ ] ∈[ ] ∈± k − k maxi m minj n ,σ 1 bi σa j . ∈[ ] ∈[ ] ∈±

262 d Definition 5.7.2 (Condition number of a1,..., an). Let a1,..., an ∈ ’ . Let { } bij j d 1 be an arbitrary orthonormal basis for the orthogonal complement of ∈[ − ] d ai in ’ . Let  ⊗ ⊗   a1 a1 b1,1     .   .    T   H :  a ⊗ a ⊗ b  .  i i i,j     .   .     ⊗ ⊗   an an bn,d 1   −  T 1 2 Let R  (HH )− / H be a column-wise orthonormalization of H, and let K  1 ( − ) 2 Id P2,3 R, where P2,3 is the permutation matrix that exchanges the 2nd and d 3 3rd modes of (’ )⊗ . The condition number κ of a1,..., an is the minimum singular value of K.

Theorem 5.7.3. For every d, n ∈ Ž and ε, β, δ ∈ (0, 1) and σ, κ0 ∈ ’>0 there is a ( ) randomized algorithm decomposed,n,ε,β,δ,σ,κ0 T with the following guarantees. For d d 4 every set of unit vectors a1,..., an ∈ ’ and every E ∈ (’ )⊗ such that

E12;34 1. the operator norm of the square matrix flattening of E satisfies k7 1 k2 6 ε, σn µ− κ

2. κ  κ(a1,..., an) > κ0

3. σn > σ where

Í ( 2)( 2)T 1. σn is the n-th singular value of the matrix i6n ai⊗ ai⊗ , Í ( 3)( 3)T 2. µ is the operator norm of i6n ai⊗ ai⊗ , and

3. κ is the condition number of a1,..., an as in Definition 5.7.2.

263 there is a subset S ⊆ {a1,..., an } of size |S| > (1 − δ)n such that given input  Í 4 +  { }  ˜ ( ) T i6n ai⊗ E the algorithm produces a set B b1,..., bt of t O n vectors which with probability at least 0.99 over the randomness in the algorithm has  1 16 ε / signed-Hausdorff-distance(S, B) 6 O . δβ

Furthermore, the algorithm decomposed,n,ε,β,δ,σ,κ0 runs in time  nd4 n2d3 n2+O β d3  O˜ √ + + ( ) . σ κ0 β

We record some intuitive explanations of the parameters in Theorem 5.7.3.

• σ, κ0 are bounds on the minimum singular values of matrices associated

to a1,..., an, used to determine the necessary precision of linear-algebraic

manipulations performed by the algorithm. Decreasing σ, κ0 yields an algorithm tolerating less well-conditioned tensors, at the expense of running time and/or accuracy guarantees.

• δ determines what fraction of the vectors a1,..., an the algorithm is allowed to fail to return. By decreasing δ the algorithm recovers a larger fraction

of a1,..., an, at the cost of increasing running time and/or decreasing per-vector accuracy.

• β determines the per-vector accuracy of the algorithm. Increasing β im- proves the accuracy of the algorithm, but with exponential cost in the running time.

• ε governs the magnitude of allowable noise E. Increasing ε yields a more noise-tolerant algorithm, at the expense of the accuracy of recovered vectors.

We record the following corollary, which follows from Theorem 5.7.3 by choosing parameters appropriately.

264 Corollary 5.7.4. For every n, d ∈ Ž and σ > 0 (independent of n, d) there is an  Í 4 + algorithm with the following guarantees. The algorithm takes input T i6n ai⊗ E, and so long as

1. κ(a1,..., an) > σ

Í ( 2)( 2) 2. the minimum nonzero eigenvalue of i6n ai⊗ ai⊗ > is at least σ k Í ( 3)( 3) k / 3. i6n ai⊗ ai⊗ > 6 1 σ, and

O 1 4. kE12;34k 6 poly(σ)/(log n) ( ),

with high probability the algorithm recovers O˜ (n) vectors b1,..., bt such that there is a set S ⊆ {a1,..., an } with |S| > (1 − o(1))n such that the signed Hausdorff distance

2 3 from S to {b1,..., bt } is o(1), in time O˜ (n d /poly(σ)).

Furthermore, hypotheses (2),(3) hold for random unit vectors a1,..., an with σ  0.1

2 O 1 so long as n 6 d /(log n) ( ), and experiments in Appendix ?? strongly suggest that (1) does as well.

 (Í 2( 2) ) 1 2 Proof of Theorem 5.7.3. Let W i6n ai⊗ ai⊗ > − / . By Lemma 5.5.1, the im- plicit tensor T0 and matrix Π3 returned by lift satisfy

√ − Õ( ( ⊗ )) 3 ( 9 2 ) T0 W ai ai ⊗ 6 O εσn/ n i6n F and

4 Π − Π 2 O( ) 3 Span Wa⊗ ai 6 εσn . ( i ⊗ )

So, by Lemma 5.6.1, with high probability there is a subset S ⊆ {a1,..., an } of size m > (1 − δ)n such for each ai ∈ S there is u j among the vectors u1,..., ut returned by the rounding algorithm with  √   1 8 2 1 1 1 8 ε / ha , u i > 1 − O · · ε / · kW k · σ  1 − O i j 1 8 1 8 n δ / β / δβ

265 k k  1 2 where the equality follows because W σn− / . Furthermore, each of the vectors u1,..., ut is similarly close to some ai ∈ S. This proves the claimed upper bound on the Hausdorff distance.

The running time follows from putting together Lemma 5.5.2 and the running time bounds of Lemma 5.6.1. 

5.8 Condition number of random tensors

d 3 d 3 Definition 5.8.1. Let Π2,3 : (’ )⊗ → (’ )⊗ be the orthogonal projector to the subspace Span(x ⊗ y ⊗ y | x, y ∈ ’d) that is invariant under interchange of the Π Π  −Π second and third tensor modes. Let 1,2 be defined similarly. Let 2⊥,3 Id 2,3 Π  −Π and 1⊥,2 Id 1,2.

Π  1 ( + ) Note that 2,3 2 Id P2,3 , where P2,3 is the orthogonal operator that Π  1 ( − ) interchanges the second and third modes, and 2⊥,3 2 Id P2,3 . This follows from the Projection Formula in representation theory, whereby for any group G 1 Í 1 of linear operators, G 1 G is equal to the projection to the common invariant | | ∈ subspace of G.

Lemma 5.8.2 (Condition number of basic swap matrix). Let a1,..., an be indepen-

d 1 d dent random d-dimensional unit vectors. Let Bi ∈ ’( − )× be a random basis for the d d3 d3 orthogonal complement of ai in ’ . Let P ∈ ’ × be the permutation matrix which

d 3 swaps second and third modes of (’ )⊗ . Let

A  ¾(a ⊗ a ⊗ Id)(a ⊗ a ⊗ Id)> . a

d3 n d 1 3 Let R ∈ ’ × ( − ) have n blocks of dimensions d × (d − 1), where the i-th block is

1 2 1 2 Ri  A− / (ai ⊗ ai ⊗ Bi) − PA− / (ai ⊗ ai ⊗ Bi)

266 where we abuse notation and denote the PSD square root of the pseudoinverse of A

1 2 2 by A− / . Then there is a function d0(d)  Θ(d ) such that ¾ kR>R − d0(d)· Id k 6 √ 2 3 2 2 O(log d) · max(d n, n, d / ). In particular, if d  n  d ,

¾ k 1 − k ( ( )2/ 2) d d R>R Id 6 O n log d d . 0( )

Corollary 5.8.3. Let a1,..., an be independent random d-dimensional unit vectors with d  n. Then with probability 1 − o(1), the condition number κ of a1,..., an as defined q 1 1 1 n Í 2 2T in Lemma 5.5.3 is at least − − O( ) − O˜ ( ), the matrix T  a⊗ a⊗ has 4 4√2 d d2 i i ∈ − ˜ ( / 2) − ( / )  Í 3 3T n-th eigenvalue σ 1 O n d O 1 d , and the matrix U ai⊗ ai⊗ has spectral norm 1 + O˜ (n/d2).

( − ) Therefore, decomposed,n,ε,β,δ,σ,κ0 when run with error ε, recovers 1 σ n com- 1 16 4 2 3 ponents ai with signed Hausdorff distance O(ε/(δβ)) / in time O˜ (nd + n d +

2+O β 3 n ( )d /β).

To prove the corollary, we will need an elementary fact.

d Fact 5.8.4. Let a1,..., an be independent random unit vectors in ’ . Let U  Ín 3 3T − ( ) k − k ˜ ( / 2)  i1 a⊗ a⊗ . With probability 1 o 1 , U Πimg U 6 O n d if n d. i i ( )

The proof may be found in Section 5.8.8.

d ˆ Fact 5.8.5. Let a1,..., an be independent random unit vectors in ’ . Let Σ  ¾a(a ⊗ a)(a ⊗ a)T for a drawn from the uniform distribution over unit vectors in ’d. Then

k 2 ˆ 1 2 ˆ 1 2 − k ˜ ( / 2) 2d− Σ− / TΣ− / ΠΣ 1 2 img T 6 O n d . ˆ − / ( )

Proof. The fact follows by a slight modification of [49, Lemma 5.9]. While Lemma 5.9 in [49] applies to Gaussian random vectors, not uniform random unit vectors,

3 2 and gives a bound of O˜ (n/d / ), with changes, it holds for unit vectors and with a bound of O˜ (n/d2)). 

267 Fact 5.8.6. Suppose A and B are both subspaces of dimension n. Suppose θ ∈ [0, π/2).

Then kΠA − ΠBΠA k 6 sin θ if and only if for every x ∈ A there is a y ∈ B so that hx, yi > cos θ . kxk kyk

Futhermore, when this holds, since kΠA − ΠBΠA k  kΠB − ΠAΠB k by symmetry, the triangle inequality yields kΠA − ΠB k 6 2 sin θ.

T Proof. Consider the product R  ΠAΠB. Let R  UΣV be its singular value decomposition, with ui and vi its ith left- and right-singular vectors and σi its ith singular value. We show as an intermediate step that σn is at least cos θ if and only if for every x ∈ A there is a y ∈ B so that hx, yi > kxk kyk cos θ.

T In one direction, suppose σn is at least cos θ. Then take y  R x. We see

T T T 2 that hx, yi  x ΠAΠB y  x Ry  kR xk . Since x ∈ A and dim A  n and img(R) ⊆ A and singular values are non-negative, if σn > 0 then the first n Í left-singular vectors of R must span A. Therefore, decomposing x  αi ui, T Í T Í we must have αi  0 for i > n. So, R x  αiVΣU ui  αi σi vi, and k T k2  k Í k2  Í 2 2k k2 Í 2 2  k k2 2 R x αi σi vi αi σi vi > αi σn x σn. So hx, yi kRTxk2 kRTxk   > σn > cos θ . kxk kyk kxk kRTxk kxk

In the other direction, suppose σn < cos θ. Then take x  un and for Í any y ∈ B, decompose y  βi vi where again βi  0 for all i > n. So

T T T T hx, yi  x ΠAΠB y  (R x) y  σn vn y  σn βn. Since βn 6 kyk, for any y we must have hx, yi σn βn  6 σ < cos θ . kxk kyk kyk n

Finally, we show that kΠA − ΠBΠA k 6 sin θ if and only if σn is at least cos θ.

This follows from the Pythagorean theorem, as ΠA  (ΠBΠA) + (ΠA − ΠBΠA) and

268 T 2 also (ΠBΠA) (ΠA −ΠBΠA)  0. Thus for any vector v, we see k(ΠA −ΠBΠA)vk 

2 2 2 T 2 kΠAvk − kΠBΠAvk  kΠAvk − kR ΠAvk . 

Í d d Fact 5.8.7. Let Φ  ei ⊗ ei for ei the elementary basis vectors in ’ . If x ∈ Φ ⊗ ’

1 p 1 then kΠ2,3xk  1 + kxk. √2 d

Proof. Since x ∈ Φ ⊗ ’d, we may write it as x  Φ ⊗ u for some u ∈ ’d. We directly calculate:

kΠ k2  k 1 ( + ) k2 2,3x 2 Id P2,3 x  1 k k2 + 1 k k2 + 1 h i 4 x 4 P2,3x 2 x, P2,3x  1 k k2 + 1 hΦ ⊗ (Φ ⊗ )i 2 x 2 u, P2,3 u  1 k k2 + 1 hÕ ⊗ ⊗ Õ ⊗ ⊗ i 2 x 2 ei ei u, ej u ej  1 k k2 + 1 ÕÕh ih ih i 2 x 2 ei , ej ei , u ej , u  1 k k2 + 1 Õh i2 2 x 2 ei , u  1 k k2 + 1 k k2 2 x 2 u  1 k k2( + 1 ) 2 x 1 d .



 Í ⊗ d ∈ ⊗ d Fact 5.8.8. Let Φ ei ei for ei the elementary basis vectors in ’ . Let u sym2 ’ d 1 and decompose u  x + y with x ∈ Φ ⊗ ’ and y ⊥ x. Then if kΠ⊥ yk > kyk, it 2,3 √2 holds that

2  1 1 1  2 kΠ⊥ uk > − − O( ) kuk . 2,3 4 4√2 d

1 Proof. Let P3,2,1  (P1,2,3)− be the orthogonal linear operator permuting the tensor modes, so that the 3-cycle (3 2 1) replaces the third mode with the second, the second mode with the first, and the first mode with the third again. By a

269 d unitary similarity transform conjugating by P3,2,1, since x ∈ Φ ⊗ ’ , we have

kΠ1,2P2,3xk  k(P1,2,3Π1,2P3,2,1)(P1,2,3P2,3)xk

 kΠ2,3P1,2xk

 kΠ2,3xk p  1 1 + 1 kxk , (5.8.1) √2 d where the last step uses ?? 5.8.7. We write, since Π1,2x  x,

kΠ Π k  1 kΠ ( − )k 1,2 2⊥,3x 2 1,2 x P2,3x  1 k − Π k 2 x 1,2P2,3x 1 k k + 1 kΠ k 6 2 x 2 1,2P2,3x p  1 kxk + 1 1 + 1 kxk 2 2√2 d    1 + 1 + O( 1 ) kxk , (5.8.2) 2 2√2 d where the second-to-last step substitutes in (5.8.1).

 + kΠ k2 ( 1 − ( 1 ))k k2 Therefore, since u x y and ?? 5.8.7 implies 2⊥,3x > 2 O d x and kΠ k2 1 k k2 also by assumption 2⊥,3 y > 2 y ,

kΠ k2  kΠ k2 + kΠ k2 + h Π i 2⊥,3u 2⊥,3x 2⊥,3 y y, 2⊥,3x  ( 1 − ( 1 ))k k2 + 1 k k2 + h Π i 2 O d x 2 y y, 2⊥,3x .

∈ ⊗ d  Since y sym2 ’ and therefore Π1,2 y y, Cauchy-Schwarz implies 1 1 hy, Π⊥ xi > −kΠ1,2Π⊥ xkk yk and then by (5.8.2), this is at least −( + + 2,3 2,3 2 2√2 ( 1 ))k k O d x . We substitute this in and then apply Young’s inequality:

2 1 1  2 1 2  1 1 1  kΠ⊥ uk > − O( ) kxk + kyk − + + O( ) kxk kyk 2,3 2 d 2 2 2√2 d   > 1 − O( 1 ) kxk2 + 1 kyk2 − 1 + 1 + O( 1 ) ( 1 kxk2 + 1 kyk2) 2 d 2 2 2√2 d 2 2     > 1 − 1 − O( 1 ) kxk2 + 1 − 1 − O( 1 ) kyk2 4 4√2 d 4 4√2 d

270    1 − 1 − O( 1 ) kuk2 . 4 4√2 d



∈ ⊗ d Fact 5.8.9. Let u sym2 ’ . If kΠ 1 2 k 2⊥,3A− / u > 1 − µ k 1 2 k A− / u

√ for µ 6 1 − 1 − 2 2 , then √2 √d kΠ k 2⊥,3u q > 1 − 1 − O( 1 ) . kuk 4 4√2 d

 +  ( 1 ΦΦT ⊗ ) ⊥ k k k k Proof. Write u x y where x d Id u and y x. If x > 2 y , then by triangle inequality and ?? 5.8.7,

kΠ⊥ uk kΠ⊥ xk − kΠ⊥ yk 2,3 > 2,3 2,3 kuk kuk kΠ⊥ xk − k yk > 2,3 kuk   1 − O( 1 ) kxk − 1 kxk √2 d 2 > kuk   1 − 1 − O( 1 ) kxk √2 2 d > kxk + kyk   1 − 1 − O( 1 ) kxk √2 2 d > 3 k k 2 x q > 1 − 1 − O( 1 ) . 4 4√2 d

Thus for the remainder of the argument, we assume kxk 6 2kyk. √ 1 2  + k 1 2 k2  2k k2 + k k2 2k k2 Then A− / u d1 y d x and A− / u d1 y d x > d1 y , so by triangle inequality,

1 1 2 √d kΠ⊥ yk > d− kΠ⊥ A− / uk − kΠ⊥ xk 2,3 1 2,3 d1 2,3

271 1 1 2 √d > d− kΠ⊥ A− / uk − kxk 1 2,3 d1 1 1 2 √d > d− kΠ⊥ A− / uk − 2 kyk 1 2,3 d1 1 1 2 √d > d− (1 − µ)kA− / uk − 40 kyk 1 d1 √ > (1 − µ)k yk − 2 d kyk d1  √  > 1 − µ − 2 2 kyk . √d

√ Therefore, the lemma follows by ?? 5.8.8, as long as 1 − µ − 2 2 > 1 .  √d √2

Proof of Corollary 5.8.3. To lower bound κ, we need a lower bound on the least Π ( T) 1 2 singular value of 2⊥,3 HH − / H, where H is the matrix with columnwise blocks of ai ⊗ ai ⊗ Bi. By Lemma 5.8.2, with probability 1 − o(1), it holds that √ 2 Π A 1 2H has all singular values within 1 ± O˜ (n/d2). This means that for d1 2⊥,3 − / ∈ ( ) kΠ 1 2 k/k 1 2 k − ˜ ( / 2) all u img H , we have 2⊥,3A− / u A− / u > 1 O n d . Therefore, by q 1 1 1 2 ?? 5.8.9, kΠ⊥ uk/kuk > − − O( ) − O˜ (n/d ). This inequality holding for 2,3 4 4√2 d ∈ ( ) Π ( T) 1 2 all u img H is equivalent to κ, the smallest singular value of 2⊥,3 HH − / H, q being at least 1 − 1 − O( 1 ) − O˜ (n/d2). 4 4√2 d



The remainder of this section is devoted to the proof of Lemma 5.8.2. At a high level, this proof follows the strategy laid out in in [80] to prove Theorem

5.62 there, but the random matrix we need to control is much more complicated than is handled there.

5.8.1 Notation

Throughout we use the following notation.

272 1. d, n ∈ Ž are natural numbers.

p 2 1 2 p 2. d1  (d + 2d)/2  Θ(d) and d2  d− / ( (d + 2)/2 − 1)  Θ(1).

d 3. a1,..., an ∈ ’ are iid random unit vectors.

d d 1 4. Bi ∈ ’ ×( − ) for i 6 n is a matrix with columns which form a random d orthonormal basis for the orthogonal complement of ai in ’ . (Chosen

independently from a j , Bj for j , i.)

d2 d2 5. Σ  ¾a(aa> ⊗ aa>) ∈ ’ × is the 4-th moment matrix of a random d-dimensional unit vector.

d2 d2 d2 6. A  ¾a(aa> ⊗ aa> ⊗ Id)  Σ ⊗ Id ∈ ’ × × is Σ “lifted” to a 6-tensor.

Φ ∈ ’d2 Φ  Í 2 7. is the vector i6d ei⊗ , where ei is the i-th standard basis vector.

d2 d2 d2 8. Πsym ∈ ’ × is the projector to the symmetric subspace of ’ (i.e. the

2 d span of vectors x⊗ for x ∈ ’ ).

d3 d3 9. P ∈ ’ × is the permutation matrix which swaps second and third tensor

modes. Concretely, (Px)ijk  xik j for i, j, k ∈ [d].

3 1 2 10. Si, for i 6 n, is the d × d − 1 matrix given by Si  A− / (ai ⊗ ai ⊗ Bi)

3 11. Ri, for i 6 n, is the d × d − 1 matrix given by Ri  Si − PSi.

3 12. RT, for any T ⊆ [n], is the d × |T|(d − 1) matrix with |T| blocks of columns,

given by {Ri }i T. ∈ ∈ d3 n d 1 13. R  R n ’ × ( − ) contains all blocks of columns Ri. [ ]

5.8.2 Fourth Moment Identities

Fact 5.8.10.  1 2 ¾ aa> ⊗ aa> − /  d1Πsym − d2ΦΦ>

273 and for any unit x and matrix X,

1 2 A− / (x ⊗ x ⊗ X)  [d1(x ⊗ x) − d2Φ] ⊗ X .

Proof. The first statement follows from Fact C.4 in [49]. For the second, notice 1 2 that A  (¾ aa> ⊗ aa>)− / ⊗ Id, so "r ! # 2 + r + 1 2 d 2d 1 d 2 A− /  Πsym + √ 1 − ΦΦ> ⊗ Id . 2 d 2

1 2 So we can expand A− / (x ⊗ x ⊗ Y) as "r r ! # d2 + 2d 1 d + 2 Πsym(x ⊗ x) + √ 1 − ΦΦ>(x ⊗ x) ⊗ X . 2 d 2

2 Since Πsym(x ⊗ x)  (x ⊗ x) and Φ>(x ⊗ x)  kxk  1, this simplifes to "r r ! # d2 + 2d 1 d + 2 (x ⊗ x) + √ 1 − Φ ⊗ X . 2 d 2 

5.8.3 Matrix Product Identities

Fact 5.8.11. Si  (d1(ai ⊗ ai) − d2Φ) ⊗ Bi.

Proof. Follows from the definition of Si and ?? 5.8.10. 

 ( 2h i2 − + 2 ) Fact 5.8.12. S>i Sj d1 ai , a j 2d1d2 d2 d Bi>Bj

Proof. Expanding via ?? 5.8.11,

 [( ( ⊗ ) − Φ) ⊗ ] [( ( ⊗ ) − Φ) ⊗ ] S>i Sj d1 ai ai d2 Bi > d1 a j a j d2 Bj  ( 2h i2 − + 2 ) d1 ai , a j 2d1d2 d2 d Bi>Bj .

2 2 Here we have used that Φ>(ai ⊗ ai)  kai k  1 and that Φ>Φ  kΦk  d. 

274 1 2 Fact 5.8.13. P  P>  P− and hence P  Id

Proof. Exercise. 

d m Fact 5.8.14. For any matrics B, B0 ∈ ’ × , we have (Φ ⊗ B)>P(Φ ⊗ B0)  B>B0

( ⊗ ) ( ⊗ ) Í Proof. The s, t-th entry of Φ B >P Φ B0 is given by uvw6d Φuv BswΦuw B0tv.

The only nonzero terms come from u  v  w, because otherise ΦuvΦuw  0. So Í  ( ) this simplifes to u6d Bsu B0tu B>B0 st. 

d d d 1 Fact 5.8.15. For any vectors a, a0 ∈ ’ and matrices B, B0 ∈ ’ ×( − ), we have

(a ⊗ a ⊗ B)>P(a0 ⊗ a0 ⊗ B0)  ha, a0i(B>a0)([B0]>a)>

d 3 Proof. Since P does not touch the first mode of (R )⊗ it is enough to compute

d 2 (a ⊗ B)>P0(a0 ⊗ B0) where P0 is the mode-swap matrix for (’ )⊗ . The s, t-th entry of this matrix is given by

Õ  (Õ )(Õ ) au Bsv a0v B0tu av B0tu a0v Bsv . uv6d u6d v6d 

d d m Fact 5.8.16. For any vector a ∈ ’ and matrices B, B0 ∈ ’ × , we have (a ⊗ a ⊗

B)>P(Φ ⊗ B0)  (B>a)([B0]>a)>

Proof. The s, t-th entry of the product is given by

Õ  Õ  ( ) ([ ] ) au av BswΦuw B0tv au av Bsu B0tv B>a s B0 >a t uvw6d uv 

 2h i( )( ) + 2 Fact 5.8.17. S>i PSj d1 ai , a j Bi>a j B>j ai > d2 Bi>Bj.

275 Proof. Expanding Si , Sj using ?? 5.8.11, we obtain

 [ ( ⊗ ) ⊗ ] [ ( ⊗ ) ⊗ ] S>i PSj d1 ai ai Bi >P d1 a j a j Bj

− [d1(ai ⊗ ai) ⊗ Bi]>P[d2Φ ⊗ Bj]

− [d2Φ ⊗ Bi]>P[d1(a j ⊗ a j) ⊗ Bj]

+ [d2Φ ⊗ Bi]>P[d2Φ ⊗ Bj]

Simplifying the terms individually using ?? 5.8.14, ?? 5.8.15, and ?? 5.8.16,

( ⊗ ⊗ ) ( ⊗ ⊗ )  h i( )( ) ai ai Bi >P a j a j Bj ai , a j Bi>a j B>j ai > (Φ ⊗ ) (Φ ⊗ )  Bi >P Bj Bi>Bj

(ai ⊗ ai ⊗ Bi)>P(Φ ⊗ Bj)  0

(Φ ⊗ Bi)>P(a j ⊗ a j ⊗ Bj)  0 .

  where the last two equalities follow from Bi>ai 0, B>j a j 0. 

 ( 2h i2 − + 2( − )) − Fact 5.8.18. R>i R j 2 d1 ai , a j 2d1d2 d2 d 1 Bi>Bj 2h i( )( )  ( 2 − + 2( − )) 2d1 ai , a j Bi>a j B>j ai >, and in particular, R>i Ri 2 d1 2d1d2 d2 d 1 Id.

Proof. By the definition of Ri , R j we expand

 ( − ) ( − ) R>i R j Si PSi > Sj PSj .

We can expand the product and use ?? 5.8.13 to get

 − R>i R j 2S>i Sj 2S>i PSj .

Applying ?? 5.8.12 and ?? 5.8.17, we get

 ( 2h i2 − + 2 ) − 2h i( )( ) − 2 R>i R j 2 d1 ai , a j 2d1d2 d2 d Bi>Bj 2d1 ai , a j Bi>a j B>j ai > 2d2 Bi>Bj .

Simplifying finishes the proof. 

276 5.8.4 Naive Spectral Norm Estimate

We will need an upper bound on the spectral norm of the matrix R, which we obtain by a matrix Chernoff bound. To prove that, we need spectral bounds on certain second moments.

Fact 5.8.19. For all i 6 n, k ¾ k SiS>i 6 1 .

Proof. Expanding by the definition of Si,

¾  1 2 ¾( ⊗ ⊗ ) 1 2  1 2(¾ ⊗ ⊗ ) 1 2  SiS>i A− / ai a>i ai a>i Bi Bi> A− / A− / ai a>i ai a>i Id A− / Id .

 −  since Bi Bi> Id ai a>i Id. The last equality uses the definition of A. 

Fact 5.8.20. For all i 6 n, k ¾ k Ri R>i 6 4 .

Proof. Follows from ?? 5.8.19 and the definition of Ri, and the fact that kPk 6 1, by the manipulations

k ¾( )k  k ¾ − − + k k ¾ k Ri R>i SiS>i PSiS>i SiS>i P PSiS>i P 6 4 SiS>i 6 4 .



Fact 5.8.21. k ¾ RR>k 6 O(n)

¾  ¾ Í Proof. Since RR> i6n Ri R>i , this follows from the triangle inequality and ?? 5.8.20. 

k ¾ k ( 2 − + 2( − )) ( 2) Fact 5.8.22. R>R 6 2 d1 2d1d2 d2 d 1 6 O d

277 Proof. R>R is a block matrix with ij-th block being R>i R j. If i , j we have ¾  R>i R j 0, since R>i R j is independent of R j and has expectation zero. At the  ( 2 − + 2( − )) ¾  ( 2 − + 2( − same time R>i Ri 2 d1 2d1d2 d2 d 1 Id. So, R>R 2 d1 2d1d2 d2 d 1)) Id.  √ Fact 5.8.23. ¾ kRk 6 O(log d · max(d, n)).

Proof. First of all, note that ¾ R  0, because ¾(ai ⊗ ai ⊗ Bi)  ¾a ai ⊗ ai ⊗

(¾[Bi | ai])  0, since each column of Bi is a random unit vector in the orthogonal complement of ai.

Also, note that with probability 1, each of the column blocks Ri has kRi k 6

1 2 O(d), because kai ⊗ ai ⊗ Bi k 6 1 and kA− / k 6 O(d).

The matrix R is a sum of independent random matrices. To apply Matrix

Chernoff, we need the bounds on its second moments, provided by ?? 5.8.21 and ?? 5.8.22. The proof is concluded by applying Matrix Chernoff. 

5.8.5 Off-Diagonal Second Moment Estimates

We will eventually need to bound the norms of some off-diagonal blocks of the matrix R>R. We prove some useful inequalities for that effort now.

Fact 5.8.24. Let X, Y be real matrices. Then (X − Y)(X − Y)>  2(XX> + YY>).

Proof. By expanding, (X − Y)(X − Y)>  XX> + YY> − XY> − XY>. Since

(X + Y)(X + Y)>  0, we obtain XX> + YY>  −XY> − YX>, finishing the proof. 

278 Fact 5.8.25. For any S ⊆ [n] and fixed ai for i < S, we have k ¾R R>RSR>R k 6 S S S S (| | · k k2) O S RS .

¾  ¾  Proof. We know from ?? 5.8.20 that Ri R>i 4 Id. Hence RSR>S Í ¾  | | i S Ri R>i 4 S Id. To prove the final bound we push the expectation ∈ inside the matrix product:

¾  (¾ )  | | ·  · | | · k k2 · R>RSR>S RS R> RSR>S RS 4 S R>RS 4 S RS Id . RS S S RS S 

Fact 5.8.26. For any S ⊆ [n] and fixed ai for i < S, we have k ¾R R>R R>RS k 6 S S S S (| | · 2) + ( 3 · k Í k) O S d O d i S ai a>i . ∈

Proof. Consider the j, k-th block of ¾R R>R R>RS, which expands to S S S S Õ ¾ R>j Ri R>i Rk . R j ,Rk i S ∈ Since ¾ R j  ¾ Rk  0, unless j  k the whole expression vanishes in expectation.  Í ¾ Consider the case j k. Here we have the matrix i S R>j Ri R>i R j. We expand ∈ the matrix R>j Ri according to ?? 5.8.18 to get

 ( 2h i2 − + 2( − )) − 2h i( )( ) R>j Ri 2 d1 ai , a j 2d1d2 d2 d 1 B>j Bi 2d1 ai , a j B>j ai Bi>a j > | {z } | {z } def def  Xij  Yij

We will need the following two spectral bounds.

k ¾ ( 2h i2 − + 2( − ))2 k ( 2) 1. 4 d1 ai , a j 2d1d2 d2 d 1 B>j Bi Bi>Bj 6 O d .  ¾ ( 2h i2 − We note that B>j Bi Bi>Bj Id, so it is enough to bound 4 d1 ai , a j + 2( − ))2 | | 2( − ) ( ) 2 ( 2) 2d1d2 d2 d 1 . By definition, 2d1d2 , d2 d 1 6 O d and d1 6 O d , so we have

¾ ( 2h i2 − + 2( − ))2 ( 4)·¾(h i2 + ( / ))2 ( 2) 4 d1 ai , a j 2d1d2 d2 d 1 6 O d ai , a j O 1 d 6 O d .

279 4h i2( )( ) ( )( )  4h i2 2. 4d1 ai , a j B>j ai Bi>a j > Bi>a j B>j ai > 4d1 ai , a j B>j ai a>i Bj .

( ) ( )  k k2 We note that Bi>a j > Bi>a j Bi>a j 6 1. So,

4h i2( )( ) ( )( )  4h i2 4d1 ai , a j B>j ai Bi>a j > Bi>a j B>j ai > 4d1 ai , a j B>j ai a>i Bj .

Í ¾ We return to bounding i S R>j Ri R>i R j, where we recall that each Ri in ∈ the sum is fixed and the expectation is over R j.

By ?? 5.8.24,

Õ Õ ¾  ¾ + ¾ R>j Ri R>i R j 2 Xij Xij> 2 YijYij> i S i S ∈ ∈ Õ Õ  k ¾ k · + 2 Xij Xij> Id 2 YijYij> by triangle inequality i S i S ∈ ∈  (| | · 2)· + Õ O S d Id 2 YijYij> by (1) above i S ∈ Õ  (| | · 2)· + ¾ 4h i2 O S d Id 2 Bj 4d1 ai , a j ai a>i Bj by (2) above i S ∈ Õ  (| | · 2)· + ( 4)· ¾h i2 · k k · k k O S d Id O d1 ai , a j ai a>i Id by Bj 6 1 i S ∈ Õ  (| | · 2)· + ( 3) · ·k k · ¾h i2 ( / ) O S d Id O d ai a>i Id by ai , a j 6 O 1 d i S ∈



5.8.6 Matrix Decoupling

Fact 5.8.27 (Block Matrix Decoupling, similar to Lemma 5.63 of [80]). Let R be an

N × nm random matrix, consisting of n blocks Ri of dimension N × m. Suppose that  ⊆ [ ] ∈ ’N n S the blocks satisfy R>i Ri Id. For a subset S n , let RS × | | matrix consisting

280 of only the blocks in S. Let T ⊆ [n] be uniformly random. Then

¾ k − k ¾ k k R>R Id 6 4 RT R>n T . R R,T [ ]\

Proof. Following the argument of Vershynin [80], we note that

2 kR>R − Id k  | sup kRxk − 1| x 1 k k nm and that for x  (x1,..., xn) ∈ ’ ,

k k2  Õ + Õ  + Õ Rx x>i R>i Ri xi x>i R>i R j x j 1 x>i R>i R j x j i n i,j n i,j n ∈[ ] ∈[ ] ∈[ ]  since Ri R>i Id. By scalar decoupling (see Vershynin, Lemma 5.60 [80]), for a uniformly random subset T ⊆ [n], Õ Õ x>R>R j x j  4 ¾ x>R>R j x j . i i T i i i,j n i T,j n T ∈[ ] ∈ ∈[ ]\ So,

2 ¾ kR>R − Id k  ¾ | sup kRxk − 1| R R x 1 k k Õ  ¾ ¾ sup 4 x>i R>i R j x j R x 1 T k k i T,j n T ∈ ∈[ ]\ Õ ¾ 6 4 sup x>i R>i R j x j R,T  x 1 i T,j n T k k ∈ ∈[ ]\  k k 4 ¾ RT>R n T R,T [ ]\ where the inequality above follows by Jensen’s inequality. 

5.8.7 Putting It Together

We are ready to prove Lemma 5.8.2.

281 n d 1 n d 1 2 Proof of Lemma 5.8.2. We will show that R>R ∈ ’ ( − )× ( − ) is close to Θ(d )· Id. ( ) The i, j -th block of R>R is given by R>i R j. Let us first consider the diagonal blocks, R>i Ri.

Using ?? 5.8.18 and the bounds d1  Θ(d), d2  Θ(1), we obtain,

 Θ( )· Ri R>i d Id .

Next, we bound the norm of the off-diagonal part of the matrix. Let ’diag be the matrix equal to R> only on diagonal blocks and zero elsewhere. We will use matrix decoupling, ?? 5.8.27, to bound ¾ kR>R − ’diagk. We find that for a uniformly random subset T ⊆ [n],

k − k k k ¾ R>R Rdiag 6 4 ¾ RT>R R R,T T

Now fix T ⊆ [n] and for i ∈ T fix unit vectors ai and orthonormal bases Bi ∈ to obtain the Ri for i T. We regard the matrix RT>RT as a sum of independent matrices, one for each i ∈ T. The i-th such matrix Vi has |T| blocks; the j-th block is R>j Ri.

k k ( ) · k k First we note that Vi 6 O d RT with probability one over Ri, since kRi k 6 O(d) with probability 1, by ?? 5.8.10.

Next, we note the bounds on the variance of RT>RT afforded by ?? 5.8.25 and ?? 5.8.26. We conclude that by Matrix Bernstein,

Õ ¾ k k · ( (p| | · k k p| | · 3 2k k1 2 ) RT>RT 6 log d O max T RT , T d, d / ai a>i / , d . RT i T ∈ So,

1 2 p 3 2 Õ 1 2 ¾ kR>R−Rdiagk 6 O(log d)·© ¾ |T| / · kR k + d ¾ |T| + d / ¾ k ai a>k / + dª R ­T,R T T,R i ® i T « ∈ ¬

282 √ √ ¾ | |1 2k k · ( · ( )) Using ?? 5.8.23, we have T / RT 6 n O log d max d, n . By standard ¾ k Í k · ( ( / )) matrix concentration, i T 6 log d O max 1, n d . So putting it together, ∈ √ 2 3 2 ¾ kR>R − Rdiagk 6 O(log d) · max(d n, n, d / ) . R 

5.8.8 Omitted Proofs

We turn now to the proof of ?? 5.8.4. The strategy is much the same as the proof of Lemma 5.8.2, which is in turn an adaptation of an argument due to Vershynin for concentration of matrices with independent columns [80].

We will use the following simple matrix decoupling inequality, with B being

3 the random matrix having columns ai⊗ .

Lemma 5.8.28 (Matrix decoupling, Lemma 5.63 in [80]). Let B be an N × n random matrix whose columns Bj satisfy kBj k  1. For any T ⊆ [n], let BT be the restriction of B to the columns T. Let T be a uniformly random set of columns. Then

k − k k( ) k ¾ B>B Id 6 4 ¾ ¾ BT >B n T . T B [ ]\

3 ⊆ [ ] Proof of ?? 5.8.4. Let B have columns ai⊗ . Fix T n . We will bound k( ) k ¾ BT >B n T , with the goal of applying Lemma 5.8.28. [ ]\

∈ h i3 For i T, the i-th row of B>B n T has entries a j , ai for j < T. Let T [ ]\ us temporarily fix ai for i < T; then these rows become independent due to independence of {a j }j T. ∈

| | We think of the matrix B>B n T as consisting of a sum of T independent T [ ]\ matrices where only the i-th row of the i-th matrix Mi is nonzero, and it consists of

283 ( ) entries Mij  B>B n T ij. We are going to apply the matrix Bernstein inequality T [ ]\  Í to the sum BT>B n T i T Mi. [ ]\ ∈

To do so, we need to compute the variance of the sum: we need to bound ! Õ Õ 2  ¾ ¾ σ max Mi Mi>, Mi>Mi . i T i T ∈ ∈

(Here the expectation is over ai for i ∈ T; we are conditioning on ai for i < T.) For the former, consider that Mi Mi> has just one nonzero entry, Õ ¾( )  ¾ h i6 (|[ ]\ |/ 3) Mi Mi> i,i ai , a j 6 O n T d . i O n T d . ∈

¾ Í h i3 Next we bound i T Mi>Mi. Let r be a random vector with entries a, ai ∈ ¾ Í  | | · ¾ for i < T and a a random unit vector. Then i T Mi>Mi T rr>. We may ∈ compute that

3 3 C 1 3 (¾ rr>)ij  ¾ha, aii ha, a ji  hai , a ji + · O(hai , a ji ) d3 d3 where C is a universal constant. (This may be seen by comparison to the case that a is replaced by a standard Gaussian and using Wick’s theorem on moments of  (Í h i6/ 6)1 2  k k a multivariate Gaussian.) Letting C1 i,j

C2, |T|C1)).

By applying Matrix Bernstein, for each t > 0 we obtain

Õ p ¾ · (k k ) ( )· (|[ ]\ |/ 3 | | · 3 · 2) Mi 1 Mi 6 t 6 O log d max n T d , T d− C2, C1, t i T ∈ p 3 where again the expectation is over only ai for i ∈ T. Choosing t  O˜ ( n/d ), by

ω 1 standard scalar concentration we obtain that (kMi k > t) 6 (dn)− ( ). Hence by

284 k Í ( − (k k ))k ( ) ω 1 Cauchy-Schwarz we find ¾ i T Mi 1 1 Mi 6 t 6 dn − ( ), so all in all, ∈

Õ p ¾ ( )O 1 · (|[ ]\ |/ 3 | | · 3 · | | / 3) Mi 6 log nd ( ) max n T d , T d− C2, T C1, n d . i T ∈

Finally, we have to bound

p (|[ ]\ |/ 3 | | · 3 · | | / 3) ¾ ¾ max n T d , T d− C2, T C1, n d T ai ,i

k k which is an upper bound on ¾ B>B n T . By Cauchy-Schwarz, we may upper T [ ]\ bound this by

3 1 2 3 1 2 1 2 3 1 2 (¾ |[n]\ T|/d ) / + (¾ |T| · d− · C2) / + (¾ |T|C1) / + (n/d ) / .

By standard matrix concentration, ¾ C2 6 O(1 + n/d). Clearly ¾ |[n]\ T| and

2 4.5 ¾ |T| 6 O(n). Finally, by straightforward computation, ¾ C1 6 n /d .

All together, applying Lemma 5.8.28, we have obtained ¾ kB>B − Id k 6

2 O˜ (n/d ) for n  d. Since B>B has the same eigenvalues as BB>  U, we are done. 

285 BIBLIOGRAPHY

[1] Evrim Acar, Canan Aykut-Bingol, Haluk Bingol, Rasmus Bro, and Bülent Yener. Multiway analysis of epilepsy tensors. Bioinformatics, 23(13):i10–i18, 2007.1, 216

[2] Rudolf Ahlswede and Andreas J. Winter. Strong converse for identification via quantum channels. IEEE Trans. Information Theory, 48(3):569–579, 2002. 315

[3] Zeyuan Allen-Zhu and Yuanzhi Li. Lazysvd: Even faster svd decomposition yet without agonizing pain. In Advances in Neural Information Processing Systems, pages 974–982, 2016. 239

[4] Anima Anandkumar, Dean P. Foster, Daniel J. Hsu, Sham M. Kakade, and Yi-Kai Liu. A spectral algorithm for latent dirichlet allocation. Algorithmica, 72(1):193–214, 2015. 94, 136

[5] Anima Anandkumar, Rong Ge, Daniel J. Hsu, and Sham M. Kakade. A tensor spectral approach to learning mixed membership community models. CoRR, abs/1302.2684, 2013. 27

[6] Anima Anandkumar, Rong Ge, Daniel J. Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models (A survey for ALT). In ALT, volume 9355 of Lecture Notes in Computer Science, pages 19–38. Springer, 2015. 137

[7] Anima Anandkumar, Rong Ge, and Majid Janzamin. Analyzing tensor power method dynamics: Applications to learning overcomplete latent variable models. CoRR, abs/1411.1488, 2014.1, 27, 95, 136, 218

[8] Animashree Anandkumar, Rong Ge, Daniel J. Hsu, and Sham Kakade. A tensor spectral approach to learning mixed membership community models. In COLT, volume 30 of JMLR Workshop and Conference Proceedings, pages 867–881. JMLR.org, 2013. 219

[9] Animashree Anandkumar, Rong Ge, Daniel J. Hsu, and Sham M. Kakade. A tensor approach to learning mixed membership community models. Journal of Machine Learning Research, 15(1):2239–2312, 2014. 94, 136

[10] Animashree Anandkumar, Rong Ge, Daniel J. Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models.

286 Journal of Machine Learning Research, 15(1):2773–2832, 2014.1, 27, 94, 136, 137, 211, 214, 216, 224, 225

[11] Animashree Anandkumar, Rong Ge, and Majid Janzamin. Guaranteed non-orthogonal tensor decomposition via alternating rank-1 updates. CoRR, abs/1402.5180, 2014. 27, 225

[12] Animashree Anandkumar, Rong Ge, and Majid Janzamin. Learning overcom- plete latent variable models through tensor methods. In COLT, volume 40 of JMLR Workshop and Conference Proceedings, pages 36–112. JMLR.org, 2015. 95, 96, 122, 136

[13] Animashree Anandkumar, Rong Ge, and Majid Janzamin. Analyzing tensor power method dynamics in overcomplete regime. Journal of Machine Learning Research, 18:22:1–22:40, 2017. 225

[14] Boaz Barak, Fernando G. S. L. Brandão, Aram Wettroth Harrow, Jonathan A. Kelner, David Steurer, and Yuan Zhou. Hypercontractivity, sum-of-squares proofs, and their applications. In STOC, pages 307–326. ACM, 2012.2, 31, 90, 104, 105, 294

[15] Boaz Barak, Jonathan A. Kelner, and David Steurer. Dictionary learning and tensor decomposition via the sum-of-squares method. CoRR, abs/1407.1543, 2014. 27

[16] Boaz Barak, Jonathan A. Kelner, and David Steurer. Rounding sum-of- squares relaxations. In STOC, pages 31–40. ACM, 2014.8, 27, 90, 92, 93, 94, 98, 99, 103, 104, 294, 295, 308

[17] Boaz Barak, Jonathan A. Kelner, and David Steurer. Dictionary learning and tensor decomposition via the sum-of-squares method. In STOC, pages 143–151. ACM, 2015.1,4, 14, 90, 94, 95, 98, 99, 103, 108, 110, 111, 136, 137, 138, 145, 147, 150, 151, 169, 172, 173, 219, 226

[18] Boaz Barak, Guy Kindler, and David Steurer. On the optimality of semidefi- nite relaxations for average-case and generalized constraint satisfaction. In ITCS, pages 197–214. ACM, 2013. 18

[19] Boaz Barak and Ankur Moitra. Tensor prediction, rademacher complexity and random 3-xor. CoRR, abs/1501.06521, 2015. 28

[20] Boaz Barak, Prasad Raghavendra, and David Steurer. Rounding semidefinite

287 programming hierarchies via global correlation. In FOCS, pages 472–481. IEEE Computer Society, 2011. 99, 103

[21] Boaz Barak and David Steurer. Sum-of-squares proofs and the quest toward optimal algorithms. CoRR, abs/1404.5236, 2014.2, 20, 27, 31, 90

[22] C.F. Beckmann and S.M. Smith. Tensorial extensions of independent com- ponent analysis for multisubject fmri analysis. NeuroImage, 25(1):294 – 311, 2005.1, 216

[23] Quentin Berthet and Philippe Rigollet. Complexity theoretic lower bounds for sparse principal component detection. In COLT, volume 30 of JMLR Workshop and Conference Proceedings, pages 1046–1066. JMLR.org, 2013. 18

[24] Aditya Bhaskara, Moses Charikar, Ankur Moitra, and Aravindan Vija- yaraghavan. Smoothed analysis of tensor decompositions. In STOC, pages 594–603. ACM, 2014. 94, 136, 137, 138, 142, 143, 199, 200, 222

[25] Alexandre d’Aspremont, Laurent El Ghaoui, Michael I. Jordan, and Gert R. G. Lanckriet. A direct formulation for sparse PCA using semidefinite programming. In NIPS, pages 41–48, 2004. 92

[26] Chandler Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. III. SIAM J. Numer. Anal., 7:1–46, 1970. 238

[27] Wenceslas Fernandez de la Vega and Claire Kenyon-Mathieu. Linear programming relaxations of maxcut. In SODA, pages 53–61. SIAM, 2007. 99

[28] Lieven De Lathauwer, Josphine Castaing, and Jean-Franois Cardoso. Fourth- order cumulant-based blind identification of underdetermined mixtures. IEEE Transactions on Signal Processing, 55(6):2965–2973, 2007. 219

[29] Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. Blind source separation by simultaneous third-order tensor diagonalization. In European Signal Processing Conference, 1996. EUSIPCO 1996. 8th, pages 1–4. IEEE, 1996. 1,9, 218

[30] Andrew C Doherty and Stephanie Wehner. Convergence of sdp hier- archies for polynomial optimization on the hypersphere. arXiv preprint arXiv:1210.5048, 2012. 63, 64

288 [31] Michael Elad. Sparse and Redundant Representations: From Theory to Ap- plications in Signal and Image Processing. Springer Publishing Company, Incorporated, 1st edition, 2010. 220

[32] Uriel Feige, Jeong Han Kim, and Eran Ofek. Witnesses for non-satisfiability of dense random 3cnf formulas. In FOCS, pages 497–508. IEEE Computer Society, 2006. 28

[33] Uriel Feige and Eran Ofek. Easily refutable subformulas of large random 3cnf formulas. Theory of Computing, 3(1):25–43, 2007. 28

[34] Joel Friedman, Andreas Goerdt, and Michael Krivelevich. Recognizing more unsatisfiable random k-sat instances efficiently. SIAM J. Comput., 35(2):408–430, 2005. 28

[35] François Le Gall. Powers of tensors and fast matrix multiplication. In ISSAC, pages 296–303. ACM, 2014. 131

[36] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points - online stochastic gradient for tensor decomposition. In Peter Grünwald, Elad Hazan, and Satyen Kale, editors, Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015, volume 40 of JMLR Workshop and Conference Proceedings, pages 797–842. JMLR.org, 2015. 225

[37] Rong Ge, Qingqing Huang, and Sham M. Kakade. Learning mixtures of gaussians in high dimensions. In STOC, pages 761–770. ACM, 2015. 94, 136

[38] Rong Ge, Qingqing Huang, and Sham M. Kakade. Learning mixtures of Gaussians in high dimensions [extended abstract]. In STOC’15—Proceedings of the 2015 ACM Symposium on Theory of Computing, pages 761–770. ACM, New York, 2015. 219

[39] Rong Ge and Tengyu Ma. Decomposing overcomplete 3rd order tensors using sum-of-squares algorithms. In APPROX-RANDOM, volume 40 of LIPIcs, pages 829–849. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2015. 90, 95, 96, 98, 108, 110, 136, 137, 138, 142, 150, 158, 187, 188, 226, 228

[40] Rong Ge and Tengyu Ma. On the optimization landscape of tensor decompo- sitions. In Advances in Neural Information Processing Systems, pages 3653–3663, 2017. 224, 225

289 [41] Andreas Goerdt and Michael Krivelevich. Efficient recognition of random unsatisfiable k-sat instances by spectral methods. In STACS 2001, 18th Annual Symposium on Theoretical Aspects of Computer Science, Dresden, Germany, February 15-17, 2001, Proceedings, pages 294–304, 2001. 28

[42] Navin Goyal, Santosh Vempala, and Ying Xiao. Fourier PCA and robust tensor decomposition. In STOC, pages 584–593. ACM, 2014. 94, 136, 137

[43] Martin Grötschel, László Lovász, and Alexander Schrijver. The ellipsoid method and its consequences in combinatorial optimization. Combinatorica, 1(2):169–197, 1981. 161, 162

[44] Venkatesan Guruswami and Ali Kemal Sinop. Faster SDP hierarchy solvers for local rounding algorithms. In FOCS, pages 197–206. IEEE Computer Society, 2012. 99

[45] Wu Hai-Long, Shibukawa Masami, and Oguma Koichi. An alternating trilinear decomposition algorithm with application to calibration of HPLC- DAD for simultaneous determination of overlapped chlorinated aromatic hydrocarbons. Journal of Chemometrics, 12(1):1–26.1, 216

[46] Richard A Harshman. Foundations of the parafac procedure: Models and conditions for an" explanatory" multi-modal factor analysis. 1970.1,9, 137, 150, 218, 230

[47] Christopher J. Hillar and Lek-Heng Lim. Most tensor problems are np-hard. J. ACM, 60(6):45:1–45:39, 2013. 217

[48] Samuel B Hopkins, Tselil Schramm, and Jonathan Shi. A robust spectral algorithm for overcomplete tensor decomposition. In Conference on Learning Theory, pages 1683–1722, 2019. 15

[49] Samuel B. Hopkins, Tselil Schramm, Jonathan Shi, and David Steurer. Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors. In STOC, pages 178–191. ACM, 2016. 15, 218, 227, 267, 274

[50] Samuel B. Hopkins, Jonathan Shi, and David Steurer. Tensor principal component analysis via sum-of-square proofs. In COLT, volume 40 of JMLR Workshop and Conference Proceedings, pages 956–1006. JMLR.org, 2015. 15, 90, 97, 98, 99, 100, 111, 132, 133, 136, 141, 216

290 [51] Samuel B Hopkins and David Steurer. Efficient bayesian estimation from few samples: community detection and related problems. In Foundations of Computer Science (FOCS), 2017 IEEE 58th Annual Symposium on, pages 379–390. IEEE, 2017. 219

[52] Samuel Brink Klevit Hopkins. Statistical Inference and the Sum of Squares Method. PhD thesis, Cornell University, 2018. 12

[53] Tamara G. Kolda and Brett W. Bader. Tensordecompositions and applications. SIAM Review, 51(3):455–500, 2009. 94, 224

[54] Jean B. Lasserre. Global optimization with polynomials and the problem of moments. SIAM J. Optim., 11(3):796–817, 2000/01.1, 90, 161

[55] Lieven De Lathauwer, Joséphine Castaing, and Jean-François Cardoso. Fourth-order cumulant-based blind identification of underdetermined mix- tures. IEEE Trans. Signal Processing, 55(6-2):2965–2973, 2007.1, 12, 94, 143, 154, 189, 225, 228, 229

[56] S.E. Leurgans, R.T. Ross, and R.B. Abel. A decomposition for three-way arrays. SIAM J. Matrix Anal. Appl., 14(4):1064–1083, 1993. 137, 150

[57] Michael S. Lewicki and Terrence J. Sejnowski. Learning overcomplete representations. Neural Comput., 12(2):337–365, February 2000. 220

[58] Tengyu Ma, Jonathan Shi, and David Steurer. Polynomial-time tensor de- compositions with sum-of-squares. In FOCS, pages 438–446. IEEE Computer Society, 2016.4, 14, 15, 218, 219, 221, 222, 225, 226, 227, 228, 231, 234

[59] Marco Mondelli and Andrea Montanari. On the connection between learning two-layers neural networks and tensor decomposition. CoRR, abs/1802.07301, 2018. 220

[60] Andrea Montanari and Emile Richard. A statistical model for tensor pca. In Advances in Neural Information Processing Systems, pages 2897–2905, 2014. 18, 19, 28, 41, 47, 54, 87

[61] Yurii Nesterov. Squared functional systems and optimization problems. In High performance optimization, volume 33 of Appl. Optim., pages 405–440. Kluwer Acad. Publ., Dordrecht, 2000.1, 90

291 [62] Ryan O’Donnell and Yuan Zhou. Approximability and proof complexity. In SODA, pages 1537–1556. SIAM, 2013. 295

[63] Roberto I. Oliveira. Sums of random Hermitian matrices and an inequality by Rudelson. Electron. Commun. Probab., 15:203–212, 2010. 138, 153, 182

[64] Pablo A Parrilo. Structured semidefinite programs and semialgebraic geometry methods in robustness and optimization. PhD thesis, California Institute of Technology, 2000.1, 90, 161

[65] Qing Qu, Ju Sun, and John Wright. Finding a sparse vector in a subspace: Linear sparsity using alternating directions. In NIPS, pages 3401–3409, 2014. 92, 93, 94

[66] Prasad Raghavendra and Ning Tan. Approximating csps with global cardinality constraints using SDP hierarchies. In SODA, pages 373–387. SIAM, 2012. 103

[67] Emile Richard and Andrea Montanari. A statistical model for tensor PCA. In NIPS, pages 2897–2905, 2014. 97, 98, 111, 132, 216

[68] Tselil Schramm and David Steurer. Fast and robust tensor decomposition with applications to dictionary learning. In COLT, volume 65 of Proceedings of Machine Learning Research, pages 1760–1793. PMLR, 2017.4, 12, 14, 219, 223, 225, 228, 231, 234, 251, 253, 254, 260

[69] Tselil Schramm and Benjamin Weitz. Low-rank matrix completion with adversarial missing entries. CoRR, abs/1506.03137, 2015. 144

[70] Vatsal Sharan and Gregory Valiant. Orthogonalized ALS: A theoretically principled tensor decomposition algorithm for practical use. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 3095–3104. PMLR, 2017. 225

[71] N. Z. Shor. An approach to obtaining global extrema in polynomial problems of mathematical programming. Kibernetika (Kiev), (5):102–106, 136, 1987.1, 90, 161

[72] Daniel A. Spielman, Huan Wang, and John Wright. Exact recovery of sparsely-used dictionaries. In COLT, volume 23 of JMLR Proceedings, pages 37.1–37.18. JMLR.org, 2012. 92, 102

292 [73] Gilbert Stengle. Complexity estimates for the schmüdgen positivstellensatz. Journal of Complexity, 12(2):167 – 174, 1996.2

[74] David Steurer. Tensor decompositions, sum-of-squares proofs, and spectral algorithms. In Quarterly Theory Workshop, Northwestern University, Illinois, May 2016.1

[75] Terence Tao. Topics in random matrix theory, volume 132. American Mathe- matical Soc., 2012. 297

[76] Ryota Tomioka and Taiji Suzuki. Spectral norm of random tensors. arXiv preprint arXiv:1407.1870, 2014. 68

[77] Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389–434, 2012. 138, 182, 297, 298, 301, 315

[78] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. CoRR, abs/1011.3027, 2010. 320, 321, 323, 334

[79] Roman Vershynin. Introduction to the non-asymtotic analysis of random matrices. pages 210–268, 2011. 297, 298, 300

[80] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Compressed sensing, pages 210–268. Cambridge Univ. Press, Cambridge, 2012. 272, 280, 281, 283

[81] Alexander S Wein, Ahmed El Alaoui, and Cristopher Moore. The kikuchi hierarchy and tensor pca. arXiv preprint arXiv:1904.03858, 2019. 14

[82] Hermann Weyl. Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differentialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung). Mathematische Annalen, 71(4):441–479, Dec 1912. 238

293 APPENDIX A SPIKED TENSOR MODEL

A.1 Pseudo-Distribution Facts

Lemma A.1.1 (Quadric Sampling). Let {x} be a pseudo-distribution over ’n of degree d > 2. Then there is an actual distribution {y} over ’n so that for any polynomial p of degree at most 2, ¾[p(y)]  ¾˜ [p(x)]. Furthermore, {y} can be sampled from in time poly n.

Lemma A.1.2 (Pseudo-Cauchy-Schwarz, Function Version, [14]). Let x, y be vector- valued polynomials. Then

1 hx, yi  (kxk2 + kyk2). 2

See [16] for the cleanest proof.

Lemma A.1.3 (Pseudo-Cauchy-Schwarz, Powered Function Version). Let x, y be vector-valued polynomials and d > 0 an integer. Then

1 hx, yid  (kxk2d + kyk2d). 2

d d d Proof. Note that hx, yi  hx⊗ , y⊗ i and apply Lemma A.1.2. 

Yet another version of pseudo-Cauchy-Schwarz will be useful:

Lemma A.1.4 (Pseudo-Cauchy-Schwarz, Multiplicative Function Version, [14]).

Let {x, y} be a degree d pseudo-distribution over a pair of vectors, d > 2. Then q q ¾˜ [hx, yi] 6 ¾˜ [kxk2] ¾˜ [kyk2].

294 Again, see [16] for the cleanest proof.

3 We will need the following inequality relating ¾˜ hx, v0i and ¾˜ hx, v0i when

3 ¾˜ hx, v0i is large.

Lemma A.1.5. Let {x} be a degree-4 pseudo-distribution satisfying {kxk2  1}, and

n 3 let v0 ∈ ’ be a unit vector. Suppose that ¾˜ hx, v0i > 1 − ε for some ε > 0. Then

¾˜ hx, v0i > 1 − 2ε.

Proof. Let p(u) be the univariate polynomial p(u)  1 − 2u3 + u. It is easy to check that p(u) > 0 for u ∈ [−1, 1]. It follows from classical results about univariate polynomials that p(u) then can be written as

p(u)  s0(u) + s1(u)(1 + u) + s2(u)(1 − u)

for some SoS polynomials s0, s1, s2 of degrees at most 2. (See [62], fact 3.2 for a precise statement and attributions.)

Now we consider

¾˜ p(hx, v0i) > ¾˜ [s1(hx, v0i)(1 + hx, v0i)] + ¾˜ [s2(hx, v0i)(1 − hx, v0i)] .

h i  1 (k k2 + ) h i  We have by Lemma A.1.2 that x, v0 2 x 1 and also that x, v0 − 1 (k k2 + ) (h i) 2 x 1 . Multiplying the latter SoS relation by the SoS polynomial s1 x, v0 and the former by s2(hx, v0i), we get that

¾˜ [s1(hx, v0i)(1 + hx, v0i)]  ¾˜ [s1(hx, v0i)] + ¾˜ [s1(hx, v0i)hx, v0i] 1 > ¾˜ [s (hx, v i)] − ¾˜ [s (hx, v i)(kxk2 + 1)] 1 0 2 1 0

> ¾˜ [s1(hx, v0i)] − ¾˜ [s1(hx, v0i)]

> 0 ,

295 where in the second-to-last step we have used the assumption that {x} satisfies {kxk2  1}. A similar analysis yields

¾˜ [s2(hx, v0i)(1 − hx, v0i)] > 0 .

3 All together, this means that ¾˜ p(hx, v0i) > 0. Expanding, we get ¾˜ [1 − 2hx, v0i + hx, v0i] > 0. Rearranging yields

3 ¾˜ hx, v0i > 2 ¾˜ hx, v0i − 1 > 2(1 − ε) − 1 > 1 − 2ε . 

We will need a bound on the pseudo-expectation of a degree-3 polynomial in terms of the operator norm of its coefficient matrix.

n2 n Lemma A.1.6. Let {x} be a degree-4 pseudo-distribution. Let M ∈ ’ × . Then

2 4 3 4 ¾˜ hx⊗ , Mxi 6 kMk(¾˜ kxk ) / .

Proof. We begin by expanding in the monomial basis and using pseudo-Cauchy- Schwarz:

˜ h 2 i  ˜ Õ ¾ x⊗ , Mx ¾ M j,k ,i xi x j xk ( ) ijk  ˜ Õ Õ ¾ xi M j,k ,i x j xk ( ) i jk 2 1 2   / ( ˜ k k2)1 2  ˜ Õ Õ  6 ¾ x / ¾ © M j,k ,i xi x jª   ­ ( ) ®   i jk   « ¬  2 1 2   / ( ˜ k k4)1 4  ˜ Õ Õ  6 ¾ x / ¾ © M j,k ,i xi x jª   ­ ( ) ®   i jk   « ¬ 

 2 T Í Í We observe that MM is a matrix representation of i jk M j,k ,i xi x j . We ( )

296 know MMT  kMk2 Id, so

2 ˜ Õ Õ k k2 ˜ k k4 ¾ © M j,k ,i xi x jª 6 M ¾ x . ­ ( ) ® i jk « ¬ 2 4 3 4 Putting it together, we get ¾˜ hx⊗ , Mxi 6 kMk(¾˜ kxk ) / as desired. 

A.2 Concentration bounds

A.2.1 Elementary Random Matrix Review

We will be extensively concerned with various real random matrices. A great deal is known about natural classes of such matrices; see the excellent book of

Tao [75] and the notes by Vershynin and Tropp [79, 77].

Our presentation here follows Vershynin’s [79]. Let X be a real random k k 1 2(¾| |p)1 p { } variable. The subgaussian norm X ψ2 of X is supp>1 p− / X / . Let a n k k { } be a distribution on ’ . The subgaussian norm a ψ2 of a is the maximal k k  kh ik subgaussian norm of the one-dimensional marginals: a ψ2 sup u 1 a, u ψ2 . k k { } k k  ( ) A family of random variables Xn n Ž is subgaussian if Xn ψ2 O 1 . The ∈ reader may easily check that an n-dimensional vector of independent standard Gaussians or independent ±1 variables is subgaussian.

It will be convenient to use the following standard result on the concentration of empirical covariance matrices. This statement is borrowed from [79], Corollary 5.50.

Lemma A.2.1. Consider a sub-gaussian distribution {a} in ’m with covariance matrix Σ ∈ ( ) ∼ { } ( / )2 k 1 Í T − , and let δ 0, 1 , t > 1. If a1,..., aN a with N > C t δ m then N ai ai

297 Σk 6 δ with probability at least 1 − 2 exp(−t2m). Here C  C(K) depends only on the  k k sub-gaussian norm K a ψ2 of a random vector taken from this distribution.

We will also need the matrix Bernstein inequality. This statement is borrowed from Theorem 1.6.2 of Tropp [77].

Theorem A.2.2 (Matrix Bernstein). Let S1,..., Sm be independent square random matrices with dimension n. Assume that each matrix has bounded deviation from its k − k  Í mean: Si ¾ Si 6 R for all i. Form the sum Z i Si and introduce a variance parameter

σ2  max{k ¾(Z − ¾ Z)(Z − ¾ Z)T k, k ¾(Z − ¾ Z)T(Z − ¾ Z)k} .

Then  t2/2  {kZ − ¾ Zk > t} 6 2n exp for all t > 0 . σ2 + Rt/3

We will need bounds on the operator norm of random square rectangular matrices, both of which are special cases of Theorem 5.39 in [79].

Lemma A.2.3. Let A be an n × n matrix with independent entries from N(0, 1). Then √ ω 1 with probability 1 − n− ( ), the operator norm kAk satisfies kAk 6 O( n).

Lemma A.2.4. Let A be an n2 × n matrix with independent entries from N(0, 1). Then

ω 1 with probability 1 − n− ( ), the operator norm kAk satisfies kAk 6 O(n).

Í ⊗ A.2.2 Concentration for i Ai Ai and Related Ensembles

Our first concentration theorem provides control over the nontrivial permutations

T of the matrix AA under the action of S4 for a tensor A with independent entries.

298 Theorem A.2.5. Let c ∈ {1, 2} and d > 1 an integer. Let A1,..., Anc be iid random

nd nd matrices in {±1} × or with independent entries from N(0, 1). Then, with probability

100 1 − O(n− ),

√ Õ 2d+c 2 1 2 Ai ⊗ Ai − ¾ Ai ⊗ Ai - dn( )/ ·(log n) / .

i nc ∈[ ] and √ Õ T T 2d+c 2 1 2 Ai ⊗ A − ¾ Ai ⊗ A - dn( )/ ·(log n) / . i i i nc ∈[ ]

We can prove Theorem 2.3.3 as a corollary of the above.

Proof of Theorem 2.3.3. Let A have iid Gaussian entries. We claim that ¾ A ⊗ A is a matrix representation of kxk4. To see this, we compute

2 2 2 hx⊗ , ¾(A ⊗ A)x⊗ i  ¾hx, Axi Õ  ¾ AijAkl xi x j xk xl i,j,k,l  Õ 2 2 xi x j ij

 kxk4 .

Now by Theorem A.2.5, we know that for Ai the slices of the tensor A from the statement of Theorem 2.3.3,

Õ 2 Ai ⊗ Ai  n ¾ A ⊗ A + λ · Id i

3 4 1 4 for λ  O(n / log(n) / ). Since n  O(λ) and both Id and ¾ A ⊗ A are matrix representations of kxk4, we are done. 

Now we prove Theorem A.2.5. We will prove only the statement about Í ⊗ Í ⊗ T i Ai Ai, as the case of i Ai Ai is similar.

299 Let A1,..., Anc be as in Theorem A.2.5. We first need to get a handle on their norms individually, for which we need the following lemma.

nd nd Lemma A.2.6. Let A be a random matrix in {±1} × or with independent entries

d 2 from N(0, 1). For all t > 1, the probability of the event {kAk > tn / } is at most t2nd K 2− / for some absolute constant K.

Proof. The subgaussian norm of the rows of A is constant and they are identically and isotropically distributed. Hence Theorem 5.39 of [79] applies to give the result. 

d 2 Since the norms of the matrices A1,..., Anc are concentrated around n / (by Lemma A.2.6), it will be enough to prove Theorem A.2.5 after truncating the

c matrices A1,..., An . For t > 1, define iid random matrices A01,..., A0nc such that

 d 2 Ai if kAi k 6 tn / , def  A0i  0 otherwise 

for some t to be chosen later. Lemma A.2.6 allows us to show that the random ⊗ ⊗ matrices Ai Ai and A0i A0i have almost the same expectation. For the remainder of this section, let K be the absolute constant from Lemma A.2.6.

∈ [ c] ⊗ ⊗ Lemma A.2.7. For every i n and all t > 1, the expectations of Ai Ai and A0i A0i satisfy ¾[ ⊗ ] − ¾[ ⊗ ] ( )· tnd K Ai Ai A0i A0i 6 O 1 2− / .

 k k d 2 Proof. Using Jensen’s inequality and that Ai A0i unless Ai > tn / , we have k ¾ ⊗ − ⊗ k ¾ k ⊗ − ⊗ k Ai Ai A0i A0i 6 Ai Ai A0i A0i Jensen’s inequality

300 ¹ √  ∞ (k k )  k k d 2 Ai > s ds since Ai A0i unless Ai > tn / tnd 2 ¹ / ∞ s K 6 2− / ds by Lemma A.2.6 d 2 tn /

Õ∞ tnd 2 K i K 6 2− / / · 2− / discretizing the intergral i0 tnd 2 K  O(2− / / ) as desired. 

 ⊗ − ¾[ ⊗ ] Lemma A.2.8. Let B10 ,..., B0nc be i.i.d. matrices such that B0i A0i A0i A0i A0i . 2 c 2 Then for every C > 1 with C 6 3t n / ,

  − 2   Õ 2d+c 2 2d C  B0 > C · n( )/ 6 2n · exp . i 6t4  i nc   ∈[ ] 

 2 d {k k } Proof. For R 2t n , the random matrices B10 ,..., B0nc satisfy B0i 6 R with probability 1. Therefore, by the Bernstein bound for non-symmetric matrices [77, Theorem 1.6],

 nc   − 2/  Õ 2d s 2  B0 > s 6 2n · exp , i1 i σ2 + Rs/3 2  {kÍ ¾ ( ) k kÍ ¾( ) k} c · 2  · 2d+c 2 where σ max i B0i B0i > , i B0i >B0i 6 n R . For s C n( )/ , the probability is bounded by + n  − 2 · 2d c /  n Õ o 2d C n( ) 2  B0 > s 6 2n · exp . i1 i 4 · 2d+c + 2 · 4d+c 2/ 4t n 2t C n( )/ 3 2 4d+c 2 4 2d+c Since our parameters satisfy t C · n( )/ /3 6 t n( ), this probability is bounded by n − 2  n Õ o 2d C  B0 > s 6 2n · exp .  i1 i 6t4

At this point, we have all components of the proof of Theorem A.2.5.

Í ⊗ Proof of Theorem A.2.5 for i Ai Ai (other case is similar). By Lemma A.2.8, ( ) − 2  Õ Õ 2d+c 2 2d C  A0 ⊗ A0 − ¾[A0 ⊗ A0] > C · n( )/ 6 2n · exp . i i i i i i Kt4

301 At the same time, by Lemma A.2.6 and a union bound,

n o 2 d    − c · t n K A1 A01,..., An A0nc > 1 n 2− / .

By Lemma A.2.7 and triangle inequality,

Õ Õ c tnd K ¾[Ai ⊗ Ai] − ¾[A0 ⊗ A0] 6 n · 2− / . i i i i

Together, these bounds imply ( )

Õ Õ 2d+c 2 c tnd K  Ai ⊗ Ai − ¾[Ai ⊗ Ai] > C · n( )/ + n · 2− / i i − 2  2d C c t2nd K 6 2n · exp + n · 2− / . Kt4 p We choose t  1 and C  100 2Kd log n and assume that n is large enough   · 2d+c 2 c · tnd K 2d · C2 c · t2nd K so that C n( )/ > n 2− / and 2n exp −Kt4 > n 2− / . Then the probability satisfies ( )

Õ Õ 2d+c 2p 100  Ai ⊗ Ai − ¾[Ai ⊗ Ai] > 20n( )/ 2Kd log n 6 4n− .  i i

A.2.3 Concentration for Spectral SoS Analyses

 · 3 + Lemma A.2.9 (Restatement of Lemma 2.5.5). Let T τ v0⊗ A. Suppose A 100 has independent entries from N(0, 1). Then with probability 1 − O(n− ) we have √ k Í ⊗ − Í ⊗ k ( 3 2 ( )1 2) k Í ( ) k ( ) i Ai Ai ¾ i Ai Ai 6 O n / log n / and i v0 i Ai 6 O n .

Proof. The first claim is immediate from Theorem A.2.5. For the second, we Í ( ) note that since v0 is a unit vector, the matrix i v0 i Ai has independent entries √ N( ) k Í ( ) k ( ) from 0, 1 . Thus, by Lemma A.2.3, i v0 i Ai 6 O n with probability 100 1 − O(n− ), as desired. 

302 Lemma A.2.10 (Restatement of Lemma 2.5.10 for General Odd k). Let A be a

n k-tensor with k an odd integer, with independent entries from N(0, 1). Let v0 ∈ ’ k+1 2 × k 1 2 k be a unit vector, and let V be the n( )/ n( − )/ unfolding of v0⊗ . Let A be the k+1 2 k 1 2 100 n( )/ × n( − )/ unfolding of A. Then with probability 1 − O(n− ), the matrix A

T k+1 2 k 2 T satisfies A A  n( )/ I + E for some E with kEk 6 O(n / log(n)) and kA V k 6

k 1 4 1 2 O(n( − )/ log(n) / ).

√ k+1 2 Proof. With δ  O(1/ n) and t  1, our parameters will satisfy n( )/ >

2 k 1 2 (t/δ) n( − )/ . Hence, by Lemma A.2.1,

  T T k+1 2 Õ T k+1 2 k+1 2 1 k 2 kEk  kA A −n( )/ Ik  aαa − n( )/ · Id 6 n( )/ ·O √  O(n / ) α n α  k+1 2 | | ( )/

k+1 2 100 with probability at least 1 − 2 exp(−n( )/ ) > 1 − O(n− ).

It remains to bound kATV k. Note that V  uwT for fixed unit vectors

k 1 2 k+1 2 T T T u ∈ ’( − )/ and w ∈ ’( )/ . So kA V k 6 kA uk. But A u is distributed

n T p 100 according to N(0, 1) and so kA uk 6 O( n log n) with probability 1 − n− by standard arguments. 

A.2.4 Concentration for Lower Bounds

The next theorems collects the concentration results necessary to apply our lower bounds Theorem 2.6.3 and Theorem 2.6.4 to random polynomials.

Lemma A.2.11. Let A be a random 3-tensor with unit Gaussian entries. For a real parameter λ, let L : ’[x]4 → ’ be the linear operator whose matrix representation M L  1 Í · T  ( 3 2/ ( )1 2) is given by M : 2 2 π 4 π AA . There is λ O n / log n / so that with L n λ ∈S

303 50 probability 1 − O(n− ) the following events all occur for every π ∈ S3.

−2λ2 · Π Id Π 1  Π σ · Aπ(Aπ)T + σ2 · Aπ(Aπ)T + (σ · Aπ(Aπ)T)T + (σ2 · Aπ(Aπ)T)T  Π 2 (A.2.1)

Õ hA, Aπi  Ω(n3) (A.2.2)

π 3 ∈S hIdsym, Aπ(Aπ)Ti  O(n3) (A.2.3)   1 n max hId , Aπi  O(1) (A.2.4) 3 2 n n i i λn / ×   2 2 n max L kxk xi x j  O(1) (A.2.5) i,j !

3 2 L k k2 2 − 1 L k k4  ( / ) n / max x xi n x O 1 n (A.2.6) i

Proof. For (A.2.1), from Theorem A.2.5, Lemma 2.6.8, the observation that multi- plication by an orthogonal operator cannot increase the operator norm, a union bound over all π, and the triangle inequality, it follows that:

σ · Aπ(Aπ)T − ¾[σ · Aπ(Aπ)T] + σ2 · Aπ(Aπ)T − ¾[σ2 · Aπ(Aπ)T] 6 2λ2.

100 with probability 1 − n− . By the definition of the operator norm and another application of triangle inequality, this implies

−4λ2 Id  σ · Aπ(Aπ)T + σ2 · Aπ(Aπ)T + (σ · Aπ(Aπ)T)T + (σ2 · Aπ(Aπ)T)T

− ¾[σ · Aπ(Aπ)T] − ¾[σ2 · Aπ(Aπ)T] − ¾[(σ · Aπ(Aπ)T)T] − ¾[(σ2 · Aπ(Aπ)T)T] .

We note that ¾[σ · Aπ(Aπ)T]  σ · Id and ¾[σ2 · Aπ(Aπ)T]  σ2 · Id, and the same for their transposes, and that Π(σ · Id +σ2 · Id)Π  0. So, dividing by 2 and projecting onto the Π subspace:

−2λ2 · Π Id Π

304 1  Π σ · Aπ(Aπ)T + σ2 · Aπ(Aπ)T + (σ · Aπ(Aπ)T)T + (σ2 · Aπ(Aπ)T)T  Π . 2

We turn to (A.2.2). By a Chernoff bound, hA, Ai  Ω(n3) with probability

100 1 − n− . Let π ∈ S3 be a nontrivial permutation. To each multi-index α with

|α|  3 we associate its orbit Oα under hπi. If α has three distinct indices, then |O | Í A Aπ X α > 1 and β α β β is a random variable α with the following properties: ∈O

| | ( ) − ω 1 • Xα < O log n with probability 1 n− ( ). − • Xα and Xα are identically distributed.

Next, we observe that we can decompose

h πi  Õ π  + Õ A, A AαAα R Xα , α 3 α | | O where R is the sum over multi-indices α with repeated indices, and therefore

2 100 has |R|  O˜ (n ) with probability 1 − n− . By a standard Chernoff bound, | Í |  ( 2) − ( 100) α Xα O n with probability 1 O n− . By a union bound over all π, O 100 we get that with probability 1 − O(n− ), Õ hA, Aπi  n3 − O(n2)  Ω(n3) ,

π 3 ∈S establishing (A.2.2).

π Next up is (A.2.3). Because A are identically distributed for all π ∈ S3 we assume without loss of generality that Aπ  A. The matrix Idsym has O(n2)

T ω 1 nonzero entries. Any individual entry of AA is with probability 1 − n− ( ) at

sym T 3 100 most O(n). So hId , AA i  O(n ) with probability 1 − O(n− ).

Next, (A.2.4). As before, we assume without loss of generality that π is the h i  Í trivial permutation. For fixed 1 6 i 6 n, we have Idn n , Ai j Aij j, which ×

305 √ is a sum of n independent unit Gaussians, so |hIdn n , Aii| 6 O( n log n) with × ω 1 probability 1−n− ( ). By a union bound over i this also holds for maxi |hIdn n , Aii|. × 100 Thus with probability 1 − O(n− ),   1 O˜ (1) n max hId , A i 6 . 3 2 n n i i n / λ × λ

Last up are (A.2.5) and (A.2.6). Since we will do a union bound later, we fix

n2 2 i, j 6 n. Let w ∈ ’ be the matrix flattening of Idn n. We expand L kxk xi x j as ×

2 1 T T T 2 T L kxk xi x j  (w Π(AA + σ · AA + σ · AA )Π(ei ⊗ ej) n2O(λ2) T T T 2 T T + w Π(AA + σ · AA + σ · AA ) Π(ei ⊗ ej)) .

Π   Π( ⊗ )  1 ( ⊗ + ⊗ ) We have w w and we let eij : ei ej 2 ei ej ej ei . So using Lemma 2.6.8,

2 2 2 T T T 2 T n O(λ ) L kxk xi x j  w (AA + σ · AA + σ · AA )eij

T T T 2 T T + w (AA + σ · AA + σ · AA ) eij

T  T  w AA eij 1 Õ 1 Õ + A e ⊗ A e + A e ⊗ A e 2 k j k i 2 k i k j k k 1 Õ 1 Õ + AT e ⊗ AT e + AT e ⊗ AT e 2 k j k i 2 k i k j k k 1 Õ 1 Õ + A e ⊗ AT e + A e ⊗ AT e 2 k i k j 2 k j k i k k 1 Õ 1 Õ  + AT e ⊗ A e + AT e ⊗ A e . 2 k i k j 2 k j k i k k T For i , j, each term w (Ak ej ⊗ Ak ei) (or similar, with various transposes) is the sum of n independent products of pairs of independent unit Gaussians, so by

ω 1 a Chernoff bound followed by a union bound, with probability 1 − n− ( ) all √ of them are O( n log n). There are O(n) such terms, for an upper bound of

3 2 O(n / (log n)) on the contribution from the tensored parts.

306 T Í At the same time, w A is a sum k akk of n rows of A and Aeij is the average of two rows of A; since i , j these rows are independent from wT A. Writing T T  1 Í h + i this out, w AA eij 2 k akk , aij a ji . Again by a standard Chernoff and 3 2 union bound argument this is in absolute value at most O(n / (log n)) with

ω 1 ω 1 probability 1 − n− ( ). In sum, when i , j, with probability at least 1 − n− ( ), we

2 2 get | L kxk xi x j |  O(1/n log n). After a union bound, the maximum over all i, j is O(1/n2). This concludes (A.2.5).

 Í h ⊗ i  Í h i2 2 In the i j case, since k w, Ak ei Ak ei j,k ej , Ak ei is a sum of n | Í h ⊗ i − independent square Gaussians, by a Bernstein inequality, k w, Ak ei Ak ei 2 1 2 ω 1 n | 6 O(n log / n) with probability 1 − n− ( ). The same holds for the other T T  | ( 2) L k k2 2 − tensored terms, and for w AA eii, so when i j we get that O λ x xi 1 2 ω 1 5| 6 O((log / n)/n) with probability 1 − n− ( ). Summing over all i, we find | ( 2) L k k4 − | ( 1 2 ) ( 2)| L k k2 2 − 1 L k k4| that O λ x 5n 6 O log / n , so that O λ x xi n x 6 1 2 ω 1 O((log / n)/n) with probability 1 − n− ( ). A union bound over i completes the argument. 

Lemma A.2.12. Let A be a random 4-tensor with unit Gaussian entries. There

2 is λ  O(n) so that when L : ’[x]4 → ’ is the linear operator whose matrix  1 Í π − ( 50) representation M is given by M : 2 2 π 4 A , with probability 1 O n− the L L n λ ∈S following events all occur for every π ∈ S4.

1 −λ2  (Aπ + (Aπ)T) (A.2.7) 2 Õ hA, Aπi  Ω(n4) (A.2.8)

π 4 ∈S √ hIdsym, Aπi  O(λ2 n) (A.2.9)

2 2 n max | L kxk xi x j |  O(1) (A.2.10) i,j

3 2 2 2 n / max | L kxk x |  O(1) . (A.2.11) i i

307 1 π +( π)T 2 × 2 Proof. For (A.2.7), we note that 2 A A is an n n matrix with unit Gaussian 1 k π + ( π)T k ( )  ( ) entries. Thus, by Lemma A.2.3, we have 2 A A 6 O n O λ . For (A.2.8) only syntactic changes are needed from the proof of (A.2.3). For (A.2.9), we observe that hIdsym, Aπi is a sum of O(n2) independent Gaussians, so is √ 2 100 O(n log n) 6 O(λ n) with probability 1 − O(n− ). We turn finally to (A.2.10) and (A.2.11). Unlike in the degree 3 case, there is nothing special here about the diagonal so we will able to bound these cases together. Fix i, j 6 n. We expand L kxk2x x 1 Í wT Aπ(e ⊗ e ) Aπ(e ⊗ e ) i j as n2 2 π 4 i j . The vector i j is a vector of unit λ ∈S √ T π ω 1 Gaussians, so w A (ei ⊗ ej)  O( n log n) with probability 1 − n− ( ). Thus, also

ω 1 2 2 with probability 1 − n− ( ), we get n maxi,j | L kxk xi x j |  O(1), which proves both (A.2.10) and (A.2.11). 

A.3 Spectral methods for the random overcomplete model

A.3.1 Linear algebra

Here we provide some lemmas in linear algebra.

This first lemma is closely related to the sos Cauchy-Schwarz from [16], and the proof is essentially the same.

d d Lemma A.3.1 (PSD Cauchy-Schwarz). Let M ∈ ’ × , M  0 and symmetric. Let d p1,..., pn , q1,..., qn ∈ ’ . Then n n n h Õ i h Õ i1 2h Õ i1 2 M, pi q>i 6 M, pi p>i / M, qi q>i / . i1 i1 i1

Í In applications, we will have i pi qi as a single block of a larger block matrix Í Í containing also the blocks i pi p>i and i qi q>i .

308 Proof. We first claim that

n n n Õ 1 Õ 1 Õ hM, p q>i 6 hM, p p>i + hM, q q>i . i i 2 i i 2 i i i1 i1 i1 To see this, just note that the right-hand side minus the left is exactly

Õn Õ hM, (pi − qi)(pi − qi)>i  (pi − qi)>M(pi − qi) > 0 . i1 i The lemma follows now be applying this inequality to

pi qi p0  q0  .  i h Ín i1 2 i h Ín i1 2 M, i1 pi p>i / M, i1 qi q>i / Lemma A.3.2 (Operator Norm Cauchy-Schwarz for Sums). Let

A1,..., Am , B1,..., Bm be real random matrices. Then 1 2 1 2 Õ Õ / Õ / ¾ ¾ ¾ Ai Bi 6 A>i Ai Bi>Bi . i i i

Proof. We have for any unit x, y,

Õ Õ x> ¾ Ai Bi x  ¾hAi x, Bi yi i i Õ 6 ¾ kAi xkkBi yk i Õ 2 1 2 2 1 2 6 (¾ kAi xk ) / (¾ kBi xk ) / i s s Õ 2 Õ 2 6 ¾ kAi xk ¾ kBi yk i i s Õ s Õ  ¾ ¾ x> A>i Ai x y> Bi>Bi y i i 1 2 1 2 Õ / Õ / ¾ ¾ 6 A>i Ai Bi>Bi . i i where the nontrivial inequalities follow from Cauchy-Schwarz for expectations, vectors and scalars, respectively. 

309 The followng lemma allows to argue about the top eigenvector of matrices with spectral gap.

Lemma A.3.3 (Top eigenvector of gapped matrices). Let M be a symmetric r-by-r matrix and let u, v be a vectors in ’r with kuk  1. Suppose u is a top singular vector of M so that |hu, Mui|  kMk and v satisfies for some ε > 0,

2 kM − vv>k 6 kMk − ε · kvk

Then, hu, vi2 > ε · kvk2.

Proof. We lower bound the quadratic form of M − vv> evaluated at u by

2 2 |hu, (M − vv>)ui| > |hu, Mui| − hu, vi  kMk − hu, vi .

At the same time, this quadratic form evaluated at u is upper bounded by kMk − ε · kvk2. It follows that hu, vi2 > ε · kvk2 as desired.



2 The following lemma states that a vector in ’d which is close to a symmetric

2 vector v⊗ , if flattened to a matrix, has top eigenvector correlated with the symmetric vector.

Lemma (Restatement of Lemma ??).

Proof. Let vˆ  v/kvk. Let (σi , ai , bi) be the ith singular value, left and right (unit) singular vectors of U respectively.

Our assumptions imply that

vˆ>Uvˆ  |hMu, vˆ ⊗ vˆi| > c · kuk.

310 Furthermore, we observe that kUkF  kMuk 6 kMk · kuk, and that therefore kUkF 6 kuk. We thus have that,

s Õ Õ 2 2 c · kuk 6 vˆ>Uvˆ  σi · hvˆ, aiihvˆ, bii 6 kuk · hvˆ, aii hvˆ, bii ,

i d i d ∈[ ] ∈[ ] where to obtain the last inequality we have used Cauchy-Schwarz and our bound on kUkF. We may thus conclude that

2 Õ 2 2 2 Õ 2 2 c 6 hvˆ, aii hvˆ, bii 6 maxhai , vˆi · hbi , vˆi  maxhai , vˆi , (A.3.1) i d i d i d ∈[ ] i d ∈[ ] ∈[ ] ∈[ ] where we have used the fact that the left singular values of U are orthonormal.

The argument is symmetric in the bi.

Furthermore, we have that

2

2 2 2 Õ Õ 2 2 Õ 2 Õ 2 2 c ·kuk 6 vˆ>Uvˆ  σi · hvˆ, aiihvˆ, bii 6 © σ hvˆ, aii ª·© hvˆ, bii ª  σ hvˆ, aii , ­ i ® ­ ® i i d i d i d i d ∈[ ] « ∈[ ] ¬ « ∈[ ] ¬ ∈[ ] where we have applied Cauchy-Schwarz and the orthonormality of the bi. In particular, Õ 2h ˆ i2 2k k2 2k k2 σi v, ai > c u > c U F . i d ∈[ ] ∈ [ ] 2 2k k2 On the other hand, let S be the set of i d for which σi 6 αc U F. By substitution,

Õ 2h ˆ i2 2k k2 Õh ˆ i2 2k k2 σi v, ai 6 αc U F v, ai 6 αc U F , i S i S ∈ ∈ where we have used the fact that the right singular vectors are orthonormal. The last two inequalities imply that S , [d]. Letting T  [d]\ S, it follows from subtraction that

( − ) 2k k2 Õ 2h ˆ i2 h ˆ i2 Õ 2  h ˆ i2k k2 1 α c U F 6 σ v, ai 6 max v, ai σ max v, ai U F , i i T i i T i T ∈ i T ∈ ∈ ∈

311 2 2 so that maxi T hvˆ, aii > (1 − α)c . Finally, ∈

| | · 2k k2 | | · 2 Õ 2  k k2 T αc U F 6 T min σ 6 σ U F , i T i i ∈ i d ∈[ ] | | b 1 c b 1 c so that T 6 αc2 . Thus, one of the top αc2 right singular vectors a has p correlation |hvˆ, ai| > (1 − α)c. The same proof holds for the b.

2 1 ( + ) ( − ) 2 1 Furthermore, if c > 2 1 η for some η > 0, and 1 α c > 2 , then by h ˆ i2 h ˆ i2 ˆ (A.3.1) it must be that maxi T v, ai  maxi d v, ai , as v cannot have square ∈ ∈[ ] 1  η correlation larger than 2 with more than one left singular vector. Taking α 1+η guarantees this. The conclusion follows. 

A.3.2 Concentration tools

We require a number of tools from the literature on concentration of measure.

For scalar-valued polynomials of Gaussians

We need the some concentration bounds for certain polynomials of Gaussian random variables.

The following lemma gives standard bounds on the tails of a standard gaussian variable—somewhat more precisely than other bounds in this paper. Though there are ample sources, we repeat the proof here for reference.

Lemma A.3.4. Let X ∼ N(0, 1). Then for t > 0,

2 e t 2  (X > t) 6 √− / , t 2π

312 and 2 e t 2 1 1   (X > t) > √− / · − . 2π t t3

Proof. To show the first statement, we apply an integration trick, ¹ 1 ∞ x2 2  (X > t)  √ e− / dx 2π t ¹ 1 ∞ x x2 2 6 √ e− / dx 2π t t 2 e t 2  √− / , t 2π x where in the third step we have used the fact that t 6 x for t > x. For the second statement, we integrate by parts and repeat the trick, ¹ 1 ∞ x2 2  (X > t)  √ e− / dx 2π t ¹ 1 ∞ 1 x2 2  √ · xe− / dx 2π t x   ¹ 1 1 2 ∞ 1 ∞ 1 2  √ − e x 2· − √ · e x 2dx − / 2 − / 2π x t 2π t x   ¹ 1 1 2 ∞ 1 ∞ x 2 √ − e x 2· − √ · e x 2dx > − / 3 − / 2π x t 2π t t   1 1 1 t2 2  √ − e− / . 2π t t3 This concludes the proof. 

The following is a small modification of Theorem 6.7 from [?] which follows from Remark 6.8 in the same.

Lemma A.3.5. For each ` > 1 there is a universal constant c` > 0 such that for every f a degree-` polynomial of standard Gaussian random variables X1,..., Xm and t > 2,

c t2 ` (| f (X)| > t ¾ | f (X)|) 6 e− ` / .

2 1 2 The same holds (with a different constant c`) if ¾ | f (x)| is replaced by (¾ f (x) ) / .

313 In our concentration results, we will need to calculate the expectations of multivariate Gaussian polynomials, many of which share a common form. Below we give an expression for these expectations.

Fact A.3.6. Let x be a d-dimensional vector with independent identically distributed gaussian entries with variance σ2. Let u be a fixed unit vector. Then setting X  (kxk2 − c)p kxk2m xxT, and setting U  (kxk2 − c)p kxk2m uuT, we have   Õ p k k 2 p+m k+1 ¾[X]  © (−1) c (d + 2)···(d + 2p + 2m − 2k)σ ( − )ª · Id, ­ k ® 06k6p « ¬ and   Õ p k k 2 p+m k T ¾[U]  © (−1) c d(d + 2)···(d + 2p + 2m − 2k − 2)σ ( − )ª · uu ­ k ® 06k6p « ¬

Proof.

¾[ ]  ¾[(k k2 − )p k k2m 2]· X x c x x1 Id p+m k    −  Õ p  Õ   Id · (−1)k ck ¾ © x2ª x2 k ­ i ® 1 06k6p  ` d  « ∈[ ] ¬    p+m k Í 2 − Since i d xi is symmetric in x1,..., xd, we have ∈[ ] p+m k+1    −  1 Õ p  Õ   Id · (−1)k ck ¾ © x2ª  d k ­ i ®  06k6p  i d  « ∈[ ] ¬  We have reduced the computation to a question of the moments of a Chi-squared variable with d degrees of freedom. Using these moments,   1 Õ p k k 2 p+m k+1  Id · (−1) c d(d + 2)···(d + 2p + 2m − 2k)σ ( − ) d k 06k6p   Õ p k k 2 p+m k+1  Id · © (−1) c (d + 2)···(d + 2p + 2m − 2k)σ ( − )ª . ­ k ® 06k6p « ¬ A similar computation yields the result about ¾[U]. 

314 For matrix-valued random variables

On several occasions we will need to apply a Matrix-Bernstein-like theorem to a sum of matrices with an unfortunate tail. To this end, we prove a “truncated

Matrix Bernstein Inequality.” Our proof uses an standard matrix Bernstein inequality as a black box. The study of inequalities of this variety—on tails of sums of independent matrix-valued random variables— was initiated by

Ahlswede and Winter [2]. The excellent survey of Tropp [77] provides many results of this kind.

In applications of the following the operator norms of the summands

X1,..., Xn have well-behaved tails and so the truncation is a routine formality. Two corollaries following the proposition and its proof capture truncation for all the matrices we encounter in the present work.

d1 d2 Proposition A.3.7 (Truncated Matrix Bernstein). Let X1,..., Xn ∈ ’ × be independent random matrices, and suppose that

h i k − [ ]k ∈ [ ]  Xi ¾ Xi op > β 6 p for all i n .

Furthermore, suppose that for each Xi,

  ¾[Xi] − ¾[Xi1 kXi kop < β ] 6 q.

Denote     2  Õ  T   T  Õ  T  T  σ  max ¾ Xi X − ¾ [Xi] ¾ X , ¾ X Xi − ¾ [Xi] ¾ [Xi] . i i i  i n i n   op op   ∈[ ] ∈[ ]   Í Then for X i n Xi, we have ∈[ ]  2    −(t − nq)  kX − ¾[X]kop > t 6 n · p + (d1 + d2)· exp . 2(σ2 + β(t − nq)/3)

315 Proof. For simplicity we start by centering the variables Xi. Let X˜i  Xi − ¾ Xi ˜  Í ˜ and X i n Xi The proof proceeds by a straightforward application of the ∈[ ] noncommutative Bernstein’s Inequality. We define variables Y1,..., Yn, which are the truncated counterparts of the X˜ is in the following sense:

 ˜ k ˜ k Xi Xi op < β, Yi   0 otherwise.   Í Define Y i n Yi. We claim that ∈[ ]

Õ Õ ¾ T − ¾[ ] ¾[ ]T ¾ ˜ ˜ T 2 YiYi Yi Yi 6 Xi Xi 6 σ and (A.3.2) i op i op

Õ Õ ¾ T − ¾[ ]T ¾[ ] ¾ ˜ T ˜ 2 Yi Yi Yi Yi 6 Xi Xi 6 σ , (A.3.3) i op i op which, together with the fact that kYi k 6 β almost surely, will allow us to apply the non-commutative Bernstein’s inequality to Y. To see (A.3.2)((A.3.3) is similar), ¾ T we expand YiYi as   T h i T ¾ YiY   X˜ i < β ¾ X˜ i X˜ X˜ i < β . i op i op ¾  ˜ ˜ T  Additionally expanding Xi Xi as      T  h i T h i T ¾ X˜ i X˜   X˜ i < β ¾ X˜ i X˜ X˜ i < β + X˜ i > β ¾ X˜ i X˜ X˜ i > β , i op i op op i op ¾[ ˜ ˜ T | ˜ ] ¾[ T]  ¾[ T] we note that Xi Xi Xi op > β is PSD. Thus, YiYi Xi Xi . But

¾[Y YT] Í ¾[Y YT] by definition i i is still PSD (and hence i i i op is given by the ¾[ T] maximum eigenvalue of YiYi ), so

Õ Õ ¾ T ¾ ˜ ˜ T YiYi 6 Xi Xi . i op i op

T T Also PSD are ¾[Yi] ¾[Yi] and ¾[(Yi − ¾[Yi])(Yi − ¾[Yi]) ]  ¾[ T] − ¾[ ] ¾[ ]T YiYi Yi Yi . By the same reasoning again, then, we get

316

Í ¾ Y YT − ¾[Y ] ¾[Y ]T Í ¾[Y YT] i i i i i op 6 i i i op. Putting this all together gives (A.3.2).

Now we are ready to apply the non-commutative Bernstein’s inequality to Y.

We have

 2    −α /2  kY − ¾[Y]kop > α 6 (d1 + d2)· exp . σ2 + β · α/3

Now, we have

     kX − ¾[X]kop > t   kX − ¾[X]kop > t | X  Y ·  [X  Y]   +  kX − ¾[X]kop > t | X , Y ·  [X , Y] ,   6  kX − ¾[X]kop > t | X  Y + n · p by a union bound over the events {Xi , Yi }. It remains to bound the conditional   probability  kX − ¾[X]kop > t | X  Y . By assumption, k ¾[X] − ¾[Y]kop 6 nq, and so by the triangle inequality,

kX − ¾[X]kop 6 kX − ¾[Y]kop + k ¾[Y] − ¾[X]kop 6 kX − ¾[Y]kop + nq.

Thus,

     kX − ¾[X]kop > t | X  Y 6  kX − ¾[Y]kop + nq > t | X  Y     kY − ¾[Y]kop > t − nq | X  Y .

Putting everything together and setting α  t − nq,

 −(t − nq)2/2  [kX − ¾[X]kop > t] 6 n · p + (d1 + d2)· exp , σ2 + β(t − nq)/3 as desired. 

The following lemma helps achieve the assumptions of Proposition A.3.7 easily for a useful class of thin-tailed random matrices.

317 Lemma A.3.8. Suppose that X is a matrix whose entries are polynomials of constant degree ` in unknowns x, which we evaluate at independent Gaussians. Let f (x) :

T kXkop and 1(x) : kXX kop, and either f is itself a polynomial in x of degree at most 2` or 1 is a polynomial in x of degree at most 4`. Then if β  R · α for q α > min{¾ | f (x)| , ¾ 1(x)} and R  polylog(n),

 log n  kXkop > β 6 n− , (A.3.4) and

  log n ¾ kX · 1{kXkop > β}kop 6 (β + α)n− . (A.3.5)

Proof. We begin with (A.3.4). Either f (x) is a polynomial of degree at most 2`, or

1(x) is a polynomial of degree at most 4` in gaussian variables. We can thus use Lemma A.3.5 to obtain the following bound,

  1 2`   | f (x)| > tα 6 exp −ct /( ) , (A.3.6) where c is a universal constant. Taking t  R  polylog(n) gives us (A.3.4).

We now address (A.3.5). To this end, let p(t) and P(t) be the probability density function and cumulative density function of kXkop, respectively. We apply Jensen’s inequality and instead bound ¹     ∞ k ¾ X1{kXkop > β} k 6 ¾ kXkop1{kXkop > β}  t · 1{t > β}p(t)dt 0 since the indicator is 0 for t 6 β, ¹ ∞  (−t)(−p(t))dt β integrating by parts, ¹ ∞ ∞  −t ·(1 − P(t)) + (1 − P(t))dt β β

318 and using the equality of 1 − P(t) with (kXkop > t) along with (A.3.4), ¹ log n ∞ 6 βn− + (kXkop > t)dt β

Applying the change of variables t  αs so as to apply (A.3.6), ¹ log n ∞  βn− + α (kXkop > αs)ds R ¹ log n ∞ 1 2` 6 βn− + α exp(−cs /( ))ds R

 ( u log n )2` Now applying a change of variables so s c ,

¹ log n 2`  log n + ∞ u · 2` 1 βn− α 1 2` n− 2` u − du cR /( ) c log n ¹ log n + ∞ u 2 6 βn− α 1 2` n− / du , cR /( ) log n where we have used the assumption that ` is constant. We can approximate this by a geometric sum,

log n Õ∞ u 2 6 βn− + α n− / cR1 2` u /( ) log n

log n cR1 2` 2 log n 6 βn− + α · n− /( )/( )

Evaluating at R  polylog n for a sufficiently large polynomial in the log gives us

  log n ¾ kX · 1{kXkop > β}kop 6 (β + α)n− , as desired. 

319 A.4 Concentration bounds for planted sparse vector in random

linear subspace

 Ín ( ) Proof of Lemma 3.4.7. Let c : i1 v i bi. The matrix in question has a nice block structure: n k k2 Õ v 2 c> ai a>  © ª . i ­ Ín ® i1 c  bi b> « i 1 i ¬ 1 The vector c is distributed as N(0, Idd 1) so by standard concentration has n − k k ˜ ( / )1 2 k k2  c 6 O d n / w.ov.p.. By assumption, v 2 1. Thus by triangle inequality w.ov.p. n  1 2 n Õ d / Õ ai a> − Idd 6 O˜ + bi b> − Idd 1 . i n i − i1 i1

By [78, Corollary 5.50] applied to the subgaussian vectors nbi, w.ov.p. n  1 2 Õ d / bi b> − Idd 1 6 O i − n i1 k Ín − k ˜ ( / )1 2 k(Ín ) 1 − and hence i1 ai a>i Idd 6 O d n / w.ov.p.. This implies i1 ai a>i − k ˜ ( / )1 2 k(Ín ) 1 2 − k ˜ ( / )1 2  ( ) Idd 6 O d n / and i1 ai a>i − / Idd 6 O d n / when d o n by the Ín following facts applied to the eigenvalues of i1 ai a>i . For 0 6 ε < 1,

1 1 (1 + ε)−  1 − O(ε) and (1 − ε)−  1 + O(ε) ,

1 2 1 2 (1 + ε)− /  1 − O(ε) and (1 − ε)− /  1 + O(ε) .

( + ) 1  Í k These are proved easily via the identity 1 ε − ∞k1 ε and similar. 

Orthogonal subspace basis

∈ ’d N( 1 ) Lemma A.4.1. Let a1,..., an be independent random vectors from 0, n Id  Ín ∈ ’d with d 6 n and let A i1 ai a>i . Then for every unit vector x , with

320 ω 1 overwhelming probability 1 − d− ( ), √  +  1 2 d n 2 hx, A− xi − kxk 6 O˜ · kxk . n

Proof. Let x ∈ ’d. By scale invariance, we may assume kxk  1.

By standard matrix concentration bounds, the matrix B  Id −A has spectral

1 2 1 1 norm kBk 6 O˜ (d/n) / w.ov.p. [78, Corollary 5.50]. Since A−  (Id −B)−  Í k 1 − − Í k kk ∞k0 B , the spectral norm of A− Id B is at most ∞k2 B (whenever the 1 series converges). Hence, kA− − Id −Bk 6 O˜ (d/n) w.ov.p..

1 2 It follows that it is enough to show that |hx, Bxi| 6 O˜ (1/n) / w.ov.p.. The √ − h i  Ín h · i2 2 random variable n n x, Bx i1 n ai , x is χ -distributed with n degrees √ of freedom. Thus, by standard concentration bounds, n|hx, Bxi| 6 O˜ ( n) w.ov.p. [?].

ω 1 We conclude that with overwhelming probability 1 − d− ( ), √  +  1 2 d n hx, A− xi − kxk 6 |hx, Bxi| + O˜ (d/n) 6 O˜ . n



∈ ’d N( 1 ) Lemma A.4.2. Let a1,..., an be independent random vectors from 0, n Id  Ín ∈ [ ] with d 6 n and let A i1 ai a>i . Then for every index i n , with overwhelming ω 1 probability 1 − d ( ), √  +  1 2 d n 2 ha j , A− a ji − ka j k 6 O˜ · ka j k . n

 Í Proof. Let A j i,j ai a>. By Sherman–Morrison, − i

1 1 1 1 1 1 A−  (A j + a j a>)−  A− − A− a j a>A− − j j + 1 j j j − 1 a>j A−j a j − − −

321 h 1 i  h 1 i − h 1 i2/( + h 1 i) k n − Thus, a j , A− a j a j , A−j a j a j , A−j a j 1 a j , A−j a j . Since n 1 A j − − − − − k  ˜ ( / )1 2 k 1k Id O d n / w.ov.p., we also have A−j 6 2 with overwhelming probability. − Therefore, w.ov.p.,

h 1 i − h 1 i h 1 i2 k k4 ˜ ( / ) · k k2 a j , A− a j a j , A−j a j 6 a j , A−j a j 6 4 a j 6 O d n a j . − − At the same time, by Lemma A.4.1, w.ov.p., √  +  h n 1 i − k k2 ˜ d n · k k2 a j , n 1 A−j a j a j 6 O a j . − − n We conclude that, w.ov.p.,

h 1 i − k k2 h 1 i − h 1 i + h 1 i − n 1 k k2 + 1 k k2 a j , A− a j a j 6 a j , A− a j a j , A−j a j a j , A−j a j n− a j n a j √ − −  d + n  6 O˜ . n 

Lemma A.4.3. Let A be a block matrix where one of the diagonal blocks is the 1 × 1 identity; that is, kvk2 c 1 c  © > ª  © > ª A ­ ® ­ ® . c B c B « ¬ « ¬ for some matrix B and vector c. Let x be a vector which decomposes as x  (x(1) x0) where x(1)  hx, e1i for e1 the first standard basis vector.

Then  B 1cc B 1   B 1cc B 1  hx, A 1xi  hx , B 1 + − > − x i+2x(1)h B 1 + − > − c, x i+(1−c B 1c) 1x(1)2 . − 0 − − 1 0 − − 1 0 > − − 1 c>B− c 1 c>B− c

Proof. By the formula for block matrix inverses,

(1 − c B 1c) 1 cT(B − cc ) 1 1 > − − > − A−  © ª . ­ 1 1 ® (B − cc>)− c (B − cc>)− « ¬ 1 The result follows by Sherman-Morrison applied to (B − cc>)− and the definition of x. 

322 n d 1 Lemma A.4.4. Let v ∈ ’ be a unit vector and let b1,..., bn ∈ ’ − have iid entries N( / ) ∈ ’d  ( ( ) )  Í T from 0, 1 n . Let ai be given by ai : v i bi . Let A : i ai ai . Let ∈ d 1  Í ( ) ∈ [ ] c ’ − be given by c : i v i bi. Then for every index i n , w.ov.p., √  +  1 2 d n 2 hai , A− aii − kai k 6 O˜ · kai k . n

 Í T k 1 − k ˜ ( / )1 2 Proof. Let B : i bi bi . By standard concentration, B− Id 6 O d n / w.ov.p. [78, Corollary 5.50]. At the same time, since v has unit norm, the entries of c are iid samples from N(0, 1/n), and hence nkck2 is χ2-distributed with d k k2 d + ˜ ( ) 1 2 degrees of freedom. Thus w.ov.p. c 6 n O dn − / . Together these imply the following useful estimates, all of which hold w.ov.p.:

 3 2 d d / |c B 1c| 6 kck2kB 1k 6 + O˜ > − − op n n  3 2 d d / kB 1cc B 1k 6 kck2kB 1k2 6 + O˜ − > − op − op n n 1 1  3 2 B− cc>B− d d / 6 + O˜ , − 1 1 c>B− c op n n where the first two use Cauchy-Schwarz and the last follows from the first two.

1 We turn now to the expansion of hai , A− aii offered by Lemma A.4.3,

 B 1cc B 1  ha , A 1a i hb , B 1 + − > − b i (A.4.1) i − i i − − 1 i 1 c>B− c  B 1cc B 1  + 2v(i)h B 1 + − > − c, b i (A.4.2) − − 1 i 1 c>B− c 1 1 2 + (1 − c>B− c)− v(i) . (A.4.3)

Addressing (A.4.1) first, by the above estimates and Lemma A.4.2 applied to

1 hbi , B− bii, √  1 1   +  B− cc>B− d n hb , B 1 + b i − kb k2 6 O˜ · kb k2 i − − 1 i i i 1 c>B− c n

323 w.ov.p.. For (A.4.2), we pull out the important factor of kck and separate v(i) from bi: w.ov.p.,  1 1   1 1  B− cc>B− B− cc>B− c 2v(i)h B 1 + c, b i  2kckv(i)h B 1 + , b i − − 1 i − − 1 k k i 1 c>B− c 1 c>B− c c   1 1   B− cc>B− c 6 kck2 v(i)2 + h B 1 + , b i2 − − 1 k k i 1 c>B− c c  d  6 O˜ (v(i)2 + kb k2) n i  d   O˜ ka k2 , n i where the last inequality follows from our estimates above and Cauchy-Schwarz.

1 Finally, for (A.4.3), since (1 − c>B− c) > 1 − O˜ (d/n) w.ov.p., we have that  d  |(1 − c B 1c) 1v(i)2 − v(i)2| 6 O˜ v(i)2 . > − − n

Putting it all together,  1 1  B− cc>B− ha , A 1a i − ka k2 6 hb , B 1 + b i − kb k2 i − i i i − − 1 i i 1 c>B− c  1 1  B− cc>B− + 2v(i)h B 1 + c, b i − − 1 i 1 c>B− c 1 1 2 2 + |(1 − c>B− c)− v(i) − v(i) | √  d + n  6 O˜ · ka k2 .  n i

A.5 Concentration bounds for overcomplete tensor decomposi-

tion

We require some facts about the concentration of certain scalar-and matrix-valued random variables, which generally follow from standard concentration arguments.

We present proofs here for completeness.

324 The first lemma captures standard facts about random Gaussians.

∈ ’d ∼ N( 1 ) Fact A.5.1. Let a1,..., an be sampled ai 0, d Id .

√ 1. Inner products |hai , a ji| are all ≈ 1/ d:     2 1 ω 1  hai , a ji 6 O˜ i, j ∈ [n], i , j > 1 − n− ( ). d ∀ √ 2. Norms are all about kai k ≈ 1 ± O˜ (1/ d):  √ √  2 ω 1  1 − O˜ (1/ d) 6 kai k 6 1 + O˜ (1/ d) i ∈ [n] > 1 − n− ( ) . 2 ∀

3. Fix a vector v ∈ ’d. Suppose 1 ∈ ’d is a vector with entries identically distributed 1 ∼ N( ) h1 i2 ≈ 2 · k k2 i 0, σ . Then , v σ v 2:   2 2 4 2 2 ω 1  h1, vi − σ · kvk 6 O˜ (σ · kvk ) > 1 − n− ( ) . 2 4

2 Proof of Fact A.5.1. We start with Item 1. Consider the quantity hai , a ji . We calculate the expectation,

 2 Õ   Õ  2  2 1 1 ¾ hai , a ji  ¾ ai(k)ai(`)a j(k)a j(`)  ¾ ai(k) ·¾ a j(k)  d·  . d2 d k,` d k d ∈[ ] ∈[ ] Since this is a degree-4 square polynomial in the entries of ai and a j, we may apply Lemma A.3.5 to conclude that

 1     ha , a i2 > t · 6 exp −O(t1 2) . i j d /

Applying this fact with t  polylog(n) and taking a union bound over pairs i, j ∈ [n] gives us the desired result.

k k2 Next is Item 2. Consider the quantity ai 2. We will apply Lemma A.3.5 in (k k2 − )2 order to obtain a tail bound for the value of the polynomial ai 2 1 . We have  1  ¾ (ka k2 − 1)2  O , i 2 d

325 and now applying Lemma A.3.5 with the square root of this expectation, we have

 2 1  log n  kai k − 1 > O˜ ( ) 6 n− . 2 √d

This gives both bounds for a single ai. The result now follows from taking a union bound over all i.

Moving on to Item 3, we view the expression f (1) : (h1, vi2 − σ2kvk2)2 as a polynomial in the gaussian entries of 1. The degree of f (1) is 4, and ¾[| (1)|]  4 · k k4 f 3σ v 4, and so we may apply Lemma A.3.5 to conclude that

 | (1)| · 4 · k k4 (− 1 2) f > t 3σ v 4 6 exp c4t / , and taking t  polylog(n) the conclusion follows. 

We also use the fact that the covariance matrix of a sum of sufficiently many gaussian outer products concentrates about its expectation.

d Fact A.5.2. Let a1,..., an ∈ ’ be vectors with iid gaussian entries such that ¾ k k2   Ω( ) E Í ai 2 1, and n d . Let be the event that the sum i n ai a>i is ∈[ ] n · close to d Id, that is

 Õ   Ω˜ ( / )· ˜ ( / )·  − ω 1 n d Id 6 ai a>i 6 O n d Id > 1 n− ( ) .  i n   ∈[ ] 

Proof of Fact A.5.2. We apply a truncated matrix bernstein inequality. For conve-  Í  nience, A : i n ai a>i and let Ai : ai a>i be a single summand. To begin, we ∈[ ] calculate the first and second moments of the summands,

1 ¾ [A ]  · Id i d  1  ¾ A A   O · Id . i >i d

326 ¾ [ ]  n · 2( )  n  So we have A d Id and σ A O d .

We now show that each summand is well-approximated by a truncated variable. To calculate the expected norm kAi kop, we observe that Ai is rank-1 and ¾ k k   ¾ k k2  thus Ai op ai 2 1. Applying Lemma A.3.8, we have

 log n  kAi kop > O˜ (1) 6 n− , and also

  log n ¾ kAi kop · 1{kAi kop > O˜ (1)} 6 n− .

Thus, applying the truncated matrix bernstein inequality from Proposi-

 1 2  2  ( n )  ˜ ( )  log n  log n  ˜ n / tion A.3.7 with σ O , β O 1 , p n− , q n− , and t O 1 2 , we d d / have that with overwhelming probability,

n  n1 2  A − · Id 6 O˜ / . 1 2 d op d / 

h1 2i We now show that among the terms of the polynomial , Tai⊗ , those that depend on a j with j , i have small magnitude. This polynomial appears in the proof that Mdiag has a noticeable spectral gap.

Lemma (Restatement of Lemma 3.5.6). Let a1,..., an be independently sampled N( 1 ) 1 N( )  Í ( ⊗ ) vectors from 0, d Idd , and let be sampled from 0, Idd . Let T i ai ai ai >. Then with overwhelming probability, for every j ∈ [n], √   4 n h1, T(a j ⊗ a j)i − h1, a jika j k 6 O˜ . d

Proof. Fixing ai and 1, the terms in the summation are independent, and we may apply a Bernstein inequality. A straightforward calculation shows that the

327 ˜ ( n ) · k1k2k k4 expectation of the sum is 0 and the variance is O d2 ai . Additionally, each summand is a polynomial in Gaussian variables, the square of which has ˜ ( 1 · k1k2k k4) expectation O d2 ai . Thus Lemma A.3.5 allows us to truncate each summand appropriately so as to employ Proposition A.3.7. An appropriate

2 2 choice of logarithmic factors and the concentration of k1k and kai k due to Fact A.5.1 gives the result for each i ∈ [n]. A union bound over each choice of i gives the final result. 

Finally, we prove that a matrix which appears in the expression for Msame has bounded norm w.ov.p.

N( 1 ) 1 ∼ N( ) Lemma A.5.3. Let a1,..., an be independent from 0, d Idd . Let 0, Idd . Fix j ∈ [n]. Then w.ov.p.

Õ 2 2 1 2 h1, aiikai k hai , a ji · ai a> 6 O˜ (n/d ) / . i i n ∈[i,j]

Proof. The proof proceeds by truncated matrix Bernstein, since the summands are independent for fixed 1, a j. For this we need to compute the variance:

2 Õ 2 6 2 Õ 2 σ  ¾h1, aii kai k hai , a ji · ai a> 6 O(1/d)· ¾ ai a> 6 O(1/d)· n/d 6 O(n/d ) . i i i n i n ∈[i,j] ∈[i,j]

The norm of each term in the sum is bounded by a constant-degree polynomial of Gaussians. Straightforward calculations show that in expectation each term is ( 1 h1 i) ( ) O d , ai in norm; w.ov.p. this is O σ . So Lemma A.3.5 applies to establish the hypothesis of truncated Bernstein Proposition A.3.7. In turn, Proposition A.3.7

328 yields that w.ov.p.

Õ 2 2 1 2 h1, aiikai k hai , a ji · ai a> 6 O˜ (σ)  O˜ (n/d ) / . i i n ∈[i,j] 

Proof of Fact A.5.4

Here we prove the following fact.

 ( ) 2  ( ) 2/k k4 Fact A.5.4. Let Σ ¾x 0,Id xx> ⊗ and let Σ˜ ¾x 0,Id xx> ⊗ x . Let ∼N( d) ∼N( d) Φ  Í 2 ∈ ’d2 Π ’d2 i ei⊗ and let sym be the projector to the symmetric subspace of (the 2 d span of vectors of the form x⊗ for x ∈ ’ ). Then

Σ  Π + ΦΦ Σ˜  2 Π + 1 ΦΦ 2 sym > , d2+2d sym d2+2d > , Σ+  1 Π − 1 ΦΦ Σ˜ +  d2+2d Π − d ΦΦ 2 sym 2 d+2 > , 2 sym 2 > . ( )

In particular, √  q   (Σ+)1 2  Π − 1 − 2 ΦΦ k k  R 2 / sym d 1 d+2 > has R 1 and for any v ∈ ’d, k ( ⊗ )k2  − 1  · k k4 R v v 2 1 d+2 v .

We will derive Fact A.5.4 as a corollary of a more general claim about rotationally symmetric distributions.

Lemma A.5.5. Let D be a distribution over ’d which is rotationally symmetric; that is,

2 for any rotation R, x ∼ D is distributed identically to Rx. Let Σ  ¾x (xx>)⊗ , let ∼D

329 Φ  Í 2 ∈ ’d2 Π ’d2 i ei⊗ and let sym be the projector to the symmetric subspace of (the 2 d span of vectors of the form x⊗ for x ∈ ’ ). Then there is a constant r so that

Σ  2r Πsym + r ΦΦ> .

Furthermore, r is given by

 ¾h i2h i2  1 ¾h i4 r x, a x, b 3 x, a where a, b are orthogonal unit vectors.

Proof. First, Σ is symmetric and operates nontrivially only on the symmetric subspace (in other words ker Πsym ⊆ ker Σ). This follows from Σ being an expec- tation over symmetric matrices whose kernels always contain the complement of the symmetric subspace.

ˆ ˆ Let aˆ, b, cˆ, d ∈ ’d be any four orthogonal unit vectors. Let R be any rotation ˆ ˆ of ’d that takes aˆ to −aˆ, but fixes b, cˆ, and d (this rotation exists for d > 5, but a different argument holds for d 6 4) . By rotational symmetry about R, all of these quantities are 0:

ˆ ˆ ¾haˆ, xihb, xihcˆ, xihd, xi  0,

ˆ ˆ ¾haˆ, xihb, xihcˆ, xi2  0, ¾haˆ, xihb, xi3  0.

√ ˆ Furthermore, let Q be a rotation of ’d that takes aˆ to (aˆ + b)/ 2. Then by rotational symmetry about Q,

¾h ˆ i4  ¾h ˆ i4  ¾ 1 h ˆ + ˆ i4  ¾ 1 [h ˆ i4 + hˆ i4 + h ˆ i2hˆ i2] a, x a, Qx 4 a b, x 4 a, x b, x 6 a, x b, x

ˆ Thus, since ¾haˆ, xi4  ¾hb, xi4 by rotational symmetry, we have

ˆ ¾haˆ, xi4  3 ¾haˆ, xi2hb, xi2.

330  ¾h ˆ i2hˆ i2  1 ¾h ˆ i4 So let r : a, x b, x 3 a, x . By rotational symmetry, r is constant ˆ over choice of orthogonal unit vectors aˆ and b.

2 Since Σ operates only on the symmetric subspace, let u ∈ ’d be any unit

d d vector in the symmetric subspace. Such a u unfolds to a symmetric matrix in ’ × ,  Íd ⊗ h i so that it has an eigendecomposition u i1 λi ui ui. Evaluating u, Σu , d Õ 2 2 hu, Σui  ¾ λi λ j hx, uii hx, u ji other terms are 0 by above i,j1 d  Õ 2 + Õ 3r λi r λi λ j i1 i,j d d !2  Õ 2 + Õ 2r λi r λi i1 i1 d !2 2 Õ  2r kuk + r λi Frobenious norm is sum of squared eigenvalues i1 !2 2 Õ  2r kuk + r ui,i trace is sum of eigenvalues i

 2r hu, Πsymui + r hu, ΦΦ>ui , so therefore Σ  2r Πsym + r ΦΦ>. 

2 2 Proof of Fact A.5.4. When x ∼ N(0, Idd), the expectation ¾hx, ai hx, bi  1 is just a product of independent standard Gaussian second moments. Therefore by

Lemma A.5.5, Σ  2 Πsym + ΦΦ>.

To find Σ˜ where x is uniformly distributed on the unit sphere, we compute Õ  ¾ k k4  ¾ 2 2  ¾ 4 + ( 2 − ) ¾ 2 2 1 x xi x j d x1 d d x1x2 i,j ¾ 4  ¾ 2 ¾ 2 2  1 and use the fact that x1 3 x1 (by Lemma A.5.5) to find that x1x2 d2+2d , Σ˜  2 Π + 1 ΦΦ and therefore by Lemma A.5.5, d2+2d sym d2+2d >.

331 + To verify the pseudoinverses, it is enough to check that MM  Πsym for each + matrix M and its claimed pseudoinverse M .

To show that k ( ⊗ )k2  − 1  · k k4 R v v 2 1 d+2 v ,

∈ ’d k ( ⊗ )k2  ( ⊗ ) 2( ⊗ ) for any v , we write R v v 2 v v >R v v and use the substitution 2 + 2 R  2Σ , along with the facts that Πsym(v ⊗ v)  v ⊗ v and hΦ, v ⊗ vi  kvk .



Now we can prove some concentration claims we deferred:

Lemma (Restatement of Lemma ??).

Proof of Lemma ??. We prove the first item:

Õ 2 2 Õ + 2 hu j , R uii  hu j , 2Σ uii i,j i,j Õ  h (Π − 1 ΦΦ ) i2 u j , sym d+2 > ui by Fact A.5.4 i,j  Õ(h i2 − 1 k k2k k2)2 a j , ai d+2 u j ui i,j Õ  O˜ (1/d)2 w.ov.p. by Fact A.5.1 i,j

 O˜ (n/d2) .

And one direction of the second item, using Fact A.5.4 and Fact A.5.1 (the other direction is similar): √ k k2  h 2 i  h (Π + 1 ΦΦ ) i  ( − Θ( / ))k k4  − ˜ ( / ) Ru j u j , R u j u j , sym d+2 > u j 1 1 d a j 1 O 1 d where the last equality holds w.ov.p.. 

332 Proof of Lemma ??

To prove Lemma ?? we will begin by reducing to the case S  [n] via the following.

d Lemma A.5.6. Let v1,..., vn ∈ ’ . Let AS have columns {vi }i S. Let ΠS be the ∈ { } k − k projector to Span vi i S. Suppose there is c > 0 so that A>n A n Idn 6 c. Then ∈ [ ] [ ] ⊆ [ ] k − Π k for every S n , ASA>S S 6 c

k − k Proof. If the hypothesized bound A>n A n Idn 6 c holds then for every [ ] [ ] ⊆ [ ] k − k S n we get A>AS Id S 6 c since A>AS is a principal submatrix of S | | S k − k A>n A n . If A>S AS Id S 6 c, then because ASA>S has the same nonzero [ ] [ ] | | k − Π k eigenvalues as A>S AS, we must have also ASA>S S 6 c. 

It will be convenient to reduce concentration for matrices involving ai ⊗ ai to analogous matrices where the vectors ai ⊗ ai are replaced by isotropic vectors of constant norm. The following lemma shows how to do this.

∼ N( 1 ) Σ˜  ¾ ( ) 2/k k4 Lemma A.5.7. Let a 0, d Idd . Let : x 0,Idd xx> ⊗ x . Then ∼N( ) + 1 2 2 u : (Σ˜ ) / a ⊗ a/kak is an isotropic random vector in the symmetric subspace p Span{y ⊗ y | y ∈ ’d } with kuk  dim Span{y ⊗ y | y ∈ ’d }.

Proof. The vector u is isotropic by definition so we prove the norm claim. Let Φ˜  Φ/kΦk. By Fact A.5.4,

Σ˜ +  d2+2d Π − d ΦΦ 2 sym 2 >

Thus,

k k2  h a a Σ˜ + a a i  d2+2d − d  d2+d  { ⊗ | ∈ ’d } u a⊗ 2 , a⊗ 2 2 2 2 dim Span y y y .  k k k k

333 The last ingredient to finish the spectral bound is a bound on the incoherence

+ 1 2 of independent samples from (Σ˜ ) / .

 ( ⊗ )/k k4 ∼ N( ) Lemma A.5.8. Let Σ˜ ¾a 0,Id aa> aa> a . Let a1,..., an 0, Idd be ∼N( d) + 1 2 2 independent, and let ui  (Σ˜ ) / (ai ⊗ ai)/kai k . Let d0  dim Span{y ⊗ y | y ∈ ’d }  1 ( 2 + ) 2 d d . Then

1 Õ 2 ¾ max hui , u ji 6 O˜ (n) . d0 i j,i

h i2 Σ˜ +  d2+2d Π − d ΦΦ Proof. Expanding ui , u j and using 2 sym 2 >, we get

2  2 a a 2  2 a ,a 2  h i2  d +2d h ai ai j j i − d  d +2d · i j − d ui , u j ⊗ 2 , ⊗ 2 h 2 i 2 2 ai a j 2 2 ai a j 2 k k k k k k k k

h i2/k k2k k2 ( / ) From elementary concentration, ¾ maxi,j ai , a j ai a j 6 O˜ 1 d , so the lemma follows by elementary manipulations. 

We need the following bound on the deviation from expectation of a tall matrix with independent columns.

Theorem A.5.9 (Theorem 5.62 in [78]). Let A be an N × n matrix (N > n) whose √ N columns Aj are independent isotropic random vectors in ’ with kAj k2  N almost surely. Consider the incoherence parameter

def 1 Õ 2 m  ¾ max hAi , Aji . N i n ∈[ ] j,i q ¾ k 1 T − k m log n Then N A A Id 6 C0 N .

We are now prepared to handle the case of S  [n] via spectral concentration for matrices with independent columns, Theorem A.5.9.

Lemma (Restatement of Lemma ??).

334 Proof of Lemma ??. By Lemma A.5.6 it is enough to prove the lemma in the case of S  [n]. For this we will use Theorem A.5.9. Let A be the matrix whose ⊗ columns are given by ai ai, so that P n  P  AA>. Because RAA>R and [ ] A>RRA have the same nonzero eigenvalues, it will be enough to show that √ 2 3 2 kA>R A − Id k 6 O˜ ( n/d) + O˜ (n/d / ) with probability 1 − o(1). (Since n 6 d √ 3 2 we have n/d  O˜ (n/d / ) so this gives the theorem.)

The columns of RA are independent, given by R(ai ⊗ ai). However, they do not quite satisfy the normalization conditions needed for Theorem A.5.9.

2 Let D be the whose i-th diagonal entry is kai k . Let  ( ) 2/k k4 ( +)1 2 1 Σ˜ ¾x 0,Id xx> ⊗ x . Then by Lemma A.5.7 the matrix Σ˜ / D− A ∼N( ) has independent columns from an isotropic distribution with a fixed norm d0. Together with Lemma A.5.8 this is enough to apply Theorem A.5.9 to con- √ ¾ k 1 1Σ˜ + 1 − k ˜ ( / ) clude that d 2 A>D− D− A Id 6 O n d . By Markov’s inequality, ( 0) √ k 1 1Σ˜ + 1 − k ˜ ( / ) − ( ) d 2 A>D− D− A Id 6 O n d with probability 1 o 1 . ( 0)

k 2 − 1 1Σ˜ + 1 k ˜ ( / 3 2) We will show next that A>R A d 2 A>D− D− A 6 O n d / with ( 0) probability 1− o(1); the lemma then follows by triangle inequality. The expression inside the norm expands as

( 2 − 1 1Σ˜ + 1) A> R d 2 D− D− A . ( 0) and so

k 2 − 1 1Σ˜ + 1 k k k2k 2 − 1 1Σ˜ + 1k A>R A d 2 A>D− D− A 6 A R d 2 D− D− ( 0) ( 0) √ By Fact A.5.1, with overwhelming probability kD − Id k 6 O˜ (1/ d). So √ 2 1 + 1 2 + k(1/d0) D− Σ˜ D− − (1/d0) Σ˜ k 6 O˜ (1/ d) w.ov.p.. We recall from Fact A.5.4, √ + 1 2 given that R  2 ·(Σ ) / , that

2  Π − 1 ΦΦ 1 Σ˜ +  d+2 Π − 1 ΦΦ R sym d+2 > and d 2 d+1 sym d+1 > . ( 0)

335 2 2 + This implies that kR − (1/d0) Σ˜ k 6 O(1/d). Finally, by an easy application k k2  k Í ( ) 2k ˜ ( / ) of Proposition A.3.7, A i ai a>i ⊗ 6 O n d w.ov.p.. All together, k 2 − 1 1Σ˜ + 1 k ˜ ( / 3 2) A>R A d 2 A>D− D− A 6 O n d / .  ( 0)

A.6 Concentration bounds for tensor principal component anal-

ysis

For convenience, we restate Lemma 3.6.5 here.

Lemma A.6.1 (Restatement of Lemma 3.6.5). For any v, with high probability over A, the following occur:

Õ ( )· ( 3 2 2 ) Tr Ai Ai 6 O n / log n i

√ Õ ( )· ( ) v i Ai 6 O n log n i

√ Õ ( ) ( )· T ( ) Tr Ai v i vv 6 O n log n . i

Í ( )· Proof of Lemma 3.6.5. We begin with the term i Tr Ai Ai. It is a sum of iid matrices Tr(Ai)· Ai. A routine computation gives ¾ Tr(Ai)· Ai  Id. We will use the truncated matrix Bernstein’s inequality (Proposition A.3.7) to bound k Í ( ) k i Tr Ai Ai .

For notational convenience, let A be distributed like a generic Ai. By a union bound, we have both of the following:

   √   √   k Tr(A)· Ak > tn 6  | Tr(A)| > tn +  kAk > tn    √   √   k Tr(A)· A − Id k > (t + 1)n 6  | Tr(A)| > tn +  kAk > tn .

336 √ c1t Since Tr(A) the sum of iid Gaussians, (| Tr(A)| > tn) 6 e− for some constant c1. Similarly, since the maximum eigenvalue of a matrix with iid entries √ c2t has a subgaussian tail, (kAk > tn) 6 e− for some c2. All together, for some

c3t c3t c3, we get (k Tr(A)· Ak > tn) 6 e− and (k Tr(A)· A − Id k > (t + 1)n) 6 e− .

For a positive parameter β, let 1β be the indicator variable for the event k Tr(A)· Ak 6 β. Then ¹ k ( )· k − k ( )· k  ∞  (k · k ) − (k · k ) ¾ Tr A A ¾ Tr A A 1⠐ Tr A A > s  Tr A A 1β > s s. 0 ¹  (k · k ) + ∞ (k · k ) ⠐ Tr A A > ⠐ Tr A A > s s. β ¹ ∞ c3β n + (k · k ) 6 βe− /  Tr A A > s s. β ¹ ∞  c3β n + (k · k ) βe− /  Tr A A > tn nt. β n ¹ / ∞ c3β n + c3t 6 βe− / ne− t. β n / c3β n n c3β n  βe− / + e− / . c3

Thus, for some β  O(n log n) we may take the parameters p, q of Proposition A.3.7

150 2 to be O(n− ). The only thing that remains is to bound the parameter σ . Since (¾ Tr(A)· A)2  Id, it is enough just to bound k ¾ Tr(A)2AAT k. We use again a union bound: √ √ 2 T 2 1 4 1 4 (k Tr(A) AA k > tn ) 6 (| Tr(A)| > t / n) + (kAk > t / n) .

By a similar argument as before, using the Gaussian tails of Tr A and kAk, we get

2 T 2 c4√t (k Tr(A) AA k > tn ) 6 e− . Then starting out with the triangle inequality,

σ2  kn · ¾ Tr(A)2AAT k

6 n · ¾ k Tr(A)2AAT k ¹  · ∞ ( ( )2 T ) n  Tr A AA > s s. 0

337 ¹  · ∞ ( ( )2 T 2) 2 n  Tr A AA > tn n t. 0 ¹ ∞ · c4√t 2 6 n e− n t. 0 " √ # t n2(c t + ) ∞ 2 4 1 c4√t  n · − e− c2 4 t0 6 O(n3) .

This gives that with high probability,

Õ ( )· ( 3 2 2 ) Tr Ai Ai 6 O n / log n . i

Í ( )· The other matrices are easier. First of all, we note that the matrix i v i Ai has independent standard Gaussian entries, so it is standard that with high √ k Í ( )· k ( ) probability i v i Ai 6 O n log n . Second, we have

Õ T T Õ v(i) Tr(Ai)vv  vv v(i) Tr(Ai). i i

The random variable Tr(Ai) is a centered Gaussian with variance n, and since Í ( ) ( ) v is a unit vector, i v i Tr Ai is also a centered Gaussian with variance n. So with high probability we get

√ T Õ ( ) ( )  Õ ( ) ( ) ( ) vv v i Tr Ai v i Tr Ai 6 O n log n i i by standard estimates. This completes the proof. 

338