TENSOR RANK DECOMPOSITIONS VIA THE PSEUDO-MOMENT METHOD
A Dissertation Presented to the Faculty of the Graduate School
of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy
by Jonathan Shi December 2019 © 2019 Jonathan Shi ALL RIGHTS RESERVED TENSOR RANK DECOMPOSITIONS VIA THE PSEUDO-MOMENT METHOD Jonathan Shi, Ph.D. Cornell University 2019
Over a series of four articles and an introduction, a “method of pseudo-moments” is developed which gives polynomial-time tensor rank decompositions for a variety of tensor component models. Algorithms given fall into two general classes: those falling back on convex optimization, which develop the theory of polynomial-time algorithms, as well as those constructed through spectral and matrix polynomial methods, which illustrate the possibility of realizing the aforementioned polynomial-time algorithm ideas in runtimes practical for real- life inputs. Tensor component models covered include those featuring random, worst-case, generic, or smoothed inputs, as well as cases featuring overcomplete and undercomplete rank. All models are capable of tolerating substantial noise in tensor inputs, measured in spectral norm. BIOGRAPHICAL SKETCH
Jonathan Shi was raised in Seattle, Washington, and received a Bachelors of Science from the University of Washington in 2013 in the major concentrations of Computer Science, Mathematics, and Physics. He has been working at Cornell University under the guidance of Professor David Steurer and will soon join Bocconi University in Milan as a Research Fellow.
iii This dissertation is dedicated to my mother and father, Richard C-J Shi and Tracey H. Luo, who made this work possible with their perseverence, and in the memory of Mary Kaitlynne Richardson, whose kindness would have driven her to greatness.
iv ACKNOWLEDGEMENTS
I am in gratitude for the support and mentorship of my advisor David Steurer, who oversaw my development of skill and knowledge in this thesis topic from the starting scraps I had, as well as Professor Robert Kleinberg in his role as a Director of Graduate Studies, committee member, and all-around pretty cool person. I would like to ensure the acknowledgement of those in the student community who worked (in part in student organizations) to create a welcoming, supportive, and inclusive community, as well as those who labor to keep the Computer Science Field at Cornell running. I would not have been able to complete this work without the aid of Eve Abrams, LCSW-R, Robert Mendola, MD, Clint Wattenberg, MS RD, and Edward Koppel, MD in improving my health and directing me to needed resources. I acknowledge support from a Cornell University Fellowship as well as the NSF via my advisor’s NSF CAREER Grant CCF-1350196 over the duration of my graduate program.
v CONTENTS
Biographical Sketch...... iii Dedication...... iv Acknowledgements...... v Contents...... vi List of Tables...... x List of Figures...... xi
1 Introduction1 1.0.1 Results stated...... 3 1.0.2 Overview of methods...... 7 1.0.3 Frontier of what is possible...... 13 1.0.4 Organization...... 15
2 Spiked tensor model 17 2.1 Introduction...... 17 2.1.1 Results...... 20 2.1.2 Techniques...... 23 2.1.3 Related Work...... 27 2.2 Preliminaries...... 29 2.2.1 Notation...... 29 2.2.2 Polynomials and Matrices...... 30 2.2.3 The Sum of Squares (SoS) Algorithm...... 31 2.3 Certifying Bounds on Random Polynomials...... 32 2.4 Polynomial-Time Recovery via Sum of Squares...... 35 2.4.1 Semi-Random Tensor PCA...... 39 2.5 Linear Time Recovery via Further Relaxation...... 40 2.5.1 The Spectral SoS Relaxation...... 41 Í ⊗ 2.5.2 Recovery via the i Ti Ti Spectral SoS Solution...... 44 2.5.3 Nearly-Linear-Time Recovery via Tensor Unfolding and Spectral SoS...... 49 2.5.4 Fast Recovery in the Semi-Random Model...... 52 2.5.5 Fast Recovery with Symmetric Noise...... 54 2.5.6 Numerical Simulations...... 58 2.6 Lower Bounds...... 59 2.6.1 Polynomials, Vectors, Matrices, and Symmetries, Redux. 62 2.6.2 Formal Statement of the Lower Bound...... 65 2.6.3 In-depth Preliminaries for Pseudo-Expectation Symmetries 68 2.6.4 Construction of Initial Pseudo-Distributions...... 71 2.6.5 Getting to the Unit Sphere...... 75 2.6.6 Repairing Almost-Pseudo-Distributions...... 79 2.6.7 Putting Everything Together...... 80 2.7 Higher-Order Tensors...... 87
vi 3 Spectral methods for the random overcomplete model 90 3.1 Introduction...... 90 3.1.1 Planted Sparse Vector in Random Linear Subspace.... 92 3.1.2 Overcomplete Tensor Decomposition...... 94 3.1.3 Tensor Principal Component Analysis...... 97 3.1.4 Related Work...... 98 3.2 Techniques...... 100 3.2.1 Planted Sparse Vector in Random Linear Subspace.... 104 3.2.2 Overcomplete Tensor Decomposition...... 107 3.2.3 Tensor Principal Component Analysis...... 111 3.3 Preliminaries...... 113 3.4 Planted Sparse Vector in Random Linear Subspace...... 115 3.4.1 Algorithm Succeeds on Good Basis...... 119 3.5 Overcomplete Tensor Decomposition...... 122 3.5.1 Proof of Theorem 3.5.3...... 125 3.5.2 Discussion of Full Algorithm...... 129 3.6 Tensor principal component analysis...... 132 3.6.1 Spiked tensor model...... 132 3.6.2 Linear-time algorithm...... 133
4 Polynomial lifts 136 4.1 Introduction...... 136 4.1.1 Results for tensor decomposition...... 140 4.1.2 Applications of tensor decomposition...... 146 4.1.3 Polynomial optimization with few global optima...... 148 4.2 Techniques...... 149 4.2.1 Rounding pseudo-distributions by matrix diagonalization 150 4.2.2 Overcomplete fourth-order tensor...... 154 4.2.3 Random overcomplete third-order tensor...... 157 4.3 Preliminaries...... 158 4.3.1 Pseudo-distributions...... 160 4.3.2 Sum of squares proofs...... 162 4.3.3 Matrix constraints and sum-of-squares proofs...... 164 4.4 Rounding pseudo-distributions...... 165 4.4.1 Rounding by matrix diagonalization...... 165 4.4.2 Improving accuracy of a found solution...... 169 4.5 Decomposition with sum-of-squares...... 171 4.5.1 General algorithm for tensor decomposition...... 172 4.5.2 Tensors with orthogonal components...... 175 4.5.3 Tensors with separated components...... 176 4.6 Spectral norms and tensor operations...... 179 4.6.1 Spectral norms and pseudo-distributions...... 179 4.6.2 Spectral norm of random contraction...... 181 4.7 Decomposition of random overcomplete 3-tensors...... 185
vii 4.8 Robust decomposition of overcomplete 4-tensors...... 189 4.8.1 Noiseless case...... 193 4.8.2 Noisy case...... 195 4.8.3 Condition number under smooth analysis...... 199 4.9 Tensor decomposition with general components...... 201 4.9.1 Improved rounding of pseudo-distributions...... 202 4.9.2 Finding all components...... 208 4.10 Fast orthogonal tensor decomposition without sum-of-squares.. 211
5 Overcomplete generic decomposition spectrally 216 5.1 Introduction...... 216 5.1.1 Our Results...... 221 5.1.2 Related works...... 224 5.2 Overview of algorithm...... 230 5.3 Preliminaries...... 235 5.4 Tools for analysis and implementation...... 238 5.4.1 Robustness and spectral perturbation...... 238 5.4.2 Efficient implementation and runtime analysis...... 239 5.5 Lifting...... 241 5.5.1 Algebraic identifiability argument...... 245 5.5.2 Robustness arguments...... 247 5.6 Rounding...... 250 5.6.1 Recovering candidate whitened and squared components 251 5.6.2 Extracting components from the whitened squares.... 255 5.6.3 Testing candidate components...... 257 5.6.4 Putting things together...... 259 5.6.5 Cleaning...... 261 5.7 Combining lift and round for final algorithm...... 262 5.8 Condition number of random tensors...... 266 5.8.1 Notation...... 272 5.8.2 Fourth Moment Identities...... 273 5.8.3 Matrix Product Identities...... 274 5.8.4 Naive Spectral Norm Estimate...... 277 5.8.5 Off-Diagonal Second Moment Estimates...... 278 5.8.6 Matrix Decoupling...... 280 5.8.7 Putting It Together...... 281 5.8.8 Omitted Proofs...... 283
Bibliography 286
A Spiked tensor model 294 A.1 Pseudo-Distribution Facts...... 294 A.2 Concentration bounds...... 297 A.2.1 Elementary Random Matrix Review...... 297
viii Í ⊗ A.2.2 Concentration for i Ai Ai and Related Ensembles... 298 A.2.3 Concentration for Spectral SoS Analyses...... 302 A.2.4 Concentration for Lower Bounds...... 303 A.3 Spectral methods for the random overcomplete model...... 308 A.3.1 Linear algebra...... 308 A.3.2 Concentration tools...... 312 A.4 Concentration bounds for planted sparse vector in random linear subspace...... 320 A.5 Concentration bounds for overcomplete tensor decomposition.. 324 A.6 Concentration bounds for tensor principal component analysis.. 336
ix LIST OF TABLES
3.1 Comparison of algorithms for the planted sparse vector problem with ambient dimension n, subspace dimension d, and relative sparsity ε...... 94 3.2 Comparison of decomposition algorithms for overcomplete 3- tensors of rank n in dimension d...... 96 3.3 Comparison of principal component analysis algorithms for 3- tensors in dimension d and with signal-to-noise ratio τ...... 98
5.1 A comparison of tensor decomposition algorithms for rank-n 4-tensors in d 4. Here ω denotes the matrix multiplication constant. A ( )⊗ robustness bound E 6 η refers to the requirement that a d2 d2 k k × reshaping of the error tensor E have spectral norm at most η. Some of the algorithms’ guarantees involve a tradeoff between robustness, runtime, and assumptions; where this is the case, we have chosen one representative setting of parameters. See ?? for details. Above, “random” indicates that the algorithm assumes a1,..., an are independent unit vectors (or Gaussians) and “algebraic” indicates that the algorithm assumes that the vectors avoid an algebraic set of measure 0...... 225
x LIST OF FIGURES
1.1 An example of the shortcomings of purely linear methods. Multicomponent phenomena whose components occur in non- orthogonal directions cannot be recovered by principle component analysis...... 2
2.1 Numerical simulation of Algorithm 2.5.1 (“Nearly-optimal spectral SoS” implemented with matrix power method), and two implementations of Algorithm 2.5.7 (“Accelerated power method”/“Nearly-linear tensor unfolding” and “Naive power method”/“Naive tensor unfolding”. Simulations were run in Julia on a Dell Optiplex 7010 running Ubuntu 12.04 with two Intel Core i7 3770 processors at 3.40 ghz and 16GB of RAM. Plots created with Gadfly. Error bars denote 95% confidence intervals. Matrix-vector multiply experiments were conducted with n 200. Reported matrix-vector multiply counts are the average of 50 independent trials. Reported times are in cpu-seconds and are the average of 10 independent trials. Note that both axes in the right-hand plot are log scaled...... 59
xi CHAPTER 1 INTRODUCTION
Linear algebraic tools are central in modern data analysis and statistics, machine learning, and scientific computing applications. Matrices are excellent models for linear relationships between two vectors, or, in a closely related way, for bilinear or quadratic functions of vectors. “Principle Component Analysis” is in essence a synonym for eigenvalue decomposition on such linear relationships, isolating a set of “most important” directions in a particular set of data, effectively modeling the data as a linearly additive superposition of independent contributions from those orthogonal “most important” directions.
This linear toolset permanently reshaped the landscape of the methods we use to understand the world. But while linear relationships have proven potent as models of the world, phenomena we seek to understand may be more complicated
Chapter 1. A natural extension of linear models would be those captured by low-degree polynomials over vectors. In general, these low-degree polynomials are represented by tensors, and they are capable of modeling document topics
[10, 7], an analytic notion of sparsity [17], and statistical mixtures of signals [46, 29, 55], among other phenomena [1, 22, 45]. A well performing and robust toolkit for tensors may again reshape our empirical methods, just as matrix operations are foundational for many of today’s methods. One might even say "tensors are the new matrices" [74].
In recent years, the fields of algorithms and complexity theory have discov- ered a uniquely powerful general approach—the sum-of-squares meta-algorithm [71, 64, 61, 54]—which has achieved new qualitative improvements in many families of problems involving higher-order relationships between variables or
1 Figure 1.1: An example of the shortcomings of purely linear methods. Multicomponent phenom- ena whose components occur in non-orthogonal directions cannot be recovered by principle component analysis. approximation problems involving some degree of insensitivity—or robustness— to small perturbations in the problem input [21, 14].
The sum-of-squares method is based on the concept of sum-of-squares proofs—or
Positivstellensatz proofs [73]—a proof system that derives polynomial inequalities that are true over the real numbers, allowing the use of addition, multiplication, and composition of inequalities, along with the simple fact that every square polynomial is non-negative. This proof system is tractable in the sense that for any fixed degree d, any proof that contains only polynomials of degree at most d can be found if it exists, or a counterexample found if such a proof does not exist.
Said counterexample would take the form of a degree-d pseudo-distribution—an object which may be represented as a generalization of a probability density function, in which not all probabilities are necessarily non-negative, but for which it is certifiably true that the expectation of any square polynomial up to degree d is non-negative. Such an object, if it violates a given polynomial inequality
2 in expectation, ensures that there is no degree-d sum-of-squares proof for that inequality.
This sum-of-squares method boasts features making it particularly compelling for application to data-driven tensor problems: first and most essentially, it is a framework built fundamentally around polynomials, and certifying facts about polynomials—suiting for symmetric tensors as they are representations of polynomials. Secondly, the success of sum-of-squares in robustly handling perturbations in input bodes well for its application to data.
The remainder of this document will cover recent progress in this program of applying sum-of-squares to tensor problems: most specifically tensor rank decomposition, a stronger higher-order analog to matrix low-rank approximations or principle component analysis.
1.0.1 Results stated
We will briefly review some notation in order to state the algorithmic results presented in this thesis.
A k-tensor may be conceived of as a multilinear map from k vectors in d to . Special cases include linear forms (or vectors) as 1-tensors and bilinear forms (or matrices) as 2-tensors. In general a k-tensor may be represented as a k-dimensional array of real numbers. Symmetric k-tensors are those that are invariant with respect to interchange of their k parameters, and may additionally be concieved of as representations of degree-k polynomials by restricting the input to k copies of the same vector. A rank-1 symmetric k-tensor represents a scalar multiple of the kth power of some linear functional v over d, and may
3 k be expressed as cv⊗ : cv ⊗ v ⊗ ... ⊗ v, where ⊗ is the tensor product and c ∈ . The symmetric rank (henceforth just “rank”) of a symmetric tensor T is the minimum number r such that there are r rank-one symmetric tensors Ti Ír satisfying T i1 Ti.
In the following results, we will allow for an error term E and assume the input tensor T is close to a low-rank tensor, up to some small error
n Õ k + T ai⊗ E. i0
We assume kai k 1 for all i for convenience and because the most difficult case will be when all components are of equal magnitude for the orthogonal, random, and separated components models. We may quantify the size of the error kEk in different ways: usually in this document we will focus on the spectral error kEk : kEkspectral, defined as the spectral norm of one of the squarest matrix
k 2 k 2 reshaping of E, as a db / c × dd / e matrix, column-indexed by a tuple of the first bk/2c tensor indices of E and row-indexed by a tuple of the remaining dk/2e tensor indices. The reason for the focus on the spectral error is that (a symmetricized version of) this form of error arises as cross-terms of data components with each other or with independent noise, especially when mixed or noisy observations are tensor powered as moment estimators, for example in dictionary learning applications [17, 58, 68].
For this summary, “recover a component” will mean that we will find, with probability at least Ω(1), among the output of the algorithm a unit vector whose absolute value of the dot product with that normalized component (the absolute cosine similarity between those two vectors) is at least some constant. From there, one may attempt locally converging optimization techniques to refine the recovered components. Alternatively, “precisely recover a component’ will
4 refer to finding a vector in the output whose absolute cosine similarity with the component is at least 1 − O(kbEk), again with probability at least Ω(1). And “recover a component up to error f ” will mean that the absolute cosine similarity
is at least 1 − f with the same probability bound.
Theorem 1.0.1 (Single-spike model with Gaussian noise). Suppose we are given the Ín k + ∈ d tensor T i0 ai⊗ E with ai and n 1. Suppose E is sampled according to a distribution where each entry is an independent Gaussian with mean 0 and variance
1 ετ− (asymmetric noise), or where the entry E i,j,k is an independent Gaussian with ( ) 1 mean 0 and variance ετ− when i 6 j 6 k or set equal to E min i,j,k ,median i,j,k ,max i,j,k ( ( ) ( ) ( )) otherwise (symmetric noise). Then we may recover the only component of T
( 1 k 4 ( ) 1 4) • using semidefinite programming, up to error O τ− d− / log d − / in polynomial time in both the symmetric and asymmetric cases.
• using a spectral approach on matrix polynomials, up to error 1 k 4 1 4 2 k+1 2 O(τ− d− / log(d)− / ) in time O˜ (d d( )/ e) in the asymmetric case.
( 1 k 4) • using a spectral approach on matrix polynomials, up to error O τ− d− / in time O˜ (dk+1) in both the symmetric and asymmetric cases.
Ín k + Theorem 1.0.2 (Orthogonal model). Suppose we have T i0 ai⊗ E as described
above, with ai arbitrary mutually orthogonal unit vectors and kEk < 1. Then we may recover the components of T
• precisely, using semidefinite programming, in polynomial time.
+ • precisely, when k 3, using a spectral algorithm, in time O˜ (d1 ω), where ω is the matrix multiplication time exponent.
5 Ín k + Theorem 1.0.3 (Random components model). Suppose we have T i0 ai⊗ E as described above, with each ai a random unit vector. Then we may recover the components of T
( / 3 2) + (k k) • using semidefinite programming, when k 3, up to error poly n d / O E , in polynomial time.
( / 4 3) + (k k) • using spectral methods, when k 3, up to error poly n d / O E , in time O˜ (d4).
Ín k + Theorem 1.0.4 (Separated model). Suppose we have T i0 ai⊗ E as described k Í T k |h i| above, with ai unit vectors satisfying ai ai 6 σ and maxi,j ai , a j 6 ρ for some ∈ ( ) ( 1+log σ ) σ > 1 and ρ 0, 1 held constant. Suppose also that k > O log rho . Then for any η ∈ (0, 1), we may recover the components of T using semidefinite programming, up to error O(η + kEk), in polynomial time.
Ín k + Theorem 1.0.5 (Generic/smoothed model). Suppose we have T i0 ai⊗ E as described above, with k 4. Then when n d2, there are symmetric matrix polynomials
(let P be one of them) in a1,..., an, such that when a particular condition number (rational function of eigenvalues) κ of P is not 0 (which is true almost everywhere) and kbEk 6 κ,
• we may recover the components of T, using semidefinite programming, precisely, in polynomial time.
• we may recover 0.99 of the components of T using spectral methods, up to error 2 1 8 2 3 O(kEk/κ ) / , in time O˜ (n d ).
We also provide a lower bound in hardness for the single-spike model with
Gaussian noise, showing that our algorithms are near-optimal for it among those
6 that are based on rounding a degree-4 sum-of-squares relaxation. This is notable for being among the first examples of sum-of-squares lower bounds for degree larger than 2 among unsupervised learning problems.
Theorem 1.0.6 (Lower bound on tensor decomposition with Gaussian noise). Let us suppose the same setting as in Theorem 1.0.1, with k 3 and asymmetric noise.
3 4 1 4 There is a τ 6 O(d / /log(d) / ) such that with high probability, the sum-of-squares relaxation for the maximum likelihood estimator of a1 has at least one solution that is formally independent of a1 (in that the solution to the relaxation may be expressed as solely a function of the noise tensor E and not of a1). Then we may recover the only component of T.
1.0.2 Overview of methods
The general framework of this body of work is that of using the method of moments with spectral techniques to compute roundings of pseudo-distributions.
The method of moments is a family of methods to derive the parameters of an observed distribution within a family of distributions, given the moments of the observed distribution. Since pseudo-distributions are defined by their pseudo- moments up to a specific degree, it is natural to apply the method of moments to a pseudo-distribution. And as pseudo-distributions are indistinguishable from actual distributions in their moments up to that specific degree, any proof of validity within the method of moments will carry over to pseudo-moments as well, as long as that proof makes no reference to moments higher than are valid in this pseudo-distribution.
We break this approach up into two broad steps: that of lifting’, or generating
7 a relaxed solution concept (such as a pseudo-distribution), and that of rounding, or extracting actual solutions from the relaxed solution concepts. Because the rounding step informs what kind of lifting is possible, we begin with the rounding.
Rounding:
Rounding in the method of moments is a special case of a combining algorithm: an algorithm that combines a probability distribution over solutions into a single (perhaps less optimal) solution. Specifically, the method of moments may be used as a combining algorithm when the parameters of the distribution are themselves the desired solutions. The technique of turning combining algorithms into rounding algorithms for the sum-of-squares hierarchy was first pioneered by Barak, Kelner, and Steurer in 2014 [16].
The methods used in the body of this thesis to extract solutions from pseudo-
d k moments reduce to linear algebraic techniques. By reshaping a tensor in ( )⊗
d 2 d 2 to a matrix in (b / c×d / e) in the various possible ways, we enable the use of powerful tools from the spectral theory of matrices.
One of the specific instantiations of these tools is by the combination of Gaussian rounding (also known as sampling from a Gaussian) and reweighing, which was also introduced by Barak, Kelner, and Steurer in 2014 [16]. We describe these techniques in two different languages: first in the language of probability distributions and then in the language of moment tensors. Gaussian rounding may be described as sampling from a Gaussian distribution with a matching covariance matrix to the pseudo-distribution. When working with pseudo-distributions of degree larger than 2, this will generally occur after a
8 series of random linear reweighings, taking inspiration from a reweighing by some polynomial function p in actual distributions wherein [x a] becomes proportional to [x a]p(a), and realized for general pseudo-distributions as changing ¾˜ q(x) to ¾˜ q(x)p(x) for any polynomial q. In a random linear reweighing, p is taken as a randomly generated linear function with Gaussian coefficients.
Equivalently in the tensor view, a Gaussian rounding takes a degree-2 pseudo- distribution (so given by solely a mean vector and a covariance matrix) and takes its PSD square root of the pseudo-covariance matrix and multiplies it by a random unit Gaussian vector, and adds the result to the pseudo-mean (generally the step of taking the PSD square root is unnecessary and therefore sometimes omitted). The sequence of random reweighings is then the process of multiplying a random Gaussian vector into each of the other tensor modes of the degree-k pseudo-distribution realized as a k-tensor over ⊕ d (where ⊕ denotes the direct sum) until only two modes remain and Gaussian rounding may be applied.
The second major rounding technique substitutes for Gaussian rounding by computing the eigendecomposition of a pseudo-covariance matrix, this technique being developed in Chapter 4. This takes inspiration from the algorithm known as Jennrich’s simultaneous diagonalization [46, 29], which was previously used to recover components from undercomplete tensors. When an eigendecomposition is applied on a noise-free orthogonal tensor after it has been reweighed down to a 2-tensor (a matrix), then the resulting eigenvectors must correspond to the original tensor components. When there is some noise, repeating this operation enough times randomly results in a large enough eigengap to withstand the spectral perturbation. These two facts are the same in Jennrich’s algorithm and
9 in ours, though our technique is simplified by the assumption that we have a certificate of orthogonality which allows for easy verification of a hypothetical solution vector.
In all of the algorithms, the way in which the method of moments is shown to be robust to the noise term in the input tensor is to find a certification function (which is a low-degree polynomial) of the input, such that the desired (signal) components to recover are certified to exist in the input if the function evaluates to over some chosen threshold. At the same time, we must show that the noise term by itself fails to achieve certification by the same function (via a sum-of-squares proof), and that the function is “strongly subadditive” in that the cross-terms that arise when multiplying the sum of the noise and the signal are with high probability too small to push the noise over the threshold. This certification, combined with the strong duality between pseudo-distributions and sum-of-squares proofs, then guarantees that the solved pseudo-distribution is predominately composed of information about the signal terms.
Lifting: generating relaxed solution concepts
In the simplest settings, the lifting step consists simply of generating consistent hypothetical higher pseudo-moments to match the distribution whose lower moments are given in the input. This is an advantage because higher moments may be easier to use: there are more of them, and their linear algebraic representations are easier to handle as rank-one components become farther apart in the higher tensor space, and the increased number of dimensions in their matrix reshapings enable more components to be recovered. Meanwhile the lower moments are much easier to empirically estimate.
10 In the overcomplete models (including the random, separated, and generic/smoothed models), there is also a concept of “lifting” via an “orthogo- nalizing map” P(x) which is a low-degree vector-valued polynomial, such that when applied to the original components to be recovered, P(a1),..., P(an) form an orthonormal set of vectors in some lifted vector space. This orthogonalizing map can be understood as a sum-of-squares proof of the existence of only a small number of solutions to the original problem, as at most dk vectors can be mutually orthogonal after being mapped by P.
Finally, we develop a technique to generate “pseudo-pseudo-moments” from the algorithm input without the use of convex optimization techniques, making this approach much faster and demonstrating potential for use in practice. These “pseudo-pseudo-moments” are not quite pseudo-moments in that they don’t necessarily satisfy the full set of symmetry or positive-semidefiniteness constraints, but enough constraints are satisfied that the rounding method used can succeed anyway. The pseudo-pseudo-moments are generated from simple matrix polynomials and spectral operations applied to the input, generally taking not much longer than computing a singular value decomposition of the input, and the recipes to generate them are inspired by examining dual objects in the sum-of-squares proofs of the rounding algorithms. The proofs that these solution concepts suffice generally go through eigenvector perturbation theorems, as used in perturbation theory in quantum mechanics.
This technique of pseudo-pseudo-moments is pushed far in the development of a spectral algorithm for the overcomplete generic/smoothed components model. A new algebraic identity is given which yields a new identifiability argu- ment, showing that the inferred order-6 pseudo-moments uniquely determine the
11 original components of the input, based on a lower level of pseudo-moments than what was previously (established as “FOOBI” requiring order-8 pseudo-moments [55]) possible. As well, a method of truncating “excess spectral mass” in various reshapings of the pseudo-moments was adapted to this setting from another work focusing on the undercomplete case [68].
Lower bound
Finally, the lower bound Theorem 1.0.6 presented in Chapter 2 states that a degree-4 sum-of-squares program is unable to distinguish a single-spike model with Gaussian noise from pure Gaussian noise model at high enough levels of noise. This lower bound is proved by direct construction of a degree-4 pseudo- distribution that with high probability purports to demonstrate the existence of a signal spike in a random input with no actual signal term. The construction is by a fine analysis of the structure of degree-4 pseudo-distributions that maximize isotropically generated degree-3 polynomials, under the symmetry and positive- semidefiniteness constraints that must be followed by pseudo-distributions. Outer products are added to an initial pseudo-distribution construction to satisfy the positive-semidefiniteness constraints, then the transformations of those outer products under 4-symmetry operators are added to satisfy symmetry constraints, along with an analysis of their spectral effect on the positive-semidefiniteness constraints, and so on. These techniques ultimately go on to contribute to the technique of pseudo-calibration which has succeeded in demonstrating sum-of- squares lower bounds at much higher degrees [52].
12 1.0.3 Frontier of what is possible
What is doable via existing techniques. Generally, the results stated for k 4 are generalizable to all even orders, and the results stated for k 3 generalize to all orders, although surgury on “spurious subspaces” may be required to apply them to even orders.
Some of the algorithms are easily extended to non-unit vectors, with perhaps some loss in performance depending on a condition number: the main exception to this being the spectral algorithm for the generic/smoothed model and the random components model (although the results for the random components model should apply just as well to standard Gaussian vectors).
New innovations needed. One barrier repeatedly encountered in the progress of this work was the lack of a notion of a “higher-order whitening” as a primitive algebraic technique. In several places, it was necessary to assume isotropy of the components in d, to make a lifting procedure work, while it also would
d k have been useful to have isometry of the components in ( )⊗ , for the pur- poses of successfully rounding the lifted tensor. It is in general not possible to achieve both of these at the same time, although the lifting technique useds for the generic/smoothed model may be considered a limited form of a possibly more general “higher-order whitening”. Such a concept could then be applied to generic/smoothed odd-ordered tensors, or to more exact recovery of the components in the spectral algorithm for generic/smoothed tensors.
More pointedly, currently there is an absense of any sort of general approach for overcomplete third-order tensors whose components fall outside of being
13 simultaneously (1) pairwise nearly orthogonal and (2) isotropically distributed. One difficulty here is that the span of any reshaping of an order-3 tensor inevitably has dimension at most d. This suggests looking to the technique of approximate message passing (AMP), which implicitly uses larger tensors as maps, to try to extract higher-dimensional polynomial maps from the input that provide new information about the components [81].
To be able to handle an important class of models realistically, it would also be important to weaken the “niceness” constraint on dictionary learning via optimization techniques [17, 58, 68], as this is a strong limitation on the distributional robustness of the technique.
Another robustness improvement important to general applicability would be the ability to recover non-unit component in the spectral algorithm for the generic/smoothed model, since in distributions in the wild, we should not expect contributions from different sources to all be of the same magnitude.
More specfically, the rounding algorithm described herein has a better chance of recovering the smaller components: recovering (and subtracting out) the large components first would be much less fragile and more akin to how Principle
Component Analysis is used today in the linear case.
It would also be interesting to apply some of these techniques in order to round solution concepts in discrete domains, e.g. in constraint satisfaction problems. Also of interest would be the derivation of tight sample size bounds, and perhaps improvements in running speed in practice by avoiding explicit construction of the full tensor (which takes O(mdk) time with m samples).
14 1.0.4 Organization
The body of this document will present the results above in detail in roughly the chronological order in which they were discovered.
Chapter 2 covers the algorithms and the lower-bound for the single-spike model as stated by Theorem 1.0.1 and Theorem 1.0.6, including the first use of “pseudo-pseudo-moments”. It shares material with a paper presented in
Conference on Learning Theory 2015 by Sam Hopkins, David Steurer, and myself [50].
Chapter 3 covers a variety of pseudo-pseudo-moments algorithms, including for the overcomplete tensor decomposition problem as stated in Theorem 1.0.3, as well as for a different problem of finding planted sparse vectors in a random subspace, and shares material with a paper presented in Symposium for the Theory of Computation 2016 by Sam Hopkins, Tselil Schramm, David Steurer, and myself [49].
Chapter 4 covers the development of lifting via an orthogonal map as well as rounding by eigendecomposition of pseudo-covariance, gives the semidefinite programming algorithms for Theorem 1.0.2, Theorem 1.0.3, Theorem 1.0.4, and
Section 1.0.4, as well as the spectral algorithm for Theorem 1.0.2, and shares material with a paper presented in Foundations of Computer Science by Tengyu Ma, David Steurer, and myself [58].
Chapter 5 covers the development of pseudo-pseudo-moments techniques in a generic overcomplete case, giving the spectral algorithm for , and shares material with a paper presented in Conference on Learning Theory 2019 by Sam Hopkins, Tselil Schramm, and myself [48].
15 Each chapter will vary slightly in the notation used and so a preliminaries section is included for each chapter.
16 CHAPTER 2 SPIKED TENSOR MODEL
2.1 Introduction
Principal component analysis (pca), the process of identifying a direction of largest possible variance from a matrix of pairwise correlations, is among the most basic tools for data analysis in a wide range of disciplines. In recent years, variants of pca have been proposed that promise to give better statistical guarantees for many applications. These variants include restricting directions to the nonnegative orthant (nonnegative matrix factorization) or to directions that are sparse linear combinations of a fixed basis (sparse pca). Often we have access to not only pairwise but also higher-order correlations. In this case, an analog of pca is to find a direction with largest possible third moment or other higher-order moment
(higher-order pca or tensor pca).
All of these variants of pca share that the underlying optimization problem is
NP-hard for general instances (often even if we allow approximation), whereas vanilla pca boils down to an efficient eigenvector computation for the input matrix. However, these hardness result are not predictive in statistical settings where inputs are drawn from particular families of distributions, where efficent algorithm can often achieve much stronger guarantees than for general instances. Understanding the power and limitations of efficient algorithms for statistical models of NP-hard optimization problems is typically very challenging: it is not clear what kind of algorithms can exploit the additional structure afforded by statistical instances, but, at the same time, there are very few tools for reasoning about the computational complexity of statistical / average-case problems. (See
17 [23] and [18] for discussions about the computational complexity of statistical models for sparse pca and random constraint satisfaction problems.)
We study a statistical model for the tensor principal component analysis problem introduced by [60] through the lens of a meta-algorithm called the sum-of-squares method, based on semidefinite programming. This method can capture a wide range of algorithmic techniques including linear programming and spectral algorithms. We show that this method can exploit the structure of statistical tensor pca instances in non-trivial ways and achieves guarantees that improve over the previous ones. On the other hand, we show that those guarantees are nearly tight if we restrict the complexity of the sum-of-squares meta-algorithm at a particular level. This result rules out better guarantees for a fairly wide range of potential algorithms. Finally, we develop techniques to turn algorithms based on the sum-of-squares meta-algorithm into algorithms that are truely efficient (and even easy to implement).
Montanari and Richard propose the following statistical model1 for tensor pca.
Problem 2.1.1 (Spiked Tensor Model for tensor pca, Asymmetric). Given an
3 n input tensor T τ · v⊗ + A, where v ∈ is an arbitrary unit vector, τ > 0 is the signal-to-noise ratio, and A is a random noise tensor with iid standard Gaussian entries, recover the signal v approximately.
√ Montanari and Richard show that when τ 6 o( n) Problem 3.6.1 becomes √ information-theoretically unsovlable, while for τ > ω( n) the maximum likeli- hood estimator (MLE) recovers v0 with hv, v0i > 1 − o(1).
The maximum-likelihood-estimator (MLE) problem for Problem 3.6.1 is an
1Montanari and Richard use a different normalization for the signal-to-noise ratio. Using their notation, β τ √n. ≈ /
18 7→ Í instance of the following meta-problem for k 3 and f : x ijk Tijk xi x j xk [60].
Problem 2.1.2. Given a homogeneous, degree-k function f on n, find a unit vector v ∈ n so as to maximize f (v) approximately.
For k 2, this problem is just an eigenvector computation. Already for k 3, it is NP-hard. Our algorithms proceed by relaxing Problem 2.1.2 to a convex problem. The latter can be solved either exactly or approximately (as will be the case of our faster algorithms). Under the Gaussian assumption on the noise
3 4 1 4 in Problem 3.6.1, we show that for τ > ω(n / log(n) / ) the relaxation does not substantially change the global optimum.
Noise Symmetry. Montanari and Richard actually consider two variants of this model. The first we have already described. In the second, the noise is
3 symmetrized, (to match the symmetry of potential signal tensors v⊗ ).
Problem 2.1.3 (Spiked Tensor Model for tensor pca, Symmetric). Given an input
3 n tensor T τ · v⊗ + A, where v ∈ is an arbitrary unit vector, τ > 0 is the signal-to-noise ratio, and A is a random symmetric noise tensor—that is, Aijk Aπ i π j π k for any permutation π—with otherwise iid standard Gaussian ( ) ( ) ( ) entries, recover the signal v approximately.
It turns out that for our algorithms based on the sum-of-squares method, this kind of symmetrization is already built-in. Hence there is no difference between
Problem 3.6.1 and Problem 2.1.3 for those algorithms. For our faster algorithms, such symmetrization is not built in. Nonetheless, we show that a variant of our nearly-linear-time algorithm for Problem 3.6.1 also solves Problem 2.1.3 with matching guarantees.
19 2.1.1 Results
Sum-of-squares relaxation. We consider the degree-4 sum-of-squares relax- ation for the MLE problem. (See Section 4.2 for a brief discussion about sum-of- squares. All necessary definitions are in Section 4.3. See [21] for more detailed discussion.) Note that the planted vector v has objective value (1 − o(1))τ for the √ MLE problem with high probability (assuming τ Ω( n) which will always be the case for us).
Theorem 2.1.4. There exists a polynomial-time algorithm based on the degree-4 sum- of-squares relaxation for the MLE problem that given an instance of Problem 3.6.1 or
3 4 1 4 Problem 2.1.3 with τ > n / (log n) / /ε outputs a unit vector v0 with hv, v0i > 1−O(ε)
10 with probability 1−O(n− ) over the randomness in the input. Furthermore, the algorithm works by rounding any solution to the relaxation with objective value at least (1 − o(1))τ.
Finally, the algorithm also certifies that all unit vectors bounded away from v0 have objective value significantly smaller than τ for the MLE problem Problem 2.1.2.
We complement the above algorithmic result by the following lower bound.
3 4 1 4 Theorem 2.1.5 (Informal Version). There is τ : → with τ 6 O(n / /log(n) / ) so that when T is an instance of Problem 3.6.1 with signal-to-noise ratio τ, with probability
10 1 − O(n− ), there exists a solution to the degree-4 sum-of-squares relaxation for the MLE problem with objective value at least τ that does not depend on the planted vector v. In particular, no algorithm can reliably recover from this solution a vector v0 that is significantly correlated with v.
Faster algorithms. We interpret a tensor-unfolding algorithm studied by Monta- nari and Richard as a spectral relaxation of the degree-4 sum-of-squares program
20 for the MLE problem. This interpretation leads to an analysis that gives better guarantees in terms of signal-to-noise ratio τ and also informs a more efficient implementation based on shifted matrix power iteration.
Theorem 2.1.6. There exists an algorithm with running time O˜ (n3), which is linear in the size of the input, that given an instance of Problem 3.6.1 or Problem 2.1.3 with
3 4 10 τ > n / /ε outputs with probability 1−O(n− ) a unit vector v0 with hv, v0i > 1−O(ε).
We remark that unlike the previous polynomial-time algorithm this linear time algorithm does not come with a certification guarantee. In Section 2.4.1, we show that small adversarial perturbations can cause this algorithm to fail, whereas the previous algorithm is robust against such perturbations. We also devise an algorithm with the certification property and running time O˜ (n4) (which is subquadratic in the size n3 of the input).
Theorem 2.1.7. There exists an algorithm with running time O˜ (n4) (for inputs of
3 3 4 1 4 size n ) that given an instance of Problem 3.6.1 with τ > n / (log n) / /ε for some
10 ε, outputs with probability 1 − O(n− ) a unit vector v0 with hv, v0i > 1 − O(ε) and certifies that all vectors bounded away from v0 have MLE objective value significantly less than τ.
Higher-order tensors. Our algorithmic results also extend in a straightforward way to tensors of order higher than 3. (See Section 2.7 for some details.) For simplicity we give some of these results only for the higher-order analogue of Problem 3.6.1; we conjecture however that all our results for Problem 2.1.3 generalize in similar fashion.
n k 4 1 4 Theorem 2.1.8. Let k be an odd integer, v0 ∈ a unit vector, τ > n / log(n) / /ε, and A an order-k tensor with independent unit Gaussian entries. Let T(x) τ ·
21 k hv0, xi + A(x).
1. There is a polynomial-time algorithm, based on semidefinite programming, which
k on input T(x) τ· hv0, xi +A(x) returns a unit vector v with hv0, vi > 1−O(ε)
10 with probability 1 − O(n− ) over random choice of A.
2. There is a polynomial-time algorithm, based on semidefinite programming, which on
k k k 4 1 4 input T(x) τ·hv0, xi +A(x) certifies that T(x) 6 τ·hv, xi +O(n / log(n) / )
10 for some unit v with probability 1 − O(n− ) over random choice of A. This guarantees in particular that v is close to a maximum likelihood estimator for the · k + problem of recovering the signal v0 from the input τ v0⊗ A. 3. By solving the semidefinite relaxation approximately, both algorithms can be
1+1 k k implemented in time O˜ (m / ), where m n is the input size.
2 For even k, the above all hold, except now we recover v with hv0, vi > 1 − O(ε), and the algorithms can be implemented in nearly linear time.
Remark 2.1.9. When A is a symmetric noise tensor (the higher-order analogue of Problem 2.1.3), (1–2) above hold. We conjecture that (3) does as well.
The last theorem, the higher-order generalization of Theorem 2.1.6, almost completely resolves a conjecture of Montanari and Richard regarding tensor unfolding algorithms for odd k. We are able to prove their conjectured signal-to- noise ratio τ for an algorithm that works mainly by using an unfolding of the input tensor, but our algorithm includes an extra random-rotation step to handle sparse signals. We conjecture but cannot prove that the necessity of this step is an artifact of the analysis.
n k 4 Theorem 2.1.10. Let k be an odd integer, v0 ∈ a unit vector, τ > n / /ε, and A an order-k tensor with independent unit Gaussian entries. There is a nearly-linear-time
22 10 algorithm, based on tensor unfolding, which, with probability 1 − O(n− ) over random
2 choice of A, recovers a vector v with hv, v0i > 1 − O(ε). This continues to hold when A is replaced by a symmetric noise tensor (the higher-order analogue of Problem 2.1.3).
2.1.2 Techniques
We arrive at our results via an analysis of Problem 2.1.2 for the function ( ) Í f x ijk Tijk xi x j xk, where T is an instance of Problem 3.6.1. The function f de- + ( ) ·h i3 ( ) Í composes as f 1 h for a signal 1 x τ v, x and noise h x ijk aijk xi x j x j where {aijk } are iid standard Gaussians. The signal 1 is maximized at x v, where it takes the value τ. The noise part, h, is with high probability at most √ √ O˜ ( n) over the unit sphere. We have insisted that τ be much greater than n, so f has a unique global maximum, dominated by the signal 1. The main problem is to find it.
To maximize 1, we apply the Sum of Squares meta-algorithm (SoS). SoS provides a hierarchy of strong convex relaxations of Problem 2.1.2. Using convex duality, we can recast the optimization problem as one of efficiently certifying the upper bound on h which shows that optima of 1 are dominated by the signal. SoS efficiently finds boundedness certificates for h of the form
2 2 c − h(x) s1(x) + ··· + sk(x)
2 where “” denotes equality in the ring [x]/(kxk − 1) and where s1,..., sk have bounded degree, when such certificates exist. (The polynomials {si } and {tj } certify that h(x) 6 c. Otherwise c − h(x) would be negative, but this is impossible by the nonnegativity of squared polynomials.)
Our main technical contribution is an almost-complete characterization of
23 certificates like these for such degree-3 random polynomials h when the poly- nomials {si } have degree at most four. In particular, we show that with high
3 4 probability in the random case a degree-4 certificate exists for c O˜ (n / ), and that with high probability, no significantly better degree-four certificate exists.
Algorithms. We apply this characterization in three ways to obtain three differ- ent algorithms. The first application is a polynomial-time based on semidefinite
3 4 programming algorithm that maximizes f when τ > Ω˜ (n / ) (and thus solves
3 4 TPCA in the spiked tensor model for τ > Ω˜ (n / ).) This first algorithm involves solving a large semidefinite program associated to the SoS relaxation. As a second application of this characterization, we avoid solving the semidefinite program. Instead, we give an algorithm running in time O˜ (n4) which quickly constructs only a small portion of an almost-optimal SoS boundedness certificate; in the random case this turns out to be enough to find the signal v and certify the boundedness of 1. (Note that this running time is only a factor of n polylog n greater than the input size n3.)
Finally, we analyze a third algorithm for TPCA which simply computes the highest singular vector of a matrix unfolding of the input tensor. This algorithm was considered in depth by Montanari and Richard, who fully characterized its behavior in the case of even-order tensors (corresponding to k 4, 6, 8, ··· in Problem 2.1.2). They conjectured that this algorithm successfully recovers the signal v at the signal-to-noise ratio τ of Theorem 2.1.7 for Problem 3.6.1 and Problem 2.1.3. Up to an extra random rotations step before the tensor unfolding in the case that the input comes from Problem 2.1.3 (and up to logarithmic factors in τ) we confirm their conjecture. We observe that their algorithm can be viewed as a method of rounding a non-optimal solution to the SoS relaxation to find
24 the signal. We show, also, that for k 4, the degree-4 SoS relaxation does no better than the simpler tensor unfolding algorithm as far as signal-to-noise ratio is concerned. However, for odd-order tensors this unfolding algorithm does not certify its own success in the way our other algorithms do.
Lower Bounds. In Theorem 2.1.5, we show that degree-4 SoS cannot certify ( ) Í that the noise polynomial A x ijk aijk xi x j xk for aijk iid standard Gaussians 3 4 satisfies A(x) 6 o(n / ).
To show that SoS certificates do not exist we construct a corresponding dual object. Here the dual object is a degree-4 pseudo-expectation: a linear map
¾˜ : [x]64 → pretending to give the expected value of polynomials of degree at most 4 under some distribution on the unit sphere. “Pretending” here means that, just like an actual distribution, ¾˜ p(x)2 > 0 for any p of degree at most 4. In other words, ¾˜ is positive semidefinite on degree 4 polynomials. While for any √ actual distribution over the unit sphere ¾ A(x) 6 O˜ ( n), we will give ¾˜ so that
3 4 ¾˜ A(x) > Ω˜ (n / ).
3 4 To ensure that ¾˜ A(x) > Ω˜ (n / ), for monomials xi x j xk of degree 3 we take
3 4 ¾˜ ≈ n / xi x j xk A,A aijk. For polynomials p of degree at most 2 it turns out to be h i enough to set ¾˜ p(x) ≈ ¾µ p(x) where ¾µ denotes the expectation under the uniform distribution on the unit sphere.
Having guessed these degree 1, 2 and 3 pseudo-moments, we need to define
¾˜ xi x j xk x` so that ¾˜ is PSD. Representing ¾˜ as a large block matrix, the Schur complement criterion for PSDness can be viewed as a method for turning candidate degree 1–3 moments (which here lie on upper-left and off-diagonal
n2 n2 blocks) into a candidate matrix M ∈ × of degree-4 pseudo-expectation
25 values which, if used to fill out the degee-4 part of ¾˜ , would make it PSD.
We would thus like to set ¾˜ xi x j xk x` M[(i, j), (k, l)]. Unfortunately, these candidate degree-4 moments M[(i, j), (k, l)] do not satisfy commuta- tivity; that is, we might have M[(i, j), (k, l)] , M[(i, k), (j, `)] (for exam- ple). But a valid pseudo-expectation must satisfy ¾˜ xi x j xk x` ¾˜ xi xk x j x`. To fix this, we average out the noncommutativity by setting ½e xi x j xk x` 1 Í M[(π(i), π(j)), (π(k), π(`))], where S4 is the symmetric group on 4 4 π 4 |S | ∈S elements.
This ensures that the candidate degree-4 pseudo-expectation ½e satisfiefs commutativity, but it introduces a new problem. While the matrix M from the Schur complement was guaranteed to be PSD and even to make ¾˜ PSD when used as its degree-4 part, some of the permutations π·M given by (π·M)[(i, j), (k, `)] M[(π(i), π(j)), (π(k), π(`))] need not even be PSD themselves. This means that, while ½e avoids having large negative eigenvalues (since it is correlated with M from Schur complement), it will have some small negative eigenvalues; i.e.
½e p(x)2 < 0 for some p.
For each permutation π · M we track the most negative eigenvalue λmin(π · M) using matrix concentration inequalities. After averaging the permutations together to form ½e and adding this to ¾˜ to give a linear functional ¾˜ + ½e on polynomials of degree at most 4, our final task is to remove these small negative eigenvalues. For this we mix ¾˜ + ½e with µ, the uniform distribution on the unit sphere. Since ¾µ has eigenvalues bounded away from zero, our final pseudo-expectation
0 def ¾˜ p(x) ε · ¾˜ p(x) + ε · ½e p(x) + (1 − ε)· ¾µ p(x) | {z } | {z } | {z } degree 1-3 pseudo-expectations degree 4 pseudo-expectations fix negative eigenvalues
26 is PSD for ε small enough. Having tracked the magnitude of the negative eigenvalues of ½e, we are able to show that ε here can be taken large enough to 3 4 get ¾˜ 0 A(x) Ω˜ (n / ), which will prove Theorem 2.1.5.
2.1.3 Related Work
There is a vast literature on tensor analogues of linear algebra problems—too vast to attempt any survey here. Tensor methods for machine learning, in particular for learning latent variable models, have garnered recent attention, e.g., with works of Anandkumar et al. [10, 5]. These approaches generally involve decomposing a tensor which captures some aggregate statistics of input data into rank-one components. A recent series of papers analyzes the tensor power method, a direct analogue of the matrix power method, as a way to find rank-one components of random-case tensors [11,7].
Another recent line of work applies the Sum of Squares (a.k.a. Lasserre or Lasserre/Parrilo) hierarchy of convex relaxations to learning problems. See the survey of Barak and Steurer for references and discussion of these relaxations [21].
Barak, Kelner, and Steurer show how to use SoS to efficiently find sparse vectors planted in random linear subspaces, and the same authors give an algorithm for dictionary learning with strong provable statistical guarantees [16, 15]. These algorithms, too, proceed by decomposition of an underlying random tensor; they exploit the strong (in many cases, the strongest-known) algorithmic guarantees offered by SoS for this problem in a variety of average-case settings.
Concurrently and independently of us, and also inspired by the recently- discovered applicability of tensor methods to machine learning, Barak and
27 Moitra use SoS techniques formally related to ours to address the tensor prediction problem: given a low-rank tensor (perhaps measured with noise) only a subset of whose entries are revealed, predict the rest of the tensor entries [19]. They work with worst-case noise and study the number of revealed entries necessary for the SoS hierarchy to successfully predict the tensor. By constrast, in our setting, the entire tensor is revealed, and we study the signal-to-noise threshold necessary for SoS to recover its principal component under distributional assumptions on the noise that allow us to avoid worst-case hardness behavior.
Since Barak and Moitra work in a setting where few tensor entries are revealed, they are able to use algorithmic techniques and lower bounds from the study of sparse random constraint satisfaction problems (CSPs), in particular random 3XOR [41, 34, 33, 32]. The tensors we study are much denser. In spite of the density (and even though our setting is real-valued), our algorithmic techniques are related to the same spectral refutations of random CSPs. However our lower bound techniques do not seem to be related to the proof-complexity techniques that go into sum-of-squares lower bound results for random CSPs.
The analysis of tractable tensor decomposition in the rank one plus noise model that we consider here (the spiked tensor model) was initiated by Montanari and Richard, whose work inspired the current paper [60]. They analyze a number of natural algorithms and find that tensor unfolding algorithms, which use the spectrum of a matrix unfolding of the input tensor, are most robust to noise. Here we consider more powerful convex relaxations, and in the process we tighten Montanari and Richard’s analysis of tensor unfolding in the case of odd-order tensors.
Related to our lower bound, Montanari, Reichman, and Zeitouni prove strong
28 impossibility results for the problem of detecting rank-one perturbations of Gaussian matrices and tensors using any eigenvalue of the matrix or unfolded tensor; they are able to characterize the precise threshold below which the entire spectrum of a perturbed noise matrix or unfolded tensor becomes indistinguish- able from pure noise [?]. Our lower bounds are much coarser, applying only to the objective value of a relaxation (the analogue of just the top eigenvalue of an unfolding), but they apply to all degree-4 SoS-based algorithms, which are a priori a major generalization of spectral methods.
2.2 Preliminaries
2.2.1 Notation
We use x (x1,..., xn) to denote a vector of indeterminates. The letters u, v, w are generally reserved for real vectors. The letters α, β are reserved for multi- indices; that is, for tuples (i1,..., ik) of indices. For f , 1 : → we write f - 1 for f O(1) and f % 1 for f Ω(1). We write f O˜ (1) if f (n) 6 1(n)·polylog n, and f Ω˜ (1) if f > 1(n)/polylog n.
We employ the usual Loewner (a.k.a. positive semi-definite) ordering on Hermitian matrices.
We will be heavily concerned with tensors and matrix flattenings thereof. In general, boldface capital letters T denote tensors and ordinary capital letters denote matrices A. We adopt the convention that unless otherwise noted for a tensor T the matrix T is the squarest-possible unfolding of T. If T has even order
29 k 2 k 2 k 2 k 2 k then T has dimensions n / × n / . For odd k it has dimensions nb / c × nd / e. All tensors, matrices, vectors, and scalars in this paper are real.
We use h·, ·i to denote the usual entrywise inner product of vectors, matrices, and tensors. For a vector v, we use kvk to denote its `2 norm. For a matrix A, we use kAk to denote its operator norm (also known as the spectral or `2-to-`2 norm).
k For a k-tensor T, we write T(v) for hv⊗ , Ti. Thus, T(x) is a homogeneous real polynomial of degree k.
We use Sk to denote the symmetric group on k elements. For a k-tensor T
π and π ∈ Sk, we denote by T the k-tensor with indices permuted according to π, π ∈ S so that Tα Tπ 1 α . A tensor T is symmetric if for all π k it is the case that − ( ) Tπ T. (Such tensors are sometimes called “supersymmetric.”)
For clarity, most of our presentation focuses on 3-tensors. For an n × n
3-tensor T, we use Ti to denote its n × n matrix slices along the first mode, i.e., ( ) Ti j,k Ti,j,k.
We often say that an sequence {En }n of events occurs with high probability, ∈ 10 c which for us means that (En fails) O(n− ). (Any other n− would do, with appropriate modifications of constants elsewhere.)
2.2.2 Polynomials and Matrices
Let [x]6d be the vector space of polynomials with real coefficients in variables x (x1,..., xn), of degree at most d. We can represent a homogeneous even-
30 d 2 d 2 degree polynomial p ∈ [x]d by an n / × n / matrix: a matrix M is a matrix
d 2 d 2 representation for p if p(x) hx⊗ / , Mx⊗ / i. If p has a matrix representation Í ( )2 M 0, then p i pi x for some polynomials pi.
2.2.3 The Sum of Squares (SoS) Algorithm
Definition 2.2.1. Let L : [x]6d → be a linear functional on polynomials of degree at most d for some d even. Suppose that
• L 1 1.
L ( )2 ∈ [ ] • p x > 0 for all p x 6d 2. /
Then L is a degree-d pseudo-expectation. We often use the suggestive notation ¾˜ for such a functional, and think of ¾˜ p(x) as giving the expectation of the polynomial p(x) under a pseudo-distribution over {x}.
For p ∈ [x]6d we say that the pseudo-distribution {x} (or, equivalently, the functional ¾˜ ) satisfies {p(x) 0} if ¾˜ p(x)q(x) 0 for all q(x) such that p(x)q(x) ∈ [x]6d.
Pseudo-distributions were first introduced in [14] and are surveyed in [21].
We employ the standard result that, up to negligible issues of numerical accuracy, if there exists a degree-d pseudo-distribution satisfying constraints
O d {p0(x) 0,..., pm(x) 0}, then it can be found in time n ( ) by solving a O d semidefinite program of size n ( ). (See [21] for references.)
31 2.3 Certifying Bounds on Random Polynomials
Let f ∈ [x]d be a homogeneous degree-d polynomial. When d is even, f has
d 2 d 2 square matrix representations of dimension n / × n / . The maximal eigenvalue of a matrix representation M of f provides a natural certifiable upper bound on ( ) max v 1 f v , as k k h i d 2 d 2 w, Mw f (v) hv⊗ / , Mv⊗ / i 6 max kMk . d 2 h i w n / w, w ∈ When f (x) A(x) for an even-order tensor A with independent random entries, the quality of this certificate is well characterized by random matrix theory. In the case where the entries of A are standard Gaussians, for instance, kMk k + T k ˜ ( d 4) ( ) A A 6 O n / with high probability, thus certifying that max v 1 f v 6 k k d 4 O˜ (n / ).
A similar story applies to f of odd degree with random coefficients, but with a catch: the certificates are not as good. For example, we expect a degree-3 random polynomial to be a smaller and simpler object than one of degree-4, ( ) and so we should be able to certify a tighter upper bound on max v 1 f v . k k The matrix representations of f are now rectangular n2 × n matrices whose top ( ) singular values are certifiable upper bounds on max v 1 f v . But in random k k matrix theory, this maximum singular value depends (to a first approximation) only on the longer dimension n2, which is the same here as in the degree-4 case.
Again when f (x) A(x), this time where A is an order-3 tensor of independent p standard Gaussian entries, kMk kAAT k > Ω˜ (n), so that this method cannot ( ) ˜ ( ) certify better than max v 1 f v 6 O n . Thus, the natural spectral certificates k k are unable to exploit the decrease in degree from 4 to 3 to improve the certified bounds.
32 To better exploit the benefits of square matrices, we bound the maxima of degree-3 homogeneous f by a degree-4 polynomial. In the case that f is ( ) 1 h ∇ ( )i multi-linear, we have the polynomial identity f x 3 x, f x . Using Cauchy- ( ) 1 k kk∇ ( )k Schwarz, we then get f x 6 3 x f x . This inequality suggests using the degree-4 polynomial k∇ f (x)k2 as a bound on f . Note that local optima of f on the sphere occur where ∇ f (v) ∝ v, and so this bound is tight at local maxima.
Given a random homogeneous f , we will associate a degree-4 polynomial related to k∇ f k2 and show that this polynomial yields the best possible degree-4 ( ) SoS-certifiable bound on max v 1 f v . k k
Definition 2.3.1. Let f ∈ [x]3 be a homogeneous degree-3 polynomial with indeterminates x (x1,..., xn). Suppose A1,..., An are matrices such that Í h i f i xi x, Ai x . We say that f is λ-bounded if there are matrices A1,..., An k k4 Í ⊗ 2 · as above and a matrix representation M of x so that i Ai Ai λ M.
We observe that for f multi-linear in the coordinates xi of x, up to a constant factor we may take the matrices Ai to be matrix representations of ∂i f , so that Í ⊗ k∇ k2 i Ai Ai is a matrix representation of the polynomial f . This choice of Ai may not, however, yield the optimal spectral bound λ2.
The following theorem is the reason for our definition of λ-boundedness.
∈ [ ] ( ) Theorem 2.3.2. Let f x 3 be λ-bounded. Then max v 1 f v 6 λ, and the k k degree-4 SoS algorithm certifies this. In particular, every degree-4 pseudo-distribution {x} over n satisfies 3 4 ¾˜ f 6 λ · ¾˜ kxk4 / .
Proof. By Cauchy–Schwarz for pseudo-expectations, the pseudo-distribution ¾˜ k k22 ¾˜ k k4 ¾˜ Í h i2 ¾˜ Í 2 · Í h i2 satisfies x 6 x and i xi x, Ai x 6 i xi i x, Ai x .
33 Therefore,
Õ ¾˜ f ¾˜ xi · hx, Ai xi i 1 2 1 2 Õ 2 / Õ 2 / 6 ¾˜ x · ¾˜ hx, Ai xi i i i 1 2 21 2 2 Õ 2 / ¾˜ kxk / · ¾˜ hx⊗ , Ai ⊗ Ai x⊗ i i 41 4 2 2 2 1 2 6 ¾˜ kxk / · ¾˜ hx⊗ , λ · Mx⊗ i /
3 4 λ · ¾˜ kxk4 / .
(Í ⊗ ) 2 · The last inequality also uses the premise i Ai Ai λ M for some matrix rep- k k4 2 · − (Í ⊗ ) resentation M of x , in the following way. Since M0 : λ M i Ai Ai 2 2 0, the polynomial hx⊗ , M0x⊗ i is a sum of squared polynomials. Thus,
2 2 ¾˜ hx⊗ , M0x⊗ i > 0 and the desired inequality follows.
We now state the degree-3 case of a general λ-boundedness fact for homo- geneous polynomials with random coefficients. The SoS-certifiable bound for a random degree-3 polynomial this provides is the backbone of our SoS algorithm for tensor PCA in the spiked tensor model.
Theorem 2.3.3. Let A be a 3-tensor with independent entries from N(0, 1). Then A(x)
3 4 1 4 is λ-bounded with λ O(n / log(n) / ), with high probability.
The full statement and proof of this theorem, generalized to arbitrary-degree homogeneous polynomials, may be found as Theorem A.2.5; we prove the statement above as a corollary in Section A.2. Here provide a proof sketch.
Proof sketch. We first note that the matrix slices Ai of A satisfy A(x) Í h i Í ⊗ i xi x, Ai x . Using the matrix Bernstein inequality, we show that i Ai − Í ⊗ ( 3 2( )1 2)· Ai ¾ i Ai Ai O n / log n / Id with high probability. At the same time,
34 1 ¾ Í ⊗ a straightforward computation shows that n i Ai Ai is a matrix represen- k k4 Í ⊗ 2 · tation of x . Since Id is as well, we get that i Ai Ai λ M , where M is k k4 Í ⊗ some matrix representation of x which combines Id and ¾ i Ai Ai, and 3 4 1 4 λ O(n / (log n) / ).
Corollary 2.3.4. Let A be a 3-tensor with independent entries from N(0, 1). Then, ( ) with high probability, the degree-4 SoS algorithm certifies that max v 1 A v 6 k k 3 4 1 4 O(n / (log n) / ). Furthermore, also with high probability, every pseudo-distribution {x} over n satisfies
3 4 1 4 4 3 4 ¾˜ A(x) 6 O(n / (log n) / )(¾˜ kxk ) / .
Proof. Immediate by combining Theorem 2.3.3 with Theorem 2.3.2.
2.4 Polynomial-Time Recovery via Sum of Squares
Here we give our first algorithm for tensor PCA: we analyze the quality of the natural SoS relaxation of tensor PCA using our previous discussion of boundedness certificates for random polynomials, and we show how to round this relaxation. We discuss also the robustness of the SoS-based algorithm to some amount of additional worst-case noise in the input. For now, to obtain a solution to the SoS relaxation we will solve a large semidefinite program. Thus, the algorithm discussed here is not yet enough to prove Theorem 2.1.7 and Corollary 2.1.7: the running time, while still polynomial, is somewhat greater than O˜ (n4).
35 Tensor PCA with Semidefinite Programming
· 3 + ∈ n Input: T τ v0⊗ A, where v and A is some order-3 tensor. n Goal: Find v ∈ with |hv, v0i| > 1 − o(1).
Algorithm 2.4.1 (Recovery). Using semidefinite programming, find the degree-4 pseudo-distribution {x} satisfying {kxk2 1} which maximizes ¾˜ T(x). Output ¾˜ x/k ¾˜ xk.
Algorithm 2.4.2 (Certification). Run Algorithm 2.4.1 to obtain v. Using semidefinite programming, find the degree-4 pseudo-distribution {x} satisfying {kxk 1}
3 3 3 4 1 4 which maximizes ¾˜ T(x) − τ · hv, xi . If ¾˜ T(x) − τ · hv, xi 6 O(n / log(n) / ), output certify. Otherwise, output fail.
The following theorem characterizes the success of Algorithm 2.4.1 and Algo- rithm 2.4.2
· 3 + Theorem 2.4.3 (Formal version of Theorem 2.1.4). Let T τ v0⊗ A, where n 3 4 1 4 v0 ∈ and A has independent entries from N(0, 1). Let τ % n / log(n) / /ε. Then 3 1 Í π with high probability over random choice of A, on input T or T0 : τ·v⊗ + A , 0 3 π 3 |S | ∈S Algorithm 2.4.1 outputs a vector v with hv, v0i > 1 − O(ε). In other words, for this τ, Algorithm 2.4.1 solves both Problem 3.6.1 and Problem 2.1.3.
n For any unit v0 ∈ and A, if Algorithm 2.4.2 outputs certify then T(x) 6
3 3 4 1 4 τ·hv, xi +O(n / log(n) / ). For A as described in either Problem 3.6.1 or Problem 2.1.3
3 4 1 4 and τ % n / log(n) / /ε, Algorithm 2.4.2 outputs certify with high probability.
The analysis has two parts. We show that
1. if there exists a sufficiently good upper bound on A(x) (or in the case of the
π symmetric noise input, on A (x) for every π ∈ S3) which is degree-4 SoS
36 certifiable, then the vector recovered by the algorithm will be very close to v, and that
2. in the case of A with independent entries from N(0, 1), such a bound exists with high probability.
Conveniently, Item 2 is precisely the content of Corollary 2.3.4. The following lemma expresses Item 1.
4 3 4 Lemma 2.4.4. Suppose A(x) ∈ [x]3 is such that | ¾˜ A(x)| 6 ετ ·(¾˜ kxk ) / for any { } · 3 + degree-4 pseudo-distribution x . Then on input τ v0⊗ A, Algorithm 2.4.1 outputs a unit vector v with hv, v0i > 1 − O(ε).
Proof. Algorithm 2.4.1 outputs v ¾˜ x/k ¾˜ xk for the pseudo-distribution that it finds, so we’d like to show hv0, ¾˜ x/k ¾˜ xki > 1 − O(ε). By pseudo-Cauchy- Schwarz (Lemma A.1.2), k ¾˜ xk2 6 ¾˜ kxk2 1, so it will suffice to prove just that hv0, ¾˜ xi > 1 − O(ε).
3 If ¾˜ hv0, xi > 1 − O(ε), then by Lemma A.1.5 (and linearity of pseudo- expectation) we would have
hv0, ¾˜ xi ¾˜ hv0, xi > 1 − O(2ε) 1 − O(ε)
3 So it suffices to show that ¾˜ hv0, xi is close to 1.
Recall that Algorithm 2.4.1 finds a pseudo-distribution that maximizes ¾˜ T(x). ¾˜ ( ) ¾˜ h 3 3i ¾˜ ( ) We split T x into the signal v0⊗ , x⊗ and noise A x components and use our hypothesized SoS upper bound on the noise.
¾˜ ( ) ·(¾˜ h 3 3i) + ¾˜ ( ) ·(¾˜ h 3 3i) + T x τ v0⊗ , x⊗ A x 6 τ v0⊗ , x⊗ ετ .
37 h 3 3i h i3 Rewriting v0⊗ , x⊗ as v0, x , we obtain
1 ¾˜ hv , xi3 > · ¾˜ T(x) − ε . 0 τ
Finally, there exists a pseudo-distribution that achieves ¾˜ T(x) > τ − ετ.
Indeed, the trivial distribution giving probability 1 to v0 is such a pseudo- distribution:
T(v0) τ + A(v0) > τ − ετ.
Putting it together,
1 (1 − ε)τ ¾˜ hv , xi3 > · ¾˜ T(x) − ε > − ε 1 − O(ε) . 0 τ τ
Proof of Theorem 2.4.3. We first address Algorithm 2.4.1. Let τ, T, T0 be as in the theorem statement. By Lemma 2.4.4, it will be enough to show that with high
4 3 4 probability every degree-4 pseudo-distribution {x} has ¾˜ A(x) 6 ε0τ ·(¾˜ kxk ) / and 1 ¾˜ Aπ(x) 6 ε τ ·(¾˜ kxk4)3 4 for some ε Θ(ε). By Corollary 2.3.4 and our 3 0 / 0 S assumptions on τ this happens for each permutation Aπ individually with high
π probability, so a union bound over A for π ∈ S3 completes the proof.
Turning to Algorithm 2.4.2, the simple fact that SoS only certifies true upper bounds implies that the algorithm is never wrong when it outputs certify. It is not hard to see that whenever Algorithm 2.4.1 has succeeded in recovering v because ¾˜ A(x) is bounded, which as above happens with high probability, Algorithm 2.4.2 will output certify.
38 2.4.1 Semi-Random Tensor PCA
We discuss here a modified TPCA model, which will illustrate the qualitative differences between the new tensor PCA algorithms we propose in this paper and previously-known algorithms. The model is semi-random and semi-adversarial. Such models are often used in average-case complexity theory to distinguish between algorithms which work by solving robust maximum-likelihood-style problems and those which work by exploiting some more fragile property of a particular choice of input distribution.
· 3 + Problem 2.4.5 (Tensor PCA in the Semi-Random Model). Let T τ v0⊗ A, n n n where v0 ∈ and A has independent entries from N(0, 1). Let Q ∈ × with
1 4 k Id −Qk 6 O(n− / ), chosen adversarially depending on T. Let T0 be the 3-tensor whose n2 × n matrix flattening is TQ. (That is, each row of T has been multiplied by a matrix which is close to identity.) On input T0, recover v.
Here we show that Algorithm 2.4.1 succeeds in recovering v in the semi- random model.
Theorem 2.4.6. Let T0 be the semi-random-model tensor PCA input, with τ >
3 4 1 4 n / log(n) / /ε. With high probability over randomness in T0, Algorithm 2.4.1 outputs a vector v with hv, v0i > 1 − O(ε).
( − · 3) Proof. By Lemma 2.4.4, it will suffice to show that B : T0 τ v0⊗ has 4 3 4 ¾˜ B(x) 6 ε0τ ·(¾˜ kxk ) / for any degree-4 pseudo-distribution {x}, for some
ε0 Θ(ε). We rewrite B as
T B (A + τ · v0(v0 ⊗ v0) )(Q − Id) + A
39 where A has independent entries from N(0, 1). Let {x} be a degree-4 pseudo-
2 T distribution. Let f (x) hx⊗ , (A + τ · v0(v0 ⊗ v0) )(Q − Id)xi. By Corollary 2.3.4,
3 4 1 4 4 3 4 ¾˜ B(x) ¾˜ f (x) + O(n / log(n) / )(¾˜ kxk ) / with high probability. By triangle inequality and sub-multiplicativity of the operator norm, we get that with high probability
3 4 k(A + τ · v0(v0 ⊗ v0))(Q − Id)k 6 (kAk + τ)kQ − Id k 6 O(n / ) , where we have also used Lemma A.2.4 to bound kAk 6 O(n) with high probability and our assumptions on τ and kQ − Id k. By an argument similar to that in the proof of Theorem 2.3.2 (which may be found in Lemma A.1.6), this yields
3 4 4 3 4 ¾˜ f (x) 6 O(n / )(¾˜ kxk ) / as desired.
2.5 Linear Time Recovery via Further Relaxation
We now attack the problem of speeding up the algorithm from the preceding section. We would like to avoid solving a large semidefinite program to optimality: our goal is to instead use much faster linear-algebraic computations—in particular, we will recover the tensor PCA signal vector by performing a single singular vector computation on a relatively small matrix. This will complete the proofs of Theorem 2.1.7 and Theorem 2.1.6, yielding the desired running time.
Our SoS algorithm in the preceding section turned on the existence of the Í ⊗ λ-boundedness certificate i Ai Ai, where Ai are the slices of a random tensor · 3 + A. Let T τ v0⊗ A be the spiked-tensor input to tensor PCA. We could look Í ⊗ ( ) at the matrix i Ti Ti as a candidate λ-boundedness certificate for T x . The Í ⊗ spectrum of this matrix must not admit the spectral bound that i Ai Ai does, because T(x) is not globally bounded: it has a large global maximum near the
40 signal v. This maximum plants a single large singular value in the spectrum of Í ⊗ i Ti Ti. The associated singular vector is readily decoded to recover the signal.
Before stating and analyzing this fast linear-algebraic algorithm, we situate it more firmly in the SoS framework. In the following, we discuss spectral SoS, a convex relaxation of Problem 2.1.2 obtained by weakening the full-power SoS Í ⊗ relaxation. We show that the spectrum of the aforementioned i Ti Ti can be viewed as approximately solving the spectral SoS relaxation. This gives the fast, certifying algorithm of Theorem 2.1.7. We also interpret the tensor unfolding algorithm given by Montanari and Richard for TPCA in the spiked tensor model as giving a more subtle approximate solution to the spectral SoS relaxation. We prove a conjecture by those authors that the algorithm successfully recovers the TPCA signal at the same signal-to-noise ratio as our other algorithms, up to a small pre-processing step in the algorithm; this proves Theorem 2.1.6[ 60]. This last algorithm, however, succeeds for somewhat different reasons than the others, and we will show that it consequently fails to certify its own success and that it is not robust to a certain kind of semi-adversarial choice of noise.
2.5.1 The Spectral SoS Relaxation
The SoS Algorithm: Matrix View
To obtain spectral SoS, the convex relaxation of Problem 2.1.2 which we will be able to (approximately) solve quickly in the random case, we first need to return to the full-strength SoS relaxation and examine it from a more linear-algebraic standpoint.
41 We have seen in Section 2.2.2 that a homogeneous p ∈ [x]2d may be represented as an nd × nd matrix whose entries correspond to coefficients of p.A
2 d 2 similar fact is true for non-homogeneous p. Let #tuples(d) 1+ n + n + ··· + n / .
6d 2 0 2 d 2 Let x⊗ / : (x⊗ , x, x⊗ ,..., x⊗ / ). Then p ∈ [x]6d can be represented as an #tuples(d) × #tuples(d) matrix; we say a matrix M of these dimensions is a
6 d 2 6 d 2 matrix representation of p if hx ⊗ / , Mx ⊗ / i p(x). For this section, we let
Mp denote the set of all such matrix representation of p.
A degree-d pseudo-distribution {x} can similarly be represented as an
#tuples d #tuples d ( )× ( ) matrix. We say that M is a matrix representation for {x} if M[α, β] ¾˜ xαxβ wheneverα and β are multi-indices with |α|, |β| 6 d.
{ } ∈ Formulated this way, if M x is the matrix representation of x and Mp { } M ∈ [ ] ( ) h i p for some p x 62d, then ¾˜ p x M x , Mp . In this sense, pseudo- { } distributions and polynomials, each represented as matrices, are dual under the trace inner product on matrices.
We are interested in optimization of polynomials over the sphere, and we have been looking at pseudo-distribution {x} satisfying {kxk2 − 1 0}. From this matrix point of view, the polynomial kxk2 − 1 corresponds to a vector
#tuples d T w ∈ ( ) (in particular, the vector w so that ww is a matrix representation of (kxk2 − 1)2), and a degree-4 pseudo-distribution {x} satisfies {kxk2 − 1 0} if ∈ and only if w ker M x . { }
A polynomial may have many matrix representations, but a pseudo- distribution has just one: a matrix representation of a pseudo-distribution must obey strong symmetry conditions in order to assign the same pseudo- expectation to every representation of the same polynomial. We will have much
42 more to say about constructing matrices satisfying these symmetry conditions when we state and prove our lower bounds, but here we will in fact profit from relaxing these symmetry constraints.
∈ [ ] Let p x 62d. In the matrix view, the SoS relaxation of the problem ( ) max x 21 p x is the following convex program. k k
max min hM, Mpi . (2.5.1) M:w ker M Mp p M∈ 0 ∈M M, 1 1 h M i It may not be immediately obvious why this program optimizes only over M which are matrix representations of pseudo-distributions. If, however, some M h i −∞ does not obey the requisite symmetries, then minMp p M, Mp , since the ∈M asymmetry may be exploited by careful choice of Mp ∈ Mp. Thus, at optimality this program yields M which is the matrix representation of a pseudo-distribution
{x} satisfying {kxk2 − 1 0}.
Relaxing to the Degree-4 Dual
We now formulate spectral SoS. In our analysis of full-power SoS for tensor PCA we have primarily considered pseudo-expectations of homogeneous degree-4 polynomials; our first step in further relaxing SoS is to project from [x]64 to [x]4.
n2 n2 #tuples 2 #tuples 2 Thus, now our matrices M, M0 will be in × rather than ( )× ( ). The projection of the constraint on the kernel in the non-homogeneous case implies Tr M 1 in the homogeneous case. The projected program is
max min hM, M0i . Tr M1 Mp p M 0 ∈M We modify this a bit to make explicit that the relaxation is allowed to add and subtract arbitrary matrix representations of the zero polynomial; in particular
43 − ∈ M M x 4 Id for any M x 4 x 4 . This program is the same as the one which k k k k k k precedes it.
h − · i + max min M, Mp c M x 4 c . (2.5.2) Tr M1 Mp p k k M 0 ∈M M 4 4 x ∈M x k kc k k ∈
By weak duality, we can interchange the min and the max in (2.5.2) to obtain the dual program:
h − · i h − · i + max min M, Mp c M x 4 6 min max M, Mp c M x 4 c Tr M1 Mp p k k Mp p Tr M1 k k M 0 ∈M ∈M M 0 M 4 4 M 4 4 x ∈M x x ∈M x k kc k k k kc k k ∈ ∈ (2.5.3)
h T − · i + min max vv , Mp c M x 4 c Mp p v 1 k k ∈M k k M 4 4 x ∈M x k kc k k ∈ (2.5.4)
( ) We call this dual program the spectral SoS relaxation of max x 1 p x . If k k Í h i N( ) p i x, Ai x for A with independent entries from 0, 1 , the spectral SoS relaxation achieves the same bound as our analysis of the full-strength SoS
3 2 1 2 relaxation: for such p, the spectral SoS relaxation is at most O(n / log(n) / ) with high probability. The reason is exactly the same as in our analysis of the Í ⊗ full-strength SoS relaxation: the matrix i Ai Ai, whose spectrum we used before to bound the full-strength SoS relaxation, is still a feasible dual solution.
Í ⊗ 2.5.2 Recovery via the i Ti Ti Spectral SoS Solution
· 3 + Let T τ v0⊗ A be the spiked-tensor input to tensor PCA. We know from our initial characterization of SoS proofs of boundedness for degree-3 polynomials
44 ( ) ( ⊗ )T(Í ⊗ )( ⊗ ) that the polynomial T0 x : x x i Ti Ti x x gives SoS-certifiable upper bounds on T(x) on the unit sphere. We consider the spectral SoS relaxation of ( ) max x 1 T0 x , k k k − · k + min MT x c M x 4 c . MT x T x ( ) k k ( )∈M ( ) M 4 4 x ∈M x k kc k k ∈ ∈ M Our goal now is to guess a good M0 T x . We will take as our dual-feasible ( ) Í ⊗ − Í ⊗ solution the top singular vector of i Ti Ti ¾ i Ai Ai. This is dual feasible h 2 ( Í ⊗ ) 2i k k4 with c n, since routine calculation gives x⊗ , ¾ i Ai Ai x⊗ x . This Í ⊗ top singular vector, which differentiates the spectrum of i Ti Ti from that of Í ⊗ ( ) i Ai Ai, is exactly the manifestation of the signal v0 which differentiates T x from A(x). The following algorithm and analysis captures this.
Í ⊗ Recovery and Certification with i Ti Ti
· 3 + ∈ n Input: T τ v0⊗ A, where v0 and A is a 3-tensor. n Goal: Find v ∈ with |hv, v0i| > 1 − o(1).
Algorithm 2.5.1 (Recovery). Compute the top (left or right) singular vector v0 of Í ⊗ − Í ⊗ × M : i Ti Ti ¾ i Ai Ai. Reshape v0 into an n n matrix V0. Compute the top singular vector v of V0. Output v/kvk.
3 Algorithm 2.5.2 (Certification). Run Algorithm 2.5.1 to obtain v. Let S : T − v⊗ . Compute the top singular value λ of
Õ Õ Si ⊗ Si − ¾ Ai ⊗ Ai . i i
3 2 1 2 If λ 6 O(n / log(n) / ), output certify. Otherwise, output fail.
The following theorem describes the behavior of Algorithm 2.5.1 and Algo- rithm 2.5.2 and gives a proof of Theorem 2.1.7 and Corollary 2.1.7.
45 · 3 + Theorem 2.5.3 (Formal version of Theorem 2.1.7). Let T τ v0⊗ A, where n v0 ∈ and A has independent entries from N(0, 1). In other words, we are given an
3 4 1 4 instance of Problem 3.6.1. Let τ > n / log(n) / /ε. Then:
2 — With high probability, Algorithm 2.5.1 returns v with hv, v0i > 1 − O(ε).
3 3 4 1 4 — If Algorithm 2.5.2 outputs certify then T(x) 6 τ · hv, xi + O(n / log(n) / ) (regardless of the distribution of A). If A is distributed as above, then Algorithm 2.5.2 outputs certify with high probability.
— Both Algorithm 2.5.1 and Algorithm 2.5.2 can be implemented in time
O(n4 log(1/ε)).
The argument that Algorithm 2.5.1 recovers a good vector in the spiked tensor model comes in three parts: we show that under appropriate regularity Í ⊗ − ⊗ conditions on the noise A that i Ti Ti ¾ Ai Ai has a good singular vector, then that with high probability in the spiked tensor model those regularity conditions hold, and finally that the good singular vector can be used to recover the signal.
· 3 + k Í ⊗ −¾ Í ⊗ Lemma 2.5.4. Let T τ v0⊗ A be an input tensor. Suppose i Ai Ai i Ai k 2 k Í ( ) k Ai 6 ετ and that i v0 i Ai 6 ετ. Then the top (left or right) singular vector v0 2 of M has hv0, v0 ⊗ v0i > 1 − O(ε).
· 3 + N( ) Lemma 2.5.5. Let T τ v0⊗ A. Suppose A has independent entries from 0, 1 . k Í ⊗ − Í ⊗ k ( 3 2 ( )1 2) Then with high probability we have i Ai Ai ¾ i Ai Ai 6 O n / log n / √ k Í ( ) k ( ) and i v0 i Ai 6 O n .
n n2 Lemma 2.5.6. Let v0 ∈ and v0 ∈ be unit vectors so that hv0, v0 ⊗v0i > 1−O(ε).
Then the top right singular vector v of the n × n matrix folding V0 of v0 satisfies hv, v0i > 1 − O(ε).
46 A similar fact to Lemma 2.5.6 appears in [60].
The proofs of Lemma 2.5.4 and Lemma 2.5.6 follow here. The proof of Lemma 2.5.5 uses only standard concentration of measure arguments; we defer it to Section A.2.
Proof of Lemma 2.5.4. We expand M as follows. Õ 2 ·( 3) ⊗ ( 3) + · (( 3) ⊗ + ⊗ ( 3) ) + ⊗ − ¾ ⊗ M τ v0⊗ i v0⊗ i τ v0⊗ i Ai Ai v0⊗ i Ai Ai Ai Ai i Õ Õ 2 ·( ⊗ )( ⊗ )T + · T ⊗ ( ) + · ( ) ⊗ T + ⊗ − ¾ ⊗ τ v0 v0 v0 v0 τ v0v0 v0 i Ai τ v0 i Ai v0v0 Ai Ai Ai Ai . i i k Í ⊗ By assumption, the noise term is bounded in operator norm: we have i Ai − ¾ Í ⊗ k 2 k · T ⊗ Ai i Ai Ai 6 ετ . Similarly, by assumption the cross-term has τ v0v0 Í ( ) k 2 i v0 i Ai 6 ετ .
·Õ (( 3) ⊗ + ⊗( 3) ) ·Õ ( ) ( T ⊗ + ⊗ T) τ Pu⊥ v0⊗ i Ai Ai v0⊗ i Pu⊥ τ v0 i Pu⊥ v0v0 Ai Ai v0v0 Pu⊥ . i i All in all, by triangle inequality,
Õ Õ · T ⊗ ( ) + · ( ) ⊗ T + ⊗ − ¾ ⊗ ( 2) τ v0v0 v0 i Ai τ v0 i Ai v0v0 Ai Ai Ai Ai 6 O ετ . i i Again by triangle inequality,
T 2 2 kMk > (v0 ⊗ v0) M(v0 ⊗ v0) τ − O(ετ ) .
Let u, w be the top left and right singular vectors of M. We have
T 2 2 2 2 u Mw τ · hu, v0 ⊗ v0ihw, v0 ⊗ v0i + O(ετ ) > τ − O(ετ ) , so rearranging gives the result.
Proof of Lemma 2.5.6. Let v0, v0, V0, v, be as in the lemma statement. We know v T is the maximizer of max w , w 1 w V0w0. By assumption, k k k 0k T h ⊗ i − ( ) v0 V0v0 v0, v0 v0 > 1 O ε .
47 Thus, the top singular value of V0 is at least 1 − O(ε), and since kv0k is a unit vector, the Frobenius norm of V0 is 1 and so all the rest of the singular values are
O(ε). Expressing v0 in the right singular basis of V0 and examining the norm of
V0v0 completes the proof.
Proof of Theorem 2.5.3. The first claim, that Algorithm 2.5.1 returns a good vector, follows from the previous three lemmas, Lemma 2.5.4, Lemma 2.5.5, Lemma 2.5.6. Í ⊗ − Í ⊗ The next, for Algorithm 2.5.2, follows from noting that i Si Si ¾ i Ai Ai is a feasible solution to the spectral SoS dual. For the claimed runtime, since we are working with matrices of size n4, it will be enough to show that the top Í ⊗ − Í ⊗ singular vector of M and the top singular value of i Si Si ¾ i Ai Ai can be recovered with O(poly log(n)) matrix-vector multiplies.
In the first case, we start by observing that it is enough to find a vector w which has hw, v0i > 1 − ε, where v0 is a top singular vector of M. Let λ1, λ2 be the top two singular values of M. The analysis of the algorithm already showed that λ1/λ2 > Ω(1/ε). Standard analysis of the matrix power method now yields that O(log(1/ε)) iterations will suffice.
Í ⊗ − Í ⊗ We finally turn to the top singular value of i Si Si ¾ i Ai Ai. Here the matrix may not have a spectral gap, but all we need to do is ensure that
3 2 1 2 the top singular value is no more than O(n / log(n) / ). We may assume that
3 2 1 2 some singular value is greater than O(n / log(n) / ). If all of them are, then a single matrix-vector multiply initialized with a random vector will discover this. Otherwise, there is a constant spectral gap, so a standard analysis of matrix power method says that within O(log n) iterations a singular value greater than
3 2 1 2 O(n / log(n) / ) will be found, if it exists.
48 2.5.3 Nearly-Linear-Time Recovery via Tensor Unfolding and
Spectral SoS
· 3 + ∈ n On input T τ v0⊗ A, where as usual v0 and A has independent entries from N(0, 1), Montanari and Richard’s Tensor Unfolding algorithm computes the top singular vector u of the squarest-possible flattening of T into a matrix.
2 It then extracts v with hv, v0i > 1 − o(1) from u with a second singular vector computation.
Recovery with TTT, a.k.a. Tensor Unfolding
· 3 + ∈ n Input: T τ v0⊗ A, where v0 and A is a 3-tensor. n Goal: Find v ∈ with |hv, v0i| > 1 − o(1).
Algorithm 2.5.7 (Recovery). Compute the top eigenvector v of M : TTT. Output v.
2 We show that this algorithm successfully recovers a vector v with hv, v0i >
3 4 1 − O(ε) when τ > n / /ε. Montanari and Richard conjectured this but were only able to show it when τ > n. We also show how to implement the algorithm in time O˜ (n3), that is to say, in time nearly-linear in the input size.
Despite its a priori simplicity, the analysis of Algorithm 2.5.7 is more subtle than for any of our other algorithms. This would not be true for even-order tensors, for which the square matrix unfolding tensor has one singular value asymptotically larger than all the rest, and indeed the corresponding singular vector is well-correlated with v0. However, in the case of odd-order tensors the unfolding has no spectral gap. Instead, the signal v0 has some second-order effect
49 on the spectrum of the matrix unfolding, which is enough to recover it.
We first situate this algorithm in the SoS framework. In the previous section Í ⊗ − Í ⊗ we examined the feasible solution i Ti Ti ¾ i Ai Ai to the spectral SoS ( ) relaxation of max x 1 T x . The tensor unfolding algorithm works by examining k k the top singular vector of the flattening T of T, which is the top eigenvector of the n × n matrix M TTT, which in turn has the same spectrum as the n2 × n2 matrix
TTT. The latter is also a feasible dual solution to the spectral SoS relaxation of ( ) ( ) max x 1 T x . However, the bound it provides on max x 1 T x is much worse k k k k Í ⊗ than that given by i Ti Ti. The latter, as we saw in the preceding section, gives 3 4 1 4 the bound O(n / log(n) / ). The former, by contrast, gives only O(n), which is the operator norm of a random n2 × n matrix (see Lemma A.2.4). This n versus
3 4 n / is the same as the gap between Montanari and Richard’s conjectured bound and what they were able to prove.
3 4 Theorem 2.5.8. For an instance of Problem 3.6.1 with τ > n / /ε, with high prob-
2 ability Algorithm 2.5.7 recovers a vector v with hv, v0i > 1 − O(ε). Furthermore, Algorithm 2.5.7 can be implemented in time O˜ (n3).
· 3 + ∈ n Lemma 2.5.9. Let T τ v0⊗ A where v0 is a unit vector, so an instance of T Problem 3.6.1. Suppose A satisfies A A C · Idn n +E for some C > 0 and E with × 2 T kEk 6 ετ and that kA (v0 ⊗ v0)k 6 ετ. Let u be the top left singular vector of the
2 matrix T. Then hv0, ui > 1 − O(ε).
Proof. The vector u is the top eigenvector of the n × n matrix TTT, which is also the top eigenvector of M : TTT − C · Id. We expand:
T T 2 · T + · ( ⊗ )T + · T( ⊗ ) T + u Mu u τ v0v0 τ v0 v0 v0 A τ A v0 v0 v0 E u 2 · h i2 + T · ( ⊗ )T + · T( ⊗ ) T + τ u, v0 u τ v0 v0 v0 A τ A v0 v0 v0 E u
50 2 2 2 6 τ hu, v0i + O(ετ ) .
T T 2 − ( 2) Again by triangle inequality, u Mu > v0 Mv τ O ετ . So rearranging we 2 get hu, v0i > 1 − O(ε) as desired.
The following lemma is a consequence of standard matrix concentration inequalities; we defer its proof to Section A.2, Lemma A.2.10.
n Lemma 2.5.10. Let A have independent entries from N(0, 1). Let v0 ∈ be a unit vector. With high probability, the matrix A satisfies AT A n2 · Id +E for some E with
3 2 T p kEk 6 O(n / ) and kA (v0 ⊗ v0)k 6 O( n log n).
The final component of a proof of Theorem 2.5.8 is to show how it can be implemented in time O˜ (n3). Since M factors as TTT, a matrix-vector multiply by M can be implemented in time O(n3). Unfortunately, M does not have an adequate eigenvalue gap to make matrix power method efficient. As we know from Lemma 2.5.10, suppressing εs and constants, M has eigenvalues in the √ 2 3 2 range n ± n / . Thus, the eigenvalue gap of M is at most 1 O(1 + 1/ n). For
1 2 δ any number k of matrix-vector multiplies with k 6 n / − , the eigenvalue gap √ n1 2 δ will become at most (1 + 1/ n) / − , which is subconstant. To get around this problem, we employ a standard trick to improve spectral gaps of matrices close to C · Id: remove C · Id.
Lemma 2.5.11. Under the assumptions of Theorem 2.5.8, Algorithm 2.5.7 can be implemented in time O˜ (n3) (which is linear in the input size, n3).
Proof. Note that the top eigenvector of M is the same as that of M − n2 · Id. The latter matrix, by the same analysis as in Lemma 2.5.9, is given by
− 2 · 2 · T + M n Id τ v0v0 M0
51 2 2 where kM0k O(ετ ). Note also that a matrix-vector multiply by M − n · Id can still be done in time O(n3). Thus, M − n2 · Id has eigenvalue gap Ω(1/ε), which is enough so that the whole algorithm runs in time O˜ (n3).
Proof of Theorem 2.5.8. Immediate from Lemma 2.5.9, Lemma 2.5.10, and Lemma 2.5.11.
2.5.4 Fast Recovery in the Semi-Random Model
There is a qualitative difference between the aggregate matrix statistics needed by our certifying algorithms (Algorithm 2.4.1, Algorithm 2.4.2, Algorithm 2.5.1,
Algorithm 2.5.2) and those needed by rounding the tensor unfolding solution spectral SoS Algorithm 2.5.7. In a precise sense, the needs of the latter are greater. The former algorithms rely only on first-order behavior of the spectra of a tensor unfolding, while the latter relies on second-order spectral behavior. Since it uses second-order properties of the randomness, Algorithm 2.5.7 fails in the semi-random model.
· 3 + ∈ n Theorem 2.5.12. Let T τ v0⊗ A, where v0 is a unit vector and A has 7 8 independent entries from N(0, 1). There is τ Ω(n / ) so that with high probability
1 4 there is an adversarial choice of Q with kQ − Id k 6 O(n− / ) so that the matrix (TQ)TTQ n2 · Id. In particular, for such τ, Algorithm 2.5.7 cannot recover the signal v0.
T 1 2 Proof. Let M be the n × n matrix M : T T. Let Q n · M− / . It is clear that
T 2 1 4 (TQ) TQ n Id. It suffices to show that kQ − Id k 6 n / with high probability.
52 We expand the matrix M as
2 · T + · ( ⊗ )T + · T( ⊗ ) T + T M τ v0v0 τ v0 v0 v0 A τ A v0 v0 v0 A A .
T 2 3 2 T By Lemma 2.5.10, A A n · Id +E for some E with kEk 6 O(n / ) and kA (v0 ⊗ p v0)k 6 O( n log n), both with high probability. Thus, the eigenvalues of M all
2 1+3 4 lie in the range n ± n / . The eigenvalues of Q in turn lie in the range n 1 1 . ( 2 ± ( 1+3 4))1 2 ( ± ( 1 4))1 2 ± ( 1 4) n O n / / 1 O n− / / 1 O n / Finally, the eigenvalues of Q − Id lie in the range 1 − 1 ±O(n 1 4), so we 1 O n1 4 − / ± ( / ) are done.
The argument that that Algorithm 2.5.1 and Algorithm 2.5.2 still succeed in the semi-random model is routine; for completeness we discuss here the necessary changes to the proof of Theorem 2.5.3. The non-probabilistic certification claims made in Theorem 2.5.3 are independent of the input model, so we show that Algorithm 2.5.1 still finds the signal with high probability and that Algorithm 2.5.2 still fails only with only a small probability.
1 4 3 4 1 4 Theorem 2.5.13. In the semi-random model, ε > n− / and τ > n / log(n) / /ε, with
2 high probability, Algorithm 2.5.1 returns v with hv, v0i > 1−O(ε) and Algorithm 2.5.2 outputs certify.
Proof. We discuss the necessary modifications to the proof of Theorem 2.5.3.
1 4 Since ε > n− / , we have that k(Q − Id)v0k 6 O(ε). It suffices then to show that the probabilistic bounds in Lemma 2.5.5 hold with A replaced by AQ. Note that this means each Ai becomes AiQ. By assumption, kQ ⊗ Q − Id ⊗ Id k 6 O(ε), k Í ⊗ Í ⊗ k so the probabilistic bound on i Ai Ai ¾ i Ai Ai carries over to the Í ( ) semi-random setting. A similar argument holds for i v0 i AiQ, which is enough to complete the proof.
53 2.5.5 Fast Recovery with Symmetric Noise
We suppose now that A is a symmetric Gaussian noise tensor; that is, that A is π ∈ S the average of A0 over all π 3, for some order-3 tensor A0 with iid standard Gaussian entries.
It was conjectured by Montanari and Richard [60] that the tensor unfolding 3 + technique can recover the signal vector v0 in the single-spike model T τv0⊗ A 3 4 with signal-to-noise ratio τ > Ω˜ (n / ) under both asymmetric and symmetric noise.
Our previous techniques fail in this symmetric noise scenario due to lack of independence between the entries of the noise tensor. However, we sidestep that issue here by restricting our attention to an asymmetric block of the input tensor.
The resulting algorithm is not precisely identical to the tensor unfolding algorithm investigated by Montanari and Richard, but is based on tensor unfolding with only superficial modifications.
54 Fast Recovery under Symmetric Noise
· 3 + ∈ n Input: T τ v0⊗ A, where v0 and A is a 3-tensor. n Goal: Find v ∈ with |hv, v0i| > 1 − o(1).
Algorithm 2.5.14 (Recovery). Take X, Y, Z a random partition of [n], and R a
n random rotation of . Let PX, PY, and PZ be the diagonal projectors onto the
3 coordinates indicated by X, Y, and Z. Let U : R⊗ T, so that we have the matrix unfolding U : (R ⊗ R)TRT Using the matrix power method, compute the top singular vectors vX, vY, and vZ respectively of the matrices
T 2 MX : PXU (PY ⊗ PZ)UPX − n /9 · Id
T 2 MY : PYU (PZ ⊗ PX)UPY − n /9 · Id
T 2 MZ : PZU (PX ⊗ PY)UPZ − n /9 · Id .
1 Output the normalization of R− (vX + vY + vZ). Remark 2.5.15 (Implementation of Algorithm 2.5.14 in nearly-linear time.). It is possible to implement each iteration of the matrix power method in Algo- rithm 2.5.14 in linear time. We focus on multiplying a vector by MX in linear time; the other cases follow similarly.
T T T 2 We can expand MX PX RT (R ⊗ R) (PY ⊗ PZ)(R ⊗ R)TR PX − n /9 · Id.
T It is simple enough to multiply an n-dimensional vector by PX, R, R , T, and Id in linear time. Furthermore multiplying an n2-dimensional vector by TT is also a simple linear time operation. The trickier part lies in multiplying an
2 2 2 T n -dimensional vector, say v, by the n -by-n matrix (R ⊗ R) (PY ⊗ PZ)(R ⊗ R).
To accomplish this, we simply reflatten our tensors. Let V be the n-by-n
T T T matrix flattening of v. Then we compute the matrix R PY R · V · R PZ R, and return its flattening back into an n2-dimensional vector, and this will be equal
55 T to (R ⊗ R) (PY ⊗ PZ)(R ⊗ R) v. This equivalence follows by taking the singular Í T Í ⊗ value decomposition V i λi ui wi , and noting that v i λi ui wi.
Lemma 2.5.16. Given a unit vector u ∈ n, a random rotation R over n, and a projection P to an m-dimensional subspace, with high probability
k k2 − / (p / 2 ) PRu m n 6 O m n log m .
Proof. Let γ be a random variable distributed as the norm of a vector in n with entries independently drawn from N(0, 1/n). Then because Gaussian vectors are rotationally invariant and Ru is a random unit vector, the coordinates of γRu are independent and Gaussian in any orthogonal basis.
So γ2kPRuk2 is the sum of the squares of m independent variables drawn p from N(0, 1/n). By a Bernstein inequality, γ2kPRuk2 − m/n 6 O( m/n2 log m) p with high probability. Also by a Bernstein inequality, γ2 − 1 < O( 1/n log n) with high probability.
3 4 Theorem 2.5.17. For τ > n / /ε, with high probability, Algorithm 2.5.14 recovers a vector v with hv, v0i > 1 − O(ε) when A is a symmetric Gaussian noise tensor (as in √ Problem 2.1.3) and ε > log(n)/ n.
Furthermore the matrix power iteration steps in Algorithm 2.5.14 each converge within O˜ (− log(ε)) steps, so that the algorithm overall runs in almost linear time
O˜ (n3 log(1/ε)).
Proof. Name the projections UX : (PY ⊗ PZ)UPX, UY : (PZ ⊗ PX)UPY, and
UZ : (PX ⊗ PY)UPZ.
3 First off, U τ(Rv0)⊗ + A0 where A0 is a symmetric Gaussian tensor (distributed identically to A). This follows by noting that multiplication by
56 3 3 π 3 π R⊗ commutes with permutation of indices, so that (R⊗ B) R⊗ B , where Í π we let B be the asymmetric Gaussian tensor so that A π 3 B . Then ∈S 3 Í π Í ( 3 )π A0 R⊗ π 3 B π 3 R⊗ B . This is identically distributed with A, as ∈S ∈S follows from the rotational symmetry of B.
T Thus UX τ(PY ⊗ PZ)(R ⊗ R)(v0 ⊗ v0)(PX Rv0) + (PY ⊗ PZ)A0PX, and
+ 2/ · T MX n 9 Id UXUX
2 2 2 T τ kPY Rv0k kPZRv0k (PX Rv0)(PX Rv0) (2.5.5)
T T + τ(PX Rv0)(v0 ⊗ v0) (R ⊗ R) (PY ⊗ PZ)A0PX (2.5.6)
T T + τPXA0 (PY ⊗ PZ)(R ⊗ R)(v0 ⊗ v0)(PX Rv0) (2.5.7)
T + PXA0 (PY ⊗ PZ)A0PX . (2.5.8)
k k2 − 1 Let S refer to Expression 2.5.5. By Lemma 2.5.16, PRv0 3 < p O( 1/n log n) with high probability for P ∈ {PX , PY , PZ }. Hence S ( 1 ± (p / )) 2( )( )T k k ( 1 ± (p / )) 2 9 O 1 n log n τ PX Rv0 PX Rv0 and S 27 O 1 n log n τ .
Let C refer to Expression 2.5.6 so that Expression 2.5.7 is CT. Let also
A00 (PY ⊗ PZ)A0PX. Note that, once the identically-zero rows and columns of A00 are removed, A00 is a matrix of iid standard Gaussian entries. Finally, let v00 PY Rv0 ⊗ PZRv0. By some substitution and by noting that kPX Rk 6 1, we
T 2 have that kCk 6 τ kv0v00 A00k. Hence by Lemma A.2.10, kCk 6 O(ετ ).
T Let N refer to Expression 2.5.8. Note that N A00 A00. Therefore by
2 3 2 Lemma 2.5.10, kN − n /9 · Id k 6 O(n / ).
2 2 Thus MX S + C + (N − n /9 · Id), so that kMX − Sk 6 O(ετ ). Since S is rank-one and has kSk > Ω(τ2), we conclude that matrix power iteration converges in O˜ (− log ε) steps.
57 2 The recovered eigenvector vX satisfies hvX , MX vXi > Ω(τ ) and hvX , (MX − ) i ( 2) h i ( 1 ± ( + p / )) 2 S vX 6 O ετ and therefore vX , SvX 27 O ε 1 n log n τ . Substi- 1 tuting in the expression for S, we conclude that hPX Rv0, vXi ( ± O(ε + √3 p 1/n log n)).
The analyses for vY and vZ follow in the same way. Hence
hvX + vY + vZ , Rv0i hvX , PX Rv0i + hvY , PY Rv0i + hvZ , PZRv0i √ p > 3 − O(ε + 1/n log n) .
At the same time, since vX, vY, and vZ are each orthogonal to each other, √ 1 kvX + vY + vZ k 3. Hence with the output vector being v : R− (vX + vY + vZ)/kvX + vY + vZ k, we have
1 p hv, v0i hRv, Rv0i hvX + vY + vZ , Rv0i > 1 − O(ε + 1/n log n) . √3
2.5.6 Numerical Simulations
We report now the results of some simple numerical simulations of the algorithms from this section. In particular, we show that the asymptotic running time differences among Algorithm 2.5.1, Algorithm 2.5.7 implemented naïvely, and the linear-time implementation of Algorithm 2.5.7 are apparent at reasonable values of n, e.g. n 200.
Specifics of our experiments are given in Figure 2.1. We find pronounced differences between all three algorithms. The naïve implementation of Algo- rithm 2.5.7 is markedly slower than the linear implementation, as measured
58 10 1.5 2
5 2 1.0
0 2 key key nearly-linear tensor unfolding 0.5 accel. power method naive tensor unfolding naive power method nearly-optimal spectral SoS -5 2 correlation with signal with correlation 0.0 -10 2 time to achieve correlation 0.9 with input (sec) input with 0.9 correlation achieve to time -15 -0.5 2 4 5 6 7 8 9 10 0 5 10 15 20 2 2 2 2 2 2 2
matrix-vector multiplies instance size
Figure 2.1: Numerical simulation of Algorithm 2.5.1 (“Nearly-optimal spectral SoS” implemented with matrix power method), and two implementations of Algorithm 2.5.7 (“Accelerated power method”/“Nearly-linear tensor unfolding” and “Naive power method”/“Naive tensor unfolding”. Simulations were run in Julia on a Dell Optiplex 7010 running Ubuntu 12.04 with two Intel Core i7 3770 processors at 3.40 ghz and 16GB of RAM. Plots created with Gadfly. Error bars denote 95% confidence intervals. Matrix-vector multiply experiments were conducted with n 200. Reported matrix-vector multiply counts are the average of 50 independent trials. Reported times are in cpu-seconds and are the average of 10 independent trials. Note that both axes in the right-hand plot are log scaled.
either by number of matrix-vector multiplies or processor time. Algorithm 2.5.1
suffers greatly from the need to construct an n2 × n2 matrix; although we do not count the time to construct this matrix against its reported running time, the memory requirements are so punishing that we were unable to collect data
beyond n 100 for this algorithm.
2.6 Lower Bounds
We will now prove lower bounds on the performance of degree-4 SoS on random
instances of the degree-4 and degree-3 homogeneous polynomial maximization problems. As an application, we show that our analysis of degree-4 for Tensor PCA is tight up to a small logarithmic factor in the signal-to-noise ratio.
59 Theorem 2.6.1 (Part one of formal version of Theorem 2.1.5). There is τ Ω(n) and a function η : A 7→ {x} mapping 4-tensors to degree-4 pseudo-distributions satisfying
2 {kxk 1} so that for every unit vector v0, if A has unit Gaussian entries, then, with high ˜ · h i4 + ( ) probability over random choice of A, the pseudo-expectation ¾x η A τ v0, x A x ∼ ( ) 4 is maximal up to constant factors among ¾˜ τ · hv0, yi + A(y) over all degree-4 pseudo- distributions {y} satisfying {kyk2 1}.
Theorem 2.6.2 (Part two of formal version of Theorem 2.1.5). There is τ
3 4 1 4 Ω(n / /(log n) / ) and a function η : A 7→ {x} mapping 3-tensors to degree-4 pseudo-
2 distributions satisfying {kxk 1} so that for every unit vector v0, if A has unit Gaussian entries, then, with high probability over random choice of A, the pseudo- ˜ · h i3 + ( ) expectation ¾x η A τ v0, x A x is maximal up to logarithmic factors among ∼ ( ) 3 2 ¾˜ τ · hv0, yi + A(y) over all degree-4 pseudo-distributions {y} satisfying {k yk 1}.
The existence of the maps η depending only on the random part A of the 3 + tensor PCA input v0⊗ A formalizes the claim from Theorem 2.1.5 that no algorithm can reliably recover v0 from the pseudo-distribution η(A).
Additionally, the lower-bound construction holds for the symmetric noise model also: the input tensor A is symmetrized whereever it occurs in the construction, so it does not matter if it had already been symmetrized beforehand.
The rest of this section is devoted to proving these theorems, which we eventually accomplish in Section 2.6.2.
Discussion and Outline of Proof
Given a random 3-tensor A, we will take the degree-3 pseudo-moments of our ( ) ˜ ( ) η A to be εA, for some small ε, so that ¾x η A A x is large. The main question ∼ ( )
60 is how to give degree-4 pseudo-moments to go with this. We will construct these
T from AA and its permutations as a 4-tensor under the action of S4.
We have already seen that a spectral upper bound on one of these permutations, Í ⊗ i Ai Ai, provides a performance guarantee for degree-4 SoS optimization of degree-3 polynomials. It is not a coincidence that this SoS lower bound depends on the negative eigenvalues of the permutations of AAT. Running the argument for the upper bound in reverse, a pseudo-distribution {x} satisfying {k k2 } ¾˜ ( ) x 2 1 and with A x large must (by pseudo-Cauchy-Schwarz) also ˜ h 2 (Í ⊗ ) 2i T have ¾ x⊗ , i Ai Ai x⊗ large. The permutations of AA are all matrix h 2 (Í ⊗ ) 2i ˜ ( ) representations of that same polynomial, x⊗ , i Ai Ai x⊗ . Hence ¾ A x will be large only if the matrix representation of the pseudo-distribution {x} is well correlated with the permutations of AAT. Since this matrix representation will also need to be positive-semidefinite, control on the spectra of permutations of AAT is therefore the key to our approach.
The general outline of the proof will be as follows:
1. Construct a pseudo-distribution that is well correlated with the permuta- tions of AAT and gives a large value to ¾˜ A(x), but which is not on the unit sphere.
2. Use a procedure modifying the first and second degree moments of the pseudo-distribution to force it onto a sphere, at the cost of violating the
2 condition that ¾˜ p(X) > 0 for all p ∈ [x]62, then rescale so it lives on the unit sphere. Thus, we end up with an object that is no longer a valid pseudo-distribution but a more general linear functional L on polynomials.
3. Quantitatively bound the failure of L to be a pseudo-distribution, and
repair it by statistically mixing the almost-pseudo-distribution with a small
61 amount of the uniform distribution over the sphere. Show that ¾˜ A(x) is still large for this new pseudo-distribution over the unit sphere.
But before we can state a formal version of our theorem, we will need a few facts about polynomials, pseudo-distributions, matrices, vectors, and how they are related by symmetries under actions of permutation groups.
2.6.1 Polynomials, Vectors, Matrices, and Symmetries, Redux
Here we further develop the matrix view of SoS presented in Section 2.5.1.
We will need to use general linear functionals L : [x]64 → on polynomials as an intermediate step between matrices and pseudo-distributions. Like pseudo- distributions, each such linear-functional L has a unique matrix representation
M satisfying certain maximal symmetry constraints. The matrix M is positive- L L semidefinite if and only if L p(x)2 > 0 for every p. If L satisfies this and L 1 1, then L is a pseudo-expectation, and M is the matrix representation of the L corresponding pseudo-distribution.
Matrices for Linear Functionals and Maximal Symmetry
#tuples d #tuples d Let L : [x]6d → . L can be represented as an n ( ) × n ( ) matrix indexed by all d0-tuples over [n] with d0 6 d/2. For tuples α, β, this matrix M is L given by def M [α, β] L xαxβ . L
For a linear functional L : [x]6d → , a polynomial p(x) ∈ [x]6d, and a matrix representation Mp for p we thus have hM , Mpi L p(x). L
62 A polynomial in [x]6d may have many matrix representations, while for us, a linear functional L has just one: the matrix M . This is because in our L definition we have required that M obey the constraints L
α β α β M [α, β] M [α0, β0] when x x x 0 x 0 . L L in order that they assign consistent values to each representation of the same polynomial. We call such matrices maximally symmetric (following Doherty and
Wehner [30]).
We have particular interest in the maximally-symmetric version of the identity matrix. The degree-d symmetrized identity matrix Idsym is the unique maximally symmetric matrix so that
h d 2 sym d 2i k kd x⊗ / , Id x⊗ / x 2 . (2.6.1)
The degree d will always be clear from context.
k kd In addition to being a matrix representation of the polynomial x 2 , the max- imally symmetric matrix Idsym also serves a dual purpose as a linear functional.
We will often be concerned with the expectation operator ¾µ for the uniform distribution over the n-sphere, and indeed for every polynomial p(x) with matrix representation Mp, µ 1 sym ¾ p(x) hId , Mpi , n2 + 2n and so Idsym/(n2 + 2n) is the unique matrix representation of ¾µ.
The Monomial-Indexed (i.e. Symmetric) Subspace
[ ] We will also require vector representations of polynomials. We note that x 6d 2 / #tuples d has a canonical embedding into ( ) as the subspace given by the following
63 family of constraints, expressed in the basis of d0-tuples for d0 6 d/2:
[ ] '{ ∈ #tuples d } x 6d 2 p ( ) such that pα pα if α0 is a permutation of α . / 0
We let Π be the projector to this subspace. For any maximally-symmetric M we have ΠMΠ M, but the reverse implication is not true (for readers familiar with quantum information: any M which has M ΠMΠ is Bose-symmetric, but may not be PPT-symmetric; maximally symmetric matrices are both. See [30] for further discussion.)
[ ] If we restrict attention to the embedding this induces of x d 2 (i.e. the / d 2 homogeneous degree-d/2 polynomials) into n / , the resulting subspace is
d 2 n sometimes called the symmetric subspace and in other works is denoted by ∨ / .
d 2 We sometimes abuse notation and let Π be the projector from n / to the canonical [ ] embedding of x d 2. /
Maximally-Symmetric Matrices from Tensors
The group Sd acts on the set of d-tensors (canonically flattened to matrices
n d 2 n d 2 n d 2 n d 2 b / c × d / e ) by permutation of indices. To any such flattened M ∈ b / c × d / e , we associate a family of maximally-symmetric matrices symM given by ( ) def Õ symM t π · M for all t > 0 . π ∈Sd That is, symM represents all scaled averages of M over different possible flat- tenings of its corresponding d-tensor. The following conditions on a matrix M are thus equivalent: (1) M ∈ symM, (2) M is maximally symmetric, (3) a tensor that flattens to M is invariant under the index-permutation action of Sd, and (4) M may be considered as a linear functional on the space of homogeneous
64 polynomials [x]d. When we construct maximally-symmetric matrices from un-symmetric ones, the choice of t is somewhat subtle and will be important in not being too wasteful in intermediate steps of our construction.
There is a more complex group action characterizing maximally-symmetric #tuples d #tuples d S matrices in ( )× ( ), which projects to the action of d0 under the #tuples d #tuples d nd0 2 nd0 2 projection of ( )× ( ) to / × / . We will never have to work explicitly with this full symmetry group; instead we will be able to construct linear
#tuples d #tuples d functionals on [x]6d (i.e. maximally symmetric matrices in ( )× ( )) by symmetrizing each degree (i.e. each d0 6 d) more or less separately.
2.6.2 Formal Statement of the Lower Bound
We will warm up with the degree-4 lower bound, which is conceptually somewhat simpler.
Theorem 2.6.3 (Degree-4 Lower Bound, General Version). Let A be a 4-tensor and let λ > 0 be a function of n. Suppose the following conditions hold:
Í π — A is significantly correlated with π 4 A . ∈S h Í πi Ω( 4) A, π 4 A > n . ∈S — Permutations have lower-bounded spectrum. ∈ S 2 × 2 1 ( π + ( π)T) π For every π 4, the Hermitian n n unfolding 2 A A of A has no eigenvalues smaller than −λ2.
— Using A as 4th pseudo-moments does not imply that kxk4 is too large.
sym π 2 3 2 For every π ∈ S4, we have hId , A i 6 O(λ n / )
65 — Using A for 4th pseudo-moments does not imply first and second degree moments are too large.
Let L : [x]4 → be the linear functional given by the matrix representation 1 Í π M : 2 2 π 4 A . Let L λ n ∈S
def L k k2 δ2 max x 2 xi x j i,j
def 2 2 δ0 max L kxk x 2 i 2 i
3 2 + 2 ( ) Then n / δ02 n δ2 6 O 1 .
{ } {k k2 } ¾˜ ( ) Then there is a degree-4 pseudo-distribution x satisfying x 2 1 so that A x > Ω(n2/λ2) + Θ(¾µ A(x)).
The degree-3 version of our lower bound requires bounds on the spectra of the flattenings not just of the 3-tensor A itself but also of the flattenings of an h 2 (Í ⊗ ) 2i associated 4-tensor, which represents the polynomial x⊗ , i Ai Ai x⊗ .
Theorem 2.6.4 (Degree-3 Lower Bound, General Version). Let A be a 3-tensor and let λ > 0 be a function of n. Suppose the following conditions hold:
Í π — A is significantly correlated with π 3 A . ∈S h Í πi Ω( 3) A, π 3 A > n . ∈S — Permutations have lower-bounded spectrum.
For every π ∈ S3, we have
1 1 −2λ2·Π Id Π Π(σ·Aπ(Aπ)T+σ2·Aπ(Aπ)T)Π+ Π(σ·Aπ(Aπ)T+σ2·Aπ(Aπ)T)TΠ . 2 2
— Using AAT for 4th moments does not imply kxk4 is too large.
sym π π T 2 2 For every π ∈ S3, we have hId , A (A ) i 6 O(λ n )
66 — Using A and AAT for 3rd and 4th moments do not imply first and second degree moments are too large.
Let π ∈ S3. Let L : [x]4 → be the linear functional given by the matrix 1 Í · T representation M : 2 2 π 4 π0 AA . Let L λ n 0∈S
def 1 δ max hId , Aπi 1 3 2 n n i i λn / × def L k k2 δ2 max x 2 xi x j i,j
def 2 2 1 4 δ0 max L kxk x − L kxk 2 i 2 i n 2
+ 3 2 + 2 ( ) Then nδ1 n / δ02 n δ2 6 O 1 .
{ } {k k2 } Then there is a degree-4 pseudo-distribution x satisfying x 2 1 so that n3 2 ¾˜ A(x) > Ω / + Θ(¾µ A(x)) . λ
Proof of Theorem 2.6.2
We prove the degree-3 corollary; the degree-4 case is almost identical using Theorem 2.6.3 and Lemma A.2.12 in place of their degree-3 counterparts.
Proof. Let A be a 3-tensor. If A satisfies the conditions of Theorem 2.6.4 with
3 4 1 4 λ O(n / log(n) / ), we let η(A) be the pseudo-distribution described there, with
n3 2 ¾˜ A(x) > Ω / + Θ(¾µ A(x)) x η A λ ∼ ( ) If A does not satisfy the regularity conditions, we let η(A) be the uniform distribution on the unit sphere. If A has unit Gaussian entries, then Lemma A.2.11 says that the regularity conditions are satisfied with this choice of λ with high
67 √ √ probability. The operator norm of A is at most O( n), so ¾µ A(x) O( n) (all with high probability) [76]. We have chosen λ and τ so that when the conditions of Theorem 2.6.4 and the bound on ¾µ A(x), obtain,
3 4 3 n / ¾˜ τ · hv0, xi + A(x) > Ω . x η A log(n)1 4 ∼ ( ) / On the other hand, our arguments on degree-4 SoS certificates for random polynomials say with high probability every degree-4 pseudo-distribution {y}
2 3 3 4 1 4 satisfying {kyk 1} has ¾˜ τ · hv, yi + A(y) 6 O(n / log(n) / ). Thus, {x} is nearly optimal and we are done.
2.6.3 In-depth Preliminaries for Pseudo-Expectation Symme-
tries
This section gives the preliminaries we will need to construct maximally- symmetric matrices (a.k.a. functionals L : [x]64 → ) in what follows. For a n2 n2 non-maximally-symmetric M ∈ × under the action of S4 by permutation of indices, the subgroup C3 < S4 represents all the significant permutations whose spectra may differ from one another in a nontrivial way. The lemmas that follow will make this more precise. For concreteness, we take C3 hσi with σ (234), but any other choice of 3-cycle would lead to a merely syntactic change in the proof.
Lemma 2.6.5. Let D8 < S4 be given by D8 h(12), (34), (13)(24)i. Let C3
2 {(), σ, σ } hσi, where () denotes the identity in S4. Then {1h : 1 ∈ D8, h ∈ C3} S4.
Proof. The proof is routine; we provide it here for completeness. Note that C3 is a subgroup of order 3 in the alternating group A4. This alternating group
68 can be decomposed as A4 K4 ·C3, where K4 h(12)(34), (13)(24)i is a normal subgroup of A4. We can also decompose S4 C2 ·A4 where C2 h(12)i and A4 is a normal subgroup of S4. Finally, D8 C2 ·K4 so by associativity,
S4 C2 ·A4 C2 ·K4 ·C3 D8 ·C3.
This lemma has two useful corollaries:
Corollary 2.6.6. For any subset S ⊆ S4, we have {1hs : 1 ∈ D8, h ∈ C3, s ∈ S} S4.
n2 n2 Corollary 2.6.7. Let M ∈ × . Let the matrix M0 be given by
def 1 2 1 2 T M0 Π M + σ · M + σ · M Π + Π M + σ · M + σ · M Π . 2 2
Then M0 ∈ symM.
+ · + 2 · Í · Proof. Observe first that M σ M σ M π 3 π M. For arbitrary ∈C ∈ n2 n2 1 Π Π + 1 Π TΠ 1 Í · N × , we show that 2 N 2 N 8 π 8 π N. First, conjugation ∈D by Π corresponds to averaging M over the group h(12), (34)i generated by interchange of indices in row and column indexing pairs, individually. At the same time, N + NT is the average of M over the matrix transposition permutation group h(13)(24)i. All together,
1 Õ Õ 1 Õ M0 (1h)· M π · M 8 8 1 8 h 3 π 4 ∈D ∈C ∈S and so M0 ∈ symM.
We make an useful observation about the nontrivial permutations of M, in the special case that M AAT for some 3-tensor A.
n2 n Lemma 2.6.8. Let A be a 3-tensor and let A ∈ × be its flattening, where the first and third modes lie on the longer axis and the third mode lies on the shorter axis. Let Ai
69 be the n × n matrix slices of A along the first mode, so that
© A1 ª ® ® A2 ® A ® . . ® . ® ® ® An « ¬ 2 2 Let P : n → n be the orthogonal linear operator so that [Px](i, j) x(j, i). Then ! · T Õ ⊗ 2 · T Õ ⊗ T σ AA Ai Ai P and σ AA Ai Ai . i i
T[( ) ( )] Í (Í ⊗ Proof. We observe that AA j1, j2 , j3, j4 i Aij1 j2 Aij3 j4 and that i Ai )[( ) ( )] Í Ai j1, j2 , j3, j4 i Aij1 j3 Aij2 j4 . Multiplication by P on the right has [(Í ⊗ the effect of switching the order of the second indexing pair, so i Ai ) ][( ) ( )] Í · T Ai P j1, j2 , j3, j4 i Aij1 j4 Aij2 j3 . From this it is easy to see that σ AA ( )· T (Í ⊗ ) 234 AA i Ai Ai P.
Similarly, we have that
( 2 T)[( ) ( )] (( )· T)[( ) ( )] Õ σ AA j1, j2 , j3, j4 243 AA j1, j2 , j3, j4 Aij1 j3 Aij4 j2 , k
2 · T Í ⊗ T from which we see that σ AA i Ai Ai .
Permutations of the Identity Matrix. The nontrivial permutations of Idn2 n2 × are:
Id[(j, k), (j0, k0)] δ(j, k)δ(j0, k0)
σ · Id[(j, k), (j0, k0)] δ(j, j0)δ(k, k0)
2 σ · Id[(j, k), (j0, k0)] δ(j, k0)δ(j0, k) .
70 2 Since (Id +σ · Id +σ · Id) is invariant under the action of D8, we have (Id +σ · Id +σ2 · Id) ∈ symM; up to scaling this matrix is the same as Idsym defined in (2.6.1). We record the following observations:
— Id, σ · Id, and σ2 · Id are all symmetric matrices.
— Up to scaling, Id +σ2 Id projects to identity on the canonical embedding of
[x]2.
— The matrix σ · Id is rank-1, positive-semidefinite, and has Π(σ · Id)Π σ · Id.
— The scaling [1/(n2 + 2n)](Id +σ · Id +σ2 · Id) is equal to a linear functional
µ ¾ : [x]4 → giving the expectation under the uniform distribution over
n 1 the unit sphere S − .
2.6.4 Construction of Initial Pseudo-Distributions
We begin by discussing how to create an initial guess at a pseudo-distribution whose third moments are highly correlated with the polynomial A(x). This initial guess will be a valid pseudo-distribution, but will fail to be on the unit sphere, and so will require some repairing later on. For now, the method of creating this initial pseudo-distribution involves using a combination of symmetrization techniques to ensure that the matrices we construct are well defined as linear functionals over polynomials, and spectral techniques to establish positive-semidefiniteness of these matrices.
71 Extending Pseudo-Distributions to Degree Four
In this section we discuss a construction that takes a linear functional L :
[x]63 → over degree-3 polynomials and yields a degree-4 pseudo-distribution {x}. We begin by reminding the reader of the Schur complement criterion for positive-semidefiniteness of block matrices.
Theorem 2.6.9. Let M be the following block matrix.
BCT def © ª M ® CD « ¬ 1 T where B 0 and is full rank. Then M 0 if and only if D CB− C .
Suppose we are given a linear functional L : [x]63 → with L 1 1.
Let L |1 be L restricted to [x]1 and similarly for L |2 and L |3. We define the following matrices:
∈ n 1 L | — M 1 × is the matrix representation of 1. L | ∈ n n L | — M 2 × is the matrix representation of 2. L | ∈ n2 n L | — M 3 × is the matrix representation of 3. L | ∈ n2 1 — V 2 × is the vector flattening of M 2 . L | L |
#tuples 2 #tuples 2 Consider the block matrix M ∈ ( )× ( ) given by
1 MT VT © 1 2 ª def L | L | ® M M M MT ® , 1 2 3 ® L | L | L | ® ® V 2 M 3 D « L | L | ¬
72 n2 n2 with D ∈ × yet to be chosen. By taking
1 MT 1 B © L | ª C , V 2 M 3 ® L | L | M 1 M 2 « L | L | ¬ we see by the Schur complement criterion that M is positive-semidefinite so
1 T long as D CB− C . However, not any choice of D will yield M maximally symmetric, which is necessary for M to define a pseudo-expectation operator ¾˜ .
We would ideally take D to be the spectrally-least maximally-symmetric
1 T matrix so that D CB− C . But this object might not be well defined, so we instead take the following substitute.
Definition 2.6.10. Let L, B, C as be as above. The symmetric Schur complement D ∈ 1 T Í ·( 1 T) Í ·( 1 T) symCB− C is t π 4 π CB− C for the least t so that t π 4 π CB− C ∈S ∈S 1 T CB− C . We denote by ¾˜ L the linear functional ¾˜ L : [x]64 → whose matrix representation is M with this choice of D, and note that ¾˜ L is a valid degree-4 pseudo-expectation.
Example 2.6.11 (Recovery of Degree-4 Uniform Moments from Symmetric Schur
µ Complement). Let L : [x]63 → be given by L p(x) : ¾ p(x). We show that
µ 1 T 2 ¾˜ L ¾ . In this case it is straightforward to compute that CB− C σ · Id/n . t Π( + · + 2 · )Π 1 Π( · )Π Our task is to pick t > 0 minimal so that n2 Id σ Id σ Id n2 σ Id .
We know that Π(σ · Id)Π σ · Id. Furthermore, Π Id Π Π(σ2 · Id)Π, and
#tuples 4 both are the identity on the canonically-embedded subspace [x]2 in ( ). We have previously observed that σ · Id is rank-one and positive-semidefinite, so
#tuples 4 T let w ∈ ( ) be such that ww σ · Id.
T( + · + 2 · ) k k2 + k k4 + 2 T( · We compute w Id σ Id σ Id w 2 w 2 w 2 2n n and w σ ) k k4 2 2/( 2 + ) Id w w 2 n . Thus t n n 2n is the minimizer. By a previous observation, this yields ¾µ.
73 To prove our lower bound, we will generalize the above example to the case
µ that we start with an operator L : [x]63 → which does not match ¾ on degree-3 polynomials.
Symmetries at Degree Three
We intend on using the symmetric Schur complement to construct a pseudo- distribution from some L : [x]63 → for which L A(x) is large. A good such L L x x x Í Aπ i j k will have i j k correlated with π 3 ijk for all (or many) indices , , . ∈S That is, it should be correlated with the coefficient of the monomial xi x j xk in ( ) L Í π A x . However, if we do this directly by setting xi x j xk π Aijk, it becomes technically inconvenient to control the spectrum of the resulting symmetric Schur complement. To this avoid, we discuss how to utilize a decomposition of M 3 L | into nicer matrices if such a decomposition exists.
1 1 k Lemma 2.6.12. Let L : [x]63 → , and suppose that M (M + ··· + M ) 3 k 3 3 L | L | L | 1 k n2 n for some M ,..., M ∈ × . Let D1,..., Dk be the respective symmetric Schur 3 3 L | L | complements of the family of matrices
1 MT VT © 1 2 ª L | L | ® M M (Mi )T ® . 1 2 3 ® L | L | L | ® i ® V M • 2 3 « L | L | ¬16i6k Then the matrix k 1 MT VT © 1 2 ª def 1Õ L | L | ® M M M (Mi )T ® 1 2 3 ® k L | L | L | ® i1 i ® V M Di 2 3 « L | L | ¬ is positive-semidefinite and maximally symmetric. Therefore it defines a valid pseudo- expectation ¾˜ L. (This is a slight abuse of notation, since the pseudo-expectation defined
74 here in general differs from the one in Definition 2.6.10.)
Proof. Each matrix in the sum defining M is positive-semidefinite, so M 0. Ík Each Di is maximally symmetric and therefore so is i1 Di. We know that Ík i M M is maximally-symmetric, so it follows that M is the matrix 3 i 1 3 L | L | representation of a valid pseudo-expectation.
2.6.5 Getting to the Unit Sphere
Our next tool takes a pseudo-distribution ¾˜ that is slightly off the unit sphere, and corrects it to give a linear functional L : [x]64 → that lies on the unit sphere.
We will also characterize how badly the resulting linear functional deviates
2 from the nonnegativity condition (L p(x) > 0 for p ∈ [x]62) required to be a pseudo-distribution
Definition 2.6.13. Let L : [x]6d → . We define
2 def L p(x) λ L min min µ ( )2 p x 6d 2 ¾ p x ∈ [ ] / where ¾µ p(x)2 is the expectation of p(x)2 when x is distributed according to the uniform distribution on the unit sphere.
Since ¾µ p(x)2 > 0 for all p, we have L p(x)2 > 0 for all p if and only if
λmin L > 0. Thus L on the unit sphere is a pseudo-distribution if and only if
L 1 1 and λmin L > 0.
Lemma 2.6.14. Let ¾˜ : [x]64 → be a valid pseudodistribution. Suppose that:
75 ¾˜ k k4 1. c : x 2 > 1. ¾˜ 2. is close to lying on the sphere, in the sense that there are δ1, δ2, δ02 > 0 so that:
| 1 ¾˜ k k2 − L | (a) c x 2 xi 0 xi 6 δ1 for all i. | 1 ¾˜ k k2 − L | (b) c x 2 xi x j 0 xi x j 6 δ2 for all i , j. | 1 ¾˜ k k2 2 − L 2| (c) c x 2 xi 0 xi 6 δ02 for all i.
Let L : [x]64 → be as follows on homogeneous p:
˜ ¾ 1 if deg p 0 def L p(x) 1 ¾˜ p(x) if deg p 3, 4 c 1 ¾˜ ( )k k2 c p x x 2 if deg p 1, 2 . L L ( )(k k2 − ) ( ) ∈ [ ] L Then satisfies p x x 2 1 0 for all p x x 62 and has λmin > − c 1 − ( ) − ( 3 2) − ( 2) −c O n δ1 O n / δ02 O n δ2.
L ( )(k k2 − ) ∈ [ ] Proof. It is easy to check that p x x 2 1 0 for all p x 62 by expanding the definition of L.
Let the linear functional L0 : [x]64 → be defined over homogeneous polynomials p as
c if deg p 0 L ( ) def 0 p x ¾˜ p(x) if deg p 3, 4 ¾˜ ( )k k2 p x x 2 if deg p 1, 2 .
Note that L0 p(x) c L p(x) for all p ∈ [x]64. Thus λmin L > λmin L0/c, and the kernel of L0 is identical to the kernel of L.
76 (k k2 − ) L L In particular, since x 2 1 is in the kernel of 0, either λmin 0 0 or 2 L0 p(x) λmin L0 min . 2 µ ( )2 p x 62,p x 1 ¾ p x ∈ [ ] ⊥(k k2 − ) Here p ⊥ (kxk2 −1) means that the polynomials p and kxk2 −1 are perpendictular ( ) + Í + Í in the coefficient basis. That is, if p x p0 i pi xi ij pij xi x j, this means Í K ii pii p0. The equality holds because any linear functional on polynomials with (kxk2 − 1) in its kernel satisfies K(p(x) + α(kxk2 − 1))2 K p(x)2 for every
µ α. The functionals L0 and ¾ in particular both satisfy this.
Let ∆ : L0 − ¾˜ , and note that ∆ is nonzero only when evaluated on the degree-1 or -2 parts of polynomials. It will be sufficient to bound ∆, since assuming λmin L0 , 0, ∆p(x)2 + ¾˜ p(x)2 λmin L0 min 2 µ ( )2 p x 62,p x 1 ¾ p x ∈ [ ] ⊥(k k2 − ) ∆p(x)2 > min . 2 µ ( )2 p x 62,p x 1 ¾ p x ∈ [ ] ⊥(k k2 − )
∈ [ ] ( ) + Í + Let p x 62. We expand p in the monomial basis: p x p0 i pi xi Í i,j pij xi x j. Then
!2 ! 2 ( )2 2+ Õ + Õ + Õ + Õ ©Õ ª+©Õ ª p x p0 2p0 pi xi 2p0 pij xi x j pi xi 2 pi xi pij xi x j® pij xi x j® . i ij i i ij ij « ¬ « ¬ An easy calculation gives !2 µ 2 2 2p0 Õ 1 Õ 2 1 Õ Õ 2 Õ 2 ¾ p(x) p + pii + p + © pii + p + p ª . 0 n n i n2 + 2n ij ii® i i i ij i « ¬ ⊥ (k k2 − ) Í The condition p x 2 1 yields p0 i pii. Substituting into the above, we obtain the sum of squares
2 2p 1 Õ 1 Õ Õ ¾µ p(x)2 p2 + 0 + p2 + ©p2 + p2 + p2 ª . 0 n n i n2 + 2n 0 ij ii® i ij i « ¬
77 Without loss of generality we assume ¾µ p(x)2 1, so now it is enough just to
2 bound ∆p(x) . We have assumed that |∆xi | 6 cδ1 and |∆xi x j | 6 cδ2 for i , j and |∆ 2| ∆ − ∆ ( ) xi 6 cδ02. We also know 1 c 1 and p x 0 when p is a homogeneous degree-3 or -4 polynomial. So we expand
Õ Õ Õ ∆ ( )2 2( − ) + ∆ + ∆ + ∆ p x p0 c 1 2p0 pi xi 2p0 pij xi x j pi p j xi x j i ij i,j and note that this is maximized in absolute value when all the signs line up:
!2 Õ Õ Õ Õ Õ |∆ ( )2| 2( − )+ | | | |+ | | © | | + | |ª+ | | + 2 p x 6 p0 c 1 2cδ1 p0 pi 2 p0 cδ2 pij cδ02 pii ® cδ2 pi cδ02 pi . i i,j i i i « ¬
2 ∈ [ ] Í 2 ( − ) We start with the second term. If p0 α for α 0, 1 , then i pi 6 n 1 α by our assumption that ¾µ p(x)2 1. This means that s | | Õ | | Õ 2 p ( − ) ( ) 2cδ1 p0 pi 6 2cδ1 αn pi 6 2cδ1n α 1 α 6 O n cδ1 , i i
2 where we have used Cauchy-Schwarz and the fact max06α61 α(1 − α) (1/2) . The other terms are all similar:
2( − ) − p0 c 1 6 c 1 Õ s Õ | | | | 2 2 ( 2)p ( − ) ( 2) 2 p0 cδ2 pij 6 2cδ2 αn pij 6 2cδ2O n α 1 α 6 O n cδ2 i,j ij s | | Õ | | Õ 2 ( 3 2) 2 p0 cδ02 pii 6 2cδ02 αn pii 6 O n / cδ02 i i !2 Õ | | Õ 2 ( 2) cδ2 pi 6 cδ2n pi 6 O n cδ2 i i Õ 2 ( ) cδ02 pi 6 O n cδ02 , i where in each case we have used Cauchy-Schwarz and our assumption ¾µ p(x)2
1.
78 Putting it all together, we get
∆ −( − ) − ( ) − ( 3 2) − ( 2) λmin > c 1 O n cδ1 O n / cδ02 O n cδ2 .
2.6.6 Repairing Almost-Pseudo-Distributions
Our last tool takes a linear functional L : [x]6d that is “almost” a pseudo- distribution over the unit sphere, in the precise sense that all conditions for being a pseudo-distribution over the sphere are satisfied except that λmin L −ε. The tool transforms it into a bona fide pseudo-distribution at a slight cost to its evaluations at various polynomials.
Lemma 2.6.15. Let L : [x]6d → and suppose that
— L 1 1
L ( )(k k2 − ) ∈ [ ] — p x x 1 0 for all p x 6d 2. −
— λmin L −ε.
Then the operator ¾˜ : [x]6d → given by
def 1 ¾˜ p(x) (L p(x) + ε ¾µ p(x)) 1 + ε is a valid pseudo-expectation satisfying {kxk2 1}.
¾˜ ¾˜ ¾˜ (k k2 − )2 Proof. It will suffice to check that λmin > 0 and that has x 2 1 0 and
¾˜ 1 1. For the first, let p ∈ [x]>2. We have
¾˜ p(x)2 1 ¾0 p(x)2 + ε ¾µ p(x)2 1 > (−ε + ε) > 0 . ¾µ p(x)2 1 + ε ¾µ p(x)2 1 + ε
Hence, λmin ¾˜ > 0.
79 It is straightforward to check the conditions that ¾˜ 1 1 and that ¾˜ satisfies {kxk2 − 1 0}, since ¾˜ is a convex combination of linear functionals that already satisfy these linear constraints.
2.6.7 Putting Everything Together
We are ready to prove Theorem 2.6.3 and Theorem 2.6.4. The proof of Theo- rem 2.6.3 is somewhat simpler and contains many of the ideas of the proof of
Theorem 2.6.4, so we start there.
The Degree-4 Lower Bound
Proof of Theorem 2.6.3. We begin by constructing a degree-4 pseudo-expectation 0 ¾˜ : [x]64 → whose degree-4 moments are biased towards A(x) but which {k k2 − } does not yet satisfy x 2 1 0 .
Let L : [x]64 → be the functional whose matrix representation when L | [ ] → 1 Í π restricted to 4 : x 4 is given by M 2 A , and which is 0 4 4 n π 4 L | |S | ∈S on polynomials of degree at most 3.
0 0 Let ¾˜ : ¾µ +ε L, where ε is a parameter to be chosen soon so that ¾˜ p(x)2 >
0 for all p ∈ [x]62. Let p ∈ [x]62. We expand p in the monomial basis as ( ) + Í + Í p x p0 i pi xi ij pij xi x j. Then
1 Õ ¾µ p(x)2 > p2 . n2 ij ij
π By our assumption on negative eigenvalues of A for all π ∈ S4, we know that L ( )2 λ2 Í 2 / 2 ¾˜ 0 ¾˜ µ + L/ 2 p x > −n2 ij pij. So if we choose ε 6 1 λ , the operator λ
80 0 will be a valid pseudo-expectation. Moreover ¾˜ is well correlated with A, since it was obtained by maximizing the amount of L, which is simply the ¾˜ 0 k k4 (maximally-symmetric) dual of A. However the calculation of x 2 shows that this pseudo-expectation is not on the unit sphere, though it is close. Let c refer to
0 1 1 Õ c : ¾˜ kxk4 ¾µ kxk4+ L kxk4 + h sym Aπi +O(n 1 2) 2 2 2 2 1 2 2 Id , 1 − / . λ |S4|n λ π 4 ∈S
0 We would like to use Lemma 2.6.14 together with ¾˜ to obtain some L1 [ ] → k k2 − L1 : x 64 with x 2 1 in its kernel and bounded λmin while still maintaining a high correlation with A. For this we need ξ1, ξ2, ξ20 so that
0 0 1 ¾˜ kxk2x − ¾˜ x i — c 2 i i 6 ξ1 for all . 0 0 1 ¾˜ kxk2x x − ¾˜ x x i j — c 2 i j i j 6 ξ2 for all , . 0 0 1 ¾˜ kxk2x2 − ¾˜ x2 i — c 2 i i 6 ξ20 for all .
0 Since ¾˜ p(x) 0 for all homogeneous odd-degree p, we may take ξ1 0. For
ξ2, we have that when i , j,
1 0 2 0 1 2 ¾˜ kxk xi x j − ¾˜ xi x j L kxk xi x j 6 δ2 , c 2 cλ2 2 where we recall δ2 and δ02 defined in the theorem statement. Finally, for ξ20 , we have
1 0 2 2 0 2 1 2 2 1 µ 2 2 µ 2 c 1 ¾˜ kxk x − ¾˜ x 6 L kxk x + ¾ kxk x − ¾ x 6 δ0 + − . c 2 i i cλ2 2 i c 2 i i 2 cn
L1 [ ] → k k2 − Thus, Lemma 2.6.14 yields : x 64 with x 2 1 in its kernel in the L1 ( )(k k2 − ) ∈ [ ] sense that p x x 2 1 0 for all p x 62. If we take ξ2 δ2 and + c 1 L1 − c 1 − 2 − 3 2( + c 1 ) − ( ) ξ20 δ02 cn− , then λmin > −c n δ2 n / δ02 cn− O 1 . Furthermore, L1 ( ) 1 L ( ) Θ( 1 L ( )) A x cλ2 A x λ2 A x .
81 So by Lemma 2.6.15, there is a degree-4 pseudo-expectation ¾˜ satisfying {k k2 } x 2 1 so that
1 ¾˜ A(x) Θ L A(x) + Θ(¾µ A(x)) λ2 ! 1 Õ Θ hA Aπi + Θ(¾µ A(x)) 2 2 , |S4|n λ π 4 ∈S n2 > Ω + Θ(¾µ A(x)) . λ2
The Degree-3 Lower Bound
Now we turn to the proof of Theorem 2.6.4.
Proof of Theorem 2.6.4. Let A be a 3-tensor. Let ε > 0 be a parameter to be chosen later. We begin with the following linear functional L : [x]63 → . For any monomial xα (where α is a multi-index of degree at most 3),
¾µ xα xα def if deg 6 2 L xα . ε Í π α 3 2 π 3 Aα if deg x 3 n / ∈S The functional L contains our current best guess at the degree 1 and 2 moments of a pseudo-distribution whose degree-3 moments are ε-correlated with A(x).
The next step is to use symmetric Schur complement to extend L to a degree-4 pseudo-expectation. Note that M 3 decomposes as L | Õ Π π M 3 A L | π 3 ∈S where, as a reminder, Aπ is the n2 × n flattening of Aπ and Π is the projector to
n2 the canonical embedding of [x]2 into . So, using Lemma 2.6.12, we want to
82 find the symmetric Schur complements of the following family of matrices (with notation matching the statement of Lemma 2.6.12):
1 MT VT © 1 2 ª L | L | ® M M ε (ΠAπ)T ® . 1 2 n3 2 ® L | L | / ® ε Π π • ® V 2 3 2 A n / L | π 3 « ¬ ∈S π Since we have the same assumptions on A for all π ∈ S3, without loss of generality we analyze just the case that π is the identity permutation, in which case Aπ A.
Since L matches the degree-one and degree-two moments of the uniform distribution on the unit sphere, we have M 1 0, the n-dimensional zero vector, L | 1 ∈ n2 2 and M 2 n Idn n. Let w be the n -dimensional vector flattening of L | × T · Idn n. We observe that ww σ Id is one of the permutations of Idn2 n2 . Taking × × B and C as follows,
1 0 © ª ε B C w 3 2 A , 1 ® n / 0 Idn n « n × ¬ we compute that 2 1 T 1 ε T CB− C (σ · Id) + ΠAA Π . n2 n2 Symmetrizing the Id portion and the AAT portion of this matrix separately, we see that the symmetric Schur complement that we are looking for is the ∈ 1 ( · ) + ε2 T spectrally-least M sym n2 σ Id n2 AA so that
t ε2 M 3 Idsym + Π(AAT + σ · AAT + σ2 · AAT)Π + Π(AAT + σ · AAT + σ2 · AAT)TΠ n2 2 1 ε2 (σ · Id) + ΠAATΠ . n2 n2
Here we have used Corollary 2.6.7 and Corollary 2.6.6 to express a general ( ε2 Π TΠ) Π T · T 2 · T element of sym n2 AA in terms of , AA , σ AA , and σ AA .
83 Any spectrally small M satisfying the above suffices for us. Taking t 1, canceling some terms, and making the substitution 3 Idsym −σ · Id 2Π Id Π, we see that it is enough to have
ε2 ε2 −2 Π Id Π Π(σ · AAT + σ2 · AAT)Π + Π(σ · AAT + σ2 · AAT)TΠ , 2 2 which by the premises of the theorem holds for ε 1/λ. Pushing through our symmetrized Schur complement rule with our decomposition of M 3 L | 0 (Lemma 2.6.12), this ε yields a valid degree-4 pseudo-expectation ¾˜ : [x]64 → 0 0 . From our choice of parameters, we see that ¾˜ |4, the degree-4 part of ¾˜ , is ¾˜ 0 | n2+2n ¾µ + L L [ ] → given by 4 n2 , where : x 4 is as defined in the theorem 0 statement. Furthermore, ¾˜ p(x) ¾µ p(x) for p with deg p 6 2.
¾˜ 0 k k4 We would like to know how big x 2 is. We have 0 1 1 c : ¾˜ kxk4 1 + ¾µ kxk4 + L kxk4 1 + + L kxk4 . 2 n 2 2 n 2
We have assumed that hIdsym, AATi 6 O(λ2n2). Since Idsym is maximally h sym Í · Ti h sym |S | Ti symmetric, we have Id , π 4 π AA Id , 4 AA and so ∈S 1 1 Õ L k k4 h sym i Θ(h sym · Ti) ( ) x 2 Id , M 4 Id , π AA 6 O 1 . λ2n2 L | n2λ2 π 4 ∈S
h Í πi Finally, our assumptions on A, π 3 A yield ∈S 0 ε Õ n3 2 ¾˜ A(x) hA, Aπi > Ω / . 3 2 n / λ π 3 ∈S
We have established the following lemma.
Lemma 2.6.16. Under the assumptions of Theorem 2.6.4 there is a degree-4 pseudo- 0 expectation operator ¾˜ so that
¾˜ 0 k k4 + ( ) — c : x 2 1 O 1 .
84 0 3 2 — ¾˜ A(x) > Ω(n / /λ).
0 µ — ¾˜ p(x) ¾ p(x) for all p ∈ [x]62.
¾˜ 0 | ( + 1 ) ¾µ | + L — 4 1 n 4 .
0 Now we would like feed ¾˜ into Lemma 2.6.14 to get a linear functional L1 [ ] → k k2 − : x 64 with x 2 1 in its kernel (equivalently, which satisfies {k k4 − } x 2 1 0 ), but in order to do that we need to find ξ1, ξ2, ξ20 so that
0 0 1 ¾˜ kxk2x − ¾˜ x i — c 2 i i 6 ξ1 for all . 0 0 1 ¾˜ kxk2x x − ¾˜ x x i j — c 2 i j i j 6 ξ2 for all , . 0 0 1 ¾˜ kxk2x2 − ¾˜ x2 i — c 2 i i 6 ξ20 for all .
0 0 For ξ1, we note that for every i, ¾˜ xi 0 since ¾˜ matches the uniform distribution 0 0 0 1 ¾˜ kxk2x − ¾˜ x 1 ¾˜ kxk2x on degree one and two polynomials. Thus, c 2 i i c 2 i .
0 We know that M 0 , the matrix representation of the degree-3 part of ¾˜ , is ¾˜ 3 | 1 ˜ 0 k k2 3 2 A. Expanding ¾ x xi with matrix representations, we get 3 n λ 2 |S | / 0 1 Õ 1 ¾˜ kxk2x hId , A i 6 δ , c 2 i |S | 3 2 n n i 1 3 cn / λ × π 3 ∈S where δ1 is as defined in the theorem statement.
L Now for ξ2 and ξ20 . Let be the operator in the theorem statement. By the 0 definition of ¾˜ , we get
0 1 ¾˜ | 6 1 + ¾µ | + L . 4 n 4
In particular, for i , j, 0 1 ¾˜ 0 k k2 − ¾˜ 1 L k k2 c x 2 xi x j xi x j c x 2 xi x j 6 δ2 .
85 For i j, 0 µ 1 1 ¾˜ kxk2x2 − ¾˜ x2 1 L kxk2x2 + 1 + ¾µ kxk2x2 − c ¾µ x2 c 2 i i c 2 i n 2 i i 1 1 1 L kxk2x2 − 1 L kxk4 + 1 L kxk4 + 1 + ¾µ kxk4 − c ¾µ x2 c 2 i n 2 n 2 n n 2 i
1 L k k2 2 − 1 L k k4 + 1 ¾˜ 0 k k4 − ¾µ 2 c x 2 xi n x 2 n x 2 c xi
1 L k k2 2 − 1 L k k4 c x 2 xi n x 2
6 δ02 .
¾˜ 0 k k4 + ( ) Thus, we can take ξ1 δ1, ξ2 δ2, ξ20 δ02, and c x 2 1 O 1 , and apply Lemma 2.6.14 to conclude that
L1 − c 1 − ( ) − ( 3 2) − ( 2) − ( ) λmin > −c O n ξ1 O n / ξ20 O n ξ2 O 1 .
The functional L1 loses a constant factor in the value assigned to A(x) as compared 0 to ¾˜ : 0 ¾˜ A(x) n3 2 L1 A(x) > Ω / . c λ
Now using Lemma 2.6.15, we can correct the negative eigenvalue of L1 to get a pseudo-expectation def ¾˜ Θ(1)L1 +Θ(1) ¾µ .
¾˜ {k k2 } By Lemma 2.6.15, the pseudo-expectation satisfies x 2 1 . Finally, to complete the proof, we have:
n3 2 ¾˜ A(x) Ω / + Θ(1) ¾µ A(x) . λ
86 2.7 Higher-Order Tensors
We have heretofore restricted ourselves to the case k 3 in our algorithms for the sake of readability. In this section we state versions of our main results for general k and indicate how the proofs from the 3-tensor case may be generalized to handle arbitrary k. Our policy is to continue to treat k as constant with respect to n, hiding multiplicative losses in k in our asymptotic notation.
The case of general odd k may be reduced to k 3 by a standard trick, which we describe here for completeness. Given A an order-k tensor, consider the
β polynomial A(x) and make the variable substitution yβ x for each multi-index
β with |β| (k + 1)/2. This yields a degree-3 polynomial A0(x, y) to which the analysis in Section 2.3 and Section 2.4 applies almost unchanged, now using pseudo-distributions {x, y} satisfying {kxk2 1, kyk2 1}. In the analysis of tensor PCA, this change of variables should be conducted after the input is split into signal and noise parts, in order to preserve the analysis of the
k second half of the rounding argument (to get from ¾˜ hv0, xi to ¾˜ hv0, xi), which then requires only syntactic modifications to Lemma A.1.5. The only other non-syntactic difference is the need to generalize the λ-boundedness results for random polynomials to handle tensors whose dimensions are not all equal; this is already done in Theorem A.2.5.
For even k, the degree-k SoS approach does not improve on the tensor unfolding algorithms of Montanari and Richard [60]. Indeed, by performing a
β similar variable substitution, yβ x for all |β| k/2, the SoS algorithm reduces exactly to the eigenvalue/eigenvector computation from tensor unfolding. If we
β perform instead the substitution yβ x for |β| k/2 − 1, it becomes possible
87 to extract v0 directly from the degree-2 pseudo-moments of an (approximately) optimal degree-4 pseudo-distribution, rather than performing an extra step to k 2 ⊗ / recover v0 from v well-correlated with v0 . Either approach recovers v0 only up to sign, since the input is unchanged under the transformation v0 7→ −v0.
We now state analogues of all our results for general k. Except for the above noted differences from the k 3 case, the proofs are all easy transformations of the proofs of their degree-3 counterparts.
n k 4 1 4 Theorem 2.7.1. Let k be an odd integer, v0 ∈ a unit vector, τ % n / log(n) / /ε, and A an order-k tensor with independent unit Gaussian entries.
1. There is an algorithm, based on semidefinite programming, which on input
k T(x) τ · hv0, xi + A(x) returns a unit vector v with hv0, vi > 1 − ε with high probability over random choice of A.
2. There is an algorithm, based on semidefinite programming, which on input
k k k 4 1 4 T(x) τ · hv0, xi + A(x) certifies that T(x) 6 τ · hv, xi + O(n / log(n) / ) for some unit v with high probability over random choice of A. This guarantees in particular that v is close to a maximum likelihood estimator for the problem of · k + recovering the signal v0 from the input τ v0⊗ A.
3. By solving the semidefinite relaxation approximately, both algorithms can be
1+1 k k implemented in time O˜ (m / ), where m n is the input size.
2 For even k, the above all hold, except now we recover v with hv0, vi > 1 − ε, and the algorithms can be implemented in nearly-linear time.
The next theorem partially resolves a conjecture of Montanari and Richard regarding tensor unfolding algorithms for odd k. We are able to prove their
88 conjectured signal-to-noise ratio τ, but under an asymmetric noise model. They conjecture that the following holds when A is symmetric with unit Gaussian entries.
n k 4 Theorem 2.7.2. Let k be an odd integer, v0 ∈ a unit vector, τ % n / /ε, and A an order-k tensor with independent unit Gaussian entries. There is a nearly-linear-time algorithm, based on tensor unfolding, which, with high probability over random choice of
2 A, recovers a vector v with hv, v0i > 1 − ε.
89 CHAPTER 3 SPECTRAL METHODS FOR THE RANDOM OVERCOMPLETE MODEL
3.1 Introduction
The sum-of-squares (SoS) method (also known as the Lasserre hierarchy) [71, 64, 61, 54] is a powerful, semidefinite-programming based meta-algorithm that applies to a wide-range of optimization problems. The method has been studied extensively for moderate-size polynomial optimization problems that arise for example in control theory and in the context of approximation algorithms for combinatorial optimization problems, especially constraint satisfaction and graph partitioning (see e.g. the survey [21]). For the latter, the SoS method captures and generalizes the best known approximation algorithms based on linear programming (LP), semidefinite programming (SDP), or spectral methods, and in many cases the SoS method is the most promising approach to obtain algorithms with better guarantees—especially in the context of Khot’s Unique Games Conjecture [14].
A sequence of recent works applies the sum-of-squares method to basic problems that arise in unsupervised machine learning: in particular, recovering sparse vectors in linear subspaces and decomposing tensors in a robust way [16, 17, 50, ?, 39]. For a wide range of parameters of these problems, SoS achieves significantly stronger guarantees than other methods, in polynomial or quasi-polynomial time.
Like other LP and SDP hierarchies, the sum-of-squares method comes with a degree parameter d ∈ that allows for trading off running time and solution
90 quality. This trade-off is appealing because for applications the additional utility of better solutions could vastly outweigh additional computational costs. Unfortunately, the computational cost grows rather steeply in terms of the
O d parameter d: the running time is n ( ) where n is the number of variables (usually comparable to the instance size). Further, even when the SDP has size polynomial in the input (when d O(1)), solving the underlying semidefinite programs is prohibitively slow for large instances.
In this work, we introduce spectral algorithms for planted sparse vector, tensor decomposition, and tensor principal components analysis (PCA) that exploit the same high-degree information as the corresponding sum-of-squares algorithms without relying on semidefinite programming, and achieve the same (or close to the same) guarantees. The resulting algorithms are quite simple (a couple of lines of matlab code) and have considerably faster running times—quasi-linear or close to linear in the input size.
A surprising implication of our work is that for some problems, spectral algorithms can exploit information from larger values of the parameter d without
O d spending time n ( ). For example, our algorithm for the planted sparse vector problem runs in nearly-linear time in the input size, even though it uses properties that the sum-of-squares method can only use for degree parameter d > 4. (In particular, the guarantees that the algorithm achieves are strictly stronger than the guarantees that SoS achieves for values of d < 4.)
The initial successes of SoS in the machine learning setting gave hope that techniques developed in the theory of approximation algorithms, specifically the techniques of hierarchies of convex relaxations and rounding convex relaxations, could broadly impact the practice of machine learning. This hope was dampened
91 by the fact that in general, algorithms that rely on solving large semidefinite programs are too slow to be practical for the large-scale problems that arise in machine learning. Our work brings this hope back into focus by demonstrating for the first time that with some care SoS algorithms can be made practical for large-scale problems.
In the following subsections we describe each of the problems that we consider, the prior best-known guarantee via the SoS hierarchy, and our results.
3.1.1 Planted Sparse Vector in Random Linear Subspace
The problem of finding a sparse vector planted in a random linear subspace was introduced by Spielman, Wang, and Wright as a way of learning sparse dictionaries [72]. Subsequent works have found further applications and begun studying the problem in its own right [?, 16, 65]. In this problem, we are given a basis for a d-dimensional linear subspace of n that is random except for one planted sparse direction, and the goal is to recover this sparse direction. The computational challenge is to solve this problem even when the planted vector is only mildly sparse (a constant fraction of non-zero coordinates) and the subspace
Ω 1 dimension is large compared to the ambient dimension (d > n ( )).
Several kinds of algorithms have been proposed for this problem based on linear programming (LP), basic semidefinite programming (SDP), sum-of-squares, and non-convex gradient descent (alternating directions method).
An inherent limitation of simpler convex methods (LP and basic SDP) [72, 25] is that they require the relative sparsity of the planted vector to be polynomial in
92 √ the subspace dimension (less than n/ d non-zero coordinates).
Sum-of-squares and non-convex methods do not share this limitation. They can recover planted vectors with constant relative sparsity even if the subspace
1 2 has polynomial dimension (up to dimension O(n / ) for sum-of-squares [16] and
1 4 up to O(n / ) for non-convex methods [65]).
We state the problem formally:
Problem 3.1.1. Planted sparse vector problem with ambient dimension n ∈ , subspace dimension d 6 n, sparsity ε > 0, and accuracy η > 0. Given an arbitrary
n orthogonal basis of a subspace spanned by vectors v0, v1,..., vd 1 ∈ , where v0 − is a vector with at most εn non-zero entries and v1,..., vd 1 are vectors sampled − independently at random from the standard Gaussian distribution on n, output
n 2 a unit vector v ∈ that has correlation hv, v0i > 1 − η with the sparse vector v0.
Our results. Our algorithm runs in nearly linear time in the input size, and matches the best-known guarantees up to a polylogarithmic factor in the subspace dimension [16].
Theorem 3.1.2. Planted sparse vector in nearly-linear time. There exists an algorithm that, for every sparsity ε > 0, ambient dimension n, and subspace dimension d √ O 1 with d 6 n/(log n) ( ), solves the planted sparse vector problem with high probability
1 4 for some accuracy η 6 O(ε / ) + on (1). The running time of the algorithm is O˜ (nd). →∞
We give a technical overview of the proof in Section 4.2, and a full proof in Section 3.4.
93 Table 3.1: Comparison of algorithms for the planted sparse vector problem with ambient dimen- sion n, subspace dimension d, and relative sparsity ε.
Reference Technique Runtime Largest d Largest√ ε ( / ) Demanet, Hand [?] linear programming poly any√ Ω 1 d Barak, Kelner, Steurer [16] SoS, general SDP poly Ω( n) Ω(1) ˜ ( 2 5) ( 1 4) ( ) Qu, Sun, Wright [65] alternating minimization O n d Ω n√/ Ω 1 this work SoS, partial traces O˜ (nd) Ω˜ ( n) Ω(1)
Previous work also showed how to recover the planted sparse vector exactly. The task of going from an approximate solution to an exact one is a special case of standard compressed sensing (see e.g. [16]).
3.1.2 Overcomplete Tensor Decomposition
Tensors naturally represent multilinear relationships in data. Algorithms for tensor decompositions have long been studied as a tool for data analysis across a wide-range of disciplines (see the early work of Harshman [?] and the survey [53]).
While the problem is NP-hard in the worst-case [?, ?], algorithms for special cases of tensor decomposition have recently led to new provable algorithmic results for several unsupervised learning problems [10, 24, 42, 9] including independent component analysis, learning mixtures of Gaussians [37], Latent Dirichlet topic modeling [4] and dictionary learning [17]. Some earlier algorithms can also be reinterpreted in terms of tensor decomposition [?, ?, ?].
A key algorithmic challenge for tensor decompositions is overcompleteness, when the number of components is larger than their dimension (i.e., the compo- nents are linearly dependent). Most algorithms that work in this regime require tensors of order 4 or higher [55, 24]. For example, the FOOBI algorithm of [55] can recover up to Ω(d2) components given an order-4 tensor in dimension d under
94 mild algebraic independence assumptions for the components—satisfied with high probability by random components. For overcomplete 3-tensors, which arise in many applications of tensor decompositions, such a result remains elusive.
Researchers have therefore turned to investigate average-case versions of the problem, when the components of the overcomplete 3-tensor are random: Given a 3-tensor T ∈ d3 of the form
Õn T ai ⊗ ai ⊗ ai , i1 where a1,..., an are random unit or Gaussian vectors, the goal is to approximately recover the components a1,..., an.
Algorithms based on tensor power iteration—a gradient-descent approach for tensor decomposition—solve this problem in polynomial time when n 6 C · d for any constant C > 1 (the running time is exponential in C)[12]. Tensor power iteration also admits local convergence analyses for up to n 6 Ω˜ (d1.5) components [12, 7]. Unfortunately, these analyses do not give polynomial-time algorithms because it is not known how to efficiently obtain the kind of initializations assumed by the analyses.
Recently, Ge and Ma [39] were able to show that a tensor-decomposition algorithm [17] based on sum-of-squares solves the above problem for n 6 Ω˜ (d1.5)
O log n in quasi-polynomial time n ( ). The key ingredient of their elegant analysis is a subtle spectral concentration bound for a particular degree-4 matrix-valued polynomial associated with the decomposition problem of random overcomplete 3-tensors.
We state the problem formally:
95 Table 3.2: Comparison of decomposition algorithms for overcomplete 3-tensors of rank n in dimension d.
Reference Technique Runtime Largest n Components Anandkumar et al. [12]a tensor power iteration poly C · d incoherent O log n Ω˜ ( 3 2) N( 1 ) Ge, Ma [39] SoS, general SDP n ( ) d / 0, d Idd b ˜ ( 1+ω) Ω˜ ( 4 3) N( 1 ) this work SoS, partial traces O nd d / 0, d Idd a The analysis shows that for every constant C > 1, the running time is polynomial for n 6 C · d components, assuming that the components also satisfy other random-like properties besides incoherence. b Here, ω 6 2.3729 is the constant so that d × d matrices can be multiplied in O(dω) arithmetic operations.
Problem 3.1.3. Random tensor decomposition with dimension d, rank n, and
d accuracy η. Let a1,..., an ∈ be independently sampled vectors from the N( 1 ) ∈( d) 3 Ín 3 Gaussian distribution 0, d Idd , and let T ⊗ be the 3-tensor T i1 ai⊗ .
Single component: Given T sampled as above, find a unit vector b that
has correlation maxi hai , bi > 1 − η with one of the vectors ai.
All components: Given T sampled as above, find a set of unit vectors
{b1,..., bn } such that hai , bii > 1 − η for every i ∈ [n].
Our results. We give the first polynomial-time algorithm for decomposing random overcomplete 3-tensors with up to ω(d) components. Our algorithms
4 3 works as long as the number of components satisfies n 6 Ω˜ (d / ), which comes close to the bound Ω˜ (d1.5) achieved by the aforementioned quasi-polynomial algorithm of Ge and Ma. For the single-component version of the problem, our algorithm runs in time close to linear in the input size.
Theorem 3.1.4. Fast random tensor decomposition. There exist randomized
4 3 O 1 algorithms that, for every dimension d and rank n with d 6 n 6 d / /(log n) ( ), solve the random tensor decomposition problem with probability 1 − o(1) for some accuracy
3 4 1 2 η 6 O˜ (n /d ) / . The running time for the single-component version of the problem is
96 + O˜ (d1 ω), where dω 6 d2.3279 is the time to multiply two d-by-d matrices. The running + time for the all-components version of the problem is O˜ (n · d1 ω).
We give a technical overview of the proof in Section 4.2, and a full proof in
Section 3.5.
We remark that the above algorithm only requires access to the input tensor with some fixed inverse polynomial accuracy because each of its four steps amplifies errors by at most a polynomial factor (see Algorithm 3.5.9). In this sense, the algorithm is robust.
3.1.3 Tensor Principal Component Analysis
The problem of tensor principal component analysis is similar to the tensor decomposition problem. However, here the focus is not on the number of components in the tensor, but about recovery in the presence of a large amount
3 n of random noise. We are given as input a tensor τ · v⊗ + A, where v ∈ is a unit vector and the entries of A are chosen iid from N(0, 1). This spiked tensor model was introduced by Montanari and Richard [67], who also obtained the first algorithms to solve the model with provable statistical guarantees. The spiked tensor model was subsequently addressed by a subset of the present authors [50], who applied the SoS approach to improve the signal-to-noise ratio required for recovery from odd-order tensors.
We state the problem formally:
Problem 3.1.5. Tensor principal components analysis with signal-to-noise ratio
d 3 3 τ and accuracy η. Let T ∈ ( )⊗ be a tensor so that T τ · v⊗ + A, where A is
97 Table 3.3: Comparison of principal component analysis algorithms for 3-tensors in dimension d and with signal-to-noise ratio τ.
Reference Technique Runtime Smallest τ
Montanari, Richard [67] spectral O˜ (d3) n
3 3 4 Hopkins, Shi, Steurer [50] SoS, spectral O˜ (d ) O(n / ) 3 3 4 this work SoS, partial traces O(d ) O˜ (n / ) a tensor with independent standard gaussian entries and v ∈ d is a unit vector.
d Given T, recover a unit vector v0 ∈ such that hv0, vi > 1 − η.
Our results. For this problem, our improvements over the previous results are more modest—we achieve signal-to-noise guarantees matching [50], but with an algorithm that runs in linear time rather than near-linear time (time O(d3) rather than O(d3 polylog d), for an input of size d3).
Theorem 3.1.6. Tensor principal component analysis in linear time. There is an algorithm which solves the tensor principal component analysis problem with
3 4 1 1 2 accuracy η > 0 whenever the signal-to-noise ratio satisfies τ > O(n / · η− · log / n). Furthermore, the algorithm runs in time O(d3).
Though for tensor PCA our improvement over previous work is modest, we include the results here as this problem is a pedagogically poignant illustration of our techniques. We give a technical overview of the proof in Section 4.2, and a full proof in the full version.
3.1.4 Related Work
Foremost, this work builds upon the SoS algorithms of [16, 17, 39, 50]. In each of these previous works, a machine learning decision problem is solved using an
98 SDP relaxation for SoS. In these works, the SDP value is large in the yes case and small in the no case, and the SDP value can be bounded using the spectrum of a specific matrix. This was implicit in [16, 17], and in [50] it was used to obtain a fast algorithm as well. In our work, we design spectral algorithms which use smaller matrices, inspired by the SoS certificates in previous works, to solve these machine-learning problems much faster, with almost matching guarantees.
A key idea in our work is that given a large matrix with information encoded in its spectral gap, one can often efficiently “compress” the matrix to a much smaller one without losing that information. This is particularly true for problems with planted solutions. Thus, we are able to improve running time by replacing
O d k k an n ( )-sized SDP with an eigenvector computation for an n × n matrix, for some k < d.
The idea of speeding up LP and SDP hierarchies for specific problems has been investigated in a series of previous works [27, 20, 44], which shows that with respect to local analyses of the sum-of-squares algorithm it is sometimes possible
O d O d O 1 to improve the running time from n ( ) to 2 ( ) · n ( ). However, the scopes and strategies of these works are completely different from ours. First, the notion of local analysis from these works does not apply to the problems considered here. Second, these works employ the ellipsoid method with a separation oracle inspired by rounding algorithms, whereas we reduce the problem to ordinary eigenvector computation.
It would also be interesting to see if our methods can be used to speed up some of the other recent successful applications of SoS to machine-learning type problems, such as [?], or the application of [16] to tensor decomposition with components that are well-separated (rather than random). Finally, we
99 would be remiss not to mention that SoS lower bounds exist for several of these problems, specifically for tensor principal components analysis, tensor prediction, and sparse PCA [50, ?, ?]. The lower bounds in the SoS framework are a good indication that we cannot expect spectral algorithms achieving better guarantees.
3.2 Techniques
Sum-of-squares method (for polynomial optimization over the sphere). The problems we consider are connected to optimization problems of the following form: Given a homogeneous n-variate real polynomial f of constant degree, find a unit vector x ∈ n so as to maximize f (x). The sum-of-squares method allows us to efficiently compute upper bounds on the maximum value of such a polynomial f over the unit sphere.
For the case that k deg( f ) is even, the most basic upper bound of this kind is the largest eigenvalue of a matrix representation of f .A matrix representation of a polynomial f is a symmetric matrix M with rows and columns indexed by monomials of degree k/2 so that f (x) can be written as the quadratic form
k 2 k 2 k 2 f (x) hx⊗ / , Mx⊗ / i, where x⊗ / is the k/2-fold tensor power of x. The largest eigenvalue of a matrix representation M is an upper bound on the value of f (x) over all unit vectors x ∈ n because
( ) h k 2 k 2i ( ) · k k 2k2 ( ) f x x⊗ / , Mx⊗ / 6 λmax M x⊗ / 2 λmax M .
The sum-of-squares methods improves on this basic spectral bound systemat- ically by associating a large family of polynomials (potentially of degree higher than deg( f )) with the input polynomial f and computing the best possible spectral
100 bound within this family of polynomials. Concretely, the sum-of-squares method with degree parameter d applied to a polynomial f with deg( f ) 6 d considers { + ( − k k2)· 1 | (1) − } ⊆ [ ] the affine subspace of polynomials f 1 x 2 deg 6 d 2 x and minimizes λmax(M) among all matrix representations 1 M of polynomials in this space.2 The problem of searching through this affine linear space of polynomials and their matrix representations and finding the one of smallest maximum eigenvalue can be solved using semidefinite programming.
Our approach for faster algorithms based on SoS algorithms is to construct specific matrices (polynomials) in this affine linear space, then compute their top eigenvectors. By designing our matrices carefully, we ensure that our algorithms have access to the same higher degree information that the sum-of-squares algorithm can access, and this information affords an advantage over the basic spectral methods for these problems. At the same time, our algorithms avoid searching for the best polynomial and matrix representation, which gives us faster running times since we avoid semidefinite programming. This approach is well suited to average-case problems where the choice of input is not adversarial; in particular it is applicable to machine learning problems where noise and inputs are assumed to be random.
Compressing matrices with partial traces. A serious limitation of the above approach is that the representation of a degree-d, n-variate polynomial requires size roughly nd. Hence, even avoiding the use of semidefinite programming,
1Earlier we defined matrix representations only for homogeneous polynomials of even degree. In general, a matrix representation of a polynomial 1 is a symmetric matrix M with rows and 6` 6` columns indexed by monomials of degree at most ` deg 1 2 such that 1 x x⊗ , Mx ⊗ 6` 0 1 ` ( )/ + ( ) h i (as a polynomial identity), where x⊗ x⊗ , x⊗ ,..., x⊗ √` 1 is the vector of all monomials 6` ( )/ of degree at most `. Note that x⊗ 1 for all x with x 1. 2The name of the method stemsk fromk the fact that thisk k last step is equivalent to finding the minimum number λ such that the space contains a polynomial of the form λ 12 + + 12 , − ( 1 ··· t ) where 11,..., 1t are polynomials of degree at most d 2. /
101 improving upon running time O(nd) requires additional ideas.
In each of the problems that we consider, we have a large matrix (suggested by a SoS algorithm) with a “signal” planted in some amount of “noise”. We show that in some situations, this large matrix can be compressed significantly without loss in the signal by applying partial trace operations. In these situations, the partial trace yields a smaller matrix with the same signal-to-noise ratio as the large matrix suggested by the SoS algorithm, even in situations when lower degree sum-of-squares approaches are known to fail (as for the planted sparse vector and tensor PCA problems).3
d2 d2 d d The partial trace Trd : × → × is the linear operator that satisfies d d Trd A ⊗ B (Tr A)· B for all A, B ∈ × . To see how the partial trace can be used to compress large matrices to smaller ones with little loss, consider the following
d2 d2 problem: Given a matrix M ∈ × of the form M τ ·(v ⊗ v)(v ⊗ v)> + A ⊗ B d d d for some unit vector v ∈ and matrices A, B ∈ × , we wish to recover the vector v. (This is a simplified version of the planted problems that we consider in this paper, where τ ·(v ⊗ v)(v ⊗ v)> is the signal and A ⊗ B plays the role of noise.)
It is straightforward to see that the matrix A ⊗ B has spectral norm kA ⊗ Bk kAk · kBk, and so when τ kAkkBk, the matrix M has a noticeable spectral gap, and the top eigenvector of M will be close to v ⊗ v. If | Tr A| ≈ kAk, the matrix
Trd M τ · vv> + Tr(A)· B has a matching spectral gap, and we can still recover v, but now we only need to compute the top eigenvector of a d × d (as opposed to
3For both problems we use matrices with dimensions corresponding to degree-2 SoS programs. An argument of Spielman et al. ([72], Theorem 9) shows that degree-2 sum-of-squares can only find sparse vectors with sparsity k 6 O˜ √n , wherease we achieve sparsity as large as k Θ n . For tensor PCA, the degree-2 SoS program( ) cannot even express the objective function. ( )
102 d2 × d2) matrix.4
If A is a Wigner matrix (e.g. a symmetric matrix with iid ±1 entries), then √ both Tr(A), kAk ≈ n, and the above condition is indeed met. In our average case/machine learning settings the “noise” component is not as simple as A ⊗ B with A a Wigner matrix. Nonetheless, we are able to ensure that the noise displays a similar behavior under partial trace operations. In some cases, this requires additional algorithmic steps, such as random projection in the case of tensor decomposition, or centering the matrix eigenvalue distribution in the case of the planted sparse vector.
It is an interesting question if there are general theorems describing the behavior of spectral norms under partial trace operations. In the current work, we compute the partial traces explicitly and estimate their norms directly. Indeed, our analyses boil down to concentrations bounds for special matrix polynomials. A general theory for the concentration of matrix polynomials is a notorious open problem (see [?]).
Partial trace operations have previously been applied for rounding SoS relaxations. Specifically, the operation of reweighing and conditioning, used in rounding algorithms for sum-of-squares such as [20, 66, 16, 17, ?], corresponds to applying a partial trace operation to the moments matrix returned by the sum-of-squares relaxation.
We now give a technical overview of our algorithmic approach for each problem, and some broad strokes of the analysis for each case. Our most
4In some of our applications, the matrix M is only represented implicitly and has size super- linear in the size of the input, but nevertheless we can compute the top eigenvector of the partial trace Trd M in nearly-linear time.
103 substantial improvements in runtime are for the planted sparse vector and overcomplete tensor decomposition problems (Section 3.2.1 and Section 3.2.2 respectively). Our algorithm for tensor PCA is the simplest application of our techniques, and it may be instructive to skip ahead and read about tensor PCA first (Section 3.2.3).
3.2.1 Planted Sparse Vector in Random Linear Subspace
Recall that in this problem we are given a linear subspace U (represented by
d some basis) that is spanned by a k-sparse unit vector v0 ∈ and random unit
d vectors v1,..., vd 1 ∈ . The goal is to recover the vector v0 approximately. −
n d Background and SoS analysis. Let A ∈ × be a matrix whose columns form an orthonormal basis for U. Our starting point is the polynomial f (x) √ k k4 Ín ( )4 Ax 4 i1 Ax i . Previous work showed that for d n the maximizer of this polynomial over the sphere corresponds to a vector close to v0 and that degree-4 sum-of-squares is able to capture this fact [14, 16]. Indeed, typical n k k4 ≈ / random vectors v in satisfy v 4 1 n whereas our planted vector satisfies k k4 / / v0 4 > 1 k 1 n, and this degree-4 information is leveraged by the SoS algorithms.
Ín ( ) 2 The polynomial f has a convenient matrix representation M i1 ai a>i ⊗ , where a1,..., an are the rows of the generator matrix A. It turns out that the eigenvalues of this matrix indeed give information about the planted sparse
d vector v0. In particular, the vector x0 ∈ with Ax0 v0 witnesses that M has / 2 an eigenvalue of at least 1 k because M’s quadratic form with the vector x0⊗ h 2 2i k k4 / satisfies x0⊗ , Mx0⊗ v0 4 > 1 k. If we let M0 be the corresponding matrix
104 for the subspace U without the planted sparse vector, M0 turns out to have only eigenvalues of at most O(1/n) up to a single spurious eigenvalue with eigenvector far from any vector of the form x ⊗ x [14].
It follows that in order to distinguish between a random subspace with a planted sparse vector (yes case) and a completely random subspace (no case), it is enough to compute the second-largest eigenvalue of a d2-by-d2 matrix
(representing the 4-norm polynomial over the subspace as above). This decision version of the problem, while strictly speaking easier than the search version above, is at the heart of the matter: one can show that the large eigenvalue for the yes case corresponds to an eigenvector which encodes the coefficients of the sparse planted vector in the basis.
Improvements. The best running time we can hope for with this basic √ approach is O(d4) (the size of the matrix). Since we are interested in d 6 O( n), the resulting running time O(nd2) would be subquadratic but still super-linear in the input size n · d (for representing a d-dimensional subspace of n). To speed things up, we use the partial trace approach outlined above. We will first apply the approach naively (obtaining a reasonable bound), and then show that a small modification to the matrix before the partial trace allows us to achieve even smaller signal-to-noise ratios.
≈ 1 ( ) 2 + In the planted case, we may approximate M k x0x0> ⊗ Z, where x0 is the vector of coefficients of v0 in the basis representation given by A (so that Ax0 v0), and Z is the noise matrix. Since kx0k 1, the partial trace operation preserves ( ) 2 ( ) 2 the projector x0x0> ⊗ in the sense that Trd x0x0> ⊗ x0x0>. Hence, with our heuristic approximation for M above, we could show that the top eigenvector of
Trd M is close to x0 by showing that the spectral norm bound kTrd Zk 6 o(1/k).
105 Ín ( ) ⊗ ( ) The partial trace of our matrix M i1 ai a>i ai a>i is easy to compute directly, n Õk k2 · N Trd M ai 2 ai a>i . i1 In the yes case (random subspace with planted sparse vector), a direct computation shows that
h i ≈ d · + n k k4 d + n λyes > x0, Nx0 n 1 d v0 4 > n 1 dk .
Hence, a natural approach to distinguish between the yes case and no case (completely random subspace) is to upper bound the spectral norm of N in the no case.
In order to simplify the bound on the spectral norm of N in the no case, suppose that the columns of A are iid samples from the Gaussian distribution N( 1 ) 0, d Id (rather than an orthogonal basis for the random subspace)–Lemma 3.4.6 establishes that this simplification is legitimate. In this simplified setup, the {k k2 · } matrix N in the no case is the sum of n iid matrices ai ai a>i , and we can p upper bound its spectral norm λno by d/n ·(1 + O( d/n)) using standard matrix concentration bounds. Hence, using the spectral norm of N, we will be able to distinguish between the yes case and the no case as long as
p d/n n/(dk) ⇒ λno λyes .
2 1 3 For linear sparsity k ε · n, this inequality is true so long as d (n/ε ) / , which √ is somewhat worse than the bound n bound on the dimension that we are aiming for.
Í ( ) Recall that Tr B i λi B for a symmetric matrix B. As discussed above, the partial trace approach works best when the noise behaves as the tensor of two
Wigner matrices, in that there are cancellations when summing the eigenvalues
106 ( ) ⊗ ( ) of the noise. In our case, the noise terms ai a>i ai a>i do not have this property, k k2 ≈ / as in fact Tr ai a>i ai d n. Thus, in order to improve the dimension bound, we will center the eigenvalue distribution of the noise part of the matrix. This will cause it to behave more like a Wigner matrix, in that the spectral norm of the noise will not increase after a partial trace. Consider the partial trace of a matrix of the form − · ⊗ Õ M α Id ai a>i , i for some constant α > 0. The partial trace of this matrix is
n Õ(k k2 − )· N0 ai 2 α ai a>i . i1
We choose the constant α ≈ d/n such that our matrix N0 has expectation 0 in the no case, when the subspace is completely random. In the yes case, the
Rayleigh quotient of N0 at x0 simply shifts as compared to N, and we have h i ≈ k k4 / λyes > x0, N0x0 v0 4 > 1 k (see Lemma 3.4.5 and sublemmas). On the other hand, in the no case, this centering operation causes significant cancellations in the eigenvalues of the partial trace matrix (instead of just shifting the eigenvalues). √ 3 2 In the no case, N0 has spectral norm λno 6 O(d/n / ) for d n (using standard matrix concentration bounds; again see Lemma 3.4.5 and sublemmas). Therefore, the spectral norm of the matrix N0 allows us to distinguish between the yes and √ 3 2 no case as long as d/n / 1/k, which is satisfied as long as k n and d n. We give the full formal argument in Section 3.4.
3.2.2 Overcomplete Tensor Decomposition
Ín 3 ∈ d3 Recall that in this problem we are given a 3-tensor T of the form T i1 ai⊗ , ∈ d N( 1 ) where a1,..., an are independent random vectors from 0, d Id . The goal
107 is to find a unit vector a ∈ d that is highly correlated with one5 of the vectors a1,..., an.
Background. The starting point of our algorithm is the polynomial f (x) Ín h i3 1.5 i1 ai , x . It turns out that for n d the (approximate) maximizers of this polynomial are close to the components a1,..., an, in the sense that f (x) ≈ 1 if and h i2 ≈ only if maxi n ai , x 1. Indeed, Ge and Ma [39] show that the sum-of-squares ∈[ ] method already captures this fact at degree 12, which implies a quasipolynomial time algorithm for this tensor decomposition problem via a general rounding result of Barak, Kelner, and Steurer [17].
The simplest approach to this problem is to consider the tensor representation Í 3 of the polynomial T i n ai⊗ , and flatten it, hoping the singular vectors of the ∈[ ] flattening are correlated with the ai. However, this approach is doomed to failure for two reasons: firstly, the simple flattenings of T are d2 × d matrices, and since 2 n d the ai⊗ collide in the column space, so that it is impossible to determine { 2} Span ai⊗ . Secondly, even for n 6 d, because the ai are random vectors, their norms concentrate very closely about 1. This makes it difficult to distinguish any one particular ai even when the span is computable.
Improvements. We will try to circumvent both of these issues by going to Í 4 higher dimensions. Suppose, for example, that we had access to i n ai⊗ .6 ∈[ ] { 2} The eigenvectors of the flattenings of this matrix are all within Spani n ai⊗ , ∈[ ] addressing our first issue, leaving us only with the trouble of extracting individual 2 Í 6 ai⊗ from their span. If furthermore we had access to i n ai⊗ , we could perform ∈[ ]
5We can then approximately recover all the components a1,..., an by running independent trials of our randomized algorithm repeatedly on the same input. 6As the problem is defined, we assume that we do not have access to this input, and in many machine learning applications this is a valid assumption, as gathering the data necessary to generate the 4th order input tensor requires a prohibitively large number of samples.
108 (Φ ⊗ ⊗ ) Í 6 Φ ∈ d d a partial random projection Id Id i n ai⊗ where × is a matrix ∈[ ] with independent Gaussian entires, and then taking a partial trace, we end up with Õ Õ (Φ ⊗ ⊗ ) 6 hΦ 2i 4 Trd Id Id ai⊗ , ai⊗ ai⊗ . i n i n ∈[ ] ∈[ ] With reasonable probability (for exposition’s sake, say with probability 1/n10), Φ 2 hΦ 2i hΦ 2i is closer to some ai⊗ than to all of the others so that , ai⊗ > 100 , a⊗j for ∈ [ ] 2 all j n , and then ai⊗ is distinguishable from the other vectors in the span of our matrix, taking care of the second issue . As we show, a much smaller gap is sufficient to distinguish the top ai from the other a j, and so the higher-probability event that Φ is only slightly closer to ai suffices (allowing us to recover all vectors at an additional runtime cost of a factor of O˜ (n)). This discussion ignores the presence of a single spurious large eigenvector, which we address in the technical sections.
Í 6 Of course, we do not have access to the higher-order tensor i n ai⊗ . Instead, ∈[ ] we can obtain a noisy version of this tensor. Our approach considers the following matrix representation of the polynomial f 2,
Õ 3 3 ⊗ ( ) ⊗ ( ) ∈ d d M ai a>j ai a>i a j a>j × . i,j
Alternatively, we can view this matrix as a particular flattening of the Kronecker-
2 squared tensor T⊗ . It is instructive to decompose M Mdiag + Mcross into Í ( ) 3 Í ⊗ its diagonal terms Mdiag i ai a>i ⊗ and its cross terms Mcross i,j ai a>j ( ) ⊗ ( ) ai a>i a j a>j . The algorithm described above is already successful for Mdiag; we need only control the eigenvalues of the partial trace of the “noise” component,
Mcross. The main technical work will be to show that k Trd Mdiagk is small. In fact, we will have to choose Φ from a somewhat different distribution—observing ( ⊗ ⊗ ) Í h i · ( ⊗ )( ⊗ ) that Trd Φ Id Id i,j ai , Φa j ai a j ai a j >, we will sample Φ so that
109 hai , Φaii hai , Φa ji. We give a more detailed overview of this algorithm in the full version, explaining in more detail our choice of Φ and justifying heuristically the boundedness of the spectral norm of the noise.
Connection to SoS analysis. To explain how the above algorithm is a speedup of SoS, we give an overview of the SoS algorithm of [39, 17]. There, the degree-t
SoS SDP program is used to obtain an order-t tensor χt (or a pseudodistribution). Í t Informally speaking, we can understand χt as a proxy for i n ai⊗ , so that ∈[ ] Í t + χt i n ai⊗ N, where N is a noise tensor. While the precise form of N ∈[ ] is unclear, we know that N must obey a set of constraints imposed by the SoS hierarchy at degree t. For a formal discussion of pseudodistributions, see [17].
Í t To extract a single component ai from the tensor i n ai⊗ , there are many ∈[ ] algorithms which would work (for example, the algorithm we described for Mdiag above). However, any algorithm extracting an ai from χt must be robust to the noise tensor N. For this it turns out the following algorithm will do: suppose we Í t ( ) 1 1 have the tensor i n ai⊗ , taking t O log n . Sample 1,..., log n 2 random ∈[ ] ( )− Í (Î h1 i) · unit vectors, and compute the matrix M i 16 j6log n 2 j , ai ai a>i . If we ( )− are lucky enough, there is some ai so that every 1j is a bit closer to ai than any + k k k k other ai0, and M ai a>i E for some E 1. The proof that E is small can be made so simple that it applies also to the SDP-produced proxy tensor χlog n, and so this algorithm is robust to the noise N. This last step is very general and can handle tensors whose components ai are less restricted than the random vectors we consider, and also more overcomplete, handling tensors of rank up to n Ω˜ (d1.5).7
Our subquadratic-time algorithm can be viewed as a low-degree, spectral
7It is an interesting open question whether taking t O log n is really necessary, or whether this heavy computational requirement is simply an artifact( of the) SoS proof.
110 analogue of the [17] SoS algorithm. However, rather than relying on an SDP to Í t produce an object close to i n ai⊗ , we manufacture one ourselves by taking ∈[ ] the Kronecker square of our input tensor. We explicitly know the form of the
2 Í 6 deviation of T⊗ from i n ai⊗ , unlike in [17], where the deviation of the SDP ∈[ ] Í t certificate χt from i n ai⊗ is poorly understood. We are thus able to control ∈[ ] this deviation (or “noise”) in a less computationally intensive way, by cleverly designing a partial trace operation which decreases the spectral norm of the deviation. Since the tensor handled by the algorithm is much smaller—order 6 rather than order log n—this provides the desired speedup.
3.2.3 Tensor Principal Component Analysis
3 d Recall that in this problem we are given a tensor T τ · v⊗ + A, where v ∈ is a unit vector, A has iid entries from N(0, 1), and τ > 0 is the signal-to-noise ratio. The aim is to recover v approximately.
Background and SoS analysis. A previous application of SoS techniques to this problem discussed several SoS or spectral algorithms, including one that runs in quasi-linear time [50]. Here we apply the partial trace method to a subquadratic spectral SoS algorithm discussed in [50] to achieve nearly the same signal-to-noise guarantee in only linear time.
3 3 Our starting point is the polynomial T(x) τ · hv, xi + hx⊗ , Ai. The √ maximizer of T(x) over the sphere is close to the vector v so long as τ n [67]. In [50], it was shown that degree-4 SoS maximizing this polynomial can recover
3 4 v with a signal-to-noise ratio of at least Ω˜ (n / ), since there exists a suitable SoS
3 bound on the noise term hx⊗ , Ai.
111 Specifically, let Ai be the ith slice of A, so that hx, Ai xi is the quadratic form Í ( ) | ( )− ·h i3| j,k Aijk x j xk. Then there is a SoS proof that T x is bounded by T x τ v, x 6 ( )1 2 · k k ( ) ( ) Í h i2 f x / x , where f x is the degree-4 polynomial f x i x, Ai x . The ( ) h 2 (Í ⊗ polynomial f has a convenient matrix representation: f x x⊗ , i Ai 2 Ai)x⊗ i: since this matrix is a sum of iid random matrices Ai ⊗ Ai, a matrix Chernoff bound shows that this matrix spectrally concentrates to its expectation. Í ⊗ So with high probability one can show that the eigenvalues of i Ai Ai are at 3 2 1 2 most ≈ d / log(d) / (except for a single spurious eigenvector), and it follows
3 4 1 4 that degree-4 SoS solves tensor PCA so long as τ d / log(d) / .
This leads the authors to consider a slight modification of f (x), given by ( ) Í h i2 1 x i x, Ti x , where Ti is the ith slice of T. Like T, the function 1 also contains information about v, and the SoS bound on the noise term in T carries over as an analogous bound on the noise in 1. In particular, expanding Ti ⊗ Ti and ignoring some negligible cross-terms yields
Õ 2 Õ Ti ⊗ Ti ≈ τ ·(v ⊗ v)(v ⊗ v)> + Ai ⊗ Ai . i i
Using v ⊗ v as a test vector, the quadratic form of the latter matrix can be made at
2 3 2 1 2 least τ − O(d / log(d) / ). Together with the boundedness of the eigenvalues of Í ⊗ 3 4 ( )1 4 i Ai Ai this shows that when τ d / log d / there is a spectral algorithm Í ⊗ 2 × 2 to recover v. Since the matrix i Ti Ti is d d , computing the top eigenvector requires O˜ (d4 log n) time, and by comparison to the input size d3 the algorithm runs in subquadratic time.
Improvements. In this work we speed this up to a linear time algorithm via the partial trace approach. As we have seen, the heart of the matter is to show 2 ·( ⊗ )( ⊗ ) + Í ⊗ that taking the partial trace of τ v v v v > i Ai Ai does not increase
112 the spectral noise. That is, we require that
Õ ⊗ Õ ( )· ( 3 2 ( )1 2) Trd Ai Ai Tr Ai Ai 6 O d / log d / . i i
The Ai have iid Guassian entries, and so as in the case of Wigner matrices, it is roughly true that | Tr(Ai)| ≈ kAi k. Thus the situation is very similar to our toy example of the application of partial traces in Section 4.2.
Í ⊗ Í ( )· Heuristically, because i n Ai Ai and i n Tr Ai Ai are random matrices, ∈[ ] ∈[ ] we expect that their eigenvalues are all of roughly the same magnitude. This means that their spectral norm should be close to their Frobenius norm divided by the square root of the dimension, since for a matrix M with eigenvalues q k k Í 2 λ1, . . . , λn, M F i n λi . By estimating the sum of the squared entries, we ∈[ ] Í ( )· Í ⊗ expect that the Frobenious norm of i Tr Ai Ai is less than that of i Ai Ai √ by a factor of d after the partial trace, while the dimension decreases by a factor of d, and so assuming that the eigenvalues are all of the same order, a typical eigenvalue should remain unchanged. We formalize these heuristic calculations using standard matrix concentration arguments in the full version.
3.3 Preliminaries
Linear algebra. We will work in the real vector spaces given by n. A vector of indeterminates may be denoted x (x1,..., xn), although we may sometimes switch to parenthetical notation for indexing, i.e. x (x(1),..., x(n)) when subscripts are already in use. We denote by [n] the set of all valid indices for a
n vector in . Let ei be the ith canonical basis vector so that ei(i) 1 and ei(j) 0 for j , i.
113 For a vectors space V, we may denote by L(V) the space of linear operators from V to V. The space orthogonal to a vector v is denoted v⊥.
1 For a matrix M, we use M− to denote its inverse or its Moore-Penrose pseudoinverse; which one it is will be clear from context. For M PSD, we write
1 2 1 2 2 1 M− / for the unique PSD matrix with (M− / ) M− .
Norms and inner products. We denote the usual entrywise inner product by h· ·i h i Í ∈ n ∈ n , , so that u, v i n ui vi for u, v . The `p-norm of a vector v is ∈[ ] k k (Í p)1 p k k given by v p i n vi / , with v denoting the `2-norm by default. The ∈[ ] matrix norm used throughout the paper will be the operator / spectral norm, k k k k k k/k k denoted by M M op : maxx,0 Mx x .
n n n Tensor manipulation. Boldface variables will reserved for tensors T ∈ × × , of which we consider only order-3 tensors. We denote by T(x, y, z) the multilinear ∈ n ( ) Í function in x, y, z such that T x, y, z i,j,k n Ti,j,k xi yj zk, applying x, y, ∈[ ] and z to the first, second, and third modes of the tensor T respectively. If the arguments are matrices P, Q, and R instead, this lifts T(P, Q, R) to the unique multilinear tensor-valued function such that [T(P, Q, R)](x, y, z) T(Px, Q y, Rz) for all vectors x, y, z.
Tensors may be flattened to matrices in the multilinear way such that for
n n n n2 n every u ∈ × and v ∈ , the tensor u ⊗ v flattens to the matrix uv> ∈ × with u considered as a vector. There are 3 different ways to flatten a 3-tensor T, corresponding to the 3 modes of T. Flattening may be understood as reinterpreting the indices of a tensor when the tensor is expressed as an 3-dimensional array of
3 numbers. The expression v⊗ refers to v ⊗ v ⊗ v for a vector v.
114 Probability and asymptotic bounds. We will often refer to collections of in- dependent and identically distributed (or iid) random variables. The Gaussian distribution with mean µ and variance σ2 is denoted N(µ, σ2). Sometimes we state that an event happens with overwhelming probability. This means that
ω 1 its probability is at least 1 − n− ( ). A function is O˜ (1(n)) if it is O(1(n)) up to polylogarithmic factors.
3.4 Planted Sparse Vector in Random Linear Subspace
In this section we give a nearly-linear-time algorithm to recover a sparse vector planted in a random subspace.
∈ n k k4 1 Problem 3.4.1. Let v0 be a unit vector such that v0 4 > εn . Let n 1 v1,..., vd 1 ∈ be iid from N(0, Idn). Let w0,..., wd 1 be an orthogo- − n − nal basis for Span{v0,..., vd 1}. Given: w0,..., wd 1 Find: a vector v such − − 2 that hv, v0i > 1 − o(1).
115 Sparse Vector Recovery in Nearly-Linear Time
Algorithm 3.4.2. Input: w0,..., wd 1 − as in Problem 3.4.1. Goal: Find v
2 with hvˆ, v0i > 1 − o(1).
• Compute leverage scores 2 2 ka1k ,..., kan k , where ai is the ith row of the n × d matrix S : w0 ··· wd 1 . − • Compute the top eigenvector u of the matrix
def Õ (k k2 − d )· A ai 2 n ai a>i . i n ∈[ ]
• Output Su.
Remark 3.4.3. Implementation of Algorithm 3.4.2 in nearly-linear time. The
2 2 leverage scores ka1k ,..., kan k are clearly computable in time O(nd). In the course of proving correctness of the algorithm we will show that A has constant spectral gap, so by a standard analysis O(log d) matrix-vector multiplies suffice to recover its top eigenvector. A single matrix-vector multiply Ax requires (k k2 − d )h i ( ) computing ci : ai n ai , x for each i (in time O nd ) and summing Í ( ) i n ci xi (in time O nd ). Finally, computing Su requires summing d vectors of ∈[ ] dimension n, again taking time O(nd).
The following establishes our algorithm’s correctness.
∈ n k k4 / Theorem 3.4.4. Let v0 be a unit vector with v0 4 > 1 εn. Let
116 ∈ n N( 1 ) v1,..., vd 1 be iid from 0, n Idn . Let w0,..., wd 1 be an orthogonal basis for − − { } × Span v0,..., vd 1 . Let ai be the i-th row of the n d matrix S : w0 ··· wd 1 . − −
1 2 When d 6 n / /polylog(n), for any sparsity ε > 0, w.ov.p. the top eigenvector u of Ín (k k2 − d )· h i2 − ( 1 4) − ( ) i1 ai n ai a>i has Su, v0 > 1 O ε / o 1 .
We have little control over the basis vectors the algorithm is given. However, there is a particularly nice (albeit non-orthogonal) basis for the subspace which exposes the underlying randomness. Suppose that we are given the basis vectors v0,..., vd, where v0 is the sparse vector normalized so that kv0k 1, and 1 v1,..., vd 1 are iid samples from N(0, Idn). The following lemma shows that if − n the algorithm had been handed this good representation of the basis rather than an arbitrary orthogonal one, its output would be the correlated to the vector of coefficients giving of the planted sparse vector (in this case the standard basis vector e1).
n n Lemma 3.4.5. Let v0 ∈ be a unit vector. Let v1,..., vd 1 ∈ be iid from − N( 1 ) × ··· 0, n Id . Let ai be the ith row of the n d matrix S : v0 vd 1 . Then there − 1 2 is a universal constant ε∗ > 0 so that for any ε 6 ε∗, so long as d 6 n / /polylog(n), w.ov.p. n Õ(k k2 − d )· k k4 · + ai n ai a>i v0 4 e1e1> M , i1 k k (k k3 · 1 4 + k k2 · 1 2 + where e1 is the first standard basis vector and M 6 O v0 4 n− / v0 4 n− / 3 4 1 kv0k4 · n− / + n− ).
The second ingredient we need is that the algorithm is robust to exchanging this good basis for an arbitrary orthogonal basis.
n 4 1 n Lemma 3.4.6. Let v0 ∈ have kv0k > . Let v1,..., vd 1 ∈ be iid from 4 εn − 1 N(0, Idn). Let w0,..., wd 1 be an orthogonal basis for Span{v0,..., vd 1}. Let ai n − −
117 × ··· be the ith row of the n d matrix S : v0 vd 1 . Let a0i be the ith row of − × ··· Í ∈ d d the n d matrix S0 : w0 wd 1 . Let A : i ai a>i . Let Q × be the − 1 2 1 2 orthogonal matrix so that SA− / S0Q, which exists since SA− / is orthogonal, and 1 2 1 2/ ( ) which has the effect that a0i QA− / ai. Then when d 6 n / polylog n , w.ov.p. ! n n Õ(k k2 − d )· − Õ(k k2 − d )· a0i n a0i a0>i Q ai n ai a>i Q> i1 i1 1 6 O + o(kvk4) . n 4
Last, we will need the following fact, which follows from standard concentra- tion.
n d 1 Lemma 3.4.7. Let v ∈ be a unit vector. Let b1,..., bn ∈ − be iid 1 d from N(0, Idd 1). Let ai ∈ be given by ai : (v(i) bi). Then w.ov.p. n − k Ín − k ˜ ( / )1 2 ( ) i1 ai a>i Idd 6 O d n / . In particular, when d o n , this implies that w.ov.p. k(Ín ) 1 − k ˜ ( / )1 2 k(Ín ) 1 2 − k ˜ ( / )1 2 i1 ai a>i − Idd 6 O d n / and i1 ai a>i − / Idd 6 O d n / .
We reserve the proofs of Lemma 3.4.6 and Lemma 3.4.7 for the full version. We are ready to prove Theorem 3.4.4.
of Theorem 3.4.4. Let b1,..., bn be the rows of the matrix S0 : v0 ··· vd 1 . − Í 1 2 Let B i bi b>i . Note that S0B− / has columns which are an orthogonal basis d d 1 2 for Span{w0,..., wd 1}. Let Q ∈ × be the rotation so that S0B− / SQ. −
Ín (k k2 − By Lemma 3.4.5 and Lemma 3.4.6, we can write the matrix A i1 ai 2 d )· n ai a>i as k k4 · + A v0 4 Qe1e1>Q> M , where w.ov.p.
k k (k k3 · 1 4) + (k k2 · 1 2) M 6 O v0 4 n− / O v0 4 n− /
118 + (k k · 3 4) + ( 1) + (k k4) O v0 4 n− / O n− o v 4 .
k k4 ( ) 1 We have assumed that v0 4 > εn − , and so since A is an almost-rank-one 2 1 4 matrix (Lemma A.3.3), the top eigenvector u of A has hu, Qe1i > 1 − O(ε / ), so
2 1 4 that hSu, SQe1i > 1 − O(ε / ) by column-orthogonality of S.
1 2 1 2 At the same time, SQe1 S0B− / e1, and by Lemma 3.4.7, kB− / − Id k 6
1 2 2 2 O˜ (d/n) / w.ov.p., so that hSu, S0e1i > hSu, SQe1i − o(1). Finally, S0e1 v0 by
2 1 4 definition, so hSu, v0i > 1 − O(ε / ) − o(1).
3.4.1 Algorithm Succeeds on Good Basis
We now prove Lemma 3.4.5. We decompose the matrix in question into a k k4 Í(k k2 − contribution from v0 4 and the rest: explicitly, the decomposition is ai 2 d )· Í ( )2 · + Í(k k2 − d · ) n ai a>i v i ai a>i bi 2 n ai a>i . This first lemma handles the k k4 contribution from v0 4.
n d 1 Lemma 3.4.8. Let v ∈ be a unit vector. Let b1,..., bn ∈ − be random vectors iid
1 d 1 2 from N(0, · Idd 1). Let ai (v(i) bi) ∈ . Suppose d 6 n / /polylog(n). Then n − n Õ ( )2 · k k4 · + v i ai a>i v 4 e1e1> M0 , i1
k k (k k3 1 4 + k k2 1 2) where M0 6 O v 4 n− / v 4 n− / w.ov.p.. of Lemma 3.4.8. We first show an operator-norm bound on the principal submatrix Ín ( )2 · i1 v i bi b>i using a truncated matrix Bernstein inequality Proposition A.3.7. First, the expected operator norm of each summand is bounded: 2 2 2 d 2 d ¾ v(i) kbi k 6 (max v(j) )· O 6 kvk · O . 2 j n 4 n
119 The operator norms are bounded by constant-degree polynomials in Gaussian variables, so Lemma A.3.8 applies to truncate their tails in preparation for application of a Bernstein bound. We just have to calculate the variance of the sum, which is at most
n Õ 4 2 4 d ¾ v(i) kbi k · bi b> kvk · O . 2 i 4 n2 i1
2 ¾ Ín ( )2 · v · The expectation i1 v i bi b>i is k nk Id. Applying a matrix Bernstein bound (Proposition A.3.7) to the deviation from expectation, we get that w.ov.p.,
n ! Õ 1 d v(i)2 · b b − · Id 6 kvk2 · O˜ i >i n 4 n i1 (k k2 1 2) 6 O v 4 n− /
1 2 for appropriate choice of d 6 n− / /polylog(n). Hence, by triangle inequality, k Ín ( )2 · k k k2 1 2 i1 v i bi b>i 6 v 4 n− / w.ov.p..
Using a Cauchy-Schwarz-style inequality (Lemma A.3.1) we now show that the bound on this principal submatrix is essentially enough to obtain the lemma.
d Let pi , qi ∈ be given by
0 def ( )· ( ) ··· def ( )· © ª p>i v0 i v0 i 0 0 qi v0 i ® . bi « ¬ Then n n Õ ( )2 · k k4 + Õ + + v i bi b>i v 4 pi q>i qi p>i qi q>i . i1 i1 Ín Ín ( )2 · We have already bounded i1 qi q>i i1 v i bi b>i . At the same time, k Ín k k k4 i1 pi p>i v 4. By Lemma A.3.1, then,
n Õ + (k k3 1 4) pi q>i qi p>i 6 O v 4 n− / i1 w.ov.p.. Finally, applying the triangle inquality gives the desired result.
120 Our second lemma controls the contribution from the random part of the leverage scores.
n d 1 Lemma 3.4.9. Let v ∈ be a unit vector. Let b1,..., bn ∈ − be random vectors iid
1 d 1 2 from N(0, · Idd 1). Let ai (v(i) bi) ∈ . Suppose d 6 n / /polylog(n). Then n − w.ov.p.
n Õ(k k2 − d )· k k2 · ( 3 4) bi 2 n ai a>i 6 v 4 O n− / i1 1 1 + kvk4 · O(n− ) + O(n− ) .
Ín (k k2 − d )· (sketch). Like in the proof of Lemma 3.4.8, i1 bi 2 n ai a>i decomposes into a convenient block structure; we will bound each block separately.
n Õ(k k2 − d )· bi 2 n ai a>i i1 n v(i)2 v(i)· b Õ(k k2 − d )· © >i ª bi 2 n ® . (3.4.1) i1 v(i)· bi bi b> « i ¬ In each block we can apply a (truncated) Bernstein inequality. The details are deferred to the full version.
We are now ready to prove Lemma 3.4.5
k k2 ( )2 + k k2 of Lemma 3.4.5. We decompose ai 2 v0 i bi 2 and use Lemma 3.4.8 and Lemma 3.4.9.
n Õ(k k2 − d )· ai 2 n ai a>i i1 n ! n ! Õ ( )2 · + Õ(k k2 − d )· v0 i ai a>i bi 2 n ai a>i i1 i1 k k4 · + v0 4 e1e1> M ,
121 where
k k (k k3 · 1 4 + k k2 · 1 2) + (k k · 1 + 1) M 6 O v0 4 n− / v0 4 n− / O v0 4 n− n− .
k k4 ( ) 1 k k4/k k 1 Since v0 > εn − , we get v0 M > 1 4 , completing the proof. 4 4 ε /
3.5 Overcomplete Tensor Decomposition
In this section, we give a polynomial-time algorithm for the following problem
4 3 when n 6 d / /(polylog d):
Ín ⊗ ⊗ ∈ d Problem 3.5.1. Given an order-3 tensor T i1 ai ai ai, where a1,..., an N( 1 ) ∈ n are iid vectors sampled from 0, d Id , find vectors b1,..., bn such that for all i ∈ [n],
hai , bii > 1 − o(1) .
We give an algorithm that solves this problem, so long as the overcompleteness
4 3 of the input tensor is bounded such that n d / /polylog d.
Ín ⊗ ⊗ ∼ N( 1 ) Theorem 3.5.2. Given as input the tensor T i1 ai ai ai where ai 0, d Idd 4 3 1+ω with d 6 n 6 d / /polylog d,8 there is an algorithm which may run in time O˜ (nd ) or O˜ (nd3.257), where dω is the time to multiply two d × d matrices, which with probability 1 − o(1) over the input T and the randomness of the algorithm finds unit
d vectors b1,..., bn ∈ such that for all i ∈ [n],
3 2 n / hai , bii > 1 − O˜ . d2 8The lower bound d 6 n on n, is a matter of technical convenience, avoiding separate concentration analyses and arithmetic in the undercomplete n < d and overcomplete n > d settings. Indeed, our algorithm still works in the undercomplete( setting) (tensor decomposition( is) easier in the undercomplete setting than the overcomplete one), but here other algorithms based on local search also work [12].
122 3 2 2 We remark that this accuracy can be improved from 1 − O˜ (n / /d ) to an arbi- trarily good precision using existing local search methods with local convergence guarantees—we discuss the details in the full version.
Í 6 As discussed in Section 4.2, to decompose the tensor i ai⊗ (note we do not actually have access to this input!) there is a very simple tensor decomposition 1 ∈ d2 Í h1 2i( ) 2 algorithm: sample a random and compute the matrix i , ai⊗ ai a>i ⊗ . O ε ( ) 2+ With probability roughly n− ( ) this matrix has (up to scaling) the form ai a>i ⊗ E for some kEk 6 1 − ε, and this is enough to recover ai.
Í 6 Í ( ⊗ ) 3 However, instead of i ai⊗ , we have only i,j ai a j ⊗ . Unfortunately, running the same algorithm on the latter input will not succeed. To see why, Í h ⊗ i( ⊗ ) 2 |h ⊗ i| ≈ consider the extra terms E0 : i,j 1, ai a j ai a j ⊗ . Since 1, ai a j 1, 2 it is straightforward to see that kE0kF ≈ n. Since the rank of E0 is clearly d , even if we are lucky and all the eigenvalues have similar magnitudes, still a typical ≈ / Í 6 eigenvalue will be n d 1, swallowing the i ai⊗ term.
Í ( ⊗ ) 3 A convenient feature separating the signal terms i ai ai ⊗ from the Í ( ⊗ ) 3 crossterms i,j ai a j ⊗ is that the crossterms are not within the span of the ai ⊗ai. Although we cannot algorithmically access Span{ai ⊗ai }, we have access to Í ( ⊗ ) something almost as good: the unfolded input tensor, T i n ai ai ai >. The ∈[ ] rows of this matrix lie in Span{ai ⊗ ai }, and so for i , j, kT(ai ⊗ ai)k kT(ai ⊗ a j)k. √ In fact, careful computation reveals that kT(ai ⊗ ai)k > Ω˜ ( n/d)kT(ai ⊗ a j)k.
Í h ⊗ i( ⊗ ) 2 Í h ( ⊗ The idea now is to replace i,j 1, ai a j ai a j ⊗ with i,j 1, T ai 2 a j)i(ai ⊗ a j)⊗ , now with 1 ∼ N(0, Idd). As before, we are hoping that there h ( ⊗ )i h ( ⊗ )i is i0 so that 1, T ai0 ai0 maxj,i0 1, T a j a j . But now we also require k Í h1 ( ⊗ )i( ⊗ )( ⊗ ) k h1 ( ⊗ )i ≈ k ( ⊗ )k i,j , T ai a j ai a j ai a j > , T ai0 ai0 T ai ai . If we
123 are lucky and all the eigenvalues of this cross-term matrix have roughly the same magnitude (indeed, we will be lucky in this way), then we can estimate heuristically that
Õ h1, T(ai ⊗ a j)i(ai ⊗ a j)(ai ⊗ a j)>
i,j
1 Õ ≈ h1, T(ai ⊗ a j)i(ai ⊗ a j)(ai ⊗ a j)> d i,j F
1 √n Õ 6 · |h1, T(ai ⊗ ai )i| (ai ⊗ a j)(ai ⊗ a j)> d d 0 0 i,j F 3 2 n / |h1 ( ⊗ )i| 6 d2 , T ai0 ai0 ,
3 2 2 4 3 suggesting our algorithm will succed when n / d , which is to say n d / . The following theorem, which formalizes the intuition above, is at the heart of our tensor decomposition algorithm.
N( 1 ) Theorem 3.5.3. Let a1,..., an be independent random vectors from 0, d Idd with 4 3 d 6 n 6 d / /(polylog d) and let 1 be a random vector from N(0, Idd). Let Σ : √ ¾ ( ) 2 ·(Σ+)1 2 Í ( ⊗ ) x 0,Idd xx> ⊗ and let R : 2 / . Let T i n ai ai ai >. Define the ∼N( ) ∈[ ] d2 d2 matrix M ∈ × ,
Õ M h1, T(ai ⊗ a j)i · (ai ⊗ a j)(ai ⊗ a j)> . i,j n ∈[ ] √ With probability 1 − o(1) over the choice of a1,..., an, for every polylog d/ d < ε < 1,
d2 the spectral gap of RMR is at least λ2/λ1 6 1 − O(ε) and the top eigenvector u ∈
O ε of RMR satisfies, with probability Ω˜ (1/n ( )) over the choice of 1,
3 2 2 2 4 n / maxhRu, ai ⊗ aii / kuk · kai k > 1 − O˜ . i n εd2 ∈[ ] √ Moreover,with probability 1−o(1) over the choice of a1,..., an, for every polylog d/ d <
124 1+O ε ε < 1 there are events E1,..., En so that 1 Ei > Ω˜ (1/n ( )) for all i ∈ [n] and when
3 2 h ⊗ i2/k k2 · k k4 − ˜ n / Ei occurs, Ru, ai ai u ai > 1 O εd2 .
We will eventually set ε 1/log n, which gives us a spectral algorithm for ( − ˜ ( / 3 2)) 2 recovering a vector 1 O n d / -correlated to some ai⊗ . Once we have a vector 2 correlated with each ai⊗ , obtaining vectors close to the ai is straightforward. We will begin by proving this theorem, and give algorithmic details in section Section 3.5.2.
In Section 3.5.1 we prove Theorem 3.5.3 using two core facts: the Gaussian vector 1 is closer to some ai than to any other with good probability, and the Í h ( ⊗ )i( ⊗ )( ⊗ ) noise term i,j 1, T ai a j ai a j ai a j > is bounded in spectral norm.
3.5.1 Proof of Theorem 3.5.3
The strategy to prove Theorem 3.5.3 is to decompose the matrix M into two parts + Í h ( ⊗ )i · M Mdiag Mcross, one formed by diagonal terms Mdiag i n 1, T ai ai ∈[ ] ( ⊗ )( ⊗ ) Í h ( ⊗ )i · ai ai ai ai > and one formed by cross terms Mcross i,j 1, T ai a j
(ai ⊗ a j)(ai ⊗ a j)>. We use the fact that the top eigenvector Mdiag is likely to be 2 correlated with one of the vectors a⊗j , and the fact that Mdiag has a noticeable spectral gap. The following propositions characterize the spectra of Mdiag and
Mcross, and are proven in the full version. √ 2 + 1 2 Proposition 3.5.4. Spectral gap of diagonal terms. Let R 2·((¾(xx>)⊗ ) ) / ∼ N( ) N( 1 ) for x 0, Idd . Let a1,..., an be independent random vectors from 0, d Idd 2 Ω 1 with d 6 n 6 d − ( ) and let 1 ∼ N(0, Idd) be independent of all the others. Let Í ( ⊗ ) Í h1 2i · ( ) 2 T : i n ai ai ai >. Suppose Mdiag i n , Tai⊗ ai a>i ⊗ . Let also v j be ∈[ ] ∈[ ]
125 2 2 such that v j v> h1, Ta⊗ i · (a j a>)⊗ . Then, with probability 1 − o(1) over a1,..., an, j √j j for each ε > polylog d/ d and each j ∈ [n], the event
n def − · Ej,ε RMdiagR ε Rv j v>j R √ o − − ˜ / · 6 RMdiagR ε O n d Rv j v>j R
1+O ε has probability at least Ω˜ (1/n ( )) over the choice of 1.
4 3 Second, we show that when n d / the spectral norm of Mcross is negligible compared to this spectral gap.
Proposition 3.5.5. Bound on crossterms. Let a1,..., an be independent random N( 1 ) 1 N( ) vectors from 0, d Idd , and let be a random vector from 0, Idd . Let T : Í ( ⊗ ) Í h1 ( ⊗ )i ⊗ i n ai ai ai >. Let Mcross : i,j n , T ai a j ai a>i a j a>j . Suppose n > d. ∈[ ] ∈[ ] Then with w.ov.p., 1 2 n3 / kM k 6 O˜ . cross d4
Using these two propositions we will conclude that the top eigenvector of
2 RMR is likely to be correlated with one of the vectors a⊗j . We also need two simple concentration bounds; we defer the proof to the full version.
N( 1 ) Lemma 3.5.6. Let a1,..., an be independently sampled vectors from 0, d Idd , and N( ) Í ( ⊗ ) let 1 be sampled from 0, Idd . Let T i ai ai ai >. Then with overwhelming probability, for every j ∈ [n], √ 4 n h1, T(a j ⊗ a j)i − h1, a jika j k 6 O˜ . d
Fact 3.5.7. Let x, y ∼ N(0, 1 Id). With overwhelming probability, 1 − kxk2 6 √ d O˜ (1/ d) and hx, yi2 O˜ (1/d).
126 As a last technical tool we will need a simple claim about the fourth moment matrix of the multivariate Gaussian: √ ( ) 2 ( +)1 2 k k Fact 3.5.8. Let Σ ¾x 0,Id xx> ⊗ and let R 2 Σ / . Then R 1, and ∼N( d) for any v ∈ d, k ( ⊗ )k2 − 1 · k k4 R v v 2 1 d+2 v .
We are prepared prove Theorem 3.5.3.
4 3 of Theorem 3.5.3 (Sketch). Let d 6 n 6 d / /(polylog d) for some polylog d to be N( 1 ) chosen later. Let a1,..., an be independent random vectors from 0, d Idd and let 1 ∼ N(0, Idd) be independent of the others. Let
Õ h1 ( ⊗ )i · ( ) 2 Mdiag : , T ai ai ai a>i ⊗ , i n ∈[ ] Õ h1 ( ⊗ )i · ⊗ Mcross : , T ai a j ai a>i a j a>j . i,j n ∈[ ]
Note that M : Mdiag + Mcross.
Proposition 3.5.5 implies that
n o k k ˜ ( 3 2/ 2) − ω 1 Mcross 6 O n / d > 1 d− ( ) . (3.5.1) √ ( ) 2 ·( +)1 2 Recall that Σ ¾x 0,Id xx> ⊗ and R 2 Σ / . By Proposition 3.5.4, ∼N( d) with probability 1 − o(1) over the choice of a1,..., an, each of the following events √ 1+O ε Ej,ε for j ∈ [n] and ε > polylog(d)/ d has probability at least Ω˜ (1/n ( )) over the choice of 1:
E0 R M − h1 Ta 2i(a a ) 2 R j,ε : diag ε , ⊗j j >j ⊗ n1 2 6 kRM Rk − (ε − O˜ ( / ))|h1, Ta 2i|k Ra 2k2 . diag d j ⊗ j ⊗
127 And therefore together with (3.5.1), the events
− h1 2i( ) 2 E∗j,ε : R M ε , Ta⊗j a j a>j ⊗ R n1 2 6 kR · M · Rk − (ε − O˜ ( / )|h1, Ta 2i|kRa 2k2 d j ⊗ j ⊗ 3 2 2 + O˜ (n / /d ) occur under the same stipulations (as M Mdiag + Mcross and kR · Mcross · Rk 6
2 kRk kMcrossk 6 kMcrossk by Fact 3.5.8).
By standard reasoning about the top eigenvector of a matrix with a spectral
d2 gap (Lemma A.3.3), the event E∗ j,ε implies that the top eigenvector u ∈ of R · M · R satisfies
* 2 +2 √ Ra⊗j O˜ ( n/d) O˜ (n3 2/d2) u, > 1 − − / . k 2k k 2k2 k 2k2|h1 2i| Ra⊗j ε Ra⊗j ε Ra⊗j , Ta⊗j
Applying Fact 3.5.8, Fact 3.5.7, Lemma 3.5.6, and standard concentration argu-
ω 1 ments to simplify the expression, we have that with probability 1 − n− ( ), √ n n3 2 > 1 − O˜ − O˜ / . εd εd2
(The calculation is performed in more detail in the full version). A union bound now gives the desired conclusion.
Finally, we use concentration arguments and our knowledge of the correlation
2 between u and Rai⊗ to upper bound the second eigenvector
3 2 2 λ2(RMR) 6 1 − O˜ (ε) + O˜ (n / /εd ) , with details given in the full version of the paper. Therefore,
λ2(RMR)/λ1(RMR) 6 1 − O(ε), completing the proof.
128 Spectral Tensor Decomposition (One Attempt)
This is the main subroutine of our algorithm—we will run it O˜ (n) times to recover all of the components Ín a1,...,⊗ ⊗an. Algorithm 3.5.9. Input: T i1 ai ai ai. Goal: Recover ai for some i ∈ [n]. ∈ d2 d • Compute the matrix unfolding T × of T. Then compute a 3- d2 d2 d2 tensor S ∈ × × by starting with the 6-tensor T ⊗ T, permuting indices, and flattening to a 3-tensor. Apply T in one mode of S to d d2 d2 obtain M ∈ ⊗ ⊗ : n Õ 2 Õ 3 T ai(ai ⊗ ai)> , S T⊗ (ai ⊗ a j)⊗ , i n i,j1 ∈[ ] ( ) Õ ( ⊗ ) ⊗ ( ⊗ ) ⊗ ( ⊗ ) M S T, Idd2 , Idd2 T ai a j ai a j ai a j . i,j n ∈[ ] • Sample a vector 1 ∈ d with iid standard gaussian entries. Evaluate d2 d2 M in its first mode in the direction of 1 to obtain M ∈ × : ( ) Õ h ( ⊗ )i · ( ⊗ )( ⊗ ) M : M 1, Idd2 , Idd2 1, T ai a j ai a j ai a j > . i,j n ∈[ ] √ def [( ) 2] ∼ N( ) def ·( +)1 2 • Let Σ ¾ aa> ⊗ for a 0, Idd . Let R 2 Σ / . Com- 2 pute the top eigenvector u ∈ d of RMR, and reshape Ru to a d d matrix U ∈ × . • For each of the signings of the top 2 unit left (or right) singular ± ± Í h ± i3 − ( ) vectors u1, u2 of U, check if i n ai , u j > 1 c n, d where 3 2 ∈[ ] c(n, d) Θ(n/d / ) is an appropriate threshold. If so, output ±u j. Otherwise output nothing.
3.5.2 Discussion of Full Algorithm
In this subsection we discuss the full details of our tensor decomposition algorithm. As above, the algorithm constructs a random matrix from the input tensor, then computes and post-processes its top eigenvector.
Theorem 3.5.3 is almost enough for the correctness of Algorithm 3.5.9, prov-
2 ing that RMR’s top eigenvector is correlated with some ai⊗ with reasonable probability. We need a few more ingredients to prove Theorem 3.5.2—we must
129 2 show that finding a vector u correlated with some ai⊗ is sufficient for finding a vector close to ai. This can be done by standard eigenvector computations on a reshaping of u to an n × n matrix. The details are left for the full version.
Finally, we must prove a bound on the runtime of the algorithm, which we do by implementing each of the steps as a sequence of carefully-chosen matrix multiplications.
+ Lemma 3.5.10. Algorithm 3.5.9 can be implemented in time O˜ (d1 ω) 6 O˜ (d3.3729), where dω is the runtime for multiplying two d × d matrices.
Proof. To run the algorithm, we only require access to power iteration using the matrix RMR. We first give a fast implementation for power iteration with the matrix M, and handle the multiplications with R separately.
d2 Consider a vector v ∈ , and a random vector 1 ∼ N(0, Idd), and let
d d V, G ∈ × be the reshapings of v and 1T respectively into matrices. Call
Tv T(Idd , V, G) (V and G applied in the second and third modes of T), and call
2 Tv the reshaping of Tv into a d × d matrix. We have that
Õ Tv ai(Vai ⊗ Gai)> . i n ∈[ ] We show that the matrix-vector multiply Mv can be computed as a flattening of the following product:
© Õ ( ⊗ ) ª © Õ ( ⊗ ) ª TvT> ai Vai Gai >® a j a j a>j ® i n j n « ∈[ ] ¬ « ∈[ ] ¬ Õ h i · h i · a j , Vai a j , Gai ai a>j i,j n ∈[ ] Õ h ⊗ i · h1 ⊗ i · ai a j , v T, ai a j ai a>j . i,j n ∈[ ]
130 d2 Flattening TvT> from a d × d matrix to a vector vTT ∈ , we have that
Õ vTT h1T, ai ⊗ a ji · hai ⊗ a j , vi · ai ⊗ a j Mv . i,j n ∈[ ]
So we have that Mv is a flattening of the product TvT>, which we will compute as a proxy for computing Mv via direct multiplication.
Computing Tv T(Id, V, G) can be done with two matrix multiplication operations, both times multiplying a d2 ×d matrix with a d ×d matrix. Computing
2 2 TvT> is a multiplication of a d × d matrix by a d × d matrix. Both these steps + may be done in time O(d1 ω), by regarding the d × d2 matrices as block matrices with blocks of size d × d. The asymptotically fastest known algorithm for matrix multiplication gives a time of O(d3.3729) [35].
2 Now, to compute the matrix-vector multiply RMRu for any vector u ∈ d , + we may first compute v Ru, perform the operation Mv in time O(d1 ω) as described above, and then again multiply by R. The matrix R is sparse: it has
O(d) entries per row (see the full version for a complete description of R), so computing Ru requires time O(d3).
Performing the update RMRv a total of O(log2 n) times is sufficient for convergence, as we have that with reasonable probability, the spectral gap ( )/ ( ) − ( 1 ) λ2 RMR λ1 RMR 6 1 O log n , as a result of applying Theorem 3.5.3 with ( 1 ) the choice of ε O log n .
Í h i3 ( 3) Finally, checking the value of i ai , x requires O d operations, and we do so a constant number of times, once for each of the signings of the top 2 left (or right) singular vectors of U.
131 3.6 Tensor principal component analysis
The Tensor PCA problem in the spiked tensor model is similar to the setting of tensor decomposition, but here the goal is to recover a single large component with all smaller components of the tensor regarded as random noise.
Problem 3.6.1 (Tensor PCA in the Order-3 Spiked Tensor Model). Given an input
3 n tensor T τ · v⊗ + A, where v ∈ is an arbitrary unit vector, τ > 0 is the signal-to-noise ratio, and A is a random noise tensor with iid standard Gaussian entries, recover the signal v approximately.
Using the partial trace method, we give the first linear-time algorithm for
3 4 this problem that recovers v for signal-to-noise ratio τ O(n / /poly log n). In addition, the algorithm requires only O(n2) auxiliary space (compared to the input size of n3) and uses only one non-adaptive pass over the input.
3.6.1 Spiked tensor model
This spiked tensor model (for general order-k tensors) was introduced by Mon- tanari and Richard [67], who also obtained the first algorithms to solve the model with provable statistical guarantees. Subsequently, the SoS approach was applied to the model to improve the signal-to-noise ratio required for odd- order tensors [50]; for 3-tensors reducing the requirement from τ Ω(n) to
3 4 1 4 τ Ω(n / log(n) / ).
Using the linear-algebraic objects involved in the analysis of the SoS relaxation, the previous work has also described algorithms with guarantees similar to those
132 of the SoS SDP relaxation, while requiring only nearly subquadratic or linear time [50].
The algorithm here improves on the previous results by use of the partial trace method, simplifying the analysis and improving the runtime by a factor of log n.
3.6.2 Linear-time algorithm
Linear-Time Algorithm for Tensor PCA
3 Algorithm 3.6.2. Input: T τ · v⊗ + A. Goal: Recover v0 with hv, v0i > 1 − o(1).
Í ⊗ ∈ n n • Compute the partial trace M : Trn i Ti Ti × , where Ti are the first-mode slices of T.
• Output the top eigenvector v0 of M.
3 4 1 2 Theorem 3.6.3. When A has iid standard Gaussian entries and τ > Cn / log(n) / /ε for some constant C, Algorithm 3.6.2 recovers v0 with hv, v0i > 1 − O(ε) with high probability over A.
Theorem 3.6.4. Algorithm 3.6.2 can be implemented in linear time and sublinear space.
These theorems are proved by routine matrix concentration results, showing that in the partial trace matrix, the signal dominates the noise.
To implement the algorithm in linear time it is enough to show that this (sublinear-sized) matrix has constant spectral gap; then a standard application of the matrix power method computes the top eigenvector.
133 Lemma 3.6.5. For any v, with high probability over A, the following occur:
Õ ( )· ( 3 2 2 ) Tr Ai Ai 6 O n / log n i
√ Õ ( )· ( ) v i Ai 6 O n log n i
√ Õ ( ) ( )· ( ) Tr Ai v i vv> 6 O n log n . i
The proof may be found in Appendix A.6.
Í ⊗ Proof of Theorem 3.6.3. We expand the partial trace Trn i Ti Ti. Õ Õ Trn Ti ⊗ Ti Tr(Ti)· Ti i i Õ Tr(τ · v(i)vv> + Ai)·(τ · v(i)vv> + Ai) i Õ 2 (τv(i)kvk + Tr(Ai)) · (τ · v(i)vv> + Ai) i ! 2 Õ Õ Õ τ vv> + τ v(i)· Ai + Tr(Ai)v(i)vv> + Tr(Ai)· Ai . i i i Applying Lemma 3.6.5 and the triangle inequality, we see that !
Õ ( )· + Õ ( ) ( ) + Õ ( )· ( 3 2 ) τ v i Ai Tr Ai v i vv> Tr Ai Ai 6 O n / log n i i i
3 4p with high probability. Thus, for appropriate choice of τ Ω(n / (log n)/ε), the Í ⊗ matrix Trn i Ti Ti is close to rank one, and the result follows by standard manipulations.
Proof of Theorem 3.6.4. Carrying over the expansion of the partial trace from above ( 3 4p( )/ ) Í ⊗ and setting τ O n / log n ε , the matrix Trn i Ti Ti has a spectral gap ratio equal to Ω(1/ε) and so the matrix power method finds the top eigenvector
134 in O(log(n/ε)) iterations. This matrix has dimension n × n, so a single iteration takes O(n2) time, which is sublinear in the input size n3. Finally, to construct Í ⊗ Trn i Ti Ti we use
Õ Õ Trn Ti ⊗ Ti Tr(Ti)· Ti i i and note that to construct the right-hand side it is enough to examine each entry of T just O(1) times and perform O(n3) additions. At no point do we need to store more than O(n2) matrix entries at the same time.
135 CHAPTER 4 POLYNOMIAL LIFTS
4.1 Introduction
Tensors are arrays of (real) numbers with multiple indices—generalizing matrices (two indices) and vectors (one index) in a natural way. They arise in many different contexts, e.g., moments of multivariate distributions, higher-order derivatives of multivariable functions, and coefficients of multivariate polynomials. An important ongoing research effort aims to extend algorithmic techniques for vectors and matrices to more general tensors. A key challenge is that many tractable matrix computations (like rank and spectral norm) become NP-hard in the tensor setting (even for just three indices) [?, ?]. However, recent work gives evidence that it is possible to avoid this computational intractability and develop provably efficient algorithms, especially for low-rank tensor decompositions, by making suitable assumptions about the input and allowing for approximations [12, 7, 39, 50, ?]. These algorithms lead to the best known provable guarantees for a wide range of unsupervised learning problems [10, 24, 42, 9], including learning mixtures of Gaussians [37], Latent Dirichlet topic modeling [4], and dictionary learning [17]. Low-rank tensor decompositions are useful for these learning problems because they are often unique up to permuting the factors—in contrast, low-rank matrix factorizations are unique only up to unitary transformation. In fact, as far as we are aware, in all natural situations where finding low-rank tensor decompositions is tractable, the decompositions are also unique.
We consider the following (symmetric) version of the tensor decomposition
d problem: Let a1,..., an ∈ be d-dimensional unit vectors. We are given (ap-
136 proximate) access to the first k moments M1,..., Mk of the uniform distribution over a1,..., an, that is,
n M 1 Õ t ∈ { } t n ai⊗ for t 1,..., k . (4.1.1) i1
The goal is to approximately recover the vectors a1,..., an. What conditions on the vectors a1,..., an and the number of moments k allow us to efficiently and robustly solve this problem?
A classical algorithm based on (simultaneous) matrix diagonalization [46, 56, attributed to Jennrich] shows that whenever the vectors a1,..., an are linearly independent, k 3 moments suffice to recover the vectors in polynomial time. (This algorithm is also robust against polynomially small errors in the input moment tensors [6, 42, 24].) Therefore an important remaining algorithmic challenge for tensor decomposition is the overcomplete case, when the number of vectors (significantly) exceeds their dimension. Several recent works studied this case with different assumptions on the vectors and the number of moments. In this work, we give a unified algorithmic framework for overcomplete tensor decomposition that achieves—and in many cases surpasses—the previous best guarantees for polynomial-time algorithms.
In particular, some decompositions that previously required quasi-polynomial time to find are reduced to polynomial time in our framework, including the case of general tensors with order logarithmically large in its overcompleteness
3 2 O 1 n/d [17] and random order-3 tensors with rank n 6 d / /log ( )(d) [39]. Iterative methods may also achieve fast local convergence guarantees for incoherent
3 2 order-3 tensors with rank o(d / ), which become global convergence guarantees under no more than constant overcompleteness [10]. In the smoothed analysis model, where each vector of the desired decomposition is assumed to have
137 been randomly perturbed by an inverse polynomial amount, polynomial-time decomposition was achieved for order-5 tensors of rank up to d2/2 [24]. Our framework extends this result to order-4 tensors, for which the corresponding analysis was previously unknown for any superconstant overcompleteness.
The starting point of our work is a new analysis of the aforementioned matrix diagonalization algorithm that works for the case when a1,..., an are linearly independent. A key ingredient of our analysis is a powerful and by now standard concentration bound for Gaussian matrix series [63, 77]. An important feature of our analysis is that it is captured by the sum-of-squares (SoS) proof system in a robust way. This fact allows us to use Jennrich’s algorithm as a rounding procedure for sum-of-squares relaxations of tensor decomposition, which is the key idea behind improving previous quasi-polynomial time algorithms based on these relaxations [17, 39].
The main advantage that sum-of-squares relaxations afford for tensor de- composition is that they allow us to efficiently hallucinate faithful higher-degree moments for a distribution given only its lower-degree moments. We can now run classical tensor decomposition algorithms like Jennrich’s on these halluci- nated higher-degree moments (akin to rounding). The goal is to show that those algorithms work as well as they would on the true higher moments. What is challenging about it is that the analysis of Jennrich’s algorithm relies on small spectral gaps that are difficult to reason about in the sum-of-squares setting. (Previous sum-of-squares based methods for tensor decomposition also followed this outline but used simpler, more robust rounding algorithms which required quasi-polynomial time.)
To this end, we view solutions to sum-of-squares relaxations as pseudo-
138 distributions, which generalize classical probability distributions in a way that takes computational efficiency into account.1 More concretely, pseudo-distributions are indistinguishable from actual distributions with respect to tests captured by
a restricted system of proofs, called sum-of-squares proofs.
An interesting feature of how we use pseudo-distributions is that our re- laxations search for pseudo-distributions of large entropy (via an appropriate
surrogate). This objective is surprising, because when we consider convex relax- ations of NP-hard search problems, the intended solutions typically correspond to atomic distributions which have entropy 0. Here, high entropy in the pseudo-
distribution allows us to ensure that rounding results in a useful solution. This appears to be related to the way in which many randomized rounding procedures use maximum-entropy distributions [?], but differs in that the aforementioned
rounding procedures focus on the entropy of the rounding process rather than the entropy (surrogate) of the solution to the convex relaxation. A measure of “entropy” has also been directly ascribed to pseudo-distributions previously [?],
and the principle of maximum entropy has been applied to pseudo-distributions as well [?], but these have previously occurred separately, and our application is
the first to encode a surrogate notion of entropy directly into the sum-of-squares proof system.
Our work also takes inspiration from a recent work that uses sum-of-squares techniques to design fast spectral algorithms for a range of problems includ-
ing tensor decomposition [?]. Their algorithm also proceeds by constructing surrogates for higher moments and applying a classical tensor decomposition algorithm on these surrogates. The difference is that the surrogates in [?] are
1In particular, the set of constant-degree moments of n-variate pseudo-distributions admits O 1 an n ( )-time separation oracle based on computing eigenvectors.
139 explicitly constructed as low-degree polynomial of the input tensor, whereas our surrogates are computed by sum-of-squares relaxations. The explicit surrogates of [?] allow for a direct (but involved) analysis through concentration bounds for matrix polynomials. In our case, a direct analysis is not possible because we have very little control over the surrogates computed by sum-of-squares relaxations. Therefore, the challenge for us is to understand to what extent classical tensor decomposition algorithms are compatible with the sum-of-squares proof system. Our analysis ends up being less technically involved compared to [?] (using the language of pseudo-distributions and sum-of-squares proofs).
4.1.1 Results for tensor decomposition
d Let {a1,..., an } ⊆ be a set of unit vectors. We study the task of approximately recovering this set of vectors given (noisy) access to its first k moments (4.1.1).
We organize this overview of our results based on different kinds of assumptions imposed on the set {a1,..., an } and the order of tensor/moments that we have access to. All of our algorithms are randomized and may fail with some small probability over their internal randomness, say probability at most 0.01. (Standard arguments allow us to amplify this probability at the cost of a small increase in running time.)
Orthogonal vectors. This scenario often captures the case of general linearly independent vectors because knowledge of the second moments of a1,..., an allows us to orthonormalize the vectors (this process is sometimes called “whiten- ing”). Many efficient algorithms are known in this case. Our contribution here
d 3 is in improving the error tolerance. For a symmetric 3-tensor E ∈ ( )⊗ , we
140 k k 2 use E 1 , 2,3 to denote the spectral norm of E as a d-by-d matrix (using the { } { } first mode of E to index rows and the last two modes of E to index the columns). √ k k This norm is at most d times the injective norm E 1 , 2 , 3 (the maximum of { } { } { } hE, x ⊗ y ⊗ zi over all unit vectors x, y, z ∈ d). The previous best error tolerance − Ín 3 for this problem required the error tensor E T i1 ai⊗ to have injective k k / k k norm E 1 , 2 , 3 1 d. Our algorithm requires only E 1 , 2,3 1, which is { } { } { } √ { } { } k k / satisfied in particular when E 1 , 2 , 3 1 d. { } { } { }
Theorem 4.1.1. There exists a polynomial-time algorithm that given a symmetric 3-
d 3 d tensor T ∈ ( )⊗ outputs a set of vectors {a0 ,..., a0 } ⊆ such that for every 1 n0 d orthonormal set {a1,..., an } ⊆ , the Hausdorff distance2 between the two sets is at most
n { } 2 ( )· − Õ 3 distH a1,..., an , a10 ,..., a0n 6 O 1 T ai⊗ . (4.1.2) 0 i1 1 , 2,3 { } { }
k −Ín 3k / Under the additional assumption T i1 a⊗ 1 , 2,3 6 1 log d, the running i { } { } + time of the algorithm can be improved to O(d1 ω) 6 d3.33 using fast matrix multiplication, where ω is the number such that two n × n matrices can be multiplied together in time nω (See Theorem 4.10.2).
k·k It is also possible to replace the spectral norm 1 , 2,3 in the above theorem { } { } statement by constant-degree sum-of-squares relaxations of the injective norm of 3-tensors. (See Remark 4.5.3 for details. ) If the error E has Gaussian N( 2 · 3) · 3 4( )O 1 distribution 0, σ Idd⊗ , then this norm is w.h.p. bounded by σ d / log d ( ) k·k ( · ) [50], whereas the norm 1 , 2,3 has magnitude Ω σ d . We prove Theorem 4.1.1 { } { } in Section 4.5.2.
2The Hausdorff distance distH X, Y between two finite sets X and Y measures the length of the ( ) largest gap between the two sets. Formally, distH X, Y is the maximum of maxx X miny Y x y ( ) ∈ ∈ k − k and maxy Y minx X x y . ∈ ∈ k − k
141 Random vectors. We consider the case that a1,..., an are chosen independently at random from the unit sphere of d. For n 6 d, this case is roughly equivalent to the case of orthonormal vectors. Thus, we are interested in the “overcomplete” case n d, when the rank is larger than the dimension. Previous work found the
3 2 O 1 decomposition in quasi-polynomial time when n 6 d / /log ( ) d [39], or in time
4 3 O 1 subquadratic in the input size when n 6 d / /log ( ) d [?]. Our polynomial-time
4 3 3 2 algorithm therefore is an improvement when n is between d / and d / (up to logarithmic factors).
Theorem 4.1.2. There exists a polynomial-time algorithm A such that with probability
ω 1 d 1 − d− ( ) over the choice of random unit vectors a1,..., an ∈ , every symmetric
d 3 3-tensor T ∈ ( )⊗ satisfies
2 Ω 1 n ( ) { } n ( ) + − Õ 3 distH A T , a1,..., an 6 O T ai⊗ . (4.1.3) d1.5 i1 1 , 2,3 { } { }
k·k Again it is possible to replace the spectral norm 1 , 2,3 in the above theorem { } { } statement by constant-degree sum-of-squares relaxations of the injective norm of 3-tensors, which as mentioned before give better bounds for Gaussian error tensors. We prove Theorem 4.1.2 in Section 4.7.
Smoothed vectors. Next, we consider a more general setup where the vectors
d a1,..., an ∈ are smoothed, i.e., randomly perturbed. This scenario is sig- nificantly more general than random vectors. Again we are interested in the overcomplete case n d. The previous best work [24] showed that the fifth
2 moment of smoothed vectors a1,..., an with n 6 d /2 is enough to approximately recover the vectors even in the presence of a polynomial amount of error. For fourth moments of smoothed vectors, no such result was known even for lower overcompleteness, say n d1.01.
142 We give an interpretation of the 4-tensor decomposition algorithm FOOBI3 [55] as a special case of a sum-of-squares based decomposition algorithm. We show that the sum-of-squares based algorithm works in the smoothed setting even in the presence of a polynomial amount of error. We define a condition
d number κ(·) for sets of vectors a1,..., an ∈ (a polynomial in the condition { 2 | ∈ [ ]} number of two matrices, one with columns ai⊗ i n and one with columns
{ai ⊗ (ai ⊗ a j − a j ⊗ ai) ⊗ a j | i , j ∈ [n]}). First, we show that the algorithm can tolerate error 1/κ which could be independent of the dimension. Concretely, our algorithm will output a set of vectors aˆ1,..., aˆn which will be close to
{a1,..., an } up to permutations and sign flip with a relative error that scales linearly in the relative error of the input and the condition number κ. Second, we show that for smoothed vectors this condition number is at least inverse polynomial with probability exponentially close to 1.
Theorem 4.1.3. There exists a polynomial-time algorithm such that for every symmetric
d 4 d 4-tensor T ∈ ( )⊗ and every set {a1,..., an } ⊆ of vectors not necessarily unit [ ] → [ ] { } length, there exists a permutation π : n n so that the output a10 ,..., a0n of the algorithm on input T satisfies
n 4 ai − a0 T − Í a π i i1 i⊗ 1,2 , 3,4 max ( ) 6 O(1)· { } { } · κ(a ,..., an) , (4.1.4) k k T 1 i n ai Ín ( 2)( 2) ∈[ ] σn i1 ai⊗ ai⊗ where σn(A) refers to the nth singular value of the matrix A, here the smallest non-zero singular value.
d We say that a distribution over vectors a1,..., an ∈ is γ-smoothed if 0 + · 1 0 0 1 1 ai ai γ i, where a1,..., an are fixed vectors and 1,..., n are independent N( 1 ) Gaussian vectors from 0, d Idd . 3The FOOBI algorithm is known to work for overcomplete 4-tensors when there is no error in the input. Researchers [24] asked if this algorithm tolerates a polynomial amount of error. Our work answers this question affirmatively for a variant of FOOBI (based on sum-of-squares).
143 Theorem 4.1.4. Let ε > 0 and n, d ∈ with n 6 d2/10. Then, for any γ-smoothed
d distribution over vectors a1,..., an in ,
n o Ω 1 κ(a1,..., an) 6 poly(d, γ) > 1 − exp(−d ( )) .
The above theorems together imply a polynomial-time algorithm for approxi- mately decomposing overcomplete smoothed 4-tensors even if the input error is polynomially large. The error probability of the algorithm is exponentially small over the choice of the smoothing. It is an interesting open problem to extend this result to overcomplete smoothed 3-tensors, even for lower overcompleteness n d1.01. Theorem 4.1.3 and Theorem 4.1.4 are proved in Section 4.8.
Separated unit vectors. In the scenario, when inner products among the vectors
d a1,..., an ∈ are bounded by ρ < 1 in absolute value, the previous best decomposition algorithm shows that moments of order (log n)/log ρ suffice [69]. Our algorithm requires moments of higher order (by a factor logarithmic in the desired accuracy) but in return tolerates up to constant spectral error. This increased error tolerance also allows us to apply this result for dictionary learning with up to constant sparsity (see Section 4.1.2).
Theorem 4.1.5. There exists an algorithm A with polynomial running time (in the size of its input) such that for all η, ρ ∈ (0, 1) and σ > 1, for every set of unit { } ⊆ d kÍn T k |h i| vectors a1,..., an with i1 ai ai 6 σ and maxi,j ai , a j 6 ρ, when the + ∈ (d) k 1 log σ · ( / ) algorithm is given a symmetric k-tensor T ⊗ with k > O log ρ log 1 η , d then its output A(T) is a set of vectors {a0 ,..., a0 } ⊆ such that 1 n0
2 n { 2 2} { 2 2} + − Õ k distH a0⊗ ,..., a0⊗n , a⊗ ,..., an⊗ 6 O η T ai⊗ . 1 1 i1 1,..., k 2 , k 2 +1,...,k { b / c} {b / c } (4.1.5)
144 We also show that a simple spectral algorithm with running time close to dk (the size of the input) achieves similar guarantees (see Remark 4.10.3). However, the error tolerance of this algorithm is in terms of an unbalanced spectral norm: k − Ín k k T i1 a⊗ 1,...,k 3 , k 3+1,...,k (the spectral norm of the tensor viewed as a i { / } { / } k 3 2k 3 d / -by-d / matrix). This norm is always larger than the balanced spectral norm in the theorem statement. In particular, for dictionary learning applications, this norm is larger than 1, which renders the guarantee of the simpler spectral algorithm vacuous in this case. We prove Theorem 4.1.5 in Section 4.5.3.
General unit vectors. In this scenario, the number of moments that our algo-
Í T rithm requires is constant as long as i ai ai has constant spectral norm and the desired accuracy is constant.
Theorem 4.1.6. There exists an algorithm A (see Algorithm 4) with polynomial running time (in the size of its input) such that for all ε ∈ (0, 1), σ > 1, for every set of unit vectors { } ⊆ d kÍn T k ∈ ( d) 2k a1,..., an with i1 ai ai 6 σ and every symmetric 2k-tensor T ⊗
k ( / )O 1 · ( ) T − Í a 2k / with > 1 ε ( ) log σ and i i⊗ 1,...,k , k+1,...,2k 6 1 3, we have { } { } 2 ( ) { 2 2} ( ) distH A T , a1⊗ ,..., an⊗ 6 O ε .
The previous best algorithm for this problem required tensors of order
O log σ εO 1 +log n (log σ)/ε and had running time d (( )/ ( ) ) [17, Theorem 4.3]. We require the same order of the tensor and the runtime is improved to be polynomial in the
poly log σ ε size of the inputs (that is, d (( )/ )).
We also remark that a bit surprisingly we can handle 1/3 error in spectral norm, and this is possible partly due to the choice of working with high order tensors. As a sanity check, we note that information-theoretically the components are
145 2k identifiable: under the assumptions, the only vectors u that satisfy hT, u⊗ i > 1/3 are those vectors close to one of the ai’s. We also note that the rounding algorithm of the sum-of-squares relaxation of this simple inefficient test requires a bit new idea beyond what we used previously. Here the difficulty is to make the runtime
poly log σ ε poly σ ε d (( )/ ) instead of d ( / ). See Section 4.9 for details.
Spectral algorithms without sum-of-squares. Finally, using a similar rounding technique directly on a orthogonal tensor (without using sum-of-squares and the pseudo-moment), we also obtain a fast and robust algorithm for orthogonal tensor decomposition. See Section 4.10 for details.
4.1.2 Applications of tensor decomposition
Tensor decomposition has a wide range of applications. We focus here on learning sparse dictionaries, which is an example of the more general phenomenon of using tensor decomposition to learn latent variable models. Here, we obtain the
first polynomial-time algorithms that work in the overcomplete regime up to constant sparsity.
Dictionary learning is an important problem in multiple areas, ranging from computational neuroscience [?, ?, ?], machine learning [?, ?], to computer vision and image processing [?, ?, ?]. The general goal is to find a good basis for given data. More formally, in the dictionary learning problem, also known as sparse coding, we are given samples of a random vector y ∈ n, of the form y Ax
n m where A is some unknown matrix in × , called dictionary, and x is sampled from an unknown distribution over sparse vectors. The goal is to approximately
146 recover the dictionary A.
We consider the same class of distributions over sparse vectors {x} as [17], which as discussed in [17] admits a wide-range of non-product distributions over sparse vectors. (The case of product distributions reduces to the significantly easier problem of independent component analysis.) We say that {x} is (k, τ)-nice
k k 2 k 2 ¾ ∈ [ ] ¾ / / ∈ [ ] ¾ α if xi 1 for every i m , xi x j 6 τ for all i , j m , and x 0 for every non-square degree-k monomial xα. Here, τ is a measure of the relative sparsity of the vectors {x}.
We give an algorithm that for nice distributions solves the dictionary learning problem in polynomial time when the desired accuracy is constant, the over- completeness of the dictionary is constant (measured by the spectral norm kAk), and the sparsity parameter τ is a sufficiently small constant (depending only on the desired accuracy and kAk). The previous best algorithm [17] requires quasi-polynomial time in this setup (but works in polynomial-time for polynomial
Ω 1 sparsity τ 6 n− ( )).
Theorem 4.1.7. There exists an algorithm R parameterized by σ > 1, η ∈ (0, 1), such
n m that for every dictionary A ∈ × with kAk 6 σ and every (k, τ)-nice distribution m O k {x} over with k > k(η, σ) O((log σ)/η) and τ 6 τ(k) k− ( ), the algorithm O k { } O k given n ( ) samples from y Ax outputs in time n ( ) vectors a10 ,..., a0m that are 1 2 O(η) / -close to the columns of A.
Since previous work [17] provides a black box reduction from dictionary learning to tensor decomposition, the theorem above follows from Theorem 4.1.6.
Our Theorem 4.1.5 implies a dictionary learning algorithm with better parameters for the case that the columns of A are separated.
147 4.1.3 Polynomial optimization with few global optima
Underlying our algorithms for the tensor decomposition is an algorithm for solving general systems of polynomial constraints with the property that the total number of different solutions is small and that there exists a short certificate for that fact in form of a sum-of-squares proof.
Let A be a system of polynomial constraints over real variables x (x1,..., xd)
` and let P : d → d be a polynomial map of degree at most `—for example,
` d P(x) x⊗ . We say that solutions a1,..., an ∈ to A are unique under the map P if the vectors P(a1),..., P(an) are orthonormal up to error 0.01 (in spectral norm) and every solution a to A satisfies P(a) ≈ P(ai) for some i ∈ [n]. We encode this property algebraically by requiring that the constraints in A imply Ín h ( ) ( )i4 · k ( )k4 the constraint i1 P ai , P x > 0.99 P x . We say that the solutions a1,..., an are `-certifiably unique if in addition this implication has a degree-` sum-of-squares proof.
The following theorem shows that if polynomial constraints have certifiably unique solutions (under a given map P), then we can find them efficiently (under the map P).
Theorem 4.1.8 (Informal statement of Theorem 4.5.2). Given a system of polynomial constraints A and a polynomial map P such that there exists `-certifiably unique solutions
O ` a1,..., an for A, we can find in time d ( ) vectors 0.1-close to P(a1),..., P(an) in Hausdorff distance.
148 4.2 Techniques
Here is the basic idea behind using sum-of-squares for tensor decomposition:
d Let a1,..., an ∈ be unit vectors and suppose we have access to their first three moments M1, M2, M3 as in (4.1.1). Since the task of recovering a1,..., an is easier the more moments we know, we would make a lot of progress if we could compute higher moments of a1,..., an, say the fourth moment M4.A natural approach toward that goal is to compute a probability distribution D over
d the sphere of such that D matches the moments of a1,..., ak that we know, M 2 M 3 M i.e., ¾D u u 1, ¾D u u⊗ 2, ¾D u u⊗ 3, and then use the fourth ( ) ( ) ( ) 4 moment ¾D u⊗ as an estimate for M4.
There are two issues with this approach: (1) computing such a distribution D is intractable and (2) even if we could compute such a distribution it is not clear if its fourth moment will be close to the fourth moments M4 we are interested in.
We address issue (1) by relaxing D to be a pseudo-distribution (solution to sum-of-squares relaxations). Then, we can match the given moments efficiently.
Issue (2) is related to the uniqueness of the tensor decomposition, which relies on properties of the vectors a1,..., an. Here, the general strategy is to first prove that this uniqueness holds for actual distributions and then transfer the uniqueness proof to the sum-of-squares proof system, which would imply that uniqueness also holds for pseudo-distributions.
In subsection 4.2.1 below, we demonstrate our key rounding idea on the
(nearly) orthogonal tensor decomposition problem. Then in subsection 4.2.2 we discuss the high level insight for the robust 4th-order tensor decomposition
149 algorithm and in subsection 4.2.3 the techniques for random 3rd-order tensor decomposition.
4.2.1 Rounding pseudo-distributions by matrix diagonalization
Our main departure from previous tensor decomposition algorithms based on sum-of-squares [17, 39] lies in rounding: the procedure to extract an actual solution from a pseudo-distribution over solutions. The previous algorithms rounded a pseudo-distribution D by directly using the first moments (or the mean) ¾D u u, ( ) which requires D to concentrate strongly around the desired solution. Our approach here instead uses Jennrich’s (simultaneous) matrix diagonalization [46, 56], to extract the desired solution as a singular vector of a matrix of the form h i T ¾D u 1, u uu , for a random vector 1.4 This permits us to impose much weaker ( ) conditions on D.
For the rest of this subsection, we assume that we have an actual distribution
d D that is supported on vectors close to some orthonormal basis a1,..., ad of , and we will design a rounding algorithm that extracts the vectors ai from the low-degree moments of D. This is a much simpler task than rounding from a pseudo-distribution, though it captures most of the essential difficulties. Since pseudo-distributions behave similarly to actual distributions on the low-degree moments, the techniques involved in rounding from actual distributions will turn out to be easily generalizable to the case of pseudo-distributions.
Let D be a distribution over the unit sphere in d. Suppose that this
4In previous treatments of simultaneous diagonalization, multiple matrices would be used for noise tolerance—increasing the confidence in the solution when more than one matrix agrees on a particular singular vector. This is unnecessary in our setting, since as we’ll see, the SoS framework itself suffices to certify the correctness of a solution.
150 distribution is supported on vectors close to some orthonormal basis a1,..., ad of d, in the sense that the distribution satisfies the constraint
( d ) Õ 3 hai , ui > 1 − ε . (4.2.1) i1 D u ( ) { h i − } Íd h i3 (This constraint implies maxi d ai , u > 1 ε D u because i1 ai , u 6 ∈[ ] ( ) h i maxi d ai , u by orthonormality.) The analysis of [17] shows that reweighing ∈[ ] 2k the distribution D by a function of the form u 7→ h1, ui for 1 ∼ N(0, Idd) and some k 6 O(log d) creates, with significant probability, a distribution D0 such that for one of the basis vectors ai, almost all of the probability mass of D0 is on vectors close to ai, in the sense that
2k max ¾ hai , ui > 1 − O(ε) , where D0(u) ∝ h1, ui D(u) . i d D u ∈[ ] 0( )
In this case, we can extract a vector close to one of the vectors ai by computing the mean ¾D u u of the reweighted distribution. This rounding procedure takes 0( ) quasi-polynomial time because it requires access to logarithmic-degree moments of the original pseudo-distribution D.
To avoid this quasi-polynomial running time, our strategy is to instead modify the original distribution D in order to create a small bias in one of the directions ai such that a modified moment matrix of D has a one-dimensional eigenspace close to ai. (This kind of modification is much less drastic than the kind of modification in previous works. Indeed, reweighing a distribution such that it concentrates around a particular vector seems to require logarithmic degree.)
Concretely, we will study the spectrum of matrices of the following form, for
1 ∼ N(0, Idd):
T M1 ¾ h1, ui · uu . D u ( )
151 Our goal is to show that with good probability, M1 has a one-dimensional eigenspace close to one of the vectors ai.
However, this is not actually true for a naïve distribution: although we have encoded the basis vectors ai into the distribution D by means of constraint (4.2.1), we cannot yet conclude that the eigenspaces of M1 have anything to do with them. We can understand this as the error allowed in (4.2.1) being highly under- constrained. For example, the distribution could be a uniform mixture of vectors of the form ai + εw for some fixed vector w, which causes w to become by far the most significant contribution to the spectrum of M1. More generally, an arbitrary spectrally small error could still completely displace all of the eigenspaces of M1.
An interpretation of this situation is that we have permitted D itself to contain a large amount of information that we do not actually possess. Constraint (4.2.1) is consistent with a wide range of possible solutions, yet in the pathological example above, the distribution does not at all reflect this uncertainty, instead settling arbitrarily on some particular biased solution: it is this bias that disrupts the usefulness of the rounding procedure.
A similar situation has previously arisen in strategies for rounding convex relaxations—specifically, when the variables of the relaxations were interpreted as the marginals of some probability distribution over solutions, then actual solutions were constructed by sampling from that distribution. In that context, a workaround was to sample those solutions from the maximum-entropy dis- tributions consistent with those marginals [?], to ensure that the distribution faithfully reflected the ignorance inherent in the relaxation solution rather than incorporating arbitrary information. Our situation differs in that it is the solution to the convex relaxation itself which is misbehaving, rather than some aspect of
152 the rounding process, but the same approach carries over here as well.
Therefore, suppose that D satisfies the maximum-entropy constraint k T k / ¾D u uu 6 1 n. This essentially enforces D to be a uniform distribution over ( ) vectors close to a1,..., an. For the sake of demonstration, we assume that D is a uniform distribution over a1,..., an. Moreover, since our algorithm is invariant under linear transformations, we may assume that the components a1,..., an
d are the standard basis vectors e1,..., en ∈ . We first decompose M1 along the coordinate 11,
1 · + 1 1 − 1 · M1 1 Me1 M10 , where 0 1 e1 .
Note that under our simplified assumption for D, by simple algebraic ma- T T nipulation we have Me1 ¾D u u1uu e1e1 . Moreover, by definition, ( ) 11 and 10 are independent. It turns out that the entropy constraint implies ¾ k k p · / 10 M10 . log d 1 n (using concentration bounds for Gaussian matrix series 1p [63]). Therefore, if we condition on the event 11 > η− log d, we have that 1 T + 1 T M1 1e1e1 M10 consists of two parts: a rank-1 single part 1e1e1 with with 1p eigenvalue larger than η− log d, and a noise part which has spectral norm at p most . log d. Hence, by the eigenvector perturbation theorem we have that the
1 2 top eigenvector is O(η / )-close to e1 as desired.
1p Taking η 0.1, we see with 1/poly(d) probability the event 11 > η− log d will happen, and therefore by repeating this procedure poly(d) times, we obtain a
1 2 vector that is O(η / )-close to e1. We can find other vectors similarly by repeating the process (in a slightly more delicate way), and the accuracy can also be boosted
(see Sections 5.6 and 4.5 for details).
153 4.2.2 Overcomplete fourth-order tensor
In this section, we give a high-level description of a robust sum-of-squares version of the tensor decomposition algorithm FOOBI [55]. For simplicity of the demonstration, we first work with the noiseless case where we are given a tensor
d 4 T ∈ ( )⊗ of the form n Õ 4 T ai⊗ . (4.2.2) i1
We will first review the key step of FOOBI algorithm and then show how to convert it into a sum-of-squares algorithm that will naturally be robust to noise.
To begin with, we observe that by viewing T as a d2 × d2 matrix of rank
2 n, we can easily find the span of the ai⊗ ’s by low-rank matrix factorization. However, since the low rank matrix factorization is only unique up to unitary
2 transformation, we are not able to recover the ai⊗ ’s from the subspace that they 2 live in. The key observation of [55] is that the ai⊗ ’s are actually the only “rank-1” vectors in the span, under a mild algebraic independence condition. Here, a d2-dimensional vector is called “rank-1” if it is a tensor product of two vectors of dimension d.
Lemma 4.2.1 ([55]). Suppose the following set of vectors is linearly independent, n o 2 ⊗ 2 − ( ⊗ ) 2 ai⊗ a⊗j ai a j ⊗ i , j . (4.2.3)
2 2 2 Then every vector x⊗ in the linear span of a1⊗ ,..., an⊗ is a multiple of one of the vectors 2 ai⊗ .
This observation leads to the algorithm FOOBI, which essentially looks for
2 rank-1 vectors in the span of ai⊗ ’s. The main drawback is that it uses simultaneous diagonalization as a sub-procedure, which is unlikely to tolerate noise better
154 than inverse polynomial in d, and in fact no noise tolerance guarantee has been explicitly shown for it before.
Our approach starts with rephrasing the original proof of Lemma 4.2.1 into the following SoS proof (which only uses polynomial inequalities that can be proved by SoS).
2 Ín · 2 Proof of Lemma 4.2.1. Let α1, . . . , αn be multipliers such that x⊗ i1 αi ai⊗ .5 Then, these multipliers satisfy the following quadratic equations:
4 Õ 2 2 x⊗ αi α j · a⊗ ⊗ a⊗ , i,j i j
4 Õ 2 x⊗ αi α j ·(ai ⊗ a j)⊗ . i,j
Together, the two equations imply that
Õ 2 2 2 0 αi α j · a⊗ ⊗ a⊗ − (ai ⊗ a j)⊗ . i,j i j
2 ⊗ 2 − ( ⊗ ) 2 By assumption, the vectors ai⊗ a⊗j ai a j ⊗ are linearly independent for Í 2 2 i , j. Therefore, from the equation above, we conclude i,j αi α j 0, meaning that at most one of αi can be non-zero. Furthermore this argument is a SoS proof,
D D since for any matrix A ∈ × with linearly independent columns and any vector ∈ [ ]D k k2 1 k k2 polynomial v x , the inequality v 6 2 Av can be proved by SoS σmin A ( ) (here σmin(A) denotes the least singular value of matrix A). So choosing A to be 2 ⊗ 2 − ( ⊗ ) 2 the matrix with columns ai⊗ a⊗j ai a j ⊗ for i , j and v to be the vector k k4 − k k4 with entries αi α j, we find by SoS proof that α 4 α 2 0.
2 When there is noise present, we cannot find the true subspace of the ai⊗ ’s and instead we only have an approximation, denoted by V, of that subspace. We will
2 Ín 2 5technically, α1, . . . , αn are polynomials in x so that x αi a⊗ holds ⊗ i 1 · i
155 modify the proof above by starting with a polynomial inequality
2 2 2 2 k IdV x⊗ k > (1 − δ)kx⊗ k , (4.2.4)
2 which constrains x⊗ to be close to the estimated subspace V (where δ is a small number that depends on error and condition number). Then an extension of the proof of Lemma 4.2.1 will show that equation (4.2.4) implies (via a SoS proof) that for some small enough δ,
Õ 2 2 ( ) αi α j 6 o 1 . (4.2.5) i,j
2 2 Note that α Kx⊗ is a linear transformation of x⊗ , and furthermore K is 2 the pseudo-inverse of the matrix with columns ai⊗ . Moreover, if we assume for a moment that α has 2-norm 1 (which is not true in general), then the equation above further implies that
n Õh 2i4 k k4 − ( ) Ki , x⊗ α 4 > 1 o 1 , (4.2.6) i1
d2 where Ki ∈ is the i-th row of K. This effectively gives us access to the 4-tensor Í 4 2 i Ki⊗ (which has ambient dimension d when flattened into a matrix), since equation (4.2.6) is anyway the constraint that would have been used by the SoS Í 4 algorithm if given the tensor i Ki⊗ as input. Note that because the Ki are not necessarily (close to) orthogonal, we cannot apply the SoS orthogonal tensor decomposition algorithm directly. However, since we are working with a 4-tensor
2 whose matrix flattening has higher dimension d , we can whiten Ki effectively in the SoS framework and then use the orthogonal SoS tensor decomposition algorithm to find the Ki’s, which will in turn yield the ai’s.
Many details were omitted in the heuristic argument above (for example, we assumed α to have norm 1). The full argument follows in Section 4.8.
156 4.2.3 Random overcomplete third-order tensor
In the random overcomplete setting, the input tensor is of the form n Õ 3 + T ai⊗ E , i1 where each ai is drawn uniformly at random from the Euclidean unit sphere, we 1.5/( )O 1 k k have d < n 6 d log d ( ), and E is some noise tensor such that E 1 , 2,3 < ε { } { } or alternatively such that a constant-degree sum-of-squares relaxation of the injective norm of E is at most ε.
Our original rounding approach depends on the target vectors ai being orthonormal or nearly so. But when n d in this overcomplete setting, orthonormality fails badly: the vectors ai are not even linearly independent.
We circumvent this problem by embedding the vectors ai in a larger ambient 2 2 space—specifically by taking the tensor powers a10 a1⊗ ,..., a0n an⊗ . Now the vectors a10 ,..., a0n are linearly independent (with probability 1) and actually close to orthonormal with high probability. Therefore, if we had access to the order-6 Í( ) 3 Í 6 tensor a0i ⊗ ai⊗ , then we could (almost) apply our rounding method to recover the vectors a0i.
The key here will be to use the sum-of-squares method to generate a pseudo- distribution over the unit sphere having T as its third-order moments tensor, and then to extract from it the set of order-6 pseudo-moments estimating the Í 6 moment tensor i ai⊗ . This pseudo-distribution would obey the constraint {( ⊗ ⊗ )T − } {Í h i3 − } u u u T > 1 ε , which implies the constraint i ai , u > 1 ε , saying, informally, that our pseudo-distribution is close to the actual uniform distribution
2 over {ai }. Substituting v u⊗ , we obtain an implied pseudo-distribution in v { } which therefore ought to be close to the uniform distribution over a0i , and we
157 should therefore be able to round the order-3 pseudo-moments of v to recover { } a0i .
Í ( )( )T Only two preconditions need to be checked: first that i a0i a0i is not too large in spectral norm, and second that our pseudo-distribution in v satisfies the {Í h i3 − ( )} constraint i a0i , v > 1 O ε . The first precondition is true (except for a spurious eigenspace which can harmlessly be projected away) and is essentially equivalent to a line of matrix concentration arguments previously made in [?]. The second precondition follows from a line of constant-degree sum-of-squares proofs, notably extending arguments made in [39] stating that the constraints {Í h i3 − k k2 } i ai , u > 1 ε, u 1 imply with constant-degree sum-of-squares proofs {Í h ik − ( ) − ˜ ( / 3 2)} that i ai , u > 1 O ε O n d / for some higher powers k. The rigorous verification of these conditions is detailed in Section 4.7.
4.3 Preliminaries
Unless explicitly stated otherwise, O(·)-notation hides absolute multiplicative constants. Concretely, every occurrence of O(x) is a placeholder for some function f (x) that satisfies x ∈ . | f (x)| 6 C|x| for some absolute constant ∀ C > 0. Similarly, Ω(x) is a placeholder for a function 1(x) that satisfies x ∈ ∀ . |1(x)| > |x|/C for some absolute constant C > 0.
+ For a matrix A, let A denote the Moore-Penrose pseudo-inverse of A. For a
1 2 symmetric positive semidefinite matrix B, let B / denote the square root of B, i.e. the unique symmetric positive-semidefinite matrix L such that L2 B.
The Kronecker product of two matrices A and B is denoted by A ⊗ B. A useful
158 identity is that (A ⊗ B)(C ⊗ D) (AC)⊗(BD) whenever the matrix multiplications are defined. The norm k · k denotes the Euclidean norm for vectors and the spectral norm for matrices.
∈ ( d) k d Í ⊗ · · · ⊗ Let T ⊗ be a k-tensor over such that T i ,...,i Ti1 ik ei1 eik , 1 k ··· d where e1,..., ed is the standard basis of . We say T is symmetric if the entries ( ) Ti1,...,ik are invariant under permuting the indices. The k index positions of T are called modes. The injective norm kTkinj is the maximum value of
d hT, x1 ⊗ · · · ⊗ xki over all vectors x1,..., xk ∈ with kx1k ··· kxk k 1.A useful class of multilinear operations on tensors has the form T 7→ (A1 ⊗· · ·⊗Ak)T, where A1,..., Ak are matrices with d columns. (This notation is the same as the Kronecker product notation for matrices, that is, (A1 ⊗ · · · ⊗ Ak)T Í ( ) ⊗ · · · ⊗ ( ) i ,...,i Ti1 ik A1ei1 Ak eik .) If some of the matrices Ai are row vectors, 1 k ··· and the others are the identity matrix, then the corresponding operation is called
d 3 tensor contraction. For example, for a third-order tensor T ∈ ( )⊗ and a vector 1 ∈ d, we call (Id ⊗ Id ⊗1T)T the contraction of the third mode of T with 1.
(Some authors use the notation T(Id, Id, 1) to denote this operation.)
For a bipartition A, B of the index set [k] of T, we let kTkA,B denote the spectral norm of the matrix unfolding TA,B of T with rows indexed by the indices in A and columns indexed by indices in B. Concretely,
k k Õ · T A,B max Ti1 ik xiA yiB , d A d B x ⊗| | , y ⊗| | ··· ∈( ) ∈( ) i1,...,ik x 61, y 61 k k k k ··· ··· Here, iA ia1 ia A and iB ib1 ib B are multi-indices, where A | | | | { } { } k k a1,..., a A and B b1,..., b B . For k 2, T 1 , 2 is the spectral norm of T | | | | { } { } k k viewed as a d-by-d matrix. For k 3, T 1,2 , 3 is the spectral norm of T viewed { } { } as a d2-by-d matrix with rows indexed by the first two modes of T and columns k k indexed by the last index of T. For symmetric 3-tensors, all norms T 1,2 , 3 , { } { }
159 k k k k T 1,3 , 2 , and T 2,3 , 1 are the same. { } { } { } { }
4.3.1 Pseudo-distributions
Pseudo-distributions generalize probability distributions in a way that allows us to optimize efficiently over moments of pseudo-distributions. We represent a discrete probability distribution D over n by its probability mass function D : n → such that D(x) is the probability of x under the distribution for every ∈ n Í ( ) x . This function is nonnegative point-wise and satisfies x supp D D x 1. ∈ ( ) For pseudo-distributions we relax the nonnegative requirement and only require that the function passes a set of simple nonnegativity tests.
A degree-d pseudo-distribution over n is a finitely6 supported function D : n → Í ( ) Í ( ) ( )2 such that x supp D D x 1 and x supp D D x f x > 0 for every function ∈ ( ) ∈ ( ) f : n → of degree at most d/2. We define the pseudo-expectation of a (possibly vector-valued or matrix-valued) function f with respect to D as
def Õ ¾˜ D f D(x) f (x) . x supp D ∈ ( ) In order to emphasize which variable is bound by the pseudo-expectation, we ( ) ( ) write ¾˜ D x f x . (This notation is useful if f x is a more complicated expression ( ) involving several variables.)
Note that a degree-∞ pseudo-distribution D satisfies D(x) > 0 for all x ∈ n. Therefore, D is an actual probability distribution (with finite support). The pseudo-expectation ¾˜ D f ¾D f of a function f is its expected value under the distribution D. 6We restrict these functions to be finitely supported in order to avoid integrals and measurability issues. It turns out to be without loss of generality in our context.
160 Our algorithms will not work with pseudo-distributions (as finitely-supported functions on n) directly. Instead the algorithms will work with moment tensors ( ) d ¾˜ D x 1, x1,..., xn ⊗ of pseudo-distributions and the associated linear functional ( ) 7→ ( ) p ¾˜ D x p x on polynomials p of degree at most d. ( )
Unlike actual probability distribution, pseudo-distributions admit general, efficient optimization algorithms. In particular, the set of low-degree moments of pseudo-distributions has an efficient separation oracle.
O d Theorem 4.3.1 ([71, 64, 54]). For n, d ∈ , the following set admits an n ( )-time weak separation oracle (in the sense of [43]), d n ¾˜ (1, x1,..., xn)⊗ degree-d pseudo-distribution D over . D x ( )
This theorem, together with the equivalence of separation and optimization
[43] allows us to solve a wide range of optimization and feasibility problems over pseudo-distributions efficiently.
The following definition captures what kind of linear constraints are induced on a pseudo-distribution over n by a system of polynomial constraints over n.
Definition 4.3.2. Let D be a degree-d pseudo-distribution over n. For a system of polynomial constraints A { f1 > 0,..., fm > 0} with deg( fi) 6 ` for every i, we say that D satisfies the polynomial constraints A at degree `, denoted D |` A, if ˜ Î ⊆ [ ] ¾D i S fi h > 0 for every S m and every sum-of-squares polynomial h on ∈ n with |S|` + deg h 6 d.
This is a relaxation (to pseudo-distributions) of the statement that the prob- ability mass of a true distribution contains only solutions to A. Indeed, if an
161 actual distribution D is supported on the solutions to A, then D satisfies D |` A regardless of the value of `.
We say that D satisfies A (without further specifying the degree) if D |` A A for ` max f >0 deg f . We say that a system of polynomial constraints in { }⊆A variables x is explicitly bounded if it contains a constraint of the form {kxk2 6 M}. The following theorem follows from Theorem 4.3.1 and [43]. We give a proof in
Appendix ?? for completeness.
O d Theorem 4.3.3. There exists a (n + |A|) ( )-time algorithm that, given any explicitly bounded and satisfiable system A of polynomial constraints in n variables, outputs (up to arbitrary accuracy) a degree-d pseudo-distribution that satisfies A.
4.3.2 Sum of squares proofs
Let x (x1,..., xn) be a tuple of indeterminates. Let [x] be the set of poly- nomials in these indeterminates with real coefficients. A polynomial p is a 2 + ··· + 2 sum-of-squares if there are polynomials q1,..., qr such that p q1 qr . Let f1,..., fr and 1 be multivariate polynomials in [x].A sum-of-squares proof that the constraints { f1 > 0,..., fr > 0} imply the constraint {1 > 0} consists of ( ) [ ] sum-of-squares polynomials pS S n in x such that ⊆[ ] Õ Ö 1 pS · fi . S n i S ⊆[ ] ∈ ⊆ [ ] ( ·Î ) We say that this proof has degree ` if every set S n satisfies deg pS i S fi 6 ` ∈ Î (in particular, this would imply pS 0 for every set S such that deg i S fi > `). ∈ If there exists a degree-` sum-of-squares proof that { f1 > 0,..., fr > 0} implies {1 > 0}, we write
{ f1 > 0,..., fr > 0} `` {1 > 0} .
162 In order to emphasize the indeterminates for the proofs, we sometimes write
{ f1(x) > 0,..., fr(x) > 0} `x,` {1(x) > 0} .
Sum-of-squares proofs obey the following inference rules, for all polynomials f , 1 : n → and F : n → m , G : n → k , H : p → n,
A `` { f > 0, 1 > 0} A `` { f > 0}, A `` {1 > 0} , 0 , A ` { + 1 } A ` + { · 1 } ` f > 0 ` `0 f > 0 (addition and multiplication)
A ` B, B ` C ` `0 , (transitivity) A `` ` C · 0 {F > 0} `` {G > 0} { ( )) } { ( ) } . (substitution) F H > 0 `` deg H G H > 0 · ( )
Sum-of-squares proofs are sound and complete for polynomial constraints over pseudo-distributions, in the sense that sum-of-squares proofs allow us to reason about what kind of polynomial constraints are satisfied by a pseudo-distribution. We defer the proofs of the following lemmas to Appendix ??.
Lemma 4.3.4 (Soundness). If D |` A for a pseudo-distribution D and there exists a sum-of-squares proof A `` B, then D |` ` B. 0 · 0
Lemma 4.3.5 (Completeness). Suppose d > `0 > `, and A is a collection of polynomial A ` { 2 + ··· + 2 } constraints with degree at most `, and x1 xn 6 B for some finite B. Let
{1 > 0} be a polynomial constraint with degree `0. If every degree-d pseudo-distribution | A | {1 } D that satisfies D ` also satisfies D `0 > 0 , then for every ε > 0, there is a sum-of-squares proof A `d {1 > −ε}. 7 7The completeness claim stated here does not match the strength of the corresponding soundness claim. This reflects an impreciseness in how we count the degrees of intermediate sum-of-squares proofs (in particular our degree accounting is not tight under proof composition), and does not reflect than the power of the proofs themselves.
163 4.3.3 Matrix constraints and sum-of-squares proofs
In sections 5.6 and 4.9, we still state positive-semidefiniteness constraints on matrices, which will be implied by sum-of-squares proofs. We define notation to express what it means for a matrix constraint to be implied by sum-of-squares. While the duality between proof systems and convex relaxations also holds in the matrix case [?], and it is possible to give a full treatment of matrix constraints in sum-of-squares, here we give an abridged and simplified treatment sufficient for our purposes.
Definition 4.3.6. Let A be a set of polynomial constraints in indeterminant x, and M is a symmetric p × p matrix with entries in [x]. Then we write A ``
{M 0} if there exists a set of polynomials q1(x),... qm(x) and a set of vectors ( ) ( ) [ ] A { } v1 x ,..., vm x of p-dimension with entries in x such that ``i qi > 0 where `i + 2 deg(vi) 6 ` for every i, and
m Õ T M qi(x)vi(x)vi(x) . (4.3.1) i1
The proof that sum-of-squares is sound for these matrix constraints is very similar to the analogous proof of Lemma 4.3.4 (see Appendix ??).
Lemma 4.3.7. Let D be a pseudo-distribution of degree d and d > ``0. Suppose | A A ` ¾˜ [ ] D ` , and `0 M 0. Then M 0.
We now give some basic properties of these matrix sum-of-squares proofs.
Lemma 4.3.8. Suppose A, B0 are symmetric matrix polynomials such that ` {A 0, B 0}. Then ` {A ⊗ B 0}.
164 Ín ( ) ( ) ( )T Ím ( ) ( ) ( )T Proof. Express A i1 qi x ui x ui x and B i1 ri x vi x vi x . Then
n m Õ Õ T A ⊗ B qi(x)rj(x) ui(x) ⊗ v j(x) ui(x) ⊗ v j(x) . i1 j1
Lemma 4.3.9. Suppose A, B, A0, B0 are symmetric matrix polynomials such that `
{A 0, B0 0, A A0, B B0}. Then ` {A ⊗ B A0 ⊗ B0, B ⊗ A B0 ⊗ A0}.
Proof. By Lemma 4.3.8, we have ` A ⊗ (B − B0) 0 and ` (A − A0) ⊗ B0 0. Adding the two equations we complete the proof. We may also take the tensor powers in the other order.
T 2 Lemma 4.3.10. Let u [u1,..., ud] be an indeterminate. Then ` {uu kuk ·Idd }.
Proof. The conclusion follows from the following explicit decomposition
2 T Õ T ` kuk Id −uu (ui ej − u j ei)(ui ej − u j ei) 0 16i 4.4 Rounding pseudo-distributions 4.4.1 Rounding by matrix diagonalization The following theorem analyzes a form of Jennrich’s algorithm for tensor de- composition through matrix diagonalization, when applied to the moments of a pseudo-distribution. We show that if the pseudo-distribution D(u) has k good correlation with some vector a⊗ , then with good chance a simple random contraction of the (k + 2)-th moments of the pseudo-distribution will return a matrix with top eigenvector close to a. 165 Theorem 4.4.1 below is the key ingredient toward a polynomial-time algorithm. It states that in order for Jennrich’s approach to successfully extract a solution in polynomial time, the correlation of the desired solution with the (k + 2)-th moments of the pseudo-distribution only needs to be large compared to the spectral norm of the covariance matrix of the pseudo-distribution. This covariance matrix can be made as small as O(1/n) in spectral norm in many situations, including—as a toy example—when D is a uniform distribution over n orthogonal unit vectors. Therefore in this sense the condition (4.4.1) below is a fairly weak requirement, which is key to the polynomial-time algorithm in Section 4.5.1. Theorem 4.4.1. Let k ∈ be even and ε ∈ (0, 1). Let D be a degree-O(k) pseudo- d {k k2 } ∈ d distribution over that satisfies u 6 1 D u , let a be a unit vector. Suppose ( ) that h ik+2 1 · T ¾˜ a, u > Ω √ ¾˜ D u uu . (4.4.1) D u ( ) ( ) ε k / O k 1 ∼ N( k) Then, with probability at least 1 d ( ) over the choice of 0, Idd⊗ , the top ? ? 2 eigenvector u of the following matrix M1 satisfies ha, u i > 1 − O(ε), k T M1 : ¾˜ h1, u⊗ i · uu . (4.4.2) D u ( ) As before, we decompose M1 into two parts, with M k and M1 defined in a⊗ 0 analogy with M1. k k k M1 h1, a i · M k + M1 where 1 1 − h1, a i · a . (4.4.3) ⊗ a⊗ 0 0 ⊗ ⊗ Our proof of Theorem 4.4.1 consists of two propositions: one about the good part M k and one about the noise part M1 . The first proposition shows that a⊗ 0 T M k is close to a multiple of aa in spectral norm (which means that the top a⊗ eigenvector of M k is close to a). a⊗ 166 h ik+2 Proposition 4.4.2. In the setting of Theorem 4.4.1, for t ¾˜ D u a, u , ( ) T M k − t · aa 6 O (ε) · t . (4.4.4) a⊗ The second proposition shows that M10 has small spectral norm in expectation. k k Proposition 4.4.3. In the setting of Theorem 4.4.1, let 10 1 − h1, a⊗ i · a⊗ . Then, h ik+2 for t ¾˜ D u a, u , ( ) ¾ ( 2 )1 2 · M10 6 O ε k log d / t . 10 Before proving the above propositions, we demonstrate how they allow us to prove Theorem 4.4.1. O k Proof of Theorem 4.4.1. We are to show that with probability 1/d ( ) over the choice T of the Gaussian vector 1, there exists s ∈ such that s · M1 − aa 6 O(ε). By Davis-Kahan Theorem (see Theorem ??), this implies the conclusion of h ik+2 ( )1 2 Theorem 4.4.1. Let t ¾˜ D u a, u . For a parameter τ Ω k log d / , we ( ) k bound the spectral norm conditioned on the event h1, a⊗ i > τ, h 1 T k i ¾ M1 − aa h1, a⊗ i > τ 1 1,a k t h ⊗ i· h i 1 − T + 1 h ki 6 M k aa ¾ k M1 1, a⊗ > τ (by (4.4.3)) t a⊗ 1 1,a t 0 h ⊗ i· 1 T 1 k 6 M k − aa + · ¾ M1 (by independence of h1, a i and 1 ) t a⊗ τ t 0 ⊗ 0 · 10 ( ) + 1 · ( 2 )1 2 6 O ε τ O ε k log d / (by Proposition 4.4.2 and 4.4.3) 6 O(ε) . (4.4.5) k By Markov’s inequality, it follows that conditioned on h1, a⊗ i > τ, the event 1 T M1 − aa 6 O(ε) has probability at least Ω(1). The theorem follows 1,a k t h ⊗ i· k O k because the event h1, a⊗ i > τ has probability at least d− ( ). 167 T Proof of Proposition 4.4.2. We are to bound the spectral norm of M k − t · aa for a⊗ h ik+2 k T k T t ¾˜ D u a, u . Let α ¾˜ D u uu . Let Id1 aa be the projector onto the ( ) ( ) subspace spanned by a and let Id 1 Id − Id1 be the projector on the orthogonal − T complement. By our choice of t, we have Id M k Id t · aa . 1 a⊗ 1 − · k kT Since Id 1 Id1 0, we can upper bound the spectral norm of Ma k t a⊗ a⊗ , − ⊗ k − · k k ( − · ) k + k k + k k Ma k t Id1 6 Id1 Ma k t Id1 Id1 Id 1 Ma k Id 1 2 Id1 Ma k Id 1 ⊗ ⊗ − ⊗ − ⊗ − k k + k k 6 Id 1 Ma k Id 1 2 Id1 Ma k Id 1 − ⊗ − ⊗ − (because Id M k Id t · Id ) 1 a⊗ 1 1 k k + k k1 2 · k k1 2 6 Id 1 Ma k Id 1 2 Id1 Ma k Id1 / Id 1 Ma k Id 1 / − ⊗ − ⊗ − ⊗ − (because M k 0) a⊗ √ k k + · k k1 2 6 Id 1 Ma k Id 1 2 α Id 1 Ma k Id 1 / . (4.4.6) − ⊗ − − ⊗ − It remains to bound the spectral norm of Id 1 Ma k Id 1, − ⊗ − k k ˜ h ik · T Id 1 Ma k Id 1 ¾ a, u Id 1 uu Id 1 − ⊗ − D u − − ( ) 6 ¾˜ ha, uik ·(1 − ha, ui2) D u ( ) T 2 2 (because ` Id 1 uu Id 1 (kuk − ha1, ui ) Id) − − 2 2 ¾˜ h i ` + k 2 ·( − 2) 2 6 k 2 a, u (using k 2 x − 1 x 6 k 2 ; see below) − D u − ( ) 2 6 · α (4.4.7) k − 2 k 2 ·( − 2) 2 ∈ Basic calculus shows that the inequality x − 1 x 6 k 2 holds for all x . − Since it is a true univariate polynomial inequality in x, it has a sum-of-squares proof with degree no larger than the degree of the involved polynomials, which is k + 2 in our case. Combining (4.4.6) and (4.4.7), yields as desired that T 1 M k − t · aa 6 O · α 6 O (ε) · t , a⊗ √k 168 h i where the second step uses the condition of Theorem 4.4.1 on t ¾˜ D u a, u ( ) k T k and α ¾˜ D u uu . ( ) h ki · T Proof of Proposition 4.4.3. The matrix M1 ¾˜ D u 10, u⊗ uu , whose spectral 0 ( ) norm we are to bound, is a random contraction of the third-order tensor T ⊗ ⊗ ( k) ¾˜ D u u u u⊗ . Corollary 4.6.6 gives the following bound on the expected ( ) norm of a random contraction in terms of spectral norms of two matrix unfoldings of T—which turn out to be the same in our case due to the symmetry of T. ¾k k ( )1 2 · {k k k k } ( )1 2 · ¾˜ k T M10 6 O log d / max T 1 , 23 , T 2 , 13 O log d / u⊗ u . 10 { } { } { } { } D u ( ) (4.4.8) Theorem 4.6.1 shows that for any pseudo-distribution D that satisfies {kuk2 6 1}, k T T ¾˜ u⊗ u 6 ¾˜ uu . (4.4.9) D u D u ( ) ( ) The statement of the lemma follows by combining the previous bounds (4.4.8) and (4.4.9), ¾k k ( )1 2· ¾˜ k T ( )1 2· ¾˜ T ( 2 )1 2· M10 6 O log d / u⊗ u 6 O log d / uu 6 O ε k log d / t , 10 D u D u ( ) ( ) h ik+2 using condition (4.4.1) of Theorem 4.4.1 which yields t ¾˜ D u a, u > √ ( ) 1 T Ω (ε k)− k¾˜ D uu k. 4.4.2 Improving accuracy of a found solution We need one more technical ingredient before analyzing our main algorithm. Previously, the run-time of the sum-of-squares algorithm in [17] (on which our algorithm is based) depended exponentially on the accuracy parameter 1/ε, and we give here a simple boosting technique that allows us to remove this dependency and achieve polynomially small error. 169 Here we have a set of nearly isotropic vectors a1,..., an. We give a sum- Ín h i4 of-squares proof that if i1 ai , u is only ε off from its maximum possible value, and if u has constant correlation with some ai, then u must in fact be (1−O(ε))-correlated with ai. Intuitively, the former constraint forces D to roughly be a mixture distribution over vectors that are close to a1,..., an, and the latter one forces it to actually only be close to ai. We then briefly show how this proof implies an algorithm to boost the accuracy when we already know a vector b that is 0.01-close to a solution, by solving for a pseudo-distribution with the added constraint {hb, ui2 > 0.9}. d Theorem 4.4.4. Let ε > 0 be smaller than some constant. Let a1,..., an ∈ be unit kÍn T k + vectors such that i1 ai ai 6 1 ε. Define the following systems of constraints, for each j ∈ [n] or unit vector b ∈ d: n n 2 Õ 4 2 1 o A j : kuk 6 1, hai , ui > 1 − ε, ha j , ui > i1 2 D u ( ) n n 2 Õ 4 2 o Bb : kuk 6 1, hai , ui > 1 − ε, hb, ui > 0.9 . i1 D u ( ) 2 2 Then A j `4 {ha j , ui > 1 − 10ε} for all j ∈ [n], and also Bb ` A j and hai , bi > 0.8 for some j ∈ [n]. Proof. We have the following sum-of-squares proof: n Õ 4 A `u,4 1 − ε 6 hai , ui i1 2 4 Õ 2 6 ha j , ui + hai , ui (by adding only square terms) i,j 4 22 6 ha j , ui + 1 + ε − ha j , ui Ín h i2 ( + )k k2 (using `u i1 ai , u 6 1 ε u ) h i2 + 1 + + − h i2 6 a j , u 2 ε 1 ε a j , u 2 (since ` 1/2 6 ha j , ui 6 1) 170 1 − h i2 + 1 + 6 2 ε a j , u 2 2ε , (4.4.10) 2 which means that A `u,4 ha j , ui > 1 − 10ε for ε > 0 small enough. To show that Bb ` Ai for some i, it is enough to show that if Bb is consistent (i.e. there exists a pseudo-distribution satisfying Bb), then there exists i ∈ [n] 2 2 such that hai , bi > 0.8, because it implies {hai , ui > 1/2} by triangle inequality. 2 For the sake of contradiction, assume that hai , bi < 0.8 for all i ∈ [n]. Then, by 2 triangle inequality (see Lemma ??), Bb ` { i ∈ [n]. hai , ui 6 0.99} which when ∀ T 2 combined with ai ai 6 1 + ε using substitution, contradicts the assumption B {Ín h i4 − } that b ` i1 ai , u > 1 ε for small enough ε > 0. d | Corollary 4.4.5. Let D be a degree-` pseudo-distribution over such that D ` 4 / Bb, with Bb as defined in Theorem 4.4.4. Then, there exists i ∈ [n] such that 2 − 2 2 ( ) h i2 ¾˜ D u u⊗ a⊗ 6 O ε and ai , b > 0.8. ( ) i 2 Proof. By Theorem 4.4.4, Bb `4 {hai , ui > 1 − 10ε} for some i. It follows by Lemma 4.3.4 that 2 ¾˜ 2 − 2 − ¾˜ 2 2 − ¾˜ h i2 u⊗ ai⊗ 6 2 2 u⊗ , ai⊗ 2 2 ai , u 6 20ε . D u D u D u ( ) ( ) ( ) 4.5 Decomposition with sum-of-squares In this section, we give a generic sum-of-squares algorithm (Algorithm1 and Theorem 4.5.2) that will be used for various different settings in the following subsections (Section 4.5.2 for orthogonal tensors, Section 4.5.3 for tensors with separated components), and in the section 4.7 for random 3-tensor and Section 4.8 for robust FOOBI. 171 4.5.1 General algorithm for tensor decomposition In this section, we provide a general sum-of-squares tensor decomposition that serve as the main building block for sections later. We will need the following lemma, which appears in [17, Proof of Lemma 6.1]. d Lemma 4.5.1. Let ε ∈ (0, 1) and {a1,..., an } be a set of unit vectors in with kÍn T k + ∈ i1 ai ai 6 1 ε. Then, for all even integers k , there exists a sum-of-squares proof that ( n ) ( n ) k k2 Õh i4 − Õh ik+2 − ( ) u 6 1, ai , u > 1 ε `u, k+2 ai , u > 1 O kε . (4.5.1) i1 i1 Our main algorithm below finds the solutions to a system of polynomial constraints A, when given a “hint” in the form of a polynomial transformation of formal variables P(·). Roughly P should be an “orthogonalizing” map so that if a1,..., an are the desired solutions to the constraints A, then P(a1),..., P(an) are k Ín ( ) ( )T k + nearly an orthonormal basis, or more precisely i0 P ai P ai 6 1 ε while 2 kP(ai)k > 1 − ε for all i. We then only require that a sum-of-squares proof exists certifying that the solutions to A after being mapped by P are actually close to ( ) ( ) A {Ín h ( ) ( )i4 − } P a1 ,... P an ; more precisely, that `` i1 P ai , P u > 1 ε u for some `. The existence of this sum-of-squares certificate then allows us to recover the solutions P(ai) up to O(ε) accuracy by solving for pseudo-distributions and then rounding them. We later show how Algorithm 1 can be applied to a variety of tensor rank decomposition problems by the design of an appropriate orthogonalizing trans- form P. For example, in Section 4.7 P(·) orthogonalizes an overcomplete tensor by lifting the variables to a higher-dimensional space, and P(·) serves as a whitening transformation on a far-from-orthogonal tensor in Section 4.8. 172 The main technical difficulty in this analysis was in making the run-time polynomial (as opposed to quasi-polynomial in [17]) for the nearly-orthogonal case where P is the identity transform. O ` Theorem 4.5.2. For every ` ∈ , there exists an n ( )-time algorithm (see Algorithm 1) with the following property: Let ε > 0 be smaller than some constant. Let d, d0 ∈ be d d d numbers. Let P : → 0 be a polynomial with deg P 6 `. Let {a1,..., an } ⊆ be d a set of vectors such that b1 P(a1),..., bn P(an) ∈ 0 all have norm at least 1 − ε kÍn T k + A and i1 bi bi 6 1 ε. Let be a system of polynomial inequalities in variables u (u1,..., ud) such that the vectors a1,..., an satisfy A and ( n ) Õ 4 4 A `u,` hbi , P(u)i > (1 − ε) kP(u)k . (4.5.2) i1 A { } ⊆ d0 Then, the algorithm on input and P outputs a set of unit vectors b10 ,..., b0n such that 2 2 ( ) 2 ( ) 2 ( )1 2 distH b1⊗ ,..., bn⊗ , b10 ⊗ ,..., b0n ⊗ 6 O ε / . Proof of Theorem 4.5.2. We analyze Algorithm 1. By Corollary 4.4.5, if there exists a pseudo-distribution D0(u) that satisfies constraints (4.5.5), then the top ( ) ( )T ( )1 2 eigenvector of ¾˜ D u P u P u is O ε / -close to one of the vectors b1,..., bn. 0( ) {h ( ) i } The fact that we add in step 4, the constraint P u , b0i 6 0.1 also implies by Corollary 4.4.5 that in some iteration i, we can never find a vector b0i that is close to one vector b0j from a previous iteration j < i. Therefore, it remains to show that in each of the n iterations with high probability we can find a pseudo-distribution D0(u) that satisfies (4.5.5). Consider a particular iteration i0 ∈ [n] of Algorithm 1. We may assume that the vectors b0 ,..., b0 are close to b1,..., bi 1. First we claim that there 1 i0 1 0 − − 173 Algorithm 1 General tensor decomposition algorithm Parameters: numbers ε > 0, n, ` ∈ . Given: system A of polynomial inequalities over d and polynomial P : d → d0. ∈ d0 Find: vectors b10 ,..., b0n . Algorithm: • For i from 1 to n, do the following: 1. Compute a degree-(k + 2)` pseudo-distribution D(u) over d, with k O(1), that satisfies the constraints A ∪ {1 + ε > kP(u)k2 > 1 − ε} + ( ) ( )T 1 ε ¾˜ D u P u P u 6 . (4.5.3) ( ) n − i + 1 1 T 2. Choose standard Gaussian vectors 1( ) ,..., 1( ) ∼ N(0, Id k ) and d0 O 1 ( ) T d ( ) and compute the top eigenvectors of the following matrices for all t ∈ [T]: t k T d d ¾˜ h1( ) , P(u)⊗ i · P(u)P(u) ∈ 0× 0 . (4.5.4) D u ( ) 3. Check if for one of the normalized top eigenvectors b? computed in the previous step, there exists a degree-4` pseudo-distribution D0(u) that satisfies the constraints A ∪ 1 + ε > kP(u)k2 > 1 − ε, hb?, P(u)i2 > 0.99 . (4.5.5) b ¾˜ P(u)P(u)T 4. Set 0i to be the top eigenvector of the matrix D0 u and add A {h ( ) i2 } ( ) to the constraint P u , b0i 6 0.01 . exists a pseudo-distribution satisfying conditions (4.5.3) in step 1, including the additional constraints added to A in previous iterations. Indeed, the uniform distribution over vectors ai ,..., an satisfies all of those conditions. By assumption, A {Ín h ( )i4 − } we have a sum-of-squares proof `u,` i1 bi , P u > 1 ε . Lemma 4.5.1 A {Ín h ( )ik − ( )} then implies `u, k+2 ` i1 bi , P u > 1 O kε for an absolute constant ( ) parameter k to be determined later. Since A includes the added constraints {h ( )i2 h ( )i2 } Ín T 2 + ( ) b1, P u 6 0.1,..., bi0 1, P u 6 0.1 , it follows by i bi bi 6 1 O ε − 1 174 A ` {Íi0 1h ( )ik ( )k 2·( + ( ))} and substitution that i−1 bi , P u 6 0.1 − 1 O ε , here choosing k ( )k 2 ·( + ( )) A {Ín h ( )ik } so that 0.1 − 1 O ε 6 0.001. Therefore, ` k+2 ` ii bi , P u > 0.99 ( ) 0 ˜ Ín h ( )ik ( + ) and so ¾D u ii bi , P u > 0.99 for any degree- k 2 ` pseudo-distribution ( ) 0 D that satisfies constraints (4.5.3). In particular, by averaging, there exists an ? index i ∈ {i0,..., n} such that k 0.99 T ¾˜ hbi? , P(u)i > > 0.9 · ¾˜ P(u)P(u) . D u n − i0 + 1 D u ( ) ( ) By Theorem 4.4.1, for each of the matrices (4.5.4) in step 2, its top eigenvector O 1 is 0.001-close to bi? with probability at least d− ( ). Therefore, we find at least Ω 1 one of those vectors with probability no smaller than 1 − d ( ). In this case, a pseudo-distribution D0(u) as required in step 3 exists, as an atomic distribution supported only on bi? is an example that satisfies the conditions. 4.5.2 Tensors with orthogonal components We apply Theorem 4.5.2 to orthogonal tensors with noise. Theorem (Restatement of Theorem 4.1.1). There exists a polynomial-time algorithm d 3 d that given a symmetric 3-tensor T ∈ ( )⊗ outputs a set of vectors {a0 ,..., a0 } ⊆ 1 n0 d such that for every orthonormal set {a1,..., an } ⊆ , the Hausdorff distance between the two sets is at most n { } 2 ( )· − Õ 3 distH a1,..., an , a10 ,..., a0n 6 O 1 T ai⊗ . (4.5.6) 0 i1 1 , 2,3 { } { } 3 Proof. We feed Algorithm1 with the inputs P(u) u and A hT, u⊗ i > 1 − ε k k − Í 3 where ε E 1 , 2,3 and E T i a⊗ . We have { } { } i n Õ 3 3 3 A `4 hai , ui hT, u⊗ i − hE, u⊗ i i1 175 3 hT, u⊗ i − ε > 1 − 2ε . Here at the second line we used that h 3i k k ` E, u⊗ 6 E 1 , 2,3 6 ε . (4.5.7) { } { } We verify that A satisfies the requirement (4.5.2), n n ! n ! Õ 4 Õ 4 Õ 2 A ` hai , ui > hai , ui hai , ui (using orthonormality) i1 i1 i1 n !2 Õ 3 > hai , ui (Cauchy-Schwarz: Lemma ??) i1 > 1 − 4ε . Therefore calling Algorithm1, we can recover aˆi which is, up to sign flip, close to 1 2 ai with error O(ε / ). We determine the sign by finding the τ ∈ {−1, +1} such h ˆ 3i − ˆ that T, τai⊗ > 1 ε and set the output a0i to τai. Remark 4.5.3. Note that in the proof of Theorem 4.1.1, the conclusion of equa- tion (4.5.7) is the only thing we used about the error term E. Therefore, define the following SoS relaxation of the injective norm: h i k k k k2 h 3i E SoS inf u 6 1 ` E, u⊗ 6 c . c ∈ Then we can replace the right hand side of equation (4.5.6) by O(1)· n T − Í a 3 i1 i⊗ SoS. 4.5.3 Tensors with separated components The following lemma shows that for separated vectors the sum of higher-order outer products has spectral norm that decrease exponentially with the tensor 176 order. d Lemma 4.5.4. Let a1,..., an be unit vectors in . Then, for every k ∈ , n k n + Õ T k 1 Õ T ai ai ⊗( ) 6 1 + max |hai , a ji| · ai ai . i,j i1 i1 + Í T k 1 ∈ ( d) k+1 Proof. Let A i ai ai ⊗ . For a unit vector x ⊗ we’ll bound the quadratic form xTAx. First, without loss of generality we can assume that x is in the subspace V { k+1} spanned by ai⊗ i. This is because if x had a component y orthogonal to V, T k+1 ∈ [ ] then y ai⊗ 0 for all i n by definition, so that Ay 0 and y can make no nonzero contribution to the quadratic form above. 1 2 + Also let W (A / ) so that W is a whitening transform and WAW is a Í k+1 Í 2 k k2 projector onto V. Then suppose x i ciWai⊗ , so that i ci x 1. Then T Õ ( k+1)T k+1 x Ax ci c j ai⊗ WAWa⊗j ij Õ 2 + Õ h ik+1 ci ci c j ai , a j i i,j k Õ 2 + |h i| Õ h i 6 ci max ai , a j ci c j ai , a j i,j i i,j k n Õ T 6 1 + max |hai , a ji| ai ai , i,j i1 Í T ( 1 2)+ where in the last step we let A0 i ai ai and W0 A0 / , and apply the Í h i Í T T k k inequality i,j ci c j ai , a j ij ci c j ai W0A0W0a j x0 A0x0 6 A0 , where x0 Í i ciW0ai is a unit vector. d d k k 2 Lemma 4.5.5. Let a ∈ and b ∈ ( )⊗ be unit vectors such that ha⊗ , bi > 1 − ε. k 1 Let B be the reshaping of the vector b into a d-by-d − matrix. Then the top left singular d 2 vector a0 ∈ of B satisfies ha0, ai > 1 − O(ε). 177 k Proof. Let c be the top right singular vector of B. Then, ha0⊗c, bi > ha⊗ , bi > 1−ε. 1 2 k 1 Therefore, ka0 ⊗ c − bk 6 O(ε) / . By triangle inequality, ka0 ⊗ c − a ⊗ a⊗ − k 6 1 2 k 1 O(ε) / , which means that as desired |ha, a0i| > ha0 ⊗ c, a ⊗ a⊗( − )i > 1−O(ε). Theorem (Restatement of Theorem 4.1.5). There exists an algorithm A with poly- nomial running time (in the size of its input) such that for all η, ρ ∈ (0, 1) and { } ⊆ d kÍn T k σ > 1, for every set of unit vectors a1,..., an with i1 ai ai 6 σ and |h i| ∈ ( d) k maxi,j ai , a j 6 ρ, when the algorithm is given a symmetric k-tensor T ⊗ with + 1 log σ d k > O · log(1/η), then its output A(T) is a set of vectors {a0 ,..., a0 } ⊆ log ρ 1 n0 such that 2 n { 2 2} { 2 2} + − Õ k distH a0⊗ ,..., a0⊗n , a⊗ ,..., an⊗ 6 O η T ai⊗ . 1 1 i1 1,..., k 2 , k 2 +1,...,k { b / c} {b / c } (4.5.8) − Í k Proof. We use Algorithm 1 from Theorem 4.5.2. Let E T i ai⊗ . We may k k assume that E 1,..., k 2 , k 2 +1,...,k 6 η, since otherwise the theorem follows { b / c} {b / c } k k from the case when η E 1,..., k 2 , k 2 +1,...,k . Let P be the polynomial map { b / c} {b / c } k 4 P(x) x⊗d / e and let A be the system of polynomial inequalities k 2 A {hT, u⊗ i > 1 − η, kuk 1} . (4.5.9) ( ) k k in variables u u1,..., ud . Since E 1,..., k 2 , k 2 +1,...,k 6 η, all of the vectors { b / c} {b / c } a1,..., an satisfy A. Let b1,..., bn be the unit vectors bi P(ai). By Lemma 4.5.4 kÍ T k + k 4 + and the condition on k, these vectors satisfy i bi bi 6 1 ρd / e σ 6 1 η. Then, we have the following sum-of-squares proof n n n A Õh ( )i4 Õh i4 k 4 Õh ik h ki − h ki `u,k bi , P u ai , u d / e > ai , u T, u⊗ E, u⊗ i1 i1 i1 (4.5.10) − − k k − > 1 η T 1,..., k 2 , k 2 +1,...,k > 1 2η . (4.5.11) { b / c} {b / c } 178 It follows that A and P satisfy the conditions of Theorem 4.5.2. Thus, Algorithm 1 A on input and P recovers vectors b10 ,..., b0n with Hausdorff distance at most √ O( η) from b1,..., bn. By Lemma 4.5.5, the top left singular vectors of the √ k 4 1 ( ) d-by-dd / e− matrix reshapenings of b10 ,..., b0n are O η -close to the vectors a1,..., an up to sign. (If k is odd, then we may determine the signs of the ai by h ki − ( ) h ki − + ( ) checking if T, a0i ⊗ > 1 O η or T, a0i ⊗ 6 1 O η for each output vector a0i.) 4.6 Spectral norms and tensor operations In this section, we provide several bounds regarding the spectral norms of moments of the lifted vectors, and the spectral norm of random contraction of a tensor, which are crucial in our analysis in previous sections. We suggest readers who are more interested in applications of the algorithms jump to Section 4.7 and 4.8. 4.6.1 Spectral norms and pseudo-distributions Theorem 4.6.1. Let D be a degree-4(p + q) pseudo-distribution over d that satisfies {k k2 } ∈ u 6 1 D u . Then, for all p, q , ( ) p q T T ¾˜ u⊗ u⊗ 6 ¾˜ uu . (4.6.1) D u D u ( ) ( ) The theorem follows by combining Lemma 4.6.2 and Lemma 4.6.4 proved below. Lemma 4.6.2 reduces Theorem 4.6.1 to the case when p q. 179 Lemma 4.6.2. Let D be a degree-4(p + q) pseudo-distribution over d that satisfies {k k2 } ∈ u 6 1 D u . Then, for all p, q , ( ) 2 p q T p p T q q T ¾˜ u⊗ u⊗ 6 ¾˜ u⊗ u⊗ · ¾˜ u⊗ u⊗ . (4.6.2) D u D u D u ( ) ( ) ( ) d p d q Proof. For all unit vectors x ∈ ( )⊗ and y ∈ ( )⊗ p q T p q hx, ¾˜ u⊗ u⊗ yi ¾˜ hx, u⊗ ihu⊗ , yi D u D u ( ) ( ) 1 2 1 2 p 2 / q 2 / 6 ¾˜ hx, u⊗ i · ¾˜ hu⊗ , yi D u D u ( ) ( ) (Cauchy–Schwarz for pseudo-expectations) 1 2 1 2 p p T / q q T / 6 ¾˜ u⊗ u⊗ · ¾˜ u⊗ u⊗ . (4.6.3) D u D u ( ) ( ) The lemma follows from this bound by choosing x and y as the top left and right p q T singular vectors of the matrix ¾˜ D u u⊗ (u⊗ ) . ( ) Towards proving Theorem 4.6.1 for the case of p q, we first establish the following lemma which says that tensoring with vector with norm less 1 won’t increase the spectral norm. Lemma 4.6.3. Let 1(u, v) be a polynomial in indeterminates u, v. Let D be a degree-4 d {k k2 ( ) } pseudo-distribution over that satisfies u 6 1, 1 u, v > 0 D u,v . Then, for all ( ) p ∈ , T T ¾˜ 1(u, v) (u ⊗ v)(u ⊗ v) 6 ¾˜ 1(u, v)vv . (4.6.4) D u,v D v ( ) ( ) Proof. We have the sum-of-squares proof that T ` 1(u, v) Id ⊗vv> − (u ⊗ v)(u ⊗ v)