Tensors, sparse problems and conditional hardness

by Elena-Madalina Persu

A.B., Harvard University (2013) S.M., Massachusetts Institute of Technology (2015)

Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

September 2018

Massachusetts Institute of Technology 2018. All rights reserved.

Signature redacted A u th o r ...... Department of Electkical Engineering and Computer Science August 24, 2018 Signature redacted C ertified by ...... Ankur Moitra Rockwell International Associate Professor of Mathematics Thesis Supervisor Signature redacted Accepted by ...... M ACHUSETTS INSTITUTE (] Leslie A. Kolodziejski OF TECHNOLOGY Professor of Electrical Engineering and Computer Science ocIf 1 0 2018 Chair, Department Committee on Graduate Students

UAR IES ARCHNVES

Tensors, sparse problems and conditional hardness by Elena-MAdalina Persu

Submitted to the Department of Electrical Engineering and Computer Science on August 24, 2018, in partial fulfillment of the requirements for the degree of Doctor of Philosophy

Abstract In this thesis we study the interplay between theoretical computer science and machine learn- ing in three different directions.

First, we make a connection between two ubiquitous sparse problems: Sparse Principal Component Analysis (SPCA) and Sparse Linear Regression (SLR). We show how to effi- ciently transform a blackbox solver for SLR into an algorithm for SPCA. Assuming the SLR solver satisfies prediction error guarantees achieved by existing efficient algorithms such as those based on the Lasso, we show that the SPCA algorithm derived from it achieves state of the art performance, matching guarantees for testing and for support recovery under the single spiked covariance model as obtained by the current best polynomial-time algorithms.

Second, we push forward the study of linear algebra properties of tensors by giving a tensor rank detection gadget for tensors in the smoothed model. Tensors have had a tremen- dous impact and have been extensively applied over the past few years to a wide range of problems for example in developing estimators for latent variable models, in independent component analysis or blind source separation. Unfortunately, their theoret- ical properties are still not well understood. We make a step in that direction.

Third, we show that many recent conditional lower bounds for a wide variety of problems in combinatorial pattern matching, graph algorithms, data structures and machine learning, including gradient computation in average case neural networks, are true under significantly weaker assumptions. This highlights that the intuition from theoretical computer science can not only help us develop faster practical algorithms but also give us a better understanding of why faster algorithms may not exist.

Thesis Supervisor: Ankur Moitra Title: Rockwell International Associate Professor of Mathematics

3 4 Acknowledgements

I will never be able to thank my parents, Camelia and Ion, enough for all their love and dedication throughout the years. Growing up, they always believed in me, taught me how to persevere and placed a great emphasis on education. I am very grateful for their constant love, support, and encouragement. I dedicate this thesis to them.

I have been extremely fortunate to have Ankur Moitra as my advisor. As a researcher, lie inspires me with his deep intellectual curiosity and a level of intuition that lets him go to the very core of difficult research problems. He savors and enjoys the process of doing research, both the challenges and the small discoveries along the way. It is Ankur's qualities as a person that I am most grateful for; his patience, compassion, and generosity.

I would like to thank Guy Bresler and Costis Daskalakis for serving on my dissertation com- mittee. Special thanks to and Prateek Jain for expanding my research horizons and hosting me at Microsoft Research over two summers. There are certain moments that completely change one's life path; the first class of Computational Learning Theory was one of those for me. Hence, I would like to thank my undergraduate advisor, , for introducing me to the world of theoretical computer science. I am also very grateful for all my collaborators, I had a lot to learn from them: Arturs Backurs, Sam Park, Ankur Moitra.

Piotr Indyk, Guy Bresler, Ryan Williams, Virginia Williams, Cristopher Musco, Cameron

Musco, Sam Elder and Michael Cohen. I was very lucky to have the chance to work with

Michael, one can only wonder what his mind could have achieved.

I also want to thank the wonderful lab assistants, Debbie and Patrice, for always having a smile on and letting me borrow their keys the many times I got locked out of my office.

Finally, thank you to all my friends during graduate school - from the MIT theory group: es- pecially Katerina, Manolis and the Greeks, Maryam, Ilya, Quanquan, Arturs, Sam, Prashant,

5 Akshay, Itay, Nicole, Aloni, Daniel, Luke, Adam, Rio, Sepideh, Pritish, Ludwig, Jerry, Gau- tam, Henry; and outside: Horia, Sergio, Andreea, Julia, Patricia, loana. Your friendship has meant a lot to me and my MIT experience would not have been the same without you.

Thank you also to my Harvard friends who made my undergraduate years some of the best of my life. Thank you especially to Min and Currierism, Fiona, Lily, Robert, Katrina, Andrei, Miriam, Shiya, Gye-Hyun and Jao-ke.

6 Contents

List of Symbols 13

1 Introduction 15

2 Sparse PCA from Sparse Linear Regression 21 2.1 Introduction ...... 21 2.1.1 Our contributions ...... 22 2.1.2 Previous algorithms ...... 23 2.2 Preliminaries ...... 30 2.2.1 Problem formulation for SPCA ...... 30 2.2.2 Problem formulation for SLR ...... 32 2.2.3 The linear model ...... 33 2.3 Algorithms and main results ...... 34 2.3.1 Intuition of test statistic ...... 34 2.3.2 Algorithms ...... 35 2.4 Analysis ...... 36 2.4.1 Analysis of Qj under H, ...... 36 2.4.2 Analysis of Qj under Ho ...... 39 2.4.3 Proof of Theorem 5 ...... 41 2.4.4 Proof of Theorem 6 ...... 41 2.4.5 Discussion ...... 41 2.5 Experiments ...... 43

7 2.5.1 Support recovery ...... 44

2.5.2 Hypothesis testing ...... 45

2.6 Conclusion ...... 46

3 Tensor rank under the smoothed model 49

3.1 Introduction ...... 49 3.1.1 Our results ...... 50 3.1.2 Our approach ...... 51 3.2 Preliminaries and notations . . 52 3.3 Young flattenings ...... 55

3.3.1 Young flattenings in the smoothed model 57

3.4 Proof of Theorem 14 ...... 62

3.5 Future directions ...... 63

3.6 Linear algebra lemmas .. .. . 65

4 Stronger Fine Grained Hardness via #SETH 69 4.1 Prelim inaries ...... 70 4.2 Pattern matching under edit distance ...... 71 4.2.1 Preliminaries ...... 71 4.2.2 Reduction ...... 73 4.3 Machine learning problems ...... 76 4.3.1 Gradient computation in average case neural networks ...... 77 4.3.2 Reduction ...... 77 4.3.3 Hardness results ...... 80 4.4 Wiener index ...... 80 4.5 Dynamic graph problems ...... 81 4.5.1 Reductions framework ...... 82 4.6 Counting Matching Triangles ...... 84 4.7 Average case hardness for the Counting Orthogonal Vectors Problem 84

8 4.7.1 Reduction ...... 8 85

A Vector Gadgets 89 A .1 Vector G adgets ...... 89

B Useful lemmas 93 B.1 Linear minimum mean-square-error estimation ...... 93 B.2 Calculations for linear model from Section 2.2.3 ...... 94 B.3 Properties of design matrix X ...... 96 B.4 Tail inequalities - Chi-squared ...... 99

Bibliography 101

9 10 List of Figures

2-1 Performance of diagonal thresholding (DT), covariance thresholding (CT),

and Q for support recovery at n = d = 625, 1250, varying values of k, and 0 = 4 44 2-2 Performance of diagonal thresholding (D), MDP, and Q for hypothesis testing at n = 200, d = 500, k = 30, 0 = 4 (left and center). TO denotes the statistic

T under H0 , and similarly for TI. Effect of rescaling covariance matrix to make variances indistinguishable is demonstrated (right) ...... 46

3-1 CANDECOMP/PARFAC tensor decomposition of a third-order tensor ... 53

11 12 List of Symbols

covariance matrix sample covariance matrix

E[-] expectation over the appropriate sample space M(p, o-) Gaussian distribution with mean vector yt and covariance matrix o

In n x n identity matrix diagf di, .. , I} diagonal matrix with diagonal entries di inequality up to an absolute constant

Sn n-dimensional unit sphere in Rn+l

Bo(k) the set of k-sparse vectors in C R d [n] {1, ... , n} 0 tensor product A exterior product

w.p. "with probability" w.h.p. "with high probability"

13 14 Chapter 1

Introduction

In the past couple of decades machine learning has become a common tool in almost any task that requires information extraction from large data sets. We are surrounded by machine learning-based technology: search engines learn to display the best results in a personalized manner; credit card transactions are secured by software that learns to detect frauds; smart cars keep us safe during our daily commute; smart-phones learn to recognize voice com- mands. Machine learning is also widely used in scientific applications such as bioinformatics, medicine and astronomy. Just as mathematics once was the main tool of the other sciences, now, computer science and more specifically machine learning, seems to have become the lenses to observe the world. Unfortunately, while inheriting the role, machine learning hasn't inherited the mathematical rigor: many of the algorithms used in practice work without any sort of provable guarantees on their behaviour. The theoretical study of machine learning is important in two ways: from the theoretical side, we hope to understand what are better models for the study of algorithms beyond worst case analysis; from the practical side, de- veloping insights about why heuristics work so well allows us to improve them. In this thesis we make advances in both directions.

In modern big data instances, we often work in the so-called high-dimensional regime, where the dimensionality of the data may be significantly larger than the number of samples.

15 Unfortunately, classical statistical tools are not appropriate as they focus on an opposite framework: the asymptotic regime where the number of samples n tend to infinity while other parameters are fixed. Classical estimators such as the maximum likelihood estimator

are nevertheless consistent; that is, the sample estimate of a parameter converges to the true

population value as we acquire more samples. In many modern applications, however, the

number of samples we have access to is far less than the dimensionality of the data. Hence, it

is often unreasonable to assume we have a lot more samples than the number of dimensions. This motivates the study of statistical problems in the high-dimensional setting.

Working in high-dimensions is both a curse and a blessing ([WailO]): exponential blowup

in sample complexity or runtime is inevitable in certain cases (the so called "curse of di-

mensionality"); however, phenomena such as concentration of measure work for us, enabling

inference under appropriate assumptions.

To circumvent the curse of dimensionality, and make problems in high-dimensional set-

tings tractable, a low-dimensional structure is imposed on the models. Sparsity is a simple

and natural assumption for many problems, and has been extensively analyzed in theory

and applied in practice. Aside from the mathematical usefulness of such an assumption,

real-world data are often sparse in an appropriate basis. For instance, natural images are

known be approximately sparse in alternate bases such as wavelet or Fourier, and this fact is

used by several compression schemes. In summary, sparse models have proven to be powerful

both in theory and practice. See [EK12 or similar for a more extensive history.

We may view the success of sparse models in the light of Occam's razor: among equivalent

explanations, the simplest is best. In the case of linear regression, it is reasonable to expect

that only a few of the covariates affect the response variable.

Despite continuous theoretical progress in high-dimensional estimation tasks, our un-

derstanding is still lacking in some aspects. For some problems, there still remains a gap between statistically optimal algorithms and known efficient algorithms: the former is often

based on a brute-force search over model parameters, while the latter utilizes various convex

relaxations and greedy heuristics. In other cases, computationally efficient algorithms re-

16 quire certain restrictive assumptions on the input, and the only known proof techniques rely crucially on those assumptions. Without those assumptions, a much higher signal strength is required in order to do inference-as far as we know. There seems to be a statistical price to pay for computational efficiency.

Sparse Principal Component Analysis (SPCA) and Sparse Linear Regression (SLR) are two problems that have a wide range of applications and have attracted a tremendous amount of attention in the last two decades as canonical examples of statistical problems in high di- mension. A variety of algorithms have been proposed for both SPCA and SLR, but their literature has been disjoint for the most part. We have a fairly good understanding of con- ditions and regimes under which these algorithms succeed. But is there a deeper connection between computational structure of SPCA and SLR?

In this thesis we show how to efficiently transform a blackbox solver for SLR into an algorithm for SPCA. Assuming the SLR solver satisfies prediction error guarantees achieved by existing efficient algorithms such as those based on the Lasso, we show that the SPCA algorithm derived from it achieves state of the art performance, matching guarantees for testing and for support recovery under the single spiked covariance model as obtained by the current best polynomial-time algorithms. Our reduction not only highlights the inher- ent similarity between the two problems, but also, from a practical standpoint, it enables obtaining a collection of algorithms for SPCA directly from known algorithms for SLR. Ex- periments on simulated data show that these algorithms perform well. This is the content of Chapter 2 of this thesis and is based on joint work with Guy Bresler and Sam Park.

Besides imposing sparsity, one other high dimensional statistics tool that has been used extensively to reveal unknown parameters of different models is tensor decompositions. Many recent problems fit into the following framework: choose some parametric family of distribu- tions that are rich enough to model things like evolution, writing or the formation of social networks, then design algorithms to learn the unknown parameters. These parameters are a proxy for finding hidden structure in the data: for example a tree of life that explains how species evolved from each other, the common topics of a collection of documents or the

17 communities of strongly connected persons in a social network. A common approach for this class of problems is to construct a tensor from the moments of the distribution and apply tensor decomposition algorithms to find the hidden factors which in turn reveal the unknown parameters of the models. More specifically, tensors are useful for learning variable models, and being able to decompose higher rank tensors generally goes hand in hand with being able to work with more components [BCMV14a].

The state of art for guaranteed tensor decomposition involves two steps: converting the input tensor to an orthogonal symmetric form and then solving the orthogonal decomposi- tions through tensor eigen decompositions [Com94, KM11, AGH+14 While having efficient guarantees, this procedure suffers from a number of theoretical and practical limitations and most importantly it is unable to recover overcomplete representations (this is the case when the tensor rank is larger than dimension) due to the orthogonality constraint, which is especially limiting, given the recent popularity of overcomplete feature learning in many domains [BCV, LS00. The same drawback holds for the progress on random tensor models. Part of the motivation behind the tensor decomposition work in this thesis is to understand whether there are more tame conditions that allow one to work with overcomplete tensors.

Tensor decompositions are a powerful tool that has been used extensively in a wide range of machine learning problems, for example in developing estimators for latent variable models, in independent component analysis or blind source separation. The uniqueness of decomposition gives tensors a significant advantage over matrices. However, many of the familiar properties of matrices do not generalize to tensors and in the general setting most tensor problems are NP-hard [HL13]. The work presented in this thesis fits into the broad direction of trying to understand what sorts of conditions avoid computational intractability for tensor problems.

Our main technical contribution in Chapter 3 is to push forward the study of linear algebra properties of tensors by giving a tensor rank detection gadget for tensors in the smoothed model. The assumption here is that the model is not adversarily chosen, formalized by a perturbation of the model parameters. Our results involve applying algebraic geometry

18 tools through Young flattenings to tensor rank detection. The results in this chapter are based on joint work with Ankur Moitra. One other direction in which ideas from theoretical computer science can influence ma- chine learning is the hardness in P, polynomial time, agenda. In particular, studying the computational complexity of widely used algorithms, for example empirical risk minimiza- tion, can shed light on which primitives in machine learning can and cannot hope to be sped up, under popular complexity conjectures. Strong Exponential Time Hypothesis (SETH) is such a conjecture and states that, for the problem of testing the satisfiability of CNF formulas, no algorithm can improve over the running time of the naive exhaustive search algorithm by an exponential factor. Over the last few years this hypothesis has been used to show conditional lower bounds for a wide variety of problems in combinatorial pattern matching, graph algorithms, data structures and machine learning. In Chapter 4, we show that many of the aforementioned results in fact hold under a significantly weaker assumption. This chapter is based on joint work with Arturs Backurs, , Ryan Williams and Virginia Williams. Specifically, we consider the counting analog of SETH (denoted #SETH), which postulates that an analogous lower bound holds for the complexity of counting the number of assignments satisfying a given CNF formula. Assuming #SETH, we show conditional lower bounds for problems such as: pattern matching under edit distance, kernel density estimation, solving kernel SVMs, computing the gradient of the top layer in a depth 3 neural network, maintaining the number of strongly connected components in a directed graph, and others. Our results strengthen the evidence of hardness of these problems.

19 20 Chapter 2

Sparse PCA from Sparse Linear Regression

2.1 Introduction

Principal component analysis is a popular and successful technique for dimension reduction and data analysis that aims to find directions along which multivariate data has maximum variance. The goal of sparse PCA (SPCA) is to find few sparse linear combinations of given data that explain as much of the variance as possible. This has the advantage that the components are more interpretable. We study sparse PCA under the spiked covariance model introduced by [Joh0l]. Data X1, ... , XnE Rd are drawn from a normal distribution V(O, Id +

OUUT), where we assume | = 1, u has at most k nonzero elements, and 0 is a parameter that controls the signal-to-noise ratio. We consider two versions of the problem: the goal of hypothesis testing is to distinguish 0 > Oo vs. 0 = 0 for some threshold 0; the goal of support recovery is to recover the support of u.

Sparsity is a natural assumption in many situations and makes problems well-defined in high-dimensional setting.The goal of this chapter is to make a specific algorithmic connection between sparse PCA and sparse linear regression. In linear regression there is a known design matrix X E R"xd and one makes n noisy observations y = X* + w, where w - A(0, U2 1") is

21 the noise. The goal is to find a regression vector #* with small loss (either squared error or prediction). In sparse linear regression (SLR), we impose the assumption that 13* has some relatively small number k of nonzero entries. The information-theoretically optimal algorithm for SPCA [BR13b] involves searching over all possible (') supports of the hidden spike u. This bears resemblance to the optimal algorithm for SLR, which minimizes the prediction error over all (') supports of the regression vector /* We illustrate that this similarity is not an accident by giving a general, simple and efficient procedure for transforming a blackbox solver for sparse linear regression with guarantees on prediction error to an algorithm for SPCA. At a high level, our algorithm regresses each coordinate i of our data on the rest of the coordinates, and computes a statistic Qj based on the prediction error. If Qj is greater than some threshold, we can safely conclude that i is in the support of u. This immediately gives algorithms for both hypothesis testing and support recovery. The advantages of such a blackbox framework are two fold: it illustrates the inherent structural similarity of the two problems; practically, it allows one to simply plug in any of the number of solvers available for SLR (such as Lasso [Tib96j or FoBa [Zha091) and directly get an algorithm for SPCA with provable guarantees. Beyond the above, our algorithm also achieves exactly or close to state of the art guar- antees for SPCA. We elaborate on these in the next subsection. Our experimental results also indicate that we outperform or match existing algorithms.

2.1.1 Our contributions

We highlight some of our contributions below:

* We give a general and efficient procedure for transforming an SLR blackbox with prediction error guarantees into algorithms for hypothesis testing and support recovery for SPCA under the spiked covariance model. Most known sparse linear regression and sparse recovery algorithms can be used as this blackbox. In experiments, we demonstrate that using popular existing SLR algorithms such as Lasso [Tib96] and

22 FoBa [Zha09] for the "blackbox" results in good performance.

" For hypothesis testing, we match state of the art provable guarantee for computa- tionally efficient algorithms; our algorithm successfully distinguishes between isotropic and spiked Gaussian distributions as soon as the signal strength is greater than 0 >

2 k logd. This matches the phase transition of diagonal thresholding (DT) [JL09] and Minimal Dual Perturbation (MDP) [BR13b] up to constant factors.

" For support recovery: for general p and n, when each non-zero entry of u is at least Q(1/v7k) (a standard assumption in the literature), our algorithm succeeds with high

2 probability for signal strength 0 > n d. In the scaling limit d/n - a as d, n -+ 00, the recent covariance thresholding algorithm [DM14], theoretically succeeds at a signal

strength that is an order of /log d smaller. However, our experimental results indicate that with an appropriate choice of blackbox, our Q algorithm outperforms covariance thresholding as well as diagonal thresholding.

" We also theoretically and empirically illustrate that our SPCA algorithm is robust to rescaling data, for instance by using a Pearson correlation matrix instead of a covariance matrix. 1

2.1.2 Previous algorithms

Various approaches to SPCA has designed in an extensive list of prior work. Earliest algo- rithms such as SCoTLASS [JTU03] and that of [ZHT06] were first to impose an fi penalty on the coefficients of the components to induce sparsity, but no provable guarantees were ana- lyzed. In a different line of work, [JL09] used a simple procedure called diagonal thresholding (DT) to select a subset of variables with the largest variance, and running ordinary PCA on the reduced set of variables. Somewhat surprisingly, this simple algorithm matches the best guarantees for hypothesis testing and is nearly optimal for support recovery. [dEGJL07] first

1We remark that the idea of seeing if a SPCA algorithm works on the correlation matrix was originally found in [VCLR13].

23 introduced a natural SDP relaxation for the problem, and since then this SDP has been used and analyzed in numerous settings.

Spiked covariance model As the most general SPCA problem is NP-hard, additional assumptions are needed in order to derive better guarantees. One popular and successfully analyzed distributional setting has been the spiked covariance model. We focus on the single spike model due to [Joh0l].

In the spiked covariance model for r spikes, X E R d is generated by the formula:

X=VDU T + Z

where V is n x r random effects matrix with i.i.d. .(O, 1) entries, D = diag(A/ 2 1A/ 2 )

2 with A ;> - -- ;> r > 0, U is d x r orthonormal and Z has i.i.d. PJ(0, U ) entries independent

of V. Equivalently, X has rows independently drawn from AF(0, E), where E = UAUT + d and A = diag(A, ... , Ar). For simplicity, our presentation follows this normal assumption on the rows of X, however we note that our analysis can easily be extended to the case where

the rows of X are drawn from a particular subgaussian distribution where, E1 / 2Z, where Z has independent subgaussian entries. We focus on the single spike case, where we just have U = u. We introduce a signal strength parameter 0, defined as 6 = A,/-. For the discussion below, we comment that some works treat signal strength 6 as a variable parameter and compare against a fixed threshold, while others fix 6 to be a constant and analyze the required number of samples n as a function of dimension d and sparsity k. Below we review state of the art guarantees for two different goals.

Support recovery The goal of support recovery is to find the support S of the spike u. Notice that if we exactly recover the support of S, then we can recover u by finding the top eigenvector of the restricted covariance matrix Es,s because this puts us back in the low-dimensional or classical regime. In general, a lower bound on the entries of u is needed

24 in order to guarantee successful recovery (see [FRG09, Wai071 for related lower bounds for sparse recovery). Under the spiked covariance model, for a subcase when the spike is uniform in all k coordinates, [AW09] analyzed both diagonal thresholding and SDP for support recovery. They showed that the SDP requires an order of k fewer samples when the SDP optimal solution is rank one. However, [KNV13] showed that the rank one condition does not happen in general, particularly in the regime approaching the information theoretic limit ( a < k < 'd). This is consistent with computational lower bounds from [BR13a]

(k > Vri), but a small gap remains (diagonal thresholding, SDP's succeed only up to k < Vn/log d). The state of the art for support recovery that closes the above gap is the covariance thresholding algorithm, first suggested by [KNV13] and analyzed by [DM14], that succeeds in the regime Vn/log d < k < v/r, although the theoretical guarantee is limited to the regime when d/n - a due to relying on techniques from random matrix theory.

Hypothesis testing Some works [BR13b, AW09, dBEG14 have focused on the problem of detection. Here one only wants to distinguish between u = 0 and afUl2 1 (with u still k-sparse). In this case, [BR13b] observed that it suffices to work with the much simpler dual of the standard SDP, called Minimal Dual Perturbation (MDP). In the dual problem, the goal is to perturb the sample covariance matrix to minimize the max eigenvalue subject to a penalty proportional to the entries of the perturbation and the sparsity level k. Diagonal thresholding (DT) and MDP work up to the same signal threshold 0 as for support recovery, but MDP seems to outperform DT on simulated data [BR13b]. Note that MDP works at the same signal threshold as the standard SDP relaxation for SPCA.

[dBEG14] analyze a statistic based on an SDP relaxation and its approximation ratio to the optimal statistic. In the regime where k, n are proportional to d, their statistic suceeds at a signal threshold for 0 that is independent of d, unlike the MDP. However, their statistic is quite slow to compute; runtime is at least a high order polynomial in d.

25 Regression based approaches Though some previous works ([ZHT06I) have used spe- cific algorithms for SLR such as Lasso as a subroutine, to the best of our knowledge our work is the first to give a general framework that uses SLR in a blackbox fashion, while matching state of the art theoretical guarantees. A similar regression-based approach has been used in [MB06] as applied to a restricted class of graphical models. The goal there is to recover the neighborhood of each node in a graphical model; they do is by regressing the random variable corresponding each node on the observations from the rest of the graph. While our regression setup is similar, their statistic is different and their analysis depends directly on the particulars of Lasso. Further, their algorithm requires extraneous conditions on the data. In particular, their Assumption

5 on minimum partial correlation requires 02 > k, compared to 02> k2 1ogd in our work. [CMW13] also uses reduction to linear regression for their sparse subspace estimation, but is different from our approach in several ways: First, their algorithm depends crucially on a good initialization done by a diagonal thresholding-like pre-processing step, whereas our algorithm does not. This further implies that under rescaling of data2 , their initialization fails. Second, their framework uses regression for the specific case of orthogonal design, whereas our design matrix can be more general as long as it satisfies a condition similar to the restricted eigenvalue condition. On the other hand, their setup allows for more general fq-based sparsity as well as the estimation of an entire subspace as opposed to a single component. [M+131 also achieves this more general setup, however it also suffers from the first problem delienated above.

Sparsity inducing priors Connections between SPCA and SLR has been noted in the probabilistic setting, albeit in an indirect manner. At a high level, the same sparsity-inducing priors can be used for either problem.

IKKGP14] consider the problem of given a base prior, finding another distribution closest to it in KL divergence ("information projection") while satisfying some constraints. They look at domain constraints (limiting domain to a particular subset) in particular, and show

2 See Section 2.4.5 for more discussion on rescaling.

26 that the desired optimal distribution is just the base distribution restricted to the subset and rescaled appropriately. Now, if we want to do information projection on all the distri- butions that k-sparse support S, then it turns out that the cost function (KL divergence) is submodular in S, so one can achieve (1 - 1/e)-approximation to the optimal objective. [KGPK15] look at the probabilistic formulation of PCA along with the EM algorithm for it. In the E-step, they optimize over distribution Q that has sparse support. Since the E-step minimizes KL divergence between the distribution of latent variables (principal components) and Q, the technique from [KKGP14] can be readily applied. However, as based on EM they can only guarantee local optimality. While the above line of work also highlights an intriguing relationship between the two problems, mainly that sparsity inducing priors can be applied to both SPCA and SLR, we view our work as different as we focus on giving a blackbox reduction from one to the other. Furthermore, provable guarantees for the EM algorithm/variational method are lacking in general, and it is not immediately clear what signal threshold their algorithm achieves for the single spike covariance model.

SLR background In linear regression, we observe a response vector y E R' and a design matrix X E R'nd that are linked by the linear model y = XO* + w. The vector w C R' is some form of observation noise, and our goal is to recover 3* given noisy observations y. We focus on the standard Gaussian model, where the entries of w are i.i.d. iV(O, a2). We also work with deterministic design; while the matrices X we consider arise from a (random) correlated Gaussian design (as analyzed in [Wai071, [Wai09I), it will make no difference to assume the matrices are deterministic (by conditioning). Most of the relevant results on sparse linear regression pertain to deterministic design. Analogous to the setting for PCA, linear regression in the high-dimensional setting is meaningless without further constraints. In general, when n < d, the system is under- determined and there is a whole subspace of solutions minimizing reconstruction error. In sparse linear regression, we additionally assume that /* is sparse, or has only a small number, k < d, of non-zero entries. This makes the problem well posed in the high dimen-

27 sional setting, though computationally more challenging. Beyond mathematical necessity, sparsity has been found to be a very suitable assumption, as often real world signals are suitable in an appropriate basis; that is, the intrinsic dimensionality of the data is often much lower than the dimensionality of the original dataset. Commonly used performance measures for SLR are tailored to prediction error, support recovery (recovering support of #*), or parameter estimation (estimating 0* under some norm). We focus on prediction error, defined as I XO* - XS3|2, and analyzed over random realizations of the noise.

The to estimator, which minimizes the reconstruction error ||y- X over all k-sparse regression vectors, achieves prediction error bound of form ([BTW07a], [RWY11]):

1 * 2

The runtime of this estimator is O(nk), which is both theoretically and practically as soon as k is larger than a constant.

Efficient methods Various efficient methods have been proposed to circumvent the com- putational intractability of the above estimator: basis pursuit, Lasso[Tib96], and the Dantzig selector [CT07] are some of initial approaches. Greedy pursuit methods such as OMP [MZ93], IHT[BDO9I, CoSaMP[NT09], and FoBa[Zha09] among others offer more efficient alternatives. 3 Many of the optimization-based approaches relax the to penalty to some form of f1 penalty or an equivalent constraint. These algorithms achieve the same prediction error guarantee as to up to a constant, but under the assumption that X satisfies certain properties, such as restricted eigenvalue ([BRT09I), compatibility ([vdG071), restricted isometry property ([CT05]), and (in)coherence ([BTW07b]). In this work, we focus on the restricted eigenvalue (see Definition 3 for a formal definition). We remark that restricted eigenvalue is among the weakest, and is only slightly stronger than the compatibility condition. Moreover, IZWJ14 give complexity-theoretic evidence for the necessity of dependence on the RE constant for

3Note that some of these algorithms were presented for compressed sensing; nonetheless, their guarantees can be converted appropriately.

28 certain worst case instances of the design matrix. See IVDGB+09] for implications between various conditions. In the next subsection, we give more intuition for some of these condi- tions.

Slow rate Without such conditions on X, the best known guarantees provably obtain only a 1/ n decay rather than a 1/n decay in prediction error as number of samples in- crease. [ZWJ15] give some evidence that this gap may be unavoidable by showing that the family of M-estimators based on minimizing the sum of squared loss and a coordinate-wise decomposable regularizer cannot achieve a rate better than 1/VV.

Optimal estimators The SLR estimators we consider are efficiently computable. Another line of work considers arbitrary estimators that are not necessarily efficiently computable. These include BIC [BTW07a], Exponential Screening [RT11], and Q-aggregation [DRXZ14]. Such estimators achieve strong guarantees regarding minimax optimality in the form of oracle inequalities on MSE.

Notation Capital letters such as X, X are used to denote matrices, and lowercase letters such as y for vectors. For clarity, we reserve X for the data matrix in SPCA and X for the design matrix SLR. Xi is the ith column of X, and X_, is the submatrix obtained by deleting the ith column from X. Similarly, ui denotes denotes the ith coordinate of u and

u-i - R d1 is u with ith coordinate removed.

ES,T is E = E[XXT] restricted to rows in S and columns in T; if S = T, we abbreviate it as Es. For example, E2:d is E restricted to coordinates 2, ..., d. All vector norms are 2-norms unless specified otherwise. C, C' to denote constants that may change from line to line.

Organization The rest of this chapter is organized as follows. In Section 2.2, we define clear formulations of both problems that will be used in the rest of the paper. We also set up the linear model for the data drawn from the spiked covariance distribution. In Section 2.3, state our algorithms and main theorems. In Section 2.4, we give the theoretical analysis

29 and discussion. In Section 2.5, we present empirical evaluation of our algorithms. In Section 2.6, we conclude with some future directions.

2.2 Preliminaries

We give the precise setup for SPCA and SLR that we will study, and introduce the linear model for our data generated from the spiked covariance model that will be subsequently used in defining our statistic and algorithms.

2.2.1 Problem formulation for SPCA

1 2 Hypothesis testing Let X( ), X( ),... , X(n) be n i.i.d. copies of a Gaussian random variable X in R d. Let X E RnXd be the matrix whose rows are X). The objective of the SPCA detection problem is to distinguish whether there is some distinguished (sparse) direction u along which X has higher variance. In SPCA, we also assume that this direction is sparse. This motivates the following null and alternate hypotheses:

HO : X ~ j(O, Jd) and H1 : X ~ A(0,Id +OuuT),

where u has unit norm and at most k nonzero entries. The distribution under H1 is known as the spiked covariance model. As smaller 0 makes the problem only harder, we assume 0 < 1 for ease of computation and as standard in literature.

We say that a test discriminates between HO and H1 with probability 1 - J if both type I and II errors have a probability smaller than 6. The goal is therefore to find a statistic #(X) and a threshold T depending on d, n, k, 6 such that for the test O(X) = 1{0(X) > T}

PHO (0(X) = 1) <6 and PH1 (0(X) = 0) 6.

We assume the following additional condition on the spike u.

30 - Assumption 1. 2 n < 1 for at least one i E [d] where cmin > 0 is some constant.

While in general we always have at least one i E [d] s.t. a i-> k'tiisnteog1, this is not enough for our regression setup, since we want at least one other coordinate j to have sufficient correlation with coordinate i. We remark that the above condition is a very mild technical condition. If it were violated, of the mass of u is on a single coordinate, so a simple procedure for testing the variance (which is akin to diagonal thresholding) would

suffice. Furthermore, for u drawn uniformly random from Sk-1, we in fact expect a constant

fraction of the coordinates to have mass at least ' , in which case the above assumption is immediately satisfied.

Support recovery The goal of support recovery is to identify the support of a when Xi's

are drawn from the spiked distribution under H1. More precisely, we say that a support recovery algorithm succeeds if the recovered support S equals S, the support of u. As standard in the literature [AW09, MB06, we need to assume a minimal bound on the size of entries of u in the support. Though the settings are a bit different, this minimal bound also is consistent with lower bounds known for sparse recovery. These lower bounds ([FRG09, Wai07]; bound of [FRG09] is a factor of k weaker) imply that the number of samples (or measurements in their language) must grow roughly as n > k log d where umin is the smallest entry of our signal u normalized by i/v7. For our support recovery algorithm, we will make the following assumption (note that it implies Assumption 1 and is much stronger):

Assumption 2. 'ui ;> cmin/Vk- for some constant 0 < Cmin < 1 Vi E [d]

This is nearly optimal in comparison to the lower bounds mentioned above. If the smallest entries is smaller by a factor of some constant C, then signal strength 0 needs to be stronger by a factor of C for our recovery algorithm to succeed, which is consistent with the lower bounds.

31 Unknown sparsity Note that throughout the paper we assume that the sparsity level k is known. However, if k is unknown, standard techniques could be used to adaptively find approximate values of k. For hypothesis testing for instance, we can start with an initial overestimate k', and keep halving until we get enough coordinates i with Qj that passes the threshold for the given k'.

2.2.2 Problem formulation for SLR

We are given (y, X) where y E R' and X E R'xd that are linked by the linear model y = X3* + w. The vector w E R' is i.i.d. .N(O, a2 ), and our goal is to compute #^ that minimizes the prediction error (or MSE) IflXO* - X/^2. We define the restricted eigenvalue constant that will be important in our analysis. Many variants exist in the literature. Below, we give a definition from [ZWJ14I.

Definition 3. First define the cone

C(S) =0 c Rd I|sc 10 i 3|i0s|I|}

where Sc denotes the complement, OT is / restricted to the subset T. The restricted eigen- value (RE) constant of X, denoted -(X), is defined as the largest constant -y s.t.

IIX3112 > 110 2 for all 0 E U C(S) n ISI=k,SC[d]

Blackbox condition We now define the condition to require on our SLR blackbox, which is invoked as SLR(y, X, k). This is similar to the guarantees achieved by known results for SLR.

Condition 4. Condition A. Let -y(X) denote the restricted eigenvalue of X. There are universal constants c, c', c" such that SLR(y, X, k) outputs 0 that is k-sparse and satisfies:

I - X#*I| 2 < 2 c klo V * E BO(k) w.p. > 1 - c'exp(-c"k log 2 d) n - ~~Y(X)( X)3< n

32 We first discuss the underlying linear structure in the data generated from the spiked covariance model. The specifics of this linear structure will be used in defining our statistic and algorithms. We then state and prove the guarantees for our algorithms.

2.2.3 The linear model

We now set up linear regression for the data X from the SPCA problem. One natural way to apply regression to our samples X from SPCA is to regress one column or coordinate on the remaining columns. More formally, let X_, denote the matrix of samples in the SPCA model with the ith column removed. For each column i, let us take as input to the blackbox

SLR the design matrix X = X-i and the response variable y = Xi.

Under the alternate hypothesis H1, if i E S, then Xi is correlated with Xj where j E S, j = i. Using properties of multivariate Gaussians, we can write y = X3* + w where ui ______2 2 u and w _- .A(O' U ) with U =1+ 9U = 2* U a ~ , t 1+(1-u ")_ .Bytheory of LMMSE,this

/* minimizes the error j2. (See Appendix B.1, B.2 for details of this calculation.) If i ' S, and for any i c [d] under the null hypothesis, y = w where w = X -~A(0, 1) (implicitly *= 0).

Because population covariance E = E[XXT] has minimum eigenvalue 1, with high prob- ability the sample design matrix X has constant restricted eigenvalue value given enough samples n (see Appendix B.3 for more details), and the prediction error guarantee of Con- dition 4 will be good enough for our analysis.

Though the dimension and the sparsity of our SLR instances are d - 1 and k - 1 (since we remove one column from the SPCA data matrix X to obtain the design matrix X), for ease of exposition we just use d, k in their place since it only affects our analysis up to small constant factors.

33 2.3 Algorithms and main results

2.3.1 Intuition of test statistic

Consider a matrix X of samples generated from the single spiked covariance model. The intuition behind the algorithm is that if i is in the support of the spike, then the rest of the support should allow to provide a nontrivial prediction for Xi since variables in the support are correlated. Conversely, for i not in the support (or under the isotropic null hypothesis), all of the variables are independent and other variables are useless for predicting Xi. So we regress Xi onto the rest of the variables and our goal is to measure the reduction in noise. How much predictive power do we gain by using X-j? The linear minimum mean-square- error (LMMSE) estimate4 of Xi conditioned on X_, (when i is on support) turns out to put approximately 0/k weight on all the other coordinates on support.5 A calculation shows that the variance in Xi is reduced by approximately 2/k. We want to measure this reduction in noise to detect when i is on support or not. Suppose for instance that we have access to 0* rather than #^(note that this is not possible in practice since we do not know the support!). Since we want to measure the reduction in noise when the variable is on support, as a first step we might employ the following statistic:

1 = _=Q-||y - X *||12 n

Unfortunately this statistic will not be able to distinguish the two hypotheses, as the reduction in LMMSE is miniscule (on the order of 02 /k compared to order of 1 + 0), so deviation due to random sampling will mask the reduction in noise.

We can fix this by adding the variance term Iy 2:

=1 1 12 1Y _ XO*11 2 Qi - |y -X *| 2|||

Notice that since y = XO* +w, the noise term |IWI12 cancels out nicely. This effectively shifts

4See Appendix B.1 for more details. 5For illustrative purposes, we consider the case where u is uniform on all k coordinates on support

34 the mean of the statistic, and now we are left with a statistic that is close to 0 under Ho 2 and is larger by about 0 /k under H1 , so distinguishing using this statistic is more effective. On a more intuitive level, including |1yJ12 allows us to measure the relative gain in predictive power without being penalized by a possibly large variance in y. Fluctuations in y due to noise will typically be canceled out in the difference of terms in Qj, minimizing the variance of our statistic.

We have to add one final fix to the above estimator. We obviously do not have access to so we must use the estimate #^ SLR(y, X, k) (y, X are as defined in Section 2.2.3) which we get from our blackbox. The bulk of the analysis is showing that this substitution does not affect much of the discriminative power of Qj. This gives our final statistic:

i 1 11 1 1Y -X 12 n n

2.3.2 Algorithms

Below we give two algorithms based on the Q statistic, one for hypothesis testing and one for support recovery:

Algorithm 1 Q-hypothesis test- Algorithm 2 Q-support recov- ing ery Input: X E RdX, k Input: X EE Rdxn, k Output: V) S = 0 for i=1,...,ddo for i=1,..., d do Oi = SLR(Xi, X_j, k) /i = SLR(Xi, Xj, k) = |XiH - n|X, Qi= n|Xl- 2 X X_,0,112s if Qj > 13k thentog if Qj > 13kog.i then return 4 - 1 S:= S U {i} end if end if end for end for Return 4 = 0 Return S

35 Below we summarize our guarantees for the above algorithms.

Theorem 5 (Hypothesis test). Given SLR that satisfies Condition 4 and with runtime

T(d, n, k) per instance, and given Assumption 1, there exist universal constants c1 , c2, c3 , c4

2 2 s.t. if *k log' and n> c2k log d then Algorithm 1 outputs V) s.t. enn

PHo ( (X) 1) V PH1 (0(X) = 0) < c3 exp(-c 4 k log d)

in time O(dT + d2 n).

Theorem 6 (Support recovery). Under the same condition on SLR and given Assump-

tion 2, if 02 > cl k 2 logd, Algorithm 2 above finds S = S with probability at least 1 - Cnn 2 c3 exp(-c 4k log d) in time O(dT + d n).

Remark 7. Though both guarantees involve bounding the signal strength 0 in terms of cmin, Assumption 2 on u in Theorem 6 is much stronger as all entries in the support of u need to be minimally bounded for Assumption 2 to hold.

2.4 Analysis

In this section we analyze the distribution of Qj under both Ho and H1 on our way to proving Theorems 5 and 6.

2.4.1 Analysis of Qj under H1

Without loss of generality assume the support of u, denoted S, is {1, ... , k} and consider the

first coordinate. We expand Q, by using y = X3* + w as follows:

2 1l _ 0 1 2 11 2 WT1 -|yI -- - - Q1i= n n y X#^YI1= n-||X3* + w11--2X#*n X/^3 - -WT(X*n X0) - n1-w112 1 2 12 T = -I|X0*||2 - W X* - -(|IX* - X||) T(X3* - X ) c cn n n t n n O 0 2w

Observe that the noise term |IWI12 cancels conveniently.

36 Before bounding each of these four terms, we introduce a useful lemma to bound cross terms involving noise w:

Lemma 8 (Lemmas 8 and 9, [RWY11I). For any fixed X E Rnxd and independent noise vector w E Ri with i.i.d. Ai(O, o.2 ) entries:

WTX O| <;< 9o- \IXO 1|2 k log - d n n r k for all 0 C B 0(2k) w.p. at least > 1 - 2 exp(-40k log(d/k))

We bound each term as follows:

Term 1. The first term contains the signal from the spike; notice its resemblance to the k-sparse eigenvalue statistic. Rewritten in another way,

XTX T - (*)T*= n (0*)T F2:dO*

Hence, we expect this to concentrate around (#*)T E2:d/* which simplifies to (see Appendix B.2 for the full calculation):

62U2(1 U2) (3*)TE :d/* = (EI,2:E2)Z2: (2:k2:d,1) 2 1 + (1 - u2)0

For concentration, observe that we may rewrite

T E2-13#* = (XI#*)2 (/*) n i=1 where X) is the ith row, representing the ith sample. This is just an appropriately scaled chi-squared random variable with n degrees of freedom (since each X(i)4* is i.i.d. normal), and the expected value of each term in the sum is the same as computed above. Applying a

37 lower tail bound on x2 distribution (see Appendix ), with probability at least 1 - 6 we have

T (#*) E 2:d3* >- (lo(16)) 1+ (1 - u2)0

Choosing 6 = exp(-k log d),

flXO*12 2> 92U2(1I - U2) k log d n - 1 + (1-u)0 1n

(a) 1 02 2 _( 2~)12) 2 1+(i- 1~) (b) 92 > C2 . (2.1)

where (a) as long as n > 16k log d and (b) since 0 < 1 and u2(1 - U2) > cma/k under Assumption 1.

Term 2. The absolute value of the second term IwTXO* can be bounded by 18 x12 k log

using Lemma 8. From (2.1) as long as 92 > Ck 2 logd

||X1* 2. 92 k log d n trc bon by:

so the first two terms together are lower bounded by:

n (IX*1|2 - 18 'k log d/k) > C (2.2) ft ft ,

constant fraction of the first term.

Term 3. The third term, which is the prediction error , is upper bounded by C .2klogd with probability at least 1 - Cexp(-C'klog d) by Condition 4 on our SLR blackbox. Note o 2 < 2 as we assume 9 < 1 (see Section 2.2.3). Now, -y(X) ;> with probability at least 1 - C exp(-C'n) if n > C"k log d since 9 < 1 (see Appendix B.3 for more details). Then, < klog d IX 11 k ftX* - X

38 Term 4. The contribution of the last cross term 2wTX(3* - 0) can also bounded by Lemma 8 w.h.p. (note 0* - /3 E Bo(2k))

1wTX(o* - _^) ||X(/3* -0)1 k d <9- 2 klog n -n k-.

Combined with the above bound for prediction error, this bounds the cross term's contribu-

tion by at most 0Ck1ad. Putting the bounds on four terms together, we get the following lower bound on Q.

Lemma 2 9. There exists constants c1 ,c 2 ,C 3 ,C 4 s.t. if 02 > k lgd and n > c2 klogd,

with probability at least 1 - c3 exp(-c 4k log d), for any i E S that satisfies the size bound in Assumption 1, S 13k log d n

Proof. From 1-4 above, by union bound, all four bounds fail to hold with probability at

2 most c3 exp(-c 4 k log d) for appropriate constants if 02 > k lod (required by Term 2) and n > c2 k log d for some c2 > 0 (note that both Terms 1 and 3 require sufficient number of samples n). That is, we have:

02 klogd S> - C' k n

So if ci is sufficiently large, the above bound is greater than 13kog E

2.4.2 Analysis of Qj under H

We could proceed by decomposing Qj the same way as in HI; all the error terms including prediction error are still bounded by O(k log d/n) in magnitude, and the signal term is gone now since /* = 0. This will give the same upper bound (up to a constant) as the following proof is about to show. However, we find the following direct analysis more informative and intuitive. Since our goal is to upper bound Qj under HO, we may let /3 be the optimal possible

39 choice given y and X (one that minimizes |y - X ||2, and hence maximizes Qi). We further break this into two steps. We enumerate over all possible subsets S of size k, and conditioned on each S, choose the optimal 0. Fix some support S of size k. The span of Xs is at most a k-dimensional subspace of R". Hence, we can consider some unitary transformation U of R' that maps the span of Xs into the subspace spanned by the first k standard basis vectors. Since U is an isometry by definition,

nQi -=|y| - |ly - X3s||1 = IUy||2 -| Uy - UXs|11

Let 9 = Uy. Since UX/s has nonzero entries only in the first k coordinates, the optimal choice (in the sense of maximizing the above quantity) of Os is to choose linear combinations of the first k columns of X so that UX/s equals the first k coordinates of . Then, nQi is just the squared norm of the first k coordinates of . Since U is some unitary matrix that is independent of y (being a function of XS which is independent of y), still has i.i.d. J(O, 1) entries, and hence nQi is a X2-var with k degrees of freedom. Now we apply an upper tail bound on the x 2 distribution (See Appendix B.4). Choosing t = 3 log , and after union bounding over all () k supports S, nQi > k + 12k log k, or Q > lk with probability at most exp(-3k log ! + k log L) < exp(-k log ) if 4 ; e.

Lemma 10. Under H0 , Vi Qi 13k nlog j w.p. at least I - exp(-k log k

Remark 11. Union bounding over all S is necessary for the analysis. For instance, we cannot just fix S to be S(3) (this denotes the support of /) since 3 is a function of y, so fixing S changes the distribution of y.

Remark 12. Observe that this analysis of Qi for Ho also extends immediately to H1 when coordinate i is outside the support. The reason the analysis cannot extend to when i E S is because U is not independent of y in this case.

Corollary 13. Under HI, if i ' S, Qi <:: no w.p. at least 1 - exp(-k log-h.k1

40 2.4.3 Proof of Theorem 5

Proof. Proof follows immediately from Lemma 10 and Lemma 9. Now, we can use our estimators Qj to separate HO and H1. Under HO, applying Lemma 10 to each coordinate i and union bounding, Vi, Qj 13k< og with probability at least 1 - exp(-Ck log d). Meanwhile, under H1, if we consider any coordinate i that satisfies Assumption 1, Lemma 9 gives

13k log d

with probability at least 1 - c3 exp(-c 4 k log d). Since V tests whether Qj > n1"kg for at least one i, b distinguishes HO and H1 successfully, with bound on type I and type II error probability c3 exp(-c 4 k log d) for appropriate constants c3 , c4 (note, these may be different from those of Lemma 9). For runtime, note that we make d oracle calls to SLR and work with matrices of size n x d. l

2.4.4 Proof of Theorem 6

Proof. As long as every ui for i E S has magnitude cmin/ vk as in Assumption 2, we can repeat the same analysis from above to all coordinates in the support. If 0 meets the same threshold, Qj > 13k log /n for all i E S with probability at least 1 - C exp(-C'k log d) by union bound. Also, recall Qj > 13k log j/n, for any i S with probability at most C exp(-C'k log d) by Corollary 13. By union bound over all d - k coordinates outside the support, the error probability is at most d - Cexp(-C'klog d) < Cexp(-C"klogd). We showed that with high probability we exactly recover the support S of u. Runtime analysis is identical to that for the hypothesis test. E

2.4.5 Discussion

Running time The runtime of both Algorithms 1 and 2 is O(nd2),6 if we assume the SLR blackbox takes nearly linear time in input size, O(nd), which is achieved by known existing

6 In what follows O(-) hides possible log and accuracy parameter E factors.

41 algorithms. This seems a bit expensive at first, but computing the sample covariance matrix alone takes 0(nd2 ) time.7 For a broad comparison, we consider spectral methods and SDP-based methods, though there are methods that do not fall in either category. Spectral methods such as covariance thresholding or truncated power method have an iteration cost 0(d2 ) due to operating on d x d matrices, and hence have total running time 0(d 2 ) (6(.) hiding precise convergence rate) in addition to the same O(nd2 ) initialization time. SDP-based methods in general take 0(d3 ) time, the time taken by interior point methods to optimize. So overall, Algorithms 1 and 2 are competitive choices for (single spiked) SPCA, at least theoretically.

Alternate blackbox The above algorithms seem rather wasteful because there is a lot of overlapping information between the different O's we get for the Qi's on support. For instance, it is plausible that / contains a good fraction of the entries in the support if the coordinate we are regressing on happens to be on support. In such case, it is unnecessary to compute Qj's for the j's we already are confident to be on support. We may be able to utilize such information more easily if instead of prediction error we consider support recovery or parameter estimation (say in f2-norm) guarantees for our SLR blackbox.

Robustness of Q statistic to rescaling

A natural and simple way to make diagonal thresholding fail is to rescale all the variables so that their variance is equal. Intuitively, we expect our algorithms based on Q to be robust to rescaling, since it should be possible to predict one variable in the support from the others in the support even after some rescaling. We can more precisely justify this intuition as follows.. Let X +- DX be the rescaling of

7 Assuming one is using naive implementation of .

42 X, where D is some diagonal matrix. Let Ds be D restricted to rows and columns in S. Note that Z, the covariance matrix of the rescaled data, is just DED by expanding the definition.

Similarly, note 52:d = D1D2:dZ2:d, where D2:d denotes D without row and column 1. Now, recall the term which dominated our analysis of Qi under HI, (*)TE 2 :d/*, which was equal to

Z1,2:dE2 A2:,1

We replace the covariances by their rescaled versions to obtain:

* 5* (Dlyl, :dD :d)D 2 2 7D-(D2:dE2:d,1D1) = D - (/*)TE2 :d/*

For the spiked covariance model, rescaling variances to one amount to rescaling with Di = 1 . Thus, we see that our signal strength is affected only by constant factor (assuming 0 < 1).

We should note though that after normalizing variances, the variance term ||yJ|2 loses its effect in the Q statistic, and Q is essentially affected by just the reconstruction error

||y - X0^1| . This robustness to rescaling is an attractive property because intuitively, our algorithms for detecting correlated structure in data should be invariant to rescaling of data; the precise scale or units for which one variable is measured should not have an impact on our ability to find meaningful structure underlying the data.

2.5 Experiments

On randomly simulated synthetic data we demonstrate the performance of our algorithm compared to other existing algorithms for SPCA. The code was implemented in Python using standard libraries. We refer to both hypothesis and support recovery variants of our algorithm from Section 2.3 as Q.

43 2.5.1 Support recovery

We randomly generate a spike u by choosing uniformly among all k-sparse vectors that are uniform on all coordinates (with random signs). In order for comparison with the work of [DM14], we use the same parameter setting of n = d. We study how the performance of four algorithms (diagonal thresholding, covariance thresholding, Q with thresholded Lasso with A = 0.1, and Q with FoBa with c = 0.1) vary over various values of k for fixed n = d. For covariance thresholding, we tried various levels of their parameter T and indeed it performed best at [DM141's recommended value of T ~ 4, which is what is shown. We modified each algorithm to return the top k most likely coordinates in the support (rather than thresholding based on a cutoff), and we count the fraction of planted support recovered. This is averaged over T = 50 trials. On the horizontal axis we measure k/v5; our metric on the vertical axis is the fraction of support correctly recovered. We observe that across almost all regimes

n =d =6j2 ,=d =1250 1,2 -DT 1.2 - DT - Q-Lasso - Q-Lasso 1.0 - -Q-FoBa 1.0 . -. Q-FoBa

0.8 - 0.8 ~0.6- t~ N0.6

a a 0.2 0 .

-0.2 0.2

0.0 0.0

-0.2 -0.2

0.2 04 0.6 0.8 1.0 1.2 1.4 1.6 1-8 2.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 k?, W k,/ 7

Figure 2-1: Performance of diagonal thresholding (DT), covariance thresholding (CT), and Q for support recovery at n = d = 625, 1250, varying values of k, and 0 = 4

of k both versions of Q algorithms outperform covariance thresholding. It is an interesting question to investigate whether the log d factor in our analysis can be removed. Diagonal thresholding is outperformed by all other methods across most values of k.

44 2.5.2 Hypothesis testing

Here we instead generate a spike u by sampling a uniformly random direction from the k-dimensional unit sphere, and embedding the vector at a random subset of k coordinates among d coordinates. For hypothesis testing, in a single trial, we compute various statistics (diagonal thresholding (DT), Minimal Dual Perturbation (MDP), and Q) after drawing n samples from .A(O, I+ OuuT). We repeat for T = 50 trials, and plot the resulting empirical distribution for each statistic. We observe similar performance of DT and Q, while MDP seems slightly more effective at distinguishing Ho and H1 at the same signal strength (that is, the distributions of the statistics under Ho vs. H1 are more well-separated).

Rescaling variables As discussed in Section 2.4.5, our algorithms are robust to rescaling the covariance matrix to the correlation matrix. As illustrated in Figure 2-2 (right), DT fails while Q appears to be still effective for distinguishing hypotheses the same regime of parameters. Other methods such as MDP and CT also appear to be robust to such rescaling

(not shown). This suggests that more modern algorithms for SPCA may be more appropriate than diagonal thresholding in practice, particularly on instances where the relative scales of the variables may not be accurate or knowable in advance, but we still want to be able to find a correlational structure between the variables.

45 110 - Do S MDP 16 - D1 MDP-1 - QO 10 14 -Q1 121

10 A 6 8

6 4

4 2] 2

.5 0.0 0.5 1.0 1-5 2.0 2-5 4 5 6 7 8 9 20 11 value of stafitk

-- DO 4.- DI -QOI 2. Q 1 0

8 2-

4 I

-0.5 0.0 0.5 1.0 1.5 2.0 2.5

Figure 2-2: Performance of diagonal thresholding (D), MDP, and Q for hypothesis testing at n = 200, d = 500, k = 30,0 = 4 (left and center). TO denotes the statistic T under H0 , and similarly for T1. Effect of rescaling covariance matrix to make variances indistinguishable is demonstrated (right)

2.6 Conclusion

We gave a reduction from SPCA to SLR that works up to the computational threshold for SPCA that we believe based on average-case hardness assumptions. One obvious question is if there is a different reduction that extends all the way down to the statistical threshold; this would imply the average-case hardness of SLR under some conditions. A related question is formulating a model more robust than the gaussian spiked covariance model for SPCA, yet still amenable to analysis. It would also be interesting to see if the reduction can be done in the other direction,

46 from SLR to SPCA. One would probably have to restrict the design matrix to a certain class in order to have sufficient control on its distribution.

47 48 Chapter 3

Tensor rank under the smoothed model

3.1 Introduction

Tensors have received a lot of attention over the past years due to their wide application in mathematics as well as statistics and other related fields. We refer to [Lan] for a detailed mathematical introduction to tensors and to [McC],[Moi] for an introduction to the role of tensor decompositions in modern statistics and machine learning, respectively . In the machine learning community, tensors have found applications in phylogenetic reconstruction [Com94l, [MR05],hidden markov models [MR05I, mixture models [HK13], topic modeling [AFH+12], community detection [AGHK13], etc. On a high level, tensors can be thought of as higher order matrices and often times they are more useful than matrices in capturing higher order relations in data. What makes tensors a more powerful tool compared with traditional linear algebra is a uniqueness property of decompositions. More concretely, on one hand, given a matrix M = E' a(') 0 b0) this decomposition is almost never unique, unless we require the factors {a(')}i and {b(')}j to be orthogonal or that M has rank one. This makes the factors {a(z)}, and {b(2)} uninterpretable. On the other hand, given a tensor

T = E 1 a(') 9 b0) 9 c() there are general conditions (e.g see Kruskal [Kru77] or Section 3.6) under which {a(')} , {b0)}, {c(')} are uniquely determined. Even though tensors are more powerful than matrices, they are less well understood and unfortunately, many of the

49 familiar properties of matrices do not generalize to tensors. For example, [Ha's90 shows that in the general setting computing the rank of a tensor is NP-hard. More than two decates later

IHL131 expanded on the prevalence of this computational complexity issue by proving that a plethora of other problems, such as finding the best low-rank approximation, computing the speectral norm and deciding whether a tensor is nonnegative definite are NP-hard too. Our main contribution is to give a rank detection gadget for tensors in the smoothed model, in particular we show that in this model tensor rank detection is not NP-hard. The assumption here is that the model is not adversarily chosen, formalized by a perturbation of the model parameters. Our main technical result is that the rank of Young flattenings of perturbed tensors adds and the analysis involves a careful disentangling of the noise created by the perturbations. We bring Young Flattenings to the TCS community and we conjecture that the tensor machinery developed in this thesis can be used to obtain algorithms for tensor decomposition, again under the smoothed model.

3.1.1 Our results

We study tensor rank detection and tensor decomposition in the smoothed model as intro- duced in [BCMV14b]:

" An adversary chooses a tensor T = Z> w() ® u(1) 0

" Each vector w(') , U(), 0 is p-perturbed to yield iv() , j(j 1

" We are given 'T' = Zi ® iz(7) 0 j().

For tensor rank detection the goal is to recover r while for tensor decomposition the goal will be to recover the factors {j1(i)}i,{J(i},{t(i)}. This model is inspired by smoothed analysis which was first introduced by Spielman and Teng in [ST01j, [ST09I as a framework in which to understand why certain algorithms perform well on realistic imputs. Intuitively,

'An independent gaussian with zero mean and variance p2 /n in each coordinate is added to w(), u(), v() to obtain Cv() i(i)I f() The Gaussian assumption is made for convenience, but our analysis actually applies to any type of perturbation; the key is the independence.

50 good smoothed analysis guarantees show that worst instances are isolated, small perturba- tions of input make instances easy and give best polynomial time guarantees in the absence of any worst-case guarantees. We prove the following main theorem, which gives a gadget for computing the rank of a perturbed tensor in time equal to computing the rank of a matrix:

Theorem 14. Let i E R""'x be a third order perturbed tensor, where T C R""'*f has rank n < R(T) < 3n/2 and t is obtained from the above smoothed analysis model. Let

Xo, X1 ... Xn_ 1 be matrix slices along the first dimension of T. Then almost surely over the noise the following equality holds:

1 1 R(T) = n + -rank(X1X07 X - X 2 2 2X0-X 1 ).

3.1.2 Our approach

We use Young flattenings to prove Theorem 14, in particular, the following lemma is the key part of the proof:

3 Theorem 15. Let T e R xnn be a rank r perturbed third order tensor, T = IW 0 uMs 9 v). For i E [r], let P = wi) 9 u) 9 v) be rank 1 perturbed tensors, so = E'_ 1 Ts. Then almost surely over the noise, the ranks of the Young flattenings tj^,(Ti>^j also add:

r 1 rank(TA ) = rank(( i)^ ), i= 1 where Young flattenings are introduced in (3.4).

Organization

In Section 3.2 we give preliminaries and notations. In Section 3.3 we introduce the Young flattening machinery for general tensors and in Section 3.3.1 we develop our main technical tool in the context of the smoothed model. In Section 3.4 we prove our main theorem. In

51 Section 3.5 we discuss extensions of our rank detection gadget and give a glimpse into how the technique we developed can be used to obtain an algorithm for tensor decomposition. In Section 3.6 we congregate the linear algebra lemmas we use throughout the proof.

3.2 Preliminaries and notations

For an n E N+, let [n] denote the set {0, 1, 2,..., In - 1}. We use 0 to denote outer product and A to denote exterior product. For any diagonal matrix D E R"r, we define D, E Rxf and D' C R(rn)x(r-n) its block components such that

D =Dn 0 (3.1) 0 D'

We use this notation throughout the proof, in particular if Dk is diagonal we denote by

Dnk and D' its corresponding blocks as defined above. For any matrix M, we denote by dimKer(M) the dimension of the kernel of M. We introduce the basics of tensors. In general a tensor is indexed over k-tuples, and k is called the order of a tensor, hence a tensor T can be viewed as a point in Rflxf2x.xfl. Note that based on this definition, a matrix is just an order two tensor. If T is an order three tensor of size m x n x p then T can be thought of as a collection of m matrices of size n x p that are stacked on top of each other. We call these matrices matrix slices along the first dimension of the tensor T.

Definition 16. A rank one, third-order tensor T is the tensor product of three vectors w, u and v, and its entries are

Tijk = WiUjVk

Thus if the dimensions of w, u and v are n 1, n2 and n3 respectively, T is of size n1 x n2 x n3 . Moreover, this can also be written as

T =w 0u v

52 We define the rank of a tensor next:

Definition 17. The rank of a third-order tensor T is the smallest integer r so that we can write r T = iw S UM (g eVM

Tensor T

u1 9v 1 0w 1 U2 0V2 0W2

Figure 3-1: CANDECOMP/PARFAC tensor decomposition of a third-order tensor

Note that the above is a generalization to tensors of a particular definition of the rank r of a matrix M, namely as the smallest number of rank one matrices we need to add up to obtain M. One example of what makes working with tensors a lot more delicate than working with matrices is that while the rank of a matrix has a lot of other equivalent definitions, the above is the only one that can be generalized. For example, the definition of the rank of a matrix M as the dimension of its column/row space doesn't generalize to higher orders because the spans along different coordinates may have different dimensions. We denote the rank of a tensor T by R(T) and the rank of a matrix M by rank(M). Note that Theorem 14 only uses three matrix slices of i to recover its rank, hence we can assume without loss of generality that i is a 3 x n x n perturbed tensor. Let T = WM w(3U( ®v(i) where w(, U, 0) are p-perturbed and r is the rank of t. Let eO, el, e 2 be the elementary basis vectors of R', and decompose T according to this basis as

iT= eO Xo + el ® X1 + e 2 0 X 2 , where {Xj}i E R"nx" are the matrix slices of i along its first dimension. Let U be the matrix

53 with the u(')'s as columns and V be the matrix with the v(2)'s as columns. Decompose the

's along e0 , e1 , e2 as w(z- = _2 dikek and for k E [3] define the diagonal matrices Dk:

d1k 0 Dk= [ -.

0 drk

By equalizing the coefficients in the two decompositions of t above we get

Xo = UDoVT', X 1 = UD1 VT, X2 = UD 2 VT (3.2)

We note that equation 3.2 can also be obtained in the following way. The aim is to show that X, = UDVT for any index *. Recall that as an r-order tensor, T can be written as

T (Zj k) w I

As slices along the first coordinate of the tensor, X, correspond to only looking at the * coordinate of all the w(l components, hence

X= TP., j, (*) (= ( I

X*= wloul) 9 v0) = UD*VT. 1

Let U be the n x n submatrix of U formed by the first n columns of U. Since U(1), U(2),. .. , U(n) are perturbed vectors, Un is invertible almost surely. Moreover let W E R x(r-") be such that UnW gives the submatrix formed by the last r - n columns of U. One could think of W as the weights for a change of basis of the last r - n columns of U, hence W is also a

54 matrix with perturbed columns. Formally, U, and W are such that :

U = [U" U W] . (3.3)

To prove Theorem 14 we leverage Young flattenings that we introduce next.

3.3 Young flattenings

Young flattenings were introduced in [LO13] as generalizations to Strassen's equations [Str83]. While we refer the reader to [LO13] for a detailed treatment of Young flattenings, we sum- marize here only the properties that we use in our proof.

For easiness of presentation, for this section only, imagine T C AOBOC, where dim(A) a and dim(B) = dim(C) = n and A, B, C vector spaces with duals A*, B*, C*. Note that

T can be considered as a linear map B* -+ A 9 C and write TB for this form. Under this notations, Strassen's equations [Str83] may be understood as follows: tensor TB with IdA to obtain a linear map B* 0 A - A ®A 9 C and skew-symmetrize the A 0 A factor to obtain the order 1 Young flattening:

TA^A : B* 0 A 4 A2A0 C. (3.4)

Recall that for a V, A 2V = V 0 V/(v 0 w + w 0 v).

Lemma 18. [L013]If T = a 0 b 0 c has rank one, then rank((a 0 b 0 c)^Al) = a - 1. 2

Proof. Expand a = a1 to a basis a2, .. . .. aa of A with dual basis a1, ... , a' of A*. Then

T^ =lEj [j 0 b] ®[ai A ai S c], so the image is isomorphic to (A/a1 ) 0 c. This gives us that rank(TA^Q) = a - 1. l

1 2 We can similarly define TA : B * @ AP A - AP+ A & C, the order p Young flattening and conclude as above that if T has rank 1 then TAP has rank (a 1)

55 Lemma 19. [LO13] Interpret T G A 3 B D C as a linear map B* - A 0 C. Then rank(TA1 ) < (a - 1)R(T), where R(T) is the rank of the tensor T.3

Proof. Let r = R(T) and let T = Ti + T2 + .. . + T, such that R(T) = iVi. Then

r r rank(Tj^j) = rank(Z(Ti)^') < rank((Ti)^') = R(T)(a - 1), where the last equality was obtained from Lemma 18. L

Next we focus on the case a = 3, explicitly compute the matrix corresponding to the transformation TA 1 and rewrite the result of Lemma 18 in terms of ranks of matrices.

Claim 20. Let T E A9 B 9 C, where dim(A) = 3 be a tensor and TA" : B* O A _ A 2 AOSC be its order 1 Young flattening as defined in (3.4). Let eO, e 1, e 2 be the basis for A and e1 A e 2, eo A e1, eo A e2 the basis for A2 A, and let Xo, X 1, X 2 be matrix slices of T, such that

T= eo9Xo+e1 X1 +e 2 0X2. Then:

eo e1 C2

e1 A e2 0 -X 2 X1 1 AMat(TA ) = e2 A e0 X 2 0 -Xo

e0 A el -X 1 Xo 0

Proof. With the notations above in place we can compute:

TA'(eo 0 0) = (Xo) (Seo A eo + O(X1 ) 9 ei A eo + O(X 2) e 2 A eo

= -#(X 1 ) 0 eo A ei + O(X2) 0 e 2 A eo

1 TA (ei®0 ) = O(Xo) 9 eo A el - 0(X2)0 el A e2

1 TA (e 2 9 #) = -O(Xo) ® e 2 A eo + O(X1 ) 9 el A e 2

3 1 If instead we worked with TA we can similarly obtain rank(T) (P- )R(T).

56 Hence, the matrix corresponding to T^1 is:

eo el 2

e 1 A e2 0 -X2 X1 Mat(T^) =A e 2 A eo X2 0 -XO eo A e1 -X1 Xo 0

Lemma 21. If T e A®9B®C, dim(A) = 3 and T = wouov has rank 1 with w = (do, d1 , d 2 ) and w, U, v not the zero vector, then Mat((T)j') has rank 2 and can be decomposed as follows:

di 1 0 -X 2 X1i Idol

X 0 -X = [0 dovT 0] + -U -d 2V T 0 dovTI 2 [d 1vT

-X 1 X0 0 U 0

Proof. The equality can be checked by direct computation and substituting equations (3.3).

d2,1 do The conditions of u, v, w not being the 0 vector ensures that the pairs of vectors 0

[a U

do

-U and [divT dovT o], [-d 2VT 0 dovT are linearly independent, which gives us 0 rank(Mat(T^1 )) F1

We are ready now to prove our main technical result.

3.3.1 Young flattenings in the smoothed model

In this section we prove Theorem 15, which says that in the smoothed analysis framework the rank of Young flattenings has an additive property.

57 Our overall strategy is to show that the matrices corresponding to the transformations (T)^' can be decomposed into sums of rank 1 matrices whose column span and row span are independent. Since T = i-0)0 it() 9,i), where (, i I,(') are perturbed and so almost surely not equal to the zero vectors, we can apply Lemma 21, to get

di dio

Mat((i)^') =A 0 [dilV(i)T d ov(i)T 0] + -UM [-di2v(i)T 0 djov(i)TI Vi. [01 L I

Lemma 21 also implies that rank(Ii1 ) rank(Mat((Tj)^l)) = 2, Vi. Recall that T

j= 1 Ti, hence

r dio r dio

Mat(Tii ) = 0 [di 1 vi)T dov(i)T 0] + )] [-di2 v(i)T 0 djov(i)T

U~=1 0 or using notations from Section 3.2:

-UDO--D 2 UDOD 1 SD1 VT DoVT 0 Mat(TA 1) = 0 U x -D 2 VT 0 Do VT LU 0

To show that rank(T^') = Z i ra'nk(j=i) = 2r almost surely it suffices to show that

3 the above matrices, let them be C1 E R x2r and C 2 E R 2rx3n have full column and row rank, respectively, almost surely. Note that we are in the case r < 3n/2. It suffices to show the above for r = 3n/2, since in the case r < 3n/2 would require linear independence of a subset of the vectors from the case r = 3n/2. From now on we assume r = 3n/2. We focus on C1 and show that the dimension of its kernel, dimKer(Ci), is 0 almost surely. By using

58 the notation introduced in Section 3.2 we can rewrite C1 as

-UD--D2 --UnWD--'D' UnD-Da UnWDJ-'D' C1 = Un UnW 0 0

0 0 Un UnW

3 Let T1 , T3 E R , n and T2 , T4 E R3nx(-n) be sets of columns of C1 such that C1 =

[T1 T2 T3 T4] (Ti's correspond to the sets of columns above). By direct computation the following relations hold:

-UnWD' -1D + UnD-Dn 2W UnWl1

Dn T2 -T 1 W= 0 = 0 0

UnW D--1 D' - Un D-jDn1W Un W2

T4 -T 3 W= 0 = 0 ,where 0 0

1 W1 = -WD'- D' + D7 Dn2W and W2 = WD-'D' - D-jDn 1 W.

Note that to complete the proof we are only interested in the dimKer(C1 ), hence we will

manipulate C1 by adding and substracting its columns without changing its column span

59 and implicity without changing its dimKer(CI). We have:

dimKer(C ) = dimKer( 1 [T1 7' T T41 )

= dimKer( [T1 T2 -T 1W T3 T4 -T 3 W)

-UnD-Dn 2 UnW1 UnD-jDi UnW2 = dimKer( Un 0 0 0)

0 0 Un 0

Recall that for any A, dimKer(ACI) = dimKer(C1) and moreover U is invertible almost surely, as seen in Section 3.2. Hence we have:

Un-i 0 0

dimKer(C1 ) = dimKer 0 Unl 0 C1

0 0 Un

-D-- W1 W2

= dimKer In i 0 0 0 0 0 In 0 ii (3.5) dimKer [W1 W21 ,I

where (*) is true by Lemma 25. Recall now the definition of W1, W 2 :

1 1 W1= -WD- D0 + D--Dn2W and W2 = WD - D' - D-Dn1 W.

We view the determinant of the matrix [W1 W2] as a polynomial in variables the diag- onal elements of D', DI, D2 i and Dn2 (which are perturbed). If this polynomial is not the zero polynomial, since the set of roots of any multivariate polynomial has measure zero, then almost surely the determinant would not be zero and so dimKer( [W1 W2]) = 0. Suppose

60 for the sake of contradiction that Det( [W1 W2]) is the zero polynomial. We apply Lemma

23 with A = W1, B = -D-jD, 1 W, C = -WD> and D = D' and obtain

aDet IW1 W2 ]_ 8Det [w- = Det( [W WD') 1d'd12 ..-(-9d'9-n)- 1

hence Det( W1 WD-') is also the zero polynomial. Next we apply Lemma 24 with

A=WD'-', B = -WD -1D', C = -D- W and D =Dn2 . Note that WD-', -D-JW E S, E R (r-f) x( 2fl-r) R"(x,-n) and let S2 E R(r-n)x(r-n) be the bottom submatrix of WD>-' and be the top submatrix of -D,-jW. Lemma 24 gives us that Det(S2 )Det(-Si) has to be the zero polynomial, as it is obtained by taking partial derivatives of Det( [W1 WD ]) which is the zero polynomial. However, recall from Section 3.2 that W E R nx(r-n) is a matrix with perturbed entries and multiplying it to the right by D -1 only has the effect of scaling its columns, while multiplying it to the right by -D-1 only scales its rows. This ensures that S2, Si as submatrices of corresponding dimensions are full rank almost surely, and so Det(S2 )Det(-S) can not be the zero polynomial. This contradiction means that our initial assumption had to be false, and so Det( [Wi W2]) is not the zero polynomial and dimKer(C1 ) has to be 0.

By an identical analysis we can obtain that dimKer(C2) 0 and so C1, C2 have full column and row rank, respectively, equal to 2r. This concludes the proof.

Remark 22. Our disentangling method reduces matrices with correlated Gaussian noise per- turbations to smaller matrices with independent Gaussian perturbations through polynomial manipulations. Hence we can apply standard anti-concentrationresults (Carbery- Wright, for instance) to conclude that tJ is not only full rank almost surely, but also well condition. In particular, its smallest singularvalue is at least an inverse polynomial, with failure probability at most an inverse polynomial.

61 3.4 Proof of Theorem 14

For i as described in Theorem 14, let X0 be its first matrix slice along the first dimension.

From equations (3.3), we have X0 = UDoV, where U, Do, V are perturbed, hence almost surely Xo will be full rank. Recall from Claim 20 that for every tensor i we can write Mat(TA') as

0 - X2 X1 - - 0 Q Mat(TAi) X2 0 -XO =, -Q R- -X1 X 0 0

0 -XO X2 where R = IQ -2X]I Xo 0 -X1

Using the Schur Complement we have

0 Q 1 0 -QR--O Q LQ R --R--lQ I L 0 R

Note that the second factor in LHS above is full rank, hence Mat(tA") has the same rank as the RHS above, i.e. rank(Mat(iA)) = rank(R) + rank(-QR-1 Q). Also, since Xo is full rank, R will have rank 2n and by using the definitions of R, Q, Q we have -QR- 1 Q =

X 2 X 'X 1 - X 1 XO-X 2, hence

4 rank(Tj ) = 2n + rank(X2 X6-'XI - X 1X-17X 2 ).

Now, since T is perturbed, by Lemma 15 and Lemma 21 we have

r 1 rank(tAA ) = ranlk(iA' 2R(T).

62 Puting the previous two equations together we obtain the desired result:

1 R(T) = n + --rank(XX671 X - X X&-'XI). 2 2 2

3.5 Future directions

One way to extend our result is by considering Young flattenings of higher order p. Let's consider the case p - 2, so a = dim(A) = 5 and we're interested in TA 2 . We are still able to

0 Q write Mat(Tj(tA2)2 ) [ , but now R and the Q's have different dimensions. In particular, if t = ao Xo +... 5a X5 then R will be a 6n x 6n matrix having 6 blocks of X on the diagonal and

0 [X 1 , X 2] [X 1 , X3] [X1 , X 4]

[X 2, X 1] 0 [X 2, X3] [X2, X4]

[X3 , X 1] [X3, X21 0 [X3, X4]

[X4 ,X 1] [X4 , X21 [X4 , X3] 0

Note that QR-1 Q is the same as QQ if we substitute the blocks [Xi, Xj] with XiX0-1Xj - XXo-1 Xi. It is still true that rank(Mat(TA2 )) rank(R) + rank(-QR-Q), and that rank(TA 2) (4)R(T) = 6R(T), where this last inequality cames from the generalization of Lemma 19. Hence, in the case p = 2 the inequality that we obtain is

1~ R(T) > n + -rank(-QR-Q). 6

We conjecture that our noise disentangling lemmas from Section 3.6 can be generalized to prove that for perturbed tensors the above becomes equality. For general p, the 1/6 is just

I/(2). Recall that QR-1 Q has dimension ( ) x so it has maximum rank ( )n making our method work potentially for R(T) up to 2 n.

63 Decomposition algorithm for perturbed tensors

We give a high level overview of how the machinery developed here may be used to obtain an algorithm for tensor decomposition in the smoothed analysis model. In particular, a generalization of Lemma 21 gives us the insight. The decomposition of Mat(TA1 ) presented in Lemma 21 for the case R(T) 1 holds for any rank:

0 -X 2 X 1 -D2Do1U DIDO1U

X27-X = [ [D 0 1 VT DoVT 0 + -U [-D 2 VT 0 DoV -X1 X0 0 U 0

where XO, X 1, X2 are matrix slices of T and U, V, Di are as introduced in Section 3.2. Note that the Young flattening above was obtained by using only the first 3 matrix slices of T along the first dimension. One could imagine using other slices to obtain different flattenings. The key property of the decomposition above is that it separates the column span of Mat(T"')

into two parts, the first part involving X 2, Xo and the second part involving X 1, Xo, let

this be Type 1 and Type 2. By substituting X1 with other two slices, X3, X4 we obtain three different flattenings. One can show that the intersection of the column spans of the

-D 2DOU three corresponding flattenings is precisely the column span of Type 1, 0 .One U

-D 3D;-U1 can repeat the intersection to obtain another matrix with column span 0 it L U turns out that the vectors ua E R' (columns of U) can be recovered as the only solutions Uici -D2DO'U UiC2 -D3Do1U ColSpan 0 and 0 ColSpan to0 for0 some (

[ Ui o \a L U U U constants c1, c2 . These constants will enable us to recover the wi's and similarly, by looking at the row spans instead of the column spans, one can recover the vi's. There are indications

64 that this algorithm would recover the components of perturbed tensors T for rank up to (1 + E)n, for a small enough constant E.

3.6 Linear algebra lemmas

Lemma 23. Let A c Rx(n--"), BC E R x and D c R'mx" be matrices such that D is diagonal with (di)im as diagonal, where di's are random variables independent of each other

and of A, B, C. If M =A B-CD] andN= [A -C] then

ODet(M) Odj~2 .. Odm= Det(N ) 6id1 ad2 .. .&odm

Proof. We look at the determinant of M as a polynomial in the random variables (dj)j

6Det(M) = Det(N) d1 Md2 ... d

Lemma 24. Let A E Rnx(n-m), BC E Rnxn" and D e R nx" be matrices such that D is diagonal with (di)i n as diagonal, where di's are random variables independent of each other and of A, B, C. If M = [A B - DC] and A 1, C1 Rmx (nm-"), A 2 , C2 C R(n-m)x(n-m) are

65 such that A =A and C [C then A 2 C2

=Det(M)Det(A 2 )Det(-C1 ) Od18d2 ... adm

Proof. Let B 1, Di E Rxm and B 2 C R(n-m)xm, D 2 E R(n~m)-x(n-m) be such that B = B1 B2 and D= I 0 . With this new notation, M can be written as 0 D2J

- DiC1 M= A1 B1

A 2 B2 - D2 C 2

We look at the determinant of M as a polynomial in the random variables (di)i

B1 - DI01 as unique representatives for rows/columns. First note that each such monomial picks exactly one element from each column of B1 - D1 C1 , hence every (di)i

Suppose there is a monomial that doesn't vanish but picks either an element (A 1 )jj or

(B2 - D 2 C2 )ij. If it picks (A 1 )ij, then it can pick no other element from the i'th row of M and so di will not appear. Hence, when taking the partial derivative with respect to di this monomial has to vanish. If it picks (B2 - D2 C 2 )ij then out of the last m columns of M it can only pick at most m - 1 other elements. Hence, by pigeonhole principle and using the uniqueness of representatives from each row and column, we get that at least one row,

66 k, of B1 - D1C1 will not have a representative in this monomial. When taking the partial derivative with respect to dk this monomial would have to vanish again. This completes the proof, and so we have DDet(M)_ daD2 .. aM Det(A 2)Det(-CI), ad1ad2 ... adrn as desired. El

Lemma 25. For B C Rbx, A E R'xa and C E R"x' and D e Rbxd such that C is invertible we have 0 C dimKer = dzmKer( [A D]) (-A B D_

vii-

Proof. Let V2 Ker 0 C 0 where v, E Ra, v c Rnv e R d. Then CV = 0 ([ BDj 2 3 2 V3 and Avi + BV 2 + Dy 3 0. Since C is invertible, this implies v 2 = 0 and consequently

C Avi + DV 3 '0,so V3 Ker( [A D]). Similarly, for any Vi E Ker( [A D]), 0,

L 3 0 C 0 will be in Ker and this bijection ensures that the two spaces have the same A B D dimension. El

Kruskal condition

For completeness, we mention here the sufficient conditions for uniqueness of tensor decom- position as formulated by Kruskal in [Kru77].

Definition 26. The Kruskal rank of a set of vectors {Aj}< is the maximum r such that all subsets of at most r vectors are linearly independent. For a matrix A this is denoted as krank(A) and the set of vectors correspond to the columns of A.

67 Kruskal [Kru77] showed that the decomposition in Definition 17 is unique if

krank(W) + krank(U) + krank(V) > 2k + 2

which is a milder condition compared to matrix decomposition. For instance, for a rank-2 decomposition, this reduces to having all factor matrices W, U, V being full column rank, while in the matrix case we need the stronger condition of orthogonality.

68 Chapter 4

Stronger Fine Grained Hardness via

#SETH

The field of fine-grained complexity aims to establish quantitive bounds on the complexity of computational problems solvable in polynomial time. Over the last few years, the area has seen lots of progress, including the development of tight conditional hardness results for problems such as computing the edit distance or the longest common subsequence between two strings [B15, ABW15, BK15]. These results are based on plausible complexity theoretic conjectures, such as Strong Exponential Time Hypothesis (SETH), which postulates that the satisfiability of CNF formulas with n variables cannot be solved in time c" for some c < 2. This hypothesis is consistent with the state of the art in satisfiability solving algorithms, the best of which run in time roughly 2n('0-()m. Other popular conjecture include 3SUM and APSP [VW15J.

Since these hardness results rely on conjectures, it is important to make them as weak as possible. This line of research has attracted significant attention over the last few years. For example [AHWW16] showed that quadratic hardness of edit distance can be shown assuming the satisfiability of general NC circuits cannot be solved in c' time for c < 2. Since NC circuits are significantly more expressive than CNF formulas, the latter assumption is significantly weaker than SETH. Similarly, [AVWY15] show conditional hardness results

69 assuming that either SETH, 3SUM or APSP hold. In this paper we explore an alternative proposal for weakening SETH, by assuming the hardness of counting the number of satisfying assignments (#SAT), as opposed to just finding one. Such a hypothesis, called #SETH, was formulated in [Wil18] 1, and states that counting the number of assignments to CNF formulas with n variables cannot be solved in time c' for some c < 2. Although the best theoretical upper bounds for the complexity of #SAT and SAT are essentially the same [AWY15, PPSZ05], counting appears to be much more difficult in practice. For example, [Varl5 states that "#SAT is a really hard problem ... in practice quite harder than SAT"; similar observations appear in [CMV13]. Thus, reducing from #SAT gives a stronger evidence of hardness than reducing from SAT itself. We demonstrate the applicability of this assumption in fine-grained complexity, by using it to show hardness of a variety of problems in pattern matching, data structures, graph algorithms and machine learning.

Organization The rest of this chapter is organized as follows. In Section 4.1 we introduce the hardness conjectures on which lower bounds are based. In Section 4.2, we show counting based conditional hardness of the pattern matching under edit distance problem. In Section 4.3, we discuss problems from machine learning and in the Sections 4.4, 4.5, 4.6 we discuss graph problems: the Wiener index, dynamic graph problems and counting mathing triangles respectively. Finally, in Section 4.7 we give a proof in the opposite direction: we show average case hardness for the Counting Orthogonal Vectors Problem assuming the worst case hardness for the minimum-weight k-Clique Problem.

4.1 Preliminaries

Definition 27 (k-SAT Problem). Decide whether a given conjunctive normal form formula on N variables and M clauses, where each clause has at most k literals, is satisfiable. The #k-SAT problem asks to output the number of satisfying assignments.

'See also [DHM+14] for a related but weaker #ETH assumption.

70 Definition 28 (SETH). k-SAT cannot be solved in time 0(2 (1 -)N ) where E > 0 is a constant independent of k.

Definition 29 (#SETH). #k-SAT cannot be solved in time o( 2 (1--)N) where E > 0 is a constant independent of k.

Definition 30 (Counting Orthogonal Vectors (COV) Problem). Given two sets A, B C

{o, 1}" with JAI = n, |B| = m, we want to output the number of pairs a e A, b e B such that a -b = 0. That is, we want to count the number of orthogonal pairs of vectors.

The following conjecture is implied by #SETH.

Definition 31 (Counting Orthogonal Vectors Conjecture). For any constant -Y> 0 and any d = w(logn), the Counting Orthogonal Vectors Problem requires Q(nm)1-0() time, where m = n-.

Note that #SETH implies Counting Orthogonal Vectors Conjecture since the reduction in [Wil04] from SAT to OV preserves the number of solutions. Each of the following sections treats one class of problems.

4.2 Pattern matching under edit distance

4.2.1 Preliminaries

Edit distance For any two sequences P and Q over an alphabet E, the edit distance edit(P, Q) is equal to the minimum number of symbol insertions, symbol deletions or symbol substitutions needed to transform P into Q. It is well known that the edit distance induces a metric; in particular, it is symmetric and satisfies the triangle inequality. In our hardness proofs for this section we use an equivalent definition of edit distance that will make the analysis of the reductions easier, in particular it claims that allowing only deletions and substitutions would give the same result.

71 Observation 32. For any two sequences P, Q, edit(P, Q) is equal to the minimum, over all sequences T, of the number of deletions and substitutions needed to transform P into T and Q into T.

Proof. By the metric properties of the edit distance, edit(P, Q) is equal to the minimum, over all sequences T, of the number of insertions, deletions and substitutions needed to transform P into T and Q into T. To get rid of the insertions, observe that if, while transforming P, we insert a symbol that is later aligned with some symbol of Q, we can instead delete the corresponding symbol in Q. Thus, it suffices to allow deletions and substitutions only. LI

Pattern matching under edit distance [CM07 For two sequence P and T over an al- phabet E, the pattern matching under edit distance problem is to compute the minimum edit distance between P and any suffix of T, T(i), for i =1, ... , T|.

In our reduction we will use vector gadgets that embed into strings the vectors inputs to the COV problem. In the analysis of the reduction it will be intuitive to consider these gadgets as nodes in a graph with edges given by whether two vectors are orthogonal or not. We next introduce some graph theoretic lemmas that we subsequently use in our reduction.

Definition 33 (McDiarmid's indequality). Let X 1, X 2 , ... , X, be independent random vari- ables and assume that f satisfies

sup if (XI, X2 .... - xn) - f (X1, x2, . .. , Xj,i x, i+1,... <, ) ci X1,X2-.-Xn,Yi for all 1 < i < n. Then, for all t > 0 we have

2 Pr [JE[f(X1 ,..., Xn)] - f(X 1 ,..., X) > t] 2exp - t

Definition 34 (Knuth Shuffle - informal). Given a finite sequence, the following procedure produces an unbiased permutations of the original sequence. Put all the elements into a hat and continuously determine the next element by randomly drawing an element from the hat

72 until no elements remain.

Lemma 35. Let G (U U V, E) be a bipartite graph with two parts U and V of size |U| = |V| = n. Let 6 JE|/n 2 be the fraction of edges present in the graph. Let S C U x V be a set of edges (not necessarily a subset of E). Let G, = (U U V, E,) be a random graph obtained from G by applying a uniformly random permutation w to the vertices in V. We consider the random variable |E, n S|, which is the number of edges of E' that are present in S. Observe that E [IE, n Sf] = 61SI. We have:

Pr [ E, n S1 = 61S| t Vnlogn] 1 - n-().

Proof. Generate a random permutation using the Knuth shuffle. Knuth shuffle generates a random permutation using n random independent variables and we can check that the resulting random variable E, n S1 satisfies the constraints of Definition 33 with ci = 2. We use the inequality given in Definition 33 with t = Vn log n to get the desired result. El

Lemma 36. Let G = ({0,...,n - 1} U {O',..., (n - 1)'}, E) be a bipartite graph. Let

6 JE|/n 2 be the fraction of edges present in the graph. Let G, = ({0,... , n - 1} U

(0)/,-.., (n - 1)'}, Er) be a random graph obtained from G by applying a uniformly random permutation 7r to the vertices in {(0)',..., (n - 1)'}. With probability 1 - n-'(), for every 0 ij, k < n -i we have

E., n Si,j,k| = 6k v' log n, where Si,j,k A {(i, (i)'), . . . , (i + k - 1, (j + k - 1)')}. The indices of the vertices wrap around, that is, if i > n, then it corresponds to vertex i - n. Similarly for (j)'.

Proof. Use Lemma 35 and use the union bound over all n3 sets Sij, k.

4.2.2 Reduction

In this section we prove the following theorem.

73 Theorem 37. Let P and T be two sequences of length n. Let T(i) A T,...,TTj be the

suffix of T that starts with the i-th symbol. Computing edit(P, T(i)) for every i = 1,. .. ,T

requires n4 /3-o(1) time assuming COV Conjecture.

1 Let A A {a, ... , aN} and B A {b ,... , bN} be the input to the COV problem. Let 6 be such that the number of orthogonal pairs is 6N 2 . Let G(a) and G'(b) be the vector gadgets from [B1151. We include their description and properties in Appendix A.1 for completeness. Let 1 A V/'n(log n)2 , let 0(a) A 5'G(a)5' and G'(b) = 5'G'(b)5'. Finally, we define

P A G(al)G(a2)...G(aN)6666...

and the text T as follows:

2 T A O'(b7()'(bN) . .. (b7r(N))I(b"(l))d'(b( .. 61 ( )

w gives a permutation obtain from a Knuth shuffle of [N]. The number of symbols 6 in the first sequence is equal to the length of T. Note that the sequences are of length O(N 5 ) implying the promised lower bound, as (N1 5) 4/3 = N 2. Let T' be 1 TA G'(b(2+l) )'(bs+) . .. '(br(N1 )G/(b-( ) )(7) 61( )

Intuition for the reduction The vector gadgets are constructed such that each optimal allignment for a suffix of T will give us the count of ortogonal pairs among n different pairs of vectors from the input to COV. Adding all these up will gives the final count of orthogonal vector pairs. The vector gadgets in T are suffled to ensure concentration of the number of orthogonal pairs among the different suffixes. Note that the padding with 5s between different gadgets is equal to the concentration gap and ensures that the desired allignment is the optimal one. In particular, it ensures that every pair of vectors is acounted for in exactly one suffix.

74 We know that edit(G(a), G'(b)) = C - 2[a I b] (4.1) for some quantity C that depends only on d, the length of vectors a and b. We claim:

n edit(P, T') = C' - 2 Z[ai, b7'('+j) j=1 where C' = nC + TI. This is equality is sufficient to show COV hardness.

edit(P, T) < C' - 2 EnI[a , b7(a+j)] is immediate: align vector gadgets in pairs and use eq 4.1.

In the rest we prove edit(P, T) > C' - 2 EZ'[aj, bs(2+j)]. Multiple times we will use the fact that k S [aj+t I b~r(j' t)1 = k65 \/ log n1 (4.2) t=1 holds for all j, j', k. We get this by using the fact that b's are permuted randomly and using Lemma 36.

Consider an optimal alignment between P and T' and build a bipartite graph G =

({1, ... , N} U {(i + 1)',. . . (2N)'}, E). Add edge (j, (j)') to E if the j-th vector gadget in P is aligned with the (j)'-th vector gadget in T.

Consider two cases.

Case 1. For every (j, ') E E we have j' = i + j. This means that, for every j, if a vector gadget G(ai) is aligned with another gadget, then it must be G'(b"(i+j)). The lower bound follows from eq 4.1: if G(aj) and G'(b(ai+j)) are not aligned, the contribution is at least the total length of the two gadgets; if the gadgets are aligned, the lower bound follows from the proof of Lemma 4 in [B115] (see Case 1.2).

75 Case 2. Exists (j, j') C E such that j' # i +j. In this case we will show that edit(P, T) ;>

C' - 26n + 1/100. We split P and T' into substrings. Start with j = 1 and j' = i + 1. Find (k, k') E E with minimum k > j such that k' # k + j' - j. Let e be the sequence between vector gadgets corresponding to j and k and let e' be the sequence between vector gadgets corresponding to j' and k'. Then set j = k and j'= k' and repeat. For every found pair of e and e' we lower bound the contribution to the edit distance from symbols in e and e'. Let m = min(k - j, k'- j') and d = max(k - j, k'- j') - m. We show that the contribution to the edit distance from e and e' is lower bounded by (C - 26)m + dl/100. Summing over all pairs e and e' we get the required lower bound on the edit distance. Suppose that k - j < k' - j'. The other case is analogous. Let t be the largest integer such that j < t < k and (t, t') E E for t' = t + j' - j. Let f be the prefix of e before the vector gadget corresponding to t and let g be the suffix of e after the vector gadget corresponding to t. Similarly define f' and g'. By eq 4.2, the contribution to edit distance from f and f' is (t - j)(C - 26) V'nilogrn. We claim that the contribution from g and g' is at least dl/100. Let c = k - t and c' = k' - t'.

Since all c and c' gadgets in g and g' are not aligned, they contribute at least clo + c'l' where lo is the length of G(a) and 1' is the length of G'(b). Finally, we have additional contribution of max(0, d1o - c'l - clo) from sequence g' (sequence g' has more symbols 5, we subtract c'1' + clo symbols corresponding to the gadgets G and G'). The final contribution is clo + c' + max(0, dl - c'l - clo) > (C - 26)lo + dl/100.

4.3 Machine learning problems

Using the COV Conjecture we can show conditional hardness for Kernel Support Vector Machines, Kernel Principal Component Analysis, Kernel Ridge Regression, batch gradient computation in neural networks, optimizing the last layer of neural networks. This is can be done using the same reductions as in [BIS17J. Note that this generalizes the results from [BIS17] by simplyfing their hardness assumption. All their results are in the worst case; in the next section we generalize their results in a different direction: we give an average case hardness result for gradient computation in neural networks with ReLu activation functions.

76 4.3.1 Gradient computation in average case neural networks

We can show hardness for batch gradient computation in depth 3 neural networks when the input vectors and the neural networks come from a simple distribution.

4.3.2 Reduction

Consider the polynomial

g(A, B) SY 11(1 - akbk). aGA kc[d] beB COV implies that evaluating g(A, B) is hard on the worst case input A, B. Since g(A, B) E

{, ... , NM}, to evaluate g(A, B) it is sufficient to evaluate g(A, B)(mod p) for O(log N) dis- tinct primes p < (log N) 0 (1).2 Note that the degree of the polynomial g is 2d. By [BRSV17], the following connection holds. For any prime p > 100d, to evaluate g(A, B)(mod p), it is sufficient to evaluate g(A', B')(mod p), where each vector a' E A' and b' E B' has entries uniformly at random from [p]. Indeed, consider the polynomial h(t) = g(A + tA", B + tB"), where A" and B" have vectors with uniformly random entries and we think of the sets as n x d matrices. Note that the degree of h is 2d and that h(O) = g(A, B) is the quantity that we want to compute. Therefore, to evaluate h(0), it is sufficient to evaluate h(1), .. . , h(2d + 2) and interpolating the polynomial. Finally, note that A + tA" has entries uniformly random mod p for t that is not divisible by p. This gives us the average case result.

Theorem 38. Let p be a prime. Given a set of vectors B ; [p]d with |B| = M, we construct a neural network with the following properties.

e Given a vector a E [P]d, the network evaluates the function f(a) such that

f(a) = Y 1(1 - akbk) (mod p). bEB kE[d]

2 Indeed, by the Chinese remainder theorem it suffices to evaluate the expression on k distinct primes Pi, .-- , Pk such that p1 ... Pk > MN. The existance of k = O(log N) such primes pi < (log N)0 (1) (remember that M = No for a constant a > 0) follows from the theorem.

77 * The network is of size M(dp)o() and has 2 layers of hidden neurons.

* The network uses ReL U and the identity activation functions.

* Each of the weights in the network is either a constant or equal to an entry bi, i E [d] for some vector b E B.

We prove the theorem by constructing a neural network for every b C B of size (dp) 0 (') that has 3 layers and evaluates the function fb(a) such that

fb(a) = 7 (1 - akbk) (mod p). kE[d]

We construct the final network for f by putting all networks fb, b E B together. We will use the following lemma.

Lemma 39. For any set X C Z and a function g : X -+ Z we can contruct a neural network with a single input neuron, a single output neuron and O(|XI) hidden neurons such that the network outputs g(x) on every x c X.

Proof. We write g(x) = EEA g(a) (x > a] - [x > a + 1]). This expression can be easily expressed using a neural network with the ReLU activation function by observing that [x > a] = max(x - (a - 1), 0) - max(x - a, 0). L

We will use Lemma 39 to implement the discrete logarithm function defined as follows.

1(x) = -10d if x = 0(mod p) and 1(x) = y if x z 0(mod p), where y C [p] is the unique integer such that 29 = x(mod p).3 As a first step, we will implement function fb using a network with 5 layers of hidden neurons. Then we will decrease the number of hidden layers to 2. The network is as follows.

* The first layer of hidden neurons has d neurons that evaluate to 1 - akbk for every k E [d]. This can be achieved by connecting the k-th hidden neuron with the k-th input neuron (with value ak) and setting the weight of the connecting edge to -bk. We 3 fHAB02] use similar ideas to construct low-depth circuits for several computational problems.

78 can add 1 to the output because, without loss of generality, we can assume that the constant 1 is among the input neurons.

" For every neuron with value 1 - akbk we apply Lemma 39 with X {1 - p 2 to evaluate 1(1 - akbk). This adds 2 layers of hidden neurons. As a result, the k-th

neuron in the third layer of hidden neurons evaluates to 1(1 - akbk).

* We add the fourth layer of hidden neurons with a single neuron: it sums up all 1(1 -

akbk).

* We use Lemma 39 again but this time with function ', which is defined as follows. If X < 0, we set l'(x) = 0. If x > 0, we set l'(x) = y where y EE [p] is such that y = 2x(mod p). This adds the fifth layer of hidden neurons and in the last layer we have one neuron, which is the output neuron of the network.

Correctness of the construction. If fb(a) = 0(mod p), then there exists k E [d] such that 1 - akbk = 0(mod p). The corresponding neuron in the third layer evaluates to -10d. The large negative value forces the neuron in the fourth layer to evaluate to a negative quantity. This, by the definition of ' forces the network to output 0. Consider the case fb(a) # 0(mod p). This means that 1 - akbk # 0(mod p) for all k E [d]. By the definition of the discrete logarithm, the output of the network is the product of quantities 1 - akbk, which is what we needed.

Reducing the number of layers. We note that, if a layer consists only of gates with the identity activation function, we can remove this layer and connect the layer before and after with direct edges. This can increase the size of the network but by no more than a polynomial factor. This immediately implies that the network described above with 5 layers of hidden neurons can be made to have only 2 layers of hidden neurons. This is because the 1st, 3rd and 4th layers have gates with identity activation functions only. These layers can be removed and we are left with a network with 2 layers.

79 4.3.3 Hardness results

Theorem 38 allows us to prove the following two hardness results.

Theorem 40. For any constant a > 0, any d = w(log n) and any prime p with 100d < p < (log n)O(') the following holds. Let n be the size of a network where each weight of an edge is either a constant or equal to an integer ri for some index i. Integers ri are independent samples drawn uniformly at random from [p]. The network has a single output neuron. Let m = n' be the number of d-dimensional input vectors with entries drawn uniformly and independently at random from [p]. Let f(b) be the output of the network on the input vector b e B, where B denotes the set of input vectors. Computing ZbeB f(b) with probability of correctness > 1 - d- 10 requires Q(nm) 1-o() time unless the COV conjecture is false.

Proof. We use the network from Theorem 38. By the properties of the network, EbEB f(b) allows us to evaluate g(A, B)(mod p). The rest follows from the discussion at the beginning of Section 4.3.2. l

Theorem 41. Consider the network from Theorem 40. Let 1 : R -4 R be a function such that l'(0) # 0 (it has non-zero derivative at 0). Let wj be the weight of the j-th edge incoming to the output neuron. Computing (j 3 >j L with probability of correctness > 1 - d-0 requires Q(nm)l-o(l) time unless the COV conjecture is false.

Proof. We observe that the last neuron of the network has the identity activation function (it simply sums up the input values). We set wj = 0 for all j. Thus, for any b E B,

I(f a (b)) l'(0)f(b). The rest follows from the discussion at the beginning of Section 4.3.2.

4.4 Wiener index

Given a graph, the Wiener index of the graph is the sum of distances between all pairs of vertices. The following reduction was used in [RVW13 to show a quadratic conditional

80 lower bound assuming COV Conjecture for the diameter problem of deciding between a 2 diameter or a 3 diameter in a graph. Construct a graph with three layers. The first layer has one vertex for every vector from A, the third layer has one vertex for every vector from B. The second layer has one vertex for every one of d dimensions. We connect a vertex corresponding to a vector a c A with a vertex corresponding to dimension i if and only if ai = 0. Similarly, we connect a vertex corresponding to a vector b E B with a vertex corresponding to dimension i if and only if bi = 0. Since instead of asking for the diameter, the Wiener index asks for the sum of distances between all pairs of vertices, this construction is able to count the number of orthogonal pairs.

4.5 Dynamic graph problems

We consider the dynamic graph problems studied in [AW14I. It turns out that most of the SETH hard problems have a counting version for which it is possible to obtain hardness by a reduction from the Counting Orthogonal Vectors Problem. Below we first state the problem for which SETH hardness result was known and then we state the corresponding counting version.

" SC2. Maintain: directed graph, update: edge insertions/ deletions, query: "Are there more than 2 strongly connected components?". Assuming SETH, either amortized

update time or amortized query time is m1 0().

Counting version. Same updates, query: "What is the number of strongly connected components?".

" #SSR. Maintain: a directed graph with a fixed source s, update: edge insertions/deletions, query: "Given 1, is the number of nodes reachable from s less than I?". Assuming SETH, either amortized update time or amortized query time is m1 0(l).

81 Counting version is the same-we can reduce the Counting Orthogonal Vectors Prob- lem to #SSR.

" ConnSub. Maintain: a fixed undirected graph and a vertex subset S, update: in- sert/remove a node into/from S, query: "Is the subgraph induced by S connected?". Assuming SETH, either amortized update time or amortized query time is ml-"l).

Counting version. Same updates, query: "What is the size of the subgraph induced by S?".

" SubUnion. Maintain: a subset S of a fixed collection X = {X 1,... ,X"} of subsets over a universe U, wich EZ IX = m, update: insert/remove a set Xi into/from S, query:

"Is uxicsXi = U?". Assuming SETH, either amortized update time or amortized query time is ml-"().

Counting version. Same updates, query: "What is I Ux,,s XiI equal to?".

* 0-PP. Maintain: a collection X of subsets X 1, ... , Xk C [n], update: given i, j, insert Xi n Xj into X, query: "Given index i, is Xi = 0?". Assuming SETH, either amortized update time or amortized query time is nl-o(l).

Counting version. Same updates, query: "What is 1X21 equal to?".

" ST-Reach. Maintain: a directed graph and fixed node subsets S and T, update: edge insertions/deletions, query: "Are there some s E S, t E T s.t. t is unreachable from s?". Assuming SETH, either amortized update time or amortized query time is ml-o(1).

Counting version. Same updates, query: "What is the number of pairs s E S, t E T s.t. t is unreachable from s?".

4.5.1 Reductions framework

The reductions from SETH to the dynamic graph problems presented above all follow the same structure using the same graph construction. As noted in [AW14I, similar constructions are used in prior papers [CLR+14, PW10]. In this section we give a description of this general

82 construction. The set of variables V corresponding to the SAT formula are split into two sets U and V

U of size n/2 each. Three sets of nodes are created: S has 2 n/2 nodes, each corresponding

to a partial assignment of the variables in U, T has 2n/2 nodes, each corresponding to a partial assignment of the variables in V and C has 0(n) nodes each corresponding to one clause. Then the edges are added: one direct edge from each partial assignment s E S to a clause c C C iff s does not satisfy c, and a directed edge from c to a partial assigment t C T iff t does not satisfy c. Then, there is a satisfying assignment to the formula if and only if there is a pair of nodes s E S and t E T such that t is not reachable from s. Hence, any algorithm that can solve this static ST-reachability problem would decide the satisfyiability of the formula and violate SETH. [AW14I makes this static construction dynamic in the following way. We give now the description of the construction for #SSR(counting the number of nodes reachable from a source). Instead of having all nodes of S in the above graph G, in the dynamic graph G' there is a single node u. There are 2n/2 stages, one for each partial assignment s C S. In each stage, edges from u are added to C but only to the neighbors of s in G , i.e. the clauses that s does not satisfy. Suppose k edges have been inserted. After the insertions, the following query is asked: "Is the number of nodes reachable from s less than k + 2 n/2?"I. If the answer to the query is yes, then the formula is satisfiable and one can stop. Otherwise, s cannot be completed to a satisfying assignment. Then all inserted edges in this stage are removed and the move corresponding to the next partial assignment of S is executed. Note that this construction can easily be adapted to count the exact number of solutions to the formula. In particular, at each stage, one can ask "By how many oes is the number of nodes reachable from s less than k + 2n/2?. Keeping the sum of all querries in a counter gives us the total number of solutions. The reductions of the remaining problems in this section use a similar approach, with some extra work, however, in all cases, by adapting slightly altering the queries we obtain simplified hardness assumptions through counting.

83 4.6 Counting Matching Triangles

The A-matching triangles problem was introduced in [AWY18] and asks: if given a graph G with colored nodes, is there a triple of distinct colors a, b, c such that there are at least A triangles (x, y, z) in G in which x has color a, y has color b, and z has color c? (In other words, are there A triangles with "matching" colors?). They give a reduction from SETH that uses the same split-and-list technique as that was used in the SETH-based lower bounds in the previous section as well, however splitting the variables into three equal sets, not two. The same reduction goes through and shows conditional hardness for the counting version of the problem. Counting Matching Triangles Problem is as follows. Given an unweighted graph with colored vertices and an integer A, output the number of triples of colors such that for every triple of colors there are at least A triangles. This problem requires Q(n 3 ) 1 -o(l) time assuming the following conjecture that is implied by #SETH.

Definition 42 (Counting 3-Orthogonal Vectors Conjecture). Given 3 sets A, B, C C {0, 1}d, counting the number of triples a E A, b G B, c G C such that Zi_ 1arbic= 0 requires n time for any d = w(log n).

4.7 Average case hardness for the Counting Orthogonal

Vectors Problem

We can show average case hardness for the Counting Orthogonal Vectors Problem assuming the worst case hardness for the minimum-weight k-Clique Problem.

Preliminaries

Notation Given a set S and an integer i, let (s) be the set of all subsets of S of size i.

For an integer i, [i] := {, . .. , i - 1}.

84 Definition 43 (d-hypergraphs). A d-hypergraph G = (V, E) is a set of vertices V and a set of edges E C (u). We call a graph k-partite if the vertices are partitioned into k parts such that each edge intersects with each part in at most one vertex.

Definition 44 (k-clique). Given a d-hypergraph G = (V, E), a subset S C V, |S| = k is a k-clique if (C) C E.

Definition 45 (k-Clique Problem). Given a hypergraph, decide if it contains a k-clique.

Definition 46 (Exact-Weight-k-Clique Problem). Given a d-hypergraph G = (V, E) with a weight function w : E - [nO(k)] and an integer t, the Exact- Weight-k-Clique Problem asks to decide if the graph G has a k-clique of weight exactly t, that is, there exists a k-clique S C V with ZTC(S) w(T)=t.

Conjecture 47 (k-Clique Conjecture). Solving the Exact-Weight-k-Clique Problem on 2- hypergraphs (regular graphs) requires nk-o(l) time.

1 Definition 48 (k Orthogonal Vectors (k-OV) Problem). Given k sets A ,..., Ak C { 0 ,I}D 2 with A | = n for every i = 1, ... , k, decide if there exists a' (- A such that Zj i=D1 a' = 0 (over integers).

4.7.1 Reduction

[ABDN18] shows the following result.

Theorem 49. Let 1 < d < k be integer constants. There is an n2d+o(l) time oracle reduc- tion from the Exact- Weight-k-Clique Problem on d-hypergraphs to the (unweighted) k-Clique Problem on 2d-hypergraphs. If the input has n vertices, every cracle query has n vertices and the reduction uses at most no(l) queries.

We use the above theorem with d = 2 and obtain the following hardness result. Solving k-Clique on 4-hypergraph cannot be done O(nk(l-E)) time unless the Exact-Weight-k-Clique on 2-hypergraphs can be solved in O(nk(1 -')) time, which in turn would contradict the k-Clique conjecture.

85 Corollary 50. Assuming the k-Clique Conjecture, the k-Clique Problem on 4-hypergraphs requires nk-o(l) time.

Consider a 4-hypergraph G = (V, E) and let u :( ) -+ {0, 1} be a function defined as follows: u(T) = 1 if T E E and u(T) = 0 otherwise. Consider the following polynomial:

g(u) := fJ u(T). se(V) Te(S)

By Corollary 50, evaluating it requires nk-o(l) time. Note that the degree of this poly- nomial is (k), which is constant when k is constant. Thus, evaluating g(u)(mod p) for large enough constant prime p is hard on average when u(T) E [p] is uniformly random and independent for all T E (4). We adapt a reduction from [GR18] to prove the following theorem.

Theorem 51. Let k,p > 1 be constant integers. Given a function u : (n) - [p], in 0(n4 ) time one can construct a 4-hypergraph G = (V, E) with |Vj = O(n) such that the number of k-cliques in it is equal to g'(u)k!, where

g'(u) : = ) u (T). SE (In) TE S

Furthermore, the constructed graph is k-partite.

Proof. First, we define a k-partite 4-hypergraph G' = ([nk], E') with parts {in,... (i+)n-1} for all i E [k]. We have an edge T' E E' if T' intersects each part in at most one vertex and ITI = 4, where T = {x(mod n) I x E T'}. We define a function u' : E' - [p] and set

U'(T') = u(T). For a vector z E [4 ),let G' be an unweighted version of graph G' in which we leave only those edges T for which zi < u'(T), where i indexes the set of 4 parts that T intersects.

The final graph G is simply a union of all 4() graphs G'. Thus, the number of vertices in the graph G is p) -nk = 0(n). The promise follows from the observation that each clique S'

86 in G' appears in HT]=s) U'(T') =T(L u(T) graphs G', where S = {x(mod n) I x E S'}. We get an additional multiplicative factor of k! because for every set S there are k! sets S'. l

We reduce the problem of counting k-cliques to the problem of counting solutions of a k-OV instance.

Theorem 52 (Lemma 3.6 in [ABDN18]). Let k > 4 be an integer. Given a k-partite 4- hypergraph with n vertices in each part, in 0(n4 ) time it is possible to construct a k-OV instance with n vectors in each set such that the number of k-cliques in the original graph is equal to the number of solutions to the k-OV instance. The dimensionality D of the vectors is D = n .

Proof. Let V1, ... , V be the parts of the 4-hypergraph G = (V, E). Define

E= T c Vi C [k] :|T nViI< 1 \E ( 4 to be a set of non-edges in the graph G. A set of vertices vi E Vi from a clique in G if and 1 only if all T E satisfy T g {v 1, ... , We construct an instance A ,..., Ak of k-OV as follows. For each v E Vi, we create a vector a E A' C {0, 1}E as follows. If T E E is disjoint from Vi, we set aT 1. If TnVi = {v}, we set aT 1. Otherwise we set aT = 0. It remains to prove the correctness of the reduction, which we do not include here. l

Finally we the problem of counting solutions of a k-OV instance to the problem of count- ing solution of a 2-OV instance.

Theorem 53 (Lemma 3.7 in [ABDN18]). Let k > 2 be an even integer. Given k-OV instance

A 1 ,... , A k C {0, 1}D with |A2I = n for i = 1, ... , k, in 0(nk/ 2 D) time we can reduce it to a 2-OV instance A, B C {0, 1}D with |A| = |B| = nk/2. The number of solutions gets preserved.

Proof. For every a' E A' for i = 1,. . ., k/2, we add a vector a E A such that a3 = ] / a .

Do the same for B except work with i = (k/2) + 1, ... , k. L

87 88 Appendix A

Vector Gadgets

A.1 Vector Gadgets

For completeness, in this section we describe the vector gadgets introduced in [BI15] and give the intuition about why they indeed satisfy eq 4.1.

The sequences are defined over an alphabet E = {0, 1, 2, 3, 4} and the final purpose is to construct gadgets G(a) and G'(b) for vectors a, b c R' whose edit distance depends on whether a and b are perpendicular or not. Let 10 10d. Define coordinate gadget sequences CG and CG' as follows. For x c {0, 11 set

2'0011120 if x = 0; CG(x) { 210000120 if x = 1.

210001120 if x = 0; CGx) (x 210111120 if x = 1.

The coordinate gadgets are designed such that for ay two integers x, x' E {0, 1}:

1 if x X' = 0 edit(CG(x), CG'(x') = 3 if x x' = 1.

89 Define another parameter 11 A (10d) 2. Use E-style notation to denote concatenation of sequences. For vectors a, a', b E {0, I}d, define the vector gadget sequences as

G(a, a') AZIL(a)VR(a')Z2

G'(b) AV1D(b)V2 , where the parameters are set as follows

Vo =V V2 A 31,

Z= Z2A 4,

L(a) AEic dCG(a ), R(a') A EjE[dCG(a'),

D (b) E ic djCG'(b ).

Let ILAILI = IRI = IDI = d(4 + 21o) be the length of D, R and L. Given three vectors

a, a', b C {0, I}d, G(a, a') and G'(b) are constructed such that their edit distance grows linearly in the minimum of a - b and a' - b,i.e

edit(G(a, a'), G'(b)) = c + 2 - min(a - b, a' - b). (A.1)

To achieve this, G and G' are constructed such that there are only two possibilities to achieve small edit distance, in each case it grows linearly in either a - b or a' -b. More precisely, the minimum edit istance between G and G' is achieved by following one of the following possible sequences of operations:

1. Case 1. Delete Z1 and L. Substitute Z2 with V2 . This costs c' A Z1 + ILI + Z2I = 2l1 + 1. Transform R and D into the same sequence by transforming the corresponding coodrinate gadgets into the same sequences. By the construction of the coordinate gadgets, the cost of this step is d + 2 - (a' - b). Therefore, this case corresponds to edit

90 distance cost c' + d + 2- (a' b) =c +2 (a' -b) for c A c' +d.

2. Case 2. Delete R and Z2. Substitute Z1 with V1. This costs c'. Transform L and D into the same sequence by transforming the corresponding coordinate gadgets. Similarly as before, the cost of this step is d + 2 - (a - b). Therefore, this case corresponds to edit

distance cost c'+ d + 2. (a. b) = c + 2- (a. b).

Equation A.1 is obtained by taking the minimum over the two cases above. Showing that these are indeed the optimal allignments and there's no other one that would give a smaller distance, requires a more detailed case analysis that we refer the reader to [B151.

To obtain the final form of equation 4.1, [B115] make the following simplifications. They assume that in the orthogonal vectors problem, for all vectors b E B, bi = 1. They can make this assumption without loss of generality because one can always add a 1 to the beginning of each b c B, and add a 0 to the beginning of each a E A, without changing the ortogonality of any pair of vectors. In this way, they can ensure that the dot product a'- b is always equal to 1 by setting a' to be the ld vector. This defines the final gadget G(a) A G(a,1d) with the property that edit(G(a), G'(b)) is small if the vectors a and b are orthogonal, and slightly larger otherwise.

if a - b = 0 edit(G(a), G'(b)) CO C otherwise, for C1 > Co. This is crucial because it guarantees that the sum of several terms edit(G(a), G'(b) is smaller than some threshold if and only if a - b = 0 for at least one pair of vectors a and b, so that one can detect whether such a pair exists. In contrast, this wouldn't hold if edit(G(a), G'(b)) depends linearly on the value of a - b, which makes more clear the need of a'. In the construction presented in this chapter we introduced additional padding as well as a permutation of the gadgets to ensure concentration of orthogonal pairs among suffixes. In this way we are able to not only detect wheter an orthogonal pair exists but exactly count

91 their number.

92 Appendix B

Useful lemmas

B.1 Linear minimum mean-square-error estimation

Given random variables Y and X (this can be a vector more generally), a natural question is what is the best prediction for Y conditioned on knowing X = x? What is considered "best" can vary, but usually we consider the mean-square-error. That is, we want to come up with Q(x) s.t.

E[(Y - y)2] is minimized. It is not hard to show that y is just the conditional expectation of Y conditioned on X. The minimum mean-square-error estimate can be a highly nontrivial function of X. The linear minimum mean-square-error (LMMSE) estimate instead restricts the atten- tion to estimators of the form = AX + b. Notice here that A and b are fixed and are not functions of X.

One can show that the LMMSE estimator is given by:

A = (Exx)-')Exy,

93 where E. is the appropriately indexed covariance matrix, and b is chosen in the obvious way to make our estimator unbiased.

B.2 Calculations for linear model from Section 2.2.3

To recap our setup, we input the design matrix X = X-i and the response variable y = Xi as inputs to an SLR blackbox. Our goal is to express y as a linear function of X plus some independent noise w. Without loss of generality let i = 1, and for our discussion below assume S = {1, ... , k}. For illustration, at times we will simplify our calculation further for the uniform case where ui = - for 1 < i < k and ui = 0 for i > k. For the moment, just consider one row of X, corresponding to one particular sample X of the original SPCA distribution. Since X is jointly Gaussian, we can express (the expectation of) y = X, as a linear function of the other coordinates:

E[XlX2 :d = X2:dl = Z1,2:d(E2:)- X2:d

Hence we can write

XI = EI,2:d(Y2:d)- X2:d + W

2 where w ~ .V(0, - ) for some - to be determined and w I Xi for i = 2, .., d.

By directly computing the variance of the above expression for X1 , we deduce an expres- sion for the noise level:

02 = Ell - Z1,2:d(E2:d)- E2:d,

2 2 Note that a is just El under Ho. We proceed to compute o under H1 , when E =

1 Id + OuUT. To compute (E2:d)~ , we use (a special case of) the Sherman-Morrison formula:

(I+ wVT)-1 = VT

E29 = (Id-1 6+uu 1 )l =d_1 -- 1 1 0 T 1 + (1- Ul)

94 where u_ 1 C Rd-l is a restricted to coordinates 2, ... , d.

T E1,2:d(E2:d)- E2:d,1 u j1 (I + Oaia I)u_ = U2)

_02U2(1aU2)

1 + (1 -U)O (specializing to uniform case again)

02 02 k k 1+k 0 k( + 0)

Finally, substituting into the expression for o.2

(- 1+ OU2 1 + (1 - U)O

Ou 2 1 + (1 - U2)O

< 2 if 0 < 1

We remark that the noise level of column 1 has been reduced by roughly T := k(1+0)yby regressing on correlated columns.

In summary, under H1 (and if 1 E S) we can write

y = X#* + w

where

#*=(E2:d) l2:d,1 0 I + (1 - Ui)20U luT Ol-

(10U2) U_1 = ui I - ( -

I1 + ( - )6 + -

95 (technically, the definition of 3* on the RHS is a k - 1 dimensional vector, but we augment it with zeros to make it d - 1 dimensional) and w - .A(O, or) where ou = 1 + 1.( 12)O. Note that in the uniform case, - klk-k3*l as 0 -+oc where lk-1 is uniform 1 on first k - 1 coordinates, as expected.

B.3 Properties of design matrix X

We give some intuition for why for SLR certain properties of the design matrix are desirable and natural for signal recovery (though not necessarily in the sense of minimizing prediction error).

One commonly used property is incoherence and (its generalization) restricted isometry property (RIP). The intuition behind incoherence is that we want the error due to under- sampling' to look roughly like noise. To put it another way, incoherence measures the tendency of linear reconstruction to leak energy from the true underlying source to other sources; we want to spread this out as uniformly as possible over all sources.

While incoherence just looks at pairs of vectors, RIP generalizes by looking at subsets of k vectors. Though it is hopeless for a n x d matrix to be well-conditioned 2 for n < d, RIP of k means that we only need the matrix to be well-conditioned if we look at any submatrix spanned by k columns.

It turns out that a much weaker condition such as restricted eigenvalue suffices for the guarantee of certain SLR algorithms. RE says that the design matrix has bounded eigen- value in a restricted set of directions. In fact, even more generally, this property corresponds to restricted strong convexity for real-valued functions. Strong convexity here means that

'This expression is from compressed sensing, but basically means the same high-dimensional setting we have been discussing when the linear system is under-determined. 2Generally, a function is said to be well-conditioned if output value varies less relative to the change in input value; condition number is also the ratio or largest to smallest singular value.

96 the Hessian of the function we are optimizing is strictly positive; this implies the function being sufficiently well-conditioned. Restricted means we only need strong convexity to hold in certain restricted set of directions; usually what suffices is the cone of directions spanned 3 by "roughly" sparse vectors. In the case of SLR, strong convexity of the f2 reconstruction error is exactly equivalent to the design matrix having a restricted eigenvalue.

It turns out RSC together with a property called decomposability of the regularizer is sufficient to imply very general results on the performance of certain class of M-estimators for high-dimensional statistical tasks with a low-dimensional structure. See [NYWR09 for a lengthier discussion and general results along this line.

Restricted eigenvalue (RE) Here we check that X defined as in Section 2.2.3 has con- stant restricted eigenvalue constant. This allows us to apply Condition 4 for the SLR black- box with good guarantee on prediction error.

The rows of X are drawn from A(0, Id-lxd-1 - O'a_ 1 J 1 ) where u_1 is u restricted to coordinates 2, ... , d wlog.

4

2 Let E= Id-lxd-.1 - 0 - 1 U_1. We can show that E/ satisfies RE with y = 1 by bound- ing E's minimum eigenvalue. First, we compute the eigenvalues of 0U _11 1. 0uzin_1 has a nullspace of dimension d- 2, so eigenvalue 0 has multiplicity d- 2. ui is a trivial eigenvector with eigenvalue OuTu_ 1 = QK 1. Therefore, E has eigenvalues 1 and 1 + 0 k.

Now we can extend this to the sample matrix X by applying Corollary 1 of [RWY10]

3 In the sense that f -weight outside a sparse support is not much larger than the f 1-weight on the support 'We assume here that 1 E S as in the previous section

97 (also see Example 3 therein), and conclude that as soon as

Eklogd=C(1+--)klogd n m>xC" k "Y 2 k

or n = Q(klogp) the matrix X satisfies RE with y(X) = 1/8.

We remark that the following small technical condition also appears in known bounds on prediction error:

Column normalization This is a condition on the scale of X relative to the noise in SLR, which is always o.

I|IX0 11222XH< 101 n2 for all 0 E Bo(2k)

We can always rescale X (and hence X) to satisfy this, which would also rescale the noise level a- in our linear model since the noise is derived from coming X from the SPCA generative model, rather than added independently as in the usual SLR setup.

Hence, since all scale dependent quantities are scaled by the same amount when we scale the original data X, wlog we may continue to use the same X and a in our analysis. As the column normalization condition does not affect us, we drop it from Condition 4 of our blackbox assumption.

98 B.4 Tail inequalities - Chi-squared

Lemma 54 (Concentration on upper and lower tails of the x2 distribution (ILMOO], Lemma 1)). Let Z be the x2 random variable with k degrees of freedom. Then,

Pr(Z - k > 2k+ 2t) < exp(-t)

Pr(Z - X > 2 kt) < exp(-t)

We can simplify the upper tail bound as follows for convenience:

Corollary 55. For x 2 r.v. Z with k degrees of freedom and deviation t > 1, Pr (Z ;> 4t) < exp(-kt).

99

M.-W, M"IFIR!"T 1 .11 IMPRIM-111 I I - - -F,- - - 100 Bibliography

[ABDN18] Amir Abboud, Karl Bringmann, Holger Dell, and Jesper Nederlof. More Con- sequences of Falsifying SETH and the Orthogonal Vectors Conjecture. STOC, 2018.

[ABW15] Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. Tight hard- ness results for lcs and other sequence similarity measures. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 59-78. IEEE, 2015.

[AFH+12] Anima Anandkumar, Dean P Foster, Daniel J Hsu, Sham M Kakade, and Yi- Kai Liu. A spectral algorithm for latent dirichlet allocation. In Advances in Neural Information Processing Systems, pages 917-925, 2012.

[AGH+ 14] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research, 15(1):2773-2832, 2014.

[AGHK131 Animashree Anandkumar, Rong Ge, Daniel Hsu, and Sham Kakade. A ten- sor spectral approach to learning mixed membership community models. In Conference on Learning Theory, pages 867-881, 2013. [AHWW1 61 Amir Abboud, Thomas Dueholm Hansen, Virginia Vassilevska Williams, and Ryan Williams. Simulating branching programs with edit distance and friends: or: a polylog shaved is a lower bound made made. In Proceedings of the forty- eighth annual ACM symposium on Theory of Computing, pages 375-388. ACM, 2016.

[AVWY15] Amir Abboud, Virginia Vassilevska Williams, and Huacheng Yu. Matching tri- angles and basing hardness on an extremely popular conjecture. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 41-50. ACM, 2015.

[AW09] Arash A Amini and Martin J Wainwright. High-dimensional analysis of semidef- inite relaxations for for sparse principal components. In Information Theory, 2008. ISIT 2008. IEEE International Symposium on, pages 2454-2458. IEEE, 2009.

101 [AW14] Amir Abboud and Virginia Vassilevska Williams. Popular conjectures imply strong lower bounds for dynamic problems. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pages 434-443. IEEE, 2014.

[AWY15 Amir Abboud, Ryan Williams, and Huacheng Yu. More applications of the polynomial method to algorithm design. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, pages 218-230. Society for Industrial and Applied Mathematics, 2015.

[AWY18 Amir Abboud, Virginia Vassilevska Williams, and Huacheng Yu. Matching tri- angles and basing hardness on an extremely popular conjecture. SIAM Journal on Computing, 47(3):1098-1122, 2018.

[BCMV14a] Aditya Bhaskara, , Ankur Moitra, and Aravindan Vijayaragha- van. Smoothed analysis of tensor decompositions. In Proceedings of the forty- sixth annual ACM symposium on Theory of computing, pages 594-603. ACM, 2014.

[BCMV14b] Aditya Bhaskara, Moses Charikar, Ankur Moitra, and Aravindan Vijayaragha- van. Smoothed analysis of tensor decompositions. In Proceedings of the forty- sixth annual A CM symposium on Theory of computing, pages 594-603. ACM, 2014.

[BCV] Yoshua Bengio, Aaron C Courville, and Pascal Vincent. Unsupervised feature learning and deep learning: A review and new perspectives.

[BD091 Thomas Blumensath and Mike E Davies. Iterative hard thresholding for com- pressed sensing. Applied and ComputationalHarmonic Analysis, 27(3):265-274, 2009.

[BI15 Arturs Backurs and Piotr Indyk. Edit distance can not be computed in strongly subquadratic time (unless seth is false). In Proceedings of the forty-seventh annual A CM symposium on Theory of computing, pages 51-58. ACM, 2015.

[BIS17] Arturs Backurs, Piotr Indyk, and Ludwig Schmidt. On the fine-grained com- plexity of empirical risk minimization: Kernel methods and neural networks. In Advances in Neural Information Processing Systems, pages 4308-4318, 2017.

IBK151 Karl Bringmann and Marvin Kunnemann. Quadratic conditional lower bounds for string problems and dynamic time warping. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 79-97. IEEE, 2015.

[BR13a] Quentin Berthet and Philippe Rigollet. Complexity theoretic lower bounds for sparse principal component detection. In Conference on Learning Theory, pages 1046-1066, 2013.

102 [BR13b] Quentin Berthet and Philippe Rigollet. Optimal detection of sparse principal components in high dimension. The Annals of Statistics, 41(4):1780-1815, 2013.

[BRSV17 Marshall Ball, Alon Rosen, Manuel Sabin, and Prashant Nalini Vasudevan. Average-case fine-grained hardness. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 483-496. ACM, 2017.

[BRT09] Peter J Bickel, Ya'acov Ritov, and Alexandre B Tsybakov. Simultaneous anal- ysis of lasso and dantzig selector. The Annals of Statistics, pages 1705-1732, 2009.

[BTW07a] Florentina Bunea, Alexandre B Tsybakov, and Marten H Wegkamp. Aggrega- tion for gaussian regression. The Annals of Statistics, 35(4):1674-1697, 2007.

IBTW07b] Florentina Bunea, Alexandre B Tsybakov, and Marten H Wegkamp. Sparse den- sity estimation with f, penalties. In Learning theory, pages 530-543. Springer, 2007.

[CLR+14] Shiri Chechik, Daniel H Larkin, Liam Roditty, Grant Schoenebeck, Robert E Tarjan, and Virginia Vassilevska Williams. Better approximation algorithms for the graph diameter. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1041-1052. Society for Industrial and Applied Mathematics, 2014.

[CM07] Graham Cormode and S Muthukrishnan. The string edit distance matching problem with moves. ACM Transactions on Algorithms (TALG), 3(1):2, 2007.

[CMV13] Supratik Chakraborty, Kuldeep S Meel, and Moshe Y Vardi. A scalable approx- imate model counter. In International Conference on Principles and Practice of Constraint Programming, pages 200-216. Springer, 2013.

[CMW13] T Tony Cai, Zongming Ma, and Yihong Wu. Sparse pca: Optimal rates and adaptive estimation. The Annals of Statistics, 41(6):3074-3110, 2013.

[Com94] Pierre Comon. Independent component analysis, a new concept? Signal pro- cessing, 36(3):287-314, 1994.

[CT05] Emmanuel J Candes and Terence Tao. Decoding by linear programming. In- formation Theory, IEEE Transactions on, 51(12):4203-4215, 2005.

[CT07] Emmanuel Candes and Terence Tao. The dantzig selector: statistical estimation when p is much larger than n. The Annals of Statistics, pages 2313-2351, 2007.

[dBEG14] Alexandre dAA2Aspremont, Francis Bach, and Laurent El Ghaoui. Approxima- tion bounds for sparse principal component analysis. Mathematical Program- ming, 148(1-2):89-110, 2014.

103 [dEGJL07] Alexandre d'Aspremont, Laurent El Ghaoui, Michael I Jordan, and Gert RG Lanckriet. A direct formulation for sparse pca using semidefinite programming. SIAM review, 49(3):434-448, 2007. [DHM+14] Holger Dell, Thore Husfeldt, Daniel Marx, Nina Taslaman, and Martin Wahlen. Exponential time complexity of the permanent and the tutte polynomial. ACM Transactions on Algorithms (TALG), 10(4):21, 2014.

[DM141 Yash Deshpande and Andrea Montanari. Sparse pca via covariance threshold- ing. In Advances in Neural Information Processing Systems, pages 334-342, 2014.

[DRXZ14 Dong Dai, Philippe Rigollet, Lucy Xia, and Tong Zhang. Aggregation of affine estimators. Electronic Journal of Statistics, 8(1):302-327, 2014.

[EK12] Yonina C Eldar and Gitta Kutyniok. Compressed sensing: theory and applica- tions. Cambridge University Press, 2012.

[FRG091 Alyson K Fletcher, Sundeep Rangan, and Vivek K Goyal. Necessary and suffi- cient conditions for sparsity pattern recovery. IEEE Transactions on Informa- tion Theory, 55(12):5758-5772, 2009. [GR18] and Guy N. Rothblum. Counting $t$-cliques: Worst-case to average-case reductions and direct interactive proof systems. Electronic Collo- quium on Computational Complexity (ECCC), 25:46, 2018. [HAB02] William Hesse, Eric Allender, and David A Mix Barrington. Uniform constant- depth threshold circuits for division and iterated multiplication. Journal of Computer and System Sciences, 65(4):695-716, 2002. [Hais90] Johan Ha'stad. Tensor rank is np-complete. Journal of Algorithms, 11(4):644- 654, 1990. [HK13] Daniel Hsu and Sham M Kakade. Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In Proceedings of the 4th con- ference on Innovations in Theoretical Computer Science, pages 11-20. ACM, 2013. [HL13] Christopher J Hillar and Lek-Heng Lim. Most tensor problems are np-hard. Journal of the ACM (JACM), 60(6):45, 2013.

[JL09] lain M Johnstone and Arthur Yu Lu. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 2009.

[Joh0l] lain M Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Annals of statistics, pages 295-327, 2001.

104 IJTU03] Ian T Jolliffe, Nickolay T Trendafilov, and Mudassir Uddin. A modified prin- cipal component technique based on the lasso. Journal of computational and Graphical Statistics, 12(3):531-547, 2003.

[KGPK15] Rajiv Khanna, Joydeep Ghosh, Russell A Poldrack, and Oluwasanmi Koyejo. Sparse submodular probabilistic pca. In International Conference on Artificial Intelligence and Statistics, 2015.

[KKGP14] Oluwasanmi 0 Koyejo, Rajiv Khanna, Joydeep Ghosh, and Russell Poldrack. On prior distributions and approximate inference for structured variables. In Advances in Neural Information Processing Systems, pages 676-684, 2014.

[KM11] Tamara G Kolda and Jackson R Mayo. Shifted power method for comput- ing tensor eigenpairs. SIAM Journal on Matrix Analysis and Applications, 32(4):1095-1124, 2011.

[KNV13] Robert Krauthgamer, Boaz Nadler, and Dan Vilenchik. Do semidefinite relax- ations really solve sparse pca. Technical report, Technical report, Weizmann Institute of Science, 2013.

[Kru77] Joseph B Kruskal. Three-way arrays: rank and uniqueness of trilinear decompo- sitions, with application to arithmetic complexity and statistics. Linear algebra and its applications, 18(2):95-138, 1977.

[Lan] Joseph M Landsberg. Tensors: geometry and applications, volume 128.

[LMOO] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic func- tional by model selection. Annals of Statistics, pages 1302-1338, 2000.

[LO13] Joseph M Landsberg and Giorgio Ottaviani. Equations for secant varieties of veronese and other varieties. Annali di Matematica Pura ed Applicata, 192(4):569-606, 2013.

[LSOO] Michael S Lewicki and Terrence J Sejnowski. Learning overcomplete represen- tations. Neural computation, 12(2):337-365, 2000.

[M+13] Zongming Ma et al. Sparse principal component analysis and iterative thresh- olding. The Annals of Statistics, 41(2):772-801, 2013.

[MB06] Nicolai Meinshausen and Peter Biihlmann. High-dimensional graphs and vari- able selection with the lasso. The annals of statistics, pages 1436-1462, 2006.

[McC] P McCullagh. Tensor Methods in Statistics. Chapman and Hall/CRC.

[Moi] Ankur Moitra. Algorithmic aspects of machine learning.

105 [MR051 Elchanan Mossel and Sebastien Roch. Learning nonsingular phylogenies and hidden markov models. In Proceedings of the thirty-seventh annual ACM sym- posium on Theory of computing, pages 366-375. ACM, 2005.

[MZ931 St6phane G Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. Signal Processing, IEEE Transactions on, 41(12):3397-3415, 1993.

[NT09] Deanna Needell and Joel A Tropp. Cosamp: Iterative signal recovery from in- complete and inaccurate samples. Applied and Computational Harmonic Anal- ysis, 26(3):301-321, 2009.

[NYWRO9j Sahand Negahban, Bin Yu, Martin J Wainwright, and Pradeep K Ravikumar. A unified framework for high-dimensional analysis of m-estimators with decom- posable regularizers. In Advances in Neural Information Processing Systems, pages 1348-1356, 2009.

[PPSZ051 Ramamohan Paturi, Pavel Pudlik, Michael E Saks, and Francis Zane. An improved exponential-time algorithm for k-sat. Journal of the ACM (JA CM), 52(3):337-364, 2005.

[PW10 Mihai Pdtragcu and Ryan Williams. On the possibility of faster sat algorithms. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, pages 1065-1075. SIAM, 2010.

[RT11] Philippe Rigollet and Alexandre Tsybakov. Exponential screening and optimal rates of sparse estimation. The Annals of Statistics, 39(2):731-771, 2011.

IRVW13I Liam Roditty and Virginia Vassilevska Williams. Fast approximation algo- rithms for the diameter and radius of sparse graphs. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 515-524. ACM, 2013.

IRWY10] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for correlated gaussian designs. The Journal of Machine Learning Research, 11:2241-2259, 2010.

[RWY11] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax rates of estima- tion for high-dimensional linear regression over fq-balls. Information Theory, IEEE Transactions on, 57(10):6976-6994, 2011.

[ST01] and Shang-Hua Teng. Smoothed : Why the simplex algorithm usually takes polynomial time. In Proceedings of the thirty-third annual ACM symposium on Theory of computing, pages 296-305. ACM, 2001.

106 [ST09] Daniel A Spielman and Shang-Hua Teng. Smoothed analysis: an attempt to explain the behavior of algorithms in practice. Communications of the ACM, 52(10):76-84, 2009.

[Str83] . Rank and optimal computation of generic tensors. Linear algebra and its applications, 52:645-685, 1983.

[Tib96I Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267-288, 1996.

[Var15I . The SAT revolution: Solving, sampling, and counting. Highlights, 2015.

IVCLR13] Vincent Q Vu, Juhee Cho, Jing Lei, and Karl Rohe. Fantope projection and selection: A near-optimal convex relaxation of sparse pca. In Advances in Neural Information Processing Systems, pages 2670-2678, 2013.

[vdG07] Sara van de Geer. The deterministic lasso. Seminar fur Statistik, Eidgenissische Technische Hochschule (ETH) ZUrich, 2007.

{VDGB+09 Sara A Van De Geer, Peter Biihlmann, et al. On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3:1360-1392, 2009.

[VW15] Virginia Vassilevska Williams. Hardness of easy problems: Basing hardness on popular conjectures such as the strong exponential time hypothesis (invited talk). In LIPIcs-Leibniz InternationalProceedings in Informatics, volume 43. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2015.

[Wai07] Martin Wainwright. Information-theoretic bounds on sparsity recovery in the high-dimensional and noisy setting. In 2007 IEEE InternationalSymposium on Information Theory, pages 961-965. IEEE, 2007.

[Wai09] Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using-constrained quadratic programming (lasso). Information Theory, IEEE Transactions on, 55(5):2183-2202, 2009.

[Wail0] Martin Wainwright. High-dimensional statistics: some progress and challenges ahead. Winedale Workshop, 2010.

[Wil04l Ryan Williams. A new algorithm for optimal constraint satisfaction and its implications. In International Colloquium on Automata, Languages, and Pro- gramming, pages 1227-1237. Springer, 2004.

{Will8i R Ryan Williams. Counting solutions to polynomial systems via reductions. In QASIcs-OpenAccess Series in Informatics, volume 61. Schloss Dagstuhl- Leibniz-Zentrum fuer Informatik, 2018.

107 [Zha09] Tong.Zhang. Adaptive forward-backward greedy algorithm for sparse learning with linear models. In Advances in Neural Information Processing Systems, pages 1921-1928, 2009.

[ZHT06] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265-286, 2006. fZWJ14] Yuchen Zhang, Martin J Wainwright, and Michael I Jordan. Lower bounds on the performance of polynomial-time algorithms for sparse linear regression. arXiv preprint arXiv:14 02.1918, 2014.

[ZWJ15J Yuchen Zhang, Martin J Wainwright, and Michael I Jordan. Optimal prediction for sparse linear models? lower bounds for coordinate-separable m-estimators. arXiv preprint arXiv:1503.03188, 2015.

108