Statistical Algorithms and a Lower Bound for Detecting Planted Cliques

∗ † ‡† Vitaly Feldman Elena Grigorescu Lev Reyzin Almaden Research Center Dept. of Computer Science Dept. of Mathematics IBM Purdue University University of Illinois at Chicago San Jose, CA 95120 West Lafayette, IN 47907 Chicago, IL 60607 [email protected] [email protected] [email protected] † † Santosh S. Vempala Ying Xiao School of Computer Science School of Computer Science Georgia Inst. of Technology Georgia Inst. of Technology Atlanta, GA 30332 Atlanta, GA 30332 [email protected] [email protected]

ABSTRACT Categories and Subject Descriptors We introduce a framework for proving lower bounds on com- F.2 [Analysis of Algorithms and Problem Complex- putational problems over distributions, based on a class of ity]: General algorithms called statistical algorithms. For such algorithms, access to the input distribution is limited to obtaining an es- Keywords timate of the expectation of any given function on a sample drawn randomly from the input distribution, rather than Statistical algorithms; Lower bounds; Planted ; Sta- directly accessing samples. Most natural algorithms of in- tistical query terest in theory and in practice, e.g., moments-based meth- ods, local search, standard iterative methods for convex op- 1. INTRODUCTION timization, MCMC and simulated annealing, are statistical We study the complexity of problems where the input algorithms or have statistical counterparts. Our framework consists of independent samples from an unknown distri- is inspired by and generalize the statistical query model in bution. Such problems are at the heart of machine learning learning theory [34]. and statistics (and their numerous applications) and also oc- Our main application is a nearly optimal lower bound cur in many other contexts such as compressed sensing and on the complexity of any statistical algorithm for detect- cryptography. Many methods exist to estimate the sample ing planted bipartite clique distributions (or planted dense complexity of such problems (e.g. VC dimension [47]). Prov- subgraph distributions) when the planted clique has size ing lower bounds on the computational complexity of these 1/2−δ O(n ) for any constant δ > 0. Variants of these prob- problems has been much more challenging. The traditional lems have been assumed to be hard to prove hardness for approach to this is via reductions and finding distributions other problems and for cryptographic applications. Our that can generate instances of some problem conjectured to lower bounds provide concrete evidence of hardness, thus be intractable (e.g., assuming NP 6= RP). supporting these assumptions. Here we present a different approach, namely showing that a broad class of algorithms, which we refer to as statistical al- ∗This material is based upon work supported by the Na- gorithms, have high complexity, unconditionally. Our defini- tional Science Foundation under Grant #1019343 to the tion encompasses most algorithmic approaches used in prac- Computing Research Association for the CIFellows Project. tice and in theory on a wide variety of problems, including †Research supported in part by NSF awards CCF-0915903 Expectation Maximization (EM) [16], local search, MCMC and CCF-0910584. optimization [45, 25], simulated annealing [36, 48], first and ‡Research supported in part by a Simons Postdoctoral Fel- second order methods for linear/convex optimization, [17, lowship. 6], k-means, Principal Component Analysis (PCA), Inde- pendent Component Analysis (ICA), Na¨ıve Bayes, Neural Networks and many others (see [13] and [9] for proofs and many other examples). In fact, we are aware of only one al- Permission to make digital or hard copies of all or part of this work for gorithm that does not have a statistical counterpart: Gaus- personal or classroom use is granted without fee provided that copies are sian elimination for solving linear equations over a field (e.g., not made or distributed for profit or commercial advantage and that copies mod 2). bear this notice and the full citation on the first page. To copy otherwise, to Informally, statistical algorithms can access the input dis- republish, to post on servers or to redistribute to lists, requires prior specific tribution only by asking for the value of any bounded real- permission and/or a fee. STOC’13, June 1-4, 2013, Palo Alto, California, USA. valued function on a random sample from the distribution, Copyright 2013 ACM 978-1-4503-2029-0/13/06 ...$15.00. or the average value of the function over a specified number of independent random samples. As an example, suppose lem of finding the largest clique in a [33]. A we are trying to solve minu∈U Ex∼D[f(x, u)] by gradient de- random graph Gn,1/2 contains a clique of size 2 log n with scent. Then the gradient of the objective function is (by high probability, and a simple greedy algorithm can find one interchanging the derivative with the integral) of size log n. Finding cliques of size (2 − ) log n is a hard problem for any  > 0. Planting a larger clique should make ∇u E[f(x, u)] = E[∇u(f(x, u))] x x it easier to find one. The problem of finding the smallest k for which the planted clique can be detected in polynomial√ and can be estimated from samples; thus the algorithm can time has attracted significant attention. For k ≥ c n log n, proceed without ever examining samples directly. A more simply picking vertices√ of large degrees suffices [37]. Cliques involved example is Linear Programming. One version is the of size k = Ω( n) can be found using spectral methods feasibility problem: find a nonzero w s.t. a · w ≥ 0 for all a [2, 38, 14], via SDPs [19], nuclear norm minimization [3] or in some set A. We can formulate this as combinatorial methods [21, 15]. While there is no known polynomial-time algorithm that max E[sign(a · w)] √ w a can detect cliques of size below the threshold of Ω( n), there is a quasipolynomial algorithm for any k ≥ 2 log n: enumer- and the distribution over a could be uniform over the set A ate subsets of size 2 log n; for each subset that forms a clique, if it is finite. This can be solved by a statistical algorithm take all common neighbors of the subset; one of these will be [10, 17]. This is also the case for semidefinite programs and the planted clique. This is also the fastest known algorithm for conic optimization [6]. The key motivation for our def- for any k = O(n1/2−δ), where δ > 0. inition of statistical algorithms is the empirical observation Some evidence of the hardness of the problem was shown that almost all algorithms that work on explicit instances by Jerrum [30] who proved that a specific approach using are already statistical in our sense or have natural statisti- √ a Markov chain cannot be efficient for k = o( n). More cal counterparts. Thus, proving lower bounds for statistical evidence of hardness is given in [20], where it is shown that algorithms strongly indicates the need for new approaches Lov´asz-Schrijver SDP relaxations, which include the SDP even for explicit instances. We present the formal oracle- used in [19], cannot be used to efficiently find cliques of based definitions of statistical algorithms in Section 2. √ size k = o( n). The problem has been used to generate The inspiration for our model is the statistical query (SQ) cryptographic primitives [31], and as a hardness assumption model in learning theory [34] defined as a restriction of [1, 28, 40]. Valiant’s PAC learning model [46]. The primary goal of We focus on the bipartite planted , where a the restriction was to simplify the design of noise-tolerant bipartite k ×k clique is planted in a random bipartite graph. learning algorithms. As was shown by Kearns and others A densest-subgraph version of the bipartite planted clique in subsequent works, almost all classes of functions that can problem has been used as a hard problem for cryptographic be learned efficiently can also be efficiently learned in the applications [4]. All known bounds and algorithms for the restricted SQ model. A notable and so far the only excep- k-clique problem can be easily adapted to the bipartite case. tion is the algorithm for learning parities, based on Gaussian Therefore it is natural to suspect that new upper bounds on elimination. As was already shown by Kearns [34], parities the planted k-clique problem would also yield new upper require exponential time to learn in the SQ model. Further, bounds for the bipartite case. Blum et al. [11] proved that the number of SQs required The starting point of our investigation for this problem for weak learning (that is, for obtaining a non-negligible ad- is the property of the bipartite planted k-clique problem vantage over the random guessing) of a class of functions C is that it has an equivalent formulation as a problem over over a fixed distribution D is characterized by a combina- distributions defined as follows. torial parameter of C and D, referred to as SQ-DIM(C,D), the statistical query dimension. Problem 1. Fix an integer k, 1 ≤ k ≤ n, and a subset Our notion of statistical algorithms generalizes SQ learn- of k indices S ⊆ {1, 2, . . . , n}. The input distribution DS ing algorithms to any computational problem over distribu- on vectors x ∈ {0, 1}n is defined as follows: with probability tions. For any problem over distributions we define a pa- 1 − (k/n), x is uniform over {0, 1}n; and with probability rameter of the problem that lower bounds the complexity of k/n the k coordinates of x in S are set to 1, and the rest solving the problem by any statistical algorithm in the same are uniform over their support. For an integer t, the dis- way as SQ-DIM lower bounds the complexity of learning in tributional planted k-biclique problem with t samples is the SQ model. Our techniques for proving lower bounds on the problem of finding the unknown subset S using t samples statistical algorithms are also based on methods developed drawn randomly from DS . for lower-bounding the complexity of SQ algorithms. How- ever, as we will describe later, they depart from the known One can view samples x1, . . . , xt as adjacency vectors of a techniques in a number of significant ways that are necessary bipartite graph as follows: the bipartite graph has n vertices for our more general definition and our applications. on the right (with k marked as members of the clique) and t We apply our techniques to the problem of detecting planted vertices on the left. Each of the t samples gives the adjacency bipartite cliques/dense subgraphs. vector of the corresponding vertex on the left. It is not hard to see that for t = n, conditioned on the event of Detecting Planted Cliques. In the planted clique prob- getting exactly k samples with planted indices, we will get a lem, we are given a graph G whose edges are generated by random bipartite graph with a planted k-biclique (we prove starting with a random graph Gn,1/2, then “planting,” i.e., the equivalence formally in the full version of this paper adding edges to form a clique on k randomly chosen vertices. [23]). Jerrum [30] and Kuˇcera [37] introduced the planted clique Another interesting approach for planted clique was pro- problem as a potentially easier variant of the classical prob- posed by Frieze and Kannan [24]. They gave a reduction from finding a planted clique in a random graph to finding a 2.1 Problems over Distributions direction that maximizes a tensor norm; this was extended We begin by formally defining the class of problems ad- to general r in [12]. Specifically, they show that maximiz- dressed by our framework. ing the r’th moment (or the 2-norm of an r’th order tensor) allows one to recover planted cliques of size Ω(˜ n1/r). A re- Definition 1 (Search problems over distributions). lated approach is to maximize the 3rd or higher moment of For a domain X, let D be a set of distributions over X, let F the distribution given by the distributional planted clique F be a set of solutions and Z : D → 2 be a map from a problem. For this approach it is natural to consider the distribution D ∈ D to a subset of solutions Z(D) ⊆ F that following type of optimization algorithm: start with some are defined to be valid solutions for D. For t > 0 the distri- unit vector u, then estimate the gradient at u (via samples), butional search problem Z over D and F using t samples is move along that direction and return to the sphere; repeat to find a valid solution f ∈ Z(D) given access to t random to reach an approximate local maximum. Unfortunately, samples from an unknown D ∈ D. over the unit sphere, the expected r’th moment function can We note that this definition captures decision problems have (exponentially) many local maxima even for simple dis- by having F = {0, 1}. With slight abuse of notation, for a tributions. A more sophisticated approach [32] is through solution f ∈ F, we denote by Z−1(f) the set of distributions Markov chains or simulated annealing; it attempts to sam- in D for which f is a valid solution. ple unit vectors from a distribution on the sphere which is It is important to note that the number of available sam- heavier on vectors that induce a higher moment, e.g., u is ples t can have a major influence on the complexity of the sampled with density proportional to ef(u) where f(u) is the problem. First, for most problems there is a minimum t expected r’th moment along u. This could be implemented for which the problem is information-theoretically solvable. by a Markov chain with a Metropolis filter [39, 27] ensuring a This value is often referred to as the sample complexity of proportional steady state distribution. If the Markov chain the problem. But even for t which is larger than the sam- were to mix rapidly, that would give an efficient approxima- ple complexity of the problem, having more samples can tion algorithm because sampling from the steady state likely make the problem easier computationally. For example, in gives a vector of high moment. At each step, all one needs is the context of attribute-efficient learning, there is a problem to be able to estimate f(u), which can be done by sampling that is intractable with few samples (under cryptographic from the input distribution. assumptions) but is easy to solve with a larger (but still poly- As we will see presently, these statistical approaches will nomial) number of samples [43]. Our distributional planted have provably high complexity. For the bipartite planted biclique problem exhibits the same phenomenon. clique problem, statistical algorithms that can query ex- pectations of arbitrary functions to within a small toler- 2.2 Statistical Algorithms ance need nΩ(log n) queries to detect planted cliques of size 1 −δ The statistical query learning model of Kearns [34] is a re- k < n 2 for any δ > 0. Even stronger exponential bounds striction of the PAC model [46]. It captures algorithms that apply for the more general problem of detecting planted rely on empirical estimates of statistical properties of ran- dense subgraphs of the same size. These bounds match the dom examples of an unknown function instead of individual current upper bounds. random examples (as in the PAC model of learning). For general search, decision and optimization problems Problem 2. Fix 0 < q ≤ p ≤ 1. For 1 ≤ k ≤ n, let over a distribution, we define statistical algorithms as algo- S ⊆ {1, 2, . . . , n} be a set of k vertex indices and D be a S rithms that do not see samples from the distribution but distribution over {0, 1}n such that when x ∼ D , with prob- S instead have access to estimates of the expectation of any ability 1 − (k/n) the entries of x are independently chosen bounded function of a sample from the distribution. according to a q biased Bernoulli variable, and with prob- ability k/n the k coordinates in S are independently cho- Definition 2 (STAT oracle). Let D be the input dis- sen according to a p biased Bernoulli variable, and the rest tribution over the domain X. For a tolerance parameter are independently chosen according to a q biased Bernoulli τ > 0, STAT(τ) oracle is the oracle that for any query func- variable. The generalized planted bipartite densest k- tion h : X → [−1, 1], returns a value subgraph problem is to find the unknown subset S given access to samples from DS . v ∈ [Ex∼D[h(x)] − τ, Ex∼D[h(x)] + τ] .

To describe these results precisely and discuss exactly what The general algorithmic techniques mentioned earlier can they mean for the complexity of these problems, we will need all be expressed as algorithms using STAT oracle instead of to define the notion of statistical algorithms, the complex- samples themselves, in most cases in a straightforward way. ity measures we use, and our main tool for proving lower We would also like to note that in the PAC learning model bounds, a notion of statistical dimension of a set of distri- some of the algorithms, such as the Perceptron algorithm, butions. We do this in the next section. In Section 3 we did not initially appear to fall into the SQ framework but SQ prove our general lower bound theorems, and in Section 5 analogues were later found for all known learning techniques we estimate the statistical dimension of detecting planted except Gaussian elimination (for specific examples, see [34] cliques and dense subgraphs. and [10]). We expect the situation to be similar even in the broader context of search problems over distributions. The most natural realization of STAT(τ) oracle is one 2. DEFINITIONS AND OVERVIEW that computes h on O(1/τ 2) random samples from D and Here we formally define statistical algorithms and the key returns their average. Chernoff’s bound would then imply notion of statistical dimension, and then describe the result- that the estimate is within the desired tolerance (with con- ing lower bounds in detail. stant probability). However, if h(x) is very biased (e.g. equal to 0 with high probability), it can be estimated with fewer in learning theory [11] used to characterize SQ learning al- samples. Our primary application requires a tight bound on gorithms. Roughly speaking, the SQ dimension corresponds the number of samples necessary to solve a problem over dis- to the number of nearly uncorrelated labeling functions in tributions. Therefore we define a stronger version of STAT a class (see the full version for the details of the definition oracle in which the tolerance is adjusted for the variance of and the relationship to our bounds [23]). the query function, that is the oracle returns the expectation We introduce a natural generalization and strengthening to within the same tolerance (up to constant factors) as one of this approach to search problems over arbitrary sets of expects to get from t samples. More formally, for a Boolean distributions and prove lower bounds on the complexity of query h, VSTAT(t) can return any value v for which the dis- statistical algorithms based on the generalized notion. Our tribution B(t, v) (sum of t independent Bernoulli variables definition departs from SQ dimension in three aspects. (1) with bias v) is statistically close to B(t, E[h]) (see Sec. 3.2 Our notion applies to any set of distributions whereas in the for more details on this correspondence). learning setting all known dimensions require fixing the dis- tribution over the domain and only allow varying the label- Let D be the input Definition 3 (VSTAT oracle). ing function. (2) Instead of relying on a bound on pairwise distribution over the domain X. For a sample size parameter correlations, our dimension relies on a bound on average t > 0, VSTAT(t) oracle is the oracle that for any query correlations in a large set of distributions. This weaker con- function h : X → {0, 1}, returns a value v ∈ [p − τ, p + τ] ,  q  dition allows us to derive the tight bounds on the complexity 1 p(1−p) of statistical algorithms for the planted k-biclique problem. where p = Ex∼D[h(x)] and τ = max t , t . (3) We show that our dimension also gives lower bounds for the stronger VSTAT oracle (without incurring a quadratic Note that VSTAT(√ t) always returns the value of the expec-√ tation within 1/ t. Therefore it is stronger than STAT(1/ t) loss in the sample size parameter). but weaker than STAT(1/t). (STAT, unlike VSTAT, allows We now define our dimension formally. For two functions non-boolean functions but this is not a significant difference f, g : X → R and a distribution D with probability density function D(x), the inner product of f and g over D is de- as any [−1, 1]-valued query can be converted to a logarithmic . number of {0, 1}-valued queries). fined as hf, giD = Ex∼D[f(x)g(x)]. The norm of f over D is p The STAT and VSTAT oracles we defined can return any kfkD = hf, fiD. We remark that, by convention, the inte- value within the given tolerance and therefore can make ad- gral from the inner product is taken only over the support of versarial choices. We also aim to prove lower bounds against D, i.e. for x ∈ X such that D(x) 6= 0. Given a distribution algorithms that use a potentially more benign, “unbiased” D over X let D(x) denote the probability density function of statistical oracle. The unbiased statistical oracle gives the D relative to some fixed underlying measure over X (for ex- algorithm the true value of a boolean query function on a ample uniform distribution for discrete X or Lebesgue mea- randomly chosen sample. This model is based on the Honest sure over Rn). Our bound is based on the inner products be- SQ model in learning by Yang [49] (which itself is based on tween functions of the following form: (D0(x) − D(x))/D(x) an earlier model of Jackson [29]). where D0 and D are distributions over X. For this to be well-defined, we will only consider cases where D(x) = 0 Definition 4 (1-STAT oracle). Let D be the input implies D0(x) = 0, in which case D0(x)/D(x) is treated as distribution over the domain X. The 1-STAT oracle is the 1. To see why such functions are relevant to our discussion, oracle that given any function h : X → {0, 1}, takes an note that for every real-valued function f over X, independent random sample x from D and returns h(x). E [f(x)] − E [f(x)] = E [D0(x)f(x)/D(x)] − E [f(x)] Note that the 1-STAT oracle draws a fresh sample upon x∼D0 x∼D x∼D x∼D each time it is called. Without re-sampling each time, the  D0 − D  = , f . answers of the 1-STAT oracle could be easily used to recover D any sample bit-by-bit, making it equivalent to having access D to random samples. The query complexity of an algo- This means that the inner product of any function f with rithm using 1-STAT is defined to be the number of calls it (D0 − D)/D is equal to the difference of expectations of f makes to the 1-STAT oracle. Note that the 1-STAT oracle under the two distributions. We also remark that the quan- can be used to simulate VSTAT (with high probability) by D0 D0 2 0 tity h D − 1, D − 1iD is known as the χ (D ,D) distance taking the average of O(t) replies of 1-STAT for the same and is widely used for hypothesis testing in statistics [41]. A function h. While it might seem that access to 1-STAT gives key notion for our statistical dimension is the average cor- an algorithm more power than access to VSTAT we will show relation of a set of distributions D0 relative to a distribution that t samples from 1-STAT can be simulated using access D. We denote it by ρ(D0,D) and define as follows: to VSTAT(O(t)). This will allow us to translate our lower   bounds on algorithms with access to VSTAT to query com- 0 . 1 X D1 D2 ρ(D ,D) = − 1, − 1 . plexity lower bounds for unbiased statistical algorithms. m2 D D D ,D ∈D0 D In the following discussion, we refer to algorithms using 1 2 STAT or VSTAT oracles (instead of samples) as statisti- Known lower bounds for SQ learning are based on bound- cal algorithms. Algorithms using the 1-STAT oracle are ing the pairwise correlations between functions. Bounds henceforth called unbiased statistical algorithms. on pairwise correlations when strong enough easily imply bounds on the average correlation. Bounding average cor- 2.3 Statistical Dimension relation is more involved technically but it appears to be The main tool in our analysis is an information-theoretic necessary for the tight lower bounds that we give. In Sec- bound on the complexity of statistical algorithms. Our def- tion 3.3 we describe a pairwise-correlation version of our initions originate from the statistical query (SQ) dimension bounds. It is sufficient for some applications and generalizes the statistical query dimension from learning theory (see full answers of the VSTAT oracle represent the expected value version for the details). of the ith bit over the sample. We now define the concept of statistical dimension. There is also a statistical algorithm that uses nO(log n) queries to VSTAT(25n/k) (significantly higher tolerance) to Definition 5. For γ¯ > 0, domain X and a search prob- find the planted set for any k ≥ log n. In fact, the same lem Z over a set of solutions F and a class of distributions algorithm can be used for the standard planted clique prob- D over X, let d be the largest integer such that there exists a lem that achieves complexity nO(log n). We enumerate over reference distribution D over X and a finite set of distribu- subsets T ⊆ [n] of log n indices and query VSTAT(25n/k) tions DD ⊆ D with the following property: for any solution with the function g : {0, 1}n → {0, 1} defined as 1 if and −1 T f ∈ F the set Df = DD \Z (f) is non-empty and for any only if the point has ones in all coordinates in T . Therefore, 0 0 0 subset D ⊆ Df , where |D | ≥ |Df |/d, ρ(D ,D) < γ¯. We if the set T is included in the true clique then define the statistical dimension with average correlation     γ¯ of Z to be d and denote it by SDA(Z, γ¯). k k − log n k k + 1 E[gT ] = · 1 + 1 − 2 ∈ , . D n n n n The statistical dimension with average correlationγ ¯ of a search problem over distributions gives a lower bound on With this expectation, VSTAT(25n/k) has tolerance at most p the complexity of any (deterministic) statistical algorithm k(k + 1)/25n2 ≤ (k +1)/5n and will return at least k/n− for the problem that uses queries to VSTAT(1/(3¯γ)). (k + 1)/(5n) > 3k/(4n). If, on the other hand, at least one element of T is not from the planted clique, then ED[gT ] ≤ Theorem 1. Let X be a domain and Z be a search prob- k/(2n) + 1/n and VSTAT(25n/k) will return at most (k + lem over a set of solutions F and a class of distributions 2)/(2n) + (k + 2)/(5n) < 3k/(4n). Thus, we will know all D over X. For γ¯ > 0 let d = SDA(Z, γ¯). Any statistical log n-sized subsets of the planted clique and hence the entire algorithm requires at least d calls to VSTAT(1/(3¯γ)) oracle clique. to solve Z. Our bounds have direct implications for the average-case planted bipartite k-clique problem. An instance of this prob- At a high level our proof works as follows. It is not hard lem is a random n × n bipartite graph with a k × k bipartite to see that for any D, an algorithm that solves Z needs to clique planted randomly. In the full version of this paper, “distinguish” all distributions in Df from D for some solu- we show that the average-case planted k-biclique is equiva- tion f. Here by “distinguishing” we mean that the algorithm lent to our distributional planted k-biclique with n samples. needs to ask a query g such that ED[g] cannot be used as a 0 For a statistical algorithm, n samples directly correspond to response of VSTAT(1/(3¯γ)) for D ∈ Df . In the key com- having access to VSTAT(O(n)). Our bounds show that√ this ponent of the proof we show that if a query function g to problem can be solved in polynomial time when k = Ω( n). VSTAT(1/(3¯γ)) distinguishes between a distribution D and At the same time, for k ≤ n1/2−δ, any statistical algorithm any distribution D0 ∈ D0, then D0 must have average corre- will require nΩ(log n) queries to VSTAT(n1+δ). lation of at leastγ ¯ relative to D. The condition that for any 0 0 We also give a bound for unbiased statistical algorithms. |D | ≥ |Df |/d, ρ(D ,D) < γ¯ then immediately implies that at least d queries are required to distinguish any distribution Theorem 3. For any constant δ > 0 and any k ≤ n1/2−δ, in Df from D. any unbiased statistical that succeeds with probability at least In Section 3.1 we give a refinement of SDA which addi- 2/3 requires Ω(˜ n2/k2) queries to solve the distributional planted tionally requires that the set Df is not too small (and not bipartite k-clique problem. just non-empty). This refined notion allows us to extend the Each query of an unbiased statistical algorithm requires a lower bound to randomized statistical algorithms and then new sample from D. Therefore this bound implies that any to unbiased statistical algorithms. algorithm that does not reuse samples will require Ω(˜ n2/k2) 2.4 Lower Bounds samples. To place this bound in context, we note that it is easy to detect whether a clique of size k has been planted us- We will prove lower bounds for the bipartite planted clique ing O˜(n2/k2) samples (as before, to detect if a coordinate i is under both statistical oracles (VSTAT and 1-STAT) . 2 2 in the clique we can compute the average of xi on O˜(n /k ) Theorem 2. For any constant δ > 0, any k ≤ n1/2−δ samples). Of course, finding all vertices in the clique would and r > 0, at least nΩ(log r) queries to VSTAT(n2/(rk2)) are require reusing samples (which statistical algorithms cannot 2 2 √ required to solve the distributional planted bipartite k-clique. do). Note that n /k ≤ n if and only if k ≥ n. No polynomial-time statistical algorithm can solve the prob- These bounds match the state of the art for the average- lem using queries to VSTAT(o(n2/k2)) and any statistical case planted bipartite k-clique and planted k-clique prob- algorithm will require nΩ(log n) queries to VSTAT(n2−δ/k2). lems. One reason why we consider this natural is that√ some of the algorithms that solve the problem for k = Ω( n) are This bound also applies to any randomized algorithm with obviously statistical. For example, the key procedure of the probability of success being at least a (positive) constant. algorithm of Feige and Ron [21] removes a vertex that has Note that this bound is essentially tight. For every ver- the lowest degree (in the current graph) and repeats until tex in the clique, the probability that the corresponding the remaining graph is a clique. The degree is the number of bit of a randomly chosen point is set to 1 is 1/2 + k/(2n) ones in a column (or row) of the and can be whereas for every vertex not in the clique, this probability viewed of as an estimate of the expectation of 1 appearing in is 1/2. Therefore using n queries to VSTAT(16n2/k2) (i.e., the corresponding coordinate of a random sample. Even the of tolerance k/4n) it is easy to detect the planted bipartite more involved algorithms for the problem, like finding the clique. Indeed, this can be done by using the query functions eigenvector with the largest eigenvalue or solving an SDP, n h : {0, 1, } → {0, 1}, h(x) = xi, for each i ∈ [n]. So, the have statistical analogues. A closely related problem is the planted densest subgraph Lemma 1. For an integer t and any p ∈ [0, 1], let p0 ∈ problem, where edges in the planted subset appear with  q  [0, 1] be such that |p0 − p| ≥ τ = max 1 , p(1−p) . Then higher probability than in the remaining graph. This is a t t variant of the densest k-subgraph problem, which itself is q 0 0 |p0 − p| ≥ min{p ,1−p } . a natural generalization of k-clique that asks to recover the 3t densest k-vertex subgraph of a given n-vertex graph [18, 35, 7, 8]. The conjectured hardness of its average case vari- We are now ready to prove Theorem 5. ant, the planted densest subgraph problem, has been used Proof of Theorem 5. We prove our lower bound by in public key encryption schemes [4] and in analyzing pa- exhibiting a distribution over inputs (which are distributions rameters specific to financial markets [5]. Our lower bounds over X) for which every deterministic statistical algorithm extend in a straightforward manner to this problem. that solves Z with probability δ (over the choice of input) requires at least (δ − η) · d/(1 − η) calls to VSTAT(1/(3¯γ)). Theorem 4. For any constant δ > 0, any k ≤ n1/2−δ, Ω(`) The claim of the theorem will then follow by Yao’s minimax ` ≤ k and density parameter p = 1/2 + α, at least n principle [51]. queries to VSTAT(n2/(`α2k2)) are required to solve the dis- Let D be the reference distribution and DD be a set of tributional planted bipartite densest k-subgraph with density distributions for which the value d is achieved. Let A be p. For constants c, δ > 0, density 1/2 < p ≤ 1/2 + 1/nc, 1/2−δ a deterministic statistical algorithm that uses q queries to and k ≤ n , any unbiased statistical algorithm requires VSTAT(1/(3¯γ)) to solve Z with probability δ over a random Ω((˜ n2+2c)/k2) queries to find a planted densest subgraph of choice of a distribution from DD. Following an approach size k. from [22], we simulate A by answering any query h : X → For example, taking k = n1/3, l = k and α = 1/nc yields a {0, 1} of A with value ED[h(x)]. Let h1, h2, . . . , hq be the 1/3 queries asked by A in this simulation and let f be the output lower bound of nΩ(n ) for the VSTAT(n1+2c) oracle. of A. −1 By the definition of SDA, for Df = DD \Z (f) it holds 0 3. LOWER BOUNDS FROM STATISTICAL that |Df | ≥ (1 − η)|DD| and for every D ⊆ Df , either 0 0 + DIMENSION ρ(D ,D) < γ¯ or |D | ≤ |Df |/d. Let the set D ⊆ DD be the + We refer the reader to the full version for the technical set of distributions on which A is successful. Let Df = Df ∩ + details [23] omitted here. D and we denote these distributions by {D1,D2,...,Dm}. + + We note that D = D \ (DD \Df ) and therefore 3.1 Lower Bounds for Statistical Algorithms f + + We now prove Theorem 1 which is the basis of all our m = |Df | ≥ |D | − |DD \Df | ≥ δ|DD| − |DD \Df | lower bounds. We will prove a stronger version of this the- δ|DD| − |DD \Df | δ − η orem which also applies to randomized algorithms. For this = |Df | ≥ |Df |. (1) |DD| − |DD \Df | 1 − η version we need an additional parameter in the definition of SDA. To lower bound q, we use a generalization of an elegant argument of Sz¨or´enyi [44]. For every k ≤ q, let A be the 6. For γ¯ > 0, η > 0, domain X and a search k Definition set of all distributions D such that problem Z over a set of solutions F and a class of distribu- i tions D over X, let d be the largest integer such that there ( r ) . 1 pi,k(1 − pi,k) exists a reference distribution D over X and a finite set E[hk(x)] − E[hk(x)] > τi,k = max , , D Di t t of distributions DD ⊆ D with the following property: for −1 any solution f ∈ F the set Df = DD \Z (f) has size 0 where we use t to denote 1/(3¯γ) and pi,k to denote ED [hk(x)]. at least (1 − η) · |D | and for any subset D ⊆ D , where i D f To prove the desired bound we first prove the following two |D0| ≥ |D |/d, ρ(D0,D) < γ¯. We define the statistical di- f claims: mension with average correlation γ¯ and solution set bound η of Z to be d and denote it by SDA(Z, γ,¯ η). P 1. k≤q |Ak| ≥ m; Note that for any η < 1, SDA(Z, γ¯) ≥ SDA(Z, γ,¯ η) and for 1 > η ≥ 1 − 1/|DD|, we get SDA(Z, γ¯) = SDA(Z, γ,¯ η), 2. for every k, |Ak| ≤ |Df |/d. where |DD| is the set of distributions that maximizes SDA(Z, γ¯). Combining these two implies that q ≥ d · m/|D |. By in- 5. Let X be a domain and Z be a search prob- f Theorem equality (1), q ≥ δ−η · d giving the desired lower bound. In lem over a set of solutions F and a class of distributions D 1−η the rest of the proof for conciseness we drop the subscript over X. For γ¯ > 0 and η ∈ (0, 1) let d = SDA(Z, γ,¯ η). Any D for inner products and norms. randomized statistical algorithm that solves Z with probabil- To prove the first claim we assume, for the sake of con- ity δ > η requires at least δ−η d calls to VSTAT(1/(3¯γ)). 1−η tradiction, that there exists Di 6∈ ∪k≤qAk. Then for every

Theorem 1 is obtained from Theorem 5 by setting δ = 1 k ≤ q, | ED[hk(x)] − EDi [hk(x)]| ≤ τi,k. This implies that and using any 1 − 1/|DD| ≤ η < 1. Further, for any η < 1, the replies of our simulation ED[hk(x)] are within τi,k of

SDA(Z, γ¯) ≥ SDA(Z, γ,¯ η) and therefore for any η < 1, a EDi [hk(x)]. By the definition of A and VSTAT(t), this im- bound on SDA(Z, γ,¯ η) can be used in Theorem 1 in place plies that f is a valid solution for Z on Di, contradicting + −1 of bound on SDA(Z, γ¯). the condition that Di ∈ Df ⊆ DD \ Z (f). We will need the following simple lemma that bounds the To prove the second claim, suppose that for some k ∈ [d], 0 distance between any p ∈ [0, 1] and p which is returned by |Ak| > |Df |/d. We will denote pk = ED[hk(x)]. We assume 0 VSTAT(t) on a query with expectation p in terms of p . that pk ≤ 1/2 (when pk > 1/2, we simply replace hk by 1 − hk). We note first that: directly translates to a lower bound on the number of sam-   ples that any statistical algorithm must use. Di(x) E[hk(x)] − E[hk(x)] = E hk(x) − E[hk(x)] We note that responses of 1-STAT do not have the room Di D D D(x) D for the possibly adversarial deviation afforded by the toler-  D  ance of the STAT and VSTAT oracles. The ability to use = h , i − 1 k D these slight deviations in a coordinated way is used crucially = p − p . in our lower bounds against VSTAT and in known lower i,k k bounds for SQ learning algorithms. This makes proving ˆ Di(x) ˆ lower bounds against unbiased statistical algorithms harder Let Di(x) = D(x) −1, (where the convention is that Di(x) = 0 if D(x) = 0). We will next show upper and lower bounds and indeed lower bounds for Honest SQ learning (which is on the following quantity a special case of unbiased statistical algorithms) required a substantially more involved argument than lower bounds for * + X ˆ ˆ the regular SQ model [50]. Φ = hk, Di · signhhk, Dii . Our lower bounds for unbiased statistical algorithms use a Di∈Ak different approach. We show a direct simulation of 1-STAT By Cauchy-Schwartz we have that oracle using VSTAT oracle. * +2 2 X Theorem 6. Let Z be a search problem and let A be Φ = h , Dˆ · signhh , Dˆ i k i k i a (possibly randomized) unbiased statistical algorithm that Di∈Ak solves Z with probability at least δ using m samples from 2 0 1-STAT. For any δ ∈ (0, 1/2], there exists a statistical al- 2 X ˆ ˆ 0 02 ≤ khkk · Di · signhhk, Dii gorithm A that uses at most m queries to VSTAT(m/δ ) 0 Di∈Ak and solves Z with probability at least δ − δ .   Our proof relies on a simple simulation. which implies that 2 X ˆ ˆ ≤ khkk ·  hDi, Dj i  success probability of the simulated algorithm is not much Di,Dj ∈Ak worse than that of the unbiased statistical algorithm. 2 2 ≤ khkk · ρ(Ak,D) · |Ak| . (2) Theorem 7. Let X be a domain and Z be a search prob- As before, we also have that lem over a set of solutions F and a class of distributions * +2 D over X. For γ¯ > 0 and η ∈ (0, 1), let d = SDA(Z, γ,¯ η). 2 X Φ = hk, Dˆ i · signhhk, Dˆ ii Any (possibly randomized) unbiased statistical algorithm that

Di∈Ak solves Z with probability δ requires at least m calls to 1-STAT  2 for X = hh , Dˆ i · signhh , Dˆ i  d(δ − η) (δ − η)2   k i k i  m = min , . Di∈Ak 2(1 − η) 12¯γ  2 In particular, if η ≤ 1/6 then any algorithm with success X ≥  |pi,k − pk| . (3) probability of at least 2/3 requires at least min{d/4, 1/48¯γ} Di∈Ak samples from 1-STAT. To evaluate the last term of this inequality we use the fact To conclude we note that the reduction in the other di- that |p − p | ≥ τ = max{1/t, pp (1 − p )/t} and i,k k i,k i,k i,k rection is trivial, namely that VSTAT(t) oracle can be sim- Lemma 1 to obtain that for every D ∈ A , i k ulated using 1-STAT oracle. We provide details in the full version of this paper. r min{p , 1 − p } r p |p − p | ≥ k k = k . (4) k i,k 3t 3t 3.3 Statistical Dimension from Pairwise Cor- relations By substituting equation (4) into (3) we get that Φ2 ≥ pk · 3t In addition to SDA which is based on average, correlation, |A |2. k we introduce a simpler notion based on pairwise correlatons. We note that, hk is a {0, 1}-valued function and therefore 2 khkk = pk. Substituting this into equation (2) we get that Definition 7. For γ, β > 0, domain X and a search 2 2 Φ ≤ pk ·ρ(Ak,D)·|Ak| . By combining these two bounds on problem Z over a set of solutions F and a class of distri- 2 Φ we obtain that ρ(Ak,D) ≥ 1/(3t) =γ ¯ which contradicts butions D over X, let m be the maximum integer such that the definition of SDA. there exists a reference distribution D over X and a finite set 3.2 Lower Bounds for Unbiased Statistical Al- of distributions DD ⊆ D such that for any solution f ∈ F, there exists a set of m distributions Df = {D1,...,Dm} ⊆ gorithms −1 DD \Z (f) with the following property: Next we address lower bounds on algorithms that use the   ( 1-STAT oracle. We recall that the 1-STAT oracle returns Di Dj β for i = j ∈ [m] − 1, − 1 ≤ the value of a function on a single randomly chosen point. D D γ for i 6= j ∈ [m]. To estimate the expectation of a function, an algorithm can D simply query this oracle multiple times with the same func- We define the statistical dimension with pairwise correla- tion and average the results. A lower bound for this oracle tions (γ, β) of Z to be m and denote it by SD(Z, γ, β). Using this notion, we can prove the following lower bound Theorem 10. For the MAX-XOR-SAT problem, let F = (which is a corollary of Theorem 1): {χx}x∈{0,1}n , let D be the set of all distributions over clauses n 1 c ∈ {0, 1} , and for any δ > 0, let Z be the ( 2 − δ)- Theorem 8. Let X be a domain and Z be a search prob- approximate MAX-XOR-SAT problem defined over F and lem over a set of solutions F and a class of distributions D D. Then SD(Z, 0, 1) ≥ 2n − 1. over X. For γ, β > 0, let m = SD(Z, γ, β). For any τ > 0, any statistical algorithm requires at least m(τ 2 − γ)/(β − γ) calls to the STAT(τ) oracle to solve Z. In particular, if for 5. PLANTED CLIQUE m−2/3 1/3 We now prove the lower bound claimed in Theorem 2 m > 0, SD(Z, γ = 2 , β = 1) ≥ m then at least m /2 −1/3 on the problem of detecting a planted k-clique in the given calls of tolerance m to the STAT oracle are required to n solve Z. distribution on vectors from {0, 1} as defined above. For a subset S ⊆ [n], let DS be the distribution over n {0, 1} with a planted clique on the subset S. Let {S1,...,Sm} 4. WARM-UP: MAX-XOR-SAT n be the set of all k subsets of [n] of size k. For i ∈ [m] we use In this section, we demonstrate our techniques on a warm- Di to denote DSi . The reference distribution in our lower up problem, MAX-XOR-SAT. For this problem, it is suffi- bounds will be the uniform distribution over {0, 1}n and let cient to use pairwise correlations, rather than average cor- Dˆ S denote DS /D − 1. In order to apply our lower bounds relations. based on statistical dimension with average correlation we The MAX-XOR-SAT problem over a distribution is de- now prove that for the planted clique problem average cor- fined as follows. relations of large sets must be small. We start with a lemma Problem 3. For  ≥ 0, the -approximate MAX-XOR- that bounds the correlation of two planted clique distribu- SAT problem is defined as follows. Given samples from tions relative to the reference distribution D as a function some unknown distribution D over XOR clauses on n vari- of the overlap between the cliques: ables, find an assignment x that maximizes up to additive  Lemma 3. For i, j ∈ [m], the probability a random clause drawn from D is satisfied. D E 2λk2 In the worst case, it is known that MAX-XOR-SAT is NP- ˆ ˆ Di, Dj ≤ 2 , hard to approximate to within 1/2−δ for any constant δ [26]. D n In practice, local search algorithms such as WalkSat [42] are where λ = |Si ∩ Sj |. commonly applied as heuristics for maximum satisfiability problems. We give strong evidence that the distributional Proof. For the distribution Di, we consider the proba- version of MAX-XOR-SAT is hard for algorithms that lo- bility Di(x) of generating the vector x. Then, cally seek to improve an assignment by flipping variables as ( ( n−k ) 1 + ( k ) 1 if ∀s ∈ S , x = 1 to satisfy more clauses, giving some theoretical justification n 2n n 2n−k i s Di(x) = n−k 1 for the observations of [42]. Moreover, our proof even applies ( n ) 2n otherwise. to the case when there exists an assignment that satisfies all ˆ the clauses generated by the target distribution. We can now compute Di and the inner product: The bound we obtain can be viewed as a restatement of 2 D E 2n−2k+λ  k2k k  the known lower bound for learning parities using statistical ˆ ˆ Di, Dj ≤ n − query algorithms (indeed, the problem of learning parities is D 2 n n a special case of our distributional MAX-XOR-SAT).  2n−k   k2k k   k   k 2 + 2 n − − + − Theorem 9. For any δ > 0, any statistical algorithm 2 n n n n requires at least τ 2(2n − 1) queries to STAT(τ) to solve 2λk2 1  ≤ 2 2 − δ -approximate MAX-XOR-SAT. In particular, at least n 2n/3 queries of tolerance 2−n/3 are required. For a vector c ∈ {0, 1}n we define the parity function . c·x We now give a bound on the average correlation of any Dˆ S χc(x) as usual χc(x) = −(−1) . Further, let Dc to be the with a large number of distinct clique distributions. uniform distribution over the set Sc = {x | χc(x) = 1}. 1/2−δ n ¯ Lemma 4. For δ > 0 and k ≤ n , let {S1,...,Sm} be Lemma 2. For c ∈ {0, 1} , c 6= 0 and the uniform distri- n n the set of all subsets of [n] of size k and {D ,...,D } bution U over {−1, 1} , the following hold: k 1 m be the corresponding distributions on {0, 1}n. Then for any ( 1 if c = c0 integer ` ≤ k, set S of size k and subset A ⊆ {S1,...,Sm} 0 2`δ 1. Ex∼Dc [χc (x)] = where |A| ≥ 4(m − 1)/n , 0 otherwise. 2 1 X `+2 k ( 0 hDˆ , Dˆ i < 2 . 1 if c = c |A| S i n2 2. Ex∼U [χc(x)χc0 (x)] = S ∈A 0 otherwise. i Proof. In this proof we first show that if the total num- These two facts will imply that when D = U (the uniform ber of sets in A is large then most of sets in A have a small distribution) and the Di’s consist of the Dc’s, we can set γ = overlap with S. We then use the bound on the overlap of 0 and β = 1 in our simpler variant of statistical dimension most sets to obtain a bound on the average correlation of from which Theorem 9 follows. DS with distributions for sets in A. k2 Formally, we let α = n2 and using Lemma 3 get the bound Analogously, for the generalized planted densest subgraph |Si∩Sj | hDˆ i, Dˆ j i ≤ 2 α. Summing over Si ∈ A, case, we can give a bound on statistical dimension SDA:

X X |S∩S | Theorem 12. Fix 0 < q ≤ p ≤ 1. For δ > 0 and hDˆ , Dˆ i ≤ 2 i α. S i k ≤ n1/2−δ let Z be the generalized planted bipartite densest S ∈A S ∈A i i subgraph problem. Then for any ` ≤ k, For any set A ⊆ {S1,...,Sm} of size t this bound is max- ! ! 2k2  (p − q)2 `+1 1 imized when the sets of A include S, then all sets that in- SDA Z, 1 + − 1 , ≥ n2`δ/4. n2 q(1 − q) n tersect S in k − 1 indices, then all sets that intersect S in k k − 2 indices and so on until the size bound t is exhausted. 2δ 2 We can therefore assume without loss of generality that A provided n ≥ 1 + (p − q) /q(1 − q). is defined in precisely this way. Let Tλ = {Si | |S ∩ Si| = λ} denote the subset of all k- Acknowledgments subsets that intersect with S in exactly λ indices. Let λ0 We thank Benny Applebaum, Avrim Blum, Uri Feige, Ravi be the smallest λ for which A ∩ Tλ is non-empty. We first Kannan, Michael Kearns, Robi Krauthgamer, Moni Naor, observe that for any 1 ≤ j ≤ k − 1, Jan Vondrak, and Avi Wigderson for insightful comments kn−k and helpful discussions. |Tj | j k−j = k n−k |Tj+1|   j+1 k−j−1 6. REFERENCES (j + 1)(n − 2k + j + 1) = [1] N. Alon, A. Andoni, T. Kaufman, K. Matulef, (k − j − 1)(k − j) R. Rubinfeld, and N. Xie. Testing k-wise and almost (j + 1)(n − 2k) k-wise independence. In Proceedings of STOC, pages ≥ k(k + 1) 496–505, 2007. [2] N. Alon, M. Krivelevich, and B. Sudakov. Finding a (j + 1)n2δ ≥ . large hidden clique in a random graph. In SODA, 2 pages 594–598, 1998. By applying this equation inductively we obtain, [3] B. P. W. Ames and S. A. Vavasis. Nuclear norm minimization for the planted clique and biclique 2j · |T | 2j · (m − 1) 0 problems. Math. Program., 129(1):69–89, 2011. |Tj | ≤ 2δj < 2δj j! · n j! · n [4] B. Applebaum, B. Barak, and A. Wigderson. and Public-key cryptography from different assumptions. In STOC, pages 171–180, 2010. X X 2λ · (m − 1) 4(m − 1) |Tλ| < ≤ . [5] S. Arora, B. Barak, M. Brunnermeier, and R. Ge. λ! · n2δλ nδj k≥λ≥j k≥λ≥j Computational complexity and information asymmetry in financial products (extended abstract). P 2δλ0 By definition of λ0, |A| ≤ |Tj | < 4(m − 1)/n . In j≥λ0 In ICS, pages 49–65, 2010. 2`δ 2`δ 2δλ particular, if |A| ≥ 4(m − 1)/n then n /4 < n 0 /4 or [6] A. Belloni, R. M. Freund, and S. Vempala. An efficient λ0 < `. Now we can conclude that rescaled perceptron algorithm for conic systems. Math. k Oper. Res., 34(3):621–641, 2009. X X j hDˆ S , Dˆ ii ≤ 2 |Tj ∩ A|α [7] A. Bhaskara, M. Charikar, E. Chlamtac, U. Feige, and

Si∈A j=λ0 A. Vijayaraghavan. Detecting high log-densities: an 1/4  k  o(n ) approximation for densest k-subgraph. In λ0 X j ≤ 2 |Tλ0 ∩ A| + 2 |Tj | α STOC, pages 201–210, 2010. j=λ0+1 [8] A. Bhaskara, M. Charikar, A. Vijayaraghavan, V. Guruswami, and Y. Zhou. Polynomial integrality  λ0 λ0+1  ≤ 2 |Tλ0 ∩ A| + 2 · 2 |Tλ0+1| α gaps for strong sdp relaxations of densest k-subgraph. ≤ 2λ0+2|A|α < 2`+2|A|α. In SODA, pages 388–405, 2012. [9] A. Blum, C. Dwork, F. McSherry, and K. Nissim. To derive the last inequality we need to note that for every Practical privacy: the SuLQ framework. In j j+1 j ≥ 0, 2 |Tj | > 2(2 |Tj+1|) we can therefore telescope the Proceedings of PODS, pages 128–138, 2005. sum. [10] A. Blum, A. M. Frieze, R. Kannan, and S. Vempala. A polynomial-time algorithm for learning noisy linear Lemma 4 gives a simple way to bound the SDA. threshold functions. Algorithmica, 22(1/2):35–52, 1/2−δ 1998. Theorem 11. For δ > 0 and k ≤ n let Z the planted bipartite k-clique problem. Then for any ` ≤ k, [11] A. Blum, M. L. Furst, J. C. Jackson, M. J. Kearns, Y. Mansour, and S. Rudich. Weakly learning dnf and !! n characterizing statistical query learning using fourier SDA Z, 2`+2k2/n2, 1/ ≥ n2`δ/4. k analysis. In STOC, pages 253–262, 1994. [12] S. Brubaker and S. Vempala. Random tensors and Combining Theorems 5 and 11 gives Theorem 2 by taking planted cliques. In Approximation, Randomization, ` = log(r). Theorems 7 and 11 imply the sample complex- and Combinatorial Optimization. Algorithms and ity lower bound stated in Theorem 3 by taking ` ≥ 4/δ. Techniques, volume 5687, pages 406–419. 2009. [13] C. Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng, [33] R. Karp. Probabilistic analysis of graph-theoretic and K. Olukotun. Map-reduce for machine learning on algorithms. In Proceedings of Computer Science and multicore. In Proceedings of NIPS, pages 281–288, Statistics 12th Annual Symposium on the Interface, 2006. page 173, 1979. [14] A. Coja-Oghlan. Graph partitioning via adaptive [34] M. Kearns. Efficient noise-tolerant learning from spectral techniques. Combinatorics, Probability & statistical queries. Journal of the ACM, Computing, 19(2):227–284, 2010. 45(6):983–1006, 1998. [15] Y. Dekel, O. Gurel-Gurevich, and Y. Peres. Finding [35] S. Khot. Ruling out ptas for graph min-bisection, hidden cliques in linear time with high probability. In densest subgraph and bipartite clique. In FOCS, pages Proceedings of ANALCO, pages 67–75, 2011. 136–145, 2004. [16] A. P. Dempster, N. M. Laird, and D. B. Rubin. [36] S. Kirkpatrick, D. G. Jr., and M. P. Vecchi. Maximum likelihood from incomplete data via the em Optimization by simmulated annealing. Science, algorithm. Journal of the Royal Statistical Society, 220(4598):671–680, 1983. Series B, 39(1):1–38, 1977. [37] L. Kucera. Expected complexity of graph partitioning [17] J. Dunagan and S. Vempala. A simple problems. Discrete Applied Mathematics, polynomial-time rescaling algorithm for solving linear 57(2-3):193–212, 1995. programs. Math. Program., 114(1):101–114, 2008. [38] F. McSherry. Spectral partitioning of random graphs. [18] U. Feige. Relations between average case complexity In FOCS, pages 529–537, 2001. and approximation complexity. In IEEE Conference [39] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, on Computational Complexity, page 5, 2002. A. H. Teller, and E. Teller. Equations of state [19] U. Feige and R. Krauthgamer. Finding and certifying calculations by fast computing machines. Journal of a large hidden clique in a semirandom graph. Random Chemical Physics, 21:1087–1092, 1953. Struct. Algorithms, 16(2):195–208, 2000. [40] L. Minder and D. Vilenchik. Small clique detection [20] U. Feige and R. Krauthgamer. The probable value of and approximate nash equilibria. 5687:673–685, 2009. the Lov´asz–Schrijver relaxations for maximum [41] K. Pearson. On the criterion that a given system of independent set. SICOMP, 32(2):345–370, 2003. deviations from the probable in the case of a [21] U. Feige and D. Ron. Finding hidden cliques in linear correlated system of variables is such that it can be time. In Proceedings of AofA, pages 189–204, 2010. reasonably supposed to have arisen from random [22] V. Feldman. A complete characterization of statistical sampling. Philosophical Magazine, Series 5, query learning with applications to evolvability. 50(302):157–175, 1900. Journal of Computer System Sciences, [42] B. Selman, H. Kautz, and B. Cohen. Local search 78(5):1444–1459, 2012. strategies for satisfiability testing. In DIMACS Series [23] V. Feldman, E. Grigorescu, L. Reyzin, S. Vempala, in Discrete Mathematics and Theoretical Computer and Y. Xiao. Statistical algorithms and a lower bound Science, pages 521–532, 1995. for detecting planted cliques. CoRR, abs/1201.1214, [43] R. Servedio. Computational sample complexity and 2012. attribute-efficient learning. Journal of Computer and [24] A. M. Frieze and R. Kannan. A new approach to the System Sciences, 60(1):161–178, 2000. planted clique problem. In FSTTCS, pages 187–198, [44] B. Sz¨or´enyi. Characterizing statistical query learning: 2008. Simplified notions and proofs. In ALT, pages 186–200, [25] A. E. Gelfand and A. F. M. Smith. Sampling based 2009. approaches to calculating marginal densities. Journal [45] M. Tanner and W. Wong. The calculation of posterior of the American Statistical Association, 85:398–409, distributions by data augmentation (with discussion). 1990. Journal of the American Statistical Association, [26] J. H˚astad. Some optimal inapproximability results. J. 82:528–550, 1987. ACM, 48:798–859, July 2001. [46] L. G. Valiant. A theory of the learnable. Commun. [27] W. K. Hastings. Monte carlo sampling methods using ACM, 27(11):1134–1142, 1984. markov chains and their applications. Biometrika, [47] V. Vapnik and A. Chervonenkis. On the uniform 57(1):97–109, 1970. convergence of relative frequencies of events to their [28] E. Hazan and R. Krauthgamer. How hard is it to probabilities. Theory of Probab. and its Applications, approximate the best nash equilibrium? SIAM J. 16(2):264–280, 1971. Comput., 40(1):79–91, 2011. [48] V. Cern´y.ˇ Thermodynamical approach to the traveling [29] J. Jackson. On the efficiency of noise-tolerant PAC salesman problem: An efficient simulation algorithm. algorithms derived from statistical queries. Annals of Journal of Optimization Theory and Applications, Mathematics and Artificial Intelligence, 39(3):291–313, 45(1):41–51, Jan. 1985. Nov. 2003. [49] K. Yang. On learning correlated boolean functions [30] M. Jerrum. Large cliques elude the metropolis process. using statistical queries. In Proceedings of ALT, pages Random Struct. Algorithms, 3(4):347–360, 1992. 59–76, 2001. [31] A. Juels and M. Peinado. Hiding cliques for [50] K. Yang. New lower bounds for statistical query cryptographic security. Des. Codes Cryptography, learning. J. Comput. Syst. Sci., 70(4):485–509, 2005. 20(3):269–280, 2000. [51] A. Yao. Probabilistic computations: Toward a unified [32] R. Kannan. personal communication. measure of complexity. In FOCS, pages 222–227, 1977.