<<

arXiv:1501.00622v4 [math.OC] 19 Jun 2017 Keywords: n cnmc,Saga,China Shanghai, Economics, and nance complexity N USA MN, rceig fthe of Proceedings Wang Mengdi Learning yteauthor(s). the by rbe 1 Problem possible. usin as loss coefficients nonzero empirical the few the in as sparsity minimize induce to to i.e., is solution, function selection. optimal penalty variable the of and role arise minimization The commonly risk problems a empirical Such and from data input function. over losses penalty empirical sparse ob- of the sum where the is problem, jective minimization sparse the study We Introduction 1 eat function penalty > λ where ∈ togN-adesfrSas piiainwt ocv Pe Concave with Optimization Sparse for NP-Hardness Strong 1 rneo nvriy J USA NJ, University, Princeton n 0 c iltm,uls unless time, polyno- mial in approximately problem optimization even sparse cannot the functions. solve one penalty that sparse suggests and It functions loss of h euaie preotmzto rbe is any problem for NP-hard optimization strongly sparse regularized the htfidn an finding that d rbe,wihivle miia uso loss for of sums functions minimization empirical involves sparse which problem, regularized the Consider osdrteproblem the consider , 1 r nu data. input are n ocne priypnly eprove We penalty. sparsity nonconvex a and ) A + yny utai,PL 0 07 oyih 2017 Copyright 2017. 70, PMLR Australia, Sydney, , 4 ( = c x min tnodUiest,C,UA orsodneto: Correspondence USA. CA, University, Stanford 2 ∈ · ie h osfunction loss the Given R NP-hardness < < ocne optimization Nonconvex d ihnChen Yichen a [email protected] X i 1 1 =1 p n a , . . . , h eutapist ra class broad a to applies result The . 34 : n ℓ R th aapit ec fdimension of (each points data O a 7→ i T Abstract nentoa ofrneo Machine on Conference International n ( = n ,b x, ) R T c · NP. 1 + ocv penalty Concave d 1 i ∈ n euaiainparameter regularization and ,  c c 2 ogogGe Dongdong + 1 ) R c , otmlslto to solution -optimal 2 n λ hnhiUiest fFi- of University Shanghai 2 × 3 X j d nvriyo Minnesota, of University ∈ =1 d , ℓ > [0 b . p : , ( · ( = 1) R | x Computational j × uhthat such · | b ) 2 Sparsity 1 R , b , . . . , egiWang Mengdi 7→ R n ) + T g , convex opesv esn,adsas prxmto.Als of list A of examples approximation. applicable sparse and sensing, compressive ihauiu iiie at minimizer unique a with fplnma-ieagrtm ihsalapproximation ( existence small the error. with of possibility polynomial-time the of excluded not have results prxmtl ovn rbe when 1 Problem solving approximately o eti pca ae fPolm1 thsbeen has it 1, Problem of an finding cases that special shown certain For penalty sparse the ( where e ne eea odtoso h osfunction loss the of conditions general under Prob- 1 of complexity lem computational the in interested are We u ancnrbtosaethree-fold. are contributions main Our functions. penalty re- and regard- hardness loss assumptions the of general ing very set under strongest 1 Problem and for first sults the complexity gives knowledge, of paper best our this forms To strongest optimization. continuous the ap- for result of of NP-hardness one strong is The proximation more. re many logistic and sparse gression, classification, sparse of include learning. variety Examples machine a and to estimation lower in apply a problems results optimization polynomial-time exists Our any there of . error that deterministic suboptimality show the we on ap- precisely, bound to er- More NP-hard optimality strongly certain is within ror. 1 it Problem that solve prove proximately we paper, this In u Chen & Huo 1 .W rsn eea odto ntesas penalty sparse the on condition general a present We 2. with 1 Problem for NP-hardness strong the prove We 1. ihoWang Zizhuo t.T h eto u nweg,ti stems gen- most the is this knowledge, our of best the To etc. aiy ti aifidb yia eat ucin such functions penalty the typical as con- by strict satisfied of is version It weaker cavity. slight a is condition The function etc. re- models, Poisson Gaussian inverse regression, regression, logistic gression, robust model but linear noise, including regression, Laplacian problems squares that with least of results to: class first limited broad the not the is to This apply functions. loss general ℓ ℓ hn&Wang & Chen sacne osfnto and function loss convex a is n concave and q , p om( norm 2010 uhta rbe ssrnl NP-hard. strongly is 1 Problem that such 3 p ; npriua,w ou ntecase the on focus we particular, In . iy Ye Yinyu q hne al. et Chen , p xc solution exact ∈ ℓ 2016 r omni preregression, sparse in common are and [0 0 , salse h adesof hardness the established ) piiainpolm with problems Optimization . 1) p 4 ,clipped ), sgvni eto 3. Section in given is , a Yin Hao at Functions nalty 2014 p ssrnl NP-hard strongly is sacnaepenalty concave a is .Hwvr these However, ). p L sthe is 4 1 om SCAD, norm, L 0 norm. ℓ and - Strong NP-Hardness for Sparse Optimization with Concave Penalty Functions

eral condition on the penalty function in the literature. ity of sparse recovery, beginning with (Arora et al., 1993).(Amaldi & Kann, 1998) proved that the problem c1 c2 3. We prove that finding an (λn d )-optimal solu- min x 0 Ax = b is not approximable within a fac- O {k 1k−ǫ | } tion to Problem 1 is strongly NP-hard, for any c1,c2 tor 2log d for any ǫ > 0.(Natarajan, 1995) showed such that . Here the hides pa-∈ [0, 1) c1 + c2 < 1 ( ) that, given ǫ > 0, A and b, the problem min x 0 rameters that depend on the penalty functionO · , which {k k | p Ax b 2 ǫ is NP-hard. (Davis et al., 1997) proved is to be specified later. It illustrates a gap between ak similar− k result≤ that} for some given ǫ> 0 and M > 0, it is the optimization error achieved by any tractable algo- NP-complete to find a solution x such that x 0 M and rithm and the desired statistical precision. Our proof Ax b ǫ. More recently, (Foster et al.k, 2015k ≤) studied provides a first unified analysis that deals with a broad ksparse− recoveryk≤ and sparse linear regression with subgaus- class of problems taking the form of Problem 1. sian noises. Assuming that the true solution is K-sparse, it showed that no polynomial-time (randomized) algorithm − log1 δ d 2 Section 2 summarizes related literatures from optimization, can find a K 2 -sparse solution x with Ax b 2 C1 1 C2 · k − k ≤ machine learning and statistics. Section 3 presents the key d n − with high probability, where δ, C1, C2 are arbi- assumptions and illustrates examples of loss and penalty trary positive scalars. Another work (Zhang et al., 2014) functions that satisfy the assumptions. Section 4 gives the showed that under the Gaussian linear model, there exists main results. Section 5 discusses the implications of our a gap between the mean square loss that can be achieved hardness results. Section 6 provides a proof of the main by polynomial-time algorithms and the statistically optimal results in a simplified setting. The full proofs are deferred mean squared error. These two works focus on estimation to the appendix. of linear models and impose distributional assumptions re- garding the input data. These results on estimation are dif- 2 Background and Related Works ferent in nature with our results on optimization. In contrast, we focus on the optimization problem itself. Sparse optimization is a powerful machine learning tool for Our results apply to a variety of loss functions and penalty extracting useful information for massive data. In Problem functions, not limited to linear regression. Moreover, we do 1, the sparse penalty serves to select the most relevant vari- not make any distributional assumption regarding the input ables from a large number of variables, in order to avoid data. overfitting. In recent years, nonconvex choices of p have received much attention; see (Frank & Friedman, 1993; There remain several open questions. First, existing results Fan & Li, 2001; Chartrand, 2007; Candes et al., 2008; mainly considered least square problems or Lq minimiza- Fan & Lv, 2010; Xue et al., 2012; Loh & Wainwright, tion problems. Second, existing results focused mainly on 2013; Wang et al., 2014; Fan et al., 2015). the L0 penalty function. The complexity of Problem 1 with general loss function and penalty function is yet to be es- Within the optimization and mathematical programming tablished. Things get complicated when p is a continuous community, the complexity of Problem 1 has been consid- function instead of the discrete L0 norm function. The ered in a numberof special cases. (Huo & Chen, 2010) first complexity for finding an ǫ-optimal solution with general proved the hardness result for a relaxed family of penalty ℓ and p is not fully understood. We will address these ques- functions with L2 loss. They show that for the penalties in tions in this paper. L0, hard-thresholded (Antoniadis & Fan, 2001) and SCAD (Fan & Li, 2001), the above optimization problem is NP- hard. (Chen et al., 2014) showed that the L2-Lp minimiza- 3 Assumptions tion is strongly NP-hard when p (0, 1). At thesame time, In this section, we state the two critical assumptions that (Bian & Chen, 2014) proved the∈ strongly NP-hardness for lead to the strong NP-hardness results: one for the penalty another class of penalty functions. The preceding existing function p, the other one for the loss function ℓ. We ar- analyses mainly focused on finding an exact global opti- gue that these assumptions are essential and very general. mum to Problem 1. For this purpose, they implicitly as- They apply to a broad class of loss functions and penalty sumed that the input and parameters involved in the re- functions that are commonly used. duction are rational numbers with a finite numerical repre- sentation, otherwise finding a global optimum to a contin- uous problem would be always intractable. A recent tech- 3.1 Assumption About Sparse Penalty nical report (Chen & Wang, 2016) proves the hardness of Throughout this paper, we make the following assumption obtaining an ǫ-optimal solution when p is the L0 norm. regarding the sparse penalty function p( ). · Within the theoretical computer science community, there have been several early works on the complex- Assumption 1. The penalty function p( ) satisfies the fol- · Strong NP-Hardness for Sparse Optimization with Concave Penalty Functions lowing conditions: 3.2 Assumption About Loss Function We state our assumption about the loss function ℓ. (i) (Monotonicity) p( ) is non-decreasing on [0, + ). · ∞ Assumption 2. Let M be an arbitrary constant. For any (ii) (Concavity) There exists τ > 0 such that p( ) is con- interval [τ1, τ2] where 0 < τ1 < τ2 < M, there exists cave but not linear on . · + k k [0, τ] k Z and b Q such that h(y) = i=1 ℓ(y,bi) has the∈ following properties:∈ P In words, condition (ii) means that the concave penalty p( ) · is nonlinear. Assumption 1 is the most general condition (i) h(y) is convex and Lipschitz continuous on [τ1, τ2]. on penalty functions in the existing literature of sparse op- (ii) has a unique minimizer in . timization. Below we present a few such examples. h(y) y∗ (τ1, τ2) (iii) There exists N Z+, δ¯ Q+ and C Q+ such that ∈ ∈ ∈ 1. In variable selection problems, the L penalization when δ (0, δ¯), we have 0 ∈ p(t) = I t=0 arises naturally as a penalty for the { 6 } h(y∗ δ) h(y∗) number of factors selected. ± − C. δN ≥ 2. A natural generalization of the L0 penalization is the k 1 p (iv) h(y∗), bi can be represented in (log ) Lp penalization p(t) = t where (0 < p < 1). i=1 τ2 τ1 { } O − The corresponding minimization problem is called the bits. bridge regression problem (Frank & Friedman, 1993). Assumption 2 is a critical, but very general, assumption 3. To obtain a hard-thresholding estimator, regarding the loss function ℓ(y,b). Condition (i) requires Antoniadis & Fan (2001) use the penalty func- convexity and Lipschitz continuity within a neighborhood. 2 + 2 tions pγ(t) = γ ((γ t) ) with γ > 0, where Conditions (ii), (iii) essentially require that, given an in- − − (x)+ := max x, 0 denotes the positive part of x. terval [τ , τ ], one can artificially pick b ,...,b to con- { } 1 2 1 k struct a function h(y) = k ℓ(y,b ) such that h has its 4. Any penalty function that belongs to the folded con- i=1 i unique minimizer in [τ , τ ] and has enough curvature near cave penalty family (Fan et al., 2014) satisfies the con- 1 2 the minimizer. This propertyP ensures that a bound on the ditions in Theorem 1. Examples include the SCAD minimal value of h(y) can be translated to a meaningful (Fan & Li, 2001)andtheMCP(Zhang, 2010a), whose bound on the minimizer y∗. The conditions (i), (ii), (iii) derivatives on (0, + ) are pγ′ (t) = γI t γ + + ∞ { ≤ } are typical properties that a loss function usually satisfies. (aγ t) t + − I t>γ and pγ′ (t) = (γ ) , respec- Condition (iv) is a technical condition that is used to avoid a 1 { } − b tively,− where γ > 0, a> 2 and b> 1. dealing with infinitely-long irrational numbers. It can be easily verified for almost all common loss functions. 5. The conditions in Theorem 1 are also satisfied by the clipped L1 penalty function (Antoniadis & Fan, 2001; We will show that Assumptions 2 is satisfied by a variety Zhang, 2010b) pγ(t) = γ min(t,γ) with γ > 0. of loss functions. An (incomplete) list is given below. This is a special case of the· piecewise linear penalty function: 1. In the least squares regression, the loss function has the form k1t if 0 t a n p(t)= ≤ ≤ T 2 k t + (k k )a if t>a a x bi .  2 1 − 2 i − i=1 where 0 k < k and a> 0. X  ≤ 2 1 Using our notation, the corresponding loss function is ℓ(y,b) = (y b)2. For all τ , τ , we choose an 6. Another family of penalty functions which bridges the − 1 2 arbitrary b′ [τ1, τ2]. We can verify that h(y) = L0 and L1 penalties are the fraction penalty functions ∈ (γ + 1)t ℓ(y,b′) satisfies all the conditions in Assumption 2. pγ(t)= with γ > 0 (Lv & Fan, 2009). γ + t 2. In the linear model with Laplacian noise, the negative 7. The family of log-penalty functions: log-likelihood function is n 1 T p (t)= log(1 + γt) ai x bi . γ − log(1 + γ) i=1 X

with γ > 0, also bridges the L0 and L1 penalties So the loss function is ℓ(y,b)= y b . As in thecase (Candes et al., 2008). of least squares regression, the| loss− function| satisfy Strong NP-Hardness for Sparse Optimization with Concave Penalty Functions

q/r Assumption 2. Similar argument also holds when we at y = ln 1 q/r . To verify (iii), we consider the sec- q − consider the Lq loss with q 1. q/r | · | ≥ ond order Taylor expansion at y = ln 1 q/r , which − 3. In robust regression, we consider the Huber loss is q (Huber, 1964) which is a mixtureof L1 and L2 norms. h(y + δ) h(y)= δ2 + o(δ2) The loss function takes the form − 2(1 + ey)

y M 1 y b 2 for y b δ, where y [τ1, τ2]. Note that e is bounded by e , L (y,b)= 2 ∈ δ δ(| y− b| 1 δ) otherwise.| − |≤ which can be computed beforehand. As a result, (iii)  | − |− 2 holds as well. T for some δ > 0 where y = a x. We then verify that 6. In the mean estimation of inverse Gaussian models Assumption 2 is satisfied. For any interval [τ1, τ2], we (McCullagh, 1984), the negative log-likelihood func- pick an arbitrary b [τ1, τ2] and let h(y) = ℓ(y,b). tion minimization is We can see that h(y∈) satisfies all the conditions in As- n sumption 2. (b aT x 1)2 min i · i − . x Rd bi ∈ i=1 p 4. In Poisson regression (Cameron & Trivedi, 2013), the X negative log-likelihood minimization is Now we show that the loss function ℓ(y,b) = (b √y 1)2 · − n b satisfies Assumption 2. By setting the T T derivative to be zero with regard to y, we can see that min log L(x; A, b)= min (exp(ai x) bi ai x). x Rd− x Rd − · 2 i=1 y take its minimum at y =1/b . Thus for any [τ1, τ2], ∈ ∈ X we choose b′ = q/r [1/√τ2, 1/√τ1]. We can see We now show that ℓ(y,b) = ey b y satisfies As- that h(y) = ℓ(y,b ) satisfies∈ all the conditions in As- − · ′ sumption 2. For any interval [τ1, τ2], we choose q and sumption 2. r such that q/r [eτ1 ,eτ2 ]. Note that eτ2 eτ1 = τ1+τ2 τ1 τ1 ∈ τ2 − 7. In the estimation of generalizedlinear modelunder the e − e τ2 τ1. Also, e is bounded by eM . Thus,− q,≥ r can− be chosen to be polynomial exponential distribution (McCullagh, 1984), the nega- tive log-likelihood function minimization is in 1/(τ2 τ1) by letting r = 1/(τ2 τ1) and q be⌈ some number− ⌉ less than r eM⌈. Then,− we⌉ choose bi k · k T k = r and b Z such that h(y) = ℓ(y,b ) = min log L(x; A, b)= min + log(ai x). i=1 i x Rd − x Rd aT x r ey q y∈. Let us verify Assumption 2. (i), (iv) ∈ ∈ i are· straightforward− · by our construction.P For (ii), note By setting the derivative to 0 with regard to y, we can b that h(y) take its minimum at ln(q/r) which is inside see that ℓ(y,b)= y + log y has a unique minimizer at [τ1, τ2] by our construction. To verify (iii), consider y = b. Thus by choosing b′ [τ , τ ] appropriately, ∈ 1 2 the second order Taylor expansion of h(y) at ln(q/r), we can readily show that h(y) = ℓ(y,b′) satisfies all the conditions in Assumption 2. r ey δ2 h(y + δ) h(y)= · δ2 + o(δ2) + o(δ2), − 2 · ≥ 2 To sum , the combination of any loss function given in any We can see that (iii) is satisfied. Therefore, Assump- Section 3.1 and penalty function given in Section 3.2 results in a strongly NP-hard optimization problem. tion 2 is satisfied. 5. In logistic regression, the negative log-likelihood 4 Main Results function minimization is In this section, we state our main results on the strong NP- n n T T hardness of Problem 1. We warm up with a preliminary min log(1 + exp(ai x)) bi ai x. x Rd − · result for a special case of Problem 1. ∈ i=1 i=1 X X Theorem 1 (A Preliminary Result). Let Assumption 1 We claim that the loss function ℓ(y,b) = log(1 + hold, and let p( ) be twice continuously differentiable in exp(y)) b y satisfies Assumption 2. By a similar ar- (0, ). Then the· minimization problem gument as− the· one in Poisson regression, we can verify ∞ r d that h(y)= i=1 ℓ(y,bi)= r log(1 + exp(y)) qy q τ1 τ2 − min Ax b + λ p( xj ), (1) e e Rn q where q/r [ τ , τ ] and q, r are polynomial x k − k | | ∈P1+e 1 1+e 2 ∈ j=1 in 1/(τ2 τ1) satisfies all the conditions in Assump- X tion⌈ 2. For− (ii),⌉ observe that ℓ(y,b) take its minimum is strongly NP-hard. Strong NP-Hardness for Sparse Optimization with Concave Penalty Functions

The result shows that many of the penalized least squares 5.1 Hardness of Regression with SCAD problems, e.g., (Fan & Lv, 2010), while enjoying small Penalty estimation errors, are hard to compute. It suggests that there does not exist a fully polynomial-time approximation Let us try to understand how significant is the non- scheme for Problem 1. It has not answered the question: approximable error of Problem 1. We consider the special whether one can approximatelysolve Problem 1 within cer- case of linear models with SCAD penalty. Let the input tain constant error. data (A, b) be generated by the linear model Ax¯ + ε = b, where x¯ is the unknown true sparse coefficients and ε is Now we show that it is not even possible to efficiently ap- a zero-mean multivariate subgaussian noise. Given the proximate the global optimal solution of Problem 1, unless data size n and variable dimension d, we follow (Fan & Li, P = NP . Given an optimization problem minx X f(x), 2001) and obtain a special case of Problem 1, given by we say that a solution x¯ is ǫ-optimal if x¯ X and∈f(¯x) ∈ ≤ d infx X f(x)+ ǫ. 1 2 ∈ min Ax b + n pγ ( xj ), (2) x 2k − k2 | | j=1 X Theorem 2 (Strong NP-Hardness of Problem 1). Let As- where γ = log d/n.(Fan & Li, 2001) showed that the sumptions 1 and 2 hold, and let c1,c2 [0, 1) be ar- optimal solution x∗ of problem (2) has a small statistical ∈ p 2 1/2 bitrary such that c1 + c2 < 1. Then it is strongly NP- error, i.e., x¯ x∗ 2 = n− + an , where an = c1 c2 k − k O hard to find a λ κ n d -optimal solution of Prob- max p′ ( x∗ ): x∗ =0 .(Fan et al., 2015) further showed · · { λ | j | j 6 }  lem 1, where d is the dimension of variable space and that we only need to find a √n log d-optimal solution to (2) 2p(t/2) p(t) κ = mint [τ/2,τ] − . to achieve such a small estimation error. ∈ { t } However, Theorem 2 tells us that it is not possible to com- pute an ǫd,n-optimal solution for problem (2) in polynomial 1/2 1/3 The non-approximable error in Theorem 2 involves the time, where ǫd,n = λκn d (by letting c1 =1/2,c2 = constant κ which is determined by the sparse penalty func- 1/3). In the special case of problem (2), we can verify that λ = n and κ = Ω(γ2) = Ω(log d/n). As a result, we see tion p. In the case where p is the L0 norm function, we can that take κ = 1. In the case of piecewise linear L1 penalty, we ǫ = Ω(n1/2d1/3) n log d, have κ = (k1 k2)/4. In the case of SCAD penalty, we d,n ≫ 2− have κ = Θ(γ ). for high values of the dimension d. Accordingp to Theorem According to Theorem 2, the non-approximable error λ 2, it is strongly NP-hard to approximately solve problem · (2) within the required statistical precision √ . This κ nc1 dc2 is determined by three factors: (i) properties of n log d the· regularization penalty λ κ; (ii) data size n; and (iii) di- result illustrates a sharp contrast between statistical proper- mension or number of variables· d. This result illustrates a ties of sparse estimation and the worst-case computational fundamental gap that can not be closed by any polynomial- complexity. time deterministic algorithm. This gap scales up when ei- ther the data size or the number of variables increases. In 5.2 Remarks on the NP-Hardness Results Section 5.1, we will see that this gap is substantially larger As illustrated by the preceding analysis, the non- than the desired estimation precision in a special case of approximibility of Problem 1 suggests that computing the sparse linear regression. sparse estimator is hard. The results suggest a funda- mental conflict between computation efficiency and esti- Theorems 1 and 2 validate the long-lasting belief that op- mation accuracy in sparse data analysis. Although the re- timization involving nonconvex penalty is hard. More im- sults seem negative, they should not discourage researchers portantly, Theorem 2 provide lower bounds for the opti- from studying computational perspectives of sparse opti- mization error that can be achieved by any polynomial-time mization. We make the following remarks: algorithm. This is one of the strongest forms of hardness result for continuous optimization. 1. Theorems 1, 2 are worst-case complexity results. They suggest that one cannot find a tractable solution 5 An Application and Remarks to the sparse optimization problems, without making any additional assumption to rule out the worst-case In this section, we analyze the strong NP-hardness results instances. in the special case of linear regression with SCAD penalty (Problem 1). We give a few remarks on the implication of 2. Our results do not exclude the possibility that, under our hardness results. more stringent modeling and distributional assump- Strong NP-Hardness for Sparse Optimization with Concave Penalty Functions

tions, the problem would be tractable with high prob- 2. g (t) has a unique global minimizer t∗(θ,µ) θ,µ ∈ ability or on average. (τ0, τ);

In short, the sparse optimization Problem 1 is fundamen- 3. Let δ¯ = min t∗(θ,µ) τ0, τ t∗(θ,µ), 1 , then for tally hard from a purely computational perspective. This any δ (0, δ¯{), we have−g (t)− 0 such that for any µ µˆ, the following ≥ In this section, we prove Theorem 1. The proof of The- properties are satisfied: orems 2 is deferred to the appendix which is based on the idea of the proof in this section. We construct a 1. g0′ ,µ(t) < 1 for any t [τ0, τˆ) and g0′ ,µ(t) > 1 for polynomial-time reduction from the 3-partition problem any t (ˆτ,− τ]; ∈ (Garey & Johnson, 1978) to the sparse optimization prob- ∈ lem. Given a set S of 3m integers s1, ...s3m, the three 2. g (t) has a unique global minimizer t∗(0,µ)=ˆτ 0,µ ∈ partition problem is to determine whether S can be parti- (τ0, τ); tioned into m triplets such that the sum of the numbers in each subset is equal. This problem is known to be strongly 3. Let δ¯ = min τˆ τ0, τ τ,ˆ 1 , then for any δ (0, δ¯), NP-hard (Garey & Johnson, 1978). The main proof idea { − − } 2 ∈ we have g0,µ(t) 1 and q = 1. Then we can p( t1 )+ + p( tl ) min p( t1 + + tl ),p(τ) . find θ and µ such that for any l 2, t ,...,t R, | | ··· | | ≥ { | ··· | } ≥ 1 l ∈ Lemma 4. If p(t) satisfies the conditions in Theorem 1, q q then there exists τ (0, τ) such that p( ) is concave l l l 0 ∈ · but not linear on [0, τ0] and is twice continuously differ- p( t )+ θ t + µ t τˆ h(θ,µ). | j | · j · j − ≥ entiable on [τ0, τ]. Furthermore, for any t˜ (τ0, τ), let j=1 j=1 j=1 ∈ X X X δ¯ = min τ0/3, t˜ τ0, τ t˜ . Then for any δ (0, δ¯) { − − } ∈ ∗ ∗ l 2, and any t ,t ,...,t such that t + + t = t˜, we τ0 t (θ,µ) τ0 τ t (θ,µ) 1 2 l 1 l Moreover, let δ¯ = min , − , − , 1, C have≥ ··· 3 2 2 1 where C is defined in Lemman 4, then for any δ (0, δ¯o), p( t1 )+ + p( tl ) 0. l l l 1 τ0/3 p( t )+θ t +µ t τˆ

holds only if ti t∗(θ,µ ) < 2δ for some i while tj δ g (t) := p( t )+ θ t q + µ t τˆ q | − | | |≤ θ,µ | | · | | · | − | for all j = i. 6 with θ,µ > 0, where τˆ is an arbitrary fixed rational number in (τ0, τ). We have the following lemma about gθ,µ(t). Lemma 5. If p(t) satisfies the conditions in Theorem 1, q > 1, and τ0 satisfies the properties in Lemma 4, then Proof of Theorem 1. We present a polynomial time re- there exist θ > 0 and µ > 0 such that for any θ θ and duction to problem (1) from the 3-partition problem. ≥ µ µ θ, the following properties are satisfied: For any given instance of the 3-partition problem with ≥ · b = (b1,...,b3m), we consider the minimization problem 1. g′′ (t) 1 for any t [τ , τ]; min f(x) in the form of (1) with x = x , 1 i θ,µ ≥ ∈ 0 x { ij } ≤ ≤ Strong NP-Hardness for Sparse Optimization with Concave Penalty Functions

3m, 1 j m, where we are able to find a near-optimal solution x such that ≤ ≤ q f(x) < 3mλ h(θ,µ) + ǫ within a polynomial time of m 3m 3m q 3m m · 1 log(1/ǫ) and the size of f(x), which is polynomial with q f(x) := bixij bixi1 + (λθ) xij respect to the size of the given 3-partition instance. Now − j=2 i=1 i=1 i=1 j=1 X X X X X we show that we can find an equitable partition based on q 3m m 3m m this near-optimal solution. By the definition of ǫ, f(x) < 1 + (λµ) q x τˆ + λ p( x ). 3mλ h(θ,µ)+ ǫ implies  ij −  | ij | · i=1 j=1 i=1 j=1 q q X X X X m m m   p( xij )+ θ xij + µ xij τ Note that the lower bounds θ, µ, and µˆ only depend on | | · − (5) j=1 j=1 j=1 the penalty function p( ), we can choose θ θ and µ X X X · ≥ ≥ 2 µθ if q > 1, or θ = 0 and µ µˆ if q = 1, such that

+ µ xij τˆ 3mλ h(θ,µ). sum of the j-th subset and the first subset is · − ≥ · j=1 ) X 3m 3m yij yi1 bi bi = bi bi Now we claim that there exists an equitable partition to the − t∗(θ,µ) · − t∗(θ,µ) · Sj S1 i=1 i=1 3-partition problem if and only if the optimal value of f(x) X X X X 3m 3m is smaller than 3mλ h(θ,µ)+ ǫ where ǫ is specified later. 1 · = biyij biyi1 . On onehand,if S can be equally partitioned into m subsets, t∗(θ,µ) − i=1 i=1 then we define X X By triangle inequality, we have t (θ,µ) if b belongs to the jth subset; x = ∗ i ij 0 otherwise. 1 3m  bi bi bi yij xij − ≤ t∗(θ,µ) · | − | Sj S1 i=1 It can be easily verified that these xij ’s satisfy f(x) = X X X 3mλ h(θ,µ). Then due to (4), we know that these x ’s 3m 3m 3m · ij provide an optimal solution to f(x) with optimal value + bi yi1 xi1 + bixij bixi1 . · | − | − ! 3mλ h(θ,µ). i=1 i=1 i=1 · X X X By the definition of y , we have y x < 2δ for any On the other hand, suppose the optimal value of f(x) is ij ij ij i, j. for the last term, since f(x) <| 3mλ− h|(θ,µ)+ ǫ, we 3mλ h(θ,µ), and there is a polynomial-time algorithm · that solves· (1). Then for know that n n b x b x <ǫ1/q τ /2. τ0 ¯ 2 q i ij i i1 0 δ = min , δ and ǫ = min λδ , (τ0/2) − ≤ 8 3m b { } i=1 i=1 ( i=1 i ) X X Therefore, we have where P n 1 τ0 ¯ τ0 t∗(θ,µ) τ0 τ t∗(θ,µ) bi bi < 4δ bi + 1. δ = min , − , − , − t∗(θ,µ) 2 ≤ 3 2 2 Sj S1 i=1 !  X X X

p(τ0/3)+ p(2τ0/3) p(τ0) Now since bi’s are all integers, we must have bi = − , 1 , Sj τ0/3 bi, which means that the partition is equitable.  S1 P P Strong NP-Hardness for Sparse Optimization with Concave Penalty Functions

References Fan, J., Liu, H., Sun, Q., and Zhang, T. TAC for sparse learning: Simultaneous control of algorith- Amaldi, E. and Kann, V. On the approximability of min- mic complexity and statistical error. arXiv preprint imizing nonzero variables or unsatisfied relations in lin- arXiv:1507.01037, 2015. ear systems. Theoretical Computer Science, 209(1):237– 260, 1998. Foster, D., Karloff, H., and Thaler, J. Variable selection is hard. In COLT, . 696–709, 2015. Antoniadis, A. and Fan, J. Regularization of wavelet ap- proximations. Journal of the American Statistical Asso- Frank, L. E. and Friedman, J. H. A statistical view of some ciation, 96(455):939–967, 2001. chemometrics regression tools. Technometrics, 35(2): 109–135, 1993. Arora, S., Babai, L., Stern, J., and Sweedy, Z. The hardness Garey, M. R. and Johnson, D. S. “Strong”NP-completeness of approximate optima in lattices, codes, and systems results: Motivation, examples, and implications. Journal of linear equations. In Foundations of Computer Sci- of the ACM (JACM), 25(3):499–508, 1978. ence, 1993. Proceedings., 34th Annual Symposium on, pp. 724–733. IEEE, 1993. Huber, P. J. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1):73–101, 1964. Bian, W. and Chen, X. Optimality conditions and complex- ity for non-lipschitz constrained optimization problems. Huo, X. and Chen, J. Complexity of penalized likelihood Preprint, 2014. estimation. Journal of Statistical Computation and Sim- ulation, 80(7):747–759, 2010. Cameron, A. C. and Trivedi, P. K. Regression analysis Loh, P.-L. and Wainwright, M. J. Regularized M-estimators of count data, volume 53. Cambridge university press, with nonconvexity: Statistical and algorithmic theory for 2013. local optima. In Advances in Neural Information Pro- Candes, E., Wakin, M., and Boyd, S. Enhancing sparsity by cessing Systems, pp. 476–484, 2013. reweighted L1 minimization. Journal of Fourier Analy- Lv, J. and Fan, Y. A unified approach to model selection sis and Applications, 14(5-6):877–905, 2008. and sparse recovery using regularized least squares. The Annals of Statistics, 37(6A):3498–3528, 2009. Chartrand, R. Exact reconstruction of sparse signals via nonconvex minimization. Signal Processing Letters, McCullagh, P. Generalized linear models. European Jour- IEEE, 14(10):707–710, 2007. nal of Operational Research, 16(3):285–292, 1984. Chen, X., Ge, D., Wang, Z., and Ye, Y. Complexity of un- Natarajan, B. K. Sparse approximate solutions to linear constrained L2 Lp minimization. Mathematical Pro- systems. SIAM journal on computing, 24(2):227–234, gramming, 143(1-2):371–383,− 2014. 1995.

Chen, Y. and Wang, M. Hardness of approximation for Wang, Z., Liu, H., and Zhang, T. Optimal computational and statistical rates of convergence for sparse noncon- sparse optimization with L0 norm. Technical Report, 2016. vex learning problems. Annals of statistics, 42(6):2164, 2014. Davis, G., Mallat, S., and Avellaneda, M. Adaptive greedy Xue, L., Zou, H., Cai, T., et al. Nonconcave penalized approximations. Constructive approximation, 13(1):57– composite conditional likelihood estimation of sparse 98, 1997. ising models. The Annals of Statistics, 40(3):1403–1429, Fan, J. and Li, R. Variable selection via nonconcave pe- 2012. nalized likelihood and its oracle properties. Journal Zhang, C.-H. Nearly unbiased variable selection under of the American Statistical Association, 96(456):1348– minimax concave penalty. The Annals of Statistics, 38 1360, 2001. (2):894–942, 2010a. Fan, J. and Lv, J. A selective overview of variable selection Zhang, T. Analysis of multi-stage convex relaxation for in high dimensional feature space. Statistica Sinica, 20 sparse regularization. Journal of Machine Learning Re- (1):101–148, 2010. search, 11:1081–1107, 2010b.

Fan, J., Xue, L., and Zou, H. Strong oracle optimality Zhang, Y., Wainwright, M. J., and Jordan, M. I. Lower of folded concave penalized estimation. The Annals of bounds on the performance of polynomial-time algo- Statistics, 42(3):819–849, 2014. rithms for sparse linear regression. In COLT, 2014.