Please see the online version of this article for supplementary materials. Regularization Parameter Selections via Generalized Information Criterion

Yiyun ZHANG, Runze LI, and Chih-Ling TSAI

We apply the nonconcave penalized likelihood approach to obtain variable selections as well as estimators. This approach relies heavily on the choice of regularization parameter, which controls the model complexity. In this paper, we propose employing the generalized information criterion, encompassing the commonly used Akaike information criterion (AIC) and Bayesian information criterion (BIC), for selecting the regularization parameter. Our proposal makes a connection between the classical variable selection criteria and the regulariza- tion parameter selections for the nonconcave penalized likelihood approaches. We show that the BIC-type selector enables identification of the true model consistently, and the resulting estimator possesses the oracle property in the terminology of Fan and Li (2001). In contrast, however, the AIC-type selector tends to overfit with positive probability. We further show that the AIC-type selector is asymptotically loss efficient, while the BIC-type selector is not. Our simulation results confirm these theoretical findings, and an empirical example is presented. Some technical proofs are given in the online supplementary material. KEY WORDS: Akaike information criterion; Bayesian information criterion; Least absolute shrinkage and selection operator; Noncon- cave penalized likelihood; Smoothly clipped absolute deviation.

1. INTRODUCTION be adopted to solve optimization problems of nonconcave pe- nalized likelihood functions. However, the above computational Using the penalized likelihood function to simultaneously se- procedures rely on the regularization parameter. Hence, the se- lect variables and estimate unknown parameters has received lection of this parameter becomes the second challenge, and is considerable attention in recent years. To avoid the instability therefore the primary aim of our paper. of the classical subset selection procedure, Tibshirani (1996) In the literature, selection criteria are usually classified proposed the Least Absolute Shrinkage and Selection Opera- into two categories: consistent [e.g., the Bayesian informa- tor (LASSO). In the same spirit as LASSO, Fan and Li (2001) introduced a unified approach via a nonconcave penalized like- tion criterion (BIC), Schwarz 1978] and efficient [e.g., the lihood function, and demonstrated its usefulness for both linear Akaike information criterion (AIC), Akaike 1974; the general- and generalized linear models. Subsequently, Fan and Li (2002) ized cross-validation (GCV), Craven and Wahba 1979]. A con- employed the nonconcave penalized likelihood approach for sistent criterion identifies the true model with a probability that Cox models, while Fan and Li (2004) used it on semiparamet- approaches 1 in large samples when a set of candidate models ric regression models. In addition, Fan and Peng (2004)inves- contains the true model. An efficient criterion selects the model tigated the theoretical properties of the nonconcave penalized so that its average squared error is asymptotically equivalent to likelihood when the number of parameters tends to infinity as the minimum offered by the candidate models when the true the sample size increases. Recently, more researchers applied model is approximated by a family of candidate models. De- penalized approaches to study variable selections (e.g., Park tailed discussions on efficiency and consistency can be found in and Hastie 2007; Wang and Leng 2007; Yuan and Lin 2007; Shibata (1981, 1984), Li (1987), Shao (1997), and McQuarrie Zhang and Lu 2007; Li and Liang 2008). and Tsai (1998). In the context of linear and generalized lin- In employing the nonconcave penalized likelihood in regres- ear models (GLIM) with a nonconcave penalized function, Fan sion analysis, we face two challenges. The first hurdle is to com- and Li (2001) proposed applying GCV to choose the regular- pute the nonconcave penalized likelihood estimate. This issue ization parameter. Recently, Wang, Li, and Tsai (2007b) found has been carefully studied in the recent literature: (i) Fan and Li that the resulting model selected by GCV tends to overfit, while (2001) proposed the local quadratic approximation (LQA) al- BIC is able to identify the finite-dimensional true linear and gorithm, which was further analyzed by Hunter and Li (2005); partial linear models consistently. Wang, Li, and Tsai (2007b) (ii) Efron et al. (2004) introduced the LARS algorithm, which also indicated that GCV is similar to AIC. However, they only can be used for the adaptive LASSO (Zou 2006; Wang, Li, and studied the penalized least squares function with the smoothly Tsai 2007a; Zhang and Lu 2007). With the aid of local linear clipped absolute deviation (SCAD) penalty. This motivated us approximation (LLA) algorithm (Zou and Li 2008), LARS can to study the issues of regularization parameter selection for pe- nalized likelihood-based models with a nonconcave penalized function. Yiyun Zhang is Senior Biostatistician, Novartis Oncology, Florham Park, NJ 07932 (E-mail: [email protected]). Runze Li is the correspondence In this paper, we adopt Nishii’s (1984) generalized informa- author and Professor, Department of and The Methodology Center, tion criterion (GIC) to choose regularization parameters in non- The Pennsylvania State University, University Park, PA 16802-2111 (E-mail: concave penalized likelihood functions. This criterion not only [email protected]). Chih-Ling Tsai is Robert W. Glock Chair professor, Grad- uate School of Management, University of California, Davis, CA 95616-8609 contains AIC and BIC as its special cases, but also bridges the (E-mail: [email protected]). We are grateful to the editor, the associate editor, connection between the classical variable selection criteria and and three referees for their helpful and constructive comments that substantially improved an earlier draft. Zhang’s research is supported by National Institute on Drug Abuse grants R21 DA024260 and P50 DA10075 as a research assistant. © 2010 American Statistical Association The content is solely the responsibility of the authors and does not necessarily Journal of the American Statistical Association represent the official views of the NIDA or the NIH. Li’s research is supported March 2010, Vol. 105, No. 489, Theory and Methods by National Science Foundation grants DMS 0348869 and 0722351. DOI: 10.1198/jasa.2009.tm08013

312 Zhang, Li, and Tsai: Regularization Parameter Selections 313 the nonconcave penalized likelihood methodology. This con- (C2) There exists a constant m such that the penalty pλ(ζ ) nection provides more flexibility for practitioners to employ satisfies p(ζ ) = 0forζ>mλ. their own favored variable selection criteria in choosing desir- (C3) If λn → 0asn →∞, then the penalty function satisfies √ able models. When the true model is among a set of candidate →∞ lim inf lim inf npλ (ζ ) . models with the GLIM structure, we show that the BIC-type n→∞ ζ →0+ n tuning parameter selector enables us to identify the true model Condition (C1) indicates that a smaller regularization para- consistently, whereas the AIC-type selector tends to yield an meter is needed if the sample size is large. Condition (C2) overfitted model. On the other hand, if the true model is approx- assures that the resulting penalized likelihood estimate is as- imated by a set of candidate models with the linear structure, we ymptotically unbiased (Fan and Li 2001). Both the SCAD and demonstrate that the AIC-type tuning parameter selector is as- Zhang’s (2007) minimax concave penalties (MCP) satisfy this ymptotically loss efficient, while the BIC-type selector does not condition. Condition (C3) is adapted from Fan and Li’s (2001) have this property in general. These findings are consistent with equation (3.5), which is used to study the oracle property. the features of AIC (see Shibata 1981; Li 1987) and BIC (see Shao 1997) used in best subset variable selections. Remark 1. Both the SCAD penalty and Lq penalty for 0 < ≤ The rest of the paper is organized as follows. Section 2 pro- q 1 are singular at the origin. Hence, it becomes challeng- poses the GIC under a general nonconcave penalized likelihood ing to maximize their corresponding penalized likelihood func- tions. Accordingly, Fan and Li (2001) proposed the LQA algo- setting. Section 3 studies the consistent property of GIC for rithm for finding the solution of the nonconcave penalized like- generalized linear models, while Section 4 investigates the as- lihood. In LQA, p (|β|) is locally approximated by a quadratic ymptotic loss efficiency of the AIC-type selector for linear re- λ function qλ(|β|), whose first derivative is given by gression models. Monte Carlo simulations and an empirical ex-   | | = | | | | ample are presented in Section 5 to illustrate the use of the reg- qλ( βj ) pλ( βj )/ βj βj. ularization parameter selectors. Section 6 provides discussions, The above equation is evaluated at the (k + 1)-step iteration of and technical proofs are given in the Appendix. (k) the Newton–Raphson algorithm if βj is not very close to zero. 2. NONCONCAVE PENALIZED Otherwise, the resulting parameter estimator of βj is set to 0. LIKELIHOOD FUNCTION 2.2 Generalized Information Criterion 2.1 Penalized Estimators and Penalty Conditions Before introducing the GIC, we first define the candidate Consider the data (x1, y1),...,(xn, yn) being collected iden- model, which is involved in variable selections. tically and independently, where yi is the response from the ith Definition 1 (Candidate model). We define α, a subset of subject, and xi is the associated d dimensional predictor vari- α¯ ={1,...,d}, as a candidate model, meaning that the corre- able. Let (β) be the log-likelihood-based (or loss) function sponding predictors labelled by α are included in the model. = T of d dimensional parameter vector β (β1,...,βd) . Then, Accordingly, α¯ is the full model. In addition, we denote the size adopting Fan and Li’s (2001) approach, we define a penalized of model α (i.e., the number of nonzero parameters in α) and likelihood to be the coefficients associated with the predictors in model α by dα d and βα, respectively. Moreover, we denote the collection of all = − | | A ˆ Q(β) (β) n pλ( βj ), (1) candidate models by . For a penalized estimator βλ that min- j=1 imizes the objective function (1), denote the model associated with βˆ by α . where pλ(·) is a penalty function with regularization parame- λ λ ter λ. Several penalty functions have been proposed in the = T + In the normal model, yi xi βα ei and 2 literature. For example, the Lq (0 < q < 2) penalty, namely ei ∼ N(0,σ ) for i = 1,...,n, Nishii (1984) proposed the GIC −1 q pλ(|β|) = q λ|β| , leads to the bridge regression (Frank and for classical variable selections. It is Friedman 1993). In particular, the L1 penalty yields the LASSO 1 GIC (α) = log σˆ 2 + κ d , estimator (Tibshirani 1996). Fan and Li (2001) proposed the κn α n n α nonconcave penalized likelihood method and advocated the use ˆ 2 of the SCAD penalty, whose first derivative is given by where βα is the parameter of the candidate model α, σα is the   maximum likelihood estimator of σ 2, and κ is a positive num- −| | n (aλ βj )+ ber that controls the properties of variable selection. Note that p (|βj|) = λ I(|βj|≤λ) + I(|βj| >λ) , λ (a − 1)λ Nishii’s GIC is different from the GIC proposed by Konishi and Kitagawa (1996). When κn = 2, GIC becomes AIC, while with a = 3.7 and pλ(0) = 0. With properly chosen regulariza- κn = log(n) leads to GIC being BIC. Because GIC contains a tion parameter λ from 0 to its upper limit, λmax, the resulting penalized estimator is sparse and therefore suitable for variable broad range of selection criteria, this motivates us to propose selections. the following GIC-type regularization parameter selector, To investigate the asymptotic properties of the regularization 1 GIC (λ) = {G(y, βˆ ) + κ df }, (2) parameter selectors, we present the penalty conditions given be- κn n λ n λ low: ˆ = where G(y, βλ) measures the fitting of model αλ, y (y1,..., → T ˆ (C1) Assume that λmax depends on n and satisfies λmax 0 yn) , βλ is the penalized parameter estimator obtained by max- →∞ as n . imizing Equation (1) with respect to β, and df λ is the degrees of 314 Journal of the American Statistical Association, March 2010 freedom of model αλ. For any given κn, we select λ that mini- 3. CONSISTENCY mizes GIC (λ). We can see that, the larger κ is, the higher the κn n In this section, we assume that the set of candidate models penalty for models with more variables. Therefore, for some contains the unique true model, and that the number of parame- given data, the size of the selected model decreases as κ in- n ters in the full model is finite. Under this assumption, we are creases. able to study the asymptotic consistency of GIC by introducing Remark 2. For any given model α (including the penalized the following definition and condition. model α and the full model α¯ ), we are able to obtain the λ Definition 2 (Underfitted and overfitted models). We assume nonpenalized parameter estimator βˆ ∗ by maximizing the log- α that there is a unique true model α in A, whose corresponding likelihood function (β) in (1). Then, Equation (2) becomes 0 coefficients are nonzero. Therefore, any candidate model α ⊃ ⊃ ∗ = 1{ ˆ ∗ + } α0, is referred to as an underfitted model, while any α α0 GICκ (α) G(y, βα) κndα , (3) n n other than α0 itself is referred to as an overfitted model. which can be used for classical variable selections. In addition, Based on the above definitions, we partition the tuning para- GIC∗ (α) turns into Nishii’s GIC (α) if we replace G(y, βˆ ∗ ) κn κn α meter interval [0,λmax] into the underfitted, true, and overfitted in (3) with the −2 log-likelihood function of the fitted normal subsets, respectively, regression model. ={ ⊃ } We next study the degrees of freedom used in the second − λ : αλ α0 , term of GIC. In the selection of the regularization parameter, 0 ={λ : αλ = α0}, and Fan and Li (2001, 2002) proposed that the degrees of freedom ={ ⊃ = } be the trace of the approximate linear projection matrix, that is, + λ : αλ α0 and αλ α0 .    ∇⊗2 ∗ ˆ −1∇⊗2 ˆ This partition allows us to assess the performance of regulariza- df L(λ) tr ( λ Q (βλ)) λ (βλ) , (4)  tion parameter selections. ∗ d ⊗ ∗ = − | | [∇ 2 ] = To investigate the asymptotic properties of the regularization where Q (β) (β) n j=1 qλ( βj ), λ Q (β) jj ∂2 ∗ ⊗2 ∂2 parameter selectors, we introduce the technical condition given Q (β), and [∇ (β)]jj = (β) for j, j such that ∂βjβj λ ∂βjβj below: βˆ = 0 and βˆ = 0. To understand the large sample property of j j (C4) For any candidate model α ∈ A, there exists c > 0 df (λ), we show its asymptotic behavior given below. α L 1 ˆ ∗ P such that G(y, β ) −→ cα. In addition, for any underfitted Proposition 1. Assume that the penalized likelihood estima- n α model α ⊃ α , c > c , where c is the limit of 1 G(y, β ) ˆ ˆ = 0 α α0 α0 n α0 tor βλ is sparse (i.e., with probability tending to one, βλj 0 and β is the parameter vector of the true model α . ˆ α0 0 if the true value of βj is 0) and consistent, where βλj is the jth The above condition assures that the underfitted model yields a component of βˆ . Under Conditions (C1) and (C2), we have λ   larger measure of model fitting than that of the true model. We = → P df L(λ) dαλ 1, next explore the asymptotic consistency of GIC for the general- ized linear models which have been used in various disciplines. where d is the size of model α . αλ λ Consider the (GLIM)—see Proof. After algebraic simplifications, McCullagh and Nelder (1989)—whose conditional density   = ∇⊗2 ˆ + −1∇⊗2 ˆ function of yi given xi is df L(λ) tr ( λ αλ (βλ) n λ) λ (βλ) ,   ; = [ − ] + ˆ ˆ fi(yi θi,φ) exp yiθi b(θi) /a(φ) c(yi,φ) , (5) where λ = diag ˆ = {p (|βλj|)/|βλj|}. Because of the consis- βλj 0 λ · · · · ˆ ˆ where a( ), b( ), and c( , ) are suitably chosen functions, θi is tency and sparsity of βλ, βλj converges to βj with probability | = = = ˆ ˆ the canonical parameter, E(yi xi) μi b (θi), g(μi) θi, g tending to 1 for all j such that βλj > 0. Hence, those βλj are is a link function, and φ is a scale parameter. Throughout this all bounded from 0. This result, together with Conditions (C1) paper, we assume that φ is known (such as in the logistic re- = and (C2), implies that λ 0 with probability tending to 1. gression model and the Poisson log-linear model) or that it can −1∇⊗2 ˆ = Subsequently, using the fact that n λ (βλ) OP(1),we be estimated by fitting the data with the full model (e.g., the complete the proof. normal linear model). In addition, we follow the classical re- T The above proposition suggests that the difference between gression approach to model θi by xi β. Based on (5), the log- likelihood-based function in (1)is df L(λ) and the size of the model, dαλ , is small. Because dαλ n is simple to calculate, we use it as the degrees of freedom df λ  in (2). In linear regression models, Efron et al. (2004) and Zou, (β) = (μ; y) = (θ) = log fi(yi; θi,φ) i=1 Hastie, and Tibshirani (2007) also suggested using dαλ as an es- timator of the degrees of freedom for LASSO. Moreover, Zou, n  = { T − T } + Hastie, and Tibshirani (2007) showed that dαλ is an asymptot- yixi β b(xi β) /a(φ) c(yi,φ) , (6) ically unbiased estimator. In this article, our asymptotical re- i=1 sults are valid without regard to the use of df (λ) or d as the L αλ where μ = (μ ,...,μ )T , y = (y ,...,y )T , and θ = (θ ,..., degrees of freedom. When the sample size is small, however, 1 n 1 n 1 θ )T . Then, the resulting scaled deviance of a penalized esti- df (λ) should be considered. In the following section, we ex- n L mate βˆ is plore the properties of GIC for the generalized linear models λ ; ˆ = { ; − ˆ ; } which have been widely used in various disciplines. D(y μλ) 2 (y y) (μλ y) , Zhang, Li, and Tsai: Regularization Parameter Selections 315

ˆ = −1 T ˆ −1 T ˆ T where μλ (g (x1 βλ),...,g (xn βλ)) . Using the Taylor expansion, we further have that ; ˆ   For model αλ, we employ the scaled deviance D(y μλ) as the ˆ ˆ ≈ 1 ˆ + 2G(y, βλ) goodness-of-fit measure, G(y, βλ),in(2) so that the resulting GCV(λ) G(y, βλ) df λ . GIC for GLIM is n n 1 1 Because G(y, βˆ )/n is bounded via Condition (C4), GCV GIC (λ) = D(y; μˆ ) + κ df . (7) λ κn n λ n n λ yields an overfitted model with a positive probability. There- fore, the penalized partial likelihood with the GCV selector In addition, when we fit the data with the nonpenalized likeli- does not possess the oracle property. hood approach under model α, GIC becomes Remark 4. In linear regression models, Wang, Li, and Tsai 1 1 ∗ = ; ˆ ∗ + (2007b) demonstrated that Fan and Li’s (2001) GCV-selector GICκn (α) D(y μα) κndα, (8) n n for the SCAD penalized least squares procedure cannot select ˆ ∗ = −1 T ˆ ∗ −1 T ˆ ∗ T ˆ ∗ the tuning parameter consistently. They further proposed the where μα (g (x1 βα),...,g (xn βα)) , and βα is the non- penalized maximum likelihood estimator of β. Accordingly, following BIC tuning parameter selector: GIC* can be used in classical variable selection [see equa- ∗ 1 BIC (λ) = log(σˆ 2) + log(n) df (λ), tion (3.10) of McCullagh and Nelder 1989]. Next, we show the λ n L asymptotic performances of GIC.  ˆ 2 = n − T ˆ 2 + where σλ i=1(yi xi βλ) /n. Using the result that log(1 Theorem 1. Suppose the density function of the generalized ≈ ∗ t) t for small t,BICλ is approximately equal to linear model satisfies Fan and Li’s 2001 three regularity condi- ∗∗ 1 ∗∗ 1 tions (A)–(C) and that the technical condition (C4) holds. BIC (λ) = D (y; μˆ ) + log(n) df (λ), n λ n L (A) If there exists a positive constant M such that κn < M, ∗∗ ; ˆ =ˆ2 ˆ 2 ˆ where D (y μλ) σλ /σα¯ is the scaled deviance of the nor- then the tuning parameter λ selected by minimizing 2 mal distribution, and σˆ ¯ is the dispersion parameter estimator GICκ (λ) in (7) satisfies α n computed from the full model. It can be seen that BIC∗∗ is a ˆ ˆ P{λ ∈ −}→0 and P{λ ∈ +}≥π, BIC-type selector. Under the conditions in Theorem 1(B), the SCAD penalized least squares procedure with the BIC∗∗ selec- where π is a nonzero probability. tor possesses the oracle property, which is consistent with the (B) Suppose that Conditions√ (C1)–(C3) are satisfied. If findings in Wang, Li, and Tsai (2007b). κn →∞ and κn/ n → 0, then the tuning parame- ˆ 4. EFFICIENCY ter λ selected by minimizing GICκn (λ) in (7) satisfies P{ ˆ = }→ αλ α0 1. Under the assumption that the true model is included in a The proof is given in Appendix A. family of candidate models, we established the consistency of Theorem 1 provides guidance on the choice of the regular- BIC-type selectors. In practice, however, this assumption may ization parameter. Theorem 1(A) implies that the GIC selec- not be valid, which motivates us to study the asymptotic effi- tor with bounded κn tends to overfit without regard to which ciency of the AIC-type selector. In the literature, the L2 norm penalty function is being used. Analogous to the classical vari- has been commonly used to assess the efficiency of the classical able selection criterion AIC, we refer to GIC with κn = 2asthe AIC procedure (see Shibata 1981, 1984 and Li 1987) in linear AIC selector, while we name GIC with κn → 2 the AIC-type se- regression models. Hence, we focus on the efficiency of linear = regression model selections via L norm. lector. In contrast, because κn log(n) fulfills the conditions√ of 2 Consider the following model: Theorem 1(B), we call GIC with κn →∞and κn/ n → 0the BIC-type selector. Theorem 1(B) indicates that the BIC-type yi = μi + i for i = 1,...,n, selector identifies the true model consistently. Thus, the non- = T concave penalized likelihood of the generalized linear model where μ (μ1,...,μn) is an unknown mean vector, and i’s with the BIC-type selector possesses the oracle property. are independent and identically distributed (iid) random errors with mean 0 and variance σ 2. Furthermore, we assume that Xβ Remark 3. In linear regression models, Fan and Li (2001) constitutes the nearest representation of the true mean vector applied the GCV selector given below to choose the regulariza- μ, and hence the full model is not necessarily a correct model. tion parameter, Adapting the formulation of Li (1987), we allow d, the dimen- β n d n →  − ˆ 2 sion of , to tend to infinity with ,but / 0. ∗ y Xβλ GCV (λ) = , (9) For the given dataset {(xi, yi) : i = 1,...,n}, we follow the { − }2 n 1 df L(λ)/n formulation of (1) to define the penalized least squares function · = T T T n d where denotes the Euclidean norm, X (x1 ,...,xn ) , and   LS = − T 2 + | | df L(λ) is defined in (4). To extend the application of GCV, we Q (β) (yi xi β) n pλ( βj ). (11) ˆ = = replace the residual sum of squares in (9)byG(y, βλ), and then i 1 j 1 choose λ that minimizes ˆ = The resulting penalized estimator of μ in model αλ is μλ ˆ ˆ G(y, βλ) Xβλ. In addition, the nonpenalized estimator of μ in model α is GCV(λ) = . (10) ∗ ˆ ∗ ˆ ˆ ∗ { − }2 μˆ = Xβ . It is noteworthy that β and β have been defined in n 1 df λ/n α α λ α 316 Journal of the American Statistical Association, March 2010

Definition 1 and Remark 2, respectively. In practice, the tuning Condition (C7) holds. Condition (C8) ensures that the differ- parameter λ is unknown and can be selected by minimizing ence between the penalized mean function estimator and the

n corresponding least squares mean function estimator is small 1  GICLS(λ) = (y − xT βˆ )2 + κ σ 2d , (12) in comparison with the risk of the least squares estimator (see κn n i i λ n αλ i=1 Lemma 3 in Appendix B). Sufficient conditions for (C8) are also given in Appendix C. We next show the asymptotic effi- 2 where σ is assumed to be known, and the case with the un- ciency of the AIC-type selector. 2 ˆ known σ will be discussed later. When βλ in (12) is replaced ˆ ∗ LS Theorem 2. Assume Conditions (C5)–(C8) hold. Then, the by the least squares estimator β ,GIC with κn = 2 becomes α κn tuning parameter λˆ selected by minimizing GICLS(λ) in (12) the Mallows’ Cp criterion that has been used for classical vari- κn with κn → 2 yields an asymptotically loss efficient estimator, able selections. ˆ To assess the performance of the tuning parameter selector, βλˆ , in the sense of (14). we adopt the approach of Shibata (1981) (also see Li 1987; The proof is given in Appendix B. Shao 1997), and define the average squared loss (or the L2 loss) Theorem 2 demonstrates that the AIC-type selector is as- ˆ associated with the estimator βλ to be ymptotically loss efficient. In addition, using the result that n log(1 + t) ≈ t for small t, the AIC selector behaves similarly 1 1 ∗ L(βˆ ) = μ − μˆ 2 = (μ − xT βˆ )2. (13) to the following AIC selector, λ n λ n i i λ i=1 2 ∗ 2σ d AIC (λ) = log(σˆ 2) + αλ . Accordingly, the risk function is R(βˆ ) = E[L(βˆ )]. λ n Using the average squared loss measure, we further define Accordingly, both AIC and AIC∗ selectors are asymptotically the asymptotic loss efficiency. loss efficient. Definition 3 (Asymptotically loss efficient). A tuning para- Applying Lemma 4 and Equation (B.1)inAppendix B, | −1 − 2 ˆ |→ meter selection procedure is said to be asymptotically loss effi- we find that if supλ∈[0,λmax] n (κn 2)σ dαλ /R(βλ) 0in cient if LS probability, then GICκn is asymptotically loss efficient. Note ˆ −1 2 ˆ L(β ˆ ) that it can be shown n σ dαλ /R(βλ) is bounded by 1. As a λ → → = ˆ 1, (14) result, κn 2 (including κn 2) is critical in establishing the inf ∈[ ] L(β ) λ 0,λmax λ asymptotic loss efficiency. This finding is similar to the clas- ˆ sical efficient criteria (see Shibata 1980; Shao 1997; and Yang in probability, where βλˆ is associated with the tuning parame- ˆ ˆ 2005). It is also not surprising that the BIC-type selectors do not ter λ selected by this procedure. We also say β ˆ is asymptoti- λ possess asymptotic loss efficiency, which is consistent with the cally loss efficient if (14) holds. finding of classical variable selections (see Li 1987 and Shao Moreover, we introduce the following technical conditions 1997). for studying the asymptotic loss efficiency of the AIC-type se- In practice, σ 2 is often unknown. It is natural to replace 2 LS lector in linear regression: the σ in GIC2 by its consistent estimator (see Shao 1997). − The following corollary shows that the asymptotical property (C5) ( 1 XT X) 1 exists, and its largest eigenvalue is bounded n of GIC still holds. by a constant number C. 4q ∞ Corollary 1. λˆ (C6) E1 < , for some positive integer q. If the tuning parameter is selected by min- LS 2 ˆ ∗ ∈ imizing GIC (λ) with κn → 2 and σ being replaced by its (C7) The risks of the least squares estimators βα for all α κn A satisfy consistent estimator σ˜ 2, then the resulting procedure is also as-  ymptotically loss efficient. ∗ − [nR(βˆ )] q → 0. α The proof is given in Appendix B. α∈A = T = | ˆ | ˆ Remark 5. As suggested by an anonymous reviewer, we (C8) Let b (b1,...,bd) , where bj pλ( βλj ) sgn(βλj) ˆ ˆ study the asymptotic loss efficiency of the generalized linear for all j such that |βλj| > 0, and bj = 0 otherwise, and βλj is ˆ model in (5). Following the spirit of the deviance measure com- the j-th component of the penalized estimator βλ. In addition, monly used in the GLIM, we adopt the Kullback–Leibler (KL) let βˆ ∗ be the least squares estimator of β obtained from model ˆ αλ distance measure to define the KL loss of an estimate β (either αλ. Then, we assume that, in probability, ˆ ∗ the maximum likelihood estimate βα or the nonconcave penal- 2 ˆ b ized estimate βλ), as sup → 0. ˆ ∗ λ∈[0,λ ] R(β ) 2 max αλ L (βˆ ) = E {(θ ) − (θˆ)}, (15) KL n 0 0 Condition (C5) has been commonly considered in the lit- T erature. Conditions (C6) and (C7) are adopted from condi- where (·) is defined in (6), θ 0 = (θ01,...,θ0n) is the true ˆ ˆ ˆ T ˆ tions (A.2) and (A.3), respectively, in Li (1987). It can be shown unknown canonical parameter, θ = (θ1,...,θn) = Xβ under that if the true model is approximated by a set of the candi- canonical link function, and E0 denotes the expectation under date models (e.g., the true model is of infinite dimension), then the true model (see, e.g., McQuarrie and Tsai 1998). It can be Zhang, Li, and Tsai: Regularization Parameter Selections 317 shown that the KL loss is identical to the squared loss for nor- from an independent Bernoulli distribution with success prob- mal linear regression models with known variance. To obtain ability 0.5. Moreover, we conduct 1000 realizations with the the asymptotic loss efficiency for GLIM, however, it is neces- sample sizes n = 200 and 400. ˆ sary to employ the Taylor expansion to expand b(θi) at the true To assess the finite sample performance of the proposed value of θ0i for i = 1,...,n. Accordingly, we need to impose methods, we report the percentage of models correctly fitted, some strong assumptions to establish the asymptotic loss effi- underfitted, and overfitted with 1, 2, 3, 4, 5 or more parameters ciency for the nonconcave penalized estimate under the KL loss by SCAD-AIC (i.e., κ = 2), SCAD-BIC [i.e., κ = log(n)], ˆ n n if the tuning parameter λ is selected by minimizing GIC in (7) SCAD-GCV, AIC, and BIC selectors as well as via the ora- with κn → 2. To avoid considerably increasing the length of cle procedure (i.e., the simulated data were fitted with the true the paper, we do not present detailed discussions, justifications, model). Their corresponding standard errors can be calculated and proofs here, but they are given in the online supplemental by pˆ(1 −ˆp)/1000, where pˆ is the observed proportion in 1000 material available in JASA website. simulations. Moreover, we report the average number of zero coefficients that were correctly (C) and incorrectly (I) identified 5. NUMERICAL STUDIES by different methods. To compare model fittings, we further cal- In this section, we present three examples comprised of two culate the following model error for the new observation (x, y): Monte Carlo experiments and one empirical example. It is note- ˆ T T ˆ 2 worthy that Example 1 considers a setting in which the true ME(β) = Ex{μ(x β) − μ(x β)} , model is in a set of candidate models with logistic regressions. This setting allows us to examine the finite sample performance where the expectation is taken with respect to the new observed T = | of the consistent selection criterion, which is expected to per- covariate vector x, and μ(x β) E(y x). Then, we report the form well via Theorem 1. In contrast, Example 2 considers a median of the relative model error (MRME), where the relative = setting in which the true model is not included in a set of candi- model error is defined as RME ME / MEfull, and MEfull is the date models with Gaussian regressions. This setting enables us model error calculated by fitting the data with the full model. to assess the finite sample performance of the efficient selection Table 1 showsthattheMRMEofSCAD-BICissmallerthan criterion, which is expected to perform well via Theorem 2. that of SCAD-AIC. As the sample size increases, the MRME of SCAD-BIC approaches that of the oracle estimator, whereas Example 1. Adapting the model setting from Tibshirani the MRME of SCAD-AIC remains at the same level. Hence, (1996) and Fan and Li (2001), we simulate the data from the SCAD-BIC is superior to SCAD-AIC in terms of model error T logistic regression model, y|x ∼ Bernoulli{p(x β)}, where measures in this example. Moreover, Figure 1 presents the box- exp(xT β) plots of RMEs, which lead to a similar conclusion. p(xT β) = μ(xT β) = . In model identifications, SCAD-BIC has a higher chance 1 + exp(xT β) than SCAD-AIC of correctly setting the nine true zero coef- In addition, β = (3, 1.5, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0)T , x is a 12- ficients to zero, while SCAD-BIC is slightly more prone than dimensional random vector, the first nine components of x are SCAD-AIC to incorrectly set the three nonzero coefficients to multivariate normal with covariance matrix = (σij), in which zero when the sample size is small. In addition, SCAD-BIC has |i−j| σij = 0.5 , and the last three components of x are generated a much higher possibility of correctly identifying the true model

Table 1. Simulation results for the logistic regression model. MRME is the median relative model error; C denotes the average number of the nine true zero coefficients that were correctly identified as zero, while I denotes the average number of the three nonzero coefficients that were incorrectly identified as zero; the numbers in parentheses are standard errors; Under, Exact, and Overfitted represent the proportions of corresponding models being selected in 1000 Monte Carlo realizations, respectively

Zeros Overfitted (%) MRME Under Exact Method(%) C I(%) (%) 1234≥5 n = 200 SCAD-AIC 58.03 7.4 (1.6) 0.0 (0.0) 0.2 30.223.219.613.67.55.9 SCAD-BIC 16.90 8.8 (0.5) 0.0 (0.1) 0.7 83.713.22.40.40.00.0 SCAD-GCV 81.91 5.8 (1.9) 0.0 (0.0) 0.0 7.812.018.118.817.625.7 AIC 57.05 7.4 (1.2) 0.0 (0.0) 0.1 19.931.526.714.84.32.8 BIC 18.27 8.8 (0.5) 0.0 (0.1) 0.8 80.417.02.30.10.00.0 Oracle 12.49 9.0 (0.0) 0.0 (0.0) 0.0 100.00.00.00.00.00.0 n = 400 SCAD-AIC 64.45 7.4 (1.5) 0.0 (0.0) 0.0 29.421.921.015.18.54.1 SCAD-BIC 19.03 8.9 (0.4) 0.0 (0.0) 0.0 89.97.91.80.20.20.0 SCAD-GCV 82.13 6.0 (1.9) 0.0 (0.0) 0.0 9.214.718.619.815.921.8 AIC 63.17 7.4 (1.2) 0.0 (0.0) 0.0 21.729.227.314.65.51.7 BIC 20.46 8.8 (0.4) 0.0 (0.0) 0.0 86.312.11.40.20.00.0 Oracle 15.88 9.0 (0.0) 0.0 (0.0) 0.0 100.00.00.00.00.00.0 318 Journal of the American Statistical Association, March 2010

Figure 1. Boxplot of relative model error (RME) with n = 200 and n = 400. than that of SCAD-AIC. Moreover, among the overfitted mod- We simulate 1000 data sets with n = 400, 800, and 1600. To els, the SCAD-BIC method is likely to include only one irrel- study the performance of selectors, we define the finite sample’s evant variable, whereas the SCAD-AIC method often includes loss efficiency, two or more. When the sample size increases, the SCAD-BIC ˆ L(β ˆ ) method yields a lesser degree of overfitting, while the SCAD- ˆ = λ LE(βλˆ ) ˆ , AIC method still overfits quite seriously. These results are con- infλ L(βλ) sistent with the theoretical findings presented in Theorem 1. where λˆ is chosen by SCAD-GCV, SCAD-AIC, and SCAD- It is not surprising that SCAD-GCV performs similar to BIC. It is noteworthy that βfull depends on the sample size and SCAD-AIC. Furthermore, the classical AIC and BIC selection γ , a sensible approach to compare the loss efficiency of the criteria behave as SCAD-AIC and SCAD-BIC, respectively, in selection criterion across different sample sizes is to compare terms of model errors and identifications. It is of interest to note the loss efficiency under the same value of βfull. that SCAD-AIC usually suggests a sparser model than AIC. In Because the performance of SCAD-GCV is very similar to sum, SCAD-BIC performs the best. that of SCAD-AIC, it is not reported here. Figure 2(a) and (b) depict the LEs of SCAD-AIC and SCAD-BIC, respectively, Example 2. In practice, it is plausible that the full model across various γ . Figure 2(a) clearly indicates that the loss ef- fails to include some important explanatory variable. This mo- ficiency of SCAD-AIC converges to 1 regardless of the value tivates us to mimic such a situation and then assess various of γ , which corroborates the theoretical finding in Theorem 2. selection procedures. Accordingly, we consider a linear re- However, the loss efficiency of SCAD-BIC in Figure 2(b) does = T + gression model yi xi β i, where xi’s are iid multivariate not show this tendency. Specifically, when γ is close to 0 (i.e., normal random variables with dimension 13, the correlation the model is nearly sparse), SCAD-BIC results in smaller loss |i−j| 2 between xi and xj is 0.5 , and i’s are iid N(0,σ ) with due to its larger penalty function. As γ increases so that the full = = T T T σ 4. In addition, we partition x (xfull, xexc) , where xfull model contains the medium-sized coefficients, SCAD-BIC is contains d = 12 covariates of the full model and xexc is the likely to choose the model that is too sparse and hence increases covariate excluded from model fittings. Accordingly, we par- the loss. When γ becomes large enough and the coefficient βexc = T T T × β tition β (βfull, βexc) , where βfull is a 12 1 vector and is dominated by the remaining coefficients in full, the con- βexc is a scalar. To investigate the performance of the proposed sistent property of SCAD-BIC tends to have the smaller loss. = In sum, Figure 2 shows that SCAD-AIC is efficient, whereas methods under√ various parameter structures, we let βfull + = T SCAD-BIC is not. Finally, we examine the efficiencies of AIC β0 γ δ/ n, where β0 (3, 1.5, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0) , δ = (0, 0, 1.5, 1.5, 1, 1, 0, 0, 0, 0, 0.5, 0.5)T , γ ranges from 0 to and BIC, and find that the performances of AIC and BIC are 10, and β = 0.2. Because the candidate model is the subset similar to those of SCAD-AIC and SCAD-BIC, respectively, exc and hence are omitted. of the full model that contains 12 covariates in xfull, the above model settings ensure that the true model is not included in the Example 3 (Mammographic mass data). Mammography is set of candidate models. the most effective method for screening for the presence of Zhang, Li, and Tsai: Regularization Parameter Selections 319

Figure 2. (a) The loss efficiency of SCAD-AIC; (b) The loss efficiency of SCAD-BIC. breast cancer. However, mammogram interpretations yield ap- we fit the data with the following logistic regression model: proximately a 70% rate of unnecessary biopsies with benign T 10 outcomes. Hence, several computer-aided diagnosis (CAD) p(x β) log = β + xjβj, 1 − p(xT β) 0 systems have been proposed to assist physicians in predicting j=1 breast biopsy outcomes from the findings of BI-RADS (Breast T T T Imaging Reporting And Data System). To examine the capa- where x = (x1,...,x10) , β = (β1,...,β10) , and p(x β) is bility of two novel CAD approaches, Elter, Schulz-Wendtland, the probability of the case being classified as malignant. As and Wittenberg (2007) recently analyzed a dataset containing a result, the tuning parameters selected by SCAD-AIC and 516 benign and 445 malignant instances. We downloaded this SCAD-BIC are 0.0332 and 0.1512, respectively, and SCAD- data set from UCI Repository, and excluded GCV yields the same model as SCAD-AIC. Because the true 131 missing values and 15 coding errors. As a result, we con- model is unknown, we follow an anonymous referee’s sugges- sidered a total of 815 cases. tion to include the models selected by the delete-1 cross valida- To study the severity (benign or malignant), Elter, Schulz- tion (delete-1 CV) and the 5-fold cross validation (5-fold CV). The detailed procedures for fitting those models can be obtained Wendtland, and Wittenberg (2007) considered the following from the first author. three covariates: x —birads, BI-RADS assessment assigned in 1 Because the model selected by the delete-1 CV is the same as a double-review process by physicians (definitely benign = 1to SCAD-AIC, Table 2 only presents the nonpenalized maximum highly suggestive of malignancy = 5); x —age, patient’s age 2 likelihood estimates from the full model as well as the SCAD- = = = in years; x3—mass density,high 1, iso 2, low 3, fat- AIC, SCAD-BIC, and 5-fold CV estimates, together with their = containing 4. In addition, we employ the dummy variables, standard errors. It indicates that the nonpenalized maximum x to x , to represent the four mass shapes: round = 1, oval = 2, 4 6 likelihood approach fits five spurious variables (x3, x6, and x8 lobular = 3, irregular = 4, where “irregular” is used as the base- to x10), while SCAD-AIC and 5-fold CV include two variables line. Analogously, we apply the dummy variables, x7 to x10,to (x6 and x9) and one variable (x9), respectively, with insignifi- = represent the five mass margins: circumscribed 1, microlobu- cant effects at a level of 0.05. In contrast, all variables (x1, x2, = = = = lated 2, obscured 3, ill-defined 4, spiculated 5), where x4, x5, and x7) selected by SCAD-BIC are significant. Because “spiculated” is used as the baseline. the sample size n = 815 is large, these findings are consistent To predict the value of the binary response, y = 0 (benign) or with Theorem 1 and the simulation results. In addition, the p- y = 1 (malignant), on the basis of the 10 explanatory variables, value of the deviance test for assessing the SCAD-BIC model

Table 2. Estimates for mammographic mass data with standard deviations in parentheses

MLE SCAD-AIC SCAD-BIC 5-Fold CV

β0 −11.04 (1.48) −11.17 (1.13) −11.16 (1.07) −11.49 (1.14) BIRADS (x1) 2.18 (0.23) 2.19 (0.23) 2.25 (0.23) 2.23 (0.23) Age (x2) 0.05 (0.01) 0.04 (0.01) 0.04 (0.01) 0.05 (0.01) Density (x3) −0.04 (0.29) 0 (–) 0 (–) 0 (–) sRound (x4) −0.98 (0.37) −0.99 (0.37) −0.80 (0.34) −0.81 (0.35) sOval (x5) −1.21 (0.32) −1.22 (0.32) −1.07 (0.30) −1.07 (0.30) sLobular (x6) −0.53 (0.35) −0.54 (0.34) 0 (–) 0 (–) mCircum (x7) −1.05 (0.42) −0.98 (0.32) −1.01 (0.30) −1.07 (0.31) mMicro (x8) −0.03 (0.65) 0 (–) 0 (–) 0 (–) mObscured (x9) −0.48 (0.39) −0.42 (0.30) 0 (–) −0.47 (0.30) mIlldef (x10) −0.09 (0.33) 0 (–) 0 (–) 0 (–) 320 Journal of the American Statistical Association, March 2010 against the full model is 0.41, and as a result, there is no evi- holds true for any λ ∈ − ={λ : αλ ⊃ α0}. This together with Condi- dence of lack of fit in the SCAD-BIC model. tion (C4) and κndα¯ /n = oP(1) obtained from the fact of κn < M,we Based on the model selected by SCAD-BIC, we conclude have   that a higher BI-RADS assessment or a greater age results in − ∗ ¯ P inf GICκn (λ) GICκ (α) > 0 a higher chance for malignancy. In addition, the oval or round λ∈ − n   mass shape yields lower odds for malignance than does the ir- ∗ ∗ ¯ ≥ D (α) − D (α) − κndα¯ regular mass shape, while the odds for malignance designated P min > 0 α ⊃α0 n n n by the lobular mass shape is not significantly different from that   of the irregular mass shape. Moreover, the odds for malignance = P min c − c ¯ + o (1)>0 → 1, (A.4) ⊃ α α P indicated by the microlobulated, obscured, and ill-defined mass α α0 margins are not significantly different from that of the spicu- as n →∞. The last step uses the fact that the number of underfit- lated mass margin. However, the circumscribed mass margin ting models is finite, hence minα ⊃α0 cα is strictly greater than cα¯ (i.e., leads to lesser odds for malignance than that of the four other cα0 ) under Condition (C4). The above equation yields (A.1) immedi- types of mass margins. ately. Next we show that the model selected by the GIC selector with 6. DISCUSSION bounded κn is overfitted with probability bounded away from zero. ∈ = ∗ ¯ In the context of variable selection, we propose the general- For any λ 0, αλ α0. Then subtracting GICκn (α) from both sides ∈ ized information criterion to choose regularization parameters of (A.3) and subsequently taking infλ 0 over GICκn (λ),wehave ∗ for nonconcave penalized likelihood functions. Furthermore, inf GIC (λ) − GIC (α)¯ ∈ κn κn we study the theoretical properties of GIC. If we believe that λ 0 ∗ ∗ the true model is contained in a set of candidate models with > D (α0)/n −[D (α)/¯ n + κndα¯ /n]. (A.5) the GLIM structure, then the BIC-type selector identifies the Note that the right-hand side of the above equation does not involve λ. ∗ ∗ ∗ ∗ ∗ ∗ true model with probability tending to 1, while the GIC selec- D (α ) − D (α)¯ =−2[(βˆ ) − (βˆ )],whereβˆ and βˆ are the 0 α0 α¯ α0 α¯ tors with bounded κn tends to overfit with positive probabil- nonpenalized maximum likelihood estimators computed via the true ity. However, if the true model is approximated by a family of and full models, respectively. Under regularity conditions (A)–(C) in ∗ ∗ candidate models, the AIC-type selector is asymptotically loss Fan and Li (2001), it can be shown that βˆ and βˆ are consistent α0 α¯ efficient, whereas the BIC-type selector is not, in general. Sim- and asymptotically normal. Hence, the likelihood ratio test statistic ∗ ∗ L ulation studies support the finite sample performance of the se- −2[(βˆ ) − (βˆ )] −→ χ2 . This result, together with κ < M α0 α¯ d ¯ −d n lection criteria. α α0 and (A.5), yields Although we obtain the theoretical properties of GIC for   − ∗ ¯ GLIM, the application of GIC is not limited. For example, we P inf GICκn (λ) GICκn (α) > 0 λ∈ 0 could employ GIC to select the regularization parameter in Cox      proportional hazard and quasi-likelihood models. We believe 2 ˆ ∗ ˆ ∗ ≥ P − β − (β ¯ ) − κnd ¯ /n > 0 that these efforts would further enhance the usefulness of GIC n α0 α α      in real data analysis. ˆ ∗ ˆ ∗ ≥ P −2 β − (β ) > d ¯ M α0 α¯ α   APPENDIX A: PROOF OF THEOREM 1 2 → P χ ≥ d ¯ M  π. dα¯ −dα α Before proving the theorem, we show the following two lemmas. 0 Then, Theorems 1(A) and 1(B) follow from Lemmas 1 and 2, respec- This implies (A.2), and we complete the proof of Lemma 1. To prove Theorem 1(B), we present the following lemma. tively. For the sake of convenience, we denote D(y; μˆ λ) in (7)and ; ˆ ∗ ∗ D(y μα) in (8)byD(λ) and D (α), respectively. Lemma 2. Suppose the density function of the generalized linear Lemma 1. Suppose the density function of the generalized linear model satisfies Fan and Li’s (2001) three regularity conditions√ (A)– = model satisfies Fan and Li’s (2001) three regularity conditions (A)– (C). Assume Conditions (C1)–(C4) hold, and let λn κn/ n.Ifκn satisfies κ →∞and λ → 0asn →∞,then (C). Assume that there exists a positive constant M such that κn < M. n n  ∗  Then under Condition (C4), we have P GIC (λ ) = GIC (α ) → 1(A.6)   κn n κn 0 ∗ ¯ → P inf GICκn (λ) > GICκ (α) 1(A.1)and λ∈ − n   → P inf GICκn (λ) > GICκn (λn) 1. (A.7) and   λ∈ −∪ + ∗ ¯ ≥ lim inf P inf GICκn (λ) > GICκ (α) π, (A.2) Proof. Without loss of generality, we assume that the first dα co- n→∞ λ∈ n 0 0 efficients of β in the true model are nonzero and the rest are zeros. →∞ α0 as n . Note that the density function of the generalized linear model satisfies Proof. For a given λ, the nonpenalized maximum likelihood esti- Fan and Li’s (2001) three regularity conditions (A)–(C). These condi- ˆ ∗ ≥ tions, together with Condition (C3) and the assumptions of κn stated mator, βα , maximizes (β) under model αλ. This implies D(λ) ∗ λ in Lemma 2, allow us to apply Fan and Li’s theorems 1 and 2 to show D (αλ) for any given λ, which leads to ˆ that, with probability tending to 1, the last d − dα components of βλ = + ∗ 0 n GICκn (λ) D(λ)/n κndf λ/n > D (αλ)/n. (A.3) ˆ are zeros and the first dα0 components of βλn satisfy the normal equa- Then, we obtain that tions ∗ ∗ ¯ ∂   ∗ D (αλ) D (α) κndα¯ βˆ + b = 0forj = 1,...,d , (A.8) GICκ (λ) − GIC (α)¯ > − − , λn λnj α0 n κn n n n ∂βj Zhang, Li, and Tsai: Regularization Parameter Selections 321

∗ ∗ where b = p (|βˆ |) sgn(βˆ ),andβˆ is the jth component Because D (α ) − D (α) follows a χ2 distribution with d − d de- λnj λn λnj λnj λnj 0 α α0 ˆ grees of freedom asymptotically for any α  α ,itisoforderO (1). of βλn . 0 P ˆ Accordingly, Using the oracle property, we have |βλ j|→|βj|≥min1≤j≤d |βj|. n α0   Then, under Conditions (C1) and (C2), there exists a constant m so that inf n GIC (λ) − GIC (λ ) ˆ ∈ κn κn n p (|β |) = 0formin≤ ≤ |β | > mλn as n gets large. Accord- λ + λn λnj 1 j dα0 j ingly, P(b = 0) → 1forj = 1,...,dα . This together with (A.8) ∗ ∗ λnj 0 ≥ min {D (α0) − D (α)}+(τ + oP(1))κn  implies that, with probability tending to 1, the first dα0 components of α α0 βˆ solve the normal equations λn ≈ τκn. (A.10) ∂   ˆ = = Using the fact that κ goes to infinity as n →∞, the right-hand side βλn 0, j 1,...,dα0 , n ∂βj of Equation (A.10) goes to positive infinity, which guarantees that the − left-hand side of Equation (A.10) is positive as n →∞. Hence, we and the remaining d dα0 components are zeros. This is exactly the same as the normal equation in solving the nonpenalized maximum finally have ∗ ∗   likelihood estimator βˆ . As a result, βˆ = βˆ with probability tend- α0 α0 λn → P inf GICκn (λ) > GICκn (λn) 1. ing to 1. It follows that λ∈ + ∗     ∗  P{D(λ ) = D (α )}=P βˆ = βˆ → 1. n 0 λn α0 The results of Cases 1 and 2 complete the proof of (A.7). { = Moreover, using the result from Proposition 1,wehaveP df λn Proofs of Theorem 1 d }→1. Consequently, α0 Lemma 1 implies that GIC (λ), which produces the underfitted   κn ∗ ∗ ¯ = model, is consistently larger than GICκ (α). Thus, the optimal model P GICκn (λn) GICκn (α0) n   selected by minimizing the GIC (λ) must contain all of the signifi-   κn = 1 − ∗ + κn − = cant variables with probability tending to one. In addition, Lemma 1 P (D(λ) D (α0)) df λ dα 0 n n n 0 indicates that there is a nonzero probability that the smallest value of ∈ → 1. GICκn (λ) for λ 0 is larger than that of the full model. As a re- sult, there is a positive probability that any λ associated with the true

The proof of (A.6) is complete. model cannot be selected by GICκn (λ) as the regularization parameter. We next show that, for any λ which cannot identify the true model, Theorem 1(A) follows. the resulting GICκn (λ) is consistently larger than GICκn (λn).Tothis Lemma 2 indicates that the model identified by λn converges to end, we consider two cases, underfitting and overfitting. the true model as the sample size gets large. In addition, it shows λ’s ∈ ⊃ Case 1: Underfitted model (i.e., λ − so that αλ α0). Applying that fail to identify the true model cannot be selected by GICκn (λ) (A.3)and(A.6), we obtain that, with probability tending to 1, asymptotically. Theorem 1(B) follows.

1 ∗ 1 ∗ κn GICκ (λ) − GICκ (λn)> D (α ) − D (α ) − dα . APPENDIX B: PROOFS OF THEOREM 2 n n n λ n 0 n 0 AND COROLLARY 1 Subsequently, take infλ∈ − of both sides of the above equation, which yields Before proving the theorem, we establish the following two lemmas. Lemma 3 evaluates the difference between a penalized mean estima- 1 ∗ 1 ∗ κn ˆ ˆ ∗ − − − tor μλ and its corresponding least squares mean estimator μα ,while inf GICκn (λ) GICκn (λn)> min D (α) D (α0) dα0 . ∗ λ λ∈ − α ⊃α0 n n n ˆ ˆ Lemma 4 demonstrates that the losses of μλ and μαλ are asymptoti- (A.9) cally equivalent. This, in conjunction with Condition (C4) and the number of underfitted models being finite, leads to Lemma 3. Under Condition (C5),      ∗ 2 1 ∗ 1 ∗ κn μˆ − μˆ ≤ nCb2, − − λ αλ P min D (α) D (α0) dα0 > 0 α ⊃α0 n n n   where C is the constant number in Condition (C5) and b is defined in = − + Condition (C8). P min cα cα0 oP(1)>0 α ⊃α0 Proof. Without loss of generality, we assume that the first dα com- → 1, λ βˆ βˆ ∗ βˆ (1) ponents of λ and αλ are nonzero, and denote them by λ and →∞ ∗(1) (1) ∗ ∗ as n . As a result, βˆ , respectively. Thus, μˆ = Xβˆ = X βˆ and μˆ = Xβˆ =   αλ λ λ αλ λ αλ αλ ˆ ∗(1) → Xα βα . From the proofs of theorems 1 and 2 in Fan and Li (2001), P inf GICκn (λ) > GICκn (λn) 1. λ λ λ∈ − ˆ (1) with probability tending to 1, we have that βλ is the solution of the following equation: Case 2: Overfitted model (i.e., λ ∈ + so that αλ  α0). According ∗ to Lemma 1, with probability tending to 1, D(λn) = D (α0).Inad-   1 T − (1) + (1) = dition, Proposition 1 indicates that df − df >τ+ o (1) for some Xα y Xαλ β b 0, λ λn P n λ λ τ>0andλ ∈ +. Moreover, as noticed in the proof of Lemma 1, ≥ ∗ (1) ˆ (1) D(λ) D (αλ). Therefore, with probability tending to 1, where b is the subvector of b that corresponds to βλ . Accordingly,     n GIC (λ) − GIC (λ )   −1 κn κn n ˆ (1) T −1 T 1 T (1)   β = X Xα X y + X Xα b = − + − λ αλ λ αλ n αλ λ D(λ) D(λn) df λ df λn κn ≥ ∗ − ∗ + + = ˆ ∗(1) + (1) D (αλ) D (α0) (τ oP(1))κn. βαλ Vαλ b , 322 Journal of the American Statistical Association, March 2010

 1 T −1 = ˆ ∗ − ˆ =ˆ ∗ − ˆ 2 = T − where Vαλ ( n Xαλ Xαλ ) . In addition, the eigenvalues of Vαλ are Let J1 L(βαλ ) L(βλ), J2 μαλ μλ /n, J3 2 (I = 2 − T = − 2 bounded under Condition (C5). Hence, Hαλ )μ/n, J4 2(σ dαλ  Hαλ )/n and J5 (κn 2)σ dαλ /n.     ∗  ∗  Using Lemma 3, Lemma 4, and similar arguments used in Li (1987), μˆ − μˆ 2 = X βˆ (1) − βˆ (1) 2 = nbT V b(1) ≤ nCb2 λ αλ αλ λ αλ 1 αλ we obtain that, in probability,   for some positive constant number C. This completes the proof.  J   j  → = sup  ˆ  0forj 1,...,4. Lemma 4. If Conditions (C5)–(C8) hold, then λ∈[0,λmax] L(βλ)    ˆ  −1 2 ˆ L(β ) Because κn → 2 and the fact that n σ dα /R(β ) is bounded sup  λ − 1 → 0, λ λ  ˆ ∗  | ˆ |→ λ∈[0,λ ] by 1, we can further show that supλ∈[0,λ ] J5/L(βλ) 0using max L(βαλ ) max Lemma 4 and Condition (C7). Accordingly, (B.1) holds, which implies in probability. LS − 2 ˆ that the difference between GICκn (λ) n and L(βλ) is negligible Proof. After algebraic simplification, we have ˆ in comparison to L(βλ). This completes the proof. ∗ ∗ ∗   μˆ − μˆ 2 2(μ − μˆ )T (μˆ − μˆ ) ˆ − ˆ ∗ = αλ λ + αλ αλ λ Proof of Corollary 1 L(βλ) L βα λ n n 2 LS When σ is unknown, the GICκn (λ) in (B.2) becomes = I1 + I2.  2 LS =  + ˆ Under Conditions (C6) and (C7), Li (1987) showed that GICκn (λ) L(βλ)   n  ˆ ∗   L(βα)  ˜ 2 − 2 sup − 1 → 0. 2(σ σ )dαλ  ˆ ∗  + J1 + J2 + J3 + J4 + J5 + . α∈A R(βα) n This, together with Condition (C8) and Lemma 3, implies Using Lemma 4 and Condition (C7), we have      2 2   I  2(σ˜ − σ )dα  1  sup  λ  → 0 sup  ˆ ∗   ˆ  ∈[ ] λ∈[0,λ ] nL(β ) λ 0,λmax L(βαλ ) max λ    ˆ ∗ − ˆ 2  ˆ ∗ − ˆ 2 ˆ ∗ in probability, which completes the proof. μα μλ μα μλ L(βα ) = sup λ − λ λ − 1 ∈[ ] ˆ ∗ ˆ ∗ ˆ ∗ λ 0,λmax nR(βαλ ) nL(βαλ ) R(βαλ ) APPENDIX C: SUFFICIENT CONDITIONS FOR (C8) → 0. We first present three sufficient conditions for (C8). √ Applying the Cauchy–Schwarz inequality, we next obtain (S1) There exists a constant M1 such that λmax satisfies nλmax <  − ˆ ∗ ·ˆ ∗ − ˆ  M1 for all n. 2 μ μαλ μαλ μλ I ≤ (S2) There exists a constant M such that pλ(θ) satisfies p (θ) ≤ 2 n 2 λ  M2λ for any θ. 1   = ˆ ∗ · √  ˆ ∗ − ˆ  (S3) The average error of the full model α¯ ,thatis,α¯  μ − 2 L(βα ) μα μλ . λ n λ 2 nα¯ →∞ →∞ Hα¯ μ /n, satisfies d ,asn . I As a result, sup ∈[ ] | 2 |→0, and Lemma 4 follows imme- We next provide motivations for these conditions. Assume that λ 0,λmax L(βˆ ∗ )  αλ = T = d = diately. the true model is μi xi β0 j=1 xijβj0 for i 1,...,n,and then write an = max1≤j≤d{p (|βj0|), βj0 = 0}. Under the condition Proof of Theorem 2 λn d = O(nν) with ν<1 , Fan and Peng (2004) proved that there ex- To show the asymptotic efficiency of the AIC-type selector, it suf- 4 LS ists a local maximum√ of the penalized least squares function, so that fices to demonstrate that minimizing GIC (λ) with κ → 2isthe ˆ −1/2 κn n β − β =O { d(n + a )}. Thus, Conditions (S1) and (S2) ˆ 0 P n √ same as minimizing L(βλ) asymptotically. To this end, we need to ensure that the penalized estimator is n/d-consistent. As for Condi- prove that, in probability, tion (S3), consider a case where the full model misses an important    LS 2 ˆ  variable xexc that is orthogonal to the rest of the variables. Accord- GIC (λ) − /n − L(βλ) − n  κn  → ingly,  ¯ = n 1 x2 β2 , and hence n ¯ /d is of the same sup  ˆ  0. (B.1) α i=1 exc;i exc α λ∈[0,λmax] L(βλ) 2 order√ as nβexc/d. Consequently, (S3) holds if the coefficient βexc sat- Let the projection matrix corresponding to the model α be Hα = isfies n/dβexc →∞, which is valid, for example, when βexc is fixed. T −1 T Xα(XαXα) Xα . Then, In contrast to imposing Condition (S3) on the error of the full model, an alternative Condition (S3*) motivated by Fan and Peng (2004)is y − Xβˆ 2 κ σ 2d LS = λ + n αλ given below. GICκn (λ) n n ∗ (S3 ) The proportion of penalized coefficients satisfies ∗ 2 ∗ 2 2  y − μˆ  μˆ − μˆ λ κ σ d 1 d | ˆ |≤ → →∞ = αλ + αλ + n αλ supλ∈[0,λ ] j= I(0 < βλj mλ) 0, as n , max dαλ 1 n n n where m is defined in (C2). 2       ˆ ˆ ∗ ˆ 1  ∗ 2 ∗ = + L(β ) + L β − L(β ) + μˆ − μˆ Condition (S3 ) means√ that the proportion of small or moderate size n λ αλ λ n αλ λ ˆ coefficients (0 < n|βλj|≤mM1) is vanishing. This condition is sat-   + 2 T − isfied under the identifiability assumption of Fan and Peng (2004)(i.e.,  I Hαλ μ | | →∞ →∞ n minj:β0j =0 β0j /λ ,asn ). As a byproduct, a similar condi-   tion can also be established for the adaptive LASSO, which employs 2 2 T 1 2 ∗ + σ dα −  Hα  + (κn − 2)σ dα . (B.2) data-driven penalties with b = λ/(βˆ )k for some k > 0 to cope with n λ λ n λ j α¯ j Zhang, Li, and Tsai: Regularization Parameter Selections 323 the bias of LASSO. In sum, (S1) to (S3) (or S3*) are mild condi- Frank, I. E., and Friedman, J. H. (1993), “A Statistical View of Some Chemo- tions, and two propositions given below demonstrate their sufficiency metrics Regression Tools,” Technometrics, 35, 109–148. [313] for Condition (C8). Hunter, D., and Li, R. (2005), “Variable Selection Using MM Algorithms,” The Annals of Statistics, 33, 1617–1642. [312] Proposition 2. Under (S1) to (S3), Condition (C8) holds. Konishi, S., and Kitagawa, G. (1996), “Generalised Information Criteria in ,” Biometrika, 83, 875–890. [313] Proof. It is noteworthy that Li, K.-C. (1987), “Asymptotic Optimality for Cp, CL, Cross-Validation and   2 2 Generalized Cross-Validation: Discrete Index Set,” The Annals of Statistics, ˆ ∗ dαλ σ dαλ σ R β = α + ≥  ¯ + . (B.1) 15, 958–975. [312,313,315,316,322] αλ λ n α n Li, R., and Liang, H. (2008), “Variable Selection in Semiparametric Regression On the right-hand side of (B.1), the first term dominates the second Modeling,” The Annals of Statistics, 36, 261–286. [312] term via Condition (S3). From (S1), (S2), and (B.1), we have that McCullagh, P., and Nelder, J. A. (1989), Generalized Linear Models,New 2 2 York: Chapman & Hall/CRC. [314,315] b2 M λ dα 1 ≤ 2 λ ≤ 2 · 2 · McQuarrie, A. D. R., and Tsai, C.-L. (1998), Regression and Time Series Model ∗ ∗ M2 nλmax , ˆ ˆ nα¯ /d Selection (1st ed.), Singapore: World Scientific Publishing Co., Pte. Ltd. R(βαλ ) R(βαλ ) [312,316] which goes to zero independently of λ. This completes the proof. Nishii, R. (1984), “Asymptotic Properties of Criteria for Selection of Variables Proposition 3. Assume that the penalty function satisfies Condi- in Multiple Regression,” The Annals of Statistics, 12, 758–765. [312-314] tion (C2). Under (S1) to (S3*), Condition (C8) holds. Park, M.-Y., and Hastie, T. (2007), “An L1 Regularization-Path Algorithm for Generalized Linear Models,” Journal of the Royal Statistical Society, Ser. B, Proof. Under Condition (C2), the components in b are zero except 69, 659–677. [312] ˆ for those βλj ≤ mλ. Then, employing (S2) and (B.1), we obtain that Schwarz, G. (1978), “Estimating the Dimension of a Model,” The Annals of Statistics, 19 (2), 461–464. [312]  2  2 2 d Shao, J. (1997), “An Asymptotic Theory for Linear Model Selection,” Statistica b b M2 2 1 ≤ ≤ · nλ · I(0 < |βˆ |≤mλ). Sinica ˆ ∗ −1 2 2 max λj , 7, 221–264. [312,313,316] R(β ) n dα σ σ dα αλ λ λ j=1 Shibata, R. (1980), “Asymptotically Efficient Selection of the Order of the Model for Estimating Parameters of a Linear Process,” The Annals of Sta- On the right-hand side of the above equation, the first term is constant, tistics, 8, 147–164. [316] the second term is bounded under (S1), and the third term goes to zero (1981), “An Optimal Selection of Regression Variables,” Biometrika, uniformly in λ under (S3*). This completes the proof. 68, 45–54. [312,313,315,316] (1984), “Approximation Efficiency of a Selection Procedure for the SUPPLEMENTAL MATERIALS Number of Regression Variables,” Biometrika, 71, 43–49. [312,315] Technical details for Remark 5: Detailed discussions, justifi- Tibshirani, R. (1996), “Regression Shrinkage and Selection via LASSO,” Jour- nal of the Royal Statistical Society, Ser. B, 58, 267–288. [312,313,317] cations, and proofs referenced in Remark 5. (suplv_rz_t.pdf) Wang, H., and Leng, C. (2007), “Unified LASSO Estimation via Least Squares Approximation,” Journal of the American Statistical Association, 102, [Received January 2008. Revised October 2009.] 1039–1048. [312] Wang, H., Li, G., and Tsai, C.-L. (2007a), “Regression Coefficient and Autore- REFERENCES gressive Order Shrinkage and Selection via LASSO,” Journal of the Royal Akaike, H. (1974), “A New Look at the Identification,” IEEE Statistical Society, Ser. B, 69, 63–78. [312] Transactions on Automatic Control, 19, 716–723. [312] Wang, H., Li, R., and Tsai, C.-L. (2007b), “Tuning Parameter Selectors for the Craven, P., and Wahba, G. (1979), “Smoothing Noisy Data With Spline Func- Smoothly Clipped Absolute Deviation Method,” Biometrika, 94, 553–568. tions,” Numerische Mathematik, 31, 377–403. [312] [312,315] Efron, B., Hastie, T., Johnstone, I. M., and Tibshirani, R. (2004), “Least Angle Yang, Y. (2005), “Can the Strengths of AIC and BIC Be Shared? A Conflict Regression,” The Annals of Statistics, 32, 407–499. [312,314] Between Model Identification and Regression Estimation,” Biometrika, 92, Elter, M., Schulz-Wendtland, R., and Wittenberg, T. (2007), “The Prediction of 937–950. [316] Breast Cancer Biopsy Outcomes Using Two CAD Approaches That Both Emphasize an Intelligible Decision Process,” Medical Physics, 34, 4164– Yuan, M., and Lin, Y. (2007), “On the Non-Negative Garrotte Estimator,” Jour- 4172. [319] nal of the Royal Statistical Society, Ser. B, 69, 143–161. [312] Fan, J., and Li, R. (2001), “Variable Selection via Nonconcave Penalized Like- Zhang, C.-H. (2007), “Penalized Linear Unbiased Selection,” Technical Re- lihood and Its Oracle Properties,” Journal of the American Statistical Asso- port 2007-003, Rutgers University, Dept. of Statistics. [313] ciation, 96, 1348–1360. [312-315,317,320,321] Zhang, H. H., and Lu, W. (2007), “Adaptive LASSO for Cox’s Proportional (2002), “Variable Selection for Cox’s Proportional Hazards Model and Hazards Model,” Biometrika, 94, 691–703. [312] Fratile Model,” The Annals of Statistics, 30, 74–99. [312,314] Zou, H. (2006), “The Adaptive LASSO and Its Oracle Properties,” Journal of (2004), “New Estimation and Model Selection Procedures for Semi- the American Statistical Association, 101, 1418–1429. [312] parametric Modeling in Longitudinal Data Analysis,” Journal of the Amer- ican Statistical Association, 99, 710–723. [312] Zou, H., and Li, R. (2008), “One-Step Sparse Estimates in Nonconcave Penal- Fan, J., and Peng, H. (2004), “Nonconcave Penalized Likelihood With a Di- ized Likelihood Models,” The Annals of Statistics, 36, 1509–1533. [312] verging Number of Parameters,” The Annals of Statistics, 32, 928–961. Zou, H., Hastie, T., and Tibshirani, R. (2007), “On the Degrees of Freedom of [312,322] the LASSO,” The Annals of Statistics, 35, 2173–2192. [314]