Regularization Parameter Selections Via Generalized Information Criterion

Please see the online version of this article for supplementary materials. Regularization Parameter Selections via Generalized Information Criterion Yiyun ZHANG, Runze LI, and Chih-Ling TSAI We apply the nonconcave penalized likelihood approach to obtain variable selections as well as shrinkage estimators. This approach relies heavily on the choice of regularization parameter, which controls the model complexity. In this paper, we propose employing the generalized information criterion, encompassing the commonly used Akaike information criterion (AIC) and Bayesian information criterion (BIC), for selecting the regularization parameter. Our proposal makes a connection between the classical variable selection criteria and the regularization parameter selections for the nonconcave penalized likelihood approaches. We show that the BIC-type selector enables identification of the true model consistently, and the resulting estimator possesses the oracle property in the terminology of Fan and Li (2001). In contrast, however, the AIC-type selector tends to overfit with positive probability. We further show that the AIC-type selector is asymptotically loss efficient, while the BIC-type selector is not. Our simulation results confirm these theoretical findings, and an empirical example is presented. Some technical proofs are given in the online supplementary material. KEY WORDS: Akaike information criterion; Bayesian information criterion; Least absolute shrinkage and selection operator; Noncon- cave penalized likelihood; Smoothly clipped absolute deviation. 1. INTRODUCTION be adopted to solve optimization problems of nonconcave penalized likelihood functions. However, the above computational Using the penalized likelihood function to simultaneously se- procedures rely on the regularization parameter. Hence, the se- lect variables and estimate unknown parameters has received lection of this parameter becomes the second challenge, and is considerable attention in recent years. To avoid the instability therefore the primary aim of our paper. of the classical subset selection procedure, Tibshirani (1996) In the literature, selection criteria are usually classified proposed the Least Absolute Shrinkage and Selection Opera- into two categories: consistent [e.g., the Bayesian informa- tor (LASSO). In the same spirit as LASSO, Fan and Li (2001) introduced a unified approach via a nonconcave penalized like- tion criterion (BIC), Schwarz 1978] and efficient [e.g., the lihood function, and demonstrated its usefulness for both linear Akaike information criterion (AIC), Akaike 1974; the general- and generalized linear models. Subsequently, Fan and Li (2002) ized cross-validation (GCV), Craven and Wahba 1979]. A con- employed the nonconcave penalized likelihood approach for sistent criterion identifies the true model with a probability that Cox models, while Fan and Li (2004) used it on semiparamet- approaches 1 in large samples when a set of candidate models ric regression models. In addition, Fan and Peng (2004)inves- contains the true model. An efficient criterion selects the model tigated the theoretical properties of the nonconcave penalized so that its average squared error is asymptotically equivalent to likelihood when the number of parameters tends to infinity as the minimum offered by the candidate models when the true the sample size increases. Recently, more researchers applied model is approximated by a family of candidate models. De- penalized approaches to study variable selections (e.g., Park tailed discussions on efficiency and consistency can be found in and Hastie 2007; Wang and Leng 2007; Yuan and Lin 2007; Shibata (1981, 1984), Li (1987), Shao (1997), and McQuarrie Zhang and Lu 2007; Li and Liang 2008). and Tsai (1998). In the context of linear and generalized lin- In employing the nonconcave penalized likelihood in regres- ear models (GLIM) with a nonconcave penalized function, Fan sion analysis, we face two challenges. The first hurdle is to com- and Li (2001) proposed applying GCV to choose the regular- pute the nonconcave penalized likelihood estimate. This issue ization parameter. Recently, Wang, Li, and Tsai (2007b) found has been carefully studied in the recent literature: (i) Fan and Li that the resulting model selected by GCV tends to overfit, while (2001) proposed the local quadratic approximation (LQA) al- BIC is able to identify the finite-dimensional true linear and gorithm, which was further analyzed by Hunter and Li (2005); partial linear models consistently. Wang, Li, and Tsai (2007b) (ii) Efron et al. (2004) introduced the LARS algorithm, which also indicated that GCV is similar to AIC. However, they only can be used for the adaptive LASSO (Zou 2006; Wang, Li, and studied the penalized least squares function with the smoothly Tsai 2007a; Zhang and Lu 2007). With the aid of local linear clipped absolute deviation (SCAD) penalty. This motivated us approximation (LLA) algorithm (Zou and Li 2008), LARS can to study the issues of regularization parameter selection for penalized likelihood-based models with a nonconcave penalized function. Yiyun Zhang is Senior Biostatistician, Novartis Oncology, Florham Park, NJ 07932 (E-mail: [email protected]). Runze Li is the correspondence In this paper, we adopt Nishii’s (1984) generalized informa- author and Professor, Department of Statistics and The Methodology Center, tion criterion (GIC) to choose regularization parameters in non- The Pennsylvania State University, University Park, PA 16802-2111 (E-mail: concave penalized likelihood functions. This criterion not only [email protected]). Chih-Ling Tsai is Robert W. Glock Chair professor, Grad- uate School of Management, University of California, Davis, CA 95616-8609 contains AIC and BIC as its special cases, but also bridges the (E-mail: [email protected]). We are grateful to the editor, the associate editor, connection between the classical variable selection criteria and and three referees for their helpful and constructive comments that substantially improved an earlier draft. Zhang’s research is supported by National Institute on Drug Abuse grants R21 DA024260 and P50 DA10075 as a research assistant. © 2010 American Statistical Association The content is solely the responsibility of the authors and does not necessarily Journal of the American Statistical Association represent the official views of the NIDA or the NIH. Li’s research is supported March 2010, Vol. 105, No. 489, Theory and Methods by National Science Foundation grants DMS 0348869 and 0722351. DOI: 10.1198/jasa.2009.tm08013 312 Zhang, Li, and Tsai: Regularization Parameter Selections 313 the nonconcave penalized likelihood methodology. This con- (C2) There exists a constant m such that the penalty pλ(ζ ) nection provides more flexibility for practitioners to employ satisfies p(ζ ) = 0forζ>mλ. their own favored variable selection criteria in choosing desir- (C3) If λn → 0asn →∞, then the penalty function satisfies √ able models. When the true model is among a set of candidate →∞ lim inf lim inf npλ (ζ ) . models with the GLIM structure, we show that the BIC-type n→∞ ζ →0+ n tuning parameter selector enables us to identify the true model Condition (C1) indicates that a smaller regularization para- consistently, whereas the AIC-type selector tends to yield an meter is needed if the sample size is large. Condition (C2) overfitted model. On the other hand, if the true model is approx- assures that the resulting penalized likelihood estimate is as- imated by a set of candidate models with the linear structure, we ymptotically unbiased (Fan and Li 2001). Both the SCAD and demonstrate that the AIC-type tuning parameter selector is as- Zhang’s (2007) minimax concave penalties (MCP) satisfy this ymptotically loss efficient, while the BIC-type selector does not condition. Condition (C3) is adapted from Fan and Li’s (2001) have this property in general. These findings are consistent with equation (3.5), which is used to study the oracle property. the features of AIC (see Shibata 1981; Li 1987) and BIC (see Shao 1997) used in best subset variable selections. Remark 1. Both the SCAD penalty and Lq penalty for 0 < ≤ The rest of the paper is organized as follows. Section 2 pro- q 1 are singular at the origin. Hence, it becomes challeng- poses the GIC under a general nonconcave penalized likelihood ing to maximize their corresponding penalized likelihood functions. Accordingly, Fan and Li (2001) proposed the LQA algo- setting. Section 3 studies the consistent property of GIC for rithm for finding the solution of the nonconcave penalized like- generalized linear models, while Section 4 investigates the as- lihood. In LQA, p (|β|) is locally approximated by a quadratic ymptotic loss efficiency of the AIC-type selector for linear re- λ function qλ(|β|), whose first derivative is given by gression models. Monte Carlo simulations and an empirical ex- | | = | | | | ample are presented in Section 5 to illustrate the use of the reg- qλ( βj ) pλ( βj )/ βj βj. ularization parameter selectors. Section 6 provides discussions, The above equation is evaluated at the (k + 1)-step iteration of and technical proofs are given in the Appendix. (k) the Newton–Raphson algorithm if βj is not very close to zero. 2. NONCONCAVE PENALIZED Otherwise, the resulting parameter estimator of βj is set to 0. LIKELIHOOD FUNCTION 2.2 Generalized Information Criterion 2.1 Penalized Estimators and Penalty Conditions Before introducing the GIC, we first define the candidate Consider the data (x1, y1),...,(xn, yn)

Load more