On the “Degrees of Freedom” of the Lasso

Hui Zou Trevor Hastie† Robert Tibshirani‡

Abstract We study the degrees of freedom of the Lasso in the framework of Stein’s unbiased risk estimation (SURE). We show that the number of non-zero coecients is an unbi- ased estimate for the degrees of freedom of the Lasso—a conclusion that requires no special assumption on the predictors. Our analysis also provides mathematical support for a related conjecture by Efron et al. (2004). As an application, various model selec- tion criteria—Cp, AIC and BIC—are dened, which, along with the LARS algorithm, provide a principled and ecient approach to obtaining the optimal Lasso t with the computational eorts of a single ordinary least-squares t. We propose the use of BIC-Lasso if the Lasso is primarily used as a variable selection method.

Department of , Stanford University, Stanford, CA 94305. Email: [email protected]. †Department of Statistics and Department of Health Research & Policy, Stanford University, Stanford, CA 94305. Email: [email protected]. ‡Department of Health Research & Policy and Department of Statistics, Stanford University, Stanford, CA 94305. Email: [email protected].

1 1 Introduction

Modern data sets typically have a large number of observations and predictors. A typical goal in model tting is to achieve good prediction accuracy with a sparse representation of the predictors in the model. The Lasso is a promising automatic model building technique, simultaneously producing T accurate and parsimonious models (Tibshirani 1996). Suppose y = (y1, . . . , yn) is the T response vector and xj = (x1j, . . . , xnj) , j = 1, . . . , p are the linearly independent predictors. Let X = [x , , x ] be the predictor matrix. The Lasso estimates for the coecients of a 1 p linear model solve p p 2 ˆ = arg min y xjj subject to j t. (1) k k | | j=1 j=1 X X Or equivalently p p 2 ˆ = arg min y xjj + j , (2) k k | | j=1 j=1 X X where is a non-negative regularization parameter. Without loss of generality we assume that the data are centered, so the intercept is not included in the above model. There is a one-one correspondence (generally depending on the data) between t and making the optimization problems in (1) and (2) equivalent. The second term in (2) is called the 1- norm penalty and is called as the lasso regularization parameter. Since the Loss+P enalty formulation is common in the statistical community, we use the representation (2) throughout this paper. Figure 1 displays the Lasso estimates as a function of using the diabetes data (Efron et al. 2004). As can be seen from Figure 1 (the left plot), the Lasso continuously shrinks the coecients toward zero as increases; and some coecients are shrunk to exact zero if is suciently large. In addition, the shrinkage often improves the prediction accuracy due to the bias-variance trade-o. Thus the Lasso simultaneously achieves accuracy and sparsity. Generally speaking, the purpose of regularization is to control the complexity of the tted model (Hastie et al. 2001). The least regularized Lasso ( = 0) corresponds to Ordinary Least Squares (OLS); while the most regularized Lasso uses = , yielding a constant t. So the model complexity is reduced via shrinkage. However, the eect∞ of the Lasso shrinkage is not very clear except for these two extreme cases. An informative measurement of model complexity is the eective degrees of freedom (Hastie & Tibshirani 1990). The prole of degrees of freedom clearly shows that how the model complexity is controlled by shrinkage. The degrees of freedom also plays an important role in estimating the prediction accuracy of the tted model, which helps us pick an optimal model among all the possible candidates, e.g. the optimal choice of in the Lasso. Thus it is desirable to know what is the degrees of freedom of the lasso for a given regularization parameter , or df(). This is an interesting problem of both theoretical and practical importance. Degrees of freedom are well studied for linear procedures. For example, the degrees of freedom in multiple exactly equals the number of predictors. A generaliza-

2 Lasso as a shrinkage method Unbiased estimate for df()

9 10

3

500 6

4 8 ts

8

ecien 7 10 ) Co 0 1 6 ( f b d

2 4 Standardized −500 2

5 PSfrag replacements PSfrag replacements Lasso as a shrinkage method 0 2 4 6 0 2 4 6 Standardized Coecients df() log(1 + ) log(1 + ) Unbiased estimate for df() b Figure 1: Diabetes data with 10 predictors. The left panel shows the Lasso coecients estimates ˆj, j = 1, 2, . . . , 10, for the diabetes study. The diabetes data were standardized. The Lasso coe- cients estimates are piece-wise linear functions of (Efron et al. 2004), hence they are piece-wise non-linear as functions of log(1 + ). The right panel shows the curve of the proposed unbiased estimate for the degrees of freedom of the Lasso, whose piece-wise constant property is basically determined by the piece-wise linearity of ˆ.

3 tion is made for all linear smoothers (Hastie & Tibshirani 1990), where the tted vector is written as yˆ = Sy and the smoother matrix S is free of y. Then df(S) = tr(S) (see Section 2). T 1 T A leading example is ridge regression (Hoerl & Kennard 1988) with S = (X X+I) X X. These results rely on the convenient expressions for representing linear smoothers. Unfor- tunately, the explicit expression of the Lasso t is not available (at least so far) due to the nonlinear nature of the Lasso, thus the nice results for linear smoothers are not directly applicable. Efron et al. (2004) (referred to as the LAR paper henceforth) propose Least Angle Re- gression (LARS), a new stage-wise model building algorithm. They show that a simple modication of LARS yields the entire Lasso solution path with the computational cost of a single OLS t. LARS describes the Lasso as a forward stage-wise model tting process. Starting at zero, the Lasso ts are sequentially updated till reaching the OLS t, while be- ing piece-wise linear between successive steps. The updates follow the current equiangular direction. Figure 2 shows how the Lasso estimates evolve step by step. From the forward stage-wise point of view, it is natural to consider the number of steps as the meta parameter to control the model complexity. In the LAR paper, it is shown that under a “positive cone” condition, the degrees of freedom of LARS equals the number of

steps, i.e., df(ˆ k) = k, where ˆ k is the t at step k. Since the Lasso and LARS coincide under the positive cone condition, the remarkable formula also holds for the Lasso. Under

general situations df(ˆ k) is still well approximated by k for LARS. However, this simple approximation cannot be true in general for the Lasso because the total number of Lasso steps can exceed the number of predictors. This usually happens when some variables are temporally dropped (coecients cross zero) during the LARS process, and they are eventually included into the full OLS model. For instance, the LARS algorithm takes 12 Lasso steps to reach the OLS t as shown in Figure 2, but the number of predictors is 10. For the degrees of freedom of the Lasso under general conditions, Efron et al. (2004) presented the following conjecture. Conjecture 1 (EHJT04). Starting at step 0, let m be the index of the last model in the k . Lasso sequence containing k predictors. Then df(ˆ mk ) = k. In this paper we study the degrees of freedom of the Lasso using Stein’s unbiased risk estimation (SURE) theory (Stein 1981). The Lasso exhibits the backward penalization and forward growth pictures, which consequently induces two dierent ways to describe its degrees of freedom. With the representation (2), we show that for any given the number of non-zero predictors in the model is an unbiased estimate for the degrees of freedom, and no special assumption on the predictors is required, e.g. the positive cone condition. The right panel in Figure 1 displays the unbiased estimate for the degrees of freedom as a function of on diabetes data (with 10 predictors). If the Lasso is viewed as a forward stage-wise process, our analysis provides mathematical support for the above conjecture. The rest of the paper is organized as follows. We rst briey review the SURE theory in Section 2. Main results and proofs are presented in Section 3. In Section 4, criteria are constructed using the degrees of freedom to adaptively select the optimal Lasso t. We address the dierence between two types of optimality: adaptive in prediction and

4 Lasso as a forw ard stage-wise PSfrag mo deling replacemen algorithm ts

1 2 3 4 5 6 7 8 9 10 10 10

*9 LARS/Lasso sequence * *

* * 3 * * * * * * * * * 500 *6 * * * * * * * * * * *4 * * * ts * * * *8 * * 7 ecien * * * * 10 * * * * * * * Co * 0 * * * * * * * * * * * * *1 * * * * * * * * * * * * * * * *2 Standardized

−500 * *

*5

0 2 4 6 8 10 12

Step

Figure 2: Diabetes data with 10 predictors: the growth paths of the Lasso coecients estimates as the LARS algorithm moves forward. On the top of the graph, we display the number of non-zero coecients at each step.

5 adaptive in variable selection. Discussions are in Section 5. Proofs of lemmas are presented in the appendix.

2 Stein’s Unbiased Risk Estimation

We begin with a brief introduction to the Stein’s unbiased risk estimation (SURE) theory (Stein 1981) which is the foundation of our analysis. The readers are referred to Efron (2004) for detailed discussions and recent references on SURE. Given a model tting method , let ˆ = (y) represent its t. We assume a homoskedastic model, i.e., given the x’s, y is generated according to y (, 2I), (3) where is the true mean vector and 2 is the common variance. The focus is how accurate can be in predicting future data. Suppose ynew is a new response vector generated from (3), then under the squared-error loss, the prediction risk is E ˆ ynew 2 /n. Efron (2004) k k shows that nn o E ˆ 2 = E y ˆ 2 n2 + 2 cov(ˆ , y ). (4) {k k } {k k } i i i=1 X The last term of (4) is called the optimism of the estimator ˆ (Efron 1986). Identity (4) also gives a natural denition of the degrees of freedom for an estimator ˆ = (y), n 2 df(ˆ) = cov(ˆ i, yi)/ . (5) i=1 X If is a linear smoother, i.e., ˆ = Sy for some matrix S independent of y, then it is easy to verify that since cov(ˆ, y) = 2S, df(ˆ) = tr(S), which coincides with the denition given by Hastie & Tibshirani (1990). By (4) we obtain E ˆ ynew 2 = E y ˆ 2 + 2df(ˆ) 2 . (6) {k k } {k k } Thus we can dene a Cp-type statistic y ˆ 2 2df(ˆ) C (ˆ) = k k + 2 (7) p n n which is an unbiased estimator of the true prediction error. When 2 is unknown, it is replaced with an unbiased estimate. Stein proves an extremely useful formula to simplify (5), which is often referred to as Stein’s Lemma (Stein 1981). According to Stein, a function g : Rn R is said to be almost → dierentiable if there is a function f : Rn Rn such that → 1 g(x + u) g(x) = uT f(x + tu)dt (8) Z0 for a.e. x Rn, each u Rn. ∈ ∈ 6 Lemma 1 (Stein’s Lemma). Suppose that ˆ : Rn Rn is almost dierentiable and denote ˆ = n ∂ˆ /∂y . If y N (, 2I), then → ∇ i=1 i i n P n cov(ˆ , y )/2 = E[ ˆ]. (9) i i ∇ i=1 X In many applications ˆ is shown to be a constant; for example, with ˆ = Sy, ˆ = tr(S). Thus the degrees∇ of freedom is easily obtained. Even if ˆ depends on y, ∇Stein’s Lemma says ∇ df(ˆ) = ˆ (10) ∇ is an unbiased estimate for the degrees of freedom df(ˆ). In the spirit of SURE, we can use b 2 y ˆ 2df(ˆ) 2 C(ˆ) = k k + (11) p n n b as an unbiased estimate for the true risk. It is worth mentioning that in some situations verifying the almost dierentiability of ˆ is not easy. Even though Stein’s Lemma assumes normality, the essence of (9) only requires ho- moskedasticity (3) and the almost dierentiability of ˆ; its justication can be made by a “delta method” argument (Efron et al. 2004). After all, df(ˆ) is about the self-inuence of y on the t, and ˆ is a natural candidate for that purpose. Meyer & Woodroofe ∇ (2000) discussed the degrees of freedom in shape-restricted regression and argued that the divergence formula (10) provides a measure of the eective dimension.

3 Main Theorems

We adopt the SURE framework with the Lasso t. Let ˆ be the Lasso t using the representation (2). Similarly, let ˆ m be the Lasso t at step m in the LARS algorithm. For convenience, we also let df() and df(m) stand for df(ˆ ) and df(ˆ m), respectively. ∂ˆ The following matrix representation of Stein’s Lemma is helpful. Let ∂y be a n n matrix whose elements are ∂ˆ ∂ˆ = i i, j = 1, 2, . . . , n. (12) ∂y ∂y i,j j Then we can write ∂ˆ ˆ = tr . (13) ∇ ∂y Suppose M is a matrix with p columns. Let be a subset of the indices 1, 2, . . . , p . Denote by M the sub-matrix S { } S M = [ mj ]j , (14) S ∈S

7 where mj is the j-th column of M. Similarly, dene = ( j ) for any vector of S j length p. Let Sgn( ) be the sign function: ∈S 1 if x > 0 Sgn(x) = 0 if x = 0  1 if x < 0  3.1 Results and data examples Our results are stated as follows. Denote the set of non-zero elements of ˆ as (), then B df() = E[ ] (15) |B|

where means the size of (). Hence df() = is an unbiased estimate for df(). The iden|Btit| y (15) holds for allBX, requiring no special|Bassumption.| We also provide mathematical support forbthe conjecture in Section 1. Actually we argue that if mk is a Lasso step containing k non-zero predictors, then df(mk) = k is a good estimate for df(mk). Note that mk is not necessary the last Lasso step containing k non-zero predictors. So the result includes the conjecture as a special case. bHowever, we show in last Section 4 that the last step choice is superior in the Lasso model selection. We let mk and rst mk denote the last and rst Lasso step containing exact k non-zero predictors, respectively. Before delving into the detail of theoretical analysis, we check the validity of our argu- ments by a simulation study. Here is the outline of the simulation. We take the 64 predictors in the diabetes data which include the quadratic terms and interactions of the original 10 predictors. The positive cone condition is violated on the 64 predictors (Efron et al. 2004). The response vector y was used to t a OLS model. We computed the OLS estimates ˆols 2 and ˆols. Then we considered a synthetic model

y = X + N(0, 1), (16) where = ˆols and = ˆols. Given the synthetic model, the degrees of freedom of the Lasso (both df() and df(mk)) can be numerically evaluated by Monte Carlo methods. For b = 1, 2, . . . , B, we independently simulated y(b) from (16). For a given , by the denition of df(), we need to evaluate

cov = E[(ˆ E[ˆ ])(y (X) )]. (17) i ,i ,i i i n 2 Then df() = i=1 covi/ . Since E[yi] = (X)i and note that

P cov = E[(ˆ a )(y (X) )] (18) i ,i i i i for any xed known constant ai. Then we compute

B ˆ (b) ai (y(b) (X)i) cov = b=1 ,i i (19) i B P c 8 n 2 and df() = i=1 covi/ . Typically ai = 0 is used in Monte Carlo calculation. In this work we use ai = (X)i, for it gives a Monte Carlo estimate for df() with smaller variance than P that by ai = 0. Oncthe other hand, for a xed , each y(b) gave the Lasso t ˆ (b) and the B df estimate df()b. Then we evaluated E[ ] by b=1 df()b/B. Similarly, we computed df(m ) by replacing ˆ (b) with ˆ (b). We|Bare| interested in E[ ] df() and k df(m ). k mk |B| k Standard errorsb were calculated based on the B replications.P b Figure 3 is a very convincing picture for the identity (15). Figure 4 shows that df(mk) is well approximated by k even when the positive cone condition is failed. The simple last rst approximation works pretty well for both mk and mk . In Figure 4, it appears that k df(mk) is not exactly zero for some k. We would like to check if the bias is real. Furthermore, if the bias is real, then we would like to explore the relation between the bias k df(mk) and the signal/noise ratio. In the synthetic model (16) ˆ Var(Xols) the signal/noise ratio 2 is about 1.25. We repeated the same simulation procedure ˆols ˆ ˆols with ( = 0, = 1) and ( = ols, = 10 ) in the synthetic model. The corresponding signal/noise ratios are zero and 125, respectively. As shown clearly in Figure 5, the bias k df(mk) is truly non-zero for some k. Thus the positive cone condition seems to be sucient and necessary for turning the approximation into an exact result. However, even if the bias exists, its maximum magnitude is less than one, regardless the size of the signal/noise ratio. So k is a very good estimate for df(mk). An interesting observation is that k tends to underestimate df(mlast) and overestimate df(mrst). . k k In addition, we observe that k df(mlast) = df(mrst) k. k k 3.2 Theorems on df() Let = j : Sgn() = 0 be the active set of where Sgn() is the sign vector of given B { j 6 } by Sgn() = Sgn( ). We denote the active set of ˆ() as () and the corresponding sign j j B vector Sgn(ˆ()) as Sgn(). We do not distinguish the index of a predictor and the predictor itself. Firstly, let us review some characteristics of the Lasso solution. For a given response vector y, there are a sequence of ’s: > > > = 0 such that: (20) 0 1 2 K For all > , ˆ() = 0. 0 In the interior of the interval ( , ), the active set () and the sign vector m+1 m B Sgn() () are constant with respect to . Thus we write them as m and Sgnm for B convenience. B The active set changes at each . When decreases from = 0, some predictors m m with zero coecients at m are about to have non-zero coecients, thus they join the active set . However, as approaches + 0 there are possibly some predictors in whose Bm m+1 Bm coecients reach zero. Hence we call m the transition points. We shall proceed by proving the follo{ wing} lemmas (proofs are given in the appendix).

9 d f computed PSfrag PSfrag b E y [ replacemen replacemen |B Mon | te d Carlo f ( ts ts )] df() df() 1.0 60 50 0.5 Carlo te )] ( 40 f Mon d y 0.0 b | 30 |B [ E 20 computed −0.5 f d 10 −1.0 0

0 10 20 30 40 50 60 0 10 20 30 40 50 60 E[ ] E[ ] |B| |B| The The rst

Figure 3: The synthetic model with the 64 predictorslast in the diabetes data. In the left panel we step compstep are E[ ] with the true degrees of freedom df() based on B = 20000 Monte Carlo simula- con con |B | PSfrag PSfrag 0 taining tions.taining The solid line is the 45 line (the perfect match). The right panel shows the estimation bias replacemen and itsreplacemen point-wise 95% condence intervals indicated by the thin dashed lines. Note that the zero k k v v ariables horizontalariables line is well inside the condence intervals. ts ts

The last step containing k variables The rst step containing k variables 60 60 50 50 Carlo Carlo te te 40 40 Mon Mon y y b b 30 30 20 20 computed computed f f d d 10 10 0 0

0 10 20 30 40 50 60 0 10 20 30 40 50 60 k k

Figure 4: The synthetic model with the 64 predictors in the diabetes data. We compare df(mk) with the true degrees of freedom df(mk) based on B = 20000 Monte Carlo simulations. We consider two choices of mk: in the left panel mk is the last Lasso step containing exact k non-zero variables,b while the right panel chooses the rst Lasso step containing exact k non-zero variables. As can be seen from the plots, our formula works pretty well in both cases.

10 PSfrag exactly magnitude c addition, bias c Figure replacemen ondition ondenc is ab the 5: ts e out we is B intervals of true obj$k − obj$df.last obj$k − obj$df.last violate k df(mk) k df(mk) k df(mk) 0.8). = observe the −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 20000 de 0 bias gr It d, ar e se that es d 10 r is e f eplic ems ( indic of The always m signal/noise=1.25 signal/noise=125 k signal/noise=0 k 20 fr ations that ) e ate 6= last e d dom f d 30 less k k ( obj$k obj$k m k by step wer for tends k last d than f the 40 e ( some ) m m use = to thin . k k last one, ) 50 d d . under f k to ( This . dashe m r However, 60 assess e 11 k rst

gar obj$k − obj$df.first obj$k − obj$df.first obj$k − obj$df.first estimate simulation d ) d less lines. 0 the k . the the The 10 d bias f Under ( The size signal/noise=1.25 m signal/noise=125 bias suggests signal/noise=0 most of 20 k last rst of d ) b is the f and ( 30 the imp m smal obj$k obj$k obj$k k p step k that ositive ) signal/noise ortant over 40 = l (the m when k estimate . k rst 50 c The message one maximum the 95% 60 c ondition, r d p atio. f ositive ( is p m oint-wise absolute k rst that ) c it . one the In is Lemma 2. Suppose ( , ). ˆ() are the Lasso coecient estimates. Then we have ∈ m+1 m

1 ˆ T T () m = X m X m X m y Sgnm . (21) B B B B 2 Lemma 3. Consider the transition points and , 0. is the active set in m m+1 m+1 Bm ( , ). Suppose i is an index added into at and its index in is i, i.e., m+1 m add Bm m Bm i = ( ) . Denote by (a) the k-th element of the vector a. We can express the transition add Bm i k point m as follows: 1 T T 2 X X m X y m B m = B B i (22) m 1 T X m X m Sgnm B B i Moreover, if j is a dropped (if thereis any) index at and j = ( ) , then drop m+1 drop Bm j m+1 can be written as: 1 T T 2 X X m X y m B m = B B j (23) m+1 1 T X m X m Sgnm B B j n Lemma 4. > 0, a null set which is a nite collection of hyperplanes in R . Let n ∀ ∃ N = R . Then y , is not any of the transition points, i.e., / (y)m . G \ N ∀ ∈ G ∈ { } Lemma 5. , ˆ (y) is a continuous function of y. ∀

Lemma 6. Fix any > 0, consider y as dened in Lemma 4. The active set () and the sign vector Sgn() are locally constant∈ G with respect to y. B

n Theorem 1. Let 0 = R . Fix an arbitrary 0. On the set with full measure as G G dened in Lemma 4, the Lasso t ˆ (y) is uniformly Lipschitz. Precisely,

ˆ (y + y) ˆ (y) y for suciently small y (24) k k k k Moreover, we have the divergence formula

ˆ (y) = . (25) ∇ |B| Proof. If = 0, then the Lasso t is just the OLS t. The conclusions are easy to verify. So we focus on > 0. Fix a y. Choose a small enough such that Ball(y, ) . Since is not any transition point, using (21) we observe G

ˆ (y) = Xˆ(y) = H (y)y ω (y), (26) T 1 T where H(y) = X (X X ) X is the projection matrix on the space X and ω(y) = B B B 1 X (XT X ) 1Sgn B . ConsiderB y < . Similarly, we get 2 B B B B k k ˆ (y + y) = H (y + y)(y + y) ω (y + y). (27) 12 Lemma 6 says that we can further let be suciently small such that both the eective set and the sign vector Sgn stay constant in Ball(y, ). Now x . Hence if y < , B k k then H(y + y) = H(y) and ω(y + y) = ω(y). (28) Then (26) and (27) give ˆ (y + y) ˆ (y) = H (y)y. (29) But since H (y)y y , (24) is proved. k k k k By the local constancy of H(y) and ω(y), we have ∂ˆ (y) = H (y). (30) ∂y Then the trace formula (13) implies

ˆ (y) = tr (H (y)) = . (31) ∇ |B|

By standard analysis arguments, it is easy to check the following proposition Proposition Let f : Rn Rn and suppose f is uniformly Lipschitz on = Rn where → G \ N is a nite set of hyperplanes. If f is continuous, then f is uniformly Lipschitz on Rn. N

Theorem 2. the Lasso t ˆ (y) is uniformly Lipschitz. The degrees of freedom of ˆ (y) equal the expectation∀ of the eective set , i.e., B df() = E [ ] . (32) |B| Proof. The proof is trivial for = 0. We only consider > 0. By Theorem 1 and the

proposition, we conclude that ˆ (y) is uniformly Lipschitz. Therefore ˆ (y) is almost dierentiable, see Meyer & Woodroofe (2000) and Efron et al. (2004). Then (32) is obtained by Stein’s Lemma and the divergence formula (25).

3.3 df(mk) and the conjecture In this section we provide mathematical support for the conjecture in Section 1. The con- jecture becomes a simple fact for two trivial cases k = 0 and k = p, thus we only need to consider k = 1, . . . , (p 1). Our arguments rely on the details of the LARS algorithm. For the sake of clarity, we rst briey describe the LARS algorithm. The readers are referred to the LAR paper (Efron et al. 2004) for the complete description. The LARS algorithm sequentially updates the Lasso estimate in a predictable way. Ini- ˆ ˆ tially (the 0 step), let 0 = 0, A0 = ∅. Suppose that m is the vector of current Lasso ˆ coecient estimates. Then ˆ m = Xm and rˆm = y ˆ m are the current t and residual T vectors. We say cˆ = X rˆm is the vector of current correlations. Dene

C = max cˆ = j : cˆ = C and j Ac . (33) j{| |} Wm { | j| ∈ m} b 13 b Then = 2C. Dene the current active set = A and the signed matrix m A m ∪ Wm sign b X = ( Sgn(cˆj)xj )j . (34) A ∈A T Let = Xsign Xsign. 1 is a vector of 1’s of length . Then we compute the equian- GA A A A |A| gular vector sign 1 u = X w with w = DG 1 , (35) A A A A A A T 1 1 T where D = (1 1 ) 2 . Let the inner product vector a = X u and AGA A A

+ C cj C + cj = minj c , . (36) ∈A D a D + a ( j j ) b b b b b For j we compute dj = Sgn(cˆj)w and j = (ˆm)j/dj. Dene ∈ A Aj

= min j and m = j : j = j . (37) j >0{ } V { ∈ A} The Lasso coecient estimatese are updated by e (ˆ ) = (ˆ ) + min , d for j . (38) m+1 j m j { } j ∈ A The set A is also updated. If < then A = . Otherwise A = . m m+1b e m+1 m Let q be the indicator of whether is droppedA or not. Dene q A\V= if q = 1, m Vm mVm Vm m otherwise qm m = ∅; and conventionally let 1 = ∅ and q 1 1 = ∅. Considering the b e active set Vas a function of , we summarizeVthe following factsV B = + if < < , (39) |B| |Bm | |Wm| m m+1 = + q . (40) |Bm+1 | |Bm | |Wm| | mVm| In the LARS algorithm, the Lasso is regarded as one kind of forward stage-wise method for which the number of steps is often used as an eective regularization parameter. For each k, k 1, 2, . . . , (p 1) , we seek the models with k non-zero predictors. Let ∈ { } = m : = k . (41) k { |Bm | } last The conjecture is asking for the t using mk = sup(k). However, it may happen that for some k there is no such m with = k. For example, if y is an equiangular vector of all |Bm | Xj , then the Lasso estimates become the OLS estimates after just one step. So k = ∅ for{ k}= 2, . . . , (p 1). The next Lemma concerns this type of situation. Basically, it shows that the “one ata time” condition (Efron et al. 2004) holds almost everywhere, therefore k is not empty for all k a.s.

n n Lemma 7. a set 0 which is a collection of nite many hyperplanes in R . y R 0, ∃ N ∀ ∈ \ N (y) = 1 and q (y) 1 m = 0, 1, . . . , K(y). (42) |Wme | | mVm | ∀ e

14 n Corollary 1. y R 0, k is not empty for all k, k = 0, 1, . . . , p. ∀ ∈ \ N Proof. This is a direct consequencee of Lemma 7 and (39), (40). The next theorem presents an expression for the Lasso t at each transition point, which

helps us compute the divergence of ˆ mk (y).

Theorem 3. Let ˆ m(y) be the Lasso t at the transition point m, m > 0. Then for any i , we can write ˆ(m) as follows ∈ Wm XT XT X Sgn( )xT (I H ) (m) (m) (m) m i (m) B B B B ˆ m(y) = H (m) y (43) ( B Sgn xT XT XT X Sgn( ) ) i i (m) (m) (m) m B B B =: Sm(y)y (44)

where H (m) is the projection matrix on the subspace of X (m). Moreover B B tr (S (y)) = ( ) . (45) m |B m | Proof. Note that ˆ() is continuous on . Using (18) in Lemma 2 and taking the limit of , we have → m p 2xT y x ˆ( ) + Sgn(ˆ( ) ) = 0, for j ( ). (46) j j m j m m j ∈ B m j=1 ! X p ˆ ˆ However, j=1 xj(m)j = j (m) xj(m)j. Thus we have ∈B P P 1 m ˆ T T (m) = X (m)X (m) X (m)y Sgn(m) . (47) B B B 2 Hence

1 m T T ˆ m(y) = X (m) X (m)X (m) X (m)y Sgn(m) B B B B 2 1 m T = H (m)y X (m) X (m)X (m) Sgn(m) . (48) B B B B 2 Since i , we must have the equiangular condition ∈ Wm Sgn xT (y ˆ(m)) = m . (49) i i 2

m Substituting (48) into (49), we solve 2 and obtain

T m xi I H (m) y = B . (50) 2 Sgn xT XT XT X Sgn( ) i i (m) (m) (m) m B B B 15 Then putting (50) back to (48) yields (43). Using the identity tr(AB) = tr(BA), we observe

XT XT X Sgn( )xT (I H ) (m) (m) (m) m i (m) B B B B tr Sm(y) H (m) = tr B  Sgn xT XT XT X Sgn( )  i i (m) (m) (m) m B B B   XT X Sgn( )xT (I H )XT (m) (m) m i (m) (m) = tr B B B B  Sgn xT XT XT X Sgn( )  i i (m) (m) (m) m B B B = tr (0) = 0. 

So tr (Sm(y)) = tr H (m) = (m) . B |B | n Denition 1. y R 0 is said to be a locally stable point for k, if y0 such that ∈ \ N ∀ y0 y (y) for a small enough (y), the eective set (y0) = (y), for all m . k k Bm Bm ∈ k Let LS(k) be the set of allelocally stable points for k.

Theorem 4. If y LS(k), then we have the divergence formula ˆ m(y) = k which holds for all m including∈ m = sup( ), the choice in the conjectur∇e. ∈ k k k Proof. The conclusion immediately follows denition 1 and Theorem 3.

n Points in LS(k) are the majority of R . Under the positive cone condition, LS(k) is a set of full measure for all k. In fact the positive cone condition implies a stronger conclusion.

Denition 2. y is said to be a strong locally stable point if y0 such that y0 y (y) ∀ k k for a small enough (y), the eective set (y0) = (y), for all m = 0, 1, . . . , K(y). Bm Bm Lemma 8. Let = y : (y) = (y) for some m, m 0, 1, . . . , K(y) . y the N1 ∈ { } ∀ ∈ n interior of R ( 0 n1), y is a strong locally stable point. In particular, theopositive cone \ Ne ∪ N b e condition implies 1 = ∅. eN e LARS is a discrete procedure by its denition, but the Lasso is a continuous shrinkage e method. So it also makes sense to talk about fractional Lasso steps in the LARS algorithm, e.g. what is the Lasso t at 3.5 steps? Under the positive cone condition, we can generalize the result of Theorem 4 in the LAR paper to the case of non-integer steps. Corollary 2. Under the positive cone condition df(ˆ ) = s for all real valued s: 0 s p. s Proof. Let k s < k + 1, s = k + r for some r [0, 1). According to the LARS algorithm, ∈ the Lasso t is linearly interpolated between steps k and k +1. So ˆ s = ˆ k (1 r)+ˆ k+1 r. Then by denition (5) and the fact cov is a linear operator, we have df(ˆ ) = df(ˆ ) (1 r) + df(ˆ ) r s k k+1 = k (1 r) + (k + 1) r = s. (51) In (51) we have used the positive cone condition and Theorem 4 in the LAR paper.

16 4 Adaptive Lasso Shrinkage

4.1 Model selection criteria For any regularization method an important issue is to nd a good choice of the regularization parameter such that the corresponding model is optimal according to some criterion, e.g. minimizing the prediction risk. For this purpose, model selection criteria have been proposed in the literature to compare dierent models. Famous examples are AIC (Akaike 1973) and BIC (Schwartz 1978). Mallows’s Cp (Mallows 1973) is very similar to AIC and a whole class of AIC or Cp-type criteria are provided by SURE theory (Stein 1981). In Efron (2004) Cp and SURE are summarized as covariance penalty methods for estimating the prediction error, and are shown to oer substantially better accuracy than cross-validation and related nonparametric methods, if one is willing to assume the model is correct. In the previous section we have derived the degrees of freedom of the Lasso for both types of regularization: and mk. Although the exact value of df() is still unknown, our formula provides a convenient unbiased estimate. In the spirit of SURE theory, the unbiased estimate for df() suces to provide an unbiased estimate for the prediction error of ˆ . If . we choose mk as the regularization parameter, the good approximation df(ˆ mk ) = k also well serves the SURE purpose. Therefore an estimate for the prediction error of ˆ (pe(ˆ)) is y ˆ 2 2 pe(ˆ) = k k + df(ˆ) 2, (52) n n 2 where df is either df() or dfb(mk), depending on theb type of regularization. When is unknown, it is usually replaced with an estimate based on the largest model. Equationb (52) equivb alentlyb derives AIC for the Lasso y ˆ 2 2 AIC(ˆ) = k k + df(ˆ). (53) n2 n Selecting the Lasso model by AIC is called AIC-Lasso shrinkageb . Following the usual de- nition of BIC, we propose BIC for the Lasso as follows y ˆ 2 log(n) BIC(ˆ) = k k + df(ˆ). (54) n2 n Similarly the Lasso model selection by BIC is called BIC-Lbasso shrinkage. AIC and BIC possess dierent asymptotic optimality. It is well known that if the true regression function is not in the candidate models, the model selected by AIC asymptotically achieves the smallest average squared error among the candidates; and the AIC estimator of the regression function converges at the minimax optimal rate whether the true regression function is in the candidate models or not, see Shao (1997), Yang (2003) and references therein. On the other hand, BIC is well known for its consistency in selecting the true model (Shao 1997). If the true model is in the candidate list, the probability of selecting the true model by BIC approaches one as the sample size n . Considering the case where the true underlying model is sparse, BIC-Lasso shrinkage→is∞adaptive in variable selection.

17 n AIC BIC 100 0.162 0.451 500 0.181 0.623 1000 0.193 0.686 2000 0.184 0.702

Table 1: The simulation example: the probability of discovering the exact true model by AIC and BIC Lasso shrinkage. The calculation is based on 2000 replications. Compared with AIC-Lasso shrinkage, BIC-Lasso shrinkage has a much higher probability of identifying the ground truth.

n AIC BIC 100 5 4 500 5 3 1000 5 3 2000 5 3

Table 2: The simulation example: the median of the number of non-zero variables selected by AIC and BIC Lasso shrinkage based on 2000 replications. One can see that AIC-Lasso shrinkage is conservative in variable selection and BIC-Lasso shrinkage tends to nd models with the right size.

However, AIC-Lasso shrinkage tends to include more non-zero predictors than the truth. The conservative nature of AIC is a familiar result in linear regression. Hence BIC-Lasso shrinkage is more appropriate than AIC-Lasso shrinkage when variable selection is the primary concern in applying the Lasso. Here we show a simulation example to demonstrate the above argument. We simulated response vectors y from a linear model: y = X + N(0, 1) where = (3, 1.5, 0, 0, 2, 0, 0, 0). i j Predictors x are multivariate normal vectors with pairwise correlation cor(i, j) = (0.1)| | { i} and the variance of each xi is one. For each estimate ˆ, it is said to discover the exact true model if ˆ1, ˆ2, ˆ5 are non-zero and the rest coecients are all zero. Table 1 shows the probabilit{y of disco}vering the exact true model using AIC-Lasso shrinkage and BIC-Lasso shrinkage. In this example both AIC and BIC always select the true predictors 1, 2, 5 in all the 2000 replications, but AIC tends to include other variables as real factors{ as sho}wn in Table 2. In contrast to AIC, BIC has a much lower false positive rate. One may think of combining the good properties of AIC and BIC into a new criterion. Although this proposal sounds quite reasonable, a surprising result is proved that any model selection criterion cannot be consistent and optimal in average squared error at the same time (Yang 2003). In other words, any model selection criterion must sacrice either prediction optimality or consistency.

18 4.2 Computation Using either AIC or BIC to nd the optimal Lasso model, we are facing an optimization problem y ˆ 2 w (optimal) = arg min k k + n df(), (55) n2 n where wn = 2 for AIC and wn = log(n) for BIC. Since thebLARS algorithm eciently solves the Lasso solution for all , nding (optimal) is attainable in principle. In fact, we show that (optimal) is one of the transition points, which further facilitates the searching procedure. Theorem 5. To nd (optimal), we only need to solve

2 y ˆ m wn m = arg min k k + df(m) (56) m n2 n then (optimal) = . m b Proof. Let us consider ( , ). By (21) we have ∈ m+1 m

T 1 y ˆ = (I H m )y + X m (X m X m ) Sgnm (57) B 2 B B B 2 2 T T T 1 y ˆ = y (I H m )y + Sgnm(X m X m ) Sgnm (58) k k B 4 B B T 1 T 2 where H m = X m (X m X m ) X m . Thus we can conclude that y ˆ is strictly B B B B B k k increasing in the interval (m+1, m). Moreover, the Lasso estimates are continuous on , hence y ˆ 2 > y ˆ 2 > y ˆ 2. (59) k m k k k k m+1 k On the other hand, note that df() = m (m+1, m) and m (m+1) . There- fore the optimal choice of in [ ,|B )| is∀ ∈ , which means |B(optimal)| |B | , m = m+1 m m+1 ∈ { m} 0, 1, 2, . . . , K. b According to Theorem 5, the optimal Lasso model is immediately selected once the entire Lasso solution path is solved by the LARS algorithm, which has the cost of a single least squares t. If we consider the best Lasso t in the forward stage-wise modeling picture (like Figure 2), inequality (59) explains the superiority of the choice of mk in the conjecture. Let mk be the last Lasso step containing k non-zero predictors. Suppose mk0 is another Lasso step containing k non-zero predictors, then df(ˆ(mk0 )) = k = df(ˆ(mk)). However, mk0 < mk 2 2 gives y ˆ m < y ˆ m . Then by the Cp statistic, we see that ˆ(mk) is always more k k k k k0 k accurate than ˆ(mk0 ), while using the sameb number of non-zerob predictors. Using k as the tuning parameter of the Lasso, we need to nd k(optimal) such that

y ˆ 2 w k(optimal) = arg min k mk k + n k. (60) k n2 n

19 Once = (optimal) and k = k(optimal) are found, we x them as the regularization parameters for tting the Lasso on future data. Using the xed k means the t on future

data is ˆ m , while the t using the xed is ˆ . It is easy to see that the selected models k by (55) and (60) coincide on the training data, i.e., ˆ = ˆ m . k Figure 6 displays the Cp (equivalently AIC) and BIC estimates of risk using the dia- betes data. The models selected by Cp are the same as those selected in the LAR paper. PSfrag WithPSfrag 10 predictors, Cp and BIC select the same model using 7 non-zero covariates. With replacemen 64 replacemen predictors, Cp selects a model using 15 covariates, while BIC selects a model with 11 covariates. ts ts

1 2 3 4 5 6 7 8 9 10 10 10 11 15 500

Cp 400 Cp BIC

BIC 400 300 300 BIC BIC , , p p C C 200 200 100 100 0 0

0 2 4 6 8 10 12 0 20 40 60 80 100 Step Step

Figure 6: The diabetes data. Cp and BIC estimates of risk with 10 (left) and 64 (right) predictors. In the left panel Cp and BIC select the same model with 7 non-zero coecients. In the right panel, Cp selects a model with 15 non-zero coecients and BIC selects a model with 11 non-zero coecients.

5 Discussion

It is interesting to note that the true degrees of freedom is a strictly decreasing function of , as shown in Figure 7. However, the unbiased estimate df() is not necessarily monotone, although its global trend is monotonically decreasing. The same phenomenon is also shown in the right panel of Figure 1. The non-monotonicity of dbf() is due to the fact that some variables can be dropped during the LARS/Lasso process. An interesting question is that whether there is a smobothed estimate df () such that df () is a smooth decreasing function and keeps the unbiased property, i.e., b b df() = E[df ()] (61)

20 b PSfrag replacemen holds for all . This is a future research topic. ts 60 50 40 ) ( f b d 30 20 10 0

0 2 4 6 log(1 + )

Figure 7: The dotted line is the curve of estimated degrees of freedom (df() vs. log(1 + )), using a typical realization y generated by the synthetic model (16). The smooth curve shows the true degrees of freedom df() obtained by averaging 20000 estimated curves.b One can see that the estimated df curve is piece-wise constant and non-monotone, while the true df curve is smooth and monotone. The two thin broken lines correspond to df()+2 V ar(df()), where V ar(df()) is calculated from the B = 20000 replications. q b b

6 Appendix: proofs of lemmas 2-8

Proof. Lemma 2 Let p p `(, y) = y x 2 + . (62) k j jk | j| j=1 j=1 X X ∂`(,y) Given y, ˆ() is the minimizer of `(, y). For those j m we must have = 0, i.e., ∈ B ∂j p 2xT y x ˆ() + Sgn(ˆ() ) = 0, for j . (63) j j j j ∈ Bm j=1 ! X

21 Since ˆ() = 0 for all i / , then p x ˆ() = x ˆ() . Thus equations in (63) i m j=1 j j j j j become ∈ B ∈B T P ˆ P 2X m y X m () m + Sgnm = 0 (64) B B B which gives (21). Proof. Lemma 3 We adopt the matrix notation used in S : M[i, ] means the i-th row of M. i joins add Bm at m, then ˆ (m)iadd = 0. (65) Consider ˆ() for ( , ). Lemma 2 gives ∈ m+1 m 1 ˆ T T () m = X m X m X m y Sgnm . (66) B B B B 2 ˆ By the continuity of ()iadd , taking the limit of the i-th element of (66) as m 0, we have → 1 1 T T T 2 X m X m [i, ]X m y = m X m X m [i, ]Sgnm . (67) ( B B B ) ( B B ) ˆ The second is a non-zero scalar, otherwise ()iadd = 0 for all (m+1, m), which contradicts the{}assumption that i becomes a member of the active set∈ . Thus we have add Bm 1 T X m X m [i, ] T T B B m = 2 1 X m y =: v( m, i)X m y, (68) T B B B ( X m X m [i, ]Sgnm ) B B

1 T (X X m ) [i,] m B where v( m, i) = 2 B 1 . Rearranging (68), we get (22). XT X [i ,]Sgn B ( m m ) m ( B B ) Similarly, if jdrop is a dropped index at m+1, we take the limit of the j-th element of (66) as + 0 to conclude that → m+1 1 T X m X m [j, ] T T B B m+1 = 2 1 X m y =: v( m, j)X m y, (69) T B B B ( X m X m [j, ]Sgnm ) B B

1 T (X X m ) [j,] m B where v( m, j) = 2 B 1 . Rearranging (69), we get (23). XT X [j ,]Sgn B ( m m ) m ( B B ) Proof. Lemma 4 Suppose for some y and m, = (y)m. > 0 means m is not the last Lasso step. By Lemma 3 we have T = m = v( m, i)X m y =: ( m, i)y. (70) { B B } B T Obviously ( m, i) = v( m, i)X m is a non-zero vector. Now let be the totality of B B B ( , i) by considering all the possible combinations of , i and the sign vector Sgn . Bm Bm m 22 only depends on X and is a nite set, since at most p predictors are available. Thus n , y = denes a hyperplane in R . We dene ∀ ∈ n = y : y = for some and = R . N { ∈ } G \ N Then on (70) is impossible. G

Proof. Lemma 5 For writing convenience we omit the subscript . Let T 1 T ˆ(y)ols = (X X) X y (71) be the OLS estimates. Note that we always have the inequality ˆ(y) ˆ(y) . (72) | |1 | ols|1 Fix an arbitrary y0 and consider a sequence of yn (n = 1, 2, . . .) such that yn y0. Since y y , we can nd a Y such that y {Y for} all n = 0, 1, 2, . . .. Consequen→ tly n → 0 k nk ˆ(yn)ols B for some upper bound B (B is determined by X and Y ). By Cauchy’s inequalitk yk and (72), we have ˆ(y ) pB for all n = 0, 1, 2, . . . (73) | n |1 (73) implies that to show ˆ(yn) pˆ(y0), it is equivalent to show for every converging subsequence of ˆ(y ) , say ˆ(y →) , the subsequence converge to ˆ(y). { n } { nk } Now assume ˆ(yn ) converges to ˆ as nk . We show ˆ = ˆ(y0). The Lasso k ∞ → ∞ ∞ criterion `(, y) is written in (62). Let

`(, y, y0) = `(, y) `(, y0). (74) ˆ By the denition of nk , we must have `(ˆ(y ), y ) `(ˆ(y ), y ). (75) 0 nk nk nk Then (75) gives ˆ ˆ ˆ `((y0), y0) = `((y0), ynk ) + `((y0), y0, ynk ) `(ˆ(y ), y ) + `(ˆ(y ), y , y ) nk nk 0 0 nk ˆ ˆ ˆ = `((ynk ), y0) + `((ynk ), ynk , y0) + `((y0), y0, ynk ). (76) We observe `(ˆ(y ), y , y ) + `(ˆ(y ), y , y ) = 2(y y )XT (ˆ(y ) ˆ(y )). (77) nk nk 0 0 0 nk 0 nk nk 0 Let nk , the right hand side of (77) goes to zero. Moreover, `(ˆ(yn ), y0) `(ˆ , y0). k ∞ Therefore→ ∞(76) reduces to → `(ˆ(y0), y0) `(ˆ , y0). ∞ However, ˆ(y0) is the unique minimizer of `(, y0), thus ˆ = ˆ(y0). ∞

23 Proof. Lemma 6 Fix an arbitrary y . Denote Ball(y, r) the n-dimensional ball with center y and 0 ∈ G radius r. Note that is an open set, so we can choose a small enough such that Ball(y , ) . FixG. Suppose y y as n , then without loss of generality 0 G n → → ∞ we can assume yn Ball(y0, ) for all n. So is not a transition point for any yn. By denition ˆ∈(y ) = 0 for all j (y ). Then Lemma 5 says that a N, as long 0 j 6 ∈ B 0 ∃ as n > N , we have ˆ(y ) = 0 and Sgn(ˆ(y )) = Sgn(ˆ(y )), for all j (y ). Thus 1 n j 6 n n ∈ B 0 (y0) (yn) n > N1. B OntheB other∀ hand, we have the following equiangular conditions (Efron et al. 2004)

= 2 xT (y Xˆ(y )) j (y ), (78) | j 0 0 | ∀ ∈ B 0 > 2 xT (y Xˆ(y )) j / (y ). (79) | j 0 0 | ∀ ∈ B 0

Using Lemma 5 again, we conclude that a N > N1 such that j / (y0) the strict inequalities (79) hold for y provided n >∃N. Thus c(y ) c(y∀ ) ∈n >B N. Therefore n B 0 B n ∀ we have (yn) = (y0) n > N. Then the local constancy of the sign vector follows the continuityB of ˆ(y).B ∀ Proof. Lemma 7 Suppose at step m, (y) 2. Let i and j be two of the predictors in (y), |Wm | add add Wm and let iadd and jadd be their indices in the current active set . Note the current active set is in Lemma 3. Hence we have A A Bm T m = v[ , i]X y, (80) A AT m = v[ , j]X y. (81) A A Therefore

T 0 = [v( , iadd ) v( , jadd )]X y =: addy. (82) ( A A A) T We claim add = [v( , iadd ) v( , jadd )]X is not a zero vector. Otherwise, since Xj are A A A { } linearly independent, = 0 forces v( , i ) v( , j ) = 0. Then we have add A add A add T 1 T 1 X X [i, ] X X [j, ] A A = A A , (83) T 1 T 1 (X X ) [i, ]Sgn (X X ) [i, ]Sgn A A A A A A T 1 which contradicts the fact (X X ) is a full rank matrix. A A Similarly, if idrop and jdrop are dropped predictors, then

T 0 = [v( , idr op) v( , jdr op)]X y =: dropy, (84) ( A A A)

T and drop = [v( , idr op) v( , jdr op)]X is a non-zero vector. A A A

24 Let M0 be the totality of add and drop by considering all the possible combinations of , (iadd, jadd), (idrop, jdrop) and Sgn . Clearly M0 is a nite set and only depends on X. Let A A = y : y = 0 for some M . (85) N0 ∈ 0 n o n Then on R 0 , the conclusione holds. \ N Proof. Lemma 8 e First we can choose a suciently small such that y0 : y0 y < , y0 is an interior n ∀ k k point of R ( 0 1). Suppose K is the last step of the Lasso solution given y. We show \ N ∪ N that for each m K, there is a m < such that qm 1 m 1 and m are locally xed in the V W Ball(y, m); alsoe m eand ˆm are locally continuous in the Ball(y, m). We proceed by induction. For m = 0 we only need to verify the local constancy of . Lemma 7 says (y) = j . By the denition of , we have xT y > xT y for W0 W0 { } W | j | | i | all i = j. Thus the strict inequality holds if y0 is suciently close to y, which implies 6 (y0) = j = (y). W0 { } W0 Assuming the conclusion holds for m, we consider points in the Ball(y, m+1) with m+1 < min` m ` . By the induction assumption, Am(y) is locally xed since it only depends on { } (q` `, `), ` (m 1) . qm m = ∅ is equivalent to (y) < (y). Once Am and m are { V W } V W xed, both (y) and (y) are continuous on y. Thus if y0 is suciently close to y, the strict inequality still holds, which means qm(y0) m(y0) = ∅.b If qm em = m, then (y) > (y) since the possibility of (y) = (y) is ruledV out. By LemmaV7, weVlet (y) = j . By b e Vm { } the denition of (y), we can see that if y0 is suciently close to y, (y0) = j , and Vm b { } e (y0) > (y0) by continuitb y. So qem(y0) m(y0) = m(y0) = m(y). ˆ V V V Then m+1 ande m+1 are locally continuous, because their updates are continuous on y onceb Ame, m and qm m are xed. Moreover, since qm m is xed, Am+1 is also locally xed. Let W(y) = j forV some j Ac . Then we haveV Wm+1 { } ∈ m+1 xT (y XT ˆ (y)) > xT (y XT ˆ (y) i = j, i Ac | j m+1 | | i m+1 | ∀ 6 ∈ m+1 By the continuity argument, the above strict inequality holds for all y0 provided y0 y k k m+1 for a suciently small m+1. So m+1(y0) = j = m+1(y). In conclusion, we can choose a small enough to make sureW that q { and} W are locally xed, and ˆ m+1 mVm Wm+1 m+1 and m+1 are locally continuous.

7 Acknowledgements

We thank Brad Efron and Yuhong Yang for their helpful comments.

References

Akaike, H. (1973), ‘Information theory and an extension of the maximum likelihood princi- ple’, Second International Symposium on Information Theory pp. 267–281.

25 Efron, B. (1986), ‘How biased is the apparent error rate of a prediction rule?’, Journal of the American Statistical Association 81, 461–470.

Efron, B. (2004), ‘The estimation of prediction error: covariance penalties and cross- validation’, Journal of the American Statistical Association 99, 619–632.

Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. (2004), ‘Least angle regression’, Annals of Statistics 32, 407–499.

Hastie, T. & Tibshirani, R. (1990), Generalized Additive Models, Chapman and Hall, London.

Hastie, T., Tibshirani, R. & Friedman, J. (2001), The Elements of Statistical Learning; Data mining, Inference and Prediction, Springer Verlag, New York.

Hoerl, A. & Kennard, R. (1988), Ridge regression, in ‘Encyclopedia of Statistical Sciences’, Vol. 8, Wiley, New York, pp. 129–136.

Mallows, C. (1973), ‘Some comments on cp’, Technometrics 15, 661–675. Meyer, M. & Woodroofe, M. (2000), ‘On the degrees of freedom in shape-restricted regres- sion’, Annals of Statistcs 28, 1083–1104.

Schwartz, G. (1978), ‘Estimating the dimension of a model’, Annals of Statistics 6, 461–464.

Shao, J. (1997), ‘An asymptotic theory for linear model selection (with discussion)’, Statistica Sinica 7, 221–242.

Stein, C. (1981), ‘Estimation of the mean of a multivariate normal distribution’, Annals of Statistics 9, 1135–1151.

Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, Journal of the Royal Statistical Society, Series B 58, 267–288.

Yang, Y. (2003), Can the strengths of AIC and BIC be shared? (submitted. http://www.stat.iastate.edu/preprint/articles/2003-10.pdf).

26