Regularization of Case-Specific Parameters for Robustness and Efficiency

Yoonkyung Lee, Steven N. MacEachern, and Yoonsuh Jung The Ohio State University, USA

Summary. Regularization methods allow one to handle a variety of inferential problems where there are more covariates than cases. This allows one to consider a potentially enormous number of covari- ates for a problem. We exploit the power of these techniques, supersaturating models by augmenting the “natural” covariates in the problem with an additional indicator for each case in the data set. We attach a penalty term for these case-specific indicators which is designed to produce a desired effect. For regression methods with squared error loss, an ℓ1 penalty produces a regression which is robust to and high leverage cases; for quantile regression methods, an ℓ2 penalty decreases the variance of the fit enough to overcome an increase in bias. The paradigm thus allows us to robustify procedures which lack robustness and to increase the efficiency of procedures which are robust. We provide a general framework for the inclusion of case-specific parameters in regularization problems, describing the impact on the effective loss for a variety of regression and classification problems. We outline a computational strategy by which existing software can be modified to solve the augmented regularization problem, providing conditions under which such modification will converge to the opti- mum solution. We illustrate the benefits of including case-specific parameters in the context of mean regression and quantile regression through analysis of NHANES and linguistic data sets.

Keywords: Case indicator; Large margin classifier; LASSO; Leverage point; ; Penal- ized method; Quantile regression

1. Introduction

A core part of regression analysis involves the examination and handling of individual cases (Weis- berg, 2005). Traditionally, cases have been removed or downweighted as outliers or because they exert an overly large influence on the fitted regression surface. The mechanism by which they are downweighted or removed is through inclusion of case-specific indicator variables. For a least-squares fit, inclusion of a case-specific indicator in the model is equivalent to removing the case from the data set; for a normal-theory, Bayesian regression analysis, inclusion of a case-specific indicator with an appropriate prior distribution is equivalent to inflating the variance of the case and hence down- weighting it. The tradition in is to handle the case-specific decisions automatically, most often by downweighting outliers according to an iterative procedure (Huber, 1981). This idea of introducing case-specific indicators also applies naturally to criterion based regression procedures. Model selection criteria such as AIC or BIC take aim at choosing a model by attaching a Address for correspondence: Yoonkyung Lee, Department of , The Ohio State University, 1958 Neil Ave, Columbus, OH 43210, USA. E-mail: [email protected] penalty for each additional parameter in the model. These criteria can be applied directly to a larger space of models–namely those in which the covariates are augmented by a set of case indicators, one for each case in the data set. When considering inclusion of a case indicator for a large outlier, the criterion will judge the trade-off between the empirical risk (here, negative log-likelihood) and model complexity (here, number of parameters) as favoring the more complex model. It will include the case indicator in the model, and, with a least-squares fit, effectively remove the case from the data set. A more considered approach would allow differential penalties for case-specific indicators and “real” covariates. With adjustment, one can essentially recover the familiar t-tests for outliers (e.g. Weisberg (2005)), either controlling the error rate at the level of the individual test or controlling the Bonferroni bound on the familywise error rate. Case-specific indicators can also be used in conjunction with regularization methods such as the LASSO (Tibshirani, 1996). Again, care must be taken with details of their inclusion. If these new covariates are treated in the same fashion as the other covariates in the problem, one is making an implicit judgement that they should be penalized in the same fashion. Alternatively, one can allow a second parameter that governs the severity of the penalty for the indicators. This penalty can be set with a view of achieving robustness in the analysis, and it allows one to tap into a large, extant body of knowledge about robustness (Huber, 1981). With regression often serving as a motivating theme, a host of methods for regularized model se- lection and estimation problems have been developed. These methods range broadly across the field of statistics. In addition to traditional normal-theory linear regression, we find many methods mo- tivated by a loss which is composed of a negative log-likelihood and a penalty for model complexity. Among these regularization methods are penalized linear regression methods (e.g. ridge regression (Hoerl and Kennard, 1970) and the LASSO), regression with a nonparametric mean function, (e.g. smoothing splines (Wahba, 1990) and generalized additive models (Hastie and Tibshirani, 1990)), and extension to regression with non-normal error distributions, namely, generalized linear models (McCullagh and Nelder, 1989). In all of these cases, one can add case-specific indicators along with an appropriate penalty in order to yield an automated, robust analysis. It should be noted that, in addition to a different severity for the penalty term, the case-specific indicators sometimes require a different form for their penalty term. A second class of procedures open to modification with case-specific indicators are those mo- tivated by minimization of an empirical risk function. The risk function may not be a negative log-likelihood. Quantile regression (whether linear or nonlinear) falls into this category, as do mod- ern classification techniques such as the support vector machine (Vapnik, 1998) and the psi-learner (Shen et al., 2003). Many of these procedures are designed with the robustness of the analysis in mind, often operating on an estimand defined to be the population-level minimizer of the empirical risk. The procedures are consistent across a wide variety of data-generating mechanisms and hence are asymptotically robust. They have little need of further robustification. Instead, scope for better- ing these procedures lies in improving their finite sample properties. Interestingly, the finite sample performance of many procedures in this class can be improved by including case-specific indicators in the problem, along with an appropriate penalty term for them. This paper investigates the use of case-specific indicators for improving modeling and prediction procedures in a regularization framework. Section 2 provides a formal description of the optimization problem which arises with the introduction of case-specific indicators. It also describes a compu- tational algorithm and conditions that ensure the algorithm will obtain the global solution to the

2 regularized problem. Section 3 explains the methodology for a selection of regression methods, mo- tivating particular forms for the penalty terms. Section 4 describes how the methodology applies to several classification schemes. Sections 5 and 6 contain worked examples. We discuss implications of the work and potential extensions in Section 7.

2. Robust and Efficient Modeling Procedures

Suppose that we have n pairs of observations denoted by (xi,yi), i =1,...,n, for statistical modeling and prediction. Here xi = (xi1,...,xip)⊤ with p covariates and the yi’s are responses. As in the standard setting of regression and classification, the yi’s are assumed to be conditionally independent given the xi’s. In this paper, we take modeling of the data as a procedure of finding a functional p relationship between xi and yi, f(x; β) with unknown parameters β R that is consistent with the data. The discrepancy or lack of fit of f is measured by a loss function∈ (y,f(x; β)). Consider a modeling procedure, say, of finding f which minimizes the empirical riskL M 1 n R (f)= (y ,f(x ; β)) n n L i i Xi=1 or its penalized version, R (f)+λJ(f)= 1 n (y ,f(x ; β))+λJ(f), where λ is a positive penalty n n i=1 L i i parameter for balancing the data fit and theP model complexity of f measured by J(f). A variety of common modeling procedures are subsumed under this formulation, including ordinary linear regression, generalized linear models, nonparametric regression, and supervised learning techniques. For brevity of exposition, we identify f with β through a parametric form and view J(f) as a functional depending on β. Extension of the formulation presented in this paper to a nonparametric function f is straightforward via a basis expansion.

2.1. Modification of Modeling Procedures First, we introduce case-specific parameters, γ = (γ1,...,γn)⊤, for the n observations by augmenting the covariates with n case-specific indicators. Motivated by the beneficial effects of regularization, we propose a general scheme to modify the modeling procedure using the case-specific parameters γ, to enhance for robustness or efficiency. Define modificationM of to be the procedure of finding the originalM model parameters, β, together with the case-specific parameters,M γ, that minimize

n L(β,γ)= (y ,f(x ; β)+ γ )+ λ J(f)+ λ J (γ). (1) L i i i β γ 2 Xi=1

If λβ is zero, involves empirical risk minimization, otherwise penalized risk minimization. The adjustment thatM the added case-specific parameters bring to the loss function (y,f(x; β)) is the L same regardless of whether λβ is zero or not. In general, J2(γ) measures the size of γ. When concerned with robustness, we often take J (γ) = γ = n γ . A rationale for this choice is that with added flexibility, the case- 2 k k1 i=1 | i| specific parametersP can curb the undesirable influence of individual cases on the fitted model. To see this effect, consider minimizing L(β,γˆ ) for fixed βˆ, which decouples to a minimization of

3 (y ,f(x ; βˆ)+γ )+λ γ for each γ . In most cases, an explicit form of the minimizerγ ˆ of L(β,γˆ ) can L i i i γ| i| i be obtained. Generallyγ ˆi’s are large for observations with large “residuals” from the current fit, and the influence of those observations can be reduced in the next round of fitting β with theγ ˆ-adjusted data. Such a case-specific adjustment would be necessary only for a small number of potential out- liers, and the ℓ1 norm which yields sparsity works to that effect. The adjustment in the process of se- quential updating of β is equivalent to changing the loss from (y,f(x; β)) to (y,f(x; β)+ˆγ), which we call the γ-adjusted loss of . The γ-adjusted loss is a re-expressionL of inL terms of the adjusted residual, used as a conceptualL aid to illustrate the effect of adjustment throughL the case-specific pa- rameter γ on . Concrete examples of the adjustments will be given in the following sections. Alter- L natively, one may view λγ (y,f(x; β)) := minγ R (y,f(x; β)+γ)+λγ γ = (y,f(x; β)+ˆγ)+λγ γˆ as a whole to be theL “effective loss” in terms∈ {L of β after profiling out| |}γ ˆ.L The effective loss re-| | places (y,f(x; β)) for the modified procedure. When concerned with efficiency, we often take J (γ)=L γ 2 = n γ2. This choiceM has the effect of increasing the impact of selected, non-outlying 2 k k2 i=1 i cases on the analysis.P In subsequent sections, we will take a few standard statistical methods for regression and classi- fication and illustrate how this general scheme applies. This framework allows us to see established procedures in a new light and also generates new procedures. For each method, particular attention will be paid to the form of adjustment to the loss function by the penalized case-specific parameters.

2.2. General Algorithm for Finding Solutions Although the computational details for obtaining the solution to (1) are specific to each modeling procedure , it is feasible to describe a common computational strategy which is effective for a M wide range of procedures. For fixed λβ and λγ , the solution pair of βˆ andγ ˆ to the modified can be found with little extra computational cost. A generic algorithm below alternates estimationM of β and γ. Givenγ ˆ, minimization of L(β, γˆ) is done via the original modeling procedure . In most M cases we consider, minimization of L(β,γˆ ) given βˆ, entails simple adjustment of “residuals”. These considerations lead to the following iterative algorithm for finding βˆ andγ ˆ.

(0) (0) (a) Initializeγ ˆ = 0 and βˆ = arg minβ L(β, 0) (the ordinary solution). (b) Iteratively alternate the following two steps, m =0, 1,... M γˆ(m+1) = arg min L(βˆ(m),γ) modifies “residuals”. • γ (m+1) (m+1) βˆ = arg minβ L(β, γˆ ). This step amounts to reapplying the procedure to • γˆ(m+1)-adjusted data although the nature of the data adjustment wouldM largely depend on . L (c) Terminate the iteration when βˆ(m+1) βˆ(m) 2 < ǫ, where ǫ is a prespecified convergence tolerance. k − k

In a nutshell, the algorithm attempts to find the joint minimizer (β,γ) by combining the mini- mizers β and γ resulting from the projected subspaces. Convergence of the iterative updates can be established under appropriate conditions. Before we state the conditions and results for convergence, we briefly describe implicit assumptions on the loss function and the complexity or penalty terms, J(f) and J (γ). (y,f(x; β)) is assumed to be non-negative. For simplicity, we assume that J(f) of 2 L

4 p p f(x; β) depends on β only, and that it is of the form J(f)= β p and J2(γ)= γ p for p 1. The LASSO penalty has p = 1 while a ridge regression type penaltyk k sets p = 2. Manyk k other≥ penalties of this form for J(f) can be adopted as well to achieve better model selection properties or certain desirable performance of . Examples include those for the elastic net (Zou and Hastie, 2005), the grouped LASSO (Yuan andM Lin, 2006), and the hierarchical LASSO (Zhou and Zhu, 2007). For certain combinations of the loss and the penalty functionals, J(f) and J2(γ), more efficient computational algorithms can be devised,L as in Hastie et al. (2004); Efron et al. (2004); Rosset and Zhu (2007). However, in an attempt to provide a general computational recipe applicable to a variety of modeling procedures which can be implemented with simple modification of existing routines, we do not pursue the optimal implementation tailored to a specific procedure in this paper. Convexity of the loss and penalty terms plays a primary role in characterizing the solutions of the iterative algorithm. For a general reference to properties of convex functions and convex optimization, see Rockafellar (1997). If L(β,γ) in (1) is continuous and strictly convex in β and γ for fixed λβ and λγ , the minimizer pair (β,γ) in each step is properly defined. That is, given γ, there exists a unique minimizer β(γ)= arg minβ L(β,γ), and vice versa. The assumption that L(β,γ) is strictly convex holds if the loss (y,f(x; β)) itself is strictly convex. Also, it is satisfied when a convex (y,f(x; β)) is combined L L with J(f) and J2(γ) strictly convex in β and γ, respectively. Suppose that L(β,γ) is strictly convex in β and γ with a unique minimizer (β∗,γ∗) for fixed (m) (m) λβ and λγ . Then, the iterative algorithm gives a sequence of (βˆ , γˆ ) with strictly decreasing (m) (m) (m) (m) L(βˆ , γˆ ). Moreover, (βˆ , γˆ ) converges to (β∗,γ∗). This result of convergence of the itera- tive algorithm is well known in convex optimization, and it is stated here without proof. Interested readers can find a formal proof in Lee et al. (2007).

3. Regression

Consider a linear model of the form yi = xi⊤β + ǫi. Without loss of generality, we assume that each covariate is standardized. Let X be an n p design matrix with x⊤ in the ith row and let × i Y = (y1,...,yn)⊤.

3.1. Least Squares Method Taking the least squares method as a baseline modeling procedure , we make a link between its modification via case-specific parameters and a classical robust regressionM procedure. p 1 The least squares estimator of β = (β ,...,β )⊤ is the minimizer βˆ R of L(β) = (Y 1 p ∈ 2 − Xβ)⊤(Y Xβ). To reduce the sensitivity of the estimator to influential observations, the p covariates − are augmented by n case indicators. Let zi be the indicator variable taking 1 for the ith observation and 0 otherwise, and let γ = (γ1,...,γn)⊤ be the coefficients of the case indicators. The proposed modification of the least squares method with J (γ)= γ leads to a well-known robust regression 2 k k1 procedure. For the robust modification, we find βˆ Rp andγ ˆ Rn that minimize ∈ ∈ 1 n L(β,γ)= Y (Xβ + γ) ⊤ Y (Xβ + γ) + λ γ , (2) 2{ − } { − } γ | i| Xi=1

5 where λγ is a fixed regularization parameter constraining γ. Just as the ordinary LASSO with the ℓ1 norm penalty stabilizes regression coefficients by shrinkage and selection, the additional penalty in (2) has the same effect on γ, whose components gauge the extent of case influences. The minimizerγ ˆ of L(β,γˆ ) for a fixed βˆ can be found by soft-thresholding the residual vector r = Y Xβˆ. That is,γ ˆ = sgn(r)( r λγ )+. For observations with small residuals, ri λγ ,γ ˆi is set equal− to zero with no effect on| the|− current fit, and for those with large residuals, |r |≤> λ ,γ ˆ is | i| γ i set equal to the residual r = y x⊤βˆ offset by λ towards zero. Combiningγ ˆ with βˆ, we define i i − i γ the adjusted residuals to be r∗ = y x⊤βˆ γˆ . That is, r∗ = r if r λ , and r∗ = sgn(r )λ , i i − i − i i i | i|≤ γ i i γ otherwise. Thus, introduction of the case-specific parameters along with the the ℓ1 penalty on γ amounts to winsorizing the ordinary residuals. The γ-adjusted loss is equivalent to truncated 2 2 squared error loss which is (y x⊤β) if y x⊤β λγ , and is λγ otherwise. Figure 1 shows (a) the relationship between the ordinary− residual| − r and| ≤ the corresponding γ, (b) the residual and the adjusted residual r∗, (c) the γ- adjusted loss as a function of r, and (d) the effective loss. 2 2 The effective loss is λγ (y, x⊤β) = (y x⊤β) /2 if y x⊤β λγ , and λγ /2+λγ ( y x⊤β λγ ) otherwise. This effectiveL loss matches Huber’s− loss functio| −n for|≤ robust regression (Huber,| − 1981).|− As in robust regression, we choose a sufficiently large λγ so that only a modest fraction of the residuals are adjusted. Similarly, modification of the LASSO as a penalized regression procedure yields the Huberized LASSO described by Rosset and Zhu (2004).

3.2. Location Families More generally, a wide class of problems can be cast in the form of a minimization of L(β) = n g(y x⊤β) where g( ) is the negative log-likelihood derived from a location family. The i=1 i − i · Passumption that we have a location family implies that the negative log-likelihood is a function only of ri = yi xi⊤β. Dropping the subscript, common choices for the negative log-likelihood, g(r) include r2 (least squares,− normal distributions) and r (least absolute deviations, Laplace distributions). | | Introducing the case-specific parameters γi, we wish to minimize n L(β,γ)= g(y x⊤β γ )+ λ γ . i − i − i γ k k1 Xi=1 For minimization with a fixed βˆ, the next result applies to a broad class of g( ) (but not to g(r)= r ). · | | Proposition 1. Suppose that g is strictly convex with the minimum at 0, and limr g′(r)= , respectively. Then, →±∞ ±∞ 1 1 r g′− (λγ ) for r>g′− (λγ ) − 1 1 γˆ = arg min g(r γ)+ λγ γ =  0 for g′− ( λγ ) r g′− (λγ ) γ − | | 1 − 1 ≤ ≤  r g′− ( λ ) for r

6 The first term in the summation can be decomposed into three parts. Large ri contribute g(ri ri + 1 1 1 − g′− (λ )) = g(g′− (λ )). Large, negative r contribute g(g′− ( λ )). Those r with intermediate γ γ i − γ i values haveγ ˆi = 0 and so contribute g(ri). Thus a graphical depiction of the γ-adjusted loss is much like that in Figure 1, panel (c), where the loss is truncated above. For asymmetric distributions (and hence asymmetric log-likelihoods), the truncation point may differ for positive and negative residuals. It should be remembered that when r is large, the correspondingγ ˆ is large, implying | i| i a large contribution of γ 1 to the overall minimization problem. The residuals will tend to be large for vectors β thatk arek at odds with the data. Thus, in a sense, some of the loss which seems to disappear due to the effective truncation of g is shifted into the penalty term for γ. Hence the

effective loss λγ (y,f(x; β)) = g(y f(x; β) γˆ)+λγ γˆ is the same as the original loss, g(y f(x; β)) L 1 − 1 − | | − when the residual is in [g′− ( λ ),g′− (λ )] and is linear beyond the interval. The linearized part − γ γ of g is joined with g such that λγ is differentiable. Computationally, the minimizationL of L(β, γˆ) givenγ ˆ entails application of the same modeling 1 procedure with g to winsorized pseudo responses yi∗ = yi γˆi, where yi∗ = yi for g′− ( λγ ) 1 M 1 1 −1 1 − ≤ r g′− (λ ), y∗ = g′− (λ ) for r>g′− (λ ), and y∗ = g′− ( λ ) for r

3.3. Quantile Regression Consider median regression with absolute deviation loss (y, x⊤β)= y x⊤β , which is not covered L | − | in the foregoing discussion. It can be verified easily that the ℓ1-adjustment of is void due to the piecewise linearity of the loss, reaffirming the robustness of median regression.L For an effectual adjustment, the ℓ2 norm regularization of the case-specific parameters is considered. With the case-specific parameters γi, we have the following objective function for modified median regression:

n λγ 2 L(β,γ)= y x⊤β γ + γ . (3) | i − i − i| 2 k k2 Xi=1

λγ 2 For a fixed βˆ and residual r = y x⊤βˆ, theγ ˆ minimizing r γ + γ is given by − | − | 2 1 1 1 sgn(r) I r > + rI r . λ | | λ | |≤ λ γ γ  γ  The γ-adjusted loss for median regression is 1 1 (y, x⊤β +ˆγ)= y x⊤β I y x⊤β > , L − − λ | − | λ γ γ 

as shown in Figure 4 (a). Interestingly, this ℓ2-adjusted absolute deviation loss is the same as the so-called “ǫ-insensitive linear loss” for support vector regression (Vapnik, 1998) with ǫ =1/λγ. With this adjustment, the effective loss is Huberized squared error loss. The ℓ2 adjustment makes median regression more efficient by rounding the sharp corner of the loss, and leads to a hybrid procedure which lies between mean and median regression. Note that, to achieve the desired effect for median regression, one chooses quite a different value of λγ than one would when adjusting squared error loss for a robust mean regression.

7 The modified median regression procedure can be also combined with a penalty on β for shrinkage and/or selection. Bi et al. (2003) consider support vector regression with the ℓ1 norm penalty β 1 for simultaneous robust regression and variable selection. These authors rely on ǫ-insensitive lineark k loss which comes out as the γ-adjusted loss of the absolute deviation. In contrast, we rely on the effective loss which produces a different solution. In general, quantile regression (Koenker and Bassett, 1978; Koenker and Hallock, 2001) can be used to estimate conditional quantiles of y given x. It is a useful regression technique when the assumption of normality on the distribution of the errors ǫ is not appropriate, for instance, when the error distribution is skewed or heavy-tailed. For the qth quantile, the check function ρq is employed: qr for r 0 ρq(r)= ≥ (4)  (1 q)r for r< 0. − − The standard procedure for the qth quantile regression finds β that minimizes the sum of asym- metrically weighted absolute errors with weight q on positive errors and weight (1 q) on negative errors: − n L(β)= ρ (y x⊤β). q i − i Xi=1

Consider modification of ρq with a case-specific parameter γ and ℓ2 norm regularization. Due to the asymmetry in the loss, except for q =1/2, the size of reduction in the loss by the case-specific parameter γ would depend on its sign. Given βˆ and residual r = y x⊤βˆ, if r 0, then the positive − ≥ γ would lower ρq by qγ, while if r< 0, the negative γ with the same absolute value would lower the loss by (q 1)γ. This asymmetric impact on the loss is undesirable. Instead, we create a penalty that leads− to the same reduction in loss for positive and negative γ of the same magnitude. In other words, the desired ℓ2 norm penalty needs to put qγ+ and (1 q)γ on an equal footing. This leads 2 2 2 2− − to the following penalty proportional to q γ+ and (1 q) γ : − − 2 2 J2(γ) := q/(1 q) γ+ + (1 q)/q γ . { − } { − } − When q =1/2, J2(γ) becomes the symmetric ℓ2 norm of γ. With this asymmetric penalty, given βˆ,γ ˆ is now defined as

λγ arg min λ (β,γˆ ) := ρq(r γ)+ J2(γ), (5) γ L γ − 2 and is explicitly given by q q q 1 q 1 q 1 q I r< + rI r< − + − I r − . −λ −λ − λ ≤ λ λ ≥ λ γ γ  γ γ  γ γ  γ The effective loss ρq is then given by

q(1 q) q (q 1)r − for r< − − 2λγ − λγ λγ 1 q 2 q  − r for r< 0 ργ (r)=  2 q − λγ ≤ (6) q  λγ q 2 1 q  r for 0 r< −  2 1 q λγ −q(1 q) ≤ 1 q qr − for r − ,  2λγ λγ  − ≥  8 and its derivative is q 1 for r< q λγ 1− q q −  λ − r for r< 0 γ γ q λγ ψq (r)=  q − ≤ 1 q (7)  λγ r for 0 r< −  1 q λγ − ≤ 1 q q for r − .  ≥ λγ  We note that, under the assumption that the density is locally constant in a neighborhood of the γ quantile, the quantile remains the 0 of the effective ψq function. Figure 2 compares the derivative of the check loss with that of the effective loss in (6). Through penalization of a case-specific parameter, ρq is modified to have a continuous derivative at the origin joined by two lines with a different slope that depends on q. The effective loss is reminiscent of 2 2 the asymmetric squared error loss (q(r+) + (1 q)(r ) ) considered by Newey and Powell (1987) and Efron (1991) for the so-called expectiles. The− proposed− modification of the check loss produces a hybrid of the check loss and asymmetric squared error loss, however, with different weights than those for expectiles, to estimate quantiles. Redefining J2(γ) as the sum of the asymmetric penalty for the case-specific parameter γi, i = 1,...,n, modified quantile regression is formulated as a procedure that finds β and γ by minimizing

n L(β,γ)= ρ (y x⊤β γ )+ λ J (γ). (8) q i − i − i γ 2 Xi=1 For a quick illustration of the proposed method, Figure 3 shows the intervals of adjustment ( q/λγ 0. The following theorem shows that if α is sufficiently ˆγ large, the modified quantile regression estimator βn is asymptotically equivalent to the standard estimator βˆn. Knight (1998) proves the asymptotic normality of regression quantile estimator βˆn ˆγ under some mild regularity conditions. Using the arguments in Koenker (2005), we show that βn has the same limiting distribution as βˆn, and thus it is √n-consistent if α is sufficiently large. Allowing a potentially different error distribution for each observation, let Y1, Y2,... be indepen- dent random variables with cdf F1, F2,... and suppose that each Fi has continuous pdf fi. Assume that the qth conditional quantile function of Y given x is linear in x and given by x⊤β(q), and let ξi(q) := xi⊤β(q). Now consider the following regularity conditions: (C-1) f (ξ), i =1, 2,..., are uniformly bounded away from 0 and at ξ . i ∞ i

(C-2) fi(ξ), i =1, 2,..., admit a first-order Taylor expansion at ξi, and fi′(ξ) are uniformly bounded at ξi.

1 (C-3) There exists a positive definite matrix D0 such that limn n− xixi⊤ = D0. →∞ P 9 1 (C-4) There exists a positive definite matrix D1 such that limn n− fi(ξi)xixi⊤ = D1. →∞ P (C-5) max x /√n 0 in probability. i=1,...,n k ik → (C-1) and (C-3) through (C-5) are the conditions considered for the limiting distribution of the standard regression quantile estimator βˆn in Koenker (2005) while (C-2) is an additional assumption that we make.

Theorem 2. Under the conditions (C-1)-(C-5), if α> 1/3, then

γ d 1 1 √n(βˆ β(q)) N(0, q(1 q)D− D D− ). n − → − 1 0 1 The proof of the theorem is in Appendix.

4. Classification

Now suppose that yi’s indicate binary outcomes. For modeling and prediction of the binary re- sponses, we mainly consider margin-based procedures such as logistic regression, support vector machines (Vapnik, 1998), and boosting (Freund and Schapire, 1997). These procedures can be modified by the addition of case indicators.

4.1. Logistic Regression Although it is customary to label a binary outcome as 0 or 1 in logistic regression, we instead adopt the symmetric labels of 1, 1 for yi’s. The symmetry facilitates comparison of different classification procedures. Logistic{− regression} takes the negative log likelihood as a loss for estimation of logit f(x) = log[p(x)/(1 p(x))]. The loss, (y,f(x)) = log[1 + exp( yf(x))], can be viewed as a function of the so-called margin,− yf(x). ThisL functional margin of yf(−x) is a pivotal quantity for defining a family of loss functions in classification similar to the residual in regression. Logistic regression can be modified with case indicators:

n L(β ,β,γ)= log 1 + exp( y f(x ; β ,β)+ γ ) + λ γ , (9) 0 − i{ i 0 i} γ k k1 Xi=1 

where f(x; β0,β)= β0 + x⊤β. When it is clear in the context, f(x) will be used as a short notation for f(x; β0,β), a discriminant function in general. For fixed βˆ0 and βˆ, γi is determined by minimizing

log 1 + exp( y f(x ; βˆ , βˆ)+ γ ) + λ γ . − i{ i 0 i} γ | i|  First note that the minimizer γi must have the same sign as yi. Letting τ = yf and assuming that 0 < λγ < 1, we have arg minγ 0 log 1 + exp( τ γ) + λγ γ = log (1 λγ )/λγ τ if ≥ − − | | { − }− τ log (1 λγ )/λγ , and 0 otherwise. This yields a truncated negative log likelihood given by ≤ { − } log 1+ λ /(1 λ ) if yf(x) log (1 λ )/λ , (y,f(x)) = γ − γ ≤ { − γ γ} L  log 1 + exp( yf(x)) otherwise −  10 as the γ-adjusted loss. This adjustment is reminiscent of Pregibon (1982)’s proposal for robust logistic regression by tapering the deviance function so as to downweight extreme observations. See Figure 4 (b) for the γ-adjusted loss (the dashed line), where η := log (1 λ )/λ , and it is a λ { − γ γ} decreasing function of λγ . λγ determines the level of truncation of the loss. As λγ tends to 1, there is no truncation. Figure 4 (b) also shows the effective loss (the solid line) for the ℓ1 adjustment, which linearizes the negative log likelihood below ηλ.

4.2. Large Margin Classifiers With the symmetric class labels, the foregoing characterization of the case-specific parameter γ in logistic regression can be easily generalized to various margin-based classification procedures. In classification, potential outliers are those cases with large negative margins. Let g(τ) be a loss function of the margin τ = yf(x). The following proposition, analogous to Proposition 1, holds for a general family of loss functions.

Proposition 3. Suppose that g is convex and monotonically decreasing in τ, and g′ is contin- uous. Then, for λγ < limτ g′(τ), − →−∞ 1 1 g′− ( λγ ) τ for τ g′− ( λγ ) γˆ = arg min g(τ + γ)+ λγ γ = − − ≤ 1 − γ | |  0 for τ >g′− ( λγ ). − The proof is straightforward. Examples of the margin-based loss g satisfying the assumption 2 include the exponential loss g(τ) = exp( τ) in boosting, the squared hinge loss g(τ)= (1 τ)+ in the support vector machine (Lee and− Mangasarian, 2001), and the negative log likelihood{ −g(τ)=} log(1 + exp( τ)) in logistic regression. Although their theoretical targets are different, all the loss functions are− truncated above for large negative margins when adjusted by γ. Thus, the effective 1 loss λγ (y,f(x)) = g(yf(x)+ˆγ)+ λγ γˆ is obtained by linearizing g for yf(x)

11 4.3. Support Vector Machines As a special case of a large margin classifier, the linear support vector machine (SVM) looks for the optimal hyperplane f(x; β0,β)= β0 + x⊤β = 0 minimizing

n λ 2 Lλ(β0,β)= 1 yif(xi; β0,β) + β 2, (10) − + 2 k k Xi=1h i where [t]+ = max(t, 0) and λ > 0 is a regularization parameter. Since the hinge loss for the SVM, g(τ) = (1 τ)+, is piecewise linear, its linearization with γ 1 is void, indicating that it has little − k k 2 need of further robustification. Instead, we consider modification of the hinge loss with γ 2. This modification is expected to have the same effect of improving efficiency as in quantile regression.k k Using the case indicators zi and their coefficients γi, we modify (10), arriving at the problem of minimizing n λβ 2 λγ 2 L(β0,β,γ)= 1 yi f(xi; β0,β)+ γi + β 2 + γ 2. (11) − { } + 2 k k 2 k k Xi=1h i

For fixed βˆ0 and βˆ, the minimizerγ ˆ of L(βˆ0, β,γˆ ) is obtained by solving the decoupled optimization problem of min [1 y f(x ; βˆ , βˆ) y γ ] + λγ γ2 for each γ . With an argument similar to that γ − i i 0 − i i + 2 i i for logistic regression, the minimizerγ ˆi should have the same sign as yi. Let ξ =1 yf. A simple calculation shows that − 0 if ξ 0 λγ 2 ≤ arg min [ξ γ]+ + γ =  ξ if 0 <ξ< 1/λγ γ 0 − 2 ≥  1/λ if ξ 1/λ . γ ≥ γ  Hence, the increase in margin yiγˆi due to inclusion of γ is given by 1 1 1 1 yif(xi) I(0 < 1 yif(xi) < )+ I(1 yif(xi) ). { − } − λγ λγ − ≥ λγ

The γ-adjusted hinge loss is (y,f(x)) = [1 1/λγ yf(x)]+ with the hinge lowered by 1/λγ as shown in Figure 4 (c) (the dashedL line). The− effective− loss (the solid line in the figure) is then given by a smooth function with the joint replaced with a quadratic piece between 1 1/λγ and 1 and linear beyond the interval. −

5. Analysis of the NHANES Data

We numerically compare standard quantile regression with modified quantile regression for analysis of real data. The Centers for Disease Control and Prevention conduct the National Health and Nutrition Examination Survey (NHANES), a large scale survey designed to monitor the health and nutrition of residents of the United States. Many are concerned about the record levels of obesity in the population, and the survey contains information on height and weight of individuals, in addition to a variety of dietary and health-related questions. Obesity is defined through body mass index (BMI) in kg/m2, a measure which adjusts weight for height. In this analysis, we describe the relationship between height and BMI among the 5938 males over the age of 18 in the aggregated

12 NHANES data sets from 1999, 2001, 2003 and 2005. Since BMI is weight adjusted for height, the null expectation is that BMI and height are unrelated. We fit a nonparametric quantile regression model to the data. The model is a six knot regression spline using the natural basis expansion. The knots (held constant across quantiles) were chosen by eye. The modified quantile regression method was directly implemented by specifying the effective γ ψ-function ψq , the derivative of the effective loss, in the rlm function in the R package. We undertook extensive simulation studies (available in Jung et al. (2009)) to establish guidelines for selection of 2 4 the penalty parameter λγ . The studies encompassed a range of sample sizes, from 10 to 10 , a variety of quantiles, and distributions exhibiting symmetry, varying degrees of asymmetry, and a variety of tail behaviors. An empirical rule was established via a (robust) regression analysis. The α analysis considered λγ of the form cqn /σˆ, where cq is a constant depending on q andσ ˆ is a robust estimate of the scale of the error distribution. The goal of the analysis was to find λγ which, across a broad range of conditions, resulted in an MSE near the condition-specific MSE. This led to the rule, with α =0.30, of cq 0.5 exp( 2.118 1.097q) for q< 0.5 and cq 0.5 exp( 2.118 1.097(1 q)) for q 0.5. The simulation≈ studies− show− that this choice of penalty≈ parameter− results− in an accurate− estimator≥ of the quantile surface. The rule was used for the NHANES data analysis. Figure 5 displays the fits from standard (QR) and modified (QR.M) quantile regressions for the quantiles between 0.1 and 0.9 in steps of 0.05. The fitted curves show a slight upward trend, some curvature overall, and mildly increasing spread as height increases. There is a noticeable bump upward in the distribution of BMI for heights near 1.73 meters. The differences between the two methods of fitting the quantile regressions are most apparent in the tails, for example the 0.6th and 0.85th quantiles for large heights. The predictive performance of the standard and modified quantile regressions are compared in Figure 6. To compare the methods, 10-fold cross validation was repeated 500 times for different splits of the data. Each time, a cross-validation score was computed as

n 1 CV = ρ (y yˆ ), (12) n q i − i Xi=1

where yi is the observed BMI for an individual in the hold-out sample, yˆi is the fitted value under QR or QR.M, and the sum runs over the hold-out sample. The figure contains plots of the 500 CV scores. The great majority of CV scores are to the lower right side of the 45 degree line, indicating that the modified quantile regression outperforms the standard method–even when the QR empirical risk function is used to evaluate performance. The pattern shown in these panels is consistent across other quantiles (not shown here). The pattern becomes a bit stronger when the QR.M empirical risk function is used to evaluate performance. Modified quantile regression has an additional advantage which is apparent for small and large heights. The standard quantile regression fits show several crossings of estimated quantiles, while crossing behavior is reduced considerably with modified quantile regression. Crossed quantiles cor- respond to a claim that a lower quantile lies above a higher quantile, contradicting the laws of probability. Figure 7 shows this behavior. Fixes for this behavior have been proposed (e.g., He (1997)), but we consider it desirable to lessen crossing without any explicit fix. The reduction in crossing holds up across other data sets that we have examined and with regression models that differ in their details.

13 6. Analysis of Language Data

Balota et al. (2004) conducted an extensive lexical decision experiment in which subjects were asked to identify whether a string of letters was an English word or a non-word. The words were monosyllabic, and the non-words were constructed to closely resemble words on a number of linguistic dimensions. Two groups were studied – college students and older adults. The data consist of response times by word, averaged over the thirty subjects in each group. For each of word, a number of covariates was recorded. Goals of the experiment include determining which features of a word (i.e., covariates) affect response time, and whether the active features affect response time in the same fashion for college students and older adults. The authors make a case for the need to conduct and analyze studies with regression techniques in mind, rather than simpler ANOVA techniques. Baayen (2007) conducts an extensive analysis of a slightly modified data set which is available in his languageR package. In his analysis, he creates and selects variables to include in a regression model, addresses issues of nonlinearity, collinearity and interaction, and removes selected cases as being influential and/or outlying. He trims a total of 87 of the 4568 cases. The resulting model, based on “typical” words, is used to address issues of linguistic importance. It includes seventeen basic covariates which enter the model as linear terms, a non-linear term for the written frequency of a word (fit as a restricted cubic spline with five knots), and an interaction term between the age group and the (nonlinear) written frequency of the word. We take his final model as an expert fit, and use it as a target to compare the performance of the robust LASSO to the LASSO. We consider two sets of potential covariates for the model. The small set consists of Baayen’s 17 basic covariates and three additional covariates representing a squared term for written frequency and the interaction between age group and the linear and squared terms for written frequency. Age group has been coded as 1 for the interactions. The large set augments these covariates with nine additional covariates that± were not included in Baayen’s final model. Baayen excluded some of these covariates for a lack of significance, others because of collinearity. The LASSO and robust LASSO were fit to all 4568 cases with the small and large sets of covariates. For the robust LASSO, the iterative algorithm in Section 2 was implemented by using LARS (Efron et al., 2004) as the baseline modeling procedure and winsorizing residuals with λγ as a bending constant. The bending constant was taken to be scale invariant, so that λγ = k σˆ, where k is a constant andσ ˆ is a robust scale estimate. The standard literature· (Huber, 1981) suggests that good choices of k lie in the range from 1 to 2. A variety of values were considered for k in this analysis. In all cases, the model was selected via the minimum Cp criterion. The estimated error standard deviations were approximately 0.0775 and 0.0789 for both the LASSO and the robust LASSO models with the small and large sets of covariates, respectively. A comparison of the models in terms of sum of squared deviations (SSD) from Baayen’s fitted values, with the sum excluding those cases that he removed as outliers and/or influential cases, shows mixed results, with the robust LASSO having a smaller value of SSD for some values of k and larger SSD for other values. The comparison highlights the discontinuity of the fitted surface in k whenever a different model is selected by minimum Cp. In order to assess the relative performance of the two methods, we must average over a number of data sets. To obtain the needed replicates and to examine the performance of the robust LASSO with a smaller sample size, we conducted a simulation study. For one replicate of the simulation, we sampled

14 400 cases from the data set and fit the LASSO and robust LASSO. Models were selected with the minimum Cp criterion, and SSD computed on the full data set, removing the cases that Baayen did. A summary of SSD is presented in Figure 8. The figure also presents confidence intervals, based on 5, 000 replicates, for the robust LASSO. We see that the robust LASSO outperforms the LASSO for both the small and large covariate sets over a wide range of k. In the simulation study, we also tracked the covariates in the selected model, with a summary displayed for k = 1.6 in Figure 9. Two key features of the coefficients are whether or not they are zero and how large they are. Several interesting features appear. First, there is little apparent difference between the 20 covariates in the small set (covariates 1 through 20) and the additional nine covariates in the large set (covariates 21 through 29). Both sets of covariates often have coefficients near 0, as indicated by the left-hand panel. The vertical bars, extending from the 10th to the 90th for a given coefficient, show many effects that are, on the whole, small. The right-hand panel presents the fraction of replicates in which a given coefficient is non-zero in the model selected by the minimum Cp criterion. Again, we see little difference between the two sets of covariates. Overall, covariates in the small set had non-zero coefficients 0.73 of the time under the LASSO and 0.75 of the time under the robust LASSO. The remaining covariates had non-zero coefficients 0.68 and 0.70 of the time, respectively. The models selected by the robust LASSO averaged 21.4 covariates; those selected by the LASSO averaged 20.7 covariates. Second, in spite of the relatively high frequency with which covariates were “included” in the model, they often had very small coefficients, suggesting that there is substantial uncertainty about the “correct” model. Third, the LASSO and the robust LASSO provide coefficients of similar magnitude and with similar non-zero frequency. For this data set and simulation, we expect this behavior, as the fraction of outlying data is small and the magnitude of the outliers is modest. Even in this difficult situation for a robust procedure, as Figure 8 shows, the robust LASSO can better match the expert fit. Fourth, examination of particular covariates demonstrates the appeal of regularization methods. Covariates 20 (WrittenFrequency)and 29 (Familiarity) address the same issue. See Balota et al. (2004) for a more complete description of the covariates. Both covariates appear in nearly all of the models for both the LASSO and the robust LASSO. Subjects are able to decide that a familiar word is a word more quickly (and more accurately) than an unfamiliar word, and so we see negative coefficients for both covariates. Although there seems to be no debate on whether this conceptual effect of similarity exists, there are a variety of viewpoints on how to best capture this effect. Regularization methods facilitate inclusion of a suite of covariates that address single conceptual effect. The robust LASSO retains this ability.

7. Discussion

In the preceding sections, we have laid out an approach to modifying modeling procedures. The approach is based on the creation of case-specific covariates which are then regularized. With appropriate choices of penalty terms, the addition of these covariates allows us to robustify those procedures which lack robustness and also allows us to improve the efficiency of procedures which are very robust, but not particularly efficient. The method is fully compatible with regularized estimation methods. In this case, the case-specific covariates are merely included as part of the regularization. The techniques are easy to implement, as they often require little modification of existing software. In some cases, there is no need for modification of software, as one merely feeds

15 a modified data set into existing routines. The motivation behind this work is a desire to move relatively automated modeling procedures in the direction of traditional data analysis (e.g., Weisberg (2004)). An important component of this type of analysis is the ability to take different looks at a data set. These different looks may suggest creation of new variates and differential handling of individual cases or groups of cases. Robust methods allow us to take such a look, even when data sets are large. Coupling robust regression techniques with the ability to examine an entire solution path provides a sharper view of the impact of unusual cases on the analysis. A second motivation for the work is the desire to improve robust, relatively nonparametric methods. This is accomplished by introducing case-specific parameters in a controlled fashion whereby the finite sample performance of estimators is improved. The perspective provided by this work suggests several directions for future research. Adaptive penalties, whether used for robustness or efficiency, can be designed to satisfy specified invariances. The asymmetric ℓ2 penalty for modified quantile regression was designed to satisfy a specified in- γ variance. For a locally constant residual density, it kept the 0 of the ψq function invariant as the width of the interval of adjustment varied. Specific, alternative forms of invariance for quantile regression are suggested by consideration of parametric driving forms for the residual distribution. γ A motivating parametric model, coupled with invariance of the 0 of the ψq function to the size of the penalty term λγ yields a path of penalties. Increasing the size of the covariate-specific penalty at an appropriate rate leads to asymptotic equivalence with the quantile regression estimator. This allows one to fit the model nonparametrically while tapping into an approximate parametric form to enhance finite sample performance. Similarly, when case-specific penalties are applied to a model such as the generalized linear model, the asymmetry of the likelihood, coupled with invariance, suggests an asymmetric form for the ℓ1 penalty used to enhance robustness of inference. Following development of the technique for quantile regression, one can apply the adaptive loss paradigm for model assessment and selection. For example, in cross-validation, a summary of a model’s fit is computed as an out-of-sample estimate of empirical risk and the evaluation is used for choosing the model (parameter value) with the smallest estimated risk. For model averaging, estimated risks are converted into weights which are then attached to model-specific predictions that are then combined to yield an overall prediction. The use of modified loss functions for estimation of risks is expected to improve stability and efficiency in model evaluation and selection.

Appendix

Proof of Theorem 2

Let u := y x⊤β(q) and consider the objective function i i − i

n γ γ Z (δ) := ρ (u x⊤δ/√n) ρ (u ) . (13) n { q i − i − q i } Xi=1

Note that Zγ (δ) is minimized at δˆ := √n(βˆγ β(q)), and the limiting distribution of δˆ is deter- n n n − n

16 γ γ γ mined by the limiting behavior of Zn(δ). To study the limit of Zn(δ), decompose Zn(δ) as

n n γ γ Z (δ) = ρ (u x⊤δ/√n) ρ (u x⊤δ/√n) + ρ (u x⊤δ/√n) ρ (u ) n { q i − i − q i − i } { q i − i − q i } Xi=1 Xi=1 n γ = ρ (u x⊤δ/√n) ρ (u x⊤δ/√n) + Z (δ), { q i − i − q i − i } n Xi=1

n where Z (δ) := ρ (u x⊤δ/√n) ρ (u ) . By showing that the first sum converges to zero in n i=1{ q i − i − q i } probability up toP a sequence of constants that do not depend on δ, we will establish the asymptotic ˆγ ˆ equivalence of βn to βn. α Given λγ = cn , first observe that

γ E ρq (ui xi⊤δ/√n) ρq(ui xi⊤δ/√n) + q(1 q)/2λγ { − ⊤ − − } − (1 q)/λγ +xi δ/√n − λγ q xi⊤δ 2 xi⊤δ q(1 q) = (u ) q(u )+ − fi(ξi + u)du Z ⊤  2 1 q − √n − − √n 2λ  xi δ/√n γ ⊤ − xi δ/√n λγ 1 q xi⊤δ 2 xi⊤δ q(1 q) + − (u ) (q 1)(u )+ − fi(ξi + u)du ⊤ Z q/λγ +x δ/√n  2 q − √n − − − √n 2λγ  − i ⊤ (1 q)/λγ +xi δ/√n 2 − λγ q xi⊤δ 1 q = u − fi(ξi + u)du ⊤ 2 1 q  − √n − λ  Zxi δ/√n γ ⊤ − xi δ/√n 2 λγ 1 q xi⊤δ q + − u + fi(ξi + u)du. ⊤ Z q/λγ +x δ/√n 2 q  − √n λγ  − i

Using a first-order Taylor expansion of fi at ξi from the condition (C-2) and the expression above, we have

n γ E ρ (u x⊤δ/√n) ρ (u x⊤δ/√n) + nq(1 q)/2λ { q i − i − q i − i } − γ Xi=1 n n q(1 q) q(1 q) fi′(ξi)xi⊤δ 2α+1/2 = − f (ξ )+ − + o(n− ). 6c2n2α i i 6c2n2α √n Xi=1 Xi=1

n Note that i=1 fi′(ξi)xi⊤δ/√n = O(√n) as fi′(ξi), i = 1,...,n are uniformly bounded from the 2 2 n 2 condition (C-2),P and xi⊤δ xi 2 δ 2 ( xi 2 + δ 2)/2 while i=1 xi 2 = O(n) from the | |≤k k k k ≤ kα k1 k k 2 2α nk k condition (C-3). Taking Cn := q(1 q)/(2cn − )+ q(1 q)/(6c n P) fi(ξi), we have that − − − i=1 P n γ E ρ (u x⊤δ/√n) ρ (u x⊤δ/√n) C 0 if α> 1/4. { q i − i − q i − i }− n → Xi=1

17 Similarly, it can be shown that

n γ V ar ρ (u x⊤δ/√n) ρ (u x⊤δ/√n) { q i − i − q i − i } Xi=1 n 2 2 q (1 q) fi(ξi) 3α+1 = − + o(n− ) 0 if α> 1/3. 20c3n3α → Xi=1 Thus, if α >1/3,

n γ ρ (u x⊤δ/√n) ρ (u x⊤δ/√n) C 0 in probability. q i − i − q i − i − n → Xi=1

γ This implies that the limiting behavior of Zn(δ) is the same as that of Zn(δ). From the proof of d 1 Theorem 4.1 in Koenker (2005), Zn(δ) δ⊤W + 2 δ⊤D1δ, where W N(0, q(1 q)D0). By the convexity argument in Koenker (2005)→ (see − also Pollard (1991), Hjort∼ and Pollard− (1993), and ˆ γ ˆ 1 Knight (1998)), δn, the minimizer of Zn(δ), converges to δ0 := D1− W , the unique minimizer of 1 δ⊤W + δ⊤D δ in distribution. This completes the proof. 2 − 2 1

References

Baayen, R. (2007). Analyzing Linguistic Data: a practical introduction to statistics. Cambridge, England: Cambridge University Press. Balota, D., M. Cortese, S. Sergent-Marshall, D. Spieler, and M. Yap (2004). Visual word recognition of single-syllable words. Journal of Experimental Psychology 133, 283–316. Bartlett, P. L., M. I. Jordan, and J. D. McAuliffe (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association 101 (473), 138–156. Bi, J., K. Bennett, M. Embrechts, C. Breneman, and M. Song (2003). Dimensionality reduction via sparse support vector machines. Journal of Machine Learning Research 3, 1229–1243. Efron, B. (1991). Regression using asymmetric squared error loss. Statistica Sinica 1 (1), 93–125. Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani (2004). Least angle regression. Annals of Statistics 32 (2), 407–499. Freund, Y. and R. E. Schapire (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55 (1), 119–139. Hastie, T., S. Rosset, R. Tibshirani, and J. Zhu (2004). The entire regularization path for the support vector machine. Journal of Machine Learning Research 5, 1391–1415. Hastie, T. and R. Tibshirani (1990). Generalized Additive Models. Chapman & Hall/CRC.

18 He, X. (1997). Quantile curves without crossing. The American Statistician 51 (2), 186–192. Hjort, N. L. and D. Pollard (1993). Asymptotics for minimisers of convex processes. Technical report. Hoerl, A. and R. Kennard (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12 (3), 55–67. Huber, P. J. (1981). Robust statistics. New York: John Wiley & Sons.

Jung, Y., S. N. MacEachern, and Y. Lee (2009). Window width selection for ℓ2 adjusted quantile regression. Technical report, Department of Statistics, The Ohio State University.

Knight, K. (1998). Limiting distributions for l1 regression estimators under general conditions. Annals of Statistics 26 (2), 755–770. Koenker, R. (2005). Quantile Regression. Cambridge University Press. Koenker, R. and G. Bassett (1978). Regression quantiles. Econometrica 1, 33–50. Koenker, R. and K. Hallock (2001). Quantile regression. Journal of Economic Perspectives 15, 143–156. Lee, Y., S. N. MacEachern, and Y. Jung (2007). Regularization of case-specific parameters for robustness and efficiency. Technical Report 799, Department of Statistics, The Ohio State Uni- versity. Lee, Y.-J. and O. L. Mangasarian (2001). SSVM: A smooth support vector machine. Computational Optimization and Applications 20, 5–22. McCullagh, P. and J. Nelder (1989). Generalized Linear Models, Second Edition. Chapman & Hall/CRC. Newey, W. K. and J. L. Powell (1987). Asymmetric least squares estimation and testing. Econo- metrica 55 (4), 819–847. Pollard, D. (1991). Asymptotics for least absolute deviation regression estimators. Econometric Theory 7 (2), 186–199. Pregibon, D. (1982). Resistant fits for some commonly used logistic models with medical applications. Biometrics 38 (2), 485–498. Rockafellar, R. T. (1997). Convex Analysis (Princeton Landmarks in Mathematics and Physics). Princeton University Press. Rosset, S. and J. Zhu (2004). Discussion of “Least angle regression” by Efron, Hastie, Johnstone and Tibshirani. Annals of Statistics 32 (2), 469–475. Rosset, S. and J. Zhu (2007). Piecewise linear regularized solution paths. The Annals of Statis- tics 35 (3), 1012–1030.

19 Shen, X., X. Zhang, G. C. Tseng, and W. H. Wong (2003). On ψ-learning. Journal of the American Statistical Association 98 (463), 724–734. Tibshirani, R. (1996). Regression selection and shrinkage via the lasso. Journal of the Royal Statis- tical Society, Series B 58 (1), 267–288. Vapnik, V. N. (1998). Statistical Learning Theory. Wiley-Interscience. Wahba, G. (1990). Spline Models for Observational Data. Philadelphia: Series in Applied Mathe- matics, Vol. 59, SIAM. Weisberg, S. (2004). Discussion of “Least angle regression” by Efron, Hastie, Johnstone and Tibshi- rani. The Annals of Statistics 32 (2), 490–494. Weisberg, S. (2005). Applied Linear Regression (3rd ed.). Hoboken NJ: Wiley. Wu, Y. and Y. Liu (2007). Robust truncated-hinge-loss support vector machines. Journal of the American Statistical Association 102, 974–983. Yuan, M. and Y. Lin (2006, February). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 (1), 49–67. Zhou, N. and J. Zhu (2007). Group variable selection via hierarchical lasso and its oracle property. Technical report, Department of Statistics, University of Michigan. Zou, H. and T. Hastie (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2), 301–320.

20 (a) (b)

γ r*

− λγ

− λγ λγ r λγ r

(c) (d)

L(r) L(r)

− λγ λγ r − λγ λγ r

Fig. 1. Modification of the squared error loss with a case-specific parameter. (a) γ versus the residual r. (b) the adjusted residual r∗ versus the ordinary residual r. (c) a truncated squared error loss as the γ-adjusted loss. (d) the effective loss.

21 q=0.2 q=0.2 q=0.5 q=0.5 q=0.7 q=0.7 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −4 −2 0 2 4 −4 −2 0 2 4

x x

Fig. 2. The derivative of the check loss in the left panel, ψq(x), and that of the modified check loss in the right γ panel, ψq (x), for q = 0.2, 0.5 and 0.7.

22 N(0,1) t(df=2.25) t(df=10) f(x) f(x) f(x)

+ + + + + + + + + + + + −0.2 −0.1 0.0 0.1 0.2 0.3 0.4 −0.2 −0.1 0.0 0.1 0.2 0.3 0.4 −0.2 −0.1 0.0 0.1 0.2 0.3 0.4 + + +

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −2 −1 0 1 2

q=(0.1,0.2,0.3,0.4,0.5) q=(0.1,0.2,0.3,0.4,0.5) q=(0.1,0.2,0.3,0.4,0.5)

Gamma(shape=3,scale=sqrt(3)) Log−Normal Exp(3)

+

+ f(x) f(x) f(x) + + + + + + + + + + + + + +

−0.3 −0.2 −0.1 0.0 0.1 + + −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 + + +

0 2 4 6 8 10 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 5

q=(0.1,0.2,0.3,0.4,0.5,0.6,0.7) q=(0.1,0.2,0.3,0.4,0.5,0.6,0.7) q=(0.1,0.2,0.3,0.4,0.5,0.6,0.7)

Fig. 3. ‘Optimal’ intervals of adjustment for different quantiles (q), sample sizes (n), and error distributions. The vertical lines in each distribution indicate the true quantiles. The stacked horizontal lines for each quantile are corresponding optimal intervals. Five intervals at each quantile are for n = 102, 102.5, 103, 103.5 and 104, respectively, from the bottom.

23 (a) (b) (c)

L(r) L(yf) L(yf)

η − 1 λγ 1 λγ λ yf 1 − 1 λγ 1 yf

Fig. 4. Modification of (a) absolute deviation loss for median regression with ℓ2 penalty, (b) negative log likelihood for logistic regression with ℓ1 penalty, and (c) hinge loss for the support vector machine with ℓ2 penalty. The solid lines are for the effective loss, the dashed lines are for the γ-adjusted loss, and the dotted lines are for the original loss in each panel.

24 QR QR.M BMI (kg/m^2) 20 30 40 50

150 160 170 180 190 200

height of male (Age>18)

Fig. 5. Regression spline estimates of conditional BMI quantiles in steps of 0.05, from 0.1 to 0.9 for the NHANES data. Natural spline bases and 6 knots are used in each fitted curve.

25 q= 0.25 q= 0.5 q= 0.9 CV Score from QR.M CV Score from QR.M CV Score from QR.M 1.503 1.504 1.505 1.506 2.039 2.040 2.041 2.042 2.043 1.124 1.126 1.128 1.130 1.503 1.504 1.505 1.506 2.039 2.040 2.041 2.042 2.043 1.124 1.126 1.128 1.130

CV Score from QR CV Score from QR CV Score from QR

Fig. 6. Scatter plots of 10-fold CV scores from standard quantile regression (QR) and modified quantile regression (QR.M) at 0.25th, 0.5th and 0.9th quantiles. Regression splines with natural spline bases and 6 knots are fitted to the NHANES data. Each of 500 points represents a pair of CV scores as in (12).

QR QR.M Difference from fitted Median Difference from fitted Median −10 −5 0 5 10 15 −10 −5 0 5 10 15

150 160 170 180 190 200 150 160 170 180 190 200

height of male (Age>18) height of male (Age>18)

Fig. 7. Differences between fitted median line and the other fitted quantiles for standard quantile regres- sion (QR) and modified quantile regression (QR.M). The dashed lines are the minimum and maximum of the observed heights.

26 (a) (b) SSD SSD 2.50 2.55 2.60 2.65 2.70 4.65 4.70 4.75 4.80 4.85 4.90

1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

k k

Fig. 8. Sum of squared deviations (SSD) from Baayen’s fits in the simulation study. The horizontal line is the mean SSD for the LASSO while the points represent the mean of SSDs for the robust LASSO. The vertical lines give approximate 95% confidence intervals for the mean SSDs. Panel (a) presents results for the small set of covariates and panel (b) presents results for the large set of covariates.

27 (a) (b)

ox ox ox

ox ox x ox

o x o x x x o x o o o x x x x x x o x o o o o o o x x x o ox ox x o ox ox ox ox ox ox ox ox ox ox o ox ox ox ox ox ox ox ox ox ox ox ox o x o ox x x o x o o

Fraction of replicates x Standardized coefficient x o o x o x x o o 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x o −0.15 −0.10 −0.05 0.00 0.05 0.10 0.15

0 5 10 15 20 25 30 0 5 10 15 20 25 30

Index Index

Fig. 9. Coefficients in the simulation study. The first 20 variates are the small set of covariates, while the remaining 9 variates are in the large set of covariates. x is for the robust LASSO and o is for the LASSO. Panel (a) contains a plot of mean coefficients for standardized variates in the simulation. The vertical lines extend from the 10th to 90th percentiles of the coefficients under the robust LASSO. Panel (b) shows the fraction of replicates in which coefficients in the selected model are non-zero.

28