Efficient Experimental Design for Regularized Linear Models
Total Page:16
File Type:pdf, Size:1020Kb
Efficient Experimental Design for Regularized Linear Models C. Devon Lin Peter Chien Department of Mathematics and Statistics Department of Statistics Queen's University, Canada University of Wisconsin-Madison Xinwei Deng Department of Statistics Virginia Tech Abstract Regularized linear models, such as Lasso, have attracted great attention in statisti- cal learning and data science. However, there is sporadic work on constructing efficient data collection for regularized linear models. In this work, we propose an experimen- tal design approach, using nearly orthogonal Latin hypercube designs, to enhance the variable selection accuracy of the regularized linear models. Systematic methods for constructing such designs are presented. The effectiveness of the proposed method is illustrated with several examples. Keywords: Design of experiments; Latin hypercube design; Nearly orthogonal design; Regularization, Variable selection. 1 Introduction arXiv:2104.01673v1 [stat.ME] 4 Apr 2021 In statistical learning and data sciences, regularized linear models have attracted great at- tention across multiple disciplines (Fan, Li, and Li, 2005; Hesterberg et al., 2008; Huang, Breheny, and Ma, 2012; Heinze, Wallisch, and Dunkler, 2018). Among various regularized linear models, the Lasso is one of the most well-known techniques on the L1 regularization to achieve accurate prediction with variable selection (Tibshirani, 1996). Statistical properties and various extensions of this method have been actively studied in recent years (Tibshirani, 2011; Zhao and Yu 2006; Zhao et al., 2019; Zou and Hastie, 2015; Zou, 2016). However, 1 there is sporadic work on constructing efficient data collection for regularized linear mod- els. In this article, we study the data collection for the regularized linear model from an experimental design perspective. First, we give a brief description of the Lasso procedure. Consider a linear model y = xT β + , (1) T where x = (x1; : : : ; xp) is the vector of p continuous predictor variables, y is the response T value, β = (β1; : : : ; βp) are the vector of regression parameters, and the error term is normally distributed with mean zero and variance σ2. Throughout, assume data are centered so that the model in (1) has no intercept. Suppose this model has a sparse structure for which only p0 predictor variables are active with non-zero regression coefficients, where p0 < p. Let A(β) = fj : βj 6= 0; j = 1; : : : ; pg be the set of the indices of the active variables. Then the cardinality of the set A(β) is p0. T For a given n × p regression matrix X = (x1;:::; xn) , and a given response vector T y = (y1; : : : ; yn) , the Lasso solution is ^ T β = arg min[(y − Xβ) (y − Xβ) + λkβkl ]; (2) β 1 Pp where kβkl1 = i=1 jβij and λ is a tuning parameter. Because the l1 norm k · kl1 is singular at the origin, a desirable property of the Lasso is that some coefficients of β^ are exactly zero. ^ ^ Then A(β) can be estimated by A(β) = fj : βj 6= 0; j = 1; : : : ; pg. The number of false selections of the Lasso is γ = #fj : j 2 A(β^) but j2 = A(β)g + #fj : j2 = A(β^) but j 2 A(β)g; (3) where # denotes the set cardinality, the first term counts the number of false positives and the second term counts the number of false negatives. The scope of this work is in developing experimental design techniques to construct the regression matrix X in (2) in order to minimize the value of γ in (3), the number of false selection. Based on the probability properties of regularized linear models, it often requires large randomness in the regression matrix (Jung et al. 2019). Thus, a straightforward way is 2 to take X to be an independently and identically distributed (i.i.d.) sample. However, from an experimental design perspective, the points of an i.i.d. sample is not well stratified in the design space (Box, Hunter, and Hunter, 2005; Wu and Hamada, 2009). To improve upon this scheme, we propose to take X to be a nearly orthogonal Latin hypercube design (NOLHD) that is a Latin hypercube design with nearly orthogonal columns. A Latin hypercube design is a space-filling design that achieves maximum uniformity when the points are projected onto any one dimension (McKay, Beckman, and Conover, 1979). An NOLHD simultaneously possesses two desirable properties: low-dimensional stratification and nearly orthogonality. Owen (1992) stated the advantage of using Latin hypercube designs for fitting additive models. It was discussed in Section 3 of Owen (1992) that the least-squares estimates of the regression coefficients of an additive regression with a Latin hypercube design can have significant smaller variability than their counterparts under an i.i.d. sample. Since the model in (1) is additive, β^ in (2) associated with a Latin hypercube design is expected to be superior to that with an i.i.d. sample. Both random Latin hypercube designs and NOLHDs are popular in computer experiments (Lin and Tang, 2015). It is advantageous to use NOLHDs instead of random Latin hypercube designs for the Lasso problem because the former have guaranteed small columnwise correlations. When the regression matrix X is taken to be an NOLHD, its small columnwise correlations allow the active variables less correlated with the inactive variables, thus improving the selection accuracy of the Lasso. There is some consistency between the concept of NOLHDs and the sparsity concept in variable selection. The sparsity assumption we have made earlier for the model in (1) states that only p0 variables in the model are active and does not specify which p0 variables are active. If the regression matrix X for this model is an NOLHD of n runs for p input variables, when the points of this design are projected onto any p0 dimensions, the resulting design still retains the NOLHD structure for the p0 factors. Note that an NOLHD has more than two levels and spreads the points evenly in the design space, not restricted to the boundaries only. Since the number of false selections γ in (3) has a nonlinear relation with the regression matrix X, the use of an NOLHD for the Lasso problem is more appropriate than a two-level design to exploit this complicated relation between γ and X. 3 The remainder of the article is organized as follows. Section 2 introduces a new criterion to measure NOLHDs, and presents two systematic methods for constructing such designs for the Lasso problem. Section 3 provides numerical examples to bear out the effectiveness of the proposed method. The numerical examples in Section 3 clearly indicate the superiority of NOLHDs over two-level designs for the Lasso problem. We provide a brief discussion in Section 4. 2 Methodology In this section we discuss the construction of NOLHDs and how to use them for the Lasso problem. We prefer NOLHDs over random Latin hypercube designs because the latter are not guaranteed to have small columnwise correlations. An an illustration, let n = 64 and p = 192 and compute γ in (3) for two different choices of X in (2). Assume the model in (1) has σ = 8 and β = (0:05; 0:2;:::; 3:0; 0 :::; 0)T , where only the first 20 coefficients are nonzeros, and the predictor variables take values on the hypercube [−(64 − 1)=2; (64 − 1)=2]p. The first method takes the design matrix X to be a random Latin hypercube design constructed by (7). Fig. 1 depicts the histogram of the columnwise sample correlations of one random Latin hypercube design, where 21% of the columnwise correlations of the matrix are larger than 0:1 in absolute values. This method gives γ = 36. The second method takes the design matrix to be a 64 × 192 NOLHD from Section 2.1, where the columnwise correlations of the matrix are very small. The second method gives γ = 20. The difference of γ values of the two methods indicates that the Lasso solution with an NOLHD can be far more superior. Here are some useful notation and definitions for constructing NOLHDs. The Kronecker product of an n × p matrix A = (aij) and an m × q matrix B = (bij) is 2 3 a11B a12B : : : a1pB 6 7 6 a B a B : : : a B 7 6 21 22 2p 7 A ⊗ B = 6 . 7 ; 6 . .. 7 4 5 an1B an2B : : : anpB where aijB is an m × q matrix whose (k; l) entry is aijbkl. The correlation matrix of an n × p 4 3.0 2.5 2.0 1.5 Density 1.0 0.5 0.0 -0.4 -0.2 0.0 0.2 0.4 sample correlation Figure 1. Histogram of the sample correlations of a 64×192 random Latin hypercube design. matrix X = (xij) is 0 1 ρ11 ρ12 : : : ρ1p B C B C B ρ21 ρ22 : : : ρ2p C ρ = B C ; (4) B . .. C B . C @ A ρp1 ρp2 : : : ρpp where Pn k=1(xki − x¯i)(xkj − x¯j) ρij = ; (5) pP 2 P 2 (xki − x¯i) (xkj − x¯j) −1 Pn represents the correlation between the ith and jth columns of X,x ¯i = n k=1 xki and −1 Pn x¯j = n k=1 xkj. The matrix X is orthogonal if ρ in (4) is an identity matrix. Let D = (dij) be an n × p random Latin hypercube in which each column is a random permutation of 1; : : : ; n, and all columns are generated independently. Using D, a random p Latin hypercube design Z = (zij) on [0; 1] is generated through d − u z = ij ij ; i = 1; : : : ; n; j = 1; : : : ; p; (6) ij n where the uij's are independent uniform random variables on [0,1), and the dij's and the p uij's are mutually independent.