The Spike-And-Slab Lasso Generalized Linear Models for Prediction and Associated Genes Detection

| INVESTIGATION

The Spike-and-Slab Lasso Generalized Linear Models for Prediction and Associated Genes Detection

Zaixiang Tang,*,†,‡ Yueping Shen,*,† Xinyan Zhang,‡ and Nengjun Yi‡,1 *Department of Biostatistics, School of Public Health and †Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases and Center for Genetic Epidemiology and Genomics, Medical College of Soochow University, Suzhou, 215123, China, and ‡Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Alabama 35294

ABSTRACT Large-scale “omics” data have been increasingly used as an important resource for prognostic prediction of diseases and detection of associated genes. However, there are considerable challenges in analyzing high-dimensional molecular data, including the large number of potential molecular predictors, limited number of samples, and small effect of each predictor. We propose new Bayesian hierarchical generalized linear models, called spike-and-slab lasso GLMs, for prognostic prediction and detection of associated genes using large-scale molecular data. The proposed model employs a spike-and-slab mixture double-exponential prior for coefficients that can induce weak shrinkage on large coefficients, and strong shrinkage on irrelevant coefficients. We have developed a fast and stable algorithm to fit large-scale hierarchal GLMs by incorporating expectation-maximization (EM) steps into the fast cyclic coordinate descent algorithm. The proposed approach integrates nice features of two popular methods, i.e., penalized lasso and Bayesian spike-and-slab variable selection. The performance of the proposed method is assessed via extensive simulation studies. The results show that the proposed approach can provide not only more accurate estimates of the parameters, but also better prediction. We demonstrate the proposed procedure on two cancer data sets: a well-known breast cancer data set consisting of 295 tumors, and expression data of 4919 genes; and the ovarian cancer data set from TCGA with 362 tumors, and expression data of 5336 genes. Our analyses show that the proposed procedure can generate powerful models for predicting outcomes and detecting associated genes. The methods have been implemented in a freely available R package BhGLM (http://www.ssg.uab.edu/bhglm/).

KEYWORDS cancer; double-exponential distribution; generalized linear model; outcome prediction; spike-and-slab lasso

HE growing recognition of precision medicine reflects the and predictive factors often provided poor prognosis and Temergence of a fieldthatisacceleratingrapidlyandwill prediction (Barillot et al. 2012). Modern “omics” technolo- help shape new clinical practice in the future (Collins and gies can generate robust large-scale molecular data, such as Varmus 2015; Jameson and Longo 2015). The important large amounts of genomic, transcriptomic, proteomic, and basis of precision medicine is to generate knowledge of metabolomics data, which provides extraordinary opportu- disease that will enable better assessment of disease risk, nities to detect new biomarkers, and to build more accurate understanding of disease mechanisms, and prediction of prognostic and predictive models. However, these large- optimal therapy and prognostic outcome for diseases by scale data sets also introduce computational and statistical using a wide range of biomedical, clinical, and environmen- challenges. tal information. Precision medicine needs accurate detec- Various approaches have been applied in analyzing tion of biomarkers and prognostic prediction (Chin et al. large-scale molecular profiling data to address the chal- 2011; Barillot et al. 2012). Traditional clinical prognostic lenges (Bovelstad et al. 2007, 2009; Lee et al. 2010; Barillot et al. 2012). The lasso, and its extensions, are the most Copyright © 2017 by the Genetics Society of America commonly used methods (Tibshirani 1996; Hastie et al. doi: 10.1534/genetics.116.192195 2009, 2015). These methods put an L1-penalty on the co- Manuscript received June 1, 2016; accepted for publication October 27, 2016; fi fi published Early Online October 31, 2016. ef cients, and can shrink many coef cients exactly to zero, Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10. thus performing variable selection. With the L1-penalty, the 1534/genetics.116.192195/-/DC1. 1Corresponding author: Department of Biostatistics, School of Public Health, University penalized likelihood can be solved by extremely fast optimiza- of Alabama at Birmingham, Birmingham, AL 35294-0022. E-mail: [email protected] tion algorithms, for example, the lars and the cyclic coordinate

Genetics, Vol. 205, 77–88 January 2017 77 descent algorithms (Efron et al. 2004; Friedman et al. 2010), Table 1 Simulated effect sizes of ﬁve nonzero coefﬁcients under making them very popular in high-dimensional data analysis. different scenarios

Recently,these penalization approaches have been applied Simulated Scenarios b5 b20 b40 bm-50 bm-5 widely for prediction and prognosis (Rapaport et al. 2007; Scenario 1 m = 1000 0.362 0.395 20.418 20.431 0.467 Barillot et al. 2012; Sohn and Sung 2013; Zhang et al. Scenario 2 m = 1000 20.457 20.491 0.521 0.550 0.585 2013; Yuan et al. 2014; Zhao et al. 2014). Scenario 3 m = 1000 0.563 20.610 0.653 20.672 0.732 However, the lasso uses a single penalty for all coefficients, Scenario 4 m = 3000 0.357 20.388 0.414 20.429 0.462 m 2 2 and thus can either include a number of irrelevant predictors, Scenario 5 = 3000 0.455 0.509 0.528 0.552 0.592 Scenario 6 m = 3000 0.560 20.618 0.654 20.673 0.716 or over-shrink large coefficients. An ideal method should m induce weak shrinkage on large effects, and strong shrinkage is the number of simulated predictors. For all scenarios, the number of individuals (n) is 500. on irrelevant effects (Fan and Li 2001; Zou 2006; Zhang 2010). Statisticians have introduced hierarchical models fi with mixture spike-and-slab priors that can adaptively de- priors, sslasso GLMs can adaptively shrink coef cients (i.e., termine the amount of shrinkage (George and McCulloch weakly shrink important predictors but strongly shrink ir- 1993, 1997). The spike-and-slab prior is the fundamen- relevantpredictors),andthuscanresultinaccurateestima- fi tal basis for most Bayesian variable selection approaches, tion and prediction. To t the sslasso GLMs, we propose fi and has proved remarkably successful (George and an ef cient algorithm by incorporating EM steps into the McCulloch 1993, 1997; Chipman 1996; Chipman et al. 2001; extremely fast cyclic coordinate descent algorithm. The per- Roˇcková and George 2014, and unpublished results). Re- formanceoftheproposedmethodisassessedviaextensive cently, Bayesian spike-and-slab priors have been applied simulations, and compared with the commonly used lasso to predictive modeling and variable selection in large- GLMs. We apply the proposed procedure to two cancer data scale genomic studies (Yi et al. 2003; Ishwaran and Rao sets with binary outcomes and thousands of molecular fea- 2005; de los Campos et al. 2010; Zhou et al. 2013; Lu et al. tures. Our results show that the proposed method can gen- 2015; Shankar et al. 2015; Shelton et al. 2015; Partovi Nia erate powerful prognostic models for predicting disease and Ghannad-Rezaie 2016). However, most previous spike- outcome, and can also detect associated genes. and-slab variable selection approaches use the mixture normal priors on coefficients and employ Markov Chain Methods Monte Carlo (MCMC) algorithms (e.g.,stochasticsearch variable selection) to fit the model. Although statistically Generalized linear models sophisticated, these MCMC methods are computationally We consider GLMs with a large number of correlated predic- intensive for analyzing large-scale molecular data. The mix- tors. The observed values of a continuous or discrete response ture normal priors cannot shrink coefficients exactly to zero, are denoted by y =(y1, , yn). The predictor variables in- and thus cannot automatically perform variable selection. clude numerous molecular predictors (e.g., gene expression). ˇ Rocková and George (2014) developed an expectation- A GLM consists of three components: the linear predictor h, maximization (EM) algorithm to fit large-scale linear mod- the link function h, and the data distribution p (McCullagh els with the mixture normal priors. and Nelder 1989; Gelman et al. 2014). The linear predictor ˇ V. Rocková and E. I. George (unpublished results) recently for the i-th individual can be expressed as proposed a new framework, called the spike-and-slab lasso, for high-dimensional normal linear models by using a new XJ ¼ þ ¼ prior on the coefficients, i.e., the spike-and-slab mixture hi b0 xijbj Xib (1) ¼ double-exponential distribution. They proved that the j 1 spike-and-slab lasso has remarkable theoretical and practical where b0 is the intercept, xij represents the observed value of properties, and overcomes some drawbacks of the previous the j-th variable, bj is the coefficient, Xi contains all variables, approaches. However, the spike-and-slab lasso, and most of and b is a vector of the intercept and all the coefficients. The the previous methods, were developed based on normal mean of the response variable is related to the linear predic- linear models, and cannot be directly applied to other mod- tor via a link function h: els. Therefore, extensions of high-dimensional methods using mixture priors to frameworks beyond normal linear 21 E yij Xi ¼ h ðXibÞ (2) regression provide important new research directions for both methodological and applied works (Roˇcková and The data distribution is expressed as George 2014, and unpublished results). Yn In this article, we extend the spike-and-slab lasso framework to p yj Xb; f ¼ p yij Xib; f (3) generalized linear models, called spike-and-slab lasso GLMs i¼1 (sslasso GLMs), to jointly analyze large-scale molecular data for building accurate predictive models and identifying im- where f is a dispersion parameter, and the distribu- portant predictors. By using the mixture double-exponential tion pðyijXib; fÞ can take various forms, including normal,

78 Z. Tang et al. Figure 1 The profiles of deviance under scenarios 3 and 6 over 50 replicated testing datasets. (s0, s1) is the prior scale for sslasso GLMs. binomial, and Poisson distributions. Some GLMs (for example, sslasso GLMs, which can induce different shrinkage scales for the binomial distribution) do not require a dispersion parame- different coefficients, and allow us to estimate the shrinkage ter; that is, f is fixed at 1. Therefore, GLMs include normal scales from the data. linearandlogisticregressions,andvariousothersasspecial cases. sslasso GLMs The classical analysis of generalized linear models is to The sslasso GLMs are more easily interpreted and handled obtain maximum likelihood estimates (MLE) for the param- from a Bayesian hierarchical modeling framework. It is well ð ; Þ eters b f by maximizing the logarithm of the likelihood known that the lasso can be expressed as a hierarchical model ð ; Þ¼ ð j ; Þ: function: l b f log p y Xb f However, the classical with double-exponential prior on coefficients (Tibshirani 1996; GLMs cannot jointly analyze multiple correlated predictors, Park and Casella 2008; Yi and Xu 2008; Kyung et al. 2010): due to the problems of nonidentifiability and overfitting. The ! lasso is a widely used penalization approach to handling 1 jb j high-dimensional models, adding an L penalty to the log- b s DE b j0; s ¼ exp 2 j (5) 1 j j 2s s likelihood function, and estimating the parameters by maximizing the penalized log-likelihood (Zou and Hastie 2005; Hastie et al. 2009, 2015; Friedman et al. 2010): where the scale, s, controls the amount of shrinkage; smaller scale induces stronger shrinkage and forces the estimates of

XJ bj toward zero. ð ; Þ¼ ð ; Þ 2 pl b f l b f l bj (4) We develop sslasso GLMs by extending the double-expo- j¼1 nential prior to the spike-and-slab mixture double-exponential prior: The overall penalty parameter, l, controls the overall strength fi ; ; ; ð 2 Þ ; þ ; of penalty and the size of the coef cients; for a small l many bj gj s0 s1 1 gj DE bj 0 s0 gjDE bj 0 s1 coefficients can be large, and for a large l; many coefficients will be shrunk toward zero. Lasso GLMs can be fit by the or, equivalently extremely fast cyclic coordinate descent algorithm, which ! b successively optimizes the penalized log-likelihood over each ; ; ; ¼ 1 2 j bj gj s0 s1 DE bj 0 Sj exp (6) parameter, with others fixed, and cycles repeatedly until con- 2Sj Sj vergence (Friedman et al. 2010; Hastie et al. 2015). ; With a single penalty parameter, l however, the lasso can where gj is the indicator variable, gj = 1 or 0, and the scale either overshrink large effects, or include a number of irrel- Sj equals one of two preset positive value s0 and s1 (s1 . s0 . 0), ¼ð 2 Þ þ : evant predictors. Ideally, one should use small penalty values i.e., Sj 1 gj s0 gjs1 The scale value s0 is chosen to be for important predictors, and large penalties for irrelevant small, and serves as a “spike scale” for modeling irrelevant predictors. However, if we have no prior knowledge about (zero) coefficients, and inducing strong shrinkage on estima- the importance of the predictors, we cannot appropriately tion; and s1 is set to be relatively large, and thus serves as a preset penalties. Here, we propose a new approach, i.e., the “slab scale” for modeling large coefficients, and inducing

Spike-and-Slab Lasso GLMs 79 Figure 2 The solution path and deviance profiles of sslasso GLMs (A, C) and the lasso model (B, D) for scenario 3. The colored points on the solution path represent the estimated values of assumed five nonzero coefficients, and the circles represent the true nonzero coefficients. The vertical lines correspond to the optimal models.

X no or weak shrinkage on estimation. If we set s = s or J 0 1, log p b; f; g; u y ¼ log pð yjb; fÞþ log pðb S Þ j¼1 j j gj = 1 or 0, the spike-and-slab double-exponential prior X becomes the double-exponential prior. Therefore, the spike- þ J ð Þþ ð Þ } ð ; Þ ¼ log p gj u log p u l b f and-slab lasso includes the lasso as a special case. j 1 XJ 1 The indicator variables gj play an essential role in linking 2 b fi S j the scale parameters with the coef cients. The indicator j¼1 j variables are assumed to follow the independent binomial XJ distribution: þ þ 2 2 gjlog u 1 gj log 1 u j¼1 12g ; ¼ gj 2 j gj u Bin gj 1 u u 1 u (7) (8) ð ; Þ¼ ð j ; Þ ¼ð 2 Þ þ : where u is the probability parameter. For the probability where l b f log p y Xb f o and Sj 1 gj s0 gjs1 parameter, u, we assume the uniform prior: u Uð0; 1Þ: The EM coordinate decent algorithm treats the indicator “ ” The probability parameter u can be viewed as the overall variables gj as missing values, and estimates the parameters shrinkage parameter that equals the prior probability ðb; f; uÞ by averaging the missing values over their posterior ¼ : p gj 1 u The prior expectation of the scale Sj equals distributions. For the E-step, we calculate the expectation of EðSjÞ ¼ð1 2 uÞs0þus1; which lies in the range [s0, s1]. As the log joint posterior density with respect to the conditional fi : will be seen, the scale Sj for each coef cient can be estimated, posterior distributions of the missing data gj The conditional leading to different shrinkage for different predictors. posterior expectation of the indicator variable gj can be de- Algorithm for fitting sslasso GLMs rived as For high-dimensional data, it is desirable to have an efficient p ¼ p g ¼ 1b ; u; y algorithm that can quickly identify important predictors and j j j fi p b g ¼ 1; s g ¼ 1u build a predictive model. We develop a fast algorithm to tthe ¼ j j 1 j ¼ ; ¼ þ ¼ ; ¼ sslasso GLMs. Our algorithm, called the EM coordinate descent p bj gj 0 s0 p gj 0 u p bj gj 1 s1 p gj 1 u algorithm, incorporates EM steps into the cyclic coordinate (9) descent procedure for fitting the penalized lasso GLMs regres- ¼ ¼ ; ¼ ¼ 2 ; ¼ ; ¼ sion. We derive the EM coordinate descent algorithm based on where p gj 1 u u p gj 0 u 1 u p b j gj 1 s1 ð ; ; ; Þ: ; Þ; ¼ ; ¼ ; : the log joint posterior density of the parameters b f g u DE bj 0 s1 and p bj gj 0 s0 DE bj 0 s0 Therefore,

80 Z. Tang et al. Table 2 Estimates of four measures over 50 replicates under different simulated scenarios

Deviance MSE AUC Misclassiﬁcation Scenario 1 n = 500, m = 1000 Lasso 638.983(17.431) 0.224(0.008) 0.693(0.031) 0.359(0.029)

sslasso: s0 = 0.01 685.229(18.922) 0.246(0.009) 0.595(0.067) 0.442(0.051) sslasso: s0 = 0.02 643.119(23.694) 0.226(0.011) 0.681(0.038) 0.371(0.031) sslasso: s0 = 0.03 637.226(22.197) 0.223(0.010) 0.692(0.033) 0.362(0.029) a sslasso: s0 = 0.04 636.930(18.928) 0.223(0.009) 0.692(0.029) 0.364(0.026) sslasso: s0 = 0.05 639.385(16.732) 0.224(0.008) 0.694(0.029) 0.364(0.022) sslasso: s0 = 0.06 639.784(17.359) 0.224(0.008) 0.684(0.029) 0.369(0.024) sslasso: s0 = 0.07 645.151(19.752) 0.227(0.009) 0.674(0.028) 0.372(0.025) Scenario 2 n = 500, m = 1000 Lasso 601.872(15.666) 0.207(0.007) 0.753(0.025) 0.320(0.022)

sslasso: s0 = 0.01 640.816(26.885) 0.224(0.012) 0.686(0.042) 0.367(0.031) sslasso: s0 = 0.02 581.940(24.945) 0.199(0.012) 0.761(0.033) 0.308(0.028) a sslasso: s0 = 0.03 581.661(28.271) 0.198(0.010) 0.765(0.028) 0.306(0.023) sslasso: s0 = 0.04 583.037(21.964) 0.199(0.009) 0.764(0.026) 0.307(0.021) sslasso: s0 = 0.05 590.185(19.343) 0.202(0.008) 0.755(0.023) 0.314(0.018) sslasso: s0 = 0.06 595.879(19.388) 0.204(0.008) 0.751(0.024) 0.328(0.018) sslasso: s0 = 0.07 603.756(20.020) 0.208(0.008) 0.738(0.024) 0.333(0.020) Scenario 3 n = 500, m = 1000 Lasso 561.917(14.623) 0.190(0.006) 0.790(0.021) 0.289(0.020)

sslasso: s0 = 0.01 585.600(34.703) 0.201(0.014) 0.759(0.035) 0.318(0.030) sslasso: s0 = 0.02 531.956(26.214) 0.180(0.010) 0.808(0.021) 0.269(0.022) sslasso: s0 = 0.03 532.747(26.343) 0.179(0.010) 0.808(0.021) 0.271(0.022) a sslasso: s0 = 0.04 530.781(24.638) 0.179(0.010) 0.809(0.020) 0.274(0.020) sslasso: s0 = 0.05 541.192(24.496) 0.182(0.010) 0.802(0.020) 0.279(0.019) sslasso: s0 = 0.06 550.971(25.065) 0.186(0.010) 0.794(0.020) 0.284(0.019) sslasso: s0 = 0.07 559.430(24.311) 0.190(0.009) 0.785(0.020) 0.293(0.019) Scenario 4 n = 500, m = 3000 Lasso 655.349(11.253) 0.232(0.005) 0.665(0.028) 0.382(0.024)

sslasso: s0 = 0.01 680.988(16.432) 0.244(0.008) 0.601(0.058) 0.430(0.119) sslasso: s0 = 0.02 655.714(23.241) 0.231(0.010) 0.663(0.034) 0.385(0.027) sslasso: s0 = 0.03 646.877(20.963) 0.228(0.009) 0.673(0.030) 0.372(0.026) a sslasso: s0 = 0.04 645.278(16.039) 0.227(0.007) 0.674(0.024) 0.377(0.022) sslasso: s0 = 0.05 654.349(16.241) 0.231(0.007) 0.659(0.027) 0.390(0.023) sslasso: s0 = 0.06 665.488(18.227) 0.236(0.008) 0.646(0.028) 0.400(0.026) sslasso: s0 = 0.07 675.374(20.660) 0.241(0.009) 0.634(0.028) 0.404(0.026) Scenario 5 n = 500, m = 3000 Lasso 620.034(16.209) 0.215(0.007) 0.726(0.030) 0.334(0.027)

sslasso: s0 = 0.01 642.083(30.947) 0.225(0.014) 0.683(0.056) 0.363(0.045) sslasso: s0 = 0.02 597.547(34.288) 0.205(0.015) 0.745(0.039) 0.322(0.033) a sslasso: s0 = 0.03 593.701(32.304) 0.205(0.013) 0.746(0.034) 0.318(0.030) sslasso: s0 = 0.04 596.421(30.006) 0.205(0.012) 0.746(0.032) 0.324(0.029) sslasso: s0 = 0.05 610.549(24.024) 0.211(0.010) 0.731(0.030) 0.333(0.025) sslasso: s0 = 0.06 623.014(24.530) 0.217(0.011) 0.715(0.032) 0.347(0.028) sslasso: s0 = 0.07 634.536(26.023) 0.222(0.011) 0.701(0.033) 0.355(0.026) Scenario 6 n = 500, m = 3000 Lasso 570.138(17.989) 0.193(0.008) 0.791(0.026) 0.289(0.026)

sslasso: s0 = 0.01 568.332(35.346) 0.194(0.015) 0.777(0.036) 0.302(0.029) sslasso: s0 = 0.02 537.665(28.103) 0.180(0.011) 0.806(0.025) 0.275(0.028) a sslasso: s0 = 0.03 530.081(29.097) 0.178(0.012) 0.812(0.025) 0.271(0.027) sslasso: s0 = 0.04 530.535(26.149) 0.178(0.011) 0.811(0.023) 0.266(0.025) sslasso: s0 = 0.05 542.091(26.825) 0.184(0.011) 0.801(0.024) 0.275(0.024) sslasso: s0 = 0.06 557.014(27.697) 0.189(0.011) 0.788(0.025) 0.288(0.022) sslasso: s0 = 0.07 572.405(28.018) 0.195(0.011) 0.776(0.025) 0.299(0.024)

Values in parentheses are SE. The slab scales, s1, are 1 in all scenarios. aThe smallest deviance values indicate the optimal model.

21 the conditional posterior expectation of Sj can be Itcanbeseenthattheestimatesofpj and Sj are larger for larger co- obtained by efﬁcientsb ; leading to different shrinkage for different coefﬁcients. ! j For the M-step, we update ðb; f; uÞ by maximizing the 2 1 1 2 p p E S 1b ¼ E b ¼ j þ j (10) posterior expectation of the log joint posterior density j j 2 þ j 1 gj s0 gjs1 s0 s1 21 with gj and Sj replaced by their conditional posterior

Spike-and-Slab Lasso GLMs 81 Figure 3 The parameter estimation averaged over 50 replicates for sslasso GLMs and the lasso model under scenarios 3 and 6. The blue circles are the assumed true values. The black points and lines represent the estimated values and the interval estimates of co- efﬁcients over 50 replicates.

expectations. From the log joint posterior density, we In summary, the EM coordinate decent algorithm for can see that ðb; fÞ and u can be updated separately, fitting the sslasso Cox models proceeds as follows: because the parameters ðb; fÞ are only involved in 1. Choose a starting value for b0; f0 and u0:. For example, we XJ ð ; Þ 2 21 can initialize b0 =0,f0= 1 and u0 = 0.5. l b f Sj bj , and the probability parameter u is j¼1 2. For t = 1, 2, 3, ..., XJ g u þ 2 g ð 2 uÞ : only in jlog 1 j log 1 Therefore, the pa- 21 j¼1 E-step: Update gj and Sj by their conditional posterior rameters ðb; fÞ are updated by maximizing the expression: expectations. M-step: XJ 1 Q ðb; fÞ¼lðb; fÞ 2 jb j (11) 1 S j 1. Update ðb; fÞ using the cyclic coordinate decent j¼1 j algorithm; 21 2. Update u: where Sj is replaced by its conditional posterior expectation . ð Þ ð 2 Þ derived above. Given the scale parameters Sj,the t 2 t 1 We assess convergence by the criterion: d d XJ 1 ðtÞ ðtÞ ðtÞ ðtÞ term jb j serves as the L lasso penalty, with S21 as 0:1þd , e; where d ¼ 2 2log lðb ; f Þ is the esti- S j 1 j j¼1 j mate of deviance at the tth iteration, and e is a small value fi 2 the penalty factors, and thus the coef cients can be updated by (say 10 5). maximizing Q ðb; fÞ using the cyclic coordinate decent algorithm. 1 Selecting optimal scale values Therefore, the coefficients can be estimated to be zero. The probability parameter u is updated by maximizing the expression: The performance of the sslasso approach can depend on the

scale parameters (s0, s1). Rather than restricting attention to XJ a single model, our fast algorithm allows us to quickly ﬁta Q ðuÞ¼ ½g logu þð1 2 g Þlogð1 2 uÞ (12) 2 j j sequence of models, from which we can choose an optimal j¼1 ﬁ XJ one based on some criteria. Our strategy is to xtheslab ¼ 1 : scale s (e.g., s = 1), and consider a sequence of L decreas- We can easily obtain: u J pj 1 1 ¼ l : , 1 , 2 , ⋯ , L , ; j 1 ing values s0 0 s0 s0 s0 s1 for the spike

82 Z. Tang et al. Figure 4 The inclusion proportions of the nonzero and zero coefficients in the model over 50 simulation replicates under scenarios 3 and 6. The black points and red circles represent the proportions of nonzero coefficients for sslasso GLMs and the lasso model, respectively. The blue points and gray circles represent the proportions of zero coefficients for sslasso GLMs and the lasso model, respectively.

ð2 Þ fi l ; ; ¼ ; ⋯; ^ ¼ ^ k scale s0.Wethen t L models with scales s0 s1 l 1 L We calculate the linear predictor hðkÞ XðkÞb for all Increasing the spike scale s0 tends to include more nonzero individuals in the k-th subset of the data, called the coefficients in the model. This procedure is similar to the cross-validated or prevalidated predicted index. Cycling lasso implemented in the widely used R package glmnet, through K parts, we obtain the cross-validated linear pre- fi ^ ð ; ^ Þ which quickly ts the lasso model over a grid of values of l dictor hi for all individuals. We then use yi hi to compute covering its entire range, giving a sequence of models for the measures described above. The cross-validated linear users to choose from (Friedman et al. 2010; Hastie et al. predictor for each patient is derived independently of the 2015). observed response of the patient, and hence the “preva- ” ; ^ Evaluation of predictive performance lidated dataset yi hi can essentially be treated as a “new dataset.” Therefore, this procedure provides valid There are several measures to evaluate the performance of a assessment of the predictive performance of the model fitted GLM (Steyerberg 2009), including: (1) deviance, (Tibshirani and Efron 2002; Hastie et al. 2015). To get stable which is a generic way of measuringP the quality of any results, we run 103 10-fold cross-validation for real data fi ¼ 2 n ð j ^; ^Þ: model, and is de ned as: d 2 i¼1log p yi Xib f De- analysis. viance measures the overall quality of a fitted GLM, and thus Data availability is usually used to choose an optimalP model. (2) Mean squared fi 1 n ð 2^ Þ2; fi error (MSE), de ned as MSE = n i¼1 yi yi (3) For logistic The authors state that all data necessary for con rming the regression, we can use two additional measures: AUC (area conclusions presented in the article are represented fully fi fi underP the ROC curve) and misclassi cation, which is de ned within the article. 1 n j 2 ^ j . : ; j 2 ^ . : ¼ as: n i¼1I yi yi 0 5 where I yi yi 0 5 1if y 2 ^y . 0:5; and I y 2 ^y . 0:5 ¼ 0ify 2 ^y # 0:5: i i i i i i Implementation (4) Prevalidated linear predictor analysis (Tibshirani and Efron 2002; Hastie et al. 2015); for all GLMs, we can use We created an R function, bmlasso, for setting up and ^ ¼ ^ the cross-validated linear predictor hi Xib as a continuous fitting sslasso GLMs, and several other R functions (e.g., covariate in a univariate GLM to predict the outcome: summary.bh, plot.bh, predict.bh, cv.bh) for summarizing ^ ¼ 21 þ ^ : E yi hi h m hib We then look at the P-value for the fitted models, and for evaluating the predictive per- testing the hypothesis b = 0 or other statistics to evaluate formance. We incorporated these functions into the freely the predictive performance. We also can transform the con- available R package BhGLM (http://www.ssg.uab.edu/ ^ tinuous cross-validated linear predictor hi into a categorical bhglm/). ¼ð ; ⋯; Þ ^ ; factor ci ci1 ci6 basedonthequantilesofhi for ex- fi Simulation study ample, 5, 25, 50, 75, and 95%P quantiles, and then tthe ¼ 21 þ 6 : model: E yi ci h m k¼2cikbk This allows us to Simulation design: We used simulations to validate the compare statistical significance and prediction between proposed sslasso approach, and to compare with the lasso different categories. in the R package glmnet. Although the proposed method can To evaluate the predictive performance of the proposed be applied to any GLMs, we focused on sslasso logistic re- model, a general way is to fit the model using a data set, and gression, because we analyzed binary outcomes in our real then calculate the above measures with independent data. We data sets [see Simulation results]. In each situation, we sim- use the prevalidation method, a variant of cross-validation ulated two data sets, and used the first one as the training (Tibshirani and Efron 2002; Hastie et al. 2015), by randomly data to fit the models, and the second one as the test data to splitting the data to K subsets of roughly the same size, and evaluate the predictive values. For each simulation setting, using (K – 1) subsets to fit a model. Denote the estimate of we replicated the simulation 50 times, and summarized the ð2kÞ coefficients from the data excluding the k-th subset by b^ : results over these replicates. We reported the results on the

Spike-and-Slab Lasso GLMs 83 Table 3 Average number of nonzero coefﬁcients and MAE of coefﬁcient estimates over 50 simulation replicates

sslasso (s1 = 1) Lasso Simulation Scenarios Average Number MAE Average Number MAE

Scenario 1, s0 = 0.04 15.800 1.431(0.420) 30.800 2.150(0.726) Scenario 2, s0 = 0.03 5.780 0.750(0.452) 30.540 2.255(0.667) Scenario 3, s0 = 0.04 10.420 0.669(0.286) 39.540 2.680(0.729) Scenario 4, s0 = 0.04 32.900 1.984(0.352) 32.960 2.316(0.770) Scenario 5, s0 = 0.03 6.460 0.906(0.498) 42.260 2.783(0.908) Scenario 6, s0 = 0.03 5.920 0.629(0.362) 43.060 2.865(0.842) Values in parentheses are SE.

predictive values including deviance, MSE, AUC, and misclas- under scenarios 3 and 6. It can be seen that the slab scale s1 sification in the test data, and the accuracy of parameter within the range [0.7, 2.5] had little influence on the devi- estimates, and the proportions of coefficients included in ance, while the spike scale s0 strongly affected model perfor- the model. mance. These results show that our approach with a fixed

For each dataset, we generated n (= 500) observations, slab scale s1 (e.g., s1 = 1) is reasonable. each with a binary response and a vector of m (= 1000, Figure 2A and Figure 2B present the solution path for

3000) continuous predictors Xi ¼ðxi1; ⋯; ximÞ: The vector Xi scenario 3 by the proposed model and the lasso model, re- was generated with 50 elements at a time, i.e., the subvector spectively. Figure 2, C and D show the profiles of deviance by ð ; ⋯; Þ; ¼ ; ; ⋯; 3 xið50kþ1Þ xið50kþ50Þ k 0 1 was sampledP randomly 10 10-fold cross-validation for the proposed model and the ð ; Þ; fromP multivariate normal distribution N50 0 where lasso model, respectively. For the proposed sslasso GLMs, ¼ðsjj9Þ with sjj ¼ 1 and sjj9 ¼ 0:6ðj 6¼ j9Þ: Thus, the pre- the minimum of deviance was achieved by including 10.7 dictors within a group were correlated, and between groups nonzero coefficients (averaged over 50 replicates), when were independent. To simulate a binary response, we first the s0 scale was 0.04. For the lasso model, the optimal model fi generated Gaussian response zi from univariateP normal dis- included 29.8 nonzero coef cients when the l value was ð ; : 2Þ; ¼ þ m ; tribution N hi 1 6 where hi b0 j¼1xijbj and then 0.035. Similar to the lasso, the sslasso GLM is a path-following transformed the continuous response to binary data by set- strategy for fast dynamic posterior exploration. However, ting individuals with the 30% largest continuous response its solution path is essentially different from that of the lasso

Z as “affected” ( yi = 1), and the other individuals as model. For the lasso solution, as shown in Figure 2B, the “unaffected” ( yi = 0). We set five coefficients b5, b20, b40, number of nonzero coefficients could be a few, even zero if bm – 50, and bm – 5 to be nonzero, two of which are negative, a strong penalty was adopted. However, a spike-and-slab and all others to be zero. Table 1 shows the preset nonzero mixture prior can help larger coefficients escape the gravita- coefficient values for six simulation scenarios. tional pull of the spike. Larger coefficients will be always We analyzed each simulated data set using the lasso logistic included in the model with none or weak shrinkage. Supple- model implemented in the R package glmnet, and the pro- mental Material, Figure S1 shows the solution path of the posed sslasso logistic regression in our R package BhGLM. For proposed model and the lasso model under scenario 6. the lasso approach, we used 10-fold cross-validation to select Figure S2 shows the adaptive shrinkage amount, along with an optimal value of l, which determines an optimal lasso the different effect size. The same conclusion could be model, and reported the results based on the optimal lasso reached, that the spike-and slab prior shows self-adaptive model. For the proposed sslasso GLM approach, we fixed slab and flexible characteristics. scale as s1 = 1, and run a grid value of spike scales: Predictive performance: Table 2 shows the deviance, MSE, s0 = 0.01, 0.02, ..., 0.07. AUC, and misclassification in the test data under different To fully investigate the impact of the scales (s0, s1) on the simulated scenarios. The smallest deviance is highlighted in results, we also fitted the simulated data sets under scenar- Table 2. As described earlier, the deviance measures the over- ios 3 and 6 (see Table 1) with 100 combinations when s0 all quality of a model, and thus is usually used to choose an changed from 0.01 to 0.10, and s1 from 0.7 to 2.5. We ana- optimal model. From these results, we can see that the sslasso lyzed each scale (s0, s1) combination over 50 replicates. We GLMs with an appropriate value of s0 performed better than also presented the solution path under simulation scenar- the lasso. Table 2 also shows that the optimal spike-and-slab ios 3 and 6, averaged over 50 replicates, to show the special GLMs usually had higher AUC value than the lasso. The AUC characteristic of sslasso GLMs. measures the discriminative ability of a logistic model (Steyerberg 2009). Thus, the sslasso GLMs generated better

Simulation results: Impact of scales (s0,s1) and solution discrimination. path: We analyzed the data with a grid of values of (s0, s1) Accuracy of parameter estimates: Figure 3, Figure S3, and covering their entire range to fully investigate the impact of Figure S4 show the estimates of coefﬁcients from the best the prior scales. Figure 1 show the proﬁles of the deviance sslasso GLMs, and the lasso model over 50 replicates of

84 Z. Tang et al. Table 4 Measures of optimal sslasso and lasso models for Dutch breast cancer dataset and TCGA ovarian cancer dataset by 103 10-fold cross validation

Deviance MSE AUC Misclassiﬁcation Dutch breast cancer sslasso 336.880(4.668) 0.192(0.003) 0.684(0.011) 0.284(0.014) lasso 342.134(3.383) 0.197(0.003) 0.656(0.018) 0.297(0.007) TCGA ovarian cancer sslasso 394.796(7.684) 0.179(0.004) 0.647(0.017) 0.258(0.010) lasso 393.152(5.392) 0.179(0.003) 0.636(0.017) 0.254(0.005) Values in parentheses are SE.

testing data. It can be seen that the sslasso GLMs provided 88 had distant metastases. Our analysis was to build a logis- more accurate estimation in most situations, especially for tic model for predicting the metastasis event using the larger coefficients. This result is expected, because the 4919 gene-expression predictors. Prior to fitting the models, spike-and-slab prior can induce weak shrinkage on larger we standardized all the predictors. For hierarchical (and also coefficients. In contrast, the lasso employs a single penalty penalized) models, it is important to use a roughly common on all the coefficients, and thus can overshrink large coeffi- scale for all predictors. cients. As a result, five nonzero coefficients were shrunk and We fixed the slab scale s1 to 1, and varied the spike scale s0 underestimated compared to true values in our simulation. over the grid of values: 0.005 + k 3 0.005; k =0,1,, Proportions of coefficients included in the model: We calcu- 39, leading to 40 models. We performed 103 10-fold cross- lated the proportions of the coefficients included in the model validation to select an optimal model based on the deviance. over the simulation replicates. Like the lasso model, the pro- Figure 5A shows the profile of the prevalidated deviance for posed sslasso GLMs can estimate coefficients to be zero, and the Dutch breast cancer dataset. The minimum value of de- thus can easily return these proportions. Figure 4 and Figure viance appears to be 336.880(4.668), when the spike scale s0 S5 show the inclusion proportions of the nonzero coefficients is 0.09. Therefore, the sslasso GLM with the prior scale (0.09, and the zero coefficients for the best sslasso GLMs and the 1) was chosen for model fitting and prediction. We further lasso model. It can be seen that the inclusion proportions of used 10-fold cross-validation over 10 replicates to evaluate the nonzero coefficients were similar for the two approaches the predictive values of the chosen model with the prior scale in most situations. However, the lasso included zero coeffi- (0.09, 1). As a comparison, we fitted the model by the lasso cients in the model more frequently than the sslasso GLMs. approach, and also performed 10-fold cross-validation over This indicates that the sslasso approach can reduce noisy 10 replicates. Table 4 summarizes the measures of perfor- signals. mance. The result of the proposed sslasso GLMs was better We summarized the average numbers of nonzero coeffi- than the result of the lasso. The cross-validated AUC by the . fi fi Pcients and the MAE of coef cient estimates, de ned as MAE = proposed method was estimated to be 0.684(0.011), which is ^ 2 ; fi bj bj m in Table 3. In most simulated scenarios, the signi cantly larger than 0.5, showing the discriminative average numbers of nonzero coefficients in the sslasso GLMs ability of the prognostic model. Totally, 51 genes were de- were much lower than those in the lasso model. We also tected, and the effect sizes for most of these genes were found that the average numbers of nonzero coefficients de- small (Figure S6). tected by the proposed models were close to the number of We further estimated the prevalidated linear predictor, fi ¼ ^; the simulated nonzero coef cients in most scenarios. How- hi Xib for each patient, and then grouped the patients ever, the lasso usually included many zero coefficients in the on the basis of the prevalidated linear predictor into categor- model. This suggests that the noises can be controlled by the ical factor according to 5, 25, 50, 75, and 95th percentiles, ¼ð ; ⋯; Þ: fi spike-and-slab prior. denoted by ci ci1 ci6 We tted the univariate model E y h^ ¼ h21 m þ h^ b and multivariate model Application to real data i i P i ¼ 21 þ 6 E yi ci h m k¼2cikbk by using the prevalidated lin- Dutch breast cancer data: We applied our sslasso GLMs to ear predictor and the categorical factors, respectively. The analyze a well-known Dutch breast cancer data set, and results are summarized in Table 5. As expected, the two compared the results with that of the lasso in the R package models were significant, indicating that the proposed predic- glmnet. This data were described in van de Vijver et al. tion model was very informative. (2002), and is publicly available in the R package “breast- CancerNKI” (https://cran.r-project.org). This data set con- TCGA ovarian cancer (OV) dataset: The second data set that tains the microarray mRNA expression measurements of we analyzed was microarray mRNA expression data for ovar- 4919 genes, and the event of metastasis after adjuvant sys- ian cancer (OV) downloaded from The Cancer Genome temic therapy from 295 women with breast cancer (van’t Atlas (TCGA, http://cancergenome.nih.gov/) (update March Veer et al. 2002; van de Vijver et al. 2002). The 4919 genes 2016). The outcome of interest is a tumor event within 2 years were selected from 24,885 genes, for which reliable expres- (started the date of initial pathologic diagnosis), including pro- sion is available (van’t Veer et al. 2002). Among 295 tumors, gression, recurrence, and new primary malignancies. Clinical

Spike-and-Slab Lasso GLMs 85 Table 5 Univariate and multivariate analyses using the prevalidated linear predictors and their categorical factors

Model Coefﬁcients Estimates SE P Values Dutch breast univariate b 0.816 0.162 4.76e207

cancer multivariate b2 20.619 0.761 0.416 b3 0.251 0.700 0.720 b4 0.412 0.697 0.555 b5 1.556 0.696 0.025 b6 1.520 0.827 0.066 TCGA ovarian univariate b 0.683 0.162 2.47e205

cancer multivariate b2 0.347 0.519 0.504 b3 0.906 0.518 0.080 b4 1.426 0.536 0.008 b5 1.719 0.572 0.003 b6 2.035 0.877 0.020

We further estimated the cross-validated linear predictor, and performed similar analysis as above Dutch breast cancer dataset. The results are summarized in Table 5. The univariate and multivariate analyses suggested that the proposed prediction model was very informative for predicting new tumor events in ovarian tumors.

Figure 5 The profiles of prevalidated deviance under varied prior scales s fi s 0 and xed 1 = 1 for Dutch breast cancer dataset (A) and TCGA ovarian Discussion cancer dataset (B). Omics technologies allow researchers to produce tons of molecular data about cancer and other diseases. These mul- data are available for 513 OV patients, where 110 tumors have tilevel molecular, and also environmental, data provide an no interested outcome records. Microarray mRNA expression extraordinary opportunity to predict the likelihood of clinical data (Agilent Technologies platform) includes 551 tumors that benefit for treatment options, and to promote more accurate have 17,785 feature profiles after removing duplications. Even and individualized health predictions (Chin et al. 2011; though we can analyze all these features, considering that the Collins and Varmus 2015). However, there are huge chal- genes with small variance might contribute little to the out- lenges in making predictions and detecting associated come, we selected the top 30% genes filtered by variance for biomarkers from large-scale molecular datasets. In this arti- predictive modeling. We merged the expression data and the cle, we have developed a new hierarchical model approach, tumor with new tumor event. After removing individuals with i.e., sslasso GLMs, for detecting important variables and prog- missing response, totally, 362 tumors with 5336 genes were nostic prediction (e.g., events such as tumor recurrence or included in our analysis. death). Although focusing on molecular profiling data and Similar to the above analysis of the Dutch breast cancer binary outcome in our simulations and real data analysis, the dataset, we fixed the slab scale s1 to 1 and considered the proposed approach can also be used for analyzing general spike scale s0 over the grid of values: 0.005 + k 3 0.005; large-scale data, and other generalized linear models. k =0,1,, 39. We performed 10-fold cross-validation with The key to our sslasso GLMs is proposing the new prior 10 replicates to select an optimal model based on the preva- distribution, i.e., the mixture spike-and-slab double-exponential lidated deviance. Figure 5B shows the profile of the prevali- prior, on the coefficients. The mixture spike-and-slab prior can dated deviance. The minimum value of deviance appears to induce different amounts of shrinkage for different predictors be 394.796(7.684) when the prior s0 = 0.095. Therefore, depending on their effect sizes, and thus have the effect of the scale (0.095, 1) was chosen for the model fitting and removing irrelevant predictors, while supporting the larger prediction. We also fitted the optimal lasso model using coefficients, thus improving the accuracy of coefficient estima- 10-fold cross-validation over 10 replicates. Table 4 summa- tion, and prognostic prediction. The theoretical property rizes the measures of performance of the proposed sslasso of the spike-and-slab prior is characterized by the solution GLMs and the lasso model. We can see that the sslasso was path. Without the slab component, the output would be slightly better than the lasso model. The cross-validated AUC equivalent or similar to the lasso solution path. Instead, was estimated to be 0.647(0.017), which is significantly the solution path of the model with the spike-and-slab prior larger than 0.5, showing the discriminative ability of the is different from the lasso. Large coefficients are usually in- prognostic model. Totally, 85 genes were detected, and the cluded in the model, while irrelevant coefficients are shrunk effect sizes for most of these genes were small (Figure S7). to zero.

86 Z. Tang et al. The large-scale sslasso GLMs can be effectively fitted by the The proposed framework is highly extensible. The benefits proposed EM coordinate descent algorithm, which incorpo- of hierarchical modeling are flexible, and it is easy to in- rates EM steps into the cyclic coordinate descent algorithm. corporate structural information about the predictors into The E-steps involve calculating the posterior expectations of predictive modeling. With the spike-and-slab priors, the the indicator variable gj and the scale Sj for each coefficient, biological information (for example, biological pathways, and the M-steps employ the existing fast algorithm, i.e., the molecular interactions of genes, genomic annotation, etc.) cyclic coordinate descent algorithm (Friedman et al. 2010; and multi-level molecular profiling data (e.g., clinical, gene Simon et al. 2011; Hastie et al. 2015), to update the coeffi- expression, DNA methylation, somatic mutation, etc.) could cients. As shown in our extensive simulation and real data be effectively integrated. This prior biological knowledge will analyses, the EM coordinate descent algorithm converges help improve prognosis and prediction (Barillot et al. 2012). rapidly, and is capable of identifying important predictors, Another important extension is to incorporate a polygenic and building promising predictive models from numerous random component for modeling small-effect predictors, or candidates. relationships among individuals, into the sslasso GLMs. Zhou The sslasso retains the advantages of two popular methods et al. (2013) proposed a Bayesian sparse linear mixed model for high-dimensional data analysis (V. Roˇcková and E. I. that includes numerous predictors with mixture normal pri- George, unpublished results), i.e., Bayesian variable selection ors and a polygenic random effect and developed a MCMC (George and McCulloch 1993, 1997; Chipman 1996; algorithm to fit the model. We will incorporate the idea of Chipman et al. 2001; Roˇcková and George 2014), and the Zhou et al. (2013) into the framework of our sslasso GLMs to penalized lasso (Tibshirani 1996, 1997; Hastie et al. 2015), develop faster model-fitting algorithms. and bridges these two methods into one unifying framework. Similar to the lasso, the proposed method can shrink many Acknowledgments coefficients exactly to zero, thus automatically achieving variable selection and yielding easily interpretable results. More We thank the associate editor and two reviewers for their importantly, due to using the spike-and-slab mixture prior, constructive suggestions and comments. This work was the shrinkage scale for each predictor can be estimated from supported in part by the research grant National Institutes the data, yielding weak shrinkage on important predictors, of Health (NIH)2 R01GM069430, and was also supported but strong shrinkage on irrelevant predictors, and thus di- by grants from China Scholarship Council, the National minishing the well-known estimation bias of the lasso. This Natural Science Foundation of China (81573253), and self-adaptive strategy is very much in contrast to the lasso project funded by the Priority Academic Program Develop- model, which shrinks all estimates equally with a constant ment of Jiangsu Higher Education Institutions at Soochow penalty. These remarkable properties of the mixture spike- University. The authors declare that they have no competing and-slab priors have been theoretically proved previously interests. for normal linear models (V. Roˇcková and E. I. George, un- Author contributions: N.Y. and Z.T., study conception and published results), and have also been observed in our meth- design; N.Y., Z.T., X.Z., and Y.S., simulation study and data odological derivation and empirical studies. summary; Z.T., X.Z., and Y.S., real data analysis; N.Y. and The performance of the sslasso GLMs depends on the scale Z.T., drafting of manuscript. parameter of the double-exponential prior. Optimally pre- setting two scale values (s0, s1)canbedifficult. A compre- hensive approach is to perform a two-dimensional search on Literature Cited all plausible combinations of (s0, s1), and then to select an optimal model based on cross-validation. However, this ap- Barillot, E., L. Calzone, P. Hupe, J. P. Vert, and A. Zinovyev, proach can be time-consuming and inconvenient to use. We 2012 Computational Systems Biology of Cancer. Chapman & performed simulations by a two-dimensional search, and Hall/CRC Mathematical and Computational Biology, Boca Ra- ton, FL. observed that the slab scale s1 within the range [0.75, Bovelstad, H. M., S. Nygard, H. L. Storvold, M. Aldrin, O. Borgan 2.5] has little influence on the fitted model, while the spike et al., 2007 Predicting survival from microarray data—a com- scale s0 can strongly affect model performance. Therefore, parative study. Bioinformatics 23: 2080–2087. we suggest a path-following strategy for fast dynamic pos- Bovelstad, H. M., S. Nygard, and O. Borgan, 2009 Survival pre- — terior exploration, which is similar to the approach of diction from clinico-genomic models acomparativestudy. ˇ BMC Bioinformatics 10: 413. Rocková and George (2014, and unpublished results). We Chin, L., J. N. Andersen, and P. A. Futreal, 2011 Cancer genomics: fi xed the slab scale s1 (e.g., s1 =1),andrungridvaluesof from discovery science to personalized medicine. Nat. Med. 17: spike scale s0 from a reasonable range, e.g., (0, 0.1), and 297–303. then selected an optimal according to cross-validation. Chipman, H., 1996 Bayesian variable selection with related pre- – Therefore, rather than restricting analysis based on a single dictions. Can. J. Stat. 24: 17 36. Chipman, H., E. I. Edwards, and R. E. McCulloch, 2001 The prac- value for the scale s0,thespeedoftheproposedalgorithm tical implementation of Bayesian model selection, in Model Se- makes it feasible to consider several, or tens of, reasonable lection, edited by P. Lahiri. Institute of Mathematical Statistics, values for the scale s0. Beachwood, OH.

Spike-and-Slab Lasso GLMs 87 Collins, F. S., and H. Varmus, 2015 A new initiative on precision based regression for exploring large model spaces in micro- medicine. N. Engl. J. Med. 372: 793–795. biome analyses. BMC Bioinformatics 16: 31. de los Campos, G., D. Gianola, and D. B. Allison, 2010 Predicting Shelton, J. A., A. S. Sheikh, J. Bornschein, P. Sterne, and J. Lucke, genetic predisposition in humans: the promise of whole-genome 2015 Nonlinear spike-and-slab sparse coding for interpretable markers. Nat. Rev. Genet. 11: 880–886. image encoding. PLoS One 10: e0124088. Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani, 2004 Least Simon, N., J. Friedman, T. Hastie, and R. Tibshirani, angle regression, pp. 407–451 in The Annals of Statistics. Institute 2011 Regularization paths for Cox’s proportional hazards of Mathematical Statistics, Beachwood, OH. model via coordinate descent. J. Stat. Softw. 39: 1–13. Fan, J., and R. Li, 2001 Variable selection via nonconcave penal- Sohn, I., and C. O. Sung, 2013 Predictive modeling using a somatic ized likelihood and its oracle properties. J. Am. Stat. Assoc. 96: mutational profile in ovarian high grade serous carcinoma. PLoS 1348–1360. One 8: e54089. Friedman, J., T. Hastie, and R. Tibshirani, 2010 Regularization Steyerberg, E. W., 2009 Clinical Prediction Models: A Practical paths for generalized linear models via coordinate descent. Approach to Development, Validation, and Updates.Springer, J. Stat. Softw. 33: 1–22. New York. Gelman, A., J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari et al., Tibshirani, R., 1996 Regression shrinkage and selection via the 2014 Bayesian Data Analysis. Chapman & Hall/CRC Press, Lasso. J. R. Stat. Soc. B 58: 267–288. New York. Tibshirani, R., 1997 The lasso method for variable selection in the George, E. I., and R. E. McCulloch, 1993 Variable selection via Cox model. Stat. Med. 16: 385–395. Gibbs sampling. J. Am. Stat. Assoc. 88: 881–889. Tibshirani, R. J., and B. Efron, 2002 Pre-validation and inference George, E. I., and R. E. McCulloch, 1997 Approaches for Bayesian in microarrays. Stat. Appl. Genet. Mol. Biol. 1: Article1. variable selection. Stat. Sin. 7: 339–373. van de Vijver, M. J., Y. D. He, L. J. van’t Veer, H. Dai, A. A. Hart Hastie, T., R. Tibshirani, and J. Friedman, 2009 The Elements of et al., 2002 A gene-expression signature as a predictor of sur- Statistical Learning. Springer-Verlag, New York, NY. vival in breast cancer. N. Engl. J. Med. 347: 1999–2009. Hastie, T., R. Tibshirani, and M. Wainwright, 2015 Statistical van’t Veer, L. J., H. Dai, M. J. van de Vijver, Y. D. He, A. A. Hart Learning with Sparsity - The Lasso and Generalization. CRC Press, et al., 2002 Gene expression profiling predicts clinical outcome New York. of breast cancer. Nature 415: 530–536. Ishwaran, H., and J. S. Rao, 2005 Spike and slab gene selection Yi, N., and S. Xu, 2008 Bayesian LASSO for quantitative trait loci for multigroup microarray data. J. Am. Stat. Assoc. 100: 764–780. mapping. Genetics 179: 1045–1055. Jameson, J. L., and D. L. Longo, 2015 Precision medicine—personalized, Yi, N., V. George, and D. B. Allison, 2003 Stochastic search variable problematic, and promising. N. Engl. J. Med. 372: 2229–2234. selection for mapping multiple quantitative trait loci. Genetics Kyung, M., J. Gill, M. Ghosh, and G. Casella, 2010 Penalized re- 165: 867–883. gression, standard errors, and Bayesian Lassos. Bayesian Anal. Yuan, Y., E. M. Van Allen, L. Omberg, N. Wagle, A. Amin-Mansour 5: 369–412. et al., 2014 Assessing the clinical utility of cancer genomic and Lee, D., W. Lee, Y. Lee, and Y. Pawitan, 2010 Super-sparse prin- proteomic data across tumor types. Nat. Biotechnol. 32: 644– cipal component analyses for high-throughput genomic data. 652. BMC Bioinformatics 11: 296. Zhang, C. H., 2010 Nearly unbiased variable selection under mini- Lu,Z.H.,H.Zhu,R.C.Knickmeyer,P.F.Sullivan,S.N.Williamset al., max concave penalty. Ann. Stat. 38: 894–942. 2015 Multiple SNP set analysis for genome-wide association Zhang, W., T. Ota, V. Shridhar, J. Chien, B. Wu et al., 2013 Network- studies through Bayesian latent variable selection. Genet. Epi- based survival analysis reveals subnetwork signatures for predict- demiol. 39: 664–677. ing outcomes of ovarian cancer treatment. PLOS Comput. Biol. 9: McCullagh, P., and J. A. Nelder, 1989 Generalized Linear Models. e1002975. Chapman and Hall, London. Zhao, Q., X. Shi, Y. Xie, J. Huang, B. Shia et al., 2014 Combining Park, T., and G. Casella, 2008 The Bayesian Lasso. J. Am. Stat. multidimensional genomic measurements for predicting cancer Assoc. 103: 681–686. prognosis: observations from TCGA. Brief. Bioinform. 16: 291– Partovi Nia, V., and M. Ghannad-Rezaie, 2016 Agglomerative 303. joint clustering of metabolic data with spike at zero: a Bayesian Zhou, X., P. Carbonetto, and M. Stephens, 2013 Polygenic modeling perspective. Biom. J. 58: 387–396. with bayesian sparse linear mixed models. PLoS Genet. 9: Rapaport, F., A. Zinovyev, M. Dutreix, E. Barillot, and J.-P. Vert, e1003264. 2007 Classification of microarray data using gene networks. Zou, H., 2006 The adaptive Lasso and its oracle properties. J. Am. BMC Bioinformatics 8: 1–15. Stat. Assoc. 101: 1418–1429. Roˇcková, V., and E. I. George, 2014 EMVS: The EM approach to Zou, H., and T. Hastie, 2005 Regularization and variable selection Bayesian variable selection. J. Am. Stat. Assoc. 109: 828–846. via the elastic net. J. R. Stat. Soc. B 67: 301–320. Shankar, J., S. Szpakowski, N. V. Solis, S. Mounaud, H. Liu et al., 2015 A systematic evaluation of high-dimensional, ensemble- Communicating editor: W. Valdar

88 Z. Tang et al.