| INVESTIGATION
The Spike-and-Slab Lasso Generalized Linear Models for Prediction and Associated Genes Detection
Zaixiang Tang,*,†,‡ Yueping Shen,*,† Xinyan Zhang,‡ and Nengjun Yi‡,1 *Department of Biostatistics, School of Public Health and †Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases and Center for Genetic Epidemiology and Genomics, Medical College of Soochow University, Suzhou, 215123, China, and ‡Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Alabama 35294
ABSTRACT Large-scale “omics” data have been increasingly used as an important resource for prognostic prediction of diseases and detection of associated genes. However, there are considerable challenges in analyzing high-dimensional molecular data, including the large number of potential molecular predictors, limited number of samples, and small effect of each predictor. We propose new Bayesian hierarchical generalized linear models, called spike-and-slab lasso GLMs, for prognostic prediction and detection of associated genes using large-scale molecular data. The proposed model employs a spike-and-slab mixture double-exponential prior for coeffi- cients that can induce weak shrinkage on large coefficients, and strong shrinkage on irrelevant coefficients. We have developed a fast and stable algorithm to fit large-scale hierarchal GLMs by incorporating expectation-maximization (EM) steps into the fast cyclic coordinate descent algorithm. The proposed approach integrates nice features of two popular methods, i.e., penalized lasso and Bayesian spike-and-slab variable selection. The performance of the proposed method is assessed via extensive simulation studies. The results show that the proposed approach can provide not only more accurate estimates of the parameters, but also better prediction. We demonstrate the proposed procedure on two cancer data sets: a well-known breast cancer data set consisting of 295 tumors, and expression data of 4919 genes; and the ovarian cancer data set from TCGA with 362 tumors, and expression data of 5336 genes. Our analyses show that the proposed procedure can generate powerful models for predicting outcomes and detecting associated genes. The methods have been implemented in a freely available R package BhGLM (http://www.ssg.uab.edu/bhglm/).
KEYWORDS cancer; double-exponential distribution; generalized linear model; outcome prediction; spike-and-slab lasso
HE growing recognition of precision medicine reflects the and predictive factors often provided poor prognosis and Temergence of a fieldthatisacceleratingrapidlyandwill prediction (Barillot et al. 2012). Modern “omics” technolo- help shape new clinical practice in the future (Collins and gies can generate robust large-scale molecular data, such as Varmus 2015; Jameson and Longo 2015). The important large amounts of genomic, transcriptomic, proteomic, and basis of precision medicine is to generate knowledge of metabolomics data, which provides extraordinary opportu- disease that will enable better assessment of disease risk, nities to detect new biomarkers, and to build more accurate understanding of disease mechanisms, and prediction of prognostic and predictive models. However, these large- optimal therapy and prognostic outcome for diseases by scale data sets also introduce computational and statistical using a wide range of biomedical, clinical, and environmen- challenges. tal information. Precision medicine needs accurate detec- Various approaches have been applied in analyzing tion of biomarkers and prognostic prediction (Chin et al. large-scale molecular profiling data to address the chal- 2011; Barillot et al. 2012). Traditional clinical prognostic lenges (Bovelstad et al. 2007, 2009; Lee et al. 2010; Barillot et al. 2012). The lasso, and its extensions, are the most Copyright © 2017 by the Genetics Society of America commonly used methods (Tibshirani 1996; Hastie et al. doi: 10.1534/genetics.116.192195 2009, 2015). These methods put an L1-penalty on the co- Manuscript received June 1, 2016; accepted for publication October 27, 2016; fi fi published Early Online October 31, 2016. ef cients, and can shrink many coef cients exactly to zero, Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10. thus performing variable selection. With the L1-penalty, the 1534/genetics.116.192195/-/DC1. 1Corresponding author: Department of Biostatistics, School of Public Health, University penalized likelihood can be solved by extremely fast optimiza- of Alabama at Birmingham, Birmingham, AL 35294-0022. E-mail: [email protected] tion algorithms, for example, the lars and the cyclic coordinate
Genetics, Vol. 205, 77–88 January 2017 77 descent algorithms (Efron et al. 2004; Friedman et al. 2010), Table 1 Simulated effect sizes of five nonzero coefficients under making them very popular in high-dimensional data analysis. different scenarios
Recently,these penalization approaches have been applied Simulated Scenarios b5 b20 b40 bm-50 bm-5 widely for prediction and prognosis (Rapaport et al. 2007; Scenario 1 m = 1000 0.362 0.395 20.418 20.431 0.467 Barillot et al. 2012; Sohn and Sung 2013; Zhang et al. Scenario 2 m = 1000 20.457 20.491 0.521 0.550 0.585 2013; Yuan et al. 2014; Zhao et al. 2014). Scenario 3 m = 1000 0.563 20.610 0.653 20.672 0.732 However, the lasso uses a single penalty for all coefficients, Scenario 4 m = 3000 0.357 20.388 0.414 20.429 0.462 m 2 2 and thus can either include a number of irrelevant predictors, Scenario 5 = 3000 0.455 0.509 0.528 0.552 0.592 Scenario 6 m = 3000 0.560 20.618 0.654 20.673 0.716 or over-shrink large coefficients. An ideal method should m induce weak shrinkage on large effects, and strong shrinkage is the number of simulated predictors. For all scenarios, the number of individuals (n) is 500. on irrelevant effects (Fan and Li 2001; Zou 2006; Zhang 2010). Statisticians have introduced hierarchical models fi with mixture spike-and-slab priors that can adaptively de- priors, sslasso GLMs can adaptively shrink coef cients (i.e., termine the amount of shrinkage (George and McCulloch weakly shrink important predictors but strongly shrink ir- 1993, 1997). The spike-and-slab prior is the fundamen- relevantpredictors),andthuscanresultinaccurateestima- fi tal basis for most Bayesian variable selection approaches, tion and prediction. To t the sslasso GLMs, we propose fi and has proved remarkably successful (George and an ef cient algorithm by incorporating EM steps into the McCulloch 1993, 1997; Chipman 1996; Chipman et al. 2001; extremely fast cyclic coordinate descent algorithm. The per- Roˇcková and George 2014, and unpublished results). Re- formanceoftheproposedmethodisassessedviaextensive cently, Bayesian spike-and-slab priors have been applied simulations, and compared with the commonly used lasso to predictive modeling and variable selection in large- GLMs. We apply the proposed procedure to two cancer data scale genomic studies (Yi et al. 2003; Ishwaran and Rao sets with binary outcomes and thousands of molecular fea- 2005; de los Campos et al. 2010; Zhou et al. 2013; Lu et al. tures. Our results show that the proposed method can gen- 2015; Shankar et al. 2015; Shelton et al. 2015; Partovi Nia erate powerful prognostic models for predicting disease and Ghannad-Rezaie 2016). However, most previous spike- outcome, and can also detect associated genes. and-slab variable selection approaches use the mixture normal priors on coefficients and employ Markov Chain Methods Monte Carlo (MCMC) algorithms (e.g.,stochasticsearch variable selection) to fit the model. Although statistically Generalized linear models sophisticated, these MCMC methods are computationally We consider GLMs with a large number of correlated predic- intensive for analyzing large-scale molecular data. The mix- tors. The observed values of a continuous or discrete response ture normal priors cannot shrink coefficients exactly to zero, are denoted by y =(y1, , yn). The predictor variables in- and thus cannot automatically perform variable selection. clude numerous molecular predictors (e.g., gene expression). ˇ Rocková and George (2014) developed an expectation- A GLM consists of three components: the linear predictor h, maximization (EM) algorithm to fit large-scale linear mod- the link function h, and the data distribution p (McCullagh els with the mixture normal priors. and Nelder 1989; Gelman et al. 2014). The linear predictor ˇ V. Rocková and E. I. George (unpublished results) recently for the i-th individual can be expressed as proposed a new framework, called the spike-and-slab lasso, for high-dimensional normal linear models by using a new XJ ¼ þ ¼ prior on the coefficients, i.e., the spike-and-slab mixture hi b0 xijbj Xib (1) ¼ double-exponential distribution. They proved that the j 1 spike-and-slab lasso has remarkable theoretical and practical where b0 is the intercept, xij represents the observed value of properties, and overcomes some drawbacks of the previous the j-th variable, bj is the coefficient, Xi contains all variables, approaches. However, the spike-and-slab lasso, and most of and b is a vector of the intercept and all the coefficients. The the previous methods, were developed based on normal mean of the response variable is related to the linear predic- linear models, and cannot be directly applied to other mod- tor via a link function h: els. Therefore, extensions of high-dimensional methods us- ing mixture priors to frameworks beyond normal linear 21 E yij Xi ¼ h ðXibÞ (2) regression provide important new research directions for both methodological and applied works (Roˇcková and The data distribution is expressed as George 2014, and unpublished results). Yn In this article, we extend the spike-and-slab lasso framework to p yj Xb; f ¼ p yij Xib; f (3) generalized linear models, called spike-and-slab lasso GLMs i¼1 (sslasso GLMs), to jointly analyze large-scale molecular data for building accurate predictive models and identifying im- where f is a dispersion parameter, and the distribu- portant predictors. By using the mixture double-exponential tion pðyijXib; fÞ can take various forms, including normal,
78 Z. Tang et al. Figure 1 The profiles of deviance under scenarios 3 and 6 over 50 replicated testing datasets. (s0, s1) is the prior scale for sslasso GLMs. binomial, and Poisson distributions. Some GLMs (for example, sslasso GLMs, which can induce different shrinkage scales for the binomial distribution) do not require a dispersion parame- different coefficients, and allow us to estimate the shrinkage ter; that is, f is fixed at 1. Therefore, GLMs include normal scales from the data. linearandlogisticregressions,andvariousothersasspecial cases. sslasso GLMs The classical analysis of generalized linear models is to The sslasso GLMs are more easily interpreted and handled obtain maximum likelihood estimates (MLE) for the param- from a Bayesian hierarchical modeling framework. It is well ð ; Þ eters b f by maximizing the logarithm of the likelihood known that the lasso can be expressed as a hierarchical model ð ; Þ¼ ð j ; Þ: function: l b f log p y Xb f However, the classical with double-exponential prior on coefficients (Tibshirani 1996; GLMs cannot jointly analyze multiple correlated predictors, Park and Casella 2008; Yi and Xu 2008; Kyung et al. 2010): due to the problems of nonidentifiability and overfitting. The ! lasso is a widely used penalization approach to handling 1 jb j high-dimensional models, adding an L penalty to the log- b s DE b j0; s ¼ exp 2 j (5) 1 j j 2s s likelihood function, and estimating the parameters by maxi- mizing the penalized log-likelihood (Zou and Hastie 2005; Hastie et al. 2009, 2015; Friedman et al. 2010): where the scale, s, controls the amount of shrinkage; smaller scale induces stronger shrinkage and forces the estimates of
XJ bj toward zero. ð ; Þ¼ ð ; Þ 2 pl b f l b f l bj (4) We develop sslasso GLMs by extending the double-expo- j¼1 nential prior to the spike-and-slab mixture double-exponen- tial prior: The overall penalty parameter, l, controls the overall strength fi ; ; ; ð 2 Þ ; þ ; of penalty and the size of the coef cients; for a small l many bj gj s0 s1 1 gj DE bj 0 s0 gjDE bj 0 s1 coefficients can be large, and for a large l; many coefficients will be shrunk toward zero. Lasso GLMs can be fit by the or, equivalently extremely fast cyclic coordinate descent algorithm, which ! b successively optimizes the penalized log-likelihood over each ; ; ; ¼ 1 2 j bj gj s0 s1 DE bj 0 Sj exp (6) parameter, with others fixed, and cycles repeatedly until con- 2Sj Sj vergence (Friedman et al. 2010; Hastie et al. 2015). ; With a single penalty parameter, l however, the lasso can where gj is the indicator variable, gj = 1 or 0, and the scale either overshrink large effects, or include a number of irrel- Sj equals one of two preset positive value s0 and s1 (s1 . s0 . 0), ¼ð 2 Þ þ : evant predictors. Ideally, one should use small penalty values i.e., Sj 1 gj s0 gjs1 The scale value s0 is chosen to be for important predictors, and large penalties for irrelevant small, and serves as a “spike scale” for modeling irrelevant predictors. However, if we have no prior knowledge about (zero) coefficients, and inducing strong shrinkage on estima- the importance of the predictors, we cannot appropriately tion; and s1 is set to be relatively large, and thus serves as a preset penalties. Here, we propose a new approach, i.e., the “slab scale” for modeling large coefficients, and inducing
Spike-and-Slab Lasso GLMs 79 Figure 2 The solution path and deviance profiles of sslasso GLMs (A, C) and the lasso model (B, D) for scenario 3. The colored points on the solution path represent the estimated values of assumed five non- zero coefficients, and the circles represent the true nonzero coefficients. The vertical lines correspond to the optimal models.