Variable Selection Algorithms in Generalized Linear Models
Total Page:16
File Type:pdf, Size:1020Kb
VARIABLE SELECTION ALGORITHMS IN GENERALIZED LINEAR MODELS Juan Carlos Laria de la Cruz A DISSERTATION Submitted in partial fulfilment of the requirements for the degree of DOCTOR OF PHILOSOPHY (Mathematical Engineering) Universidad Carlos III de Madrid Advisors Rosa E. Lillo M. Carmen Aguilera-Morillo Tutor Rosa E. Lillo June, 2020 Variable selection algorithms in generalized linear models A DISSERTATION June, 2020 By Juan Carlos Laria de la Cruz Published by: UC3M, Department of Statistics, Av. Universidad 30, 28912 Leganés, Madrid Spain www.uc3m.es This work is licensed under Creative Commons Attribution – Non Commercial – Non Derivatives A mi madre Acknowledgements None of this work would have been possible without the continuous support, expert advice, motivation, constant dedication, and commit- ment of my thesis advisors and friends, Professors Rosa Lillo and Car- men Aguilera. I am deeply grateful to them. I would also like to thank my coauthors, and coworkers at the Depart- ment of Statistics. Many people have given me relevant suggestions and insights at different points during these years. Me being here has required so much patience and sacrifice by those involved in my education, that this work is theirs too. Not only my grandmother and my parents but also my school and university teach- ers authored this thesis. Special thanks to my family and my friends, who were always there for me, unconditionally, when I needed them. Contents published and presented The materials from the following sources are included in the thesis. Their inclusion is not indicated by typographical means or references, since they are fully embedded in each chapter indicated below. Paper A. Laria Juan C., Aguilera-Morillo M. Carmen and Lillo Rosa E. (2019). “An iterative sparse-group lasso”. In: Journal of Com- putational and Graphical Statistics, 28(3), 722-731. • url: https://doi.org/10.1080/10618600.2019.1573687 • Included in: Chapter 2 (Full) Paper B. Laria Juan C., Aguilera-Morillo M. Carmen, Álvarez Enrique, Lillo Rosa E., López-Taruella Sara, del Monte-Millán María, Picornell Antonio C., Martín Miguel and Romo Juan (2020). “Iterative variable selection for high-dimensional data: Predic- tion of pathological response in triple-negative breast cancer”. • url: http://hdl.handle.net/10016/30572 • Included in: Chapter 3 (Full) Paper C. Laria Juan C., Aguilera-Morillo M. Carmen and Lillo Rosa E. (2020). “A variable selection and clustering method for gener- alized linear models” • Included in: Chapter 4 (Full) • Note: Submitted for publication Paper D. Laria Juan C., Clemmensen Line H., Ersbøll Bjarne K. (2020). “A generalized linear joint trained framework for semi-supervised leaning of sparse features”. arXiv preprint arXiv:2006.01671. • url: http://arxiv.org/abs/2006.01671 • Included in: Chapter 5 (Full) • Note: Submitted for publication Variable selection algorithms in generalized linear models vii Further research achievements Papers 1. Delgado-Gómez D., Laria J.C. and Ruiz-Hernández, D. (2019). “Computerized adaptive test and decision trees: a unifying ap- proach”. Expert Systems with Applications, 117, 358-366. 2. Corujo J.M., Valdés J.E. and Laria J.C. (2019). “Stochastic comparisons of two-unit repairable systems”. Communications in Statistics-Theory and Methods, 48(23), 5820-5838. Software 1. s2net: The Generalized Semi-Supervised Elastic-Net. Main- tainer. https://cran.r-project.org/package=s2net 2. cat.dt: Computerized Adaptive Testing and Decision Trees. Co- author. https://cran.r-project.org/package=cat.dt viii Variable selection algorithms in generalized linear models Abstract This thesis has been developed at University Carlos III of Madrid, motivated through a collaboration with the Gregorio Marañón Gene- ral University Hospital, in Madrid. It is framed within the field of Penalized Linear Models, specifically Variable Selection in Regres- sion, Classification and Survival Models, but it also explores other techniques such as Variable Clustering and Semi-Supervised Learn- ing. In recent years, variable selection techniques based on penalized mod- els have gained considerable importance. With the advance of tech- nologies in the last decade, it has been possible to collect and process huge volumes of data with algorithms of greater computational com- plexity. However, although it seemed that models that provided simple and interpretable solutions were going to be definitively displaced by more complex ones, they have still proved to be very useful. Indeed, in a practical sense, a model that is capable of filtering important infor- mation, easily extrapolated and interpreted by a human, is often more valuable than a more complex model that is incapable of providing any kind of feedback on the underlying problem, even when the latter offers better predictions. This thesis focuses on high dimensional problems, in which the num- ber of variables is of the same order or larger than the sample size. In this type of problems, restrictions that eliminate variables from the model often lead to better performance and interpretability of the re- sults. To adjust linear regression in high dimension the Sparse Group Lasso regularization method has proven to be very efficient. However, in order to use the Sparse Group Lasso in practice, there are two criti- cal aspects on which the solution depends: the correct selection of the regularization parameters, and a prior specification of groups of vari- ables. Very little research has focused on algorithms for the selection of the regularization parameters of the Sparse Group Lasso, and none has explored the issue of the grouping and how to relax this restriction that in practice is an obstacle to using this method. The main objective of this thesis is to propose new methods of vari- able selection in generalized linear models. This thesis explores the Variable selection algorithms in generalized linear models ix Sparse Group Lasso regularization method, analyzing in detail the correct selection of the regularization parameters, and finally relax- ing the problem of group specification by introducing a new variable clustering algorithm based on the Sparse Group Lasso, but much more flexible and that extends it. In a parallel but related line ofresearch, this thesis reveals a connection between penalized linear models and semi-supervised learning. This thesis is structured as a compendium of articles, divided into four chapters. Each chapter has a structure and contents independent from the rest, however, all of them follow a common line. First, variable se- lection methods based on regularization are introduced, describing the optimization problem that appears and a numerical algorithm to ap- proximate its solution when a term of the objective function is not dif- ferentiable. The latter occurs naturally when penalties inducing vari- able selection are added. A contribution of this work is the iterative Sparse Group Lasso, which is an algorithm to obtain the estimation of the coefficients of the Sparse Group Lasso model, without theneed to specify the regularization parameters. It uses coordinate descent for the parameters, while approximating the error function in a vali- dation sample. Moreover, with respect to the traditional Sparse Group Lasso, this new proposal considers a more general penalty, where each group has a flexible weight. A separate chapter presents an extension that uses the iterative Sparse Group Lasso to order the variables in the model according to a defined importance index. The introduc- tion of this index is motivated by problems in which there are a large number of variables, only a few of which are directly related to the response variable. This methodology is applied to genetic data, re- vealing promising results. A further significant contribution of this thesis is the Group Linear Algorithm with Sparse Principal decom- position, which is also motivated by problems in which only a small number of variables influence the response variable. However, unlike other methodologies, in this case the relevant variables are not neces- sarily among the observed data. This makes it a potentially powerful method, adaptable to multiple scenarios, which is also, as a side ef- fect, a supervised variable clustering algorithm. Moreover, it can be interpreted as an extension of the Sparse Group Lasso that does not require an initial specification of the groups. From a computational x Variable selection algorithms in generalized linear models point of view, this paper presents an organized framework for solv- ing problems in which the objective function is a linear combination of a differentiable error term and a penalty. The flexibility ofthis implementation allows it to be applied to problems in very different contexts, for example, the proposed Generalized Elastic Net for semi- supervised learning. Regarding its main objective, this thesis offers a framework for the exploration of generalized interpretable models. In the last chapter, in addition to compiling a summary of the contributions of the thesis, future lines of work in the scope of the thesis are included. Variable selection algorithms in generalized linear models xi Resumen Esta tesis se ha desarrollado en la Universidad Carlos III de Madrid motivada por una colaboración de investigación con el Hospital Gene- ral Universitario Gregorio Marañón, en Madrid. Está enmarcada den- tro del campo de los Modelos Lineales Penalizados, concretamente Selección de Variables en Modelos de Regresión, Clasificación y Su- pervivencia,