SUPERVISED GROUP LASSO WITH APPLICATIONS TO MICROARRAY DATA ANALYSIS Shuangge Ma1, Xiao Song2, and Jian Huang3 1Department of Epidemiology and Public Health, Yale University 2Department of Health Administration, Biostatistics and Epidemiology, University of Georgia 3Departments of Statistics and Actuarial Science, and Biostatistics, University of Iowa March 2007 The University of Iowa Department of Statistics and Actuarial Science Technical Report No. 375 1 Supervised group Lasso with applications to microarray data analysis Shuangge Ma¤1 Xiao Song 2and Jian Huang 3 1 Department of Epidemiology and Public Health, Yale University, New Haven, CT 06520, USA 2 Department of Health Administration, Biostatistics and Epidemiology, University of Georgia, Athens, GA 30602, USA 3 Department of Statistics and Actuarial Science, University of Iowa, Iowa City, IA 52242, USA Email: Shuangge Ma¤-
[email protected]; Xiao Song -
[email protected]; Jian Huang -
[email protected]; ¤Corresponding author Abstract Background: A tremendous amount of e®orts have been devoted to identifying genes for diagnosis and prognosis of diseases using microarray gene expression data. It has been demonstrated that gene expression data have cluster structure, where the clusters consist of co-regulated genes which tend to have coordinated functions. However, most available statistical methods for gene selection do not take into consideration the cluster structure. Results: We propose a supervised group Lasso approach that takes into account the cluster structure in gene expression data for gene selection and predictive model building. For gene expression data without biological cluster information, we ¯rst divide genes into clusters using the K-means approach and determine the optimal number of clusters using the Gap method.