1 Group-sparse SVD Models and Their Applications in Biological Data

Wenwen Min, Juan Liu and Shihua Zhang

Abstract—Sparse Singular Value Decomposition (SVD) models have been proposed for high dimensional gene expression data to identify block patterns with similar expressions. However, these models do not take into account prior group effects upon variable selection. To this end, we first propose group-sparse SVD models with group Lasso (GL1-SVD) and group L0-norm penalty (GL0-SVD) for non-overlapping group structure of variables. However, such group-sparse SVD models limit their applicability in some problems with overlapping structure. Thus, we also propose two group-sparse SVD models with overlapping group Lasso (OGL1-SVD) and overlapping group L0-norm penalty (OGL0-SVD). We first adopt an alternating iterative strategy to solve GL1-SVD based on a block coordinate descent method, and GL0-SVD based on a projection method. The key of solving OGL1-SVD is a proximal operator with overlapping group Lasso penalty. We employ an alternating direction method of multipliers (ADMM) to solve the proximal operator. Similarly, we develop an approximate method to solve OGL0-SVD. Applications of these methods and comparison with competing ones using simulated data demonstrate their effectiveness. Extensive applications of them onto several real gene expression data with gene prior group knowledge identify some biologically interpretable gene modules.

Index Terms—sparse SVD, low-rank matrix decomposition, group-sparse penalty, overlapping group-sparse penalty, coordinate descent method, alternating direction method of multipliers (ADMM), !

1 INTRODUCTION sparse models ignore prior information of gene variables, and usually assume that each gene is selected in a bicluster INGULAR Value Decomposition (SVD) is one of the clas- with equal probability. Actually, one gene may belong to S sical matrix decomposition models [1]. It is a useful tool multiple pathways in biology [13]. As far as we know, there for data analysis and low-dimensional data representation is not yet a model for biclustering gene expression data in many different fields such as signal processing, matrix by integrating gene pathway information. Group sparse approximation and [2], [3], [4]. However, the penalties [14], [15] should be used to induce the structured non-sparse singular vectors with all variables are difficult to sparsity of variables for variable selection. Several studies be explained intuitively. In the recent years, sparse models have explored the (overlapping) group Lasso in regression have been widely applied in computational biology to im- tasks [16], [17]. However, little work focus on developing prove biological interpretation [5], [6], [7]. In addition, many structured sparse SVD for biclustering high-dimensional researchers applied diverse sparse penalties onto singular data (e.g., biclustering gene expression data via integrating vectors in SVD and developed multiple sparse SVD models prior gene group knowledge). to improve their interpretation and capture inherent struc- In this paper, motivated by the development of sparse tures and patterns from the input data [8], [9]. For example, coding and structured sparse penalties, we propose several sparse SVD provides a new way for exploring bicluster group-sparse SVD models for pattern discovery in biolog- X ∈ p×n patterns of gene expression data. Suppose R de- ical data. We first introduce the group-sparse SVD model notes a gene expression matrix with p genes and n samples. with group Lasso (L1) penalty (GL1-SVD) to integrate non- Biologically, a subset of patients and genes can be clustered arXiv:1807.10956v1 [stat.ML] 28 Jul 2018 overlapping structure of variables. Compared to L1-norm, together as a coherent bicluster or block pattern with similar L0-norm is a more natural sparsity-inducing penalty. Thus, expressions. Previous studies have reported that such a we also propose an effective group-sparse SVD via replacing bicluster among gene expression data can be identified by L1-norm with L0-norm, called GL0-SVD, which uses a mix- low-rank sparse SVD models [10], [11], [12]. However, these norm by combining the group Lasso and L0-norm penalty. However, the non-overlapping group structure limits their • Wenwen Min is with School of Mathematics and Science, applicabilities in diverse fields. We consider a more general Jiangxi Science and Technology Normal University, Nanchang 330038, situation, where we assume that either groups of variables China. E-mail: [email protected] • Juan Liu is with School of Computer Science, Wuhan University, Wuhan are potentially overlapping (e.g., a gene may belong to mul- 430072, China. E-mail: [email protected]. tiple pathways (groups)). We also propose two group-sparse • Shihua Zhang is with NCMIS, CEMS, RCSDS, Academy of Mathematics SVD models with overlapping group Lasso (OGL1-SVD) and Systems Science, Chinese Academy of Sciences, Beijing 100190; and overlapping group L0-norm penalty (OGL0-SVD). School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049; Center for Excellence in Animal Evolution and To solve these models, we design an alternating iterative Genetics, Chinese Academy of Sciences, Kunming 650223, China. E-mail: to solve GL1-SVD based on a block coordinate de- [email protected]. scent method and GL0-SVD based on a projection method. Manuscript received XXX, 2018; revised XXX, 2018. Furthermore, we develop a more general approach based 2

on Alternating Direction Method of Multipliers (ADMM) A A variable as a group to solve OGL1-SVD. In addition, we extend OGL1-SVD to G1 G2 G3 OGL0-SVD, which is a regularized SVD with overlapping grouped L0-norm penalty. The key of solving OGL1-SVD Non-overlapping groups

is also a proximal operator with overlapping group L0- G2 norm penalty. We propose a greedy method to solve it G1 G3 and obtain its approximate solution. Finally, applications of Overlapping groups these methods and comparison with the state-of-the-art ones using a set of simulated data demonstrate their effective- ness and computational efficiency. Extensive applications of B Samples 푿 ≈ 풖풗푻 S1 S2 S3 S4 S5 S6 S7 S8 푇 them onto the high-dimensional gene expression data show g1 풗 g that our methods could identify more biologically relevant 2 g3 zero gene modules, and improve their biological interpretations. g4 g Related Work We briefly review the regularized low 5 g6 g rank-r SVD model as follows: Genes 7 g8 g T 2 9 A gene pattern minimize kX − UDV kF g10 ( ) ( ) ( ) U,D,V { g1 g2 g4 g5 g8 g9 g10 } g11 { } 2 i (1) X: Gene expression Groups S1 S4 S6 S8 subject to kUik ≤ 1, Ω1(Ui) ≤ c1, ∀i 풖 2 i kVik ≤ 1, Ω2(Vi) ≤ c2, ∀i Fig. 1: (A) There are three structured ways corresponding to three penalty functions for variable selection including p×n p×r where X ∈ R with p features and n samples, U ∈ R , Lasso, group Lasso and overlapping group Lasso. In the r×n V ∈ R and D is diagonal matrix. Ui (Vi) corresponds to third way, group G1 and G2 as well as group G2 and the i-th column of U (V ), which is a column orthogonal ma- G3 are overlapping, respectively. (B) A simple example trix. To solve the above optimization problem, we introduce to explain how SVD(OGL0,L0) (see TABLE 1) identifies a general regularized rank-one SVD model: a gene co-expression sub-matrix by integrating gene ex- pression data and prior group information. Genes are di- T 2 minimize kX − duv kF u,v,d vided into seven overlapping groups (denoted by brackets). SVD(OGL0,L0) is used to identify a pair sparse singular subject to kuk2 ≤ 1, Ω (u) ≤ c , (2) 1 1 vectors u and v. Based on the non-zero elements of u 2 kvk ≤ 1, Ω2(v) ≤ c2, and v, a gene bicluster or module can be identified with a gene set {(g1, g2), (g4, g5), (g8, g9, g10))} and a sample set d u p where is a positive singular value, is a -dimensional {s1, s4, s6, s8}, where “( )” indicates some genes are in a column vector, and v is a n-dimensional column vector. given group. Ω1(u) and Ω2(v) are two penalty functions, c1 and c2 are two hyperparameters. In a Bayesian view, different prior distribution functions of u and v correspond to different 2 GROUP-SPARSE SVDMODELS regularized functions. For example, L1-norm is a very popu- In this section, we propose four group sparse SVD mod- lar sparsity-inducing norm [18] and has been used to obtain els with respect to different structured penalties (TABLE sparse solutions in a large number of statistical models 1). For a given data (e.g., gene expression data), we can including the regression model [18], [19], SVD [20], PCA make proper adjustments to get one-sided group-sparse [21], LDA [22], K-means [23], etc. SVD models via using (overlapping) group-sparse penal- Recently, some sparse SVD models have been proposed ties for the right (or left) singular vector. For example, for coherent sub-matrix detection [20], [10], [11]. For exam- SVD(OGL0, L0) is a group-sparse SVD model, which uses ple, Witten et al. [20] developed a penalized matrix decom- the overlapping group L0-penalty for u and L0-penalty for position (PMD) method, which regularizes the singular vec- v respectively. tors with Lasso and fussed Lasso to induce sparsity. Lee et al. [10] proposed a rank-one sparse SVD model with adaptive TABLE 1: The group-sparse SVD models Lasso (L1)(L1-SVD) of the singular vectors for biclustering Model Penalty function of gene expression data. Some generalized sparsity penalty GL1-SVD Group-Lasso (GL1) functions (e.g., group Lasso [14] and sparse group lasso [24]) GL0-SVD Group-L0 (GL0) have been widely used in many regression models for fea- OGL1-SVD Overlapping-Group-Lasso (OGL1) ture selection by integrating group information of variables. OGL0-SVD Overlapping-Group-L0 (OGL0) However, it is a challenging issue to use these generalized Below we will introduce these models and their algo- penalty functions such as group Lasso and overlapping rithms in detail. group Lasso [15], [25] in the SVD framework with effective . To this end, we develop several group-sparse SVD models with different group-sparse penalties including 2.1 GL1-SVD

ΩGL1 (u), ΩGL0 (u), ΩOGL1 (u) and ΩOGL0 (u) to integrate Suppose the left singular vector u and right singular vec- diverse group structures of variables for pattern discovery tor v can be respectively divided into L and M non- (l) p ×1 (m) in biological data (see TABLE 1). overlapping groups: u ∈ R l , l = 1, ..., L and v ∈ 3

p ×1 R m , m = 1, ..., M. Here, we consider the (adaptive) Since Eq. (6) is strictly convex and separable, the block coordi- group Lasso (GL1) penalty [26] for u and v as follows: nate descent algorithm must converge to its optimal solution [27]. Finally, we can choose an η to guarantee u = u L M kuk2 X (l) X (m) (normalizing condition). ΩGL1 (u) = wlku k2 and ΩGL1 (v) = τmkv k2, l=1 m=1 (3) 2.1.2 Learning v where both wl and τm are adaptive weight parameters. √ √ T Suppose wl = pl and τm = qm for group sizes, the In the same manner, we fix u in Eq. (5) and let z = X u. penalty reduces to a traditional group Lasso. Similarly, we can also obtain the coordinate update rule for (m) Based on the definition of GL1 penalty, we propose the v , m = 1, 2, ··· ,M. first group-sparse SVD with group Lasso penalty (GL1- ( 1 λτm (m) (m) SVD), also namely SVD(GL1, GL1): (1 − (m) )z , if kz k2 > λτm, v(m) = 2η kz k2 (10) T 2 0, otherwise. minimize kX − duv kF u,v,d 2 (4) subject to kuk ≤ 1, ΩGL1 (u) ≤ c1, Furthermore, to meet the normalizing condition, we chose 2 an η to guarantee v = v . Besides, if here each group only kvk ≤ 1, ΩGL1 (v) ≤ c2. kvk2 contains one element, then the group Lasso penalty reduces T 2 2 2 T Since ||X − duv ||F = kXkF + d − 2du Xv. Minimizing to the Lasso penalty. Accordingly, we get another update T 2 T ||X − duv ||F is equivalent to minimizing −u Xv, and formula: once the u and v are determined, the d value is determined T ( 1 λ by u Xv. We obtain the Lagrangian form of GL -SVD (1 − )zi, if |zi| > λ, 1 2η kzik2 vi = (11) model as follows: 0, otherwise. T L(u, v) = − u Xv + λ1ΩGL1 (u) + λ2ΩGL1 (v) 2 2 (5) + η1kuk + η2kvk , 2.1.3 GL1-SVD Algorithm where λ1 ≥ 0, λ2 ≥ 0, η1 ≥ 0 and η2 ≥ 0 are Lagrange mul- Based on Eqs. (9) and (10), we propose an alternating iter- tipliers. To solve the problem (5), we apply an alternating ative algorithm (Algorithm 1) to solve the GL1-SVD model iterative algorithm to optimize u for a fixed v and vice versa. and its time complexity is O(T np + T p2 + T n2), where T is the number of iterations. We can control the iteration by 2.1.1 Learning u monitoring the change of d. Fix v and let z = Xv, minimizing Eq. (5) is equivalent to In order to display the penalty function for left and right minimizing the following criterion: singular vectors, GL1-SVD can also be written in another form SVD(GL1, GL1), denoting that the left singular vector L L T X (l) X (l)T (l) u is regularized by GL1 penalty and the right singular vec- L(u, λ, η) = −u z + λ wlku k2 + η u u , (6) tor v is regularized by GL1 penalty, respectively. Similarly, l=1 l=1 we can simply modify Algorithm 1 to solve SVD(GL1, L1) (1) (2) (L) where u = [u ; u ; ... ; u ] and λ = λ1, η = η1 for model, which applies Lasso as the penalty for v. simplicity. It is obvious that L(u, λ, η) is convex with respect u, and we develop a block coordinate descent algorithm

[27], [28], [29], [30] to minimize Eq. (6), i.e. one group of u 2.2 GL0-SVD is updated at a time. For a single group u(l) with fixed u(j) for all 1 ≤ j ≤ L and j 6= l, the subgradient equations (see Unlike GL1 penalty, below we consider a group L0-norm [31]) of Eq. (6) with respect to u(l) is written as: penalty (GL0) of u and v as follows:

(l) (l) (l) ∇ (l) L = −z + λwls + 2ηu = 0, (7) u ΩGL0 (u) = kφ(u)k0 and ΩGL0 (v) = kφ(v)k0, (12) where s(l) is the subgradient vector of u(l) and it meets where φ(u) = [ku(1)k, ku(2)k, ··· , ku(L)k]T and φ(v) = ( u(l) (l) [kv(1)k, ku(2)k, ··· , kv(M)k]T . (l) (l) , if u 6= 0, s = ku k2 (l) (l) (8) Based on the above definition of GL0 penalty, we pro- ∈ {s : ks k2 ≤ 1}, otherwise. pose the second group-sparse SVD model with GL0 penalty, (l) (l) (l) GL GL GL Based on Eq. (7), we have 2ηu = z − λwls . namely 0-SVD or SVD( 0, 0): (l) (l) If kz k2 > λwl, then we have u 6= 0. Since η > (l) T 2 (l) (l) u minimize kX − duv kF 0, λ > 0, wl > 0 and 2ηu = z − λwl (l) . Thus, we ku k2 u,v,d (l) (l) u z (l) 1 (l) λwl (13) have (l) = (l) and u = z (1 − (l) ). subject to kuk2 ≤ 1, ΩGL0 (u) ≤ ku, ku k2 kz k2 2η kz k2 (l) (l) If kz k2 ≤ λ, then u = 0. In short, we obtain the kvk2 ≤ 1, ΩGL0 (v) ≤ kv. following update rule for u(l) (l = 1, ··· ,L), Here, we employ an alternating iterative strategy to solve ( 1 λwl (l) (l) (l) (1 − (l) )z , if kz k2 > λwl, problem (13). Fix u (or v), the problem (13) reduces to a u = 2η kz k2 (9) 0, otherwise. projection problem with group L0-norm penalty. 4

Algorithm 1 GL1-SVD or SVD(GL1, GL1) 2.2.3 GL0-SVD Algorithm p×n Require: Matrix X ∈ R , λu and λv; Group information Finally, we propose an alternating iterative method (Algo- Ensure: u, v and d rithm 2) to solve the optimization problem (13). The time 2 2 1: Initialize v with kvk = 1 complexity of Algorithm 2 is O(T np + T n + T p ), where 2: repeat T is the number of iterations. 3: Let z = Xv 4: for l = 1 to L do Algorithm 2 GL0-SVD or SVD(GL0, GL0) (l) p×n 5: if ku k2 ≤ λuwl then Require: Matrix X ∈ , ku and kv; Group information (l) R 6: u = 0 Ensure: u, v and d. 7: else 1: Initialize v with kvk = 1 (l) (l) λ w 8: u = z (1 − u l ) kzlk2 2: repeat 9: end if 3: Let zu = Xv 10: end for 4: ub = PGL0 (zu) by using Eq. (15) u u 11: u = kuk 5: u = b 2 kubk2 T 12: for m = 1 to M do 6: z = X u T Let v 13: Let z = X u 7: vb = PGL0 (zv) by using Eq. (15) (m) v 14: if kv k2 ≤ λvwm then 8: v = b (m) kvbk2 15: v = 0 T 9: d = z v 16: else (m) (m) λv wm 10: until d convergence 17: v = z (1 − (m) ) kz k2 11: return u, v and d 18: end if 19: end for v 20: v = Note that once the number of elements of every group kvk2 T q = 1 i = 1, 2, ··· ,M L 21: d = z v equals 1 (i.e., i for ), the group 0-norm penalty reduces to L -norm penalty. Moreover, Algorithm 2 22: until d convergence 0 with a small modification can be used to solve SVD(GL , 23: return u, v and d 0 L0), which applies L0-norm as the penalty for the right singular vector v. In addition, compared to adaptive group 2.2.1 Learning u lasso [26], we may consider a weighted (adaptive) group L0- (1) (L) T T 2 2 2 T penalty. We rewrite φ(zu) = [w1kz k, ··· , wLkz k] in Since ||X − duv ||F = kXkF + d − 2du Xv. Fix v and Eq. (15), where wi is a weight coefficient to balance different let zu = Xv, Eq. (13) reduces to a group-sparse projection √ w = 1/ q q operator with respect to u: group-size and it is defined by i i, and i is the number of elements in group i. T minimize − zu u, s.t. ΩGL0 (u) ≤ ku. (14) kuk≤1 We present Theorem 1 to solve problem (14). 2.3 OGL1-SVD

PGL0 (zu) In some situations, the non-overlapping group structure in Theorem 1. The optimum solution of Eq. (14) is kP (z )k , GL0 u group Lasso limits its applicability in practice. For example, P (z ) where GL0 u is a column-vector and meets a gene can participate in multiple pathways. Several studies ( (g) have explored the overlapping group Lasso in regression (g) zu , if g ∈ supp(φ(zu), ku), [PGL0 (zu)] = (15) tasks [16], [17]. However, structured sparse SVD with over- 0, otherwise, lapping group structure remains to be solved. (g) where [PGL0 (zu)] is a sub-vector from the g-th group, g = Here we consider the overlapping group situation, 1, 2, ··· ,L and supp(φ(zu), ku) denotes the set of indexes of the where a variable may belong to more than one group. largest ku elements of φ(zu). Suppose u corresponds to the row-variables of X with u overlapping groups G = {G1,G2, ··· ,GL} and v cor- The objective function of (14) can be simplified as T responds to the column-variables of X with overlapping T PL (l) (l) v −zu u = l=1 −zu u . Theorem 1 shows that solving groups G = {G1,G2, ··· ,GM }. In other words, u and v problem (14) is equivalent to forcing the elements in L − ku can be respectively divided into L and M groups, which z pl×1 groups of u with the smallest group-norm values to be can be represented by uGl ∈ R , l = 1, ..., L and pm×1 zeros. We can easily prove that Theorem 1 is true. Here we vGm ∈ R , m = 1, ..., M. We define an overlapping omit the prove process. group Lasso (OGL1) penalty of u as follows [15], [16], [32]:

L 2.2.2 Learning v X T ΩOGL1 (u) = minimize wlkuGl k, (17) In the same manner, fix u and let zv = X u, thus problem J ⊆Gu,supp(φ(u))⊆J (13) can be written as a similar subproblem with respect to l=1 v: where supp(·) denotes the index set of non-zero elements T minimize − zv v, s.t. ΩGL0 (v) ≤ kv. (16) for a given vector. kuk≤1 OGL1 is a specific penalty function for structured spar- Similarly, based on Theorem 1, we can obtain the estimator sity. It can lead to the sparse solution, whose supports are P (z ) of v as GL0 v . unions of predefined overlapping groups of variables. Based kPGL0 (zv )k2 5

(1) (L) on the definition of OGL1, we propose the third group- where Lagrange multipliers θ = [θ ; ··· ; θ ] and sparse SVD model as follows: y = [y(1); ··· ; y(L)] are two column vectors with L non-

T 2 overlapping groups. For convenience, we first define some minimize kX − duv kF ˜(l) (l) (l) u,v,d column-vectors θ , y˜ and e˜ (l = 1, ··· ,L), and they ˜(l) kuk ≤ 1, Ω (u) ≤ c , (18) have the same size and group structures as u, where θ subject to 2 OGL1 u ˜(l) (l) ˜(l) meets that [θ ]Gk = θ if k = l and [θ ]Gk = 0 kvk2 ≤ 1, ΩOGL1 (v) ≤ cv, (l) (l) (l) otherwise; y˜ meets that [y˜ ]Gk = y if k = l and [y˜(l)] = 0 e˜(l) [e˜(l)] = 1 where cu and cv are two hyperparameters. We first intro- Gk otherwise; meets that Gk if (l) (l) ˜(l) k = l and [e˜ ]Gk = 0 otherwise. Note that [θ ]Gk , duce two latent vectors u˜ and v˜. Let u˜ = uGl , l = (1) (L) ˜(l) ˜(l) 1, ··· ,L and set u˜ = (u˜ , ··· , u˜ ), which is a column [y ]Gk and [e ]Gk (k = 1 ··· L) respectively represent PL the elements of k-th group of θ˜(l), y˜(l) and e˜(l). Thus, we vector with size of l=1 |Gl|. Similarly, we can get v˜ based (l)T T ˜(l) (l)T T (l) on v. In addition, we can extend the rows and columns have θ uGl = u θ and y uGl = u y˜ . So we can of X of p × n to obtain a new matrix X˜ with size of obtain the gradient equations with respect to u in Eq. (22) PL PM l=1 |Gl| × m=1 |Gm|, whose row and column variables as follows: are non-overlapping. Thus, solving the problem (18) is L L ! L X ˜(l) X (l) X (l) approximately equivalent to solving a SVD(GL1, GL1) for ∇uLρ = 2ηu−z− θ +ρ e˜ •u−ρ y˜ = 0, non-overlapping X˜ . We can obtain an approximate solution l=1 l=1 l=1 of (18) by using Algorithm 1. However, if a variable belongs (23) to many different groups, it leads to a large computational where “•” performs element-by-element multiplication. burden. For example, given a protein-protein interaction Thus, we can obtain the update rule for u and ensure it (PPI) network, which contains about 13,000 genes and is a unit vector: 250,000 edges. If we consider each edge of the PPI network L L u X (l) X (l) as a group, then we would construct a high-dimensional u ← b , where u = z + θ˜ + ρ y˜ . (24) kuk b matrix X˜ , which contains 500,0000 rows. b l=1 l=1 To address this issue, we develop a method based on We also obtain the subgradient equations (see [31]) with alternating direction method of multipliers (ADMM) [33], respect to y(l) in Eq. (22) as follows: [34] to directly solve problem (18). Similar with Eq. (5), we (l) (l) (l) first redefine problem (18) with its Lagrange form: ∇y(l) Lρ = λwl · s + θ + ρ(y − uGl ) = 0, (25)

T (l) (l) y(l) L(u, v) = − u Xv + λ1ΩOGL (u) where l = 1, ··· ,L, if y 6= 0, then s = (l) , 1 (19) ky k2 T T s(l) ks(l)k ≤ 1 + λ2ΩOGL1 (v) + η1u u + η2u v, otherwise is a vector with 2 . For convenience, (l) (l) let t = ρuGl − θ , we thus develop a block coordinate where parameters λ1 ≥ 0, λ2 ≥ 0, η1 ≥ 0 and η1 ≥ 0 descent method to learn Lagrange multipliers y. Since y(l) are Lagrange multipliers. Inspired by [35], we develop an (l = 1, ··· ,L) are independent. Thus, y(l) (l = 1, ··· ,L) can alternating iterative algorithm to minimize it. That is, we be updated in parallel according to the following formula: optimize the above problem with respect to u by fixing v   ( 1 λwl (l) (l) and vice versa. Since u and v are symmetrical in problem 1 − (l) t , if kt k2 > λwl, (l) ρ kt k2 (19), we only need to consider a subproblem with respect to y ← (26) 0, otherwise. u as follows: θ T 2 Based on ADMM [34], we also obtain the update rule for minimize − u z + λΩOGL (u) + ηkuk , (20) u 1 as follows: (l) (l) (l) where z = Xv. Since the overlapping Lasso penalty is a θ ← θ + ρ(y − uGl ), l = 1, ··· , L. (27) convex function [36]. We can apply ADMM [33], [34] to solve Combining Eqs. (24), (26) and (27), we thus get an ADMM the above problem (20). To obtain the learning algorithm of based method to solve problem (21) (Algorithm 3). Note that (20), we first introduce an auxiliary y and redefine the above the output of Algorithm 3 is a set of selected group indexes, problem as follows: defined as T . For example, if y = [y(1); y(2); y(3)], y(1) = 0, L y(2) 6= 0, and y(3) 6= 0, then T = {2, 3}. T X (l) 2 minimize − u z + λ wlky k2 + ηkuk u l=1 (21) Algorithm 3 ADMM method for problem (21) y(l) = u , l = 1, ··· , L. p subject to Gl Require: z ∈ R , G, λ, ρ > 0 1: Initialize θ and y So the augmented Lagrangian of (21) can be written as 2: repeat follows: 3: Updating u with fixed y and θ using Eq. (24) L X T 4: Updating y with fixed u and θ using Eq. (26) L (u, y, θ) = − uT z + ηkuk2 + θ(l) (y(l) − u ) ρ Gl 5: Updating θ with fixed u and y using Eq. (27) l=1 6: until convergence L L (g) X ρ X 7: T = {g : ky k2 > 0, g ∈ {1, ··· , |G|}} + λ w ky(l)k + ky(l) − u k2, l 2 2 Gl 2 8: return T l=1 l=1 (22) 6

Algorithm 4 OGL1-SVD or SVD(OGL1, OGL1) the exact solution of problem (31). To this end, we use T Require: Matrix X ∈ p×n, λ , and λ ; Gu and Gv an approximate method, which replaces z u by using R u v P T (l) z uG . Since y = uG in problem (31), we have Ensure: u, v and d l Gl l l P zT u = P zT y(l) 1: Initialize v with kvk = 1 l Gl Gl l Gl . Thus, problem (31) approximately 2: repeat reduces to the below problem, 3: Let zu = Xv X T (l) (l) minimize − zG y , s.t. ΩGL0 (y) ≤ ku, y = uGl . 4: Get the active groups Tu by Algorithm 3 with zu and kuk≤1,y l u l G as the input (32) 5: ub = zu ◦ 1Tu Since y contains a non-overlapping structure, we can easily u 6: u = b kubk get the optimal solution of the above problem on u and y. T 7: Let zv = X u To sum up, we obtain an approximate solution of (30) as 8: Get the active groups Tv by Algorithm 3 with zv and Theorem 2 suggests. Gv as the input PbOGL0 (z) 9: v = z ◦ 1 Theorem 2. The approximate solution of (30) is b v Tv kPbOGL (z)k2 v 0 10: v = b and kvbk T 11: d = z v ( zGi , if i ∈ supp(φ(z), ku), 12: until d convergence [PbOGL0 (z)]Gi = (33) 0, otherwise, 13: return u, v and d

where i = 1, 2, ··· ,L, φ(z) = [kzG1 k; ··· ; kzGL k] and supp(φ(z), ku) denotes the set of indexes of the largest ku In summary, based on the ADMM algorithm (Algorithm elements of φ(z). 3), we adopt an alternating iterative strategy (Algorithm 4) Briefly, Theorem 2 shows that approximately solving to solve SVD(OGL1,OGL1). In Algorithm 4, the operation the problem (30) is equivalent to keeping elements of ku x = z ◦ 1T denotes if group l ∈ T , then xGl = zGl , and the remaining elements of x are zero. groups with the largest ku group-norm values and the other elements are zeros. Fix u in problem (29) and let z = XT u, thus the problem 2.4 OGL -SVD 0 (29) reduces to the following one: L Here we define an overlapping group 0-norm penalty T minimize − z v, s.t. ΩOGL0 (v) ≤ kv. (34) (OGL0) of u as follows: kvk≤1 L X 1 Similarly, based on Theorem 2, we can obtain the approx- ΩOGL0 (u) = minimize (kuGl k= 6 0), (28) P (z) J ⊆Gu, (φ(u))⊆J bOGL0 supp l=1 imate solution of (34) as . Finally, we propose kPbOGL0 (z)k2 where supp(·) denotes the index set of non-zero elements an alternating iterative method based on an approximate for a given vector. method to solve problem (29) (Algorithm 5). Based on the definition of OGL0, we propose the fourth Algorithm 5 OGL0-SVD or SVD(OGL0, OGL0) group-sparse SVD model with overlapping group L0-norm p×n u u penalty (OGL0-SVD) as follows: Require: Matrix X ∈ R , ku and kv; G and G T 2 Ensure: u, v and d minimize kX − duv kF u,v,d 1: Initialize v with kvk2 = 1 (29) 2: repeat subject to kuk2 ≤ 1, ΩOGL0 (u) ≤ ku, 3: Let zu = Xv kvk2 ≤ 1, ΩOGL0 (v) ≤ kv. 4: ub = PbOGL0 (zu) by using Eq. (33) u Similarly, we solve the above problem by using an alter- 5: u = b kubk2 nating iterative method. Fix u (or v), we transform the T 6: Let zv = X u original optimization problem into a projection problem 7: vb = PbOGL0 (zv) by using Eq. (33) L v with overlapping group 0-norm penalty. 8: v = b kvbk2 Fix v in problem (29) and let z = Xv, thus the problem T 9: d = z v can be written into a projection problem with overlapping 10: until d convergence group L0-norm penalty: 11: return u, v and d T minimize − z u, s.t. ΩOGL0 (u) ≤ ku. (30) kuk≤1 To solve the above problem, we introduce y and obtain the 2.5 Convergence analysis above problem in a new way: Inspired by [27], [37], for a two-block coordinate problem, if T (l) minimize − z u, s.t. ΩGL0 (y) ≤ ku, y = uGl , (31) its objective function of each subproblem is strictly convex, kuk≤1,y then there exists a unique global optimal solution for this where l = 1, ··· ,L and y = [y(1); ··· ; y(L)]. problem. The Gauss-Seidel method can effectively solve The above problem contains overlapping group-sparse such a two-block coordinate problem and it converges to a induced penalty with L0-norm. Thus, it is difficult to solve critical point for any given initial (see [38] and the references 7

therein). We note both the proposed GL1-SVD (Algorithm with several sparse SVD method without using prior group 1) and OGL1-SVD (Algorithm 4) are Gauss-Seidel type of information including L0-SVD [42], [43], L1-SVD [10]. We methods and those subproblems of GL1-SVD and OGL1- generated two types of simulation data with respect to non- SVD models are strictly convex. Thus, GL1-SVD and OGL1- overlapping group structure (GR) and overlapping group SVD algorithms are convergent. structure (OGR) respectively (Fig. 1A). Next we discuss the convergence of GL0-SVD (Algo- Without loss of generality, we first generated u and v rithm 2). In [35], the authors developed a class of methods for GR and OGR cases, and then generated a rank-one data based on a proximal gradient strategy to solve a broad class matrix by using formula of nonconvex and nonsmooth problems: X = duvT + γ, (37) minimize F (u, v) = f(u) + g(v) + H(u, v), (35) u, v i.i.d. where d = 1, ij ∼ N (0, 1) and γ is a nonnegative where f(u) and g(v) are nonconvex and nonsmooth func- parameter to control the signal-to-noise ratio (SNR). The tions and H(u, v) is a smooth function (also see [39], [40], logarithm of SNR (logSNR) is defined by: [41]). GL0-SVD model can be seen as such a problem:  kduvk2  kduvk2  T logSNR = log F = log F , (38) H(u, v) = −u Xv, (36a) 10 E(kγk2 ) 10 γ2np ( F 0, if kuk = 1, ΩOGL0 (u) ≤ ku, f(u) = (36b) where E(kγk2 ) denotes the expected sum of squares of +∞, otherwise. F noise. ( 0, if kvk = 1, ΩOGL (v) ≤ kv, We evaluated the performance of all methods by the g(v) = 0 (36c) +∞, otherwise. following measures including true positive rate (TPR), true negative rate (TNR), false positive rate (FPR), false discovery Note F (u, v) = f(u) + g(v) + H(u, v) of GL0-SVD is rate (FDR) and accuracy (ACC). They are defined as follows: semialgebraic and meets the KL property (Regarding the TP TN FP semialgebraic and KL property, please see [35], [39], [41]). TPR = , TNR = , FPR = , Based on the Theorem 1 in [35] (also see Theorem 2 in P N N GL [41]), we can obtain that 0-SVD algorithm converges to a FP TP+TN critical point. FDR = , ACC = , TP+FP TP+FP+FN+TN In a word, GL1-SVD (Algorithm 1), GL0-SVD (Algo- rithm 2) and OGL1-SVD (Algorithm 4) converge to their where P denotes the number of positive samples, N denotes corresponding critical points. Although OGL0-SVD (Algo- the number of negative samples, TP denotes the number of rithm 5) applies an approximate strategy, it has a good true positive, TN denotes the number of true negative, FP convergence in practice. denotes the number of false positive, and FN denotes the number of false negative, respectively. 2.6 Group-sparse SVD for edge-guided gene selection Given a high-dimensional data (e.g., gene expression data) 3.1 Non-overlapping group structure (GR) and a prior network (e.g., a gene interaction network), we p×n We generated the simulated data matrix X ∈ R can consider a special edge group structure, in which each with n = 100 samples without groups, and p row edge (e.g., a gene interaction) is considered as a group. variables with 50 groups. We first generated v = In our study, the gene interaction network is regarded rnorm(n, mean = 0, sd = 1), which samples n ele- as the graph (V,E) where V = {1, ··· , p} is the set of ments from the standard normal distribution. Then we nodes (genes) and E is the set of edges (gene interactions). generated u = [uG1 ; uG2 ; ··· ; uG50 ] with 50 groups SVD(OGL0, L0) can be applied to analyze such high- where if i ∈ {3, 4, 13, 14, 15, 33, 34, 43, 44, 45}, and uGi = dimensional gene expression data via integrating group ({−1, 1}, q) q {−1, 1} u sample , which samples elements from , information G = E. The estimated sparse solution is used and q denotes the number of members of a group, otherwise for gene selection. uGi = 0. Finally, we obtained the simulated matrix X by using Eq. (37). 2.7 Learning multiple factors Here we first considered q ∈ {20, 100} and logSNR ∈ To identify the next gene module, we subtract the signal {−1, −2, −2.2, −2.4, −2.6, −2.8} to generate simulated of current pair of singular vectors from the input data (i.e., data with GR. Given a logSNR, suppose u, v and d are T γ q X := X − du v), and then apply SVD(OGL0,L0) again known, then we got a by using Eq. (38). For each pair ( , to identify the next pair of sparse singular vectors. Repeat logSNR), we generated 50 simulated matrices Xs. this step for r times, we obtained r pairs of sparse singular Here, we evaluated the performance of GL1-SVD and vectors and get a rank r approximation of matrix X. GL0-SVD with ku = 10 (groups) in this simulated data. For comparison, we forced the identified u to contain 200 non-zero elements if q = 20 (i.e., the number of rows of 3 SIMULATION STUDY X is p = 1000) and 1000 non-zero elements if q = 100 In the section, we applied these group sparse SVD meth- (i.e., p = 5000) for L1-SVD and L0-SVD by tuning their ods (GL1-SVD, GL0-SVD, OGL1-SVD and OGL0-SVD) to parameters. For visualization, we first tested GL1-SVD and a set of simulated data and compared their performance GL0-SVD with kv = 10 to a GR simulated X with p = 1000 8

A B Non-overlapping groups C Overlapping groups Original Signal 1.0 1.0 0.04 0.00 −0.04 0.9 0.9 0 250 500 750 1000 ACC L1−SVD ACC 0.8 0.8 0.1 0.0 −0.1 0.7 0 250 500 750 1000 0.7 L0−SVD 0.1 400 0.0 −0.1 300 300 0 250 500 750 1000 GL1−SVD value 0.1 200 200 0.0 Singular −0.1 Singular value 100 100 0 250 500 750 1000 GL0−SVD 0.1 0 0 0.0 1.00 1.00 OGL0-SVD −0.1 GL0-SVD GL1-SVD OGL1-SVD 0 250 500 750 1000 L0-SVD L0-SVD L1-SVD L1-SVD OGL1−SVD 0.75 0.75 0.1 0.0 0.50 0.50 −0.1 0 250 500 750 1000 Time (second) Time (second) OGL0−SVD 0.25 0.25 0.1 0.0 0.00 0.00 −0.1 −2.8 −2.6 −2.4 −2.2 −2 −1 −2.8 −2.6 −2.4 −2.2 −2 −1 −2.8 −2.6 −2.4 −2.2 −2 −1 −2.8 −2.6 −2.4 −2.2 −2 −1 0 250 500 750 1000 logSNR with p = 1000 logSNR with p = 5000 logSNR with p=1000 logSNR with p=5000

Fig. 2: Evaluation results of GL1-SVD, GL0-SVD, OGL1-SVD and OGL0-SVD and comparison with L1-SVD and L0- SVD using the simulated data X with prior (non-) overlapping group information. (A) Illustration of all methods in the simulated X with p = 1000 and logSNR = −2. (B) and (C) Results of all methods in terms of ACC, singular value d and time (second) for cases with GR and OGR respectively. Each bar entry is the average over 50 replications. The input p×100 Xs ∈ R vary with different p=1000 and 5000, and logSNR=-2.8, -2.6, -2.4, -2.2, -2 and -1, respectively.

and logSNR = −2, to explain their performance and com- a group), otherwise uGi = 0. We considered t ∈ {20, 100} pared them with other methods (Fig. 2A). Obviously GL1- and logSNR ∈ {−1, −2, −2.2, −2.4, −2.6, −2.8} for gener- SVD and GL0-SVD can improve performance of variable ating OGR simulated data. Note that once u, v and logSNR p×n selection by integrating non-overlapping group information is given, we could generate the simulated matrix X ∈ R of variables. Further, we tested our methods on more GR using Eq. (37) where d = 1. For each pair (q, logSNR), we simulation data (Fig. 2B and TABLE 2). We also found generated 50 simulated matrices Xs. that the performance of GL0-SVD and GL1-SVD are sig- For visualization, we first applied OGL1-SVD and nificantly superior to that of L0-SVD, L1-SVD in terms of OGL0-SVD with kv = 5 onto a simulated X with OGR, different logSNRs with p = 1000 or p = 5000 (Fig. 2B). In p = 1000 and logSNR = −2, to explain their perfor- particular, the greater the noise of simulated data, the better mance and compared them with other methods (Fig. 2A). the performance of our methods. Furthermore, compared Obviously OGL1-SVD and OGL0-SVD can improve per- with GL1-SVD, GL0-SVD obtains higher singular values for formance of variable selection by integrating overlapping different logSNRs (Fig. 2B). We also compare the different group information of variables. Further, we tested our meth- algorithms on GPU time of an ordinary personal computer, ods on more OGR simulation data (Fig. 2B and TABLE 2). all the algorithm takes less than one second. Computational For comparison, we forced the identified u to contain 200 results illustrate that our models can enhance the power non-zero elements if t = 20 (i.e., the number of rows of X is of variable selection by integrating group information of p = 1000) and 1000 non-zero elements t = 100 for L1-SVD variables. and L0-SVD by tuning their parameters. The performance of OGL0-SVD and OGL1-SVD are significantly superior to 3.2 Overlapping group structure (OGR) that of L0-SVD, L1-SVD in terms of different logSNRs with To generate OGR simulated data, we first generated v with p = 1000 or p = 5000 (Fig. 2B). OGL1-SVD and OGL0- n = 100 (samples) and its elements are from a standard SVD get similar singular values, whereas OGL1-SVD needs normal distribution, i.e., v = rnorm(n, mean = 0, sd = 1). more time in terms of different logSNRs with p = 1000 or Then we generated u with overlapping groups. We con- p = 5000 (Fig. 2B). sidered an overlapping group structure for u as follows: Finally, we also investigated the effect of our meth- G1 = {1, 2, ..., 2t}, G2 = {t + 1, t + 2, ..., 3t}, ··· , G48 = ods with different sizes of data Xs. In the simulated {47t + 1, 47t + 2, ..., 49t}, G49 = {48t + 1, 48t + 2, ..., 50t}, data, there are 10 active groups (each group contains q where every group and its adjacent groups overlap half of members) for GR cases, and there are 5 active groups the elements. Note that the dimension of u is p = 50t. for OGR cases and each group contains 2t members. We

If i ∈ {3, 13, 14, 33, 43, 44} (active groups), then uGi = set q = t, thus we can set a common original sig- sample({−1, 1}, 2t) (2t denotes the number of members of nal of u for GR and OGR cases. Based on the defini- 9

tion of u and v, we set q ∈ {20, 40, 100, 160, 200} (i.e., samples), Breast invasive carcinoma (BRCA, 847 samples), p ∈ {1000, 2000, 5000, 8000, 10000}) and logSNR=-2.8 to Colon and rectum carcinoma (CRC, 550 samples), Head and generate X using Eq. (37). For each pair (q, logSNR), we neck squamous-cell carcinoma (HNSC, 303 samples), Kid- generated 50 simulated matrices Xs. We applied GL0-SVD, ney renal clear-cell carcinoma (KIRC, 474 samples), Brain GL1-SVD, L0-SVD and L1-SVD onto the simulated data and lower grade glioma (LGG, 179 samples), Lung adenocarci- compared their performance in different ways (TABLE 2). noma (LUAD, 350 samples), Lung squamous-cell carcinoma In summary, the group-sparse SVD methods obtain higher (LUSC, 315 samples), Prostate adenocarcinoma (PRAD, 170 TPR, TNR and ACC (lower FPR and FDR) than L1-SVD and samples), Skin cutaneous melanoma (SKCM, 234 samples), L0-SVD do (TABLE 2). Naturally, the group-sparse methods Thyroid carcinoma (THCA, 224 samples), Uterine corpus spent a bit more time, and obtain lower singular value d. endometrioid carcinoma (UCEC, 478 samples). We normal- ized each gene expression dateset of a given cancer type TABLE 2: Evaluation results of GL0-SVD, GL1-SVD, L0- using R function scale. Furthermore, we also downloaded SVD and L1-SVD in terms of TPR, TNR, FPR, FDR, ACC, p×100 the corresponding clinical data of the above 12 cancer types singular value d and time (second). The input X ∈ R from Firehose (http://firebrowse.org/). vary with different p=1000, 2000, 5000, 8000 and 10000. All KEGG pathway data. To integrate the pathway group Xs = −2.8 these simulated are generated at logSNR . Each information with SVD(GL0, L0), we also downloaded the entry is an average over 50 replications. KEGG [13] gene pathways from the Molecular Signatures

p = 1000 Database (MSigDB) [45]. We considered a KEGG pathway L1-SVD L0-SVD GL1-SVD GL0-SVD OGL1-SVD OGL0-SVD as a group and removed all the KEGG pathways with > 100 TPR 0.21 0.21 0.22 0.22 0.23 0.23 TNR 0.80 0.80 0.81 0.81 0.82 0.82 genes. Finally, we obtained 151 KEGG pathways across 2778 FPR 0.20 0.20 0.19 0.19 0.18 0.18 genes by only considering the intersection genes between FDR 0.79 0.79 0.78 0.78 0.76 0.76 ACC 0.68 0.68 0.69 0.69 0.70 0.70 the CGP gene expression and KEGG pathways data. On d 337.91 367.61 242.38 267.73 257.94 258.08 time 0.10 0.02 0.27 0.07 0.25 0.12 average, a gene pathway contains about 40 genes. p = 2000 PPI network. We also downloaded a protein-protein TPR 0.21 0.21 0.25 0.24 0.23 0.23 TNR 0.80 0.80 0.81 0.81 0.82 0.82 interaction (PPI) network from Pathway Commons (http:// FPR 0.20 0.20 0.19 0.19 0.18 0.18 www.pathwaycommons.org/). In our application, we also FDR 0.79 0.79 0.75 0.76 0.75 0.75 ACC 0.68 0.68 0.70 0.70 0.70 0.70 considered the edge set of the PPI network as overlapping d 436.70 480.66 284.84 329.42 317.45 317.65 u time 0.20 0.05 0.40 0.14 0.36 0.18 groups G , i.e, a gene (protein) interaction edge represents p = 5000 a group. TPR 0.22 0.23 0.66 0.55 0.55 0.55 TNR 0.81 0.81 0.92 0.89 0.90 0.90 Data collection. Finally, we generated three biological FPR 0.19 0.19 0.08 0.11 0.10 0.10 datasets to assess our models: FDR 0.78 0.77 0.34 0.45 0.40 0.40 ACC 0.69 0.69 0.87 0.82 0.83 0.83 d 634.84 705.79 409.22 464.14 447.67 447.21 • Dataset 1: “CGP + PPI”. This dataset was obtained time 0.48 0.21 0.80 0.29 0.62 0.34 by combining CGP gene expression and PPI network p = 8000 TPR 0.24 0.25 0.95 0.91 0.83 0.83 data with 13,321 genes, 641 samples and 262,462 TNR 0.81 0.81 0.99 0.98 0.97 0.97 FPR 0.19 0.19 0.01 0.02 0.03 0.03 interactions. FDR 0.76 0.75 0.05 0.09 0.13 0.13 • Dataset 2: “TCGA + PPI”. This dataset was obtained ACC 0.70 0.70 0.98 0.97 0.94 0.94 d 778.10 869.29 555.55 588.70 569.39 569.11 by combining TCGA gene expression and PPI net- time 0.89 0.44 0.79 0.31 0.51 0.30 work data. For each TCGA cancer type, we obtain p = 10000 TPR 0.25 0.27 1.00 0.99 0.92 0.92 the expression of 10,399 genes and a PPI network of TNR 0.81 0.82 1.00 1.00 0.98 0.98 FPR 0.19 0.18 0.00 0.00 0.02 0.02 10399 genes and 257039 interactions. FDR 0.75 0.73 0.00 0.01 0.07 0.06 • Dataset 3: “CGP + KEGG”. This dataset was obtained ACC 0.70 0.71 1.00 1.00 0.97 0.97 d 863.06 964.79 634.32 655.52 645.11 645.85 by combining CGP gene expression and KEGG path- time 1.25 0.60 0.80 0.32 0.48 0.31 way data including gene expression of 641 samples and 151 KEGG pathways across 2778 genes.

4 BIOLOGICAL APPLICATIONS Biological analysis of gene modules: To assess whether the identified modules have significant biological We applied our models to two gene expression data of two functions, we employed the bioinformatics tool DAVID well-known large-scale projects. (https://david.ncifcrf.gov/) [46] to perform gene enrich- ment analysis with GO biological processes (BP) and KEGG 4.1 Biological data pathways. The terms with Benjamin corrected p-value < CGP expression data. We first downloaded a gene expres- 0.05 are considered as significant ones. sion dataset from the Cancer Genome Project (CGP) [44] For each cancer type, we first used SVD(OGL0, L0) to with 13321 genes across 641 cell lines (samples). The 641 cell find a gene module with similar expressions (Fig. 1). To lines are derived from different tissues and cancer types. analyze the clinical relevance of each module in a given TCGA expression data. We also obtained twelve cancer cancer type, we ran a multivariate multivariate Cox propor- gene expression datasets for twelve cancer types across tional hazard model to obtain the prognostic scores for the about 4000 cancer samples (http://www.cs.utoronto.ca/ patients of this cancer type. It was implemented by using ∼yueli/PanMiRa.html), which is downloaded from TCGA predict function in the R package ‘survival’ with type = “lp”. database (http://cancergenome.nih.gov/). The twelve can- Then we divided the patients of this cancer type into low- cer types consist of Bladder urothelial carcinoma (BLCA, 134 risk and high-risk groups based on the prognostic scores. 10

A FANCA B SMARCB1 POU2F1 Leukocyte activation IQCB1 LYN PTPN6 BLK Immune response-activating MAG PTPRC signal transduction DOK1 SYK ACTB Immune response-regulating cell POU2AF1 CD72 BLNK NAP1L1 RASA1CD5 surface receptor signaling pathway B cell receptor signaling pathway CSK KIAA0922 PIK3R1 PIGR PTK2B FYN DOK3 VAV1 CD79A Immune response-activating cell surface receptor signaling pathway INPP5D APBB1IP FN1 POU2F2 FC epsilon RI signaling pathway SIT1 LCK BTK CD19 PPP2R3B AMBP Intracellular signaling cascade IGJ CD79B WAS Antigen receptor mediated TEC VPREB3 FCGR2BCD81 HCLS1 SH3BP5 signaling pathway LAT2 PIK3CA CYFIP2 FC gamma R-mediated phagocytosis RFTN1 DAPP1PLCG2 MAP4K1 IBTK B cell receptor signaling pathway VAV2 IGHM 0 5 10 21 -log10(p)

Fig. 3: Results of “CGP + PPI” dataset. (A) The gene set in the identified module by SVD(OGL0, L0) corresponds to a subnetwork of the prior PPI network. The gene subnetwork contain 56 genes and 18 of them (the circled ones) belong to a KEGG pathway, B cell receptor signaling pathway. (B) Top 10 enriched KEGG and GOBP pathways of this module are shown. Enrichment score was computed by -log10(p)(p is Benjamini-Hochberg adjusted).

Finally, we assessed the overall survival difference between edges in the prior PPI network by integrating edge-group the two groups using log-rank test and drew a Kaplan-Meier structure. (KM) curve for visualization. TABLE 3: The top gene modules identified by SVD(OGL0, L0) and L0-SVD from 12 different cancer-type gene expres- 4.2 Application to CGP data with a PPI network sion datasets. #edge indicates the number of edges of all We applied SVD(OGL0, L0) to the “CGP+ PPI” dataset module genes, and #degree indicates the sum of degree of consisting of the CGP gene expression and PPI network all module genes in the prior PPI network. data. We set ku = 100 to extract gene interactions, kv = 50 a b a b to select 50 samples. Here we only focused on the first Data #gene #edge #edge #degree #degree BLCA-module 66 22 135 1936 1800 identified gene module, which contains 56 genes and 272 SKCM-module 65 82 202 8348 2800 interactions. We first found that the subnetwork of identified LUAD-module 73 25 173 1434 2948 module in the prior PPI network is dense (Fig. 3A). As LUSC-module 67 19 170 793 2813 we expected, the genes from this module contains a large BRCA-module 79 83 162 8493 5211 number of linked genes in the prior PPI network (the degree HNSC-module 84 23 159 1231 3598 THCA-module 50 52 382 6871 4699 sum of these genes is 6321). The identified 50 samples KIRC-module 66 29 194 4848 10497 of this module are significantly related with some blood LGG-module 59 198 139 6048 7549 related cancers including AML (6 of 16), B cell leukemia PRAD-module 68 69 178 7463 9210 (7 of 7), B cell lymphoma(7 of 10), Burkitt lymphoma(11 of UCEC-module 55 22 251 4512 14144 CRC-module 49 1 384 1177 16620 11), lymphoblastic leukemia(6 of 11), lymphoid neoplasm a corresponds to L0-SVD. (5 of 11). On the other hand, these samples are specifically b corresponds to SVD(OGL0, L0). related with blood tissue (50 of 100). Moreover, this module is enriched in 77 different GO biological processes and 11 KEGG pathways, most of which are immune and blood related pathways including immune response-activating cell 4.3 Application to TCGA data with a PPI network surface receptor signaling pathway, immune response-regulating We next applied SVD(OGL0, L0) to analyze the gene cell surface receptor signaling pathway and immune response- expression data of 12 different cancers including BLCA, activating signal transduction (Fig. 3B). BLCA, LUAD, LUSC, BRCA, HNSC, THCA, KIRC, LGG, Finally, we also applied L0-SVD to extract a gene module UCEC, and CRC. Note that SVD(OGL0, L0) was tested with the same number of genes and samples for comparison. independently for each cancer to identify a important gene However, this module only contains 35 edges and most module across a set of tumor patients. SVD(OGL0, L0) uses of the identified genes are isolated. We repeatedly sample overlapping group L0-norm penalty for gene interaction 80 percent of genes and samples from the original matrix selection with ku = 100 (i.e. identify 100 edge-groups) and (i.e., the CGP data) for 10 times. For each sampled data L0-norm penalty for sample-variable selection with kv = 50 matrices, we applied OGL0-SVD and L0-SVD to identify a (i.e., 50 samples). Thus, we obtained 12 gene modules for gene module with the same settings above. We obtained an 12 cancer types with about 65 genes on average (TABLE 3). average of 382.8 (edges) for OGL0-SVD, but an average of For convenience, we defined the identified module of ‘X’ 35.6 (sum of edges) for L0-SVD. These results indicate that cancer-type as X-module (TABLE 3). For comparison, we OGL0-SVD can identify gene modules with more connected also enforced L0-SVD to identify a gene module with the 11

A immune response B BLCA BRCA CRC defense response p = 6.61e−05 p = 2.57e−10 p = 5.39e−11 positive regulation of immune system process leukocyte activation lymphocyte activation cell activation T cell activation regulation of lymphocyte activation biological adhesion 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 cell adhesion 0 50 100 150 0 100 200 0 50 100 150 extracellular matrix organization HNSC KIRC LGG extracellular structure organization p = 3.95e−07 p = 4.5e−17 p = 9.84e−12 collagen fibril organization blood vessel development mitochondrial ele. transport NADH to ubiquinone oxidative phosphorylation electron transport chain mitochondrial ATP synthesis coupled ele. tran. ATP synthesis coupled electron transport 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 generation of precursor metabolites and energy 0 50 150 0 50 100 150 0 50 150 synaptic vesicle transport LUAD LUSC PRAD synaptic transmission p = 1.15e−10 p = 8.33e−09 regulation of synaptic transmission p = 3.34e−01 neurotransmitter secretion regulation of transmission of nerve impulse muscle organ development

cytoskeleton organization Survival Probability muscle contraction muscle system process 0.0 0.4 0.8 0.0 0.4 0.8 actin cytoskeleton organization 0.0 0.4 0.8 0 50 150 250 0 50 100 0 40 80 transcription SKCM THCA UCEC transcription from RNA polymerase II promoter p = 1.64e−12 p = 1.7e−07 regulation of transcription RNA biosynthetic process transcription, DNA−dependent P = 1.72e−02 10 RNA splicing 7.5 intracellular transport 5.0 nuclear transport 2.5 protein localization in nucleus -log10( P -value) 0.0 0.0 0.4 0.8 0.0 0.4 0.8 protein targeting 0.0 0.4 0.8 0 100 300 0 50 100 0 50 150 LGG CRC KIRC BLCA LUAD Time (months) LUSC THCA BRCA PRAD HNSC UCEC SKCM

Fig. 4: Results of “TCGA + PPI” dataset. (A) Top five enriched GOBP (Gene Ontology Biological Process) terms of each gene module are shown where enriched scores were computed by -log10(p-value) (where p-value represents the Benjamini- Hochberg adjusted p-value). To visualize the results, if -log10(p-value) is greater than 10, it is set to 10. (B) Kaplan-Meier survival curves show overall survival of 12 different cancer types. p-values (or ps) were computed by log-rank test and ‘+’ denote the censoring patients. same number of samples and genes for each cancer type. immune-related pathways (Fig. 4A). Taken together of 12 As we expected, all modules identified by SVD(OGL0, L0) cancer-type gene modules, we find 412 enriched GO biolog- contain more edges than the ones identified by L0-SVD. ical processes (GO BPs, or GOBPs) and 48 KEGG pathways with Benjamin corrected p-value < 0.05. Interestingly, some Interestingly, we also find that most of the module genes important caner-related KEGG pathways are discovered in identified by SVD(OGL0, L0) are with higher degree in the both BRCA-module and HNSC-module including ECM- prior PPI network than those by L0-SVD without using receptor interaction (with p-value = 4.3e-18 in BRCA-module prior information (TABLE 3), indicating that these genes and p-value = 2.8e-22 in HNSC-module), Focal adhesion with higher degree tend to be related with cancer. In ad- (with p-value = 3.2e-16 in BRCA-module and p-value = dition, we observed a overlapping pattern, containing four 3.5e-22 in HNSC-module), and Pathways in cancer (with p- cancer-specific modules (BLCA, SKCM, LUAD and LUSC value = 5.0e-03 in BRCA-module and p-value = 2.3e-04 in cancers which are related with skin/epidermis tissue) shar- HNSC-module) (Fig. 4A). Some important immune related ing many common genes. Subsequent functional analysis pathways in both BLCA, SKCM, LUAD and LUSC-module shows these common genes are enriched in some important 12

TABLE 4: Summary of the top ten gene modules identified by SVD(OGL0, OGL0) in the “CGP+KEGG” dataset. #Gene and #Sample denote the number of genes and samples in each module; KEGG Term denotes the enriched KEGG pathways; Tumor and Tissue Type denote the enriched tumor and tissue types respectively.

ID #Gene #Sample KEGG Term Tumor Type Tissue Type 1 221 100 Ribosome; Primary immunodeficiency; B cell recep- AML; B cell leukemia; B cell lymphoma; Blood tor signaling pathway; Intestinal immune network Burkitt lymphoma; lymphoblastic for iga production; Asthma leukemia; Lymphoid neoplasm other 2 99 100 DNA replication; Homologous recombination; Base B cell leukemia; Lung: Small cell car- Blood; Lung excision repair; Mismatch repair; Nucleotide exci- cinoma; Lymphoblastic leukemia; Lym- sion repair phoblastic T cell leukaemia 3 200 100 Ecm receptor interaction; Glycosaminoglycan Glioma; lymphoblastic leukemia; Lym- Blood; Bone; degradation; Primary immunodeficiency; phoblastic T cell leukaemia; Osteosar- CNS; Soft- Glycosaminoglycan biosynthesis heparan sulfate; coma tissue Arrhythmogenic right ventricular cardiomyopathy arvc 4 216 100 Pathogenic escherichia coli infection; Lysine degra- AML; B cell leukemia; B cell lymphoma; Blood dation; Notch signaling pathway; Adherens junc- Burkitt lymphoma; Lymphoblastic tion; Rna degradation leukemia; Lymphoblastic T cell leukaemia; Lymphoid neoplasm other 5 83 100 Allograft rejection; Type i diabetes mellitus; Asthma; AML; B cell leukemia; B cell Blood Graft versus host disease; Intestinal immune net- lymphoma; Burkitt lymphoma; work for iga production Hodgkin lymphoma; Lymphoblastic T cell leukaemia; Lymphoid neoplasm other; Myeloma 6 166 100 Complement and coagulation cascades; Phenylala- Large intestine; Liver GI tract nine metabolism; Primary bile acid biosynthesis; Ppar signaling pathway; Steroid biosynthesis 7 253 100 Hematopoietic cell lineage; Acute myeloid AML; B cell lymphoma; Lymphoblas- Blood; Upper leukemia; Fc epsilon ri signaling pathway; tic leukemia; lymphoblastic T cell aerodigestive Fc gamma r mediated phagocytosis; Primary leukaemia; lymphoid neoplasm other; immunodeficiency oesophagus; upper aerodigestive tract 8 141 100 Glutathione metabolism; Metabolism of xenobiotics Liver; lung: NSCLC: adenocarcinoma; GI tract; Lung; by cytochrome p450; Steroid hormone biosynthesis; Lung: NSCLC: squamous cell carci- Upper aerodi- Ascorbate and aldarate metabolism; Arachidonic noma; Oesophagus gestive acid metabolism 9 132 100 Glycosaminoglycan biosynthesis chondroitin sul- Glioma; Lung: small cell carcinoma; CNS fate; Mismatch repair; N glycan biosynthesis; Gly- Lymphoblastic leukemia; Neuroblas- cosylphosphatidylinositol GPI anchor biosynthesis; toma Vibrio cholerae infection 10 193 100 B cell receptor signaling pathway; Circadian rhythm AML; B cell lymphoma; Glioma; lung: Blood mammal; Primary immunodeficiency; Fc gamma r small cell carcinoma; Lymphoblastic mediated phagocytosis; Fc epsilon ri signaling path- leukemia; Lymphoid neoplasm other way

including Cell adhesion molecules, Hematopoietic cell lineage, in SVD(OGL0, L0) for convenience. Since the identified top Natural killer cell mediated cytotoxicity, Primary immunodefi- ten pair of singular vectors explained more than 60% of the ciency, and T cell receptor signaling pathway. This is consistent variance, we focused on the top ten pair singular vectors to with previous studies that lung cancer is related to immune extract ten gene functional modules (TABLE 4). response [47]. In addition, several KEGG pathways are For each gene module, we computed the overlap signifi- found in THCA-module including Oxidative phosphorylation cance between its sample set and each tissue or cancer class (p = 1.1e-59), Parkinson’s disease (p =7.9e-52), Alzheimer’s using a hypergeometric test, implemented via R function disease (p =5.8e-48) and Huntington’s disease (p = 1.8e-46). For phyper. Those cancer types with a few samples (≤ 5) are clarity, the top five enriched GO BPs of each cancer module ignored. We find that each module is significantly related to are shown (Fig. 4A). Several GO BPs are common in many at least one cancer or tissue type, indicating that they indeed cancers, while most of them are specific to certain cancers be biologically relevant. (Fig. 4A). It is worth noting that some cancer/tissue subtype- Moreover, we find that those patients of 11 cancers specific KEGG pathways are discovered. We summarized (BLCA, BLCA, LUAD, LUSC, BRCA, HNSC, THCA, KIRC, all key messages of the identified modules in TABLE 4. LGG, UCEC, CRC) can be effectively separated into two For example, we find that all the samples of module 1 risk-groups, which are significantly different with survival belong to blood tissue, and most of samples of module 1 outcome (p-value < 0.05, log-rank test) (Fig. 4B). All the are significantly enriched in some lymphoid-related cancers. results showed that our method by integrating interaction Interestingly, the corresponding five blood-specific KEGG structure could identify more biologically relevant gene pathways (including ribosome, primary immunodeficiency, b modules, and improve their biological interpretations. cell receptor signaling pathway, intestinal immune network for iga production, and asthma) are related to lymphoma. Another 4.4 Application to CGP data with KEGG Pathways example module 6 is specifically related to large intestine, We applied SVD(OGL0, L0) to the “CGP + KEGG”. We set liver cancer and GI tract tissue with important KEGG path- ku = 5 (i.e., five pathways), kv = 100 (i.e. 100 samples) ways including complement and coagulation cascades, pheny- 13 lalanine metabolism, primary bile acid biosynthesis, ppar sig- REFERENCES naling pathway, steroid biosynthesis. These results provide a [1] A. P. Singh and G. J. Gordon, “A unified view of matrix factor- new way to understand and study the mechanisms between ization models,” in Machine Learning and Knowledge Discovery in different tissues and cancers. Databases (ECML/PKDD), 2008, pp. 358–373. [2] O. Alter, P. O. Brown, and D. Botstein, “Singular value decomposi- tion for genome-wide expression data processing and modeling,” 5 DISCUSSIONANDCONCLUSION Proceedings of the National Academy of Sciences, vol. 97, no. 18, pp. 10 101–10 106, 2000. Inferring blocking patterns from high-dimensional biologi- [3] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for cal data is still a central challenge in computational biology. designing overcomplete dictionaries for sparse representation,” IEEE Transactions on Signal Processing, vol. 54, no. 11, pp. 4311– On one hand, we aim to identify some gene subsets, which 4322, 2006. are co-expressed across some samples in the corresponding [4] X. Zhou, J. He, G. Huang, and Y. Zhang, “SVD-based incremental gene expression data. On the other hand, we expect that approaches for recommender systems,” Journal of Computer and these identified gene sets belong to some biologically mean- System Sciences, vol. 81, no. 4, pp. 717–733, 2015. [5] I. Sohn, J. Kim, S.-H. Jung, and C. Park, “Gradient lasso for cox ingful groups such as functional pathways. To this end, we proportional hazards model,” Bioinformatics, vol. 25, no. 14, pp. propose group-sparse SVD models to identify gene blocking 1775–1781, 2009. patterns (modules) via integrating gene expression data and [6] H. Fang, C. Huang, H. Zhao, and M. Deng, “Cclasso: correlation inference for compositional data through lasso,” Bioinformatics, prior gene knowledge. vol. 31, no. 19, pp. 3172–3180, 2015. We note that the concept of group sparse has also been [7] K. Greenlaw, E. Szefer, J. Graham, M. Lesperance, F. S. Nathoo, and introduced into some related models such as the non- A. D. N. Initiative, “A bayesian group sparse multi-task regression negative matrix factorization (NMF) model [48]. To our model for imaging genetics,” Bioinformatics, vol. 33, no. 16, pp. 2513–2522, 2017. knowledge, non-overlapping group Lasso was applied as [8] J. Shi, N. Wang, Y. Xia, D.-Y. Yeung, I. King, and J. Jia, “SCMF: the penalty onto NMF. Obviously, our models are very Sparse covariance matrix factorization for collaborative filtering,” different from the group-sparse NMF model. First, we used in IJCAI, 2013, pp. 2705–2711. L [9] L. Jing, P. Wang, and L. Yang, “Sparse probabilistic matrix factor- a more direct non-overlapping sparse penalty with 0-norm ization by laplace distribution for collaborative filtering,” in IJCAI, penalty into the SVD model and propose a group sparse 2015, pp. 1771–1777. SVD with group L0-norm penalty (GL0-SVD). More impor- [10] M. Lee, H. Shen, J. Z. Huang, and J. Marron, “Biclustering via sparse singular value decomposition,” Biometrics, vol. 66, no. 4, tantly, we prove the convergence of the GL0-SVD algorithm. pp. 1087–1095, 2010. Second, since the non-overlapping group structure in group [11] M. Sill, S. Kaiser, A. Benner, and A. Kopp-Schneider, “Robust Lasso or group L0-norm penalty limits their applicability biclustering by sparse singular value decomposition incorporating in practice. Several works have studied the (overlapping) stability selection,” Bioinformatics, vol. 27, no. 15, pp. 2089–2097, 2011. group Lasso in regression tasks. However, little work focus [12] D. Yang, Z. Ma, and A. Buja, “A sparse singular value decompo- on developing structured sparse matrix factorization models sition method for high-dimensional data,” Journal of Computational with overlapping group structure. We propose two overlap- and Graphical Statistics, vol. 23, no. 4, pp. 923–942, 2014. [13] M. Kanehisa and S. Goto, “KEGG: Kyoto encyclopedia of genes ping group SVD models (OGL1-SVD and OGL0-SVD), and and genomes,” Nucleic Acids Research, vol. 28, no. 1, pp. 27–30, discuss their convergence issue. Computational results of 2000. high-dimensional gene expression data show that our meth- [14] M. Yuan and Y. Lin, “Model selection and estimation in regression ods could identify more biologically relevant gene modules with grouped variables,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 68, no. 1, pp. 49–67, 2006. and improve their biological interpretations than the state- [15] L. Jacob, G. Obozinski, and J.-P. Vert, “Group lasso with overlap of-the-art sparse SVD methods. Moreover, we expect that and graph lasso,” in ICML. ACM, 2009, pp. 433–440. our methods could be applied to more high-dimensional [16] G. Obozinski, L. Jacob, and J.-P. Vert, “Group lasso with overlaps: biological data such as single cell RNA-seq data, and high- the latent group lasso approach,” arXiv preprint arXiv:1110.0413, 2011. dimensional data in other fields. In the future, it will be [17] L. Yuan, J. Liu, and J. Ye, “Efficient methods for overlapping group valuable to extend current concept into structured sparse lasso,” in Advances in Neural Information Processing Systems, 2011, tensor factorization model for multi-way analysis of multi- pp. 352–360. source data. [18] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996. [19] H. Zou and T. Hastie, “Regularization and variable selection via ACKNOWLEDGEMENT the elastic net,” Journal of the Royal Statistical Society: Series B Shihua Zhang and Juan Liu are the corresponding au- (Statistical Methodology), vol. 67, no. 2, pp. 301–320, 2005. [20] D. M. Witten, R. Tibshirani, and T. Hastie, “A penalized matrix thors of this paper. Wenwen Min would like to thank decomposition, with applications to sparse principal components the support of the Academy of Mathematics and Sys- and canonical correlation analysis,” Biostatistics, vol. 10, no. 3, pp. tems Science, CAS during his visit. This work has been 515–534, 2009. supported by the National Natural Science Foundation of [21] H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal component analysis,” Journal of Computational and Graphical Statistics, vol. 15, China [11661141019, 61621003, 61422309, 61379092]; Strate- no. 2, pp. 265–286, 2006. gic Priority Research Pro-gram of the Chinese Academy [22] L. Clemmensen, T. Hastie, D. Witten, and B. Ersbøll, “Sparse of Sciences (CAS) [XDB13040600]; National Ten Thousand discriminant analysis,” Technometrics, vol. 53, no. 4, pp. 406–413, 2011. Talent Pro-gram for Young Top-notch Talents; Key Research [23] D. M. Witten and R. Tibshirani, “A framework for feature selection Pro-gram of the Chinese Academy of Sciences [KFZD-SW- in clustering,” Journal of the American Statistical Association, vol. 105, 219]; National Key Research and Development Program of no. 490, pp. 713–726, 2010. China [2017YFC0908405]; CAS Frontier Science Re-search [24] N. Simon, J. Friedman, T. Hastie, and R. Tibshirani, “A sparse- group lasso,” Journal of Computational & Graphical Statistics, vol. 22, Key Project for Top Young Scientist [QYZDB-SSW-SYS008]. no. 2, pp. 231–245, 2013. 14

[25] R. Tibshirani, “Regression shrinkage and selection via the lasso: optimization,” SIAM Journal on Optimization, vol. 23, no. 2, pp. a retrospective,” Journal of the Royal Statistical Society: Series B 1126–1153, 2013. (Statistical Methodology), vol. 73, no. 3, pp. 273–282, 2011. [39] T. Pock and S. Sabach, “Inertial proximal alternating linearized [26] H. Wang and C. Leng, “A note on adaptive group lasso,” Com- minimization (ipalm) for nonconvex and nonsmooth problems,” putational Statistics & Data Analysis, vol. 52, no. 12, pp. 5277–5286, SIAM Journal on Imaging Sciences, vol. 9, no. 4, pp. 1756–1787, 2016. 2008. [40] F. Yang, Y. Shen, and Z. S. Liu, “The proximal alternating iterative [27] P. Tseng, “Convergence of a block coordinate descent method for hard√ thresholding method for L0 minimization, with complexity nondifferentiable minimization,” Journal of Optimization Theory & O(1/ k),” Journal of Computational and Applied Mathematics, vol. Applications, vol. 109, no. 3, pp. 475–494, 2001. 311, pp. 115–129, 2017. [28] J. Friedman, T. Hastie, H. Hofling,¨ R. Tibshirani et al., “Pathwise [41] Y. Xu and W. Yin, “A globally convergent algorithm for noncon- coordinate optimization,” The Annals of Applied Statistics, vol. 1, vex optimization based on block coordinate update,” Journal of no. 2, pp. 302–332, 2007. Scientific Computing, pp. 1–35, 2017. [29] J. Friedman, T. Hastie, and R. Tibshirani, “Regularization paths [42] M. Asteris, A. Kyrillidis, O. Koyejo, and R. Poldrack, “A simple for generalized linear models via coordinate descent,” Journal of and provable algorithm for sparse diagonal cca,” in ICML. ACM, Statistical Software, vol. 33, no. 1, p. 1, 2010. 2016, pp. 1148–1157. [30] N. Simon, J. Friedman, and T. Hastie, “A blockwise descent [43] W. Min, J. Liu, and S. Zhang, “A novel two-stage method for algorithm for group-penalized multiresponse and multinomial identifying miRNA-gene regulatory modules in breast cancer,” regression,” arXiv preprint arXiv:1311.6529, 2013. in IEEE International Conference on Bioinformatics and Biomedicine [31] M. S. Bazaraa, H. D. Sherali, and C. M. Shetty, Nonlinear program- (BIBM). IEEE, 2015, pp. 151–156. ming: theory and algorithms. John Wiley & Sons, 2013. [44] M. J. Garnett, E. J. Edelman, S. J. Heidorn, C. D. Greenman, [32] J. Huang, T. Zhang, and D. Metaxas, “Learning with structured A. Dastur, K. W. Lau, P. Greninger, I. R. Thompson, X. Luo, sparsity,” Journal of Machine Learning Research, vol. 12, no. Nov, pp. J. Soares et al., “Systematic identification of genomic markers of 3371–3412, 2011. drug sensitivity in cancer cells,” Nature, vol. 483, no. 7391, pp. [33] Z. Qin and D. Goldfarb, “Structured sparsity via alternating di- 570–575, 2012. rection methods,” Journal of Machine Learning Research, vol. 13, no. [45] A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. May, pp. 1435–1468, 2012. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. [34] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed Lander et al., “Gene set enrichment analysis: a knowledge-based optimization and statistical learning via the alternating direction approach for interpreting genome-wide expression profiles,” Pro- method of multipliers,” Foundations and Trends R in Machine Learn- ceedings of the National Academy of Sciences, vol. 102, no. 43, pp. ing, vol. 3, no. 1, pp. 1–122, 2011. 15 545–15 550, 2005. [35] J. Bolte, S. Sabach, and M. Teboulle, “Proximal alternating lin- [46] D. W. Huang, B. T. Sherman, and R. A. Lempicki, “Systematic and earized minimization for nonconvex and nonsmooth problems,” integrative analysis of large gene lists using david bioinformatics Mathematical Programming, vol. 146, no. 1-2, pp. 459–494, 2014. resources,” Nature Protocol, vol. 4, no. 1, pp. 44–57, 2009. [36] S. Mosci, S. Villa, A. Verri, and L. Rosasco, “A primal-dual algo- [47] P. Roepman, J. Jassem, E. F. Smit, T. Muley, J. Niklinski, T. van de rithm for group sparse regularization with overlapping groups,” Velde, A. T. Witteveen, W. Rzyman, A. Floore, S. Burgers et al., “An in Advances in Neural Information Processing Systems, 2010, pp. immune response enriched 72-gene prognostic profile for early- 2604–2612. stage non–small-cell lung cancer,” Clinical Cancer Research, vol. 15, [37] L. Grippo and M. Sciandrone, “On the convergence of the block no. 1, pp. 284–290, 2009. nonlinear gauss–seidel method under convex constraints,” Opera- [48] J. Kim, R. D. C. Monteiro, and H. Park, “Group sparsity in tions Research Letters, vol. 26, no. 3, pp. 127–136, 2000. nonnegative matrix factorization,” in SIAM International Conference [38] M. Razaviyayn, M. Hong, and Z.-Q. Luo, “A unified convergence on Data Mining, 2012, pp. 851–862. analysis of block successive minimization methods for nonsmooth