Shi et al. BMC Systems Biology 2010, 4:74 http://www.biomedcentral.com/1752-0509/4/74

RESEARCH ARTICLE Open Access Co-expressionResearch article module analysis reveals biological processes, genomic gain, and regulatory mechanisms associated with breast cancer progression

Zhiao Shi1,2, Catherine K Derow3 and Bing Zhang*3

Abstract Background: expression signatures are typically identified by correlating gene expression patterns to a disease phenotype of interest. However, individual gene-based signatures usually suffer from low reproducibility and interpretability. Results: We have developed a novel algorithm Iterative Clique Enumeration (ICE) for identifying relatively independent maximal cliques as co-expression modules and a module-based approach to the analysis of gene expression data. Applying this approach on a public breast cancer dataset identified 19 modules whose expression levels were significantly correlated with tumor grade. The correlations were reproducible for 17 modules in an independent breast cancer dataset, and the reproducibility was considerably higher than that based on individual or modules identified by other algorithms. Sixteen out of the 17 modules showed significant enrichment in certain (GO) categories. Specifically, modules related to cell proliferation and immune response were up-regulated in high- grade tumors while those related to cell adhesion was down-regulated. Further analyses showed that transcription factors NYFB, E2F1/E2F3, NRF1, and ELK1 were responsible for the up-regulation of the cell proliferation modules. IRF family and ETS family were responsible for the up-regulation of the immune response modules. Moreover, inhibition of the PPARA signaling pathway may also play an important role in tumor progression. The module without GO enrichment was found to be associated with a potential genomic gain in 8q21-23 in high-grade tumors. The 17- module signature of breast tumor progression clustered patients into subgroups with significantly different relapse- free survival times. Namely, patients with lower cell proliferation and higher cell adhesion levels had significantly lower risk of recurrence, both for all patients (p = 0.004) and for those with grade 2 tumors (p = 0.017). Conclusions: The ICE algorithm is effective in identifying relatively independent co-expression modules from gene co- expression networks and the module-based approach illustrated in this study provides a robust, interpretable, and mechanistic characterization of transcriptional changes.

Background Since the pioneering work on breast cancer [1,2], gene Large-scale gene expression profiling with microarray expression signatures have been reported for different platforms is becoming a popular approach to the identifi- diseases. Some of the signatures have been approved by cation of molecular biomarkers for disease diagnosis, the US Food and Drug Administration for diagnostic prognosis, and response to treatment. Gene expression assay. signatures are usually identified by correlating gene Despite these exciting progresses, individual gene- expression patterns to a disease phenotype of interest. based signatures suffer from some well-known problems. First, given the large number of genes on the array, high * Correspondence: [email protected] correlation among the genes, small number of samples in 3 Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN 37240, USA a data set, and large variance across patient samples, it is Full list of author information is available at the end of the article

© 2010 Shi et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons At- tribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Shi et al. BMC Systems Biology 2010, 4:74 Page 2 of 14 http://www.biomedcentral.com/1752-0509/4/74

common to see little overlap among gene signatures iden- overlap in real networks, algorithms have been developed tified by different research groups for a common clinical to merge highly overlapping cliques to facilitate further outcome [3]. Secondly, because gene-based signatures analysis [18,21]. often fail to put individual genes in a functional context, Applying graph-theoretic approaches on gene expres- understanding the biology highlighted by a signature sion data requires the construction of gene co-expression remains a significant challenge[4]. Moreover, because network from the data [22]. Usually, gene expression sim- important regulators such as transcription factors are not ilarity is calculated for each pair of genes, and then a net- necessarily regulated at the transcriptional level, gene- work is constructed by setting a threshold for the pair- based signatures can seldom identify regulatory mecha- wise similarity. Such a network is usually represented as nisms underlying the disease phenotype of interest. As a an un-weighted graph, in which each node is a gene and result, module-based approaches aimed at a more robust two genes are connected by an edge if their expression and interpretable characterization of transcriptional similarity level is above a pre-selected threshold. The changes have emerged [4-6]. choice of a threshold can significantly affect the integrity In contrast to the gene-level analyses, module-based of the network and the co-expression modules derived approaches use gene modules as the basic building blocks from it. Various methods have been proposed in an for analysis. Modules can be defined in a knowledge- attempt to guide the selection of an appropriate thresh- driven fashion based on existing knowledge on pathways, old, including those based on statistical analysis [23,24] biological processes, and complexes [5]; they can or network properties [25,26]. Usually, functional similar- also be derived in a data-driven fashion by identifying ity derived from Gene Ontology (GO) is used subse- subgroups of genes sharing similar expression pattern quently to evaluate the biological relevance of the across multiple conditions, i.e., co-expression modules selected threshold [25,27]. [7]. The latter is particularly interesting as it is not limited The application of clique-based approaches on gene co- to or biased towards existing knowledge, and holds the expression networks is hampered by an extreme degree of potential to reveal truly novel regulatory mechanisms overlap among maximal cliques in a network. While underlying the dynamic molecular processes underpin- allowing overlap is an important advantage of clique- ning disease[8]. Various studies have demonstrated the based approaches, excessive overlap carries too much significance of examining gene co-expression in address- redundant information and makes downstream analysis ing biological problems [7,9,10]. difficult. Although one may consider a post-processing Some commonly used methods for analyzing gene co- step to merge highly overlapping cliques [18,21], this is expression pattern in microarray data include hierarchi- usually impractical for gene co-expression networks cal clustering [11], K-means clustering [12], and self orga- because there may exist billions of maximal cliques in a nizing maps [13]. Generally, these methods are suitable network and it is already a significant challenge just to for understanding the global structure of the data but store these cliques. suboptimal for module identification. Recently, various In this paper, we propose a clique-based framework for methods for module detection in large-scale networks the analysis of gene expression data. An iterative clique have been proposed. Some representative methods enumeration (ICE) algorithm is developed to identify a include the betweenness-based method [14], the modu- manageable number of modules that can represent major larity optimization method [15], the spectral partitioning transcriptional programs encoded in a co-expression net- method [16], and graph-theoretic approaches relying on work. Downstream analyses on the modules further cliques and other tightly connected components [17,18]. reveal biological processes and regulatory mechanisms Clique-based approaches have to date provided one of underlying disease phenotypes of interest. Using publicly the most successful ways to consider module overlap, available human breast cancer gene expression datasets, which is an important characteristic of real networks we demonstrated that the ICE algorithm was able to [18]. For example, a gene/protein can have multiple func- detect functionally homogeneous co-expression mod- tions and therefore belong to multiple modules or com- ules, and the detected modules covered a diverse variety plexes. In fact, clique-based approaches have been of biological processes. We further identified modules successfully applied on the identification of protein com- whose overall expression levels were correlated with plexes from protein interaction networks [18-21]. A tumor grade, and showed that the correlations were more clique is a complete sub-network in which all nodes are reproducible in an independent data set than those based connected in a pairwise fashion. A clique is maximal if it on individual genes or modules identified by other algo- is not contained in any other clique, and maximum if it is rithms. Next, we illustrated that the module-based the largest maximal clique in the graph. Clique-based approach could reveal important biological processes as approaches usually start with the identification of all well as previously reported and novel regulatory mecha- maximal cliques in the network. Because maximal cliques nisms underlying breast tumor progression. Finally, we Shi et al. BMC Systems Biology 2010, 4:74 Page 3 of 14 http://www.biomedcentral.com/1752-0509/4/74

showed that the expression pattern of the modules could phenotype of interest (e.g. tumor grade or stage), the cluster patients into subgroups with significantly differ- module-based approach analyzes modules as units and ent relapse-free survival times and provided useful prog- identifies co-expression modules that are significantly nostic information that was independent of tumor grade. correlated with the phenotype, i.e. potential module bio- markers. Finally, identified modules are queried against Results and Discussion gene set databases such as the GO gene sets and Tran- Overview of the co-expression module-based analysis scription Factor Binding Site (TFBS) gene sets to infer framework biological processes and regulatory mechanisms underly- Figure 1 depicts an overview of the co-expression mod- ing the phenotype of interest. ule-based analysis framework. Based on a gene expres- sion data set, a co-expression network is constructed in Knowledge guided co-expression network construction which each node is a gene and two genes are connected Microarray gene expression data on two cohorts of breast by an edge if their expression similarity level is above a cancer patients were used for demonstration in this study. pre-selected threshold. Although we used the Pearson's Both datasets (GSE2109_breast and GSE2990 [28]) were correlation coefficient for the similarity calculation in this downloaded from the Gene Expression Omnibus (GEO) study, other measurements such as the Spearman's corre- database http://www.ncbi.nlm.nih.gov/geo/. lation coefficient and the mutual information can be GSE2109_breast was generated using the Affymetrix equally applied. A knowledge-guided method is U133 Plus 2.0 Array, while GSE2990 was generated using employed for threshold selection to ensure the biological the Affymetrix U133A Array. Clinical information on relevance of the gene co-expression network. Next, the patients in each dataset is summarized in Table 1. ICE algorithm developed in this study is used to identify We used the GSE2109_breast dataset consisting of 351 relatively independent maximal cliques as co-expression breast tumor specimens for co-expression network con- modules. In contrast to the single gene-based analyses in struction. The construction of co-expression network by which individual genes are tested for their correlation to a thresholding is a critical step in the analysis framework.

Co-expression Iterative Clique Thresholding Enumeration

Gene expression data Co-expression network Co-expression modules

10 1 00! 0%& 0(& -3 &( $& )0" $1)& )0#$ ((# 0 $! &00 (& ## #!  )& &! &%"' ! "#$ !  ##( $) 0&2 !&"  )&" $& $&  !  1(& !  ! 2% !  $ ! (&  " ! #" #"! $ $1 ( 1(! 1(!  564 $& $"$ & Phenotype Correlation #   0" "& 3 $& $  %2# $& $  & 1 !$ 3 )# )#  )#  )#   )#   )#   )#  )#  )#   )#   )#  )# )#  )#  )#  )# )# )#  )#  )#  )# )#   )#  )#  )#  )#  )#   )#  )#  )#  )#   )#  )#  )#  )#  )#  )#  )#  )#  )#  )# )#   )#   )#  )#  )#   )#  )#  )# )#  )#  )# )#   )#  )#  )#  )#   )#  )#  )#  )#   )#  )#  )#  )#   )#  )# )#  )#  )#  )#  )#  )#  )# )#  )#  )#  )#  )#  )#   )#   )#  )#  )#  )#  )#  )#  )#   )#  )#   )#  )#  )#   )#   )#  )#  )#  )#  )#   )#  )#  )#  )# )#  )#  )#  )# )#   )#  )#  )#  )#  )#  )#  )#  )#  )# )# )# )#  )#   )# )#   )#  )#  )#  )# )# )#  )#  )# )#  )#   )#  )#  )#  )#  )#  )#  )#  )# )#  )#  )#  )# )# )#  )#  )#   )#  )#  )#  )# )#  )#  )#  )#  )#  )#   )# )#  )#  )#   )#   )#  )#   )# )#  )#  )#  )#  )#  )#  )#  )#  )# )#  )#  )# )#  )# )#  )# )#  )# )#  )#   )#  )#  )#  )#  )# )# )#  )#  )# )#  )#  )#   )#  )#  )#  )#  )#   )#  )# )#  )#  )#  )#  )# )#  )#   )#   )#  )#   )#  )#   )# )#   )#  )# )#  )#   )#  )#  )#  )#  )# )# )#  )#  )#  )#  )#  )#   )# )# )#  )#  )#  )#  )#  )#  )#  )#  )#  )# )#  )#  )#  )#  )#   )# )# )# )# )#  )#  )#   )#  )#  )#  )#   )#   )#  )#   )#  )#  )#  )# )#   )#  )#  )#   )#  )#  )#  )#  )#  Grade 1 Grade 2 Grade 3

Module biomarkers NFYB

TTK NEK2 TACC3 UBE2C TOP2A CENPF KIF23 SPC25 CDKN3 RACGAP1TPX2 KIF20A ESPL1 CDC45L STMN1 RRM2 CDC2

GO Enrichment Analysis Biological processes TFBS Regulatory mechanisms

Figure 1 Schematic overview of the co-expression module-based analysis framework. GO: Gene Ontology. TFBS: Transcription Factor Binding Sites. Shi et al. BMC Systems Biology 2010, 4:74 Page 4 of 14 http://www.biomedcentral.com/1752-0509/4/74

Table 1: Microarray datasets used in the study

GSE2109_breast GSE2990 ● adjp=0.01 Sample size 351 189 top 1% top 0.1% ●

● Pathological stage 0 4- ● ●

● ● 1 38 - ●

●● ● ● ●● ●● ● ● ●●● Functional similarity ●●● ●●● ●●●● ●●●●● 2 129 - ●●●●● ●●●● ●●●●●● ●●●●●●● ●●●●●●●● ● ●●●●●●●● ● ●●●●●●●●●● ●● ●● ●●●●●●●●●●●● ● ● ●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● 3 67 -

4 5- 0.5 1 1.5 2 2.5 3 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

NA 108 - Co−expression level

Figure 2 Functional similarity for gene pairs at different co-ex- Pathological grade 1 31 67 pression level. Co-expressions are binned into 0.1 unit intervals and the average GO semantic similarity for each bin is plotted as an open 2 113 46 circle. The three vertical lines correspond to three co-expression levels: Bonferoni corrected p value of 0.01 (blue), the top 1% of all correlations (red), and the top 0.1% of all correlations (green). 3 136 59

co-expression level corresponding to the top 0.1% of all NA 71 17 correlations. Therefore, we used this threshold (Pearson's correlation coefficient of 0.6533) and constructed a gene Recurrence 1 -67co-expression network with 7,819 nodes and 195,928 edges. Because biological networks, including co-expres- 0 - 120 sion networks, usually share some important characteris- tics such as the power law degree distribution and high NA -2clustering coefficient [7,29], we further examined these topological characteristics of the constructed network. Indeed, the network showed a power law degree distribu- To ensure the biological relevance of constructed net- tion (data not shown) and a high clustering coefficient of work, we used a knowledge-guided method for threshold 0.52, suggesting its biological relevance. selection. Figure 2 shows the relationship between co- expression level (as measured by the Pearson's correlation Functional homogeneity within and heterogeneity across coefficient) and functional similarity (as measured by GO co-expression modules semantic similarity). As shown in the figure, in general, a The extremely high clustering coefficient of the network higher co-expression level corresponds to a higher func- indicated a highly modular structure. In the ICE algo- tional similarity score. Although large positive and nega- rithm, we define co-expression modules as subgroups of tive correlations are both statistically significant, large perfectly interconnected genes, i.e., cliques. A classic negative correlations don't correlate with higher func- maximal clique enumeration algorithm [30] identified tional similarity. We examined different threshold selec- 1,306,734,139 maximal cliques of size 10 or more from tions corresponding to the Bonferroni adjusted p value of the network, making existing methods relying on the 0.01, the top 1% of all correlations, and the top 0.1% of all maximal cliques such as the C-Finder algorithm [18] correlations (Figure 2). No critical functional similarity impractical. Applying the ICE algorithm proposed in this change was observed at the co-expression level corre- study, we identified 50 maximal cliques as relatively inde- sponding to the Bonferroni adjusted p value of 0.01, indi- pendent co-expression modules. A major concern in the cating that statistical significance couldn't be transferred application of clique-based approaches is their computa- directly into biological significance. In contrast, a sharp tional complexity. However, as gene co-expression net- increase in functional similarity was observed above the works are sparse with a scale-free distribution, clique- Shi et al. BMC Systems Biology 2010, 4:74 Page 5 of 14 http://www.biomedcentral.com/1752-0509/4/74

based approaches are applicable on these networks [22]. drastically increased in grade 3 tumors compared to For the above analysis, it took only 8 minutes and about grade 1 and 2 tumors. These results are consistent with a 128 MB memory for the ICE algorithm to generate the previous report that different tumor grades are associated results on a computer with a 2.4 GHz AMD Opteron with distinct gene expression signatures, while tumor CPU and 4 GB memory. specimens from distinct pathological stages may share Because all genes in a module are highly co-expressed remarkable similarity in the expression profiles [32]. and are likely co-regulated, we expect functional homo- geneity within the modules. On the other hand, because Reproducibility of the module-based biomarkers in overlap between the modules is restrained, we expect the independent data set modules to be functionally heterogeneous and to repre- Potential application of molecular features (e.g. individual sent relatively independent biological processes and tran- genes or modules) as disease biomarkers relies on the scriptional regulatory programs. In order to evaluate the reproducibility of the association between molecular fea- functional homogeneity of the modules, we performed tures and the disease phenotype in independent patient the functional category enrichment analysis for the mod- cohorts. Ideally, if we find a biomarker from one patient ules against the GO categories. Among the 50 modules, cohort, we hope that this biomarker could hold signifi- 36 (72%) showed high functional homogeneity, with a cant in independent cohorts. In this study, we define Bonferroni-adjusted p-value (B-adjp) less than 0.05 in at reproducibility of a biomarker set identified from one least one of the GO categories. Moreover, these modules patient cohort as the percentage of included biomarkers corresponded to a diverse variety of biological processes that hold significant in an independent cohort. including metabolic process, immune system process, 19 modules showed significant correlation with tumor developmental process, system process, cell cycle, grade in the GSE2109_breast dataset (FDR < 0.01). We response to stimulus, transport, signal transduction, etc. calculated the percentage of the correlations that held These results suggest that co-expression modules identi- significant in the independent dataset GSE2990. Despite fied by the ICE algorithm are functionally homogeneous GSE2990 being generated on a different microarray plat- within a module and heterogeneous across modules. form on which only 64% of the genes in the Lack of functional enrichment for some of the modules GSE2109_breast dataset were present, 17 out of the 19 may be attributed to the false discovery of the algorithm, modules (89.47%) showed significant correlation with the incompleteness of the GO annotations, or the non- tumor grade (FDR < 0.01) in this independent dataset, functional relationship among genes in a module (e.g. indicating high reproducibility of this module-based bio- genomic proximity). marker set. We also checked the modules for their gene components. The 19 modules identified in the Modules correlated with breast cancer progression GSE2109_breast dataset comprised 405 genes, among The modules identified above were dynamically which 371 were included in the 17 reproducible modules expressed in response to different conditions. Because (91.60%). When we focused on genes common to both samples in the dataset included tumor specimens from platforms, the 19 modules comprised 352 common different stages and grades, we hypothesized that the genes, among which 330 were included in the 17 repro- dynamic expression of some of the modules might corre- ducible modules (93.75%). late with tumor stage or grade. Average expression of all As a comparison, we repeated the analysis at the indi- genes in a module was used to represent the overall vidual gene level. Among the 19,803 genes in the expression level of the module. We used the non-para- GSE2109_breast dataset, 5,107 showed significant corre- metric Jonckheere-Terpstra trend test to evaluate the cor- lation with tumor grade (FDR < 0.01). When tested in relation between the expression levels of the modules and GSE2990, only 1,569 genes (30.72%) maintained signifi- stage or grade. Because we were testing 50 modules at the cant correlation with tumor grade (FDR < 0.01). Some of same time, the p values generated by the test were further the inconsistency can be explained by the difference adjusted using the Benjamini and Hochberg correction between the array platforms because only 3,758 out of the [31] to derive False Discovery Rates (FDRs). To our sur- 5,107 genes were presented in the GSE2990 dataset. Nev- prise, none of the modules was correlated with tumor ertheless, even when we considered only the 3,758 genes, stage (FDR > 0.46). On the other hand, 19 out of the 50 the gene-level reproducibility (41.75%) was still much modules were significantly correlated with tumor grade lower than that for the ICE modules (Figure 3). (FDR < 0.01), and some of the correlations were We also used the popular partitioning method K-means extremely significant. For example, module_2 showed an and the graph-based module identification algorithm FDR of 3.52e-22. Module_2 is used for illustration in Fig- MCODE to identify modules from the GSE2109_breast ure 1. As shown in the heat map, all genes in this module dataset. For the K-means analysis, the number of clusters were highly correlated, and their expression levels were was arbitrarily set to 50 to match the number of modules Shi et al. BMC Systems Biology 2010, 4:74 Page 6 of 14 http://www.biomedcentral.com/1752-0509/4/74

On the other hand, owing to its stringent requirement, 100%  the ICE algorithm is likely to miss some genes in a mod-

80% ule, which will limit its use in applications sensitive to

  false negatives. However, as long as the core components

   60% in a co-expression module can be identified, one should be able to use the modules as biomarkers or for the infer-  40% ence of underlying regulatory mechanisms through    enrichment analysis.

20% Major biological processes associated with breast tumor Reproducibility in independent data set (%) Reproducibility progression 0% ICE MCODE K-means Gene_1 Gene_2 To understand the biology related to the grade-correlated Methods modules and the interaction among the modules, we cre- Figure 3 Reproducibility of grade-correlation in an independent ated a network for genes within the 17 reproducible ICE data set. Modules or genes correlated with tumor grade were identi- modules based on the GSE2109_breast dataset. As shown fied in the GSE2109 dataset using different methods and tested for re- in Figure 4, the modules can be divided into three major producibility in the GSE2990 dataset. ICE, MCODE, and K-means are three module-based methods. Gene_1 and Gene_2 are based on indi- groups, corresponding to the three connected compo- vidual genes. Gene_1 does not exclude the difference between mi- nents in the figure. croarray platforms used for the two datasets, while Gene_2 is based on Group I included five up-regulated modules, all of them common genes on the microarray platforms. were related to immune system processes or immune response. Group II included seven up-regulated modules, identified by ICE. For the MCODE analysis, 37 modules six of which were related to cell proliferation. Group III with 10 or more genes were identified. Based on the Jon- included five down-regulated modules that were involved ckheere-Terpstra trend test, 27 K-means modules and 21 in cell adhesion and systems development. High cell pro- MCODE modules were significantly correlated with liferation in high-grade tumors fits with a 'hallmark' of tumor grade (FDR < 0.01) in the GSE2109_breast dataset. cancer being uncontrolled proliferation [33]. Low cell When tested in the GSE2990 dataset, 16 K-means mod- adhesion levels in high-grade tumors is consistent with ules (59.26%) and 14 MCODE modules (66.67%) showed the fact that the loss of cell adhesion is required for cells significant correlation with tumor grade (FDR < 0.01). to invade tissues and reach the bloodstream, where they These comparisons demonstrate that module-based can travel to distant sites to establish metastases [34]. biomarker sets are more reproducible than the one based Although the exact roles of the immune system during on individual genes when the same selection criterion cancer development is complex and still a matter of (FDR < 0.01) is applied. Using more stringent criteria for debate [35], the association between immune system and the selection of gene-based biomarkers may improve the cancer development has been known for over a century reproducibility of the selected biomarkers [28]. However, [36]. These results demonstrate that the module-based with a very stringent criterion, the functional context of approach provides easily interpretable characterization of selected biomarkers may be lost, making it difficult to transcriptional changes. By correlating modules instead understand the underlying biology. For the module-based of individual genes with tumor grades, we gain a higher- methods, although the numbers of reproducible modules order understanding of the biological processes related to were comparable for the three methods, the ICE method breast cancer progression. outperforms both K-means and MCODE with regard to Although the ICE algorithm aims at identifying rela- the reproducibility (Figure 3). The higher reproducibility tively independent modules by ensuring a proportion of of ICE may be attributed to the knowledge-guided net- unique genes in individual modules (10 in this study), we work construction and the stringent requirement noticed considerable overlap among some of the modules imposed by the module definition in the ICE algorithm (e.g. Module_2 and Module_20). It is not clear whether that ensures high-level of co-expression among all genes these modules are really independent. However, down- in a module. Incorporating GO information in the K- stream transcription factor target enrichment analysis means analysis might help improve its performance. could help elucidate whether they are associated with dif- Although MCODE starts with the same network as ICE, ferent regulatory mechanisms. high-level of co-expression among all genes in a module is not guaranteed. High-level of co-expression is critical Genomic gain during breast tumor progression for the inference of co-regulation [8], therefore, the ICE Module_36 in Group II (Figure 4) was not enriched in any algorithm is more likely to identify truly co-regulated GO categories and would not have been identified using gene sets that are reproducible in independent data sets. any knowledge-driven analyses based on GO. However, Shi et al. BMC Systems Biology 2010, 4:74 Page 7 of 14 http://www.biomedcentral.com/1752-0509/4/74

Figure 4 Network of genes in the 17 modules reproducibly correlated with tumor grade. Nodes represent genes and edges represent co-ex- pression between genes. Edges connecting genes in the same module are colored in dark grey, while edges connecting genes in different modules are colored in light grey. Genes in the same module are represented by nodes of the same shape. Genes belonging to multiple modules are repre- sented by round nodes. The color scale bar shows the scale of p values for correlations between gene expression and tumor grade as calculated by the Jonckheere-Terpstra test. Modules are clustered into three connected components I, II, and III. Primary function annotations for each module are labeled in the figure. careful examination of the module found that all 13 genes [37]. Our results indicate that the genomic gain may in this module lie in close proximity on 8, affect a broader region including 8q21, 8q22, and 8q23. suggesting potential biological relevance of the module. Interestingly, a published bioinformatics study using 12 Specifically, 9 of the genes are located in the 8q22 region, independent human breast cancer microarray studies 3 in 8q21, and 1 in 8q23 (Figure 4). If we randomly pick 13 comprising 1422 tumor samples also identified 8q21-23 genes from all 19,803 genes on the microarray, we only as a potential aberrant chromosomal region [39]. It is not expect to find 0.09 genes in the 8q21-23 region that has a clear why this region is consistently duplicated in high- total of 144 genes. Finding all 13 genes from the module grade tumors. However, because Module_36 is moder- in this region is significantly non-random (p = 2.92e-28 in ately correlated with the cell proliferation modules (Fig- the Hypergeometric test). ure 4), we hypothesize that this genomic instability may To check whether this enrichment was due to the be associated with increased cell proliferation. genomic gain of 8q21-23 in high-grade tumors, we plot- ted the gene expression data for all genes in this region Regulatory mechanisms underpinning breast tumor based on the GSE2109_breast dataset (Figure 5). It is progression clear from the figure that not only the 13 genes, the Because genes in a co-expression module are likely to be majority of genes in this region showed consistent and regulated by a cohesive mechanism [8], we performed the elevated expression in high-grade tumors. Previously, functional category enrichment analysis for the modules genomic gain at 8q22 in breast tumor samples has been against the gene sets of transcription factor targets in reported by independent groups and has been associated order to identify potential transcriptional regulatory with poor-prognosis [37,38]. A recent study identified mechanisms underlying the modules. Among the 17 MTDH, one gene in module_36, as the most significant modules, 8 were enriched with targets of certain tran- functional mediator of this poor-prognosis genomic gain scription factors (B-adjp less than 0.05, Figure 6). Shi et al. BMC Systems Biology 2010, 4:74 Page 8 of 14 http://www.biomedcentral.com/1752-0509/4/74

GradeGrade

TRPS1 CSMD3 KCNV1 GOLSYN EBAG9 PKHD1L1 ENY2 NUDCD1 TRHR TMEM74 TTC35 EIF3E TTC35 RSPO2 8q23 ANGPT1 ABRA OXR1 ZFPM2 LRP12 DPYS TM7SF4 RIMS2 SLC25A32 SLC25A32 CTHRC1 FZD6 BAALC C8orf56 ATP6V1C1 ATP6V1C1 AZIN1 KLF10 AZIN1 ODF1 UBR5 RRM2B UBR5 NCALD GRHL2 NACAP1 ZNF706 ZNF706 YWHAZ PABPC1 SNX31 ANKRD46 RNF19A SPAG1 POLR2K POLR2K FBXO43 RGS22 COX6C VPS13B OSR2 STK3 KCNS2

8q22 POP1 HRSP12 C8orf47 SNORA72 RPL30 MATN2 LAPTM4B MTDH TSPYL5 MTDH PGCP SDC2 PTDSS1 MTERFD1 MTERFD1 UQCRB C8orf37 PLEKHF2 C8orf38 TP53INP1 CCNE2 INTS8 DPY19L4 KIAA1429 DPY19L4 RAD54B GEM CDH17 TMEM67 C8orf39 RBM12B FAM92A1 C8orf83 RUNX1T1 SLC26A7 LRRC69 OTUD6B TMEM55A NECAB1 TMEM64 CALB1 DECR1 NBN OSGIN2 RIPK2 MMP16 CNBD1 CNGB3 CPNE3 FAM82B WWP1 SLC7A13 ATP6V0D2 PSKH2 REXO1L1 CA2 CA3 CA1 CA13 C8orf59 E2F5 LRRCC1 RALYL SNX16 CHMP4C

8q21 ZFAND1 IMPA1 ZFAND1 FABP4 PMP2 PAG1 ZNF704 ZBTB10 TPD52 MRPS28 MRPS28 HEY1 STMN2 IL7 FAM164A PKIA PXMP3 ZFHX4 LOC100192378 HNF4G CRISPLD1 PI15 GDAP1 JPH1 LY96 TMEM70 TCEB1 UBE2W TCEB1 STAU2 RDH10 RPL7 C8orf84 TERF1 KCNB2 GSM38110 GSM38106 GSM88988 GSM38051 GSM76486 GSM89018 GSM53080 GSM76514 GSM53028 GSM53059 GSM38109 GSM89068 GSM46893 GSM88985 GSM76582 GSM38059 GSM76493 GSM76545 GSM46908 GSM53130 GSM53032 GSM38094 GSM53172 GSM89019 GSM88960 GSM46849 GSM76521 GSM46894 GSM38091 GSM38062 GSM38054 GSM76560 GSM89008 GSM46974 GSM46968 GSM38057 GSM46855 GSM38102 GSM46862 GSM76497 GSM88991 GSM89003 GSM89023 GSM53115 GSM89006 GSM89016 GSM38081 GSM46891 GSM89011 GSM46874 GSM88970 GSM76566 GSM46870 GSM53102 GSM89009 GSM76491 GSM88979 GSM53027 GSM76528 GSM89015 GSM46863 GSM46959 GSM38099 GSM76563 GSM38080 GSM38090 GSM89021 GSM89005 GSM88986 GSM89055 GSM76637 GSM53086 GSM53131 GSM76592 GSM46846 GSM76627 GSM89071 GSM88996 GSM76619 GSM76495 GSM53108 GSM76564 GSM46890 GSM46885 GSM38082 GSM53099 GSM89000 GSM76492 GSM88990 GSM89010 GSM38063 GSM46827 GSM38083 GSM76517 GSM89102 GSM46873 GSM46880 GSM38092 GSM46883 GSM88969 GSM88980 GSM76561 GSM301657 GSM152759 GSM179895 GSM152707 GSM179863 GSM102523 GSM277734 GSM231917 GSM138054 GSM179780 GSM102482 GSM152627 GSM138035 GSM138028 GSM277679 GSM102569 GSM137903 GSM117758 GSM179835 GSM179829 GSM179932 GSM179891 GSM179841 GSM117761 GSM179791 GSM137950 GSM138051 GSM138041 GSM152615 GSM179946 GSM102468 GSM231918 GSM137898 GSM179885 GSM102525 GSM277685 GSM203669 GSM231892 GSM179919 GSM102563 GSM179898 GSM102564 GSM137988 GSM102573 GSM179854 GSM301688 GSM102487 GSM152685 GSM152753 GSM117585 GSM137936 GSM152621 GSM152768 GSM152620 GSM203752 GSM152680 GSM152616 GSM325787 GSM231884 GSM301686 GSM117745 GSM301653 GSM152571 GSM152782 GSM102440 GSM179861 GSM137989 GSM102441 GSM152698 GSM179886 GSM137924 GSM117685 GSM231887 GSM152761 GSM301664 GSM102506 GSM102439 GSM137987 GSM325826 GSM102536 GSM353926 GSM117730 GSM117581 GSM117686 GSM102544 GSM231886 GSM137991 GSM117695 GSM117661 GSM117766 GSM117692 GSM117578 GSM203648 GSM117675 GSM102499 GSM152699 GSM277681 GSM179902 GSM179837 GSM138031 GSM152713 GSM137943 GSM117759 GSM117687 GSM117584 GSM152781 GSM152775 GSM117639 GSM325808 GSM102526 GSM152585 GSM179862 GSM179806 GSM203693 GSM325792 GSM102530 GSM102571 GSM117580 GSM203726 GSM137944 GSM117588 GSM102566 GSM117719 GSM231976 GSM102567 GSM152576 GSM117660 GSM137978 GSM203639 GSM152711 GSM137896 GSM102545 GSM203708 GSM152803 GSM231919 GSM203690 GSM152590 GSM152570 GSM137937 GSM179957 GSM203623 GSM137895 GSM152678 GSM152766 GSM152645 GSM137953 GSM137963 GSM277682 GSM152729 GSM138027 GSM231932 GSM277693 GSM301651 GSM203701 GSM102583 GSM325788 GSM137925 GSM203784 GSM152691 GSM231992 GSM152661 GSM277707 GSM138014 GSM179892 GSM231902 GSM203792 GSM203624 GSM152618 GSM117659 GSM179896 GSM203759 GSM152605 GSM152669 GSM117777 GSM203658 GSM137939 GSM179950 GSM137984 Band Band Module_33 Grade 1 Grade 2 Grade 3 -3 3 Module_36

Figure 5 Expression profile of genes in the chromosome region 8q21-23. Expression data for genes in the chromosome region 8q21-23 were collected from the GSE2109 dataset and visualized in a heat map. Rows present genes and columns represent samples. Genes are ordered by their chromosome location and colored-coded on the left by three cytogenetic bands 8q21, 8q22, and 8q23. Genes in module_36 are marked in red and labeled on the left. Samples are color-coded on the top by tumor grade, where blue, cyan, and pink correspond to grades 1, 2, and 3, respectively. The color scale bar at the bottom shows the relative gene expression level (0 is the mean expression level of a given gene).

In a previous study using a different data set (GSE3494), IRF2, IRF7, and IRF8) were associated with module_22 Niida et al. employed a Bayesian network approach and and module_31, and those in the ETS family (SPI1, ELF1, identified motifs bound by ELK1, E2F, NRF1 and NFY as and ETS2) were associated with module_5. Both of these principal regulatory motifs driving malignant progression families have implicated roles in breast cancer [41,42]. of breast cancer [40]. All these transcription factors were Module_7 was down-regulated in high-grade tumors, detected in our study. Genes regulated by NYFB, E2F1/ and it was enriched with the targets of transcription fac- E2F3, NRF1, and ELK1 were enriched in module_2, tors CEBPB, TBP, POU2F1, and PPARG. Interestingly, module_20, module_50, and module_46, respectively. All PPARG itself was included in the module and is a target these previously reported regulatory mechanisms are of all the other three transcription factors. Eight out of related to increased cell proliferation in high-grade the 35 genes in this module, including LPL, SORBS1, tumors. The result that overlapping modules such as PPARG, PLIN, FABP4, AQP7, CD36, and ADIPOQ, are module_2 and module_50 are associated with different involved in the PPARA signaling pathway according to transcription factors (Figure 6) suggests that they are the annotation in the KEGG database. The result indi- likely to be independent. cates that this transcriptional program might contribute We also identified regulatory mechanisms related to to breast tumor progression through inhibiting the increased immune response in high-grade tumors. Spe- PPARA signaling pathway. Consistently, it has been cifically, transcription factors in the IRF family (IRF1, recently reported that the odds of breast cancer are dou- Shi et al. BMC Systems Biology 2010, 4:74 Page 9 of 14 http://www.biomedcentral.com/1752-0509/4/74

NFYB E2F1 E2F3

TTK NEK2 TACC3 UBE2C TOP2A CENPF KIF23 SPC25 CDKN3 RACGAP1TPX2 KIF20A ESPL1 CDC45L STMN1 RRM2 CDC2 MCM2 EZH2 MELK H2AFZ KNTC1 MCM6 PCNA FBXO5 MCM3

Module_2 Module_20

NRF1 ELK1 IRF7 IRF2 IRF1 IRF8

LSM2 ILF3 NCBP1 SHMT2 MTX2 SUMO1 RPE HAT1 MRPL3 OASL DDX58 IFIT2 USP18 IFI44 HLA-F CXCL10 PSMB9 TAP1 HLA-C PSMB8

Module_50 Module_46 Module_22 Module_31

SPI1 ELF1 ETS2 POU2F1 TBP CEBPB

ACVR1CADH1B CD36 CHRDL1SLC19A3 ITGA7 RBP4 FABP4 G0S2 LEP PIK3AP1 EVI2B LAIR1 TLR4 TYROBP NCF2 AIF1 FCER1G CTSS HCLS1 GPR65 LST1 CD86 PPARG

Module_5 LPL PLIN GPD1 AQP7

1e-5 1e-5 Module_7 Neg. 1 Pos.

Figure 6 Co-expression modules and their regulators. Nodes represent genes and edges represent transcriptional regulation between transcrip- tion factors (top) and target genes (bottom). Transcription factors that are also included in the module are represented by square nodes, while others are represented by triangle nodes. The color scale bar shows the scale of p values for correlations between gene expression and tumor grade as cal- culated by the Jonckheere-Terpstra test. bled among women with PPARA polymorphism with regard to their likelihood of recurrence. Specifically, rs4253760 [43]. we applied unsupervised hierarchical clustering on the As shown in Figure 6, some transcription factors, such GSE2990 dataset to cluster cancer patients into sub- as IRF1 and PPARG are highly correlated with target groups based on the expression pattern of the 17 mod- genes and therefore are included in the modules. Some ules, and then investigated difference in relapse-free transcription factors are not included in the module but survival for different patient subgroups by comparing still show positive correlation with target genes, such as their survival curves estimated using the Kaplan-Meier E2F1, E2F3, and IRF7. Some transcription factors are method. negatively correlated with target genes such as CEBPB. It As shown in Figure 7A, the modules can be categorized is worth noting that most of the transcription factors do into 3 groups based on the overall expression patterns, not show significant transcriptional changes and cannot corresponding to the three connected components in Fig- be identified by individual gene based analysis. Our mod- ure 4. Patients in the dataset are first divided into two ule-based approach not only confirmed previously groups based primarily on the expression of group II and reported regulatory mechanisms, but also generated group III modules. The group with low-expression of novel hypotheses on regulatory mechanisms associated group II modules and high-expression of group III mod- with breast tumor progression. ules (patient group a) comprised mostly patients with grade 1 or 2 tumors, while the group with high-expres- Prognostic value of the grade-correlated modules sion of group II modules and low-expression of group III It is well known that morphologically similar tumors may modules (patient group b) was enriched in patients with have very different clinical courses. Because tumor speci- grade 3 tumors. This was expected as all of the modules mens with the same grade annotation showed very differ- were significantly correlated with tumor grade in this ent expression pattern of the modules (see Additional file dataset, although they were identified in the independent 1), we explored whether expression pattern of the mod- GSE2109_breast dataset. Patient group b can be further ules could be used to classify tumors more accurately Shi et al. BMC Systems Biology 2010, 4:74 Page 10 of 14 http://www.biomedcentral.com/1752-0509/4/74

a b1 b b2 A B

Grade 100 100 1 2 Module_8 80 80 Module_31

Module_5 60 60

I Module_28 40 40 Module_22 20 20 Module_46 a a_G2 b p=0.004 b_G2 p=0.017 Group1 Group1 Module_35 Group2 Group2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Relapse-free survival (%) 0 0

02468101214 024681012 Module_50 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12

Module_36

II Module_20 100 100

Module_2 3 4 80 80 Module_41

Module_27 60 60

Module_15 40 40 Module_7 a G1 III Module_34 20 b1 20 G2 p=0.013 p=0.065 Group1b2 Group1G3 Module_24 Group2 Group2 Group3 Group3 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Relapse-free survival (%) 0 0

02468101214 02468101214

GSM65753 GSM65827 GSM65859 GSM65785 GSM65762 GSM65796 GSM65347 GSM65847 GSM65334 GSM65848 GSM65346 GSM65349 GSM65363 GSM65836 GSM65880 GSM65830 GSM65862 GSM65355 GSM65365 GSM65858 GSM65801 GSM65867 GSM65864 GSM65871 GSM65874 GSM65854 GSM65362 GSM65865 GSM65323 GSM65342 GSM65824 GSM65772 GSM65826 GSM65849 GSM65329 GSM65817 GSM65781 GSM65876 GSM65336 GSM65808 GSM65814 GSM65320 GSM65812 GSM65791 GSM65319 GSM65783 GSM65807 GSM65318 GSM65771 GSM65797 GSM65803 GSM65800 GSM65779 GSM65788 GSM65840 GSM65775 GSM65758 GSM65764 GSM65773 GSM65780 GSM65789 GSM65770 GSM65799 GSM65787 GSM65853 GSM65332 GSM65763 GSM65766 GSM65331 GSM65793 GSM65765 GSM65754 GSM65776 GSM65782 GSM65818 GSM65767 GSM65790 GSM65810 GSM65804 GSM65815 GSM65805 GSM65757 GSM65846 GSM65860 GSM65869 GSM65345 GSM65360 GSM65832 GSM65358 GSM65861 GSM65356 GSM65370 GSM65828 GSM65857 GSM65351 GSM65353 GSM65844 GSM65377 GSM65792 GSM65856 GSM65348 GSM65366 GSM65372 GSM65373 GSM65316 GSM65324 GSM65756 GSM65317 GSM65327 GSM65795 GSM65831 GSM65752 GSM65837 GSM65845 GSM65755 GSM65816 GSM65852 GSM65868 GSM65371 GSM65774 GSM65341 GSM65354 GSM65325 GSM65359 GSM65820 GSM65835 GSM65337 GSM65375 GSM65343 GSM65786 GSM65340 GSM65361 GSM65378 GSM65879 GSM65367 GSM65878 GSM65823 GSM65875 GSM65364 GSM65855 GSM65368 GSM65834 GSM65338 GSM65374 GSM65335 GSM65768 GSM65321 GSM65811 GSM65877 GSM65872 GSM65873 GSM65326 GSM65330 GSM65357 GSM65379 GSM65798 GSM65784 GSM65369 GSM65344 GSM65825 GSM65839 GSM65806 GSM65802 GSM65866 GSM65350 GSM65838 GSM65761 GSM65822 GSM65328 GSM65813 GSM65769 GSM65339 GSM65794 GSM65333 GSM65760 GSM65821 GSM65843 GSM65842 GSM65322 GSM65850 GSM65870 GSM65841 GSM65819 GSM65833 GSM65863 GSM65352 GSM65851 GSM65376 GSM65829 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 Grade 1 Grade 2 Year Year Grade 3 Grade unknown -3 3

Figure 7 Patient subgroups identified based on module expression pattern and their relapse-free survival. Panel A: two-dimensional cluster- ing of the samples and the 17 modules in the GSE2990 dataset. Rows represent modules and columns represent samples, which are colored-coded by tumor grade, where blue, cyan, pink, and yellow correspond to grades 1, 2, 3, and unknown, respectively. Module expression is calculated as the average standardized expression of all genes in the module. The color scale bar shows the relative module expression level, where 0 is the mean ex- pression level of a given module. Modules and samples are clustered independently by hierarchical clustering. Major module clusters (I, II, and III) and major patient clusters (a, b1, and b2) are labeled on the dendrograms. Panel B shows the survival analysis results for different patient subgroups. B1: group a versus group b, all tumor grades. B2: group a versus group b, limited to grade 2 (G2) tumors. B3: groups a, b1, and b2, all tumor grades. B4: grade 1(G1), 2 (G2), and 3 (G3). divided into two subgroups based on the expression of = 0.013). As shown in Figure 7B_3, group a patients had group I modules. One subgroup showed low-expression obviously better outcome than group b patients. More- of group I modules (patient group b1), while the other over, outcomes for the two subgroups in group b were showed high-expression of these modules (patient group also discernible: patients in group b1 seemed to have b2). higher risk of recurrence compared to those in group b2. We first investigated whether the patient groups a and Considering the function of the modules (Figure 4), the b had different recurrences. As shown in Figure 7B_1, result suggests that patients with lower cell proliferation they did show significant different relapse-free survival (p and higher systems development and cell adhesion have = 0.004 in log-rank test). To test whether this result was lower risk of recurrence. For patients with high cell prolif- independent of tumor grade, we limited the survival anal- eration, although the sample size was too small to draw ysis to grade 2 patients. As shown in Figure 7B_2, grade 2 any statistical conclusions, it seems that the subgroup patients in groups a and b also showed significant differ- with higher immune response has relatively lower risk of ent relapse-free survival (p = 0.017), even with a small recurrence, at least from year 4 to year 8. One explana- sample size. tion is that the activation of adaptive immunity in the Next, we compared relapse-free survival for patient high cell proliferation group could elicit antitumor groups a, b1, and b2 as identified from the hierarchical responses through T-cell-mediated toxicity, antibody- clustering. According to the log-rank test, the overall dif- dependent cell-mediated cytotoxicity, and antibody- ference among these groups was statistically significant (p induced complement-mediated lysis. However, as Shi et al. BMC Systems Biology 2010, 4:74 Page 11 of 14 http://www.biomedcentral.com/1752-0509/4/74

immune system may play paradoxical roles during cancer the module-based approach on a breast cancer microar- development[35], the relationship between immune sys- ray gene expression dataset revealed important biological tem activity and the risk of recurrence may not be processes and previously reported regulatory mecha- straightforward. nisms underlying breast tumor progression. In addition, In comparison, we investigated the association between novel hypotheses on regulatory mechanisms and grade annotation and relapse-free survival. As shown in genomic gain associated with breast tumor progression Figure 7B_4, grade 1 patients had considerable better out- have been generated. The 17-module signature of breast come than grade 2 and 3 patients, while the outcomes for tumor progression clustered patients into subgroups with grade 2 and 3 patients were hardly discernible. The over- significantly different relapse-free survival times in an all difference in relapse-free survival for the three groups independent dataset and provided useful prognostic with different tumor grades was only marginally signifi- information that is independent of tumor grade. cant (p = 0.065). These results demonstrate that the expression pattern Methods of the grade-correlated modules is associated with the Data preprocessing risk of recurrence, that the expression pattern of the cel files for each gene expression dataset were down- modules can classify grade 2 patients into two groups loaded from the Gene Expression Omnibus (GEO) data- with high versus low risks of recurrence, and that the base http://www.ncbi.nlm.nih.gov/geo/ and processed three patient groups identified by the module expression using the Robust MultiChip Analysis (RMA) algorithm pattern may classify patients more accurately than the [45] as implemented in Bioconductor [46]. Probe set iden- traditional grading system. Similar findings have been tifiers (IDs) were mapped to gene symbols based on the reported in the original study in which a gene-level mapping provided by the GEO database. Median expres- approach was applied on the GSE2990 dataset [28]. How- sion levels from multiple probe sets corresponding to the ever, our module-based approach puts genes into biologi- same gene were calculated to represent the gene expres- cal contexts and helps reveal biological processes and sion level. To make the gene expression level comparable regulatory mechanisms underlying the biomarkers. across genes, expression values for each gene were stan- Along the same line, protein interaction modules have dardized using a Z-score transformation. For each data- been proposed as biomarkers for the classification of set, a gene expression matrix with normalized and breast cancer metastasis[44]. As co-expression modules standardized expression values was thus generated. are sub-networks identified from gene co-expression net- works in a data-driven fashion, they are not limited by Gene co-expression network construction existing knowledge on protein interaction and should Given a gene expression matrix with n genes and m sam- compliment the protein interaction module-based ples, Pearson's correlation coefficients are calculated for ×− approach. An obvious extension of the current study is to all the nn()1 gene pairs. To evaluate functional similar- construct prediction models based on the co-expression 2 modules and to evaluate their performance in different ity between each pair of genes, the Resnik's semantic sim- patient cohorts. ilarity [47] is calculated based on the GO biological process annotation according to Elo et al [25]. The aver- Conclusions age functional similarity of gene pairs at various correla- We have developed a novel algorithm ICE for identifying tion ranges is calculated and plotted. A correlation relatively independent maximal cliques as co-expression threshold above which a sharp increase in functional sim- modules and a co-expression module-based approach to ilarity occurs is selected through manual inspection of the analysis of gene expression data. The ICE algorithm the plot. Based on the selected threshold, a gene co- can identify functionally homogeneous modules from expression network is constructed in which each node is complex gene co-expression networks. Because the over- a gene while two nodes are connected by an edge if their lap among the modules is restrained, the algorithm is able Pearson's correlation coefficient is above the threshold. to identify a small number of modules covering a variety Co-expression module identification of biological processes, in turn facilitating downstream We propose an intuitive Iterative Clique Enumeration analyses to investigate major transcriptional programs (ICE) algorithm to identify a manageable number of max- encoded in gene co-expression networks. On the other imal cliques as relatively independent co-expression hand, the stringent requirement imposed by the module modules in order to facilitate further analyses of the tran- definition in the algorithm ensures high-level of co- scriptional mechanisms encoded in a gene co-expression expression among all genes in a module. Therefore, the network. Given a graph , G = (V,E), the algorithm works algorithm is able to identify truly co-regulated gene sets on the original graph G and "residual graphs" G = (V , E ) that are reproducible in independent data sets. Applying i i i Shi et al. BMC Systems Biology 2010, 4:74 Page 12 of 14 http://www.biomedcentral.com/1752-0509/4/74

iteratively, where the residual graph Gi is generated by represent the overall expression of the module in the removing all maximal cliques identified before step i from ∑ EGk G ∈M G. At step i, the algorithm first identifies the maximum sample: E = k j . M j clique Ci in Gi [48], and then finds the largest maximal Mj ' clique Ci in G that covers Ci. Ci' is considered as a module Jonckheere-Terpstra test and is removed from Gi to generate the next residual Suppose that N samples (e.g., tumor specimens) are graph G . Because C is identified in a residual graph, it i+1 i assigned to s ordered categories (e.g., tumor grades) and a has no overlap with previously found modules. However, selected attribute X (e.g., expression of a gene) is mea- expanding C to C ' allows overlap between C ' and existing i i i sured for the samples and ranked. The Jonckheere-Terp- modules. Therefore, overlap among the identified mod- stra test[51] tests the null hypothesis H that there are no ules is allowed but the amount of overlap is restrained. difference among the categories. The alternative hypoth- The iteration stops when a pre-defined clique size thresh- esis is that the sample means change monotonically along old cmin is reached. Because we focus on the major regula- the ordered sequence of the categories. Let us denote by tory programs encoded in the network, c is set to 10 in min Xi1, ..., Xim and Xj1, ..., Xjn the observations on the ith and this study. The ICE algorithm is formally described in the jth categories where i

NN2()()23+−∑ n2 23n + Functional category enrichment analysis Var() W = i i [51]. Therefore, the For a module with n genes and an a priori defined func- 72 tional category with K genes, hypergeometric test[50] is significance of W can be calculated based on the null dis- used to evaluate the significance of the overlap k between tribution. the module and the category. All N genes on the microar- Survival analysis ray are used as a reference. The significance of the overlap Survival analysis was performed in R using the survival ⎛ NK− ⎞⎛ K ⎞ ⎜ ⎟⎜ ⎟ package. Specifically, survival curves were estimated n ⎝ ni− ⎠⎝ i ⎠ can be calculated by p = ∑ . For the using the Kaplan-Meier method, and survival compari- ik= ⎛ N ⎞ ⎜ ⎟ sons among groups were made by the log-rank test. ⎝ n ⎠ source of the functional categories, we used the Gene Network visualization Ontology (GO) gene sets and the transcription factor tar- Networks were visualized using Cytoscape [49]. gets gene sets downloaded from the MSigDB (http:// www.broad.mit.edu/gsea/msigdb/index.jsp, version 2.5). Appendix The later was organized by transcription factor binding Algorithm 1. Iterative Clique Enumeration (ICE) motifs. We further collected known transcription factors 1. Given graph G = (V,E), clique size threshold cmin for the binding motifs from the Transfac database (http:// 2. i = 0, C0 is the maximum clique in G, output C0 www.gene-regulation.com, professional version 12.1). 3. C0 C0, G0  G, V0  V, E0  E Genes associated with different binding motifs that cor- 4. while ||Ci|| ≥ cmin do respond to a common transcription factor were com- 5. i  i + 1 bined into one gene set. Gene sets associated with 6. Gi = (Vi, Ei), where binding motifs that have no known transcription factors ==’’’ ∈∈ were not considered in this study. VVii−−11\, CEE iii − 1 \(,)|{ nnnC stsi − 1 and nC ti 7. Find the maximum clique C in G Overall expression level of a module i i 8. Find the largest maximal clique C ' in G such that Because the gene expression data is normalized and stan- i Ci'  Ci dardized, for a selected sample and a specific module Mj, average expression of all genes in the module is used to 9. Output Ci' 10. end while Shi et al. BMC Systems Biology 2010, 4:74 Page 13 of 14 http://www.biomedcentral.com/1752-0509/4/74

Additional material 11. Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998, 95(25):14863-14868. Additional file 1 Dynamic expression of the modules during breast 12. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic cancer progression. Figures visualizing the dynamic expression of the determination of genetic network architecture. Nat Genet 1999, modules during breast cancer progression in the GSE2109 dataset. 22(3):281-285. 13. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander Authors' contributions ES, Golub TR: Interpreting patterns of gene expression with self- BZ and ZS conceived of the study, designed the algorithm, and performed the organizing maps: methods and application to hematopoietic data analysis. BZ and CKD performed the data interpretation. BZ, ZS and CKD differentiation. Proc Natl Acad Sci USA 1999, 96(6):2907-2912. drafted the manuscript. All authors contributed to and approved the final man- 14. Newman MEJ, Girvan M: Finding and evaluating community structure uscript. in networks. Physical Review E 2004, 69(2):026113. 15. Newman MEJ: Fast algorithm for detecting community structure in Acknowledgements networks. Physical Review E 2004, 69(6):066133. This work was supported by the National Institutes of Health (NIH)/National 16. Donetti L, Muñoz M: Detecting Network Communities: a new Institute of General Medical Sciences (NIGMS) through grant R01GM088822 systematic and efficient algorithm. Journal of Statistical Mechanics: and the Pilot Grant support from the Vanderbilt Conte Center for Neuroscience Theory and Experiment 2004, 2004:P10012. Research (P50MH078028). This work was conducted in part using the 17. Bader GD, Hogue CW: An automated method for finding molecular resources of the Advanced Computing Center for Research and Education at complexes in large protein interaction networks. BMC Bioinformatics Vanderbilt University, Nashville, TN. We acknowledge Drs. Natasha Deane and 2003, 4:2. Robbert Slebos for useful comments on this manuscript. 18. Palla G, Derenyi I, Farkas I, Vicsek T: Uncovering the overlapping community structure of complex networks in nature and society. Author Details Nature 2005, 435(7043):814-818. 1Advanced Computing Center for Research & Education, Vanderbilt University, 19. Li J, Zimmerman LJ, Park BH, Tabb DL, Liebler DC, Zhang B: Network- Nashville, TN 37240, USA, 2Department of Electrical Engineering and assisted protein identification and data interpretation in shotgun Computer Science, Vanderbilt University, Nashville, TN 37240, USA and proteomics. Mol Syst Biol 2009, 5:303. 3Department of Biomedical Informatics, Vanderbilt University School of 20. Yu H, Paccanaro A, Trifonov V, Gerstein M: Predicting interactions in Medicine, Nashville, TN 37240, USA protein networks by completing defective cliques. Bioinformatics 2006, 22(7):823-829. Received: 18 February 2010 Accepted: 27 May 2010 21. Zhang B, Park BH, Karpinets T, Samatova NF: From pull-down data to Published: 27 May 2010 protein interaction networks and complexes with biological relevance. ©ThisBMC 2010 isarticleSystems an Shi Open etis Biology availableal; Access licensee 2010, from:article BioMed 4 :74http distributed ://www.biomedcentral.com/1752-0509/4/74Central Ltd.under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Bioinformatics 2008, 24(7):979-986. References 22. Voy BH, Scharff JA, Perkins AD, Saxton AM, Borate B, Chesler EJ, Branstetter 1. van't Veer LJ, Dai H, Vijver MJ van de, He YD, Hart AA, Mao M, Peterse HL, LK, Langston MA: Extracting gene networks for low-dose radiation Kooy K van der, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, using graph theoretical algorithms. PLoS Comput Biol 2006, 2(7):e89. Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling 23. Butte AJ, Kohane IS: Mutual information relevance networks: functional predicts clinical outcome of breast cancer. Nature 2002, genomic clustering using pairwise entropy measurements. Pac Symp 415(6871):530-536. Biocomput 2000:418-429. 2. Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, 24. Luo F, Yang Y, Zhong J, Gao H, Khan L, Thompson DK, Zhou J: Timmermans M, Meijer-van Gelder ME, Yu J, Jatkoe T, Berns EM, Atkins D, Constructing gene co-expression networks and predicting functions Foekens JA: Gene-expression profiles to predict distant metastasis of of unknown genes by random matrix theory. BMC Bioinformatics 2007, lymph-node-negative primary breast cancer. Lancet 2005, 8:299. 365(9460):671-679. 25. Elo LL, Jarvenpaa H, Oresic M, Lahesmaa R, Aittokallio T: Systematic 3. Ein-Dor L, Kela I, Getz G, Givol D, Domany E: Outcome signature genes in construction of gene coexpression networks with applications to breast cancer: is there a unique set? Bioinformatics 2005, 21(2):171-178. human T helper cell differentiation process. Bioinformatics 2007, 4. Segal E, Friedman N, Koller D, Regev A: A module map showing 23(16):2096-2103. conditional activity of expression modules in cancer. Nat Genet 2004, 26. Zhang B, Horvath S: A general framework for weighted gene co- 36(10):1090-1098. expression network analysis. Stat Appl Genet Mol Biol 2005, 4:Article17. 5. Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, 27. Lee HK, Hsu AK, Sajdak J, Qin J, Pavlidis P: Coexpression analysis of Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, human genes across many microarray data sets. Genome Res 2004, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, 14(6):1085-1094. Hirschhorn JN, Altshuler D, Groop LC: PGC-1alpha-responsive genes 28. Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, involved in oxidative phosphorylation are coordinately downregulated Praz V, Haibe-Kains B, Desmedt C, Larsimont D, Cardoso F, Peterse H, in human diabetes. Nat Genet 2003, 34(3):267-273. Nuyten D, Buyse M, Vijver MJ Van de, Bergh J, Piccart M, Delorenzi M: Gene 6. Wang L, Zhang B, Wolfinger RD, Chen X: An integrated approach for the expression profiling in breast cancer: understanding the molecular analysis of biological pathways using mixed models. PLoS Genet 2008, basis of histologic grade to improve prognosis. J Natl Cancer Inst 2006, 4(7):e1000115. 98(4):262-272. 7. Stuart JM, Segal E, Koller D, Kim SK: A gene-coexpression network for 29. van Noort V, Snel B, Huynen MA: The yeast coexpression network has a global discovery of conserved genetic modules. Science 2003, small-world, scale-free architecture and can be explained by a simple 302(5643):249-255. model. EMBO Rep 2004, 5(3):280-284. 8. Allocco DJ, Kohane IS, Butte AJ: Quantifying the relationship between 30. Bron C, Kerbosch J: Algorithm 457: finding all cliques of an undirected co-expression, co-regulation and gene function. BMC Bioinformatics graph. Comm of the ACM 1973, 16:575-577. 2004, 5:18. 31. Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical 9. Oldham MC, Horvath S, Geschwind DH: Conservation and evolution of and powerful approach to multiple testing. Journal of the Royal gene coexpression networks in human and chimpanzee brains. Proc Statistical Society, Series B 1995, 57:289-300. Natl Acad Sci USA 2006, 103(47):17973-17978. 32. Ma XJ, Salunga R, Tuggle JT, Gaudet J, Enright E, McQuary P, Payette T, 10. Butte AJ, Tamayo P, Slonim D, Golub TR, Kohane IS: Discovering Pistone M, Stecker K, Zhang BM, Zhou YX, Varnholt H, Smith B, Gadd M, functional relationships between RNA expression and Chatfield E, Kessler J, Baer TM, Erlander MG, Sgroi DC: Gene expression chemotherapeutic susceptibility using relevance networks. Proc Natl profiles of human breast cancer progression. Proc Natl Acad Sci USA Acad Sci USA 2000, 97(22):12182-12186. 2003, 100(10):5974-5979. 33. Hanahan D, Weinberg RA: The hallmarks of cancer. Cell 2000, 100(1):57-70. Shi et al. BMC Systems Biology 2010, 4:74 Page 14 of 14 http://www.biomedcentral.com/1752-0509/4/74

34. Steeg PS: Cancer: micromanagement of metastasis. Nature 2007, 449(7163):671-673. 35. de Visser KE, Eichten A, Coussens LM: Paradoxical roles of the immune system during cancer development. Nat Rev Cancer 2006, 6(1):24-37. 36. Balkwill F, Mantovani A: Inflammation and cancer: back to Virchow? Lancet 2001, 357(9255):539-545. 37. Hu G, Chong RA, Yang Q, Wei Y, Blanco MA, Li F, Reiss M, Au JL, Haffty BG, Kang Y: MTDH activation by 8q22 genomic gain promotes chemoresistance and metastasis of poor-prognosis breast cancer. Cancer Cell 2009, 15(1):9-20. 38. Thomassen M, Tan Q, Kruse TA: Gene expression meta-analysis identifies chromosomal regions and candidate genes involved in breast cancer metastasis. Breast Cancer Res Treat 2009, 113(2):239-249. 39. Buness A, Kuner R, Ruschhaupt M, Poustka A, Sultmann H, Tresch A: Identification of aberrant chromosomal regions from gene expression microarray studies applied to human breast cancer. Bioinformatics 2007, 23(17):2273-2280. 40. Niida A, Smith AD, Imoto S, Tsutsumi S, Aburatani H, Zhang MQ, Akiyama T: Integrative bioinformatics analysis of transcriptional regulatory programs in breast cancer cells. BMC Bioinformatics 2008, 9:404. 41. Bouker KB, Skaar TC, Riggins RB, Harburger DS, Fernandez DR, Zwart A, Wang A, Clarke R: Interferon regulatory factor-1 (IRF-1) exhibits tumor suppressor activities in breast cancer associated with caspase activation and induction of apoptosis. Carcinogenesis 2005, 26(9):1527-1535. 42. Turner DP, Findlay VJ, Moussa O, Watson DK: Defining ETS transcription regulatory networks and their contribution to breast cancer progression. J Cell Biochem 2007, 102(3):549-559. 43. Golembesky AK, Gammon MD, North KE, Bensen JT, Schroeder JC, Teitelbaum SL, Neugut AI, Santella RM: Peroxisome proliferator-activated receptor-alpha (PPARA) genetic polymorphisms and breast cancer risk: a Long Island ancillary study. Carcinogenesis 2008, 29(10):1944-1949. 44. Chuang HY, Lee E, Liu YT, Lee D, Ideker T: Network-based classification of breast cancer metastasis. Mol Syst Biol 2007, 3:140. 45. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003, 4(2):249-264. 46. Reimers M, Carey VJ: Bioconductor: an open source framework for bioinformatics and computational biology. Methods Enzymol 2006, 411:119-134. 47. Resnik P: Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J Artif intel Res 1999, 11:95-130. 48. Ostergard P: A fast algorithm for the maximum clique problem. Discrete Applied Math 2002, 120:197-207. 49. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003, 13(11):2498-2504. 50. Zhang B, Kirov S, Snoddy J: WebGestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic Acids Res 2005:W741-748. 51. Lehmann EL, D'Abrera HJM: Nonparametrics. San Francisco: Holden-Day; 1975.

doi: 10.1186/1752-0509-4-74 Cite this article as: Shi et al., Co-expression module analysis reveals biologi- cal processes, genomic gain, and regulatory mechanisms associated with breast cancer progression BMC Systems Biology 2010, 4:74