Supplementary Data
Total Page:16
File Type:pdf, Size:1020Kb
Supplementary information 1 Contents 1 Supplementary Methods 3 1.1 Prototype-based co-expression modules ...................... 3 1.2 Gene ontology and functional network analysis ................. 4 1.3 Survival meta-analysis ............................... 4 1.3.1 Univariate Cox regression ......................... 4 1.3.2 Variable selection for multivariate Cox regression ............ 5 2 Supplementary Figure 1 6 3 Supplementary Figure 2 7 4 Supplementary Figure 3 8 5 Supplementary Table 1 12 6 Supplementary Table 2 13 6.1 Funtional annotation of the gene expression modules .............. 13 7 Supplementary Table 3 14 8 Supplementary Table 4 16 9 Supplementary Table 5 17 10 Alternative computation of prognostic gene signatures 18 10.1 GENE70 in [van’t Veer et al., 2002] Dataset ................... 18 10.1.1 Score .................................... 18 10.1.2 Risk ..................................... 18 10.1.3 Stratified Survival Curves ......................... 19 10.2 GENE70 in [van de Vijver et al., 2002] Dataset ................. 19 10.2.1 Score .................................... 19 10.2.2 Risk ..................................... 19 10.2.3 Stratified Survival Curves ......................... 20 10.3 GENE76 in [Wang et al., 2005] .......................... 21 10.3.1 Score .................................... 21 10.3.2 Risk ..................................... 21 10.3.3 Stratified Survival Curves ......................... 21 10.4 GENE76 in [Foekens et al., 2006] Dataset .................... 21 10.4.1 Score .................................... 21 10.4.2 Risk ..................................... 23 10.4.3 Stratified Survival Curves ......................... 23 10.5 Conclusion ..................................... 23 2 1 Supplementary Methods 1.1 Prototype-based co-expression modules In order to identify genes that are coexpressed with one specific prototype, we used a database of 581 patients from NKI2 and VDX datasets. First, we considered only the intersection of genes between the Affymetrix and Agilent platforms after having applied the mapping procedure as described in the article (see Section Gene expression data and probe annotation). We refer hereafter to NKI2 and VDX reduced datasets as gene expressions of this intersection. The following procedure, sketched in Supplementary Figure 1, is performed for each gene of the NKI2 and VDX reduced datasets : 1. All univariate linear models were fitted using prototypes as explanatory variable and the gene i as response variable in the NKI2 and VDX reduced datasets, resulting in seven couples of univariate linear models. 2. To test whether variability in coefficient estimates between the two platforms are due to sampling error alone, we applied a stringent test of heterogeneity [Cochrane, 1954] for each couple of coefficients. If at least one coefficient is heterogeneous (p-value < 0.01), gene i was discarded for further analysis. 3. We compared a set of linear models to identify if gene i is predictable by only one prototype, i.e. one model is significantly better than all the other candidates. To do so, we used the PRESS statistic [Allen, 1974] to compute efficiently the leave-one-out cross-validation (LOOCV) errors and compared two models on the basis of their vector of LOOCV errors. A Friedman’s test [Friedman, 1937] was used to identify the set of best models for NKI2 and VDX reduced datasets separately. For each comparison, the two p-values were meta-analytically combined using the Z-transform method [Whitlock, 2005]. A model was considered as significantly better than another one if the combined p-value < 0.05. Because of computational limitation, we were not able to test all possible combinations of prototypes to predict gene i. Only the best set of prototypes with respect to mean squared LOOCV error of the corresponding multivariate linear model was identified using the orthogonal Gram-Schmidt variable selection [Chen et al., 1989]. This multivariate model was used in addition to the set of univariate models. 4. We tested the specificity of gene i to one prototype by looking at this set of best models. If only one univariate model belonged to this set, it meant that the model using only the prototype j was significantly better than all the models with the other prototypes. Additionally, if the multivariate model belonged to the set of best models, it meant that the multivariate model is not significantly better than the model with prototype j. 5. Gene i was identified to be specific to prototype j and was included in the module, also called gene list, j. In order to reduce the size of the modules, we filtered the specific genes using a threshold of 0.95 on the normalized mean squared LOOCV error. 3 1.2 Gene ontology and functional network analysis Gene ontology and functional network analyses were executed using Ingenuity Pathways Anal- ysis tools (Ingenuity Systems, Mountain View, CA), a web-delivered application1 that enables the discovery, visualization, and exploration of molecular interaction networks in gene ex- pression data. The lists of genes identified to be specifically associated with the different prototypes, containing the HUGO gene symbol as well as an indication of positive or negative co-expression, were uploaded into the Ingenuity pathway analysis. Each gene symbol was mapped to its corresponding gene object in the Ingenuity pathway knowledge base (IPKB). These so-called focus genes were then used as a starting point for generating biological net- works. It should be noted that some identifiers from the dataset may not be mapped due to one of several reasons : • The gene/protein ID does not correspond to a known gene product. • There are insufficient findings in the literature regarding this gene. • Findings for this gene have not been entered in the Ingenuity pathway knowledge base. Biological functions were assigned to each dataset by using the knowledge base as a reference set and a proprietary ontology representing over 500,000 classes of biological objects and con- sisting of millions of individually modeled relationships between proteins, genes, complexes, cells, tissues, small molecules, and diseases. These semantically encoded relationships are based on a continual, formal extraction from the public domain literature and cover >10,000 human genes. The significance value associated with Functions is a measure for how likely it is that genes from the dataset file participate in that function. The significance is expressed as a p-value, which is calculated using the right-tailed Fisher’s Exact Test. In this method, the p-value is calculated by comparing the number of user-specified genes of interest that participate in a given function or pathway, relative to the total number of occurrences of these genes in all functional annotations stored in the IPKB. 1.3 Survival meta-analysis For the following analyses, we focused on meta-analysis methods, i.e. each dataset was analyzed separately and overall conclusions were drawn from these fragmented results. 1.3.1 Univariate Cox regression In the case of univariate Cox models, the method is simple : 1. A univariate Cox model is fitted for each dataset separately, resulting in a list of esti- mated coefficients and standard errors. If there is not enough patients to fit the model, this dataset is not considered for further analysis. 2. Using the estimated coefficients and standard errors, we computed the overall hazard ratio using the inverse variance-weighted method with a fixed effect model [Cochrane, 1954] (i.e. asuming that the heterogeneity observed in the coefficient estimations comes from sampling error alone). 1http://www.ingenuity.com/products/pathways_analysis.html 4 1.3.2 Variable selection for multivariate Cox regression Because of the small number of untreated patients in some datasets, we were not able to fit a multivariate Cox model using all the clinical variables and module scores. Therefore, we used a forward stepwise variable selection in a meta-analytical framework to identify a set of relevant variables for survival. The algorithm is resumed by the following steps : 1. A Cox model was fitted for each dataset separately. If there were not enough patients to fit the model, this dataset was not considered for further analysis. This Cox model might be univariate or multivariate depending on the evolution of the variable selection. For each dataset, the Cox model was fitted using the same set of variables. 2. The estimation of overall hazard ratio and their associated Wald test p-value were com- puted for each variable present in the Cox models using the inverse variance-weighted method with a fixed effect model. 3. If the variable having the lowest p-values was sufficiently significant (p-value < 0.05), this new variable was included in the set of selected variables. 4. Alternatively, if a variable in the current Cox model was no more significant because of the inclusion of other variables (p-value ≥ 0.05), this variable was removed from the set of selected variables. 5. The algorithm ended when no variables were removed or included to the set of selected variables. 5 2 Supplementary Figure 1 Supplemenatry Figure 1 sketches the design of the method used to identify prototype- based co-expression modules. NKI2 VDX reduced reduced dataset dataset prototypes gene i gene i prototypes Fit univariate linear models Test of heterogeneity Fit multivaiate linear model Set of best linear models Test of specificity prototype j Put gene i in module j 6 3 Supplementary Figure 2 Supplemenatry Figure 2 sketches the datasets used for each analysis: