Supplementary information on Material and Methods

Patients and samples

All 165 study patients were treated by surgery for lung adenocarcinoma without prior chemotherapy.

Fifty-eight patients received cisplatin-based adjuvant chemotherapy. They belonged to two groups based on their smoking status. Never smokers had a lifetime exposure of less than 100 cigarettes, which was cross-validated using an ad hoc form.

The main cohort included 77 never smokers to which 77 ever smokers were matched by surgery center, sex and disease stage. This main cohort was designed to address the issue of differences in genomic DNA copy-number profiles between never and ever smokers. Patients were predominantly women (88%). According to the TNM system in use at the time of diagnosis, the tumors could be classified as stage I for 88 patients, stage II for 18 patients or stage III for 48 patients(1). While by design sex and disease stage were distributed equally between never and ever smokers, the two groups defined by smoking status differed by age (mean age 60 years for ever smokers versus 68 years for never smokers; p-value <0.0001) and rate of EGFR (66% for never smokers versus 17% for ever smokers; p-value <0.0001) or KRAS mutation (7% for never smokers versus 41% for ever smokers; p-value <0.0001) (supplementary Table S4).

An additional group of 11 never smokers, who had been treated at participating surgical centers, had been studied using CGH arrays. They presented with typical characteristics of never smoker patients, including female sex preponderance (10 patients), higher age (median 65 years) compared to ever smokers and a high rate of EGFR mutation (7 patients). These additional cases were included only in the present study of expression as the CGH array data could not be easily combined with genomic data obtained using SNP arrays. Their RNAs were extracted, purified, quantified, qualified and hybridized together with the samples from the main cohort.

The pathological diagnoses were reviewed with the help of immunohistochemical stains. Most tumors (93%) expressed the NKX2-1 . Cases for which a doubt about the primary site in the

1 lung remained were excluded. All adenocarcinomas were invasive. A bronchiolo-alveolar component was recorded when a non invasive lepidic growth was seen adjacent to a component of invasive adenocarcinoma, which corresponds to the lepidic subtype in the proposed revisions to the histological classification of lung adenocarcinoma,(2)(3). A bronchiolo-alveolar component was more frequent (p-value 0.0007) in never smokers (61%) than in ever smokers (34%).

A translocation involving ALK was shown using FISH in 4 tumors from never smokers with wild-type

EGFR(4).

Genomic DNA and RNA were extracted from frozen tissue sections using commercial kits (Qiagen,

Hilden, Germany) at Institut Gustave-Roussy. Frozen samples were sectioned after removing most of embedding medium. Beginning and end sections were stained with haematoxylin and eosin to assess the proportions of tumor cells. Only cases with an average of tumor cells equal to or above 50% were included. Thirty to 60 sections were placed in two separate tubes and kept frozen in liquid nitrogen until nucleic acid extractions. Both RNA and DNA were assessed for integrity and quantity following stringent quality control criteria (CIT program protocols: cit.ligue-cancer.net).

Genomic array analysis

Genomic arrays were carried out on the Integragen platform (Evry, France). DNAs were hybridized on

Illumina SNP HumanCNV370 chips according to the instructions provided by the array manufacturer

(Illumina, San Diego, CA). Raw fluorescent signals were imported and normalized using Illumina

BeadStudio software. A supplemental normalization procedure tQN was applied to correct for dye bias(5). Genomic profiles were segmented using the circular binary segmentation algorithm

(DNAcopy package, Bioconductor)(6). The absolute copy numbers were determined using the

Genome Alteration Print method(7).

The Genomic Identification of Significant Targets In Cancer (GISTIC) version 2.0 algorithm was applied to high-quality copy-number profiles with an amplitude threshold of 0.2 for copy-number amplifications or copy-number deletions(8). Scoring was performed using the GeneGISTIC

2 procedure. Significant regions were identified with a significance level of 0.25 for the residual q- value. Their peak boundaries were calculated with a confidence level of 0.85. overlapping wide peak limits were listed as the most likely gene targets in each region.

The frequencies of aberrations contributing to significant peak regions identified by GISTIC 2.0 were compared using chi-square tests with Bonferroni correction for multiple testing.

Gene expression analysis

Gene expression arrays were carried out on the IGBMC microarray platform (Strasbourg, France).

Total RNAs were amplified, labeled and hybridized to Affymetrix U133 Plus 2.0

GeneChip, following the manufacturer’s protocol (Affymetrix, Santa Clara, CA). Microarrays were scanned with an Affymetrix GeneChip Scanner 3000 and raw intensities were quantified from subsequent images using GCOS 1.4 software (Affymetrix). Data were normalized using Robust Multi- array Average method(9).

Unsupervised hierarchical clustering analysis of normal lung and tumors samples using Pearson correlation metric was performed on the 1183 probe sets (quantile 0.975) with the greatest robust coefficient of variation between samples. Only probe sets with Affymetrix annotation class A and located on autosomes were considered. The tumor samples were from the LG cohort. The normal lung samples were from eleven female Asian never smokers with publicly-available Human Genome

U133 Plus 2.0 Affymetrix gene expression data (accession number: GSE 19804).

Clustering stability was assessed using resampling strategy as well as noise addition and clustering procedures comparison. Differences between sample clusters were tested using the chi-square test.

Gene Ontology (GO) sets were obtained from GeneOntology.org. Genatomy was used to calculate hypergeometric enrichment with FDR correction of p-values(10)(11). Literature Vector Analysis

(LitVAn) was used to infer gene cluster functionality with an evaluation of the significance of their scores (litvan.bio.columbia.edu)(12).

3 Both genomic and gene expression data were deposited in ArrayExpress database (accession number: E-MTAB-923), which also includes a list of EGFR and KRAS mutations.

ATAD2 relative expression was measured in duplicates by real-time RT-PCR using the Hs00204205

Taqman® probe (Applied Biosystems, Carlsbad, CA, USA). POLR2A and YAP1 were selected as the best combination of stable internal control genes across the cohort using the Normfinder algorithm(13). The ATAD2 and internal control data were corrected for PCR efficiency and interplate variations. The comparative threshold cycle (CT) method was applied to calculate the mean  s.d.

2-CT, and then the fold-change differences between groups and the corresponding Welch t test two- sided p-values(14).

Integration of copy-number and gene expression data

We used COpy Number and EXpression In Cancer (CONEXIC) to integrate matched copy number

(amplifications or deletions) and gene expression data from 80 paired samples(15).

The list of potential regulators (candidate genes) for which amplification data or deletion data were considered included genes overlapping GISTIC 2.0 significant regions that were less than one third of a arm in length. Too many candidate drivers can burden CONEXIC computationally and statistically(15). We experienced such problem with the full list of candidate drivers (3048 genes) overlapping all GISTIC regions. To reduce the number of candidate genes, we could have used a lower threshold for the FDR q-value in the GISTIC analysis. Instead, we kept the FDR q-value threshold 0.25 which is typically used in GISTIC analyses, and considered only ‘focal’ aberrations, which were arbitrarily identified by their length relative to the chromosome arm. The criterion ‘less than a third of chromosome arm’ was applied exactly by calculating the ratio of the length of each aberrant region to that of the chromosome arm where they were located.

Gene expression data were processed by removing probe sets whose standard deviation was smaller than 0.25 on a log2 scale, resulting in 28821 probe sets measuring 16009 unique genes. Inconsistent

4 multiple probe sets were then removed resulting in a final set of 10358 genes. Expression values were normalized to mean of zero and a standard deviation of one for each gene.

During the ‘SingleModulator’ step, a conservative set of potential driver genes was obtained using a

Welch t-test (p-value<0.05), comparing amplified versus normal or deleted versus normal. All potential driver genes were considered during the following ‘NetworkLearning’ step.

The ‘Single Modulator’ step was run using permutation testing (p-value<0.001) and the ‘noUpDown’ parameter such that each module contained genes positively and negatively regulated with expression of the regulators. Potential regulators were excluded as members of the resulting clusters. The parameters for the scoring function were alpha=2 and lambda=1. Non-parametric bootstrapping was applied to the 80 samples and repeated 100 times. Candidate drivers were selected when they were selected in at least 90% of the runs.

The ‘Network Learning’ step was run using permutation testing (p-value<0.05) and the

‘LikelihoodCutoff’ parameter (value 2) to remove genes that could not be explained by any regulation program. The parameters for the scoring function were alpha=2, lambda=1, beta=20, x=15 and y=0.

Non-parametric bootstrapping was applied to the 80 samples and repeated 100 times. Candidate drivers were selected when they were selected in at least 40% of the runs. A final run was performed to obtain the final regulatory programs.

For permutation testing, we specified a p-value 0.001 for the significance threshold during the Single

Modulator step and kept the default 0.05 significance threshold during the Network Learning step.

We counted the number of genes that were assigned to modules at different permutation p-value thresholds: 938, 1661, 2972 and 6367 genes were assigned to modules at the 1.0E-6, 1.0E-5, 1.0E-4,

1.0E-3 thresholds, respectively. With a FDR set at 5%, the p-values for all significant genes were less than the FDR-derived significance thresholds(16). The more stringent significance threshold during the Single Modulator step is needed in order to build a very robust starting point for Network

Learning. Likewise, we choose a 90% confidence threshold and a 40% threshold for bootstrap during

5 the Single Modulator step and the Network Learning step, respectively. The 90% and 40% confidence thresholds were determined based on the study of the effect of the confidence threshold for bootstrap on the removal of spurious modulators generated by random permutation such that no spurious modulator passed these thresholds(15). Therefore, in the final model each module gene is associated with its modulator not by a single significance test, but after passing twice permutation significance tests with the indicated significance thresholds, which were bootstrapped with confidence levels (90% and 40%) sufficient to remove spurious correlations.

The modules and their modulators were visualized using Genatomy that was further used to calculate hypergeometric enrichment with FDR correction of p-values as above(17). The MYC target database was kindly provided by Dr Van Dang (Abramson Cancer Centre, PA, USA). To nominate the module associated with smoking status, we used gene set enrichment analysis (GSEA)

(http://www.broadinstitute.org/gsea/index.jsp) with smoking status as phenotype and the CONEXIC modules as gene sets.

For validation, the publicly-available gene expression data from two independent cohorts were used.

In the Ding cohort 68 lung adenocarcinomas had been studied using the Affymetrix Human Genome

U133 Plus 2.0 Genechip (accession number: GSE 12667)(18). The data were processed as described above, resulting in a final set of 10115 genes. In the Shedden cohort 442 lung adenocarcinomas had been studied using the Affymetrix Human Genome U133 A Genechip (caarraydb.nci.nih.gov, pId=1015945236141280)(19). The data generated at the University of Michigan Cancer Centre,

Moffitt Cancer Centre and Memorial Sloan-Kettering Cancer Centre were combined (391 patients) as they were similar in average signal intensity and variation(19). Expression values were calculated for the probe sets that were used in the CONEXIC analysis of the LG cohort.

The linear relationship between a modulator and its associated genes was measured using the

Pearson correlation coefficient.

6 An overview of original modules in the LG cohort and of the profiles obtained by applying the regulatory programs of these modules to the data from the Ding cohort is available (supplementary figure S3).

Survival analysis

The time to death following the removal of the primary tumor was investigated in the LG and in the

Shedden cohort. Survival was censured at 60 months in the Shedden cohort(19).

The survival probability was estimated using the Kaplan-Meier method. Survival in two groups was compared using the log-rank test. Groups were defined by clinical variables, including adjuvant chemotherapy, age, disease stage, sex and smoking status, or by biological variables, including bronchiolo-alveolar component, EGFR mutation, KRAS mutation, 8q24.12 amplification, and ATAD2 or MYC expression. The first-order split expression value found by CONEXIC was used to define the low or high ATAD2 groups in the analysis of gene expression array data in the LG cohort. Otherwise, groups were defined using the median.

Cox proportional hazard multivariate models were tested to adjust for clinical variables including age, sex and disease stage, which were available in both the LG and Shedden cohorts. The final models were selected by comparing the results of the likelihood ratio test. The proportional hazard assumption was verified using the scaled Shoenfeld residuals test in the selected models.

All survival analyses were performed using the package ‘survival’ in R (version 2.13.2). A p-value less than 0.05 was considered as significant.

References for supplementary Material and Methods

1. Hermanek P, Hutter R, Sobin L, Wagner G, Wittekind C, editors. TNM Atlas. Guide illustré de la classification TNM/pTNM des tumeurs malignes. 4th ed. Paris: Springer-Verlag France; 1998.

2. Travis WD, Brambilla E, Müller-Hermelink HK, Harris CC. Cancer International Agency for Research on Cancer. Pathology and genetics of tumours of the lung, pleura, thymus and heart. lyon: IARC; 2004.

7 3. Travis WD, Brambilla E, Noguchi M, Nicholson AG, Geisinger KR, Yatabe Y, et al. International association for the study of lung cancer/american thoracic society/european respiratory society international multidisciplinary classification of lung adenocarcinoma. J Thorac Oncol 2011;6(2):244–285.

4. Hofman P, Ilie M, Hofman V, Roux S, Valent A, Bernheim A, et al. Immunohistochemistry to identify EGFR mutations or ALK rearrangements in patients with lung adenocarcinoma [Internet]. Ann Oncol. 2012;23(7):1738-43.

5. Staaf J, Vallon-Christersson J, Lindgren D, Juliusson G, Rosenquist R, Höglund M, et al. Normalization of Illumina Infinium whole-genome SNP data improves copy number estimates and allelic intensity ratios. BMC Bioinformatics 2008;9:409.

6. Venkatraman ES, Olshen AB. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics 2007;23(6):657–663.[cited 2011 Dec 28 ]

7. Popova T, Manié E, Stoppa-Lyonnet D, Rigaill G, Barillot E, Stern MH. Genome Alteration Print (GAP): a tool to visualize and mine complex cancer genomic profiles obtained by SNP arrays. Genome Biol 2009;10(11):R128.

8. Mermel CH, Schumacher SE, Hill B, Meyerson ML, Beroukhim R, Getz G. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol. 2011;12(4):R41.

9. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003;31(4):e15.

10. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U.S.A. 2005;102(43):15545–15550.

11. Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. U.S.A. 2003;100(16):9440–9445.

12. Salton G, Buckley C. Global text matching for information retrieval. Science 1991;253(5023):1012–1015.

13. Andersen CL, Jensen JL, Ørntoft TF. Normalization of real-time quantitative reverse transcription-PCR data: a model-based variance estimation approach to identify genes suited for normalization, applied to bladder and colon cancer data sets. Cancer Res. 2004;64(15):5245–5250.

14. Schmittgen TD, Livak KJ. Analyzing real-time PCR data by the comparative C(T) method. Nat Protoc 2008;3(6):1101–1108.

15. Akavia UD, Litvin O, Kim J, Sanchez-Garcia F, Kotliar D, Causton HC, et al. An integrated approach to uncover drivers of cancer. Cell 2010;143(6):1005–1017.

16. Benjamini Y, Krieger AM, Yekutieli D. Adaptive linear step-up procedures that control the false discovery rate. Biometrika 2006;93(3):491–507.

17. Litvin O, Causton HC, Chen B-J, Pe’er D. Modularity and interactions in the genetics of gene expression. Proc. Natl. Acad. Sci. U.S.A. 2009;106(16):6441–6446.

8 18. Ding L, Getz G, Wheeler DA, Mardis ER, McLellan MD, Cibulskis K, et al. Somatic mutations affect key pathways in lung adenocarcinoma. Nature 2008;455(7216):1069–1075.

19. Shedden K, Taylor JMG, Enkemann SA, Tsao M-S, Yeatman TJ, Gerald WL, et al. Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nat. Med. 2008;14(8):822–827.

Legends of Supplementary Figures S1 to S2

Supplementary Figure S1. Module network views of the original and the modified module 62 in the

68 lung adenocarcinomas from the Ding cohort (GSE12667) and of the modified module 62 in the LG cohort. Genes in module 62 included in both final sets of genes after processing of gene expression data in the LG cohort and in the Ding cohort are analyzed. Samples are sorted according to the

CONEXIC regulatory programs with split values indicated as yellow dotted lines. A. The original regulatory program found by CONEXIC in the LG cohort is applied using the original split values for the modulators to module 62 gene expression data in the Ding cohort. Modulators include ATAD2,

TUBB3, SLC25A21 and KCNMB4, wherein ATAD2 increased expression at the first-order and the right second-order splits is associated with increased expression of genes in the module. B. Modified regulatory program obtained by replacing TUBB3 by ATAD2 at the left second order split (orange left branch) and applied to module 62 genes in the Ding cohort. ATAD2 is the sole modulator, wherein increased ATAD2 expression split into four groups is associated with increased expression of module

62 genes across the entire sample set. C. Modified module 62 regulatory program obtained by replacing TUBB3 with ATAD2 as above and applied to module 62 genes in the LG cohort, wherein increased ATAD2 expression split into four groups is associated with increased expression of module

62 genes across the entire sample set.

Supplementary Figure S2. Heat maps of genes of the proliferative cluster (cluster f) according to

ATAD2 expression, smoking status-associated discrete variables and disease stage in the LG cohort.

Proliferative genes included in the final set of genes after processing of LG expression data are analyzed. Samples are first sorted according to the smoking status-associated variable, then within

9 each category into four groups of increasing ATAD2 expression levels (from white to red). The four

ATAD2 groups are sorted using the split values found by CONEXIC in the analysis of gene expression data in the LG cohort. A. Bronchiolo-alveolar component (BAC), comparing BAC (blue) versus no BAC.

B. EGFR mutation (blue). C. KRAS mutation (blue). D. Disease stage, comparing late stage (blue) versus early stage. E. Sex, comparing male (blue) versus female.

Supplementary Figure S3. Overview of module 1 to 27 profiles. Samples are sorted according to the regulatory programs found with CONEXIC in the LG cohort. The 31 main modulators were: ABCA6,

ABCA8, ANGPT1, ARHGDIB, ATAD2, C17orf60, CCT2, CD69, CDK3, CDKN2AIP, DSCC1, EMB, FAM83A,

GPR89B, HLA-DMA, HLA-DMB, MRPL13, MSRB3, NEIL3, SEC61G, SLA, SLC25A21, SMURF2, SPHK1,

SUPT3H, TBX4, TK1, TSPAN4, TUBB3, UBE2O, VCAM1. A. LG cohort. B. The original regulatory programs found by CONEXIC in the LG cohort are applied using the original split values for the modulators to gene expression data in the Ding cohort.

10