Supplementary Information on Material and Methods
Total Page:16
File Type:pdf, Size:1020Kb
Supplementary information on Material and Methods Patients and samples All 165 study patients were treated by surgery for lung adenocarcinoma without prior chemotherapy. Fifty-eight patients received cisplatin-based adjuvant chemotherapy. They belonged to two groups based on their smoking status. Never smokers had a lifetime exposure of less than 100 cigarettes, which was cross-validated using an ad hoc form. The main cohort included 77 never smokers to which 77 ever smokers were matched by surgery center, sex and disease stage. This main cohort was designed to address the issue of differences in genomic DNA copy-number profiles between never and ever smokers. Patients were predominantly women (88%). According to the TNM system in use at the time of diagnosis, the tumors could be classified as stage I for 88 patients, stage II for 18 patients or stage III for 48 patients(1). While by design sex and disease stage were distributed equally between never and ever smokers, the two groups defined by smoking status differed by age (mean age 60 years for ever smokers versus 68 years for never smokers; p-value <0.0001) and rate of EGFR (66% for never smokers versus 17% for ever smokers; p-value <0.0001) or KRAS mutation (7% for never smokers versus 41% for ever smokers; p-value <0.0001) (supplementary Table S4). An additional group of 11 never smokers, who had been treated at participating surgical centers, had been studied using CGH arrays. They presented with typical characteristics of never smoker patients, including female sex preponderance (10 patients), higher age (median 65 years) compared to ever smokers and a high rate of EGFR mutation (7 patients). These additional cases were included only in the present study of gene expression as the CGH array data could not be easily combined with genomic data obtained using SNP arrays. Their RNAs were extracted, purified, quantified, qualified and hybridized together with the samples from the main cohort. The pathological diagnoses were reviewed with the help of immunohistochemical stains. Most tumors (93%) expressed the NKX2-1 protein. Cases for which a doubt about the primary site in the 1 lung remained were excluded. All adenocarcinomas were invasive. A bronchiolo-alveolar component was recorded when a non invasive lepidic growth was seen adjacent to a component of invasive adenocarcinoma, which corresponds to the lepidic subtype in the proposed revisions to the histological classification of lung adenocarcinoma,(2)(3). A bronchiolo-alveolar component was more frequent (p-value 0.0007) in never smokers (61%) than in ever smokers (34%). A translocation involving ALK was shown using FISH in 4 tumors from never smokers with wild-type EGFR(4). Genomic DNA and RNA were extracted from frozen tissue sections using commercial kits (Qiagen, Hilden, Germany) at Institut Gustave-Roussy. Frozen samples were sectioned after removing most of embedding medium. Beginning and end sections were stained with haematoxylin and eosin to assess the proportions of tumor cells. Only cases with an average of tumor cells equal to or above 50% were included. Thirty to 60 sections were placed in two separate tubes and kept frozen in liquid nitrogen until nucleic acid extractions. Both RNA and DNA were assessed for integrity and quantity following stringent quality control criteria (CIT program protocols: cit.ligue-cancer.net). Genomic array analysis Genomic arrays were carried out on the Integragen platform (Evry, France). DNAs were hybridized on Illumina SNP HumanCNV370 chips according to the instructions provided by the array manufacturer (Illumina, San Diego, CA). Raw fluorescent signals were imported and normalized using Illumina BeadStudio software. A supplemental normalization procedure tQN was applied to correct for dye bias(5). Genomic profiles were segmented using the circular binary segmentation algorithm (DNAcopy package, Bioconductor)(6). The absolute copy numbers were determined using the Genome Alteration Print method(7). The Genomic Identification of Significant Targets In Cancer (GISTIC) version 2.0 algorithm was applied to high-quality copy-number profiles with an amplitude threshold of 0.2 for copy-number amplifications or copy-number deletions(8). Scoring was performed using the GeneGISTIC 2 procedure. Significant regions were identified with a significance level of 0.25 for the residual q- value. Their peak boundaries were calculated with a confidence level of 0.85. Genes overlapping wide peak limits were listed as the most likely gene targets in each region. The frequencies of aberrations contributing to significant peak regions identified by GISTIC 2.0 were compared using chi-square tests with Bonferroni correction for multiple testing. Gene expression analysis Gene expression arrays were carried out on the IGBMC microarray platform (Strasbourg, France). Total RNAs were amplified, labeled and hybridized to Affymetrix Human Genome U133 Plus 2.0 GeneChip, following the manufacturer’s protocol (Affymetrix, Santa Clara, CA). Microarrays were scanned with an Affymetrix GeneChip Scanner 3000 and raw intensities were quantified from subsequent images using GCOS 1.4 software (Affymetrix). Data were normalized using Robust Multi- array Average method(9). Unsupervised hierarchical clustering analysis of normal lung and tumors samples using Pearson correlation metric was performed on the 1183 probe sets (quantile 0.975) with the greatest robust coefficient of variation between samples. Only probe sets with Affymetrix annotation class A and located on autosomes were considered. The tumor samples were from the LG cohort. The normal lung samples were from eleven female Asian never smokers with publicly-available Human Genome U133 Plus 2.0 Affymetrix gene expression data (accession number: GSE 19804). Clustering stability was assessed using resampling strategy as well as noise addition and clustering procedures comparison. Differences between sample clusters were tested using the chi-square test. Gene Ontology (GO) sets were obtained from GeneOntology.org. Genatomy was used to calculate hypergeometric enrichment with FDR correction of p-values(10)(11). Literature Vector Analysis (LitVAn) was used to infer gene cluster functionality with an evaluation of the significance of their scores (litvan.bio.columbia.edu)(12). 3 Both genomic and gene expression data were deposited in ArrayExpress database (accession number: E-MTAB-923), which also includes a list of EGFR and KRAS mutations. ATAD2 relative expression was measured in duplicates by real-time RT-PCR using the Hs00204205 Taqman® probe (Applied Biosystems, Carlsbad, CA, USA). POLR2A and YAP1 were selected as the best combination of stable internal control genes across the cohort using the Normfinder algorithm(13). The ATAD2 and internal control data were corrected for PCR efficiency and interplate variations. The comparative threshold cycle (CT) method was applied to calculate the mean s.d. 2-CT, and then the fold-change differences between groups and the corresponding Welch t test two- sided p-values(14). Integration of copy-number and gene expression data We used COpy Number and EXpression In Cancer (CONEXIC) to integrate matched copy number (amplifications or deletions) and gene expression data from 80 paired samples(15). The list of potential regulators (candidate genes) for which amplification data or deletion data were considered included genes overlapping GISTIC 2.0 significant regions that were less than one third of a chromosome arm in length. Too many candidate drivers can burden CONEXIC computationally and statistically(15). We experienced such problem with the full list of candidate drivers (3048 genes) overlapping all GISTIC regions. To reduce the number of candidate genes, we could have used a lower threshold for the FDR q-value in the GISTIC analysis. Instead, we kept the FDR q-value threshold 0.25 which is typically used in GISTIC analyses, and considered only ‘focal’ aberrations, which were arbitrarily identified by their length relative to the chromosome arm. The criterion ‘less than a third of chromosome arm’ was applied exactly by calculating the ratio of the length of each aberrant region to that of the chromosome arm where they were located. Gene expression data were processed by removing probe sets whose standard deviation was smaller than 0.25 on a log2 scale, resulting in 28821 probe sets measuring 16009 unique genes. Inconsistent 4 multiple probe sets were then removed resulting in a final set of 10358 genes. Expression values were normalized to mean of zero and a standard deviation of one for each gene. During the ‘SingleModulator’ step, a conservative set of potential driver genes was obtained using a Welch t-test (p-value<0.05), comparing amplified versus normal or deleted versus normal. All potential driver genes were considered during the following ‘NetworkLearning’ step. The ‘Single Modulator’ step was run using permutation testing (p-value<0.001) and the ‘noUpDown’ parameter such that each module contained genes positively and negatively regulated with expression of the regulators. Potential regulators were excluded as members of the resulting clusters. The parameters for the scoring function were alpha=2 and lambda=1. Non-parametric bootstrapping was applied to the 80 samples and repeated 100 times. Candidate drivers were selected when they were selected in at least 90% of the runs. The ‘Network Learning’ step was run using permutation testing (p-value<0.05) and the ‘LikelihoodCutoff’ parameter (value 2) to remove