An Integrative Bioinformatics Approach Reveals Coding and Non-Coding
Total Page:16
File Type:pdf, Size:1020Kb
www.nature.com/bjc ARTICLE Genetics & Genomics An integrative bioinformatics approach reveals coding and non-coding gene variants associated with gene expression profiles and outcome in breast cancer molecular subtypes Balázs Győrffy1,2,Lőrinc Pongor1,2, Giulia Bottai3, Xiaotong Li4, Jan Budczies5, András Szabó2, Christos Hatzis4, Lajos Pusztai4 and Libero Santarpia3 BACKGROUND: Sequence variations in coding and non-coding regions of the genome can affect gene expression and signalling pathways, which in turn may influence disease outcome. METHODS: In this study, we integrated somatic mutations, gene expression and clinical data from 930 breast cancer patients included in the TCGA database. Genes associated with single mutations in molecular breast cancer subtypes were identified by the Mann-Whitney U-test and their prognostic value was evaluated by Kaplan-Meier and Cox regression analyses. Results were confirmed using gene expression profiles from the Metabric data set (n = 1988) and whole-genome sequencing data from the TCGA cohort (n = 117). RESULTS: The overall mutation rate in coding and non-coding regions were significantly higher in ER-negative/HER2-negative tumours (P = 2.8E–03 and P = 2.4E–07, respectively). Recurrent sequence variations were identified in non-coding regulatory regions of several cancer-associated genes, including NBPF1, PIK3CA and TP53. After multivariate regression analysis, gene signatures associated with three coding mutations (CDH1, MAP3K1 and TP53) and two non-coding variants (CRTC3 and STAG2)in cancer-related genes predicted prognosis in ER-positive/HER2-negative tumours. CONCLUSIONS: These findings demonstrate that sequence alterations influence gene expression and oncogenic pathways, possibly affecting the outcome of breast cancer patients. Our data provide potential opportunities to identify non-coding variations with functional and clinical relevance in breast cancer. British Journal of Cancer (2018) 118:1107–1114; https://doi.org/10.1038/s41416-018-0030-0 INTRODUCTION associated gene expression signatures often have a greater Results from DNA-sequencing studies revealed a great degree of prognostic value than single mutations.7, 8 genomic heterogeneity in breast cancer, which may partly explain In this study, we used the data from 930 breast cancer patients the diversity of clinical behaviour of breast cancer subtypes.1 It is of The Cancer Genome Atlas (TCGA) database to identify genes also apparent that breast cancer genome is characterised by only with sequence variations in coding and non-coding regions that a few frequently mutated genes together with a long tail of rare were captured by exome sequencing. We next identified genes mutations in a great variety of genes, whose potential prognostic whose expression was associated with a given mutation by impact is difficult to assess due to the small sample size of affected comparing cases with and without the mutation. This analysis was cases.2 Furthermore, although it has been suggested that these done separately for each molecular breast cancer subtype. Once genetic aberrations can be useful drug targets or biomarkers for subtype-specific mutation-associated gene signatures were iden- patient stratification, their utility in clinical practice has been tified, we assessed their effects on survival. Results were also limited so far.3 validated using whole-genome sequencing (WGS) data from the In addition to mutations in protein-coding regions, a new class TCGA cohort (n = 117) and gene expression profiles from the of oncogenic events in non-coding areas of the genome has also Metabric data set (n = 1988). been identified, suggesting that cancer can result from an array of genomic alterations affecting both coding and non-coding regions.4 Most pathogenic DNA sequence alterations directly or MATERIALS AND METHODS indirectly impact gene expression and protein functions, leaving Data source and acquisition an imprint on messenger RNA expression that can be captured by Whole-exome sequencing (WES), RNA sequencing (RNA-seq) data gene expression profiling.5, 6 It is noteworthy that mutation- and the related clinical information from the TCGA database (n = 1MTA TTK Lendület Cancer Biomarker Research Group, Institute of Enzymology, Budapest H-1117, Hungary; 22nd Department of Pediatrics, Semmelweis University, Budapest H- 1094, Hungary; 3Oncology Experimental Therapeutics, Humanitas Clinical and Research Institute, Rozzano, Milan 20089, Italy; 4Breast Medical Oncology, Yale Cancer Center, Yale School of Medicine, New Haven, Connecticut 06520, USA and 5Institute of Pathology, Charité University Hospital, 10117 Berlin, Germany Correspondence: Libero Santarpia ([email protected]) These authors contributed equally: Lőrinc Pongor and Giulia Bottai Received: 19 July 2017 Revised: 19 January 2018 Accepted: 22 January 2018 Published online: 21 March 2018 © Cancer Research UK 2018 Breast cancer genetic variants and prognosis BGyőrffy et al. 1108 930), as well as WGS data (n = 117) were obtained from http:// Gene expression signatures cancergenome.nih.gov/. Gene expression analysis was performed Mann-Whitney U-test was performed to identify genes whose on the pre-processed RNA-seq data (i.e., level 3 data) using expression was significantly associated with a given genotype (i.e., MapSplice and RNA-Seq by Expectation-Maximisation. Individual somatic mutation) in each breast cancer subtype, separately. patient files were merged into a single database using the plyr R Samples were divided into two cohorts according to the mutation package.9 Information for relapse-free survival was available only status and the cohorts were compared to each other. The analysis for 52 patients and therefore we used overall survival (OS) as was performed for all coding and non-coding sequence variations outcome measure. Clinical and pathological features of the breast without filtering for functional significance. The average expres- cancer cohort are described in Supplementary Table 1. The entire sion of the significantly mutation-associated genes was desig- workflow of the study is summarised in Supplementary Figure 1. nated as the gene expression signature of a given genotype. The expression of the downregulated genes was inverted before Processing of sequence variations computing the mean expression of the signature. Significant Aligned data were downloaded via The Cancer Genomics Hub associations from the Mann-Whitney U-test (P ≤ 0.01) were ranked (https://cghub.ucsc.edu/) for both tumour and matched normal based on their achieved P-values. Finally, a maximum of samples. Somatic mutation calls were performed with the MuTect 100 significant genes for each signature was included to reduce programme, as previously described.8 The identified variants were noise in large gene sets. annotated with MuTect using SNP database (dbSNP, build 139) and Catalogue Of Somatic Mutations in Cancer (COSMIC, version Statistical analyses 68) library.10 The recognised sequence variations were functionally To examine the association of mutation status and mutation- annotated via SNPeff (version 3.5) and filtered to include the associated signatures with OS, we performed Kaplan-Meier and COSMIC-identified genes only.8 Known cancer-associated genes Cox proportional hazard regression analyses using the median were defined by the Cancer Gene Census.10 We also analysed expression of the signature to dichotomise the population. sequence variants occurring in introns, promoters (defined as − Multivariate analysis was performed including tumour size 2.5 kb from transcription starting sites) and other regulatory (T stage), lymph node status (N stage), the presence of distant elements, including enhancers, untranslated regions (UTRs) and metastases (M stage) and MKI67 gene expression as a measure transcription factor (TF) binding sites, which were captured by of proliferation. The level of statistical significance was set at WES. We used the LARVA software to identify significantly P<0.05. 1234567890();,: mutated non-coding regulatory elements, via modelling with β- binomial distribution and mutation rate calculation through DNA replication timing correction.11 To identify potential non-coding RESULTS drivers, somatic mutations were annotated and prioritised by Database characteristics FunSeq2, combining inter- and intra-species conservation, loss- Nine hundred and thirty patients with invasive breast cancer were and gain-of-function events for TF binding, enhancer-gene analysed, including 50.2% ER-positive/HER2-negative, 19.9% ER- linkages and network centrality, and per-element recurrence negative/HER2-negative and 29.9% HER2-positive cancers. The across samples.12 mean age of patients was 58.3 years. Nodal status was available for 905 patients, of which 45.3% were lymph node positive. The Sample classification according to breast cancer molecular median follow-up was 31.5 months. The detailed characteristics of subtypes patients included in the analysis are presented in Supplementary Breast cancer molecular subtypes were defined by the oestrogen Table 1. receptor (ER) and human epidermal growth factor receptor 2 (HER2) status determined by RNA-seq data. For ER the RNA-seq ID Genetic variants in coding and non-coding regions in breast 2099 was used with a cut-off of 3,700, whereas for HER2 the RNA- cancer seq ID 2064 was used with a cut-off of 27,000. Cut-off values were We found 208 and 3562 genes with sequence variations in coding determined by a receiver operating curve (ROC) analysis compar- and non-coding areas in at