Curation of the Pancreatic Ductal Adenocarcinoma Subset of The
Total Page:16
File Type:pdf, Size:1020Kb
Published OnlineFirst May 8, 2018; DOI: 10.1158/1078-0432.CCR-18-0290 Perspectives Clinical Cancer Research Curation of the Pancreatic Ductal Adenocarcinoma Subset of the Cancer Genome Atlas Is Essential for Accurate Conclusions about Survival-Related Molecular Mechanisms Ivana Peran1, Subha Madhavan1,2, Stephen W. Byers1, and Matthew D. McCoy1,2 Abstract Purpose: Publicly available databases, for example, The Cancer survival rates of PDAC. A similar discrepancy was observed in Genome Atlas (TCGA), containing clinical and molecular data lung, stomach, and liver cancer subsets of TCGA. The whole from many patients are useful in validating the contribution of transcriptome expression patterns of PDAC-TCGA revealed a particular genes to disease mechanisms and in forming novel cluster of samples derived from neuroendocrine tumors, which hypotheses relating to clinical outcomes. have a distinctive biology and better disease prognosis than Experimental Design: Theimpactofkeydriversofcancer PDAC. Furthermore, PDAC-TCGA contains numerous pseudo- progression can be assessed by segregating a patient cohort normal samples, as well as those that arose from tumors not by certain molecular features and constructing survival plots classified as PDAC. using the associated clinical data. However, conclusions Conclusions: Inclusion of misclassified samples in the bioin- drawn from this straightforward analysis are highly dependent formatic analyses distorts the association of molecular biomar- onthequalityandsourceoftissuesamples,asdemonstrated kers with clinical outcomes, altering multiple published conclu- through the pancreatic ductal adenocarcinoma (PDAC) subset sions used to support and motivate experimental research. Hence, of TCGA. the stringent scrutiny of type and origin of samples included in the Results: Analyses of the PDAC-TCGA database, which contains bioinformatic analyses by researchers, databases, and web-tool mainly resectable cancer samples from patients in stage IIB, reveal developers is of crucial importance for generating accurate con- a difference from widely known historic median and 5-year clusions. Clin Cancer Res; 1–7. Ó2018 AACR. Introduction At the same time, the field of molecular medicine recognized an opportunity for the development of targeted therapies for patients Completion of the Human Genome Project in 2003 (1, 2) with similar gene mutations, expression, and protein activation allowed for association studies mapping genomic variation, par- patterns. Eventually, the scientific and medical community turned ticularly mutations, with disease phenotypes. Besides mutations, toward personalized therapy, specifically accounting for the disease incidence can also be a consequence of aberrant gene molecular mechanism at work in an individual patient. Technical expression or protein activation status. Fortunately, completion developments and high-throughput screening allowed us to gath- of the human genome also laid the foundation for mapping of er information on gene mutation and expression, as well as expressed mRNA transcripts to their genomic sequences, which protein activation across a large number of patients. The resulting has subsequently revealed the diversity of gene expression availability of datasets that include information about phenotypic patterns responsible for the underlying physiology of healthy and molecular features has been a significant driver of transla- and diseased tissues. tional research over the last decade. The Cancer Genome Atlas (TCGA) began cataloging molec- ular datasets for various cancer types in 2005, and has since 1Georgetown-Lombardi Comprehensive Cancer Center, Department of Oncol- grown to include extensive molecular and clinical information ogy, Georgetown University Medical Center, Washington, DC. 2Innovation for individual cancer patient samples (3). Today, the data for 33 Center for Biomedical Informatics, Georgetown University, Washington, DC. different tumor types based on tissues collected from over Note: Supplementary data for this article are available at Clinical Cancer 11,000 patients is available (https://cancergenome.nih.gov). Research Online (http://clincancerres.aacrjournals.org/). To support access and use of the data, many publically available Corresponding Authors: Ivana Peran, Georgetown-Lombardi Comprehen- web tools use processed TCGA datasets to enable rapid analysis sive Cancer Center, Department of Oncology, Georgetown University of gene expression pattern, mutation analysis, and other molec- Medical Center, 3970 Reservoir Road, NW, New Research Building, Room ular and clinical features associated with a specific cancer type. E415, Washington, DC 20057. Phone: 202-687-1891; Fax: 202-687-7505; This allows researchers to compare their findings with the E-mail: [email protected]; and Matthew D. McCoy, TCGA datasets, or correlate particular molecular features with [email protected] a clinical outcome. doi: 10.1158/1078-0432.CCR-18-0290 Here, we focus on the pancreatic ductal adenocarcinoma Ó2018 American Association for Cancer Research. (PDAC) dataset of TCGA (annotated as PAAD within TCGA), www.aacrjournals.org OF1 Downloaded from clincancerres.aacrjournals.org on October 1, 2021. © 2018 American Association for Cancer Research. Published OnlineFirst May 8, 2018; DOI: 10.1158/1078-0432.CCR-18-0290 Peran et al. Translational Relevance Materials and Methods Publicly available molecular databases have significantly The publicly available PDAC dataset of the TCGA (annotated as contributed to translational research, especially when validat- PAAD within TCGA) was initially analyzed using Morpheus ing the impact of a particular gene from a preclinical to clinical (Broad Institute, Cambridge, MA; https://software.broadinsti setting or generating hypotheses related to prognostic biomar- tute.org/morpheus/) and Xenabrowser software (University of kers. In response to the arrival of databases containing anno- California Santa Cruz; Xena, http://xena.ucsc.edu). The processed tated information on molecular and clinical features across a gene expression data were downloaded from Morpheus, and variety of diseases, numerous web tools for big data analysis analyzed using hierarchal clustering to identify subsets of samples have emerged. Here we illustrate the importance of scrutiniz- with related gene expression patterns. Fold changes for genes – ing the underlying content when using common analysis commonly associated with PDAC progression (4 7) were nor- methods to validate laboratory findings and generate a novel malized to their median expression to quantify the distribution hypothesis predicting phenotypic associations with molecular across the entire dataset. The data on pancreatic cancer statistics – – signatures. We advise researchers to carefully consider the was obtained through the NIH NCI Surveillance, Epidemiol- origin and characteristics of the individual samples included ogy, and End Results Program; (https://seer.cancer.gov/statfacts/ – in the bioinformatic analyses and recommend exclusion of html/pancreas.html). Kaplan Meier graphs were generated for data originating from (pseudo)-normal or other cell origins population subsets showing differential expression for a given fi with distinct underlying biology. A failure to properly curate PDAC-associated gene and statistical signi cance was calculated – – the sample pool can lead to inaccurate conclusions about (log-rank test and Gehan Breslow Wilcoxon test) by using clinically relevant biomarkers and misinformed hypotheses GraphPad Prism version 5.0 for Mac OS X (GraphPad Software, about disease-related mechanisms. La Jolla California). Patients were divided into low versus high expression groups based on the data published in the original article to the best of our knowledge. Results which contains genomic, transcriptomic, and proteomic analy- Discrepancy between PDAC-TCGA and SEER's data on ses of 185 patient samples (https://tcga-data.nci.nih.gov/docs/ pancreatic ductal adenocarcinoma statistics publications/tcga/). In addition to molecular data, the PDAC- According to the National Cancer Database, 55% of pan- TCGA annotated dataset contains numerous other pieces of creatic cancer patients between 1992 and 1998 were diagnosed clinical information, including overall survival, records of therapy in stage IV (Fig. 1A; ref. 8). However, in the PDAC-TCGA and surgery performed, metastatic status, history of diabetes, dataset, about two-thirds of patients were diagnosed in stage smoking, alcohol consumption, etc. However, researchers utiliz- IIB, with less than 3% of patients being in stage IV (Fig. 1B). ing the dataset for validation and hypothesis generation should The SEER registry also stratifies patients diagnosed with PDAC be wary of conclusions drawn using the entire collection of between 2007 and 2013 based on the spread of disease into samples. Despite its name, the PDAC dataset includes samples localized, regional, or distant groups. Comparable with the of various cell origin, as well as (pseudo)-normal samples. As National Cancer Database statistics from the previous decade, shown in Table 1, some of the cancer samples did not arise from more than half of the patients in the SEER database were the pancreas; belong to neuroendocrine tumors, or to different diagnosed with distant disease (Fig. 1C). Most PDAC-TCGA subtypes of pancreatic cancer not classified as adenocarcinoma. patients had localized or regionally spread disease (25% and We show here that inclusion of those samples into analyses skews