ENVE: a Novel Computational Framework

Varadan et al. Genome Medicine (2015) 7:69 DOI 10.1186/s13073-015-0192-9 METHOD Open Access ENVE: a novel computational framework characterizes copy-number mutational landscapes in colorectal cancers from African American patients Vinay Varadan1,2,7*†, Salendra Singh2, Arman Nosrati3, Lakshmeswari Ravi3, James Lutterbaugh3, Jill S. Barnholtz-Sloan1,2, Sanford D. Markowitz2,3,4,5, Joseph E. Willis2,4,5,6 and Kishore Guda1,2,4,7*† Abstract Reliable detection of somatic copy-number alterations (sCNAs) in tumors using whole-exome sequencing (WES) remains challenging owing to technical (inherent noise) and sample-associated variability in WES data. We present a novel computational framework, ENVE, which models inherent noise in any WES dataset, enabling robust detection of sCNAs across WES platforms. ENVE achieved high concordance with orthogonal sCNA assessments across two colorectal cancer (CRC) WES datasets, and consistently outperformed a best-in-class algorithm, Control-FREEC. We subsequently used ENVE to characterize global sCNA landscapes in African American CRCs, identifying genomic aberrations potentially associated with CRC pathogenesis in this population. ENVE is downloadable at https://github.com/ENVE-Tools/ENVE. Background comprehensively characterize genome-scale DNA alter- Human cancer is caused in part by structural changes ations in tumor tissues. In particular, whole-exome resulting in DNA copy-number alterations at distinct sequencing (WES) offers a cost-effective way of interro- locations in the tumor genome. Identification of such gating mutation and copy-number profiles within somatic copy-number alterations (sCNA) in tumor tis- protein-coding regions in the tumor genome. This has sues has contributed significantly to our understanding resulted in the increasing use of WES in both the re- of the pathogenesis and to the expansion of therapeutic search [8, 9] and clinical settings [10, 11]. However, avenues across multiple cancers [1–4]. Traditionally, variability in tumor content among clinical samples in sCNAs have been detected using cytogenetic techniques addition to the random technical variability in DNA such as fluorescent in situ hybridization, array compara- library enrichment steps during WES can potentially tive genomic hybridization [5], and representational introduce systematic biases across the genome, thus oligonucleotide microarrays [6] as well as single nucleo- making sCNA detection relatively challenging. Al- tide polymorphism (SNP) arrays [7]. However, each of though quite a few algorithmic approaches have been the above techniques has limitations with regard to the developed to address these issues [12–18], a recent number, resolution, and platform-specific assessability of comprehensive review [19] of these published method- regions that can be interrogated in the genome. ologies, primarily using simulated data, showed sub- More recently, massively parallel sequencing tech- stantial variability in sensitivity and specificity across nologies have provided the unique opportunity to algorithms, with algorithm-specific parameter choice a key confounder of algorithm performance. This poses a significant challenge in reliably detecting sCNAs in * Correspondence: [email protected]; [email protected] † WES data because choosing the right parameter for a Equal contributors 1Division of General Medical Sciences-Oncology, Case Western Reserve given application is non-trivial. There is therefore a University, Cleveland, OH 44106, USA pressing need to develop relatively parameter-free and Full list of author information is available at the end of the article © 2015 Varadan et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Varadan et al. Genome Medicine (2015) 7:69 Page 2 of 15 robust methodologies for detecting these sCNAs across performed by the Oklahoma Medical Research Founda- diverse tumor types and sequencing platforms. tion Next Generation DNA Sequencing Core Facility Here we present a novel computational methodology, (Oklahoma City, OK, USA). Target sequence enrichments ENVE (Extreme Value Distribution Based Somatic were performed using the Illumina TruSeq Exome Enrich- Copy-Number Variation Estimation), which robustly ment kit as per the manufacturer’s protocols (Illumina detects tumor-specific copy-number alterations in mas- Inc., San Diego, CA, USA). Briefly, sample DNA was sively parallel DNA sequencing data without the need quantified using a PicoGreen fluorometric assay, and 3 μg for complex parameter choices or user intervention. of genomic DNA was randomly sheared to an average size We demonstrate the robustness of ENVE’s performance of 300 bp using a Covaris S2 sonicator (Covaris Inc., in two independent matched tumor/normal WES data- Woburn, MA, USA). Sonicated DNA was then end- sets (total N = 107), derived from Caucasian and repaired, A-tailed, and ligated with indexed paired-end African American (AA) colorectal cancers (CRC), by Illumina adapters. Target capture was performed on DNA comparing ENVE-based sCNA calls in WES data pooled from six indexed samples, following which the against SNP arrays and quantitative real-time PCR captured library was PCR amplified for ten cycles to (qPCR)-based sCNA assessments performed on the enrich for target genomic regions. The captured libraries same sample sets. We further show ENVE as signifi- were precisely quantified using a qPCR-based Kapa cantly and consistently outperforming the best-in-class Biosystems Library Quantification Kit (Kapa Biosystems sCNA detection algorithm, Control-FREEC [12, 19], in Inc., Woburn, MA, USA) on a Roche LightCycler 480 these analyses. We additionally demonstrate the repro- (Roche Applied Science, Indianapolis, IN, USA). Deep ducibility of ENVE’s key noise-modeling feature using sequencing of the capture enriched DNA pools was per- an independent WES dataset derived from 54 normal formed on an Illumina HiSeq 2000 instrument to generate diploid samples. Finally, using the ENVE framework, 100-bp paired-end reads, and to achieve an average read we characterize, for the first time, global sCNA land- depth of ~70× per tumor sample and ~50× per matched scapes in colon cancers arising in AA patients, identify- normal sample. A Burrows–Wheeler Aligner version ing genomic aberrations potentially associated with 0.6.1-r104 [22] algorithm [23] was used to align individual colon carcinogenesis in this population. 100-bp reads from the raw FASTQ files to the human reference genome (build hg19). Following the conversion Methods of aligned reads in to Binary Sequence Alignment/Map AA CRC samples (BAM) format and subsequent removal of duplicated The AA CRC sample set included a total of 30 fresh- reads, coverage metrics of target bases were calculated frozen, predominantly late-stage microsatellite stable using the Picard tools version 1.41 [24]. On average, (MSS) CRCs and matched normal samples from AA Picard metrics showed ~69 % of the target bases covered patients (Additional file 1: Table S1). The colon cancer at 20× read depth for the normal samples and ~86 % of diagnosis of all tumor samples was reviewed and con- the target bases covered at 20× read depth for the tumors. firmed by an anatomic pathologist (JW). Genomic DNA from the tumor samples was extracted as previously The Cancer Genome Atlas CRC whole-exome dataset described [20]. DNA from all patients’ tumors was con- We identified a total of 77 MSS colon adenocarcinoma firmed as being MSS by evaluation of microsatellite alleles and matched normal WES samples from The Cancer in tumor and matched normal DNA at microsatellite Genome Atlas (TCGA) colon cancer repository on the markers BAT26, BAT40, D2S123, D5S346, and D17S250 Cancer Genomics Hub [25], for which Affymetrix SNP6 [21]. All samples used in this study were accrued under array-based copy-number profiles were available on the tumor sample accrual protocol entitled, “CWRU 7296: TCGA Data Portal (Additional file 1: Table S1). BAM Colon Epithelial Tissue Bank,” which was approved by the files for the 77 tumor/normal pairs were downloaded University Hospitals Case Medical Center Institutional from Cancer Genomics Hub using the GeneTorrent cli- Review Board for Human Investigation with the assigned ent. Subsequent to removal of duplicated reads, coverage UH IRB number 03-94-105. Under this protocol, tissue metrics of target bases were calculated using the Picard was obtained through written informed consent from tools. On average, Picard metrics showed ~86 % of the patients for research use. All aspects of this study were target bases covered at 20× read depth for both the nor- conducted in accordance with these approved guidelines. mal and tumor samples. Whole-exome capture, deep sequencing, and alignment SNP array-based copy-number analysis of AA CRC samples We evaluated 12 of the 30 AA tumor/normal paired Target capture, library preparation, and deep sequencing samples (Additional file 1: Table S1) for genome-wide of the 30 normal/tumor paired frozen DNA samples were somatic

ENVE: a Novel Computational Framework

VISPA: a Computational Pipeline for the Identification and Analysis Of

INTELLIGENT MEDICINE the Wings of Global Health

Developing and Implementing an Institute-Wide Data Sharing Policy Stephanie OM Dyke and Tim JP Hubbard*

Cancer Genome Interpreter Annotates the Biological and Clinical Relevance of Tumor Alterations David Tamborero1,2, Carlota Rubio-Perez1, Jordi Deu-Pons1,2, Michael P

A 26-Hour System of Highly Sensitive Whole Genome Sequencing for Emergency Management of Genetic Diseases Neil A

Exploration of Haplotype Research Consortium Imputation for Genome-Wide Association Studies in 20,032 Generation Scotland Participants Reka Nagy1, Thibaud S

Omic Personality: Implications of Stable

Downloaded from the CAVA Integration with Data Generated by Pre-NGS Methods Webpage [19]

Characterizing Biobank Organizations in the U.S.: Results from a National

Detecting Protein Variants by Mass Spectrometry: a Comprehensive Study in Cancer Cell-Lines Javier A

Biomed Central Open-Access Research That Covers a Broad Range of Disciplines, and Reaches Influencers and Decision Makers

Systems Medicine and Integrated Care to Combat Chronic