Varadan et al. Genome Medicine (2015) 7:69 DOI 10.1186/s13073-015-0192-9

METHOD ENVE: a novel computational framework characterizes copy-number mutational landscapes in colorectal cancers from African American patients Vinay Varadan1,2,7*†, Salendra Singh2, Arman Nosrati3, Lakshmeswari Ravi3, James Lutterbaugh3, Jill S. Barnholtz-Sloan1,2, Sanford D. Markowitz2,3,4,5, Joseph E. Willis2,4,5,6 and Kishore Guda1,2,4,7*†

Abstract Reliable detection of somatic copy-number alterations (sCNAs) in tumors using whole-exome sequencing (WES) remains challenging owing to technical (inherent noise) and sample-associated variability in WES data. We present a novel computational framework, ENVE, which models inherent noise in any WES dataset, enabling robust detection of sCNAs across WES platforms. ENVE achieved high concordance with orthogonal sCNA assessments across two colorectal cancer (CRC) WES datasets, and consistently outperformed a best-in-class algorithm, Control-FREEC. We subsequently used ENVE to characterize global sCNA landscapes in African American CRCs, identifying genomic aberrations potentially associated with CRC pathogenesis in this population. ENVE is downloadable at https://github.com/ENVE-Tools/ENVE.

Background comprehensively characterize genome-scale DNA alter- Human cancer is caused in part by structural changes ations in tumor tissues. In particular, whole-exome resulting in DNA copy-number alterations at distinct sequencing (WES) offers a cost-effective way of interro- locations in the tumor genome. Identification of such gating mutation and copy-number profiles within somatic copy-number alterations (sCNA) in tumor tis- protein-coding regions in the tumor genome. This has sues has contributed significantly to our understanding resulted in the increasing use of WES in both the re- of the pathogenesis and to the expansion of therapeutic search [8, 9] and clinical settings [10, 11]. However, avenues across multiple cancers [1–4]. Traditionally, variability in tumor content among clinical samples in sCNAs have been detected using cytogenetic techniques addition to the random technical variability in DNA such as fluorescent in situ hybridization, array compara- library enrichment steps during WES can potentially tive genomic hybridization [5], and representational introduce systematic biases across the genome, thus oligonucleotide microarrays [6] as well as single nucleo- making sCNA detection relatively challenging. Al- tide polymorphism (SNP) arrays [7]. However, each of though quite a few algorithmic approaches have been the above techniques has limitations with regard to the developed to address these issues [12–18], a recent number, resolution, and platform-specific assessability of comprehensive review [19] of these published method- regions that can be interrogated in the genome. ologies, primarily using simulated data, showed sub- More recently, massively parallel sequencing tech- stantial variability in sensitivity and specificity across nologies have provided the unique opportunity to algorithms, with algorithm-specific parameter choice a key confounder of algorithm performance. This poses a significant challenge in reliably detecting sCNAs in * Correspondence: [email protected]; [email protected] † WES data because choosing the right parameter for a Equal contributors 1Division of General Medical Sciences-Oncology, Case Western Reserve given application is non-trivial. There is therefore a University, Cleveland, OH 44106, USA pressing need to develop relatively parameter-free and Full list of author information is available at the end of the article

© 2015 Varadan et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Varadan et al. Genome Medicine (2015) 7:69 Page 2 of 15

robust methodologies for detecting these sCNAs across performed by the Oklahoma Medical Research Founda- diverse tumor types and sequencing platforms. tion Next Generation DNA Sequencing Core Facility Here we present a novel computational methodology, (Oklahoma City, OK, USA). Target sequence enrichments ENVE (Extreme Value Distribution Based Somatic were performed using the Illumina TruSeq Exome Enrich- Copy-Number Variation Estimation), which robustly ment kit as per the manufacturer’s protocols (Illumina detects tumor-specific copy-number alterations in mas- Inc., San Diego, CA, USA). Briefly, sample DNA was sively parallel DNA sequencing data without the need quantified using a PicoGreen fluorometric assay, and 3 μg for complex parameter choices or user intervention. of genomic DNA was randomly sheared to an average size We demonstrate the robustness of ENVE’s performance of 300 bp using a Covaris S2 sonicator (Covaris Inc., in two independent matched tumor/normal WES data- Woburn, MA, USA). Sonicated DNA was then end- sets (total N = 107), derived from Caucasian and repaired, A-tailed, and ligated with indexed paired-end African American (AA) colorectal cancers (CRC), by Illumina adapters. Target capture was performed on DNA comparing ENVE-based sCNA calls in WES data pooled from six indexed samples, following which the against SNP arrays and quantitative real-time PCR captured library was PCR amplified for ten cycles to (qPCR)-based sCNA assessments performed on the enrich for target genomic regions. The captured libraries same sample sets. We further show ENVE as signifi- were precisely quantified using a qPCR-based Kapa cantly and consistently outperforming the best-in-class Biosystems Library Quantification Kit (Kapa Biosystems sCNA detection algorithm, Control-FREEC [12, 19], in Inc., Woburn, MA, USA) on a Roche LightCycler 480 these analyses. We additionally demonstrate the repro- (Roche Applied Science, Indianapolis, IN, USA). Deep ducibility of ENVE’s key noise-modeling feature using sequencing of the capture enriched DNA pools was per- an independent WES dataset derived from 54 normal formed on an Illumina HiSeq 2000 instrument to generate diploid samples. Finally, using the ENVE framework, 100-bp paired-end reads, and to achieve an average read we characterize, for the first time, global sCNA land- depth of ~70× per tumor sample and ~50× per matched scapes in colon cancers arising in AA patients, identify- normal sample. A Burrows–Wheeler Aligner version ing genomic aberrations potentially associated with 0.6.1-r104 [22] algorithm [23] was used to align individual colon carcinogenesis in this population. 100-bp reads from the raw FASTQ files to the human reference genome (build hg19). Following the conversion Methods of aligned reads in to Binary Sequence Alignment/Map AA CRC samples (BAM) format and subsequent removal of duplicated The AA CRC sample set included a total of 30 fresh- reads, coverage metrics of target bases were calculated frozen, predominantly late-stage microsatellite stable using the Picard tools version 1.41 [24]. On average, (MSS) CRCs and matched normal samples from AA Picard metrics showed ~69 % of the target bases covered patients (Additional file 1: Table S1). The colon cancer at 20× read depth for the normal samples and ~86 % of diagnosis of all tumor samples was reviewed and con- the target bases covered at 20× read depth for the tumors. firmed by an anatomic pathologist (JW). Genomic DNA from the tumor samples was extracted as previously The Cancer Genome Atlas CRC whole-exome dataset described [20]. DNA from all patients’ tumors was con- We identified a total of 77 MSS colon adenocarcinoma firmed as being MSS by evaluation of microsatellite alleles and matched normal WES samples from The Cancer in tumor and matched normal DNA at microsatellite Genome Atlas (TCGA) colon cancer repository on the markers BAT26, BAT40, D2S123, D5S346, and D17S250 Cancer Genomics Hub [25], for which Affymetrix SNP6 [21]. All samples used in this study were accrued under array-based copy-number profiles were available on the tumor sample accrual protocol entitled, “CWRU 7296: TCGA Data Portal (Additional file 1: Table S1). BAM Colon Epithelial Tissue Bank,” which was approved by the files for the 77 tumor/normal pairs were downloaded University Hospitals Case Medical Center Institutional from Cancer Genomics Hub using the GeneTorrent cli- Review Board for Human Investigation with the assigned ent. Subsequent to removal of duplicated reads, coverage UH IRB number 03-94-105. Under this protocol, tissue metrics of target bases were calculated using the Picard was obtained through written informed consent from tools. On average, Picard metrics showed ~86 % of the patients for research use. All aspects of this study were target bases covered at 20× read depth for both the nor- conducted in accordance with these approved guidelines. mal and tumor samples.

Whole-exome capture, deep sequencing, and alignment SNP array-based copy-number analysis of AA CRC samples We evaluated 12 of the 30 AA tumor/normal paired Target capture, library preparation, and deep sequencing samples (Additional file 1: Table S1) for genome-wide of the 30 normal/tumor paired frozen DNA samples were somatic copy-number alterations using HumanOmni2.5-8 Varadan et al. Genome Medicine (2015) 7:69 Page 3 of 15

BeadChips containing 2,379,855 markers (Illumina). validated using a qBiomarker qPCR copy-number array Briefly, 200 ng of normal and tumor DNA were hybridized as per the manufacturer’s instructions (Qiagen Inc., on to the BeadChips and array images were scanned using Valencia, CA, USA). Briefly, 500–700 ng of genomic the HiScan System (Illumina). The array data were subse- DNA from six matched tumor/normal AA CRC cases quently processed using GenomeStudio software to gener- used for WES, and DNA from an additional six AA nor- ate the B-allele frequency and log-ratio values across all mal colon samples (Additional file 1: Table S1) was used the markers per individual chromosome (Illumina). for qPCR validation of a custom 11-gene panel, with Quality control analysis of the SNP array data revealed each gene mapping to a distinct genomic locus. Of note, an average of 98 % (range 91–99 %) call rate for the each of the 11 genes on the custom panel selected for samples. The B-allele frequency and log-ratio values of qPCR analysis showed significant copy-number alter- the samples were next imported into Partek Genomics ation, as detected by ENVE in WES data, in at least one Suite software (Partek Inc., St. Louis, MO, USA) to of the six AA CRC cases. Pre-designed qPCR primers identify regions showing significant copy-number alter- for the 11 candidate genes and a multi-copy reference ations in the tumors. This analysis was performed as (MRef) control were plated in quadruplicate on a per the manufacturer’s instructions using default set- 96-well plate, enabling the analysis of two samples per tings on the genomic segmentation algorithm, which plate (Qiagen). qPCR was carried out using the CFX96 included a minimum marker-distance of 50, P-value Real-Time PCR equipment (BioRad, Hercules, CA, USA) threshold of 0.001, and signal to noise ratio of 0.3. in a total volume of 25 μl containing the SYBR Green Affymetrix SNP Array 6.0–based copy-number profiles Assay Master Mix (Qiagen) for 10 min at 95 °C, were obtained from TCGA portal [26] for the 77 TCGA followed by 40 cycles of 95 °C for 15 s and 60 °C for 1 CRC WES samples (Additional file 1: Table S1). TCGA min. Cq values obtained from each of the reaction wells Level 3 copy-number data provide segmented copy- were uploaded to an online data analysis tool [28] for number calls, after elimination of potential germline copy- subsequent significance estimation of tumor-specific number variation (CNV) in each sample, using the Broad CNAs in the 11 target genes using the calibrator genome Institute’s Copy Number Inference pipeline for Affymetrix methodology, where the 12 AA normal samples served SNP Array 6.0 arrays. For each tumor sample, genome- as diploid genome controls. Tumor-specific CNAs with wide segmented copy-number calls were obtained at P ≤ 0.05 were considered significant. different Segment-Mean cutoff values ranging from ≥0.1 to ≥0.5 (indicating somatic amplifications) and from ≤−0.1 Control-FREEC–based copy-number analysis for AA and to ≤−0.5 (indicating somatic deletions). In all cases, seg- TCGA CRC datasets mented copy-number calls inferred from at least 10 SNP Somatic copy-number analysis on the AA CRC and array probes were included in the analyses as previously TCGA CRC WES datasets was performed using the suggested [2, 27]. developer’s recommended parameters for processing WES data from matched tumor/normal samples [12]. Copy-number analysis using pooled normals The window size was set to 500 bp with a step of 250 bp For the AA CRC WES dataset, we first computationally for all of the analyses. GC content normalization was pooled reads from the 30 AA normal samples (Add- enabled for all of the analyses, along with the noisyData itional file 1: Table S1). Next, we sub-sampled this option set as TRUE in order to avoid false-positive pooled normal data to generate 12 independent refer- predictions due to non-uniform capture in exome se- ence normals containing a similar number of total quencing data. For the primary analyses, Control-FREEC mapped reads observed for each of the 12 AA tumor (version 6.7) was run in the default mode without enab- samples for which SNP array data were available. We ling correction for contamination by normal cells. then performed ENVE Modules 2a-c on this simulated However, contamination adjustment was subsequently data to identify significant (ENVE P ≤ 0.05) sCNAs in enabled to evaluate whether automatic inference of the 12 AA tumor samples. Similarly, for the SNP array tumor content in the tissue samples improved the per- data, a computationally pooled reference normal was de- formance of Control-FREEC. rived from the 12 AA normal samples using the Partek Genomics Suite software (Partek) followed by SNP Recurrent sCNA identification using GISTIC array-based sCNA detection in the 12 tumors. The ENVE output file containing ENVE P-values assigned to each candidate copy-number–altered segment in the 30 qPCR-based estimation of somatic copy-number AA CRC and stage-matched 30 TCGA Caucasian CRC alterations cases was analyzed using the GISTIC tool (version 2.0.21) Recurrent somatic copy-number alterations in candidate [27]. The markers file for GISTIC was derived as the regions identified by ENVE in the WES dataset were union of the start coordinates of all possible 100-bp Varadan et al. Genome Medicine (2015) 7:69 Page 4 of 15

segments within the exonic regions defined in the region- dataset obtained from TCGA (TCGA CRC, N = 77) of-interest file for the Illumina TruSeq Exome platform. (See “Methods” and Additional file 1: Table S1). A sub- Copy-number–altered segments that were not considered set of these TCGA cases consisting of predominantly significant (ENVE P > 0.05) were assigned a LogRatio late-stage cancers (N = 30) was used to further assess value of 0, thus making them copy-neutral for GISTIC differences in sCNAs in CRCs arising in AA versus analysis. GISTIC broad-level analysis was performed with Caucasian ethnicities. a size threshold of 98 % of a chromosome arm to differen- tiate between arm-level and focal events. sCNA regions and arm-level events with q ≤ 0.25 were considered ENVE methodology overview significant. The significance of focal sCNA events was The ENVE methodology consists of two major modules: determined using residual q-values, which were estimated Module 1 uses non-tumor normal diploid samples to by removing amplifications or deletions that overlapped capture and model inherent noise in WES data likely other, more significant sCNAs in the same chromosome. arising from technical variability in the DNA capture, Focal sCNAs with residual q ≤ 0.05 were considered hybridization, and/or amplification steps, in addition to significant. The frequencies of the resulting significant variability in sequencing platforms. This is followed by recurrent sCNAs were plotted using ggplot2 (R package Module 2. which utilizes the learned model parameters version 0.9.3.1). to reliably detect somatic copy-number alterations in tu- mors (Fig. 1). WES data accessibility for AA CRC and TCGA cohorts As mentioned above, the 77 MSS colon adenocarcinoma Module 1 and matched normal WES samples from TCGA colon Module 1 of the ENVE methodology consists of four cancer cohort are publicly available in the repository on steps as follows: the Cancer Genomics Hub [25]. The AA CRC WES data- set (N = 30) was generated in-house, and all appropriate processed files relevant to this study can be accessed at Module 1a: Pairwise random normal-normal comparisons the ENVE Tool website [29]. In this module (Fig. 1), WES profiles of N normal samples {S1,S2, …,SN} are taken pairwise, resulting in (N !)/ Results (2 * (N − 2) !) random normal–normal combinations We describe the key computational steps in the ENVE {S1:S2,S1:S3, …,SN-1:SN}. We accordingly applied this methodology and evaluate its performance using two module to the WES profiles of the 30 AA normal and 30 matched tumor/normal WES datasets, an in-house TCGA normal samples, resulting in a total of 435 random WES dataset of predominantly late-stage, MSS AA normal–normal comparisons for the AA and TCGA CRCs (N = 30) [20], and a Caucasian MSS CRC WES cohorts.

Module 1 Module 2 Modeling Inherent Noise in WES Data using Utilizing the GEV Noise Model to Detect sCNAs Diploid Normal Samples in Tumors a a Pairwise Random Normal-Normal Comparisons Tumor-Normal Comparisons

S1:S2,S1:S3,S1:S4,...SN-1:SN T1:S1,T2:S2,T3:S3,...TN:SN

b Estimation of Segmental LogRatios in Pairwise b Estimation of Segmental LogRatios in Random Normal-Normal Comparisons Tumor-Normal Comparisons

c Establishing Chromosome-specific Noise Thresholds using: c GEV-based Significance Evaluation of Tumor- Fraction of chromosomal coverage Entropy of chromosomal coverage Normal Segmental LogRatios Silhouette index measure

d Modeling Inherent Noise using the Generalized ENVE Significant sCNAs Extreme Value Distribution (GEV) (P 0.05)

Fig. 1 Overview of the ENVE workflow to detect somatic copy-number alterations. The ENVE framework consists of two major modules: the first involves modeling of inherent noise in WES data using normal diploid samples (Module 1 on the left); the second module utilizes the expected variability as captured by the learned model parameters to detect sCNAs in tumors (Module 2 on the right) Varadan et al. Genome Medicine (2015) 7:69 Page 5 of 15

Module 1b: Estimation of segmental LogRatios in pairwise distributed around zero by subtracting the mode of the random normal-normal comparisons distribution from each of the segmental LogRatios. In this module (Fig. 1), genome-wide segmental LogRatios Accordingly, we applied ENVE Module 1b on the 435 for each of the (N !)/(2 * (N − 2) !) random normal–nor- random normal–normal comparisons derived from the mal combinations are calculated using read depth com- AA and TCGA normal WES datasets. As expected for parison and circular binary segmentation [30]. For each comparisons of normal diploid samples, the vast majority sample pair being compared, each target region within the of the segmental LogRatios across the normal–normal exome is divided into non-overlapping 100-bp windows. comparisons were distributed around zero, with a minor- The ratio of average read depth within these 100-bp ity of segments showing segmental LogRatios deviating windows is estimated for each pair of samples being significantly from zero (Figs. 2 and 3). Because these compared (DSi and DSj) after normalizing for the total significant deviations could result from either inherent number of uniquely mapped bases per sample (TRSi noise in WES data or focal germline CNV within the and TRSj) as follows: normal diploid samples being compared, identification of  those segmental LogRatio deviations associated primarily ¼ DSi à TRSj ð Þ with inherent noise is essential for subsequent noise W RdRatio log2 1 DSj TRSi modeling in WES data.

The ratio of read depth per window (WRdRatio) is then Module 1c: Establishing chromosome-specific noise corrected for GC content according to published meth- thresholds odology [14]. The resulting GC-corrected WRdRatio data ENVE Module 1c (Fig. 1) is specifically designed to iden- are segmented using circular binary segmentation, tify chromosome-specific segmental LogRatio deviations resulting in genomic segments and associated segmental associated particularly with random inherent noise in LogRatios summarized from the GC-corrected WRdRatio WES data. Given that segmental LogRatio deviations in of all 100-bp windows within each segment. The result- the normal–normal sample comparisons tend to be ing distribution of genome-wide segmental LogRatios asymmetrically distributed around zero (for example, in the normal–normal comparisons is adjusted to be Figs. 2a and 3a), the positive and negative LogRatio

Fig. 2 Modeling of inherent noise in the AA CRC WES dataset. a Distribution of exome-wide segmental LogRatios generated using 435 random normal–normal comparisons derived from 30 normal diploid WES samples. b, c Chromosomal coverage by copy-number altered segments at different LogRatio thresholds using the 435 normal–normal comparisons across all chromosomes for the positive (b) and negative (c) LogRatio deviations. Bold horizontal lines within each chromosome indicate the noise-threshold values Varadan et al. Genome Medicine (2015) 7:69 Page 6 of 15

Fig. 3 Modeling of inherent noise in the TCGA CRC WES dataset. a Distribution of exome-wide segmental LogRatios generated using 435 random normal–normal comparisons derived from 30 normal diploid WES samples. b, c Chromosomal coverage by copy-number altered segments at different LogRatio thresholds using the 435 normal–normal comparisons across all chromosomes for the positive (b) and negative (c) LogRatio deviations. Bold horizontal lines within each chromosome indicate the noise-threshold values

deviations in the normal–normal comparisons are mod- of chromosomal coverage at a particular RT is therefore eled separately. Accordingly, each chromosome is first the ratio of non-zero entries in FRT to the length of FRT divided into equal-sized non-overlapping 10-kb win- for a given chromosome. Next, the entropy of chromo- dows. The frequency with which each of the chromo- somal coverage, for given a chromosome, at each RT is somal windows is covered by copy-number altered given as: segments at LogRatios ranging from 0 to 2 in incre- X  ments of 0.05 is counted, both in the positive and nega- E ¼ − f RT Ã log f RT ð2Þ tive directions. Because chromosomal coverage tends to RT j 2 j ∀j be more complete and randomly distributed at lower Absolute LogRatio Thresholds (as expected with random R is defined as the R associated with the maximal inherent noise) for both positive and negative deviations, TF T drop in fraction of chromosomal coverage and R is as opposed to the sparse and focal coverage observed at TE defined as the R associated with the first major loss in higher Absolute LogRatio Thresholds (as expected with T entropy of chromosomal coverage per chromosome. The germline CNVs), ENVE Module 1c employs a robust maximum of R and R corresponds to the noise quantitative approach to differentiate between inherent TF TE threshold (R ), above which we expect to see focal noise and germline CNVs in the normal–normal com- NT alterations associated with germline CNVs, and below parisons (ENVE Module 1c, Fig. 1). Each chromosome is which we expect to see variations associated with first divided into non-overlapping 10-kb windows. Sub- random inherent noise. The average silhouette index sequently, the frequency of segmental coverage within [31] ascertains whether the pairwise distances between each chromosomal window is calculated using segments FRT vectors across R are substantially different from with absolute LogRatios at or above Absolute LogRatio NT RT the pairwise distances between F vectors within all RT Threshold (RT), with RT varying from 0 to 1 (RTmax)in RT above or below R . Positive silhouette index values steps of 0.05. Let F be the vector containing the fre- NT close to 1 suggest that the chromosome-specific noise quencies of segmental coverage f RT for all the win- j thresholds (RNT) appropriately capture the variability dows in a chromosome at a particular RT. The fraction associated with inherent noise. Varadan et al. Genome Medicine (2015) 7:69 Page 7 of 15

Accordingly, we employed the above measures in noise thresholds in the normal–normal comparisons ENVE Module 1c on the 435 AA and TCGA normal– (Figs. 2b,c and 3b,c). Intriguingly, despite the potential normal comparisons, obtaining chromosome-specific confounding effects of somatic alterations as well as noise thresholds (Figs. 2b,c and 3b,c; Additional file 2: platform-associated noise in these tumor samples, we Figures S1 and S2). These chromosome-specific noise found that segments with LogRatios above the noise thresholds exhibited high positive silhouette indices thresholds in the normal–normal comparisons exhibited (≥0.8) in both the AA and TCGA normal–normal com- significantly higher correlation with their matched parisons, indicating that the focal copy-number altered tumor–tumor counterparts in both the AA CRC (Pear- segments observed above the noise thresholds are quali- son’s r = 0.39; Fig. 4a) and TCGA CRC (Pearson’s r =0.36; tatively distinct from the random distribution of seg- Fig. 4b) datasets, as compared to randomly selected ments below the noise thresholds. tumor–tumor comparisons that showed virtually no To further ascertain that these discrete focal alter- correlation (Pearson’s r = 0.008 in AA CRC, Fig. 4a; ations in genomic segments observed above the noise Pearson’s r = 0.005 in TCGA CRC, Fig. 4b). However, thresholds in the normal–normal data are indicative of segments with LogRatios below the noise thresholds in germline CNVs prevalent among the normal samples, the normal–normal comparisons exhibited almost no we repeated the analysis of ENVE Modules 1a-c (Fig. 1) correlation with either their matched tumor–tumor coun- by replacing the normal–normal comparisons with cor- terparts (Pearson’s r =0.03inAACRC;Pearson’s r =0.07 responding matched tumor–tumor comparisons for both in TCGA CRC) or randomly selected tumor–tumor com- the AA and TCGA CRC datasets, assuming that any parisons (Pearson’s r = 0.001 in AA CRC, Fig. 4a; Pearson’s likely germline CNV should also be detectable in the r = 0.000 in TCGA CRC, Fig. 4b). matched tumor samples. We next identified the genomic Taken together, these findings in two independent WES regions from the matched tumor–tumor comparisons datasets strongly indicate that the focal copy-number overlapping the genomic segments falling above the altered segments falling above the noise thresholds are

a AA CRC Datasetb TCGA CRC Dataset Segments Above Noise Thresholds Segments Above Noise Thresholds 6 6 6 r = 0.39 6 r = 0.008 r = 0.36 r = 0.005 Tum LR Tum Tum LR Tum

0 0 0 um- 0 Tum-Tum LR Tum-Tum T

−6 −6 Random Random Tum- Matched Matched−6 LR Tum-Tum −6 −6 0 6 −6 0 6 −6 0 6 −6 0 6 Norm-Norm LR Norm-Norm LR Norm-Norm LR Norm-Norm LR

Segments Below Noise Thresholds Segments Below Noise Thresholds 4 r = 0.034 r = 0.001 4 r = 0.074 r = 0.00 um LR Tum LR Tum T

0 um- 0 0 0 T Random Random Tum- Matched−4 LR Tum-Tum −4 Matched−4 LR Tum-Tum −4 −0.6 0 0.6 −0.6 0 0.6 −1 0 1 −1 0 1 Norm-Norm LR Norm-Norm LR Norm-Norm LR Norm-Norm LR Fig. 4 Ability of ENVE in distinguishing germline copy-number variation from random inherent noise. Segments falling above or below the estimated noise thresholds in normal–normal comparisons are evaluated against segments that overlap with the corresponding genomic regions from tumor–tumor comparisons in AA CRC (a) and TCGA CRC WES (b) datasets. Shown are the Pearson correlation coefficients comparing normal–normal segmental LogRatios with corresponding matched tumor–tumor or random tumor–tumor segmental LogRatios. Note the genomic segments in normal–normal comparisons falling above noise thresholds show higher correlation with segments in corresponding matched tumor–tumor comparisons than with segments in random tumor–tumor comparisons. By contrast, genomic segments in normal–normal comparisons that fall below noise thresholds show no correlation with either matched or random tumor–tumor comparisons Varadan et al. Genome Medicine (2015) 7:69 Page 8 of 15

more likely to be associated with germline CNVs, whereas parameters to evaluate somatic copy-number amplifica- the randomly distributed segments falling below the noise tions. Similarly, the minimum segmental LogRatio values thresholds are indicative of inherent noise in WES data. associated with negative deviations within a chromosome Additionally, we note that the focal genomic segments for each of the K normal–normal comparisons, resulting with high LogRatios above noise thresholds within each in K minima, are used to estimate the GEV parameters to chromosome were repeatedly observed across multiple evaluate somatic copy-number deletions. The R package unique normal sample pair comparisons in both the AA fExtremes (R package version 3010.81) is used to estimate and TCGA CRC datasets, providing further evidence of the above GEV parameters. their being indicative of germline CNVs as opposed to We applied ENVE Module 1d on the AA and TCGA random inherent noise. Importantly, the observed differ- normal–normal comparisons to obtain chromosome- ences in chromosome-specific noise-threshold values in specific GEV parameters, thus effectively capturing and AA versus TCGA CRC datasets (Figs. 1 and 2) further modeling chromosome-specific inherent noise associated highlight ENVE’s ability to model noise in a platform- and with each of these WES datasets. sample-agnostic manner.

Module 1d: Modeling inherent noise using the Generalized Module 2 Extreme Value distribution We next applied ENVE Module 2 (Fig. 1) to call sCNAs This module (Fig. 1) derives generalized extreme value in AA CRC and TCGA CRC WES samples. (GEV) distribution-based models of the inherent noise associated with WES data. First, copy-number altered segments falling below the noise thresholds in the nor- Module 2a-b: Estimation of segmental LogRatios in mal–normal comparisons (Figs. 2 and 3) are selected for tumor–normal comparisons noise modeling. Assuming X1, X2, … Xn to be the seg- ENVE Module 2a-b performs read depth comparison mental LogRatios of selected copy-number altered seg- and circular binary segmentation to identify all of the ments within the normal–normal comparisons, the potentially copy-number altered segments along with Fisher–Tippett theorem [32] states that the distribution their respective GC-corrected segmental LogRatios for of Mn = max{X1, X2, …, Xn} converges to (as n → ∞) the each matched tumor/normal comparison similar to GEV distribution: Module 1a-b. To account for potential aneuploidy/hyper- ploidy [33, 34] in the tumor samples, which could result hi −μ −1=ξ in the segmental LogRatios of copy-neutral regions deviat- ðÞ¼ − þ ξ y ð Þ Gy exp 1 σ 3 ing from zero, the distribution of genome-wide segmental LogRatios in every tumor–normal comparison is adjusted where ξ, μ, and σ are the shape, the location, and scale by subtracting the mode of the distribution from each of parameters, respectively, that fully define the GEV. Be- the segmental LogRatios. Using these modules, we accord- cause the only requirement for the GEV distribution is ingly obtained segmental LogRatios for each of the AA that the segmental LogRatios, Xi, are independent and and TCGA matched tumor/normal comparisons. identically distributed random variables, the tails of whose distributions can have either an exponential or polynomial decay, we modeled the maxima of the Module 2c: GEV-based significance evaluation of tumor– segmental LogRatios using the GEV. Also, because the normal segmental LogRatios variability in segmental LogRatio estimates is likely to be The chromosome-specific GEV parameters for amplifica- chromosome-specific, reflecting variations in gene tions and deletions, as derived in Module 1d above, are density and capture efficiency across regions, separate used in Module 2c to evaluate the probability that an GEV model parameters are inferred for respective chro- observed candidate amplification or deletion within a mosomes. Furthermore, because somatic copy-number chromosome is due to inherent noise in WES data. This deletion events can only fall into two categories (hetero- module employs the pgev function within fExtremes zygous or homozygous deletions), as opposed to the (R package version 3010.81). Segments that achieve a sig- copy-number amplifications, separate GEV parameters nificant probability (P ≤ 0.05) are accordingly classified as for positive and negative deviations are estimated, being amplified or deleted in the respective tumor sample. respectively, using the probability weighted moment Thus, by applying ENVE Modules 2a-c (Fig. 1), we identi- method [32]. Accordingly, per chromosome, the max- fied chromosomal regions showing significant copy- imum segmental LogRatio values associated with positive number alterations (ENVE P ≤ 0.05) in each of the 30 AA deviations within each of the K normal–normal compari- CRC and 77 TCGA CRC samples (Additional file 1: Tables sons, resulting in K maxima, are used to estimate the GEV S2 and S3). Varadan et al. Genome Medicine (2015) 7:69 Page 9 of 15

ENVE implementation FREEC [12], which has been reported to outperform ENVE is implemented as a tool that is freely available published WES-based sCNA detection algorithms in a along with the source codes for academic use at [29]. comprehensive review [19]. We performed sCNA detec- ENVE can accept BAM files and then performs the tion on the two WES datasets using Control-FREEC’s above outlined statistical analyses and outputs somatic recommended parameters (see “Methods”), and subse- copy-number alterations along with their segmental Log- quently compared the ENVE-based and Control-FREEC- Ratios and significance estimates for each tumor sample. based sCNA calls in each of the AA CRC and TCGA The computational resources required to run ENVE using CRC samples, individually, with those detected by the GC-corrected normalized read-counts are lightweight, SNP arrays. We anchored the comparison to only within wherein all of the normal–normal and tumor–normal gene-coding regions because SNP arrays also span sub- analyses for a cohort of 77 tumor/normal samples could stantial non-coding regions that are not interrogated by be performed on a desktop with a single processor and the WES platform. Because SNP arrays are also not a 16GB of memory in under 10 h. gold-standard technique for evaluating sensitivity and specificity, we instead assessed the concordance between Evaluation of ENVE performance in detecting sCNAs in the SNP arrays and the respective ENVE and Control- tumor samples FREEC algorithms. Specifically, as shown in Additional We next proceeded to systematically evaluate the per- file 2: Figure S3, the percent concordance between SNP formance of ENVE by assessing its sensitivity and speci- arrays and ENVE/Control-FREEC was calculated as the ficity on individual tumor samples. Although there exists ratio of the total length of all concordant exonic sCNA no gold-standard technique for use as a comparator in regions called by the WES algorithm to the total length formal evaluation of sensitivity and specificity of sCNA of the exonic SNP array sCNA regions. calls in stromal-admixed clinical tumor samples, we Figure 5a shows the median number of genes associ- nevertheless proceeded to evaluate ENVE’s performance ated with sCNA regions in AA and TCGA CRC WES by comparing against widely used SNP arrays. datasets, as detected by ENVE and Control-FREEC, Accordingly, we performed SNP array-based sCNA de- along with their concordance with SNP array-based tection in both the AA CRC and TCGA CRC datasets. estimates. For regions with copy-number amplifications, For the TCGA CRC dataset, we obtained SNP array- ENVE achieved a higher concordance with SNP arrays based sCNA calls for all of the 77 tumors included in than Control-FREEC both in the AA CRC (97.32 % vs. our WES study from the TCGA portal (see “Methods”). 87.26 %) and TCGA CRC (97.68 % vs. 89.27 %) data- Similarly, for the AA CRC dataset, we obtained SNP sets, despite Control-FREEC calling on average 30 % array sCNA calls in 12 of the 30 AA CRC samples used more amplification events than ENVE in both WES in our WES study (see “Methods”). As an additional key datasets (Fig. 5b). This strongly implies that ENVE has comparator, we evaluated another algorithm, Control- higher sensitivity and specificity in calling copy-number

abAmplifications Deletions

15000 15000 ENVE Amplifications Deletions FREEC Performance 10000 10000 Dataset Control Control Estimates ENVE ENVE FREEC FREEC

AA CRC AA 5000 5000 Median Genes Altered 3563 4896 3490 251 Genes with sCNAs AA CRC % Concordance with 97.32 87.26 90.6 47.68 SNP Array sCNA Calls 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 25000 25000 Median Genes Altered 1950 2657 1478 1114 TCGA % Concordance with 15000 15000 CRC 97.68 89.27 78.22 67.03 SNP Array sCNA Calls 5000 TCGA CRC TCGA 5000 Genes with sCNAs

10 20 30 40 50 60 70 10 20 30 40 50 60 70 Samples Samples Fig. 5 SNP array-based performance estimates of ENVE and Control-FREEC in detecting sCNAs in tumors. a Median number of genes with sCNAs, as detected by ENVE and Control-FREEC in the WES data, along with their median percent concordance with SNP array-based sCNA calls. b Number of genes with sCNAs (Y-axis) in each sample (X-axis), as detected by ENVE (blue) and Control-FREEC (red), in AA and TCGA CRC WES datasets. Significant sCNAs as detected by the Partek Suite were used for comparison in AA CRCs, while for TCGA CRCs SNP array segments with Segment-Mean cutoffs of ±0.5 were used for comparison Varadan et al. Genome Medicine (2015) 7:69 Page 10 of 15

amplifications than Control-FREEC. Similarly, for re- LogRatios ranging between −1 and 0.7 in the WES data gions with copy-number deletions, ENVE achieved a (Fig. 6a), likely suggesting that the higher performance higher concordance rate with SNP arrays (Fig. 5a) than of ENVE compared to Control-FREEC results from Control-FREEC both in the AA CRC (90.6 % vs. 47.68 %) ENVE’s ability to detect sCNAs even at low tumor/nor- and TCGA CRC datasets (78.22 % vs. 67.03 %), with mal read depth ratios in these stromal-admixed tumor Control-FREEC calling on average 16 % more deletion samples. It is important to note that although the genes events than ENVE across samples in the TCGA CRC data- and samples selected for this comparison were chosen set (Fig. 5b). Noting the extremely poor performance of based on ENVE’s output, we had no a priori expectation Control-FREEC in identifying deletion events, especially of the qPCR results, thus allowing for a fair comparison in the AA CRC WES dataset, we proceeded to evaluate with Control-FREEC. whether its performance could be improved by enabling Taken together, although neither qPCR nor SNP arrays Control-FREEC to infer and adjust for potential stromal are gold-standard techniques for a formal evaluation of admixture and tumor content in the two WES datasets. In the sensitivity and specificity of ENVE and Control- this mode, we found Control-FREEC called on aver- FREEC, our comparative analyses based on these com- age 55 % more copy-number altered events than monly used techniques underscore the ability of the ENVE, but did not match ENVE’s sensitivity in de- ENVE methodology to reliably detect sCNAs in variable tecting sCNAs in three of the four comparisons stromal admixture tumor tissues, without having to re- across the AA and TCGA WES datasets (Additional sort to complex and unstable estimations of tumor con- file 2: Figure S4A,B). Moreover, ENVE consistently tent or ploidy. showed better performance than Control-FREEC when tested across different Segment-Mean cutoff values that were used for classifying sCNAs in the TCGA SNP array Characterization of sCNA landscapes in AA CRCs dataset (Additional file 2: Figure S5), or at low overall read Using the ENVE-significant alterations as input (Additional depth simulated scenarios (Additional file 2: Figure S6). file 1: Table S2), we next identified chromosomal Taken together, these results strongly point to the poten- regions showing significant (q-value ≤ 0.25) recurrent tially high sensitivity and specificity of the ENVE frame- focal and arm-level alterations in AA CRCs using the work in detecting sCNAs in WES data. GISTIC tool [27] (see “Methods”; Additional file 1: In addition to the above SNP array-based comparative Tables S4–S6, Fig. 7). While focal sCNAs occurred analysis, we also evaluated ENVE’s performance by using throughout the length of respective chromosomes a second orthogonal platform, qPCR. Accordingly, we (Additional file 1: Tables S4 and S5), GISTIC’sbroad- designed a custom qPCR copy-number array containing level analysis showed significant (q-value ≤ 0.25) a set of 11 genes, each representing a distinct genomic chromosomal arm-level deletions specifically in 1p, 8p, locus that showed recurrent sCNAs (frequency ≥ 30 %) 14q, 15q, 18p, and 18q, and amplifications in 1q, 7p, among the 30 AA CRC cases in the WES dataset, as de- 8q, 13q, 19q, 20p, and 20q in AA CRCs (Additional file 1: tected by ENVE (Additional file 1: Table S2). Using this Table S6). Furthermore, chromosomal regions containing qPCR array, we estimated sCNAs in a subset of AA well-known CRC tumor suppressor genes (TP53, DCC, CRC cases (N = 6), where each of the cancers showed SMAD4, SMAD2) [9, 35, 36] showed recurrent copy- copy-number alteration in at least one of the 11 genes number deletions in ≥25 % of AA CRC cases. Conversely, (Fig. 6a). Of note, these six cases were not represented copy-number amplifications in 13q and 20q loci, regions in the 12 samples used for the SNP array analysis, thus known to harbor candidate oncogenes [37–39], were allowing for an independent evaluation of ENVE’s per- observed in ≥27 % of AA CRC cases. formance. Respective matched normal samples from We next asked if there were any recurrent sCNA these six cancers, along with an additional six AA nor- signatures identified in AA CRCs (Fig. 7) that were sig- mal samples, were used as diploid genome controls in nificantly different from Caucasian CRCs. Accordingly, the qPCR analysis. We again used the WES-based sCNA we identified a set of 30 predominantly late-stage MSS calls from Control-FREEC for these six CRC cases as an Caucasian CRC cases from the TCGA WES cohort additional key comparator in this analysis. Comparison (Additional file 1: Table S1), and evaluated for signifi- of amplifications, deletions, and copy-neutral calls be- cant sCNAs (ENVE P ≤ 0.05; Additional file 1: Table S3) tween qPCR and ENVE showed a significantly higher followed by GISTIC analysis to identify recurrent chromo- overall concordance of 72.72 % (Chi-square P = 0.049, somal arm-level alterations (q ≤ 0.25; Additional file 1: Fig. 6b) compared to the 59 % concordance observed be- Table S6). Assessing for significant chromosomal arm- tween qPCR and Control-FREEC (Additional file 2: level alterations in these two cohorts, however, showed no Figure S7). Notably, 54 % of the qPCR and ENVE marked differences in their frequencies between the AA concordant alterations exhibited low tumor/normal and TCGA CRC WES datasets, and/or between the AA Varadan et al. Genome Medicine (2015) 7:69 Page 11 of 15

a AA 11893 AA 15960 AA 15873 2.0 2.0 2.0 1.0 1.0 1.0

WES 0.0 0.0 0.0 -1.0 -1.0 -1.0 LogRatio (T/N) 5 5 5 4 4 4 3 3 3 2 2 2

qPCR 1 1 1

CN Estimate 0 0 0 AA 13260 AA 15972 AA 11027 2.0 2.0 2.0 1.0 1.0 1.0

WES 0.0 0.0 0.0 -1.0 -1.0 -1.0

LogRatio (T/N) 5 5 5 4 4 4 3 3 3 2 2 2

qPCR 1 1 1

CN Estimate 0 0 0 DAD1 DAD1 DAD1 BMP7 BMP7 BMP7 SMC5 SMC5 SMC5 GNAS GNAS GNAS PGM5 PGM5 PGM5 SALL2 SALL2 SALL2 LECT1 LECT1 LECT1 TONSL TONSL TONSL CSMD2 CSMD2 CSMD2 PARP10 PARP10 PARP10 ZSCAN20 b ZSCAN20 ZSCAN20 1 1 11 111 1 111 AA 11893 100.00 %Concordance AA 15960 1 011 1 0 0 1 1 1 0 63.64 AA 15873 1 1 0 1 1 1 1 0 0 1 1 72.73 (ENVE AA 13260 1 1 1 1 1 1 1 0 1 0 0 72.73 vs. AA 15972 0 0 0 1 1 0 0 1 0 1 1 45.46 qPCR) AA 11027 1 1 1 1 1 1 0 1 1 1 0 81.82 72.72 (chr1) (chr8) (chr1) (chr8) (chr9) (chr9) (chr14) (chr13) (chr14) (chr20) DAD1 (chr20) BMP7 SMC5 GNAS PGM5 SALL2 LECT1 TONSL CSMD2 PARP10 ZSCAN20 Fig. 6 Concordance analysis of ENVE-based and qPCR-based sCNA estimates. a Matched tumor/normal (T/N) LogRatios derived from WES data (above) and corresponding qPCR-based copy-number (CN) estimates (below) for 11 genes in each of the six AA CRC cases. ENVE-significant LogRatios are indicated by solid black circles, with LogRatios above 0 indicating amplifications and LogRatios below 0 indicating deletions. Significant qPCR-based CN alterations are indicated by solid black bars, with values above 2 representing amplifications and values below 2 representing deletions. b Matrix showing concordance (1) and discordance (0) between ENVE and qPCR CN estimates for the 11 genes in each of the six AA CRC cases. For each case, copy-neutral genes by both ENVE and qPCR analyses, as well as genes showing significant CN alterations in the same direction by both platforms, were deemed concordant (from a). The resulting percent concordance between ENVE-based and qPCR-based CN estimates in each sample is shown on the right, with an overall concordance rate of 72.72 %

CRC WES and the TCGA CRC SNP array datasets (Add- Discussion itional file 1: Table S6). We have developed a robust and unbiased method for While the majority of the recurrent sCNAs observed detecting somatic copy-number alterations using WES in AA CRCs (Fig. 7) were consistent with those previ- data. Performance evaluation of ENVE in two independ- ously reported for colon cancers [9], there is a likelihood ent WES tumor tissue datasets showed a high concord- that ethnicity-associated differences exist in both the ance between ENVE and SNP array and qPCR-based location and frequency of focal sCNAs in CRCs. Al- sCNA estimates (Figs. 5 and 6). In addition, we found though we identified a significant number of focal copy- ENVE significantly and consistently outperformed the number alterations in AA CRCs (Additional file 1: Table best-in-class published WES-based sCNA detection S5), a larger platform-matched and algorithm-matched algorithm [19], Control-FREEC [12] (Figs. 5 and 6, sCNA analysis is necessary to systematically characterize Additional file 2: Figures S4–S7). More importantly, our ethnicity-specific differences in focal sCNAs in CRCs. performance evaluations strongly indicate that ENVE Nonetheless, our prior study detailing the gene- has high sensitivity and specificity in detecting sCNAs mutational landscapes in AA CRCs [20], together with from WES data derived from stromal-admixed tumor our current comprehensive characterization of sCNA samples. In particular, our comparative analyses reveal landscapes in AA CRCs, uncovers recurrent genetic the effectiveness of ENVE in detecting genuine sCNAs aberrations that are potentially associated with CRC even at low tumor/normal segmental LogRatios (−1to development in the AA population. 0.7) (Fig. 6a), strongly underscoring the drawbacks with Varadan et al. Genome Medicine (2015) 7:69 Page 12 of 15

1.00

0.75

s

n

o

i

t

a

c i 0.50

f

i

l

p

m

A 0.25

y

c

n

e

u 0.00

q

e

r

F -0.25

s

n

o

i

t -0.50

e

l

e

D -0.75

-1.00 Fig. 7 Recurrent somatic copy-number alterations in AA CRCs. GISTIC plot showing the frequencies (Y-axis) of significant recurrent amplifications (red) and deletions (blue), as identified by ENVE, in the 30 AA CRC cases. Recurrent arm-level alterations are observed within autosomal regions, including deletions in 1p, 8p, 15q, 18p, and 18q, and amplifications in 7p, 8q, 13q, and 20q using pre-defined LogRatio value cutoffs to identify noise assessment. We estimated chromosome-specific sCNAs in tumors. In fact, examination of the relation- noise thresholds using normal–normal comparisons ship between the LogRatios of individual segments and derived from random groups of 16–54 samples in incre- ENVE-based P-values (Additional file 2: Figure S8) ments of 2, repeated ten times. We found the ENVE shows that no single segmental LogRatio-cutoff value estimates of noise thresholds across chromosomes to would have captured all recurrent copy-number amplifi- be nearly all stable with respect to the number of cations or deletions in either the AA or TCGA CRC diploid samples used for the estimation (Additional WES datasets. Besides, the commonly observed variabil- file 2: Figure S9). Although the chromosome-specific ity in cancer cell content among clinical specimens noise thresholds are not sensitive to the number of diploid would preclude the use of a single LogRatio-cutoff value samples being used, reliable estimation of the parameters for determining recurrent sCNAs. Although some pub- of the GEV distribution requires 100–150 extreme values lished copy-number algorithms have attempted to over- [43], corresponding to a lower acceptable limit of 15–20 come the challenge of defining LogRatio-cutoffs by normal diploid samples. Therefore, we suggest that using inferring the tumor content and ploidy of each sample 15–20 normal samples is sufficient to model the inherent to estimate the absolute tumor copy-number [16, 18, 40, 41], noise in WES data, and as such is computationally effi- these estimates are often unstable, with these algo- cient. We therefore anticipate that ENVE’skeyfeature rithms differing in their underlying assumptions, which involving modeling of inherent noise in WES data will may not always correspond to the complex chromo- enable its broad application across studies, where somal architecture in tumors [42]. Our approach, in population-matched and platform-matched normal dip- contrast, does not infer tumor content or ploidy, but loid DNA samples are frequently available. provides a probabilistic estimate of the presence of Foreseeing a likely practical situation where a normal sCNAsintumorsgiventheinherentnoiseinWES sample matching the tumor may not be available from measurements as estimated from non-malignant nor- the patient, we further evaluated the performance of mal diploid samples, and therefore offers a simpler and ENVE in a simulated circumstance where each of the robust alternative. tumor samples was compared to a pooled set of normal Because one of the key characteristics of ENVE is the samples derived from the WES data. This analysis was use of normal diploid samples for capturing inherent performed using computationally derived pooled normal noise associated with WES data, we used DNA samples samples for both the AA CRC WES and SNP array data- derived from 54 immortalized lymphoblastoid cell lines sets (see “Methods”). Next, we compared the ENVE established from patients’ peripheral blood lymphocytes sCNA calls with SNP array-based sCNA calls in the to determine whether noise threshold estimates are sen- same 12 AA CRC cases from above (Fig. 5). Notably, in sitive to the number of normal diploid samples used for this pooled analysis (Additional file 2: Figure S10), ENVE Varadan et al. Genome Medicine (2015) 7:69 Page 13 of 15

exhibited high concordance rates with the SNP array In particular, ENVE reliably detects sCNAs in stromal- calls for both amplifications (97.72 %) and deletions admixed tumor samples and is therefore expected to be (92.86 %), as compared to the matched normal analysis, broadly applicable across cancer-profiling studies. We further suggesting that ENVE remains a viable method- believe this user-friendly methodology should be portable ology for reliably detecting sCNAs in tumors even in the to any massively parallel DNA sequencing platform. absence of a matched normal sample. We note that one of the limitations of published algo- Additional files rithms [12–18] is their exclusive applicability to deep- sequencing data derived from fresh-frozen material but Additional file 1: Table S1. Cohorts, ethnicity, and tumor stage of not archived formalin-fixed paraffin-embedded (FFPE) samples used for WES, SNP array, and qPCR. Table S2. Regions with significant copy-number alterations, as detected by ENVE, in the 30 AA biospecimens. While sequencing of archived FFPE DNA CRC WES cases. Table S3. Regions with significant copy-number alterations, allows for de novo characterization of gene mutations, as detected by ENVE, in the 30 predominantly late-stage Caucasian TCGA as shown by us and others [11, 44], estimation of copy- CRC WES cases. Table S4. Recurrent somatic copy-number altered regions in the 30 AA CRC WES cases estimated by GISTIC. Table S5. Recurrent focal number alterations using WES of FFPE specimens copy-number amplifications and deletions in the 30 AA CRC WES cases remains challenging owing to poor DNA quality in estimated by GISTIC. Table S6. Chromosomal arm-level sCNA frequencies in archived FFPE tumor samples. This, in turn, may result AA and TCGA CRCs estimated by GISTIC. (XLSX 535 kb) in enhanced inherent noise, which may also be prevalent Additional file 2: Figure S1. Fraction of chromosomal coverage across segmental LogRatio thresholds in AA normal–normal comparisons. in FFPE-derived normal diploid DNA samples. We have Figure S2. Entropy of chromosomal coverage across segmental not assessed the performance of ENVE in archival FFPE LogRatio thresholds in AA normal–normal comparisons. Figure S3. samples, but we anticipate that ENVE’s noise-modeling Concordance assessment of ENVE/Control-FREEC sCNA segments with SNP array. Figure S4. Performance evaluation of Control-FREEC with feature may reliably capture the degree of inherent noise contamination-correction against ENVE. Figure S5.Performanceevaluation in FFPE samples, thus potentially enabling use of the of ENVE against Control-FREEC across SNP array Segment-Mean cutoffs in extensive clinically annotated tumor samples held in the TCGA dataset. Figure S6. Impact of sequencing read depth on ENVE versus Control-FREEC performance in the TCGA WES dataset. Figure S7. pathology archives. Concordance analysis of Control-FREEC-based and qPCR-based sCNA One potential limitation of ENVE is that, while it estimates. Figure S8. Relationship between the tumor/normal Segmental P- models sources of inherent noise in WES data, it does LogRatios and ENVE value. Figure S9. Effect of the number of normal samples on ENVE noise-threshold estimates. Figure S10. Performance not explicitly model the likely occurrences of genomic evaluation of ENVE in matched- versus pooled normal analysis scenarios. complexities, such as aneuploidy and hyperploidy, in the (PDF 2411 kb) tumors. This may possibly influence the true positive/ negative sCNA detection rates of the current ENVE Abbreviations AA: African American; BAM: binary sequence alignment/map; CNV: copy- framework. However, estimating allele frequencies in number variation; CRC: colorectal cancer; DCC: DCC netrin 1 receptor; addition to LogRatios from WES data is a conceivable ENVE: Extreme value distribution based somatic copy-Number Variation extension to the current ENVE framework, and may ad- Estimation; FFPE: formalin-fixed paraffin-embedded; GEV: generalized extreme value distribution; MSS: microsatellite stable; qPCR: quantitative dress the influence of such aberrations. Another likely real-time polymerase chain reaction; sCNA: somatic copy-number limitation of ENVE is the requirement of at least 15–20 alteration; SMAD2: SMAD family member 2; SMAD4: SMAD family member platform-matched normal samples in order to capture 4; SNP: single nucleotide polymorphism; TCGA: The Cancer Genome Atlas; and model the inherent noise in WES data. However, TP53: tumor protein p53; WES: whole-exome sequencing. because most cancer-profiling studies are designed to Competing interests include the collection of platform-matched normal sam- The authors declare that they have no competing interests. ples (matched/unmatched with the tumors), this limita- Authors’ contributions tion is likely not burdensome. More importantly, we VV and KG developed the ENVE methodology. VV and KG designed the note that the ENVE’s unique noise-modeling feature, not study and supervised all analyses. VV and SS implemented and optimized included in any of the other published sCNA detection the ENVE workflow. SDM and JW provided the patient samples. JW performed the pathology review. JL processed tissue samples including DNA algorithms, provides detailed and otherwise unavailable extraction and analysis of microsatellite instability in tumor samples. AN comprehension of the inherent noise in any given WES performed read depth comparison and segmentation of the whole-exome dataset to the user (Figs. 1b,c and 2b,c), thus allowing data. JBS performed statistical analyses and analyzed TCGA CRC WES data. KG performed SNP array analysis. KG and LR performed qPCR-based copy- for reliable interrogation of sCNAs in the tumor samples number estimation. VV and KG wrote the manuscript with contributions from in a platform-agnostic manner. all other authors. All authors read and approved the final manuscript.

Acknowledgements Conclusions We thank Dr. Mark Adams at the J. Craig Venter Institute for his critical We present ENVE as a robust method for detecting review of the manuscript and helpful suggestions. The results published here sCNAs in WES-based studies using either matched or un- are in part based upon data generated by The Cancer Genome Atlas (TCGA) project established by the NCI and NHGRI. Information about TCGA and associated matched tumor/normal samples, without the need for investigators and institutions can be found at http://cancergenome.nih.gov/. complex parameter choices or extensive user intervention. This research was supported by PHS awards: K08 CA148980 (KG); Career Varadan et al. Genome Medicine (2015) 7:69 Page 14 of 15

Development Program of Case GI SPORE (P50 CA150964) awards (KG 15. Ha G, Roth A, Lai D, Bashashati A, Ding J, Goya R, et al. Integrative analysis and VV); Case GI SPORE P50 CA150964 (SDM); R21 CA149349 (JEW); P30 of genome-wide loss of heterozygosity and monoallelic expression at CA043703 and U54 CA163060 (KG, VV, SDM, JBS, JEW); Breast Cancer nucleotide resolution reveals disrupted pathways in triple-negative breast Research Foundation (VV); and the Rosalie and Morton Cohen Family cancer. Genome Res. 2012;22:1995–2007. doi:10.1101/gr.137570.112. Memorial Genomics Fund of University Hospitals (VV). 16. Gusnanto A, Wood HM, Pawitan Y, Rabbitts P, Berri S. Correcting for cancer genome size and tumour cell content enables better estimation of copy Author details number alterations from next-generation sequence data. Bioinformatics. 1Division of General Medical Sciences-Oncology, Case Western Reserve 2012;28:40–7. doi:10.1093/bioinformatics/btr593. University, Cleveland, OH 44106, USA. 2Case Comprehensive Cancer Center, 17. Krishnan NM, Gaur P, Chaudhary R, Rao AA, Panda B. COPS: a sensitive and Case Western Reserve University, Cleveland, OH 44106, USA. 3Division of accurate tool for detecting somatic copy number alterations using short- Hematology and Oncology, Case Western Reserve University, Cleveland, OH read sequence data from paired samples. PLoS One. 2012;7, e47812. 44106, USA. 4Department of Medicine, Case Western Reserve University, doi:10.1371/journal.pone.0047812. Cleveland, OH 44106, USA. 5Case Medical Center, Case Western Reserve 18. Bao L, Pu M, Messer K. AbsCN-seq: a statistical method to estimate tumor University, Cleveland, OH 44106, USA. 6Department of Pathology, Case purity, ploidy and absolute copy numbers from next-generation sequencing Western Reserve University, Cleveland, OH 44106, USA. 7Case Western data. Bioinformatics. 2014. doi:10.1093/bioinformatics/btt759. Reserve University, 2103 Cornell Road, Wolstein Research Building, Cleveland, 19. Alkodsi A, Louhimo R, Hautaniemi S. Comparative analysis of methods for OH 44106, USA. identifying somatic copy number alterations from deep sequencing data. Briefings Bioinformatics. 2014;30:1056–63. doi:10.1093/bib/bbu004. Received: 26 February 2015 Accepted: 30 June 2015 20. Guda K, Veigl ML, Varadan V, Nosrati A, Ravi L, Lutterbaugh J, et al. Novel recurrently mutated genes in African American colon cancers. Proc Nat Acad Sci U S A. 2015;112:1149–54. doi:10.1073/pnas.1417064112. 21. Umar A, Risinger JI, Hawk ET, Barrett JC. Testing guidelines for hereditary References non-polyposis colorectal cancer. Nat Rev Cancer. 2004;4:153–8. 1. Al-Kuraya K, Schraml P, Torhorst J, Tapia C, Zaharieva B, Novotny H, et al. doi:10.1038/nrc1278. Prognostic relevance of gene amplifications and coamplifications in breast 22. Li H. Burrows-Wheeler Aligner. http://sourceforge.net/projects/bio-bwa/. cancer. Cancer Res. 2004;64:8534–40. doi:10.1158/0008-5472.CAN-04-1945. Accessed 21 Jan 2014. 2. Zack TI, Schumacher SE, Carter SL, Cherniack AD, Saksena G, Tabak B, et al. 23. Li H, Durbin R. Fast and accurate short read alignment with Burrows- Pan-cancer patterns of somatic copy number alteration. Nat Genet. Wheeler transform. Bioinformatics. 2009;25:1754–60. doi:10.1093/ 2013;45:1134–40. doi:10.1038/ng.2760. bioinformatics/btp324. 3. Ocak S, Yamashita H, Udyavar AR, Miller AN, Gonzalez AL, Zou Y, et al. 24. Broad Institute: Picard Tools. http://broadinstitute.github.io/picard/. DNA copy number aberrations in small-cell lung cancer reveal activation Accessed 21 Jan 2014. of the focal adhesion pathway. Oncogene. 2010;29:6331–42. doi:10.1038/ 25. Cancer Genomics Hub. https://cghub.ucsc.edu. Accessed 09 Jan 2015. onc.2010.362. 26. The Cancer Genome Atlas. https://tcga-data.nci.nih.gov/tcga/. Accessed 09 4. Hieronymus H, Schultz N, Gopalan A, Carver BS, Chang MT, Xiao Y, et al. Jan 2015. Copy number alteration burden predicts prostate cancer relapse. Proc Natl 27. Mermel CH, Schumacher SE, Hill B, Meyerson ML, Beroukhim R, Getz G. Acad Sci U S A. 2014;111:11139–44. doi:10.1073/pnas.1411446111. GISTIC2.0 facilitates sensitive and confident localization of the targets of 5. Pinkel D, Segraves R, Sudar D, Clark S, Poole I, Kowbel D, et al. High focal somatic copy-number alteration in human cancers. Genome Biol. resolution analysis of DNA copy number variation using comparative 2011;12:R41. doi:10.1186/gb-2011-12-4-r41. genomic hybridization to microarrays. Nat Genet. 1998;20:207–11. 28. SABiosciences: qBiomarker Data Analysis version 1.2. http:// doi:10.1038/2524. pcrdataanalysis.sabiosciences.com/cnv/CNVanalysis.php. Accessed 08 Apr 2014. 6. Lucito R, Healy J, Alexander J, Reiner A, Esposito D, Chi M, et al. 29. ENVE Tool. https://github.com/ENVE-Tools/ENVE. Representational oligonucleotide microarray analysis: a high-resolution 30. Olshen AB, Venkatraman ES, Lucito R, Wigler M. Circular binary method to detect genome copy number variation. Genome Res. segmentation for the analysis of array-based DNA copy number data. 2003;13:2291–305. doi:10.1101/gr.1349003. Biostatistics. 2004;5:557–72. doi:10.1093/biostatistics/kxh008. 7. Bignell GR, Huang J, Greshock J, Watt S, Butler A, West S, et al. High-resolution 31. Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and analysis of DNA copy number using oligonucleotide microarrays. Genome Res. validation of cluster analysis. J Comput Appl Math. 1986;20:53–65. 2004;14:287–95. doi:10.1101/gr.2012304. doi:10.1016/0377-0427(87)90125-7. 8. Stransky N, Egloff AM, Tward AD, Kostic AD, Cibulskis K, Sivachenko A, et al. 32. Coles S. An introduction to statistical modeling of extreme values. Springer The mutational landscape of head and neck squamous cell carcinoma. series in statistics. London. New York: Springer; 2001. Science. 2011;333:1157–60. doi:10.1126/science.1208130. 33. Li A, Liu Z, Lezon-Geyda K, Sarkar S, Lannin D, Schulz V, et al. GPHMM: an 9. Cancer Genome Atlas Network. Comprehensive molecular characterization integrated hidden Markov model for identification of copy number of human colon and rectal cancer. Nature. 2012;487:330–7. doi:10.1038/ alteration and loss of heterozygosity in complex tumor samples using nature11252. whole genome SNP arrays. Nucleic Acids Res. 2011;39:4928–41. doi:10.1093/ 10. Kamalakaran S, Varadan V, Janevski A, Banerjee N, Tuck D, McCombie WR, nar/gkr014. et al. Translating next generation sequencing to practice: opportunities and 34. Yu Z, Liu Y, Shen Y, Wang M, Li A. CLImAT: accurate detection of copy necessary steps. Mol Oncol. 2013;7:743–55. doi:10.1016/j.molonc.2013.04.008. number alteration and loss of heterozygosity in impure and aneuploid 11. Van Allen EM, Wagle N, Stojanov P, Perrin DL, Cibulskis K, Marlow S, et al. tumor samples using whole-genome sequencing data. Bioinformatics. Whole-exome sequencing and clinical interpretation of formalin-fixed, 2014;30:2576–83. doi:10.1093/bioinformatics/btu346. paraffin-embedded tumor samples to guide precision cancer medicine. Nat 35. Wood LD, Parsons DW, Jones S, Lin J, Sjoblom T, Leary RJ, et al. The Med. 2014;20:682–8. doi:10.1038/nm.3559. genomic landscapes of human breast and colorectal cancers. Science. 12. Boeva V, Popova T, Bleakley K, Chiche P, Cappo J, Schleiermacher G, et al. 2007;318:1108–13. doi:10.1126/science.1145720. Control-FREEC: a tool for assessing copy number and allelic content using 36. Sjoblom T, Jones S, Wood LD, Parsons DW, Lin J, Barber TD, et al. The next-generation sequencing data. Bioinformatics. 2012;28:423–5. consensus coding sequences of human breast and colorectal cancers. doi:10.1093/bioinformatics/btr670. Science. 2006;314:268–74. doi:10.1126/science.1133427. 13. Chiang DY, Getz G, Jaffe DB, O’Kelly MJ, Zhao X, Carter SL, et al. High-resolution 37. Day E, Poulogiannis G, McCaughan F, Mulholland S, Arends MJ, Ibrahim AE, mapping of copy-number alterations with massively parallel sequencing. Nat et al. IRS2 is a candidate driver oncogene on 13q34 in colorectal cancer. Int Methods. 2009;6:99–103. J Exp Pathol. 2013;94:203–11. doi:10.1111/iep.12021. 14. Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, et al. VarScan 38. Tabach Y, Kogan-Sakin I, Buganim Y, Solomon H, Goldfinger N, Hovland R, 2: somatic mutation and copy number alteration discovery in cancer by et al. Amplification of the 20q chromosomal arm occurs early in exome sequencing. Genome Res. 2012;22:568–76. doi:10.1101/ tumorigenic transformation and may initiate cancer. PLoS One. gr.129684.111. 2011;6:e14632. doi:10.1371/journal.pone.0014632. Varadan et al. Genome Medicine (2015) 7:69 Page 15 of 15

39. Brim H, Lee E, Abu-Asab MS, Chaouchi M, Razjouyan H, Namin H, et al. Genomic aberrations in an African American colorectal cancer cohort reveals a MSI-specific profile and chromosome X amplification in male patients. PLoS One. 2012;7:e40392. doi:10.1371/journal.pone.0040392. 40. Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, et al. Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol. 2012;30:413–21. doi:10.1038/nbt.2203. 41. Van Loo P, Nordgard SH, Lingjaerde OC, Russnes HG, Rye IH, Sun W, et al. Allele-specific copy number analysis of tumors. Proc Natl Acad Sci U S A. 2010;107:16910–5. doi:10.1073/pnas.1009843107. 42. Lonnstedt IM, Caramia F, Li J, Fumagalli D, Salgado R, Rowan A, et al. Deciphering clonality in aneuploid breast tumors using SNP array and sequencing data. Genome Biol. 2014;15:470. doi:10.1186/s13059-014-0470-7. 43. Hosking JRM, Wallis JR, Wood EF. Estimation of the generalized extreme-value distribution by the method of probability-weighted moments. Technometrics. 1985;27:251–61. doi:10.2307/1269706. 44. Adams MD, Veigl ML, Wang Z, Molyneux N, Sun S, Guda K, et al. Global mutational profiling of formalin-fixed human colon cancers from a pathology archive. Modern Pathol. 2012;25:1599–608. doi:10.1038/ modpathol.2012.121.

Submit your next manuscript to BioMed Central and take full advantage of:

• Convenient online submission • Thorough • No space constraints or color figure charges • Immediate publication on acceptance • Inclusion in PubMed, CAS, and Google Scholar • Research which is freely available for redistribution

Submit your manuscript at www.biomedcentral.com/submit