Copy Number Variation Detection from 1000 Genomes Project Exon
Total Page:16
File Type:pdf, Size:1020Kb
Wu et al. BMC Bioinformatics 2012, 13:305 http://www.biomedcentral.com/1471-2105/13/305 METHODOLOGY ARTICLE Open Access Copy Number Variation detection from 1000 Genomes project exon capture sequencing data Jiantao Wu1†, Krzysztof R Grzeda1†, Chip Stewart1, Fabian Grubert2, Alexander E Urban2, Michael P Snyder2 and Gabor T Marth1* Abstract Background: DNA capture technologies combined with high-throughput sequencing now enable cost-effective, deep-coverage, targeted sequencing of complete exomes. This is well suited for SNP discovery and genotyping. However there has been little attention devoted to Copy Number Variation (CNV) detection from exome capture datasets despite the potentially high impact of CNVs in exonic regions on protein function. Results: As members of the 1000 Genomes Project analysis effort, we investigated 697 samples in which 931 genes were targeted and sampled with 454 or Illumina paired-end sequencing. We developed a rigorous Bayesian method to detect CNVs in the genes, based on read depth within target regions. Despite substantial variability in read coverage across samples and targeted exons, we were able to identify 107 heterozygous deletions in the dataset. The experimentally determined false discovery rate (FDR) of the cleanest dataset from the Wellcome Trust Sanger Institute is 12.5%. We were able to substantially improve the FDR in a subset of gene deletion candidates that were adjacent to another gene deletion call (17 calls). The estimated sensitivity of our call-set was 45%. Conclusions: This study demonstrates that exonic sequencing datasets, collected both in population based and medical sequencing projects, will be a useful substrate for detecting genic CNV events, particularly deletions. Based on the number of events we found and the sensitivity of the methods in the present dataset, we estimate on average 16 genic heterozygous deletions per individual genome. Our power analysis informs ongoing and future projects about sequencing depth and uniformity of read coverage required for efficient detection. Background in whole-genome shotgun datasets [12-14]. However, Copy Number Variations (CNVs) i.e. deletions and despite decreasing costs, deep-coverage (≥ 25×) whole- amplifications, are an essential part of normal human genome data is still prohibitively expensive for routine variability [1]. Specific CNV events have also been linked sequencing of hundreds of samples, and in low-coverage to various human diseases [2], including cancer [3,4] (2-6× base coverage) datasets detection sensitivity and autism [5,6] and schizophrenia [7]. Historically, large resolution is limited to long genomic events [1]. CNV events can be observed using FISH [8] but system- Targeted DNA capture technologies combined with atic, genome-wide discovery of CNVs started with high-throughput sequencing now provide a reasonable microarray-based methods [9-11] which can detect balance between coverage and sequencing cost in a sub- events down to 1 kb resolution. As with all hybridization stantial portion of the genome, and full-exome sequen- based approaches, these methods are blind in repetitive cing projects are presently collecting ≥ 25× average and low complexity regions of the genome where probes sequence coverage in thousands of samples. CNV events cannot be designed. High throughput sequencing with in exonic regions are important because the deletions of next-generation technologies have enabled CNV detec- one or both copies, or amplifications affecting exons, are tion at higher resolution (i.e. down to smaller event size), likely to incur phenotypic consequences. Current algorithms for detecting CNVs in whole-genome * Correspondence: [email protected] shotgun sequencing data use one of four signals as evidence †Equal contributors for an event: (1) aberrantly mapped mate-pair reads (RP or 1 Boston College, Boston, Chestnut Hill, MA, USA read pair methods); (2) split-read mapping positions (SR); Full list of author information is available at the end of the article © 2012 Wu et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Wu et al. BMC Bioinformatics 2012, 13:305 Page 2 of 19 http://www.biomedcentral.com/1471-2105/13/305 (3) de novo assembly (AS); and (4) a significant drop or in- based on RD measurement, and detects samples with non- crease of mapped read depth (RD methods). Unfortunately, normal copy number in the capture target regions. As par- these methods are not generally applicable for CNV detec- ticipants of the 1000 Genomes Project, we took part in the tion in capture sequence data without substantial modifica- data analysis of the “Exon Sequencing Pilot” dataset [16], tions. SR, RP, and AS based methods are sensitive only to where 12,475 exons from over 900 genes were targeted CNVs in which mapped reads or fragments span the event and sequenced with a variety of DNA capture and sequen- breakpoint (s). In the case of exon capturing data, this cing technologies. restricts detection to CNV events where at least one break- point falls in a targeted exon. RD based methods suffer Results and discussion from large fluctuations of sequence coverage stemming Brief algorithmic overview from variability in probe-specific hybridization affinities Our algorithm is an extended version of RD-based CNV across different capture targets (in this case: exons) and sets detection that aims to mitigate the vast target-to-target of such targets (in our case: genes), and from the over- (and consequently gene-to-gene) heterogeneity of read dispersion of the read coverage distribution in the same tar- coverage by normalization procedures roughly correspond- get across different samples. Presumably because of the ing to those employed in CNV detection methods from technical challenges, and despite the importance of deletion microarray hybridization intensity data. The overall work- or amplification events within exons, there are currently no flow of our method is shown in Figure 1 and described in reported CNV detection algorithms for targeted DNA cap- greater detail in the Methods section. For a given gene in a ture based exon-sequencing data (with the exception of given sample (we will use the abbreviation GSS: Gene- methods for tumor-normal datasets [15] where the read Sample Site throughout the paper), we define the read depth measured in the normal sample can be used for depth as the number of uniquely mapped reads whose 5’ normalization – signal not available in the case of popula- end falls within any of the targeted exons within that gene. tion sequencing). We compare this measurement with an expected read In this study, we set out to develop a CNV detection al- depth (Eq. 2, Methods), based on a “gene affinity” calcu- gorithm for capture sequencing data. This algorithm is lated from measured read depth for that gene across all A B Number of genes Observed read depth Observed read depth Median read depth C Observed read depth Expected read depth Figure 1 Workflow of the CNV detection method. A. Median Read Depth (MRD) is calculated for each sample, as a measure of sample coverage (NA18523 shown). B. The gene affinity is estimated for each gene as the slope of the least-square-error linear fit between MRD and RD for that gene (TRIM33 shown). C. Example of observed (magenta) and expected (green) read depth for three samples and four genes. The observed read depths were roughly half of the expected values for genes TRIM33 and NRAS, in sample NA18523, and detected as deletions. Wu et al. BMC Bioinformatics 2012, 13:305 Page 3 of 19 http://www.biomedcentral.com/1471-2105/13/305 samples (to account for across-target read coverage vari- simultaneously. According to Principal Component ance due to target-specific hybridization), and the overall Analysis of the observed read depths, (Figure 2A, see read depth for the sample (to account for the variance Methods), target and genes affinities are inconsistent of read coverage due to the overall sequence quantity across data from different centers, and therefore we collected for the sample under examination). We then analyzed each dataset separately. We only considered use a Bayesian scheme to determine whether the mea- datasets with at least 100 samples (SC, BI, BCM) so we sured coverage is consistent with normal copy number can obtain sufficient sample statistics across genes. (e.g. CN = 2 for autosomes), or aberrant copy number After filtering out genes and samples that didn’t meet (i.e. homozygous deletion: CN = 0, heterozygous deletion: our minimum read depth requirements (see Methods), CN = 1, or amplification: CN > 2). We have included two we were left with the following datasets: SC (862 genes algorithmic variants: One is suitable for CNV events that in 106 individuals sequenced with Illumina), BI (739 occur at a low allele frequency (i.e. in a small fraction of genes in 110 samples sequenced with Illumina), and the samples), and the other for capturing higher-frequency BCM (439 genes in 349 samples sequenced with 454) deletion events (see Methods). (Table 1). The number of genes that passed our filters was substantially lower in the BCM dataset both due to Dataset lower overall 454 coverage (see below), and because the In this study we analyzed the exon capture sequencing longer 454 reads result in lower RD (fewer reads) when dataset collected by the 1000 Genomes Project Exon Se- compared to shorter Illumina reads, even at equivalent quencing Pilot, including 931 genes processed with Agi- base coverage. lent liquid-phase and Nimblegen solid-phase capture methods, and sequenced from 697 individuals with Sample coverage and gene affinities Illumina paired-end and/or 454 technologies. The sam- As a metric of coverage for each sample, we calculated ples in the dataset have been sequenced by four different the sample-specific median gene RD, referred to as “Me- data collection centers (Washington University, WU; dian Read Depth” (MRD); see Figure 1A and Methods.