Lupski et al. Genome Medicine 2013, 5:57 http://genomemedicine.com/content/5/6/57

RESEARCH Open Access Exome sequencing resolves apparent incidental findings and reveals further complexity of SH3TC2 variant alleles causing Charcot-Marie-Tooth neuropathy James R Lupski1,2*, Claudia Gonzaga-Jauregui1, Yaping Yang3, Matthew N Bainbridge2, Shalini Jhangiani2, Christian J Buhay2, Christie L Kovar2, Min Wang2, Alicia C Hawes2, Jeffrey G Reid2, Christine Eng3, Donna M Muzny2 and Richard A Gibbs1,2*

Abstract Background: The debate regarding the relative merits of whole genome sequencing (WGS) versus exome sequencing (ES) centers around comparative cost, average depth of coverage for each interrogated base, and their relative efficiency in the identification of medically actionable variants from the myriad of variants identified by each approach. Nevertheless, few genomes have been subjected to both WGS and ES, using multiple next generation sequencing platforms. In addition, no personal genome has been so extensively analyzed using DNA derived from peripheral blood as opposed to DNA from transformed cell lines that may either accumulate mutations during propagation or clonally expand mosaic variants during cell transformation and propagation. Methods: We investigated a genome that was studied previously by SOLiD chemistry using both ES and WGS, and now perform six independent ES assays (Illumina GAII (x2), Illumina HiSeq (x2), Life Technologies’ Personal Genome Machine (PGM) and Proton), and one additional WGS (Illumina HiSeq). Results: We compared the variants identified by the different methods and provide insights into the differences among variants identified between ES runs in the same technology platform and among different sequencing technologies. We resolved the true genotypes of medically actionable variants identified in the proband through orthogonal experimental approaches. Furthermore, ES identified an additional SH3TC2 variant (p.M1?) that likely contributes to the phenotype in the proband. Conclusions: ES identified additional medically actionable variant calls and helped resolve ambiguous single nucleotide variants (SNV) documenting the power of increased depth of coverage of the captured targeted regions. Comparative analyses of WGS and ES reveal that pseudogenes and segmental duplications may explain some instances of apparent disease mutations in unaffected individuals. Keywords: Exome sequencing, Whole-genome sequencing, Incidental findings, SH3TC2, Personal genomes, Precision medicine

* Correspondence: [email protected]; [email protected] 1Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA Full list of author information is available at the end of the article

© 2013 Lupski et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Lupski et al. Genome Medicine 2013, 5:57 Page 2 of 17 http://genomemedicine.com/content/5/6/57

Background databases and the lack of clinical indications for any of Exome sequencing (ES) is an approach to these five conditions, led to the conclusion that these analysis and genetic variant detection that focuses on the alleles were most probably not pathogenic in the proband. coding exons of and closely linked functional ele- In the period since the initial CMT study, technologies ments. ES is less expensive than comparable approaches for both WGS and ES have advanced considerably. We based on whole genome sequencing (WGS), because developed a series of improved ES methods and reagents there is overall less raw DNA sequence data required. that reliably assay >95% of the coding regions of the Furthermore, coding regions contain the changes that are human genome [5,6]. The current ES assay design currently the most easily interpretable, and knowledge involves routine ‘oversampling’ of the targeted regions, gained from ES may therefore have more immediate med- and ensures that a minimum of 95% (average 97%) of ical utility and application. There is a generally held belief, these positions are tested with >20-fold NGS coverage. however, that WGS may be more informative than ES, The new methods also take advantage of longer indivi- even within the boundaries of exome regions, as the ran- dual DNA sequence reads generated on the Illumina dom distribution of individual sequence reads offers an HiSeq platform (2 × 100 bp) versus the previous SOLiD overall higher likelihood of effective testing of individual platform (2 × 50 bp) used for WGS. sites in the genome. Further, biases inherent to the cap- ture process may result in missing or poorly covered Methods bases in a subset of the exome regions [1]. Sample inclusion We previously applied WGS using massively parallel This study conforms to the Helsinki Declaration regarding next-generation sequencing (NGS) methods to a subject ethical principles for medical research involving human with molecularly undefined, autosomal recessive, Char- subjects. Subject BAB195 and additional family members cot-Marie-Tooth (CMT) neuropathy and identified were recruited and consented for genomic and DNA stu- apparently causative compound heterozygous SH3TC2 dies under a research protocol approved by the Institutional variants: nonsense p.R954X (g. chr5: 148,406,435 G>A) Review Board of Baylor College of Medicine. All patients and missense p.Y169H alleles (g. chr5: 148,422,281 gave informed consent for participation in this study. A>G) [2]. The two alleles satisfied multiple criteria for disease causation: both were validated with alternate DNA preparation methods technologies; the nonsense p.R954X allele had been pre- Blood was directly collected in PAXgene Blood DNA tubes viously reported in unrelated CMT families and corro- and DNA was isolated using the PAXgene Blood DNA kit borated with functional studies; each segregated (PreAnalytiX, Qiagen, Valencia, CA, USA). The quality of faithfully in the pedigree and no other lesion was dis- the DNA sample was ascertained by electrophoresis and covered in the locus despite adequate coverage of all determined to be of high quality (size >23 kb) with no visi- coding bases with NGS reads. Further, each allele faith- ble degradation. Quantity was determined using standard fully independently segregated with additional different Pico Green assays. mild phenotypes observed in other family members - susceptibility to carpal tunnel syndrome (presumably Description of NimbleGen VCRome 2.1 exome capture due to loss-of-function nonsense mutation) and an elec- design trophysiologically identified subclinical axonopathy (mis- In the current methods, the HGSC-design ‘VCRome 2.1’ sense allele). reagent [7] was used for exome capture. This NimbleGen Unresolved issues in the study included discovery of probe set targets the Vega [8], CCDS, and RefSeq additional reportedly pathogenic alleles at other loci, models, including predicted genes within RefSeq, as well related to potentially medically actionable conditions (inci- as microRNA (miRNA) [9] and regulatory regions for a dental or secondary findings). For example, the proband total target size of 42 Mb. Thus, exome capture sequen- was suggested to have a dominant variant that was causa- cing will identify SNPs and indels in coding tive for adrenoleukodystrophy (ALD, MIM #300100) [3]; regions, exon flanking sequences (including intron splice ABCD1, chrX:153,008,483 G>A; p.G608D, as well as to be sites), regulatory regions, and small non-coding RNAs (for homozygous for an allele causative for spinal muscular example, microRNAs). atrophy with respiratory distress type 1 (SMARD, MIM #604320) [4]; IGHMBP2, chr11:68,705,674C>A; pT879K. Illumina exome sequencing on GAII and HiSeq These variants were confirmed by Sanger sequencing and Library construction restriction endonuclease digestion (Figure 1). The proband Genomic DNA samples were constructed into Illumina was also found by WGS to be homozygous for reported PairEnd pre-capture libraries according to the manufac- causative mutations in three other Mendelian disease turer’s protocol (Illumina Multiplexing_SamplePrep_- genes. The possibility for erroneous entries in public Guide_1005361_D) with modifications as described in the Lupski et al. Genome Medicine 2013, 5:57 Page 3 of 17 http://genomemedicine.com/content/5/6/57

Figure 1 Confirmation and segregation by endonuclease restriction digestion of mutations presumed to represent incidental findings. The upper gel shows restriction digestion using the HphI restriction enzyme for the p.G608D mutation in ABCD1, causative for adrenoleukodystrophy (ALD). All tested individuals, none of which presented with ALD, are shown to be heterozygous for the mutation, including all males whom are hemizygous for genes on the × . The lower gel shows restriction digestion using the BtgI restriction enzyme for the p.T879K mutation in IGHMBP2, putatively causative for spinal muscular atrophy with respiratory distress type 1 (SMARD1). Random segregation and zygosity for this variant can be observed in this family; none of the individual subjects present with SMARD.

BCM-HGSC Illumina Barcoded Paired-End Capture adaptor sequences. Post-capture LM-PCR amplification Library Preparation protocol. The complete library and was performed using the 2X SOLiD Library High Fidelity capture protocol, as well as oligonucleotide sequences are Amplification Mix with 14 cycles of amplification. After accessible from the HGSC website [10]. the final XP bead purification, quantity and size of the cap- Briefly, 1 ug of genomic DNA in 100 uL volume was ture library was analyzed using the Agilent Bioanalyzer sheared into fragments of approximately 300 base pairs in 2100 DNA Chip 7500. The efficiency of the capture was a Covaris microTube with the E2 system (Covaris, Inc., evaluated by performing a qPCR-based quality check on Woburn, MA, USA) followed by end-repair, A-tailing, and the four standard NimbleGen internal controls. Successful ligation of the Illumina multiplexing PE adaptors. Pre- enrichment of the capture libraries was estimated to be in capture Ligation Mediated-PCR (LM-PCR) was performed the range of 7 to 9 of ΔCt value over the non-enriched for seven cycles of amplification using the 2X SOLiD samples. Library High Fidelity Amplification Mix (custom product Illumina sequencing on GAIIx and HiSeq 2000 manufactured by Invitrogen). Universal primer IMUX- Library templates were prepared for sequencing using Illu- P1.0 and pre-capture barcoded primer IBC were used in mina’s cBot cluster generation system with the corre- the PCR amplification. Purification was performed with sponding TruSeq PE Cluster Kits for the GA and HiSeq. Agencourt AMPure XP beads. Quantification and size dis- Briefly, these libraries were denatured with sodium hydro- tribution of the pre-capture LM-PCR product was deter- xide and diluted to 3 to 6 pM in hybridization buffer in mined using the Agilent Bioanalyzer 2100 DNA 7500 chip. order to achieve a load density of 400 to 550 k clusters/ Capture methods mm2 on the GAIIx and 700 to 900 k clusters/mm2 on the Pre-capture libraries (1 ug) were hybridized in solution to HiSeq 2000. Barcoded libraries were loaded in two inde- the VCRome 2.1 exome capture platform (HGSC design, pendent lanes of a GAIIx flow cell and then merged for NimbleGen) described above according to the manufac- analysis. Due to the higher capacity of the HiSeq 2000, turer’s protocol NimbleGen SeqCap EZ Exome Library SR each barcoded sample was pooled (in a set of three) for User’s Guide (Version 2.2) with minor revisions. Human loading onto a single lane of a HiSeq flowcell and then CotI DNA and full-length Illumina adaptor-specific block- deconvoluted for analysis based on barcode sequence. All ing oligonucleotides were added into the hybridization lanes were spiked with 2% phage phiX DNA control mix to block repetitive genomic sequences and the library for run quality control. After loading onto the flow Lupski et al. Genome Medicine 2013, 5:57 Page 4 of 17 http://genomemedicine.com/content/5/6/57

cell, sample libraries underwent bridge amplification to annotation tools that brings together relevant annotation form clonal clusters, and the sequencing primer was information using ANNOVAR [21] with UCSC and hybridized. RefSeq gene models, as well as a host of other internal and Sequencing runs were performed in paired-end mode external data resources. using the Illumina Genome Analyzer IIx (GAIIx) and HiSeq 2000 platforms. Using the TruSeq SBS Kits for Ion torrent exome sequencing using PGM and Proton the GA and the HiSeq, sequencing-by-synthesis reac- Library construction: The pre-capture libraries for the tions were extended for 101 cycles from each end, with Ion Torrent platforms (PGM and Proton) were con- an additional seven cycles for the index read for runs structed using the Ion Xpress Plus Fragment Library Kit that included pooled samples. Real-time analysis (RTA) (Life Technologies) and the experimental procedures software was used to process the image analysis and followed the manufacturer’s Ion Xpress Plus gDNA and base calling. Sequencing runs generated approximately Amplicon Library Preparation User Guide with minor 40-65 million and 350-500 million successful reads per modifications. lane for the GAIIx and HiSeq, respectively. With these The PGM pre-capture library was generated using 1 ug run yields, each sample library achieved 10 to 12 Gb of of genomic DNA which was fragmented to approximately raw DNA sequence data, which enabled a minimum of 200 base pairs by the Covaris S2 system (Covaris, Inc., 20x coverage for 95% to 97% of the bases targeted in Woburn, MA, USA) and purified with 1.2x Agencourt the exome. Ampure XP Beads (Beckman Coulter, Cat. No. A63882). Fragmentation was followed by end-repair, blunt-end Illumina whole genome sequencing on HiSeq 2000 ligation of the Ion Xpress Barcode and Ion P1 adaptors The WGS library was sequenced with the Illumina as well as nick translation. Pre-capture ligation-mediated HiSeq 2000 platform using the template preparation and PCR (LM-PCR) was performed for eight cycles of ampli- sequencing methods described above for the HiSeq fication using 2X SOLiD Library High Fidelity Amplifica- 2000, with the following exceptions: a total of three flow tion Mix (custom product by Invitrogen). cell lanes were loaded, yielding approximately 148 Gb of The Proton pre-capture library was generated in a simi- sequence for the WGS dataset. lar manner as above starting with fragmentation of geno- mic DNA to 100 base pairs using the Covaris S2 system Illumina analysis and SNP calls and purified with 1.8x Agencourt Ampure XP Beads. Illumina sequence analysis was performed using the Blunt-end ligation was performed using the non-barcoded HGSC Mercury analysis pipeline [11] that manages all Ion A and Ion P1 adaptors. Post-ligation size selection was aspects of data processing and analyses, moving data step performed using a 3% agarose gel cassette with the target by step through various analysis tools from the initial size of 180 bp. Final pre-capture LM-PCR was performed sequence generation on the instrument to annotated var- again using 2X SOLiD Library High Fidelity Amplification iant calls (SNPs and intra-read in/dels). First, the primary Mix (custom product by Invitrogen), for 12 cycles of analysis software on the instrument produces .bcl files that amplification. The above enzymatic reactions for both the are transferred off-instrument into the HGSC analysis ION Torrent PGM and Proton libraries were followed by infrastructure by the HiSeq Real-time Analysis module AMPure XP bead purification (1.4x for the PGM library (1.13.48). Once the run is complete and all .bcl files are and 1.8x for the Proton library, respectively). Quantity and transferred, Mercury runs the vendor’s primary analysis size distribution of the PCR product were analyzed using software (CASAVA v1.8.0), which de-multiplexes pooled the Agilent Bioanalyzer 2100 DNA 7500 chip. samples and generates sequence reads and base-call confi- Capture methods: For target enrichment, 1 ug of each dence values (qualities). The next step is the mapping of pre-capture library was hybridized in solution to the pre- reads and qualities to the GRCh37 human reference gen- viously described HGSC VCRome 2.1 exome capture ome assembly [12] using the Burrows-Wheeler aligner design following the Manual of Ion TargetSeq Custom (BWA [13,14]) and producing a BAM [15] (binary align- Enrichment Kits with minor modifications. Human CotI ment/map) file. The third step involves quality recalibra- DNA and adaptor A and P1-specific blocking oligonu- tion (using GATK [16,17]), and where necessary the cleotides were added into the hybridization reactions to merging of separate sequence-event BAMs into a single block repetitive genomic sequences and the common sample-level BAM is performed. BAM sorting, duplicate adaptor sequences. Post-capture LM-PCR amplification read marking, and realignment to improve in/del discovery was performed using 2X SOLiD Library High Fidelity all occur at this step. Next, we used the Atlas2 [18] suite Amplification Mix with 14 to 16 cycles of amplification. (Atlas-SNP and Atlas-indel) to call variants and produce a Quantities and sizes of the capture libraries were ana- variant call file (VCF [19]). Finally, annotation data is lyzed using the Agilent Bioanalyzer 2100 DNA Chip added to the vcf using the Cassandra [20] suite of 7500. The efficiency of the capture was again evaluated Lupski et al. Genome Medicine 2013, 5:57 Page 5 of 17 http://genomemedicine.com/content/5/6/57

by performing a qPCR-based quality check on the four Ion Torrent analysis methods: Sequence data from standard NimbleGen internal controls. Successful enrich- either Ion Torrent Personal Genome Machine (PGM) or ment of the capture libraries was estimated to range from Proton were processed on the instrument server using a 7 to 9 of ΔCt value over the non-enriched samples. theTorrentSuitev3.2softwarepackage.Theseinclude Template preparation and sequencing: Library tem- signal processing, base calling, and mapping. Reads were plates were prepared for sequencing using the Life Tech- mapped to the GRCh37 human reference genome assem- nologies Ion Xpress and Ion OneTouch protocols and bly [22] via TMAP short read aligner [23]. In the case of reagents. Briefly, library fragments were clonally ampli- the Proton platform, reads identified with the same start fied onto ion sphere particles (ISPs) through emulsion coordinate and 3’ adaptor flow position were considered PCR and then enriched for template-positive ISPs. More duplicates and were removed via a proprietary method in specifically, PGM emulsion PCR reactions utilized the the Torrent Suite. For the PGM, duplicate reads were Ion Xpress 200 Template Kit (Life Technologies) as spe- identified solely by start position and were removed cified by the vendor. Emulsions were generated with an using the Picard MarkDuplicates tool [24]. Where multi- IKA Ultra-Turrax, amplification followed using standard ple sequencing runs were generated on the same library, thermal cycling methods, and the ISPs were recovered BAM files were combined into a single BAM file with with a SOLiD emulsion collection tray (Life Technolo- Picard MergeSamFiles prior to identifying and removing gies) through centrifugation. In some instances, these duplicates. amplification and recovery steps were automated for the Single nucleotide variants (SNVs) were identified using PGM reactions using the Ion OneTouch System and the VarIONt, an extensible and highly configurable variant Ion OneTouch Template Kit v2, and similarly, the Proton caller pipeline developed at the HGSC and specifically emulsion PCR reactions were performed using the Ion tuned for Ion Torrent sequence. It utilizes SAMtools Proton I Template OT2 Kit (Life Technologies), with pileup [25] to generate a pileup string at every mapped amplification and recovery steps automated with the Ion base coordinate. It filters out portions of the read that OneTouch 2 System. Following recovery, enrichment may be unreliable, and then identifies alternate alleles was completed by selectively binding the ISPs containing from the remainder. Part of VarIONt’s capability is to amplified library fragments to streptavidin coated mag- define thresholds at which variants are called. For this netic beads, removing empty ISPs through washing steps, application, we required that the interrogation site have and denaturing the library strands to allow for collection a minimum read coverage of 10X and a minor allele of the template-positive ISPs. For all reactions, these fraction (maf) of 0.05. High confidence variants must steps were accomplished using the Life Technologies ES have a minimum read depth of 30X and 0.1 maf. Once module of the Ion OneTouch 2 System, and template- thevariantswerecalled,thepreviouslydescribedCas- positive ISPs were quantified using the Guava EasyCyte 5 sandra annotation suite was used to annotate the identi- (Millipore Technologies), obtaining >90% enrichment fied variants. efficiency for all reactions. For PGM runs, approximately 35 million template- SOLiD whole genome sequencing positive ISPs per run were deposited onto the Ion 318C SOLiD data used for comparative analyses, referred to chips (Life Technologies) by a series of centrifugation here as SOLiD WGS, were originally obtained as speci- steps that incorporated alternating the chip directional- fied in Lupski et al. [2]. In summary, whole genome ity. Sequencing was performed with the Ion PGM 200 sequence data were generated at 29.6x average depth of Sequencing Kit (Life Technologies) using the 440 flow coverage using the Sequencing by Oligonucleotide Liga- (‘200 bp’) run format. For the single Proton run, the tion and Detection (SOLiD) technology (Life Technolo- Proton P1 Chip was first pre-rinsed and incubated with gies, formerly Applied Biosystems), with a mappable NaOH for 1 min before loading in order to minimize yield of 89.6 Gb. Mapping and variant calling analyses residual contaminants and decrease background signal. were performed using the analysis suite Corona Lite. Approximately 300-350 million template-positive ISPs were deposited onto the Proton PI Chip and then Data release underwent sequencing using the Proton’s 260 flow (‘100 This study has been registered at the National Center bp’) run format and the Life Technologies Ion Proton I for Biotechnology Information (NCBI) under BioProject Sequencing Kit. A total of nine PGM runs and one Pro- ID 203659 (Accession: PRJNA203659) for BioSample: ton run were sequenced, which yielded a total of 6.2 SAMN00009513. Data are publicly available through the and 7.9 Gb of data, respectively. Analyses showed that Sequence Read Archive (SRA) under accession numbers: 91% to 92% of the targeted exome bases were covered SRX286243, SRX286245, SRX286282, SRX286417, to a depth of at least 20x. SRX286419, SRX286832. Lupski et al. Genome Medicine 2013, 5:57 Page 6 of 17 http://genomemedicine.com/content/5/6/57

Data analysis 97% of the targeted DNA bases were covered with >20 Data analyses and comparison of variants between the reads and >94% were covered with >40 reads (Table 1). multiple sequencing runs were performed using custom For exome enrichment and capture we used the in- Perl scripts. housedesignedVCRome2.1exometargetingreagent that covers 197,583 regions with a total number of Confirmation of variants 35,432,211 bases targeted. A median of 91,573 SNPs Variants of interest were confirmed by Sanger sequen- were recovered by the different ES experiments, how- cing of amplified PCR products. Primers specific to the ever not all of the SNPs recovered per exome run are region containing the variant to be tested were designed. coding exonic variants. Standard end-point PCR was performed using QIAGEN The ES data recovered a larger number of variants in HotStar Taq polymerase (QIAGEN Sciences, Maryland, the coding regions than was seen by WGS (Figure 2), USA). For ABCD1 fragment amplification, long-range suggesting a higher sensitivity, but also raising the possi- PCR was performed using the QIAGEN Long-range bility of a higher false positive rate. The number of ES PCR mix. Amplification of specific PCR fragments was variants was similar in four experiments using the same confirmed by agarose gel electrophoresis. Endonuclease chemistry (Illumina GAII (two runs) vs. Illumina HiSeq restriction digestion was performed to orthogonally test (two runs), Table 1, Figure 3). For coding SNPs, the the genotype and segregation of some of the variants of intra-platform concordance between pairs of ES experi- interest. The amplified PCR products were digested ments ranged from 96.16% to 97.79%, while the concor- using specific restriction enzymes (New England Bio- dance among the four ES experiments performed using Labs) to identify the site of the mutations to be tested. the reversible terminators chemistry was 89.22% (Figure Specific endonuclease digestion was verified by agarose 3a). Concordance of indel variants between runs in the gel electrophoresis. same platform ranged from 81.95% to 83.13%; however this decreases when comparing across the four Illumina Results and Discussion experiments (64.41%) (Figure 3b), which suggests that To pursue unresolved issues from the previous WGS some of the indel calls can be platform-specific artifacts. study that included unconfirmed variant calls and the Indels are in general more difficult to assess by variant challenging interpretation of incidental variants, we callers than single-nucleotide substitutions and conse- further tested the subject’s DNA with the latest version quently some of the differences and the lower concor- of ES methods. A total of almost 60 Gb of new raw dance for indel calling within and between platforms data, with an average of 11 Gb per exome sequence run, can also arise at the alignment and variant calling stage were generated. An average depth of sequence coverage of the data processing. of 170 and a median coverage of 140 was attained for We also performed two additional exome sequencing the main captured exome run (HiSeq-1). More than experiments at lower depth of coverage using the latest

Table 1 SNV identified by replicated ES of CMT proband. Platform GAII-1 GAII-2 HiSeq-1 HiSeq-2 PGM Proton WGS-SOLID WGS-HiSeq Total reads produced 117,685,348 107,207,362 119,508,046 118,716,380 41,268,598 71,113,537 2,469,936,140 1,479,177,846 Duplicate reads (%) 4.98 3.63 5.86 5.74 25.10 19.34 5.00 2.29 Total reads aligned (%) 97.80 98.66 96.76 97.09 99.16 98.83 58.12 96.57 Aligned reads on target (%) 64.90 69.70 70.59 70.66 60.81 71.66 N/A N/A Average coverage 138x 137x 170x 169x 65x 93x 29.6x 47.6x Median coverage 110x 111x 140x 139x 60x 80x N/A 40x Targets hit (%) 99.42 99.32 99.42 99.40 99.22 99.29 N/A N/A Bases targeted 35,432,211 35,432,211 35,432,211 35,432,211 35,432,211 35,432,211 N/A N/A Targeted bases with 10+ coverage (%) 97.58 97.49 97.98 97.99 95.84 94.64 N/A N/A Targeted bases with 20+ coverage (%) 96.19 96.18 97.23 97.21 92.28 91.28 N/A N/A Targeted bases with 40+ coverage (%) 90.60 91.08 94.59 94.53 75.19 80.75 N/A N/A Total SNPs 102,444 91,661 92,553 91,484 41,958 57,243 3,420,306 4,016,486 cSNPs (n) 23,321 22,869 23,017 22,980 20,372 22,079 18,829 24,475 nsSNPs (n) 11,491 11,217 11,288 11,252 11,363 10,821 9,069 12,370 Total c_inDels 8,879 8,249 9,576 9,375 N/A N/A N/A 10,439 Lupski et al. Genome Medicine 2013, 5:57 Page 7 of 17 http://genomemedicine.com/content/5/6/57

variants due to strand bias; the second parameter is the low variant ratio, and thirdly the low quality of the called variants in one run versus another which in many cases is influenced by mis-mapping of reads to locations in the genome other than the specific target lowering the var- iant quality score (Figure 5b). The run that had the most ‘private’ calls was the PGM run; however the majority of these private calls fall within the low variant fraction por- tion of the distribution suggesting that most of them are false positives. Interestingly, the distribution of ‘private’ SNPs in the Proton exome sequencing run is more scat- tered and distributed in the coverage and variant fraction ranges usually observed for true positive variant calls (Figure 4). Of these high-quality cSNPs, 16,280 were recovered in the two WGS approaches in spite of the dif- ferent platforms used and the differences in coverage between the two WGS experiments and among all the sequencing experiments. Of note, comparison between the two WGS approaches, one being the original SOLiD WGS at approximately 30x average depth of coverage and the second one the Illumina Figure 2 Diagram depicting the differences between the number of variants identified by WGS and ES of the same individual HiSeq WGS at approximately 47x depth of coverage genome. WGS identified 3,420,306 SNPs throughout the genome, resulted in 3,090,120 concordant SNPs across the whole including 18,829 coding SNPs (cSNPs). Targeted exome sequence (ES) genome between the two experiments (90.34% of the ori- focuses on capturing most of the coding variation contained in ginal SOLiD WGS SNPs). 197,583 exonic regions (VCRome 2.1 design). From this, ES identified 21,772 concordant cSNPs among four Illumina sequencing runs and 18,063 high-quality cSNP variants concordant among all six exome Resolution of incidental findings from WGS sequencing experiments. Of these, 3,709 cSNPs difered from the cSNPs In most cases, the ES data recapitulated the WGS data: identified by the original SOLiD WGS approach. for example, both personal genome approaches identi- fied 12 pharmacologically relevant variants (that is, pharmacogenetic traits) (Table 2). The greater coverage Ion Technology (Life Technologies) using the Personal by ES (65 reads) versus WGS (2 reads) resolved the pre- Genome Machine (PGM) and the Proton Machine. Com- viously observed CYP2C9 p.Arg144Cys allele, involved parison of all the exome experiments yielded a total of in warfarin sensitivity, into a C/T heterozygous rather 21,772 concordant coding SNPs (cSNPs) across the four than a T/T homozygous allele. Similarly, the previously high-coverage Illumina exome sequencing experiments. reported SMARD alleles were found, consistent with Moreover, 18,623 high-quality cSNPs were concordant their identity as erroneous associations with disease in across the different platforms and across all six exome database entries. In addition to the IGHMBP2 allele experiments, of which 8,777 are non-synonymous, reportedly associated with SMARD, three other ‘disease- including 73 nonsense changes. For the six different associated’ variants found by WGS and confirmed as exome sequencing runs, the majority of single-run called homozygous alleles and for which the proband did not SNPs fall below the 0.20-0.25 variant fraction for reliable manifest the disease (GLB1-chr3: 33,138,549G>A;p.P10L heterozygous calls; those single-run calls with higher and ABCA4-chr1: 94,544,234T>C; p.H423R and NHLRC1 variant fractions (>0.20) generally fall below the 10x cov- chr6: 18,122,506G>A; p.P111L, the latter allele resolved erage cutoff (Figure 4). In the four Illumina exome into a heterozygous variant by ES) have subsequently been sequencing runs, the variant fraction vs. coverage distri- observed as homozygous at a high frequency in public bution of SNPs called in one run but not in the other exome and genome sequence databases, again consistent three runs is as expected. The majority of single-run calls with the original interpretation of erroneous entries as cluster in the lower range of the variant fraction distribu- ‘disease associated’ in mutation databases. tion (Figure 5a). Comparison of SNP variants called Elsewhere, the greater sensitivity of the ES data between exome sequencing runs using the same Illumina resolved false positive calls from the WGS. For example, sequencing platforms (GAII-1 vs. GAII-2 and HiSeq1 vs. the dominant ABCD1 ALD allele that had been pre- HiSeq2) shows that the most common parameter for dif- viously reported was resolved as ‘wild-type’ (normal) ferences between sequencing runs is the filtering-out of with the further sequencing on the HiSeq platform. Lupski et al. Genome Medicine 2013, 5:57 Page 8 of 17 http://genomemedicine.com/content/5/6/57

Figure 3 Overlap of SNVs identified in four targeted exome sequencing replicates of the same individual’s DNA in the Illumina platform.(a) Comparison of identified coding SNPs (cSNPs) within and between sequencing technologies (GAII vs. HiSeq). There is a high percentage of shared identified SNPs both within and between the different technologies. (b) Comparison of identified InDels within and between sequencing technologies. We observe less overlap between the two technologies probably due to a higher rate of false positive InDels.

Examination of the individual sequence reads in the origi- to represent other related genomic loci, such as a pseudo- nal WGS data showed that the available data had only six gene or segmental duplications of the genomic interval reads with three reference reads (two unique) and three containing this gene, and not the true ABCD1 locus. variant reads (three unique). In contrast, the new ES data Further data from PCR followed by either Sanger provided 232 reads at this base position. Thirty-five of sequencing or HphI restriction digestion, in all the family these reads did contain the variant; however, the mapping members, were not particularly helpful in resolving the algorithm (BWA) that was used to align the reads also true ABCD1 genotypes: each showed apparent ‘heterozyg- showed all 35 were distinct, providing evidence for non- osity’ for this position (including the male subjects for this unique alignment. Hence, these 35 reads were more likely X-linked gene) (Figure 1), but were consistent with the Lupski et al. Genome Medicine 2013, 5:57 Page 9 of 17 http://genomemedicine.com/content/5/6/57

Figure 4 Distribution of variant fraction vs. coverage per base for SNPs called only in single exome sequencing runs, not replicated in any other exome run. For the six different exome sequencing runs, the majority of single-run called SNPs fall below the 0.20-0.25 variant fraction for reliable heterozygous calls; those single-run calls with higher variant fractions (>0.20) generally fall below the 10x coverage cutoff. The run that had the most ‘private’ calls was the PGM run; however the majority of these private calls fall within the low variant fraction portion of the distribution suggesting that most of them are false positives. Interestingly, the distribution of ‘private’ SNPs in the Proton exome sequencing run is more scattered and distributed in the coverage and variant fraction ranges usually observed for true positive variant calls. model of cross-amplification due to genomic similarities subsequent detailed analysis of the region to design a long- between a distal segment of the ABCD1 gene and the pre- range PCR assay helped to interpret and give a faithful sence in the personal genome of the proband of a partial recapitulation of the published data [26], consistent with pseudogene copy [20] of ABCD1. We observed that this the X-linked hemizygosity expected in males. It seems that particular region spanning 9,757 bp and comprising exons the ratio between the number of reads calling the reference 7 through 10 of the ABCD1 gene is repeated four addi- base versus the number of mismapped reads calling var- tional times in other locations in the haploid reference iants can confound the variant calling algorithms. Copy genome at chr2, chr10, chr16, and chr22 with 94.90% to number variant (CNV) alleles from pseudogenes, low copy 95.70% identity among the different copies (Figure 6a). It repeats (LCRs), and other structural variants of personal is difficult to experimentally assess by next-generation genomes within clinical populations may confound the sequencing, hybridization, DNA specific endonuclease interpretation of simple nucleotide variant (SNV) alleles restriction digestion or computational mapping algorithms using whole genome approaches. based on DNA sequence matching, to which of the dupli- cated copies of the region the reads that contain the pre- Identification of a complex SH3TC2 allele associated sumed variant mismap. Our main interest was to test with CMT whether the functional ABCD1 gene contained the ALD At the critical CMT causative locus in the proband, the mutation thus we designed primers for long-range PCR SH3TC2 mutations (p.R954X and p.Y169H) that were outside the duplicated segment that comprises exons 7 to identified by WGS were also found by ES. This was 10 of ABCD1 to specifically interrogate the ABCD1 region. expected as these two sites had been extensively validated Long-range PCR amplification followed by nested regular and studied in follow-up to the initial discovery. Surpris- PCR amplification of the exon 8 segment containing the ingly, however, an additional potentially significant muta- variant and Sanger sequencing revealed that in fact the tion (p.M1?) was also found in the first codon of SH3TC2 ABCD1 gene contained no mutation and that the variant by ES. At this position, the ES data contained 71 and 98 allele is present in the pseudogene, although which of confidently mapping sequence reads corresponding to the the four copies cannot be differentiated by this approach reference base and variant base, respectively, consistent (Figure 6b). Hence, of all four approaches (WGS; ES; PCR+ with heterozygosity at this position. The mutation was con- Sanger sequencing; PCR+restriction digestion), the obser- firmed by PCR amplification followed by Sanger sequen- vation that all the reads containing the variant had low cing and three independent restriction endonuclease mapping qualities in the high-coverage ES data and digestions (Figure 7) that showed the newly discovered Lupski et al. Genome Medicine 2013, 5:57 Page 10 of 17 http://genomemedicine.com/content/5/6/57

Figure 5 Analysis of missed variants between Illumina ES runs.(a) Distribution of variant fraction and coverage per base for missed SNPs in Illumina exome sequencing experiments. In the four Illumina exome sequencing runs, the variant fraction vs. coverage distribution of SNPs called in one run but not in the other three runs is as expected. The majority of single-run calls cluster in the lower range of variant fraction, below the standard threshold of 0.20-0.25 for reliable heterozygous calls. (b) Classification of filtered-out variants that differed between Illumina exome sequencing runs. Comparison of SNP variants called between exome sequencing runs using the same Illumina sequencing platforms (GAII vs. GAII and HiSeq1 vs. HiSeq2) shows that the most common parameter for differences between sequencing runs is the filtering-out of variants due to strand bias in one run versus another; the second parameter is the low variant ratio and third low quality of the called variants which in many cases is influenced by mis-mapping of reads to different other locations in the genome besides the specific target. Lupski et al. Genome Medicine 2013, 5:57 Page 11 of 17 http://genomemedicine.com/content/5/6/57

Table 2 Observed pharmacogenetic variants by WGS position regardless of the platform used for sequencing. versus ES. We also performed two additional ES experiments using Gene OBS WGS OBS WES Variant Drug(s) the Personal Genome Machine (PGM) and Proton plat- ADRB1 C/C C/C p.Gly389Arg Beta-blockers forms from Life Technologies and an additional Illumina CDA A/C A/C p.Lys27Gln Gemcitabine, cisplatin HiSeq whole-genome sequencing at 47.6 average depth of CYP2C8 C/T C/T p.Arg139Lys Paclitaxel coverage to assess distribution of read depth in WGS using a different sequencing platform with longer reads. The CYP2C8 T/C T/C p.Lys339Arg Paclitaxel three alleles in SH3TC2 were successfully covered and CYP2C9 A/C A/C p.Ile359Leu Warfarin identified in these two additional ES and WGS approaches CYP2C9 T/T(r = 2) C/T(r = 65) p.Arg144Cys Warfarin (Table 3, Figure 9). NAT2 C/C C/C p.Ile114Thr Isoniazid Retrospective analysis of the prior WGS SOLiD data POR C/T C/T p.Ala503Val Midazolam [2] revealed that the total accumulated DNA sequence TPMT C/G C/G p.Ala80Pro Purines reads covering this site had satisfied the applied filters UGT1A3 C/C C/C p.Trp11Arg Estrones, flavonoids (total of 15 unique sequencing reads covering position UGT1A7 C/C C/C p.Trp208Arg Irinotecan chr5:148,442,585). The fraction of those reads, however, UGT1A7 G/G G/G p.Asn129Lys Irinotecan that represented the variant allele was below the accepted threshold (two reads of the total 15, 13.3% versus the filter’s cutoff >20% variant allele fraction). allele segregating faithfully as a complex allele with the pre- Hence, the site was adequately covered overall, but not viously observed p.Y169H allele in the family (Figure 8). by enough reads containing the variant allele for the The three SH3TC2 gene variants were well-covered, depth bioinformatic algorithm to call it a variable position. of coverage ranging from 104x to 209x among the four dif- Others have suggested an intrinsic allelic bias in the ferent Illumina ES replicates, and identified or ‘called’ by ligation based SOLiD sequencing method, however, this the software in these four ES experiments with an approxi- individual example can be readily explained stochasti- mate equal ratio for reference and variant reads over each cally [27,28].

Figure 6 (a) Genomic landscape of the region containing the ABCD1 gene and the repeated 9.7 kb segment reported as a partial pseudogene. This segment comprises exons 7 to 10 of the ABCD1 gene and is repeated in four other locations in the reference genome with approximately 95% identity. (b) Comparison of Sanger sequencing of the reported mutation (p.Gly608Asp) through direct PCR of the segment containing exon 8 and 9 of ABCD1 using genomic DNA as a template (upper) and nested PCR of the same segment after long-range PCR amplification of an approximately 10 kb segment using primers specific to the ABCD1 locus outside of the repeated segment (lower). Lupski et al. Genome Medicine 2013, 5:57 Page 12 of 17 http://genomemedicine.com/content/5/6/57

Figure 7 Flanking sequence of the genomic mutation T ® C at position chr5:148,422,778 in SH3TC2. The coding SNV transition c.1A>G causes a substitution of valine for methionine at the initiation codon. An alternative in frame ATG codon is present 90 codons downstream with a less conserved Kozak sequence than the upstream initiation codon. The wild-type allele can be recognized by two different restriction enzymes, FatI and CviAII, but the mutation destroys the restriction site (data not shown). Conversely, transition T®C creates a restriction site that can be recognized by the endonuclease AflIII. The mutation segregates with the axonal neuropathy phenotype and co-segregates as a complex allele (p.M1?; Y169H) with the nonsense mutation p.R954X to cause the CMT phenotype in this family.

These new data regarding SH3TC2 variant alleles sug- because of possible re-initiation of translation from an gest three possibilities: (1)thenewlyidentifiedp.M1? alternative methionine initiation codon 90 amino acids variant is the causative allele in cis with a benign p. downstream and hence pathogenicity of this allele Y169H; (2) the p.M1? variant alone has little effect results from the downstream p.Y169H; or (3) the CMT

Figure 8 Segregation of nonsense and complex alleles of SH3TC2 in a family with Charcot-Marie-Tooth neuropathy. Previously reported nonsense variant (p.R954X) was inherited from the maternal line; reported missense p.Y169H and newly identified p.M1? variants are in cis inherited from the paternal line and segregating with the axonal neuropathy phenotype in the family. Individuals that inherited both the nonsense and complex allele present with CMT neuropathy. Lupski et al. Genome Medicine 2013, 5:57 Page 13 of 17 http://genomemedicine.com/content/5/6/57

Table 3 Comparison of SH3TC2 alleles in six exome specific anti-SH3TC2 antibody, only one smaller band of sequencing experiments and one whole-genome a size consistent with the protein product of a predicted sequencing. spliced variant of SH3TC2 (Additional File 1). Thus, in Instrument p.M1? Ratio p.Y169H Ratio p.R954x Ratio these lymphoblastoid cell lines from the subject and GAII (Ex1) RR 47 0.48 90 0.50 63 0.47 other family members segregating the alleles, no evi- VR 66 0.58 90 0.50 71 0.53 dence of a truncated protein, perhaps reflecting non- TR 113 - 180 - 134 - sense mediated decay of the nonsense mutation bearing GAII (Ex2) RR 50 0.48 79 0.45 45 0.39 transcript, nor evidence of a predicted sized protein due to re-initiation of translation from an internal AUG VR 54 0.52 95 0.55 69 0.61 initiation codon could be obtained. TR 104 - 174 - 114 - These new findings from ES studies further support HiSeq (Ex1) RR 71 0.42 100 0.48 54 0.47 the original study conclusions of: (1) pathogenic involve- VR 98 0.58 109 0.52 61 0.53 ment of the SH3TC2 gene; and (2) spurious interpreta- TR 169 - 209 - 115 - tions of ‘incidental’ findings that may occur based upon HiSeq (Ex2) RR 70 0.42 90 0.44 72 0.49 current database entries. Further, these new data illumi- VR 97 0.58 114 0.56 76 0.51 nate how structural variation unique to individuals from TR 167 - 204 - 148 - clinical populations might challenge interpretation of PGM (Ex) RR 19 0.37 55 0.55 45 0.58 some variant alleles. VR 33 0.63 45 0.45 33 0.42 Conclusions TR 52 - 100 - 78 - These data illustrate that, as predicted by an editorial Proton (Ex) RR 58 0.55 79 0.50 35 0.56 accompanying the original manuscript [32], ES can indeed VR 48 0.45 80 0.50 28 0.44 identify the causative variants of a mendelian genetic dis- TR 108 - 159 - 63 - ease. In many instances, the high yet skewed depth of cov- HiSeq-WGS RR 29 0.60 25 0.43 32 0.48 erage in targeted regions afforded by the ES methods may VR 19 0.40 33 0.57 34 0.52 offer higher likelihood of recovery of significant variants TR 48 - 58 - 66 - and resolution of their true genotypes when compared to RR: number of reads calling the reference base; TR: total number of reads at the lower, but more uniform WGS coverage (Table 4). that position; VR: number of reads calling the variant base. The ratio columns This may be particularly true because most WGS studies reflect the fraction of reads that called either the reference base or the read are typically implemented at average depths of coverage base over the total number of reads for that position. well below any current ES approach, due to the theoretical assumption and observations that an approximately 30x causality requires the combination of variants in the average depth of coverage across the whole genome would complex allele (p.M1?; p.Y169H) plus the co-segregating provide enough read depth to identify >99% of the homo- nonsense allele p.R954X [29]. This latter hypothesis zygous variants and approximately 98% of the heterozy- wouldbeconsistentwitha‘mutational load’ model, in gous variants; that is, NOT identifying 1% to 2% of which a partial loss-of-function hypomorphic complex variants in the coding regions. The ability to detect rare allele is independently segregating with the electrophy- variant alleles may be particularly relevant to medically siologically identified axonopathy as observed for the actionable variants [31] and pharmacogenetic traits [33]; homozygous PMP22 T118M mutation [30]. Rare variant both providing personal genome variation of potential uti- alleles and combinations of such alleles at a locus may lity in precision medicine. Given the high coverage at aggregate from parental contributions or may occur by which targeted exome sequencing is performed (>100x) new mutations within recent ancestors; a concept and the reduced sampling space provided by the capture referred to as clan genomics [31]. It is distinctly possible step, the percentage of targeted bases that are not that the p.M1? variant occurred de novo on a haplotype sequenced is low due to the distribution of individual that contains p.Y169H within the clan. Of note, while reads throughout the ‘genome space’ interrogated and the the p.M1? variant is novel, since our original report [2] fold sequence coverage for each base position. However, the p.Y169H variant was observed in additional indivi- the technical limitation of ES might be imposed by the duals. However, it is always observed in the heterozy- capture step [34], due to limitations of the capture designs gous state and never as a homozygous variant, making (that is, not all genes and exons have been adequately its functional significance elusive. defined and therefore not included in ES designs) and the Western blotting of lymphoblastoid cell lines derived capture efficiency (70% to 80%). Our data suggest that the from the subject and family members carrying different choice of ES for identification of fully penetrant critical variant alleles at the SH3TC2 locus identified, using a SNV mutations in the coding regions of mammalian Lupski et al. Genome Medicine 2013, 5:57 Page 14 of 17 http://genomemedicine.com/content/5/6/57

Figure 9 Comparison of coverage across the whole SH3TC2 gene and flanking regions of the three different mutations identified in the proband. Exome sequencing (ES) provides saturation of base calling reads at >100x. (a) Coverage across the whole SH3TC2 gene. (b) Coverage across the p.M1? mutation. (c) Coverage across the p.Y169H mutation. (d) Coverage across the p.R954X mutation. genomes may be regarded as superior, rather than a sequencing approach to total variant detection in a personal shortcut, or compromise, when compared with WGS genome if one seeks to capture themajorityofthegenome- approaches at depths of coverage below 100x. WGS con- wide variation including CNVs and other structural tinues to be a more comprehensive next-generation variants in addition to SNVs. However, in order to best Lupski et al. Genome Medicine 2013, 5:57 Page 15 of 17 http://genomemedicine.com/content/5/6/57 chrXchr11 153,008,483chr3 68,705,674 Gchr1 33,138,549 Cchr6 A 94,544,234 Gchr5 A 18,122,506 G608D Tchr5 A 148,442,585 T879K Gchr5 148,422,281 C T het P10L (3:3) A 148,406,435 A hom H423R (9:1) C G het P111L (49:183) G hom (14:0) M1 hom (7:90) A het hom (35:197) (13:1) Y169H hom (139:0) hom hom (3:0) (107:1) N/A R954X hom (170:0) hom (163:0) hom N/A het (147:0) het (17:13) hom (9:6) (131:0) hom (254:1) het (20:15) hom het hom (115:0) (90:90) (148:0) N/A hom N/A het (279:1) hom (63:71) het (52:0) (47:66) hom het (137:3) (79:95) hom (68:0) hom het (113:0) hom (45:69) het (89:0) (50:54) het hom (100:109) N/A (147:0) hom het (52:0) (9:14) hom het (84:1) (54:61) het het N/A (71:98) (90:114) hom het (50:0) het (55:45) het (9:11) N/A (72:76) het (70:97) het (67:63) het N/A (45:33) het (19:33) het (25:33) het het (37:33) het (8:20) (46:80) N/A het (32:34) het (29:19) het (16:17) Table 4 Variant call comparison across all platforms for variants mentioned throughout the text. IGHMPB2 GLB1 ABCA4 NHLRC1 SH3TC2 SH3TC2 SH3TC2 GeneABCD1 chr pos_hg19 Ref var aa_change SOLiD-WGS GAII-Ex1 GAII-Ex2 HiSeq-Ex1 HiSeq-Ex2 PGM-Ex Proton-Ex HiSeq-WGS Lupski et al. Genome Medicine 2013, 5:57 Page 16 of 17 http://genomemedicine.com/content/5/6/57

capture variation higher coverage at lower costs need to be consensus coding DNA sequence exome reveals exons with higher implemented for WGS approaches. variant densities. Genome Biol 2011, 12:R68. 8. Wilming LG, Gilbert JG, Howe K, Trevanion S, Hubbard T, Harrow JL: The vertebrate genome annotation (Vega) database. Nucleic Acids Res 2008, 36(Database):D753-D760. Authors’ contributions 9. Griffiths-Jones S, Saini HK, van Dongen S, Enright AJ: miRBase: tools for JRL and RAG conceived and designed the study and participated in the microRNA genomics. Nucleic Acids Res 2008, 36(Database):D154-D158. drafting and writing of the manuscript. CG-J performed data analyses, 10. [https://www.hgsc.bcm.edu/sites/default/files/documents/ variant confirmations and analyses, and contributed to the writing of the Illumina_Barcoded_Paired-End_Capture_Library_Preparation.pdf]. manuscript. YY, MNB, and SJ performed data analysis. CJB, CLK, MW, and 11. [https://github.com/dsexton2/Mercury-Pipeline]. ACH performed data generation and analysis and helped with the writing of 12. [http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/]. the methods. CE and JGR participated in the coordination of the study. 13. Li H, Durbin R: Fast and accurate short read alignment with Burrows- DMM participated in the coordination of the study, data generation, and Wheeler Transform. Bioinformatics 2009, 25:1754-1760. writing of the manuscript. All authors read and approved the final 14. [http://bio-bwa.sourceforge.net/]. manuscript. 15. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup: The Competing interests Sequence alignment/map (SAM) format and SAMtools. Bioinformatics JRL is a paid consultant for Athena Diagnostics, holds stock ownership in 2009, 25:2078-2079. 23andMe and Ion Torrent Systems, and is a co-inventor on US and European 16. DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, patents related to molecular diagnostics. RAG is a paid consultant of G.E. del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Clarient. The Department of Molecular and Human Genetics at Baylor Sivachenko A, Cibulskis K, Gabriel S, Altshuler D, Daly M: A framework for College of Medicine derives revenue from genetic testing offered in the variation discovery and genotyping using next-generation DNA Whole Genome Laboratory (WGL) and Medical Genetics Laboratories (MGL: sequencing data. Nat Genet 2011, 43:491-498. http://www.bcm.edu/geneticlabs). The remaining authors declare that they 17. [http://www.broadinstitute.org/gatk/]. have no competing interests. 18. Challis D, Yu J, Evani US, Jackson AR, Paithankar S, Coarfa C, Milosavljevic A, Gibbs RA, Yu F: An integrative variant analysis suite for whole exome Acknowledgements next-generation sequencing data. BMC Bioinformatics 2012, 13:8. The authors would like to thank Dr. Arthur Beaudet and Dr. Pawel 19. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Stankiewicz for their critical review of the manuscript; and Joep de Ligt for Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R, 1000 helpful discussions of analysis results. This work was supported in part by US Genomes Project Analysis Group: The variant call format and VCFtools. National Institute of Neurological Disorders and Stroke (NINDS) grant Bioinformatics 2011, 27:2156-2158. R01NS058529, and US National Human Genome Research Institute (NHGRI) 20. Bainbridge MN, Wiszniewski W, Murdock DR, Friedman J, Gonzaga- grants U54HG006542 and U54HG003273. Jauregui C, Newsham I, Reid JG, Fink JK, Morgan MB, Gingras MC, Muzny DM, Hoang LD, Yousaf S, Lupski JR, Gibbs RA: Whole-genome Author details sequencing for optimized patient management. Sci Transl Med 2011, 1 Department of Molecular and Human Genetics, Baylor College of Medicine, 3:87re3. 2 One Baylor Plaza, Houston, TX 77030, USA. Human Genome Sequencing 21. Wang K, Li M, Hakonarson H: ANNOVAR: Functional annotation of genetic Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, variants from next-generation sequencing data. Nucleic Acids Res 2010, 3 USA. Whole Genome Laboratory, Baylor College of Medicine, One Baylor 38:e164. Plaza, MS: NAB2015, Houston, TX 77030, USA. 22. [http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/]. 23. [https://github.com/iontorrent/TMAP]. Received: 11 March 2013 Revised: 20 May 2013 24. [http://picard.sourceforge.net/]. Accepted: 27 June 2013 Published: 27 June 2013 25. [http://samtools.sourceforge.net/pileup.shtml]. 26. Braun A, Kammerer S, Ambach H, Roscher AA: Characterization of a partial References pseudogene homologous to the adrenoleukodystrophy gene and 1. Gonzaga-Jauregui C, Lupski JR, Gibbs RA: Human genome sequencing in application to mutation detection. Hum Mutat 1996, 7:105-108. health and disease. Annu Rev Med 2012, 63:35-61. 27. Hedges DJ, Guettouche T, Yang S, Bademci G, Diaz A, Andersen A, 2. Lupski JR, Reid JG, Gonzaga-Jauregui C, Rio Deiros D, Chen DC, Nazareth L, Hulme WF, Linker S, Mehta A, Edwards YJ, Beecham GW, Martin ER, Pericak- Bainbridge M, Dinh H, Jing C, Wheeler DA, McGuire AL, Zhang F, Vance MA, Zuchner S, Vance JM, Gilbert JR: Comparison of three targeted Stankiewicz P, Halperin JJ, Yang C, Gehman C, Guo D, Irikat RK, Tom W, enrichment strategies on the SOLiD sequencing platform. PLoS One 2011, Fantin NJ, Muzny DM, Gibbs RA: Whole-genome sequencing in a patient 6:e18595. with Charcot-Marie-Tooth neuropathy. N Engl J Med 2010, 362:1181-1191. 28. McKernan KJ, Peckham HE, Costa GL, McLaughlin SF, Fu Y, Tsung EF, 3. Dvoráková L, Storkánová G, Unterrainer G, Hujová J, Kmoch S, Zeman J, Clouser CR, Duncan C, Ichikawa JK, Lee CC, Zhang Z, Ranade SS, Hrebícek M, Berger J: Eight novel ABCD1 gene mutations and three Dimalanta ET, Hyland FC, Sokolsky TD, Zhang L, Sheridan A, Fu H, polymorphisms in patients with X-linked adrenoleukodystrophy: The Hendrickson CL, Li B, Kotler L, Stuart JR, Malek JA, Manning JM, first polymorphism causing an amino acid exchange. Hum Mutat 2001, Antipova AA, Perez DS, Moore MP, Hayashibara KC, Lyons MR, Beaudoin RE, 18:52-60. et al: Sequence and structural variation in a human genome uncovered 4. Tachi N, Kikuchi S, Kozuka N, Nogami A: A new mutation of IGHMBP2 by short-read, massively parallel ligation sequencing using two-base gene in spinal muscular atrophy with respiratory distress type 1. Pediatr encoding. Genome Res 2009, 19:1527-1541. Neurol 2005, 32:288-290. 29. Shroyer NF, Lewis RA, Yatsenko AN, Lupski JR: Null missense ABCR (ABCA4) 5. Bainbridge MN, Wang M, Burgess DL, Kovar C, Rodesch MJ, D’Ascenzo M, mutations in a family with Stargardt disease and retinitis pigmentosa. Kitzman J, Wu YQ, Newsham I, Richmond TA, Jeddeloh JA, Muzny D, Invest Ophthalmol Vis Sci 2001, 42:2757-2761. Albert TJ, Gibbs RA: Whole exome capture in solution with 3 Gbp of 30. Shy ME, Scavina MT, Clark A, Krajewski KM, Li J, Kamholz J, Kolodny E, data. Genome Biol 2010, 11:R62. Szigeti K, Fischer RA, Saifi GM, Scherer SS, Lupski JR: T118M PMP22 6. Bainbridge MN, Wang M, Wu Y, Newsham I, Muzny DM, Jefferies JL, mutation causes partial loss of function and HNPP-like neuropathy. Ann Albert TJ, Burgess DL, Gibbs RA: Targeted enrichment beyond the Neurol 2006, 59:358-364. consensus coding DNA sequence exome reveals exons with higher 31. Lupski JR, Belmont JW, Boerwinkle E, Gibbs RA: Clan genomics and the variant densities. Genome Biol 2011, 12:R68. complex architecture of human disease. Cell 2011, 147:32-43. 7. Bainbridge MN, Wang M, Wu Y, Newsham I, Muzny DM, Jefferies JL, 32. Lifton RP: Individual genomes on the horizon. N Engl J Med 2010, Albert TJ, Burgess DL, Gibbs RA: Targeted enrichment beyond the 362:1235-1236. Lupski et al. Genome Medicine 2013, 5:57 Page 17 of 17 http://genomemedicine.com/content/5/6/57

33. Nelson MR, Wegmann D, Ehm MG, Kessner D, St Jean P, Verzilli C, Shen J, Tang Z, Bacanu SA, Fraser D, Warren L, Aponte J, Zawistowski M, Liu X, Zhang H, Zhang Y, Li J, Li Y, Li L, Woollard P, Topp S, Hall MD, Nangle K, Wang J, Abecasis G, Cardon LR, Zöllner S, Whittaker JC, Chissoe SL, Novembre J, et al: An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science 2012, 337:100-104. 34. Clark MJ, Chen R, Lam HY, Karczewski KJ, Chen R, Euskirchen G, Butte AJ, Snyder M: Performance comparison of exome DNA sequencing technologies. Nat Biotechnol 2011, 29:908-914.

doi:10.1186/gm461 Cite this article as: Lupski et al.: Exome sequencing resolves apparent incidental findings and reveals further complexity of SH3TC2 variant alleles causing Charcot-Marie-Tooth neuropathy. Genome Medicine 2013 5:57.

Submit your next manuscript to BioMed Central and take full advantage of:

• Convenient online submission • Thorough peer review • No space constraints or color figure charges • Immediate publication on acceptance • Inclusion in PubMed, CAS, Scopus and Google Scholar • Research which is freely available for redistribution

Submit your manuscript at www.biomedcentral.com/submit