Exome Sequencing: a Comparison of Enrichment Technologies

Exome Sequencing: A Comparison of Enrichment Technologies Michael James Clark, Ph.D August 3rd, 2011 .whoami Michael James Clark, Ph.D Geneticist & Bioinformaticist Stanford University Mike Snyder Lab mendeliandisorder.blogspot.com Likes: Sequencing Exomes Contact: [email protected] Twitter: @mjcgenetics Why exome sequencing? http://maps.google.com Why exome sequencing? http://map.seqanswers.com Exome is highly interpretable Exome Variation Missense variations—determinable amino acid changes Nonsense variations—estimable effect on expression Copy number variations—estimable gene dosage effects Non-exome variation Enhancer/promoter variations—mostly indeterminate gene expression effect ? Why exome over whole genome? Cost per Mb of DNA Sequence $10,000.00 $1,000.00 $100.00 $10.00 Whole genome sequencing requires 50-100x $1.00 more sequencing, and is therefore more expensive. $0.10 http://www.genome.gov/sequencingcosts/ Why exome over whole genome? Cost per Genome $100,000,000 The cost of whole genome sequencing $10,000,000 remains relatively high $1,000,000 $100,000 $10,000 $1,000 $100 Exome-sequencing is <1/10th $10 the cost of WGS. $1 http://www.genome.gov/sequencingcosts/ Method: exome enrichment Biotinylated oligonucleotide baits complementary to exome sequences Example images borrowed from Illumina TruSeq exome enrichment workflow TruSeq Exome SureSelect All Exon 50M SeqCap EZ Exome v2.0 Enrichment Kit Exome enrichment platform comparisons Targeting strategy design differences Targeting region differences RefSeq (coding): 49kb RefSeq (UTR): 100kb Ensembl (CDS): 100kb miRBase: 60kb RefSeq (coding): 200kb RefSeq (coding): 10kb RefSeq (UTR): 400kb RefSeq (UTR): 28Mb Ensembl (CDS): 1.4Mb Ensembl (CDS): 300kb miRBase: 55kb miRBase: 28kb Assay differences Nimblegen Agilent Illumina Starting DNA ( g) 3 3 1 Duration (days) 7 3.5 3.5 Hybridization (hours) 72 24 2 x (16-20) Pre-hyb PCR 12 4-6 10 Post-hyb PCR 18 10-12 10 qPCR validation Yes None Recommended Molecule used DNA RNA DNA Automation Yes Yes Yes Experiment! Compare enrichment assays Data filtering and normalization Platform Raw Reads Mapped Uniquely Not Duplicates Mapped Agilent 124,112,466 123,202,356 112,602,746 94,779,030 (99.2%) (91.4%) (84.2%) Nimblegen 184,983,780 183,502,451 175,593,794 154,270,343 (99.2%) (95.7%) (87.9%) Illumina 112,885,944 110,977,932 100,236,505 88,759,249 (98.3%) (90.3%) (88.5%) 80,000,000 reads Targeting efficiency Targeting efficiency Results: Nimblegen = most Denser baits result in higher efficiencyefficient and less Illumina = captures sequencing the most Less dense designs take more sequencing, Illumina requires but capture more more sequencing Off-target enrichment Exon +500b -500b Target Region Coverage Off-target Off-target Off-target enrichment Nimblegen Agilent Illumina Off- Off- target target 9% 13% Off- target 36% On- On- On- target target target 64% 91% 87% Off-target enrichments % overlapping RepeatMasker % overlapping Segdups On-target Off-target On-target Off-target 47.0% 30.1% 38.3% 32.7% 19.5% 18.9% 10.9% 9.0% 8.0% 7.7% 6.6% 5.0% Agilent Nimblegen Illumina Agilent Nimblegen Illumina Small variant analysis Comparing sensitivity and trends in small variant detection by exome-seq Single nucleotide variations SNVs detected SNP chip concordance 60,000 Illumina 50,000 Nimblegen 40,000 Agilent 30,000 98.0 98.5 99.0 99.5 100.0 20,000 Reference Bias 10,000 Illumina 0 Nimblegen Ref Agilent Alt 0 0.2 0.4 0.6 0.8 1 SNV detection trends Platform-specific SNVs Indel detection trends Indel size distributions Exome-seq vs. WGS Comparison of model exome-seq and whole genome sequencing experiments Create model experiments Quality correlates with coverage 1000 Mean Base Coverage 900 800 700 48 600 42 500 36 400 30 300 200 Mean variant qualityscore Meanvariant 100 0 50M 60M 70M 80M Read Count 50M 60M 70M 80M All SNVs Exome-only Read Count SNV overlap WGS 6,126 42,633 5,408 Exome-seq WGS 5,634 43,125 5,687 Exome-seq 50M reads / 30x exome-seq 60M reads / 36x exome-seq 35x WGS 35x WGS WGS 5,329 43,430 5,881 Exome-seq WGS 5,083 43,676 6,060 Exome-seq 70M reads / 42x exome-seq 80M reads / 48x exome-seq 35x WGS 35x WGS Novel variant rates Whole exome Exome-seq-specific Whole Genome WGS-specific 60.0% 50.0% 40.0% False-positive rate of experiment-specific variants 30.0% 20.0% % novel variants % novel 10.0% 0.0% 50M Reads 60M reads 70M Reads 80M Reads Variant summary 60M 70M WGS- 80M WGS-Exome-seq- Exome-seq- specific, Exome-seq- specific, specific, WGS- specific, 1,447 specific, 1,877 1,521 1,947 2,020 specific, 1,384 Both, Both, Both, 43,125 43,430 43,676 Experiment-specific disease variants Illumina TruSeq Exome Normalized to 80M Reads WGS-specific, Exome 1,383 - seq Other, 1,553 - Both, 43,676 specific, 2,020 Disease-associated, 467 Conclusions Bait length, target choice, PCR cycles all influence exome-seq performance Greater coverage requires greater sequencing Most important factors to consider: Regions covered Amount of sequencing to be done Exome-seq observes important variants missed by a typical WGS experiment These lessons extend to all target enrichment experiments Thank you GC-content bias High or low GC-content reduces amplification in PCR negatively impacts oligonucleotide array hybridization GC-content bias SNV detection trends Indel detection trends .

Exome Sequencing: a Comparison of Enrichment Technologies

Whole Exome and Whole Genome Sequencing – Oxford Clinical Policy

Whole Exome Sequencing Faqs

Next-Gen Sequencing Identifies Non-Coding Variation Disrupting

An Efficient and Scalable Analysis Framework for Variant Extraction and Refinement from Population-Scale DNA Sequence Data

Whole Exome and Whole Genome Sequencing

Clinical Exome Sequencing Tip Sheet – Medicare Item Numbers 73358/73359

The Genomics Era: the Future of Genetics in Medicine - Glossary

Exome Sequencing in Early Disease Diagnosis

Whole Exome Sequencing (WES)

Genomic Technologies for Cancer Research

RNA Next Generation Sequencing Resources Available at the Experimental and Computational Genomics Core (ECGC)

Guide to Interpreting Genomic Reports: a Genomics Toolkit