Exome Sequencing: a Comparison of Enrichment Technologies

Total Page:16

File Type:pdf, Size:1020Kb

Exome Sequencing: a Comparison of Enrichment Technologies Exome Sequencing: A Comparison of Enrichment Technologies Michael James Clark, Ph.D August 3rd, 2011 .whoami Michael James Clark, Ph.D Geneticist & Bioinformaticist Stanford University Mike Snyder Lab mendeliandisorder.blogspot.com Likes: Sequencing Exomes Contact: [email protected] Twitter: @mjcgenetics Why exome sequencing? http://maps.google.com Why exome sequencing? http://map.seqanswers.com Exome is highly interpretable Exome Variation Missense variations—determinable amino acid changes Nonsense variations—estimable effect on expression Copy number variations—estimable gene dosage effects Non-exome variation Enhancer/promoter variations—mostly indeterminate gene expression effect ? Why exome over whole genome? Cost per Mb of DNA Sequence $10,000.00 $1,000.00 $100.00 $10.00 Whole genome sequencing requires 50-100x $1.00 more sequencing, and is therefore more expensive. $0.10 http://www.genome.gov/sequencingcosts/ Why exome over whole genome? Cost per Genome $100,000,000 The cost of whole genome sequencing $10,000,000 remains relatively high $1,000,000 $100,000 $10,000 $1,000 $100 Exome-sequencing is <1/10th $10 the cost of WGS. $1 http://www.genome.gov/sequencingcosts/ Method: exome enrichment Biotinylated oligonucleotide baits complementary to exome sequences Example images borrowed from Illumina TruSeq exome enrichment workflow TruSeq Exome SureSelect All Exon 50M SeqCap EZ Exome v2.0 Enrichment Kit Exome enrichment platform comparisons Targeting strategy design differences Targeting region differences RefSeq (coding): 49kb RefSeq (UTR): 100kb Ensembl (CDS): 100kb miRBase: 60kb RefSeq (coding): 200kb RefSeq (coding): 10kb RefSeq (UTR): 400kb RefSeq (UTR): 28Mb Ensembl (CDS): 1.4Mb Ensembl (CDS): 300kb miRBase: 55kb miRBase: 28kb Assay differences Nimblegen Agilent Illumina Starting DNA ( g) 3 3 1 Duration (days) 7 3.5 3.5 Hybridization (hours) 72 24 2 x (16-20) Pre-hyb PCR 12 4-6 10 Post-hyb PCR 18 10-12 10 qPCR validation Yes None Recommended Molecule used DNA RNA DNA Automation Yes Yes Yes Experiment! Compare enrichment assays Data filtering and normalization Platform Raw Reads Mapped Uniquely Not Duplicates Mapped Agilent 124,112,466 123,202,356 112,602,746 94,779,030 (99.2%) (91.4%) (84.2%) Nimblegen 184,983,780 183,502,451 175,593,794 154,270,343 (99.2%) (95.7%) (87.9%) Illumina 112,885,944 110,977,932 100,236,505 88,759,249 (98.3%) (90.3%) (88.5%) 80,000,000 reads Targeting efficiency Targeting efficiency Results: Nimblegen = most Denser baits result in higher efficiencyefficient and less Illumina = captures sequencing the most Less dense designs take more sequencing, Illumina requires but capture more more sequencing Off-target enrichment Exon +500b -500b Target Region Coverage Off-target Off-target Off-target enrichment Nimblegen Agilent Illumina Off- Off- target target 9% 13% Off- target 36% On- On- On- target target target 64% 91% 87% Off-target enrichments % overlapping RepeatMasker % overlapping Segdups On-target Off-target On-target Off-target 47.0% 30.1% 38.3% 32.7% 19.5% 18.9% 10.9% 9.0% 8.0% 7.7% 6.6% 5.0% Agilent Nimblegen Illumina Agilent Nimblegen Illumina Small variant analysis Comparing sensitivity and trends in small variant detection by exome-seq Single nucleotide variations SNVs detected SNP chip concordance 60,000 Illumina 50,000 Nimblegen 40,000 Agilent 30,000 98.0 98.5 99.0 99.5 100.0 20,000 Reference Bias 10,000 Illumina 0 Nimblegen Ref Agilent Alt 0 0.2 0.4 0.6 0.8 1 SNV detection trends Platform-specific SNVs Indel detection trends Indel size distributions Exome-seq vs. WGS Comparison of model exome-seq and whole genome sequencing experiments Create model experiments Quality correlates with coverage 1000 Mean Base Coverage 900 800 700 48 600 42 500 36 400 30 300 200 Mean variant qualityscore Meanvariant 100 0 50M 60M 70M 80M Read Count 50M 60M 70M 80M All SNVs Exome-only Read Count SNV overlap WGS 6,126 42,633 5,408 Exome-seq WGS 5,634 43,125 5,687 Exome-seq 50M reads / 30x exome-seq 60M reads / 36x exome-seq 35x WGS 35x WGS WGS 5,329 43,430 5,881 Exome-seq WGS 5,083 43,676 6,060 Exome-seq 70M reads / 42x exome-seq 80M reads / 48x exome-seq 35x WGS 35x WGS Novel variant rates Whole exome Exome-seq-specific Whole Genome WGS-specific 60.0% 50.0% 40.0% False-positive rate of experiment-specific variants 30.0% 20.0% % novel variants % novel 10.0% 0.0% 50M Reads 60M reads 70M Reads 80M Reads Variant summary 60M 70M WGS- 80M WGS-Exome-seq- Exome-seq- specific, Exome-seq- specific, specific, WGS- specific, 1,447 specific, 1,877 1,521 1,947 2,020 specific, 1,384 Both, Both, Both, 43,125 43,430 43,676 Experiment-specific disease variants Illumina TruSeq Exome Normalized to 80M Reads WGS-specific, Exome 1,383 - seq Other, 1,553 - Both, 43,676 specific, 2,020 Disease-associated, 467 Conclusions Bait length, target choice, PCR cycles all influence exome-seq performance Greater coverage requires greater sequencing Most important factors to consider: Regions covered Amount of sequencing to be done Exome-seq observes important variants missed by a typical WGS experiment These lessons extend to all target enrichment experiments Thank you GC-content bias High or low GC-content reduces amplification in PCR negatively impacts oligonucleotide array hybridization GC-content bias SNV detection trends Indel detection trends .
Recommended publications
  • Whole Exome and Whole Genome Sequencing – Oxford Clinical Policy
    UnitedHealthcare® Oxford Clinical Policy Whole Exome and Whole Genome Sequencing Policy Number: LABORATORY 024.11 T2 Effective Date: October 1, 2021 Instructions for Use Table of Contents Page Related Policies Coverage Rationale ....................................................................... 1 Chromosome Microarray Testing (Non-Oncology Documentation Requirements ...................................................... 2 Conditions) Definitions ...................................................................................... 2 Molecular Oncology Testing for Cancer Diagnosis, Prior Authorization Requirements ................................................ 3 Prognosis, and Treatment Decisions Applicable Codes .......................................................................... 3 • Preimplantation Genetic Testing Description of Services ................................................................. 4 Clinical Evidence ........................................................................... 4 U.S. Food and Drug Administration ........................................... 22 References ................................................................................... 22 Policy History/Revision Information ........................................... 26 Instructions for Use ..................................................................... 27 Coverage Rationale Whole Exome Sequencing (WES) Whole Exome Sequencing (WES) is proven and Medically Necessary for the following: • Diagnosing or evaluating a genetic disorder
    [Show full text]
  • Whole Exome Sequencing Faqs
    WHOLE EXOME SEQUENCING (WES) Whole Exome Sequencing (WES) is generally ordered when a patient’s medical history and physical exam strongly suggest that there is an underlying genetic etiology. In some cases, the patient may have had an extensive evaluation consisting of multiple genetic tests, without identifying an etiology. In other cases, a physician may opt to order one of the Whole Exome Sequencing tests early in the patient’s evaluation in an effort to expedite a possible diagnosis and reduce costs incurred by multiple tests. Whole Exome Sequencing is a highly complex test that is newly developed for the identification of changes in a patient’s DNA that are causative or related to their medical concerns. In contrast to current sequencing tests that analyze one gene or small groups of related genes at a time, Whole Exome Sequencing analyzes the exons or coding regions of thousands of genes simultaneously using next-generation sequencing techniques. The exome refers to the portion of the human genome that contains functionally important sequences of DNA that direct the body to make proteins essential for the body to function properly. These regions of DNA are referred to as exons. There are approximately 180,000 exons in the human genome which represents about 3% of the genome. These 180,000 exons are arranged in about 22,000 genes. It is known that many of the errors that occur in DNA sequences that then lead to genetic disorders are located in the exons. Therefore, sequencing of the exome is thought to be an efficient method of analyzing a patient’s DNA to discover the genetic cause of diseases or disabilities.
    [Show full text]
  • Next-Gen Sequencing Identifies Non-Coding Variation Disrupting
    OPEN Molecular Psychiatry (2018) 23, 1375–1384 www.nature.com/mp ORIGINAL ARTICLE Next-gen sequencing identifies non-coding variation disrupting miRNA-binding sites in neurological disorders P Devanna1, XS Chen2,JHo1,2, D Gajewski1, SD Smith3, A Gialluisi2,4, C Francks2,5, SE Fisher2,5, DF Newbury6,7 and SC Vernes1,5 Understanding the genetic factors underlying neurodevelopmental and neuropsychiatric disorders is a major challenge given their prevalence and potential severity for quality of life. While large-scale genomic screens have made major advances in this area, for many disorders the genetic underpinnings are complex and poorly understood. To date the field has focused predominantly on protein coding variation, but given the importance of tightly controlled gene expression for normal brain development and disorder, variation that affects non-coding regulatory regions of the genome is likely to play an important role in these phenotypes. Herein we show the importance of 3 prime untranslated region (3'UTR) non-coding regulatory variants across neurodevelopmental and neuropsychiatric disorders. We devised a pipeline for identifying and functionally validating putatively pathogenic variants from next generation sequencing (NGS) data. We applied this pipeline to a cohort of children with severe specific language impairment (SLI) and identified a functional, SLI-associated variant affecting gene regulation in cells and post-mortem human brain. This variant and the affected gene (ARHGEF39) represent new putative risk factors for SLI. Furthermore, we identified 3′UTR regulatory variants across autism, schizophrenia and bipolar disorder NGS cohorts demonstrating their impact on neurodevelopmental and neuropsychiatric disorders. Our findings show the importance of investigating non-coding regulatory variants when determining risk factors contributing to neurodevelopmental and neuropsychiatric disorders.
    [Show full text]
  • An Efficient and Scalable Analysis Framework for Variant Extraction and Refinement from Population-Scale DNA Sequence Data
    Downloaded from genome.cshlp.org on October 4, 2021 - Published by Cold Spring Harbor Laboratory Press Method An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data Goo Jun,1,2 Mary Kate Wing,2 Gonçalo R. Abecasis,2 and Hyun Min Kang2 1Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas 77030, USA; 2Center for Statistical Genetics and Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, Michigan 48109, USA The analysis of next-generation sequencing data is computationally and statistically challenging because of the massive vol- ume of data and imperfect data quality. We present GotCloud, a pipeline for efficiently detecting and genotyping high- quality variants from large-scale sequencing data. GotCloud automates sequence alignment, sample-level quality control, variant calling, filtering of likely artifacts using machine-learning techniques, and genotype refinement using haplotype in- formation. The pipeline can process thousands of samples in parallel and requires less computational resources than current alternatives. Experiments with whole-genome and exome-targeted sequence data generated by the 1000 Genomes Project show that the pipeline provides effective filtering against false positive variants and high power to detect true variants. Our pipeline has already contributed to variant detection and genotyping in several large-scale sequencing projects, including the 1000 Genomes Project and the NHLBI Exome Sequencing Project. We hope it will now prove useful to many medical sequencing studies. [Supplemental material is available for this article.] The cost of human genome sequencing has declined rapidly, pow- There is a pressing need for software pipelines that support ered by advances in massively parallel sequencing technologies.
    [Show full text]
  • Whole Exome and Whole Genome Sequencing
    UnitedHealthcare® Community Plan Medical Policy Whole Exome and Whole Genome Sequencing Policy Number: CS150.J Effective Date: October 1, 2021 Instructions for Use Table of Contents Page Related Community Plan Policies Application ..................................................................................... 1 • Chromosome Microarray Testing (Non-Oncology Coverage Rationale ....................................................................... 1 Conditions) Definitions ...................................................................................... 2 • Molecular Oncology Testing for Cancer Diagnosis, Applicable Codes .......................................................................... 3 Prognosis, and Treatment Decisions Description of Services ................................................................. 4 • Preimplantation Genetic Testing Clinical Evidence ........................................................................... 4 U.S. Food and Drug Administration ........................................... 22 Commercial Policy References ................................................................................... 22 • Whole Exome and Whole Genome Sequencing Policy History/Revision Information ........................................... 26 Medicare Advantage Coverage Summaries Instructions for Use ..................................................................... 26 • Genetic Testing • Laboratory Tests and Services Application This Medical Policy does not apply to the states listed below; refer to
    [Show full text]
  • Clinical Exome Sequencing Tip Sheet – Medicare Item Numbers 73358/73359
    Clinical exome sequencing Tip sheet – Medicare item numbers 73358/73359 Glossary Chromosome microarray (CMA or molecular Monogenic conditions (as opposed karyotype): CMA has a Medicare item number to polygenic or multifactorial conditions) are for patients presenting with intellectual caused by variants in a single gene. Variants disability, developmental delay, autism, or at may be inherited (dominant or recessive least two congenital anomalies. CMA is the fashion), or may occur spontaneously (de recommended first line test in these cases as novo) showing no family history. it can exclude a chromosome cause of disease which is unlikely to be detected by Whole exome sequence – sequencing only exome. the protein coding genes (exons). The exome is ~2% of the genome and contains ~85% of Gene panel is a set of genes that are known to disease-causing gene variants. be associated with a phenotype or disorder. They help narrow down the search Whole genome sequence – sequencing the for variants of interest to genes with evidence entire genome (all genes, including coding linking them to particular phenotypes and noncoding regions) Human phenotype ontology (HPO) terms Singleton – Analysis of the child only. describe a phenotypic abnormality using a Trio – analysis of the child and both biological standard nomenclature. Ideally, all clinicians parents. and scientists are using the same terms. Variant - A change in the DNA code that Mendeliome refers to the ~5,000 genes (out of differs from a reference genome. about 20,000 protein coding genes) that are known to be associated with monogenic disease. As variants in new genes are identified with evidence linking them with human disease, they are added to the Mendeliome.
    [Show full text]
  • The Genomics Era: the Future of Genetics in Medicine - Glossary
    The Genomics Era: the Future of Genetics in Medicine - Glossary The glossary below provides a list of key terms used throughout the course. You do not need to read them all now; we’ll be linking back to the main glossary step wherever these terms appear, so you may refer back to this list if you are unsure of the terminology being used. Term Definition The process of matching reads back to their original Alignment position in the reference genome. An allele is one of a number of alternative forms of the same gene or genetic locus. We inherit one copy Allele of our genetic code from our mother and one copy of our genetic code from our father. Each copy is known as an allele. Microarray based genomic comparative hybridisation. This is a technique used to detect chromosome imbalances by comparing patient and control DNA and comparing differences between the two sets. It is Array CGH a useful technique for detecting small chromosome deletions and duplications which would not have been detected with more traditional karyotyping techniques. A unit of DNA. There are four bases which form the Base cross links (or rungs) of the DNA double helix: adenine (A), thymine (T), guanine (G) and cytosine (C). Capture see Target enrichment. The process by which a cell becomes specialized in Cell differentiation order to perform a specific function. Centromere The point at which the sister chromatids are joined. #1 FutureLearn A structure located in the nucleus all living cells, comprised of DNA bound around proteins called histones. The normal number of chromosomes in each Chromosome human cell nucleus is 46 and is composed of 22 pairs of autosomes and a pair of sex chromosomes which determine gender: males have an X and a Y chromosome whilst females have two X chromosomes.
    [Show full text]
  • Exome Sequencing in Early Disease Diagnosis
    Diabetes Updates Short Communication ISSN: 2631-5483 Exome Sequencing in early disease diagnosis: Are we on the right track? Musambil M* Department of Genetics, Strategic Center for Diabetes Research, King Saud University, Riyadh, Saudi Arabia Identification of genetic variants associated with monogenic to whole exome sequencing (WES) and targeted gene sequencing. syndromes, complex disorders and related traits opened up an avenue But WGS still remains very much expensive compared with WES and that had not been explored before, which is to translate the genetic targeted sequencing [6] (Tables 1 and 2). screening information into disease predicting tools which could WES could be briefly explained as the process of sequencing provide more efficient management of the disease by improving risk exons or the protein-encoding parts of the genes which represent and its prediction capabilities. The methods to uncover the genetics of these complex disorders have evolved over time. The International the functional part of the genome. WES gives a clear picture of high HapMap Project, carried out as a part of the Human Genome Project, penetrance allelic variation and its relationship to disease phenotype was successful in providing information about more than one million [7]. As WES targets exons and with the knowledge that Mendelian or SNPs (Single Nucleotide Polymorphisms) across the human genome. partly Mendelian variations are mediated by non-synonymous, splice A revolution in SNP genotyping technology had occurred, making it site and frameshift variations, exomes remain the most ideal regions possible to genotype hundreds of thousands of SNPs, opening new to be screened in order to link genetic variation to health and disease.
    [Show full text]
  • Whole Exome Sequencing (WES)
    Whole Exome Sequencing (WES) Turn Around Time: 30 Days TEST METHODOLOGY CPT Codes: Proband – 81415, Family Member – 81416 DNA will be extracted from whole blood or Test Includes: DNA Extraction other specimen types. Extracted DNA is Library Prep quantified and sheared to the correct size. The Exome Capture sample then undergoes library preparation and Library QC the exome is captured. After quality assurance, Illumina Platform Sequencing the captured library is then subjected to next Data Analysis generation DNA sequencing on the Illumina Sanger Variant Confirmation (if requested) platform. The reads from this sequencing are Interpreted Clinical Report aligned to a reference sequence and variations from this reference are identified. The sequence variants are then loaded into a commercial software package that contains data sources and Expedited WES testing is available. algorithms allowing for the evaluation of whole Contact the lab for more information. exome sequencing variants for evolutionary conservation, predicted impact on protein TEST DESCRIPTION structure and function (including Polyphen2 (5) and SIFT (6)), ability to disrupt conserved Whole Exome Sequencing (WES) is used to detect variants in a patient’s exome splice sites, and presence in databases including in order to determine the role of genomic variants in disease outcomes. The OMIM, dbSNP, and HGMD (1,2,3). The exome is a little more than 1% of the genome that codes for protein. The patient’s software annotates variants with this data, exome will be sequenced to an average depth of 100X with a minimum depth of considering both the reference gene model and coverage of 85X. Over 97% of the exome will be sequenced to a depth of 10X.
    [Show full text]
  • Genomic Technologies for Cancer Research
    Genomic Technologies for Cancer Research www.illumina.com/applications/cancer.html Table of Contents I. Introduction: Genomic Technologies for Cancer Research 3 II. Approaches for Detecting Somatic Mutations 4 Targeted Sequencing Solutions for Somatic Mutation Detection 4 Exome Sequencing 4 Focused Sequencing Panels 4 Custom Targeted Sequencing 4 Whole-Genome Sequencing Solutions 4 Data Analysis Tools for Somatic Variant Detection 5 III. Evaluating Germline Mutations in Cancer 6 Targeted Sequencing to Detect Common Germline Mutations 7 Microarray-Based Approaches 7 IV. Structural Variant Detection in Cancer 7 DNA and RNA Sequencing for Translocation Detection 8 Copy Number Variation Arrays 8 V. Investigating Gene Regulation in Cancer 8 DNA–Protein Interactions 8 DNA Methylation 9 RNA Sequencing 9 Targeted RNA Sequencing 9 Small RNA Sequencing 10 Data Analysis Tools for the Study of Gene Regulation 11 VI. Summary 11 For Research Use Only. Not for use in diagnostic procedures. I. Introduction: Genomic Technologies for Cancer Research In recent years, genomic technologies have emerged as invaluable tools in cancer research (Figure 1). International projects such as the International Cancer Genome Consortium (ICGC)1 and The Cancer Genome Atlas (TCGA)2, tasked with mapping the biology of dozens of tumor types, would not have been possible without these tools. Next-generation sequencing (NGS) and high-density microarrays are used to study the biology of cancer. Both provide the cancer research community with a growing body of knowledge that may lead to more effective drug design, better patient treatment options, and more accurate prognoses.3 Normal Neoplastic Changes Tumor Treatment Response Recurrence PROGRESSION Somatic Mutations Germline Gene Expression & Mutations Epigenetic Changes Additional Mutations Chromosomal Abnormalities HETEROGENEITY Figure 1: The Tumor Progression Pathway—Genomic technologies are helping researchers achieve a deeper understanding of the tumor progression pathway.
    [Show full text]
  • RNA Next Generation Sequencing Resources Available at the Experimental and Computational Genomics Core (ECGC)
    RNA next generation sequencing resources available at the Experimental and Computational Genomics Core (ECGC) Kornel Schuebel, PhD ECGC Resource Director [email protected] Telephone 410-614-0445 CRB2 Rm 131 (lab) CRB2 Rm 1m44 (office) What’s our mission? To facilitate easy access to genomic technologies and bioinformatics expertise, including experimental design, sample processing, and data analysis. To build educational and training opportunities for genomics analysis. Next Generation Microarray Sequencing Experimental and Computational Genomics Biostatistics and Genomics Bioinformatics Education Analysis The ECGC team Faculty Directors Staff Leslie Cope Michael Considine Sarah Wheelan Anuj Gupta Vasan Yegnasubramanian Jennifer Meyers Alyza Skaist Resource Director Hai Xu Kornel Schuebel Coordinators Lauren Ciotti Luda Danilova Daniel Vellucci Core faculty IT support Rob Scharpf Greg Smith Elana Fertig Dominic King How do I start my project? Let us know a little (a Lauren Sarah sentence or two is fine) Vasan Contact us at ecgc.jhmi.edu about your project Leslie Kornel Schedule and attend a consultation Lauren Meet with us to establish an Sarah experimental plan, discuss Vasan costs and a timeline Leslie Set up an iLab project report Kornel and drop off samples Drop off times are generally Lauren on Tuesdays and Thursdays Jennifer Kornel We will contact you to verify types of data analysis you want Anuj, Alyza, Michael Schedule a meeting to look We confirm with you the Sarah comparisons your iLabs Vasan at the data together report showed
    [Show full text]
  • Guide to Interpreting Genomic Reports: a Genomics Toolkit
    Guide to Interpreting Genomic Reports: A Genomics Toolkit A guide to genomic test results for non-genetics providers Created by the Practitioner Education Working Group of the Clinical Sequencing Exploratory Research (CSER) Consortium Genomic Report Toolkit Authors Kelly East, MS, CGC, Wendy Chung MD, PhD, Kate Foreman, MS, CGC, Mari Gilmore, MS, CGC, Michele Gornick, PhD, Lucia Hindorff, PhD, Tia Kauffman, MPH, Donna Messersmith , PhD, Cindy Prows, MSN, APRN, CNS, Elena Stoffel, MD, Joon-Ho Yu, MPh, PhD and Sharon Plon, MD, PhD About this resource This resource was created by a team of genomic testing experts. It is designed to help non-geneticist healthcare providers to understand genomic medicine and genome sequencing. The CSER Consortium1 is an NIH-funded group exploring genomic testing in clinical settings. Acknowledgements This work was conducted as part of the Clinical Sequencing Exploratory Research (CSER) Consortium, grants U01 HG006485, U01 HG006485, U01 HG006546, U01 HG006492, UM1 HG007301, UM1 HG007292, UM1 HG006508, U01 HG006487, U01 HG006507, R01 HG006618, and U01 HG007307. Special thanks to Alexandria Wyatt and Hugo O’Campo for graphic design and layout, Jill Pope for technical editing, and the entire CSER Practitioner Education Working Group for their time, energy, and support in developing this resource. Contents 1 Introduction and Overview ................................................................ 3 2 Diagnostic Results Related to Patient Symptoms: Pathogenic and Likely Pathogenic Variants . 8 3 Uncertain Results
    [Show full text]