De Bruijn Graphs Face Harsh Realities of Assembly What’S Inside the Cell Nucleus? Contents of the Nucleus
Total Page:16
File Type:pdf, Size:1020Kb
How Do Biologists Assemble Genomes? ! Graph Algorithms Phillip Compeau and Pavel Pevzner Bioinformatics Algorithms: An Active Learning Approach ©2015 by Compeau and Pevzner. All rights reserved. The Newspaper Problem The Newspaper Problem The Newspaper Problem The Newspaper Problem The Newspaper Problem The Newspaper Problem Outline • What Is Genome Sequencing? • The String Reconstruction Problem • String Reconstruction as a Walk in the Overlap Graph • Another Graph for String Reconstruction • The Seven Bridges of Konigsberg • Euler’s Theorem • De Bruijn Graphs Face Harsh Realities of Assembly What’s inside the cell nucleus? Contents of the Nucleus • The nucleus contains chromosomes. • Humans have 23 pairs of! chromosomes (one in each! pair comes from each parent).! • But what are chromosomes! made of? DNA: The Building Block of Life • One more zoom, and we reach the molecular level. • Early 1950s: Researchers start! uncovering properties of ! chromosomal substance,! now called “deoxyribose! nucleic acid”: DNA • 1953: Watson and Crick publish! “double helix” structure of DNA. Molecular Structure of DNA DNAs Double Helix DNAs Molecular Structure Molecular Structure of DNA • Nucleotide: Half of one! “rung” of DNA. • Four choices for the nucleic acid of a nucleotide: 1. Adenine (A) 2. Cytosine (C) 3. Guanine (G)—bonds to C 4. Thymine (T)—bonds to A DNAs Molecular Structure Why is DNA Important? Central Dogma of Molecular Biology: DNA is transcribed into RNA, which is then translated into protein (chain of amino acids). Courtesy: Rachel Raynes! Genome: A Long DNA “Book” • Genome: The nucleotide sequence read down one side of an organism’s chromosomal DNA. …CCGTAGTCGCATGGAACAGTATACGAGACAGTACAGATACGATACGATACGATCATTAACCGAGAGTACCAGATTCCAGATCATAC TTACGCTTAGCTACGGACGTACGATACCCAGATTACGATCCATATAGATATAACCGGTGTGTCTTGCTAATACGTAACGGGGTGCCT TCGATAGGTCAGAATACCAGATCTCTCGATCTTCTTACAGATACTACGATCCCCAGATACTACCCCTACTGACCCATCGTACGGGTA CTACTACGGATATGATACCGATGTAGAGGGATCCATATATCCCGAGACGTCTCGCGCATAAGATCATCGTCTAGATACACGTACGTA CTAGACTAGCGTATGCCTCTTATGATCGTCCCGATCGAGTCGCGTGCTCAGAAAAGCTACGATACGATACCCGATACTAGACCATAG… • A human genome has about 3 billion nucleotides. • Biologists want to be able to read this book. This is what it means to sequence a genome. We Share 99.9% of Our Genomes CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGA TCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTAT CGATCGATCGATCGATTATCTACGATCGATCGATCGATCA CTATACGAGCTACTACGTACGTACGATCGCGGGACTATTA TCGACTACAGATAAAACATGCTAGTACAACAGTATACATA GCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGA TCAGCTACAACATCGTAGCTACGATGCATTAGCAAGCTAT CGATCGATCGATCGATTATCTACGATCGATCGATCGATCA CTATACGAGCTACTACGTACGTACGATCGCGTGACTATTA TCGACTACAGATGAAACATGCTAGTACAACAGTATACATA GCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT Species vs. Individual Sequencing Species Sequencing: What is the “consensus” genome of an entire species? Species vs. Individual Sequencing Individual Sequencing: What makes an individual unique within their species? Why SequenceLETTERS a Species’sNATURE MICROBIOLOGY DOI: Genome? 10.1038/NMICROBIOL.2016.48 (Tenericutes) Bacteria Actinobacteria Armatimonadetes Nomurabacteria Kaiserbacteria Zixibacteria Atribacteria Adlerbacteria Cloacimonetes Aquificae Chloroflexi Campbellbacteria Fibrobacteres Calescamantes Gemmatimonadetes Caldiserica Firmicutes WOR-3 Dictyoglomi TA06 Thermotogae Cyanobacteria Poribacteria Deinococcus-Therm. Latescibacteria Synergistetes Giovannonibacteria BRC1 Fusobacteria Melainabacteria Wolfebacteria Marinimicrobia Jorgensenbacteria RBX1 Ignavibacteria Bacteroidetes WOR1 Chlorobi Caldithrix Azambacteria PVC Parcubacteria superphylum Yanofskybacteria Planctomycetes Moranbacteria Elusimicrobia Chlamydiae, Lentisphaerae, Magasanikbacteria Verrucomicrobia Uhrbacteria Falkowbacteria Candidate Omnitrophica Phyla Radiation SM2F11 Rokubacteria NC10 Aminicentantes Peregrinibacteria Acidobacteria Tectomicrobia, Modulibacteria Gracilibacteria BD1-5, GN02 Nitrospinae Absconditabacteria SR1 Nitrospirae Saccharibacteria Dadabacteria Berkelbacteria Deltaprotebacteria (Thermodesulfobacteria) Chrysiogenetes Deferribacteres Hydrogenedentes NKB19 Woesebacteria Spirochaetes Shapirobacteria Wirthbacteria Amesbacteria TM6 Collierbacteria Epsilonproteobacteria Pacebacteria Beckwithbacteria Roizmanbacteria Dojkabacteria WS6 Gottesmanbacteria CPR1 Levybacteria CPR3 Daviesbacteria Microgenomates Katanobacteria Curtissbacteria Alphaproteobacteria WWE3 Zetaproteo. Acidithiobacillia Betaproteobacteria Major lineages with isolated representative: italics Major lineage lacking isolated representative: 0.4 Gammaproteobacteria Micrarchaeota Diapherotrites Eukaryotes Nanohaloarchaeota Aenigmarchaeota Loki. Parvarchaeota Thor. Korarch. DPANN Crenarch. Pacearchaeota Bathyarc. Nanoarchaeota YNPFFA Woesearchaeota Aigarch. Opisthokonta Altiarchaeales Halobacteria Z7ME43 Methanopyri TACK Methanococci Excavata Archaea Hadesarchaea Thermococci Thaumarchaeota Archaeplastida Hug et al., 2016! Methanobacteria Thermoplasmata Chromalveolata Archaeoglobi Methanomicrobia Amoebozoa Figure 1 | A current view of the tree of life, encompassing the total diversity represented by sequenced genomes. The tree includes 92 named bacterial phyla, 26 archaeal phyla and all five of the Eukaryotic supergroups. Major lineages are assigned arbitrary colours and named, with well-characterized lineage names, in italics. Lineages lacking an isolated representative are highlighted with non-italicized names and red dots. For details on taxon sampling and tree inference, see Methods. The names Tenericutes and Thermodesulfobacteria are bracketed to indicate that these lineages branch within the Firmicutesand the Deltaproteobacteria, respectively. Eukaryotic supergroups are noted, but not otherwise delineated due to the low resolution of these lineages. The CPR phyla are assigned a single colour as they are composed entirely of organisms without isolated representatives, and are still in the process of definition at lower taxonomic levels. The complete ribosomal protein tree is available in rectangular format with full bootstrap values as Supplementary Fig. 1 andin Newick format in Supplementary Dataset 2. 2 NATURE MICROBIOLOGY | www.nature.com/naturemicrobiology © 2016 Macmillan Publishers Limited. All rights reserved Why Sequence an Individual’s Genome? Personalized Medicine: Tailoring medical treatment to the individual based on their genetics. 2010: First person whose life was saved due to genome sequencing. Brief History of Genome Sequencing 1977: Gilbert and Sanger develop sequencing techniques independently. 1980: They share the Nobel Walter Gilbert prize. The resulting sequencing methods cost $1 per nucleotide. Frederick Sanger Brief History of Genome Sequencing 1990: The public Human Genome Project, headed by Francis Collins, aims to sequence the human genome. Francis Collins 1997: Craig Venter founds Celera genomics, a private firm with the same goal. Craig Venter Brief History of Genome Sequencing 2000: Draft of human genome is simultaneously completed by the Human Genome Consortium (public) and Celera Genomics (private). Brief History of Genome Sequencing 2000s: Race is on to sequence other mammalian genomes. Brief History of Genome Sequencing 2008: US passes Genetic Nondiscrimination Act. 2013: UK declares public funding to sequence 100,000 human genomes. 2015: Ilumina reduces cost of sequencing an individual human genome to $1,000. The Future of Genomics What Makes Genome Sequencing Hard? Sequencing machines can only read short pieces of DNA (~250 nucleotides long), called reads. vs. General Idea of Genome Assembly Multiple identical copies of a genome Shatter the genome into reads Sequence the reads AGAATATCA TGAGAATAT GAGAATATC AGAATATCA Assemble the genome using GAGAATATC overlapping reads TGAGAATAT ...TGAGAATATCA... General Idea of Genome Assembly Multiple identical copies of a genome Shatter the genome into reads Sequence the reads AGAATATCA TGAGAATAT GAGAATATC AGAATATCA Assemble the genome using GAGAATATC overlapping reads TGAGAATAT ...TGAGAATATCA... STOP and Think: What does this remind you of? Outline • What Is Genome Sequencing? • The String Reconstruction Problem • String Reconstruction as a Walk in the Overlap Graph • Another Graph for String Reconstruction • The Seven Bridges of Konigsberg • Euler’s Theorem • De Bruijn Graphs Face Harsh Realities of Assembly Complications in Genome Assembly 1. DNA is double-stranded (and may consist of multiple chromosomes). 2. Reads have imperfect coverage of the underlying genome. 3. Sequencing machines are error-prone. Assumptions for Genome Assembly 1. DNA is single-stranded (and consists of a single chromosome, like bacteria). 2. Reads have perfect coverage of the underlying genome: every k-mer in the genome is present. 3. Sequencing machines are error-free. k-mer Composition The k-mer composition of a string Text, denoted Compositionk(Text), is the collection of all k-mer substrings of Text (including repeats). Composition3(TATGGGGTGC) =! {ATG, GGG, GGG, GGT, GTG, TAT, TGC, TGG} String Reconstruction Problem String Reconstruction Problem: Reconstruct a string from its k-mer composition. • Input: An integer k and a collection Patterns of k- mers. • Output: A string Text with k-mer composition equal to Patterns (if such a string exists). Exercise Break: Reconstruct a string having the 3- mer composition {AAT, ATG, GTT, TAA, TGT}. What algorithm did you use? HOW DO WE ASSEMBLE GENOMES? Solving the String Composition Problem is a straightforward exercise, but in order to model genome