Todays Menu: What Is a Genome?
Total Page:16
File Type:pdf, Size:1020Kb
Todays Menu: BGGN 213 • What is a Genome? • Genome sequencing and the Human genome project Genome Informatics I • What can we do with a Genome? Lecture 13 • Compare, model, mine and edit Barry Grant • Modern Genome Sequencing • 1st, 2nd and 3rd generation sequencing • Workflow for NGS http://thegrantlab.org/bggn213 • RNA-Sequencing and Discovering variation Side note! What is a genome? Genetics and Genomics The total genetic Genetics is primarily the study of individual genes, material of an • mutations within those genes, and their inheritance organism by which patterns in order to understand specific traits. individual traits are encoded, controlled, and ultimately passed • Genomics expands upon classical genetics and on to future considers aspects of the entire genome, typically generations using computer aided approaches. Side note! Genomes come in Side note! many shapes Under a microscope, a Eukaryotic cell's genome (i.e. collection of chromosomes) resembles a chaotic jumble of • Primarily DNA, but can be noodles. The looping is not RNA in the case of some random however and appears to play a role in controlling viruses gene regulation. Prokaryote • Some genomes are circular, others linear Bacteriophage • Can be organized into discrete units (chromosomes) or freestanding molecules (plasmids) Image credit: Eukaryote Scientific American March 2019 Genomes come in Genome Databases many sizes NCBI Genome: http://www.ncbi.nlm.nih.gov/genome 100,000 Eukaryotes Prokaryotes 1,000 Viruses 10 Number of protein coding genes Number of protein 0.01 1.00 100.00 Genome size (Mb) Modified from image by Estevezj / CC BY-SA Early Genome Sequencing The First Sequenced Genomes • Chain-termination “Sanger” sequencing was developed in 1977 by Frederick Sanger, colloquially referred to as the “Father of Genomics” • Sequence reads were typically 750-1000 base pairs Bacteriophage φ-X174 Haemophilus influenzae in length with an error rate • Completed in 1977 • Completed in 1995 of ~1 / 10000 bases • 5,386 base pairs, ssDNA • 1,830,140 base pairs, dsDNA • 11 genes • 1740 genes http://en.wikipedia.org/wiki/Frederick_Sanger http://en.wikipedia.org/wiki/Phi_X_174 http://phil.cdc.gov/ The Human Genome Project • The Human Genome Project (HGP) was an international, public consortium that began in 1990 – Initiated by James Watson – Primarily led by Francis Collins – Eventual Cost: $2.7 Billion • Celera Genomics was a private corporation that started in 1998 – Headed by Craig Venter – Eventual Cost: $300 Million • Both initiatives released initial drafts of the human genome in 2001 – ~3.2 Billion base pairs, dsDNA – ~20,000 genes Modern Genome Sequencing Rapid progress of genome sequencing • Next Generation Sequencing (NGS) technologies have resulted in a paradigm shift from long reads at low coverage to short reads at high coverage • This provides numerous opportunities for new and expanded genomic applications Reference Cost to sequence a Humangenome (in USD) Reads Year Image source: https://en.wikipedia.org/wiki/Carlson_curve Rapid progress of Major impact areas for genome sequencing genomic medicine • Cancer: Identification of driver mutations and drugable variants, Molecular stratification to guide and monitor treatment, Identification of tumor specific variants for personalized immunotherapy approaches (precision medicine). 20,000 fold • Genetic disease diagnose: Rare, inherited and so-called ‘mystery’ change in disease diagnose. the last decade! • Health management: Predisposition testing for complex diseases Cost to sequence a (e.g. cardiac disease, diabetes and others), optimization and MRI: $4k Humangenome (in USD) avoidance of adverse drug reactions. • Health data analytics: Incorporating genomic data with additional health data for improved healthcare delivery. Year Image source: https://en.wikipedia.org/wiki/Carlson_curve What can go wrong in Goals of Cancer Genomics Genome Research WhatWhat cancan gocancergo wrongwrong genomes? inin cancercancer genomes?genomes? • Identify changes in the genomes of TypeType of of change change SomeSome common common technology technology to to study study changes changes tumors that drive cancer progression DNADNA mutations mutations WGS,WGS, WXS WXS DNADNA structural structural variations variations WGSWGS • Identify new targets for therapy CopyCopy number number variation variation (CNV) (CNV) CGHCGH array, array, SNP SNP array, array, WGS WGS DNADNA methylation methylation MethylationMethylation array, array, RRBS, RRBS, WGBS WGBS • Select drugs based on the genomics of mRNAmRNA expression expression changes changes mRNAmRNA expression expression array, array, RNA RNA-seq-seq the tumor miRNAmiRNA expression expression changes changes miRNAmiRNA expression expression array, array, miRNA miRNA-seq-seq ProteinProtein expression expression ProteinProtein arrays, arrays, mass mass spectrometry spectrometry • Provide early cancer detection and treatment response monitoring Marc Rosenthal WGSWGS = =whole whole genome genome sequencing, sequencing, WXS WXS = =whole whole exome exome sequencing sequencing RRBSRRBS = =reduced reduced representation representation bisulfite bisulfite sequencing, sequencing, WGBS WGBS = =whole whole genome genome bisulfite bisulfite sequencing sequencing • Utilize cancer specific mutations to derive neoantigen immunotherapy approaches DNA Sequencing Concepts Modern NGS Sequencing Platforms • Sequencing by Synthesis: Uses a polymerase to incorporate and assess nucleotides to a primer sequence – 1 nucleotide at a time • Sequencing by Ligation: Uses a ligase to attach hybridized sequences to a primer sequence – 1 or more nucleotides at a time (e.g. dibase) Modified from Mardis, ER (2011), Nature, 470, pp. 198-203 Illumina – Reversible terminators Illumina Sequencing - Video Enzymatic amplification on glass surface Polymerase-mediated 1 incorporation of end blocked 2 fluorescent nucleotides Fluorescent emission from incorporated dye-labeled nucleotides 3 https://www.youtube.com/watch?src_vid=womKfikWlxM&v=fCd6B5HRaZ8 Cleave dye & blocking group, repeat… Images adapted from: Metzker, ML (2010), Nat. Rev. Genet, 11, pp. 31-46 NGS Sequencing Terminology Summary: “Generations” of DNA Sequencing Insert Size Sequence Coverage length 6X 0 1 2 3 4 5 6 7 8 9 10 11 12 0 200 400 600 insert size Base coverage by sequence Schadt, EE et al (2010), Hum. Mol. Biol., 19(RI2), pp. R227-R240 Third Generation Sequencing The first direct RNA Side-Note: sequencing by nanopore • Currently in active development • Hard to define what “3rd” generation means • For example this new nanopore sequencing method was just published! • Typical characteristics: https://www.nature.com/articles/nmeth.4577 – Long (1,000bp+) sequence reads • "Sequencing the RNA in a biological sample can unlock a – Single molecule (no amplification step) wealth of information, including the identity of bacteria and – Often associated with nanopore technology viruses, the nuances of alternative splicing or the transcriptional state of organisms. However, current methods • But not necessarily! have limitations due to short read lengths and reverse transcription or amplification biases. Here we demonstrate nanopore direct RNA-seq, a highly parallel, real-time, single- molecule method that circumvents reverse transcription or amplification steps.” SeqAnswers Wiki & BioStars Side-Note: A good repository of analysis software can be found at http://seqanswers.com and https://www.biostars.org/ What can we do with all this sequence information? Population Scale Analysis “Variety’s the very spice of life” We can now begin to assess genetic differences on a -William Cowper, 1785 very large scale, both as naturally occurring variation in human and non-human populations as well “Variation is the spice of life” somatically within tumors -Kruglyak & Nickerson, 2001 • While the sequencing of the human genome was a great milestone, the DNA from a single person is not representative of the millions of potential differences that can occur between individuals • These unknown genetic variants could be the cause of many phenotypes such as differing morphology, susceptibility to disease, or be completely benign. https://www.genomicsengland.co.uk/the-100000-genomes-project/ Germline Variation Somatic Variation • Mutations in the germline are passed along to • Mutations in non-germline cells that are not passed offspring and are present along to offspring in the DNA over every cell • Can occur during mitosis or • In animals, these from the environment itself typically occur in meiosis • Are an integral part in tumor during gamete progression and evolution differentiation Darryl Leja, Courtesy: National Human Genome Research Institute. Types of Genomic Variation Differences Between Individuals • Single Nucleotide Polymorphisms AATCTGAGGCAT The average number of genetic differences in the (SNPs) – mutations of one AATCTCAGGCAT nucleotide to another germline between two random humans can be broken down as follows: • Insertion/Deletion Polymorphisms (INDELs) – small mutations removing AATCTGAAGGCAT or adding one or more nucleotides AATCT--AGGCAT • 3,600,000 single nucleotide differences at a particular locus • 344,000 small insertion and deletions • Structural Variation • 1,000 larger deletion and duplications (SVs) – medium to large sized rearrangements of chromosomal DNA Numbers change depending on ancestry! Darryl Leja, Courtesy: National Human Genome Research Institute. [ Numbers from: 1000 Genomes Project, Nature, 2012 ] Discovering Variation: SNPs and INDELs Discovering