Whole Genome Sequencing

Total Page:16

File Type:pdf, Size:1020Kb

Whole Genome Sequencing Whole Genome Sequencing Introduction to the Interpretation of Whole Genome Sequence Data in Food Safety ntroduction 2. Describe dimensions of WGS analysis for which I Whole genome sequencing (WGS) for bacterial there are still significant ambiguity in the foodborne pathogen characterization is here to scientific approaches; and stay. Developments in WGS platforms have made it 3. Summarize some outstanding challenges to the possible to sequence the entire genome of bacteria application of these methods to bacterial at prices comparable to common molecular foodborne pathogen subtyping. We will introduce whole genome sequencing subtyping methods. These data can provide near- perfect discrimination of bacterial isolates. While technologies and common analyses, then discuss the common molecular subtyping methods only application of these methods to regulatory action interrogate small parts of the genome (e.g., for and different foodborne pathogens. pulsed-field gel electrophoresis [PFGE] restriction sites, MLST sequence of ~7 loci of ~500 basepairs equencing technologies [bp]), WGS approaches make it possible to S Sequencing technologies used for WGS can be interrogate more than 99% of the genome, which subdivided into two categories; (i) short-read translates to approximately 2.8 million and 4.8 technologies, which produce sequence reads up to million basepairs in Listeria monocytogenes and 500 bp (e.g., Illumina, IonTorrent), and (ii) long- Salmonella enterica, respectively. read technologies, which produce reads longer than 1000 bp and often lengths over 70,000 bp What’s more, as sequencing technologies and data (e.g., Pacific Biosciences, Oxford Nanopore). At the analytics continue to mature, WGS will provide time of writing this document (May 2016), the two results at costs and timeframes cheaper/faster than sequencing platforms most commonly used in WGS traditional subtyping. U.S. governmental agencies are Illumina and Pacific Biosciences. Illumina (CDC, FDA, and USDA-FSIS) are beginning to build sequencers (e.g., MiSeq, NextSeq, HiSeq) are large, shared databases (e.g. GenomeTrakr) to popular because of speed, throughput and high store WGS data generated from foodborne accuracy of the data produced by these pathogen isolates collected from routine surveillance sequencers, allowing bacterial genomes to be or human disease cases to compare records sequenced at low costs (between $50 and $100 between isolates and use these comparisons to per bacterial genome). The per bacterial genome inform regulatory action. costs of Pacific Biosciences sequencers are considerably more expensive (>$800), making it While the field is beginning to coalesce around cost prohibitive for WGS-based typing. common WGS sequencing platforms and basic data analysis approaches (1), the application of Short-read technologies are best suited for high- those platforms and approaches to bacterial food throughput applications due to high accuracy and safety has not yet matured. The purpose of this low costs per base sequenced. They are the main article is to: technologies used for routine WGS by government 1. Introduce whole genome sequencing platforms agencies and are used for whole genome analogs and analytical approaches to practitioners who to nucleotide-based subtyping schemes, such as have not yet encountered these data in their whole-genome Single Nucleotide Polymorphism work; (SNP) or Multi Locus Sequence Typing (MLST) analysis. Gaps in genome sequencing (see ‘data as reference-based assembly for outbreak analysis – genome assembly’) prevent interrogation detection, but may be problematic for gene of genome-scale events, such as genome detection-based applications such as WGS-based rearrangements or differences in PFGE patterns. screening of antibiotic-resistance genes. Long-read sequencing technologies can complement Genome assembly and/or variant calling short-read technologies, at a higher cost and lower The main objective of WGS analysis is the throughput. Their main advantage is that longer identification of genomic differences between reads can often be assembled de novo into a bacterial strains. Since the raw data of WGS complete genome, either alone or in combination technology are bacterial genome sequence with short-read data. In principle, long-read data fragments of various size (from 100s-10,000+ bp), could be used to directly calculate PFGE pattern a fundamental question is how to use those profiles for comparison to existing databases. fragments to determine genomic differences, referred to as genomic variants. Bioinformatics ata analysis pipelines are tools that identify these variants, and D The field of computer science called are generally referred to as ‘variant’ callers. bioinformatics is used to analyze WGS data. This Genomic variants include (i) single nucleotide involves algorithm-, pipeline- and software polymorphisms (SNPs) or single nucleotide variants development, analysis, transfer and storage/ (SNVs), which indicate a single nucleotide database development of genomics data. substitution difference between genomes, (ii) insertions and deletions of nucleotide/s (commonly A typical WGS workflow contains the following referred to as indels), and (iii) genomic steps; (i) quality control and data grooming, (ii) rearrangements. genome assembly and/or variant calling, and (iii) post-assembly analysis. Current academic reviews, One approach to detect variants is to first such as (1), give more detail on these steps than assemble genomes de novo and then use whole what follows below. genome alignment-based methods to compare two or more strains. De novo assembly of short-read Data quality control and data grooming sequences generally yields so-called draft genome Quality control of WGS data involves multiple sequences, genome sequences that still contain aspects, but some of the most important involves gaps. These gaps are generally caused by the read quality (e.g., how may sites of a 300 bp read presence of repetitive sequences (e.g., rRNA fall below a specific quality threshold), fold sequence clusters) in the genome. Recent coverage or sequence depth and putative bioinformatic advances in the assembly strategies contaminants. Read quality is usually dealt with by of long reads from Pacific Biosciences sequencing data grooming, i.e., removal of low-quality regions technologies now make it possible to produce de of the individual reads with specialized novo closed genome sequences (i.e., sequences bioinformatics tools. The second aspect involves fold without gaps). coverage. WGS data typically consists of hundreds of thousands of short sequence reads representing A second approach is the reference mapping fragments of the genome. Fold coverage or approach. In this approach, reads are aligned sequence depth refers to the median or average (mapped) against a (preferably closed) reference number of reads that cover each nucleotide in a genome. After mapping, variants are called from genome. Too low coverage will influence the the consensus of the mapped reads. Reference accuracy of downstream analyses, as will too high mapping-based approaches are very popular coverage. A commonly overlooked aspect of data because they are computationally inexpensive and quality is contamination with a non-target organism, are fast compared to de novo assembly. A limitation which can be a laboratory-introduced contamination of this method is the reliance on a reference or an organism that is co-isolated. This may not genome. Especially if a closely related reference pose problems for some downstream analyses, such genome is absent, mapping against a distantly related genome may lead to problems with variant inferred using statistical methods such as the calling, and unique regions not found in the bootstrap for parsimony, maximum likelihood and reference sequence will not be included in distance methods and posterior probabilities for downstream analysis. Bayesian methods. In addition to de novo assembly and reference The next challenge is to visualize the differences in mapping-based methods, reference-free de novo a way that can guide action, such as identifying a variant calling methods exist. These methods do not plausible outbreak or source of contamination. require reference sequences and are faster than de When distances are counted, a reasonable novo assembly-based methods. visualization approach is to plot, or report, the differences between groups of strains. For example, Post assembly analysis one can plot the number of pairwise SNP One underappreciated subtlety in WGS analysis is differences between bacterial isolates within or how to interpret the genetic variants between between outbreaks (2), or individual food strains as biologically relevant measures of strain establishments (3). Another common choice is to difference. This problem has two facets: (i) build a phylogenetic tree that displays the determining which differences matter and how to calculated evolutionary model of the isolates as a count them, and (ii) visualizing the differences in a series of splits from a root state. In these, clusters of way that can guide action. isolates near the ‘tips’ of the tree are more closely related to each other than isolates elsewhere in the Most WGS analyses use SNPs as the primary tree. As a hybrid approach, one can combine SNP measure of genetic distance, although other counting and phylogenetic tree production to find methods include whole-genome multilocus sequence clusters of isolates and then report the differences typing (MLST)
Recommended publications
  • Genome Analysis of the Smallest Free-Living Eukaryote Ostreococcus
    Genome analysis of the smallest free-living eukaryote SEE COMMENTARY Ostreococcus tauri unveils many unique features Evelyne Derellea,b, Conchita Ferrazb,c, Stephane Rombautsb,d, Pierre Rouze´ b,e, Alexandra Z. Wordenf, Steven Robbensd, Fre´ de´ ric Partenskyg, Sven Degroeved,h, Sophie Echeynie´ c, Richard Cookei, Yvan Saeysd, Jan Wuytsd, Kamel Jabbarij, Chris Bowlerk, Olivier Panaudi, BenoıˆtPie´ gui, Steven G. Ballk, Jean-Philippe Ralk, Franc¸ois-Yves Bougeta, Gwenael Piganeaua, Bernard De Baetsh, Andre´ Picarda,l, Michel Delsenyi, Jacques Demaillec, Yves Van de Peerd,m, and Herve´ Moreaua,m aObservatoire Oce´anologique, Laboratoire Arago, Unite´Mixte de Recherche 7628, Centre National de la Recherche Scientifique–Universite´Pierre et Marie Curie-Paris 6, BP44, 66651 Banyuls sur Mer Cedex, France; cInstitut de Ge´ne´ tique Humaine, Unite´Propre de Recherche 1142, Centre National de la Recherche Scientifique, 141 Rue de Cardonille, 34396 Montpellier Cedex 5, France; dDepartment of Plant Systems Biology, Flanders Interuniversity Institute for Biotechnology and eLaboratoire Associe´de l’Institut National de la Recherche Agronomique (France), Ghent University, Technologiepark 927, 9052 Ghent, Belgium; fRosenstiel School of Marine and Atmospheric Science, University of Miami, 4600 Rickenbacker Causeway, Miami, FL 33149; gStation Biologique, Unite´Mixte de Recherche 7144, Centre National de la Recherche Scientifique–Universite´Pierre et Marie Curie-Paris 6, BP74, 29682 Roscoff Cedex, France; hDepartment of Applied Mathematics, Biometrics and
    [Show full text]
  • Reference Genome Sequence of the Model Plant Setaria
    UC Davis UC Davis Previously Published Works Title Reference genome sequence of the model plant Setaria Permalink https://escholarship.org/uc/item/2rv1r405 Journal Nature Biotechnology, 30(6) ISSN 1087-0156 1546-1696 Authors Bennetzen, Jeffrey L Schmutz, Jeremy Wang, Hao et al. Publication Date 2012-05-13 DOI 10.1038/nbt.2196 Peer reviewed eScholarship.org Powered by the California Digital Library University of California ARTICLES Reference genome sequence of the model plant Setaria Jeffrey L Bennetzen1,13, Jeremy Schmutz2,3,13, Hao Wang1, Ryan Percifield1,12, Jennifer Hawkins1,12, Ana C Pontaroli1,12, Matt Estep1,4, Liang Feng1, Justin N Vaughn1, Jane Grimwood2,3, Jerry Jenkins2,3, Kerrie Barry3, Erika Lindquist3, Uffe Hellsten3, Shweta Deshpande3, Xuewen Wang5, Xiaomei Wu5,12, Therese Mitros6, Jimmy Triplett4,12, Xiaohan Yang7, Chu-Yu Ye7, Margarita Mauro-Herrera8, Lin Wang9, Pinghua Li9, Manoj Sharma10, Rita Sharma10, Pamela C Ronald10, Olivier Panaud11, Elizabeth A Kellogg4, Thomas P Brutnell9,12, Andrew N Doust8, Gerald A Tuskan7, Daniel Rokhsar3 & Katrien M Devos5 We generated a high-quality reference genome sequence for foxtail millet (Setaria italica). The ~400-Mb assembly covers ~80% of the genome and >95% of the gene space. The assembly was anchored to a 992-locus genetic map and was annotated by comparison with >1.3 million expressed sequence tag reads. We produced more than 580 million RNA-Seq reads to facilitate expression analyses. We also sequenced Setaria viridis, the ancestral wild relative of S. italica, and identified regions of differential single-nucleotide polymorphism density, distribution of transposable elements, small RNA content, chromosomal rearrangement and segregation distortion.
    [Show full text]
  • Y Chromosome Dynamics in Drosophila
    Y chromosome dynamics in Drosophila Amanda Larracuente Department of Biology Sex chromosomes X X X Y J. Graves Sex chromosome evolution Proto-sex Autosomes chromosomes Sex Suppressed determining recombination Differentiation X Y Reviewed in Rice 1996, Charlesworth 1996 Y chromosomes • Male-restricted • Non-recombining • Degenerate • Heterochromatic Image from Willard 2003 Drosophila Y chromosome D. melanogaster Cen Hoskins et al. 2015 ~40 Mb • ~20 genes • Acquired from autosomes • Heterochromatic: Ø 80% is simple satellite DNA Photo: A. Karwath Lohe et al. 1993 Satellite DNA • Tandem repeats • Heterochromatin • Centromeres, telomeres, Y chromosomes Yunis and Yasmineh 1970 http://www.chrombios.com Y chromosome assembly challenges • Repeats are difficult to sequence • Underrepresented • Difficult to assemble Genome Sequence read: Short read lengths cannot span repeats Single molecule real-time sequencing • Pacific Biosciences • Average read length ~15 kb • Long reads span repeats • Better genome assemblies Zero mode waveguide Eid et al. 2009 Comparative Y chromosome evolution in Drosophila I. Y chromosome assemblies II. Evolution of Y-linked genes Drosophila genomes 2 Mya 0.24 Mya Photo: A. Karwath P6C4 ~115X ~120X ~85X ~95X De novo genome assembly • Assemble genome Iterative assembly: Canu, Hybrid, Quickmerge • Polish reference Quiver x 2; Pilon 2L 2R 3L 3R 4 X Y Assembled genome Mahul Chakraborty, Ching-Ho Chang 2L 2R 3L 3R 4 X Y Y X/A heterochromatin De novo genome assembly species Total bp # contigs NG50 D. simulans 154,317,203 161 21,495729
    [Show full text]
  • The Bacteria Genome Pipeline (BAGEP): an Automated, Scalable Workflow for Bacteria Genomes with Snakemake
    The Bacteria Genome Pipeline (BAGEP): an automated, scalable workflow for bacteria genomes with Snakemake Idowu B. Olawoye1,2, Simon D.W. Frost3,4 and Christian T. Happi1,2 1 Department of Biological Sciences, Faculty of Natural Sciences, Redeemer's University, Ede, Osun State, Nigeria 2 African Centre of Excellence for Genomics of Infectious Diseases (ACEGID), Redeemer's University, Ede, Osun State, Nigeria 3 Microsoft Research, Redmond, WA, USA 4 London School of Hygiene & Tropical Medicine, University of London, London, United Kingdom ABSTRACT Next generation sequencing technologies are becoming more accessible and affordable over the years, with entire genome sequences of several pathogens being deciphered in few hours. However, there is the need to analyze multiple genomes within a short time, in order to provide critical information about a pathogen of interest such as drug resistance, mutations and genetic relationship of isolates in an outbreak setting. Many pipelines that currently do this are stand-alone workflows and require huge computational requirements to analyze multiple genomes. We present an automated and scalable pipeline called BAGEP for monomorphic bacteria that performs quality control on FASTQ paired end files, scan reads for contaminants using a taxonomic classifier, maps reads to a reference genome of choice for variant detection, detects antimicrobial resistant (AMR) genes, constructs a phylogenetic tree from core genome alignments and provide interactive short nucleotide polymorphism (SNP) visualization across
    [Show full text]
  • Lawrence Berkeley National Laboratory Recent Work
    Lawrence Berkeley National Laboratory Recent Work Title 1,003 reference genomes of bacterial and archaeal isolates expand coverage of the tree of life. Permalink https://escholarship.org/uc/item/7cx5710p Journal Nature biotechnology, 35(7) ISSN 1087-0156 Authors Mukherjee, Supratim Seshadri, Rekha Varghese, Neha J et al. Publication Date 2017-07-01 DOI 10.1038/nbt.3886 Peer reviewed eScholarship.org Powered by the California Digital Library University of California RESOU r CE OPEN 1,003 reference genomes of bacterial and archaeal isolates expand coverage of the tree of life Supratim Mukherjee1,10, Rekha Seshadri1,10, Neha J Varghese1, Emiley A Eloe-Fadrosh1, Jan P Meier-Kolthoff2 , Markus Göker2 , R Cameron Coates1,9, Michalis Hadjithomas1, Georgios A Pavlopoulos1 , David Paez-Espino1 , Yasuo Yoshikuni1, Axel Visel1 , William B Whitman3, George M Garrity4,5, Jonathan A Eisen6, Philip Hugenholtz7 , Amrita Pati1,9, Natalia N Ivanova1, Tanja Woyke1, Hans-Peter Klenk8 & Nikos C Kyrpides1 We present 1,003 reference genomes that were sequenced as part of the Genomic Encyclopedia of Bacteria and Archaea (GEBA) initiative, selected to maximize sequence coverage of phylogenetic space. These genomes double the number of existing type strains and expand their overall phylogenetic diversity by 25%. Comparative analyses with previously available finished and draft genomes reveal a 10.5% increase in novel protein families as a function of phylogenetic diversity. The GEBA genomes recruit 25 million previously unassigned metagenomic proteins from 4,650 samples, improving their phylogenetic and functional interpretation. We identify numerous biosynthetic clusters and experimentally validate a divergent phenazine cluster with potential new chemical structure and antimicrobial activity.
    [Show full text]
  • Genome Evolution: Mutation Is the Main Driver of Genome Size in Prokaryotes Gabriel A.B
    Genome Evolution: Mutation Is the Main Driver of Genome Size in Prokaryotes Gabriel A.B. Marais, Bérénice Batut, Vincent Daubin To cite this version: Gabriel A.B. Marais, Bérénice Batut, Vincent Daubin. Genome Evolution: Mutation Is the Main Driver of Genome Size in Prokaryotes. Current Biology - CB, Elsevier, 2020, 30 (19), pp.R1083- R1085. 10.1016/j.cub.2020.07.093. hal-03066151 HAL Id: hal-03066151 https://hal.archives-ouvertes.fr/hal-03066151 Submitted on 15 Dec 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. DISPATCH Genome Evolution: Mutation is the Main Driver of Genome Size in Prokaryotes Gabriel A.B. Marais1, Bérénice Batut2, and Vincent Daubin1 1Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Évolutive UMR 5558, F- 69622 Villeurbanne, France 2Albert-Ludwigs-University Freiburg, Department of Computer Science, 79110 Freiburg, Germany Summary Despite intense research on genome architecture since the 2000’s, genome-size evolution in prokaryotes has remained puzzling. Using a phylogenetic approach, a new study found that increased mutation rate is associated with gene loss and reduced genome size in prokaryotes. In 2003 [1] and later in 2007 in his book “The Origins of Genome Architecture” [2], Lynch developed his influential theory that a genome’s complexity, represented by its size, is primarily the result of genetic drift.
    [Show full text]
  • Telomere-To-Telomere Assembly of a Complete Human X Chromosome W
    https://doi.org/10.1038/s41586-020-2547-7 Accelerated Article Preview Telomere-to-telomere assembly of a W complete human X chromosome E VI Received: 30 July 2019 Karen H. Miga, Sergey Koren, Arang Rhie, Mitchell R. Vollger, Ariel Gershman, Andrey Bzikadze, Shelise Brooks, Edmund Howe, David Porubsky, GlennisE A. Logsdon, Accepted: 29 May 2020 Valerie A. Schneider, Tamara Potapova, Jonathan Wood, William Chow, Joel Armstrong, Accelerated Article Preview Published Jeanne Fredrickson, Evgenia Pak, Kristof Tigyi, Milinn Kremitzki,R Christopher Markovic, online 14 July 2020 Valerie Maduro, Amalia Dutra, Gerard G. Bouffard, Alexander M. Chang, Nancy F. Hansen, Amy B. Wilfert, Françoise Thibaud-Nissen, Anthony D. Schmitt,P Jon-Matthew Belton, Cite this article as: Miga, K. H. et al. Siddarth Selvaraj, Megan Y. Dennis, Daniela C. Soto, Ruta Sahasrabudhe, Gulhan Kaya, Telomere-to-telomere assembly of a com- Josh Quick, Nicholas J. Loman, Nadine Holmes, Matthew Loose, Urvashi Surti, plete human X chromosome. Nature Rosa ana Risques, Tina A. Graves Lindsay, RobertE Fulton, Ira Hall, Benedict Paten, https://doi.org/10.1038/s41586-020-2547-7 Kerstin Howe, Winston Timp, Alice Young, James C. Mullikin, Pavel A. Pevzner, (2020). Jennifer L. Gerton, Beth A. Sullivan, EvanL E. Eichler & Adam M. Phillippy C This is a PDF fle of a peer-reviewedI paper that has been accepted for publication. Although unedited, the Tcontent has been subjected to preliminary formatting. Nature is providing this early version of the typeset paper as a service to our authors and readers. The text andR fgures will undergo copyediting and a proof review before the paper is published in its fnal form.
    [Show full text]
  • Evolution of Genome Size
    Evolution of Genome Advanced article Article Contents Size • Introduction • How Much Variation Is There? Stephen I Wright, Department of Ecology and Evolutionary Biology, University • What Types of DNA Drive Genome Size of Toronto, Toronto, Ontario, Canada Variation? • Neutral Model • Nearly Neutral Model • Adaptive Hypotheses • Transposable Element Evolution • Conclusion • Acknowledgements Online posting date: 16th January 2017 The size of the genome represents one of the most in the last century. While considerable progress has been made strikingly variable yet poorly understood traits in the characterisation of the extent of genome size variation, in eukaryotic organisms. Genomic comparisons the dominant evolutionary processes driving genome size evolu- suggest that most properties of genomes tend tion remain subject to considerable debate. Large-scale genome sequencing is enabling new insights into both the proximate to increase with genome size, but the fraction causes and evolutionary forces governing genome size differ- of the genome that comprises transposable ele- ences. ments (TEs) and other repetitive elements tends to increase disproportionately. Neutral, nearly neutral and adaptive models for the evolution of How Much Variation Is There? genome size have been proposed, but strong evi- dence for the general importance of any of these Because determining the amount of DNA (deoxyribonucleic models remains lacking, and improved under- acid) in a cell has been much more straightforward and cheaper standing of factors driving the
    [Show full text]
  • Best Practices for Whole Genome Sequencing Using the Sequel System
    Best Practices for Whole Genome Sequencing Using the Sequel System Justin Blethrow, Nick Sisneros, Shreyasee Chakraborty, Sarah Kingan, Richard Hall, Joan Wilson, Christine Lambert, Kevin Eng, Emily Hatas and Primo Baybayan PacBio, 1305 O’Brien Dr., Menlo Park, CA 94025 Library Construction Abstract Recommendations Data Analysis Plant and animal whole genome sequencing has proven to Recommended Shearing Devices for Large-insert Fragments Hierarchical Genome Assembly Process (HGAP) and Polishing be challenging, particularly due to genome size, high For shearing DNA, PacBio recommends either: 1) needle shearing with a 26 G needle, which density of repetitive elements and heterozygosity. The allows for flexibility in number of shearing pulses with the needle or 2) the Megaruptor, a simple, Sequel System delivers long reads, high consensus automated, and highly reproducible system to fragment DNA up to 75 kb. accuracy and uniform coverage, enabling more complete, accurate, and contiguous assemblies of these large complex genomes. The latest Sequel chemistry increases yield up to 8 Gb per SMRT Cell for long insert libraries >20 kb and up to 10 Gb per SMRT Cell for libraries >40 kb. In addition, the recently released SMRTbell Express Megaruptor® DNA Shearing System Template Prep Kit reduces the time (~3 hours) and DNA Demonstration of Needle Shearing input (~3 µg), making the workflow easy to use for multi- SMRT Cell projects. 1 2 3 4 5 Here, we recommend the best practices for whole genome HGAP1 utilizes all PacBio data using the longest reads for contiguity and all reads to sequencing and de novo assembly of complex plant and generate high-quality de novo assemblies with high consensus accuracy (>QV50).
    [Show full text]
  • Decoding Non-Coding DNA: Trash Or Treasure?
    GENERAL ARTICLE Decoding Non-Coding DNA: Trash or Treasure? Namrata Iyer Non-coding DNA, once thought of as ‘junk’, represents a very large portion of an organism’s genome. However, recent research has brought to light many functional elements present within non-coding DNA sequences and unravelled a fascinat- ing array of functions performed by these elements. These findings have highlighted the nature of the evolutionary forces that led to the accumulation and retention of non-coding Namrata Iyer is a PhD DNA. In this article, the various elements present within non- student in the Department of Microbiology and Cell coding DNA, their functional relevance to the cell and the Biology, Indian Institute of changing perspective of the scientific community towards this Science, Bangalore. Her so-called ‘junk’ DNA have been described. research interest is the molecular basis of host– Since the dawn of time, man has always been plagued by the pathogen interactions in question of the origin of life. What are the forces that govern the human diseases. course of evolution? What are the elements that separate man from other forms of life? The discovery of DNA (deoxyribo- nucleic acid) as the genetic material in 1944 opened up new avenues to answer these questions. Surprisingly, the language of DNA comprises of only 4 letters, i.e., A,T,G,C which when read in groups of three (triplets) encode the information for the synthe- sis of proteins (by a process known as translation) which are the work-horses of a cell. Before translation can begin, the informa- tion on DNA is first copied into an intermediate known as mRNA (by a process known as transcription).
    [Show full text]
  • Human Genome Reference Program (HGRP)
    Frequently Asked Questions – Human Genome Reference Program (HGRP) Funding Announcements • RFA-HG-19-002: High Quality Human Reference Genomes (HQRG) • RFA-HG-19-003: Research and Development for Genome Reference Representations (GRR) • RFA-HG-19-004: Human Genome Reference Center (HGRC) • NOT-HG-19-011: Emphasizing Opportunity for Developing Comprehensive Human Genome Sequencing Methodologies Concept Clearance Slides https://www.genome.gov/pages/about/nachgr/september2018agendadocuments/sept2018council_hg _reference_program.pdf. *Note that this presentation includes a Concept for a fifth HGRP component seeking development of informatics tools for the pan-genome. Eligibility Questions 1. Are for-profit entities eligible to apply? a. Only higher education institutes, governments, and non-profits are eligible to apply for the HGRC (RFA-HG-19-004). b. For-profit entities are eligible to apply for HQRG and GRR (RFA-HG-19-002 and 003), and for the Notice for Comprehensive Human Genome Sequencing Methodologies (NOT-HG- 19-011). The Notice also allows SBIR applications. 2. Can foreign institutions apply or receive subcontracts? a. Foreign institutions are eligible to apply to the HQRG and GRR announcements, and Developing Comprehensive Sequencing Methodologies Notice. b. Foreign institutions, including non-domestic (U.S.) components of U.S. organizations, are not eligible for HGRC. However, the FOA does allow foreign components. c. For more information, please see the NIH Grants Policy Statement. 3. Will applications with multiple sites be considered? Yes, applications with multiple sites, providing they are eligible institutions, will be considered. Application Questions 1. How much funding is available for this program? NHGRI has set aside ~$10M total costs per year for the Human Genome Reference Program.
    [Show full text]
  • De Novo Genome Assembly Versus Mapping to a Reference Genome
    De novo genome assembly versus mapping to a reference genome Beat Wolf PhD. Student in Computer Science University of Würzburg, Germany University of Applied Sciences Western Switzerland [email protected] 1 Outline ● Genetic variations ● De novo sequence assembly ● Reference based mapping/alignment ● Variant calling ● Comparison ● Conclusion 2 What are variants? ● Difference between a sample (patient) DNA and a reference (another sample or a population consensus) ● Sum of all variations in a patient determine his genotype and phenotype 3 Variation types ● Small variations ( < 50bp) – SNV (Single nucleotide variation) – Indel (insertion/deletion) 4 Structural variations 5 Sequencing technologies ● Sequencing produces small overlapping sequences 6 Sequencing technologies ● Difference read lengths, 36 – 10'000bp (150-500bp is typical) ● Different sequencing technologies produce different data And different kinds of errors – Substitutions (Base replaced by other) – Homopolymers (3 or more repeated bases) ● AAAAA might be read as AAAA or AAAAAA – Insertion (Non existent base has been read) – Deletion (Base has been skipped) – Duplication (cloned sequences during PCR) – Somatic cells sequenced 7 Sequencing technologies ● Standardized output format: FASTQ – Contains the read sequence and a quality for every base http://en.wikipedia.org/wiki/FASTQ_format 8 Recreating the genome ● The problem: – Recreate the original patient genome from the sequenced reads ● For which we dont know where they came from and are noisy ● Solution: – Recreate the genome
    [Show full text]