Easy and Accurate Reconstruction of Whole HIV Genomes from Short-Read Sequence Data

bioRxiv preprint doi: https://doi.org/10.1101/092916; this version posted December 9, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license. Easy and Accurate Reconstruction of Whole HIV Genomes from Short-Read Sequence Data Chris Wymant1,2,*, Fran¸cois Blanquart2, Astrid Gall3, Margreet Bakker4, Daniela Bezemer5, Nicholas J. Croucher2, Tanya Golubchik1,6, Matthew Hall1,2, Mariska Hillebregt5, Swee Hoe Ong3, Jan Albert7,8, Norbert Bannert9, Jacques Fellay10,11, Katrien Fransen12, Annabelle Gourlay13, M. Kate Grabowski14, Barbara Gunsenheimer-Bartmeyer15, Huldrych Günthard16,17, Pia Kivelä18, Roger Kouyos16,17, Oliver Laeyendecker19, Kirsi Liitsola18, Laurence Meyer20, Kholoud Porter13, Matti Ristola18, Ard van Sighem5, Guido Vanham21, Ben Berkhout4, Marion Cornelissen4, Paul Kellam22,23, Peter Reiss5, Christophe Fraser1,2, and The BEEHIVE Collaborationy 1Oxford Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, Nuffield Department of Medicine, University of Oxford, UK 2Medical Research Council Centre for Outbreak Analysis and Modelling, Department of Infectious Disease Epidemiology, Imperial College London, London, UK 3Virus Genomics, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK 4Laboratory of Experimental Virology, Department of Medical Microbiology, Center for Infection and Immunity Amsterdam (CINIMA), Academic Medical Center of the University of Amsterdam, Amsterdam, The Netherlands 5Stichting HIV Monitoring, Amsterdam, The Netherlands 6Wellcome Trust Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, UK 7Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet, Stockholm, Sweden 8Department of Clinical Microbiology, Karolinska University Hospital, Stockholm, Sweden 9Division for HIV and other Retroviruses, Robert Koch Institute, Berlin, Germany 10School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne, Switzerland 11Swiss Institute of Bioinformatics, Lausanne, Switzerland 12HIV/STI reference laboratory, WHO collaborating centre, Institute of Tropical Medicine, Department of Clinical Science, Antwerpen, Belgium 13Department of Infection and Population Health, University College London, London, UK 14John Hopkins University, Baltimore, USA 15Department of Infectious Disease Epidemiology, Robert Koch-Institute, Berlin, Germany 16Division of Infectious Diseases and Hospital Epidemiology, University Hospital Zurich, Zurich, Switzerland 17Institute of Medical Virology, University of Zurich, Switzerland 18Department of infectious Diseases, Helsinki University Hospital, Helsinki, Finland 19Laboratory of Immunoregulation, NIAID, NIH, Baltimore, USA 20INSERM CESP U1018, UniversitéParis Sud, UniversitéParis Saclay, APHP, Service de SantéPublique, Hôpitalde Bicêtre, Le Kremlin-Bicêtre, France 21Virology Unit, Immunovirology Research Pole, Biomedical Sciences Department, Institute of Tropical Medicine, Antwerpen, Belgium 22Kymab Ltd, Cambridge, UK 23Division of Infectious Diseases, Department of Medicine, Imperial College London, London, UK *To whom correspondance should be addressed: [email protected] ySee attached document for a complete list. Abstract Next-generation sequencing has yet to be widely adopted for HIV. The difficulty of accurately reconstructing the consensus sequence of a quasispecies from reads (short fragments of DNA) in the presence of rapid between- and within-host evolution may have been a deterrent. In particular, mapping (aligning) reads to a reference sequence leads to biased loss of information; this bias can distort epidemiological and evolutionary conclusions. De novo assembly avoids this bias by effectively aligning the reads to themselves, producing a set of sequences called contigs. However contigs provide only a partial summary of the reads, misassembly may result in their having an incorrect structure, and no information is available at parts of the genome where contigs could not be assembled. To address these problems we developed the tool shiver to preprocess reads for quality and contamination, then map them to a reference tailored to the sample using corrected contigs supplemented with existing references sequences. Run with two commands per sample, it can easily be used for large heterogeneous data sets. We use shiver to reconstruct the consensus sequence and minority variant information from paired-end short read data produced with the Illumina platform, for 65 existing publicly available samples and 50 new samples. We show the systematic superiority of mapping to shiver's constructed reference over mapping the same reads to the standard reference HXB2: an average of 29 bases per sample are called differently, of which 98.5% are supported by higher coverage. We also provide a practical guide to working with imperfect contigs. 1 bioRxiv preprint doi: https://doi.org/10.1101/092916; this version posted December 9, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license. 1 Introduction founder virus The genetic sequences of pathogens are a rich data source sample for studying their epidemiology and evolution, and pro- taken transmission within-host diversification vide information for vaccine and therapeutic design. In the bottleneck RNA past decade, next-generation sequencing (NGS) has trans- extraction formed genomics, with decreasing costs and enormous in- creases in the amount of data available. Despite the suc- _____ cess of NGS in other fields, sequencing of human immun- _____ _____ odeficiency virus (HIV) is still largely based on the older _____ method of Sanger sequencing. For example, on the com- _____ prehensive Los Alamos HIV database [1], of the 119,237 _____ short reads RT PCR _____ _____ _____ _____ _____ fragmentation _____ _____ _____ _____ _____ _____ _____ _____ samples with platform information, 91.6% were generated _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ and sequencing _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ by Sanger sequencing, 7.0% with the Roche 454 platform, _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ 1.4% with Illumina platforms, and 0.02% with the IonTor- _____ _____ _____ _____ _____ _____ _____ _____ _____ rent platform. Restricting to the 38,635 samples dating _____ from 2010 or later, these numbers change only to 94.6% ? Sanger sequencing, 2.0% 454, 3.4% Illumina and 0.02% IonTorrent. Figure 1: Interpreting next-generation sequencing data for More broadly, NGS has been hugely successful both HIV. for sequencing samples with no within-sample diversity, and at the opposite end of the spectrum, for metage- nomic studies. In the first case, any apparent within- The quasispecies in one patient can be summarised by sample diversity is attributable to sequencing error; in the the consensus sequence { the àverage' sequence of those latter case, there is no presumption that different frag- virions sampled, as represented in the reads. Determining ments of DNA have the same origin, and so each fragment the most common base at each position in the genome, is checked against large databases to catalogue within- and which other bases are present and at what frequen- sample diversity [2, 3]. cies, requires the reads to be aligned. To what should HIV is an intermediate case: the long duration of they be aligned? Mapping (aligning) to a reference too chronic infection coupled with high rates of replication and far from the quasispecies' true consensus leads to biased mutation mean that a single infection, and hence a single loss of information [9{12]. Like any form of sequence align- sample, will contain a diverse collection of related viral ment, mapping relies upon sequence similarity; the more particles, frequently called a quasispecies. Reconstructing a read differs from its reference, the less likely it is to be different aspects of these quasispecies from reads (frag- aligned correctly or at all. This hides differences between ments of sequence; see Fig. 1) has proven technically chal- the sample and the reference, giving a consensus genome lenging, and may have been a significant obstacle to the erroneously similar the reference chosen. widespread adoption

Easy and Accurate Reconstruction of Whole HIV Genomes from Short-Read Sequence Data

A Comprehensive Workflow for Variant Calling Pipeline Comparison and Analysis Using R Programming

Property Graph Vs RDF Triple Store: a Comparison on Glycan Substructure Search

Genomic Alignment (Mapping) and SNP / Polymorphism Calling

Genetics 211 - 2018 Lecture 3

Introduction to Bioinformatics (Elective) – SBB1609

Meeting Booklet

UNIVERSITY of CALIFORNIA RIVERSIDE SNP Calling Using Genotype Model Selection on High-Throughput Sequencing Data a Dissertation

Rapid and Comprehensive Quality Assessment of Raw Sequence Reads

Supplementary Figure 1. Seb1 Interacts with the CF-CPF Complex

BIOINFORMATICS APPLICATIONS NOTE Doi:10.1093/Bioinformatics/Btp352

N-Glycosylation Profiles As a Risk Stratification Biomarker for Type II Diabetes Mellitus and Its Associated Factors

Provisional- Reference Genome’