The Utility of Long Read Sequencing for the Discovery of Genomic Retroviral Insertions and for Hybrid Genome Assembly
Total Page:16
File Type:pdf, Size:1020Kb
The Utility of Long Read Sequencing for the Discovery of Genomic Retroviral Insertions and for Hybrid Genome Assembly by Ryan Salinas A thesis submitted as fulfilment of the requirement for the degree of Masters of Science School of Biotechnology and Biomolecular Sciences University of New South Wales August, 2017 THE UNIVERSITY OF NEW SOUTH WALES Thesis/Dissertation Sheet Surname or Family name: Salinas First name: Ryan Other name/s: Matthew Abbreviation for degree as given in the University calendar: MSC School: School of Biotechnology and Biomolecular Sciences Faculty: Faculty of Science Title: The Utility of Long Read Sequencing for the Discovery of Genomic Retroviral Insertions and for Hybrid Genome Assembly Abstract 350 words maximum: Long sequence reads from PacBio technology show promise in resolving hard-to-assemble regions of genomes and should be of utility in hybrid genome assemblies. However, the benefits of such approaches are not yet fully understood. To help address this, two investigations were undertaken on the koala genome, the first which sought to find and characterise koala retrovirus (KoRV) insertions through progressive use of targeted assemblies and the second which used increasing amounts of long read sequence data in hybrid genome assemblies. For the analysis of KoRV insertions, targeted short read-based assemblies generated limited insights into the KoRV insertion points and were unable to reconstruct adjacent gene structures. The use of PacBio long read technology, by contrast, allowed KoRV inserts to be fully assembled and characterised and for adjoining genomic koala genes to be discovered. Within the full long read genome assembly, no cases of KoRV insertion into coding regions of annotated genes were observed. The investigation into use of differing amounts of long reads in hybrid assemblies generated some unexpected results, as measured by commonly used assembly statistics, the presence of conserved gene sets and of repeat regions. Within the range of assemblies generated, using 20X to 57X short read data with up to 20X PacBio long read data, there was almost no change in the presence of conserved gene sets and thus no improvement in the assembly of protein gene- containing regions in hybrid assembly with long reads. However, the hybrid assemblies saw increases in N50 and maximum scaffold size, and in the assembly of repeat-containing regions. It can be concluded that long sequence reads are of use in the assembly and annotation of regions that contain repeated sequences but their utility in hybrid assemblies, at least for high content regions containing protein coding genes, may be limited. Declaration relating to disposition of project thesis/dissertation I hereby grant to the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or in part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all property rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstracts International (this is applicable to doctoral theses only). The University recognises that there may be exceptional circumstances requiring restrictions on copying or conditions on use. Requests for restriction for a period of up to 2 years must be made in writing. Requests for a longer period of restriction may be considered in exceptional circumstances and require the approval of the Dean of Graduate Research. FOR OFFICE USE ONLY Date of completion of requirements for Award: i ORIGINALITY STATEMENT ‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged.’ ii COPYRIGHT STATEMENT ‘I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International (this is applicable to doctoral theses only). I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.' AUTHENTICITY STATEMENT ‘I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format.’ ABSTRACT Long sequence reads from PacBio technology show promise in resolving hard-to-assemble regions of genomes and should be of utility in hybrid genome assemblies. However, the benefits of such approaches are not yet fully understood. To help address this, two investigations were undertaken on the koala genome, the first which sought to find and characterise koala retrovirus (KoRV) insertions through progressive use of targeted assemblies and the second which used increasing amounts of long read sequence data in hybrid genome assemblies. For the analysis of KoRV insertions, targeted short read-based assemblies generated limited insights into the KoRV insertion points and were unable to reconstruct adjacent gene structures. The use of PacBio long read technology, by contrast, allowed KoRV inserts to be fully assembled and characterised and for adjoining genomic koala genes to be discovered. Within the full long read genome assembly, no cases of KoRV insertion into coding regions of annotated genes were observed. The investigation into use of differing amounts of long reads in hybrid assemblies generated some unexpected results, as measured by commonly used assembly statistics, the presence of conserved gene sets and of repeat regions. Within the range of assemblies generated, using 20X to 57X short read data with up to 20X PacBio long read data, there was almost no change in the presence of conserved gene sets and thus no improvement in the assembly of protein gene- containing regions in hybrid assembly with long reads. However, the hybrid assemblies saw increases in N50 and maximum scaffold size, and in the assembly of repeat-containing regions. It can be concluded that long sequence reads are of use in the assembly and annotation of regions that contain repeated sequences but their utility in hybrid assemblies, at least for high content regions containing protein coding genes, may be limited. iii PUBLICATIONS Work undertaken in this thesis has been incorporated into two publications. 1. Long-read genome sequence assembly provides insight into ongoing retroviral invasion of the koala germline. Hobbs M, King A, Salinas R, Chen Z, Tsangaras K, Greenwood AD, Johnson RN, Belov K, Wilkins MR, Timms P. Scientific Reports 2017. 7(1):15838. 2. Adaptation and conservation insights from the koala genome Rebecca N. Johnson, Denis O’Meally, Zhiliang Chen, Graham J. Etherington, Simon Y. W. Ho, Will J. Nash, Catherine E. Grueber, Yuanyuan Cheng, Camilla M. Whittington, Siobhan Dennison, Emma Peel, Wilfried Haerty, Rachel J. O’Neill, Don Colgan1, Tonia L.Russell, David E. Alquezar-Planas, Val Attenbrow, Jason G. Bragg, Parice A. Brandies, Amanda Yoon-Yee Chong, Janine E. Deakin, Federica Di Palma, Zachary Duda, Mark D. B. Eldridge, Kyle M. Ewart, Carolyn J. Hogg, Greta J. Frankham, Arthur Georges, Amber K. Gillett, Merran Govendir, Alex D. Greenwood, Takashi Hayakawa, Kristofer M. Helgen, Matthew Hobbs, Clare E. Holleley Thomas N. Heider, Elizabeth A. Jones, Andrew King, Danielle Madden, Jennifer A. Marshall Graves, Katrina M. Morris, Linda E. Neaves, Hardip R. Patel, Adam Polkinghorne, Marilyn B. Renfree, Charles Robin, Ryan Salinas, Kyriakos Tsangaras, Paul D. Waters, Shafagh A. Waters, Belinda Wright, Marc R. Wilkins, Peter Timms, Katherine Belov. Nature Genetics, in press. iv TABLE OF CONTENTS 1 INTRODUCTION ..................................................................................................................... 9 1.1 A BRIEF HISTORY OF GENOME SEQUENCING ........................................................................... 9 1.2 SHORT-READ SEQUENCING AND ASSEMBLY ........................................................................... 11 1.2.1 Sequencing methods ...................................................................................................