A Galaxy-Based Virus Genome Assembly Pipeline A
Total Page:16
File Type:pdf, Size:1020Kb
The Pennsylvania State University The Graduate School VIRAMP: A GALAXY-BASED VIRUS GENOME ASSEMBLY PIPELINE A Thesis in Integrative Biosciences by Yinan Wan © 2014 Yinan Wan Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science December 2014 The thesis of Yinan Wan was reviewed and approved* by the following: Istvan Albert Associate Professor, Bioinformatics, Biochemistry and Molecular Biology Thesis Advisor Moriah L. Szpara Assistant Professor of Biochemistry & Molecular Biology Cooduvalli S. Shashikant Associate Professor of Molecular and Development Biology Co-Director, IBIOS Graduate Program Option in Bioinformatics and Genomics Peter Hudson Willaman Professor of Biology Director, Huck Institutes of the Life Sciences *Signatures are on file in the Graduate School iii ABSTRACT Background Advances in next generation sequencing make it possible to obtain high-coverage sequence data for large numbers of viral strains in a short time. However, since most bioinformatics tools are developed for command line use, the selection and accessibility of computational tools for genome assembly and variation analysis limits the ability of individual scientist to perform further bioinformatics analysis. Findings We have developed a multi-step viral genome assembly pipeline named VirAmp that combines existing tools and techniques and presents them to end users via a web-enabled Galaxy interface. Our pipeline allows users to assemble, analyze and interpret high coverage viral sequencing data with an ease and efficiency that previously was not feasible. Our software makes a large number of genome assembly and related tools available to life scientists and automates the currently recommended best practices into a single, easy to use interface. We tested our pipeline with three different datasets from human herpes simplex virus (HSV). Conclusions VirAmp provides a user-friendly interface and a complete pipeline for viral genome assembly and analysis. We make our software available via an Amazon Elastic Cloud disk image that can be easily launched by anyone with an Amazon web service account. A demonstration version of our system can be found at http://www.viramp.com. We also maintain detailed documentation on each tool and methodology at http://docs.viramp.com. iv TABLE OF CONTENTS List of Figures .......................................................................................................................... vi List of Tables ........................................................................................................................... vii Acknowledgements ................................................................................................................. viii Chapter 1 Introduction ............................................................................................................. 1 Chapter 2 Background and Related Work ............................................................................... 3 Whole-genome variation analysis strategy ...................................................................... 3 Mapping against a reference genome ...................................................................... 3 de novo Assembly .................................................................................................... 4 Viral Genome Related Assemblers .................................................................................. 5 Chapter 3 Assembly Pipeline Description ............................................................................... 7 Data preprocessing ........................................................................................................... 8 Error correction and Coverage Reduction ....................................................................... 9 Two-step semi- de novo assembly ................................................................................... 11 de novo assembly ..................................................................................................... 11 Reference-based schaffolding .................................................................................. 12 Information Recovery ...................................................................................................... 13 Assembly Assessment and Variation Analysis ................................................................ 14 Assembly Assessment .............................................................................................. 14 Assembly-Reference Comparison Report ................................................................ 15 Circos Graph for genome alignment visualization .................................................. 15 Variation Analysis ................................................................................................... 16 Recognising Errors in Assemblies using Paired Reads ................................................... 17 Chapter 4 Example analysis .................................................................................................... 20 Data and Computational Resources Description ............................................................. 20 Comparing the performances at each step of the assembly pipeline ............................... 21 Comparing the final assemblies to the reference genome ............................................... 23 Comparing VIRAMP with other popular assemblers ...................................................... 26 REAPR reference-free evaluation of genome assembly .................................................. 28 Chapter 5 Discussion ............................................................................................................... 31 Final Gap Filling .............................................................................................................. 31 Repetitive Region ............................................................................................................ 33 Assemble genome using single-end reads ....................................................................... 33 v Chapter 6 Conclusion .............................................................................................................. 35 Bibliography ............................................................................................................................ 36 Appendix Demonstration of VirAmp platform ..................................................................... 38 vi LIST OF FIGURES Figure 3-1 Overview of the VIRAMP pipeline ....................................................................... 15 Figure 3-2 Demonstration of Read coverage before and after digital normalization .............. 18 Figure 3-3 Demonstration of SSPACE scaffolding and extension .......................................... 21 Figure 3-4 Circos graph projecting the comparison between assembly and reference genome ............................................................................................................................. 24 Figure 3-5 Demonstration of REAPR core algorithm FCD calculation [18] .......................... 25 Figure 3-6 Gap identification in REAPR[18] .......................................................................... 25 Figure 4-1 Comparing the final assembly with the reference genome .................................... 31 Figure 4-2 VIRAMP visualization of the alignment between the final assembly and reference genome. ............................................................................................................ 33 Appendix Figure 1 Settings to run the VirAmp pipeline ......................................................... 44 Appendix Figure 2 A published Galaxy workflow connects individual tools in the platform and demonstrates VirAmp paired-end assembly. ............................................. 46 Appendix Figure 3 VirAmp with result file at pipeline complete status ................................. 47 vii LIST OF TABLES Table 3-1 Example of Assembly-Reference Comparison Report ........................................... 15 Table 4-1 Statistics of assemblies at each step of the VIRAMP pipeline ............................... 21 Table 4-2 The contigs coordinates in reference gnome ........................................................... 21 Table 4-3 Comparison report between the final assembly and reference genome .................. 23 Table 4-4 Comparing VirAmp results with three other popular Assembly pipelines ............. 27 Table 4-5 Reapr evaluation of assemblies from different pipelines ........................................ 29 viii ACKNOWLEDGEMENTS First and foremost, I would like to express my sincere gratitude to my advisor Prof. Istvan Albert, who generously supported me during my graduate education. His guidance has helped me in all aspect of my research as well as in developing my verbal and written skills. At the same time he has also offered me advice and support beyond the academic while helping me through various difficult situations. I would like to thank Prof. Moriah Szpara for starting the collaboration that lead to this thesis project, sharing her insights and thoughts while providing me with the training that allowed me to complete this project. I would also like to thank Prof. Cooduvalli Shashikant for being my committee member. In his role as the co-director in Bioinformatics & Genomics program, Dr.Shashikant also offered me help and support all the way along for the past three years, helping