Viral Quasispecies Reconstruction Using Next Generation Sequencing Reads
Total Page:16
File Type:pdf, Size:1020Kb
Georgia State University ScholarWorks @ Georgia State University Computer Science Dissertations Department of Computer Science Summer 8-12-2013 Viral Quasispecies Reconstruction Using Next Generation Sequencing Reads Bassam A. Tork Georgia State University Follow this and additional works at: https://scholarworks.gsu.edu/cs_diss Recommended Citation Tork, Bassam A., "Viral Quasispecies Reconstruction Using Next Generation Sequencing Reads." Dissertation, Georgia State University, 2013. https://scholarworks.gsu.edu/cs_diss/77 This Dissertation is brought to you for free and open access by the Department of Computer Science at ScholarWorks @ Georgia State University. It has been accepted for inclusion in Computer Science Dissertations by an authorized administrator of ScholarWorks @ Georgia State University. For more information, please contact [email protected]. VIRAL QUASISPECIES RECONSTRUCTION USING NEXT GENERATION SEQUENCING READS by BASSAM TORK Under the Direction of Dr. Alexander Zelikovsky ABSTRACT The genomic diversity of viral quasispecies is a subject of great interest, especially for chronic infections. Characterization of viral diversity can be addressed by high-throughput sequencing technology (454 Life Sciences, Illumina, SOLiD, Ion Torrent, etc.). Standard assembly software was originally designed for single genome assembly and cannot be used to assemble and estimate the frequency of closely related quasispecies sequences. This work focuses on parsimonious and maximum likelihood models for assembling viral quasispecies and estimating their frequencies from 454 sequencing data. Our methods have been applied to several RNA viruses (HCV, IBV) as well as DNA viruses (HBV), genotyped using 454 Life Sciences amplicon and shotgun methods. INDEX WORDS: RNA viruses, Viral quasispecies, Genome assembly VIRAL QUASISPECIES RECONSTRUCTION USING NEXT GENERATION SEQUENCING READS by BASSAM TORK A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in the College of Arts and Sciences Georgia State University 2013 Copyright by Bassam Tork 2013 VIRAL QUASISPECIES RECONSTRUCTION USING NEXT GENERATION SEQUENCING READS by BASSAM TORK Committee Chair: Dr. Alexander Zelikovsky Committee: Dr. Yury Khudyakov Dr. Robert Harrison Dr. Rajshekhar Sunderraman Electronic Version Approved: Office of Graduate Studies College of Arts and Sciences Georgia State University August 2013 iv DEDICATION To my parents Abdel Rahman Tork and Mariam Tork. v ACKNOWLEDGEMENTS I would especially like to thank my advisor Dr. Alexander Zelikovsky for his encourage- ment, advisement and constant support over the years under his supervision at Georgia State University. This dissertation could not have been done without his guidance, patience and motivation. I have learned more about computer science and conducting research from him than virtually everyone else. I would especially like to thank my committee members: Dr. Yury Khudyakov, Dr. Robert Harrison, and Dr. Rajshekhar Sunderraman. Their critique of my work with no doubt strengthened and improved it. Special thanks to Professor Ion Mandoiu and Professors I.Khan and O'Neill of University of Connecticut for their valuable advisement. Special thanks to Dr. Yi Pan Chairman of the Computer Science Department for his encouragement, advisement and support over the years of my PhD study. Special thanks to Dr. Rajshekhar Sunderraman for his encouragement, advisement and support over the years of my PhD study. Special thanks to Computer Science Department of Georgia State University faculty, staff and students. Special thanks to the GSU Molecular Basis of Disease Initiative (MBD). Special thanks go out to my dear friends Nick, Hamed, Serghei, Adrian, Sasha, Blanche, Olga, and Ekaterina. Finally I am grateful to my parents, my brothers, my sisters and all my friends for their support and ecouragement. Special thanks go from my heart to my nice family and daughters whom I miss a lot. vi TABLE OF CONTENTS ACKNOWLEDGEMENTS ::::::::::::::::: v LIST OF TABLES :::::::::::::::::::: ix LIST OF FIGURES :::::::::::::::::::: xi LIST OF ABBREVIATIONS :::::::::::::::: xiv PART 1 INTRODUCTION ::::::::::::::: 1 1.1 Viral Quasispecies ............................. 1 1.2 Challenges of Quasispecies Reconstruction From NGS Reads . 2 1.3 Viral Quasispecies Spectrum Reconstruction Problem . 2 1.4 Existing Approaches For Sequencing Viral Quasispecies . 3 1.5 Main Contributions ............................ 4 1.6 Future Work ................................ 5 1.7 Roadmap of the Rest of the Dissertation ............... 5 1.8 Publications ................................. 6 1.9 Oral Presentations ............................. 7 1.10 Book Chapters ............................... 7 1.11 Posters .................................... 7 PART 2 STATE-OF-THE-ART IN VIRAL QUASISPECIES RE- CONSTRUCTION ::::::::::::::: 9 2.1 Introduction ................................. 9 2.2 NGS Technologies ............................. 10 2.3 Existing Tools ................................ 17 2.3.1 ShoRAH . 18 vii 2.3.2 QuRe . 19 2.3.3 Error Correction . 20 PART 3 QUASISPECIES SPECTRUM RECONSTRUCTION FROM SHOTGUN READS :::::::::::::: 21 3.1 ViSpA .................................... 21 3.1.1 Read Error Correction . 22 3.1.2 Read Alignment . 22 3.1.3 Processing of 454 Reads . 23 3.1.4 Read Graph . 23 3.1.5 Candidate path selection. 23 3.1.6 Estimation of candidate quasispecies sequence frequencies. 23 3.2 Fixing Erroneous Homopolymer Indels via Protein Alignment . 24 PART 4 QUASISPECIES SPECTRUM RECONSTRUCTION FROM AMPLICON READS :::::::::::::: 27 4.1 Theoretical Analysis of Quasispecies Assembly Problem for Ampli- con Reads .................................. 27 4.1.1 Quasispecies Assembly in the Error-Free, Ideal-Frequency Model Prob- lem . 28 4.1.2 Reduction of Skewed-Frequency Model to Ideal-Frequency Model 31 4.2 Viral Quasispecies Reconstruction Algorithms From Amplicon Reads 32 4.2.1 Read Graph . 32 4.2.2 Maximum Bandwidth Path . 33 4.2.3 Maximum Frequency Path. 33 4.2.4 Greedy Algorithm for Resolving Forks. 33 4.3 Data Sets .................................. 34 4.3.1 Simulated Reads . 34 4.3.2 HBV Data . 34 viii 4.4 Validation Methods ............................ 34 4.4.1 Quasispecies Assembling Validation . 35 4.4.2 Validation of Frequency Distribution . 35 4.5 Experimental Results ........................... 35 4.5.1 Simulated Reads . 35 4.5.2 Real Reads . 42 PART 5 RECONSTRUCTION OF IBV QUASISPECIES ::: 44 5.1 Introduction ................................. 44 5.2 Methods ................................... 45 5.2.1 Compared Methods . 46 5.3 Data Sets and Golden Standards .................... 47 5.4 Validation .................................. 48 5.4.1 Validation of 454 Reads . 48 5.4.2 Validation of Correction Methods . 48 5.4.3 Comparison Measures . 49 5.5 Experimental Results ........................... 50 5.5.1 Results of 454 Reads Validation . 50 5.5.2 Results of Correction Methods Validation . 53 5.5.3 Results of Reconstructed Quasispecies . 53 PART 6 CONCLUSIONS AND FUTURE WORK :::::: 63 REFERENCES ::::::::::::::::::::: 65 ix LIST OF TABLES Table 4.1 Average sensitivity/PPV of ViSpA and Maximum Bandwidth algo- rithms for shotgun and amplicon error-free read data sets simulated from ten and twenty quasispecies [1, 2]. 42 Table 4.2 Total and distinct reads of amplicons per patient [1, 2]. 43 Table 5.1 Pairwise edit distance between the 10 Sanger clones [3]. 54 Table 5.2 Edit distance between collapsed Sanger clones and reconstructed qua- sispecies using parameters 1 2 5( the number of mismatches between sub-reads and super-reads, the number of mismatches between two overlapped reads, and mutation rate respectively), threshold 0.005 on KEC corrected reads using ViSpA [3]. 55 Table 5.3 Edit distance between collapsed sanger clones and reconstructed qua- sispecies using parameters 1 2 5( the number of mismatches between sub-reads and super-reads, the number of mismatches between two overlapped reads, and mutation rate respectively), threshold 0.005 on uncorrected reads using ViSpA [3]. 56 Table 5.4 Edit distance between collapsed Sanger clones and reconstructed qua- sispecies using ShoRAH assembler with default parameters [3]. 58 Table 5.5 Average distance to clones (ADC) for the reconstructed quasispecies using different methods [3]. 60 Table 5.6 Average prediction error (APE) for the reconstructed quasispecies us- ing different methods [3]. 60 x Table 5.7 Average prediction error(APE) and Average distance to clones(ADC) of reconstructed quasispecies for different methods [3]. 61 xi LIST OF FIGURES Figure 2.1 454 Sequencing Method [4]. 12 Figure 2.2 Illumina Sequencing Method[5]. 14 Figure 2.3 SOLiD Sequencing Method [6]. 16 Figure 2.4 Ion Torrent Sequencing Method [7]. 17 Figure 3.1 Algorithmic approach to solve QSA problem. 22 Figure 3.2 protein sequence alignment (CLUSTAL W) of the quasispecies se- quence (0.0498579823867 ID20 * 1) with the reference (contig20 * 3) before systematic error fixing. 25 Figure 3.3 protein sequence alignment (CLUSTAL W) of the quasispecies se- quence (0.0498579823867 ID20 * 1) with the reference (contig20 * 3) after systematic error fixing. 26 Figure 4.1 The case of two distinct reads for both amplicons [1, 2]. 30 Figure 4.2 Sensitivity for different amplicon window shifts from uniform distribu- tion [1, 2] . 37 Figure 4.3 Sensitivity for different amplicon window shifts from skewed distribu- tion [1, 2] . 37 Figure 4.4 Sensitivity for different amplicon window shifts from geometric