Algorithm Development for Next Generation Sequencing-Based Metagenome Analysis
Total Page:16
File Type:pdf, Size:1020Kb
ALGORITHM DEVELOPMENT FOR NEXT GENERATION SEQUENCING-BASED METAGENOME ANALYSIS A Thesis Presented to The Academic Faculty by Andrey O. Kislyuk In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in Bioinformatics School of Biology Georgia Institute of Technology December 2010 ALGORITHM DEVELOPMENT FOR NEXT GENERATION SEQUENCING-BASED METAGENOME ANALYSIS Approved by: Professor Joshua S. Weitz, Advisor Professor Yury Chernoff School of Biology School of Biology Georgia Institute of Technology Georgia Institute of Technology Professor I. King Jordan, Co-Advisor Professor David A. Bader School of Biology School of Computational Science and Georgia Institute of Technology Engineering Georgia Institute of Technology Professor Nicholas H. Bergman Date Approved: 13 August 2010 Department of Homeland Security National Biodefense Analysis and Countermeasures Center Adjunct Professor, School of Biology Georgia Institute of Technology To the ghost in the machine iii ACKNOWLEDGEMENTS I am deeply grateful to my advisor, Joshua Weitz, for his invaluable guidance, support, and encouragement. I would like to thank Inna Dubchak and Michael Brudno, who introduced me to computational biology, and King Jordan, my co-advisor, for their support and mentorship. I am also grateful to Nicholas Bergman for his support and advice; and I am thankful to Stephen Turner for his support, mentorship, and the opportunities he has given me. It has been a privilege to work with these brilliant and talented mentors. I want to thank my collaborators and co-workers, especially Alexandre Lomsadze, Alex Mitrophanov, Alex Poliakov, Alla Lapidus, Scott Sammons, Russell Neches, Jonathan Eisen, Jonathan Dushoff, Bart Haegeman, and Mark Borodovsky, who have all contributed to my understanding of the science and practice of bioinformatics and computational biology. I would also like to thank Lee Katz, Sonia Agrawal, Matthew Hagen, Andrew Conley, Pushkala Jayaraman, Viswateja Nelakuditi, Jay Humphrey, Dhwani Govil, Raydel Mair, Kathleen Tatti, Maria Tondella, Brian Harcourt, Leonard Mayer, and Srijak Bhatnagar, who have contributed to the work that became part of this thesis. I am thankful to my friends and instructors at Georgia Tech and Berkeley; to Nils Onsager for his dedicated vision and instruction; and to Madison Park for her support. Finally, I am thankful to my family for their love and support. iv TABLE OF CONTENTS DEDICATION ................................... iii ACKNOWLEDGEMENTS ............................ iv LIST OF TABLES................................. ix LIST OF FIGURES ................................ xiii SUMMARY..................................... xix I INTRODUCTION.............................. 1 II ALGORITHM DESIGN IN METAGENOMICS: A PRIMER...... 3 2.1 Overview of metagenome analysis...................5 2.2 Major computational tasks in metagenome analysis.........7 2.2.1 Assembly............................7 2.2.2 Binning............................. 11 2.2.3 Phylotyping and phylogeny reconstruction.......... 13 2.2.4 Metabolic pathway reconstruction.............. 17 2.2.5 Gene prediction and annotation................ 18 2.2.6 Technology advancement.................... 19 2.3 Algorithmic techniques......................... 19 2.3.1 Feature selection........................ 20 2.3.2 Randomized and approximation algorithms......... 22 2.3.3 String processing........................ 23 2.4 Toolkit................................. 25 2.4.1 Testing and validation..................... 27 2.4.2 Conclusion........................... 28 III FRAMESHIFT DETECTION ....................... 30 3.1 Introduction.............................. 30 3.2 Materials................................ 33 v 3.2.1 Sequences with artificial sequencing errors.......... 33 3.2.2 Sequences with 454 pyrosequencing errors.......... 33 3.2.3 Protein sequences....................... 34 3.3 Methods................................. 34 3.3.1 Verification by protein sequence alignment.......... 36 3.4 Results................................. 38 3.5 Discussion................................ 40 3.6 Technical details............................ 41 3.6.1 Prior design........................... 42 3.6.2 New design........................... 43 3.7 Acknowledgements........................... 46 3.8 Funding................................. 53 IV GENOME ASSEMBLY AND ANNOTATION PIPELINE . 54 4.1 Introduction.............................. 55 4.2 System and Methods.......................... 59 4.2.1 Genome test data....................... 59 4.2.2 Pipeline organization...................... 59 4.2.3 Assembly............................ 59 4.2.4 Feature prediction....................... 65 4.2.5 Functional annotation..................... 67 4.2.6 Availability........................... 68 4.3 Discussion................................ 70 4.3.1 Genome biology of N. meningitidis and B. bronchiseptica .. 70 4.3.2 Computational genomics pipeline............... 72 4.4 Validation on known data....................... 73 4.5 Acknowledgements........................... 74 4.6 Funding................................. 75 vi V METAGENOMIC BINNING ........................ 76 5.1 Background............................... 77 5.2 Methods................................. 81 5.2.1 The binning problem...................... 81 5.2.2 MCMC framework....................... 83 5.2.3 Numerical details........................ 87 5.2.4 Testing methodology...................... 90 5.3 Results and Discussion......................... 92 5.4 Conclusions............................... 97 5.4.1 Example application of likelihood model........... 99 5.5 Acknowledgements........................... 101 5.6 Funding................................. 101 VI CORE- AND PAN-GENOMES....................... 107 6.1 Introduction.............................. 108 6.2 Results................................. 110 6.2.1 Pan and core genome sizes cannot be reliably estimated.. 110 6.2.2 Genomic fluidity is a robust and reliable estimator of gene diversity............................. 111 6.2.3 Fluidity and its variance can be estimated from a group of sequenced genomes....................... 114 6.2.4 Rank-ordering of genomic fluidity is robust to variation in alignment parameters..................... 117 6.2.5 Genomic fluidity is a natural metric spanning phylogenetic scales from species to kingdom................ 118 6.3 Discussion................................ 122 6.4 Materials and Methods......................... 124 6.4.1 Fluidity estimator pipeline................... 124 6.4.2 Significance test for fluidity differences............ 125 6.5 Acknowledgements........................... 126 6.6 Funding................................. 126 vii VII CONCLUSION................................ 134 viii LIST OF TABLES 1 Task-approach matrix: metagenome analysis tasks........... 29 2 Accuracy parameters of the different versions of the algorithm as well as the FrameD program determined on genomic sequences with synthetic frameshifts. A/ for the old algorithm, described in the \Prior design" section in Appendix; B/for the new algorithm with coding potential analysis only (the classifier algorithm off); C/full ab initio prediction (includes the coding potential analysis and the classifier algorithm); D/ for the ab initio prediction followed by protein database search and alignment and rejection of ab initio predictions with negative evidence (scenarios 3 and 4, Fig. 2); E/for the ab initio prediction followed by the protein database search and alignment with acceptance of the predictions possessing a positive evidence (scenarios 1 and 2, Fig. 2); F/ for the FrameD program (Schiex et al., 2003). Bold numbers indi- cate best performance among A-E as measured by the average value (Sn+Sp)/2................................. 48 3 Characteristics of the algorithm performance on genomes sequenced by 454 pyrosequencing method. Designations of the methods B and C are the same as in Table2; C*, D*, E* are analogous to Table2; * indicates that the algorithm was using the homopolymer correction (see text). Bold numbers indicate best performance among B-E* as measured by the average, (Sn+Sp)/2.................. 49 4 Parameters of the fitted normal distributions for the values of orf overlap, gene overlap, ds stop dist, rbs score as observed in the sets of sequences with artificial frameshifts and sequences with gene overlaps. These parameters describing the \true frameshift" vs. \gene overlap" class distributions were used in the classifier algorithm........... 52 5 Quantitative analysis of frameshift predictions. Designations for columns C, D, and E are identical to Table2. FS/Kbp, artificial frameshifts per 1000 base pairs. Pred, predicted frameshifts. TP, true positives. 52 6 Parameters of the coding potential analysis algorithm........ 53 7 Summary of sequencing projects used in the pipeline development. Data for each strain are presented in rows............... 60 ix 8 Summary of assembler performance. Data for each strain are pre- sented in rows. Statistics from standalone assemblers (Newbler and AMOScmp) are presented together with results of the combining pro- tocol (default output of the pipeline) and an optional, manually as- sisted predictive gap closure protocol. (a) N50 is a standard quality metric for genome assemblies that summarizes the length distribution of contigs. It represents the size N such that 50% of the genome is contained in contigs of size N or greater. Greater N50 values indicate higher quality assemblies. (b) No improvement was