Reconstruction of Bacterial Strain Genomes from Shotgun Metagenomic Reads
Total Page:16
File Type:pdf, Size:1020Kb
University of Central Florida STARS Electronic Theses and Dissertations, 2020- 2020 Reconstruction of Bacterial Strain Genomes from Shotgun Metagenomic Reads Xin Li University of Central Florida Part of the Computer Sciences Commons Find similar works at: https://stars.library.ucf.edu/etd2020 University of Central Florida Libraries http://library.ucf.edu This Doctoral Dissertation (Open Access) is brought to you for free and open access by STARS. It has been accepted for inclusion in Electronic Theses and Dissertations, 2020- by an authorized administrator of STARS. For more information, please contact [email protected]. STARS Citation Li, Xin, "Reconstruction of Bacterial Strain Genomes from Shotgun Metagenomic Reads" (2020). Electronic Theses and Dissertations, 2020-. 377. https://stars.library.ucf.edu/etd2020/377 RECONSTRUCTION OF BACTERIAL STRAIN GENOMES FROM SHOTGUN METAGENOMIC READS by XIN LI B.E. Yanshan University, 2010 M.S. Florida International University, 2013 A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Department of Computer Science in the College of Engineering and Computer Science at the University of Central Florida Orlando, Florida Fall Term 2020 Major Professor: Haiyan Hu ABSTRACT It is necessary to study bacterial strains in environmental samples. The environmental samples are mixed DNA samples collected from the ocean, soil, lake, human body sites, etc. In a natural environment, they provide us new insights into the diversity of our earth. As for bacterial strains on or inside human bodies, to select the proper treatment for diseases caused by bacterial strains, it is critical to identify the corresponding strains and reconstruct their genomes. However, it is a challenge to do so with the DNA from a large number of unknown microbial species mixed together in an environmental sample. The majority of available computational methods depend on available sequenced genomes and marker genes, which can not fully discover the strains and reconstruct their genomes from the shotgun metagenomic reads. In this dissertation, we studied bacterial strain reconstruction, including one case study about shotgun metagenomic sequencing and two novel approaches to improve the performance of reconstructing bacterial strains. Firstly, we studied how newly sequenced genomes affect the analysis result from shotgun metagenomic datasets. In this study, we found two more new phyla that were related to colitis development compared with a previous study, and the two new phyla were also more statistically significant. Furthermore, we found that one major conclusion from the previous study was not supported by repeating the analysis with an updated marker gene database and tools in metagenomics. Secondly, to better analyze shotgun metagenomic datasets, BHap, a novel algorithm based on fuzzy flow networks and de Bruijn graph was developed to reconstruct bacterial strains. BHap had high precision, recall and F1 score and low susceptibility to sequence errors. It also outperformed existing tools in terms of better precision, better recall, higher F1 score and more accurate estimation of the number of strains. Last but not least, a second approach, mixtureS, was developed by considering all genome positions. MixtureS is based on the EM ii algorithms and the frequency difference of strains to distinguish different strains of a bacterial species in shotgun metagenomic datasets. Compared with several existing methods including BHap, mixtureS had a better performance in terms of precision, recall, the prediction accuracy of the strain numbers and abundance. Based on the developed BHap and mixtureS methods, we also developed two software tools, which will be valuable for future strain studies in metagenomics. iii ACKNOWLEDGMENTS I would sincerely express my gratitude to my advisor Dr. Haiyan Hu, who continually teaches me the importance of a good attitude for research and work. That is extremely important for my Ph.D. studies. It would be not possible to finish this dissertation without her guidance and help. I would also like to thank my co-advisor Dr.Xiaoman Li, who is a talented scientist. I have learned a lot from him for the past six years. He has given countless thoughtful suggestions to my research, and guide me in the right research direction. I also want to this opportunity to thank my committee members Dr. Saleh A. Naser and Dr. Liqiang Wang for serving in my committee. Their suggestions are very helpful for me to complete my dissertation. Finally, I would like to thank my parents. They are supporting me and encouraging me in many different ways through my whole graduate study. iv TABLE OF CONTENTS LIST OF FIGURES ..................................................................................................................... viii LIST OF TABLES ......................................................................................................................... ix CHAPTER ONE: INTRODUCTION ............................................................................................. 1 1.1 Metagenomics ....................................................................................................................... 1 1.2 16S ribosomal RNA and shotgun sequencing ...................................................................... 4 1.3 Bacterial strains ..................................................................................................................... 5 1.4 Other strain reconstruction method ....................................................................................... 7 1.5 Overview of the Dissertation ................................................................................................ 8 CHAPTER TWO: OLD METAGENOMIC DATA MEET NEWLY SEQUENCED GENOMES ......................................................................................................................................................... 9 2.1 Introduction ........................................................................................................................... 9 2.2 Materials and Methods ........................................................................................................ 12 2.2.1 Data and their processing ............................................................................................. 12 2.2.2 Database preparation and Centrifuge ........................................................................... 14 2.2.3 MetaPhlAn ................................................................................................................... 15 2.2.4 Statistical analysis ........................................................................................................ 15 2.3 Results ................................................................................................................................. 15 2.3.1 At least two new phyla may relate to colitis development in patients ......................... 15 2.3.2 Hundreds of lower taxa may relate to colitis development in patients ........................ 19 v 2.3.3 Many identified taxa may relate to colitis development based on literature ............... 24 2.3.4 Re-analyses with MetaPhlAn support that Actinobacteria is different between CF samples and PtC samples ...................................................................................................... 26 2.4 Discussion ........................................................................................................................... 30 CHAPTER THREE: AN APPROACH FOR BACTERIAL STRAIN RECONSTRUCTION BASED ON DE BRUIJN GRAPH ............................................................................................... 35 3.1 Introduction ......................................................................................................................... 35 3.2 Materials and Methods ........................................................................................................ 37 3.2.1 Simulated datasets ........................................................................................................ 37 3.2.2 Experimental datasets .................................................................................................. 42 3.2.3 BHap, a novel approach for strain reconstruction in bacterial populations ................. 44 3.2.4 Evaluation of BHap and other tools ............................................................................. 50 3.3 Results ................................................................................................................................. 50 3.3.1 BHap has a robust performance with varied parameter values .................................... 50 3.3.2 BHap reconstructs strains better than EVORhA on simulated datasets ...................... 57 3.3.3 BHap reconstructs strains better than EVORhA on experimental dataset ................... 67 3.4 Discussion ........................................................................................................................... 70 CHAPTER FOUR: A NOVEL TOOL FOR BACTERIAL STRAIN RECONSTRUCTION FROM READS ......................................................................................................................................... 72 4.1 Introduction ......................................................................................................................... 72 vi 4.2 Materials and Methods .......................................................................................................