Mining Bacterial NGS Data Vastly Expands the Complete Genomes Of
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/2021.07.15.452192; this version posted July 15, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 1 Title: Mining bacterial NGS data vastly expands the complete genomes of 2 temperate phages 3 Authors: Xianglilan Zhang1†, Ruohan Wang2†, Xiangcheng Xie3†, Yunjia Hu4†, Jianping 4 Wang2†, Qiang Sun5, Xikang Feng6, Shanwei Tong7,8, Yujun Cui1, Mengyao Wang2, 5 Shixiang Zhai9,10, Qi Niu11, Fangyi Wang12, Andrew M. Kropinski13, Xiaofang Jiang14, 6 Shaoliang Peng12*, Shuaicheng Li2*, Yigang Tong4* 7 †These authors contributed equally to this work. 8 *Corresponding author. Email: [email protected] (Y.T.); [email protected] 9 (S.C.L.); [email protected] (S.P.) 10 Affiliations: 11 1State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology 12 and Epidemiology, Beijing 100071, People’s Republic of China. 13 2Department of Computer Science, City University of Hong Kong, Hong Kong 999077, 14 People’s Republic of China. 15 3College of Computer, National University of Defense Technology, Changsha 410073, 16 People’s Republic of China. 17 4Beijing Advanced Innovation Center for Soft Matter Science and Engineering (BAIC- 18 SM), College of Life Science and Technology, Beijing University of Chemical 19 Technology, Beijing 100029, People’s Republic of China. 20 5The 964th Hospital, Changchun 130021, People’s Republic of China. bioRxiv preprint doi: https://doi.org/10.1101/2021.07.15.452192; this version posted July 15, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 21 6School of Software, Northwestern Polytechnical University, Xi'an 710072, People’s 22 Republic of China. 23 7Bioinformatics Graduate Program, University of British Columbia, Vancouver, British 24 Columbia, Canada. 25 8Food, Nutrition and Health Program, Faculty of Land and Food Systems, The University 26 of British Columbia, Vancouver, British Columbia, Canada. 27 9Yantai Institute of Coastal Zone Research, Chinese Academy of Sciences, Yantai, 28 Shandong 264003, People’s Republic of China. 29 10University of Chinese Academy of Sciences, Beijing 100049, People’s Republic of 30 China. 31 11School of Computer Science and Electronic Engineering, Hunan University, Changsha 32 410082, People’s Republic of China. 33 12Department of Biostatistics, Yale School of Public Health, New Haven, CT 06510, 34 USA. 35 13Departments of Food Science, and Pathobiology, University of Guelph, Guelph ON 36 N1G 2W1, Canada. 37 14National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA. 38 39 Temperate phages (active prophages induced from bacteria) help control 40 pathogenicity, modulate community structure, and maintain gut homeostasis1. 41 Complete phage genome sequences are indispensable for understanding phage bioRxiv preprint doi: https://doi.org/10.1101/2021.07.15.452192; this version posted July 15, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 42 biology. Traditional plaque techniques are inapplicable to temperate phages due to 43 the lysogenicity of these phages, which curb the identification and characterization 44 of temperate phages. Existing in silico tools for prophage prediction usually fail to 45 detect accurate and complete temperate phage genomes2-5. In this study, by a novel 46 computational method mining both the integrated active prophages and their 47 spontaneously induced forms (temperate phages), we obtained 192,326 complete 48 temperate phage genomes from bacterial next-generation sequencing (NGS) data, 49 hence expanded the existing number of complete temperate phage genomes by more 50 than 100-fold. The reliability of our method was validated by wet-lab experiments. 51 The experiments demonstrated that our method can accurately determine the 52 complete genome sequences of the temperate phages, with exact flanking sites (attP 53 and attB sites), outperforming other state-of-the-art prophage prediction methods. 54 Our analysis indicates that temperate phages are likely to function in the evolution 55 of microbes by 1) cross-infecting different bacterial host species; 2) transferring 56 antibiotic resistance and virulence genes; and 3) interacting with hosts through 57 restriction-modification and CRISPR/anti-CRISPR systems. This work provides a 58 comprehensive complete temperate phage genome database and relevant 59 information, which can serve as a valuable resource for phage research. 60 Main Text 61 Temperate phage is a key component of the microbiome. Temperate phages have 62 both lysogenic and lytic cycles. When they integrate (as prophages) into bacterial 63 chromosomes, they enter the lysogenic cycle to participate in several bacterial cellular 64 processes6. As the temperate phages have an inherent capacity to mediate the gene bioRxiv preprint doi: https://doi.org/10.1101/2021.07.15.452192; this version posted July 15, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 65 transfers between bacteria by specialized transduction, they may increase bacterial 66 virulence and promote antibiotic resistance1,7. Therefore, temperate phage detection is 67 necessary to ensure the safety of phage therapy. Also, isolating suitable virulent phages is 68 challenging for some bacterial infections, such as Clostridioides difficile and 69 Mycobacterium tuberculosis8-10. Hence, genetically modifying temperate phages into 70 virulent phages is a promising phage therapy approach for such bacterial infections. In 71 recent years, fecal microbiota transplantation (FMT) has increasingly become a 72 prominent therapy to treat C.difficile infection (CDI) by normalizing microbial diversity 73 and community structure in patients11,12. FMT may also help manage other disorders 74 associated with gut microbiota alteration11. However, drug-resistant and highly virulent 75 bacteria in the donor's stool pose a potential threat to the patient's life. Identifying and 76 profiling the temperate phages that carry important antibiotic resistance genes and 77 virulence genes, is crucial for FMT success. 78 Even though temperate phages play essential roles in medical treatments, only a 79 few genomes are accessible untill now, seriously hindering the related research. Using 80 integrase as a marker, we identified only 1,504 complete temperate phage genomes on 81 GenBank (December 2020, Extended Data Table 4). Detecting temperate phages is 82 challenging. Traditionally, phage detection relies on culture-based methods to isolate the 83 temperate phage after inducing it from a lysogenic strain, amplify the phage to high titers, 84 and characterize it often by transducing the phage into different strains13. However, as 85 temperate phages are readily integrated into the host genome and stay silent, forming 86 phage plaques is difficult after transduction, which causes a big challenge for the 87 traditional culture method. Nowadays, an increasing amount of bacterial genomic bioRxiv preprint doi: https://doi.org/10.1101/2021.07.15.452192; this version posted July 15, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 88 sequences have become available in databases due to the advances in next-generation 89 sequencing (NGS) technologies. Bioinformatics tools have been developed to predict 90 potential prophage sequences within bacterial genomes, including Phage_Finder14, 91 Prophage finder15, Prophinder16, PHAST2, PHASTER17, PhiSpy3, VirSorter18, Prophage 92 Hunter4, and VIBRANT5. Among these tools, PHASTER/PHAST, VirSorter, and 93 Prophage Hunter consider the completeness of a predicted prophage sequence. Typically, 94 these tools rely on phage protein clusters to locate possible prophage regions on 95 assembled bacterial genome sequences, yet fail to acquire exact active prophage 96 (temperate phage) sequence boundaries2-5. That is, all the existing prophage prediction 97 tools are unable to obtain accurately complete temperate phage genome sequences. 98 Here we present a novel computational method to acquire complete temperate 99 phage genome sequences using bacterial NGS data. Unlike other in silico tools that adopt 100 machine learning or statistical classifier to predict possible prophage regions from the 101 bacterial genomes (Table 1), our method utilizes raw NGS data (reads) to identify 102 sequences that match the circularized temperate phage genome. The raw reads capture 103 the biological replication process signals of temperate phages in their lytic cycles. 104 Essentially, our method incorporates three biological principles from phage life cycles to 105 detect temperate phages: 1) In the replication process of the