Network Analysis of Ten Thousand Genomes Shed Light on Pseudomonas
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/2021.08.16.456539; this version posted August 16, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 1 Network analysis of ten thousand genomes shed light on Pseudomonas 2 diversity and classification 3 Hemanoel Passarelli-Araujo1,2,*, Glória Regina Franco1, Thiago M. Venancio2,* 4 1Departamento De Bioquímica e Imunologia, Instituto De Ciências Biológicas, UniversiDaDe FeDeral De 5 Minas Gerais, Belo Horizonte, MG, Brazil 6 2Laboratório De Química e Função De Proteínas e PeptíDeos, Centro De Biociências e Biotecnologia, 7 UniversiDaDe EstaDual do Norte Fluminense Darcy Ribeiro, Campos dos Goytacazes, RJ, Brazil. 8 9 *CorresponDing authors 10 Av. Alberto Lamego 2000, P5 sala 217; Parque Califórnia 11 Campos Dos Goytacazes, RJ, Brazil 12 CEP: 28013-602 13 HPA: [email protected]; TMV: [email protected] bioRxiv preprint doi: https://doi.org/10.1101/2021.08.16.456539; this version posted August 16, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 14 SUMMARY 15 The growth of sequenceD bacterial genomes has revolutionizeD the assessment of microbial 16 diversity. Pseudomonas is a wiDely diverse genus, comprising isolates associateD with processes 17 from pathogenesis to biotechnological applications. However, this high diversity led to historical 18 taxonomic inconsistencies. Although type strains have been employeD to estimate 19 Pseudomonas diversity, they represent a small fraction of the genomic Diversity at a genus level. 20 We useD 10,035 available Pseudomonas genomes, incluDing 210 type strains, to builD a genomic 21 distance network to estimate the number of species through community identification. We 22 identifieD inconsistencies with several type strains anD founD that 25.65% of the Pseudomonas 23 genomes deposited on Genbank are misclassifieD. We retrieveD the 13 main Pseudomonas 24 groups anD proposeD P. alcaligenes as a new group. Finally, this work proviDes new insights on 25 the phylogenetic boundaries of Pseudomonas anD highlights that the Pseudomonas diversity has 26 been hitherto overlookeD. 27 Keywords: Phylogenomics, Pseudomonads, Taxonomy, Community detection bioRxiv preprint doi: https://doi.org/10.1101/2021.08.16.456539; this version posted August 16, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 28 INTRODUCTION 29 Biological networks have been an essential analytical tool to better unDerstanD microbial 30 diversity anD ecology1, 2. A network is a set of connecteD objects, in which objects can be 31 representeD as noDes anD connections as eDges. Networks proviDe a simple anD powerful 32 abstraction to evaluate the importance of inDiviDual or clustereD noDes in maintaining a given 33 system. CoupleD with whole-genome sequencing, it can refine our knowledge about genetic 34 relationships of Diverse bacteria such as Pseudomonas. 35 Pseudomonas is a genus within the Gammaproteobacteria class, whose members 36 colonize aquatic anD terrestrial habitats. These bacteria are involveD in plant anD human 37 diseases, as well as in biotechnological applications such as plant growth-promotion and 38 bioremediation3. The genus Pseudomonas was Described at the enD of the nineteenth century 39 based on morphology, anD its remarkable nutritional versatility was recognized thereafter4. The 40 metabolic Diversity of pseuDomonaDs, combineD with biochemical tests to Describe species, 41 culminateD in a chaotic taxonomic situation4. 42 In 1984, the genus was reviseD anD subDiviDeD into five groups baseD on DNA-DNA anD 43 rRNA-DNA hybriDization5, with group I retaining the name Pseudomonas. Over the past 30 years, 44 other molecular markers such as housekeeping genes have been used to mitigate the issues of 45 Pseudomonas taxonomy6, 7, 8. BaseD on the 16S rRNA gene sequences, the genus is DiviDeD into 46 three main lineages representeD by Pseudomonas pertucinogena, Pseudomonas aeruginosa, 47 anD Pseudomonas fluorescens9. These lineages comprise groups of different species – both, 48 lineages anD groups receive the name of the representative species. Currently, there are 254 49 Pseudomonas species with valiDateD names accorDing to the List of Prokaryotic Names with 50 StanDing in the Nomenclature (LPSN)10. However, although the genus Division into lineages anD 51 groups has facilitated the classification of new species, the remnants of the Pseudomonas 52 misclassification still linger in public Databases11, 12. 53 The explosion in availability of complete genomes for both cultureD anD uncultureD 54 microorganisms has improveD the classification of several bacteria, incluDing Pseudomonas8, 13. 55 One of the golD stanDarDs for species circumscription is the Digital whole genome comparison 56 by Average Nucleotide Identity (ANI)14. Since using only genomes from type strains might bias 57 anD proviDe an unrealistic picture of microbial Diversity, we aimeD to estimate the Pseudomonas 58 diversity using all available genomes through a network approach. Here, we proviDe new 59 perspectives on Pseudomonas diversity by exploring the topology of the genomic distance 60 network and the phylogenetic tree from representative genomes. This work also provides novel 61 insights about the misclassification anD phylogenetic borders of Pseudomonas. 62 63 RESULTS 64 Dataset collection 65 We obtaineD 11,025 genomes from GenBank on June 2020. After evaluating the quality of each 66 genome (see methoDs for more Details) anD removing fragmenteD genomes, 10,035 genomes 67 passed in the 80% quality threshold (Figure S1). We used 238 type strains with available 68 genomes anD names valiDly publisheD accorDing to the List of Prokaryotic Names with Standing 69 in Nomenclature in March 2021. AccorDing to the NCBI classification, the top four abunDant bioRxiv preprint doi: https://doi.org/10.1101/2021.08.16.456539; this version posted August 16, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 70 species in our Dataset are P. aeruginosa (n = 5,088), P. viridiflava (n = 1,509), Pseudomonas sp. 71 (n = 1,083), anD P. syringae (n = 435) (Table S1). The sizes of the retrieveD genome rangeD from 72 3.0 to 9.4 Mb. 73 74 Genome-based analyses reveals the presence of synonymous Pseudomonas species 75 The misclassification of some Pseudomonas type strains has been reported by several studies 8, 76 15, 16, 17. Type strains play an important role in taxonomy by anchoring species names as 77 unambiguous points of reference18. In this context, the term “synonym” refers to the situation 78 where the same taxon receives different scientific names. We used 238 type strain genomes to 79 evaluate the presence of synonymous species in Pseudomonas. The ANI was computeD for all 80 type strains to construct an identity network that was further useD to check the linkage between 81 genomes baseD on a 95% ANI thresholD (Figure 1). Since 95% has been accepteD as species 82 delimitation thresholD14, connections between type strains inDicate synonymous names or 83 subspecies. 84 We iDentifieD 30 connecteD genomes in the ANI network (Figure 1). Four of these 85 connecteD genomes are expecteD because they represent P. chlororaphis anD its subspecies. Of 86 the 26 remaining connecteD species, 15 have been previously reporteD, such as that in the group 87 containing P. amygdali, P. ficuserectae, anD P. savastanoi15, 16. Here, we observeD 11 88 connections, incluDing the one between P. panacis anD P. marginalis with 97.34% iDentity, 89 suggesting that P. panacis is a later synonym of P. marginalis. 90 91 92 Figure 1. Type strain validation based on Average Nucleotide Identity. Each noDe in the network 93 represent a type strain genome anD nodes are connecteD if they share at least 95% iDentity. The left panel 94 is a magnifieD representation of the connecteD noDes, with eDges coloreD accorDing to percent iDentity 95 between the nodes. 96 97 The Pseudomonas genomic distance network is highly structured 98 In networks, the community structure plays an important role to unDerstanD network topology. 99 We useD all 10,035 Pseudomonas genomes to construct a Distance network to estimate the 100 number of Pseudomonas species from the number of communities detected in this network. 101 Since alignment-based methods to estimate genome similarity (e.g. ANI) is computationally 102 expensive due to the algorithm quadratic time complexity19, it becomes impractical for thousanD bioRxiv preprint doi: https://doi.org/10.1101/2021.08.16.456539; this version posted August 16, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 103 genomes. Therefore, we estimateD the Mash Distance that strongly correlates with ANI anD can 104 be rapidly computeD for large Datasets20. 105 Mash Distances are computeD by reDucing large sequences to small anD representative 106 sketches20. We estimateD the pairwise Mash Distance for all genomes using sketch sizes of 1000 107 anD 5000, which convergeD to similar Distance values (Figure S2a). However, we observeD that 108 the greater the Distance between two Pseudomonas genomes, the more Divergent the Distance 109 estimation (Figure S2b), although the Density Distribution is similar (Figure S2c). The final 110 distance between two genomes was given as the average Distance value from both sketch sizes.