bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1 Comparative phylogenomic analysis reveals evolutionary genomic

2 changes and novel toxin families in endophytic Liberibacter

3 pathogens

4

5 Yongjun Tan1, Cindy Wang1, Theresa Schneider1, Huan Li1, Robson Francisco de Souza2, 6 Xueming Tang3, Tzung-Fu Hsieh4,5, Xu Wang6,7,8, Xu Li4,5, and Dapeng Zhang1,9,*

7

8 1Department of Biology, College of Arts & Sciences, Saint Louis University, St. Louis, MO 63103

9 2Departamento de Microbiologia, Instituto de Ciências Biomédicas, Universidade de São Paulo, 10 São Paulo 05508-900, Brazil

11 3School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai, China

12 4Department of Plant and Microbial Biology, North Carolina State University, Raleigh, NC 27695

13 5Plants for Human Health Institute, North Carolina State University, Kannapolis, NC 28081

14 6Department of Pathobiology, College of Veterinary Medicine, Auburn University, Auburn, AL 15 36849

16 7Alabama Agricultural Experiment Station, Auburn University, Auburn, AL 36849

17 8HudsonAlpha Institute for Biotechnology, Huntsville, AL 35806

18 9Bioinformatics and Computational Biology Program, College of Arts & Sciences, Saint Louis 19 University, St. Louis, MO 63103

20

21 Correspondence and requests for materials should be addressed to D.Z. (email: 22 [email protected])

23

24 Running title: Comparative Phylogenomics of Liberibacter Pathogens

25 Keywords: Liberibacter pathogens, huanglongbing, zebra chip, toxins, prophages, 26 pathogenesis, evolution, comparative genomics

27

1

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

28 Abstract

29 Liberibacter pathogens are the causative agents of several severe crop diseases worldwide, 30 including citrus Huanglongbing and potato Zebra Chip. These are endophytic and non- 31 culturable, which makes experimental approaches challenging and highlights the need for 32 bioinformatic analysis in advancing our understanding about Liberibacter pathogenesis. Here, 33 we performed an in-depth comparative phylogenomic analysis of the Liberibacter pathogens 34 and their free-living, nonpathogenic, ancestral species, aiming to identify the major genomic 35 changes and determinants associated with their evolutionary transitions in living habitats and 36 pathogenicity. We found that prophage loci represent the most variable regions among 37 Liberibacter genomes. Using gene neighborhood analysis and phylogenetic classification, we 38 systematically recovered, annotated, and classified all prophage loci into four types, including 39 one previously unrecognized group. We showed that these prophages originated through 40 independent gene transfers at different evolutionary stages of Liberibacter and only the SC-type 41 prophage was associated with the emergence of the pathogens. Using ortholog clustering, we 42 vigorously identified two additional sets of genomic genes, which were either lost or gained in 43 the ancestor of the pathogens. Consistent with the habitat change, the lost genes were 44 enriched for biosynthesis of cellular building blocks. Importantly, among the gained genes, we 45 uncovered several previously unrecognized toxins, including a novel class of polymorphic toxins, 46 a YdjM phospholipase toxin, and a secreted EEP protein. Our results substantially extend the 47 knowledge on the evolutionary events and potential determinants leading to the emergence of 48 endophytic, pathogenic Liberibacter species and will facilitate the design of functional 49 experiments and the development of new detection and blockage methods of these pathogens.

50

51 Importance

52 Liberibacter pathogens are associated with several severe crop diseases, including citrus 53 Huanglongbing, the most destructive disease to the citrus industry. Currently, no effective cure 54 or treatments are available, and no resistant citrus variety has been found. The fact that these 55 obligate endophytic pathogens are not culturable has made it extremely challenging to

2

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

56 experimentally uncover from the whole genome the genes/proteins important to Liberibacter 57 pathogenesis. Further, earlier bioinformatics studies failed to identify the key genomic 58 determinants, such as toxins and effector proteins, that underlie the pathogenicity of the 59 bacteria. In this study, an in-depth comparative genomic analysis of Liberibacter pathogens 60 together with their ancestral non-pathogenic species identified the prophage loci and several 61 novel toxins that are evolutionarily associated with the emergence of the pathogens. These 62 results shed new lights on the disease mechanism of Liberibacter pathogens and will facilitate 63 the development of new detection and blockage methods targeting the toxins.

64

65

66

3

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

67 Introduction

68 Candidatus Liberibacter, a genus of Gram-negative bacteria in the order of Rhizobiales, 69 has recently received increasing attention due to the fact that several of its species are 70 associated with severe diseases on multiple crop plants, such as citrus, potato, and tomato. 71 These include three species, Candidatus Liberibacter asiaticus found in Asia and North America, 72 Candidatus Liberibacter africanus found in Africa 1, and Candidatus Liberibacter americanus 73 found in Brazil 2, that cause the citrus Huanglongbing (HLB) disease, and Candidatus 74 Liberibacter solanacearum, that causes similar diseases in tomato and potato, referred to as 75 Psyllid yellows and Zebra Chip (ZC), respectively 3,4. The HLB disease, also known as citrus 76 greening, is the most destructive, worldwide disease of citrus 5. It is characterized by yellowing 77 of citrus tree leaves, premature defoliation, decay of feeder rootlets and lateral roots, production 78 of small bitter fruit, and eventually death of the citrus tree 6. Thus, HLB has led to a substantial 79 loss of citrus production and severe damage to the economy and job market 7. Unfortunately, 80 there is currently no effective cure or treatments available, and no resistant citrus variety has 81 been found so far. In a similar vein, ZC is a new disease of potato that was first identified in 82 Mexico in 1994 8 and has quickly spread to many countries in recent years. It negatively affects 83 growth, yield, propagation potential, and qualities of tubers and thus impacts international trade 84 and the economy significantly 9.

85 Given the severity and rapid spread of these diseases in commercial crops, extensive 86 research efforts have been made to better understand the basic biology of these pathogens and 87 the pathogenesis mechanisms of the diseases, especially on HLB. However, the progress has 88 been slow due to two main obstacles. First, the Liberibacter pathogens of these diseases are 89 endophytic bacteria, whose transmissions are naturally vectored by several psyllids, such as 90 Asian citrus psyllid (Diaphorina citri) 10, African citrus psyllid (Trioza erytreae) 11, and Bactericera 91 cockerelli 9. Therefore, they are not culturable which makes it difficult for wet-lab researchers to 92 conduct functional studies. Second, secretion of toxins or effectors represents one of the most 93 important mechanisms of bacterial pathogenesis and virulence. These toxins either manipulate 94 host/vector immunity and physiology or damage the host cells 12. However, such molecules 95 involved in the HLB disease have remained elusive for years. Previous research has focused 96 primarily on secreted proteins in HLB-associated pathogens which carry T2SS specific signal 97 peptides 13. Thus far, the only proteins claimed to be the potential toxins behind the HLB 98 disease have been the short proteins carrying these signal peptides 14-17. However, these 99 proteins are mainly present in one strain of the C.L. asiaticus species, and not found in other

4

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

100 C.L. asiaticus strains and other Liberibacter pathogens associated with HLB (C.L. americanus 101 and C.L. africanus)14,16-21, suggesting that other types of toxins or effectors may contribute to the 102 primary HLB pathology. Without accurate and comprehensive information about the HLB- 103 associated toxins/effectors, targeted strategies for detection, prevention, and more importantly 104 blockage will not be possible.

105 In this study, we aimed to tackle these problems by utilizing the available genome 106 information and dedicated bioinformatics strategies. Our analysis is based on the observed 107 phenotypic difference between the most ancestral Liberibacter species, Liberibacter crescens, 108 that displays a free-living habitat and is apparently non-pathogenic, and the descendants that 109 are both endophytic and pathogenic. We hypothesized that the functional difference is 110 determined by the genomic changes, including both gene-loss and gene-gain events, that 111 occurred at the common ancestor of the pathogens, after its divergence from L. crescens. We 112 have designed multiple comparative genomics strategies to systematically mine the major 113 genomic differences, including unique prophage loci and other genes that are evolutionarily 114 associated with the emergence of the pathogens. More importantly, we identified several novel 115 toxin proteins, including a new type of polymorphic toxins, a YdjM phospholipase toxin, and a 116 secreted EEP family protein. The new information gained from this research provides important 117 insights into the evolution and pathogenesis of Liberibacter pathogens and will facilitate future 118 research to develop novel detection and blockage methods targeting the toxins.

119 Results

120 Phylogenetic relationships of HLB-associated pathogens suggest an evolutionary 121 transition from the non-pathogenic ancestor to pathogenic descendants. We first sought 122 to understand the evolutionary relationship of the three HLB-associated pathogens. We 123 collected all Liberibacter species whose genome information is available in the NCBI GenBank 124 database (Fig. 1a and Supplementary Table 1). This includes eight genomes of HLB-associated 125 pathogens (six strains of C.L. asiaticus 5,22-26, one of C.L. africanus 27, and one of C.L. 126 americanus 2), one genome of C.L. solanacearum28, one genome of C.L. europaeus, which is a 127 potential pathogen of Cytisus scoparius and vectored by Arytainilla spartiophila 29, and two 128 genomes of L. crescens, which were isolated from papaya and represent the most basal lineage 129 within the Liberibacter genus30,31. We conducted Maximum Likelihood phylogenetic analyses by 130 using three conserved genes including 16S rRNA, 23S rRNA, and DNA polymerases A, all of 131 which support the same tree topology (Fig. 1b-d). Importantly, this analysis revealed that the 132 species associated with HLB are not monophyletic. Both C.L. africanus and C.L. asiaticus share

5

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

133 a common ancestor. However, C.L. americanus shows a close relationship with C.L. europaeus. 134 Between these two groups is C.L. solanacearum, the pathogen of potatoes, tomatoes, and 135 carrots. Further, all five species of pathogens share a common ancestor, whose sister group is 136 L. crescens. L. crescens are free-living bacteria and do not seem to be pathogenic, despite the 137 fact that they were isolated from papaya 31,32. Our phylogenetic trees also support its ancestral 138 (deepest branching) position at the base of Liberibacter (Fig. 1b-d).

139 According to this phyletic pattern between the ancestral species of Liberibacter, L. 140 crescens, which is free-living and apparently non-pathogenic, and the descendants, which are 141 endophytic and pathogenic bacteria, there is a clear transition in term of living habitats and 142 pathogenicity. In other words, both the intracellular habitats and pathogenicity of these bacteria 143 are derived traits of Liberibacter during evolution. We hypothesize that the development of these 144 new traits was attributed to the genomic changes, both gene-loss and gene-gain events, that 145 occurred in the ancestor of the Liberibacter pathogens. With this hypothesis, we designed 146 several phylogenomic comparison strategies to specifically identify the genes or genomic 147 regions that display unique phyletic patterns in either pathogenic descendants or ancestral L. 148 crescens species and to understand the evolution and pathogenicity mechanisms of 149 Liberibacter.

150 Whole genome comparisons of Liberibacter bacteria. The first computational strategy 151 involved comparing whole Liberibacter genomes in order to develop a general idea of genomic 152 dynamics and to identify any large genomic regions that were deleted or inserted during 153 evolution. Pairwise TBLASTX comparisons were used to identify the gene correspondence 154 between the genomes (Fig. 2). Each line indicates one correspondence of the homologs. We 155 found that between five pathogenic species, the majority of their genome regions preserve 156 similar gene composition and organization. The major difference between their genomes arises 157 from large-scale genome recombination and inversions (as shown in crossed blue lines). 158 However, when we compared genomes of the non-pathogenic L. crescens and pathogenic C.L. 159 europaeus, the difference between their genomes is evident. Although many short operonic 160 structures are still preserved between the two genomes, the recombination and inversions are 161 more prominent and extensive. Thus, the observed extensive genomic changes between the 162 non-pathogenic ancestor and pathogenic ancestor are consistent with their transitions in living 163 habitat and pathogenicity.

164 When the genome regions that display strong variations among pathogens and between 165 pathogens and the non-pathogenic ancestor were examined, we found that many of them are

6

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

166 prophage loci (as highlighted on each genome) (Fig. 2). This is very interesting, as prophages 167 are known to play a critical role in bacterial pathogenesis 33 and many toxins or virulence factors 168 of bacterial pathogens are carried by prophage loci, such as cholera toxins 34 and the toxins 169 used by pathogenic E.coli 35. Therefore, we conducted a detailed analysis of all prophage loci to 170 test if any prophage loci are associated with the pathogenicity of the pathogens.

171 Genomic organization dissection and classification of Liberibacter prophage loci. Several 172 prophage loci have been previously identified in Liberibacter bacteria 30,36,37. However, their 173 evolutionary relationships, origins, genomic organizations, and functional compositions remain 174 unclear. Thus, we conducted an in-depth gene-neighborhood analysis to systematically extract 175 the potential prophage loci, identify their boundaries, and annotate their components. As a result, 176 we retrieved 36 prophage loci from the 12 available Liberibacter genomes (Fig. 3, 177 Supplementary Table 2 and Data 1). We annotated their gene components by clustering the 178 protein products and dissecting their domains. We found that these prophage loci display a 179 highly variable gene composition. However, based on the shared gene components, we were 180 able to classify them into four major types, including LC1, LC2, SC, and UT (Fig. 3). Specifically, 181 the LC1 type is only found in non-pathogenic L. crescens whereas LC2 is present in all 182 Liberibacter species. They are similar in terms of gene composition; however, LC2 contains 183 several unique genes such as LC_TM, Peptidase_S74_2, LC_3, HTH_XRE, and DUF1376. The 184 SC type is found in all Liberibacter pathogens; it typically has a large genome size with big 185 variations among loci which are mainly caused by independent genome deletion and insertions. 186 It is noteworthy that several genomes contain multiple copies of the SC-type phages. These 187 include two copies in C.L. asiaticus gxpsy genome, two copies in C.L. solanacearum ZC1 188 genome, and three copies in C.L. europaeus NZ1 genome. By contrast, the UT type represents 189 a novel group of prophages that we recovered in all C.L. asiaticus strains. Its size is much 190 smaller, and according to its gene composition, it is close to the SC type. There is also a small 191 prophage locus ZC1 on the C.L. solanacearum genome, but it is more likely a fragment.

192 Phylogenetic analysis reveals independent gene transfers of phages to Liberibacter. We 193 next attempted to trace the evolutionary histories of these prophages to examine their 194 correlations with the development of pathogenicity. By comparing gene components of 195 prophage loci, we found that terminase, the key phage component involved in the phage DNA 196 packing process 38, is the only gene family conserved across all identified prophage loci (Fig. 3). 197 Therefore, we used the terminase as a marker to study the evolutionary origins of these 198 prophages. We conducted several BLASTP searches using the terminases from Liberibacter

7

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

199 species to collect homologs in the NCBI NR database. Using both Maximum Likelihood and 200 Bayesian Inference analyses (Fig. 4), we found that the terminases of Liberibacter prophages 201 are not monophyletic; instead, terminases of different types, LC1, LC2, and SC (together with 202 UT), were nested in three separate clades. This indicates that the LC1, LC2, and SC prophages 203 have different evolutionary origins, and they have been transferred independently to Liberibacter 204 at different evolutionary time points (Fig. 4a). In addition to the independent origins, the tree also 205 suggests that multiple copies of the SC phage in several Liberibacter pathogens were likely 206 generated from genome-specific duplications, such as those in C.L. solanacearum, C.L. 207 europaeus, and C.L. asiaticus (Fig. 4a). However, the SC prophages also appear to have 208 undergone recombination between species and strains, given the facts: 1) their terminases did 209 not follow a typical pattern of vertical evolution, unlike the LC2 terminases (Fig. 4a); 2) in our 210 genome comparison analysis, the SC phages from different C.L. asiaticus strains display a 211 strikingly rapid divergence in certain regions, in contrast to other genomic regions (Fig. 4b).

212 Based on the results from both gene neighborhood-based classification and 213 phylogenetic analysis, it is likely that LC2, LC1, and SC prophages were introduced at the base 214 of Liberibacter, the base of non-pathogenic L. crescens, and the common ancestor of the 215 Liberibacter pathogens, respectively. As for the UT phage, it likely originated from a duplication 216 event of the ancestral SC phage at the base of the C.L. asiaticus species (Fig. 4b). According to 217 the phylogenetic histories of these prophages, we can infer that the origin of the SC type of 218 prophages was associated with the emergence of pathogenicity of the Liberibacter pathogens.

219 Ortholog clustering analysis reveals additional unique genes which were either lost or 220 gained in the ancestor of pathogens. In addition to prophages, we sought to identify other 221 genomic (non-prophage) genes which might be associated with the transition from the non- 222 pathogenic ancestor to pathogenic descendants. These genes should be featured by having 223 undergone either gene-loss or gene-gain events in the common ancestor of all pathogens. 224 Therefore, we were specifically interested in identifying two groups of genes which display 225 unique phyletic patterns: first, the genes, that underwent gene loss, are present in the non- 226 pathogenic L. crescens species, but not in pathogenic descendants; second, the genes, that 227 underwent gene gain, are present in all pathogens but not in non-pathogenic L. crescens 228 species (Fig. 5a).

229 To identify these genes, we utilized three steps of analysis. First, we carried out an 230 ortholog analysis using the OrthoFinder program 39 to cluster the proteins (excluding the 231 prophage components) of all collected Liberibacter genomes into different orthologous groups

8

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

232 (orthogroups). Based on their phyletic profiles, we identified the orthogroups that are only 233 present in pathogens (but not in the non-pathogenic ancestor) or in the non-pathogenic ancestor 234 (but not in pathogens). We also utilized whole-genome BLASTP searches to specifically identify 235 the cases of gene gain or gene loss by comparing the Liberibacter proteins with other 236 sequences in the NCBI database, according to the pairwise sequence similarity scores. Finally, 237 to validate and confirm the above results, we conducted extensive phylogenetic analyses on all 238 identified orthogroups. From these analyses, we identified 323 orthogroups (about 335 genes in 239 L. crescens BT-1) that were lost in all pathogenic descendants and 35 orthogroups (about 35-37 240 genes in each pathogen) that were gained in the ancestor of all pathogens (Fig. 5a and 241 Supplementary Table 3 and Table 4). Figs. 5b,c,d show the evolutionary histories of the three 242 orthogroups. The tree topology supports a common ancestry of the homologous genes at the 243 base of the Liberibacter pathogens and their origins were likely from bacteria other than the 244 non-pathogenic Liberibacter ancestor.

245 The gene-loss and gene-gain events might contribute to the establishment of endophytic 246 habitats of pathogens. To understand the functional significance of these two types of 247 orthogroups, we performed Gene Ontology and KEGG pathway analysis. This revealed that 248 many of the genes that were lost in the endophytic pathogens are involved in synthesis and 249 metabolism of several major types of cellular components, including amino acids, phosphate- 250 containing compounds, nitrogen components, and nucleotide acids (Supplementary Figure 1, 2, 251 and Table 5). Strikingly, the whole biosynthetic pathways for Tryptophan and Histidine were 252 missing in the pathogens (Fig. 6). The majority of the biosynthetic genes are clustered together 253 as different operons on the ancestral L. cresens genome; however, a complete deletion of 254 several related operons and other genes highlights the dependency of these intracellular 255 pathogens on receiving these nutrients from their hosts.

256 For the genes that were gained in the pathogens, we found that they encode multiple 257 transporters, enzymes involved in DNA/RNA synthesis, regulation, or small molecule 258 metabolism, and several transmembrane proteins (Supplementary Table 4). Some of them 259 might contribute to the pathogen’s ability to adapt and proliferate in the intracellular environment 260 of host cells. For example, the survival protein SurE, is a metal-dependent nucleotide 261 phosphatase that has been shown to be essential for bacterial pathogenesis and survival in the 262 stationary phase and in harsh conditions 40. Further, the multiple transporters might play 263 important roles in exchanging chemical components between the pathogens and the plant host

9

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

264 41. Thus, both the gene-loss and gene-gain events that happened in the ancestor of Liberibacter 265 pathogens seem to have defined the molecular foundation of their endophytic habitats.

266 Extensive sequence and structure analyses identify potential virulence factors, including 267 several polymorphic toxins. Since pathogenicity is an acquired trait for Liberibacter bacteria, 268 we hypothesize that such ability is attributed to the unique pathogenicity-related genes, that 269 were gained by the ancestor of these pathogens during evolution (or gene-gain events). 270 Therefore, the potential virulence factors, such as toxins and effectors, should be among the 271 gene-gain list that we identified (Supplementary Table 4). However, of the proteins on our list, 272 almost half of them are hypothetical proteins with no functional annotation. To accurately 273 uncover the function of these proteins, we conducted a series of sequence and structure 274 analyses to examine the sequence/structural features of these proteins, dissect their domain 275 components, establish the distant relationship between these domains with the known Pfam 276 domains, and further synthesize the function of the proteins by combining the domain 277 annotations. This systematic analysis has allowed us to identify three groups of proteins as 278 potential toxins/effectors (Supplementary Table 4).

279 The first toxin group comprises two hypothetical proteins (e.g. WP_012778667.1 and 280 WP_012778668.1 in C.L. asiaticus strain gxpsy). By sequence analysis, we found that these 281 two proteins share a unique domain architecture containing two previously un-recognized 282 domains, a PD domain (designated for the conservation of two residues, Pro and Asp; 283 Supplementary Figure 3) and a Ntox52 domain (Supplementary Figure 4). No function has been 284 associated with these domains. However, by studying the proteins containing either the PD or 285 Ntox52 domains, we found that they are also coupled with other domains that are components 286 of polymorphic toxin systems (PTSs). PTSs are a vast class of toxin systems that we have 287 recently identified in bacteria and archaea 42-45. They have served as the primary weaponry for 288 many bacterial pathogens 46-49 which were exported through different secretion pathways such 289 as T2SS, T5SS, T6SS and T7SS 42,43,45,50,51. Despite the extensive diversity of these 290 polymorphic toxins, we were able to dissect the underlying principles of these proteins, including 291 1) The toxins typically contain multiple modular architectures with N-terminal secretion-related 292 domains, central repeats or linker domains, pre-toxin domains, and C-terminal toxin domains 293 42,43,45; and 2) The toxins display a tremendous polymorphism in domain composition as they 294 diversify through domain recombination or shuffling 42,43,45. Therefore, the association of toxin- 295 related domains is the most prominent feature of toxin proteins 42,43,45.

10

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

296 We found that both the PD and Ntox52 domains display a typical association with other 297 toxin-specific domains (Fig. 7a). Specifically, the PD domain is frequently coupled with several 298 long N-terminal regions and a number of C-terminal toxin domains including nucleases (REase- 299 1, AHH, EndoU), peptidases (MPTase, Tox-PL4), deaminases, ADP-Ribosyltransferases (ART- 300 HYD3, ART-VIP2), and Anthrax toxA (Fig. 7a), that were identified in our earlier studies 42,43,45,52. 301 The Ntox52 domain, on the other hand, is always located at the C-terminus of toxin proteins, a 302 position that the toxin domain typically occupies, and coupled with other toxin-specific central 303 modules, such as RHS and PseudoCD2, in addition to the PD domain. These lines of evidence 304 strongly suggest that the two identified proteins represent novel polymorphic toxins where the 305 PD domain is a pre-toxin module and the Ntox52 is a toxin module. By secondary structure 306 prediction and conservation analysis, the Ntox52 domain is featured by a core structure 307 containing several alpha helices and four beta strands and by several conserved polar residues 308 (Supplementary Figure 4). However, no distant homolog has been found in our profile-based 309 sequence searches.

310 Another potential toxin is the YdjM protein (Fig. 7b). Our profile-profile comparison 311 revealed that it is related to several toxin domains including bacterial alpha-toxin (probability 95% 312 of profile-profile match), a membrane-disrupting phospholipase responsible for gas gangrene 313 and myonecrosis in C. perfringens infected tissues 53, and the HET-C domain (probability 93% 314 of profile-profile match), that is also a toxin component that we identified in bacterial PTSs and 315 in the fungal nonallelic incompatibility system 43,54 (Fig. 7b). Although they share a low sequence 316 identity, these domains display conservation in both the structural composition and the catalytic 317 core (Fig. 7b and 7c). The catalytic core of the known alpha toxin and Het-C domains involves 318 seven conserved residues, six of which are preserved in the YdjM domain. However, our 319 structural modeling suggested that the YdjM-specific His76 serves an equivalent role as His136 320 of the alpha toxin (Fig. 7c). Further, by analyzing the domain architectures of many other 321 proteins containing the YdjM domain, we found: 1) several bacterial RHS-type toxins use YdjM 322 as their C-terminal toxin module (e.g. WP_143945134) (Fig. 7c); 2) the YdjM domain is 323 predominantly coupled with a novel beta-sandwich domain (Fig. 7c; Supplementary Figure 5), 324 like the alpha toxin, which has a beta-sandwich domain to facilitate toxin localization at the 325 membrane 53. Thus, based on these lines of evidence, we proposed that YdjM is a novel 326 phospholipase toxin which might disrupt the membrane of host cells.

327 The third toxin group includes several proteins belonging to the EEP 328 (Endonuclease/Exonuclease/Phosphatase) family (Fig. 5b and Supplementary Table 5). They

11

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

329 contain an N-terminal signal peptide, indicating that they are secreted and therefore might target 330 host cells. The EEP family comprises many enzymes with different activities, from DNA I-like 331 nucleases, endonucleases of retrotransposons, inositol polyphosphatases to 332 phosphodiesterases (Pfam ID: PF03372). By a phylogenetic analysis of the Liberibacter EEP 333 related sequences and other known EEP enzymes (Supplementary Figure 7), we found that the 334 Liberibacter EEP and related sequences form a distinct clade which is more closely related to a 335 clade of DNA I-like nucleases, which also include Haemophilus ducreyi cytolethal distending 336 toxins (1SR4_B)55 and Salmonella typhoid toxin (4K6L_F)56. Therefore, it is possible that the 337 secreted Liberibacter EEP proteins act as nuclease toxins targeting the host DNAs.

338

339 Discussion

340 HLB is the most destructive disease of citrus worldwide and associated with several species of 341 Candidatus Liberibacter, a psyllid-transmitted, phloem-limited, alpha . Despite 342 intense scrutiny, molecular mechanisms of the HLB pathogenesis remain to be elucidated. The 343 fact that these pathogens are obligate intracellular bacteria and are not culturable in vitro has 344 made experimental, functional studies extremely challenging. Thus, computational mining of 345 available genome information has become a promising strategy to dig into the potential 346 pathogenesis mechanisms of HLB and to guide the directed experimental studies. Here we 347 present an in-depth comparative genomic analysis of several Liberibacter species, including 348 those pathogens associated with citrus HLB and potato ZC, along with the ancestral, non- 349 pathogenic Liberibacter species, L. crescens. We successfully identified the major genomic 350 changes that might contribute to the establishment of endophytic habitats and pathogenicity of 351 the Liberibacter pathogens.

352 One of the major genome variations found by our comparative genomic analysis resides 353 at the prophage loci (Fig. 2 and Fig. 3). Prophages in Liberibacter have been extensively 354 studied in the past, but the results are controversial. On the one hand, the primary HLB 355 pathogen, C.L. asiaticus, is found to carry two prophage loci whose lytic cycle was activated 356 during bacterial infection in plant 36, indicating that the prophage might be involved in 357 pathogenicity or infection. On the other hand, the absence of the prophage in a Japanese strain 358 of C.L. asiaticus Ishi-1, which also induced severe symptoms in citrus, suggests that the 359 prophage might be dispensable for pathogenicity. Further, many different prophage loci are 360 identified in both Liberibacter pathogens and the free-living ancestor 30,37, making it more difficult 361 to understand their functional contribution. To resolve these puzzles, we conducted a unique,

12

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

362 comprehensive operonic association analysis to recover all the prophage loci in the Liberibacter 363 genomes and systematically annotate the conserved phage components. According to both the 364 prophage composition and phylogeny of the conserved terminase, we were able to classify 365 these diverse prophage loci into four major types, namely LC1, LC2, SC, and UT. This result 366 has clarified several confusions about the presence of the prophage loci in the Liberibacter 367 species. For example, our phylogenetic analysis has linked the recently classified type 4 368 prophages from the Liberibacter pathogens 37 to the earlier LC2 prophage from the ancestral L. 369 crescens30. We also uncovered a previously un-detected prophage type, the UT phage, that is 370 present in all the C.L. asiaticus strains. Further, we were able to identify the LC2 and UT 371 prophages in the Japanese strain which was previously thought to contain no prophage loci 5. 372 Thus, our detailed annotation and classification of prophages has provided a genomic resource 373 for studying the prophages in the Liberibacter bacteria.

374 Importantly, our analysis revealed that these different types of prophages originated 375 through multiple independent gene transfer events to Liberibacter at different evolutionary 376 stages (Fig. 4c). However, it is only the SC-type prophage that was acquired by the ancestor of 377 the pathogens. Thus, this evolutionary association suggests that the acquisition of the SC 378 phage might contribute to the emergence of pathogenicity in these pathogens. Indeed, many 379 toxins of bacterial pathogens are known to be carried by prophage loci 33-35. On average, a 380 typical SC-prophage locus contains about 35 genes. While the majority of these genes encode 381 structural components and enzymes involved in phage replication and transcription, others 382 remain uncharacterized. Therefore, a systematic analysis of the SC-prophage components will 383 be necessary to understand its role in the Liberibacter pathogens. To be noted, we observed an 384 unusual divergence between the SC prophage loci from different Liberibacter pathogens and 385 different strains of the same pathogen, distinct from other genomic regions and other types of 386 prophages. This rapid evolutionary pattern of the SC prophage suggests that some of its 387 components may have acted at the interface of host-pathogen interactions and gained the 388 upper hand during the evolutionary arms race 57,58.

389 Our comparative genomic analysis also identified genomic (non-prophage) genes that 390 were lost or gained at the ancestor of the Liberibacter pathogens. To be noted, of the 323 391 orthogroups (335 genes) that were lost in all the pathogens, 56 genes were also identified as 392 the culture-essential genes for L. crescens in a Tn5 transposon mutagenesis screening 393 experiment 59. Functional and pathway analysis revealed that biosynthetic pathways of several 394 essential amino acids in the pathogens were disrupted by gene loss, consistent with the earlier

13

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

395 studies 30,60. Some of the genes that were gained in the ancestor of the pathogens encode 396 transporters or transmembrane proteins, which may also contribute to the molecular adaptation 397 of the bacteria to endophytic habitats. Further functional characterization of the genes on the 398 gene-loss and gene-gain lists holds promises for advancing our understanding about the 399 nutrient dependence of endophytic Liberibacter bacteria and enabling rational design of new 400 strategies to culture them in vitro.

401 More importantly, our detailed sequence and structural analyses have identified several 402 potential protein toxins. Protein toxins or effectors are the major pathogenicity components of 403 bacteria-caused diseases. However, the toxins that are responsible for HLB and other related 404 diseases have remained enigmatic for years. Early genome annotation did not reveal any 405 bacterial pathogenesis-related genes in the genome 5. Although several studies have attempted 406 to examine the genes encoding small, secreted proteins, candidates thus far are only limited to 407 certain C.L. asiaticus strains 14,16,18-21. We had also tried to identify toxins in Liberibacter by 408 utilizing our previously established toxin domain profiles of polymorphic toxin systems; 409 unfortunately, no candidate was identified. This suggests that the Liberibacter pathogens might 410 utilize some novel toxins whose toxin domains are not yet identified. Indeed, by studying the 411 hypothetical proteins that were found to follow a gene gain pattern, we were able to uncover 412 three new toxin groups. Importantly, the newly identified toxin domains are also fused with other 413 typical domains of the polymorphic toxins. This strongly supported their role as toxins in the 414 Liberibacter pathogens. In addition to the above toxin candidates, other hypothetical proteins in 415 the gene-gain list may also be potential toxins. One such example is the protein 416 (WP_015452959.1) that was recently found to cause cell death in the leaves of N. benthamiana 417 61.

418 This research expands our recent efforts in using computational means to dissect the 419 molecular mechanisms of complex organismal interactions 42,43,45,62. Organismal or species 420 interactions, either inter-specific, intra-specific, or those between pathogen and host, are the 421 mainstay of Life 43. Driven by the evolutionary arms race, the proteins mediating these 422 interactions, such as toxins, effectors, or virulence factors, typically evolve rapidly and are 423 difficult to identify using traditional experimental and computational methods 63. The major 424 contribution of our research in this regard is to use dedicated protein domain-centric analysis 425 strategies, comparative genomics, and evolutionary theory to identify such toxin/effector 426 components and to formulate the principle of the systems behinds these interactions 427 42,43,62,63.Previously, this approach has led to the discovery of several distinct classes of conflict

14

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

428 systems, including bacterial polymorphic toxin systems involved in kin selection and bacteria- 429 host interactions 42,43,45, Crinkler-RHS (CR) effector systems at the interface of eukaryotic 430 pathogen/symbiont-host interactions 62, nucleotide-centric conflict systems 64, DNA modification 431 systems deployed in phage-bacteria interactions 65, and viral pathogenicity factors involved in 432 coronavirus-host interactions 57,58. Many of our computational predictions in protein function and 433 the organizational principles of the systems have facilitated the creation of new concepts and 434 directions in several research fields and have been later experimentally validated, including the 435 recent discoveries of the enzymatic function of animal and plant TIR proteins 66,67, adenosine 436 methylation enzymes 68,69, type III CRISPR-Cas systems 70,71, and novel immunoglobulin 437 proteins and ion channel proteins in SARS-CoV-2 57,58, in addition to many work on PTSs 42,43,45.

438 As an extension to understand the interactions between bacterial pathogens and their 439 host, we present our identification of the elusive pathogenicity factors in the Liberibacter 440 bacteria. This knowledge will open doors to developing and deploying new strategies for the 441 detection of Liberibacter pathogens and treatment of the associated diseases. Both the SC 442 prophage loci and the genes gained in the ancestor of the pathogenic strains, especially the 443 newly identified toxin groups, can serve as biomarkers for specific detection of the pathogens. 444 The association of these genes with pathogenic species suggests that some of them may play a 445 role in pathogenesis. This hypothesis can be tested directly using targeted, functional 446 experiments. We envision that future research informed and enabled by our genomic analysis 447 holds great promises to elucidate the pathogenesis mechanisms of Liberibacter bacteria, 448 leading to long-sought solutions for curing HLB and other related diseases.

449

450 Methods and material

451 Genomes that were investigated in this study. A total of 12 genomes of Liberibacter species 452 were collected from the nucleotide database of the National Center of Biotechnology Information 453 (NCBI). Their genome annotations including the coding sequences (CDS) and RNA genes were 454 extracted from NCBI GenBank files. The encoded protein sequences were retrieved from the 455 NCBI protein database. The detailed information of the genomes used in this study can be 456 found in Supplementary Table 1.

457 Pair-wise genome comparisons. To systemically identify the genomic features and variations 458 among Liberibacter species, we performed pair-wise whole genome comparisons by using the 459 local TBLASTX 72 program with the cut-off e-value of 0.001 serving as the significant threshold.

15

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

460 The results were visualized using Artemis Comparison Tool (ACT) 73 with a cut-off matching 461 sequence length at 500 bp, and modified using Adobe Illustrator. The C.L. europaeus genome 462 has 15 contigs and in order to compare it to other genomes, we linked these contigs based on 463 the order of their accession numbers: PSQJ01000001.1, PSQJ01000002.1, PSQJ01000004.1, 464 PSQJ01000005.1, PSQJ01000006.1, PSQJ01000007.1, PSQJ01000008.1, PSQJ01000009.1, 465 PSQJ01000010.1, PSQJ01000011.1, PSQJ01000012.1, PSQJ01000013.1, PSQJ01000014.1, 466 PSQJ01000015.1 and PSQJ01000003.1.

467 Protein sequence search and analysis. To collect protein homologs, iterative sequence 468 profile searches were conducted using the Position-Specific Iterated BLAST (PSI-BLAST) 469 program 72 against the non-redundant (nr) protein database of NCBI with a cut-off e-value of 470 0.005 serving as the significance threshold. Similarity-based clustering was performed by 471 BLASTCLUST, a BLAST score-based single-linkage clustering method 472 (ftp://ftp.ncbi.nih.gov/blast/documents/blastclust.html). Multiple sequence alignments (MSA) 473 were built by the KALIGN 74, MUSCLE 75 or PROMALS3D 76 programs, followed by careful 474 manual adjustments based on the profile–profile alignment and the secondary structure 475 information generated by the JPRED program 77. A consensus method was used to calculate 476 the conservation pattern of the MSA based on different categories of amino acid physico- 477 chemical properties developed by Taylor in 1986 78. Consensus is calculated by examining each 478 column of the MSA to determine whether a threshold fraction (either 75% or 80%) of the amino 479 acids belongs to a defined category. Then the MSA was colored using the CHROMA program 79 480 based on the calculated consensus sequence and further modified using Adobe Illustrator or 481 Microsoft Word. The HHsearch program was used for profile-profile comparison 80. Signal 482 peptide and transmembrane region prediction was detected using the Phobius program 81. 483 Potential open reading frames were detected using the ORFfinder program 484 (https://www.ncbi.nlm.nih.gov/orffinder/).

485 Molecular phylogenetic analysis. We used both the Maximum Likelihood (ML) analysis, 486 implicated in the MEGA7 programs 82, and Bayesian Inference (BI), implemented in the BEAST 487 1.8.4 program 83, to reconstruct the phylogenetic relationships of genes and proteins.

488 To infer the phylogeny of Liberibacter species, we selected three genes, including 16S 489 rRNA, 23S rRNA, and DNA polymerase A, collected their homologs from Liberibacter bacteria 490 and other related sequences using BLASTN or BLASTP programs, generated MSAs which 491 were further analyzed using the ML analysis. A GTR model was applied to the RNA sequences 492 and a JTT model was applied to the protein sequences. Initial tree(s) for the heuristic search

16

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

493 were obtained automatically by applying Neighbor-Join and BioNJ algorithms to a matrix of 494 pairwise distances estimated using the Maximum Composite Likelihood (MCL) approach, and 495 then selecting the topology with a superior log-likelihood value. For both data types, a discrete 496 Gamma distribution was used to model evolutionary rate differences among sites (four 497 categories). A bootstrap analysis with 100 repetitions was performed to assess the significance 498 of the phylogenetic grouping.

499 To reconstruct the evolutionary history of terminases, we first applied the ML analysis in 500 which a WAG model with a discrete gamma distribution (four categories) was used to model 501 rate heterogeneity among sites. A bootstrap analysis with 100 repetitions was performed to 502 assess the significance of the phylogenetic grouping. We also applied the BI analysis with a JTT 503 model and a discrete Gamma distribution (four categories). Markov chain Monte Carlo (MCMC) 504 duplicate runs of 10 million states each, sampling every 10,000 steps was computed. Logs of 505 MCMC runs were examined using Tracer 1.7.1 program 83. Burn-ins were set to be 2% of 506 iterations.

507 All trees with the highest log-likelihood from the ML analysis were visualized using the 508 MEGA7 program 82. The tree is drawn to scale, with branch lengths measured in the number of 509 substitutions per site. The bootstrapping values from ML analysis and/or the Posterior values 510 from Bayesian inference analysis are shown next to the branches.

511 Gene neighborhood analysis of the prophage loci. In order to identify the prophage loci, we 512 utilized the gene neighborhood analysis. Unlike many earlier studies 25,30,37, which relayed on 513 the existing phage database and sequences, our analysis is based on the operonic association 514 to comply with the extreme diversity of prophage loci. As the terminase is the core component of 515 prophage, we used it as a marker to identify potential prophage loci on the genome. We 516 collected the upstream and downstream gene neighbors of the terminase from the NCBI 517 GenBank files. All protein sequences were clustered using the similarity-based BLASTCLUST 518 program (ftp://ftp.ncbi.nih.gov/blast/documents/blastclust.html). Protein clusters were further 519 annotated with the conserved domains which are identified by the hmmscan program searching 520 against Pfam 80,84 and our own curated domain profiles.

521 Identification of the genes gained or lost at the ancestor of pathogenic species. To 522 identify the genes that were gained or lost at the ancestor of all Liberibacter pathogens, we first 523 utilized the OrthoFinder v2.2.3 program 39,85 with default parameters to infer groups of 524 orthologous gene clusters (orthogroups) among all 12 Liberibacter proteomes based on protein 525 homology detection by Diamond 86 and MCL clustering 87. Orthogroup comprises genes of

17

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

526 related species that evolved from a common gene ancestor by speciation 88. Given the common 527 ancestry of all Liberibacter species, the majority of the genes should share common ancestors 528 and be preserved in all these species, whereas the genes that were gained or lost at the 529 ancestor of the Liberibacter pathogens will display a different presence in either pathogenic 530 Liberibacter species or non-pathogenic ancestor. Therefore, we extracted these orthogroups 531 using a custom python script based on the following criteria: 1. Orthogroups with gene loss: the 532 ones present in non-pathogenic L. crescens but not in any of the pathogenic Candidatus 533 Liberibacter species; 2. Orthogroups with gene gain: the ones that are found in at least 4 534 Candidatus Liberibacter species (a total of 5 species used in this study), but not in L. crescens 535 genomes.

536 Importantly, by conducting case-by-case phylogenetic analyses, we realized that there 537 were still some false positive and false negatives in the ortholog clustering result. To overcome 538 this methodological limitation, we utilized another strategy given the special situation of 539 Liberibacter species. For ortholog clustering, the gene transfers between bacterial genomes are 540 the major obstacle. In the case of Liberibacter bacteria, all pathogenic species are endophytic 541 bacteria and their life cycle is entirely restricted in the host cells. Therefore, they have little 542 chance to exchange genes with other bacteria and evolution of their genes should have been 543 mainly influenced by sequence diversification and their relationship should be readily 544 computationally tractable using pairwise sequence similarity scores. Based on this evidence, we 545 conduced genome-wide BLASTP searches against the NCBI-NR database using all C.L. 546 asiaticus proteins as queries. The results showed a good correlation between sequence 547 similarity scores and the evolutionary distance. Thus, by this simple BLAST search, we were 548 able to identify 1) the proteins in the non-pathogenic ancestor that have no close homologs in 549 pathogenic Liberibacter species, and 2) the proteins in pathogenic species that have no close 550 homologs in the ancestor.

551 To validate the above results on gene gain and gene loss, we carried out extensive 552 bootstrapped maximum likelihood phylogenetic analyses using the MEGA7 program for all the 553 identified orthogroups by including both Liberibacter proteins and other related homologs that 554 were identified from NCBI-NR database. These three steps led to an identification of 323 555 orthogroups (about 335 genes) that are only found in non-pathogenic L. crescens and 35 556 orthogroups that are only present in pathogenic bacteria (Supplementary Table 3 and Table 4). 557 The number of gene-loss and gene-gain genes might differ between species or strains due to 558 genome-specific duplications.

18

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

559 Protein function analysis. We used three levels of analyses to conduct the functional 560 annotation of the proteins that were identified in this study. First, we retrieved the annotation 561 information from the NCBI RefSeq database and Gene Ontology (GO) terms using the 562 Blast2GO program 89. Their involvement in the molecular synthesis pathways was determined 563 by mapping the KEGG database 90. Second, for those proteins that have no functional 564 annotation, we annotated them using the conserved domains which were identified using the 565 hmmscan program searching against Pfam 80,84 and our own curated domain profiles. Finally, 566 for proteins with potential uncharacterized domains, we conducted detailed case-by-case 567 analysis of protein sequences and structures, such as homologous sequence searches (PSI- 568 BLAST) 72, multiple sequence alignment analysis (KALIGN, MUSCLE, PROMALS3D)74-76, 569 secondary structure prediction (JPRED) 77, and sequence-profile/profile-profile searches 570 (HHsearch) 80, to dissect their domain architectures, identify the conserved sequence/structural 571 features, and to predict aspects of their biochemical and biological function.

572 Protein structure prediction and analysis. The MODELLER (version 9v1) program 91 was 573 utilized for homology modeling of the tertiary structure of the YdjM protein by using the alpha 574 toxin (1OLP_A) as a template. The sequence identity between the template and the targets is 575 9%. Since in these low sequence-identity cases, sequence alignment is the most important 576 factor affecting the quality of model 92, the alignment used in this analysis has been carefully 577 built and cross-validated based on the information from HHsearch and edited manually using 578 the secondary structure information. Structural analysis and comparison were conducted using 579 the molecular visualization program PyMOL 93.

580

581 Data availability

582 Authors can confirm that all relevant data are included in the paper and/or its Supplementary 583 Information files.

584

585 Acknowledgements

586 Y.T., C.W., T.S., H.L., and D.Z. are supported by the Saint Louis University Start-up Fund. T.- 587 F.H. is supported by the National Institute of Food and Agriculture Hatch Project 02413. X.W. is 588 supported by a USDA National Institute of Food and Agriculture Hatch project 1018100, 589 National Science Foundation EPSCoR RII Track-4 Research Fellowship (NSF OIA 1928770),

19

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

590 an Alabama Agriculture Experiment Station ARES Agriculture Research Enhancement, 591 Exploration and Development (AgR-SEED) award, as well as a generous laboratory start-up 592 fund from Auburn University College of Veterinary. X.L. is supported by the National Institute of 593 Food and Agriculture Hatch Project 02634.

594

595 Author contributions

596 D.Z. conceived the overall project. Y.T., C.W., T.S., H.L. performed the phylogenetic analysis. 597 Y.T and D.Z. performed genome comparisons, GO and pathway analysis, protein domain 598 identification, domain architecture analysis, and structure analysis/modeling. Y.T. and C.W. 599 conducted the operonic analysis of prophages and studied the function of phage components. 600 Y.T., C.W., and R.F.D were involved in ortholog analysis. Y.T. and R.F.D. contributed to the 601 data organization. Y.T., X.T., T.F.H., X.W., X.L. and D.Z. contributed to the design of the 602 analysis strategies. Y.T., H.L., X.T., T.F.H., X.W., X.L., and D.Z. interpreted the results. Y.T. and 603 D.Z. wrote the manuscript. All authors have read, revised, and commented on the manuscript.

604

605 Additional information

606 Supplementary Information accompanies this paper are available at Nature Communications 607 online.

608

609 Competing Interests

610 The authors declare no competing interests.

20

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

611 Figures and legends

612 Fig. 1 Phylogenetic relationship of HLB-associated bacteria and their relative species. a 613 Detailed information of Liberibacter species whose genomes were investigated in this study, 614 including their evolutionary relationship, transmitted vectors, hosts, and associated diseases. 615 The evolutionary relationship was derived from phylogenetic trees of 16S rRNA (b), 23S rRNA 616 (c), and DNA polymerase A (d). b-d The phylogenetic trees were inferred using the Maximum 617 Likelihood (ML) method, where the supporting values from 100 bootstrap are shown for major 618 branches only. The outgroup clades are colored in grey background.

619

620

621

21

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

622 Fig. 2 Whole genome comparisons of Liberibacter species. The genomes are labelled by their 623 scientific names followed by strain information. Cyan and yellow boxes are the protein coding 624 frames and RNA regions on either forward (+, top) and reverse (-, bottom) strands, respectively. 625 Homologous genes between genomes are linked by lines, where the red and blue lines 626 represent the forward and reverse (complementary) matches, respectively. The intensity of the 627 color bands is proportional to the percent identity of the match, where higher intensity indicates 628 higher sequence identity. The phage loci are shown in colored boxes (legend on the right).

629

630

631

632

22

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

633 Fig. 3 Genomic structures of the prophages identified in the Liberibacter species. The coding 634 frames in prophage loci are presented in blocks. The blocks, labelled with gene annotations and 635 highlighted in different colors, are the coding products shared by at least four prophage loci, 636 while the small grey blocks represent non-conserved coding products or pseudogenes. The 637 genome structures were aligned based on the shared terminases of these prophage loci. On the 638 left side, the prophage loci are indicated by their species names, strains, and terminase 639 accession numbers or locus tag. The prophage loci were classified into four types (LC1, LC2, 640 SC-like, UT) and a distinct prophage fragment found in a C.L. solanacearum ZC1 strain. One 641 C.L. europaeus SC-like prophage structure was derived from two genome contigs, 642 PSQJ01000015.1 and PSQJ01000003.1, which is indicated by a star (*).

643

644

645

23

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

646 Fig. 4 Independent origins of Liberibacter prophages. a Phylogenetic relationship of prophage 647 terminases. The phylogeny was constructed by using Maximum Likelihood (ML) and Bayesian 648 Inference (BI) methods. The ML tree with the highest log likelihood is presented with the major 649 branches supported by both ML bootstrap values (top) and the BI Posterior values (bottom, 650 bracketed). The terminases from the previously-defined Liberibacter prophage types are 651 highlighted in different background color: LC2 prophage in light blue, LC1 prophage in light 652 green, ZC1 prophage in light purple, UT prophage in light red, and SC prophage in yellow. 653 Sequence annotations were colored differently according to their species. The terminase 654 sequences from C.L. asiaticus strains are identical, therefore, one sequence is used to 655 represent and indicated by hashtag (#). Several protein sequences were translated by the 656 ORFfinder program, which are indicated by a star symbol (*). b Whole genome comparison of 657 different strains of C.L. asiaticus. Homologous genes between genomes are linked by lines. 658 Colored boxes are the prophage loci and the high variation of SC-type prophage loci is shown at 659 the 3’ end of the genomes. c The inferred independent origins of different Liberibacter phage 660 types. The evolutionary relationship of Liberibacter species is shown on the left and their 661 composition of the phage loci are illustrated on the right. The coloring themes refer to the Figure 662 2 legend. The red arrows indicate the potential gene transfer events of different prophage types.

24

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

663

25

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

664 Fig. 5 Ortholog clustering analysis identified the genomic genes that were gained and lost 665 during the transition from non-pathogenic ancestor to pathogenic decedents. a A conceptual 666 figure illustrating the strategies to identify the Liberibacter genes which were gained or lost 667 during evolution according to the phyletic patterns (presence or absence) of orthologous genes 668 in both ancestral free-living L. crescens and pathogenic decedents. b-d Phylogenetic verification 669 of gene transfer events of three gene products, exonuclease-endonuclease-phosphatase (EEP), 670 peptidase M75 (iron transporter), and sodium:dicarboxylate symporter family (SDF) protein, to 671 the pathogenic ancestor. The trees were inferred using Maximum Likelihood (ML) method, and 672 the supporting values from 100 bootstraps are shown for major branches. Sequences were 673 annotated by their species and strains, colored accordingly. Some sequences from different C.L. 674 asiaticus stains are identical and so we use one sequence to represent, which is indicated by a 675 star symbol (*).

26

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

676 27

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

677 Fig.6 KEGG pathway analysis reveals a complete deletion of a chain of enzymes of the 678 pathways responsible for biosynthesis of Tryptophan (a) and Histidine (b) in the pathogenic 679 Liberibacter species. In each case, operonic structure of biosynthesis enzymes in the ancestral 680 L. crescens BT-1 genome is presented above, with the enzyme genes in colored arrow blocks 681 and gene boundaries indicated by genomic locations; the KEGG pathway is shown below, with 682 reaction steps in arrowed lines, colored according to the operonic blocks. The grey blocks 683 indicate the genes that are not associated with the pathway and not lost in pathogenic 684 Liberibacter species.

685

686

687

688

689

690

28

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

691 Fig. 7 Identification of potential protein toxins. a Domain architectures of representative 692 polymorphic toxins containing the PD domain and the Ntox52 domain. The PD toxins of C.L. 693 asiaticus were highlighted in red. Domain architectures are labelled by accession numbers, 694 species names and their lineages in bracket. Domain architectures are not drawn to scale. b 695 Multiple sequence alignment between the YdjM, alpha toxin, Het-C, and S1-P1 nuclease 696 domains. The conserved catalytic residues are highlighted in black background. c Structural 697 comparison of YdjM and the alpha toxin domains. Representative domain architectures of these 698 toxin proteins are shown.

29

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

699

700

30

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

701 References

702 1 Jagoueix, S., Bove, J.-m. & Garnier, M. The phloem-limited bacterium of greening 703 disease of citrus is a member of the α subdivision of the Proteobacteria. International 704 Journal of Systematic and Evolutionary Microbiology 44, 379-386 (1994). 705 2 Texeira, D. et al. First report of a huanglongbing-like disease of citrus in São Paulo State, 706 Brazil and association of a new Liberibacter species,“Candidatus Liberibacter 707 americanus”, with the disease. Plant disease 89, 107-107 (2005). 708 3 Munyaneza, J., Sengoda, V., Crosslin, J., Garzon-Tiznado, J. & Cardenas-Valenzuela, 709 O. First report of “Candidatus Liberibacter solanacearum” in tomato plants in Mexico. 710 Plant Disease 93, 1076-1076 (2009). 711 4 Liefting, L. W. et al. A new ‘Candidatus Liberibacter’species associated with diseases of 712 solanaceous crops. Plant disease 93, 208-214 (2009). 713 5 Katoh, H., Miyata, S.-i., Inoue, H. & Iwanami, T. Unique features of a Japanese 714 ‘Candidatus Liberibacter asiaticus’ strain revealed by whole genome sequencing. PLoS 715 One 9, e106109 (2014). 716 6 Graca, J. d. Citrus greening disease. Annual review of phytopathology 29, 109-136 717 (1991). 718 7 Callaway, E. Bioterror: the green menace. Nature 452, 148-150, doi:10.1038/452148a 719 (2008). 720 8 Secor, G. A. & Rivera-Varas, V. V. Emerging diseases of cultivated potato and their 721 impact on Latin America. Revista Latinoamericana de la Papa (Suplemento) 1, 1-8 722 (2004). 723 9 Munyaneza, J., Crosslin, J. & Upton, J. Association of Bactericera cockerelli (Homoptera: 724 Psyllidae) with “zebra chip,” a new potato disease in southwestern United States and 725 Mexico. Journal of Economic Entomology 100, 656-663 (2007). 726 10 Capoor, S., Rao, D. & Viswanath, S. Diaphorina citri Kuway., a vector of the greening 727 disease of citrus in India. Indian J. Agric. Sci 37, 572-576 (1967). 728 11 McClean, A. & Oberholzer, P. Citrus psylla, a vector of the greening disease of sweet 729 orange. South African Journal of Agricultural Science 8, 297-298 (1965). 730 12 Hogenhout, S. A., Van der Hoorn, R. A., Terauchi, R. & Kamoun, S. Emerging concepts 731 in effector biology of plant-associated organisms. Molecular plant-microbe interactions 732 22, 115-122 (2009). 733 13 Prasad, S., Xu, J., Zhang, Y. & Wang, N. SEC-translocon dependent extracytoplasmic 734 proteins of Candidatus Liberibacter asiaticus. Frontiers in microbiology 7, 1989 (2016).

31

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

735 14 Clark, K. et al. An effector from the Huanglongbing-associated pathogen targets citrus 736 proteases. Nature communications 9, 1-11 (2018). 737 15 Du, P. et al. “Candidatus Liberibacter asiaticus” Secretes Nonclassically Secreted 738 Proteins That Suppress Host Hypersensitive Cell Death and Induce Expression of Plant 739 Pathogenesis-Related Proteins. Applied and Environmental Microbiology (2021). 740 16 Zhang, C. et al. A Sec-Dependent Secretory Protein of the Huanglongbing-Associated 741 Pathogen Suppresses Hypersensitive Cell Death in Nicotiana benthamiana. Frontiers in 742 microbiology 11 (2020). 743 17 Zhou, Y. et al. ‘Candidatus Liberibacter Asiaticus’ SDE1 Effector Induces Huanglongbing 744 Chlorosis by Downregulating Host DDX3 Gene. International journal of molecular 745 sciences 21, 7996 (2020). 746 18 Pitino, M., Armstrong, C. M., Cano, L. M. & Duan, Y. Transient expression of Candidatus 747 Liberibacter asiaticus effector induces cell death in Nicotiana benthamiana. Front. Plant 748 Sci. 7, 982 (2016). 749 19 Ying, X. et al. Identification of the virulence factors of Candidatus Liberibacter asiaticus 750 via heterologous expression in Nicotiana benthamiana using Tobacco Mosaic Virus. 751 International journal of molecular sciences 20, 5575 (2019). 752 20 Zhang, C. et al. A novel ‘Candidatus Liberibacter asiaticus’-encoded Sec-dependent 753 secretory protein suppresses programmed cell death in Nicotiana benthamiana. 754 International journal of molecular sciences 20, 5802 (2019). 755 21 Pang, Z. et al. Citrus CsACD2 is a target of Candidatus Liberibacter asiaticus in 756 Huanglongbing disease. Plant physiology 184, 792-805 (2020). 757 22 Lin, H. et al. Complete genome sequence of a Chinese strain of “Candidatus 758 Liberibacter asiaticus”. Genome announcements 1 (2013). 759 23 Zheng, Z., Deng, X. & Chen, J. Whole-genome sequence of “Candidatus Liberibacter 760 asiaticus” from Guangdong, China. Genome Announcements 2 (2014). 761 24 Zheng, Z. et al. A type 3 prophage of ‘Candidatus Liberibacter asiaticus’ carrying a 762 restriction-modification system. Phytopathology 108, 454-461 (2018). 763 25 Dai, Z. et al. Prophage diversity of ‘Candidatus Liberibacter asiaticus’ strains in 764 California. Phytopathology 109, 551-559 (2019). 765 26 Duan, Y. et al. Complete genome sequence of citrus huanglongbing 766 bacterium,‘Candidatus Liberibacter asiaticus’ obtained through metagenomics. 767 Molecular Plant-Microbe Interactions 22, 1011-1020 (2009).

32

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

768 27 Lin, H. et al. Complete genome sequence of “Candidatus Liberibacter africanus,” a 769 bacterium associated with citrus huanglongbing. Genome announcements 3 (2015). 770 28 Lin, H. et al. The complete genome sequence of ‘Candidatus Liberibacter solanacearum’, 771 the bacterium associated with potato zebra chip disease. PLoS One 6, e19135 (2011). 772 29 Thompson, S. et al. First report of'Candidatus Liberibacter europaeus' associated with 773 psyllid infested Scotch broom. New Disease Reports 27, 2044-0588.2013 (2013). 774 30 Leonard, M. T., Fagen, J. R., Davis-Richardson, A. G., Davis, M. J. & Triplett, E. W. 775 Complete genome sequence of Liberibacter crescens BT-1. Standards in genomic 776 sciences 7, 271-283 (2012). 777 31 Fagen, J. R. et al. Liberibactercrescens gen. nov., sp. nov., the first cultured member of 778 the genus Liberibacter. International journal of systematic and evolutionary microbiology 779 64, 2461-2466 (2014). 780 32 Thapa, S. P. et al. Genome‐wide analyses of Liberibacter species provides insights into 781 evolution, phylogenetic relationships, and virulence factors. Molecular plant pathology 21, 782 716-731 (2020). 783 33 Brüssow, H., Canchaya, C. & Hardt, W.-D. Phages and the evolution of bacterial 784 pathogens: from genomic rearrangements to lysogenic conversion. Microbiology and 785 molecular biology reviews 68, 560-602 (2004). 786 34 Faruque, S. M., Alim, A. A., Albert, M. J., Islam, K. N. & Mekalanos, J. J. Induction of the 787 lysogenic phage encoding cholera toxin in naturally occurring strains of toxigenic Vibrio 788 cholerae O1 and O139. Infection and immunity 66, 3752-3757 (1998). 789 35 Tozzoli, R. et al. Shiga toxin-converting phages and the emergence of new pathogenic 790 Escherichia coli: a world in motion. Frontiers in cellular and infection microbiology 4, 80 791 (2014). 792 36 Zhang, S. et al. ‘Ca. Liberibacter asiaticus’ carries an excision plasmid prophage and a 793 chromosomally integrated prophage that becomes lytic in plant infections. Molecular 794 plant-microbe interactions 24, 458-468 (2011). 795 37 Dominguez-Mirazo, M., Jin, R. & Weitz, J. S. Functional and Comparative Genomic 796 Analysis of Integrated Prophage-Like Sequences in “Candidatus Liberibacter asiaticus”. 797 Msphere 4 (2019). 798 38 Black, L. W. DNA packaging and cutting by phage terminases: control in phage T4 by a 799 synaptic mechanism. Bioessays 17, 1025-1030 (1995).

33

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

800 39 Emms, D. M. & Kelly, S. OrthoFinder: solving fundamental biases in whole genome 801 comparisons dramatically improves orthogroup inference accuracy. Genome biology 16, 802 1-14 (2015). 803 40 Li, C., Wu, P.-Y. & Hsieh, M. Growth-phase-dependent transcriptional regulation of the 804 pcm and surE genes required for stationary-phase survival of Escherichia coli. 805 Microbiology 143, 3513-3520 (1997). 806 41 Li, W., Cong, Q., Pei, J., Kinch, L. N. & Grishin, N. V. The ABC transporters in 807 Candidatus Liberibacter asiaticus. Proteins: Structure, Function, and Bioinformatics 80, 808 2614-2628 (2012). 809 42 Zhang, D., Iyer, L. M. & Aravind, L. A novel immunity system for bacterial nucleic acid 810 degrading toxins and its recruitment in various eukaryotic and DNA viral systems. 811 Nucleic acids research 39, 4532-4552 (2011). 812 43 Zhang, D., de Souza, R. F., Anantharaman, V., Iyer, L. M. & Aravind, L. Polymorphic 813 toxin systems: comprehensive characterization of trafficking modes, processing, 814 mechanisms of action, immunity and ecology using comparative genomics. Biology 815 direct 7, 1-76 (2012). 816 44 Makarova, K. S. et al. Antimicrobial peptides, polymorphic toxins, and self-nonself 817 recognition systems in Archaea: An untapped armory for intermicrobial conflicts. MBio 818 10 (2019). 819 45 Iyer, L. M., Zhang, D., Rogozin, I. B. & Aravind, L. Evolution of the deaminase fold and 820 multiple origins of eukaryotic editing and mutagenic nucleic acid deaminases from 821 bacterial toxin systems. Nucleic acids research 39, 9473-9497 (2011). 822 46 Jamet, A. & Nassif, X. New players in the toxin field: polymorphic toxin systems in 823 bacteria. MBio 6 (2015). 824 47 Kung, V. L. et al. An rhs gene of aeruginosa encodes a virulence protein 825 that activates the inflammasome. Proceedings of the National Academy of Sciences 109, 826 1275-1280 (2012). 827 48 Elhosseiny, N. M. & Attia, A. S. Acinetobacter: an emerging pathogen with a versatile 828 secretome. Emerging microbes & infections 7, 1-15 (2018). 829 49 Jamet, A. et al. A new family of secreted toxins in pathogenic Neisseria species. PLoS 830 Pathog 11, e1004592 (2015). 831 50 Whitney, J. C. et al. A broadly distributed toxin family mediates contact-dependent 832 antagonism between gram-positive bacteria. Elife 6, e26938 (2017).

34

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

833 51 Lien, Y.-W. & Lai, E.-M. Type VI secretion effectors: methodologies and biology. 834 Frontiers in cellular and infection microbiology 7, 254 (2017). 835 52 Aravind, L., Zhang, D., de Souza, R. F., Anand, S. & Iyer, L. M. The natural history of 836 ADP-ribosyltransferases and the ADP-ribosylation system. Endogenous ADP- 837 Ribosylation, 3-32 (2014). 838 53 Clark, G. C. et al. Clostridium absonum α-toxin: new insights into Clostridial 839 phospholipase C substrate binding and specificity. Journal of molecular biology 333, 840 759-769 (2003). 841 54 Paoletti, M. & Clave, C. The fungus-specific HET domain mediates programmed cell 842 death in Podospora anserina. Eukaryotic Cell 6, 2001-2008 (2007). 843 55 Nešić, D., Hsu, Y. & Stebbins, C. E. Assembly and function of a bacterial genotoxin. 844 Nature 429, 429-433 (2004). 845 56 Song, J., Gao, X. & Galán, J. E. Structure and function of the Salmonella Typhi 846 chimaeric A 2 B 5 typhoid toxin. Nature 499, 350-354 (2013). 847 57 Tan, Y., Schneider, T., Leong, M., Aravind, L. & Zhang, D. Novel immunoglobulin 848 domain proteins provide insights into evolution and pathogenesis of SARS-CoV-2- 849 related viruses. MBio 11 (2020). 850 58 Tan, Y. et al. Unification and extensive diversification of M/Orf3-related ion channel 851 proteins in coronaviruses and other nidoviruses. Virus Evolution 7, veab014 (2021). 852 59 Lai, K.-K., Davis-Richardson, A. G., Dias, R. & Triplett, E. W. Identification of the genes 853 required for the culture of Liberibacter crescens, the closest cultured relative of the 854 Liberibacter plant pathogens. Frontiers in microbiology 7, 547 (2016). 855 60 Fagen, J. R. et al. Comparative genomics of cultured and uncultured strains suggests 856 genes essential for free-living growth of Liberibacter. PLoS One 9, e84469 (2014). 857 61 Frampton, R. A. et al. Draft genome sequence of a “Candidatus Liberibacter europaeus” 858 strain assembled from broom psyllids (Arytainilla spartiophila) from New Zealand. 859 Genome announcements 6 (2018). 860 62 Zhang, D., Burroughs, A. M., Vidal, N. D., Iyer, L. M. & Aravind, L. Transposons to toxins: 861 the provenance, architecture and diversification of a widespread class of eukaryotic 862 effectors. Nucleic acids research 44, 3513-3533 (2016). 863 63 Zhang, D., Iyer, L. M., Burroughs, A. M. & Aravind, L. Resilience of biochemical activity 864 in protein domains in the face of structural divergence. Current opinion in structural 865 biology 26, 92-103 (2014).

35

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

866 64 Burroughs, A. M., Zhang, D., Schäffer, D. E., Iyer, L. M. & Aravind, L. Comparative 867 genomic analyses reveal a vast, novel network of nucleotide-centric systems in 868 biological conflicts, immunity and signaling. Nucleic acids research 43, 10633-10654 869 (2015). 870 65 Iyer, L. M., Zhang, D., Maxwell Burroughs, A. & Aravind, L. Computational identification 871 of novel biochemical systems involved in oxidation, glycosylation and other complex 872 modifications of bases in DNA. Nucleic acids research 41, 7635-7655 (2013). 873 66 Essuman, K. et al. The SARM1 Toll/interleukin-1 receptor domain possesses intrinsic 874 NAD+ cleavage activity that promotes pathological axonal degeneration. Neuron 93, 875 1334-1343. e1335 (2017). 876 67 Wan, L. et al. TIR domains of plant immune receptors are NAD+-cleaving enzymes that 877 promote cell death. Science 365, 799-803 (2019). 878 68 Iyer, L. M., Zhang, D. & Aravind, L. Adenine methylation in eukaryotes: Apprehending 879 the complex evolutionary history and functional potential of an epigenetic modification. 880 Bioessays 38, 27-40 (2016). 881 69 Wang, X. et al. Structural basis of N 6-adenosine methylation by the METTL3–METTL14 882 complex. Nature 534, 575-578 (2016). 883 70 Kazlauskiene, M., Kostiuk, G., Venclovas, Č., Tamulaitis, G. & Siksnys, V. A cyclic 884 oligonucleotide signaling pathway in type III CRISPR-Cas systems. Science 357, 605- 885 609 (2017). 886 71 Niewoehner, O. et al. Type III CRISPR–Cas systems produce cyclic oligoadenylate 887 second messengers. Nature 548, 543-548 (2017). 888 72 Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein 889 database search programs. Nucleic acids research 25, 3389-3402 (1997). 890 73 Carver, T. J. et al. ACT: the Artemis comparison tool. Bioinformatics 21, 3422-3423 891 (2005). 892 74 Lassmann, T. & Sonnhammer, E. L. Kalign–an accurate and fast multiple sequence 893 alignment algorithm. BMC bioinformatics 6, 1-9 (2005). 894 75 Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high 895 throughput. Nucleic acids research 32, 1792-1797 (2004). 896 76 Pei, J., Kim, B.-H. & Grishin, N. V. PROMALS3D: a tool for multiple protein sequence 897 and structure alignments. Nucleic acids research 36, 2295-2300 (2008). 898 77 Drozdetskiy, A., Cole, C., Procter, J. & Barton, G. J. JPred4: a protein secondary 899 structure prediction server. Nucleic acids research 43, W389-W394 (2015).

36

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

900 78 Taylor, W. R. The classification of amino acid conservation. Journal of theoretical 901 Biology 119, 205-218 (1986). 902 79 Goodstadt, L. & Ponting, C. P. CHROMA: consensus-based colouring of multiple 903 alignments for publication. Bioinformatics 17, 845-846 (2001). 904 80 Steinegger, M. et al. HH-suite3 for fast remote homology detection and deep protein 905 annotation. BMC bioinformatics 20, 1-15 (2019). 906 81 Käll, L., Krogh, A. & Sonnhammer, E. L. Advantages of combined transmembrane 907 topology and signal peptide prediction—the Phobius web server. Nucleic acids research 908 35, W429-W432 (2007). 909 82 Kumar, S., Stecher, G. & Tamura, K. MEGA7: molecular evolutionary genetics analysis 910 version 7.0 for bigger datasets. Molecular biology and evolution 33, 1870-1874 (2016). 911 83 Suchard, M. A. et al. Bayesian phylogenetic and phylodynamic data integration using 912 BEAST 1.10. Virus evolution 4, vey016 (2018). 913 84 Mistry, J. et al. Pfam: The protein families database in 2021. Nucleic Acids Research 49, 914 D412-D419 (2021). 915 85 Emms, D. M. & Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative 916 genomics. Genome biology 20, 1-14 (2019). 917 86 Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using 918 DIAMOND. Nature methods 12, 59-60 (2015). 919 87 Li, L., Stoeckert, C. J. & Roos, D. S. OrthoMCL: identification of ortholog groups for 920 eukaryotic genomes. Genome research 13, 2178-2189 (2003). 921 88 Kuzniar, A., van Ham, R. C., Pongor, S. & Leunissen, J. A. The quest for orthologs: 922 finding the corresponding gene across genomes. Trends in Genetics 24, 539-551 (2008). 923 89 Conesa, A. et al. Blast2GO: a universal tool for annotation, visualization and analysis in 924 functional genomics research. Bioinformatics 21, 3674-3676 (2005). 925 90 Kanehisa, M., Furumichi, M., Sato, Y., Ishiguro-Watanabe, M. & Tanabe, M. KEGG: 926 integrating viruses and cellular organisms. Nucleic Acids Research 49, D545-D551 927 (2021). 928 91 Webb, B. & Sali, A. Comparative protein structure modeling using MODELLER. Current 929 protocols in bioinformatics 54, 5.6. 1-5.6. 37 (2016). 930 92 Cozzetto, D. & Tramontano, A. Relationship between multiple sequence alignments and 931 quality of protein comparative models. Proteins: Structure, Function, and Bioinformatics 932 58, 151-157 (2005).

37

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

933 93 DeLano, W. L. Pymol: An open-source molecular graphics tool. CCP4 Newsletter on 934 protein crystallography 40, 82-92 (2002).

935

38

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

a Bacteria Vector Host Associated Disease

C.L. asiaticus psy62

C.L. asiaticus JXGC

C.L. asiaticus Ishi-1 Diaphorina citri C.L. asiaticus A4 Citrus Huanglongbing Endophytic C.L. asiaticus gxpsy Pathogenic C.L. asiaticus AHCA1 Gene gain C.L. africanus PTSAPSY Trioza erytreae Potato Zebra Chip C.L. solanacearum ZC1 Bactericera cockerelli Tomato Psyllid Yellows

Gene C.L. americanus SaoPaulo Diaphorina citri Citrus Huanglongbing loss Stunted Growth of Shoots, C.L. europaeus NZ1 Arytainilla spartiophila Cytisus scoparius Shortened Internodes, Leaf Dwarfing, Leaf Tip Chlorosis L. crescens BT-1 Free-living Unknown Papaya Unknown L. crescens BT-0 Non-pathogenic

Outgroup

bC.L. asiaticus psy62 cC.L. asiaticus psy62 d C.L. asiaticus AHCA1 C.L. asiaticus gxpsy C.L. asiaticus JXGC C.L. asiaticus psy62 C.L. asiaticus JXGC 1.00 C.L. asiaticus Ishi1 1.00 C.L. asiaticus A4 1.00 C.L. asiaticus Ishi-1 C.L. asiaticus A4 C.L. asiaticus JXGC 1.00 1.00 0.92 C.L. asiaticus A4 C.L. asiaticus gxpsy C.L. asiaticus gxpsy C.L. asiaticus AHCA1 C.L. asiaticus AHCA1 C.L. asiaticus Ishi-1 0.94 0.97 1.00 C.L. africanus PTSAPSY C.L. africanus PTSAPSY C.L. africanus PTSAPSY 1.00 C.L. solanacearum ZC1 1.00 C.L. solanacearum ZC1 1.00 C.L. solanacearum ZC1

0.98 C.L. americanus SaoPaulo C.L. americanus SaoPaulo C.L. americanus SaoPaulo 1.00 1.00 1.00 C.L. europaeus NZ1 0.88 C.L. europaeus NZ1 0.61 C.L. europaeus NZ1 1.00 L. crescens BT-1 1.00 L. crescens BT-1 1.00 L. crescens BT-1 L. crescens BT-0 L. crescens BT-0 L. crescens BT-0 P. zundukense Tri-48 P. zundukense Tri-48 P. zundukense Tri-48 R. grahamii BG7 R. leguminosarum Norway R. grahamii BG7 1.00 R. leguminosarum Norway 0.99 R. grahamii BG7 1.00 R. gallicum R602

R. gallicum R602 1.00 R. gallicum R602 R. leguminosarum Norway E. adhaerens OV14 E. adhaerens OV14 0.90 E. adhaerens OV14 0.98 Outgroup S. fredii CCBAU25509 0.89 E. alkalisoli YIC4027 Outgroup S. americanum CFNEI73 Outgroup S. americanum CFNEI73 S. fredii CCBAU25509 0.99 S. fredii CCBAU25509

1.00 E. alkalisoli YIC4027 S. americanum CFNEI73 E. alkalisoli YIC4027 S. tmeliloti KH35c S. meliloti KH35c S. meliloti KH35c

0.01 0.02 0.1 16S rRNA 23S rRNA DNA Polymerase A bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. C.L. asiaticus (+) gxpsy (-)

C.L. africanus (+) PTSAPSY (-)

C.L. solanacearum (+) ZC1 (-) Prophage LC1

Prophage LC2

Prophage SC-like

Prophage UT

Prophage ZC1

C.L. americanus (+) SaoPaulo (-)

C.L. europaeus (+) NZ1 (-)

L. crescens (+) BT-1 (-) Peptidase Phage Tail Phage L.cre BT-1 B488_RS02435 _S74_1 LC_2 Stabilisation LC_4 Tubular LC_1 DUF5309 Portal Terminase Protein LC1 bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which Peptidase Phage Tail DUF5309 Portal Terminase Phage L.cre BT-0was WP_060774378.1 not certified by peer review) is the author/funder,_S74_1 LC_2 Stabilisation LC_4 TubularwhoLC_1 has granted bioRxivProtein a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Peptidase Peptidase Peptidase Phage Tail Phage C.L.asi gxpsy WSI_RS05090 LC_TM _S74_2 _S74_1 _S74_1 LC_2 Stabilisation LC_4 Tubular Scaffold Portal Terminase LC_3 HTH_XRE DUF1376

Peptidase Peptidase Peptidase Phage Tail Phage C.L.asi psy62 CLIBASIA_RS05125 LC_TM _S74_2 _S74_1 _S74_1 LC_2 Stabilisation LC_4 Tubular Scaffold Portal Terminase LC_3 HTH_XRE DUF1376

Peptidase Peptidase Peptidase Phage Tail Phage C.L.asi A4 CD16_RS05195 LC_TM _S74_2 _S74_1 _S74_1 LC_2 Stabilisation LC_4 Tubular LC_1 Scaffold Portal Terminase LC_3 HTH_XRE DUF1376

Peptidase Peptidase Phage Tail Phage C.L.asi AHCA1 DIC79_RS05225 LC_TM _S74_2 _S74_1 LC_2 Stabilisation LC_4 Tubular LC_1 Scaffold Portal Terminase LC_3 HTH_XRE DUF1376

Peptidase Peptidase Peptidase Phage Tail Phage C.L.asi JXGC B2I23_RS05215 LC_TM _S74_2 _S74_1 _S74_1 LC_2 Stabilisation LC_4 Tubular LC_1 Scaffold Portal Terminase LC_3 HTH_XRE DUF1376

Peptidase Peptidase Peptidase Phage Tail Phage C.L.asi Ishi-1 CGUJ_RS05110 LC_TM _S74_2 _S74_1 _S74_1 LC_2 Stabilisation LC_4 Tubular Scaffold Portal Terminase LC_3 HTH_XRE DUF1376 LC2 Peptidase Phage Tail Phage HTH_XRE+ C.L.afr PTSAPSY WP_083965928.1 LC_TM _S74_1 LC_2 Stabilisation LC_4 Tubular LC_1 Scaffold Portal Terminase LC_3 HTH_XRE ATPase DUF1376 LC_14 HTH_XRE

Peptidase Peptidase Phage Tail Phage C.L.sol ZC1 WP_050780892.1 LC_TM _S74_2 _S74_1 LC_2 Stabilisation Tubular LC_1 DUF5309 Scaffold Portal Terminase LC_3 HTH_XRE ATPase DUF1376 LC_14

Peptidase Peptidase Peptidase Phage Tail Phage C.L.ame SaoPaulo WP_081604698.1 LC_TM _S74_2 _S74_1 _S74_1 LC_2 Stabilisation LC_4 Tubular LC_1 DUF5309 Scaffold Portal Terminase LC_3 HTH_XRE LC_14

Peptidase Peptidase Phage Tail Phage C.L.eur NZ1 PTL86342.1 LC_TM _S74_2 _S74_1 LC_2 Stabilisation LC_4 Tubular LC_1 DUF5309 Scaffold Portal Terminase LC_3 HTH_XRE ATPase DUF1376DUF1376 LC_14

Peptidase Peptidase Phage Tail Phage Portal L.cre BT-0 WP_144049198.1 LC_TM _S74_2 _S74_1 LC_2 Stabilisation LC_4 Tubular LC_1 DUF5309 Scaffold Terminase LC_3 DUF1376 ATPase

Peptidase Peptidase Phage Tail Phage Portal L.cre BT-1 WP_144049198.1 LC_TM _S74_2 _S74_1 LC_2 Stabilisation LC_4 Tubular LC_1 DUF5309 Scaffold Terminase LC_3 DUF1376 ATPase

HTH_XRE+ SC_ SC_ Tail Head-Tail Terminase Phage DNA DNA DNA Guanylate Phage C.L.asi A4 WP_055688163.1 TM+TM SC_TMTM+TM SC_TM SC_13 SC_8 SC_10 SC_12 Tubular Tube Capsid SC_1 SC_2 Connector HTH_Tnp HTH_XRE SC_4 SC_9 Repeat Protein Primase AHD3 SC_5 SC_4 SC_15 DUF2800 BroN DUF2815 Polymerase VRRNUC Ligase Kinase Integrase

HTH_XRE+ SC_ SC_ Tail Head-Tail Terminase Phage DNA DNA DNA Guanylate C.L.asi gxpsy WP_015453047.1 (SC-1) TM+TM SC_TM TM+TM SC_TM SC_13 SC_8 SC_10 SC_12 Tubular Tube Capsid SC_1 SC_2 Connector HTH_Tnp HTH_XRE SC_4 SC_9 Repeat Protein Primase AHD3 SC_4 SC_15 DUF2800 BroN DUF2815 Polymerase VRRNUC SNF Ligase Kinase

HTH_XRE+ SC_ Tail Head-Tail Terminase Phage DNA DNA DNA Guanylate Phage C.L.asi gxpsy WP_015453086.1 (SC-2) TM+TM SC_TM SC_6 SC_7 SC_11 SC_3 Tubular Tube Capsid SC_1 SC_2 Connector HTH_Tnp HTH_XRE SC_4 SC_9 Repeat Protein Primase AHD3 SC_4 SC_15 DUF2800 BroN DUF2815 Polymerase VRRNUC SNF Ligase Kinase Integrase

HTH_XRE+ SC_ Tail Head-Tail Terminase Phage DNA DNA DNA Guanylate Phage C.L.asi psy62 WP_015825010.1 TM+TM SC_TM SC_6 SC_7 SC_11 SC_3 Tubular Tube Capsid SC_1 SC_2 Connector HTH_Tnp HTH_XRE SC_4 SC_9 Repeat Protein Primase AHD3 SC_5 SC_4 SC_15 DUF2800 BroN DUF2815 Polymerase VRRNUC Ligase Kinase Integrase

DNA DNA Guanylate Phage C.L.asi AHCA1 DIC79_RS05475 SC_TM Terminase HTH_Tnp SC_9 AHD3 SC_4 SC_15 DUF2815 Polymerase VRRNUC SNF Ligase Kinase Integrase

NTPase NTPase HTH_XRE+ Phage DNA DNA DNA Guanylate Phage C.L.asi JXGC B2I23_RS05500 coiled-coil (fragment) (fragment) HsdR HsdS HsdM Terminase HTH_Tnp HTH_XRE SC_4 SC_9 Repeat Protein Primase AHD3 SC_4 SC_15 DUF2800 BroN DUF2815 Polymerase VRRNUC SNF Ligase Kinase Integrase

HTH_XRE+ SC_ Tail Head-Tail Terminase Phage DNA Guanylate Phage C.L.afr PTSAPSY WP_102030493.1 TM+TM SC_TM SC_3 Tubular Tube Capsid SC_1 SC_2 Connector HTH_XRE SC_4 SC_9 Protein AHD3 SC_5 Ligase Kinase Integrase

HTH_XRE+ SC_ Tail Head-Tail Terminase DNA DNA Phage SC-like C.L.sol ZC1 WP_013461597.1 (SC-1) TM+TM SC_TM SC_8 SC_3 Tubular Tube Capsid SC_2 Connector HTH_Tnp HTH_XRE Repeat SC_9 BroN SC_4 DUF2800 BroN DUF2815 Polymerase VRRNUC SNF Ligase Integrase

SC_ Head-Tail HTH_XRE+ HTH_XRE+ DNA Tail Terminase DNA Phage C.L.sol ZC1 WP_143827807.1 (SC-2) TM+TM SC_TM SC_8 SC_3 Tubular Tube Capsid SC_1 SC_2 Connector HTH_Tnp HTH_XRE HTH_XRE SC_9 BroN SC_4 DUF2800 BroN Polymerase VRRNUC SNF Ligase Integrase

HTH_XRE+ SC_ Tail Head-Tail Terminase Phage DNA DNA Guanylate C.L.ame SaoPaulo WP_007557549.1 (SC-1) TM+TM SC_TM SC_6 SC_7 SC_11 SC_3 Tubular Tube Capsid SC_1 SC_2 Connector HTH_Tnp HTH_XRE BroN Repeat Protein Primase AHD3 SC_5 SC_4 SC_15 DUF2800 DUF2815 Polymerase VRRNUC SNF Kinase

SC_ DNA Guanylate C.L.ame SaoPaulo (SC-2) TM+TM SC_TM SC_8 SC_10SC_3 SC_12 DUF2800 DUF2815 Polymerase VRRNUC SNF Kinase

Tail Head-Tail HTH_XRE+ Phage C.L.eur NZ1 C4617_04470 (SC-1) SC_TM SC_13 Tubular Tube Connector Terminase HTH_Tnp HTH_XRE Protein

HTH_XRE+ Tail Head-Tail Terminase DNA DNA DNA Guanylate Phage C.L.eur NZ1 PTL86184.1 (SC-2) SC_TM SC_13 SC_6 SC_7 SC_10 SC_3 SC_12 Tubular Tube Capsid SC_1 SC_2 Connector HTH_Tnp HTH_XRE BroN Primase SC_5 SC_5 SC_15 DUF2800 BroN DUF2815 Polymerase VRRNUC SNF Ligase Kinase Integrase

Head-Tail C.L.eur NZ1 PTL86079.1 (SC-3) SC_2 Connector Terminase

* Tail Head-Tail Phage DNA DNA DNA Guanylate Phage C.L.eur NZ1 PTL86012.1 (SC-4) SC_6 SC_7 SC_11 SC_3 Tubular Tube Capsid SC_1 SC_2 Connector Terminase Protein Primase SC_5 SC_15 DUF2800 BroN DUF2815 Polymerase VRRNUC SNF Ligase Kinase Integrase

Phage ZC1 C.L.sol ZC1 WP_013462293.1 Capsid Terminase Integrase

DNA DNA Guanylate Phage C.L.asi gxpsy WP_015453015.1 Terminase HTH_Tnp DUF2800 DUF2815 Polymerase VRRNUC SNF Ligase Kinase Protein

DNA DNA Guanylate C.L.asi psy62 WP_015453015.1 Terminase HTH_Tnp DUF2800 DUF2815 Polymerase VRRNUC SNF Ligase Kinase

DNA DNA Guanylate Phage C.L.asi A4 WP_015453015.1 Terminase HTH_Tnp DUF2800 DUF2815 Polymerase VRRNUC SNF Ligase Kinase Protein UT DNA DNA Guanylate C.L.asi AHCA1 WP_015453015.1 Terminase HTH_Tnp DUF2800 DUF2815 Polymerase VRRNUC SNF Ligase Kinase

DNA DNA Guanylate Phage C.L.asi JXGC WP_015453015.1 Terminase HTH_Tnp DUF2800 DUF2815 Polymerase VRRNUC SNF Ligase Kinase Protein

DNA DNA Guanylate Phage C.L.asi Ishi-1 WP_045490419.1 Terminase HTH_Tnp DUF2800 DUF2815 Polymerase VRRNUC SNF Ligase Kinase Integrase bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. a b P. sp. B. sp. D. bacterium S. americanum S. sp. BJ1 LC2 prophage 0.95 S. meliloti terminase clade R. leguminosarum 1.00 L. crescens BT-1 0.97 (1.0) L. crescens BT-0 C.L. europaeus NZ1 C.L. americanus SaoPaulo 0.88 C.L. solanacearum ZC1 C.L. africanus PTSAPSY C.L. asiaticus# 1.00 A. insulae (1.0) bacterium 0.88

1.00 0.85 0.99 (1.0) 0.95 H. bacterium P. sp. ctka020 C.D. bacterium LC1 prophage 1.00 L. crescens BT-1 terminase clade L. crescens BT-0 1.00 T. sp. (1.0) 1.00 R. ecuadorense R. mesoamericanum 0.98 R. bacterium 1.00

P. bacterium 1.00 P. bacterium uncultured Mediterranean phage uvMED Prophage LC2 Prophage SC-like Prophage UT C.P. archaeon 100 1.00 (1.0) F. bacterium C.D. bacterium N. sp. Nsp37 G. bacterium c C.L. asiaticus A4 LC2 UT SC M. tunisiensis SC-like prophage M. bacterium terminase clade N. bacterium C.L. asiaticus gxpsy LC2 UT SC-1 SC-2 0.86 D. bacterium 1.00 C.L. solanacearum ZC1 (ZC1) UT C.L. asiaticus psy62 LC2 UT SC C.L. solanacearum R1 1.00 C.L. europaeus NZ1 0.86 C.L. europaeus NZ1 C.L. asiaticus AHCA1 LC2 UT SC C.L. europaeus NZ1 C.L. europaeus NZ1* C.L. asiaticus JXGC LC2 UT SC 0.90 C.L. solanacearum ISR100 C.L. asiaticus gxpsy LC2 UT 0.99 C.L. asiaticus JXGC* C.L. asiaticus Ishi-1 (1.0) 0.91 C.L. asiaticus gxpsy C.L. asiaticus psy62 C.L. africanus PTSAPSY LC2 SC 1.00 C.L. asiaticus AHCA1* SC-like C.L. asiaticus A4 C.L. solanacearum ZC1 SC-1 LC2 SC-2 C.L. solanacearum ZC1 ZC1(Fragment) 0.97 C.L. solanacearum ZC1 0.89 SC C.L. solanacearum LsoNZ1 LC2 C.L. americanus SaoPaulo LC2 SC-1 -2 C.L. africanus PTSAPSY C.L. americanus SaoPaulo C.L. europaeus NZ1 LC2 SC-1 SC-2 SC-4 C.L. asiaticus TX2351 (UT) SC-3(Fragment) C.L. asiaticus Ishi-1 (UT) 0.99 C.L. asiaticus gxpsy (UT) L. crescens BT-1 LC1 LC2 0.99 C.L. asiaticus psy62 (UT) C.L. asiaticus A4 (UT) LC1 L. crescens BT-0 LC1 LC2 C.L. asiaticus AHCA1 (UT) 0.5 C.L. asiaticus JXGC (UT) Outgroup a bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version postedGene June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxivgain a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-NDGene 4.0 International license. loss

L. crescens

Outgroup C.L. africanus C.L. asiaticus C.L. europaeusC.L. americanusC.L. solanacearum

OrthoFinder Whole Genome OrthoFinder Analysis BLASTP Analysis

52 38 378

2923 9 BLASTP & Phylogenetic Analysis BLASTP & Phylogenetic Analysis 323 orthogroups 323 35 orthogroups are lost in 284 3 only present in pathogenic bacteria pathogenic bacteria

b 0.94 C.L. europaeus NZ1 c 0.94 C.L. europaeus NZ1 d 0.98 C.L. europaeus NZ1 C.L. americanus SaoPaulo C.L. americanus SaoPaulo C.L. americanus SaoPaulo C.L. solanacearum ISR100 1.00 C.L. solanacearum ZC1 C.L. solanacearum ZC1 C.L. solanacearum FIN114 C.L. solanacearum NZ1 0.83 0.86 1.00 C.L. solanacearum NZ1 1.00 C.L. solanacearum ISR100 C.L. solanacearum ISR100 C.L. solanacearum HenneA 1.00 C.L. solanacearum R1 1.00 C.L. solanacearum NZ1 C.L. solanacearum ZC1 C.L. solanacearum NZ1 C.L. solanacearum ZC1 C.L. solanacearum ZC1 C.L. solanacearum FIN114 0.89 C.L. solanacearum FIN114 C.L. solanacearum FIN114 0.91 C.L. africanus PTSAPSY C.L. africanus PTSAPSY C.L. solanacearum FIN114 C.L. asiaticus CRCFL16 C.L. asiaticus MFL16 1.00 C.L. africanus PTSAPSY 0.93 C.L. asiaticus gxpsy 0.95 C.L. asiaticus MFL16 C.L. africanus PTSAPSY C.L. asiaticus GFR3TX3 0.96 1.00 1.00 C.L. asiaticus* C.L. asiaticus* C.L. asiaticus* 0.98 S. arboris LMG_14919 C.L. asiaticus psy62 C.L. asiaticus TX2351 E. aridi LMR013 0.95 R. undicola ATCC_700741 C.L. asiaticus CRCFL16 1.00 E. aridi LMR001 A. vitis V80/94 R. bacterium Ac37b E. sp. LCM_4579 A. arsenijevicii 1.00 L. lansingensis NCTC12830 1.00 N. sp. StC3 A. rhizogenes SBV_302_78_2 0.96 A. tumefaciens complex L. sp. Alg231-35 0.87 L. hackeliae NCTC11979 M. sp. LSJC277A00 R. pusense S2 0.91 A. rosae SBV33 0.91 L. clemsonensis CDC-D5610 0.81 S. sp. W43 0.98 0.98 L. donaldsonii NCTC13292 1.00 P. sedentarius R. bacterium AFS063392 R. sp. NT-26 S. abyssi DSM_100673 L. feeleii NCTC11978 1.00 R. bacterium AFS084561 0.80 R. sp. RKSG952 L. sp. W10-070 0.80 R. bacterium AFS049984 M. sp. SYSU_G3D203 R. bacterium AFS054504 C.J. acanthamoeba UWC36 1.00 H. sp. EC-HK425 R. sp. LC145 0.98 C. avium PV4360_2 A. vitis CG415 R. sp. YS-13r C.X. pacificiensis L6 C. bacterium

0.5 0.5 0.5 Exonuclease-Endonuclease-Phosphatase Peptidase_M75 Sodium:dicarboxylate symporter family (EEP) (Iron transporter) (SDF) bioRxiv preprinta Tryptophan doi: https://doi.org/10.1101/2021.06.02.446850 biosynthesis pathway; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 1424383 trpE trpD trpCF trpB trpA 1417832

trpE trpD trpCF trpB trpA

Chorismate Anthranilate N-(5-Phospho- 1-(2-Carboxy- (3-Indolyl)-glycerol L-Tryptophan beta-D-ribosyl) phenylamino)- phosphate anthranilate 1-deoxy-D-ribulose 5-phosphate

b Histidine biosynthesis pathway 1224824 1225924 1080405 hisG hisG hisD 1083632 hisC

1470072 hisB hisH hisA hisF hisE 1473386

hisG hisE hisI hisA hisHF hisB

PRPP Phosphoribosyl- Phosphoribosyl- Phosphoribosyl- Phosphoribulosyl- D-erythro- Imidazole-acetol ATP AMP formimino- formimino- Imidazole-glycerol phosphate AICAR-phosphate AICAR-phosphate 3-phosphate AICAR hisC 1137779 1138225 hisI hisD hisD hisN

L-Histidine L-Histidinal L-Histidinol L-Histidinol phosphate

1212790 1213575 hisN bioRxiv preprint doi: https://doi.org/10.1101/2021.06.02.446850; this version posted June 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

a WP_012778667.1 C. Liberibacter asiaticus (Alphaproteobacteria) WP_011335381.1 Pseudomonas fluorescens ()

PD Ntox52 PseuCD PD Deaminase

WP_012778668.1 C. Liberibacter asiaticus (Alphaproteobacteria) VVE31266.1 Pandoraea iniqua (Betaproteobacteria)

PD Ntox52 SP PseuCD PD MPTase

NIH12428.1 Serratia symbiotica (Gammaproteobacteria) WP_057021785.1 Pseudomonas synxantha (Gammaproteobacteria) ART- PD Ntox52 PseuCD PD HYD3

WP_131701693.1 Enterobacter genomosp. O (Gammaproteobacteria) WP_060549918.1 Pseudomonas poae (Gammaproteobacteria)

RHS RHS RHS RHS RHS RHS RHS RHS Ntox52 PseuN PseuCD PD Tox-PL4

WP_014716902.1 Pseudomonas fluorescens (Gammaproteobacteria) WP_174554041.1 Pseudomonas palleroniana (Gammaproteobacteria) ART- SP PseuCD2 Ntox52 PseuCD PD VIP2

MBC8081718.1 Hymenobacter sp. (Bacteroidetes) WP_012667360.1 Erwinia pyrifoliae (Gammaproteobacteria) ART- DUF4280 PD REase-1 ErwinN2 EnterCD PD VIP2

WP_082334377.1 Selenomonas ruminantium (Firmicutes) WP_162617360.1 Providencia rettgeri (Gammaproteobacteria)

Phage_GPD PD AHH ??? RHS RHS RHS RHS RHS RHS RHS RHS RHS RHS RHS RHS RHS RHS RHS RHS RHS RHS RHS RHS RHS SHH EnteCD PD EndoU

WP_148527566.1 Vibrio cholerae (Gammaproteobacteria) WP_175248061.1 Vibrio cholerae (Gammaproteobacteria) Anthrax_ Anthrax_ SP SMAD PD toxA SP VibrCD PD toxA

b Secondary Structure YdjM 6 46 71 76 113 117 147 CLas_WP_015452856.1 MTYEGHCIFALASVILAKKVN 10 GTVVLGALLSCLLPDIDHP 14 TSCISSHRGF THSILAVILYKWLIHHFF 7 QGLQDALIIGYVSHLVADVLTPTGI 18 SPIGEIFFCVLFLIYAIIS CLaf_WP_047263772.1 MTFRGHCIFALASIILAKKIN 10 GIVIAGALLSCLLPDIDHP 14 PSFILNHRGF THSILAVILYGWLIHYFF 7 QGFQDSLIIGYVSHLVADILTPSGI 18 SPTIEIFFCVLCLIYAIII CLso_WP_044054324.1 MTFKGHCIFALASVILAKKVS 10 GIIALGALLSCLLPDIDHS 14 TSYILKHRGF THSILAVILYVECIHQFF 7 QDFQDAMIIGYVSHLVADILTPAGI 18 SSVREILFCVLFLIYATVV CLam_WP_023466341.1 MTFKGHCVFALSMIIIAKKFV 10 GNVILGSMLSFSLPDIDHP 14 TSHILTHRGF THSILSVVLFSLLINNCL 7 KGLHDSMIIGYISHLIADMFTKSGI 18 TPLKEGFFCVLILIFAIII CLeu_PTL86073.1 MTFKGHCIFSLSTIIVAKKIG 10 DIVIIGALLSCVLPDIDHP 14 TCGILNHRGF THSILSFFLYIWLLNQVL 7 RGFHDAMIIGYASHLIADILTTTGI 18 TPLKETLFCLLVLIYSIIG Gbet_WP_072548542.1 MMAGSHVALGAAAWLIVAPKL 16 AAGLVLAVAGALLPDVDHP 14 LSKAFGHRGV THSLLAIIACGWAMQHSR 2 RTIIDPLVTGYLSHLAADLLTPAGL 17 GSAFEPLIVALALWGAGSA Alpha toxin 11 56 68 126 130 136 152 Csar_1OLP_A DGTGTHSVIVTQAIEMLKHDL 15 EKNLHKFQLGSTFPDYDPN 2 -SLYQDHFWD 21 NAESQTRKFATLAKNEWD 2 NYEKAAWYLGQGMHYFGDLNTPYHA 10 HVKFETYAEERKDTYRLDT Bcer_2HUC_A EGVNSHLWIVNRAIDIMSRNT 11 NEWRTELENGIYAADYENP 4 -STFASHFYD 12 QAKETGAKYFKLAGESYK 2 DMKQAFFYLGLSLHYLGDVNQPMHA 12 HSKYENFVDTIKDNYKVTD Nmag_YP_003480544.1 ELTQDHAETIDEIRDYYEIEQ 90 ECSFDWLPGTGFLDDFLVD 12 HTHEPYHFYW 15 SAFGGAPGEAQYMMDQAY 2 SFSTRRDYLGYALHYVQDMGVPLHT 29 HFDYEEWAADNFEDADWYL Bsp._ZP_02000037.1 YELETHATISQEATKLSVLGT 26 KKLLEVIGDGARFEDNWP- --RFFNHFYD 40 DGRDYLYQALTLPNATER 2 NFGLFFRTLGHVTHHIADMAQPQHT 4 HAFGDNVLYEGYTEKVVDD Het-C Pent_YP_609853.1 AGEGGHGTIERVLEEV----- 62 -DDGEEQGDFQVSPEVLGV -YKASEHIDN 44 VAHMSSKIREAAREGK-- -TAEGMRAFGEALHVLED--YFAHS -NFVELSLHKQGHTQVLPW Psyr_NP_794309.1 EHTHTHGAIEAVLESA----- 61 --MRIDRSRFVVTPERLGV -YRPGEHIDN 44 VDIMVAGLESAVNAGPQ- -STDGLRDMGAALHILED--FFAHS -NFAELSLIKLGHTRVLPW Fvan_XP_003040294.1 GYVWRHGDIVEALRYLPASFV 51 ---GFATDEFEVTRERLGC -YKHVEHIDN 46 ADYVTTQLRECIRLGREK 3 SKKEAFIHLGAALHTLED--FSAHS -NFTELCLHELGEGDVFAF Pgra_EFP88730.1 GKAWRHGDIEDSLEGLVKVAG 78 ---GYATREFQVTRERLSV -YLLTEHIDN 68 TACIRRVFRQCIELGRQA 7 KEFESFRLLGTGLHTLED--FSAHS -NWCELALQKLGHKHVFAH S1-P1 nuclease Pcit_1AK0_A WGALGHATVAYVAQHYVSPEA 11 --SSSYLASIASWADEYRL 4 KWSASLHFID 21 CSISAIANYTQRVSDSSL 1 -SENHAEALRFLVHFIGDMTQPLHD 21 HSDWDTYMPQKLIGGHAL- Mmar_ZP_01687413.1 WGFFAHQRINRLAVFTLPSEM 9 -----YLTENAINPDARRY 2 KGEAPRHFID 36 IVPWHINFMKYRLTKAFK 3 -ATKILRLSAEIGHYIADANVPLHT 14 HGFWESRLPELYSDNY--- Gmal_ZP_07031395.1 WGRDGHMMINRLAAQYLPSDV 9 ---LDTMEYLGPEPDRWKG 7 DTQSPDHFFD 56 QVEEVWQRLKTDLREYRK 11 -QTVILFEAGWLGHYVGDGSQPLHT 21 HSLFESLYVSANIKPA--- Lint_YP_001639.1 WGWEGHRTIGIIAQQLLINSK 9 --GDLTLEQISTCPDELKA 22 TNTGPWHFID 20 CVVAEINKWSSVLADTTQ 1 -KAKRLQALSFVVHFIGDLHQPLHT 23 HSMWDINLVNYISTNP--- Consensus/75% hhh.sHs.bs..h..bh...... hh....hhs+b..s ...h.pHbh. sh..hh.bb..b...p.b ...p.hbhbGbh.Hbh.Dh..shh. .shh+.hb..bh...sh.. c

H76 D56 C H71 H68 D46 H126 N H113 H136

D130 D117 H6 H11 N E147 E152 C

YdjM (WP_015452856.1) Alpha toxin (1OLP_A)

WP_143942405.1 Streptomyces sp. MZ03-37 WP_081417210.1 Paenibacillus sp. Soil522 1OLP_A Clostridium sardiniense

beta- Alpha PLAT RHS RHS RHS RHS RHS YdjM YdjM sandwich Toxin