Articles https://doi.org/10.1038/s41588-017-0012-9

Genomic features of bacterial adaptation to plants

Asaf Levy 1, Isai Salas Gonzalez2,3, Maximilian Mittelviefhaus4, Scott Clingenpeel 1, Sur Herrera Paredes 2,3,15, Jiamin Miao5,16, Kunru Wang5, Giulia Devescovi6, Kyra Stillman1, Freddy Monteiro2,3, Bryan Rangel Alvarez1, Derek S. Lundberg2,3, Tse-Yuan Lu7, Sarah Lebeis8, Zhao Jin9, Meredith McDonald2,3, Andrew P. Klein2,3, Meghan E. Feltcher2,3,17, Tijana Glavina Rio1, Sarah R. Grant 2, Sharon L. Doty 10, Ruth E. Ley 11, Bingyu Zhao5, Vittorio Venturi6, Dale A. Pelletier7, Julia A. Vorholt4, Susannah G. Tringe 1,12*, Tanja Woyke 1,12* and Jeffery L. Dangl 2,3,13,14*

Plants intimately associate with diverse . Plant-associated bacteria have ostensibly evolved genes that enable them to adapt to plant environments. However, the identities of such genes are mostly unknown, and their functions are poorly characterized. We sequenced 484 genomes of bacterial isolates from roots of Brassicaceae, poplar, and maize. We then com- pared 3,837 bacterial genomes to identify thousands of plant-associated gene clusters. Genomes of plant-associated bacteria encode more carbohydrate metabolism functions and fewer mobile elements than related non-plant-associated genomes do. We experimentally validated candidates from two sets of plant-associated genes: one involved in plant colonization, and the other serving in microbe–microbe competition between plant-associated bacteria. We also identified 64 plant-associated pro- tein domains that potentially mimic plant domains; some are shared with plant-associated fungi and oomycetes. This work expands the genome-based understanding of plant–microbe interactions and provides potential leads for efficient and sustain- able agriculture through microbiome engineering.

he microbiota of plants and animals have coevolved with their metagenomics because of limited sequencing depth. Although hosts for millions of years1–3. Through photosynthesis, plants metagenome sequencing has the advantage of capturing the DNA Tserve as a rich source of carbon for diverse bacterial com- of uncultivated organisms, multiple 16S rRNA gene surveys have munities. These include mutualists and commensals, as well as reproducibly shown that the most common plant-associated bac- pathogens. Phytopathogens and growth-promoting bacteria have teria are derived mainly from four phyla13,17 (, considerable effects on plant growth, health, and productivity4–7. , , and ) that are amenable to Except for intensively studied relationships such as root nodula- cultivation. Thus, bacterial cultivation is not a major limitation in tion in legumes8, T-DNA transfer by Agrobacterium9, and type III sampling of the abundant members of the plant microbiome16. secretion–mediated pathogenesis10, the molecular mechanisms that Our objective was to characterize the genes that contribute to govern plant–microbe interactions are not well understood. It is bacterial adaptation to plants (plant-associated genes) and those therefore important to identify and characterize the bacterial genes genes that specifically aid in bacterial root colonization (root- and functions that help microbes thrive in the plant environment. associated genes). We sequenced the genomes of 484 new bacte- Such knowledge should improve the ability to combat plant diseases rial isolates and single bacterial cells from the roots of Brassicaceae, and harness beneficial bacterial functions for agriculture, with direct maize, and poplar trees. We combined the newly sequenced genomes effects on global food security, bioenergy, and carbon sequestration. with existing genomes to create a dataset of 3,837 high-quality, non- Cultivation-independent methods based on profiling of marker redundant genomes. We then developed a computational approach genes or shotgun metagenome sequencing have considerably to identify plant-associated genes and root-associated genes based improved the overall understanding of microbial ecology in the on comparison of phylogenetically related genomes with knowledge plant environment11–15. In parallel, reduced sequencing costs have of the origin of isolation. We experimentally validated two sets of enabled the genome sequencing of plant-associated bacterial iso- plant-associated genes, including a previously unrecognized gene lates at a large scale16. Importantly, isolates enable functional valida- family that functions in plant-associated microbe–microbe compe- tion of in silico predictions. Isolate genomes also provide genomic tition. In addition, we characterized many plant-associated genes and evolutionary context for individual genes, as well as the poten- that are shared between bacteria of different phyla, and even between tial to access genomes of rare organisms that might be missed by bacteria and plant-associated eukaryotes. This study represents

1DOE Joint Genome Institute, Walnut Creek, CA, USA. 2Department of Biology, University of North Carolina, Chapel Hill, NC, USA. 3Howard Hughes Medical Institute, Chevy Chase, MD, USA. 4Institute of Microbiology, ETH Zurich, Zurich, Switzerland. 5Department of Horticulture, Virginia Tech, Blacksburg, VA, USA. 6International Centre for Genetic Engineering and Biotechnology, Trieste, Italy. 7Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA. 8Department of Microbiology, University of Tennessee, Knoxville, TN, USA. 9Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY, USA. 10School of Environmental and Forest Sciences, University of Washington, Seattle, WA, USA. 11Max Planck Institute for Developmental Biology, Tübingen, Germany. 12School of Natural Sciences, University of California, Merced, Merced, CA, USA. 13The Carolina Center for Genome Sciences, University of North Carolina, Chapel Hill, NC, USA. 14Department of Microbiology and Immunology, University of North Carolina, Chapel Hill, NC, USA. Present address: 15Department of Biology, Stanford University, Stanford, CA, USA; 16The Grassland College, Gansu Agricultural University, Lanzhou, Gansu, China; 17BD Technologies and Innovation, Research Triangle Park, NC, USA. Asaf Levy and Isai Salas Gonzalez contributed equally to this work. *e-mail: [email protected]; [email protected]; [email protected]

Nature Genetics | www.nature.com/naturegenetics © 2017 Nature America Inc., part of Springer Nature. All rights reserved. Articles Nature Genetics a comprehensive and unbiased effort to identify and characterize barley18, wheat, and cucumber14 (Methods). The nine taxa analyzed candidate genes required at the bacteria–plant interface. here account for 33–76% (median, 41%; Supplementary Table 4) of the total bacterial communities found in plant-associated envi- Results ronments and therefore represent a substantial portion of the plant Expanding the plant-associated bacterial reference catalog. To microbiota, consistent with previous reports13,16,19. obtain a comprehensive reference set of plant-associated bacterial genomes, we isolated and sequenced 191, 135, and 51 novel bacte- Increased carbohydrate metabolism and fewer mobile elements rial strains from the roots of Brassicaceae (91% from Arabidopsis in plant-associated genomes. We compared the genomes of bac- thaliana), poplar trees (Populus trichocarpa and Populus deltoides), teria isolated from plant environments with those from bacteria of and maize, respectively (Methods, Table 1, Supplementary Tables 1–3). shared ancestry that were isolated from non-plant environments. The bacteria were specifically isolated from the interior (endo- We assumed that the two groups should differ in the set of accessory phytic compartment) or surface (rhizoplane) of plant roots, or from genes that evolved as part of their adaptation to a specific niche. soil attached to the root (rhizosphere). In addition, we isolated Comparison of the size of plant-associated, soil, and NPA genomes and sequenced 107 single bacterial cells from surface-sterilized showed that plant-associated and/or soil genomes were signifi- roots of A. thaliana. All genomes were assembled, annotated, and cantly larger than NPA genomes (P <​ 0.05, PhyloGLM and t-tests; deposited in public databases and in a dedicated website (“URLs,” Supplementary Fig. 1a, Supplementary Table 5). We observed this Supplementary Table 3, Methods). trend in six to seven of the nine analyzed taxa (depending on the test), representing all four phyla. Pangenome analyses of a few A broad, high-quality bacterial genome collection. In addition genera with plant-associated and NPA isolation sites showed that to the newly sequenced genomes noted above, we collected 5,587 pangenome sizes were similar between plant-associated and NPA bacterial genomes belonging to the four most abundant phyla of genomes (Supplementary Fig. 2). plant-associated bacteria13 from public databases (Methods). We Next, we examined whether certain gene categories are enriched manually classified each genome as plant-associated, non-plant- or depleted in plant-associated genomes versus in their NPA coun- associated (NPA), or soil-derived on the basis of its unambiguous terparts, using 26 broad functional gene categories (Supplementary isolation niche (Methods, Supplementary Tables 1 and 2). The Table 6). We used the PhyloGLM test (Fig. 1b) and t-test plant-associated genomes included organisms isolated from plants (Supplementary Fig. 3) to detect enrichment. Two gene categories or rhizospheres. A subset of the plant-associated bacteria was also demonstrated similar phylogeny-independent trends suggestive of annotated as ‘root-associated’ when isolated from the rhizoplane or an environment-dependent selection process. The “Carbohydrate the root endophytic compartment. Genomes from bacteria isolated metabolism and transport” gene category was expanded in the from soil were considered as a separate group, as it is unknown plant-associated organisms of six taxa (Fig. 1b). This was the whether these strains can actively associate with plants. Finally, the most expanded category in Alphaproteobacteria, Bacteroidetes, remaining genomes were labeled as NPA genomes; these were iso- Xanthomonadaceae, and (Supplementary Fig. 3). lated from diverse sources, including humans, non-human animals, In contrast, mobile genetic elements (phages and transposons) air, sediments, and aquatic environments. were underrepresented in four plant-associated taxa (Fig. 1b and We carried out stringent quality control to remove low-quality or Supplementary Fig. 3). Plant-associated genomes showed increased redundant genomes (Methods). This led to a final dataset of 3,837 genome sizes despite a reduction in the number of mobile elements high-quality and nonredundant genomes, including 1,160 plant- that often serve as vehicles for horizontal gene transfer and genome associated genomes, 523 of which were also root-associated. We expansion. A comparison of root-associated bacteria to soil bacteria grouped these 3,837 genomes into nine monophyletic taxa to allow showed less drastic changes than those seen between plant-associ- comparative genomics analysis among phylogenetically related ated and NPA groups, as expected for organisms that live in more genomes (Fig. 1a, Supplementary Tables 1 and 2, Methods, “URLs”). similar habitats (Fig. 1b and Supplementary Fig. 3). To determine whether our genome collection from cultured isolates was representative of plant-associated bacterial communi- Identification and validation of plant- and root-associated ties, we analyzed cultivation-independent 16S rDNA surveys and genes. We sought to identify specific genes enriched in plant- and metagenomes from the plant environments of Arabidopsis11,12, root-associated genomes compared with NPA and soil-derived

Table 1 | Novel and previously sequenced genomes used in this analysis Taxon Taxonomic rank Novel sequenced PA Scanned Genomes used in PA NPA Soil RA genomes genomes analysis Alphaproteobacteria1 Class 126 784 610 368 199 43 169 Burkholderiales1 Order 85 612 433 160 209 64 86 Acinetobacter1 Genus 4 926 454 7 442 5 3 Pseudomonas1 Genus 75 506 349 169 137 43 61 Xanthomonadaceae1 Family 15 264 147 110 26 11 26 Bacillales2 Order 54 664 454 97 185 172 54 Actinobacteria 13 NA 69 504 394 164 142 88 89 Actinobacteria 23 NA 19 845 587 29 526 32 18 Bacteroidetes4 Phylum 37 481 409 56 293 60 17 Total 484 5,586 3,837 1,160 2,159 518 523

1Proteobacteria. 2Firmicutes. 3Actinobacteria phylum. PA, plant-associated bacteria; NPA, non-plant-associated bacteria; soil, soil-associated bacteria; RA, root-associated bacteria; NA, not available (an artificial taxon).

Nature Genetics | www.nature.com/naturegenetics © 2017 Nature America Inc., part of Springer Nature. All rights reserved. Nature Genetics Articles

a

Taxonomic groups Bacillales (n = 454) Actinobacteria 1 (n = 394) Actinobacteria 2 (n = 587) Bacteroidetes (n = 409) Alphaproteobacteria (n = 610) (n = 433) Xanthomonadaceae (n = 147) Pseudomonas (n = 349) Acinetobacter (n = 454)

Classification NPA (n = 2,159) PA (n = 1,160) RA (n = 523) soil (n = 518)

Tree scale: 0.1

b PA vs. NPA

1.0 M # genes

0.5 M Phylum Class Order Family Genus 0.1 M Proteo. Alphaproteobacteria Bacteroidetes Proteo. β-proteo. Burkholderiales Actino. Actinobacteria 2* Firmic. Bacilli Bacillales Proteo. γ-proteo. Xanthomon. Xanthomonadaceae Proteo. γ-proteo. Pseudomon. Moraxel. Acinetobacter Actino. Actinobacteria 1* Proteo. γ-proteo. Pseudomon. Pseudom. Pseudomonas

EnergyLipids Motility Defense 0.5 M1.0 M1.5 M2.0 M ChromatinUnknown Coenzymes Translation NucleotideCells division Amino acidTranscriptions Cytoskeleton Carbohydrates Inorganic ions PhyloGLM estimate General function Post-translatioRNAn processing Signal transduction Tra cking, secretion >1 Secondary metabolites Extracellular structures More genes Cell wall, cell membrane Prophages, transposons 0.5–1 Replication, repair, recomb. <0.5 in PA/RA Non-significant >–0.5 More genes –1 to –0.5 RA vs. soil <–1 in NPA/soil # gene s 0.5 M 0.3 M Phylum Class Order Family Genus 50 K Proteo. Alphaproteobacteria Bacteroidetes Proteo. β-proteo. Burkholderiales Actino. Actinobacteria 2* Firmic. Bacilli Bacillales Proteo. γ-proteo. Xanthomon. Xanthomonadaceae Proteo. γ-proteo. Pseudomon. Moraxel. Acinetobacter Actino. Actinobacteria 1* Proteo. γ-proteo. Pseudomon. Pseudom. Pseudomonas .

EnergLipidsy Motility Defense 0.1 M 0.5 M 1.0 M ChromatinUnknown Coenzymes Translation NucleotideCells division Amino acidsTranscription Cytoskeleton Carbohydrates Inorganic ions General function Post-translatioRNAn processing Signal transduction Tra cking, secretion Secondary metabolites Extracellular structures Cell wall, cell membrane Prophages, transposons Replication, repair, recomb

Fig. 1 | The genome dataset used in analysis, and differences in gene category abundances. a, The maximum-likelihood phylogenetic tree of 3,837 high- quality and nonredundant bacterial genomes, based on the concatenated alignment of 31 single-copy genes. The outer ring shows the taxonomic group, the central ring shows the isolation source, and the inner ring shows the root-associated (RA) genomes within plant-associated (PA) genomes. Taxon names are color-coded according to phylum: green, Proteobacteria; red, Firmicutes; blue, Bacteroidetes; purple, Actinobacteria. See “URLs” for the iTOL interactive phylogenetic tree. b, Differences in gene categories between plant-associated and NPA genomes (top) and between root-associated and soil- associated genomes (bottom) of the same taxon. Both heat maps indicate the level of enrichment or depletion based on a PhyloGLM test. Significant cells (color-coded according to the key) represent P values of <​ 0.05 (FDR-corrected). Pink-red cells indicate significantly more genes in plant-associated and root-associated genomes in the top and bottom heat maps, respectively. Histograms at the top and right of each heat map represent the total number of genes compared in each column and row, respectively. Asterisks indicate non-formal class names. “Carbohydrates” denotes the carbohydrate metabolism and transport gene category. Full COG category names for the x-axis labels are presented in Supplementary Table 6. Note that cells representing high absolute estimate values (dark colors) are based on categories of few genes and are therefore more likely to be less accurate. Phylum names are color- coded as in a. Xanthomon., Xanthomonadales; Pseudomon., Pseudomonadales; Pseudom., Pseudomonadaceae; Moraxel., Moraxellaceae.

Nature Genetics | www.nature.com/naturegenetics © 2017 Nature America Inc., part of Springer Nature. All rights reserved. Articles Nature Genetics genomes, respectively (Supplementary Fig. 4, Methods). First, we nodT family and a Tir chaperone protein (CesT), respectively. It is clustered the proteins and/or protein domains of each taxon on the plausible that the other six genes assayed function in facets of plant basis of homology, using the annotation resources COG20, KEGG association not captured in this experimental context. Orthology21, and TIGRFAM22, which typically comprise 35–75% Functions for which coexpression of and cooperation between of all genes in bacterial genomes23. To capture genes that do not different proteins are needed are often encoded by gene operons have existing functional annotations, we also used OrthoFinder24 in bacteria. We therefore tested whether our methods could cor- (after benchmarking; Supplementary Fig. 5) to cluster all protein rectly predict known plant-associated operons. We grouped plant- sequences within each taxon into homology-based orthogroups. associated and root-associated genes into putative plant-associated Finally, we clustered protein domains with Pfam25 (Methods, and root-associated operons on the basis of their genomic proxim- “URLs”). We used these five protein/-clustering approaches ity and orientation (Supplementary Fig. 4, Methods, “URLs”). This in parallel comparative genomics pipelines. Each protein/domain analysis yielded some well-known plant-associated functions, such sequence was additionally labeled as originating from either a plant- as those of the nodABCSUIJZ and nifHDKENXQ operons (Fig. 2c,d). associated genome or an NPA genome. Nod and Nif proteins are integral for biological nitrogen cycling and Next, we determined whether protein/domain clusters were sig- mediate root nodulation31 and nitrogen fixation32, respectively. We nificantly associated with a plant-associated lifestyle by using five also identified the biosynthetic gene cluster for the precursor of the independent statistical approaches: hypergbin, hypergcn (two ver- plant hormone gibberellin33,34 (Fig. 2e). Other known plant-associ- sions of the hypergeometric test), phyloglmbin, phyloglmcn (two ated operons identified are related to chemotaxis35, secretion systems phylogenetic tests based on PhyloGLM26), and Scoary27 (a stringent such as T3SS36 and T6SS37, and flagellum biosyntheis38–40 (Fig. 2f–i). combined test) (Methods). These analyses were based on either gene Thus, we identified thousands of plant-associated and root- presence/absence or gene copy number. We defined a gene as sig- associated gene clusters by using five different statistical approaches nificantly plant-associated if at least one test showed that it belonged (Supplementary Table 18) and validated them by means of compu- to a significant plant-associated gene cluster, and if it originated tational and experimental approaches, broadening our understand- from a plant-associated genome. We defined significant NPA, root- ing of the genetic basis of plant–microbe interactions and providing associated, and soil genes in the same way. Significant gene clusters a valuable resource to drive further experimentation. identified by the different methods had varying degrees of overlap (Supplementary Figs. 6 and 7). In general, we noted a high degree Protein domains reproducibly enriched in diverse plant-asso- of overlap between plant-associated and root-associated genes and ciated genomes. Plant-associated and root-associated proteins overlap between NPA and soil-associated genes (Supplementary and protein domains conserved across evolutionarily diverse taxa Fig. 8). Overall, plant-associated genes were depleted from NPA are potentially pivotal to the interaction between bacteria and genomes from heterogeneous isolation sources (Supplementary plants. We identified 767 Pfam domains as significant plant-asso- Figs. 9 and 10). Principal coordinates analysis with matrices that ciated domains in at least three taxa, on the basis of multiple tests contained only the plant-associated and NPA genes derived from (Supplementary Table 19a). Below we elaborate on a few domains each method as features increased the separation of plant-associated that were plant-associated or root-associated in all four phyla. from NPA genomes along the first two axes (Supplementary Fig. 11). Two of these domains, a DNA-binding domain (pfam00356) and a We provide full lists of statistically significant plant-associated, root- ligand-binding (pfam13377) domain, are characteristic of the LacI associated, soil-associated, and NPA proteins and domains accord- transcription factor (TF) family. These TFs regulate gene expres- ing to the five clustering techniques and five statistical approaches sion in response to different sugars41, and their copy numbers were for each taxon in Supplementary Tables 7–15 (also see “URLs”). expanded in the genomes of plant-associated and root-associated To validate our predictions, we assessed the abundance patterns bacteria in eight of the nine taxa analyzed (Fig. 3a). Examination of plant-associated and root-associated genes in natural environ- of the genomic neighbors of lacI-family genes identified strong ments. We retrieved 38 publicly available plant-associated, NPA, enrichment for genes involved in carbohydrate metabolism and root-associated, and soil-associated shotgun metagenomes, includ- transport in all of these taxa, consistent with their expected regula- ing some from plant-associated environments that were not used tion by a LacI-family protein41 (Supplementary Fig. 18). We ana- for isolation of the bacteria analyzed here14,28,29 (Supplementary lyzed the promoter regions of these putative regulatory targets of Table 16a). We mapped reads from these culture-independent LacI-family TFs, and identified three AANCGNTT palindromic metagenomes to plant-associated genes found with all statisti- octamers that were statistically enriched in all but one taxon, and cal approaches (Methods, Supplementary Figs. 12–16). Plant- which may serve as the TF-binding site (Supplementary Table 20). associated genes in up to seven taxa were more abundant (P <​ 0.05, These data suggest that accumulation of a large repertoire of LacI- t-test) in plant-associated metagenomes than in NPA metagenomes family-controlled regulons is a common strategy across bacterial (Fig. 2a, Supplementary Table 16b). Root-associated, soil-associ- lineages during adaptation to the plant environment. ated, and NPA genes, in contrast, were not necessarily more abun- Another domain, the metabolic domain aldo-keto reductase dant in their expected environments (Supplementary Table 16b). (pfam00248), was enriched in the genomes of plant-associated and In addition, we selected eight genes that were predicted to be root-associated bacteria from eight taxa belonging to all four phyla plant-associated by multiple approaches (Supplementary Table 17a) investigated (Fig. 3b). This domain is involved in the metabolic con- for experimental validation via an in planta bacterial fitness assay version of a broad range of substrates, including sugars and toxic (Methods). We inoculated the roots of surface-sterilized rice carbonyl compounds42. Thus, bacteria that inhabit plant environ- seedlings (n =​ 9–30 seedlings per experiment) with wild-type ments may consume similar substrates. Additional plant-associated Paraburkholderia kururiensis M130 (a rice endophyte30) or a knock- and root-associated proteins and domains that were enriched in at out mutant strain for each of the eight genes. We grew the plants least six taxa are described in Supplementary Fig. 19. for 11 d and then collected and quantified the bacteria that were We also identified domains that were reproducibly enriched in tightly attached to the roots (Methods, Supplementary Table 17b). NPA and/or soil-associated genomes, including many domains of Mutations in two genes led to fourfold to sixfold reductions in colo- mobile genetic elements (Supplementary Fig. 20). nization (false discovery rate (FDR)-corrected Wilcoxon rank sum test, q <​ 0.1) relative to that by wild-type bacteria (Fig. 2b), without Putative plant protein mimicry by plant- and root-associated an observed effect on growth rate (Supplementary Fig. 17). These proteins. Convergent evolution and horizontal transfer of protein two genes encode an outer-membrane efflux transporter from the domains from eukaryotes to bacteria have been suggested for some

Nature Genetics | www.nature.com/naturegenetics © 2017 Nature America Inc., part of Springer Nature. All rights reserved. Nature Genetics Articles

a Metagenome samples: NPASoil PA RA (also PA) A PA gene according to hypergbin Actinobacteria 1 Alphaproteobacteria P = 0.03 A PA gene according to phyloglmbin 4 P = 0.004 5 A PA gene according to Scoary 3

2 4 c Alphaproteobacteria, Bradyrhizobium sp. ARR65 1 BraARR65DRAFT_00043940 3 0 NolO NodZ Bacillales Pseudomonas NodA NodB NodC NodS NodU NodI NodJ Nodulation 3 P = 0.001 2 2 0 P = 0.0006 1 –2 0 d Burkholderiales, Burkholderia kururiensis M130 –4 –1 G118DRAFT_01458 Normalized # reads Nitrogen nifH nifD nifK nifE nifN nifX nifQ fixation Bacteroidetes Xanthomonadaceae P = 0.006 5.5 4 e P = 0.045 3 Alphaproteobacteria, Rhizobium etli bv. mimosae IE4771 4.5 P = 0.046 2 IE4771_PB00363 (gene id2585987913)

1 Ent-kaurene biosynthesis mlr6367 mlr6364 mlr6366 mlr6365 mlr6368 mlr6368 3.5 mlr6370 (gibberelin precursor) ( n = 3) Ocean ( n = 1) Ocean ( n = 1) At soil ( n = 2) At soil ( n = 2) RA, all ( n = 11) RA, all ( n = 11) Soil, all ( n = 11) Soil, all ( n = 11) PA, all ( n = 20) PA, all ( n = 20) NPA, all ( n = 7) NPA, all ( n = 7) At rhizo. ( n = 4) At rhizo. ( n = 4) Poplar soil ( n = 5) Poplar soil ( n = 5) Wheat soil ( n = 2) Wheat soil ( n = 2) Human gut ( n = 3) Human gut ( n = 3) f Cucumber soil ( n = 2) Cucumber soil ( n = 3) Arctic salt lake ( n = 3) Arctic salt lake ( n = 3) Alphaproteobacteria, Rhizobium leguminosarum bv. viciae 3841 Wheat rhizoplane ( n = 3) Wheat rhizoplane ( n = 3) Cucum. rhizoplane ( n = 3) Poplar endosphere ( n = 5) Poplar rhizosphere ( n = 5) Cucum. rhizoplane Poplar endosphere ( n = 5) Poplar rhizosphere ( n = 5) RL4028 cheB cheR cheW mcpB cheW cheA cheY Chemotaxis b Mean value Median value 8.0 7.5 g Xanthomonadaceae, Xanthomonas axonopodis pv. vasculorum NCPPB 900 7.0 Type III secretion system 6.5 Ga0097773_106954 6.0 HrpB1 HrpB2 YscJ/HrcJ family HrpB4 T3SS protein L T3SS protein N HrpB7 T3SS protein T T3SS protein C (CFU/g of root ) 10

5.5 Hpa B HrpE T3SS protein D HpaA T3SS protein S T3SS protein R T3SS protein Q HpaP T3SS protein V T3SS protein U lo g 5.0

Wild type

G118DRAFT_05604 G118DRAFT_03668 h Burkholderiales, Variovorax sp. Root411 Type VI secretion system Ga0102102_11238 ImpH ImpG ImpF ImpE Hc p ImpM ImpL ImpK ImpJ Vgr G VasG

i Alphaproteobacteria, Ensifer medicae WSM1115 SinmedDRAFT_3917 Flagellum biosynthesis fli R flh A flg D flb T flaF FlgL FlgK FlgE Flagelli n Flagelli n Flagelli n Flagelli n FliP FliL FlgH Mot E FlgI FlgA FlgG FlgC FlgB fli I FlgF

Fig. 2 | Validation of predicted plant-associated genes by multiple approaches. a, Plant-associated (PA) genes, which were predicted from isolate genomes, were more abundant in PA metagenomes than in NPA metagenomes. Reads from 38 shotgun metagenome samples were mapped to significant PA, NPA, RA, and soil-associated genes predicted by Scoary. P values are indicated for the significant differences between PA and NPA genes or RA and soil-associated genes in each taxon (two-sided t-test). Full results and an explanation for normalization are presented in Supplementary Fig. 14. b, Results of a rice root colonization experiment using wild-type Paraburkholderia kururiensis M130 or knockout mutants for two predicted plant-associated genes. Two mutants showed reduced colonization compared with the wild type: G118DRAFT_05604 (q-value = 0.00013,​ Wilcoxon rank sum test), which encodes an outer membrane efflux transporter from the nodT family, and G118DRAFT_03668 (q-value =​ 0.0952, Wilcoxon rank sum test), a Tir chaperone protein (CesT). Each point represents the average count of a minimum of three to six plates derived from the same plantlet, expressed as colony-forming units (CFU) per gram of root. c–i, Examples of known functional plant-associated operons captured by different statistical approaches. The plant-associated genes are highlighted by shaded bars, colored according to the key. c, Nod genes. d, NIF genes. e, Ent-kaurene (gibberelin precursor). f, Chemotaxis proteins in bacteria from different taxa. g, Type III secretion system. h, Type VI secretion system, including the imp genes (impaired in nodulation). i, Flagellum biosynthesis in Alphaproteobacteria. Labels show the gene symbol or the protein name for which such information was available.

Nature Genetics | www.nature.com/naturegenetics © 2017 Nature America Inc., part of Springer Nature. All rights reserved. Articles Nature Genetics

NPA a Transcriptional regulator, LacI family PA LacI, DNA-binding domain Periplasmic binding protein-like, (pfam00356) ligand binding (pfam13377) RA Soil 1 350 –12 P = 3 × 10 0.004 –27 –5 P = 7 × 10 P = 0.002 0.012 P = 3 × 10 P = 4 × 10–6 P = 0.03 P = 0.008 0.010 P = 2 × 10–27 0.003 P = 7 × 10–7 P = 0.001 0.008 P = 3 × 10–14 P = 0.004 0.002 0.006 P = 0.03 Classified as significant 0.004 Distribution in genomes 0.001 0.002 Fraction of all domains 0.000 0.000 Actino. 1 Alphaprot. Bacillales Actino. 2 Bactero. Burkho. Pseud. Xanthom.

b Aldo-keto reductase (pfam00248)

1 320 0.010 P = 7 × 10–14 0.005 –7 P = 0.002 P = 3 × 10 0.008 –49 P = 0.001 0.004 –8 P = 0.01 P = 3 × 10 P = 0.04 P = 1 × 10 –8 P = 3 × 10–5 P = 1 × 10 –7 0.006 P = 6 × 10 0.003 P = 0.003

0.004 0.002

0.002 Fraction of all domains 0.001

0.000 0.000 Actino. 1 Alphaprot. Bacillales Actino. 2 Bactero. Burkho. Pseud. Xanthom.

Fig. 3 | Proteins and protein domains that were reproducibly enriched as plant-associated or root-associated in multiple taxa. We compared the occurrence of protein domains (from Pfam) between plant-associated (PA) and NPA bacteria and between root-associated (RA) and soil-associated bacteria. Color-coding is as in Fig. 1a. a, Transcription factors with LacI (Pfam00356) and periplasmic-binding protein domains (Pfam13377). These proteins are often annotated as COG1609. b, Aldo-keto reductase domain (Pfam00248). Proteins with this domain are often annotated as COG0667. We used a two-sided t-test to test for the presence of the genes in a and b in genomes that shared the same label and to verify the enrichment reported by the various tests. FDR-corrected P values are shown for significant results (q-value <​ 0.05). Colored circles indicate the number of different statistical tests ( ≤ 5)​ supporting plant, non-plant, root, or soil association of a gene or domain, with each circle representing one test. Gene illustrations above each graph represent random protein models. Note that a and b each contain two graphs because of the different scales. Actino., Actinobacteria; Alphaprot., Alphaproteobacteria; Burkho., Burkholderiales; Bactero., Bacteroidetes; Pseud., Pseudomonas; Xanthom., Xanthomonadaceae. Box-and-whisker plots show the median (center lines), 25th and 75th percentiles (box edges), extreme data points within 1.5 times the interquartile range from the box edge (whiskers), and outliers (isolated data points). Full results are in Supplementary Table 19.

microbial effector proteins that are secreted into eukaryotic host repeat (RCC1)47, leucine-rich repeat (LRR)48, and pectate lyase49. cells to suppress defense and facilitate microbial proliferation43–45. PREPARADOs from plant genomes were enriched 3–14-fold We searched for new candidate effectors or other functional plant- (P <​ 10−5, Fisher’s exact test) as domains predicted to be ‘integrated protein mimics. We retrieved a set of significant plant-associated effector decoys’ when fused to plant intracellular innate immune and root-associated Pfam domains that were reproducibly predicted receptors of the NLR class50–53 (compared with two random domain by multiple approaches or in multiple taxa, and we cross-referenced sets; Methods, Supplementary Figs. 21 and 23, Supplementary these with protein domains that were also more abundant in plant Table 21). We found that 2,201 bacterial proteins that encode 17 of genomes than in bacterial genomes (Methods). This analysis yielded the 64 PREPARADOs shared ≥40%​ identity across the entire pro- 64 plant-resembling plant-associated and root-associated domains tein sequence with eukaryotic proteins from plants, plant-associ- (PREPARADOs) encoded by 11,916 genes (Supplementary Fig. 21, ated fungi, or plant-associated oomycetes, and therefore are likely to Supplementary Table 21). The number of PREPARADOs was four- maintain a similar function (Supplementary Fig. 24, Supplementary fold higher than the number of domains that overlapped repro- Tables 21 and 22). The varied phylogenetic distribution among this ducible NPA/soil-associated domains and plant domains (n =​ 15). protein class could have resulted from convergent evolution or from The PREPARADOs were relatively abundant in genomes of plant- cross- horizontal gene transfer between phylogenetically associated Bacteroidetes and Xanthomonadaceae ( >​ 0.5% of all distant organisms subjected to the shared selective forces of the domains on average; Supplementary Fig. 22). Some PREPARADOs plant environment. were previously described as domains within effector proteins, Seven PREPARADO-containing protein families were character- such as Ankyrin repeats46, regulator of chromosome condensation ized by N-terminal eukaryotic or bacterial signal peptides followed

Nature Genetics | www.nature.com/naturegenetics © 2017 Nature America Inc., part of Springer Nature. All rights reserved. Nature Genetics Articles

XP_015619049 Oryza sativa japonica group BAF14474 100 Oryza sativa japonica group NP_191518.1 100 Arabidopsis thaliana NP_176220.1 Jacalin 46 Arabidopsis thaliana Dirigent WP_017689535.1 Protein kinase, catalytic Paenibacillus sp. PAMC 26794, Firmicutes domain XP_003714330.1 F-box associated 1 Magnaporthe oryzae 70-15, Ascomycota 84 Endonuclease/exonuclease/ EIT77169.1 phosphatase family Aspergillus oryzae 3.042, Ascomycota 89 Signal peptide, Gram– bacteria EPS95407.1 Fomitopsis pinicola FP-58527 SS1, Basidiomycota Signal peptide, Gram+ bacteria EUC62117.1 Signal peptide, eukaryotes 91 Rhizoctonia solani AG-3 Rhs1AP, Basidiomycota Bacteria: PA, NPA, soil 100 PA fungi 21 ETM02225.1 36 Phytophthora parasitica, oomycetes PA oomycetes EHQ06050.1 Plants Leptonema illini DSM 21528, WP_020125377.1 0.1 23 Streptomyces sp. 303MFCol5.2, Actinobacteria AAM38968.1 30 Xanthomonas axonopodispv. citri str. 306, Proteobacteria

Fig. 4 | A protein family shared by plant-associated bacteria, fungi, and oomycetes that resemble plant proteins. A maximum-likelihood phylogenetic tree of representative proteins with Jacalin-like domains across plants and plant-associated (PA) organisms. Endonuclease/exonuclease/phosphatase- Jacalin proteins are present across PA eukaryotes (fungi and oomycetes) and PA bacteria. In most cases these proteins contain a signal peptide in the N terminus. The Jacalin-like domain is found in many plant proteins, often in multiple copies. The protein accession is shown above each protein illustration. by a PREPARADO dedicated to carbohydrate binding or metabo- Novel putative plant- and root-associated gene operons. In addi- lism (Supplementary Table 21). One of these domains, Jacalin, is tion to successfully capturing several known plant-associated oper- a mannose-binding lectin domain that is found in 48 genes in the ons (Fig. 2c–i), we also identified putative plant-associated bacterial A. thaliana genome, compared with three genes in the human operons (“URLs”). Two previously uncharacterized plant-associated genome25. Mannose is found on the cell wall of different bacterial gene families were conspicuous. These genes are organized in mul- and fungal pathogens and could serve as a microbial-associated tiple loci in plant-associated genomes, each with up to five tandem molecular pattern that is recognized by the plant immune system54–61. gene copies. They encode short, highly divergent, high-copy-number We identified a family of ~430-amino-acid-long microbial proteins proteins that are predicted to be secreted, as explained below. These with a signal peptide followed by a functionally ill-defined endonu- two plant-associated protein families never co-occurred in the same clease/exonuclease/phosphatase family domain (pfam03372), and genome, and their genomic presence was perfectly correlated with ending with a Jacalin domain (pfam01419). This domain architec- lifestyles of pathogenic or nonpathogenic bacteria of the genus ture is absent in plants but is found in diverse microorganisms, many (order Burkholderiales) (Fig. 6a). We named the gene of which are phytopathogens, including Gram-negative and Gram- families present in non-pathogens and pathogens Jekyll and Hyde, positive bacteria, fungi from the Ascomycota and Basidiomycota respectively, after the characters in Robert Louis Stevenson’s classic phyla, and oomycetes (Fig. 4). We speculate that these microbial novel. lectins may be secreted to outcompete plant immune receptors The typical Jekyll gene is 97 amino acids long, contains an for mannose-binding on the microbial cell wall, effectively serving N-terminal signal peptide, lacks a transmembrane domain, and, as camouflage. in 98.5% of cases, appears in non-pathogenic plant-associated We thus discovered a large set of protein domains that are shared or soil-associated Acidovorax isolates (Fig. 6a, Supplementary between plants and the microbes that colonize them. In many cases Fig. 25d, Supplementary Table 23a). A single genome may encode the entire protein is conserved across evolutionarily distant plant- up to 13 Jekyll gene copies (Fig. 6a) distributed in up to nine loci associated microorganisms. (Supplementary Table 23a). We recently isolated four Acidovorax strains from the leaves of naturally grown Arabidopsis16. Even Co-occurrence of plant-associated gene clusters. We identified these nearly identical isolates carried hypervariable Jekyll loci that numerous cases of plant-associated gene clusters (orthogroups) were substantially more divergent than neighboring genes and that demonstrate high co-occurrence between genomes (“URLs”). included copy-number variations and various mutations (Fig. 6b, When the plant-associated genes were derived by phylogeny-aware Supplementary Fig. 25, Supplementary Table 24). tests (i.e., PhyloGLM and Scoary), they were candidates for inter- The Hyde putative operons, in contrast, are composed of two taxon horizontal gene transfer events. For example, we identified a distinct gene families unrelated to Jekyll. A typical Hyde1 protein cluster predicted by Scoary of up to 11 co-occurring genes (mean has 135 amino acids and an N-terminal transmembrane helix. pairwise Spearman correlation: 0.81) in a flagellum-like locus Hyde1 proteins are also highly variable, as demonstrated by copy- from sporadically distributed plant-associated or soil-associated number variation, sequence divergence, and intralocus transposon genomes across 12 different genera in Burkholderiales (Fig. 5). insertions (Fig. 6a,c, Supplementary Fig. 26a–c, Supplementary Two of the annotated flagellar-like proteins, FlgB (COG1815) and Table 23b). Hyde1 was found in 99% of cases in phytopathogenic FliN (pfam01052), are also encoded by plant-associated genes in Acidovorax. These genomes carried up to 15 Hyde1 gene copies Actinobacteria 1 and Alphaproteobacteria taxa. Six of the remaining distributed in up to ten loci (Fig. 6a, Supplementary Table 23b). In genes encode hypothetical proteins, all but one of which are specific 70% of cases Hyde1 was located directly downstream from a more to , suggestive of a flagellar structure variant that conserved ~300-amino-acid-long plant-associated protein-coding evolved in this class in the plant environment. Flagellum-mediated gene that we named Hyde2 (Fig. 6c,d, Supplementary Table 23d). motility or flagellum-derived secretion systems (for example, T3SS) We identified loci with Hyde2 followed by Hyde1-like genes in dif- are important for plant colonization and virulence39,40,62,63 and can be ferent members of the Proteobacteria phylum. These contained a horizontally transferred64. highly variable Hyde1-like protein family that maintained only the

Nature Genetics | www.nature.com/naturegenetics © 2017 Nature America Inc., part of Springer Nature. All rights reserved. Articles Nature Genetics

a

Clustered orthogroups 1 Burkholderiales_OG0008861 (FliN) Burkholderiales_OG0008863 (FlgA) Burkholderiales_OG0010506 (FliO) 0.8 Burkholderiales_OG0008351 (FlgB) Burkholderiales_OG0007689 (hyp.) Burkholderiales_OG0008862 (hyp.) Burkholderiales_OG0008350 (hyp.) 0.6 Burkholderiales_OG0009542 (hyp.) Burkholderiales_OG0007460 (hyp.) Burkholderiales_OG0010507 (hyp.) Burkholderiales_OG0004311 (RHS)

0.4 Spearman correlation Methylibium sp. YR605

0.2 Burkholderia sp. Msh1 0

–0.2 Ralstonia solanacearum P673 FlgA FlgI FlgB FlgC FliF FliH FlgD FlgE FliN C-term FliO FliP FliA –0.4 Not a significant PA gene (Scoary) –0.6 PA orthogroups, Burkholderiales (Scoary) –0.8 Flagellar genes –1 PA orthogroups, Burkholderiales (Scoary) Roseateles, Rhizobacter Methylibium

b Tree scale: 0.1 Achromobacter Caldimonas_manganoxidans_ATCC_BAA-36 9

Thiomonas_bhuba

Classification 5

Thiomonas_sp._

Thiomonas_sp. Bordetel

Lautropia

Bordetella_bron Bordetella Bordetella

Thiomonas_arsen

Bordetella Bordetella_

Bordetell Bordete la_bronchis Bordete Bordete Variovorax

4

Thiomona

PM1 Bordetella_p

14 Bordete Bordetella Bordetella_pertussis_ _mirabi neswarensis s_ATCC_BAA-3

Bordetella_ FB-Cd,_DSM NPA lica_DSM_1124 _bronchiseptica _parapertussi SHI_001 2 Bordete _FB-6,_DSM_2580 5123 2 1 Bordetell lla_bronch _bronch 44 a_bronc 7 Achromoba lla_bron lla_bronc bronchiseptic Bordete 04 ilyticu

lla_br lis_ATCC_ 5 eptica_B1 chisepti s_intermedia_

sp._JO _bronchisep Bordetella_h lla_bron itoxydans_3A DSM_19884 ertussis_CHLA- BC _14619 Bordete Borde iseptica_1 erium_H R _Root66 a_bronc chisepti hiseptica_ _Root1217 Bordetel bronchiseptica_E01onchise _DSM_18181 p._Root14 Bordet lla_pertuss iseptica_D756 hisepti

Bordetell benzoat cter_xy Bordetell _2561 sp._Root1237 Bordete _petroleiphilum_ ca_RB5 _gelatinosus_IL s_12822 Achromobact Bordetel p._Root1221 tella_ 8-5_(C3

es_sp._YR242 _KM2 5159 Achromobact chisept deriales_ ._Root29 lla_hol a_A1- er_sp._Root4 itrificans_DSM_1 ium_sp._Root12 ironomi_ ella_ho hiseptica ca_E01 PA la_trematu ptica_1 ca_C Tohama_ losoxidans_NBRC_1 7 Achrom 28 Achromob olmesi tica_E01 ium_sp._YR60 holm D989 5 K1 Achromob lla_ho aceae_bact ydans_DSM 0 9 ter_sp._OV335 Bordeta_hinzii_ATCC_5 a_holmesii 0

la_holme 9 2

2 Achromobac is_C 7 s ophaga_sp._P _sp._Rim47sp._Rim28 mesii ica_E0 ) 4 Leptothrix_cholodnii_SP-6 Azohrdromonas_austra Burkhol Achromobact Rubrivivax Rubrivivax_ lmesii_ Roseatel CJ2 4 Pelomonas_s Achromo Achromobacter_sp._Root8 esii_105 058 1 Pelomonas_sp. Pelomonas_ onas_ch 1 Pelomonas_sp. racilis_Canale-Parola_D4,_ATCC_19624 i_ATCC_ lmesii_ Methylibium sis_ED16 er_arse obact _D99 S Methylib ella_aviu Rhizobact er_pie 0 Methylibium_sp._CF059 acter_xyl m_CCUG_139 _35009 I Rhizobacter_s amonad Achromob 0 Methylib a_thioox Bordetella 1 3 Methylibium_sp._CF468 Rhizobac p._Leaf40 acter_sp._ sii_701 2 rax_sp._Leaf267 _H558 Rhizobacter_sp._Root16D2 er_sp._R F627 Rhizobacter_sp Alcal 3 4113 Hydrogen ter_tataouinensis_TTB310 bacter_xylochaudii nitoxydans 8 Com ter_pi Brachym 5154 Brachymonas_den igenes_faecal osoxidanser_xylos5126,_ATCC_27061 m_197 Ottowi odofe denbachen 173 4 0 Limnohabitans RA Alcaligene Bordetella_sp._petrii_Se-11acter_xylosox Limnohabitans_ 7 Hylemonella_gracilis_Niagara_ p._OV174 _ATCC_43553echaudii_HL Root170oot565 1 0 Hylemonella_g 0 Ramlibac _UNC47MFTsu3.1 N Comamonadaceae_bacterium_URHA0028 oxidans_X0 2 soxidans_A _SY8 Ramlibacter_s ferax_sai Alcalige _NH44784 Pseudorh onas_s s_faecalis_fa Pseudorhodoferax_sp._Leaf265 is_phenolicus_Alcaligen onas_naphthalenivorans_s_sp._JS666 11R,_DSM_ Pseudorhodoferax_sp._Leaf274 s_ATCC_BAA-807 Albidiferax_sp._OV413 0 Taylor _FB-8,_D idans_AXX- 3 Pusillim E Rhodo nes_sp Rhodoferax_ferrireducens_T118 5.1 1 2736 Polarom ella Taylor es_sp._EGD-AK -1996 8 Polarom _asin onas_nPusillimonas_sp._T7-7ecalis_NCIB_86 SM_2487 Polaromonas_glacialis_R3-9 Soil 12804._DH1f Polaromona er_lanceolatus_ATCC_14669 Taylorell DSM_1650 Polaromonas_sp._YR568 ella_as A Advenella_mimigard Taylorella_equigTaylorel igenitalis Polaromonas_sp._CF318 75MFSha3. Delftia oerteman Polaromonas_sp._JS666 Taylorella_equigen Curvibacter_gracili a_asini inige 3 _160MFSha2.1 Curvibact 7 K605 la_equigenital_ATCC_70093 3 Xylophilus_sp._Leaf22 Oligella_ure nitali 87 genitali nii_BS Variovorax_sp._URHB0020ax_sp. Variovorax_paradoxus_110B Oligella_ureOligella_u enitalis_AT s_14 BasileaBrackiella_oe Variovorax_sp._350MFTsu efordensis_DPN7 8 s_MCE /45 Variovorax_sp._3 Variovor _psitt thralis_DSM_7italis_14is_MCE 2 reth 3 Variovorax_sp._278MFTsu5.1 CC_3586 s_EPS acipulmonolyticaralis_ 3 Variovorax_sp._O SutterellAdvenella_kash dipodis_D Variovorax_sp._Root411 /5 9 Sutterella_wadsworthe Variovorax_sp._CF313 Paras Sutterella_wadswort ,_DSM_17166 _DSM_18253DNF000 6 5 Variovorax_sp._Root434 utterella_excrementa_wa is_DSM_2 53 SM_13743 1 Variovorax_sp._EL159 9MFTsu5.1 dswor mirensis_W13003 40 Variovorax_sp._Root473 Variovorax_sp._770b Burkho thens _(MIMI470 Variovorax_paradoxu hensis_2 Variovorax_sp._Root318D1 lderiales_bact nsis_HGAis_3_1_4 1 Chitinimonas_k ihominis_YI ) Variovorax_paradoxus_B_paradoxus_34radoxus_4MFCol3.14 Chitin Variovorax_paradoxus_S110anuli_NBRC_101663 Limnobacter _1_59BFA 5 imonas 0223B Variovorax_paradoxus_295MFChir4.1 Variovorax JS42 oreensis_DSerium_1_1_4T_11859 _taiwan A Variovorax_pa -S7 _sp._MED10 Comamonas_gr ensis_ 7 Comamonas_badia_DSM_17552 DSM_188M_1772 Acidovorax_sp._ 5 Acidovorax_ebreus_TPSY 6 Acidovorax_sp._MRilus_denitrificans_BC 35 99 caeni_DSM_19327 Alicycliphilus_denitrificans_K601 Alicycliph 1 Acidovorax_ Comamonas_composti_DSM_21721 Delftia_acidovorans_CCUG_274B Delftia_acidovorans_CCUG_158 Delftia_acidovorans_SPH- 8 Delftia_acidovorans_2167atica_DA1877 Delftia_sp._RIT313 _NBRC_1491 Burkholderia Delftia_sp._Cs1-4 996 Comamonas_aqu eroni_TK102 _mallei_ATCC i_ATCC_11 Burkhol Comamonas_aquaticas_testost 4 deria_mallei_ Comamonas_testosteroni_KF-1 Burkhold _2334 s_testosteron NB-1 4 Comamona BurkholderiBurkho eria_ma ydans_DSM_17888977 lderia_m JHU Comamona

llei_BM roni_sv._Ba_C Acidovorax a_mallei_ allei K Comamonas_testosteroni_S4 8 Burkhol Burkhold GB8_horse__BMY Comamonas_thiooxas_testoste eria_m Burkholderia_mallei_deria_mall 4 Comamon rophila_DSM_1158 ei_NCTC_102allei_BM Extensimonas_vulgaris_CGMCC_1.10 9 Q 1661 Burkholde Simplicispira_psych llae_DSM_ ria_mal NCTC_1024729 Burkholderialei_200272128 Giesbergeria_anulus_ATCC_3595_valeriane 0 8 1 Acidovorax SM_2798 Burkholderia__mallei 0 Burkholde _SAVP1 Acidovorax_sp._Leaf16_wautersii_D Burkholde ria_pseudomamallei_2334 Acidovorax njaci_DSM_7481 CC_19860 ria_pseu domallei_ llei_1710 4 Acidovorax_ko ae_avenae_AT Burkholde a ria_pse HBPUB101 Acidovorax_aven venae_RS-1 udomalle 34a Burkholderi i_HBPUB1 Acidovorax_avenae_a C_19882 a_pseudoma 0134a _oryzae_ATC -1 Burkholderia llei_MSHR Acidovorax itrulli_AAC00 _pseudomalle 5848 ax_avenae_c i_MSHR5848 Acidovor Burkholderia_pseudoma ulli_DSM_17060 Burkholderia_pseudomallei llei_Paste Acidovorax_citr i_tw6 ur Acidovorax_citrull Burkholderia_pseud _MSHR5858 _pslbtw65 omallei Acidovorax_citrulli EF01-2 _HBPUB10303a _eiseniae_ Burkholderia_pseudo mallei_S13 rax_sp._JHL-3 Burkholderia_pseudomallei_1 Acidovo 026b Acidovorax_sp._Root267 Burkholderia_pseudom allei_406e Acidovorax_radicis_N35 Burkholderia_pseudomallei _MSHR1043 Acidovorax_sp._Root219 Burkholderia_pseudomallei_1655 Acidovorax_sp._Root217 Acidovorax_sp._Leaf78 2 Burkholderia_pseudomallei_NCTC_1339 Acidovorax_sp._Leaf76 e Burkholderia_pseudomallei_354 Acidovorax_sp._Leaf84 lei_MSHR146 Acidovorax_sp._NO-1 Burkholderia_pseudomal Acidovor pseudomallei_NAU20B-16 Burkholderiales ax_sp._JHL-9 Burkholderia_ SHR520 Acidovorax_t ria_pseudomallei_M emperans_KY4 Burkholde R511 Acidovorax_temperans_CB2 ia_pseudomallei_MSH Acidovorax_ Burkholder sp._KKS10 pseudomallei_NCTC_131788 Acidovor 2 Burkholderia_ omallei_MSHR33 ax_sp._Root7 lderia_pseud HR346 Acidovorax_sp._Roo 0 Burkho seudomallei_MS Acidovorax_s t402 MSHR305 Burkholderia_p 9 Acidovora p._Root568 llei_Pakistan_ x_delafie Burkholderia_pseudomallei_ 1106b Derxia_gum ldii_DSM omallei_ O Oxalobac mosa_DSM _64 Burkholderia_pseudomaderia_pseud ter_formi _723 Burkhol seudomallei_BE Oxalobacte genes_OXCC r_formig Burkholderia_p llei_BPC006 Herbas 13 Novih pirillu enes_HOxB 179 erbaspim_massilie Burkholderia_pseudomaeudomallei_1106a Herbas LS rillum_sp._Rootnse_JC20 Burkholderia_pseudomallei_K96243 Burkholderia_ps ei_NCTC_13 Herbaspirillpirillum 6 pseudomall seudomallei_668 _sp._C 189 Herbasp um_sp._YRF444 Burkholderia_Burkholderia_p allei_MSHR6137lei_576 Herbaspirillum_seropeirillum_ frising 522 ria_pseudom HR58581 Herbaspirillum ense_GSF3 eria_pseudomal Herbasp BurkholdeBurkhold dicae_SmR1 domallei_MSsis_MSMB12 Herbaspi irillum__seropedica 0 lderia_pseu_thailanden Hermini rillum_huttisp._GW10 e_Os45 Burkho ailandensis_MSMB43 Janth imonas_arsen 3 Burkholderia 444 inob ense_putei_I ATCC_700388 Collimonas_spacterium Burkholderia_th Collimonas_fungi icoxydans_ ilandensis_Es_H0587 _sp._M AM_1503 BDU Collimo ._OK41 arseille ULPAs 2 landensis_E264,_ C184 Collimonnas_sp vorans_Ter3312 1 Burkholderia_thailandensis_USAMRU_Malaysia__2Burkholderia_tharia_thailandensi 0 Collimo ._OK242 Massilia_ as_sp._OK307 Burkholderia_thai Burkholde nas_sp._OK60 a_cenocepacia_P -3 Massilia_sp Burkholderia_oklahomensis_ Massilia_sp.sp._Root Burkholderi Massilia_ ._Root 1485 7 enocepacia_MC0 Massilia_niasten _Leaf139133 Burkholderia_cenocepacia_AU_1054 5 timonae_C Burkholderia_cenocepacia_HI2424 Massilia_conso o Massi Collimonas Burkholderia_cBurkholderia_cepacia_Bu72 Massil a_cenocepacia_BC7 lia_alkalit sis_DSM_21313CUG_4578 2 Massili ia_sp._LC23ciata_BSC265 Burkholderia_cenocepacia_H111epacia_gv._III_J231 Massilia_ a_lurida olerans_DS 3 Burkholderi epacia_K56-2Valvancia_869T 9 Pseudo Duganella_ flava_CGMCC_8 Duganell duganel _CGMCC_1. M_17462 a_cenocepa Burkholderia_cenoc Massilia_sp Mass sp._Root336la_violac Burkholderia_cepacia_DDS_7H-2 a_sp._R Burkholderia_cenoc Burkholderia_cepacia_383 Dugane 1082 Burkholderi TCC_25416 ilia_s 1.1068 Dugan einigr 2 Burkholderia_cepacia_RB-3 ._Root35oot1480D1 Duganella_sp._Lealla_sp._Cp._Root D2 5 Burkholderia_pyrrocinia_Lyc2 Duganella_ella_sp._OV a_DSM_15 Burkholderia_sp._MSh1 Duganella_sp._Leaf6 41 1 Burkholderia_sp._WP42 Janthino F402 8 Janthinob 887 Janthin zoogloeoide45 Jant bacterium_sp. f126 8 Burkholderia_sp._KJ006 GD2M616 Janthinob Burkholderia_vietnamiensis_WPB hinobacteriobacteracterium_ Burkholderia_cepacia_UCB_717,_ABurkholderia_vietnamiensis_G4 Janthino Janthi 1 s_ATCC_ Burkholderia_cepacia_AMMD Janthinoba O158 acterium_sp._ium_ Janthino nobacteriubacteri agaricida_HH0 Burkholderia_cepacia_GG4 Polynucleobacte um_livi Burkholderia_ambifaria_MC40-6 sp._CG 2593 Polynucle 1 Rals bactercterium_sp._551aum_liv 5 Burkholderia_pyrrocinia_CH-67 Cupr dum_PA3 mnosum_W1R3 Ralstonia_sptonia_s m_sp._OK6 _A1 Massilia _DDS_22E-1 RA13 Cupriavidus_metiavidus_sp obacter_ idum Burkholderia_multivorans_C Cupriavidus_met ium_sp. Cupriavidus MC_2572 p._PB r_sp._MWH-M _RIT3 Burkholderia_multivorans_CGD1_PC543_(LMG_19468) Cupriavi i_NBRC_13700 Ralstonia_eutropha_JMP13 _34 76 Burkholderia_multivorans_ATCC_17Burkholderia_dolosa_AUBurkholderia_sp._lig30 Ralstonia_sp A necessarius_a 0 Ralstoni ._25MFCo._BIS 8 Cupriavidus_ 4 4 Burkholderia_multivorans_DDS_15A-1 Burkholderia_sp. x_BR3459 Cupriavid dus_sp. Cupri _sp._Y 7 allidurans_ Burkholderia_gladioli_BSR3 Cupriav Cupr a_eutropha_H allidurans_CH3 oK -1 Cupriavidus l4.1 Ralston avidus ._GA3- _OV096R65 symbioti4 Burkholderia_cenocepacia Ralston Burkholderia_dolosa Burkholderia_glumae_BGR1 Ralst iavidus_ us_necator_N-1,_A Burkholderia_sp._JPY366 Ralstonia idus_taiwan sp._UYPR2.512 1 Burkholderia_gladiol beta_pr H113 _sp._AM Ralstoni 3 Ralstonia_ onia_sp._J ia_sp._AU Ralston cus_QLW-P1DM Ralston ia_pick 0 Ralstonia_s _taiwanentaiwanensis_ Burkholderia_sp._WSM2232ica_NBRC_102488 Ralstonia_s 16 4 Burkholderia_phenoliruptrix_AC1100 _100965 Ralstoni 4 Burkholderia_sp._WSM2230 worthii_WSM3556 Ralst oteoba _picket Ralst novorans_LB400 Ralstonia_so a_sp._UN P6 Ralst ensis 5 Ralstonia_s ia_pickettii_12 ettii_O Burkholderia_phenoliruptri Ralstonia_s Burkholderia_graminis_C4D1M _NBRC_102489 Ralston ia_picketti 12-0 Burkholderia_sp._URHA0054 BA4 3 6 Ralsto sp._5_2_5 GI_000100 Duganella 81 Ralstonia_sola onia_sol Ralstoni Ralston onia_sola Burkholderia_sp._Ch1 Ralstonia Pandoraea_sp cterium_ Pandora tii_12 _STM6 Pandoraea_sp._RB onia_solan a_solanacearu TCC_4329 4 Pandoraea sis_STM601 Pandoraea_pn Burkholderia_andr Burkholder Burkholderia_sp.B13 entiae_WSM5005 38 Burkholder 7 p._UNCCL14 8_(IVD WA-1 1 R21 RC_100964 7 olanace LMG_194 9

9 3 C404CL21Co 8 nia_sola errae_BS001ia_sp._BT03 160 6 87 ia_solanacear J 7 4 ensis_M anacea ia_solanacear a_solanacearu i_27511 07 Burkholderia_dil UYPR1.41 olanacearum_ lanacearum_ 6FA nsengisoli_NBRC 06233 eaf177 olanacearu nacearum_ JGI_0001001-A ea_sp._E2 ) atum_STM a_S170 _solanacea 1-A05 ._SJ9 0 Burkholderia_caledonBurkholderia_bryophila_376MFSha3.Burkholderia_xe 1 D 1 is_M130 p._YI2 arum_CMR15 p._MP-1 acearum_FQY_ Burkholderia_phytofirmans_PsJN _JC294 2 _pnomenusa_RB38 A Burkholderia_sp._JPY251 8 Burkholderia_sp._WSM4176 ia_rhizoxinic nacearum ia_sp._B1 4 Burkholderia_sp._JC2948Burkholderia_sp._YR281 arum_STM3621 nacearum_23- rum_Rs-0 Burkholderia_sp._CCGE1002 ria_carib ._SD6- m_SD5 4 Burkholder rimmiae_R27

omenusa_3kgm Burkholderia_spr l deria_sp._ ria_sp._L ia_sp._RPE64 Burkholderia_t deria_s lderia_sp um_P67 Burkholderia_fungorum ururiens 6 Burkholderia ila_NBRC_1057 -44 m_Po82 Rs-10- eria_phym olderia_s eleia_NBRC_10181 opogonis_Ba354 rum_NCPPB GMI1000 nsis_NBRC_103 um_NCPPB909 dis_NBRC_10181 2 m_MolK2 CFBP2957 4 ia_sordidicol 4 Burkholderia_gi _UW551 _mimosarum_LMG_2325 9-161 11 Burkholde ria_nodosa_DSM_2 Burkhol eria_k a_HKI_45 Burkh 24 Burkholderia_terrae_NB Burkhol Burkho 4

Burkholderia_sp._YR520 3 Burkholderia_sp._JPY34 7 Burkholde 10B idipalu Burkhold oxydans_NBRC_10710 Burkholderia_sp. 4 a_ferrariae_NBRC_1 Burkholder

ia_banne eria_oxyph Burkholderia_g

R

_282 lderia_mimosarum_NBRC_1063 Burkholder Burkholderia_mimos Burkholde Burkhold Burkholderia_h

Burkholderia 4

deria_ac

9

Burkhold Burkho Burkholderi Burkholder

Burkhol kururiensis_thio

Burkholderia_

Ralstonia

Fig. 5 | Co-occurring plant-associated and soil-associated flagellum-like gene clusters are sporadically distributed across Burkholderiales. a, Left, a hierarchically clustered correlation matrix of all 202 significant plant-associated (PA) orthogroups (gene clusters) from Burkholderiales, predicted by Scoary. Right, the orthogroups present within and adjacent to the flagellar-like locus of different genomes. Gene names based on a BLAST search are shown in parentheses. Hyp., hypothetical protein; RHS, RHS repeat protein. Genes illustrated above and below the black horizontal line for each species are located on the positive and negative strand, respectively. b, The Burkholderiales phylogenetic tree based on the concatenated alignment of 31 single- copy genes. Colored circles represent the 11 orthogroups presented in a, with the same color-coding as in a. Genus names are shown next to pillars of stacked circles. RA, root-associated.

Nature Genetics | www.nature.com/naturegenetics © 2017 Nature America Inc., part of Springer Nature. All rights reserved. Nature Genetics Articles

a b Jekyll Comamonas testosteroni ATCC 11996 (0,0,0) Acidovorax sp. Leaf191 A. caeni DSM 19327 (0,0,0)

A. sp. JS42 (0,0,0) Acidovorax sp. Leaf76 100 A. ebreus TPSY (0,0,0) A. sp. MR-S7 (0,0,0) Acidovorax sp. Leaf78

100 A. temperans CB2 (0,0,0) Acidovorax sp. Leaf84 A. temperans KY4 (0,0,0)

A. sp. JHL-3 (0,0,0) c A. sp. JHL-9 (0,0,0) Hyde1 Hyde2 Transposase RNase G A. sp. NO-1 (1,0,0) Acidovorax avenae subsp. citrulli AAC00-1, watermelon A. sp. Root70 (10,0,0) A. sp. KKS102 (10,0,0) A. delafieldii DSM 64 (13,0,0) tw6, watermelon 100 A. sp. Root402 (10,0,0) A. sp. Root568 (9,0,0) Acidovorax avenae T10_61, sugarcane A. sp. Leaf78 (10,0,0)

A. sp. Leaf76 (11,0,1) Acidovorax oryzae ATCC 19882, rice A. sp. Leaf84 (12,0,2) 100 A. radicis N35 (13,0,0) 100 A. sp. Root267 (9,0,0) Acidovorax avenae avenae RS-1, rice 100 A. sp. Root219 (7,0,0) A. sp. Root217 (12,0,0) d A. avenae citrulli AAC00-1 (0,11,3) Hyde1-like Hyde2 Transposase PAAR Malate synthase A. citrulli DSM 17060 (0,15,4) P. syringae pv. tomato DC3000, tomato A. citrulli tw6 (0,15,3)

100 A. citrulli pslbtw65 (0,15,5) 100 A. oryzae ATCC 19882 (0,9,3) P. syringae pv. actinidia ICMP 18883, A. avenae avenae RS-1 (0,15,8) kiwifruit 77 A. avenaeT10_61 (0,11,6) A. avenae avenae P. syringae ICMP 18804, kiwifruit ATCC 19860 (0,11,5) A. konjaci DSM 7481 (0,5,4) P. syringae Pph1448A, A. valerianellae DSM 16619 (0,3,4) common bean A. sp. Leaf160 (0,0,0)

100 A. wautersii DSM 27981 (0,0,0) P. syringae PmaM6 (Psy38), cauliflower

Acidovorax Genome Gene copy Bacterial tree branches classification number lifestyle P. syringae pv. tomato Max13, tomato Pathogens PA Pathogen (Jekyll, Hyde1, Hyde2) Non-pathogens NPA Non-pathogen P. syringae pv. syringae B728a, and soil Soil green bean

Fig. 6 | Rapidly diversifying, high-copy-number Jekyll and Hyde plant-associated genes. a, A maximum likelihood phylogenetic tree of Acidovorax isolates based on concatenation of 35 single-copy genes. The pathogenic and non-pathogenic branches of the tree are perfectly correlated with the presence of Hyde1 and Jekyll genes, respectively. b, An example of a variable Jekyll locus in highly related Acidovorax species isolated from leaves of wild Arabidopsis from Brugg, Switzerland. Arrows indicate the following locus tags (from top to bottom): Ga0102403_10161, Ga0102306_101276, Ga0102307_107159, and Ga0102310_10161. c, An example of a variable Hyde locus from pathogenic Acidovorax infecting different plants (the host plant is shown after the species name). The transposase in the first operon fragmented a Hyde2 gene. Arrows indicate the following locus tags (from top to bottom): Aave_3195, Ga0078621_123525, Ga0098809_1087148, T336DRAFT_00345, and AASARDRAFT_03920. d, An example of a variable Hyde locus from pathogenic Pseudomonas syringae infecting different plants. Arrows indicate the following locus tags (from top to bottom): PSPTOimg_00004880 (a.k.a. PSPTO_0475), A243_06583, NZ4DRAFT_02530, Pphimg_00049570, PmaM6_0066.00000100, PsyrptM_010100007142, and Psyr_4701. Genes color- coded with the same colors in b–d are homologous, with the exception of genes colored in ivory (unannotated genes) and Hyde1 and Hyde1-like genes, which are analogous in terms of their similar size, high diversification rate, position downstream of Hyde2, and tendency to have a transmembrane domain. PAAR, proline-alanine-alanine-arginine repeat superfamily.

Nature Genetics | www.nature.com/naturegenetics © 2017 Nature America Inc., part of Springer Nature. All rights reserved. Articles Nature Genetics

a P = 4 × 10–8

–7 9 P = 0.0002 P = 2 × 10

8

7 No IPTG

+ 0.5 mM IPTG

6

CFU/ml

10 log 5

4

3

+ Hyde2 E. coli + GFP E. coli + Hyde1 E. coli + Hyde1 E. coli (Aave_3191) (Aave_0990) (Aave_0989)

b 2 0.02 0.005 0.001 0.046 0.009 0.0001 0.0007 0.02 0.003 0.001 0.042 0.01 0.0001 0.001

0 ) ne g –2 AAC00-1 wt /CFU y AAC00-1 ∆T6SS pr e ** –4 AAC00-1 ∆5-Hyde1 (CFU

10 ** log ** –6 ** ** ** ** –8

SphingomonasPseudomonas E. coli BW25113 NovosphingobiumL11 L58 Chryseobacterium Stenotrophomonas L98 PseudomonasL130 Acinetobacter L2 L82 Flavobacterium L434 Pseudomonas L70 L405

Prey strains

Fig. 7 | Hyde1 proteins of Acidovorax citrulli AAC00-1 are toxic to E. coli and various plant-associated bacterial strains. a, Toxicity assay of Hyde proteins expressed in E. coli. GFP, Hyde2-Aave_0990, and two Hyde1 genes from two loci, Aave_0989 and Aave_3191, were cloned into pET28b and transformed into E. coli C41 cells. Aave_0989 and Aave_3191 proteins were 53% identical. Bacterial cultures from five independent colonies were spotted on an LB plate. Gene expression of the cloned genes was induced with 0.5 mM IPTG. P values are shown for significant results (two-sided t-test). b, Quantification of recovered prey cells after coincubation with Acidovorax aggressor strains. Antibiotic-resistant prey strains E. coli BW25113 and nine different Arabidopsis leaf isolates were mixed at equal ratios with different aggressor strains or with NB medium (negative control). Five Hyde1 loci (including 9 out of 11 Hyde1 genes) are deleted in ∆​5-Hyde1. ∆​T6SS contains a vasD (Aave_1470) deletion. After coincubation for 19 h on NB agar plates, mixed populations were resuspended in NB medium and spotted on selective antibiotic-containing NB agar. The box plots represent results from at least three independent experiments, with individual values superimposed as dots. The center line represents the median, the box limits represent the 25th and 75th percentiles, and the edges represent the minimal and maximal values. P values are shown at the top; double asterisks denote a significant difference (one-way ANOVA followed by Tukey’s honest significant difference test) between results for wild type versus ∆​T6SS and for wild type versus ∆​5-Hyde1. Full strain names and statistical information are presented in Supplementary Table 25. For a time course experiment with exemplary strains, see Supplementary Fig. 29. short length and a transmembrane helix (Supplementary Fig. 26d). often are directly preceded by genes that encode core structural T6SS Hyde-carrying organisms included other phytopathogens, such proteins, such as PAAR, VgrG, and Hcp65, or are fused to PAAR as Pseudomonas syringae, in which the Hyde1-like-Hyde2 locus (Fig. 6d, Supplementary Fig. 27a,b, Supplementary Table 23e). We was again highly variable between closely related strains (Fig. 6d, therefore suggest that Hyde1 and/or Hyde2 might constitute a new Supplementary Table 23c). However, the striking Hyde genomic T6SS effector family. expansion was specific to the phytopathogenic Acidovorax lineage The high sequence diversity of Jekyll and Hyde1 genes suggests (Supplementary Table 23e). Notably, we observed that Hyde genes that the two plant-associated protein families encoded by these

Nature Genetics | www.nature.com/naturegenetics © 2017 Nature America Inc., part of Springer Nature. All rights reserved. Nature Genetics Articles genes could be involved in molecular arms races with other organ- secretion systems (Supplementary Table 21). Notably, plant proteins isms in the plant environment. As many type VI effectors are used carrying 35 of these domains belonged to the NLR class of intracellu- in interbacterial warfare, we tested Acidovorax Hyde1 proteins for lar innate immune receptors (Supplementary Fig. 23, Supplementary antibacterial properties. Expression of two variants of the gene in Table 21). Thus, these PREPARADO protein domains may serve as Escherichia coli led to a 105–106-fold reduction in cell numbers molecular mimics. Some may interfere with plant immune functions (Fig. 7a, Supplementary Table 25). We constructed a mutant strain through disruption of key plant protein interactions75,76. Likewise, the of the phytopathogen Acidovorax citrulli AAC00-1with deletion Jacalin-containing proteins we detected in plant-associated bacteria, of five Hyde1 loci (∆​5-Hyde1), encompassing 9 of 11 Hyde1 genes fungi, and oomycetes may represent a strategy of avoiding immunity (Supplementary Fig. 28, Supplementary Table 25). Wild-type, triggered by microbial-associated molecular patterns, by binding to ∆​5-Hyde1, and T6SS-mutant (∆T6SS)​ Acidovorax strains were extracellular microbial mannose molecules and thereby serving as a coincubated with an E. coli strain that is susceptible to T6SS kill- molecular invisibility cloak77,78. ing66 and nine phylogenetically diverse Arabidopsis leaf bacterial Finally, we demonstrated that numerous plant-associated func- isolates16. Survival of wild-type E. coli and six of the leaf isolates after tions are consistent across phylogenetically diverse bacterial taxa, coincubation with wild-type Acidovorax was reduced 102–106-fold and that some functions are even shared with plant-associated compared with that after coincubation with ∆5-Hyde1​ or ∆T6SS​ eukaryotes. Some of these traits may facilitate plant colonization by Acidovorax (Fig. 7b, Supplementary Fig. 29, Supplementary microbes and therefore might prove useful in genome engineering Table 25). Combined with the genomic association of Hyde loci of agricultural inoculants to eventually yield a more efficient and with T6SS, these results suggest that the T6SS antibacterial phe- sustainable agriculture. notype of Acidovorax is mediated by Hyde proteins and that these toxins could be used in competition against other plant-associated URLs. iTOL Interactive tree (Fig. 1a), https://itol.embl.de/tree/15 organisms. Consistent with a function in microbe–microbe inter- 223230182273621508772620; datasets at the Dangl lab’s dedicated actions, we did not detect compromised virulence of the ∆5-Hyde1​ website, http://labs.bio.unc.edu/Dangl/Resources/gfobap_website/ strain on host plants (watermelon; data not shown). However, index.html (Dataset 1, FNA—nucleotide FASTA files of the 3,837 clearance of competitors via T6SS can promote the persistence of genomes; Dataset 2, FAA—FASTA files of all proteins used in the Acidovorax citrulli on its host67. analysis; Dataset 3, COG/KEGG Orthology/Pfam/TIGRFAM IMG annotations of all genes used in analysis; Dataset 4, metadata Discussion of all genomes; Dataset 5, phylogenetic trees of each of the nine There is increasing awareness that plant-associated microbial taxa; Dataset 6, pangenome matrices; Dataset 7, pangeneome data communities have important roles in host growth and health. An frames; Dataset 8, OrthoFinder orthogroup FASTA files; Dataset 9, understanding of plant–microbe relationships at the genomic level Mafft MSA of all orthogroups; Dataset 10, hidden Markov models could enable scientists to use microbes to enhance agricultural of all orthogroups; Dataset 11, plant-associated/NPA and root-asso- productivity. Most studies have focused on specific plant micro- ciated/soil-associated enrichment tables; Dataset 12, correlation biomes, with more emphasis on microbial diversity than on gene matrices; Dataset 13, predicted operons); DSMZ, https://www. function12,14,16,18,68–74. Here we sequenced nearly 500 root-associated dsmz.de/; ATCC, https://www.atcc.org/; NCBI Biosample, https:// bacterial genomes isolated from different plant hosts. These new www.ncbi.nlm.nih.gov/biosample/; IMG, https://img.jgi.doe.gov/ genomes were combined in a collection of 3,837 high-quality bacte- cgi-bin/mer/main.cgi; GOLD, https://gold.jgi.doe.gov/; Phytozome, rial genomes for comparative analysis. We developed a systematic https://phytozome.jgi.doe.gov/pz/portal.html; BrassicaDB, approach to identify plant-associated and root-associated genes and http://brassicadb.org/brad/; sm R package, http://www.stats.gla. putative operons. Our method is accurate as reflected by its abil- ac.uk/~adrian/sm; vegan R package, https://cran.r-project.org/web/ ity to capture numerous operons previously shown to have a plant- packages/vegan/index.html; ape R package, https://cran.r-project. associated function, the enrichment of plant-associated genes in org/web/packages/ape/ape.pdf; fpc R package, https://cran.r-proj- plant-associated metagenomes, the validation of Hyde1 proteins as ect.org/web/packages/fpc/index.html; phylolmR package, https:// likely type VI effectors in Acidovorax directed against other plant- cran.r-project.org/web/packages/phylolm/index.html; scripts used associated bacteria, and the validation of two new genes in P. kur u - to compute the orthogroups, https://github.com/isaisg/gfobap/tree/ riensis that affect rice root colonization. We note that bacterial genes master/orthofinder_diamond; scripts used to run the gene enrich- that are enriched in genomes from the plant environment are also ment tests, https://github.com/isaisg/gfobap/tree/master/enrich- likely to be involved in adaptation to the many other organisms that ment_tests; scripts used to perform the PCoA, https://github.com/ share the same niche, as we demonstrated for Hyde1. isaisg/gfobap/tree/master/pcoa_visualization_ogs_enriched. We used five different statistical approaches to identify genes that were significantly associated with the plant/root environment, each Methods with its advantages and disadvantages. The phylogeny-correcting Methods, including statements of data availability and any asso- approaches (phyloglmbin, phyloglmcn, and Scoary) allow accurate ciated accession codes and references, are available at https://doi. identification of genes that are polyphyletic and correlate with an org/10.1038/s41588-017-0012-9. environment independently of ancestral state. On the basis of our metagenome validation, the hypergeometric test predicts more genes Received: 20 January 2017; Accepted: 10 November 2017; that are abundant in plant-associated communities than PhyloGLM Published: xx xx xxxx does. It also identifies monophyletic plant-associated genes, but it yields more false positives than the phylogenetic tests, because in References 1. Ley, R. E. et al. Evolution of mammals and their gut microbes. Science 320, every plant-associated lineage many lineage-specific genes will be 1647–1651 (2008). considered plant-associated. Scoary is the most stringent method 2. Baumann, P. Biology bacteriocyte-associated endosymbionts of plant of all and yielded the fewest predictions (Supplementary Table 18). sap-sucking insects. Annu. Rev. Microbiol. 59, 155–189 (2005). Future experimental validation should prioritize genes predicted in 3. Sprent, J. I. 60Ma of legume nodulation. What’s new? What’s changing? multiple taxa and/or by multiple approaches (Supplementary Figs. 5 J. Exp. Bot. 59, 1081–1084 (2008). 4. Pfeilmeier, S., Caly, D. L. & Malone, J. G. Bacterial pathogenesis and 6, Supplementary Tables 20 and 26). of plants: future challenges from a microbial perspective: Challenges We discovered 64 PREPARADOs. Proteins containing 19 of these in Bacterial Molecular Plant Pathology. Mol. Plant Pathol. 17, domains are predicted to be secreted by the Sec or T3SS protein 1298–1313 (2016).

Nature Genetics | www.nature.com/naturegenetics © 2017 Nature America Inc., part of Springer Nature. All rights reserved. Articles Nature Genetics

5. Chowdhury, S. P., Hartmann, A., Gao, X. & Borriss, R. Biocontrol 36. Büttner, D. & He, S. Y. Type III protein secretion in plant pathogenic mechanism by root-associated Bacillus amyloliquefaciens FZB42—a review. bacteria. Plant Physiol. 150, 1656–1664 (2009). Front. Microbiol. 6, 780 (2015). 37. Gao, R. et al. Genome-wide RNA sequencing analysis of quorum 6. Fibach-Paldi, S., Burdman, S. & Okon, Y. Key physiological properties sensing-controlled regulons in the plant-associated Burkholderia glumae contributing to rhizosphere adaptation and plant growth promotion abilities PG1 strain. Appl. Environ. Microbiol. 81, 7993–8007 (2015). of Azospirillum brasilense. FEMS Microbiol. Lett. 326, 99–108 (2012). 38. Weller-Stuart, T., Toth, I., De Maayer, P. & Coutinho, T. Swimming and 7. Santhanam, R. et al. Native root-associated bacteria rescue a plant from a twitching motility are essential for attachment and virulence of Pantoea sudden-wilt disease that emerged during continuous cropping. Proc. Natl. ananatis in onion seedlings. Mol. Plant Pathol. 18, 734–745 (2017). Acad. Sci. USA 112, E5013–E5020 (2015). 39. De Weger, L. A. et al. Flagella of a plant-growth-stimulating Pseudomonas 8. Peters, N. K., Frost, J. W. & Long, S. R. A plant favone, luteolin, induces fuorescens strain are required for colonization of potato roots. J. Bacteriol. expression of Rhizobium meliloti nodulation genes. Science 233, 977–980 (1986). 169, 2769–2773 (1987). 9. Hiei, Y., Ohta, S., Komari, T. & Kumashiro, T. Efcient transformation of 40. de Weert, S. et al. Flagella-driven chemotaxis towards exudate components rice (Oryza sativa L.) mediated by Agrobacterium and sequence analysis of is an important trait for tomato root colonization by Pseudomonas the boundaries of the T-DNA. Plant J. 6, 271–282 (1994). fuorescens. Mol. Plant Microbe Interact. 15, 1173–1180 (2002). 10. Hueck, C. J. Type III protein secretion systems in bacterial pathogens of 41. Ravcheev, D. A. et al. Comparative genomics and evolution of regulons of animals and plants. Microbiol. Mol. Biol. Rev. 62, 379–433 (1998). the LacI-family transcription factors. Front. Microbiol. 5, 294 (2014). 11. Bulgarelli, D. et al. Revealing structure and assembly cues for Arabidopsis 42. Yamauchi, Y., Hasegawa, A., Taninaka, A., Mizutani, M. & Sugimoto, Y. root-inhabiting bacterial microbiota. Nature 488, 91–95 (2012). NADPH-dependent reductases involved in the detoxifcation of reactive 12. Lundberg, D. S. et al. Defning the core Arabidopsis thaliana root carbonyls in plants. J. Biol. Chem. 286, 6999–7009 (2011). microbiome. Nature 488, 86–90 (2012). 43. Burstein, D. et al. Genome-scale identifcation of Legionella pneumophila 13. Bulgarelli, D., Schlaeppi, K., Spaepen, S., Ver Loren van Temaat, E. & efectors using a machine learning approach. PLoS Pathog. 5, Schulze-Lefert, P. Structure and functions of the bacterial microbiota of e1000508 (2009). plants. Annu. Rev. Plant Biol. 64, 807–838 (2013). 44. Dean, P. Functional domains and motifs of bacterial type III efector 14. Ofek-Lalzar, M. et al. Niche and host-associated functional signatures of the proteins and their roles in infection. FEMS Microbiol. Rev. 35, root surface microbiome. Nat. Commun. 5, 4950 (2014). 1100–1125 (2011). 15. Gottel, N. R. et al. Distinct microbial communities within the endosphere 45. Stebbins, C. E. & Galán, J. E. Structural mimicry in bacterial virulence. and rhizosphere of Populus deltoides roots across contrasting soil types. Nature 412, 701–705 (2001). Appl. Environ. Microbiol. 77, 5934–5944 (2011). 46. Price, C. T. et al. Molecular mimicry by an F-box efector of Legionella 16. Bai, Y. et al. Functional overlap of the Arabidopsis leaf and root microbiota. pneumophila hijacks a conserved polyubiquitination machinery within Nature 528, 364–369 (2015). macrophages and protozoa. PLoS Pathog. 5, e1000704 (2009). 17. Hardoim, P. R. et al. Te hidden world within plants: ecological and 47. Rothmeier, E. et al. Activation of Ran GTPase by a Legionella efector evolutionary considerations for defning functioning of microbial promotes microtubule polymerization, pathogen vacuole motility and endophytes. Microbiol. Mol. Biol. Rev. 79, 293–320 (2015). infection. PLoS Pathog. 9, e1003598 (2013). 18. Bulgarelli, D. et al. Structure and function of the bacterial root microbiota 48. Xu, R.-Q. et al. AvrAC(Xcc8004), a type III efector with a leucine-rich in wild and domesticated barley. Cell Host Microbe 17, 392–403 (2015). repeat domain from Xanthomonas campestris pathovar campestris confers 19. Hacquard, S. et al. Microbiota and host nutrition across plant and animal avirulence in vascular tissues of Arabidopsis thaliana ecotype Col-0. J. kingdoms. Cell Host Microbe 17, 603–616 (2015). Bacteriol. 190, 343–355 (2008). 20. Tatusov, R. L., Galperin, M. Y., Natale, D. A. & Koonin, E. V. Te COG 49. Shevchik, V. E., Robert-Baudouy, J. & Hugouvieux-Cotte-Pattat, N. Pectate database: a tool for genome-scale analysis of protein functions and lyase PelI of Erwinia chrysanthemi 3937 belongs to a new family. J. evolution. Nucleic Acids Res. 28, 33–36 (2000). Bacteriol. 179, 7321–7330 (1997). 21. Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG 50. Cesari, S., Bernoux, M., Moncuquet, P., Kroj, T. & Dodds, P. N. A novel as a reference resource for gene and protein annotation. Nucleic Acids Res. conserved mechanism for plant NLR protein pairs: the “integrated decoy” 44, D457–D462 (2016). hypothesis. Front. Plant Sci. 5, 606 (2014). 22. Haf, D. H., Selengut, J. D. & White, O. Te TIGRFAMs database of protein 51. Sarris, P. F. et al. A plant immune receptor detects pathogen efectors that families. Nucleic Acids Res. 31, 371–373 (2003). target WRKY transcription factors. Cell 161, 1089–1100 (2015). 23. Huntemann, M. et al. Te standard operating procedure of the DOE-JGI 52. Sarris, P. F., Cevik, V., Dagdas, G., Jones, J. D. & Krasileva, K. V. Microbial Genome Annotation Pipeline (MGAP v.4). Stand. Genomic Sci. Comparative analysis of plant immune receptor architectures uncovers host 10, 86 (2015). proteins likely targeted by pathogens. BMC Biol. 14, 8 (2016). 24. Emms, D. M. & Kelly, S. OrthoFinder: solving fundamental biases in whole 53. Le Roux, C. et al. A receptor pair with an integrated decoy converts genome comparisons dramatically improves orthogroup inference accuracy. pathogen disabling of transcription factors to immunity. Cell 161, Genome Biol. 16, 157 (2015). 1074–1088 (2015). 25. Finn, R. D. et al. Te Pfam protein families database: towards a more 54. Brown, G. D. & Netea, M. G. (eds.). Immunology of Fungal Infections. sustainable future. Nucleic Acids Res. 44, D279–D285 (2016). (Springer, Dordrecht, Te Netherlands, 2007). 26. Ives, A. R. & Garland, T. Jr. Phylogenetic logistic regression for binary 55. Gadjeva, M., Takahashi, K. & Tiel, S. Mannan-binding lectin—a soluble dependent variables. Syst. Biol. 59, 9–26 (2010). pattern recognition molecule. Mol. Immunol. 41, 113–121 (2004). 27. Brynildsrud, O., Bohlin, J., Schefer, L. & Eldholm, V. Rapid scoring of 56. Ma, Q.-H., Tian, B. & Li, Y.-L. Overexpression of a wheat jasmonate- genes in microbial pan-genome-wide association studies with Scoary. regulated lectin increases pathogen resistance. Biochimie 92, Genome Biol. 17, 238 (2016). 187–193 (2010). 28. Hultman, J. et al. Multi-omics of permafrost, active layer and thermokarst 57. Xiang, Y. et al. A jacalin-related lectin-like gene in wheat is a component of bog soil microbiomes. Nature 521, 208–212 (2015). the plant defence system. J. Exp. Bot. 62, 5471–5483 (2011). 29. Louca, S. et al. Integrating biogeochemistry with multiomic sequence 58. Yamaji, Y. et al. Lectin-mediated resistance impairs plant virus infection at information in a model oxygen minimum zone. Proc. Natl. Acad. Sci. USA the cellular level. Plant Cell 24, 778–793 (2012). 113, E5925–E5933 (2016). 59. Weidenbach, D. et al. Polarized defense against fungal pathogens is 30. Coutinho, B. G., Licastro, D., Mendonça-Previato, L., Cámara, M. & Venturi, mediated by the Jacalin-related lectin domain of modular Poaceae-specifc V. Plant-infuenced gene expression in the rice endophyte Burkholderia proteins. Mol. Plant 9, 514–527 (2016). kururiensis M130. Mol. Plant Microbe Interact. 28, 10–21 (2015). 60. Sahly, H. et al. Surfactant protein D binds selectively to Klebsiella 31. Long, S. R. Rhizobium-legume nodulation: life together in the pneumoniae lipopolysaccharides containing mannose-rich O-antigens. underground. Cell 56, 203–214 (1989). J. Immunol. 169, 3267–3274 (2002). 32. Ruvkun, G. B., Sundaresan, V. & Ausubel, F. M. Directed transposon Tn5 61. Osborn, M. J., Rosen, S. M., Rothfeld, L., Zeleznick, L. D. & Horecker, B. L. mutagenesis and complementation analysis of Rhizobium meliloti symbiotic Lipopolysaccharide of the gram-negative cell wall. Science 145, nitrogen fxation genes. Cell 29, 551–559 (1982). 783–789 (1964). 33. Hershey, D. M., Lu, X., Zi, J. & Peters, R. J. Functional conservation of the 62. Tans-Kersten, J., Huang, H. & Allen, C. Ralstonia solanacearum capacity for ent-kaurene biosynthesis and an associated operon in certain needs motility for invasive virulence on tomato. J. Bacteriol. 183, rhizobia. J. Bacteriol. 196, 100–106 (2014). 3597–3605 (2001). 34. Nett, R. S. et al. Elucidation of gibberellin biosynthesis in bacteria reveals 63. Cole, B. J. et al. Genome-wide identifcation of bacterial plant colonization convergent evolution. Nat. Chem. Biol. 13, 69–74 (2017). genes. PLoS Biol. 15, e2002860 (2017). 35. Scharf, B. E., Hynes, M. F. & Alexandre, G. M. Chemotaxis signaling 64. Poggio, S. et al. A complete set of fagellar genes acquired by horizontal systems in model benefcial plant-bacteria associations. Plant Mol. Biol. 90, transfer coexists with the endogenous fagellar system in Rhodobacter 549–559 (2016). sphaeroides. J. Bacteriol. 189, 3208–3216 (2007).

Nature Genetics | www.nature.com/naturegenetics © 2017 Nature America Inc., part of Springer Nature. All rights reserved. Nature Genetics Articles

65. Ho, B. T., Dong, T. G. & Mekalanos, J. J. A view to a kill: the bacterial type Gordon and Betty Moore Foundation (GBMF3030). M.E.F. was supported by NIH VI secretion system. Cell Host Microbe 15, 9–21 (2014). Dr. Ruth L. Kirschstein NRSA Fellowship F32-GM112345. D.A.P. and T.-Y.L. were 66. MacIntyre, D. L., Miyata, S. T., Kitaoka, M. & Pukatzki, S. Te Vibrio supported by the Genomic Science Program, US Department of Energy, Office of cholerae type VI secretion system displays antimicrobial properties. Proc. Science, Biological and Environmental Research as part of the Oak Ridge National Natl. Acad. Sci. USA 107, 19520–19524 (2010). Laboratory Plant Microbe Interfaces Scientific Focus Area (http://pmi.ornl.gov) and 67. Tian, Y. et al. Te type VI protein secretion system contributes to bioflm Plant Feedstock Genomics Award DE-SC001043. Oak Ridge National Laboratory is formation and seed-to-seedling transmission of Acidovorax citrulli on managed by UT-Battelle, LLC, for the US Department of Energy under contract DE- melon. Mol. Plant Pathol. 16, 38–47 (2015). AC05-00OR22725. J.A.V. was supported by a SystemsX.ch grant (Micro2X) and a 68. Peifer, J. A. et al. Diversity and heritability of the maize rhizosphere European Research Council (ERC) advanced grant (PhyMo). We thank I. Bertani, microbiome under feld conditions. Proc. Natl. Acad. Sci. USA 110, C. Bez, R. Bowers, D. Burstein, A. Chun Chen, D. Chiniquy, B. Cole, O. Cohen, 6548–6553 (2013). A. Copeland, J. Eisen, E. Eloe-Fadrosh, M. Hadjithomas, O. Finkel, H. Schnitzel Meule 69. Agler, M. T. et al. Microbial hub taxa link host and abiotic factors to plant Fux, N. Ivanova, J. Knelman, R. Malmstrom, R. Perez-Torres, D. Salomon, R. Sorek, microbiome variation. PLoS Biol. 14, e1002352 (2016). T. Mucyn, R. Seshadri, T.K. Reddy, L. Ryan, and H. Sberro Livnat for general help, 70. Bokulich, N. A., Torngate, J. H., Richardson, P. M. & Mills, D. A. text editing, and ideas for this work. We thank R. Walcott (University of Georgia, Athens, Microbial biogeography of wine grapes is conditioned by cultivar, vintage, GA, USA) for providing the Acidovorax citrulli VasD mutant strain. and climate. Proc. Natl. Acad. Sci. USA 111, E139–E148 (2014). 71. Coleman-Derr, D. et al. Plant compartment and biogeography afect Author contributions microbiome composition in cultivated and native Agave species. A.L. performed most data analysis and wrote the paper. I.S.G. performed phylogenetic New Phytol. 209, 798–811 (2016). inference, performed phylogenetically aware analyses, analyzed the data, provided the 72. Shade, A., McManus, P. S. & Handelsman, J. Unexpected diversity supporting website, and contributed to manuscript writing. M. Mittelviefhaus and during community succession in the apple fower microbiome. MBio 4, J.A.V. designed and performed experiments related to Hyde1 gene function and e00602–e00612 (2013). contributed to manuscript writing. S.C. isolated single bacterial cells and prepared 73. Turner, T. R. et al. Comparative metatranscriptomics reveals kingdom metadata for data analysis. F.M. analyzed data. S.H.P. analyzed data and contributed to level changes in the rhizosphere microbiome of plants. ISME J. 7, manuscript writing. J.M. produced a mutant strain for Hyde1. K.W. tested Hyde1 toxicity 2248–2258 (2013). in E. coli. G.D. and V.V. produced deletion mutants and designed and performed rice 74. Edwards, J. et al. Structure, variation, and assembly of the root-associated root colonization experiments. K.S. helped in data analysis. B.R.A. prepared metadata for microbiomes of rice. Proc. Natl. Acad. Sci. USA 112, E911–E920 (2015). data analysis. D.S.L., T.-Y.L., S.L., Z.J., M. McDonald, A.P.K., M.E.F., and S.L.D. isolated 75. Kroj, T., Chanclud, E., Michel-Romiti, C., Grand, X. & Morel, J.-B. bacteria from different plants or managed this process. T.G.d.R. managed the sequencing Integration of decoy domains derived from protein targets of pathogen project. S.R.G., D.A.P., and R.E.L. managed bacterial isolation efforts and contributed to efectors into plant immune receptors is widespread. New Phytol. 210, manuscript writing. B.Z. managed Hyde1 deletion and toxicity testing. S.G.T. contributed 618–626 (2016). to manuscript writing. T.W. managed single-cell isolation efforts and contributed 76. Mukhtar, M. S. et al. Independently evolved virulence efectors converge to manuscript writing. J.L.D. directed the overall project and contributed to onto hubs in a plant immune system network. Science 333, 596–601 (2011). manuscript writing. 77. Vimr, E. & Lichtensteiger, C. To sialylate, or not to sialylate: that is the question. Trends Microbiol. 10, 254–257 (2002). 78. de Jonge, R. et al. Conserved fungal LysM efector Ecp6 prevents chitin- Competing interests triggered immunity in plants. Science 329, 953–955 (2010). J.L.D. is a cofounder of and shareholder in, and S.H.P. collaborates with, AgBiome LLC, a corporation that aims to use plant-associated microbes to improve plant productivity. Acknowledgements The work conducted by the US Department of Energy Joint Genome Institute, a Additional information DOE Office of Science User Facility, is supported by the Office of Science of the US Supplementary information is available for this paper at https://doi.org/10.1038/ Department of Energy under contract no. DE-AC02-05CH11231. J.L.D. and S.G.T. s41588-017-0012-9. were supported by NSF INSPIRE grant IOS-1343020, and J.L.D. was also supported Reprints and permissions information is available at www.nature.com/reprints. by DOE–USDA Feedstock Award DE-SC001043 and by the Office of Science (BER), US Department of Energy, grant no. DE-SC0014395. S.H.P. was supported by NIH Correspondence and requests for materials should be addressed to S.G.T. or T.W. or Training Grant T32 GM067553-06 and was a Howard Hughes Medical Institute (HHMI) J.L.D. International Student Research Fellow. D.S.L. was supported by NIH Training Grant T32 Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in GM07092-34. J.L.D. is an Investigator of the HHMI, supported by the HHMI and the published maps and institutional affiliations.

Nature Genetics | www.nature.com/naturegenetics © 2017 Nature America Inc., part of Springer Nature. All rights reserved. Articles Nature Genetics

Methods the number of taxonomic groups to analyze, we converted the phylogenetic tree Additional method descriptions appear in Supplementary Note 1. into a distance matrix, using the cophenetic function implemented in the R package ape (“URLs”). We then clustered the 3,837 genomes into nine groups Bacterial isolation and genome sequencing. The detailed isolation procedure is using k-medoids clustering as implemented in the PAM (partitioning around described in Supplementary Note 1. Bacterial strains from Brassicaceae and poplar medoids) algorithm from the R package fpc (“URLs”). The k-medoids algorithm were isolated via previously described protocols79,80. Poplar strains were cultured clusters a dataset of n objects into k a priori–defined clusters. To identify the from root tissues collected from Populus deltoides and Populus trichocarpa trees in optimal k value for the dataset, we compared the silhouette coefficients for Tennessee, North Carolina, and Oregon. Root samples were processed as described values of k ranging from 1 to 30. We selected a value of k =​ 9 because it yielded previously15,80. Briefly, we isolated rhizosphere strains by plating serial dilutions of the maximal average silhouette coefficient (0.66). In addition, at k =​ 9 the root wash, whereas for endosphere strains, we pulverized surface-sterilized roots taxa were monophyletic, contained hundreds of genomes, and were relatively with a sterile mortar and pestle in 10 mL of MgSO4 (10 mM) solution before plating balanced between plant-associated and NPA genomes in most taxa (Table 1). The serial dilutions. Strains were isolated on R2A agar media, and the resulting colonies resulting genome clusters generally overlapped with annotated taxonomic units. were picked and re-streaked a minimum of three times to ensure isolation. Isolated One exception was in the Actinobacteria phylum. Here our clustering divided strains were identified by 16S rDNA PCR followed by Sanger sequencing. the genomes into two taxa that we named, for simplicity, Actinobacteria 1 and For maize isolates, we selected soils associated with Il14h and Mo17 maize Actinobacteria 2. However, our rigorous phylogenetic analysis supports previous 88 genotypes grown in Lansing, NY, and Urbana, IL. The rhizosphere soil samples of suggestions for revisions in the of phylum Actinobacteria . each maize genotype were grown at each location and were collected at week 12 In addition, the tree showed very divergent bacterial taxa in the Bacteroidetes as previously described68. From each rhizosphere soil sample, soil was washed and phylum that could not be separated into monophyletic groups. Specifically, the samples were plated onto Pseudomonas isolation agar (BD Diagnostic Systems). Sphingobacteriales order (from class ) and the Cytophagaceae The plates were incubated at 30 °C until colonies formed, and DNA was extracted (from class Cytophagia) are paraphyletic. Therefore, we decided to unify all from cells. Bacteroidetes into one phylum-level taxon. Analysis of the prevalence of the nine For isolation of single cells, A. thaliana accessions Col-0 and Cvi-0 were grown taxa in 16S rDNA and metagenome appears in the Supplementary Information. to maturity. Roots were washed in distilled water multiple times. Root surfaces were sterilized with bleach. Surface-sterilized roots were then ground with a sterile Pangenome analysis. For each comparison in Supplementary Fig. 2, a random mortar and pestle. Individual cells were isolated by flow cytometry followed by set of ten genomes from each environment (plant-associated and NPA from DNA amplification with MDA, and 16S rDNA screening as described previously81. specific environments) was selected, and the mean and s.d. of the phylogenetic DNA from isolates and single cells was sequenced on next-generation distance in the set were calculated. This step was repeated 50 times to produce two sequencing platforms, mostly using Illumina HiSeq technology (Supplementary random sets of genomes (plant-associated and NPA) that were comparable and Table 3). Sequenced genomic DNA was assembled via different assembly methods had minimum differences between their mean and s.d. of phylogenetic distances. (Supplementary Table 3). Genomes were annotated using the DOE-JGI Microbial Genes for pangenome analysis were taken from the orthogroups (see below). Core Genome Annotation Pipeline (MGAP v.4)23 and deposited at the IMG database genome, accessory genome, and unique genes were defined as genes that appeared (“URLs”), ENA, or Genbank for public use. in all ten genomes, in two to nine genomes, and in only one genome, respectively. For core and accessory genomes, the median copy number in each relevant Data compilation of 3,837 isolate genomes and their isolation-site metadata. We orthogroup was used. retrieved 5,586 bacterial genomes from the IMG system (“URLs,” Supplementary Table 1). Isolation sites were identified through a manual curation process that Genome size comparison and gene category enrichment analysis. Genome included scanning of IMG metadata, DSMZ, ATCC, NCBI Biosample (“URLs”), sizes were retrieved from the IMG database (“URLs”) and compared by t-test and 26 and the scientific literature. On the basis of its isolation site, each genome was PhyloGLM . Kernel density plots from the R sm package (“URLs”) were used to labeled as plant-associated, NPA, or soil-associated. Plant-associated organisms prepare Supplementary Fig. 1. Protein-coding genes were retrieved and mapped were also labeled as root-associated when isolated from the endophytic to COG IDs with the program RPS-BLAST at an e-value cutoff of 1e–2 and an compartments or from the rhizoplane. We applied stringent quality control alignment length of at least 70% of the consensus sequence length. Each COG measures to ensure a high-quality and minimally biased set of genomes: ID was mapped to at least one COG category (Supplementary Table 6). For each a. Known isolation site: genomes with missing isolation-site information were genome, we counted the number of genes from a given category. A t-test and filtered out. PhyloGLM test were used to compare the number of genes in the genomes that b. High genome quality and completeness: all isolate genomes passed this shared the same taxon and category but different labels (e.g., plant-associated

filter if N50 (the shortest sequence length at 50% of the genome) was versus NPA). more than 50,000 bp. Single amplified genomes passed the quality filter if they had at least 90% of 35 universal single-copy clusters of orthologous Benchmarking gene clustering with UCLUST and OrthoFinder. We computed groups (COGs)82. In addition, we used CheckM83 to assess isolate genome clusters of coding sequences across each of the nine taxa defined above with two 89 24 completeness and contamination. Only genomes that were at least 95% algorithms: UCLUST (v 7.0) and OrthoFinder (v 1.1.4). UCLUST was run with complete and no more than 5% contaminated were used. 50% identity and 50% coverage in the target to call the clusters. Command used: c. High-quality gene annotation: genomes that passed this filter had at usearch7.0.1090_i86linux64 -cluster_fast <​ input_file >​ -id 0.5 -maxaccepts least 90% genome sequence coding for genes, with an exception—in the 0 -maxrejects 0 -target_cov 0.5 –uc <​ output_file >​ . To improve pairwise alignment Bartonella genus most genomes have coding base percentages below 90%. performance, we used the accelerated protein alignment algorithm implemented 90 d. Nonredundancy: we computed whole-genome average nucleotide identity in DIAMOND (v 0.8.36.98) with the --very-sensitive option in the DIAMOND and alignment fraction values for each pair of genomes84. When the BLASTP algorithm. After computing the alignments, we ran OrthoFinder with alignment fraction exceeded 90% and the whole-genome average nucleotide default parameters. See “URLs” for the scripts used to compute the orthogroups. identity was greater than 99.995% we considered the genome pair redundant. Supplementary Fig. 5 shows benchmarking of OrthoFinder against UCLUST. In such cases one genome was randomly selected, and the other genome was To estimate the quality of the clusters output by UCLUST and OrthoFinder, we marked as redundant and was filtered out. mapped the proteins from our datasets to the curated set of taxon markers from 91 e. Consistency in the phylogenetic tree: we filtered out 14 bacterial genomes Phyla-AMPHORA . Next, we compared the distribution of each of the taxon- that showed discrepancy between their given taxonomy and their actual specific markers identified by Phyla-AMPHORA across the clusters output by phylogenetic placement in the bacterial tree. UCLUST and OrthoFinder. To compare the two approaches, we estimated two metrics: the purity and the fragmentation index, explained in Supplementary Fig. 5 Construction of the bacterial genome tree. To generate a phylogenetic tree of and in the Supplementary Information. the 3,837 high-quality and nonredundant bacterial genomes, we retrieved 31 universal single-copy genes from each genome with AMPHORA285. For each Identification of plant-associated, NPA, root-associated, and soil genes/ individual marker gene, we used Muscle with default parameters to construct an domains. The following description applies to plant-associated, NPA, root- alignment. We masked the 31 alignments by using Zorro86 and filtered the low- associated, and soil genes. For conciseness, only plant-associated genes are quality columns of the alignment. Finally, we concatenated the 31 alignments into described here. Plant-associated genes were identified via a two-step process that an overall merged alignment, from which we built an approximately maximum- included protein/domain clustering on the basis of amino acid sequence similarity likelihood phylogenetic tree with the WAG model implemented in FastTree 2.187. and subsequent identification of the protein/domain clusters significantly enriched Trees of each taxon are provided in Dataset S5 at http://labs.bio.unc.edu/Dangl/ in proteins/domains from plant-associated bacteria (Supplementary Fig. 4). Resources/gfobap_website/faa_trees_metadata.html. Clustering of genes and protein domains involved five independent methods: OrthoFinder24, COG20, KEGG Orthology (KO)21, TIGRFAM22, and Pfam25. Clustering of 3,837 genomes into nine taxa. We divided the dataset into different OrthoFinder was selected (after the aforementioned benchmarking) as a clustering taxa (taxonomic groups) to allow downstream identification of genes enriched in approach that included all proteins, including those that lack any functional the plant-associated or root-associated genomes of each taxon compared with the annotation. We first compiled, for each taxon separately, a list of all proteins in the NPA or soil-associated genomes from the same taxon, respectively. To determine genomes. For COG, KO, TIGRFAM, and Pfam, we used the existing annotations

Nature Genetics | www.nature.com/naturegenetics © 2017 Nature America Inc., part of Springer Nature. All rights reserved. Nature Genetics Articles of IMG genes that are based on BLAST alignments to the different protein/domain models fitted. Briefly, the Akaike information criterion estimates how much models23. This process yielded gene/domain clusters. Next, we determined which information is lost when a model is applied to represent the true model of a clusters were significantly enriched with genes derived from plant-associated particular dataset. See “URLs” for a link to the scripts used to perform the PCoA. genomes. These clusters were termed plant-associated clusters. In the statistical analysis, we used only clusters of more than five members. We corrected P values Validation of plant-associated genes in Paraburkholderia kururiensis with Benjamini–Hochberg FDR and used q <​ 0.05 as the significance threshold, M130 affecting rice root colonization. Growth and transformation details of unless stated otherwise. The proteins in each cluster were categorized as either P. kururiensis M130 are described in Supplementary Note 1. plant-associated or NPA, on the basis of the label of the encoding genome. Namely, a plant-associated gene is a gene derived from a plant-associated gene cluster and a Mutant construction. Internal fragments of 200–900 bp from each gene of plant-associated genome. interest were PCR-amplified with the primers listed in Supplementary Table 17c. The three main approaches were the hypergeometric test (Hyperg), PhyloGLM, Fragments were cloned in the pGem2T easy vector (Promega) and sequenced and Scoary. Hyperg looks for overall enrichment of gene copies across a group of (GATC Biotech; Germany), then excised with EcoRI restriction enzyme and cloned genomes but ignores the phylogenetic structure of the dataset. PhyloGLM26 takes in the corresponding site in pKNOCK-Km R94. These plasmids were then used as a into account phylogenetic information to eliminate apparent enrichments that can suicide delivery system to create the knockout mutants and transferred to be explained by shared ancestry. The Hyperg and PhyloGLM tests were used in two P. kururiensis M130 by triparental mating. All the mutants were verified by PCR versions, based on either gene presence/absence data (hypergbin, phyloglmbin) or with primers specific to the pKNOCK-Km vector and to the genomic DNA gene copy-number data (hypergcn, phyloglmcn). We also used a stringent version sequences upstream and downstream from the targeted genes. of Scoary27, a gene presence/absence approach that combines Fisher’s exact test, a phylogenetic test, and a label-permutation test. The first hypergeometric test, Rhizosphere colonization experiments with P. kururiensis and mutant derivatives. hypergcn, used the gene copy-number data, with the cluster being the sample, the Seeds of Oryza sativa (BALDO variety) were surface-sterilized and then left to total number of plant-associated and NPA genes being the population, and the germinate in sterile conditions at 30 °C in the dark for 7 d. Each seedling was then number of plant-associated genes within the cluster being considered as ‘successes’. aseptically transferred into a 50-mL Falcon tube containing 35 mL of half-strength The second version, hybergbin, used gene presence/absence data. P values were Hoagland solution semisolid substrate (0.4% agar). The tubes were then inoculated corrected by Benjamini–Hochberg FDR92 for clusters of COG/KO/TIGRFAM/ with 107 colony-forming units (cfu) of a P. kururiensis suspension. Plants were Pfam. For the abundant OrthoFinder clusters, we used Bonferroni correction with grown for 11 d at 30 °C (16/8-h light/dark cycles). For the determination of the a threshold of P <​ 0.1, as downstream validation with metagenomes showed fewer bacterial counts, plants were washed under tap water for 1 min and then cut below false positives with the more significant clusters. The third and fourth statistical the cotyledon to excise the roots. Roots were air-dried for 15 min, weighed, and approaches used PhyloGLM26, implemented in the phylolm (v 2.5) R package then transferred to a sterile tube containing 5 mL of PBS. After vortexing, the (“URLs”). PhyloGLM combines a Markov process of lifestyle (e.g., plant-associated suspension was serially diluted to 10−1 and 10−2 in PBS, and aliquots were plated on versus NPA) evolution with a regularized logistic regression. This approach takes KB plates containing the appropriate antibiotic (rifampicin 50 µ​g/mL for the wild advantage of the known phylogeny to specify the residual correlation structure type, rifampicin 50 µ​g/mL and kanamycin 50 µ​g/mL for the mutants). After 3 d of between genomes that share common ancestry, and so it does not need to make incubation at 30 °C, we counted colony-forming units (CFU). Three replicates for the incorrect assumption that observations are independent. Intuitively PhyloGLM each dilution from ten independent plantlets were used to determine the average favors genes found in multiple lineages of the same taxon. For each taxon we used CFU values. the subtree from Fig. 1a to estimate the correlation matrix between observations and used the copy number (in phyloglmcn) or presence/absence pattern (in Plant-mimicking plant-associated and root-associated proteins. Supplementary phyloglmbin) of each gene as the only independent variable. Positive and negative Fig. 21 summarizes the algorithm used to find plant-mimicking plant-associated estimates in phyloglmbin/phyloglmcn indicated plant-associated/root-associated and root-associated proteins. Pfam25 version 30.0 metadata were downloaded. and NPA/soil-associated proteins/domains, respectively. Protein domains that appeared in both Viridiplantae and bacteria and occurred at Finally, the fifth statistical approach was Scoary27, which uses a gene presence/ least two times more frequently in Viridiplantae than in bacteria were considered absence dataset. Scoary combines Fisher’s exact test, a phylogeny-aware test, and as plant-like domains (n =​ 708). In parallel, we scanned the set of significant an empirical label-switching permutation analysis. A gene cluster was considered plant-associated, root-associated, NPA, and soil-associated Pfam protein domains significant by Scoary only if (1) it had a q-value less than 0.05 for Fisher’s exact test, predicted by the five algorithms in the nine taxa. We compiled a list of domains (2) the ‘worst’ P value from the pairwise comparison algorithm was <​ 0.05, and that were significantly plant-associated/root-associated in at least four tests, and (3) the empirical (permutation-based) P value was <​ 0.05. These are very stringent significantly NPA/soil-associated in up to two tests (n =​ 1,779). The overlapping criteria that yielded relatively few significant predictions. Odds ratios greater than domains between the first two sets were defined as PREPARADOs (n =​ 64). In or less than 1 in Scoary indicated plant-associated/root-associated and NPA/soil- parallel, we created two control sets of 500 random plant-like Pfam domains associated proteins/domains, respectively. and 500 random plant-associated/root-associated Pfam domains. Enrichment of See “URLs” for links to the code used for the gene enrichment tests. A PREPARADOs integrated into plant NLR proteins in comparison to the domains description of additional assessment of plant-associated/NPA prediction in the control groups was tested by Fisher’s exact test. To identify domains found robustness using validation genome datasets is presented in Supplementary Note 1. in plant disease-resistance proteins, we retrieved all proteins from Phytozome and BrassicaDB (“URLs”). To identify domains in plant disease-resistance proteins, Validation of predicted plant-associated, NPA, root-associated, and we used hmmscan to search protein sequences for the presence of NB-ARC soil-associated genes using metagenomes. Metagenome samples (n =​ 38; (PF00931.20), TIR (PF01582.18), TIR_2 (PF13676.4), or RPW8 (PF05659.9) Supplementary Table 16) were downloaded from NCBI and GOLD (“URLs”). The domains. Bacterial proteins carrying the PREPARADO domains were considered reads were translated into proteins, and proteins at least 40 amino acids long were as having full-length identity to fungal, oomycete, or plant proteins on the basis of aligned using HMMsearch93 against the different protein references. The protein LAST alignments to all Refseq proteins of plants, fungi, and protozoa. “Full-length” references included the predicted plant-associated, root-associated, soil-associated, is defined here as an alignment length of at least 90% of the length of both query and NPA proteins from OrthoFinder that were found to be significant by the and reference proteins. The threshold used for considering a high amino acid different approaches. The normalization process is explained in Supplementary identity was 40%. An explanation of the prediction of secretion of proteins with Figs. 12–16. PREPARADOs is presented in the Supplementary Information.

Principal coordinates analysis. To visualize the overall contribution of statistically Prediction of plant-associated, NPA, root-associated, and soil-associated significant enriched/depleted orthogroups to the differentiation of plant-associated operons and their annotation as biosynthetic gene clusters. Significant plant- and NPA genomes, we used principal coordinates analysis (PCoA) and logistic associated, NPA, root-associated, and soil-associated genes of each genome were regression. For each of the nine taxa analyzed, we ran this analysis over a collection clustered on the basis of genomic distance: genes sharing the same scaffold and of matrices. The first matrix was the full pan-genome matrix, which depicted the strand that were up to 200 bp apart were clustered into the same predicted operon. distribution of all the orthogroups contained across all the genomes in a given We allowed up to one spacer gene, which is a non-significant gene, between each taxon. The subsequent matrices represented subsets of the full pan-genome matrix; pair of significant genes within an operon. Operons were predicted for the genes in each of these matrices depicted the distribution of only the statistically significant COG and OrthoFinder clusters using all five approaches. Operons were annotated orthogroups as called by one of the five different algorithms used to test for the as biosynthetic gene clusters if at least one of the constituent genes was part of a genotype–phenotype association. A full description of this process is presented in biosynthetic gene cluster from the IMG-ABC database95. Supplementary Note 1. We used the function cmdscale from the R (v 3.3.1) stats package to run PCoA Jekyll and Hyde analyses. To find all homologs and paralogs of Jekyll and Hyde over all the matrices described above, using the Canberra distance as implemented genes, we used IMG BLAST search with an e-value threshold of 1e–5 against in the vegdist function from the vegan (v 2.4-2) R package (“URLs”). Then, we took all IMG isolates. We searched Hyde1 homologs of Acidovorax, Hyde1 homologs the first two axes output from the PCoA and used them as independent variables of Pseudomonas, Hyde2, and Jekyll genes using proteins of genes Aave_1071, to fit a logistic regression over the labels of each genome (plant-associated, NPA). A243_06583, Ga0078621_123530, and Ga0102403_10160 as the query sequence, Finally, we computed the Akaike information criterion for each of the different respectively. Multiple sequence alignments were done with Mafft96. A phylogenetic

Nature Genetics | www.nature.com/naturegenetics © 2017 Nature America Inc., part of Springer Nature. All rights reserved. Articles Nature Genetics tree of Acidovorax species was produced with RaxML97, based on concatenation of used can be found at the IMG website (https://img.jgi.doe.gov/). For example, the 35 single-copy genes98. metadata of taxon ID 2558860101 can be found at https://img.jgi.doe.gov/cgi-bin/ mer/main.cgi?section=​TaxonDetail&page=​taxonDetail&taxon_oid=​2558860101. Hyde1 toxicity assay. To verify the toxicity of Hyde1 and Hyde2 proteins to E. coli, we cloned genes encoding proteins Aave_0990 (Hyde2), Aave_0989 (Hyde1), and Aave_3191 (Hyde1), or GFP as a control, to the inducible pET28b References expression vector via the LR reaction. The recombinant vectors were transformed 79. Doty, S. L. et al. Diazotrophic endophytes of native black cottonwood and into E. coli C41 competent cells by electroporation after sequencing validation. willow. Symbiosis 47, 23–33 (2009). Five colonies were selected and cultured in LB liquid media supplemented with 80. Weston, D. J. et al. Pseudomonas fuorescens induces strain-dependent and kanamycin with shaking overnight. The OD600 of the bacterial culture was adjusted strain-independent host plant responses in defense networks, primary to 1.0, and then the culture was diluted by 102, 104, 106, and 108 times successively. metabolism, photosynthesis, and ftness. Mol. Plant Microbe Interact. 25, Bacteria culture gradients were spotted (5 μ​L) on LB plates with or without 0.5 mM 765–778 (2012). IPTG to induce gene expression. 81. Rinke, C. et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature 499, 431–437 (2013). Construction of ∆5-Hyde1 strain. Details of the construction of the ∆​5-Hyde1 82. Beszteri, B., Temperton, B., Frickenhaus, S. & Giovannoni, S. J. Average genome strain are presented in Supplementary Note 1. A. citrulli strain AAC00-1 and size: a potential source of bias in comparative metagenomics. ISME J. 4, its derived mutants were grown on nutrient agar medium supplemented with 1075–1077 (2010). rifampicin (100 µ​g/ml). To delete a cluster of five Hyde1 genes (Aave_3191–3195), 83. Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. we carried out a marker-exchange mutagenesis as previously described99. The CheckM: assessing the quality of microbial genomes recovered from marker-free mutant was designated as ∆​1-Hyde1, and its genotype was confirmed isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015). by PCR amplification and sequencing. The marker-exchange mutagenesis 84. Varghese, N. J. et al. Microbial species delineation using whole genome procedure was repeated to delete four other Hyde1 loci (Supplementary Fig. 28). sequences. Nucleic Acids Res. 43, 6761–6771 (2015). The primers used are listed in Supplementary Table 25. The final mutant, 85. Kerepesi, C., Bánky, D. & Grolmusz, V. AmphoraNet: the webserver with deletion of 9 out of 11 Hyde1 genes (in five loci), was designated as ∆​5-Hyde1 implementation of the AMPHORA2 metagenomic workfow suite. Gene and was used for competition assay. The ∆​T6SS mutant was provided by Ron 533, 538–540 (2014). Walcott’s lab. 86. Wu, M., Chatterji, S. & Eisen, J. A. Accounting for alignment uncertainty in phylogenomics. PLoS One 7, e30288 (2012). Competition assay of Acidovorax citrulli AAC00-1 against diferent strains. 87. Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2—approximately Bacterial strains. E. coli BW25113 pSEVA381 was grown aerobically in LB broth maximum-likelihood trees for large alignments. PLoS One 5, e9490 (2010). (5 g/L NaCl) at 37 °C in the presence of chloramphenicol. Naturally antibiotic- 88. Sen, A. et al. Phylogeny of the class Actinobacteria revisited in the light of resistant bacterial leaf isolates16 and Acidovorax strains were grown aerobically in complete genomes. Te orders ‘Frankiales’ and Micrococcales should be NB medium (5 g/L NaCl) at 28 °C in the presence of the appropriate antibiotic. split into coherent entities: proposal of Frankiales ord. nov., Antibiotic resistance and concentrations used in the competition assay are Geodermatophilales ord. nov., Acidothermales ord. nov. and Nakamurellales mentioned in Supplementary Table 25. ord. nov. Int. J. Syst. Evol. Microbiol. 64, 3821–3832 (2014). 89. Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Competition assay. Competition assays were conducted similarly as described Bioinformatics 26, 2460–2461 (2010). elsewhere66,100. Briefly, bacterial overnight cultures were harvested and washed in 90. Buchfnk, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment PBS (pH 7.4) to remove excess antibiotics, and resuspended in fresh NB medium using DIAMOND. Nat. Methods 12, 59–60 (2015). to an optical density of 10. Predator and prey strains were mixed at a 1:1 ratio, and 91. Wang, Z. & Wu, M. A phylum-level bacterial phylogenetic marker database. 5 µ​L of the mixture was spotted onto dry NB agar plates and incubated at 28 °C. Mol. Biol. Evol. 30, 1258–1262 (2013). As a negative control, the same volume of NB medium was mixed with prey cells 92. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a instead of the predator strain. After 19 h of coincubation, bacterial spots were practical and powerful approach to multiple testing. J. R. Stat. Soc. Series B excised from the agar and resuspended in 500 µ​L of NB medium and then spotted Stat. Methodol. 57, 289–300 (1995). on NB agar containing antibiotic selective for the prey strains. CFUs of recovered 93. Finn, R. D. et al. HMMER web server: 2015 update. Nucleic Acids Res. 43, prey cells were determined after incubation at 28 °C. All assays were performed in W30–W38 (2015). at least three biological replicates. 94. Alexeyev, M. F. Te pKNOCK series of broad-host-range mobilizable suicide vectors for gene knockout and targeted DNA insertion into the Life sciences reporting summary. Further information on experimental design is chromosome of gram-negative bacteria. Biotechniques 26, 824–826 (1999). available in the Life Sciences Reporting Summary. 95. Hadjithomas, M. et al. IMG-ABC: a knowledge base to fuel discovery of biosynthetic gene clusters and novel secondary metabolites. MBio 6, Data availability. All new genomes (Supplementary Table 3) were submitted and e00932 (2015). are publicly available in at least one of the following databanks (see accessions in 96. Katoh, K., Misawa, K., Kuma, K. & Miyata, T. MAFFT: a novel method for Supplementary Table 3): rapid multiple sequence alignment based on fast Fourier transform. Nucleic 1. IMG/M, https://img.jgi.doe.gov/cgi-bin/m/main.cgi Acids Res. 30, 3059–3066 (2002). 2. Genbank, https://www.ncbi.nlm.nih.gov/genbank/ 97. Stamatakis, A., Hoover, P. & Rougemont, J. A rapid bootstrap algorithm for 3. ENA, http://www.ebi.ac.uk/ena the RAxML Web servers. Syst. Biol. 57, 758–771 (2008). 4. A dedicated website for the Dangl lab: http://labs.bio.unc.edu/Dangl/ 98. Finkel, O. M., Béjà, O. & Belkin, S. Global abundance of microbial Resources/gfobap_website/index.html rhodopsins. ISME J. 7, 448–451 (2013). The dedicated website contains nucleotide and amino acid FASTA files of all 99. Traore, S. M. Characterization of Type Tree Efector Genes of A. citrulli, the datasets used, protein/domain annotations (COG, KO, TiGRfam, Pfam), metadata, Causal Agent of Bacterial Fruit Blotch of Cucurbits. (Virginia Polytechnic phylogenetic trees, OrthoFinder orthogroups, orthogroup hidden Markov models, full Institute and State University, Blacksburg, VA, 2014). enrichment datasets, correlation between orthogroups, and predicted operons (“URLs”). 100. Basler, M., Ho, B. T. & Mekalanos, J. J. Tit-for-tat: type VI secretion Links to different scripts that were used in analysis are included in the “URLs” system counterattack during bacterial cell-cell interactions. Cell 152, section. The full genome sequence, gene annotation, and metadata of each genome 884–894 (2013).

Nature Genetics | www.nature.com/naturegenetics © 2017 Nature America Inc., part of Springer Nature. All rights reserved. nature research | life sciences reporting summary

Corresponding author(s): Jeffery L. Dangl Initial submission Revised version Final submission Life Sciences Reporting Summary Nature Research wishes to improve the reproducibility of the work that we publish. This form is intended for publication with all accepted life science papers and provides structure for consistency and transparency in reporting. Every life science submission will use this form; some list items might not apply to an individual manuscript, but all fields must be completed for clarity. For further information on the points included in this form, see Reporting Life Sciences Research. For further information on Nature Research policies, including our data availability policy, see Authors & Referees and the Editorial Policy Checklist.

` Experimental design 1. Sample size Describe how sample size was determined. Computational analysis is done based on sample size in the order of hunderds. Experiments were done with sample size of 3-30 biological replicates. 2. Data exclusions Describe any data exclusions. No data was excluded. 3. Replication Describe whether the experimental findings were All reported results in the paper was reliably reproduced in biological replicates. reliably reproduced. 4. Randomization Describe how samples/organisms/participants were Mostly irrelevant to the study. In the few cases where we used randomizations we allocated into experimental groups. made sure the randomized control group has similar features to the tested group. For example in the randomization done for PREPARADOs enrichment in planr NLR immune proteins we randomly sampled 500 PA/RA domains and 500 plant protein domains. 5. Blinding Describe whether the investigators were blinded to Irrelevant to the study. group allocation during data collection and/or analysis. Note: all studies involving animals and/or human research participants must disclose whether blinding and randomization were used.

6. Statistical parameters For all figures and tables that use statistical methods, confirm that the following items are present in relevant figure legends (or in the Methods section if additional space is needed). n/a Confirmed

The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement (animals, litters, cultures, etc.) A description of how samples were collected, noting whether measurements were taken from distinct samples or whether the same sample was measured repeatedly A statement indicating how many times each experiment was replicated The statistical test(s) used and whether they are one- or two-sided (note: only common tests should be described solely by name; more complex techniques should be described in the Methods section) A description of any assumptions or corrections, such as an adjustment for multiple comparisons

The test results (e.g. P values) given as exact values whenever possible and with confidence intervals noted June 2017 A clear description of statistics including central tendency (e.g. median, mean) and variation (e.g. standard deviation, interquartile range) Clearly defined error bars

See the web collection on statistics for biologists for further resources and guidance.

1 ` Software nature research | life sciences reporting summary Policy information about availability of computer code 7. Software Describe the software used to analyze the data in this All software used is described in methods and provided to readers. We mention study. different R packages used in analysis.

For manuscripts utilizing custom algorithms or software that are central to the paper but not yet described in the published literature, software must be made available to editors and reviewers upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). Nature Methods guidance for providing algorithms and software for publication provides further information on this topic.

` Materials and reagents Policy information about availability of materials 8. Materials availability Indicate whether there are restrictions on availability of There are no restrictions on materials availability. unique materials or if these materials are only available for distribution by a for-profit company. 9. Antibodies Describe the antibodies used and how they were validated No antibodies were used. for use in the system under study (i.e. assay and species). 10. Eukaryotic cell lines a. State the source of each eukaryotic cell line used. Irrelevant to the study

b. Describe the method of cell line authentication used. Irrelevant to the study

c. Report whether the cell lines were tested for Irrelevant to the study mycoplasma contamination.

d. If any of the cell lines used are listed in the database Irrelevant to the study of commonly misidentified cell lines maintained by ICLAC, provide a scientific rationale for their use.

` Animals and human research participants Policy information about studies involving animals; when reporting animal research, follow the ARRIVE guidelines 11. Description of research animals Provide details on animals and/or animal-derived Irrelevant to the study materials used in the study.

Policy information about studies involving human research participants 12. Description of human research participants Describe the covariate-relevant population Irrelevant to the study characteristics of the human research participants. June 2017

2