bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1 Polyploidization events shaped the 2 repertoires in legumes () 3 4 Kanhu C. Moharana and Thiago M. Venancio* 5 6 Laboratório de Química e Função de Proteínas e Peptídeos, Centro de Biociências e 7 Biotecnologia, Universidade Estadual do Norte Fluminense Darcy Ribeiro; Campos dos 8 Goytacazes, Brazil. 9 10 * Corresponding author 11 Av. Alberto Lamego 2000 / P5 / 217; Parque Califórnia 12 Campos dos Goytacazes, RJ 13 Brazil 14 CEP: 28013-602 15 [email protected] 16

1 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

17 Abstract 18 Transcription factors (TF) are essential for proper growth and development. 19 Several legumes, particularly soybean, are rich sources of protein and oil, with great 20 impact in the economy of several countries. Here we report a phylogenomic analysis of 21 major TF families in legumes and their potential association with important traits such 22 as nitrogen fixation and seed development. We used TF DNA-binding domains to 23 systematically screen the genomes of 15 legume and 5 non-legume species. The 24 percentage of TFs ranged from 3-8% of the gene complements. TF orthologous groups 25 (OG) in extant species were used to estimate OG sizes in ancestor nodes using a gene 26 birth-death model, which allowed us to identify lineage-specific expansions. Together, 27 OG analysis and rate of synonymous substitutions (Ks) between gene pairs show that 28 major TF expansions are strongly associated with known whole-genome duplication 29 (WGD) events in the legume (~58 mya) and Glycine (~13 mya) lineages, which account 30 for a large fraction of the Ph. vulgaris and Gl. TF repertoires. Out of the 3407 Gl. 31 max TFs, 1808 and 676 can be traced back to a single homeolog in Ph. vulgaris and Vi 32 vinifera, respectively. We found a trend for TFs expanded in legumes to be preferentially 33 transcribed in roots and nodules, suggesting their recruitment early in the evolution of 34 nodulation in the legume clade. We also found TF expansions in the Glycine WGD that 35 were followed by gene loss in the wild soybean Gl. soja, including genes located within 36 important quantitative trait loci. Together, our findings strongly support the roles of two 37 WGDs in shaping the TF repertoires in the legume and Glycine lineages, which are likely 38 related to important aspects of legume and soybean biology. 39 40 Keywords: whole genome duplication, phylogenomics, segmental duplication, 41 nodulation, soybean.

42 Abbreviations: 43 TF: transcription factors; DBD: DNA binding domains; AP2: APETALA 2; ERF: Ethylene 44 Responsive Factor; RAV: Related to ABI3/VP1; ARF: Auxin response factor; BBR-BPC: 45 Barley B Recombinant (BBR) - BASIC PENTACYSTEINE1 (BPC1); BES1: BRI1-EMS- 46 SUPPRESSOR; bHLH: Basic helix loop helix; bZIP: Basic ; Dof: DNA binding 47 with one finger; CO-like: CONSTANS-like; LSD: LESION SIMULATING DISEASE 1 (LSD1); 48 C2H2: CCHH (Zn); C3H: CCCH (Zn); CAMTA: Calmodulin binding transcription factors; 49 CPP: Cystein-rich polycomb-like protein; DBB: Double B-box ; /DP: E2 50 factor protein and DP protein ; EIL: Ethylene-Insentive3 (EIN3)-like protein3 (EIL3); FAR1: 51 FAR-RED IMPAIRED RESPONSE1 ; LFY: LEAFY; G2-like: Golden2 (G2)-like ; ARR-B: Type-B 52 phospho-accepting response regulator; GeBP: GLABROUS1 enhancer-binding protein ; 2 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

53 GRAS: GAI, RGA, and SCR; GRF: GROWTH-REGULATING FACTOR; TALE: Three Amino acid 54 Loop Extension; WOX: WUS -containing protein family ; HB: Homeobox ; HB- 55 PHD: HB-PHD finger; HB-other: HB-other; HRT-like: Hairy-Related transcription-factor- 56 like; HSF: ; LBD: ASYMMETRIC LEAVES2/LATERAL ORGAN 57 BOUNDARIES; M-type: MADS-type I; MIKC: MADS-type II; MYB: Myb proto-oncogene 58 protein; NAC: NAM, ATAF1, 2 and CUC2; NF-X1: Nuclear factor, X-box binding 1; NF-YA: 59 Nuclear factor Y subunit A; NF-YB: Nuclear factor Y subunit B; NF-YC: Nuclear factor Y 60 subunit C; Nin-like: NODULE INCEPTION ; NZZ/SPL: SPOROCYTELESS/NOZZLE; S1Fa-like: 61 S1Fa-like; TCP: TEOSINTE-LIKE1, CYCLOIDEA, and PROLIFERATING CELL FACTOR1; ZF-HD: 62 Zinc finger homeodomain protein; SBP: SQUAMOSA promoter binding protein; SRS: SHI 63 RELATED SEQUENCE; SAP: STERILE APETALA; STAT: Signal Transducers and Activators of 64 Transcription; VOZ: One-Zinc finger. 65 66 Introduction 67 Legumes (Fabaceae) are the third largest Angiosperm family, comprising nearly 20,000 68 species with tremendous morphological and ecological variation (Lewis, 2005). Legumes 69 are notorious for their symbiotic interactions with specific diazotrophic bacteria, which 70 is a feature of major ecological and agronomic relevance. Out of the six Fabaceae 71 subfamilies, Papilionoideae alone accounts for two-thirds of all legume species, 72 including economically important crops, such as Glycine max (soybean), Phaseolus 73 vulgaris (common bean), Arachis hypogaea (peanut), and arietinum () 74 (Graham and Vance, 2003;Cardoso et al., 2012;Azani et al., 2017). Several legume grains 75 and pulses are rich sources of dietary proteins, cooking oils, and biofuels. Currently, at 76 least 15 legume genomes are publicly available (Table 1), including those from wild and 77 cultivated soybean (Glycine soja and Gl. max). Nevertheless, Fabaceae subfamilies other 78 than Papilionoideae are largely underrepresented among sequenced genomes, despite 79 the recently published Chamaecrista fasciculata and Mimosa pudica genomes 80 (Griesmann et al., 2018). 81 Virtually all major biological processes are at least partially regulated at the 82 transcriptional level by specific DNA-binding transcription factors (TFs), which bind to 83 cis-regulatory elements of target genes by means of DNA binding domains (DBDs). 84 Because of their key regulatory roles, TFs have been extensively demonstrated to be 85 critical for plant evolution and adaptation to multiple environments (Doebley and 86 Lukens, 1998;Lehti-Shiu et al., 2017). A typical TF family encodes proteins sharing a 87 common DBD. Over 50 TF families have been identified in (Jin et al., 2017), out of 88 which many regulate biological processes such as growth, development, stress signaling, 89 and defense against pathogens (Lata et al., 2011;Tripathi et al., 2016;He et al., 2018).

3 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

90 Although many TF families are present throughout eukaryotes, their sizes can vary 91 considerably (Lespinet et al., 2002;Nagata et al., 2016;Lehti-Shiu et al., 2017). Plant TF 92 families are usually larger than their animal counterparts (Shiu, 2005) and the 93 exceptional increase in TF family sizes in higher plants is often related with whole 94 genome duplication (WGD) and triplication (WGT) events, also known as 95 polyploidization (Lehti-Shiu et al., 2017). 96 Polyploidization is widely accepted as an important factor in angiosperm 97 evolution (Pickett and Meeks-Wagner, 1995;Soltis et al., 2015). All core share 98 at least one polyploidization event (a hexaploidy event, i.e. the γ polyploidy) (Jaillon et 99 al., 2007;Aköz and Nordborg, 2019). With the increasing availability of plant genomes, it 100 became clear that several other WGD events occurred after the γ polyploidization event. 101 For example, Arabidopsis has two additional lineage-specific WGDs (α and β polyploidy) 102 (Blanc et al., 2000;Vision et al., 2000), whereas legumes share a tetraploid ancestor that 103 originated ~58-60 million years ago (mya) (Cannon et al., 2006). Further analysis of the 104 soybean and narrowleaf lupin (Lupinus angustifolius) genomes uncovered additional 105 lineage-specific WGDs in these lineages (Schmutz et al., 2010;Kroc et al., 2014;Hane et 106 al., 2017). Importantly, domestication of polyploid species is more likely than that of 107 their wild relatives (Salman-Minkov et al., 2016), supporting the importance of such 108 events in agriculture. In addition to large scale duplication events, small scale 109 duplications (SSDs) or local (e.g. tandem) duplications also contributed to the expansion 110 of multiple gene families (Cannon et al., 2004). 111 The relative contribution of duplication modes in plant genomes is a subject of 112 intense research (Panchy et al., 2016). Upon WGD, gene loss is the most common fate of 113 one of the duplicates, in a process called fractionation (Freeling et al., 2015;Panchy et 114 al., 2016;Cheng et al., 2018). Nevertheless, some gene families (e.g. TFs and signal 115 transduction genes) are apparently more prone to retain duplicated copies than others 116 (Blanc and Wolfe, 2004;Lehti-Shiu et al., 2017). Different mechanistic explanations have 117 been proposed for this phenomenon, out of which the gene balance hypothesis is the 118 most accepted one. According to this hypothesis, upon a WGD, genes with many 119 interaction partners have higher probability of being retained in duplicates, since 120 alterations in the stoichiometry their protein products tend to be deleterious (Birchler 121 and Veitia, 2007;Freeling, 2009;Birchler and Veitia, 2011). Retained copies then typically 122 evolve via subfunctionalization (i.e. duplicates acquire complementary functions) or 123 neofunctionalization (i.e. one of the copies evolves a new function) (Freeling et al., 124 2015). 125 It is currently accepted that the transition from wild to domesticated soybean 126 took place in Central China between 5,000 and 9,000 years ago through a gradual 127 process that involved an intermediary species, Gl. gracilis (Han et al., 2016;Sedivy et al., 4 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

128 2017). Other lines of evidence support independent domestication events in East Asia 129 (Korea and Japan) (Zhou et al., 2015;Sedivy et al., 2017). Artificial selection during 130 domestication involved several distinct traits, such as pod shattering, seed hardness, 131 adaptation to different photoperiods, flowering time, and stress resistance (Zhou et al., 132 2015;Sedivy et al., 2017). Several important TFs are involved in these traits. SHAT1-5 133 (NAC family TF) promote shattering resistance by increasing lignification of fiber cap 134 cells (Dong et al., 2014). 135 The progress in sequencing technologies and automation over the past 12 years 136 unleashed the power of comparative and population genomics in pinpointing key genes 137 involved in domestication and improvement of soybean and other crops, as illustrated 138 by the discovery of many Quantitative Trait Loci (QTL) involved in commercially 139 important traits (e.g. seed weight and oil content) by the resequencing of 302 soybean 140 accessions (Zhou et al., 2015). In the present work we systematically investigate the 141 evolution of TF repertoires in legumes (Fabaceae). Briefly, we performed large-scale 142 comparative analysis of TFs from 15 legume and five non-legume species. Our results 143 unveil a profound impact of polyploidization events on the expansion of TF families 144 throughout legumes. In particular, some TF families that expanded in the legume WGD 145 event (~58 mya) are preferentially expressed in roots and nodules, supporting their 146 importance in the evolution of nodulation. Further, TF expansions that happened at the 147 Glycine WGD (~13 mya) include genes that were subsequently lost in Gl. soja, including 148 TF genes that are within Gl. max QTLs associated with leaf shape, area and width, 149 proportion of FA18 in seeds, and branch density. Together, our results strongly support 150 that WGD events deeply shaped the evolution of TF repertoires in legumes and likely 151 generated TFs that regulate nodulation and other traits of key agronomic importance. 152 153 154 Results and Discussion

155 Systematic identification and characterization of transcription factors 156 We used a set of diagnostic specific DNA binding domains and forbidden domains 157 (Supplementary Table S1) to identify TFs in the genomes of 20 plant species (Table 1). 158 We predicted a total of 37,008 TFs (Supplementary Table S2), which were classified in 58 159 broad families (Table 2). A total of 31,111 TFs were predicted in the 15 legume genomes 160 (Supplementary Table S2). We benchmarked our pipeline by comparing the detected 161 TFs with those previously predicted in Ar. thaliana. Out of 1713 Ar. thaliana TFs 162 available in PlantTFDB, 1673 (98%) were correctly predicted (Supplementary Figure S1). 163 Further, 59 TFs were exclusively predicted by our pipeline, out of which 40 were 164 annotated as TFs in the TAIR database (https://www.arabidopsis.org) (Supplementary 5 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

165 Table S3). The percentage of TFs across genomes ranged from 5 to 8%, which is in line 166 with a previous estimation from 95 eudicot species (Jin et al., 2017). Gl. soja and Se. 167 moellendorffii showed the highest and lowest number of TFs, respectively (Figure 1). 168 Legumes typically showed greater number of TFs than non-legumes (Figure 1), although 169 the variation in these fractions indicates that some TF expansions play specific roles in 170 particular lineages. 171 To better understand the different proportion of TFs across genomes, we 172 compared TF family sizes between pairs of species. All except six TF families (i.e. AP2, 173 GRAS, B3, Nin-like, HRT-like, and Trihelix) expanded in the basal angiosperm Am. 174 trichopoda in comparison to the lycophyte Se. moellendorffii, supporting the 175 contribution ancient WGD events to the TF repertoires of seed plants (Albert et al., 176 2013). Although MADS TFs tightly regulate flower development, their diversification has 177 been proposed to predate the origin of angiosperms (Albert et al., 2013). We found 178 twice more MADS genes in Am. trichopoda (n=34) than in Se. moellendorffii (n=15). In 179 particular, the MIKC-type (type II) MADS subfamily alone has increased by five-fold, in 180 spite of the higher rate of gene birth/death of the M-type (type-I) MADS subfamily (Nam 181 et al., 2004;Kumpeangkeaw et al., 2019). By analyzing TF clusters (described later) we 182 observed that genes from two M-type MADS clusters are exclusively present in Am. 183 trichopoda, probably as a result of a lineage-specific expansion. We also analyzed the 184 expansion of the GRAS family in Am. trichopoda (n= 44) as compared to the basal dicot 185 Aq. coerulea (n= 36), which happened via lineage-specific tandem duplications (10 186 genes) in the former (Supplementary Figure S3). These 10 genes belong to a single 187 orthologous group (OG, see below) that does not have orthologs from other dicots 188 except one from the basal eudicot Aq. coerulea. In addition, we found one more Am. 189 trichopoda specific OG consisting of two GRAS genes (scaffold00007.332 and 190 scaffold00007.335). There are also a few remarkable expansions in Se. moellendorffii in 191 comparison to Am. trichopoda (e.g. HD-Zip, NAC, TCP, GATA, expanded by more than 192 two fold) (Table 2). Together, these results support a growth of the TF repertoire early in 193 the diversification of angiosperms. 194 Aq. coerulea is an ancient tetraploid and this tetraploidy was likely an important 195 first step towards the gamma hexaploidy (4n+2n) that is shared by all core eudicots 196 (Aköz and Nordborg, 2019). Nevertheless, some TF families are remarkably larger in Aq. 197 coerulea than in Vi. vinifera, such as FAR1 (Aco: 92, Vvi: 19), B3 (Aco: 104, Vvi:29), GeBP 198 (Aco: 13, Vvi: 1), and M-Type MADS (Aco: 50, Vvi: 17) (Figure 2; Table 2). After the 199 gamma hexaplodization event, Vi. vinifera has not undergone any large scale 200 duplication event, making it a suitable reference for comparative analysis with other 201 core eudicots (Jaillon et al., 2007;Severin et al., 2011;Wang et al., 2017). Most large 202 families are expanded in Ar. thaliana and legumes in comparison to Vi. vinifera (Figure 6 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

203 2). There are also some notable species-specific expansions in legumes, such as that of 204 FAR1, B3, and M-Type MADS in Me. truncatula (Figure 2). FAR1 has also expanded 10 205 times in Ar. ipaensis and Ar. duranensis. FAR1 has been linked with skotomorphogenesis 206 and photomorphogenesis in higher plants and its expansion might be related with the 207 fructification process in peanuts (Chen et al., 2016;Lu et al., 2018). Unlike the above- 208 mentioned expansions of MADS TFs in Am. trichopoda, only M-type (type-I) MADS had 209 large expansions in all legumes in comparison to Vi. Vinifera, as previously discussed 210 (Nam et al., 2004;Feil et al., 2013;Kumpeangkeaw et al., 2019). 211 The Glycine genus has a more recent WGD that is not shared with Phaseolus. 212 Accordingly, we found an approximate ratio of 1:2 between Ph. vulgaris and Gl. max in 213 90% (52/58) of the TF families, implying that the Glycine WGD has strongly contributed 214 to the soybean TF repertoire. There are also a few deviations from this trend, such as 215 the NAC (Gl. max: 142 and Ph. vulgaris: 90) and NOZZLE/SPL (Gl. max: 2 and Ph. vulgaris: 216 3) families. Of the 42 NAC OGs with genes from Ph. vulgaris and Gl. max, 9 have 217 identical number of genes, indicating that there are subfamilies that rapidly reverted 218 back to their configuration before the Glycine WGD, probably due to gene dosage 219 sensitivity. We also noticed that Gl. soja has 38 more TFs than Gl. max. While sixteen TF 220 families had identical number of genes in both species, others such as ERF (Gl. max: 290, 221 Gl. soja: 279) and FAR1 (Gl. max: 68, Gl. soja:79) showed significant variation between 222 cultivated and wild soybeans. 223

224 Origin of TF paralogs 225 We also investigated the modes of duplication shaping TF family sizes. 226 Paralogous pairs had their modes of duplication predicted using a previously described 227 scheme (Proulx et al., 2011;Qiao et al., 2018) that assign duplicate pairs in the following 228 categories: segmental duplicates (SD); tandem duplicates (TD); pairs separated by one 229 to five intervening genes were called proximal duplicates (PD); pairs originated via 230 retrotransposon (rTE) or DNA transposon (dTE) activity. The remaining pairs were 231 classified as dispersed duplicates (DD). We used the priority order 232 SD>TD>PD>rTE>dTE>DD to assign a single duplication mode to each pair. In legumes, 233 more than 70% of the TFs have at least one paralog (Table 3). Further, it is clear that SD 234 is the main duplication mode, supporting their origin through large scale duplication, as 235 previously reported (Lehti-Shiu et al., 2017). Gl. max and Gl. soja are the species with 236 the greatest number of SD TFs, which comprise 77.7% of the TF repertoire in the former. 237 In addition, local duplications (i.e. TD and PD) also contributed to TF repertoires, 238 particularly in Me. truncatula, which has 11% (241/1752) of the duplicate TFs classified 239 as TD and PD, especially in the ERF, WRKY, and B3 families (Supplementary Table S4). 7 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

240 The prevalence of TD pairs in Me. truncatula has also been reported in genes related to 241 other regulatory roles, such as in the F-box family (Bellieny-Rabelo et al., 2013). Further, 242 in Ar. duranensis, Ar. ipaensis, and Vi. radiata, nearly 7% of the duplicated TFs are 243 derived from TD, whereas dTE duplications account for 29.8% of the duplicated TFs in 244 Ca. fasciculata, especially in the MYB, NAC, and bHLH families (Table 3; Supplementary 245 Table S4). There are also some notable differences in the prevalence of modes of 246 duplication between closely related species. For example, local TF duplications are more 247 frequent in Ar. ipaensis than in Ar. duranensis (Table 3). 248 Between 5.5% (188/3407, in Gl. max) and 60.7% (560/923, in Am. trichopoda) of 249 the TFs were classified as singletons (Table 3). While in Ar. thaliana 22.58% (392/1736) 250 of the TFs were singletons, in legumes this number ranges between 5.5 and 29% (Table 251 3). Importantly, a large fraction of these singletons remain syntenic to a reference 252 outgroup species (Table 4). In Ph. vulgaris and Me. truncatula, syntenic singleton TFs 253 were significantly more expressed than their non-syntenic counterparts (Figure 3A), 254 suggesting that their greater functional conservation is associated with their genomic 255 context. Most SD TFs were also found to be syntenic in their closest outgroup species 256 (Table 4). We also estimated non-synonymous/synonymous mutation ratios (Ka/Ks) 257 between singleton and SD TFs with preserved synteny in an outgroup species. Orthologs 258 from SD pairs had significantly lower Ka/Ks than the singleton orthologs (Figure 3B), 259 leading us to hypothesize that these genes are under strong purifying selection due to 260 their involvement in intricate regulatory systems emerging from the WGD events. 261 Similar observations on the strong negative selection of duplicated genes have been 262 also reported in other species (Davis and Petrov, 2004;Jordan et al., 2004). 263 The significant number of syntenic SD TFs derived mostly by gene retention after 264 successive WGD events (Table 3). Gene duplicability, the ability of a duplicate pair to 265 remain duplicate, is non-random and biased towards specific gene families, including 266 TFs (Lynch and Conery, 2000;Davis and Petrov, 2004;Li et al., 2016). We analyzed TF 267 duplicability using Gl. max TFs from syntenic blocks that survived the 58 mya and the 13 268 mya WGD events. We used intra-species collinear blocks to identify Gl. max SD TFs that 269 correspond to single syntenic regions in Ph. vulgaris. We used a maximum Ks threshold 270 of 0.4 to filter the Gl. max SD pairs that likely emerged in the 13 mya WGD (Schmutz et 271 al., 2010). Nearly 81% (1808/2230) of the Gl. max SD TFs within that Ks range had a 272 syntenic gene in Ph. vulgaris. Further, 75% (676/904) of such Ph. vulgaris orthologs had 273 a single Vi. vinifera syntenic ortholog. In both cases we found that bHLH family had the 274 highest number of syntenic gene pairs (Gl. max-Ph. vulgaris: 99 pairs and Ph. vulgaris-Vi. 275 Vinifera: 43 pairs). Conversely, only 16% (15/93) of the Gl. max syntenic singleton TFs 276 correspond to single genomic regions in Ph. vulgaris and Vi vinifera. We hypothesize 277 that these genes not only depend on the conservation of a local genomic context, but 8 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

278 are also sensitive to gene dosage. Our results clearly illustrate the high duplicability of 279 most TF families in soybean and further support the impact of two WGD events that 280 account for a prominent fraction of the TF repertoire of this species. 281

282 Large scale duplication events correlate with increase in TF copy number 283 Many TF families explored here are broad and diversified, often comprising 284 multiple sub-groups, such as bHLH (Pires and Dolan, 2010), MYB (Du et al., 2012), and 285 ERF (Nakano, 2006). To obtain an overview of the diversification of plant TF families, we 286 assigned them to OGs by using all-vs-all reciprocal BLASTP search, followed by Markov 287 clustering (see methods for details). For example, AP2 had 28 clusters (Supplementary 288 Table S5), labeled as AP:1 to AP:28. We found 1557 TF OGs from the 58 TF families 289 reported above. Nearly 9% (144/1557) of these OGs had no members from legume 290 species, whereas 29% (452/1557) were legume-specific, and 43% (672/1557) had genes 291 from at least 10 species. Expectedly, larger families had more OGs, such as bHLH, C2H2, 292 and MYB, with more than 100 OGs each. Conversely, a few families diverted from this 293 trend, such as SAP (65 genes and 7 OGs) and EIL (143 members and 13 OGs) 294 (Supplementary Figure S4). 295 To investigate the evolution of TF families in more detail, we analyzed the 296 number of genes per OG in each species using CAFE (v.4.2) (Bie et al., 2006;Han et al., 297 2013), an elegant method that uses gene birth (λ) and death (μ) rates to model gain and 298 loss events in different lineages of a given ultrametric species tree (see methods for 299 details). We searched for optimal λ and μ based on the maximum likelihood score, using 300 the option to take potentially fragmented genomes into account (see methods for 301 details). We used 672 TF OGs with sufficient variation in number of genes per species 302 (statistical variance ≥ 0.5) and containing genes from at least 10 species. For example, 303 10 out of 31 AP2 clusters were used for rate estimation (Supplementary Table S5). We 304 repeated the rate parameter search for 50 times and the parameters resulting in the 305 best maximum likelihood score were used for further analysis. The results obtained with 306 CAFE largely confirm the general trend for TF gain upon WGD (Figure 4), which is in line 307 with the above-mentioned correlation between SD, the retention of paralogous TF pairs, 308 and intraspecies synteny. This trend can be exemplified by the nodes representing the 309 legume and Glycine ancestors, which have a high number of expanded TF families 310 (Figure 4). 311 We also investigated the impact of the legume and Glycine WGDs in the TF 312 repertoires of Gl. max and Gl. soja. Firstly, we analyzed the 138 OGs (from 34 TF 313 families) that expanded in legumes in comparison to non-legumes (Figure 4). If all 1557 314 OGs are considered, an average of 0.21 genes were gained per OG in legumes, in 9 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

315 contrast to 1.09 genes in the 138 expanded OGs. Secondly, we identified rapidly 316 evolving OGs (10 of 138; 7.25%) (Table 5), which are those with significant gene gain or 317 loss rate (p-value < 0.05) (Table 6). In these 10 OGs, the average rate of gene gain was 318 2.0, nearly two and 10 times of that observed in the 138 expanded OGs and in the 319 complete set of 1557 OGs, respectively. 320 We analyzed the most prevalent modes of duplication in the 138 OGs that 321 expanded in legumes and found that a major fraction of them emerged via SD. While 322 SDs comprise more than 80% of the TFs in species with more recent WGDs (i.e. Gl. max, 323 Gl. soja, and Lu. angustifolius) (Figure 5), several SD pairs might have lost collinearity 324 after the legume WGD and were classified as DDs. When inspecting the Ks distributions 325 of the paralogous pairs from the 138 OGs that expanded in legumes, we found ranges 326 corresponding to both, the legume and Glycine WGDs (Figure 5) (Schmutz et al., 327 2010;Cannon, 2013), suggesting that a fraction of the TFs that expanded in the legume 328 WGD subsequently duplicated for a second time in the Glycine WGD. Deviations from 329 this range were observed for Me. truncatula, Arachis spp., Cicer spp. and, Ch. 330 fasciculata, as previously reported (Cannon et al., 2010;Varshney et al., 2013;Tang et al., 331 2014;Chen et al., 2016). DD paralogs have a more dispersed Ks distribution than that SD, 332 although their Ks distributions also indicate that several DD pairs were likely generated 333 by SD with subsequent loss of collinearity (Figure 5). Collectively, these results support 334 the association between the expansion of legume-specific TF expansions and the WGD 335 event that took place 58 mya. 336 To further explore the functional relevance of legume TF expansions, we 337 analyzed gene expression patterns (see methods for details) across multiple tissues from 338 Me. truncatula, Ph. vulgaris, and Gl. max (Figure 6; Supplementary Figure S5). Strikingly, 339 the 138 OGs that expanded in legumes are enriched in genes with preferential 340 expression in nodules (Fisher's Exact Test, p-values = 1.7 × 10-4 and 1.4 × 10-5 for Me. 341 truncatula and Gl. max, respectively) and roots (Fisher's Exact Test, p-values = 1.4 × 10-9 342 and 5.9 × 10-3 for Me. truncatula and Ph. vulgaris, respectively). These results indicate 343 that the recruitment of these genes predate the emergence of nodulation in legumes 344 and might have played roles in the root physiology associated with this process. 345 Next, we integrated phylogenetic reconstructions of the 10 rapidly expanded 346 OGs with gene expression data and found three interesting groups (i.e. bHLH:12, M- 347 type:1, ERF:10) (Table 6). The bHLH:12 OG showed significantly higher expression during 348 nodule development in Me. truncatula, Ph. vulgaris, and Gl. max (Figure 6B and 349 Supplementary Figure S5). This OG includes two SD pairs of Me. truncatula bHLHs, 350 Medtr4g087920-Medtr2g015890 and Medtr4g079760-Medtr2g091190 with Ks values of 351 1.0523 and 0.8292, respectively. Ph. vulgaris and Gl. max orthologs of these genes were 352 also more expressed in roots and nodules than in other tissues (Figure 6). 10 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

353 Another interesting OG encodes TFs from the M-type MADS family (M-Type:1). 354 This OG has independently expanded in different species (Supplementary Figure S6), 355 including a major expansion in Lu. angustifolius. Nine of 14 Gl. max genes in this cluster 356 emerged within the Glycine genus, including one gene (Glyma.03G083700.1) with 357 preferential expression in seeds and flowers (Supplementary Figure S7A). The two Ph. 358 vulgaris orthologs (Phvul.006G077700 and Phvul.006G077800.1) showed seed-specific 359 expression, suggesting their importance in seed development (Supplementary Figure 360 S7B). Interestingly, these Ph. vulgaris genes originated from an ancestral tandem 361 duplication, as this organization is also found in other legumes (Supplementary Figure 362 S6). Although this OG lacked an Ar. thaliana member, the closest Ar. thaliana homologs 363 include AT5G27810, AT1G22590 (AGAMOUS-LIKE 87, AGL87), and AT5G48670 364 (AGAMOUS-LIKE80, AGL80). Importantly, AGL80 has been shown to be responsible for 365 central cell and endosperm development in Arabidopsis (Portereiko et al., 2006). 366 We also analyzed two ERF (ethylene response factor) OGs (ERF:10 and ERF:18) 367 containing genes playing critical roles in nodulation. Of these two, only ERF:10 was 368 among the 10 rapidly expanded OGs. Manual curation revealed that ERF:10 and ERF:18 369 comprise ERF required for nodule differentiation (EFD) and ERF required for nodulation 370 (ERN) genes, respectively. ERN and EFD genes regulate nodulation in Me. truncatula 371 (Vernie et al., 2008;Young et al., 2011;Cerri et al., 2012). Three of the four MtERNs (i.e. 372 Medtr7g085810.1, Medtr6g029180.1, and Medtr8g085960.1) had relatively higher 373 expression after inoculation than in roots or nodules, supporting their critical role in 374 nodule development (Supplementary Figure S9A). The biased expression towards 375 nodules and root tissues are also observed in Ph. vulgaris and Gl. max orthologs 376 (Supplementary Figure S8). Of the two MtEFDs, Medtr4g008860 and Medtr3g106290 377 were more expressed in nodules and inoculated root hairs, respectively. Similarly, Ph. 378 vulgaris and Gl. max ERNs are also more expressed in roots and nodules than in aerial 379 tissues (Supplementary Figure S8). 380

381 Association between quantitative traits and Glycine max specific TFs 382 The Glycine node had the largest number of expanded OGs, with 36% (563/1557) 383 (Figure 4) of them expanding by an average rate of 1.67 genes per OG. Out of these, 57 384 had rapidly expanded (p-value < 0.05) with an average rate of 2 genes per OG. In 385 comparison to the Glycine node, 76 and 82 OGs expanded in Gl. soja and Gl. max, 386 respectively. Of the OGs that expanded in Gl. max, 79% (65/82) showed significant 387 expansions (p-value < 0.05), with average rate of 1.44. Among these families, ERF (6 388 OGs), MYB (5 OGs), MYB_related (8 OGs), bHLH (7 OGs), and C2H2 (7 OGS) TFs gained 389 more than 5 genes per OG. Interestingly, 59% (202/341) of the genes from expanded 11 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

390 OGs lie within SD regions, supporting the importance of the Glycine WGD in shaping 391 these families in Gl. max. 392 We explored whether some of these OGs could be related with important 393 soybean agronomic traits. We searched the Gl. soja syntenic regions corresponding to 394 the 341 Gl. max TFs from the 65 rapidly expanded OGs. We identified 50 TFs without a 395 homeolog in Gl. soja (Supplementary Table S6), out of which only five had a syntenic 396 ortholog in Ph. vulgaris. Interestingly, two ERF (Glyma.20G115300, Glyma.14G161900) 397 and one SBP (Glyma.06G205700) TFs are within previously reported chromosomal 398 regions associated with important quantitative traits (Supplementary Figure S9) (Fang et 399 al., 2017). The ERF Glyma.20G115300 was located within a region associated with 400 overall leaf size and average number of seeds per pod. The second ERF, 401 Glyma.14G161900, is within a region associated with FA18 content and ratio in mature 402 seeds. Finally, the SBP TF Glyma.06G205700 is within a region regulating branch density 403 (i.e. ratio of branch number and plant height) and beginning bloom date. 404 (Supplementary Table S6). We envisage that many more of these 50 TFs will be 405 associated with important traits, which could be revealed in by a more comprehensive 406 work integrating QTL information from other genotypes and studies with our 407 phylogenomic results. 408

12 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

409 Methods

410 Genomic data 411 Genome sequences and annotations for 20 plant species were obtained from public 412 repositories (Table 1). We used coding sequences and predicted protein sequences from 413 the longest splicing isoform (when more than one are available). Predicted proteins with 414 less than 50 amino acids or containing premature stop codons or more than 20% 415 ambiguous amino acids were excluded.

416 Prediction and classifications of transcription factors (TFs) 417 To remove redundancy due to splicing isoforms and incomplete gene predictions, we 418 removed nearly identical sequences using BLASTCLUST (Altschul et al., 1997) as 419 previously described (parameters: -S 1.89 -L 0.9 -b F) (Gossani et al., 2014;Vidal et al., 420 2016). 421 We adopted the TF family classification scheme of plantTFDB (Zhang et al., 422 2011;Jin et al., 2017). We created a local database of protein domains by combining all 423 HMM profiles from -A (Release 31.0) (Finn et al., 2016) and 13 plant specific TF 424 HMM profiles downloaded from PlantTFDB (Supplementary Table S1). Protein 425 sequences were searched for conserved domains using HMMER 3.0 (http://hmmer.org) 426 (domain e-value cutoff < 0.01). TFs were classified in 58 families according to their DBD. 427

428 Species phylogeny 429 A species phylogeny was reconstructed using low copy orthologs present in all 20 430 species. We clustered the predicted proteins on the basis of the pairwise sequence 431 similarity of their longest protein products, which was computed with BLAST (e-value ≤ 432 1e-5) (Altschul et al., 1997). Sequence pairs with percentage identity of at least 35% and 433 query coverage of at least 50% were used for Markov clustering using mclblastline (v. 434 12-068; Inflation parameter: 1.5) (Enright et al., 2002). Clusters containing up to 22 435 genes with at least one gene from each species were used. If a species had paralogous 436 genes, the paralog with greater identity to orthologs from other species was used. 437 Amino acid sequence alignment was performed using DECIPHER (Wright, 2015) and 438 cDNA alignment performed with PAL2NAL (Suyama et al., 2006). We concatenated the 439 codon alignments of these genes to create a super-alignment. Next, phangorn (Schliep, 440 2011) was used to estimate the best substitution model for the phylogenetic 441 reconstruction, which was performed using RAxML (v8.2.11; model: GTRGAMMAIG4, 442 bootstrap: 1000) (Stamatakis, 2014). The phylogram and sequence alignment were used 443 in relTime-ML (implemented in MEGA-X, v.10.0.1) (Tamura et al., 2018) to generate an 13 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

444 ultrametric tree. We used the TimeTree database (Kumar et al., 2017) to retrieve the 445 divergence times of Fabaceae and Vi. vinifera (110 mya) and of Ph. vulgaris and Gl. max 446 (24 mya), which were used as references.

447 Synteny and synonymous mutation rate (Ks) 448 We identified segmental duplications using DAGCHAINER (r02062008) (Haas et al., 449 2004) by using bidirectional best BLASTP hits (e-value ≤ 1e-10, 35% minimum identity, 450 50% minimum query coverage). A minimum of four collinear genes were required to 451 identify a syntenic block (DAGCHAINER, parameter -A 4), as previously used in soybean 452 (Severin et al., 2011). Tandem duplicates were also identified using DAGCHAINER 453 (parameters -T -A 2). Tandem or segmental gene pairs had their non-synonymous (Ka) 454 and synonymous (Ks) mutation rates estimated using the bp_pairwise_kaks script, 455 distributed with BioPerl (v5.22.1) (Stajich et al., 2002).

456 Orthologous cluster and TF paralog identification 457 TF OGs were inspected for different modes of gene duplication. Multiple modes of 458 duplication can also co-occur in a group of genes. In these cases, only one type of 459 duplication is reported, following the order SD>TD>PD>rTE>dTE>DD (Proulx et al., 460 2011;Qiao et al., 2018). Although this strategy can slightly underestimate some 461 duplication levels, it helped us to assess the main forces shaping the expansion of TF 462 families.

463 Estimation of expansions and contractions in TF families 464 We used CAFE (v4.2) (Han et al., 2013) to assess the evolution of TF family sizes using 465 the time-calibrated species tree and TF OG compositions as inputs. We used the 466 cafeerror.py script, available in the CAFE package, to model error rates that might have 467 been introduced in gene family sizes, particularly by species with more fragmented 468 genome assemblies (e.g. Lu. angustifolius) (Han et al., 2013). This error model was used 469 adjust family sizes. We estimated λ and μ by running CAFE for 50 times and selected the 470 parameters that gave the best maximum likelihood estimate. These parameters were 471 used to estimate OG sizes at ancestor nodes and to predict rapidly evolving OGs (p-value 472 < 0.05), which are those that significantly gained or lost genes. The average net change 473 at each node on the species tree was expressed as:

+ (& () ) 474 The average change on node ! = ',- ' ' " +

475 where n is the total number of OGs, (Mi−Xi) is the difference in OG size between node M

476 and its parent node X for a given OG i. A negative or positive Am value stands for 14 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

477 contraction or expansion, respectively. Some remarkably expanded or contracted TF OG 478 had their phylogenies reconstructed with RAxML (v8.2.11; model: GTRGAMMAAUTO, 479 bootstrap: 1000) and visualized using Figtree (v.1.4.3) 480 (http://tree.bio.ed.ac.uk/software/figtree/).

481 Gene expression data and tissue specificity 482 Ar. thaliana and Ph. vulgaris normalized gene expression data were obtained from 483 public databases such as ArrayExpress (Liu et al., 2012) and PvGEA (O’Rourke et al., 484 2014), respectively. Four additional RNAseq datasets were downloaded from the NCBI 485 SRA database (https://www.ncbi.nlm.nih.gov/sra/). The first two datasets comprise two 486 soybean transcriptome studies (Bioproject PRJNA208048, PRJNA79597) (Libault et al., 487 2010;Severin et al., 2010). The third dataset includes Me. truncatula transcriptomes 488 (Boscari et al., 2013) in the following conditions and tissues: nitrogen-starving roots, 489 roots inoculated with Sinorhizobium meliloti, and root nodules (BioProject PRJNA79233). 490 We also downloaded an additional Me. truncatula RNAseq data covering 7 different 491 tissues (BioProject PRJNA80163). RNAseq reads were mapped on each species genome 492 using STAR v2.5.3a (Dobin et al., 2013) and normalized gene expression values were 493 estimated with StringTie v1.3.4d (Pertea et al., 2015), both with default parameters. 494 Expression values lower than 1 were converted to 0 and considered not

495 expressed. We added 1 to all values, which were then log2 transformed. To determine 496 tissue-preferential expression, we transformed the gene expression values in a 497 transformed z-score index (Kryuchkova-Mostacci and Robinson-Rechavi, 2016). 498 Depending on the highest expression in a given tissue, genes with transformed z-score 499 index > 0.9 were labeled as preferentially expressed. 500

501 Micro-syntenic regions between Gl. max, Gl. soja, and Ph. vulgaris 502 We used DAGCHAINER output files to identify the microsyntenic regions in Gl. max, Gl. 503 soja, and Ph. vulgaris. In particular, we queried the genes from OGs with significantly 504 larger (as predicted by CAFE) sizes in Gl. max in comparison to the Glycine node. For 505 each Gl. max gene, we considered only one collinear region from Gl. soja and Ph. 506 vulgaris. When more than one collinear region was detected, we selected that with the 507 highest DAGCHAINER alignment score. We visualized the microsynteny regions using 508 Genome Context Viewer available on Legume Information System (Cleary et al., 2017).

15 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

509 QTL intervals 510 We obtained the chromosomal coordinates of 150 QTLs significantly associated with 57 511 soybean traits (Fang et al., 2017). Chromosomal coordinates of soybean genes were 512 mapped to these QTL regions using bedtools v2.26.0 (Quinlan and Hall, 2010). 513 514 Acknowledgements 515 This work was supported by funding from Coordenação de Aperfeiçoamento de Pessoal 516 de Nível Superior (CAPES, Finance code 001), Conselho Nacional de Desenvolvimento 517 Científico e Tecnológico (CNPq), and Fundação Carlos Chagas Filho de Amparo à 518 Pesquisa do Estado do Rio de Janeiro (FAPERJ). 519 520 521 References

522 Aköz, G., and Nordborg, M. (2019). The Aquilegia genome reveals a hybrid origin of core eudicots. 523 bioRxiv.10.1101/407973 524 Albert, V.A., Barbazuk, W.B., Depamphilis, C.W., Der, J.P., Leebens-Mack, J., Ma, H., Palmer, J.D., 525 Rounsley, S., Sankoff, D., Schuster, S.C., Soltis, D.E., Soltis, P.S., Wessler, S.R., Wing, R.A., Albert, 526 V.A., Ammiraju, J.S.S., Barbazuk, W.B., Chamala, S., Chanderbali, A.S., Depamphilis, C.W., Der, J.P., 527 Determann, R., Leebens-Mack, J., Ma, H., Ralph, P., Rounsley, S., Schuster, S.C., Soltis, D.E., Soltis, 528 P.S., Talag, J., Tomsho, L., Walts, B., Wanke, S., Wing, R.A., Albert, V.A., Barbazuk, W.B., Chamala, 529 S., Chanderbali, A.S., Chang, T.-H., Determann, R., Lan, T., Soltis, D.E., Soltis, P.S., Arikit, S., Axtell, 530 M.J., Ayyampalayam, S., Barbazuk, W.B., Burnette, J.M., Chamala, S., De Paoli, E., Depamphilis, 531 C.W., Der, J.P., Estill, J.C., Farrell, N.P., Harkess, A., Jiao, Y., Leebens-Mack, J., Liu, K., Mei, W., 532 Meyers, B.C., Shahid, S., Wafula, E., Walts, B., Wessler, S.R., Zhai, J., Zhang, X., Albert, V.A., 533 Carretero-Paulet, L., Depamphilis, C.W., Der, J.P., Jiao, Y., Leebens-Mack, J., Lyons, E., Sankoff, D., 534 Tang, H., Wafula, E., Zheng, C., Albert, V.A., Altman, N.S., Barbazuk, W.B., Carretero-Paulet, L., 535 Depamphilis, C.W., Der, J.P., Estill, J.C., Jiao, Y., Leebens-Mack, J., Liu, K., Mei, W., Wafula, E., 536 Altman, N.S., Arikit, S., Axtell, M.J., Chamala, S., Chanderbali, A.S., Chen, F., Chen, J.-Q., Chiang, V., 537 De Paoli, E., Depamphilis, C.W., Der, J.P., et al. (2013). The Amborella Genome and the Evolution 538 of Flowering Plants. Science 342, 1241089-1241089.10.1126/science.1241089 539 Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997). 540 Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic 541 Acids Res 25, 3389-3402 542 Arabidopsis-Genome-Initiative (2000). Analysis of the genome sequence of the 543 . Nature 408, 796-815.10.1038/35048692 544 Azani, N., Babineau, M., Bailey, C.D., Banks, H., Barbosa, A., Pinto, R.B., Boatwright, J., Borges, L., 545 Brown, G., Bruneau, A., Candido, E., Cardoso, D., Chung, K.-F., Clark, R., Conceição, A.D., Crisp, M., 546 Cubas, P., Delgado-Salinas, A., Dexter, K., Doyle, J., Duminil, J., Egan, A., De La Estrella, M., Falcão, 547 M., Filatov, D., Fortuna-Perez, A.P., Fortunato, R., Gagnon, E., Gasson, P., Rando, J.G., Azevedo 548 Tozzi, A.M.G.D., Gunn, B., Harris, D., Haston, E., Hawkins, J., Herendeen, P., Hughes, C., Iganci, 549 J.V., Javadi, F., Kanu, S.A., Kazempour-Osaloo, S., Kite, G., Klitgaard, B., Kochanovski, F., Koenen, 550 E.M., Kovar, L., Lavin, M., Roux, M.L., Lewis, G., De Lima, H., López-Roberts, M.C., Mackinder, B., 551 Maia, V.H., Malécot, V., Mansano, V., Marazzi, B., Mattapha, S., Miller, J., Mitsuyuki, C., Moura, 16 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

552 T., Murphy, D., Nageswara-Rao, M., Nevado, B., Neves, D., Ojeda, D., Pennington, R.T., Prado, D., 553 Prenner, G., De Queiroz, L.P., Ramos, G., Ranzato Filardi, F., Ribeiro, P., Rico-Arce, M.D.L., 554 Sanderson, M., Santos-Silva, J., São-Mateus, W.B., Silva, M.S., Simon, M., Sinou, C., Snak, C., De 555 Souza, É., Sprent, J., Steele, K., Steier, J., Steeves, R., Stirton, C., Tagane, S., Torke, B., Toyama, H., 556 Cruz, D.T.D., Vatanparast, M., Wieringa, J., Wink, M., Wojciechowski, M., Yahara, T., Yi, T., and 557 Zimmerman, E. (2017). A new subfamily classification of the Leguminosae based on a 558 taxonomically comprehensive phylogeny – The Legume Phylogeny Working Group (LPWG). Taxon 559 66, 44-77.10.12705/661.3 560 Banks, J.A., Nishiyama, T., Hasebe, M., Bowman, J.L., Gribskov, M., Depamphilis, C., Albert, V.A., 561 Aono, N., Aoyama, T., Ambrose, B.A., Ashton, N.W., Axtell, M.J., Barker, E., Barker, M.S., 562 Bennetzen, J.L., Bonawitz, N.D., Chapple, C., Cheng, C., Correa, L.G., Dacre, M., Debarry, J., 563 Dreyer, I., Elias, M., Engstrom, E.M., Estelle, M., Feng, L., Finet, C., Floyd, S.K., Frommer, W.B., 564 Fujita, T., Gramzow, L., Gutensohn, M., Harholt, J., Hattori, M., Heyl, A., Hirai, T., Hiwatashi, Y., 565 Ishikawa, M., Iwata, M., Karol, K.G., Koehler, B., Kolukisaoglu, U., Kubo, M., Kurata, T., Lalonde, S., 566 Li, K., Li, Y., Litt, A., Lyons, E., Manning, G., Maruyama, T., Michael, T.P., Mikami, K., Miyazaki, S., 567 Morinaga, S., Murata, T., Mueller-Roeber, B., Nelson, D.R., Obara, M., Oguri, Y., Olmstead, R.G., 568 Onodera, N., Petersen, B.L., Pils, B., Prigge, M., Rensing, S.A., Riano-Pachon, D.M., Roberts, A.W., 569 Sato, Y., Scheller, H.V., Schulz, B., Schulz, C., Shakirov, E.V., Shibagaki, N., Shinohara, N., Shippen, 570 D.E., Sorensen, I., Sotooka, R., Sugimoto, N., Sugita, M., Sumikawa, N., Tanurdzic, M., Theissen, G., 571 Ulvskov, P., Wakazuki, S., Weng, J.K., Willats, W.W., Wipf, D., Wolf, P.G., Yang, L., Zimmer, A.D., 572 Zhu, Q., Mitros, T., Hellsten, U., Loque, D., Otillar, R., Salamov, A., Schmutz, J., Shapiro, H., 573 Lindquist, E., et al. (2011). The Selaginella genome identifies genetic changes associated with the 574 evolution of vascular plants. Science 332, 960-963.10.1126/science.1203810 575 Bellieny-Rabelo, D., Oliveira, A.E., and Venancio, T.M. (2013). Impact of whole-genome and tandem 576 duplications in the expansion and functional diversification of the F-box family in legumes 577 (Fabaceae). PLoS One 8, e55127.10.1371/journal.pone.0055127 578 Bertioli, D.J., Cannon, S.B., Froenicke, L., Huang, G., Farmer, A.D., Cannon, E.K., Liu, X., Gao, D., 579 Clevenger, J., Dash, S., Ren, L., Moretzsohn, M.C., Shirasawa, K., Huang, W., Vidigal, B., Abernathy, 580 B., Chu, Y., Niederhuth, C.E., Umale, P., Araujo, A.C., Kozik, A., Kim, K.D., Burow, M.D., Varshney, 581 R.K., Wang, X., Zhang, X., Barkley, N., Guimaraes, P.M., Isobe, S., Guo, B., Liao, B., Stalker, H.T., 582 Schmitz, R.J., Scheffler, B.E., Leal-Bertioli, S.C., Xun, X., Jackson, S.A., Michelmore, R., and Ozias- 583 Akins, P. (2016). The genome sequences of Arachis duranensis and Arachis ipaensis, the diploid 584 ancestors of cultivated peanut. Nat Genet 48, 438-446.10.1038/ng.3517 585 Bie, T.D., Cristianini, N., Demuth, J.P., and Hahn, M.W. (2006). CAFE : a computational tool for the 586 study of gene family evolution. 22, 1269-1271.10.1093/bioinformatics/btl097 587 Birchler, J.A., and Veitia, R.A. (2007). The Gene Balance Hypothesis: From Classical Genetics to 588 Modern Genomics. THE PLANT CELL ONLINE 19, 395-402.10.1105/tpc.106.049338 589 Birchler, J.A., and Veitia, R.A. (2011). Protein–protein and protein–DNA dosage balance and 590 differential paralog transcription factor retention in polyploids. Frontiers in Plant Genetics and 591 Genomics 2, 64.10.3389/fpls.2011.00064 592 Blanc, G., Barakat, A., Guyot, R., Cooke, R., and Delseny, M. (2000). Extensive duplication and 593 reshuffling in the Arabidopsis genome. Plant Cell 12, 1093-1101 594 Blanc, G., and Wolfe, K.H. (2004). Functional divergence of duplicated genes formed by polyploidy 595 during Arabidopsis evolution. The Plant cell 16, 1679-1691.10.1105/tpc.021410 596 Boscari, A., Del Giudice, J., Ferrarini, A., Venturini, L., Zaffini, A.L., Delledonne, M., and Puppo, A. 597 (2013). Expression dynamics of the Medicago truncatula transcriptome during the symbiotic 598 interaction with Sinorhizobium meliloti: which role for nitric oxide? Plant Physiol 161, 425- 599 439.10.1104/pp.112.208538

17 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

600 Cannon, S.B. (2013). The model legume genomes. Methods Mol Biol 1069, 1-14.10.1007/978-1- 601 62703-613-9_1 602 Cannon, S.B., Ilut, D., Farmer, A.D., Maki, S.L., May, G.D., Singer, S.R., and Doyle, J.J. (2010). 603 Polyploidy did not predate the evolution of nodulation in all legumes. PLoS One 5, 604 e11630.10.1371/journal.pone.0011630 605 Cannon, S.B., Mitra, A., Baumgarten, A., Young, N.D., and May, G. (2004). The roles of segmental and 606 tandem gene duplication in the evolution of large gene families in Arabidopsis thaliana. BMC 607 Plant Biol 4, 10.10.1186/1471-2229-4-10 608 Cannon, S.B., Sterck, L., Rombauts, S., Sato, S., Cheung, F., Gouzy, J., Wang, X., Mudge, J., Vasdewani, 609 J., Schiex, T., Spannagl, M., Monaghan, E., Nicholson, C., Humphray, S.J., Schoof, H., Mayer, K.F., 610 Rogers, J., Quetier, F., Oldroyd, G.E., Debelle, F., Cook, D.R., Retzel, E.F., Roe, B.A., Town, C.D., 611 Tabata, S., Van De Peer, Y., and Young, N.D. (2006). Legume genome evolution viewed through 612 the Medicago truncatula and Lotus japonicus genomes. Proc Natl Acad Sci U S A 103, 14959- 613 14964.10.1073/pnas.0603228103 614 Cardoso, D., De Queiroz, L.P., Pennington, R.T., De Lima, H.C., Fonty, E., Wojciechowski, M.F., and 615 Lavin, M. (2012). Revisiting the phylogeny of papilionoid legumes: New insights from 616 comprehensively sampled early-branching lineages. American Journal of Botany 99, 1991- 617 2013.10.3732/ajb.1200380 618 Cerri, M.R., Frances, L., Laloum, T., Auriac, M.C., Niebel, A., Oldroyd, G.E., Barker, D.G., Fournier, J., 619 and De Carvalho-Niebel, F. (2012). Medicago truncatula ERN transcription factors: regulatory 620 interplay with NSP1/NSP2 GRAS factors and expression dynamics throughout rhizobial infection. 621 Plant Physiol 160, 2155-2172.10.1104/pp.112.203190 622 Chen, X., Li, H., Pandey, M.K., Yang, Q., Wang, X., Garg, V., Chi, X., Doddamani, D., Hong, Y., 623 Upadhyaya, H., Guo, H., Khan, A.W., Zhu, F., Zhang, X., Pan, L., Pierce, G.J., Zhou, G., 624 Krishnamohan, K.A., Chen, M., Zhong, N., Agarwal, G., Li, S., Chitikineni, A., Zhang, G.Q., Sharma, 625 S., Chen, N., Liu, H., Janila, P., Wang, M., Wang, T., Sun, J., Li, X., Li, C., Yu, L., Wen, S., Singh, S., 626 Yang, Z., Zhao, J., Zhang, C., Yu, Y., Bi, J., Liu, Z.J., Paterson, A.H., Wang, S., Liang, X., Varshney, 627 R.K., and Yu, S. (2016). Draft genome of the peanut A-genome progenitor (Arachis duranensis) 628 provides insights into geocarpy, oil biosynthesis, and allergens. Proc Natl Acad Sci U S A 113, 629 6785-6790.10.1073/pnas.1600899113 630 Cheng, F., Wu, J., Cai, X., Liang, J., Freeling, M., and Wang, X. (2018). Gene retention, fractionation 631 and subgenome differences in polyploid plants. Nat Plants 4, 258-268.10.1038/s41477-018-0136- 632 7 633 Cleary, A., Farmer, A., and Hancock, J. (2017). Genome Context Viewer: visual exploration of multiple 634 annotated genomes using microsynteny. Bioinformatics 34, 1562- 635 1564.10.1093/bioinformatics/btx757 636 Davis, J.C., and Petrov, D.A. (2004). Preferential duplication of conserved proteins in eukaryotic 637 genomes. PLoS Biol 2, E55.10.1371/journal.pbio.0020055 638 Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., and 639 Gingeras, T.R. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15- 640 21.10.1093/bioinformatics/bts635 641 Doebley, J., and Lukens, L. (1998). Transcriptional regulators and the evolution of plant form. The 642 Plant cell 10, 1075-1082.10.1105/tpc.10.7.1075 643 Dong, Y., Yang, X., Liu, J., Wang, B.H., Liu, B.L., and Wang, Y.Z. (2014). Pod shattering resistance 644 associated with domestication is mediated by a NAC gene in soybean. Nat Commun 5, 645 3352.10.1038/ncomms4352 646 Du, H., Yang, S.-S., Liang, Z., Feng, B.-R., Liu, L., Huang, Y.-B., and Tang, Y.-X. (2012). Genome-wide 647 analysis of the MYB transcription factor superfamily in soybean. BMC Plant Biol 12, 648 106.10.1186/1471-2229-12-106

18 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

649 Enright, A.J., Van Dongen, S., and Ouzounis, C.A. (2002). An efficient algorithm for large-scale 650 detection of protein families. Nucleic Acids Res 30, 1575-1584 651 Fang, C., Ma, Y., Wu, S., Liu, Z., Wang, Z., Yang, R., Hu, G., Zhou, Z., Yu, H., Zhang, M., Pan, Y., Zhou, 652 G., Ren, H., Du, W., Yan, H., Wang, Y., Han, D., Shen, Y., Liu, S., Liu, T., Zhang, J., Qin, H., Yuan, J., 653 Yuan, X., Kong, F., Liu, B., Li, J., Zhang, Z., Wang, G., Zhu, B., and Tian, Z. (2017). Genome-wide 654 association studies dissect the genetic networks underlying agronomical traits in soybean. 655 Genome Biology 18.10.1186/s13059-017-1289-9 656 Feil, R., Yoshida, T., and Kawabe, A. (2013). Importance of Gene Duplication in the Evolution of 657 Genomic Imprinting Revealed by Molecular Evolutionary Analysis of the Type I MADS-Box Gene 658 Family in Arabidopsis Species. PLoS One 8, e73588.10.1371/journal.pone.0073588 659 Filiault, D.L., Ballerini, E.S., Mandakova, T., Akoz, G., Derieg, N.J., Schmutz, J., Jenkins, J., Grimwood, 660 J., Shu, S., Hayes, R.D., Hellsten, U., Barry, K., Yan, J., Mihaltcheva, S., Karafiatova, M., Nizhynska, 661 V., Kramer, E.M., Lysak, M.A., Hodges, S.A., and Nordborg, M. (2018). The Aquilegia genome 662 provides insight into adaptive radiation and reveals an extraordinarily polymorphic chromosome 663 with a unique history. Elife 7.10.7554/eLife.36426 664 Finn, R.D., Coggill, P., Eberhardt, R.Y., Eddy, S.R., Mistry, J., Mitchell, A.L., Potter, S.C., Punta, M., 665 Qureshi, M., Sangrador-Vegas, A., Salazar, G.A., Tate, J., and Bateman, A. (2016). The Pfam 666 protein families database: towards a more sustainable future. Nucleic Acids Res 44, D279- 667 285.10.1093/nar/gkv1344 668 Freeling, M. (2009). Bias in plant gene content following different sorts of duplication: tandem, 669 whole-genome, segmental, or by transposition. Annu Rev Plant Biol 60, 433- 670 453.10.1146/annurev.arplant.043008.092122 671 Freeling, M., Scanlon, M.J., and Fowler, J.E. (2015). Fractionation and subfunctionalization following 672 genome duplications: mechanisms that drive gene content and their consequences. Curr Opin 673 Genet Dev 35, 110-118.10.1016/j.gde.2015.11.002 674 Gossani, C., Bellieny-Rabelo, D., and Venancio, T.M. (2014). Evolutionary analysis of multidrug 675 resistance genes in fungi - impact of gene duplication and family conservation. FEBS J 281, 4967- 676 4977.10.1111/febs.13046 677 Graham, P.H., and Vance, C.P. (2003). Legumes: importance and constraints to greater use. PLANT 678 PHYSIOLOGY 131, 872-877.10.1104/pp.017004 679 Griesmann, M., Chang, Y., Liu, X., Song, Y., Haberer, G., Crook, M.B., Billault-Penneteau, B., 680 Lauressergues, D., Keller, J., Imanishi, L., Roswanjaya, Y.P., Kohlen, W., Pujic, P., Battenberg, K., 681 Alloisio, N., Liang, Y., Hilhorst, H., Salgado, M.G., Hocher, V., Gherbi, H., Svistoonoff, S., Doyle, J.J., 682 He, S., Xu, Y., Xu, S., Qu, J., Gao, Q., Fang, X., Fu, Y., Normand, P., Berry, A.M., Wall, L.G., Ane, J.M., 683 Pawlowski, K., Xu, X., Yang, H., Spannagl, M., Mayer, K.F.X., Wong, G.K., Parniske, M., Delaux, 684 P.M., and Cheng, S. (2018). Phylogenomics reveals multiple losses of nitrogen-fixing root nodule 685 symbiosis. Science 361.10.1126/science.aat1743 686 Gupta, S., Nawaz, K., Parween, S., Roy, R., Sahu, K., Kumar Pole, A., Khandal, H., Srivastava, R., Kumar 687 Parida, S., and Chattopadhyay, D. (2017). Draft genome sequence of Cicer reticulatum L., the wild 688 progenitor of chickpea provides a resource for agronomic trait improvement. DNA Res 24, 1- 689 10.10.1093/dnares/dsw042 690 Haas, B.J., Delcher, A.L., Wortman, J.R., and Salzberg, S.L. (2004). DAGchainer: A tool for mining 691 segmental genome duplications and synteny. Bioinformatics 20, 3643- 692 3646.10.1093/bioinformatics/bth397 693 Han, M.V., Thomas, G.W., Lugo-Martinez, J., and Hahn, M.W. (2013). Estimating gene gain and loss 694 rates in the presence of error in genome assembly and annotation using CAFE 3. Mol Biol Evol 30, 695 1987-1997.10.1093/molbev/mst100

19 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

696 Han, Y., Zhao, X., Liu, D., Li, Y., Lightfoot, D.A., Yang, Z., Zhao, L., Zhou, G., Wang, Z., Huang, L., Zhang, 697 Z., Qiu, L., Zheng, H., and Li, W. (2016). Domestication footprints anchor genomic regions of 698 agronomic importance in soybeans. New Phytol 209, 871-884.10.1111/nph.13626 699 Hane, J.K., Ming, Y., Kamphuis, L.G., Nelson, M.N., Garg, G., Atkins, C.A., Bayer, P.E., Bravo, A., 700 Bringans, S., Cannon, S., Edwards, D., Foley, R., Gao, L.L., Harrison, M.J., Huang, W., Hurgobin, B., 701 Li, S., Liu, C.W., Mcgrath, A., Morahan, G., Murray, J., Weller, J., Jian, J., and Singh, K.B. (2017). A 702 comprehensive draft genome sequence for lupin (Lupinus angustifolius), an emerging health 703 food: insights into plant-microbe interactions and legume evolution. Plant Biotechnol J 15, 318- 704 330.10.1111/pbi.12615 705 He, M., He, C.-Q., and Ding, N.-Z. (2018). Abiotic Stresses: General Defenses of Land Plants and 706 Chances for Engineering Multistress Tolerance. Frontiers in Plant Science 707 9.10.3389/fpls.2018.01771 708 Jaillon, O., Aury, J.-M., Noel, B., Policriti, A., Clepet, C., Casagrande, A., Choisne, N., Aubourg, S., 709 Vitulo, N., Jubin, C., Vezzi, A., Legeai, F., Hugueney, P., Dasilva, C., Horner, D., Mica, E., Jublot, D., 710 Poulain, J., Bruyère, C., Billault, A., Segurens, B., Gouyvenoux, M., Ugarte, E., Cattonaro, F., 711 Anthouard, V., Vico, V., Del Fabbro, C., Alaux, M., Di Gaspero, G., Dumas, V., Felice, N., Paillard, S., 712 Juman, I., Moroldo, M., Scalabrin, S., Canaguier, A., Le Clainche, I., Malacrida, G., Durand, E., 713 Pesole, G., Laucou, V., Chatelet, P., Merdinoglu, D., Delledonne, M., Pezzotti, M., Lecharny, A., 714 Scarpelli, C., Artiguenave, F., Pè, M.E., Valle, G., Morgante, M., Caboche, M., Adam-Blondon, A.-F., 715 Weissenbach, J., Quétier, F., Wincker, P., and Characterization, F.-I.P.C.F.G.G. (2007). The 716 grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. 717 Nature 449, 463-467.10.1038/nature06148 718 Jain, M., Misra, G., Patel, R.K., Priya, P., Jhanwar, S., Khan, A.W., Shah, N., Singh, V.K., Garg, R., Jeena, 719 G., Yadav, M., Kant, C., Sharma, P., Yadav, G., Bhatia, S., Tyagi, A.K., and Chattopadhyay, D. 720 (2013). A draft genome sequence of the pulse crop chickpea (Cicer arietinum L.). Plant J 74, 715- 721 729.10.1111/tpj.12173 722 Jin, J., Tian, F., Yang, D.C., Meng, Y.Q., Kong, L., Luo, J., and Gao, G. (2017). PlantTFDB 4.0: toward a 723 central hub for transcription factors and regulatory interactions in plants. Nucleic Acids Res 45, 724 D1040-D1045.10.1093/nar/gkw982 725 Jordan, I.K., Wolf, Y.I., and Koonin, E.V. (2004). Duplicated genes evolve slower than singletons 726 despite the initial rate increase. BMC Evol Biol 4, 22.10.1186/1471-2148-4-22 727 Kang, Y.J., Kim, S.K., Kim, M.Y., Lestari, P., Kim, K.H., Ha, B.K., Jun, T.H., Hwang, W.J., Lee, T., Lee, J., 728 Shim, S., Yoon, M.Y., Jang, Y.E., Han, K.S., Taeprayoon, P., Yoon, N., Somta, P., Tanya, P., Kim, K.S., 729 Gwag, J.G., Moon, J.K., Lee, Y.H., Park, B.S., Bombarely, A., Doyle, J.J., Jackson, S.A., Schafleitner, 730 R., Srinives, P., Varshney, R.K., and Lee, S.H. (2014). Genome sequence of mungbean and insights 731 into evolution within Vigna species. Nat Commun 5, 5443.10.1038/ncomms6443 732 Kroc, M., Koczyk, G., Swiecicki, W., Kilian, A., and Nelson, M.N. (2014). New evidence of ancestral 733 polyploidy in the Genistoid legume Lupinus angustifolius L. (narrow-leafed lupin). Theor Appl 734 Genet 127, 1237-1249.10.1007/s00122-014-2294-y 735 Kryuchkova-Mostacci, N., and Robinson-Rechavi, M. (2016). A benchmark of gene expression tissue- 736 specificity metrics. Brief Bioinform, bbw008.10.1093/bib/bbw008 737 Kumar, S., Stecher, G., Suleski, M., and Hedges, S.B. (2017). TimeTree: A Resource for Timelines, 738 Timetrees, and Divergence Times. Mol Biol Evol 34, 1812-1819.10.1093/molbev/msx116 739 Kumpeangkeaw, A., Tan, D., Fu, L., Han, B., Sun, X., Hu, X., Ding, Z., and Zhang, J. (2019). Asymmetric 740 birth and death of type I and type II MADS-box gene subfamilies in the rubber tree facilitating 741 laticifer development. PLoS One 14, e0214335.10.1371/journal.pone.0214335 742 Lata, C., Yadav, A., and Pras, M. (2011). "Role of Plant Transcription Factors in Abiotic Stress 743 Tolerance," in Abiotic Stress Response in Plants - Physiological, Biochemical and Genetic 744 Perspectives, eds. A. Shanker & B. Venkateswarlu. InTech), 269-296.

20 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

745 Lehti-Shiu, M.D., Panchy, N., Wang, P., Uygun, S., and Shiu, S.H. (2017). Diversity, expansion, and 746 evolutionary novelty of plant DNA-binding transcription factor families. Biochim Biophys Acta 747 Gene Regul Mech 1860, 3-20.10.1016/j.bbagrm.2016.08.005 748 Lespinet, O., Wolf, Y.I., Koonin, E.V., and Aravind, L. (2002). The role of lineage-specific gene family 749 expansion in the evolution of eukaryotes. Genome Res 12, 1048-1059.10.1101/gr.174302 750 Lewis, G.P. (2005). Legumes of the World. Royal Botanic Gardens, Kew. 751 Li, Z., Defoort, J., Tasdighian, S., Maere, S., Van De Peer, Y., and De Smet, R. (2016). Gene 752 Duplicability of Core Genes Is Highly Consistent across All Angiosperms. Plant Cell 28, 326- 753 344.10.1105/tpc.15.00877 754 Libault, M., Farmer, A., Joshi, T., Takahashi, K., Langley, R.J., Franklin, L.D., He, J., Xu, D., May, G., and 755 Stacey, G. (2010). An integrated transcriptome atlas of the crop model Glycine max, and its use in 756 comparative analyses in plants. Plant J 63, 86-99.10.1111/j.1365-313X.2010.04222.x 757 Liu, J., Jung, C., Xu, J., Wang, H., Deng, S., Bernad, L., Arenas-Huertero, C., and Chua, N.H. (2012). 758 Genome-wide analysis uncovers regulation of long intergenic noncoding RNAs in Arabidopsis. 759 Plant Cell 24, 4333-4345.10.1105/tpc.112.102855 760 Lu, Q., Li, H., Hong, Y., Zhang, G., Wen, S., Li, X., Zhou, G., Li, S., Liu, H., Liu, H., Liu, Z., Varshney, R.K., 761 Chen, X., and Liang, X. (2018). Genome Sequencing and Analysis of the Peanut B-Genome 762 Progenitor (Arachis ipaensis). Frontiers in Plant Science 9.10.3389/fpls.2018.00604 763 Lynch, M., and Conery, J.S. (2000). The evolutionary fate and consequences of duplicate genes. 764 Science 290, 1151-1155 765 Mochida, K., Sakurai, T., Seki, H., Yoshida, T., Takahagi, K., Sawai, S., Uchiyama, H., Muranaka, T., and 766 Saito, K. (2017). Draft genome assembly and annotation of Glycyrrhiza uralensis, a medicinal 767 legume. Plant J 89, 181-194.10.1111/tpj.13385 768 Nagata, T., Hosaka-Sasaki, A., and Kikuchi, S. (2016). The Evolutionary Diversification of Genes that 769 Encode Transcription Factor Proteins in Plants. 73-97.10.1016/b978-0-12-800854-6.00005-1 770 Nakano, T. (2006). Genome-Wide Analysis of the ERF Gene Family in Arabidopsis and Rice. PLANT 771 PHYSIOLOGY 140, 411-432.10.1104/pp.105.073783 772 Nam, J., Kim, J., Lee, S., An, G., Ma, H., and Nei, M. (2004). Type I MADS-box genes have experienced 773 faster birth-and-death evolution than type II MADS-box genes in angiosperms. Proceedings of the 774 National Academy of Sciences 101, 1910-1915.10.1073/pnas.0308430100 775 O’rourke, J.A., Iniguez, L.P., Fu, F., Bucciarelli, B., Miller, S.S., Jackson, S.A., Mcclean, P.E., Li, J., Dai, X., 776 Zhao, P.X., Hernandez, G., and Vance, C.P. (2014). An RNA-Seq based gene expression atlas of the 777 common bean. BMC genomics 15, 866.10.1186/1471-2164-15-866 778 Panchy, N., Lehti-Shiu, M., and Shiu, S.H. (2016). Evolution of Gene Duplication in Plants. Plant 779 Physiol 171, 2294-2316.10.1104/pp.16.00523 780 Parween, S., Nawaz, K., Roy, R., Pole, A.K., Venkata Suresh, B., Misra, G., Jain, M., Yadav, G., Parida, 781 S.K., Tyagi, A.K., Bhatia, S., and Chattopadhyay, D. (2015). An advanced draft genome assembly of 782 a desi type chickpea (Cicer arietinum L.). Sci Rep 5, 12806.10.1038/srep12806 783 Pertea, M., Pertea, G.M., Antonescu, C.M., Chang, T.C., Mendell, J.T., and Salzberg, S.L. (2015). 784 StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat 785 Biotechnol 33, 290-295.10.1038/nbt.3122 786 Pickett, F.B., and Meeks-Wagner, D.R. (1995). Seeing double: appreciating genetic redundancy. Plant 787 Cell 7, 1347-1356.10.1105/tpc.7.9.1347 788 Pires, N., and Dolan, L. (2010). Early evolution of bHLH proteins in plants. Plant Signal Behav 5, 911- 789 912.10.4161/psb.5.7.12100 790 Portereiko, M.F., Lloyd, A., Steffen, J.G., Punwani, J.A., Otsuga, D., and Drews, G.N. (2006). AGL80 is 791 required for central cell and endosperm development in Arabidopsis. Plant Cell 18, 1862- 792 1872.10.1105/tpc.106.040824

21 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

793 Proulx, S.R., Wang, Y., Wang, X., Tang, H., Tan, X., Ficklin, S.P., Feltus, F.A., and Paterson, A.H. (2011). 794 Modes of Gene Duplication Contribute Differently to Genetic Novelty and Redundancy, but Show 795 Parallels across Divergent Angiosperms. PLoS One 6, e28150.10.1371/journal.pone.0028150 796 Qiao, X., Yin, H., Li, L., Wang, R., Wu, J., and Zhang, S. (2018). Different Modes of Gene Duplication 797 Show Divergent Evolutionary Patterns and Contribute Differently to the Expansion of Gene 798 Families Involved in Important Fruit Traits in Pear (Pyrus bretschneideri). Front Plant Sci 9, 799 161.10.3389/fpls.2018.00161 800 Quinlan, A.R., and Hall, I.M. (2010). BEDTools: a flexible suite of utilities for comparing genomic 801 features. Bioinformatics 26, 841-842.10.1093/bioinformatics/btq033 802 Salman-Minkov, A., Sabath, N., and Mayrose, I. (2016). Whole-genome duplication as a key factor in 803 crop domestication. Nat Plants 2.10.1038/nplants.2016.115 804 Sato, S., Nakamura, Y., Kaneko, T., Asamizu, E., Kato, T., Nakao, M., Sasamoto, S., Watanabe, A., Ono, 805 A., Kawashima, K., Fujishiro, T., Katoh, M., Kohara, M., Kishida, Y., Minami, C., Nakayama, S., 806 Nakazaki, N., Shimizu, Y., Shinpo, S., Takahashi, C., Wada, T., Yamada, M., Ohmido, N., Hayashi, 807 M., Fukui, K., Baba, T., Nakamichi, T., Mori, H., and Tabata, S. (2008). Genome structure of the 808 legume, Lotus japonicus. DNA Res 15, 227-239.10.1093/dnares/dsn008 809 Schliep, K.P. (2011). phangorn: phylogenetic analysis in R. Bioinformatics 27, 592- 810 593.10.1093/bioinformatics/btq706 811 Schmutz, J., Cannon, S.B., Schlueter, J., Ma, J., Mitros, T., Nelson, W., Hyten, D.L., Song, Q., Thelen, 812 J.J., Cheng, J., Xu, D., Hellsten, U., May, G.D., Yu, Y., Sakurai, T., Umezawa, T., Bhattacharyya, M.K., 813 Sandhu, D., Valliyodan, B., Lindquist, E., Peto, M., Grant, D., Shu, S., Goodstein, D., Barry, K., 814 Futrell-Griggs, M., Abernathy, B., Du, J., Tian, Z., Zhu, L., Gill, N., Joshi, T., Libault, M., Sethuraman, 815 A., Zhang, X.C., Shinozaki, K., Nguyen, H.T., Wing, R.A., Cregan, P., Specht, J., Grimwood, J., 816 Rokhsar, D., Stacey, G., Shoemaker, R.C., and Jackson, S.A. (2010). Genome sequence of the 817 palaeopolyploid soybean. Nature 463, 178-183.10.1038/nature08670 818 Schmutz, J., Mcclean, P.E., Mamidi, S., Wu, G.A., Cannon, S.B., Grimwood, J., Jenkins, J., Shu, S., Song, 819 Q., Chavarro, C., Torres-Torres, M., Geffroy, V., Moghaddam, S.M., Gao, D., Abernathy, B., Barry, 820 K., Blair, M., Brick, M.A., Chovatia, M., Gepts, P., Goodstein, D.M., Gonzales, M., Hellsten, U., 821 Hyten, D.L., Jia, G., Kelly, J.D., Kudrna, D., Lee, R., Richard, M.M., Miklas, P.N., Osorno, J.M., 822 Rodrigues, J., Thareau, V., Urrea, C.A., Wang, M., Yu, Y., Zhang, M., Wing, R.A., Cregan, P.B., 823 Rokhsar, D.S., and Jackson, S.A. (2014). A reference genome for common bean and genome-wide 824 analysis of dual domestications. Nat Genet 46, 707-713.10.1038/ng.3008 825 Sedivy, E.J., Wu, F., and Hanzawa, Y. (2017). Soybean domestication: the origin, genetic architecture 826 and molecular bases. New Phytologist 214, 539-553.10.1111/nph.14418 827 Severin, A.J., Cannon, S.B., Graham, M.M., Grant, D., and Shoemaker, R.C. (2011). Changes in twelve 828 homoeologous genomic regions in soybean following three rounds of polyploidy. Plant Cell 23, 829 3129-3136.10.1105/tpc.111.089573 830 Severin, A.J., Woody, J.L., Bolon, Y.T., Joseph, B., Diers, B.W., Farmer, A.D., Muehlbauer, G.J., Nelson, 831 R.T., Grant, D., Specht, J.E., Graham, M.A., Cannon, S.B., May, G.D., Vance, C.P., and Shoemaker, 832 R.C. (2010). RNA-Seq Atlas of Glycine max: a guide to the soybean transcriptome. BMC Plant Biol 833 10, 160.10.1186/1471-2229-10-160 834 Shiu, S.-H. (2005). Transcription Factor Families Have Much Higher Expansion Rates in Plants than in 835 Animals. PLANT PHYSIOLOGY 139, 18-26.10.1104/pp.105.065110 836 Soltis, P.S., Marchant, D.B., Van De Peer, Y., and Soltis, D.E. (2015). Polyploidy and genome evolution 837 in plants. Curr Opin Genet Dev 35, 119-125.10.1016/j.gde.2015.11.003 838 Stajich, J.E., Block, D., Boulez, K., Brenner, S.E., Chervitz, S.A., Dagdigian, C., Fuellen, G., Gilbert, J.G., 839 Korf, I., Lapp, H., Lehvaslaiho, H., Matsalla, C., Mungall, C.J., Osborne, B.I., Pocock, M.R., 840 Schattner, P., Senger, M., Stein, L.D., Stupka, E., Wilkinson, M.D., and Birney, E. (2002). The 841 Bioperl toolkit: Perl modules for the life sciences. Genome Res 12, 1611-1618.10.1101/gr.361602

22 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

842 Stamatakis, A. (2014). RAxML version 8: a tool for phylogenetic analysis and post-analysis of large 843 phylogenies. Bioinformatics 30, 1312-1313.10.1093/bioinformatics/btu033 844 Suyama, M., Torrents, D., and Bork, P. (2006). PAL2NAL: robust conversion of protein sequence 845 alignments into the corresponding codon alignments. Nucleic Acids Res 34, W609- 846 612.10.1093/nar/gkl315 847 Tamura, K., Tao, Q., Kumar, S., and Russo, C. (2018). Theoretical Foundation of the RelTime Method 848 for Estimating Divergence Times from Variable Evolutionary Rates. Mol Biol Evol 35, 1770- 849 1782.10.1093/molbev/msy044 850 Tang, H., Krishnakumar, V., Bidwell, S., Rosen, B., Chan, A., Zhou, S., Gentzbittel, L., Childs, K.L., 851 Yandell, M., Gundlach, H., Mayer, K.F., Schwartz, D.C., and Town, C.D. (2014). An improved 852 genome release (version Mt4.0) for the model legume Medicago truncatula. BMC genomics 15, 853 312.10.1186/1471-2164-15-312 854 Tripathi, P., Galla, A., Rabara, R.C., and Rushton, P.J. (2016). "Transcription Factors that Regulate 855 Defence Responses and Their Use in Increasing Disease Resistance," in Plant Pathogen Resistance 856 Biotechnology, ed. D.B. Collinge.), 109-129. 857 Varshney, R.K., Chen, W., Li, Y., Bharti, A.K., Saxena, R.K., Schlueter, J.A., Donoghue, M.T., Azam, S., 858 Fan, G., Whaley, A.M., Farmer, A.D., Sheridan, J., Iwata, A., Tuteja, R., Penmetsa, R.V., Wu, W., 859 Upadhyaya, H.D., Yang, S.P., Shah, T., Saxena, K.B., Michael, T., Mccombie, W.R., Yang, B., Zhang, 860 G., Yang, H., Wang, J., Spillane, C., Cook, D.R., May, G.D., Xu, X., and Jackson, S.A. (2011). Draft 861 genome sequence of pigeonpea (Cajanus cajan), an orphan legume crop of resource-poor 862 farmers. Nat Biotechnol 30, 83-89.10.1038/nbt.2022 863 Varshney, R.K., Song, C., Saxena, R.K., Azam, S., Yu, S., Sharpe, A.G., Cannon, S., Baek, J., Rosen, B.D., 864 Tar'an, B., Millan, T., Zhang, X., Ramsay, L.D., Iwata, A., Wang, Y., Nelson, W., Farmer, A.D., Gaur, 865 P.M., Soderlund, C., Penmetsa, R.V., Xu, C., Bharti, A.K., He, W., Winter, P., Zhao, S., Hane, J.K., 866 Carrasquilla-Garcia, N., Condie, J.A., Upadhyaya, H.D., Luo, M.C., Thudi, M., Gowda, C.L., Singh, 867 N.P., Lichtenzveig, J., Gali, K.K., Rubio, J., Nadarajan, N., Dolezel, J., Bansal, K.C., Xu, X., Edwards, 868 D., Zhang, G., Kahl, G., Gil, J., Singh, K.B., Datta, S.K., Jackson, S.A., Wang, J., and Cook, D.R. 869 (2013). Draft genome sequence of chickpea (Cicer arietinum) provides a resource for trait 870 improvement. Nat Biotechnol 31, 240-246.10.1038/nbt.2491 871 Vernie, T., Moreau, S., De Billy, F., Plet, J., Combier, J.P., Rogers, C., Oldroyd, G., Frugier, F., Niebel, A., 872 and Gamas, P. (2008). EFD Is an ERF transcription factor involved in the control of nodule number 873 and differentiation in Medicago truncatula. Plant Cell 20, 2696-2713.10.1105/tpc.108.059857 874 Vidal, N.M., Grazziotin, A.L., Iyer, L.M., Aravind, L., and Venancio, T.M. (2016). Transcription factors, 875 chromatin proteins and the diversification of Hemiptera. Insect Biochem Mol Biol 69, 1- 876 13.10.1016/j.ibmb.2015.07.001 877 Vision, T.J., Brown, D.G., and Tanksley, S.D. (2000). The origins of genomic duplications in 878 Arabidopsis. Science 290, 2114-2117 879 Wang, J., Sun, P., Li, Y., Liu, Y., Yu, J., Ma, X., Sun, S., Yang, N., Xia, R., Lei, T., Liu, X., Jiao, B., Xing, Y., 880 Ge, W., Wang, L., Wang, Z., Song, X., Yuan, M., Guo, D., Zhang, L., Zhang, J., Jin, D., Chen, W., Pan, 881 Y., Liu, T., Jin, L., Sun, J., Cheng, R., Duan, X., Shen, S., Qin, J., Zhang, M.C., Paterson, A.H., and 882 Wang, X. (2017). Hierarchically Aligning 10 Legume Genomes Establishes a Family-Level Genomics 883 Platform. Plant Physiol 174, 284-300.10.1104/pp.16.01981 884 Wright, E.S. (2015). DECIPHER: harnessing local sequence context to improve protein multiple 885 sequence alignment. BMC Bioinformatics 16, 322.10.1186/s12859-015-0749-z 886 Xie, M., Chung, C.Y.-L., Li, M.-W., Wong, F.-L., Wang, X., Liu, A., Wang, Z., Leung, A.K.-Y., Wong, T.-H., 887 Tong, S.-W., Xiao, Z., Fan, K., Ng, M.-S., Qi, X., Yang, L., Deng, T., He, L., Chen, L., Fu, A., Ding, Q., 888 He, J., Chung, G., Isobe, S., Tanabata, T., Valliyodan, B., Nguyen, H.T., Cannon, S.B., Foyer, C.H., 889 Chan, T.-F., and Lam, H.-M. (2019). A reference-grade wild soybean genome. Nat Commun 890 10.10.1038/s41467-019-09142-9

23 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

891 Yang, K., Tian, Z., Chen, C., Luo, L., Zhao, B., Wang, Z., Yu, L., Li, Y., Sun, Y., Li, W., Chen, Y., Zhang, Y., 892 Ai, D., Zhao, J., Shang, C., Ma, Y., Wu, B., Wang, M., Gao, L., Sun, D., Zhang, P., Guo, F., Wang, W., 893 Wang, J., Varshney, R.K., Ling, H.Q., and Wan, P. (2015). Genome sequencing of adzuki bean 894 (Vigna angularis) provides insight into high starch and low fat accumulation and domestication. 895 Proc Natl Acad Sci U S A 112, 13213-13218.10.1073/pnas.1420949112 896 Young, N.D., Debelle, F., Oldroyd, G.E., Geurts, R., Cannon, S.B., Udvardi, M.K., Benedito, V.A., Mayer, 897 K.F., Gouzy, J., Schoof, H., Van De Peer, Y., Proost, S., Cook, D.R., Meyers, B.C., Spannagl, M., 898 Cheung, F., De Mita, S., Krishnakumar, V., Gundlach, H., Zhou, S., Mudge, J., Bharti, A.K., Murray, 899 J.D., Naoumkina, M.A., Rosen, B., Silverstein, K.A., Tang, H., Rombauts, S., Zhao, P.X., Zhou, P., 900 Barbe, V., Bardou, P., Bechner, M., Bellec, A., Berger, A., Berges, H., Bidwell, S., Bisseling, T., 901 Choisne, N., Couloux, A., Denny, R., Deshpande, S., Dai, X., Doyle, J.J., Dudez, A.M., Farmer, A.D., 902 Fouteau, S., Franken, C., Gibelin, C., Gish, J., Goldstein, S., Gonzalez, A.J., Green, P.J., Hallab, A., 903 Hartog, M., Hua, A., Humphray, S.J., Jeong, D.H., Jing, Y., Jocker, A., Kenton, S.M., Kim, D.J., Klee, 904 K., Lai, H., Lang, C., Lin, S., Macmil, S.L., Magdelenat, G., Matthews, L., Mccorrison, J., Monaghan, 905 E.L., Mun, J.H., Najar, F.Z., Nicholson, C., Noirot, C., O'bleness, M., Paule, C.R., Poulain, J., Prion, 906 F., Qin, B., Qu, C., Retzel, E.F., Riddle, C., Sallet, E., Samain, S., Samson, N., Sanders, I., Saurat, O., 907 Scarpelli, C., Schiex, T., Segurens, B., Severin, A.J., Sherrier, D.J., Shi, R., Sims, S., Singer, S.R., 908 Sinharoy, S., Sterck, L., Viollet, A., Wang, B.B., et al. (2011). The Medicago genome provides 909 insight into the evolution of rhizobial symbioses. Nature 480, 520-524.10.1038/nature10625 910 Zhang, H., Jin, J., Tang, L., Zhao, Y., Gu, X., Gao, G., and Luo, J. (2011). PlantTFDB 2.0: update and 911 improvement of the comprehensive plant transcription factor database. Nucleic Acids Res 39, 912 D1114-D1117.10.1093/nar/gkq1141 913 Zhou, Z., Jiang, Y., Wang, Z., Gou, Z., Lyu, J., Li, W., Yu, Y., Shu, L., Zhao, Y., Ma, Y., Fang, C., Shen, Y., 914 Liu, T., Li, C., Li, Q., Wu, M., Wang, M., Wu, Y., Dong, Y., Wan, W., Wang, X., Ding, Z., Gao, Y., 915 Xiang, H., Zhu, B., Lee, S.H., Wang, W., and Tian, Z. (2015). Resequencing 302 wild and cultivated 916 accessions identifies genes related to domestication and improvement in soybean. Nat Biotechnol 917 33, 408-414.10.1038/nbt.3096 918

24 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

919 Tables

920 Table 1: Plant species used in this study and their corresponding genome assembly versions. Non-legume 921 species are marked with asterisks. 922

Scientific name Genome assembly version Genes Chromosome Reference number Cajanus cajan GCA 000340665.1 48,331 11 (Varshney et al., 2011) Phaseolus vulgaris Pvulgaris 218 v1.0 27,197 11 (Schmutz et al., 2014) Vigna radiata Vradiata ver6 35,143 11 (Kang et al., 2014) Vigna angularis Vigan1.1 34,172 11 (Yang et al., 2015) Glycine max Gmax 275 Wm82.a2.v1 56,044 20 (Schmutz et al., 2010) Glycine soja W05v1.0 55,539 20 (Xie et al., 2019) Cicer reticulatum WCGAP v1.0 25,680 8 (Gupta et al., 2017) (Jain et al., 2013;Varshney et al., Cicer arietinum ASM33114v1 33,107 8 2013;Parween et al., 2015) (Young et al., 2011;Tang et al., Medicago truncatula Mtruncatula 285 Mt4.0v1 50,894 8 2014) Glycyrrhiza uralensis Gur.draft-genome.20151208 34,445 8 (Mochida et al., 2017) Lotus japonicus Genome assembly build 3.0 39,734 6 (Sato et al., 2008) Lupinus angustifolius v1.0 33,076 20 (Hane et al., 2017) Arachis ipaensis Araip1.0 46,410 10 (Bertioli et al., 2016) (Bertioli et al., 2016;Chen et al., Arachis duranensis Aradu1.0 42,562 10 2016) Chamaecrista fasciculata version.1 21,781 8 (Griesmann et al., 2018) (Arabidopsis-Genome-Initiative, Arabidopsis thaliana* Athaliana 167 TAIR10 27,416 5 2000) Vitis vinifera* Vvinifera 145 Genoscope.12X 26,346 19 (Jaillon et al., 2007) Amborella trichopoda* AmTr v1.1 26,846 13 (Albert et al., 2013) Aquilegia coerulea* Aquilegia coerulea v3.1 30,023 7 (Filiault et al., 2018) Selaginella moellendorffii* Selaginella moellendorffii v1.0 22,285 10 (Banks et al., 2011) 923 924 925 926 927 928 929 930 931 932 933 934 935 936 25 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

937 Table 2: Number of transcription factors identified in each family across species.

family family F Super T Ca. cajan Ph. vulgaris Vi. radiata Vi. angularis Gl. max Gl. soja Ci. reticulatum Ci. arietinum Me. truncatula Gl. uralensis Lo. japonicus Lu. angustifolius Ar. ipaensis Ar. duranensis Ch. fasciculata Ar. thaliana Vi. vinifera Aq. coerulea Am. trichopoda Se. moellendorffii AP2 26 31 24 26 49 52 24 25 31 24 28 38 30 24 22 18 20 16 12 16 AP2/ERF ERF 147 149 156 161 290 279 115 129 186 112 107 189 143 130 131 122 72 78 64 35 RAV 2 3 3 3 5 5 2 3 3 3 3 6 3 3 4 6 3 4 1 2 BBR-BPC 5 4 4 6 4 5 2 3 2 4 5 10 5 6 4 7 5 4 4 1 BES1 6 6 6 8 10 8 7 6 7 6 5 12 9 8 7 8 6 4 5 5 B3 ARF 23 27 28 27 42 45 25 25 38 11 25 43 31 30 22 22 18 13 13 7 B3 49 48 36 33 77 79 23 35 132 49 61 52 69 68 29 73 29 104 18 19 C3H 44 44 42 43 73 80 45 50 58 40 54 65 48 45 37 50 47 40 32 29 CO-like 10 13 12 10 22 22 10 10 11 11 8 20 11 11 10 17 6 8 5 4 Dof 37 42 42 40 74 73 34 37 40 41 30 67 39 38 39 36 22 29 18 11 C2C2 Zn- GATA 33 32 29 27 61 63 26 27 42 30 20 45 25 25 26 30 20 29 21 6 finger LSD 7 4 4 4 5 4 4 4 4 2 5 7 4 4 3 3 3 3 2 2 YABBY 9 8 10 9 12 13 7 8 7 8 7 14 7 8 9 6 7 5 6 0 C2H2 223 134 128 128 219 218 78 108 111 108 93 183 136 127 157 104 64 87 85 35 CAMTA 10 8 11 6 14 15 7 7 8 8 7 9 8 7 6 6 5 5 3 5 CPP 8 9 10 7 16 15 7 8 12 9 10 13 14 13 9 10 8 7 6 5 DBB 11 12 14 13 16 16 6 7 8 8 8 13 7 8 10 8 8 5 4 4 E2F/DP 7 7 7 9 12 13 6 6 6 10 7 12 10 9 6 8 7 7 5 4 MADS M-type 48 44 23 31 77 80 34 43 101 29 34 38 28 23 32 65 17 50 19 12 MIKC 23 34 47 27 75 71 16 51 38 14 21 27 44 48 28 42 35 24 15 3 EIL 6 7 4 5 10 12 6 7 13 9 7 9 7 6 17 6 2 2 2 6 FAR1 49 25 67 20 68 79 17 37 76 79 29 10 298 198 41 17 19 92 10 0 LFY 1 1 1 1 3 2 1 1 1 2 0 3 2 2 1 1 1 1 1 1 GARP ARR-B 15 15 16 18 31 34 13 14 28 14 10 19 12 11 9 15 12 13 7 6 G2-like 44 50 52 46 96 100 36 41 44 49 39 78 49 46 47 42 39 34 27 20 GRAS 57 55 58 58 108 111 47 46 66 53 62 54 49 48 52 34 43 36 44 47 GRF 10 10 9 8 20 20 8 8 8 10 9 15 11 11 10 9 8 7 6 4 GeBP 7 5 8 5 9 11 7 8 7 5 4 12 4 7 4 23 1 13 6 1 HB-PHD 2 3 3 3 6 6 2 2 2 3 3 2 4 3 3 2 2 2 2 1 HD-ZIP 51 55 54 54 89 92 45 49 57 54 40 82 45 43 45 48 33 30 22 9 Homeobox TALE 34 32 31 28 57 61 22 24 23 32 25 45 28 30 25 21 22 17 12 7 WOX 18 18 20 61 32 33 14 18 19 17 14 31 15 15 13 16 10 10 9 8 Other 6 7 8 7 15 13 8 8 7 8 9 12 9 7 6 6 7 6 5 4 HRT-like 1 1 1 1 1 1 0 2 3 1 1 1 1 1 1 2 1 1 1 2 HSF 27 29 32 32 46 45 20 22 28 32 10 33 22 22 32 24 18 15 12 6 LBD 52 48 47 49 75 75 35 47 59 53 41 70 52 49 52 41 43 29 23 13 MYB MYB 173 170 172 169 294 288 80 132 162 146 100 210 136 135 152 146 138 97 61 21 MYB 84 82 73 75 162 165 53 62 93 95 62 107 76 70 62 72 57 55 36 34 relatedNAC 91 90 92 90 142 139 61 77 97 72 81 117 85 84 88 112 70 76 45 21 NF-X1 1 2 2 2 3 3 3 3 3 3 3 2 2 2 2 2 3 3 2 2 NF-YA 11 9 9 9 16 18 7 8 8 9 8 13 12 11 7 10 7 5 5 1 NF-Y NF-YB 23 19 23 18 36 36 19 22 24 23 18 25 16 14 20 13 17 14 9 7 NF-YC 14 15 14 14 22 22 11 11 15 14 9 16 13 12 10 14 8 10 8 5 NZZ/SPL 1 3 2 3 2 2 2 2 1 1 2 3 4 3 2 1 2 2 1 3 Nin-like 13 12 11 12 27 26 10 9 13 12 9 15 16 12 11 14 8 9 7 7 S1Fa-like 2 4 3 3 4 4 3 4 5 3 3 6 2 2 5 3 3 1 1 0 SAP 3 5 3 1 5 5 2 2 3 3 4 5 5 5 1 4 3 3 2 1 SBP 24 23 22 23 42 40 17 19 24 25 16 35 19 18 19 16 18 13 12 9 SRS 11 10 11 10 22 22 8 8 11 10 11 12 11 10 10 11 6 4 6 4 STAT 1 1 1 1 1 1 1 1 1 1 1 1 2 1 0 2 1 1 1 2 TCP 25 27 27 27 55 56 19 24 21 24 24 42 30 23 23 24 15 16 14 4 Trihelix 38 41 41 42 70 74 28 38 36 36 32 64 40 43 39 30 26 33 30 34 VOZ 2 3 3 3 4 5 2 2 2 3 2 4 2 2 2 2 2 2 1 0 WRKY 96 90 95 94 171 170 59 81 104 79 72 112 85 84 71 73 59 38 31 12 Whirly 3 3 3 5 7 7 3 3 3 2 3 4 2 3 3 3 2 3 2 1 ZF-HD 18 19 18 18 41 45 14 15 17 15 20 26 13 15 31 17 10 16 9 7 bHLH 156 164 161 159 321 320 110 141 162 141 129 207 150 145 137 142 106 99 72 45 bZIP 72 79 85 85 141 146 61 70 91 73 64 128 73 74 65 77 50 46 41 28 26 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

938 Table 3: Prevalence of different modes of duplication among transcription factors.

Species Total Singletons (%) Duplicates SD TD PD rTE dTE DD Ca. cajan TFs1970 314 (15.94) (%)1656 (84.06) 548 90 21 40 397 560 Ph. vulgaris 1891 257 (13.59) 1634 (86.41) 914 99 8 16 244 353 Vi. radiata 1918 253 (13.19) 1665 (86.81) 743 115 9 24 319 455 Vi. angularis 1877 307 (16.36) 1570 (83.64) 842 74 9 14 220 411 Gl. max 3407 188 (5.52) 3219 (94.48) 264 79 14 28 242 211 Gl. soja 3445 224 (6.5) 3221 (93.5) 5264 85 13 24 231 220 Ci. reticulatum 1332 365 (27.4) 967 (72.6) 8350 45 3 24 207 338 Ci. arietinum 1660 339 (20.42) 1321 (79.58) 521 101 7 17 267 408 Me. truncatula 2182 430 (19.71) 1752 (80.29) 648 209 32 27 235 601 Gl. uralensis 1738 403 (23.19) 1335 (76.81) 482 36 3 19 281 514 Lo. japonicus 1514 428 (28.27) 1086 (71.73) 195 39 6 25 345 476 Lu. angustifolius 2493 223 (8.95) 2270 (91.05) 165 51 0 0 0 565 Ar. ipaensis 2073 365 (17.61) 1708 (82.39) 4336 113 21 32 434 772 Ar. duranensis 1902 363 (19.09) 1539 (80.91) 410 95 5 27 373 629 Ch. fasciculata 1709 426 (24.93) 1283 (75.07) 186 36 2 36 383 640 Ar. thaliana 1736 392 (22.58) 1344 (77.42) 707 81 13 15 152 376 Vi. vinifera 1274 478 (37.52) 796 (62.48) 303 82 13 5 193 200 Aq. coerulea 1376 552 (40.12) 824 (59.88) 130 86 14 0 0 594 Am. trichopoda 923 560 (60.67) 363 (39.33) 14 33 4 0 0 312 Se. moellendorffii 588 278 (47.28) 310 (52.72) 79 10 4 0 0 217 939 Abbreviations: Segmental duplicates (SD); Tandem duplicates (TD); Proximal duplicates (PD); Retrotransposon mediated duplicates 940 (rTE); Transposon mediated duplicates (dTE); Dispersed duplicates (DD). 941

27 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

942 Table 4: Percentage of singletons within collinear regions in a reference species.

SD in collinear Reference Singletons in Species Singletons SD regions in the outgroup species collinear regions reference outgroup Ca. cajan Vi. vinifera 314 132 (42%) 548 430 (78%) Ph. vulgaris Vi. vinifera 257 142 (55%) 914 750 (82%) Vi. radiata Vi. vinifera 253 120 (47%) 743 595 (80%) Vi. angularis Vi. vinifera 307 161 (52%) 842 659 (78%) Gl. max Ph. vulgaris 188 93 (49%) 2645 2546 (96%) Gl. soja Ph. vulgaris 224 105 (47%) 2648 2546 (96%) Ci. reticulatum Vi. vinifera 365 199 (55%) 350 286 (81%) Ci. arietinum Vi. vinifera 339 201 (59%) 521 414 (79%) Me. truncatula Vi. vinifera 430 184 (43%) 648 520 (80%) Gl. uralensis Vi. vinifera 403 170 (42%) 482 376 (78%) Lo. japonicus Vi. vinifera 428 177 (41%) 195 154 (78%) Lu. angustifolius Vi. vinifera 223 73 (32%) 1654 1114 (67%) Ar. ipaensis Vi. vinifera 365 171 (47%) 336 275 (81%) Ar. duranensis Vi. vinifera 363 152 (42%) 410 327 (79%) Ch. fasciculata Vi. vinifera 426 144 (34%) 186 95 (51%) Ar. thaliana Vi. vinifera 392 226 (58%) 707 512 (72%) Vi. vinifera Aq. coerulea 478 295 (62%) 303 285 (94%) Aq. coerulea Am. trichopoda 552 177 (32%) 130 65 (50%) Am. trichopoda Se. moellendorffii 560 0 (0%) 14 0 (0%) 943 Abbreviations: Segmental duplicates (SD). 944 945 946 947

28 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

948 Table 5: Number of genes and prevalence of modes of duplication in TF orthologous groups that 949 expanded in legumes.

Species Total SD TD PD rTE dTE DD Ca. cajan 38 14 3 0 0 7 14 Ph. vulgaris 36 14 6 1 0 7 8 Vi. radiata 36 15 4 1 0 12 4 Vi. angularis 38 14 9 0 0 9 6 Gl. max 67 48 5 3 0 10 1 Gl. soja 69 50 2 3 0 13 1 Ci. reticulatum 22 6 0 0 0 7 9 Ci. arietinum 24 11 0 2 0 1 10 Me. truncatula 39 11 4 2 0 12 10 Gl. uralensis 31 12 0 0 0 4 15 Lo. japonicus 28 6 2 0 0 13 7 Lu. angustifolius 60 42 6 0 0 0 12 Ar. ipaensis 31 11 4 1 0 9 6 Ar. duranensis 32 9 5 1 0 10 7 Ch. fasciculata 34 2 6 0 0 13 13 950 Abbreviations: Segmental duplicates (SD); Tandem duplicates (TD); Proximal duplicates (PD); Retrotransposon mediated duplicates 951 (rTE); Transposon mediated duplicates (dTE); Dispersed duplicates (DD). 952

29 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

953 954 Table 6: Orthologous groups with significantly (p-value < 0.05) rapid expansion in legumes. M- Description ARF:1 ARF:2 bHLH:5 bHLH:12 ERF:10 LBD:8 MYB:1 MYB:25 NAC:3 type:1 Ca. cajan 4 2 3 3 4 3 7 5 3 4 Ph. vulgaris 3 3 4 2 4 3 5 6 2 4 Vi. radiata 4 3 4 2 3 4 4 5 3 4 Vi. angularis 3 3 4 2 4 3 5 8 3 3 Gl. max 5 6 8 6 5 4 14 9 5 5 Gl. soja 6 7 8 5 6 4 14 9 5 5 Ci. reticulatum 3 3 2 4 3 1 4 1 1 3 Ci. arietinum 3 3 2 4 3 2 4 1 1 3 Me. truncatula 3 3 3 4 2 2 7 5 7 3 Gl. uralensis 0 3 3 2 4 4 6 2 3 4 Lo. japonicus 3 3 3 3 3 2 4 4 1 3 Lu. angustifolius 5 5 6 5 5 6 18 2 3 5 Ar. ipaensis 4 3 4 2 3 3 4 3 1 5 Ar. duranensis 4 3 4 2 2 2 5 3 2 5 Ch. fasciculata 3 3 4 3 3 3 3 4 5 3 Ar. thaliana 1 0 0 0 1 1 0 0 0 1 Vi. vinifera 1 1 1 1 0 2 1 1 0 2 Aq. coerulea 1 1 1 1 1 1 0 0 0 1 Am. trichopoda 0 0 0 1 0 0 0 0 0 0 Se. moellendorffii 0 0 0 0 0 0 0 0 0 0 955 956

30 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

957 Figures 958

959 960 Figure 1: Absolute and relative number of transcription factors in each species. Grey bars and the orange 961 line represent the absolute number and percentage of transcription factors in each species, respectively. 962 Legumes and non-legumes are separated by a dotted vertical line.

963

31 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

964

965

966 Figure 2: Ratio (in log2 scale) of the sizes of each transcription factor family each species in relation to Vi. 967 vinifera. Values greater or smaller than zero represent transcription factor families that are relatively 968 larger or smaller in a given species (in columns) in comparison to Vi. vinifera, respectively. The numbers in 969 parentheses stand for the absolute size of that particular family in Vi. vinifera.

970

971 32 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

972 973

974 975 Figure 3: Expression levels and Ka/Ks ratio of singleton transcription factors. A. Expression (in FPKM) of 976 syntenic- and non-syntenic singletons in three legume species and in Arabidopsis thaliana. B. Ka/Ks 977 distribution of syntenic singletons and syntenic segmental duplicates. Syntenic singletons are 978 transcription factors genes, without a close paralog, that are located in a syntenic region in a reference 979 outgroup. Segmental duplicates are paralogous transcription factors with preserved synteny in the same 980 genome, as well as in the genome of a reference outgroup. Ph. vulgaris was used as reference for Gl. max 981 and Vi. vinifera was used as reference for the other three species. Statistical significance test was 982 performed using the Mann-Whitney U test and asterisk (*) mark indicates p-value < 0.05.

33 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

983 984 Figure 4: Species tree showing number of transcription factor orthologous groups that gained or lost 985 genes. We used different rates of evolution in different lineages, which are represented as branch 986 styles/colors. Known polyploidization events are marked with stars. Green and red triangles refer to 987 nodes with more expansions and contractions, respectively. Numbers of expanded and contracted 988 orthologous groups are shown in green and red, respectively.

989

34 bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

990 991 992 Figure 5: Distribution of synonymous substitution rates (Ks) of 138 orthologous groups with gene gain in 993 legumes. Genome-wide Ks distributions are shown as density plots on the top panel. The bottom panel 994 shows Ks distributions of segmental and dispersed duplicate gene pairs.

995 996 997 998

35

bioRxiv preprint doi: https://doi.org/10.1101/849778; this version posted November 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

999

1000 1001 Figure 6: A. Phylogenetic reconstruction of the orthologous group bHLH:12, which is expanded in legumes. B. Gene expression patterns of bHLH:12 genes in 1002 Me. truncatula (BioProject: PRJNA80163), Gl. max (Libault et al., 2010) and Ph. vulgaris (O’Rourke et al., 2014), showing a trend for greater expression in roots 1003 and nodules.

36