G3: |Genomes| Early Online, published on February 21, 2018 as doi:10.1534/g3.118.200040

1 Title: Synopsis of The SOFL Plant-Specific Family

2

3 Reuben Tayengwa*,†,1, Jianfei Zhao‡, Courtney F. Pierce†,2, Breanna E. Werner†,3, Michael

4 M. Neff*,†

5

6 *Program in Molecular Plant Sciences, Washington State University, Pullman, WA, USA.

7 †Department Crop and Soil Sciences, Washington State University, Pullman, WA, USA.

8 ‡Department of , University of Pennsylvania, Philadelphia, PA, USA.

9

10 Present addresses

11 1University of Maryland, Plant Sciences and Horticultural Landscape Department, College Park,

12 MD 20742.

13 2Colorado State University, Department of Animal Sciences, Fort Collins, CO 80523.

14 3Washington State University College of Nursing, Spokane, WA 99202.

15

16

17

18

1

© The Author(s) 2013. Published by the Genetics Society of America. 19 Running Title: The SOFL gene family

20

21 Keywords

22 SOFL, plant-specific, gene-family, phylogenetic, Arabidopsis, evolution, introns

23 For correspondence (e-mail [email protected])

24 Address: 387 Johnson Hall, PO Box 646420, Pullman WA 99164-6420 USA

25 Phone: 509-335-7705, Fax: 509-335-8674

26

27

28

29

30

31

32

33

34

35

36

2

37 ABSTRACT

38 SUPPRESSOR OF PHYB-4#5 DOMINANT (sob5-D) was previously identified as a suppressor

39 of the phyB-4 long-hypocotyl phenotype in . Overexpression of SOB5

40 conferred dwarf phenotypes similar to those observed in plants containing elevated levels of

41 cytokinin (CK) nucleotides and nucleosides. Two SOB-FIVE- LIKE (SOFL) proteins, AtSOFL1

42 and AtSOFL2, which are more similar at the protein level to each other than they are to SOB5,

43 conferred similar phenotypes to the sob5-D mutant when overexpressed. We used founding

44 SOFL gene family members to perform database searches and identified a total of 289 SOFL

45 homologues in sequenced genomes of 89 angiosperm species. Phylogenetic analysis results

46 implied that the SOFL gene family emerged during the expansion of angiosperms and later

47 evolved into four distinct clades. Among the newly identified gene family members are four

48 previously unreported Arabidopsis SOFLs. Multiple sequence alignment of the 289 SOFL

49 protein sequences revealed two highly conserved domains; SOFL-A and SOFL-B.

50 Overexpression and site-directed mutagenesis studies demonstrated that the SOFL domains are

51 necessary for SOB5 and AtSOFL1’s overexpression phenotypes. Examination of the subcellular

52 localization patterns of the founding Arabidopsis thaliana SOFLs suggested they may be

53 localized in the cytoplasm and the nucleus. We have discovered that SOFLs are a plant-specific

54 gene family characterized by two conserved domains that are important for function.

55

56 INTRODUCTION

57 Many genes can be grouped into specific families based on nucleotide and protein sequence

58 similarity. Such gene families have arisen as a result of duplication and expansion of individual

59 members (Wang et al. 2012). Gene family sizes vary across different lineages and may have

3

60 important functional outcomes related to adaptation or speciation (Qu and Zhu 2006; Vollrath et

61 al. 1998). Functional and evolutionary studies of genes belonging to various families have

62 greatly enhanced our understanding of components involved in plant growth and development

63 (Kim et al. 2007; Higuchi et al. 2004; Xie et al. 1999; Kim 2006; Kondou et al. 2010; Zhao et al.

64 2014). Despite significant progress made towards discovery and curations of many gene families,

65 it is estimated that the majority of plant gene families remain uncharacterized (Guo 2012).

66 Therefore, the continued identification and study of hitherto uncharacterized gene families

67 remains crucial for our continued endeavor to eventually achieve the goal of functionally

68 characterizing most of the genes.

69

70 The Arabidopsis thaliana activation-tagged mutant sob5-D (SUPPRESSOR OF PHYB-4#5

71 DOMINANT) was previously identified as a suppressor of the long-hypocotyl phenotype

72 conferred by a phyB-4 mutation (Zhang et al. 2006). Database searches using SOB5 (At5g08150)

73 protein sequence as a query revealed two SOB-FIVE- LIKE (SOFL) proteins in Arabidopsis,

74 AtSOFL1 (At1g26210) and AtSOFL2 (At1g68870) (Zhang et al. 2006). AtSOFL1 and AtSOFL2

75 are more similar to each other than they are to SOB5 at the protein level (Zhang et al. 2006,

76 2009). When individually overexpressed, AtSOFL1 and AtSOFL2 conferred phenotypes similar

77 to the sob5-D activation-tagged mutant (Zhang et al. 2006, 2009). Transgenic lines individually

78 overexpressing SOB5, AtSOFL1 and AtSOFL2 genes displayed reduced apical dominance, and

79 conferred smaller rosette leaves, retarded root growth and delayed senescence phenotypes

80 (Zhang et al. 2006, 2009). In addition, these transgenic lines also contained elevated levels of

81 cytokinin (CK) nucleotides and nucleosides; trans-zeatin riboside (tZR), trans-zeatin riboside

82 monophosphate (tZRMP), and N6-(∆2-isopentenyl)adenosine monophosphate (iPRMP) (Zhang et

4

83 al. 2006, 2009). These results suggested that the founding SOFL gene family members may have

84 similar or overlapping CK-related functions. However, the specific details of what roles, if any,

85 intermediate CK nucleotide/nucleoside species play during plant growth and development are

86 not fully understood and will have to be further investigated in the future.

87

88 Zhang et al., (2006), reported that the three founding SOFLs constituted a small, novel and

89 intron-less Arabidopsis thaliana gene family. In addition, SOFL homologues and expressed

90 sequence tags were also identified in a few monocotyledonous and dicotyledonous species;

91 Malus x domestica, Brassica napus, Oryza sativa and Populus trichocarpa (Zhang et al. 2006).

92 Due to limited sequenced genomes at the time it was difficult to obtain a comprehensive

93 phylogenetic overview of the SOFL gene family. Nonetheless, based on these preliminary results

94 we speculated that SOFL homologues also existed in genomes of other eukaryotic, prokaryotic,

95 and or mammalian species (Zhang et al. 2006).

96

97 Since Arabidopsis SOFLs were first identified, more genome sequence information has been

98 released (Fang et al. 2013; Wheeler et al. 2008; Kersey et al. 2016). We have taken a

99 phylogenetic approach to update our understanding of the evolution and diversification of the

100 SOFL gene family. Database searches using the three founding SOFL members’ proteins

101 sequences as queries revealed at least 289 SOFL homologues in 89 angiosperm species. During

102 this process, we also discovered that the Arabidopsis thaliana genome contains four additional

103 and more divergent SOFLs that were previously unreported. In addition, phylogenetic analysis

104 showed that the 289 SOFL homologues evolved into four main clades. Finally, to generate

105 preliminary data as well as gain some insights into this gene family, we performed subcellular

5

106 localization, tissue expression and domain mutational studies using the three founding SOFL

107 members. This updated synopsis of the SOFL gene family provides basic background

108 information needed to design future studies to eventually fully characterize this novel gene

109 family.

110

111 MATERIALS AND METHODS

112 Plant materials and growth conditions

113 All Arabidopsis thaliana lines used in this manuscript are in the Columbia (Col-0) background.

114 For phenotypic analysis, all plants were grown on soil in pots in growth chambers set at 21°C

115 with white light (200 μmol m-2 sec-1) and 60-70 % humidity, and under 16 hours light and 8

116 hours darkness.

117

118 Sequence alignment and phylogenetic analysis

119 SOB5, AtSOFL1 and AtSOFL2 protein sequences were downloaded from NCBI (Lamesch et al.

120 2012; Goodstein et al. 2011; Wheeler et al. 2008; Sayers et al. 2009). The sequences were used

121 as queries to search NCBI and Phytozome databases for SOFL homologues using BLASTP

122 option using a < 1 X 10-2 cut-off value. The amino acid sequences were aligned using Probalign

123 V1.3 (Roshan 2014) on CIPRES Science Gateway Server (Miller et al. 2015). Bayesian

124 inference analysis was performed with the Mr. Bayes 3.2.1 on XSEDE tool on CIPRES Science

125 Gateway for 30 million generations with 8 chains to run. Generations were sampled every 10,000

126 generations with the first 25% generations were used as a burn-in.

127

128 Sequence logo analysis

6

129 Sequence logo analysis of conserved SOFL-A and SOFL-B domains was performed using the

130 online WebLogo tool (Crooks et al. 2004). 289 SOFL sequences retrieved from NCBI (Sayers et

131 al. 2009)and Phytozome (Goodstein et al. 2011) were used to generate the sequence logo using

132 default settings.

133

134 Overexpression and point-mutation analysis of SOFL domains in SOB5 and AtSOFL1

135 Site-directed point mutations were generated using the QuikChange® Lightning mutagenesis kit

136 according to manufacturer’s instructions (Agilent Technologies). Mutagenesis primers were

137 designed using Agilent’s web-based QuikChange® Primer Design Program

138 (http://www.genomics.agilent.com/primerDesignProgram.jsp) (Table S1). Gateway® compatible

139 entry vectors, pENTR223 (ABRC), carrying SOB5 and AtSOFL1 coding sequences, were used as

140 templates in mutagenesis PCR reactions. Each point mutation was confirmed via sequencing.

141 Once mutagenesis was confirmed, SOB5 or AtSOFL1 coding sequences carrying point mutations

142 were cloned via LR® reactions into pCHF3 (Zhang et al. 2006) and pEarlyGate100 (Earley et al.

143 2006) binary vectors, respectively. Destination binary vectors carrying mutated SOB5 and

144 AtSOFL1 genes under control of the constitutive 35S promoter were transformed into

145 Arabidopsis thaliana Col-0 wild-type plants using the floral dip method and the

146 tumafaciens strain GV3101 (Clough and Bent 1998).Transformants were screened on 1/2×

147 Linsmaier and Skoog media plates containing 25 mg L-1 Basta (pEarleyGate 100) and 30 mg L-1

148 kanamycin (pCHF3). Homozygous lines were identified in the T3 generation. Three independent

149 transgenic lines representing each construct were selected for analysis. All plants were grown

150 and analyzed in a growth chamber.

151

7

152 RNA extraction and real-time PCR

153 Total RNA was extracted from tissues pooled from three independent plants (n=3 for all tissues)

154 of six-day old seedlings, 10-day old juvenile plants, rosette leaves from a 20-day old plant, floral

155 tissue, siliques, a 25-day old adult plant, and roots from a 25-day old plant using the Qiagen®

156 RNeasy Mini Kit (Qiagen). On-column DNase digestion was performed using the RNase-Free

157 DNase Set (Qiagen, Valencia, CA). Complementary DNA (cDNA) was further generated using

158 the iScript Reverse Transcription Supermix for RT-qPCR (Bio Rad, Hercules, CA). Identical

159 amounts of cDNA input were used in PCR reactions using either ubiquitin10 (UBQ10) or gene-

160 specific primers (Table S1). Amplification using the UBQ10 primers was done in 30 cycles

161 while for gene-specific primers 45 cycles were used. Negative control, no reverse transcriptase

162 (No RT), samples were prepared using the same reaction conditions and reagents (minus reverse

163 transcriptase enzyme) used to make cDNA.

164

165 CFP-SOFL fusion constructs and onion bombardment

166 To generate pSAT4-CFP-SOB5, pSAT4-CFP-AtSOFL1, and pSAT4-CFP-AtSOFL2 N-terminal

167 fusion constructs, respective full-length coding sequences, each contained in the entry vector

168 pENTR223 (ABRC), were cloned into pSAT4-CFP (ABRC) destination vector via Gateway®

169 LR reactions (Invitrogen, Carlsbad, CA). Onion epidermal layers were co-bombarded with

170 pSAT6-mRFP (ABRC) and pSAT4-CFP constructs carrying SOFLs fused to CFP using

171 a PDS-1000/He Biolistic transformation system (Bio Rad, Hercules, CA). Bombarded onion

172 epidermal layers were incubated in the dark for 40 hours. To identify successfully transformed

173 cells and observe fluorescent signals, the onion epidermal layer was examined on a Leica TCS

174 SP8 X (Leica Microsystems, Mannheim, Germany) confocal microscope.

8

175

176 Data availability

177 Genetic material used in this manuscript is available upon request. Primers used in the study are

178 listed in Table S1. Total numbers of SOFL homologues identified in various species are listed in

179 Table S2. Table S3 shows results from protein localization prediction analysis. Alternatively-

180 spliced AtSOFL1 product sequences (AtSOFL1.1 and AtSOFL1.2) are provided in the

181 supplemental file (File S2). Figure S1 shows a cartoon and gel image of AtSOFL1’s alternative

182 splicing products. Sequences extracted from our database searches using the founding SOFL

183 protein sequences as queries are provided in Supplemental file.

184

185 RESULTS

186 SOFLs are a plant-specific gene family

187 To identify SOFL homologues, we performed BLASTP searches against the NCBI database

188 (Sayers et al. 2009) using SOB5, AtSOFL1 and AtSOFL2 sequences as queries. We extracted

189 289 SOFL homologues from 89 flowering plant species, of which 15 were monocotyledons and

190 74 were dicotyledons (Table S2). To gain insight into the evolutionary history of the SOFL gene

191 family, we next performed phylogenetic analysis using the 289 SOFL protein sequences

192 extracted via database searches. Our analysis suggested that SOFLs likely evolved into four

193 major clades: Clade-I, II, III and IV (Figure 1a and 1b). We hypothesize that SOFLs within each

194 clade have greatly expanded through evolution, with Clades-I and -IV showing the largest

195 expansion (Figure 1a and1b). However, since the tree was constructed only with the existing

196 SOFL genes in the plant species examined and no outgroup used, we cannot rule out gene loss

197 events. Annotation data from NCBI (Sayers et al. 2009), Phytozome (Goodstein et al. 2011) and

9

198 individual species genome databases revealed that 60 SOFL homologues contained introns and

199 another 226 did not (Table S2). There was no annotation data for Arabis alpina and Erythranthe

200 guttata (Mimulas guttatus) SOFL homologues.

201

202 Arabidopsis thaliana genome contains seven SOFL members.

203 It was previously reported that the Arabidopsis genome contained three SOFL members (Zhang

204 et al. 2006, 2009). However, latest database search results revealed that the Arabidopsis thaliana

205 genome contained a total of seven SOFL members (File S1). In addition, we also identified seven

206 SOFLs in Arabidopsis lyrata (File S1). We have designated the newly identified Arabidopsis

207 thaliana SOFL genes as AtSOFL3 (AT3G30580), AtSOFL4 (AT5G38790), AtSOFL5

208 (AT4G33800) and AtSOFL6 (AT1G58460). SOB5 is in Clade VI of the phylogenetic tree (Figure

209 1a and 1b), compared to AtSOFL1 and AtSOFL2 which were in Clade I. Interestingly, the newly

210 identified SOFL members, AtSOFL3, AtSOFL4, AtSOFL5 and AtSOFL6 are in Clade IV (Figure

211 1b), suggesting that genetic redundancy may exist among these genes and SOB5. All

212 Arabidopsis SOFLs are intron-less except AtSOFL1(recently classified by TAIR as intron-

213 containing), AtSOFL5 and AtSOFL6.

214

215 We next assessed the expression patterns of Arabidopsis thaliana SOFL homologues. RNA was

216 extracted from seedlings, juvenile plants, adult rosette leaf, floral structures, siliques, an entire

217 flowering plant and adult plant roots. SOB5, AtSOFL1 and AtSOFL2 transcripts were detected in

218 all samples tested except roots (Figure 2). AtSOFL3, AtSOFL4, AtSOFL5 and AtSOFL6 were

219 expressed at varying levels in seedlings, juvenile plants, flowers, siliques and whole plant

220 (Figure 2). AtSOFL3, AtSOFL4, AtSOFL5 were the only ones detected in roots. AtSOFL6

10

221 showed little to no expression in rosette leaves and roots, and similarly, AtSOFL5 showed little

222 to no expression in seedlings and rosette leaves (Figure 2). AtSOFL4 was expressed in all

223 samples tested but had the lowest transcript levels overall. The varying transcript accumulation

224 levels and differential tissue expression patterns may suggest unique functional roles among the

225 seven Arabidopsis thaliana SOFLs during plant development (Figure 2). Surprisingly, during

226 efforts to amplify a full-length AtSOFL1 (At1g26210) cDNA we identified a previously

227 unreported splice variant. We have designated the original 447 bp transcript AtSOFL1

228 (At1g26210.1), and the newly identified 771bp second transcript as (At1g26210.2) (Figure S1a,

229 b).

230

231 SOFL homologues are characterized by two conserved domains

232 Multiple sequence alignment analysis is a useful tool used to infer relationships among gene

233 family members. Alignment data can provide functional information through the identification of

234 key conserved domains and other important features. We aligned SOFL homologues protein

235 sequences and revealed two conserved domains in the N-terminal region (Figure 3a), which were

236 previously reported by Zhang et al. (2006), albeit based on fewer protein sequences. We have

237 designated the two conserved domains, SOFL-A and SOFL-B (Figure 3a). When only

238 Arabidopsis SOFL sequences were aligned (Figure 3b) and compared to an alignment of all 289

239 SOFL homologues (Figure 3c, 3d) we observed slight differences in the SOFL-B domain. In an

240 alignment of Arabidopsis SOFL sequences only, SOFL-B domain contains eight 100%

241 conserved residues SM×SDASS×P (Figure 3b). However, the same domain only contains four

242 100% conserved residues (S××SDA) when all 289 SOFLs homologous sequences were aligned

243 (Figure 3d). In general, based on all SOFL orthologue sequences identified so far, SOFL-A

11

244 domain contains an SGWT×Y motif and SOFL-B domain contains an S××SDA motif (× =

245 amino acid residue that is not 100% conserved) (Figure 3c, d). There were no areas of high

246 sequence similarity in the C-terminal region, except for an increased presence of basic amino

247 acid residues in most SOFL homologues (Table S3).

248

249 Conserved amino acid residues in SOFL domains are required for the manifestation of

250 SOB5 and AtSOFL1 overexpression phenotypes.

251 Previously, Zhang et al., (2009), investigated the requirement of some of the 100% conserved

252 residues in SOFL-A and SOFL-B domains for the manifestation of AtSOFL2’s overexpression

253 phenotype via site-directed mutagenesis. Constructs carrying AtSOFL2 gene harboring individual

254 point mutations in each of the two conserved domains were used to generate plants

255 overexpressing an aberrant protein. Resultant transgenic mutant plants overexpressing AtSOFL2

256 gene harboring T21I and D80N point mutations lost the phenotype that is typically associated

257 with the overexpression of the wild type AtSOFL2 gene (Zhang et al. 2009). These results

258 implied that the conserved amino acid residues were important for the manifestation of AtSOFL2

259 overexpression phenotype.

260

261 To further investigate the biological importance of some of the 100% conserved residues in the

262 other two founding SOFLs, SOB5 and AtSOFL1, we generated a series of site-directed point

263 mutations in their respective SOFL-A and SOFL-B domains. Wild-type Arabidopsis thaliana

264 plants were separately transformed with constructs carrying mutations in SOB5 (T21I and P61R)

265 and AtSOFL1 (T23I and P84R) coding sequences. Transgenic plants overexpressing SOB5

266 (T21I, P61R) and AtSOFL1 (T23I, P84R) mutated genes lost the dwarf/semi-dwarf phenotypes

12

267 typically observed when wild-type versions were overexpressed (Figure 4a, b). We also

268 performed site-directed mutagenesis on non-conserved amino acid residues; D33H and D53H for

269 SOB5 and AtSOFL1, respectively, which are not located in the two conserved domains (Figure

270 3b). Transgenic plants overexpressing mutated SOB5 (D33H) and AtSOFL1(D53H) exhibited

271 phenotypes similar to those observed when wild-type genes are constitutively expressed (Figure

272 4a, b) (Zhang et al. 2006, 2009). These data suggest that less conserved amino acid residues are

273 not required for the observed phenotypes.

274

275 SOFL subcellular localization

276 We used two publicly available online prediction programs, Wolf PSORT (Horton et al. 2007)

277 and SeqNLS (Lin and Hu 2013), to gain insights into subcellular localization patterns of the 289

278 SOFL homologues. Wolf PSORT converts protein amino acid sequences into numerical

279 localization features based on sorting signals, amino acid composition and functional motifs such

280 as DNA-binding motifs (Horton et al. 2007). SeqNLS uses a sequential pattern mining algorithm

281 to effectively identify potential nuclear localization signals (NLS) in protein sequences (Lin and

282 Hu 2013). Wolf PSORT predicted 274 SOFL homologues to localize to the nucleus, with the

283 remainder predicted to localize to chloroplast, cytoplasm, mitochondria, peroxisome, Golgi and

284 extracellular space (Table S3). In contrast, the SeqNLS program detected nuclear localization

285 signals in only 156 SOFLs with a statistically significant prediction score of at least 0.89 (Table

286 S2) (Lin and Hu 2013). Interestingly SeqNLS did not detect nuclear localization signals in 13

287 SOFLs which were predicted to localize to the nucleus by Wolf PSORT (Table S3). 119 SOFLs

288 scored below the default cutoff threshold to be classified as containing a NLS. Overall, at least

13

289 156 SOFL homologues were predicted to localize to the nucleus by both Wolf PSORT and

290 SeqNLS algorithms (Table S3).

291

292 To further assess some of the subcellular prediction data, we examined the intracellular

293 distribution of the founding Arabidopsis thaliana SOFL members. We fused SOB5, AtSOFL1

294 and AtSOFL2 to the carboxyl end of a cyan fluorescent protein (CFP) and transiently expressed

295 the fusion proteins in onion epidermal cells. CFP-SOB5, CFP-AtSOFL1 and CFP-AtSOFL2

296 fluorescent signals were observed in both the cytoplasm and nucleus of the onion epidermal cells

297 (Figure 5). We also examined whether the fluorescent signals of all three fusion proteins were

298 localized to the cytosol and not the wall. This was achieved by inducing plasmolysis through

299 exposure of onion epidermal peels to 0.8 M mannitol. Ensuing the mannitol treatment, CFP

300 signal remained restricted to the edges of the plasmolyzed and plasma membrane and or

301 cytoplasm, suggesting that CFP-SOB5, CFP-AtSOFL1 and CFP-AtSOFL2 fusion proteins were

302 not localized to the cell wall.

303

304 DISCUSSION

305 SOFL homologues emerged in angiosperms

306 The founding members of the SOFL gene family were first identified via an activation tagging

307 screen in Arabidopsis thaliana (Zhang et al. 2006). Based on the limited number of sequenced

308 genomes available at the time, it was initially concluded that SOB5, AtSOFL1 and AtSOFL2 were

309 a small three-member intron-less Arabidopsis thaliana gene family. To expand on earlier studies

310 and acquire a basic level understanding of this poorly characterized gene family, we took

311 advantage of an increasing number of sequenced genomes to perform database searches using

14

312 the founding Arabidopsis thaliana founding SOFL members as queries. We have retrieved a total

313 of 289 SOFL sequences from 89 angiosperm species (Table S2). However, we cannot rule out

314 the possibility that SOFL homologues are present in unsequenced organisms or were lost in other

315 genomes. In addition, we expect additional SOFL homologues to be identified in other species as

316 more genomes are sequenced. Furthermore, as genome sequence annotation improves, it is

317 possible that some of the SOFL homologues currently classified as intron-less or intron-

318 containing may be re-categorized in the future.

319

320 No SOFL homologues were identified in Volvox cartari (green algae), Chlamydomonas

321 reinhardtii (green algae), Ostreococcus lucimarinas (single-celled water algae), Micromonas

322 pusilla (water algae), Physcomitrella patens (non-vascular bryophyte, moss) and Selaginella

323 moellendorffii (member of an ancient vascular plant lineage) (Lamesch et al. 2012), strongly

324 suggesting that SOFLs may have emerged after the separation of lycophytes, pterophytes and

325 seed plants. Out of the four main plant groups: bryophytes, seedless vascular

326 (pteridophytes/lycophytes), gymnosperms and angiosperms, SOFLs have so far only been

327 identified in angiosperms. We hypothesize that the SOFL gene family emerged and expanded

328 during the evolution of flowering plant species.

329

330 SOFL gene family is comprised of intron-containing and intron-less genes

331 Approximately 20% of SOFL homologues identified so far contain introns and 80% are intron-

332 less (Table S2). These results are inconsistent with overall data from rice and Arabidopsis

333 genomes which revealed the presence of only 19.9% and 21.7% intron-lacking genes,

334 respectively (Jain et al. 2008). Our latest database search results also showed that Arabidopsis

15

335 thaliana contains seven SOFL genes, in contrast to previous reports of three intron-less family

336 members (Zhang et al. 2006, 2009). Out of the four newly discovered Arabidopsis SOFLs, only

337 AtSOFL5 and AtSOFL6 contain introns, a departure from the previous designation of the

338 Arabidopsis SOFL family as being comprised of members lacking introns (Zhang et al. 2006,

339 2009). Genes lacking introns are a characteristic of prokaryotes and are a useful resource for

340 studying the evolution of gene architecture in eukaryotes, but information on their biological

341 significance remains limited (Yan et al. 2014). Considering that SOFLs seem to have emerged

342 during the evolution of angiosperms, it is surprising that the majority of the genes in this family

343 are intron-less, a trait expected in early species (Zou et al. 2011). On the other hand, this result

344 can be explained by the fact that majority of the intron-less SOFLs could have potentially arisen

345 because of gene duplication events (Yan et al. 2016; Lecharny et al. 2003). Gene prediction and

346 gene functional studies data suggests that intron-less genes may play unique roles in growth and

347 development, including translation and energy metabolism in maize, rice and Arabidopsis as well

348 as cell envelope and amino acid biosynthesis in rice and Arabidopsis (Yan et al. 2014; Jain et al.

349 2008). In addition, gain-of-function studies from Zhang et al. (2006, 2009) suggested that the

350 three-founding intron-lacking SOFLs may be involved in CK-related functions. According to

351 Zhao et al. (2014) the presence of introns enhances the transcription of associated genes.

352 Therefore, to further explore the biological significance of intron-less genes, studies that include

353 gain-of-function, loss of function, pattern, gene expression level analysis and

354 subcellular localization will need to be performed in the future.

355

356 SOFL-A and SOFL-B domains are important for SOB5 and AtSOFL1’s overexpression

357 phenotypes

16

358 Previously, Zhang et al. (2009) demonstrated via site-directed mutagenesis experiments that

359 certain amino residues in the SOFL domains were necessary for the manifestation of AtSOFL2’s

360 overexpression phenotype. In our study, we have similarly shown that specific conserved amino

361 acid residues in SOFL-A and SOFL-B domains are also necessary for both SOB5’s and

362 AtSOFL1’s overexpression phenotypes (Figure 4). These results, together with sequence logo

363 analysis showing that certain amino acid residues in SOFL-A and SOFL-B domains are 100%

364 conserved in all 289 SOFLs, further strengthen the hypothesis that they are important for

365 function. These results should, however, be interpreted with caution because point mutations can

366 potentially cause proteins to fold incorrectly. Nonetheless, our results demonstrated, at least, that

367 not all point mutations lead to a loss of protein function. Future experiments in which all

368 conserved residues are replaced with alanines or entire domains are deleted may provide answers

369 to whether SOFL-A and SOFL-B domains are important for function. It is not yet clear what

370 biological role these highly conserved domains play. However, conserved domains typically play

371 crucial roles in protein-protein interactions, DNA binding, and other important cellular

372 processes.

373

374 Founding Arabidopsis thaliana SOFL members localize to the nucleus and cytoplasm

375 To begin to examine the intracellular distribution of SOFLs homologues we first used nuclear

376 localization signal (NLS) detection and sub-cellular localization programs, SeqNLS (Lin and Hu

377 2013) and Wolf PSORT (Horton et al. 2007). The two programs predicted that majority of

378 SOFLs contained NLSs and may localize to the nucleus among other organelles (Table S3). This

379 hypothesis was further supported by CFP-SOFL protein fusion subcellular localization

380 experimental data, which showed that SOB5, AtSOFL1 and AtSOFL2 localize to the nucleus

17

381 and the cytosol (Figure 5). However, SOB5 subcellular localization results were at odds with

382 findings by Zhang et al, (2006) who reported that SOB5 only localized to the cytoplasm and/or

383 plasma membrane, but not the nucleus. These contrasting conclusions could be due to a different

384 experimental design and fluorescence detection methods used. The SOB5-GFP C-terminal fusion

385 protein in Zhang et al. (2006) was missing five C-terminal amino acids from SOB5, whereas in

386 our case we used a full-length SOB5 protein fused to the carboxyl end of the CFP tag. Secondly,

387 we used confocal microscopy to detect CFP fluorescence, which is more reliable compared to

388 fluorescence microscopy used by Zhang et al. (2006). In addition, we generated and analyzed N-

389 terminal fusions in contrast to the C-terminal fusion described in Zhang et al., (2006). C-terminal

390 and N-terminal tagged proteins have been shown to show opposite localization patterns (Palmer

391 and Freeman 2004), a possibility that may also explain SOB5-GFP and CFP-SOB5 localization

392 pattern. In the future, both C-and N-terminal fusions of each protein should be examined to avoid

393 such potential problems.

394

395 Even though we used a plasmolysis assay to show that the three founding SOFLs were not

396 localized to the cell wall, we could not distinguish whether they were localized in the plasma

397 membrane, cytosol or both. To try and answer this question, we used a bioinformatics approach

398 by running the 289 SOFL homologue sequences against web-based transmembrane protein

399 topology prediction algorithms. TMHMM (Kahsay et al. 2005) and TMMOD (Kahsay et al.

400 2005), both hidden Markov prediction models, did not predict any transmembrane helices in all

401 289 SOFL orthologue sequences. This outcome could be verified experimentally by

402 discriminating between cytosolic and cell membrane protein using osmotic disruption of the

403 protoplast vacuole in hypotonic solution (Serna 2005). This method results in the diffusion of the

18

404 GFP signal from the cell periphery to the central part of the cell volume, an outcome that will not

405 occur when the protein under study is attached to the cell membrane (Serna 2005). Overall, these

406 data suggest that at least, the three founding Arabidopsis thaliana SOFLs and possibly several

407 other SOFL homologues may localize to the nucleus or cytoplasm.

408

409 CONCLUSION

410 The identification of at least 289 SOFL homologues creates an opportunity to study and

411 characterize this novel gene family in at least 89 flowering plant species. A combination of gain-

412 of-function, loss-of-function, protein-protein interaction and CK quantitation studies will go a

413 long way towards answering various questions raised by Zhang et al. (2006, 2009) about the

414 SOFL gene family. One critical question is whether CK nucleosides and nucleotides have

415 biological activity in plant growth and development, as suggested by Zhang et al., (2006, 2009).

416 Gain-of-function and CK-quantitation studies in selected species may be used to test the

417 hypothesis that overexpression of SOB5, AtSOFL1 and AtSOFL2 homologues can cause similar

418 CK-related phenotypes reported by Zhang et al., (2006, 2009). Results from such studies can

419 then be compared to higher order null mutants in which putative redundant SOFLs from the same

420 phylogenetic clades are knocked out. In addition, our site-directed mutagenesis studies suggested

421 that conserved SOFL-A and SOFL-B domains were important for function. Similar studies

422 involving mutagenesis of additional conserved amino acid residues in both Arabidopsis and other

423 species will likely provide new functional details regarding these poorly characterized domains.

424 Finally, even though our latest database searches suggest that SOFLs are a plant specific gene

425 family, the ever-increasing number of sequenced genomes will continue to test this hypothesis.

426

19

427 ACKNOWLEDGEMENTS

428 We thank Dr. Jingyu Zhang of the Institute of Botany, Chinese Academy of Sciences, Beijing,

429 China for assistance in providing background information regarding prior work. We also thank

430 Dr. Amit Dhingra and his lab members at Washington State University for help using the

431 Biolistic PDS- 1000/He Particle Delivery System during transient expression of CFP-protein

432 fusion in onion epidermal layers. We are grateful to Dr. Daniel Mullendore of the WSU

433 Franceschi Microscopy and Imaging Center for the guidance and assistance in using the confocal

434 microscope during sub-cellular localization analyses. This research was supported by the USDA

435 National Institute of Food and Agriculture, HATCH project 1007178 (to M.M.N.). This work

436 was also supported in part by the funds provided by the Program in WSU Molecular Plant

437 Sciences GPSI Fellowship (to R.T.).

438

439 Conflict of interest

440 The authors have no conflict of interest to declare.

441

442

443

444

445

446

447

448

449

20

450 Short legends for Supporting Information

451 Table S1. Primers used in this manuscript. List of primers that were used for genotyping,

452 construct development and measuring transcript accumulation.

453

454 Table S2. A total of 289 SOFL homologues were identified in at least 89 flowering plant

455 species. Founding Arabidopsis thaliana SOFL family members were used as queries to blast the

456 NCBI (Sayers et al. 2009) and Phytozome databases. SOFL gene family homologues were

457 retrieved from genome sequences of 89 monocotyledon and dicotyledon species. The total

458 number of SOFLs included in this list is not final and is likely to change over time as the

459 curation, annotation and sequencing of more genome sequences continue to improve and

460 increase.

461

462 Table S3. Subcellular localization prediction analysis. All 289 SOFL orthologue protein

463 sequences were analyzed through Wolf PSORT (Horton et al. 2007) and SeqNLS (Lin and Hu

464 2013) online databases. SeqNLS did not detect NLS in 14 SOFL homologues and classified an

465 additional 119 SOFLs as scoring below a cut-off threshold score to be confidently predicted as

466 nuclear localizing (containing residues in yellow, green, cyan, blue and navy-blue shades). Wolf

467 PSORT predicted 274 SOFLs out of 289 homologues to localize to the nucleus. Underlined

468 orange and red colored parts of the peptide sequence represent predicted NLSs.

469

470 Figure S1. AtSOFL1 is alternatively spliced into AtSOFL1.1 and AtSOFL1.2 splice variants.

471 (a) AtSOFL1.1 is the first splicing variant and represents At1g26210, previously thought to be an

472 intron-less gene, and includes a partially retained intron (dashed box and red solid line).

21

473 AtSOFL1.2 represents the second splicing variant and includes two predicted exons. The blue

474 box [a] and the green box [b] both represent the two predicted exons. The red solid line

475 represents an intron. The brown box represents 3’ UTR region. Small orange and purple regions

476 represent the SOFL-A and SOFL-B domains, respectively. (b) Agarose gel image showing a

477 AtSOFL1.1 PCR product amplified from genomic DNA and an alternatively spliced AtSOFL1.2

478 semi-quantitative PCR product. M = marker.

479

480 File S1. SOFL homologue protein sequences. List of 289 SOFL homologues retrieved from

481 NCBI and Phytozome database searches.

482

483 File S2. Annotation of AtSOFL1 and the two alternative splicing variants. Our BLASTP

484 search revealed a, (a) 168 amino acid hypothetical protein, GenBank accession number

485 AAG50669.1 with a predicted (b) 507 nucleotide long coding sequence (CDS), GenBank

486 accession number AC079829.6 from which we designed PCR primers (Table S1). To further

487 investigate this putative SOFL gene, we performed PCR using genomic DNA and

488 complementary DNA as templates and amplified products of (c) 986 and (d) 771 base pairs sizes

489 (which includes the 3’ UTR sequence), respectively. Based on the predicted CDS length we had

490 expected a PCR product of 507 base pairs if we used cDNA as template. To investigate this

491 anomaly, we sequenced the two amplified products. Sequence alignment of the genomic DNA

492 and cDNA PCR amplified products revealed that AtSOFL1 contains a 215-base pair intron

493 (highlighted in grey). Sequence alignment analysis showed that (e) AtSOFL1.1’s first 319

494 nucleotides out of 986 are 100% identical to (f) AtSOFL1 (At1G26210) gene’s CDS.

495 Furthermore, the terminal 128 nucleotides of AtSOFL1.1’s CDS are 100% identical to the first

22

496 segment of AtSOFL1’s 215 nucleotide long intron segment. This suggests that AtSOFL1’s 986

497 nucleotide-long gene produce two alternatively spliced variants; AtSOFL1.1 and AtSOFL1.2.

498 AtSOFL1.1 CDS is translated into a 148-amino acid long peptide and (g) AtSOFL1.2 CDS is

499 translated into a predicted 118 amino acid products. *represents a stop codon.

500

501 FIGURE LEGENDS

502 Figure 1. Phylogenetic analysis of 289 SOFL homologue protein sequences showed that the

503 gene family evolved into four main clades. (A) and (B) Annotated rectangular layout mid-point

504 root Mr. Bayes phylogenetic trees. Bayesian inference analysis was performed on 289 SOFL

505 homologues amino acid sequences with the Mr. Bayes 3.2.1 on XSEDE tool on CIPRES Science

506 Gateway for 30 million generations with 8 chains to run with convergence at 0.0099 (Miller et al.

507 2015).

508

509 Figure 2. Expression levels of Arabidopsis thaliana SOFL genes in various tissues. Total

510 RNA was isolated from; 6-day old seedlings (S), juvenile plants (J), rosette leaf (L), floral

511 structure (F), siliques (SQ), whole adult plant (WP) and roots (R) and used for RT-PCR analysis.

512 UBIQUITIN10 was used as an internal control for PCR. No reverse transcriptase (No RT) RNA

513 samples were used as negative controls.

514

515 Figure 3. Multiple sequence alignment and analysis of SOFL proteins. (a) Illustration

516 showing the topology of SOFL proteins. Purple rectangular blocks represent the two conserved

517 domains, SOFL-A and SOFL-B. (b) N-terminal amino acid sequence alignment of Arabidopsis

518 thaliana SOFL proteins. The alignment was obtained using JalView Probcons with Defaults

519 program (Malek 2001; Zhang 2003). Semi-conserved amino acids are indicated with a light

23

520 purple shade. Conserved amino acids are indicated by a darker purple shade. Red squares

521 indicate amino acid residues that were selected for site-directed mutagenesis. WebLogo©

522 analysis of conserved, (c) SOFL-A domain, and (d) SOFL-B domain, using all 289 SOFL protein

523 sequences (Lin and Hu 2013). Red asterisk denotes 100% conserved amino acid residues among

524 all 289 SOFL sequences.

525 Figure 4. Phenotypes of transgenic plants overexpressing full-length SOB5 and AtSOFL1

526 genes harboring point mutations generated via site-directed mutagenesis. (a) Wild-type and

527 sob5-D control plants compared to transgenic plants overexpressing the following SOB5 point

528 mutations: T21I, P61R and D33H. D33H mutation is in a non-conserved portion of SOB5 gene.

529 (b) Wild-type and 35S:AtSOFL1 control plants compared to transgenic plants overexpressing the

530 following AtSOFL1 point mutations: T23I, P84R and D53H. D53H is in a non-conserved portion

531 of AtSOFL1 gene. All plants were grown together in a growth chamber for two weeks. Three

532 independent transgenic lines were analyzed. This experiment was repeated three times with

533 similar outcomes.

534 Figure 5. Subcellular localization of CFP-SOB5 (a), CFP-AtSOFL1 (b), CFP-AtSOFL2 (c)

535 fusion proteins in onion epidermal cells. Most biochemical functions carried out in plant cells

536 are performed by proteins in specific cellular locations. Protein subcellular localization studies, via

537 fluorescent protein fusions, are a useful tool to narrow down cellular functions of novel or unknown

538 proteins. Onion epidermal peels were biolistically bombarded with CFP-protein fusion constructs

539 and incubated in the dark for 40 hours, then visualized under a confocal microscope. Each panel

540 shows, in a clockwise direction, CFP-protein fusion signal, free RFP signal, merge of CFP and

541 RFP signal and bright-field view showing outline of cells. White arrows indicate outline of

24

542 plasmolyzed membrane. Free RFP construct was used as a control for successful bombardment

543 as well as a localization marker. Plasmolysis was induced by treatment with 0.8 M mannitol.

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

25

565 LITERATURE CITED

566 Clough, S.J., and A.F. Bent, 1998 Floral dip: a simplified method for Agrobacterium-mediated

567 transformation of Arabidopsis thaliana. Plant J 16 (6):735-743.

568 Crooks, G.E., G. Hon, J.M. Chandonia, and S.E. Brenner, 2004 WebLogo: a sequence logo generator.

569 Genome Res 14 (6):1188-1190.

570 Earley, K.W., J.R. Haag, O. Pontes, K. Opper, T. Juehne et al., 2006 Gateway-compatible vectors for

571 plant functional genomics and proteomics. Plant J 45 (4):616-629.

572 Fang, H., M.E. Oates, R.B. Pethica, J.M. Greenwood, A.J. Sardar et al., 2013 A daily-updated tree of

573 (sequenced) life as a reference for genome research. Sci Rep 3:2015.

574 Goodstein, D.M., S. Shu, R. Howson, R. Neupane, R.D. Hayes et al., 2011 Phytozome: a comparative

575 platform for green plant genomics. Nucleic Acids Res 40 (Database issue):D1178-1186.

576 Guo, Y.L., 2012 Gene family evolution in green plants with emphasis on the origination and evolution of

577 Arabidopsis thaliana genes. Plant J 73 (6):941-951.

578 Higuchi, M., M.S. Pischke, A.P. Mahonen, K. Miyawaki, Y. Hashimoto et al., 2004 In planta functions of

579 the Arabidopsis cytokinin receptor family. Proceedings of the National Academy of Sciences of

580 the United States of America 101 (23):8821-8826.

581 Horton, P., K.J. Park, T. Obayashi, N. Fujita, H. Harada et al., 2007 WoLF PSORT: protein localization

582 predictor. Nucleic Acids Res 35 (Web Server issue):W585-587.

583 Jain, M., P. Khurana, A.K. Tyagi, and J.P. Khurana, 2008 Genome-wide analysis of intronless genes in

584 rice and Arabidopsis. Funct Integr Genomics 8 (1):69-78.

585 Kahsay, R.Y., G. Gao, and L. Liao, 2005 An improved hidden Markov model for transmembrane protein

586 detection and topology prediction and its applications to complete genomes. Bioinformatics 21

587 (9):1853-1858.

588 Kersey, P.J., J.E. Allen, I. Armean, S. Boddu, B.J. Bolt et al., 2016 Ensembl Genomes 2016: more

589 genomes, more complexity. Nucleic Acids Res 44 (D1):D574-580.

26

590 Kim, S.Y., 2006 The role of ABF family bZIP class transcription factors in stress response. Physiologia

591 Plantarum 126 (4):519-527.

592 Kim, S.Y., S.G. Kim, Y.S. Kim, P.J. Seo, M. Bae et al., 2007 Exploring membrane-associated NAC

593 transcription factors in Arabidopsis: implications for membrane biology in genome regulation.

594 Nucleic Acids Res 35 (1):203-213.

595 Kondou, Y., M. Higuchi, and M. Matsui, 2010 High-throughput characterization of plant gene functions

596 by using gain-of-function technology. Annu Rev Plant Biol 61:373-393.

597 Lamesch, P., T.Z. Berardini, D. Li, D. Swarbreck, C. Wilks et al., 2012 The Arabidopsis Information

598 Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res 40 (Database

599 issue):D1202-1210.

600 Lecharny, A., N. Boudet, I. Gy, S. Aubourg, and M. Kreis, 2003 Introns in, introns out in plant gene

601 families: a genomic approach of the dynamics of gene structure. J Struct Funct Genomics 3 (1-

602 4):111-116.

603 Lin, J.R., and J. Hu, 2013 SeqNLS: nuclear localization signal prediction based on frequent pattern

604 mining and linear motif scoring. PLoS One 8 (10):e76864.

605 Malek, J.A., 2001 Abundant protein domains occur in proportion to proteome size. Genome Biol 2

606 (9):RESEARCH0039.

607 Miller, M.A., T. Schwartz, B.E. Pickett, S. He, E.B. Klem et al., 2015 A RESTful API for Access to

608 Phylogenetic Tools via the CIPRES Science Gateway. Evol Bioinform Online 11:43-48.

609 Palmer, E., and T. Freeman, 2004 Investigation into the use of C- and N-terminal GFP fusion proteins for

610 subcellular localization studies using reverse microarrays. Comp Funct Genomics 5

611 (4):342-353.

612 Qu, L.J., and Y.X. Zhu, 2006 Transcription factor families in Arabidopsis: major progress and

613 outstanding issues for future research. Curr Opin Plant Biol 9 (5):544-549.

614 Roshan, U., 2014 Multiple sequence alignment using Probcons and Probalign. Methods Mol Biol

615 1079:147-153.

27

616 Sayers, E.W., T. Barrett, D.A. Benson, S.H. Bryant, K. Canese et al., 2009 Database resources of the

617 National Center for Information. Nucleic Acids Res 37 (Database issue):D5-15.

618 Serna, L., 2005 A simple method for discriminating between cell membrane and cytosolic proteins. New

619 Phytol 165 (3):947-952.

620 Vollrath, D., V.L. Jaramillo-Babb, M.V. Clough, I. McIntosh, K.M. Scott et al., 1998 Loss-of-function

621 mutations in the LIM-homeodomain gene, LMX1B, in nail-patella syndrome. Hum Mol Genet 7

622 (7):1091-1098.

623 Wang, Y., X. Wang, and A.H. Paterson, 2012 Genome and gene duplications and gene expression

624 divergence: a view from plants. Ann N Y Acad Sci 1256:1-14.

625 Wheeler, D.L., T. Barrett, D.A. Benson, S.H. Bryant, K. Canese et al., 2008 Database resources of the

626 National Center for Biotechnology Information. Nucleic Acids Res 36 (Database issue):D13-21.

627 Xie, Q., A.P. Sanz-Burgos, H. Guo, J.A. Garcia, and C. Gutierrez, 1999 GRAB proteins, novel members

628 of the NAC domain family, isolated by their interaction with a geminivirus protein. Plant Mol

629 Biol 39 (4):647-656.

630 Yan, H., X. Dai, K. Feng, Q. Ma, and T. Yin, 2016 IGDD: a database of intronless genes in dicots. BMC

631 Bioinformatics 17:289.

632 Yan, H., W. Zhang, Y. Lin, Q. Dong, X. Peng et al., 2014 Different evolutionary patterns among

633 intronless genes in maize genome. Biochem Biophys Res Commun 449 (1):146-150.

634 Zhang, J., R. Vankova, J. Malbeck, P.I. Dobrev, Y. Xu et al., 2009 AtSOFL1 and AtSOFL2 act

635 redundantly as positive modulators of the endogenous content of specific cytokinins in

636 Arabidopsis. PLoS One 4 (12):e8236.

637 Zhang, J., E.L. Wrage, R. Vankova, J. Malbeck, and M.M. Neff, 2006 Over-expression of SOB5 suggests

638 the involvement of a novel plant protein in cytokinin-mediated development. Plant Journal 46

639 (5):834-848.

640 Zhang, J.Z., 2003 Overexpression analysis of plant transcription factors. Curr Opin Plant Biol 6 (5):430-

641 440.

28

642 Zhao, J., D.S. Favero, J. Qiu, E.H. Roalson, and M.M. Neff, 2014 Insights into the evolution and

643 diversification of the AT-hook Motif Nuclear Localized gene family in land plants. BMC Plant

644 Biol 14 (1):266.

645 Zou, M., B. Guo, and S. He, 2011 The roles and evolutionary patterns of intronless genes in

646 deuterostomes. Comp Funct Genomics 2011:680673.

647

648

29

1 BnSOFL15 1 BnSOFL20 1 BrSOFL11 1 BnSOFL19 1 BnSOFL16 BrSOFL9 0.84 1 BnSOFL18 1 BnSOFL11 BrSOFL8 1 1 RsSOFL5 0.99 RsSOFL7 1 BnSOFL13 0.74 1 BrSOFL10 0.98 1 BnSOFL14 1 RsSOFL8 EsSOFL2 AaSOFL2 1 1 0.89 CsaSOFL9 CsaSOFL10 0.99 CsaSOFL8 1 CgSOFL3 0.79 1 1 CrSOFL1 1 AhSOFL2 0.8 AlSOFL7 1 AtSOFL2 1 BsSOFL3 ThSOFL3 1 CsaSOFL6 1 1 CsaSOFL7 1BsSOFL2 1 CgSOFL2 AhSOFL3 0.98 AlSOFL3 1 AtSOFL1 1 BnSOFL6 BnSOFL7 1 RsSOFL4 1 BnSOFL17 0.87 BrSOFL5 1 1 BrSOFL6 RsSOFL6 BnSOFL8 BrSOFL7 0.98 SlSOFL3 1 SpSOFL1 0.86 StSOFL1 1 StSOFL3 1 CanSOFL1 1 CanSOFL21 NtSOFL1 1 NsSOFL2 0.65 1 NaSOFL1 NaSOFL21 1 NtoSOFL2 1 NtoSOFL1 1 NsSOFL3 1 1 DhSOFL1 1 DhSOFL2 1 OeSOFL1 1 0.78 1 GaSOFL1 1 HiSOFL1 0.61 1 EgrSOFL1 CcaSOFL1 0.86 GarSOFL2 1 GhSOFL1 0.94 GhSOFL2 0.8 1 GrSOFL3 DzSOFL1 1 1 HuSOFL1 TcSOFL3 1 CcapSOFL2 0.88 1 CoSOFL2 0.95 CfSOFL1 CpSOFL1 FvSOFL1 1 HbSOFL1 1 MeSOFL4 0.99 1 1 PtSOFL1 0.61 1 PtSOFL3 1 1CsiSOFL1 CsiSOFL4 0.86 1 1 1 MdSOFL3 1 1MdSOFL5 1 PbSOFL5 0.88 1 PpSOFL31 1 PpSOFL5 MnSOFL2 1 1 1 GsSOFL3 0.99 GsSOFL51 1 1 1 LaSOFL2 1 LaSOFL3 Clade I 1 1 CarSOFL4 1 MeSOFL1 1 1 MeSOFL3 HbSOFL2 1 1 1 HbSOFL3 MeSOFL2 0.99 1 JcSOFL2 RcSOFL1 1 PeSOFL1 PtSOFL5 1 1 0.96 1 NtoSOFL4 1 NsSOFL1 1 SlSOFL2 1 1 StSOFL2 1 0.98 SlSOFL1 1 StSOFL4 1 NtoSOFL3 PgSOFL1 0.96 1 1 GhSOFL3 0.52 1 GrSOFL1 0.85 1 1 GarSOFL3 0.98 CcapSOFL1 1 CoSOFL1 1 1 DzSOFL2 TcSOFL2 1 GarSOFL1 0.61 1 GrSOFL2 0.65 1 CsiSOFL5 MnSOFL3 ZjSOFL2 1 0.93 GmSOFL1 0.8 1 GmSOFL6 1 GsSOFL1 0.98 GmSOFL7 1 GsSOFL4 1 1 1 VaSOFL2 1 1 VrSOFL2 0.71 1 PvuSOFL1 0.8 1 CcSOFL2 1 CarSOFL2 TsSOFL2 0.79 LaSOFL1 1 0.8 GmSOFL4 0.79 GmSOFL5 0.86 1 PvuSOFL2 CarSOFL1 1 CmSOFL1 1 CusaSOFL1 Clade II Continue with next page Continue with previous page 1 0.86 GmSOFL2 1 GsSOFL2 0.54 CcSOFL1 0.56 0.98 VaSOFL1 1 VrSOFL1 1 GmSOFL3 1 1 1 1 MtSOFL2 0.99 1 TsSOFL1 1 CarSOFL3 1 LaSOFL4 LaSOFL5 1 0.79 MdSOFL1 1 PbSOFL4 1 1 MdSOFL2 1 1 PbSOFL1 0.51 0.95 PmSOFL1 0.91 1 PpSOFL1 0.78 1 MnSOFL1 0.53 1 ZjSOFL1 0.95 1 CpSOFL2 1 VvSOFL1 1 VvSOFL2 0.98 JcSOFL1 1 1 MeSOFL5 RcSOFL21 1 1 PtSOFL2 PtSOFL4 1 MdSOFL4 1 MdSOFL7 1 PbSOFL2 0.99 1 MdSOFL6 1 PbSOFL3 PpSOFL2 0.81 PpSOFL4 0.81 1 PpSOFL6 1 PpSOFL7 1PmSOFL2 1 DcSOFL1 1 1 DcSOFL2 1 DcSOFL31 1 DcSOFL4 1 CsiSOFL2 CsiSOFL3 1 EgrSOFL2 1 1 EgrSOFL3 1 NnSOFL1 1 NnSOFL2 EguSOFL1 Clade III

1 1 BnSOFL2 1 1 BnSOFL3 1 1BrSOFL1 1 BrSOFL4 0.71 1 RsSOFL21 0.56 1BnSOFL5 1 1 BrSOFL3 RsSOFL11 0.94 1 1 BnSOFL4 BnSOFL21 0.93 0.98 0.83 BnSOFL1 1 1 BrSOFL2 BnSOFL22 1 1 RsSOFL3 1 1 AaSOFL1 1 CsaSOFL1 1 1 CsaSOFL2 CsaSOFL3 1 0.99 1 1 CsaSOFL4 1 1 BsSOFL1 1 1 CgSOFL1 1 CcSOFL3 1 1 MtSOFL1 AhSOFL1 1 CarSOFL7 1 AlSOFL1 1 SOB5 0.99 CClSOFL1 0.99 ThSOFL1 1 CsiSOFL6 ThSOFL2 1 1 JcSOFL3 0.61 1 1 MdSOFL8 1 MdSOFL9 0.8 MdSOFL10 0.99 1 MdSOFL11 1 MdSOFL12 CClSOFL2 1 0.68 1 MaSOFL5 0.61 1 MaSOFL6 1 1 MaSOFL4 1 AlSOFL6 1 AtSOFL5 1 CarSOFL5 1 CarSOFL6 0.51 AlSOFL5 1 AtSOFL6 1 1 CcarSOFL1 1 HaSOFL1 TcSOFL1 1 1 1 PviSOFL1 0.99 PviSOFL2 SiSOFL1 1 1 1ZmSOFL1 1 1 ZmSOFL2 1 OtSOFL1 1 1 OsSOFL2 1 OsSOFL3 1 1 ObSOFL1 BdSOFL2 1 AtaSOFL11 1 TuSOFL1 0.99 1PhSOFL1 1 1 1 PviSOFL3 1 1 PviSOFL4 1 SiSOFL2 SvSOFL1 1 1 ZmSOFL3 1 ZmSOFL4 1 1 1 SbSOFL1 0.53 BdSOFL1 1 BstSOFL1 1 ObSOFL2 0.98 OsSOFL1 1 1 0.94 CgSOFL4 0.99 CrSOFL2 CsaSOFL5 0.94 AlSOFL4 0.59 1 AtSOFL3 1 BnSOFL10 1 BnSOFL12 0.98 EsSOFL1 0.89 1 BnSOFL9 1 1 AlSOFL2 1 AtSOFL4 1 CgSOFL5 1 ThSOFL4 1 AcSOFL1 InSOFL1 1 1 1 EguiSOFL1 1 1 EguiSOFL4 1 EguiSOFL5 1 1 PdSOFL1 1 1 1 1 PdSOFL2 1 PdSOFL5 1 EguiSOFL2 1 0.96 1 MaSOFL1 1 PdSOFL3 0.92 PdSOFL4 EguiSOFL3 1 MaSOFL2 Clade VI MaSOFL3 S J L F SQ WP R

AtSOFL6

AtSOFL5

AtSOFL4

AtSOFL3

AtSOFL2

AtSOFL1

SOB5

UBQ

UBQ No RT A

B

C * * * * *

D * * * *

Col-0 sob5-D T21I A P61R D33H

B Col-0 35S:AtSOFL1 T23I P84R D53H A

B

C