<<

bioRxiv preprint doi: https://doi.org/10.1101/373506; this version posted July 25, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 Re-annotated benthamiana gene models for 2 enhanced proteomics and reverse genetics 3 4 Jiorgos Kourelis1, Farnusch Kaschani2, Friederike M. Grosse-Holz1, Felix Homma1, 5 Markus Kaiser2, Renier A. L. van der Hoorn1 6 1Plant Chemetics Laboratory, Department of Sciences, University of Oxford, South Parks Road, 7 OX1 3RB Oxford, UK; 2Chemische Biologie, Zentrum fur Medizinische Biotechnologie, Fakultät für 8 Biologie, Universität Duisburg-Essen, Essen, Germany.

9 Nicotiana benthamiana is an important model organism and representative of the 10 (Nightshade) family. N. benthamiana has a complex ancient allopolyploid 11 genome with 19 chromosomes, and an estimated genome size of 3.1Gb. Several draft 12 assemblies of the N. benthamiana genome have been generated, however, many of the 13 gene-models in these draft assemblies appear incorrect. Here we present a nearly non- 14 redundant database of 42,855 improved N. benthamiana gene-models. With an 15 estimated 97.6% completeness, the new predicted proteome is more complete than the 16 previous proteomes. We show that the database is more sensitive and accurate in 17 proteomics applications, while maintaining a reasonable low gene number. As a proof- 18 of-concept we use this proteome to compare the leaf extracellular (apoplastic) 19 proteome to a total extract of leaves. Several gene families are more abundant in the 20 apoplast. For one of these apoplastic protein families, the subtilases, we present a 21 phylogenetic analysis illustrating the utility of this database. Besides proteome 22 annotation, this database will aid the research community with improved target gene 23 selection for genome editing and off-target prediction for gene silencing. 24 25 Keywords: Solanaceae // Genome annotation // Nicotiana benthamiana // Proteomics // Subtilases 26 27 Database availability. The database has been uploaded at Oxford Research Archives (ORA) and can be 28 downloaded from this link: https://deposit.ora.ox.ac.uk/datasets/uuid:f09e1d98-f0f1-4560-aed4-a5147bc7739d. 29 We hope that this database will be accessible via the SolGenomics database in the near future. 30 31 Introduction 56 includes important crops such as potato (Solanum 32 Nicotiana benthamiana has risen to prominence as 57 tuberosum), tomato (Solanum lycopersicum), 33 a model organism for several reasons. First, N. 58 eggplant (Solanum melongena), and pepper 34 benthamiana is highly susceptible to viruses, 59 (Capsicum ssp.), as well as (Nicotiana 35 resulting in highly efficient virus-induced gene- 60 tabacum) and petunia (Petunia ssp.). 36 silencing (VIGS) for rapid reverse genetic screens 61 N. benthamiana belongs to the 37 (Senthil-Kumar and Mysore, 2014). This 62 Suaveolentes section of the Nicotiana genus, and 38 hypersusceptibility to viruses is due to an ancient 63 has an ancient allopolyploid origin (>10Mya) 39 disruptive mutation in the RNA-dependent RNA 64 accompanied by chromosomal re-arrangements 40 polymerase 1 gene (Rdr1), present in the lineage 65 resulting in a complex genome with 19 41 of N. benthamiana which is used in laboratories 66 chromosomes in the haploid genome – reduced 42 around the world (Bally et al., 2015). Reverse 67 from the ancestral allotetraploid 24 chromosomes 43 genetics using N. benthamiana have confirmed 68 - and an estimated haploid genome size of ~3.1Gb 44 many genes important for disease resistance (Wu 69 (Leitch et al., 2008; Goodin et al., 2008; Wang 45 et al., 2017; Senthil‐ Kumar et al., 2018). 70 and Bennetzen, 2015). There are four independent 46 Additionally, N. benthamiana is highly amenable 71 draft assemblies of the N. benthamiana genome 47 to the generation of stable transgenic lines 72 (Bombarely et al., 2012; Naim et al., 2012), as 48 (Clemente, 2006; Sparkes et al., 2006) and to 73 well as a de-novo transcriptome generated from 49 transient expression of transgenes (Goodin et al., 74 short-read RNAseq (Nakasugi et al., 2014). These 50 2008). This easy manipulation has facilitated rapid 75 datasets have greatly facilitated research in N. 51 forward genetic screens and has established N. 76 benthamiana, allowing for efficient prediction of 52 benthamiana as the plant bioreactor of choice for 77 off-targets of VIGS (Fernandez-Pozo et al., 2015) 53 the production of biopharmaceuticals (Stoger et 78 and genome editing using CRISPR/Cas9 (Liu et 54 al., 2014). Finally, N. benthamiana is a member of 79 al., 2017), as well as RNAseq and proteomics 55 the Solanaceae (Nightshade) family which 80 studies (Grosse-Holz et al., 2018). These draft

1

bioRxiv preprint doi: https://doi.org/10.1101/373506; this version posted July 25, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

81 assemblies are, however, several years old and 141 redundancy using CD-HIT-EST and combined the 82 gene annotations have not been updated since. 142 gene-models into a single database (NbA) 83 Furthermore, in the course of our research we 143 containing 41,651 gene-models (Figure 1, Step 84 realized that many of the gene models in these 144 3). We next compared the predicted proteome 85 draft assemblies are incorrect, and that putative 145 derived from our NbA database against the 86 pseudo-genes are often annotated as protein- 146 published predicted proteomes using a proteomics 87 encoding genes. This is exacerbated because these 147 dataset from a full-leaf extract or apoplastic fluid 88 draft assemblies are highly fragmented and that N. 148 (samples described further in the manuscript). 89 benthamiana has a complex origin. Furthermore, 149 Proteins for which peptides were identified in the 90 the de-novo transcriptome assembly has a high 150 published proteomes but not in our NbA database 91 proportion of chimeric transcripts. Because of 151 were extracted and re-annotated in the draft 92 incorrect annotations, extensive processing is 152 genomes as described above and added to our 93 required to select target genes for reverse genetic 153 NbA database resulting in the NbB database 94 approaches such as gene silencing and editing, or 154 containing 42,884 gene-models (Figure 1; Step 4, 95 for phylogenetic analysis of gene families. We 155 1233 additional entries). Finally, missing 96 realized that gene annotation in several other 156 BUSCOs (Benchmarking Universal Single-Copy 97 species in the Nicotiana genus were much better 157 Orthologues) (Simão et al., 2015; Waterhouse et 98 (Xu et al., 2017; Sierro et al., 2013; Sierro et al., 158 al., 2018) were re-annotated in our database as 99 2014), and decided to re-annotate the available N. 159 above, together with the manual curation of 100 benthamiana draft genomes using these gene 160 several gene-families in which several duplicated 101 models as a template. The gene models obtained 161 genes were removed (see Material & Methods) to 102 in this way were extracted into a single non- 162 obtain our final NbC database containing 42,853 103 redundant database with improved gene models. 163 entries (Figure 1; Step 5). 104 Here we show that this database is more accurate 164 105 and sensitive for proteomics, facilitates 165 The new proteome database is more complete, 106 phylogenetic analysis of gene families, and may 166 more sensitive and accurate, and relatively small 107 be useful for genome editing and VIGS on- and 167 We next compared the predicted proteome 108 off-target prediction. 168 database to the published predicted proteomes. 109 169 We also included our Nicotiana_db95 proteome 110 Results and Discussion 170 database in this comparison. The published 111 Re-annotation of gene-models in the N. 171 proteomes included the predicted proteomes from 112 benthamiana genome assemblies 172 the Niben0.4.4 and Niben1.0.1 draft genomes, a 113 For the annotation of gene-models in the different 173 previously described curated database in which 114 N. benthamiana draft genomes, we chose to use 174 gene-models from Niben1.0.1 were corrected 115 Scipio (Keller et al., 2008). Scipio refines the 175 using RNAseq reads (Grosse-Holz et al., 2018), 116 transcription start-site, exon-exon boundaries, and 176 and the predicted proteome derived from the de- 117 the stop-codon position of protein sequences 177 novo transcriptomes (Nakasugi et al., 2014). 118 aligned to the genome using BLAT (Keller et al., 178 We used BUSCO (Simão et al., 2015; 119 2008). Importantly, given that the input protein 179 Waterhouse et al., 2018) as a quality measure to 120 sequences are well-annotated, this method is more 180 estimate the completion of our database as 121 accurate and sensitive than other gene prediction 181 compared to the published predicted proteomes 122 methods (Keller et al., 2011). Because the 182 (Figure 2a). The BUSCO set used contains 1440 123 efficiency of this process correlates with 183 highly conserved plant genes which are expected 124 phylogenetic distance, we took the predicted 184 to be predominantly found in a single-copy 125 protein sequences from recently sequenced 185 (Simão et al., 2015). Nicotiana_db95 has one 126 Nicotiana species (Figure 1) (Sierro et al., 2013; 186 fragmented and nine missing BUSCOs, indicating 127 Xu et al., 2017). We then used CD-HIT at a 95% 187 that at best we should be able to identify 99.3% of 128 identity cut-off to reduce the redundancy in this 188 the N. benthamiana genes using this database 129 database and additionally to remove partial 189 (Figure 2a). In our NbB database, 1406/1440 130 sequences (Figure 1, Step 1). The resulting 190 (97.6%) BUSCO proteins were identified as 131 database – Nicotiana_db95 – contains 85,453 191 complete, of which 762 were single-copy (54.2%), 132 protein sequences from various Nicotiana species. 192 644 were duplicated (45.8%), nine sequences 133 We used the protein sequences in this database as 193 were fragmented (0.63%), and 25 were missing 134 an input to annotate gene-models in the four 194 (1.74%) (Figure 2a). The high number of 135 independent draft assemblies of the N. 195 duplication is likely due to either technical 136 benthamiana genome using Scipio (Figure 1, Step 196 duplication generated by small variations between 137 2). As the available N. benthamiana draft genomes 197 the different draft genomes, or genuinely 138 are highly fragmented and each individual draft 198 duplicated genes arising from the allo-tetraploid 139 genome may miss a number of genes, we extracted 199 origin of N. benthamiana. By adding missing 140 the gene-models generated by Scipio, filtered for 200 BUSCOs into the NbC database, we recovered

2

bioRxiv preprint doi: https://doi.org/10.1101/373506; this version posted July 25, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

201 eight of the nine fragmented BUSCOs and ten of 261 1; Step 4), we independently validated our 202 the 25 missing BUSCOs. In comparison, the next 262 database on an independent dataset where we re- 203 most complete, previously published proteome is 263 analysed a previously published apoplastic 204 the predicted proteome from the Nbv5.1 primary 264 proteome of agro-infiltrated N. benthamiana as 205 + alternate transcriptome, which has 12 265 compared to non-infiltrated N. benthamiana 206 fragmented and 32 missing BUSCOs, but it also 266 (PRIDE repository PXD006708) (Grosse-Holz et 207 has nearly five times more proteins than our 267 al., 2018). The new NbC database was also more 208 database and 71.8% of BUSCOs are duplicated. 268 sensitive and accurate than the Curated database 209 Next, we investigated the number of 269 on this dataset (18,430 vs 17,960 peptides 210 unique PFAM identifiers found with each entry in 270 detected, 22.5%±/-3.1% vs 21.7%±3.0% spectra 211 each proteome, as an estimation of the number of 271 identified). 212 proteins incorrectly annotated (Figure 2b). We 272 Finally, since phylogenetic analysis of 213 expect that miss-annotated sequences and 273 gene families in closely related species often relies 214 fragmented gene products are less likely to get a 274 on gene-annotations, we compared the predicted 215 PFAM annotation. Indeed, significantly more 275 proteome from our NbB database against the 216 proteins get at least one PFAM identifier in our 276 predicted proteomes of Solanaceae species for 217 three databases as compared to the published 277 which genomes have been sequenced (Figure 218 proteomes, indicating that proteins in our database 278 S1a,b). Our NbB proteome compares well to the 219 are better annotated. 279 predicted proteomes of other sequences 220 Furthermore, we looked at the length 280 Solanaceae species. Additionally, since the 221 distributions of proteins in the different predicted 281 predicted proteomes of some of these species miss 222 proteomes (Figure 2c). We reasoned that the 282 a relatively high proportion of genes (up to 28.5% 223 protein-length distribution should be similar to 283 of genes missing or fragmented), care must be 224 that of the Nicotiana_db95 database. The proteins 284 taken to not over-interpret results derived from 225 in the final proteome are significantly longer than 285 phylogenetic analysis using these sequences. 226 those in the Niben0.4.4, Niben1.0.1, and manually 286 227 curated proteome (Grosse-Holz et al., 2018) while 287 Improved annotation of the apoplastic proteome 228 the proteins in the Nbv5.1 primary + alternate 288 of N. benthamiana 229 proteome are on average larger than in our final 289 Next we used our final NbC database to analyse 230 NbC database. We speculate that the Niben0.4.4 290 the extracellular protein repertoire of the N. 231 and Niben1.0.1 predicted proteomes contain many 291 benthamiana apoplast. The plant apoplast is the 232 pseudo-genes which are annotated as protein- 292 primary interface in plant-pathogen interactions 233 encoding as well as partial genes (Figure 2c), 293 (Misas-Villamil and van der Hoorn, 2008; 234 while the Nbv5.1 primary + alternative proteome 294 Doehlemann and Hemetsberger, 2013) and 235 has a high proportion of chimeric sequences which 295 apoplastic proteins include many enzymes 236 due to the short-read sequencing techniques used 296 potentially important in plant-pathogen 237 are biased towards long transcripts (Figure 2c). 297 interactions. We found the protein composition of 238 Additionally, the curated proteome has a large 298 leaf apoplastic fluid (AF) to be distinct from that 239 proportion of very small proteins and 47.3% of 299 of a leaf total extract (TE) (Figure 3a). We 240 genes do have a PFAM annotation, which we 300 considered proteins apoplastic when only detected 241 speculate is due to partial sequences or spurious 301 in the AF samples or those with a log2 fold 242 small ORFs being annotated as protein-encoding 302 abundance difference ≥1.5 and a p-value cut-off 243 (Figure 2b,c). 303 off ≤0.01 (BH-adjusted moderated t-test) in the 244 Finally, comparing the different 304 comparison of AF vs TE (518 proteins). Similarly, 245 proteomes on a proteomics dataset indicates the 305 we considered proteins intracellular when found 246 new database has the highest sensitivity, with the 306 only in the TE or those proteins with a log2 fold 247 highest percentage of annotated MS/MS spectra in 307 abundance difference ≤-1.5 with a p-value cut-off 248 both tested samples, while it has the fewest entries 308 off ≤0.01 in the comparison of AF vs TE (1042 249 (Figure 2d). Additionally, using our new NbC 309 proteins) (Figure 3b). The remainder proteins was 250 database, we identify the highest number of 310 considered both apoplastic and intracellular (832 251 unique peptides identified in at least 3 out of 4 311 proteins). As expected, the apoplastic proteome is 252 biological replicates of both proteomes (Figure 312 significantly enriched for signal peptide 253 2d). These metrics combined indicate that the new 313 containing proteins, while the intracellular 254 NbC database is more sensitive and accurate for 314 proteins and proteins present both in the apoplast 255 proteomics than the currently available databases. 315 and intracellular are significantly enriched for 256 Importantly, this does not come at the cost of 316 proteins lacking a signal peptide (BH-adjusted 257 increased redundancy, which would hinder 317 hypergeometric test, p<0.001). 258 downstream applications. 318 Proteins considered predominantly 259 Since the previous proteomics dataset 319 intracellular are enriched for GO-SLIM terms 260 was also used to re-annotate gene-models (Figure 320 associated with translation, photosynthesis and

3

bioRxiv preprint doi: https://doi.org/10.1101/373506; this version posted July 25, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

321 transport as biological processes (Figure 3c), and 381 aspartic acid (Figure S2, and Figure S3, 322 a similar pattern is seen for the molecular function 382 Reichardt et al., 2018). 323 terms (Figure 3d). Proteins present both in TE and 383 During this analysis we discovered three 324 in AF are enriched for GO-SLIM terms associated 384 subtilisin genes that are missing in our NbB 325 with biosynthetic processes, and homeostasis 385 database, and six incomplete sequences lacking 5- 326 (Figure 3c). These processes usually performed 386 107 amino acids. In addition, five putative pseudo- 327 by proteins acting at multiple subcellular 387 genes were annotated as protein-encoding genes 328 localizations. The apoplastic proteome is enriched 388 and were removed from the final NbC database, 329 for proteins acting in catabolic processes and 389 and 18 subtilase genes were found to be duplicated 330 carbohydrate and lipid metabolic processes 390 and these duplicates were removed in the final 331 (Figure 3c), which is reflected in the enrichment 391 NbC database (Table S3). In comparison, the 332 of peptidases and glycosidases (Figure 3d, Table 392 Niben1.0.1 genome annotation predicts 103 333 S1 for a full list). 393 different subtilisin gene products. However, we 334 To specify which peptidases are enriched 394 found that these annotated genes correspond to 38 335 in the apoplast, we also annotated the proteome 395 pseudo-genes and 49 protein-encoding genes - 336 with MEROPS peptidase identifiers (Rawlings et 396 none of which are correctly annotated - while 16 337 al., 2018). Three of the 15 different families of 397 subtilisin genes are absent from Niben1.0.1 338 peptidases detected in the apoplast have 398 (Table S3). Furthermore, the predicted proteome 339 significantly more members enriched in the AF as 399 from the Nbv5.1 primary+alternate transcriptome 340 compared to TE, namely the subtilase (S08; 13 400 contains more than 400 subtilisin gene products, 341 members, p<0.001), serine carboxypeptidase-like 401 largely due to a large number of chimeric 342 (S10; 8 members, p<0.01), and aspartic peptidase 402 sequences. In conclusion, the new database 343 families (A01; 16 members, p<0.001), while the 403 represents a significant improvement over 344 proteasome is enriched in the intracellular fraction 404 previous genome annotations and facilitates more 345 (T01; 27 members, p<0.001) (BH-adjusted 405 accurate and meaningful phylogenetic analysis of 346 hypergeometric test, Table S2 for a full list). 406 gene families in N. benthamiana. 347 407 348 Pseudogenization in the subtilisin family is 408 Improved accuracy for genome editing: the 349 consistent with a contracting functional genome 409 subtilase gene-family 350 One of the gene families found enriched in the 410 Target selection for genome editing is improved 351 apoplast is the subtilisin family. Several subtilisins 411 by the use of our new database for several reasons: 352 are implicated in immunity, notably the tomato 412 1) gene-models in this database are more 353 P69 clade of subtilisins (Taylor and Qiu, 2017). In 413 complete; 2) fewer pseudo-genes are annotated as 354 order to estimate the completeness of our 414 protein-encoding genes; 3) gene duplication is 355 database, we manually verified and corrected 415 reduced as compared to the de-novo 356 genes belonging to the subtilisin gene family. Our 416 transcriptome; and 4) the remaining duplication in 357 NbC database contains 64 complete subtilisin 417 our database is easily resolved for genes of interest 358 genes, and one partial gene. By searching the 418 as it mostly involves genes with slight sequence 359 Niben1.0.1 and Niben0.4.4 genome assemblies, 419 variations between the different draft genomes. 360 we identified an additional 43 putative subtilisin 420 These sequence variations may be due to 361 pseudo-genes which had internal stop-codons and 421 heterozygosity or technical artefacts of the 362 are therefore likely non-coding. 422 sequencing and assembly. As an example we 363 Interestingly, phylogenetic analysis 423 show the gene-model of one of the subtilisins in 364 shows that close paralogs are often 424 the different databases. In our NbC database, this 365 pseudogenised. This pattern of pseudogenization 425 subtilisin is encoded by a single-exon gene-model 366 in the subtilisin gene family is consistent with a 426 of 2,268bp encoding for a 756 amino acid protein 367 contracting functional genome upon 427 (Figure 5a). This subtilisin is highly fragmented 368 polyploidization, where for each functional 428 in the Niben1.0.1 genome assembly, with parts of 369 protein-encoding gene there is a corresponding 429 the sequence present on different contigs, while 370 pseudo-gene (Figure 4, and Figure S2). 430 the gene is only partially annotated (Figure 5b). 371 Remarkably, no SBT3 clade family members were 431 The last 90bp of this gene are not annotated in the 372 identified in N. benthamiana (Figure S3). Finally, 432 Nbv0.5 genome (Figure 5c). Furthermore, there is 373 we looked for the amino acid residue at the pro- 433 a 132bp insertion in the Niben0.4.4 genome 374 domain junction, as the presence of an aspartic 434 assembly resulting in a predicted protein with a 44 375 acid residue is indicative of phytaspase activity 435 amino-acid insertion (Figure 5d). Additionally, 376 (Reichardt et al., 2018). Three N. benthamiana 436 we identified 13 sequences corresponding to 377 subtilisins may possess phytaspase activity based 437 partial or chimeric variants of this subtilisin are 378 on the presence of an apartic acid residue at the 438 present in the Nb5.1 primary + alternate predicted 379 pro-domain junction as well as a histidine residue 439 proteome using BLAST with no full match. 380 in the S1 pocket which is thought to bind to P1 440 Finally, this subtilisin differs by three non-

4

bioRxiv preprint doi: https://doi.org/10.1101/373506; this version posted July 25, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

441 synonymous SNPs between the Niben1.0.1 and 501 genomes were made available by the Waterhouse 442 Nbv0.5 genome assemblies, while two of these 502 lab at the Queensland University of Technology 443 non-synonymous SNPs are present in the 503 (Naim et al., 2012). 444 Niben0.4.4 genome assembly (Figure 5b,d). In 504 Annotation - In order to extract gene- 445 conclusion, this example displays how combining 505 models from the published N. benthamiana draft 446 gene-models derived from different genome 506 genomes we combined all the Nicotiana protein 447 assemblies has made our database more complete 507 sequences, except for those from N. obtusifolia, in 448 than annotating any single genome assembly 508 one database, with the addition of 110 genes 449 currently available. 509 which we had previously manually curated 450 Although our NbC database does not 510 leading to a database with 226,543 protein 451 contain the genomic context and lacks non-coding 511 sequences. We used CD-HIT (v4.6.8) (Fu et al., 452 genes, this database will vastly improve research 512 2012) to cluster these sequences at a 95% identity 453 on N. benthamiana. We trust our NbC database to 513 threshold and reduce the redundancy in our 454 be useful for the large research community of 514 database while removing partials 455 plant scientists using N. benthamiana as a model 515 (Nicotiana_db95; 85,453 sequences). This 456 system, for example to identify novel interactors 516 database was used to annotate the gene-models in 457 in Co-IP experiments, but also to facilitate reverse 517 the different N. benthamiana genome builds using 458 genetic approaches such as genome editing and 518 Scipio version 1.4.1 (Keller et al., 2008) which 459 VIGS. 519 was run with default settings. After running Scipio 460 520 we used Augustus (v3.3) (Stanke et al., 2006) to 461 Material & Methods 521 extract complete and partial gene models. Putative 462 Sequence retrieval - The predicted proteomes for 522 pseudo-genes (containing internal stop codons) 463 N. attenuata (GCF_001879085.1) (Xu et al., 523 and genes lacking an ATG start or stop codon 464 2017) (http://nadh.ice.mpg.de/NaDH/), N. 524 were stored separately. Transdecoder (v5.0.2) 465 tabacum TN90 (GCF_000715135.1) (Sierro et al., 525 (Haas et al., 2013) was used to retrieve the single- 466 2013), N. sylvestris (GCF_000393655.1) (Sierro 526 best ORF on the putative pseudo-genes containing 467 et al., 2013) and N. tomentosiformis 527 homology to the Nicotiana_db95 database as 468 (GCF_000390325.2) (Sierro et al., 2013), and 528 determined by BlastP searches. If a putative 469 Daucus carota subsp. sativus 529 pseudo-gene contained an ORF >90% of the 470 (GCA_001625215.1) (Iorizzo et al., 2016) were 530 annotated gene length and lacking <30 amino 471 downloaded from Genbank. In addition, we 531 acids it was considered a putative gene. Other 472 retrieved 565 full-length N. benthamiana protein 532 putative pseudo-genes were discarded. Next we 473 sequences from Genbank. The Arabidopsis 533 used CD-HIT-EST to filter the redundancy from 474 thaliana predicted proteome 534 this database. First, we used CD-HIT-EST to 475 (Araport11_genes.201606.pep) was obtained 535 cluster the CDS derived from the gene-models 476 from Araport (Cheng et al., 2017). The Solanum 536 derived from the different databases at 100% 477 melongena predicted proteome 537 identity. Next, we selected the longest sequence at 478 (SME_r2.5.1_pep) was obtained from the 538 99% identity between the different genome builds 479 Eggplant Genome DataBase (Hirakawa et al., 539 using CD-HIT-EST-2D in the following order for 480 2014). The N. obtusifolia (NIOBT_r1.0) predicted 540 both the complete and the partial databases: 481 proteome was obtained from the Nicotiana 541 Niben1.0.1 > Nbv0.5 > Niben0.4.4 > Nbv0.3. 482 attenuata Data Hub (Xu et al., 2017) 542 Since sequences which are smaller are maintained 483 (http://nadh.ice.mpg.de/NaDH/). The Petunia 543 like this we used the reduced databases in the 484 axillaris N (Petunia_axillaris_v1.6.2_proteins) 544 opposite direction to remove partial genes: Nbv0.3 485 and P. inflata S6 545 > Niben0.4.4 > Nbv0.5 > Niben1.0.1. Finally we 486 (Petunia_inflata_v1.0.1_proteins) (Bombarely et 546 used CD-HIT-EST-2D to remove genes from the 487 al., 2016), Capsicum annuum glabriusculum 547 partial database with a longer representative in the 488 (CaChiltepin.pep) and C. annuum zunla-1 548 complete database at 99% identity and vice-versa. 489 (CaZL1.pep) (Qin et al., 2014), C. annuum cv 549 This resulted in the NbA database. We compared 490 CM334 (Pepper.v.1.55.proteins.annotated) (Kim 550 this database for proteomic analysis on the 491 et al., 2014), Solanum tuberosum 551 described proteomics dataset containing 492 (PGSC_DM_v3.4_pep) (Consortium, 2011), and 552 apoplastic fluid (AF) samples and full-leaf extract 493 Solanum lycopersicum (ITAG3.2_proteins) 553 (TE) samples and compared its performance to the 494 (Consortium, 2012) predicted proteomes were 554 other published predicted proteomes. For this 495 downloaded from Solgenomics. The N. 555 analysis we predicted the Nbv5.1 proteome from 496 benthamiana draft genome builds Niben1.0.1 and 556 the transcriptome using Transdecoder and 497 Niben0.4.4 - both generated by the Boyce 557 selecting the single-best ORF with homology to 498 Thompson Institute for Plant Research (BTI) 558 the Nicotiana_db95 database, and filtered the 499 (Bombarely et al., 2012) - were downloaded from 559 database using CD-HIT at 100% identity. Proteins 500 Solgenomics, and the Nbv0.5 and Nbv0.3 draft 560 for which peptides were identified in the other

5

bioRxiv preprint doi: https://doi.org/10.1101/373506; this version posted July 25, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

561 databases but absent from the NbA database 621 concentration to 25mM. Protein digestion was 562 search were extracted, clustered at 100% using 622 started by addition of 1000ng LysC (Wako 563 CD-HIT, and re-annotated in the genomes as 623 Chemicals GmbH) and incubation for 3hr at 37°C 564 above. This resulted in the NbB database. Finally, 624 while gently shaking (800rpm). The samples were 565 we ran BUSCO (v3.0.2; dependencies: NCBI- 625 then diluted with ammoniumbicarbonate (final 566 BLAST v2.7.1+; HMMER v3.1; Augustus v3.3) 626 concentration 80mM) to a final urea concentration 567 (Simão et al., 2015; Waterhouse et al., 2018) on 627 of 1M. 1000ng Sequencing grade Trypsin 568 the different N. benthamiana predicted proteomes 628 (Promega) was added and the samples were 569 using the set (Embryophyta_odb9), 629 incubated overnight at 37°C while gently shaking 570 extracted the missing BUSCOs and re-annotated 630 (800rpm). Protein digestion was stopped by 571 these as above. Additionally, we manually 631 addition of formic acid (FA, final 5% v/v). Tryptic 572 inspected the database for the PLCP, subtilisin, 632 digests were desalted on home-made C18 573 VPE, and GH35-domain encoding gene families, 633 StageTips (Rappsilber et al., 2007) by passing the 574 and manually removed redundant sequences. This 634 solution over 2 disc StageTips in 150µL aliquots 575 resulted in our final database. This database was 635 by centrifugation (600-1200× g). Bound peptides 576 annotated using SignalP (v4) (Petersen et al., 636 were washed with 0.1% FA and subsequently 577 2011), ApoplastP (v1.0.1) (Sperschneider et al., 637 eluted with 80% Acetonitrile (ACN). Using a 578 2018), and PFAM (v31) (Finn et al., 2016). 638 vacuum concentrator (Eppendorf) samples were 579 Finally we annotated the predicted proteome with 639 dried, and the peptides were resuspended in 20 µL 580 GO terms and UniProt identifiers using Sma3s v2 640 0.1% FA solution. 581 (Casimiro-Soriguer et al., 2017). 641 LC-MS/MS - The samples were analysed 582 Sample preparation for proteomics and 642 as in (Grosse-Holz et al., 2018). Briefly, samples 583 definition of biological replicates - Four-week old 643 were run on an Orbitrap Elite instrument (Thermo) 584 N. benthamiana plants were used. The AF was 644 (Michalski et al., 2011) coupled to an EASY-nLC 585 extracted by vacuum infiltrating N. benthamiana 645 1000 liquid chromatography (LC) system 586 leaves with ice-cold MilliQ. Leaves were dried to 646 (Thermo) operated in the one-column mode. 587 remove excess liquid, and apoplastic fluid was 647 Peptides were directly loaded on a fused silica 588 extracted by centrifugation of the leaves in a 20 ml 648 capillary (75µm × 30cm) with an integrated 589 syringe barrel (without needle or plunger) in a 50 649 PicoFrit emitter (New Objective) analytical 590 ml falcon tube at 2000x g, 4°C for 25min. Samples 650 column packed in-house with Reprosil-Pur 120 591 were snap-frozen in liquid nitrogen and stored at - 651 C18-AQ 1.9 µm resin (Dr. Maisch), taking care to 592 80°C prior to use. TE was collected by removing 652 not exceed the set pressure limit of 980 bar 593 the central vein and snap-freezing the leaves in 653 (usually around 0.5-0.8µl/min). The analytical 594 liquid nitrogen followed by grinding in a pestle 654 column was encased by a column oven (Sonation; 595 and mortar and addition of three volumes of 655 45°C during data acquisition) and attached to a 596 phosphate-buffered saline (PBS) (w/v). One 656 nanospray flex ion source (Thermo). Peptides 597 biological replicate was defined as a sample, AF 657 were separated on the analytical column by 598 or TE, consisting of one leaf from three 658 running a 140-min gradient of solvent A (0.1% FA 599 independent plants (3 leaves total). Four 659 in water; ; Ultra-Performance Liquid 600 independent biological replicates were taken for 660 Chromatography (UPLC) grade) and solvent B 601 AF and TE. 661 (0.1% FA in ACN; UPLC grade) at a flow rate of 602 Protein digestion and sample clean-up - 662 300nl/min (gradient: start with 7% B; gradient 7% 603 AF and TE sample corresponding to 15μg of 663 to 35% B for 120 min; gradient 35% to 100% B 604 protein was taken for each sample (based on 664 for 10 min and 100% B for 10 min) at a flow rate 605 Bradford assay). Dithiothreitol (DTT) was added 665 of 300 nl/min.). The mass spectrometer was 606 to a concentration of 40mM, and the volume 666 operated using Xcalibur software (version 2.2 607 adjusted to 250μl with MS-grade water (Sigma). 667 SP1.48) in positive ion mode. Precursor ion 608 Proteins were precipitated by the addition of 4 668 scanning was performed in the Orbitrap analyzer 609 volumes of ice-cold acetone, followed by a 1hr 669 (FTMS; Fourier Transform Mass Spectrometry) 610 incubation at -20°C and subsequent cetrifugation 670 in the scan range of m/z 300-1800 and at a 611 at 18,000 g, 4°C for 20min. The pellet was dried 671 resolution of 60000 with the internal lock mass 612 at room temperature (RT) for 5min and 672 option turned on (lock mass was 445.120025 m/z, 613 resuspended in 25µL 8M urea, followed by a 673 polysiloxane) (Olsen et al., 2005). Product ion 614 second chloroform/methanol precipitation. The 674 spectra were recorded in a data-dependent manner 615 pellet was dried at RT for 5 min and resuspended 675 in the ion trap (ITMS) in a variable scan range and 616 in 25µL 8M urea. Protein reduction and alkylation 676 at a rapid scan rate. The ionization potential was 617 was achieved by sequential incubation with DTT 677 set to 1.8kV. Peptides were analysed by a 618 (final 5mM, 30 min, RT) and iodoacetamide 678 repeating cycle of a full precursor ion scan (1.0 × 619 (IAM; final 20mM, 30min, RT, dark). Non- 679 106 ions or 50ms) followed by 15 product ion 620 reacted IAM was quenched by raising the DTT 680 scans (1.0 × 104 ions or 50ms). Peptides exceeding

6

bioRxiv preprint doi: https://doi.org/10.1101/373506; this version posted July 25, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

681 a threshold of 500 counts were selected for tandem 741 terms that were either depleted or enriched 682 mass (MS2) spectrum generation. Collision 742 (p≤0.05) in the bonafide AF protein groups as 683 induced dissociation (CID) energy was set to 35% 743 compared to bonafide AF depleted proteins or 684 for the generation of MS2 spectra. Dynamic ion 744 protein groups present both in the AF and TE. 685 exclusion was set to 60 seconds with a maximum 745 Phylogenetic analysis - Predicted 686 list of excluded ions consisting of 500 members 746 proteomes were annotated with PFAM identifiers, 687 and a repeat count of one. Ion injection time 747 and all sequences containing a Peptidase S8 688 prediction, preview mode for the Fourier 748 (PF00082) domain were extracted from the 689 transform mass spectrometer (FTMS, the 749 different databases. Additionally, we manually 690 orbitrap), monoisotopic precursor selection and 750 curated the subtilisin gene-family in the 691 charge state screening were enabled. Only charge 751 Niben1.0.1 draft genome, identifying putative 692 states higher than 1 were considered for 752 pseudo-genes which were annotated as protein- 693 fragmentation. 753 encoding genes, as well as missing genes and 694 Peptide and Protein Identification - 754 incorrect gene models or genes in which the 695 Peptide spectra were searched in MaxQuant 755 reference sequence was absent in Niben1.0.1. 696 (version 1.5.3.30) using the Andromeda search 756 Tomato subtilisins were retrieved from 697 engine (Cox et al., 2011) with default settings and 757 Solgenomics, and other previously characterized 698 label-free quantification and match-between-runs 758 subtilisins (Taylor and Qiu, 2017) were retrieved 699 activated (Cox and Mann, 2008; Cox et al., 2014) 759 from NCBI. Clustal Omega (Sievers et al., 2011; 700 against the databases specified in the text 760 Li et al., 2015) was used to align these sequences. 701 including a known contaminants database. 761 The putative pseudo-gene sequences were 702 Included modifications were 762 substituted with the best blast hit in NCBI in order 703 carbamidomethylation (static) and oxidation and 763 to visualize pseudogenization in the alignment and 704 N-terminal acetylation (dynamic). Precursor mass 764 phylogenetic tree. Determining the best model for 705 tolerance was set to ±20 ppm (first search) and 765 maximum likelihood phylogenetic analysis and 706 ±4.5 ppm (main search), while the MS/MS match 766 the phylogenetic analysis was performed in 707 tolerance was set to ±0.5 Da. The peptide 767 MEGA X (Kumar et al., 2018). The evolutionary 708 spectrum match FDR and the protein FDR were 768 history was inferred by using the Maximum 709 set to 0.01 (based on a target-decoy approach) and 769 Likelihood method based on the Whelan and 710 the minimum peptide length was set to 7 amino 770 Goldman model. A discrete Gamma distribution 711 acids. Protein quantification was performed in 771 was used to model evolutionary rate differences 712 MaxQuant (Tyanova et al., 2016), based on 772 among sites, and the rate variation model allowed 713 unique and razor peptides including all 773 for some sites to be evolutionarily invariable. All 714 modifications. 774 positions with less than 80% site coverage were 715 Proteomics processing in R - Identified 775 eliminated. Niben101Scf00595_742942-795541 716 protein groups were filtered for reverse and 776 was used to root the phylogenetic trees. 717 contaminants proteins and those only identified by 777 Data Availability - The mass 718 matching, and only those protein groups identified 778 spectrometry proteomics data have been deposited 719 in 3 out of 4 biological replicates either AF or TE 779 to the ProteomeXchange Consortium via the 720 were selected. The LFQ values were log2 780 PRIDE (Vizcaíno et al., 2016) partner repository 721 transformed, and missing values were imputed 781 (https://www.ebi.ac.uk/pride/archive/) with the 722 using a minimal distribution as implemented in 782 data set identifier PXD010435. During the review 723 imputeLCMD (v2.0) (Lazar, 2015). A moderated 783 process the data can be accessed via a reviewer 724 t-test was used as implemented in Limma 784 account (Username: [email protected]; 725 (v3.34.3) (Ritchie et al., 2015; Phipson et al., 785 Password: PQSfFZyN). Samples FGH01-04 726 2016) and adjusted using Benjamini–Hochberg 786 represent AF and FGH05-08 represent TE. 727 (BH) adjustment to identify protein groups 787 728 significantly differing between AF and TE. 788 Acknowledgements. We would like to thank 729 Bonafide apoplastic protein groups were those 789 Philippe Varennes-Jutras and Daniela Sueldo for 730 only detected in AF and those significantly 790 critically reading the manuscript and providing 731 (p≤0.01) log2 fold change ≥1.5 in AF samples. 791 important suggestions for improving the 732 Protein groups only detected in TE and those 792 manuscript. 733 significantly (p≤0.01) log2 fold change ≤-1.5 793 734 depleted in AF samples were considered 794 Funding. This work has been supported by ‘The 735 intracellular. The remainder was considered both 795 Clarendon Fund’ (JK), and the ERC Consolidator 736 apoplastic and intra-cellular. Majority proteins 796 grant 616449 ‘GreenProteases’ (RvdH, FGH). 737 were annotated with SignalP, PFAM, MEROPS 797 The funders had no role in study design, data 738 (v12) (Rawlings et al., 2018), GO, and UniProt 798 collection and analysis, decision to publish, or 739 keywords identifiers. A BH-adjusted 799 preparation of the manuscript. 740 Hypergeometric test was used to identify those 800

7

bioRxiv preprint doi: https://doi.org/10.1101/373506; this version posted July 25, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

801 Competing interests. The authors have 861 http://link.springer.com/protocol/10.1385/1- 802 declared that no competing interests exist. 862 59745-130-4%3A143 [Accessed January 4, 803 863 2015]. 804 Author contributions. Conceptualization: JK, 864 Consortium, T.P.G.S. (2011) Genome sequence 805 RvdH; Formal analysis: JK; Funding acquisition: 865 and analysis of the tuber crop potato. Nature, 806 RvdH; Wetlab experiments: FHG; Proteomics: 866 475, 189–195. 807 FK, MK; Programming: JK, FH; Writing: JK, 867 Consortium, T.T.G. (2012) The tomato genome 808 RvdH. 868 sequence provides insights into fleshy fruit 809 869 evolution. Nature, 485, 635–641. 810 References 870 Cox, J., Hein, M.Y., Luber, C.A., Paron, I., 811 Bally, J., Nakasugi, K., Jia, F., Jung, H., Ho, 871 Nagaraj, N. and Mann, M. (2014) Accurate 812 S.Y.W., Wong, M., Paul, C.M., Naim, F., 872 proteome-wide label-free quantification by 813 Wood, C.C., Crowhurst, R.N., Hellens, R.P., 873 delayed normalization and maximal peptide 814 Dale, J.L. and Waterhouse, P.M. (2015) The 874 ratio extraction, termed MaxLFQ. Mol. Cell. 815 extremophile Nicotiana benthamiana has 875 Proteomics, 13, 2513–2526. 816 traded viral defence for early vigour. Nat 876 Cox, J. and Mann, M. (2008) MaxQuant 817 Plants, 1, 15165. 877 enables high peptide identification rates, 818 Bombarely, A., Moser, M., Amrad, A., Bao, 878 individualized p.p.b.-range mass accuracies and 819 M., Bapaume, L., Barry, C.S., Bliek, M., 879 proteome-wide protein quantification. Nat 820 Boersma, M.R., Borghi, L., Bruggmann, R., 880 Biotech, 26, 1367–1372. 821 Bucher, M., D’Agostino, N., Davies, K., 881 Cox, J., Neuhauser, N., Michalski, A., 822 Druege, U., Dudareva, N., Egea-Cortines, 882 Scheltema, R.A., Olsen, J.V. and Mann, M. 823 M., Delledonne, M., Fernandez-Pozo, N., 883 (2011) Andromeda: a peptide search engine 824 Franken, P., Grandont, L., Heslop-Harrison, 884 integrated into the MaxQuant environment. J. 825 J.S., Hintzsche, J., Johns, M., Koes, R., Lv, 885 Proteome Res., 10, 1794–1805. 826 X., Lyons, E., Malla, D., Martinoia, E., 886 Doehlemann, G. and Hemetsberger, C. (2013) 827 Mattson, N.S., Morel, P., Mueller, L.A., 887 Apoplastic immunity and its suppression by 828 Muhlemann, J., Nouri, E., Passeri, V., 888 filamentous plant pathogens. New Phytol., 198, 829 Pezzotti, M., Qi, Q., Reinhardt, D., Rich, M., 889 1001–1016. 830 Richert-Pöggeler, K.R., Robbins, T.P., 890 Fernandez-Pozo, N., Rosli, H.G., Martin, G.B. 831 Schatz, M.C., Schranz, M.E., Schuurink, 891 and Mueller, L.A. (2015) The SGN VIGS 832 R.C., Schwarzacher, T., Spelt, K., Tang, H., 892 Tool: user-friendly software to design Virus- 833 Urbanus, S.L., Vandenbussche, M., 893 Induced Gene Silencing (VIGS) constructs for 834 Vijverberg, K., Villarino, G.H., Warner, 894 functional genomics. Mol. Plant, 8, 486–488. 835 R.M., Weiss, J., Yue, Z., Zethof, J., 895 Finn, R.D., Coggill, P., Eberhardt, R.Y., Eddy, 836 Quattrocchio, F., Sims, T.L. and 896 S.R., Mistry, J., Mitchell, A.L., Potter, S.C., 837 Kuhlemeier, C. (2016) Insight into the 897 Punta, M., Qureshi, M., Sangrador-Vegas, 838 evolution of the Solanaceae from the parental 898 A., Salazar, G.A., Tate, J. and Bateman, A. 839 genomes of Petunia hybrida. Nat. Plants, 2, 899 (2016) The Pfam protein families database: 840 16074. 900 towards a more sustainable future. Nucleic 841 Bombarely, A., Rosli, H.G., Vrebalov, J., 901 Acids Res, 44, D279–D285. 842 Moffett, P., Mueller, L.A. and Martin, G.B. 902 Fu, L., Niu, B., Zhu, Z., Wu, S. and Li, W. 843 (2012) A draft genome sequence of Nicotiana 903 (2012) CD-HIT: accelerated for clustering the 844 benthamiana to enhance molecular plant- 904 next-generation sequencing data. 845 microbe biology research. Mol. Plant-Microbe 905 Bioinformatics, 28, 3150–3152. 846 Interact., 25, 1523–1530. 906 Goodin, M.M., Zaitlin, D., Naidu, R.A. and 847 Casimiro-Soriguer, C.S., Muñoz-Mérida, A. 907 Lommel, S.A. (2008) Nicotiana benthamiana: 848 and Pérez-Pulido, A.J. (2017) Sma3s: A 908 its history and future as a model for plant– 849 universal tool for easy functional annotation of 909 pathogen interactions. MPMI, 21, 1015–1026. 850 proteomes and transcriptomes. Proteomics, 17. 910 Grosse-Holz, F.M., Kelly, S., Blaskowski, S., 851 Cheng, C.-Y., Krishnakumar, V., Chan, A.P., 911 Kaschani, F., Kaiser, M. and van der Hoorn, 852 Thibaud-Nissen, F., Schobel, S. and Town, 912 R.A.L. (2018) The transcriptome, extracellular 853 C.D. (2017) Araport11: a complete 913 proteome and active secretome of 854 reannotation of the 914 agroinfiltrated Nicotiana benthamiana uncover 855 reference genome. Plant J, 89, 789–804. 915 a large, diverse protease repertoire. Plant 856 Clemente, T. (2006) Nicotiana (Nicotiana 916 Biotechnol. J., 16, 1068–1084. 857 tobaccum, Nicotiana benthamiana). In K. 917 Haas, B.J., Papanicolaou, A., Yassour, M., 858 Wang, ed. Agrobacterium Protocols. Methods 918 Grabherr, M., Blood, P.D., Bowden, J., 859 in Molecular Biology (Clifton, N.J.). Humana 919 Couger, M.B., Eccles, D., Li, B., Lieber, M., 860 Press, pp. 143–154. Available at: 920 MacManes, M.D., Ott, M., Orvis, J., Pochet,

8

bioRxiv preprint doi: https://doi.org/10.1101/373506; this version posted July 25, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

921 N., Strozzi, F., Weeks, N., Westerman, R., 980 Kumar, S., Stecher, G., Li, M., Knyaz, C., 922 William, T., Dewey, C.N., Henschel, R., 981 Tamura, K. and Battistuzzi, F.U. (2018) 923 LeDuc, R.D., Friedman, N. and Regev, A. 982 MEGA X: Molecular Evolutionary Genetics 924 (2013) De novo transcript sequence 983 Analysis across computing platforms. Mol Biol 925 reconstruction from RNA-seq using the Trinity 984 Evol, 35, 1547–1549. 926 platform for reference generation and analysis. 985 Lazar, C. (2015) imputeLCMD: a collection of 927 Nat. Protocols, 8, 1494–1512. 986 methods for left-censored missing data 928 Hirakawa, H., Shirasawa, K., Miyatake, K., 987 imputation, Available at: https://cran.r- 929 Nunome, T., Negoro, S., Ohyama, A., 988 project.org/web/packages/imputeLCMD/index. 930 Yamaguchi, H., Sato, S., Isobe, S., Tabata, S. 989 html [Accessed February 25, 2018]. 931 and Fukuoka, H. (2014) Draft genome 990 Leitch, I.J., Hanson, L., Lim, K.Y., Kovarik, 932 sequence of eggplant (Solanum melongena L.): 991 A., Chase, M.W., Clarkson, J.J. and Leitch, 933 the representative Solanum species indigenous 992 A.R. (2008) The ups and downs of genome size 934 to the old world. DNA Res, 21, 649–660. 993 evolution in polyploid species of Nicotiana 935 Iorizzo, M., Ellison, S., Senalik, D., Zeng, P., 994 (Solanaceae). Ann Bot, 101, 805–814. 936 Satapoomin, P., Huang, J., Bowman, M., 995 Li, W., Cowley, A., Uludag, M., Gur, T., 937 Iovene, M., Sanseverino, W., Cavagnaro, P., 996 McWilliam, H., Squizzato, S., Park, Y.M., 938 Yildiz, M., Macko-Podgórni, A., Moranska, 997 Buso, N. and Lopez, R. (2015) The EMBL- 939 E., Grzebelus, E., Grzebelus, D., Ashrafi, H., 998 EBI bioinformatics web and programmatic 940 Zheng, Z., Cheng, S., Spooner, D., Van 999 tools framework. Nucleic Acids Res, 43, 941 Deynze, A. and Simon, P. (2016) A high- 1000 W580–W584. 942 quality carrot genome assembly provides new 1001 Liu, H., Ding, Y., Zhou, Y., Jin, W., Xie, K. 943 insights into carotenoid accumulation and 1002 and Chen, L.-L. (2017) CRISPR-P 2.0: an 944 asterid genome evolution. Nat. Genet., 48, 657– 1003 improved CRISPR-Cas9 tool for genome 945 666. 1004 editing in plants. Molecular Plant, 10, 530– 946 Keller, O., Kollmar, M., Stanke, M. and 1005 532. 947 Waack, S. (2011) A novel hybrid gene 1006 Michalski, A., Damoc, E., Hauschild, J.-P., 948 prediction method employing protein multiple 1007 Lange, O., Wieghaus, A., Makarov, A., 949 sequence alignments. Bioinformatics, 27, 757– 1008 Nagaraj, N., Cox, J., Mann, M. and Horning, 950 763. 1009 S. (2011) Mass spectrometry-based proteomics 951 Keller, O., Odronitz, F., Stanke, M., Kollmar, 1010 using Q Exactive, a high-performance benchtop 952 M. and Waack, S. (2008) Scipio: using protein 1011 quadrupole Orbitrap mass spectrometer. Mol 953 sequences to determine the precise exon/intron 1012 Cell Proteomics, 10, M111.011015. 954 structures of genes and their orthologs in 1013 Misas-Villamil, J.C. and van der Hoorn, 955 closely related species. BMC Bioinformatics, 9, 1014 R.A.L. (2008) Enzyme–inhibitor interactions at 956 278. 1015 the plant–pathogen interface. Current Opinion 957 Kim, S., Park, M., Yeom, S.-I., Kim, Y.-M., 1016 in Plant Biology, 11, 380–388. 958 Lee, J.M., Lee, H.-A., Seo, E., Choi, J., 1017 Naim, F., Nakasugi, K., Crowhurst, R.N., 959 Cheong, K., Kim, K.-T., Jung, K., Lee, G.- 1018 Hilario, E., Zwart, A.B., Hellens, R.P., 960 W., Oh, S.-K., Bae, C., Kim, S.-B., Lee, H.- 1019 Taylor, J.M., Waterhouse, P.M. and Wood, 961 Y., Kim, S.-Y., Kim, M.-S., Kang, B.-C., Jo, 1020 C.C. (2012) Advanced engineering of lipid 962 Y.D., Yang, H.-B., Jeong, H.-J., Kang, W.- 1021 metabolism in Nicotiana benthamiana using a 963 H., Kwon, J.-K., Shin, C., Lim, J.Y., Park, 1022 draft genome and the V2 viral silencing- 964 J.H., Huh, J.H., Kim, J.-S., Kim, B.-D., 1023 suppressor protein. PLOS ONE, 7, e52717. 965 Cohen, O., Paran, I., Suh, M.C., Lee, S.B., 1024 Nakasugi, K., Crowhurst, R., Bally, J. and 966 Kim, Y.-K., Shin, Y., Noh, S.-J., Park, J., 1025 Waterhouse, P. (2014) Combining 967 Seo, Y.S., Kwon, S.-Y., Kim, H.A., Park, 1026 transcriptome assemblies from multiple de 968 J.M., Kim, H.-J., Choi, S.-B., Bosland, P.W., 1027 novo assemblers in the allo-tetraploid plant 969 Reeves, G., Jo, S.-H., Lee, B.-W., Cho, H.-T., 1028 Nicotiana benthamiana. PLOS ONE, 9, 970 Choi, H.-S., Lee, M.-S., Yu, Y., Do Choi, Y., 1029 e91776. 971 Park, B.-S., Deynze, A. van, Ashrafi, H., Hill, 1030 Olsen, J.V., Godoy, L.M.F. de, Li, G., Macek, 972 T., Kim, W.T., Pai, H.-S., Ahn, H.K., Yeam, 1031 B., Mortensen, P., Pesch, R., Makarov, A., 973 I., Giovannoni, J.J., Rose, J.K.C., Sørensen, 1032 Lange, O., Horning, S. and Mann, M. (2005) 974 I., Lee, S.-J., Kim, R.W., Choi, I.-Y., Choi, 1033 Parts per million mass accuracy on an Orbitrap 975 B.-S., Lim, J.-S., Lee, Y.-H. and Choi, D. 1034 mass spectrometer via lock mass injection into 976 (2014) Genome sequence of the hot pepper 1035 a C-trap. Mol Cell Proteomics, 4, 2010–2021. 977 provides insights into the evolution of 1036 Petersen, T.N., Brunak, S., Heijne, G. von and 978 pungency in Capsicum species. Nat. Genet., 46, 1037 Nielsen, H. (2011) SignalP 4.0: discriminating 979 270–278. 1038 signal peptides from transmembrane regions. 1039 Nat. Methods, 8, 785–786.

9

bioRxiv preprint doi: https://doi.org/10.1101/373506; this version posted July 25, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1040 Phipson, B., Lee, S., Majewski, I.J., 1100 Wang, K., Xu, P., Eck, J.V., Chakravarthy, 1041 Alexander, W.S. and Smyth, G.K. (2016) 1101 S., Dinesh‐ Kumar, S.P., Martin, G.B. and 1042 Robust hyperparameter estimation protects 1102 Mysore, K.S. (2018) Virus-induced gene 1043 against hypervariable genes and improves 1103 silencing database for phenomics and 1044 power to detect differential expression. Ann 1104 functional genomics in Nicotiana benthamiana. 1045 Appl Stat, 10, 946–963. 1105 Plant Direct, 2, e00055. 1046 Qin, C., Yu, C., Shen, Y., Fang, X., Chen, L., 1106 Sierro, N., Battey, J.N., Ouadi, S., Bovet, L., 1047 Min, J., Cheng, J., Zhao, S., Xu, M., Luo, Y., 1107 Goepfert, S., Bakaher, N., Peitsch, M.C. and 1048 Yang, Y., Wu, Z., Mao, L., Wu, H., Ling-Hu, 1108 Ivanov, N.V. (2013) Reference genomes and 1049 C., Zhou, H., Lin, H., González-Morales, S., 1109 transcriptomes of Nicotiana sylvestris and 1050 Trejo-Saavedra, D.L., Tian, H., Tang, Xin, 1110 Nicotiana tomentosiformis. Genome Biology, 1051 Zhao, M., Huang, Z., Zhou, A., Yao, X., Cui, 1111 14, R60. 1052 J., Li, Wenqi, Chen, Z., Feng, Y., Niu, Y., Bi, 1112 Sierro, N., Battey, J.N.D., Ouadi, S., Bakaher, 1053 S., Yang, X., Li, Weipeng, Cai, H., Luo, X., 1113 N., Bovet, L., Willig, A., Goepfert, S., 1054 Montes-Hernández, S., Leyva-González, 1114 Peitsch, M.C. and Ivanov, N.V. (2014) The 1055 M.A., Xiong, Z., He, X., Bai, L., Tan, S., 1115 tobacco genome sequence and its comparison 1056 Tang, Xiangqun, Liu, D., Liu, J., Zhang, S., 1116 with those of tomato and potato. Nat Commun, 1057 Chen, M., Zhang, Lu, Zhang, Li, Zhang, 1117 5, 3833. 1058 Yinchao, Liao, W., Zhang, Yan, Wang, M., 1118 Sievers, F., Wilm, A., Dineen, D., Gibson, T.J., 1059 Lv, X., Wen, B., Liu, H., Luan, H., Zhang, 1119 Karplus, K., Li, W., Lopez, R., McWilliam, 1060 Yonggang, Yang, S., Wang, X., Xu, J., Li, X., 1120 H., Remmert, M., Söding, J., Thompson, 1061 Li, S., Wang, J., Palloix, A., Bosland, P.W., 1121 J.D. and Higgins, D.G. (2011) Fast, scalable 1062 Li, Y., Krogh, A., Rivera-Bustamante, R.F., 1122 generation of high‐ quality protein multiple 1063 Herrera-Estrella, L., Yin, Y., Yu, J., Hu, K. 1123 sequence alignments using Clustal Omega. 1064 and Zhang, Z. (2014) Whole-genome 1124 Molecular Systems Biology, 7, 539. 1065 sequencing of cultivated and wild peppers 1125 Simão, F.A., Waterhouse, R.M., Ioannidis, P., 1066 provides insights into Capsicum domestication 1126 Kriventseva, E.V. and Zdobnov, E.M. (2015) 1067 and specialization. Proc. Natl. Acad. Sci. USA, 1127 BUSCO: assessing genome assembly and 1068 111, 5135–5140. 1128 annotation completeness with single-copy 1069 Reichardt, S., Repper, D., Tuzhikov, A.I., 1129 orthologs. Bioinformatics, 31, 3210–3212. 1070 Galiullina, R.A., Planas-Marquès, M., 1130 Sparkes, I.A., Runions, J., Kearns, A. and 1071 Chichkova, N.V., Vartapetian, A.B., Stintzi, 1131 Hawes, C. (2006) Rapid, transient expression 1072 A. and Schaller, A. (2018) The tomato 1132 of fluorescent fusion proteins in tobacco plants 1073 subtilase family includes several cell death- 1133 and generation of stably transformed plants. 1074 related proteinases with caspase specificity. Sci. 1134 Nat. Protocols, 1, 2019–2025. 1075 Rep. 8, 10531. 1135 Sperschneider, J., Dodds, P.N., Singh, K.B. 1076 Rappsilber, J., Mann, M. and Ishihama, Y. 1136 and Taylor, J.M. (2018) ApoplastP: prediction 1077 (2007) Protocol for micro-purification, 1137 of effectors and plant proteins in the apoplast 1078 enrichment, pre-fractionation and storage of 1138 using machine learning. New Phytol, 217, 1079 peptides for proteomics using StageTips. Nat. 1139 1764–1778. 1080 Protocols, 2, 1896–1906. 1140 Stanke, M., Schöffmann, O., Morgenstern, B. 1081 Rawlings, N.D., Barrett, A.J., Thomas, P.D., 1141 and Waack, S. (2006) Gene prediction in 1082 Huang, X., Bateman, A. and Finn, R.D. 1142 eukaryotes with a generalized hidden Markov 1083 (2018) The MEROPS database of proteolytic 1143 model that uses hints from external sources. 1084 enzymes, their substrates and inhibitors in 2017 1144 BMC Bioinformatics, 7, 62. 1085 and a comparison with peptidases in the 1145 Stoger, E., Fischer, R., Moloney, M. and Ma, 1086 PANTHER database. Nucleic Acids Res, 46, 1146 J.K.-C. (2014) Plant molecular for 1087 D624–D632. 1147 the treatment of chronic and infectious 1088 Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., 1148 diseases. Annual Review of Plant Biology, 65, 1089 Law, C.W., Shi, W. and Smyth, G.K. (2015) 1149 743–768. 1090 Limma powers differential expression analyses 1150 Taylor, A. and Qiu, Y.-L. (2017) Evolutionary 1091 for RNA-sequencing and microarray studies. 1151 history of subtilases in land plants and their 1092 Nucleic Acids Res, 43, e47–e47. 1152 involvement in symbiotic interactions. Mol. 1093 Senthil-Kumar, M. and Mysore, K.S. (2014) 1153 Plant-Microbe Interact, 30, 489–501. 1094 Tobacco rattle virus–based virus-induced gene 1154 Tyanova, S., Temu, T. and Cox, J. (2016) The 1095 silencing in Nicotiana benthamiana. Nat. 1155 MaxQuant computational platform for mass 1096 Protocols, 9, 1549–1562. 1156 spectrometry-based shotgun proteomics. Nat. 1097 Senthil‐ Kumar, M., Wang, M., Chang, J., 1157 Protocols, 11, 2301–2319. 1098 Ramegowda, V., Pozo, O. del, Liu, Y., 1158 Vizcaíno, J.A., Csordas, A., Toro, N. del-, 1099 Doraiswamy, V., Lee, H.-K., Ryu, C.-M., 1159 Dianes, J.A., Griss, J., Lavidas, I., Mayer,

10

bioRxiv preprint doi: https://doi.org/10.1101/373506; this version posted July 25, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1160 G., Perez-Riverol, Y., Reisinger, F., Ternent, 1188 1161 T., Xu, Q.-W., Wang, R. and Hermjakob, H. 1189 Figure S1: Comparison of Solanaceae 1162 (2016) 2016 update of the PRIDE database and 1190 proteomes. 1163 its related tools. Nucleic Acids Res, 44, D447– 1191 Figure S2: Phylogenetic analysis of the 1164 D456. 1192 subtilisin gene-family from figure 4 with 1165 Wang, X. and Bennetzen, J.L. (2015) Current 1193 names. 1166 status and prospects for the study of Nicotiana 1194 Figure S3: Phylogenetic analysis of the 1167 genomics, genetics, and biosynthesis 1195 subtilisin gene-family of tomato and 1168 genes. Mol Genet Genomics, 290, 11–21. 1196 Arabidopsis and including other previously 1169 Waterhouse, R.M., Seppey, M., Simão, F.A., 1197 characterized subtilisins. 1170 Manni, M., Ioannidis, P., Klioutchnikov, G., 1198 1171 Kriventseva, E.V. and Zdobnov, E.M. (2018) 1199 Table S1: GO-SLIM term enrichment complete at 1172 BUSCO applications from quality assessments 1200 p≤0.05 1173 to gene prediction and phylogenomics. Mol 1201 Table S2: MEROPS family term enrichment 1174 Biol Evol, 35, 543–548. 1202 complete 1175 Wu, C.-H., Abd-El-Haliem, A., Bozkurt, T.O., 1203 Table S3: Gene-model comparison 1176 Belhaj, K., Terauchi, R., Vossen, J.H. and 1204 1177 Kamoun, S. (2017) NLR network mediates 1205 Supplemental dataset 1: NbC gene-models 1178 immunity to diverse plant pathogens. Proc. 1206 database fasta nucleotide sequence 1179 Natl. Acad. Sci. USA, 114, 8113–8118. 1207 Supplemental dataset 2: NbC gene-models 1180 Xu, S., Brockmöller, T., Navarro-Quezada, A., 1208 database gff3 annotation 1181 Kuhl, H., Gase, K., Ling, Z., Zhou, W., 1209 Supplemental dataset 3: NbC predicted CDS 1182 Kreitzer, C., Stanke, M., Tang, H., Lyons, 1210 Supplemental dataset 4: NbC predicted 1183 E., Pandey, P., Pandey, S.P., Timmermann, 1211 proteome 1184 B., Gaquerel, E. and Baldwin, I.T. (2017) 1212 Supplemental dataset 5: PFAM, SignalP, 1185 Wild tobacco genomes reveal the evolution of 1213 ApoplastP, and Sma3 v2 annotation of the NbC 1186 nicotine biosynthesis. Proc. Natl. Acad. Sci. 1214 predicted proteome. 1187 USA, 114, 6133–6138.

11

bioRxiv preprint doi: https://doi.org/10.1101/373506; this version posted July 25, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1215 1216 1217 Figure 1. Bioinformatics pipeline for improved Nicotiana benthamiana proteome annotation. 1218 A database of Nicotiana predicted protein sequences was retrieved from Genbank and clustered at 95% 1219 identity threshold to reduce redundancy (Step 1), and used to annotate the four available N. benthamiana 1220 draft genomes (Step 2). The databases derived from the different genome builds were clustered at 99% 1221 sequence identity using CD-HIT-EST-2D in the following order: Niben1.0.1 > Nbv0.5 > Niben0.4.4 > 1222 Nbv0.3 generating NbA (Step 3). Proteins identified by proteomics in other databases but missing in 1223 NbA were added as above to generate NbB (Step 4), and finally missing BUSCOs were added as above 1224 to generate the final database NbC (Step 5).

12

bioRxiv preprint doi: https://doi.org/10.1101/373506; this version posted July 25, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1225 1226 1227 Figure 2. Increased coverage and annotation of N. benthamiana proteins. 1228 a) Completeness of the different predicted proteomes was estimated using BUSCO with the 1229 embryophyta database. b) The percentage of the proteins assigned with a certain number of unique 1230 PFAM identifiers (from 0-8) is plotted. c) Violin and boxplot graph of log10 protein length distribution 1231 of each database. Jittered dots show the raw underlying data. d) Percentage of annotated MS/MS spectra 1232 in AF or TE samples. Means and standard deviation (n=4) are shown and the number of unique peptides 1233 identified in at least of three out of four biological replicates of either AF or TE.

13

bioRxiv preprint doi: https://doi.org/10.1101/373506; this version posted July 25, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1234 1235 Figure 3. Improved annotation of the N. benthamiana apoplastic proteome. 1236 a) Correlation matrix heat map of the LFQ intensity of protein groups in the biological replicates of AF 1237 and TE. Biological replicates are clustered on similarity. b) A volcano plot is shown plotting log2 fold 1238 difference of AF/TE over –log10 BH-adjusted moderated p-values. Proteins log2 fold change ≥1.5 and 1239 p≤0.01 were considered apoplastic, as well as those only found in AF. Conversely, proteins with a log2 1240 fold change ≤-1.5 and p≤0.01 were considered strictly intracellular, as well as those found only in TE. 1241 c) and d) shows a grid where each row represents an GO-SLIM annotation significantly enriched or 1242 depleted (BH-adjusted hypergeometric test, p<0.05) in at least one of the fractions (apoplast, 1243 intracellular, or both) and each column the compartment. Each bubble indicates the percentage of genes 1244 containing that specific GO-SLIM annotation in that compartment. Colours indicate whether the GO- 1245 SLIM annotations are enriched or depleted in that fraction (p<0.05, n.s. = non-significant). c) 1246 Percentage of proteins in each fraction annotated with biological process-associated GO-SLIM terms; 1247 d) Molecular function-associated GO-SLIM terms. 1248

14

bioRxiv preprint doi: https://doi.org/10.1101/373506; this version posted July 25, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1249 1250 1251 Figure 4. Birth and death of subtilase gene paralogs in N. benthamiana. 1252 The evolutionary history of the subtilisin gene family was inferred by using the Maximum Likelihood 1253 method based on the Whelan and Goldman model. The bootstrap consensus tree inferred from 500 1254 replicates is taken to represent the evolutionary history of the taxa analysed. Putative pseudogenes are 1255 indicated in grey. Subtilases identified in apoplastic fluid (AF) and/or total extract (TE) are indicated 1256 with yellow and green dots, respectively. Naming of subtilase clades is according to (Taylor and Qiu, 1257 2017). Figure S2 includes the individual names.

15

bioRxiv preprint doi: https://doi.org/10.1101/373506; this version posted July 25, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1258 1259 1260 Figure 5. Example of a subtilase gene misannotation in different genome assemblies. 1261 The gene-models corresponding to a subtilase encoded by Nbv0.5scaffold3272_191000-195270 were 1262 retrieved from the different databases: a) NbC, b) Niben1.0.1 genome, c) Nbv0.5 genome, and d) 1263 Niben0.4.4 genome. CDS is annotated in grey boxes. The dotted line indicates a stretch of unknown 1264 sequence (Ns). Vertical dotted lines indicate SNPs, and the triangle indicate an insertion. Strand 1265 direction is indicated by arrows.

16