1 Supplementary Information Appendix 2 3 1. material, sequencing, assembly and map integration of the Nicotiana attenuata and N. 4 obtusifolia ge no mes ...... 2 5 1.1 Plant material and DNA preparation ...... 2 6 1.2 Raw data processing...... 2 7 1.3 Genome assembly pipeline...... 3 8 1.4 Generating the optical map ...... 4 9 1.5 Anchor scaffolds to pseudomolecules ...... 5 10 1.6 Quality assessment and validation of assembly ...... 6 11 2. Genome annotation ...... 7 12 2.1 Annotation of repeats and LTR insertion time estimation in Nicotiana ...... 7 13 2.2 Analyzing DTT-NIC1 insertions in Nicotiana genomes...... 8 14 2.3 Annotation of miRNA, tRNA and rRNA ...... 9 15 2.4 Annotation of protein coding genes...... 10 16 2.5 RNA sequencing and data analysis ...... 12 17 2.6 Expression of transposable elements...... 14 18 2.7 Functional annotation of protein coding genes ...... 14 19 2.8 Functional annotation of protein domains ...... 15 20 2.9 Annotation of transcription factor and protein kinase genes ...... 16 21 3. Evolutionary genomics analyses ...... 16 22 3.1 Homologous group identification ...... 16 23 3.2 Detection of gene duplication events in each HG ...... 16 24 3.3 Confirmation of the whole-genome triplication in Nicotiana...... 18 25 3.4 Estimation of species divergence times...... 20 26 3.5 Identification of lineage-specific gene family expansion ...... 21 27 4. Evolution of nicotine biosynthesis ...... 22 28 4.1 Identification and reconstruction of the evolution history of nicotine biosynthesis genes ...... 22 29 4.2 Validation of the promoter sequences of nicotine biosynthesis genes and their ancestral copies... 23 30 4.3 Prediction of TE-derived putative miRNA target sites into regulatory region of nicotine 31 biosynthesis genes...... 23 32 4.4 Evolution of root-specific expression of nicotine biosynthesis genes...... 24 33 4.5 Data access...... 25 34 5. Supplemental Tables ...... 27 35 6. Supplemental Figures...... 36 36 7. Captions for supplementary datasets S1 to S9 ...... 72 37 8. References ...... 74 38 1

39 1. Plant material, sequencing, assembly and map integration of the Nicotiana attenuata and

40 N. obtusifolia genomes

41 1.1 Plant material and DNA preparation

42 were grown as previously described (1). The genomic DNA sequenced by 454 and

43 Illumina HiSeq2000 technologies was isolated from late rosette-stage plants using the CTAB-method(2).

44 The two N. attenuata DNA plants used for this extraction were from a 30th generation inbred line,

45 referred to as the UT accession, which originated from a 1996 collection from a native population in

46 Washington County, Utah, USA (1). N. obtusifolia DNA was obtained from a single plant of the first

47 inbred generation derived from seeds collected from a native population in 2004 at the Lytle Ranch

48 Preserve, Saint George, Utah, USA. High molecular weight genomic DNA used to generate the optical

49 map of N. attenuata was isolated using a nuclei based protocol (3) from approximately one hundred

50 freshly harvested N. attenuata plants (harvested 29 days post germination) of the same inbred generation

51 and origin as used for the short-read sequencing.

522. To reduce the potential effects of secondary metabolites on single molecular sequencing, the plant

53 material used for PacBio sequencing was from a cross of two isogenic N. attenuata transgenic lines

54 (mother: ir-aoc, line A-07-457-1, which was transformed with pRESC5AOC [GenBank KX011463] and

55 is impaired in JA biosynthesis; father: irGGPPS, line A-07-230-5, which was transformed with

56 pRESC5GGPPS [GenBank KX011462] and is impaired in the synthesis of the abundant 17-

57 hydroxygeranyllinalool diterpene glycosides), both generated from the 22nd inbred generation of the

58 same origin as the N. attenuata plants described above (1). Genomic DNA was isolated from

59 approximately one hundred young plants (harvested 29 days post germination) by Amplicon Express

60 (http://ampliconexpress.com) according to a proprietary protocol.

61 1.2 Raw data processing

62 We filtered and trimmed the Illumina HiSeq2000 whole-genome shotgun raw sequences of N.

63 attenuata and N. obtusifolia to obtain high-quality non-duplicated read pairs prior to genome assembly.

2

64 This was achieved by a custom script that extracted only the largest reads, longer than 32 bp, and which

65 contained no bases with a Phred quality score lower than 11. We compared the first 32 bp of each read in

66 a pair to those of other read pairs and retained only one pair if the same sequence was found more than

67 once (deduplication). Roche/454 long reads for N. attenuata were processed by the sffToCA script

68 provided by the Celera Assembler v7 pipeline (CA7). Table S3 provides details of the filtered data used

69 for the genome assemblies.

70 1.3 Genome assembly pipeline.

71 The overall assembly workflow for N. attenuata is shown as SI Appendix, Fig. S3. All paired-end

72 reads from the sequenced libraries were assembled using the Celera Assembler (CA7) with a minimum

73 read length cut-off at 64bp. Preliminary tests showed that single end or short reads did not improve the

74 assemblies, but increased calculation time significantly. In total, 86.4 Gb short read data were assembled.

75 The expected genome size of N. attenuata is of 2.54 Gb based on coverage of the “larger than N50 length

76 unitigs”, similar to the estimation (1C=2.5pg) from the flow cytometry analysis. We used the SSPACE v2

77 scaffolder to further improve scaffolding using the mate-pair data and filled gaps using GapCloser v1.12

78 (4). After manual inspections, we found that certain neighboring contigs in the scaffolds still contained

79 overlaps, which might be due to the assembly process from CA7 that places copies of repeat sequence at

80 the end of contigs or due to issues in Gapcloser v1.12 that leave open some closable gaps. To close these

81 gaps between overlapped contigs, we compared neighboring contigs using BLASTN (min. identity

82 95%/min. length 43) and then joined overlapping contigs with custom scripts.

83 The assembly scaffolds from short reads were used for gap filling and further scaffolding using

84 PBJelly (v15.8.24)(5) with ~10x PacBio reads (N50=14.9 kb, max read length =48.9 kb), which resulted

85 in a 2.17 Gb genome assembly. While PBjelly only increased the N50 scaffold size from 176 kb to 202 kb,

86 it significantly increased the N50 contig size from 67 kb to 90 kb. Because PacBio reads contain about

87 12-15% errors, we performed an additional correction step using short reads with PILON(6). In total, 98.2%

88 of the draft assembly was confirmed by short reads and ~1.3 Mb sequences were corrected by PILON.

3

89 The PILON corrected assembly was further mapped to the 10x PacBio reads using BLASR and used

90 SSPACE-longreads for the second round scaffolding, which increased the N50 scaffold size to 292 kb.

91 To assemble the N. obtusifolia genome, we employed a hybrid strategy in which we first

92 assembled all short reads by a ‘de Bruijn graph’ assembler using idba-ud v1.1.1 (7), and then assembled

93 the locally re-assembled contigs and a subset of the short read data by an ‘OLC’ long read assembler

94 using CA7. Scaffolding and gap filling were carried out using SSPACE v2 using mate-pair data in a

95 similar manner to the N. attenuata assembly.

96 1.4 Generating the optical map

97 The optical maps of N. attenuata obtained from BioNano Genomics were used to improve the

98 assembly quality. High molecular weight genomic DNA was isolated from fresh N. attenuata tissues

99 using the protocol similar to that of VanBuren et al(3). Briefly, around 5g leaves were collected from

100 young plants and fixed with formaldehyde. After being homogenized in isolation buffer, filtrating

101 washing treatment was performed. The nuclei were purified on percoll cushions and washed extensively

102 and finally embedded in low melting agarose at different dilutions. The DNA plugs were treated with a

103 lysis buffer containing detergent, proteinase K and β-mercaptoethanol (BME). In total, 140 Gb of data

104 (>100 kb) were collected representing ~55× genome coverage with a molecule N50 length of 250 kb (Fig.

105 S1). Molecules were de novo assembled as previously described (8). The optical map finally assembled

106 consists of 2.2 Gb with an N50 size of 1.4 Mb.

107 The optical map final assembly was then used to anchor sequence scaffolds using the sewing

108 machine pipeline (9). Overall 81.7% of sequence assembly can be mapped to the optical map and 84.7 %

109 of the optical map can be aligned to sequence assembly. The super scaffolding was performed using

110 parameter ‘--f_con 12 --f_algn 40 --s_con 10 --s_algn –T 1e-8’ after manual inspection of the quality of

111 the scaffolds. The super scaffolding using the optical map generated a genome assembly consisting of

112 39,115 scaffolds with N50 to 358.4 kb (total size 2.4 Gb). Construction of a high-density linkage map

4

113 A high-density linkage map was constructed using the genotyping by sequencing (GBS) method

114 on 256 individuals from an advanced intercross recombinant inbred line population (AI-RIL). The

115 establishment of the AI-RIL and detailed procedures for DNA isolation and sequencing is described in

116 detail elsewhere (10). In total, 1.2 billion paired-end clean reads were obtained from 256 samples

117 (average = 4.7 million) after quality control and adapter trimming using AdapterRemoval (11) with

118 parameter “--minquality 30 --minlength 36”. All reads were then mapped to the draft genome of N.

119 attenuata (release v1.0) using Bowtie 2 (version 2.2.5) with default parameters and only reads with

120 mapping quality greater than 3 were used for the downstream analysis. Genome Analysis Toolkit (GATK,

121 version 3.3-0-g37228af) (12) was used to call single nucleotide polymorphisms (SNPs) and indels with

122 the parameter “-stand_call_conf 30 -stand_emit_conf 10” after realignment at the indel loci as

123 recommended. All SNPs were filtered based on mapping and SNP quality and sequence depth using the

124 parameter “--clusterWindowSize 10 MQ0 >= 4 && ( (MQ0 / (1.0 * DP)) > 0.1) QUAL < 30.0 QD < 5.0”,

125 which resulted in 16,904 polymorphic markers. For linkage map construction, we further removed all

126 markers that were missing in more than 30% of individuals or showed segregation distortions (P < 0.001,

127 χ2 test). The final dataset used for the linkage group construction contained 7989 markers. The linkage

128 map was constructed using ASmap (13) following recommended workflow. In brief, all markers that were

129 typical for more than 70% of individuals were used to construct the backbone of the linkage map using

130 parameters: dist.fun = "kosambi", p.value = 1e-12. These parameters were selected based on several

131 rounds of manual optimizations. Then, markers that were typical of less than 70% of all individuals were

132 pushed back to the map based on their similarity with markers included in the backbone map. The final

133 linkage map consists of 12 linkage groups with 2,906 cM.

134 1.5 Anchor scaffolds to pseudomolecules

135 To anchor scaffolds to pseudomolecules, we first used chromonomer

136 (http://catchenlab.life.illinois.edu/chromonomer/) to identify conflicts between genome assembly and

137 linkage maps and remove inconsistent markers. In total, 219 scaffolds were identified as containing

5

138 potential assembly errors and 188 were split at the largest gap between two markers that mark the

139 boundaries of the two positions. The remaining 31 scaffolds that contained too many potential errors were

140 excluded from anchoring to pseudomolecules but included as scaffolds in the final genome assembly. The

141 final step of anchoring scaffolds to pseudomolecules was performed using ALLMAPS (14). In total,

142 2,132 scaffolds representing 825.8Mb (34.9%) of the final genome assembly and 13,076 (38.9%) of total

143 predicted genes were anchored to 12 pseudomolecules (Fig. S2).

144 The final assembly consists 37,194 scaffolds, which represents 92% (2.37 Gb) of the estimated

145 genome size. The N50 scaffold and contig sizes equal to 524.5kb and 64.2kb, respectively, and 50% of

146 the genome was represented in 420 longest scaffolds (L50). The detailed step-by-step workflow of the

147 genome assembly procedures is shown in Fig. S3.

148 1.6 Quality assessment and validation of assembly

149 We assessed the completeness and quality of the final N. attenuata assembly using four different

150 methods. First, we mapped 80,044 N. attenuata EST sequences that were assembled in a previous study

151 (with a length greater than 300bp) to the genome using gmap (15). The overall mapping rate was ~97.1%

152 (identity greater than 95% and coverage greater than 80% of each transcript). In addition, we also mapped

153 1207 million paired-end reads to the genome using TopHat2 (16). Overall, 96.5% of them could be

154 mapped to the genome and 94.2% were properly paired. Furthermore, among the 45,816 N. benthamiana

155 EST sequences from NCBI (length greater than 300 bp), 88.7% of them could be mapped to the final N.

156 attenuata assembly (greater than 85% identity and 60% coverage). Second, mapping 224 million (a subset)

157 paired-end 100 bp Illumina reads (1kb library) back to the genome using Bowtie2 (-I 500 -X 2000)

158 showed that 99.94% of reads can be mapped to the genome and 97.7% were properly paired. Third, we

159 used CEGMA based on 248 ultra-conserved CEGs to further estimate the completeness of the assembly.

160 In total, 208 (83.9%) were found in full length and 243 (98.0%) were found to be partial or full-length in

161 our final assembly. The overall completeness of the final N. attenuata assembly estimated based on

162 CEGMA is slightly less than the tomato genome, but similar to the potato genome (Table S2). Fourth, re-

6

163 sequencing of ~2kb the upstream regions of 26 candidate genes using Sanger sequencer confirmed that

164 96.2% (25 out of 26) of the assembly were correct. For these 25 correctly assembled genes, the similarity

165 between sequences obtained from Sanger sequencer and WGS assembly is more than 99.86%. Most of

166 the mismatches (53 out of 72 bps) were due to an additional AT-rich fragment was miss-assembled to the

167 promoter region of NIATv7_g09977. Together, these data suggest that the overall quality of this assembly

168 is high.

169 2. Genome annotation

170 2.1 Annotation of repeats and LTR insertion time estimation in Nicotiana

171 The summary of repeat content is shown in Table S4. The abundance of MITEs is shown in

172 Supplementary Dataset 1. Because the expansion of LTRs contributed to the rapid increase of genome

173 size in Nicotiana, we further analyzed the insertion time of this subgroup. All annotated LTR/gypsy

174 and/or LTR/copia TE sequences with a mapping score greater than 250 and a size greater than 100 bp

175 were extracted from the RepeatMasker output. Only TE families that contained more than 200 copies

176 were retained for downstream analysis. For each of the extracted LTR sequences, BLASTN was used to

177 find the other LTR sequences with the highest similarity score with parameters: -evalue 1e-10 -task dc-

178 megablast. The pairwise aligned fragments that had a length greater than 200 bp and a bit score greater

179 than 200 were used for estimating substitution rates. The identical reciprocal best blast hits were only

180 counted once. A substitution rate was then calculated using baseml from the PAML (4.7) package with

181 the “REV” molecular evolution model. Our LTR expansion dating approach is different from the method

182 described by SanMiguel et al (17), which compares the sequence divergence between two LTR regions of

183 complete retrotransposons. However, in the N. attenuata genome like in many other plant genomes, the

184 complete LTR set only contributes to a small proportion of the total amount of retrotransposons,

185 especially since many old retrotransposons are rapidly disrupted by new insertions and thus cannot be

186 detected as complete LTRs. This may result in a biased picture of the overall insertion history of

7

187 retrotransposons in the genome. Our method directly calculates the divergence between most recently

188 duplicated repeat sequences and thus avoids the bias introduced from annotating complete LTRs.

189 2.2 Analyzing DTT-NIC1 insertions in Nicotiana genomes

190 To analyze the effect of DTT-NIC1 insertions on gene expression divergences between

191 duplicated genes, we first calculated the gene expression and tissue specificity divergences between

192 duplicated gene pairs in N. attenuata using 21 different RNA-seq libraries (see details below on

193 expression analysis and gene duplication identification). To reduce redundancy and false positives, we

194 performed the analysis based on the following protocol: 1) we analyzed the gene duplication history of all

195 N. attenuata genes based on the constructed phylogenetic trees of each homologous group; 2) only the

196 reciprocal most recently duplicated paralog pairs were identified and used for the downstream analysis; 3)

197 only the gene pairs that resulted from duplications that enjoyed an approximate Bayesian support of

198 greater than 0.9 were retained in the analysis; 4) all gene pairs were assigned into two groups: A: one

199 member of the gene pair with at least one DTT-NIC1 insertion (within the 1kb upstream of 5’ region) and

200 B: neither one of gene pair with DTT-NIC1 insertions; 5) gene expression divergence at both expression

201 and tissue specificity levels were then compared between the two groups. The tissue-specificity score for

202 each gene/transcript using the τ index (18) based on a subset of the RNA-seq libraries. The index τ is

203 defined as:

204

(1 ) = 푁 1 ∑푖=1 − 푋푖 τ 205 푁 −

206 where N is the number of tissues and is the expression profile component normalized by the maximal

207 component value. Euclidean distance 푋between푖 the expression profiles of two paralogs among the same

208 RNA-seq libraries was used to quantify the gene expression divergence. Both tissue specificity and

209 expression divergence were calculated based on log2 transformed TPM values.

8

210 2.3 Annotation of miRNA, tRNA and rRNA

211 Small RNA sequencing was performed to capture all small RNAs (smRNAs), especially the

212 miRNAs that are expressed during day and night in the wild-type N. attenuata. Three replicates were used;

213 clean reads were generated from the raw smRNA reads after removing the adaptor sequences and

214 removing the low quality reads—such as reads with unidentified nucleotides (N) or reads with any single

215 nucleotide stretch > 5 nucleotides. Clean reads, > 15 nucleotides, were filtered for further analysis. Next,

216 all the clean reads were aligned against Rfam and the reads mapped to tRNA, rRNA and snoRNA were

217 removed. Remaining reads in all the replicates were merged for each time point, and reads that were

218 expressed in at least two of three replicates were retained for further analysis. These reads are referred to

219 as “mappable miRNA reads.” These reads were aligned to the N. attenuata genome using Bowtie with

220 maximum two mismatches and five reported genomic location alignments. Sequences that did not match

221 the genome were discarded. Further, aligned reads were mapped against the N. attenuata transcriptome by

222 Bowtie with no reverse complement mapping parameter, and those aligned to the transcriptome were

223 removed from the bin of miRNA mappable reads. Conserved miRNAs were identified by comparing the

224 N. attenuata mappable reads against the known plant mature miRNAs and their precursors in the

225 miRBase database (www.mirbase.org). Sequences with perfect matches to the known sequences were

226 regarded as bona-fide conserved miRNAs and were subjected to precursor prediction in the Nicotiana

227 genome. A flanking sequence of 200 bases was extracted from the mapped genomic locations for each

228 read and RNAfold was used to predict precursor stem-loop structure. The miRCheck algorithm was used

229 to compare all the potential miRNAs and precursors against a set of secondary structure constraints

230 derived from known plant miRNA precursors.

231 A total of 131 miRNA reads corresponding to 83 bona-fide conserved miRNAs were identified

232 that were expressed during day and night harvests of N. attenuata leaves. These miRNAs were obtained

233 from 158 genomic locations in Nicotiana genome and were conserved in 34 other plant species (such as

234 Arabidopsis lyrata, Arabidopsis thaliana, Vitis vinifera, and Zea mays etc.) as shown in Supplementary

235 Dataset 2. Of these 83 miRNAs (131 miRNA reads), twelve miRNAs (miR160c-3p, miR171b-3p,

9

236 miR398a-3p, miR426, miR1429-5p, miR5069, miR5497, miR6020b, miR6021, miR6206, miR7711-5p.4

237 and miR7744-5p) were expressed only during day time. On the other hand, 14 miRNAs (miR160-5p,

238 miR408-5p, miR2609a, miR4379, miR5015, miR5042-3p, miR5255, miR5303c, miR5635a, miR5741a,

239 miR6444, miR7750-5p, miR8143 and miR8764) were expressed only during the night. Fifty seven

240 miRNAs (corresponding to 88 miRNA reads) were expressed during both day and night, of which 12

241 miRNAs (miR172c, miR172j, miR5303, miR5303a, miR6149a, miR6161a, miR6161b, miR6161c,

242 miR6161d, miR6164a, miR6164b and miR7997c) were expressed with different mature sequences at

243 these times. Differences in sequences for a miRNA indicate that different isomiRs were expressed

244 between the day and night.

245 tRNAs were first annotated using tRNAscan-SE (v 1.3.1)(19) with default parameters. All

246 predicted pseudo tRNAs and tRNAs that showed high similarity to chloroplast and mitochondrial

247 genomes (e-value < 1e-10, identify > 95) were removed. In total, 1052 tRNAs that decode standard 20

248 AA and four selenocysteine tRNAs were predicted. rRNAs were annotated using RNAmmer (v1.2)(20)

249 with parameter “ -S euk -m tsu,lsu,ssu”. In total, 799 rRNA sequences were predicted.

250 2.4 Annotation of protein coding genes.

251 The N. attenuata genome was annotated using the Nicotiana Genome Annotation (NGA) pipeline,

252 which employs both hint-guided (hg) Augustus (v. 2.7) (21) (hg-Augustus) and MAKER2 (v.2.28)(22),

253 gene prediction pipelines based on genome release v1.0.

254 For hg-Augustus gene annotation, the HMM gene model was specifically trained for N. attenuata

255 using RNA-seq data from major plant tissues, and gene models were predicted using unmasked genome

256 sequences. The repeat regions were here given less probability to be predicted as a gene, in particular

257 when RNA-seq evidence was missing. For MAKER2 annotation pipeline, we integrated evidence of

258 multiple protein codings from three sources: ab initio gene predictions, transcript evidence and protein

259 homolog evidence. The input evidences for MAKER2 were: 1) ab initio gene predictors (GeneMark and

10

260 Augustus) that were each trained with full-length transcripts; 2) Trinity (v. r20131110) assembled

261 transcripts using RNA-seq data from major tissues; 3) UNIREF90 plant proteins (mapped using genewise,

262 v. wise2-4-1); 4) six high quality plant proteomes (tomato, potato, grape, Arabidopsis, Populus and rice).

263 The MAKER2 annotation pipeline was run on the repeat masked N. attenuata genome.

264 The predicted gene models from hg-Augustus were filtered based on their repeats contents. All

265 genes with repeats occupying more than 50% of the gene length were removed, and genes were retained if

266 less than 10% of their sequence matched to repeats. For genes that contained repeats occupying 10-50%

267 of their entire gene length, we performed an additional search of the plant Refseq database using

268 BLASTX. If the gene matched with a non-repeats homolog (e-value greater than 1e-5 and bit score

269 greater than 200) from the plant refseq database, the predicted gene models were retained for downstream

270 analysis. In total, 35,737 gene models from hg-Augustus predictions were retained. For MAKER2

271 predicted gene models, in addition to the filtering based on repeat content as described above, the gene

272 models that had low evidence support were removed from downstream analysis (eAED <= 0.45, this

273 cutoff was set after manual inspections on the predicted gene models). In total 33,274 gene models from

274 MAKER2 prediction were retained. The predicted gene models from hg-Augustus and MAKER2

275 pipelines were then combined and overlapping gene structures were removed. After manual inspections,

276 we found that hg-Augustus outperformed MAKER prediction when the pipelines predicted different gene

277 structures for the same gene. Therefore, when the two pipelines predicted different gene structures for the

278 same genome region, we retained only the hg-Augustus predicted gene models. After merging the

279 predicted models from both hg-Augustus and MAKER2 predictions, genes with pre-mature stop codons

280 were considered as pseudogenes and were removed from the downstream analysis. The predicted gene

281 models were then transferred to N. attenuata genome release v2.0 using a custom script which first

282 identified homologous regions using BLAST and then predicted gene structure using GeneWise. This

283 finally resulted in 33,449 high quality gene models in the final N. attenuata genome, of which 74.9%

284 were supported by at least 50 RNA-seq reads. Because gene models within a are usually conserved,

11

285 we annotated N. obtusifolia and two other publicly available diploid Nicotiana genomes, N. sylvestris and

286 N. tomentosiformis(23), using a homology-based annotation pipeline based on all predicted N. attenuata

287 protein coding gene models. Using all predicted gene models from N. attenuata, we predicted protein

288 coding gene models in N. obtusifolia, N. tomentosiformis and N. sylvestris using a homolog-based

289 approach.

290 2.5 RNA sequencing and data analysis

291 RNA was isolated from plants of the same origin and inbred generation as described above for N.

292 attenuata DNA isolation. RNA was isolated using TRIZOL® (Thermo Fisher Scientific) according to the

293 instructions of the manufacturer. DNA was removed from all RNA preparations using TURBO™ DNase

294 (Thermo Fisher Scientific) according to the manufacturer’s protocol. In total, twenty one RNA-seq

295 libraries from different plant tissues and the same tissues under different biotic and abiotic stress

296 treatments were first enriched for RNAs with poly-A tails and then used for RNA-seq library construction

297 with Illumina's TruSeq RNA sample preparation kits. The insertion sizes of the libraries are

298 approximately 200 bp. All RNA-seq libraries were sequenced using Illumina 2000 HiSeq platform with

299 read lengths of 50 or 101 bp and resulted in 793,785,373 paired-end Illumina reads.

300 The raw sequence reads were trimmed using AdapterRemoval (v1.1) with parameters “--collapse

301 --trimns --trimqualities 2 --minlength 36”. The trimmed reads were then aligned to the N. attenuata

302 genome assembly (v 2.0) using TopHat2 (v2.1.0) (24), with maximum and minimum intron size set to

303 50,000 and 41 base pairs (bp), respectively, estimated from the N. attenuata genome annotation.

304 To estimate the expression level of assembled genes and transcripts, all trimmed RNA-seq reads

305 were mapped to the assembled transcripts using RSEM (v1.2.20) (25). Transcripts per million (TPM) was

306 calculated for each transcript and gene. To exclude low-expressed genes and transcripts, only genes with

307 TPM greater than five in at least one sample were considered as expressed. Similarly, only the transcripts

308 with TPM greater than one in at least one sample were considered as expressed.

309

12

310 In total, 21 RNA-seq library from different tissues and same tissue with different treatments were

311 sequenced (Table S5). Expression profile of all predicted gene models is provided in Supplementary

312 Dataset 3. For transcript expression, only the transcripts with TPM greater than one in at least one sample

313 were considered as expressed. In total, 25,506 genes and 73,624 transcripts were expressed, respectively

314 (Supplementary Dataset 4). Among all sequenced tissues, roots and pollen tubes expressed the highest

315 and lowest number of both genes and transcripts (Fig. S18), respectively. We further annotated the repeat

316 content of all expressed transcripts using RepeatMasker (open-4.0.5) with repeat library annotated from N.

317 attenuata as described previously. Among these 73,624 expressed transcripts, 17,463 of them (~24%)

318 were found containing repeats (Supplementary Dataset 4), suggesting that a large number of repeats of N.

319 attenuata are expressed.

320 The floral gene expression was calculated based on RNA-seq library from open flowers (OFL),

321 and for tissues that contain more than one library (with different treatments), we used the average

322 expression values among all libraries to represent the expression level in the respective tissues. Tissue-

323 specific genes/transcripts were considered as τ index >= 0.95. Among all sequenced tissues, roots have

324 the highest number of tissue-specific genes and transcripts (Fig. S19). Interestingly, although pollen tubes

325 expressed fewer genes (4201) than other tissues, they harbored the largest proportion of tissue-specific

326 genes (10%).

327 We annotated all of the alternative splicing (AS) events using JuncBASE (v0.6) (26) based on the

328 splicing junctions (SJs) information from the BAM files produced by Tophat2 (16). To reduce false

329 positives that likely result from non-specific or erroneous alignments, all original junctions were filtered

330 based on overhangs greater than 13 bp in length, as shown in our previous study(27). The overall pattern

331 of AS detected among different tissues is consistent with our previous study based on roots and leaves of

332 N. attenuata, which showed that intron retention (IR) and exon skipping (ES) are the most and least

333 abundant AS events, respectively (Fig. S20).

13

334 2.6 Expression of transposable elements

335 The expression of TE was analyzed using two different datasets. Among different tissues, RNA-

336 seq data were used. We first remapped all clean reads to the N. attenuata genome using tophat2 with the

337 parameters mentioned above, except for the retention of 150 multiple mapped reads (-g 150). Then, the

338 mapped bam files were used to estimate expression of the different TE families using TEtranscripts (28).

339 Among all measured tissues, we found that TE expression was highest in pollen tube (Fig. S21).

340 For M. sexta herbivory-induced TE expression in leaves, we used a previously published time-

341 course microarray dataset (29). All probes were mapped to the genome using blastn. Only probes that had

342 a perfect match to the genome were considered. Overlaps between positions of mapped probes and

343 repeats were identified using bedtools (30) based on repeatmakser output. All probes of which 100%

344 could be located within the repeat region and did not map to any annotated genes were used to analyze

345 expression of the M. sexta herbivory-induced expression of TEs. The microarray data were first

346 normalized using quantile normalization, and differential gene expression was analyzed using the limma

347 R package (31). Probes that showed a false discovery rate (FDR) adjusted P value lower than 0.05 and a

348 fold change greater than 2 were considered as induced by M. sexta OS. Differentially expressed probes

349 that were annotated as repeats are reported in Supplementary Dataset 5. Similar to protein coding genes,

350 most of TEs showed the highest induction at 1 h after elicitation (Fig. S22), except LTR/copia, which

351 were induced the most at 13 h after elicitation.

352 2.7 Functional annotation of protein coding genes

353 The functions and gene ontology (GO) of all predicted protein coding genes were annotated

354 using BLAST2GO (32). All protein coding sequences were first compared to the plant reference sequence

355 database (downloaded in February 2016) using BLAST with e-value cutoff 1e-10. The BLAST results

356 were imported to BLAST2GO and the functions and GO terms of each gene were annotated using the

357 default settings. In total, 59.3% of genes were assigned to at least one GO term. The enzyme commission

358 (EC) number was also assigned to the predicted genes by comparing to the KEGG pathway databases.

14

359 The pathways that contained only one mapped enzyme were removed. Overall, 5,370 genes (16.1 %)

360 were assigned to 942 EC codes that were involved in 144 pathways. The top three pathways that contain

361 the highest number of annotated genes are starch and sucrose metabolism, purine metabolism and

362 phenylalanine metabolism. Furthermore, the KEGG orthologs (KO) were annotated using kobas 2.0 (33),

363 with e-value cutoff set as 1e-10. In total, 10,746 genes were assigned to KO terms. The information of

364 mapped KO information for each gene is reported in the Supplementary Dataset 6.

365 2.8 Functional annotation of protein domains

366 To identify the functional domains of protein-coding genes, INTERPROSCAN (v. 5.16-55.0)(34)

367 was used to scan protein sequences against the protein signatures from InterPro database (v. 55.0,

368 downloaded in February, 2016). The InterPro database integrates protein families, Pfam domains and

369 functional sites from different databases. The following databases from InterPro were used: Coils (v.

370 2.2.1), CATH-Gene3D (v. 3.5.0), HAMAP (v. 201511.02), Panther (v. 10.0), Pfam (v. 28.0), PIRSF (v.

371 3.01), PRINTS (v. 42.0), ProDom (v. 2006.1), PROSITE (v. 20.113), SMART (v. 6.2), SUPERFAMILY

372 (v. 1.75) and TIGRFAM (v. 15.0). In total, 38,810 protein domains were identified, consisting of 3,925

373 unique Pfams. For predicted genes in N. attenuata genome, 71.1% (23,783 out of 33,449) of them contain

374 at least one predicted protein domain. The top 20 Pfam and SUPERFAMILY domains are shown in Fig.

375 S23 and S24. The top 20 most frequent domains account for 22.7% (8,780 out of 38,729) and 37.6%

376 (10,950 out of 29,091) of all predicted Pfam domains and SUPERFAMILIES respectively. In comparison

377 to tomato(35), most of the top 20 SUPERFAMILIES are similar except three, which are ribonuclease H-

378 like (SSF53098), DNA/RNA polymerases (SSF56672) and trans-glycosidases (SSF51445). At domain

379 levels, the top 20 list in N. attenuata and tomato differed in seven domains, including domains of reverse

380 transcriptase-like (PF13456), PPR repeat (PF12854) and a domain of unknown function (PF14111).

15

381 2.9 Annotation of transcription factor and protein kinase genes

382 The transcription factor and protein kinase-containing genes were identified based on the

383 identified domain in each gene according to rules described in Pérez-Rodríguez et. al(36) using the iTAK

384 tool (http://bioinfo.bti.cornell.edu/cgi-bin/itak/index.cgi). In total, 2,498 and 1,071 genes were annotated

385 as transcription factor and protein kinase genes respectively (Supplementary Dataset 7). Among all

386 transcription factors, MYB, AP2 and bHLH were the three largest families.

387 3. Evolutionary genomics analyses

388 3.1 Homologous group identification

389 The detailed methods used for defining the homologous groups (HG) are given in the Method

390 section of the main text. In total, we identified 23,340 HGs (with at least two homolog sequences) from

391 the 11 plant genomes (Table S1). The average gene family size within each species varied from 1.8 in C.

392 sativus to 2.8 in P. trichocarpa. Among these identified HGs, 4,328 contained at least one gene from each

393 of all 11 dicots species, thus representing the core dicot HGs. In N. attenuata, 13,632 HGs containing

394 30,513 protein-coding genes were identified, of which 89.7% (12,231) contained orthologs from both

395 Nicotiana genomes sequenced in this study. In addition, 8.8% (2,936 of 33,449) of N. attenuata genes

396 were assigned as orphan genes, since they did not cluster with any other sequences (Fig. S25).

397 3.2 Detection of gene duplication events in each HG

398 We first constructed the phylogenetic tree for each HG. For this, we aligned all coding sequences

399 for each HG using MUSCLE (v.3.8.31) (37), based on translated protein sequences from TranslatorX

400 (v.1.1) (38). For all aligned sequences, all non-informational sites (gaps in more than 20% of sequences)

401 were removed using trimAL (v1.4) (39). Then, for each HG, PhyML (v. 20140206) (40) was used to

402 construct the gene tree with the best nucleotide substitution model, estimated based on jModeltest2

403 (v.2.1.10) (41) with the following parameters: -f -i -g 4 -s 3 -AIC -a. The support for each branch was

404 calculated using the approximate Bayesian method.

16

405 The duplication events within each HG were predicted based on the constructed gene trees using

406 a tree reconciliation algorithm which compares the structure of a species’ tree and gene tree to infer the

407 duplication events (42). This approach allowed us to predict the history of gene duplication events on

408 each branch of the species tree. To reduce false positives, we only considered tree structures with

409 approximate Bayes branch supports greater than 0.9 for the downstream analysis. In addition, we also

410 characterized the gene duplication events based on their genomic location. Tandem duplicated gene pairs

411 were identified as gene pairs from the same HG if located within a distance of 100 kb or four genes from

412 each other.

413 Using the tree reconciliation algorithm mentioned above, we identified 81,859 duplication events

414 on the branches of 11 species tree. Among all these duplicated events, genes from N. attenuata were

415 involved in 27.0% of them (22,054 out of 81,859). The species/genera that were known to have

416 experienced whole genome duplication (WGD) or triplication (WGT), such as Arabidopsis, Populus,

417 Mimulus, showed much higher numbers of duplication events (Fig. S4, and Supplementary Dataset 8).

418 Furthermore, a large number of gene duplication events were found on the branch of ,

419 indicating the shared WGT (35) between the Nicotiana and Solanum genera (Fig. S4). Although the

420 cultivated potato is an autotetraploid, the genome sequences used for our analysis represent a diploid

421 species (2n=2x=24), which were obtained through the combinations of sequencing a homozygous

422 doubled-monoploid derived from classical tissue culture and a heterozygous diploid breeding line (43).

423 Therefore the additional polyploidization likely did not affect the detection of gene duplication events at

424 the branch of Solanaceae and Nicotiana.

425 Additionally, we inferred the type of duplication to be tandem duplication or genome-wide

426 duplication by using gene distance and syntenic block information of the genome, respectively. All

427 duplicated genes gene pairs that are separated less than 10 genes and were less than 150 kbp apart in the

428 genome of any species of tomato, potato or N. attenuata were considered as derived from tandem

429 duplications. In total, we found 535 duplications that occurred on the Solanaceae branch (7.8%) which

430 were likely to be tandem. We further used syntenic blocks estimated in tomato and potato that were

17

431 predicted by the Plant Genome Duplication Database (PGDD,

432 http://chibba.pgml.uga.edu/duplication/index/downloads) to infer duplications that resulted from genome-

433 wide duplications. All duplication events occurred on the Solanaceae branch and can be found in a

434 syntenic block (with e-value < 0.01 and 0.3

435 duplications.

436 One example of duplications via WGT is the evolution of threonine deaminase (TD) family,

437 which has four copies in both Nicotiana and Solanum resulting from one round of local duplication

438 followed by a genome-wide duplication (Fig. S6). TD is a Solanaceae specific anti-herbivore defense

439 gene that encodes a pyridoxal phosphate-dependent enzyme that dehydrates threonine to α-ketobutyrate

440 and ammonia, as the committed step in the biosynthesis of isoleucine (Ile) in plants. After one tandem

441 duplication followed by WGT, the duplicated gene copy (TD2.2), likely through several adaptive

442 substitutions (44), evolved as an anti-digestive defense in tomato that degrades threonine in the guts of

443 attacking larvae (45). Interesting, in N. attenuata TD2.2 likely retained its biosynthetic function to supply

444 the Ile essential for jasmonate-mediated defense signaling (46). Consistent with a functional

445 specialization and in contrast to other WGT-derived TD copies, N. attenuata TD2.2 exhibits a unique

446 spatial expression pattern and is specifically induced by herbivore attack (Fig. S6).

447

448 3.3 Confirmation of the whole-genome triplication in Nicotiana

449 To further support the conclusion that Nicotiana spp. share the WGT event previously identified

450 in Solanum (35), we performed molecular evolution and phylogenomic analyses. For the molecular dating,

451 the synonymous substitution rates (Ks) for the gene pairs in all homologous groups (with less than 5

452 genes per species) were calculated based on YN model using KaKs_Calculator (v.1.2) (47). To estimate

453 the divergence time, the formula T=Ks/ (2*mutational rate) was used, where the mutational rate was set

454 to 7.1*10-9 substitutions/site/million years (MYA)(48). The within-species 4dTV distribution in N.

455 attenuata and tomato showed a similar peak between 0.2 to 0.25, indicating that a genome-wide

18

456 duplication event had occurred at a similar time in both species (Fig. S5). Given the previously identified

457 WGT in tomato, it is likely that Nicotiana shares this genome-wide duplication event with tomato. This

458 inference is also consistent with the more ancient estimated age of this WGT (71 ±19.4 MYA)(35)

459 compared to the recent divergence between Nicotiana and Solanum (24.4 MYA to 28.6 MYA).

460 To further validate the shared WGT event between Nicotiana and Solanum, we analyzed a subset

461 of genes that underwent a triplication in either Solanum or Nicotiana. The genes that underwent a

462 triplication event in Solanum were identified based on the fact that the tree structure consists of one

463 outgroup node (at least one of genes in Arabidopsis, Populus, Cucumber or Vitis) and two duplication

464 events that were shared between tomato and potato with no gene loss in either tomato or potato. Similarly,

465 the genes that underwent a triplication event in Nicotiana were identified based on the fact that the tree

466 structure consists of one outgroup node and two duplication events that were shared between N. attenuata

467 and N. obtusifolia with no gene loss in either species. In order to reduce the false positives, we only

468 considered the tree nodes that have approximate Bayes branch supports of greater than 0.9. We then

469 analyzed the number of these triplication events that were shared between Solanum and Nicotiana based

470 on the complete tree structure. In total, we identified 229 and 436 triplication events in Solanum and

471 Nicotiana, respectively. In both the Solanum and Nicotiana datasets, the majority (89.2% and 79.9%

472 respectively) of the triplication events were shared with the other genus. These results are consistent with

473 the fact that the WGT event found in tomato is indeed shared between Solanum and Nicotiana.

474 Furthermore, using MCSCAN, we also found 22 duplication blocks, each of which contains at least 20

475 genes among assembled pseudo-chromosomes. In addition, we also compared the identified triplication

476 events in Solanum and Nicotiana with those of genes in Mimulus. Our results showed that a majority of

477 the triplication events (98.7% and 98.4% in Solanum and Nicotiana, respectively) were not shared with

478 Mimulus, indicating that gene duplication events in Mimulus are independent of the WGT found in

479 Solanaceae.

19

480 3.4 Estimation of species divergence times

481 We calculated the divergence times of four Solanaceae spp. (N. attenuata, N. obtusifolia, S.

482 lycopersicum and S. tuberosum) with V. vinifera as the outgroup using a Bayesian approach. In brief, we

483 first identified 1,622 one-to-one orthologs using BLAST reciprocal best-hits (RBH) algorithm. Then each

484 orthologous group was aligned using the protein coding sequences with MUSCLE (v. 3.8.31) (37) based

485 on translated protein sequences from TranslatorX (v.1.1) (38). These alignments were concatenated to one

486 super-alignment and all ambiguously aligned regions were removed using trimAL (v. 1.4) (39) with

487 parameter: -gt 0.8. The final alignment that contained ~2.1Mb nucleotide sites was then used for species

488 divergence time estimation. The nucleotide substitution parameters were first estimated with the HKY85

489 model using baseml from the PAML package (v4.8) (49). The branch-length and the corresponding

490 variance-covariance matrix were then estimated using estbranches. The results from estbranches were

491 used to predict divergence times with the MultiDivTime program (50). The MultiDivTime analysis was

492 performed according to the manual with the following parameters: the root-to-tip mean was set to 119.5

493 MYA with a standard deviation of 10 MYA based on the divergence time estimated by Guyot et. Al (51);

494 the evolutionary rate of the root was set to 0.007631799 substitutions per nucleotide site per MYA

495 calculated based on the results of baseml; and a burn-in time of 10,000,000 generations was used. MCMC

496 chains were sampled every 10,000 generations until 100,000 samples were taken and the calibration

497 points of 7.3 MYA and 23.7 MYA for the divergence time of tomato-potato and tomato-Nicotiana,

498 respectively (52), were used.

499 The analysis revealed that N. attenuata and N. obtusifolia diverged about 12.5 MYA ago (95%

500 confidence interval: 10.1 MYA to 14.7 MYA), which is about five million years earlier than the

501 divergence time between potato and tomato (7.1 MYA, range from 6.4 MYA to 8.2 MYA). The estimated

502 divergence time between the Nicotiana and Solanum lineages was estimated to be of 27.1 MYA (95%

503 confidence interval: 24.4 MYA to 28.6 MYA) and the divergence time of V. vinifera and solanaceous

504 species is around 116.70 MYA (95% confidence interval: 100.9 MYA to 133.6 MYA).

20

505 3.5 Identification of lineage-specific gene family expansion

506 We analyzed Solanaceae and Nicotiana lineage-specific HG expansions using the previously

507 identified gene duplication events. For this, we performed a Fisher’s exact test to identify gene families

508 that exhibit significantly more duplications in comparison to the genome-wide pattern. Because the total

509 number of gene families was large, we reduced the number of tests and false positives for a given branch

510 by calculating a Z-score for each HG and only considering those HGs that had Z-scores > 1.96 for

511 Fisher’s exact test. Multiple testing corrections were then performed based on these P-values using the

512 false discovery rate (FDR < 0.1) method. It should be noted that this is a conservative approach, as many

513 small gene families cannot be detected by such stringent statistical tests. Hence, the gene families

514 identified by this analysis as significantly expanding are likely true positives.

515 For the Solanaceae branch, we found 1,596 HGs with a Z-score greater than 1.96. Among them,

516 four HGs experienced significantly more duplications than the genome-wide pattern based on Fisher’s

517 exact test. One of these belongs to the S-locus F-box protein (SLF) family, of which 10 duplications were

518 identified on the branch of Solanaceae. Consistently, a recent study also found increased number of SLF

519 genes in Petunia, another Solanaceae species (53). The phylogenetic tree combining SLF from Petunia,

520 Solanum and Nicotiana revealed that this particular SLF subfamily experienced frequent duplication and

521 gene loss in different lineages (Fig. S26). The SLF gene family is known to be involved in pollen

522 recognition and self-compatibility processes, and the expansion of this family might have contributed to

523 the observed diversity of the mating systems in the Solanaceae family (54). Another HG that significantly

524 expanded in the Solanaceae is corresponding to the Zeatin O-glucosyltransferase-like (ZOG) gene family,

525 of which we detected 25 duplication events (Fig. S27). Genes from the ZOG family are involved in

526 regulating cytokinin levels a process critical in tuning the signaling mediated by this hormonal pathway to

527 biotic and abiotic stresses (55, 56). Expansion of this gene family in the Solanaceae branch likely reflects

528 physiological adaptations to the diverse habitats colonized by these species.

529 Analyzing gene duplications within the Nicotiana branch showed that 256 HGs have a Z-score

530 greater than 1.96, and 58 HGs experienced significantly more duplications, based on Fisher’s exact test,

21

531 than seen from the genome-wide pattern (Supplementary Dataset 9). Among these significantly expanded

532 gene families, an NBS-LRR type disease resistance gene family specifically duplicated three times in

533 Nicotiana but not in other branches of the Solanaceae. Genomic location of these duplicated genes shows

534 that all four genes are located in a gene cluster within one scaffold (Fig. S28). Further expression analyses

535 revealed that all four genes are highly expressed in roots, suggesting that these genes might be involved in

536 plant-pathogen interactions in Nicotiana roots.

537 Another gene family that significantly expanded in Nicotiana species is that of the Purine uptake

538 permeases (Fig. S29). Genes from this family are known as plasma membrane-localized transporters (57).

539 More specifically, in tobacco, one member of this family, Nicotine Uptake Permease1 (NUP1), has

540 demonstrated functions in regulating nicotine localization via its transporter activity (57, 58). Furthermore,

541 NUP1 has also been shown to act as a transcriptional regulator of the key transcription factor ERF189 in

542 the nicotine biosynthesis pathway (58), although details on the underlying mechanism are lacking. The

543 expression profile of NUP1 in N. attenuata highlights that this gene is not only expressed in roots, but

544 also shows high levels of expression in floral tissues (Fig. S29), which indicates that NUP1 might be

545 involved in the allocation of nicotine to flowers of N. attenuata, where it serves as a deterrent for

546 pollinators and increases out-crossing rates (59). In addition, several other members of this family are

547 specifically induced in the leaves and stem by simulated herbivory, indicating that these genes in N.

548 attenuata might also be involved in allocating nicotine to particular parts of the vegetative tissues as an

549 anti-herbivore toxin.

550 4. Evolution of nicotine biosynthesis

551 4.1 Identification and reconstruction of the evolution history of nicotine biosynthesis genes

552 We identified nicotine biosynthesis genes and their ancestral copies based on sequence homology

553 using blastp and manual curations (Table S6). Gene evolutionary history was inferred using the

554 combination of phylogenetic and synteny analysis when possible. However, due to lack of genomic

22

555 information from the closely related genus of Nicotiana, it remains unclear whether the Nicotiana-lineage

556 specific duplications, such as the NAD pathway genes and neofunctionalization of BBLs occurred in the

557 ancestors of all genera, in which trace levels of nicotine have been controversially reported, such as

558 Crenidium, Cyphanthera and Duboisia (60).

559 The putrescine biosynthetic pathway for nicotine biosynthesis could have two routes: 1)

560 synthesized by ODC-mediated decarboxylation of ornithine or 2) synthesized by ADC-mediated

561 decarboxylation of arginine. Recent studies that individually silenced ODC and ADC suggest that the

562 putrescine for nicotine biosynthesis in Nicotiana is likely through the former route (61, 62). Consistently,

563 while phylogenetic analysis showed that two ADC copies are present in Nicotiana genomes likely

564 through the WGT, none of them showed root specific expression pattern.

565 4.2 Validation of the promoter sequences of nicotine biosynthesis genes and their ancestral copies

566 We validated the 2kb promoter sequences of 26 genes from N. attenuata, which included all

567 nicotine biosynthesis genes and several of their ancestral copies using Sanger sequencing. Primers were

568 first designed based on the assembled genome sequences, and direct PCRs were performed to amplify the

569 target fragments (Table S7). For genes that gave multiple bands or no PCR products, nested PCRs were

570 performed. The amplified fragments with the expected size were cut from gels and used for either direct

571 sequencing or sequencing after cloning into the pJET vector. All Sanger sequences were manually

572 inspected and compared to the genome assembly. Among all tested genes, only the promoter region of

573 one gene, NIATv7_g05934, was found miss-assembled (PCR products do not match the genome

574 assembly). For all other correctly assembled genes, 99.86% identity was found between Sanger

575 sequencing and the assembled genome.

576 4.3 Prediction of TE-derived putative miRNA target sites into regulatory region of nicotine

577 biosynthesis genes

578 Insertions of TE are known to introduce putative target sites of miRNAs. To examine whether the

579 observed TE insertions within the 2kb upstream region of nicotine biosynthesis genes and of their

23

580 ancestral copies had introduced candidate miRNA targeting sites, we performed in silico miRNA target

581 site predictions using the method described in Pandey et al (63). Briefly, all the candidate promoter

582 sequences were first checked for miRNA seed-pairs using custom written Perl script. The promoter

583 sequences from 3’ end were used for "Watson-Crick" complementarity matching against 5’ ends of

584 miRNAs after generating 7- to 13-nt seeds starting from first or second nucleotide position at 5’ end of

585 the miRNAs. The matches were extended by allowing mismatches after the seed match or 9th nucleotide

586 in the miRNAs.

587 In total, 23 and 17 putative miRNA target sites were predicted among the 2kb upstream regions

588 of 10 genes that are involved in nicotine biosynthesis and of their ancestral copies or non-root specific

589 expressed genes (10), respectively (Table S8). Among all nicotine biosynthesis genes, five predicted

590 miRNA target sites that were detected from 4 genes (BBL2.2, PMT1.2, ODC2 and QS) overlapped with

591 TE insertions.

592 4.4 Evolution of root-specific expression of nicotine biosynthesis genes

593 One of the important features associated with efficient nicotine production is the root-specific

594 expression of nicotine biosynthesis genes. To obtain evolutionary insights into the process of root-specify

595 expression acquisition by these genes, we explored the expression pattern of orthologous genes of the

596 nicotine biosynthetic pathway in tomato and potato. The results show that although absolute expression

597 levels of these orthologues are consistently low in the roots of these two species, some genes exhibit a

598 certain degree of root-specific expression in these Solanum species, such as PMT1 and ODC2 from the

599 duplicated polyamine pathway as well as the homologs of BBLs and A622. These observations indicate

600 that the evolution of a preferential root expression for these genes may have taken place before the

601 divergence between Nicotiana and Solanum (Fig. S15). This, however, does not exclude the likely

602 possibility that the neofunctionalization of BBL and A622 occurred after the divergence between

603 Nicotiana and Solanum. In contrast, the root-specific expression of the NAD pathway genes, such as AO2,

604 QPT2 and QS likely evolved specifically in Nicotiana. A further comparative analysis of the expression

24

605 profile of ERF189, the key transcription factor that regulates most of the nicotine biosynthesis genes,

606 showed that its root-specific expression likely evolved in Nicotiana, as its orthologues are more uniformly

607 expressed across tissues in tomato and potato (Fig. S17).

608 From a mechanistic point of view, there are at least two possibilities: 1), while gaining additional

609 GCC/G-box motifs increased their expression in roots, nicotine genes may have lost the transcription

610 factor binding sites that allow them to be expressed in other tissues, and 2) specific smRNAs may have

611 contributed either directly to or through interactions with endogenous target mimicry (64) the suppression

612 of the expression of nicotine genes in non-root tissues. We speculate that both mechanisms might have

613 contributed to root expression specialization. Future molecular studies that specifically manipulate the

614 promoter sequences of nicotine biosynthesis genes and targeted smRNAs will be required to understand

615 this process.

616 4.5 Data access

617 To provide free access to our genome data, we established a webserver (Nicotiana attenuata Data

618 Hub: http://nadh.ice.mpg.de/)(65), which allows data visualization of gene families and co-expressed

619 genes, as well as data download of the annotated gene models of the two Nicotiana genomes. The Whole

620 Genome Shotgun projects of N. attenuata and N. obtusifolia have been deposited at DDBJ/ENA/GenBank

621 under the accession MJEQ00000000 and MJEQ00000000, respectively. All short reads and PacBio reads

622 used in this study were deposited in NCBI under BioProject PRJNA317743 (RNA-seq reads of N.

623 attenuata), PRJNA316810 (short gun reads of N. attenuata), PRJNA378521 (GBS sequencing of AI-RIL

624 population), PRJNA317654 (WGS PacBio long reads of N. attenuata), PRJNA316803 (RNA-seq reads

625 and assembly of N. obtusifolia), and PRJNA316794 (short reads of N. obtusifolia). The assembled

626 genome sequences of N. attenuata and annotation information of all protein coding genes are available

627 from the CoGe platform, which provides genome-browser view and downstream comparative genomic

628 analysis, and Sol Genomics network (https://solgenomics.net/). There are 129 gene models that are

629 deposited in CoGe and SGN but were not submitted to NCBI because they contain introns smaller than

25

630 10bp, which do not fulfill the criteria for NCBI submission. However these genes models are likely to be

631 correct based on their sequence conservation. Therefore, they were retained and made them publically

632 available via SGN and CoGe. Including or excluding these gene models does not affect any conclusion of

633 the study. The python scripts used for phylogenomic analysis are deposited on github

634 (https://github.com/thobrock/Nicotiana-attenuata-genome-project/).

635

26

636 5. Supplemental Tables

637 Table S1. Genomes of 11 plant species used for comparative genomic analysis.

Species Version # of gene models URL Reference

N. attenuata r2.0 33,449 http://nadh.ice.mpg.de T his st udy

N. obtusifolia v1.0 27,911 http://nadh.ice.mpg.de T his st udy

The Arabidopsis Genome Initiative.

A. thaliana T AIR 10 27,416 http://phytozome.jgi.doe.gov/arabidopsis 2000 (66)

http://peppersequence.genomics.cn/page/species/d C. annuum v2.0 35,336 Kim et al. 2014 (67) ownload.jsp Huang et al. 2009 (68) C. sativus v1.0 21,503 http://phytozome.jgi.doe.gov/cucumber

Hellsten et al. 2013 (69) M. guttatus v2.0 28,140 http://phytozome.jgi.doe.gov/mimulus

Tuskan et al. 2006 (70) P. trichocarpa v3.0 41,335 http://phytozome.jgi.doe.gov/poplar

Tomato Consortium. 2012 (35) S. lycopersicum IT AG2 .3 34,727 http://phytozome.jgi.doe.gov/tomato

Hirakawa et al. 2014 (71) S. melongena v2.5.1 42,035 ftp://ftp.kazusa.or.jp/pub/eggplant/

Xu et al. 2011 (43) S. tuberosum v3.4 35,119 http://phytozome.jgi.doe.gov/potato

Jaillon et al. 2007 (72) V. vinifera Genoscope 12X 26,346 http://phytozome.jgi.doe.gov/grape

638

639

27

640 Table S2. Completeness of N. attenuata and N. obtusifolia, two other published Nicotiana species,

641 potato and tomato genomes (potato version 206 and tomato version 225) estimated based on 248

642 ultra-conse rved CEGs using CEGMA.

Comple te Partial Groups Average G roup 1 G roup 2 G roup 3 G roup 4 Average G roup 1 G roup 2 G roup 3 G roup 4

#Prots 208 51 44 52 61 243 63 54 61 65

%Completeness 83.87 77.27 78.57 85.25 93.85 97.98 95.45 96.43 100 100

N. attenuata #Total 463 95 86 121 161 646 145 135 167 199

Average 2.23 1.86 1.95 2.33 2.64 2.66 2.3 2.5 2.74 3.06

%Ortho 63.94 52.94 54.55 71.15 73.77 75.31 66.67 74.07 78.69 81.54

#Prots 211 52 45 55 59 241 65 53 61 62

%Completeness 85.08 78.79 80.36 90.16 90.77 97.18 98.48 94.64 100 95.38

N. obtusifolia #Total 421 96 78 113 134 573 121 117 169 166

Average 2 1.85 1.73 2.05 2.27 2.38 1.86 2.21 2.77 2.68

%Ortho 56.87 51.92 42.22 65.45 64.41 70.54 53.85 66.04 88.52 74.19

#Prots 195 49 39 50 57 234 60 52 60 62

%Completeness 78.63 74.24 69.64 81.97 87.69 94.35 90.91 92.86 98.36 95.38

N. sylvestris #Total 402 85 73 102 142 588 124 118 157 189

Average 2.06 1.73 1.87 2.04 2.49 2.51 2.07 2.27 2.62 3.05

%Ortho 60.51 42.86 61.54 64 71.93 73.93 60 71.15 85 79.03

#Prots 164 51 35 39 39 188 57 42 44 45

%Completeness 66.13 77.27 62.5 63.93 60 75.81 86.36 75 72.13 69.23

N. tomentosiformis #Total 321 84 62 88 87 441 113 88 121 119

Average 1.96 1.65 1.77 2.26 2.23 2.35 1.98 2.1 2.75 2.64

%Ortho 56.71 41.18 48.57 66.67 74.36 72.34 61.4 66.67 81.82 82.22

#Prots 208 51 43 51 63 239 63 51 60 65

%Completeness 83.87 77.27 76.79 83.61 96.92 96.37 95.45 91.07 98.36 100

S. tuberosum #Total 392 84 76 98 134 512 108 102 137 165

Average 1.88 1.65 1.77 1.92 2.13 2.14 1.71 2 2.28 2.54

%Ortho 53.85 41.18 53.49 54.9 63.49 65.27 47.62 62.75 70 80

#Prots 215 55 46 54 60 238 63 52 61 62

%Completeness 86.69 83.33 82.14 88.52 92.31 95.97 95.45 92.86 100 95.38

S. lycopersicum #Total 391 88 73 104 126 497 112 98 136 151

Average 1.82 1.6 1.59 1.93 2.1 2.09 1.78 1.88 2.23 2.44

%Ortho 47.44 34.55 43.48 53.7 56.67 58.4 49.21 55.77 60.66 67.74 643

28

644 Table S3. Summary of filtered (duplicate removal, quality clipping) WGS sequencing data used for

645 the assemblies of N. attenuata and N. obtusifolia.

Nicotiana attenuata (2.5Gb) Nicotiana obtusifolia (1.4Gb)

Library PE180 PE250 PE600 PE950 MP5500 MP20000 SR454 PE480 PE1050 MP3500 type

Number 354,327,6 254,673,1 152,657,0 92,178,3 80,509,5 2,005,65 478,531,3 222,388,2 108,153,56 paired - 62 42 90 08 74 8 66 20 2 reads

Number 103,391,4 30,351,12 68,691,92 77,197,1 25,188,0 - - - - - non paired 83 1 4 96 20

Avg. 87.5 97.9 91.7 88.6 75.0 75.0 644.1 91.3 77.6 94.6 length [bp]

Length 40.03 27.91 20.30 15.01 6.04 0.150 16.22 43.70 17.26 10.23 sum [Gb]

Approx.

seq. 16.7 11.6 8.5 6.3 2.5 0.1 6.8 31.2 12.3 7.3

coverage

Approx.

frag. 13.3 13.3 19.1 18.2 92.3 8.4 6.8 82.0 83.4 135.2

coverage

646

29

647 Table S4. Repeat content among six solanaceous genomes.

S. tuberosum S. lycopersicum N. obtusifolia N. tomentosiformis N. sylvestris N. attenuata TE class TE type (Mb) (Mb) (Mb) (Mb) (Mb) (Mb)

DNA 10.3 11.9 36.2 66.5 69.4 70.0

DNA/CMC-EnSpm 7.5 9.4 11.7 7.6 8.6 12.4

DNA/Dada 0.1 0.0 2.9 5.5 7.8 7.7

DNA/MULE-MuDR 3.3 3.2 0.4 0.6 0.4 0.6

DNA/MuLE-MuDR 1.4 5.0 3.2 5.5 9.7 9.5

DNA/PIF-Harbinger 5.6 4.3 4.7 4.8 5.8 6.0

DNA DNA/TcMar-Pogo 0.4 0.4 - - - -

DNA/TcMar-Stowaway 9.2 4.1 2.7 3.9 4.0 3.8

DNA/hAT 0.1 - 0.1 0.2 0.4 1.1

DNA/hAT-Ac 7.5 4.1 9.0 11.1 12.5 12.7

DNA/hAT-Charlie 0.1 - - - - -

DNA/hAT-Tag1 1.7 1.2 1.2 2.0 1.5 2.0

DNA/hAT-Tip100 4.6 2.6 1.5 2.3 2.3 2.2

LINE - - 11.7 14.6 19.2 22.9

LINE/L1 10.6 7.9 18.2 12.0 18.3 24.9

LINE/L1-Tx1 - - 0.0 0.0 0.5 0.6 LINE LINE/L2 - - 0.5 0.2 0.5 0.5

LINE/R1 - - 0.2 0.3 0.4 0.4

LINE/RTE-BovB 5.1 3.3 8.6 13.0 15.4 12.4

LTR 0.1 - 83.1 133.7 129.4 132.5

LTR/Caulimovirus 3.7 2.4 12.5 1.5 2.4 1.1

LTR LTR/Copia 29.1 48.8 42.6 61.5 87.1 131.0

LTR/ERVL - - 0.0 0.9 4.1 4.0

LTR/Gypsy 244.6 318.0 519.7 864.6 1171.4 1201.5

RC/Helitron 2.4 0.6 3.1 0.0 3.0 3.4

Retrotransposon 3.8 3.5 8.1 3.4 12.6 14.8

Others SINE 0.7 0.5 1.3 13.5 2.5 2.5

Satellite 3.8 1.0 5.9 1.9 3.2 4.1

Unknown (Mb) 46.4 37.1 3.2 5.1 9.1 9.5

TE size (Mb) 402.2 469.2 792.3 1236.2 1601.6 1694.2

Assembled genome size excluding Ns (Mb) 676.3 737.6 1223.2 1642.4 2047.5 2090.5

Percentage of TE in the genome (%) 59.5 63.6 64.8 75.3 78.2 81.0 648

30

649 Table S5. Information on samples used for RNA-seq.

Treatment / % uniquely mapped Tissue Library ID # raw reads # clean reads Read length development stage reads

M. sexta OS induced on Roots leaves NA1498ROT 327,772,944 317,843,200 93.77 50bp

Control NA1717LEC 61,531,550 16,430,982 93.29 100bp

Leaves M. sexta OS locally 328,071,888 induced NA1500LET 317,905,162 91.94 50bp

Smoke solution treated NA1501SES 51,423,280 49,120,770 90.53 50bp

Seeds Water treated NA1502SEW 75,944,970 73,525,266 90.31 50bp

Dry NA1503SED 63,463,542 61,285,870 89.01 50bp

M. sexta OS treated on Stems 72,473,514 leaves NA1504STT 70,575,710 94.70 50bp

Early developmental 44,064,054 Corollas st age NA1505COE 23,081,618 95.14 100bp

Late developmental stage NA1515COL 41,110,650 17,017,608 94.71 100bp

Stigmas Mat ure NA1506STI 39,281,658 17,878,354 94.88 100bp

Germinated on pollen Po l l en t u bes 41,692,244 germinat ion medium NA1507POL 22,451,272 96.83 100bp

Without pollination NA1508SNP 45,428,336 21,445,272 94.64 100bp

Pollinated with non-self 39,071,214 Styles pollen NA1509STO 18,816,496 94.29 100bp

Pollinated with self- 50,505,064 pollen NA1510STS 23,107,862 95.20 100bp

Nectaries Mat ure NA1511NEC 55,777,620 23,403,620 92.92 100bp

Anthers Mat ure NA1512ANT 41,480,422 18,767,454 96.22 100bp

O vari e s Mat ure NA1513OVA 39,608,326 17,426,638 94.42 100bp

Pedicels No treatment NA1514PED 41,520,690 19,508,002 95.14 100bp

Fully opened NA1516OFL 43,791,980 18,550,688 94.87 100bp Fl owe rs Buds NA1517FLB 45,864,746 22,167,496 95.47 100bp

Leave samples that were

Leaves collect at different time NA1821CTN 37,692,054 36,884,606 95.44 100bp

points

31

650 Table S6. Annotation of nicotine biosynthesis genes and number of motifs in the 2kb upstream

651 region. TPM: transcript per million.

G ene ID in # of G-box # of GCC Root expression Reference sequences in Gene name References N. attenuata motifs motifs (TPM) database

NIATv7_g12091 AO2 5 2 3640.36 DW001381 (73)

NIATv7_g36700 QS 2 2 3311.29 AF154657 (73)

NIATv7_g25254 QPT2 3 2 2816.12 AJ748263.1 (74)

NIATv7_g05809 ODC2 2 0 3866.97 AF127242.1 (73) Nicotine NIATv7_g61142 PMT1.1 4 3 4457.42 D28506.1 (75) biosynthesis NIATv7_g05615 PMT1.2 4 4 1806.61 D28506.1 (75) copies NIATv7_g14063 MPO1 1 3 454.52 AB289456.1 (76)

NIATv7_g17187 BBL2.1 2 2 275.14 AB604219.1 (77)

NIATv7_g00769 BBL2.2 4 3 956.46 AB604219.1 (77)

NIATv7_g09977 A622 4 0 2258.39 AB445396.1 (78)

NIATv7_g00125 DAO 1 4 12.01 XM_009602560 -

NIATv7_g01601 SPDS1 1 0 226.22 NM_001302585.1 (79)

Ancestors/th NIATv7_g11058 DAO2-like 2 0 12.86 AB289457 (76)

e closest NIATv7_g29386 AO1 3 0 8.21 XM_009759430 -

homologs NIATv7_g29483 BBL-like 0 0 0 AB604221 (77)

NIATv7_g34686 ODC1 0 2 11.27 XM_009771959 -

NIATv7_g58606 QPT1 1 1 29.29 AJ748262.1 (74) 652

32

653 Table S7. Primers used for validating 2kb upstream sequence regions of selected genes. While direct

654 PCR amplifications worked well for most of the genes using primers directly binding to beginning of the

655 protein coding region and the end of upstream 2kb fragments, nested PCR amplifications were required

656 for four genes, likely due to the presences of repetitive sequences present in these genes.

Gene ID Forward Primer Reverse Primer PCR amplification

NIATv7_g00125 CTCCATCAAAACTGAAGATCCAGTAGA GGGGAAGAAACCGTCGCCTTTTCCTGA

NIATv7_g01601 GTTGGTGTTTATAGAAATATGAGAACCA CGTTGTCGTTGTTGTTGTGGTTGGCTG

NIATv7_g04912 CATGAGGGGGAGACATGACAAACAGTT GCAGCGTCTACACAACAACCTAGGGC

NIATv7_g05615 ATCCATCCGCTCGTCCATTTCTCTCATG TAGAGCCATTTGTGTTGGTAGATATGA

NIATv7_g05809 ATCAGACCAGACAAATTAGCTTATGCG GAATGGCCGCCGGGTTCAACCCGGAA

NIATv7_g09978 CTCCGATTTTGGTGACATGAGTTTACCT CAAGGGCTGCTCAACTTCAGACTTCAT

NIATv7_g12091 GAGTTGGAGTGGGCTAAAGTTCATGCC CTGTCCGCATCCTGAAGCGATACCAG

NIATv7_g14063 CCCCGCTTGGTCCCATTAGGCCTATTGG GTGCCGTCACTTTCTGTTTAGTAGTGG

NIATv7_g20235 CATGAATACAGTAGCAAAAATAGGCGT CTCCGAGAAAGAGAAGGTTGCATTAA One step NIATv7_g20333 CTATCTCGCTCGAAATCGAGAACTTTG GTCCTACAATGGCTTCTACGGGAGAA amplification NIATv7_g21863 CTATTTTCCCTGTACTCAGCTTTAGTAAC CCATTTTCCACTGGCGTCCCCTTCAGA

NIATv7_g25254 CGGTTAATCGATTGTCAAATATGATTTC CACTGTTGCAGTGAAAGGAATAGCTC

NIATv7_g28218 GCGGAAAGGATTCCGACAATA GGGGAA GACTCCATGATTTTCCCCATGCTAATG

NIATv7_g29386 ATGCCTGACAGGTAAGAACATATAGGT CCGCTTCCTGAAGCGATACCAGTTGC

NIATv7_g29483 GCTAATCAAAGCAACTACACAAGAGTT GAATTATGAGCAGAAGCTGAAGAAAC

NIATv7_g32622 GATATCCGGTGGTTTAGGTGGAATGGA GCAACAGTTACCAATCGCTTC TTCTCC

NIATv7_g34686 GTACACCGGAGAAAGTATACTCCTAAT GGGAACATGACTCTTACCGATAGCTGT

NIATv7_g36700 GACACAATCACGCACAGAGGCCAGAGC GAAGATTTCATGACTAAATTTGCGGCA

NIATv7_g42088 CAGATGGATTTTACAATGCCTTTTTGTT CAAGGAGATTAAAATCAGAGAAAGAG

NIATv7_g58606 CTCCCAACTCCAGAATCACCATACG GTAATTGCATGAGGATGCACTATTGCA

NIATv7_g61142 GCTGGGGCTGGAGACTAAGGAACCAGA GCCATTTGTGTTGGTAGATATGACTTC

GTGAAGTACGGTGTATAGGCATGGTGA GATGATGCAAGCGCATGCT 1st nested PCR NIATv7_g14350 GTGAAGTACGGTGTATAGGCATGGTGA GTTGGTGAACTTCGCTCTTAGAGGAAG 2nd nested PCR

CAGTCCACACACATTTCGGCAGGTG CTTCATTCCACTGTGCTAGATACTGAA 1st nested PCR NIATv7_g00769 CAGTCCACACACATTTCGGCAGGTG CTTGCTGTTTGGTAGGATAATGAGG 2nd nested PCR

GTCATTATCTTGTCTTTTTCTTCTCATCT AAGGATGGCATGTATGAGCTCTTG 1st nested PCR

NIATv7_g11058 CATATTGCTGGTGCATGATATCAC GGAGGAGTCACCTTGTGCAAAGTTGC 2nd nested PCR

GGATTGAAAGTAGTATATTTTTGTG GGAGGAGTCACCTTGTGCAAAGTTGC 3rd nested PCR

TGGGATATTAAAATTCAATATACCCAGT GTAAGGAGATTCATTATTGTTATTCGC 1st nested PCR

NIATv7_g34398 TGGGATATTAAAATTCAATATACCCAGT CTTGCTTTTGTACTGACCCCTTTGAG 2nd nested PCR left

GACTAAAACTTCTACAAATCGTCTGC GTAAGGAGATTCATTATTGTTATTCGC 2nd nested PCR right

CTTACACGTACGGTAAGAGGCTGAGT CCTGCTGCTTGGTAGGATAATGAAC 1st nested PCR

NIATv7_g17187 GAACTTCTTGTTTATACTTATGACAAG CGGACTAATAAGGAAAGTGTGTCATC 2nd nested PCR left

GAAACTGTTTCTACGGCTCAAAAAA CCTGCTGCTTGGTAGGATAATGAAC 2nd nested PCR right

CTTTCGCTTTAGAGATGGATTCACTCTT GTAGCCTGTGCCTCCAATTATTAAGAT 1st nested PCR

NIATv7_g09977 CTCAGTTGTTCTAGTGAAGTTGAAG CACATTTACCTATTATAACATGTAAC 2nd nexted PCR left CGATTTTAGCACCACTTTGGCAGCA GTAGCCTGTGCCTCCAATTATTAAGAT 2nd nested PCR right 657

33

658 Table S8. Prediction of miRNA targeting sites on the 2kb upstream region of nicotine biosynthesis

659 genes and their ancestral or non-root specific expressed copies. The overlap between predicted

660 miRNA targeting sites and TE insertions are also provided in the last column. The positions indicate the

661 seeds of predicted miRNA targeting sites.

Seed Root-specific GeneID Start End miRNA G eneName Overlap with TE length expression

rnd-5_family -2472: NIATv7_g00769 -1937 -1929 NIATTr2_miR77253p.2 8 BBL2.2 YES DNA/hAT-Ac

NIATv7_g00769 -1228 -1219 NIATTr2_miR1429.5p 9 BBL2.2 YES -

NIATv7_g00769 -1233 -1225 NIATTr2_miR6164 8 BBL2.2 YES -

NIATv7_g05615 -36 -28 NIATTr2_miR7504 8 PMT1.2 YES -

NIATv7_g05615 -1499 -1491 NIATTr2_miR7711.5p.4 8 PMT1.2 YES rnd-5_family -822: LTR/G y p sy

NIATv7_g05615 -737 -729 NIATTr2_miR5163.3p 8 PMT1.2 YES rnd-5_family -58: LTR

NIATv7_g05809 -1865 -1857 NIATTr2_miR1429.5p 8 ODC2 YES -

rnd-4_family -757: LINE/RTE- NIATv7_g05809 -1450 -1442 NIATTr2_miR162 8 ODC2 YES BovB

NIATv7_g05809 -1865 -1857 NIATTr2_miR6161 8 ODC2 YES -

NIATv7_g05809 -106 -95 NIATTr2_miR9487 11 ODC2 YES -

NIATv7_g09977 -977 -968 NIATTr2_miR1066 9 A622 YES -

NIATv7_g09977 -688 -680 NIATTr2_miR6161 8 A622 YES -

NIATv7_g12091 -724 -716 NIATTr2_miR8015.3p 8 AO2 YES -

NIATv7_g12091 -1092 -1083 NIATTr2_miR398.3p 9 AO2 YES -

NIATv7_g14063 -1219 -1211 NIATTr2_miR6444 8 MPO2 YES -

NIATv7_g14063 -780 -772 NIATTr2_miR8015.3p 8 MPO2 YES -

NIATv7_g17187 -1700 -1692 NIATTr2_miR5497 8 BBL2.1 YES -

NIATv7_g25254 -793 -785 NIATTr2_miR6161 8 QPT2 YES -

NIATv7_g25254 -1559 -1551 NIATTr2_miR156.5p 8 QPT2 YES -

NIATv7_g25254 -940 -932 NIATTr2_miR5750 8 QPT2 YES -

NIATv7_g36700 -1075 -1067 NIATTr2_miR6164 8 QS YES rnd-4_family -1369: DNA

NIATv7_g61142 -36 -28 NIATTr2_miR7504 8 PMT1.1 YES -

NIATv7_g61142 -1042 -1032 NIATTr2_miR1429.5p 10 PMT1.1 YES -

NIATv7_g00125 -1151 -1143 NIATTr2_miR7504 8 DAO NO -

NIATv7_g01601 -1138 -1130 NIATTr2_miR5750 8 SPDS1 NO -

NIATv7_g04912 -1586 -1578 NIATTr2_miR6444 8 ADC2 NO rnd-3_family -18: LTR/Copia

NIATv7_g04912 -1190 -1182 NIATTr2_miR8015.3p 8 ADC2 NO rnd-4_family -175: DNA

NIATv7_g05934 -67 -59 NIATTr2_miR5750 8 ADC1 NO -

NIATv7_g11058 -215 -207 NIATTr2_miR8007.5p 8 MPO1 NO -

NIATv7_g11058 -482 -474 NIATTr2_miR172.3p 8 MPO1 NO -

NIATv7_g29386 -1738 -1730 NIATTr2_miR8667 8 AO1 NO rnd-4_family -391: DNA

NIATv7_g29483 -1830 -1821 NIATTr2_miR1429.5p 9 BBL-like NO -

NIATv7_g29483 -825 -817 NIATTr2_miR408.5p 8 BBL-like NO -

NIATv7_g29483 -1830 -1821 NIATTr2_miR6161 9 BBL-like NO -

34

rnd-4_family -1284: NIATv7_g34398 -982 -974 NIATTr2_miR7725.3p.2 8 SPDS2 NO DNA/TcMar-Stowaway

NIATv7_g34398 -695 -686 NIATTr2_miR6161 9 SPDS2 NO -

NIATv7_g34686 -1228 -1220 NIATTr2_miR5069 8 ODC1 NO

NIATv7_g34686 -447 -439 NIATTr2_miR6164 8 ODC1 NO rnd-4_family -173: DTT-NIC1

NIATv7_g58606 -1401 -1392 NIATTr2_miR426 9 QPT1 NO rnd-2_family -51: LTR/Gypsy

NIATv7_g58606 -1962 -1954 NIATTr2_miR9496 8 QPT1 NO rnd-5_family -192: LTR/G y p sy 662

35

663 6. Supplemental Figures

664

665 Fig. S1. Distribution of Bio-Nano molecule length. Left panel indicates the size distribution of the

666 BioNano molecule length, right panel shows an example of the BioNano output image.

667

36

668

669

670 Fig. S2. Anchoring of scaffolds to the linkage map. In total, 12 linkage groups were constructed.

671 Linkage map and anchored pseudo-chromosomes are shown on left and right sides, respectively.

672

37

673

674 Fig. S3. Workflow used to assemble the genome of N. attenuata. Data used for N. attenuata assembly

675 are shown in the rounded boxes. Software and assembly processes are shown on the left and right side of

676 each arrow, respectively.

677

38

678

679 Fig. S4. Gene family size evolution among 11 plant genomes. The number of all duplication events

680 (above branch nodes) from 11 different plant species. The duplication events were estimated from

681 phylogenetic trees constructed from homologous groups. Only branches that have approximated Bayes

682 branch support values greater than 0.9 are presented. Known whole genome triplication (red stars) and

683 duplication (green stars) events are shown.

39

684

685 Fig. S5. Distribution of 4dTV is consistent with the hypothesis that Nicotiana and Solanum share a

686 genome-wide duplication event. The distributions of fourth-fold degenerate sites (4dTV) between

687 duplicated paralogs within and orthologs between genomes are shown. The comparison N. attenuata with

688 tomato reflects the speciation events between Nicotiana and Solanum. Within genome comparisons show

689 the divergence between duplicated gene pairs and thus reflects genome-wide duplication events.

690

691

40

692

693 Fig. S6. Evolution of threonine deaminase (TD). A) Molecular function of TD, which converts

694 threonine to α- ketobutyrate, the substrate for the biosynthesis of isoleucine. B) Syntenic information of

695 TD in grape and tomato genomes. Upper panel shows the syntenic region between grape and tomato

696 chromosome 9. Two copies of TD were found on the tomato chromosome 10 (between 61.8Mb- 63.8Mb),

697 and one copy was found on the tomato chromosome 9, which was reverse duplicated from chromosome

698 10. C and D). Phylogenetic tree of the TD family (C) and a simplified model (D) show the evolutionary

41

699 history of TD. Numbers on each branch indicate the approximate Bayes branch supports. Local

700 duplication and whole genome duplication events are shown as purple and green circles, respectively. E

701 and F) expression of four TDs among different tissues of N. attenuata (E), and among different time

702 points after M. sexta OS elicitation (F) in leaves. In (F), the expression is shown as mean expression and

703 standard error (from three biological replicates).

704

42

705

706 Fig. S7. Evolution of ornithine decarboxylases (ODC). A and B) phylogenetic tree of ODC among

707 different plants (A) and the predicted evolutionary history of ODC (B). An ancient duplication event

708 occurred before the divergence between Vitaceae and Solanaceae. The number on the branch shows the

709 approximate Bayes branch support. C) Syntenic information between the homologs of ODC1 and ODC2

710 in tomato. No clear signature of synteny was found. D) A dot plot depicts the sequence similarity of CDS

711 and the 2kb upstream region between OCD1 and ODC2 in N. attenuata. E) Detailed annotations of TE

712 and transcription factor binding motifs. Light blue regions indicate the TEs annotated from RepeatMasker.

713

43

714

715 Fig. S8. Evolution of N-putrescine methyltransferases (PMT). A) Phylogenetic tree of PMT and

716 SPERMIDINE SYNTHASE (SPDS) in six plant species. The number on the branch shows the approximate

44

717 Bayes branch support. The closest homolog from Arabidopsis and grape were considered as outgroups. B)

718 and C) simplified evolutionary model of SPDS and PMT in tomato and Nicotiana. The syntenic

719 information used to construct these models is from tomato. D) a simplified schematic representation of

720 SPDS and PMT functions. E) and F) detailed annotation of TE and transcription factor binding motifs of

721 NaPMT1.1 (E) and NaPMT1.2 (F). Light blue regions indicate the TEs annotation from RepeatMasker,

722 and pink rounded rectangles depict the motif sequences and their 150 bp flanking region that shows

723 significant homology to TEs (e-value < 1e-5).

724

725

45

726

727 Fig. S9. Evolution of N-methylputrescine oxidases (MPO). A and B) the phylogenetic tree of MPO and

728 DAO in 11 plant species (A) and the predicted evolutionary history of MPO (B). MPO evolved from the

729 duplication of DAO, likely before the divergence between Phrymaceae (Lamiales) and Solanaceae. The

730 duplication between MPO1 and DAO2 are shared among different Solanaceae species. C) The

731 homologues of MPO1 and DAO2 from tomato are in a syntenic block. D) Two dot plots depict the

46

732 sequence similarity of protein coding sequences and 2kp upstream regions between MPO1 and DAO1

733 (left) and between MPO1 and DAO2 from N. attenuata (right). E) Detailed annotation of TE and

734 transcription factor binding motifs. Pink rounded rectangles depict the motif sequences and their 150 bp

735 flanking region that show significant homology to TEs (e-value < 1e-5).

736

47

737

738 Fig. S10. Evolution of aspartate oxidases (AO). A and B ) phylogenetic tree of AO among 11 plants (A)

739 and the predicted evolutionary history of AO (B). Both N. attenuata and N. obtusifolia have two copies of

740 AO. The number on the branch shows the approximate Bayes branch support. C) and D) The dot plot of

741 CDS sequence together with the 2kp upstream region between N. attenuata AO1 and AO2 (C), and

742 between AO2 from N. attenuata and N. obtusifolia (D). E) Detailed annotation of TE and transcription

743 factor binding motifs. Light blue region indicates the TEs annotated from RepeatMasker, and pink

744 rounded rectangles depict the motifs sequences and their 150 bp flanking region that show significant

745 homology to TEs (e-value < 1e-5).

746

48

747

748 Fig. S11. Evolution of quinolinic acid phosphoribosyl transferases (QPT). A and B) The phylogenetic

749 tree of QPT among 13 plants and the predicted evolutionary history of QPT (B). All Nicotiana species

750 have two copies of QPT, except N. obtusifolia, which is likely due to incomplete genome assembly. The

751 number on the branch shows the bootstrap value. C and D) Dot plots depicting the sequence similarity of

752 CDS sequence together with the 2kp upstream region of QPT1 and QTP2 in N. attenuata (C) and QPT2

753 between N. attenuata and N. obtusifolia (D). E) Detailed annotation of TE and transcription factor

754 binding motifs. Light blue regions indicate the TE annotated from RepeatMasker.

755

49

756

50

757 Fig. S12. Evolution of berberine bridge enzyme-like proteins (BBL). A and B) Phylogenetic tree of

758 BBLs among different plants (A) and the predicted evolutionary history of BBL (B). Gene structures of all

759 BBL homologs are shown. The numbers on the branches show the bootstrap values. C-F) Dot plots of

760 CDS sequence together with 2kp upstream region between NaBBL2.2 and NoBBL2 (C), NaBBL2.1 and

761 NaBBL2.2 (D), NaBBL2.1and NaBBL-like (E), NaBBL2.2 and NaBBL-like (F). G and H) Detailed

762 annotation of TE and transcription factor binding motifs of NaBBL2.1 (G) and NaBBL2.2 (H). Light blue

763 regions indicate the TEs annotated from RepeatMasker, and pink rounded rectangles depict the motifs

764 sequences and their 150 bp flanking region that show significant homology to TEs (e-value < 1e-5). Both

765 NaBBL2.1 and NaBBL2.2 have an insertion of the DTT-NIC1 MITE insertion.

766

51

767

768 Fig. S13. Evolution of quinolinic acid synthases (QS). A) Phylogenetic tree of QS among 11 plants.

769 Both N. attenuata and N. obtusifolia have one copy of QS. The number on the branch shows the

770 approximate Bayes branch support. B) A dot plot depicts the sequence similarity of CDS sequence

771 together with the 2kp upstream region of QS between N. attenuata and N. obtusifolia. C) Detailed

772 annotation of TE and transcription factor binding motifs. Light blue regions indicate the TEs annotated

773 from RepeatMasker.

774

52

775

776

777 Fig. S14. Evolution of A622 protein. A) A subclade tree of the A622 and A622-like superfamily is

778 shown among six plants. Both N. attenuata and N. obtusifolia have single copies of A622. The numbers

779 on the branch show the approximate Bayesian support. B) The dot plot depicts the sequence similarity of

780 protein coding sequences and 2kp upstream regions between A622 from N. attenuata and N. obtusifolia.

781 C) Detailed annotation of TE and transcription factor binding motifs. Light blue regions indicate the TEs

782 annotated from RepeatMasker, and pink rounded rectangles depict the motifs sequences and their 150 bp

783 flanking region that show significant homology to TEs (e-value < 1e-5).

53

784

785 Fig. S15. Evolution of root-specific expression of nicotine biosynthesis genes. Heatmaps depict the

786 expression of nicotine biosynthesis genes or their orthologues in N. attenuata, tomato and potato. The

787 results show that PMT1, A622-like and BBL-like in tomato showed root-specific expression, although

788 their absolute expression levels are low. It is worth noting that the enzymatic functions of A622-like and

789 BBL-like genes in tomato and potato remain unknown. The expression data of N. attenuata, tomato and

790 potato were retrieved from N. attenuata datahub (http://nadh.ice.mpg.de/NaDH/), tomato eFP browser

791 (http://bar.utoronto.ca/efp_tomato/cgi-bin/efpWeb.cgi) and potato eFP browser

792 (http://bar.utoronto.ca/efp_potato/cgi-bin/efpWeb.cgi), respectively.

793

794

54

795

796 Fig. S16. Evolution of MYC2 in Nicotiana. Phylogenetic tree of MYC2 among different plants (A) and

797 the proposed evolutionary history of MYC2 (B). The numbers on the branches show the bootstrap values.

798

55

799

800 Fig. S17. Evolution of ERF189 in Nicotiana. (a) Phylogenetic tree of ERF189 among different plants.

801 The numbers on the branches denote approximate Bayes branch support values. Red arrows indicate the

56

802 tandem duplications that were identified based on synteny information. (b) Evolutionary model of the

803 ERF189 gene cluster based on both phylogenetic relationship among homolog genes and their syntenic

804 information in P. axillaris (80), tomato (35) and N. obtusifolia. Lineage specific tandem duplications of

805 ERFs are shown in blue (Petunia), light blue (Solanum) and red (Nicotiana). The expression profile of

806 ERF189 and its orthologous gene in tomato were retrieved from N. attenuata data hub (nadh.ice.mpg.de)

807 and tomato eFP browser (http://bar.utoronto.ca/efp_tomato/cgi-bin/efpWeb.cgi), respectively. The

808 asterisk indicates the gene NIOBTv3_g42088 is likely a pseudogene in N. obtusifolia.

809

57

810

811 Fig. S18. Distribution of expressed genes and transcripts among different tissue-specific RNA-seq

812 libraries. A) numbers of expressed genes; B) number of expressed total transcripts; C) number of

813 expressed transcripts that contain transposon sequences. The abbreviations of sample information are:

814 ROT: root from plant induced by M. sexta OS; LET: leaf from plant induced by M. sexta OS; LEC: leaf

815 from non-treated plant; SED: dry seeds; SEW: seeds treated with water; SES: seeds treated with liquid

816 smoke; STT: stem from M. sexta OS-induced plant; COE: corollas at early developmental stage; COL:

58

817 corollas at late developmental stage; STI: styles; POL: pollen tubes grown in pollen germination media;

818 SNP: stigmas not pollinated; STO: stigmas outcrossed; STS: stigmas self-pollinated; NEC: nectaries;

819 ANT: anthers; OVA: ovaries; PED: pedicels; OFL: open flower; FLB: flower bud.

59

820

821 Fig. S19. Distribution of tissue-specific expressed genes and transcripts. A) numbers of tissue-specific

822 expressed genes; B) numbers of tissue-specific expressed transcripts.

823

60

824

825 Fig. S20. Distribution of different alternative splicing events in N. attenuata. AltA refers to

826 alternative acceptor site; AltD refers to alternative donor site; IR refers to intron retention and ES

827 refers to exon skipping.

828

61

829 62

830 Fig. S21. Expression of transposable elements among nine tissues. A) heatmap depicting the relative

831 expression of 577 TE families that showed expression in at least one tissue (FPKM>1). Classification of

832 the different TEs is shown with different colors on the left side. B) Boxplots showing the expression of

833 TEs among different tissues. Different letters indicate statistical differences (multiple comparisons after

834 Kruskal-Wallis test P < 0.01) among different tissues. Numbers on each boxplot indicate median values

835 for each tissue.

63

836

837

838

839

840

841 Fig. S22. Percentage of TEs induced by M. sexta oral secretions in N. attenuata leaves within a 21 h

842 time course. Each color indicates different classes of TE families.

843

64

844

845

846 Fig. S23. The most abundant 20 Pfam domains in N. attenuata.

847

65

848

849

850 Fig. S24. Top 20 superfamily domains in N. attenuata.

851

852

66

853

854 Fig. S25. Distribution of gene families among four plant species.

855

67

856

857 Fig. S26. Expansion of SLF gene family in Solanaceae. Phylogenetic diagram showing the rapid gene

858 duplication (colors) and loss (gray) of SLF subfamily members in the Solanaceae. The sequence from

859 Pyrus pyrifolia was used as an outgroup. The number at each branch indicates the approximate Bayes

860 branch supports. Red circles indicate the detected duplication shared among Solanaceae species. The

861 color of each node indicates the species while the gene loss in a subclade is indicated by the gray regions.

862 Nodes with only color but no gene ID indicate the gene loss detected in the given species/clade.

863

864

68

865

866 Fig. S27. Expansion of the Zeatin O-glucosyltransferase-like gene family in Solanaceae. A

867 phylogenetic tree showing the expansion of the family. Numbers at the branches refer to the approximate

868 Bayes branch support. Red circles indicate the detected duplication events shared among Solanaceae

869 species.

870

871

69

872

873 Fig. S28. Expansion of a disease resistant gene family (RPM1-like) in Nicotiana. A) phylogenetic tree

874 of the gene family. This gene family was found specific to Solanaceae plants. While no duplication was

875 found in other Solanaceae species, three duplication events were identified in Nicotiana. Red circles

876 indicate gene duplication events. The numbers below the branches refer to the approximate Bayes branch

877 supports. B) Genomic localization of the four genes in N. attenuata.

878

879

880

70

881

882 Fig. S29. Expansion of the purine uptake permease gene family in Nicotiana. A) phylogenetic tree of

883 the gene family. The color indicates the species. Red circles indicate Nicotiana-specific gene duplication

884 events. The numbers at the branches refer to the approximate Bayes branch supports. B) Heatmap

885 showing the expression of genes from six representative N. attenuata tissues. The orthologous genes of N.

886 tabaccum NICOTINE UPTAKE PERMEASE1 (NUP1) is highlighted in red color. ROT: root, LET: leaf

887 treated with M. sexta oral secretion. LEC, control untreated leaf; STT: stem from plant treated with M.

888 sexta oral secretion in leaf; NEC: nectary; OFL: flower. Heatmap color gradient indicates the scaled log10

889 TPM value.

71

890 7. Captions for supplementary datasets S1 to S9

891 Supplementary Dataset 1. Summary information on MITE families in Nicotiana and their

892 homologous families in Solanum genomes. Different superfamilies are represented by different

893 codes. DTA for Tc1/Mariner, DTA for hAT, DTH for PIF/Harbinger. For tomato and potato, the

894 original MITE IDs are shown in the last column. The biggest MITE family in Nicotiana, DTT-NIC1,

895 is highlighted in bold font.

896 Supplementary Dataset 2. Annotation of smRNAs in N. attenuata.

897 Supplementary Dataset 3. Expression of genes among 21 RNA-seq libraries. The expression

898 numbers are TPM values. The abbreviation of sample information are: ROT: root from plant induced

899 by M. sexta OS; LET: leaf from plant induced by M. sexta OS; LEC: leaf from non-treated plant;

900 SED: dry seeds; SEW: seeds treated with water; SES: seeds treated with liquid smoke; STT: stem

901 from M. sexta OS-induced plant; COE: corollas at early developmental stage; COL: corollas at late

902 developmental stage; STI: styles; POL: pollen tubes grown in pollen germination media; SNP:

903 stigmas not pollinated; STO: stigmas outcrossed; STS: stigmas self-pollinated; NEC: nectaries; ANT:

904 anthers; OVA: ovaries; PED: pedicels; OFL: open flower; FLB: flower bud; CTN: leaf samples

905 collected at different time points pooled together. Tissue specificity was calculated using the τ index

906 among all tissues except CTN.

907 Supplementary Dataset 4. Expression of assembled transcripts among 21 RNA-seq libraries.

908 The expression numbers are TPM values. The abbreviation of sample information are: ROT: root

909 from plant induced by M. sexta OS; LET: leaf from plant induced by M. sexta OS; LEC: leaf from

910 non-treated plant; SED: dry seeds; SEW: seeds treated with water; SES: seeds treated with liquid

911 smoke; STT: stem from M. sexta OS-induced plant; COE: corollas at early developmental stage;

912 COL: corollas at late developmental stage; STI: styles; POL: pollen tubes grown in pollen

913 germination media; SNP: stigmas not pollinated; STO: stigmas outcrossed; STS: stigmas self-

914 pollinated; NEC: nectaries; ANT: anthers; OVA: ovaries; PED: pedicels; OFL: open flower; FLB:

915 flower bud; CTN: leaf samples collected at different time points pooled together. Tissue specificity

72

916 and transcripts of which at least 50bp were annotated as TE are also indicated. Tissue specificity was

917 calculated using the τ index among all tissues except CTN.

918 Supplementary Dataset 5. Expression of microarray probes that are both induced by

919 simulated herbivory by application of M. sexta oral secretion and mapped to transposable

920 elements of N. attenuata. SL: systemic untreated leaf; TL: treated leaf; RT: systemic untreated root.

921 Supplementary Dataset 6. KEGG ortholog (KO) annotation of N. attenuata genes.

922 Supplementary Dataset 7. List of genes annotated as protein kinases and transcription factors.

923 Supplementary Dataset 8. List of recent duplication events of N. attenuata genes and their

924 duplicated copies. Annotations of the duplication that the gene was most recently involved, and the

925 functional annotation of gene based on Blast2GO are also indicated. Detected duplication times are:

926 N. attenuata specific; shared among Nicotiana; shared among Solanaceae; shared with M. guttatus;

927 shared among core . Inferred duplication types are: genome-wide duplications, tandem

928 duplications.

929 Supplementary Dataset 9. Gene families that expanded significantly in Nicotiana.

930

73

931 8. References

932 1. Krügel T, Lim M, Gase K, Halitschke R, & Baldwin IT (2002) Agrobacterium-mediated 933 transformation of Nicotiana attenuata, a model ecological expression system. Chemoecology 934 12(4):177-183. 935 2. Bubner B, Gase K, & Baldwin IT (2004) Two-fold differences are the detection limit for 936 determining transgene copy numbers in plants by real-time PCR. BMC Biotechnol. 4. 937 3. VanBuren R, et al. (2015) Single-molecule sequencing of the desiccation-tolerant grass 938 Oropetium thomaeum. Nature 527(7579):508-511. 939 4. Luo RB, et al. (2012) SOAPdenovo2: an empirically improved memory-efficient short-read de 940 novo assembler. GigaScience 1. 941 5. English AC, et al. (2012) Mind the gap: upgrading genomes with Pacific Biosciences RS long- 942 read sequencing technology. PLoS One 7(11). 943 6. Walker BJ, et al. (2014) Pilon: An integrated tool for comprehensive microbial variant detection 944 and genome assembly improvement. PLoS One 9(11). 945 7. Peng Y, Leung HCM, Yiu SM, & Chin FYL (2012) IDBA-UD: a de novo assembler for single- 946 cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28(11):1420- 947 1428. 948 8. Cao HZ, et al. (2014) Rapid detection of structural variation in a human genome using 949 nanochannel-based genome mapping technology. GigaScience 3. 950 9. Shelton JM, et al. (2015) Tools and pipelines for BioNano data: molecule assembly pipeline and 951 FASTA super scaffolding tool. BMC Genomics 16. 952 10. Zhou W, et al. (In press) Tissue-specific emission of (E)-α-bergamotene helps resolve the 953 dilemma when pollinators are also herbivores. Curr. Biol. 954 11. Lindgreen S (2012) AdapterRemoval: easy cleaning of next-generation sequencing reads. BMC 955 research notes 5:337. 956 12. McKenna A, et al. (2010) The genome analysis toolkit: a mapreduce framework for analyzing 957 next-generation DNA sequencing data. Genome Res 20(9):1297-1303. 958 13. Wu YH, Bhat PR, Close TJ, & Lonardi S (2008) Efficient and accurate construction of genetic 959 linkage maps from the minimum spanning tree of a graph. PLoS Genetics 4(10). 960 14. Tang HB, et al. (2015) ALLMAPS: robust scaffold ordering based on multiple maps. Genome 961 Biol 16. 962 15. Wu TD & Watanabe CK (2005) GMAP: a genomic mapping and alignment program for mRNA 963 and EST sequences. Bioinformatics 21(9):1859-1875. 964 16. Kim D, et al. (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, 965 deletions and gene fusions. Genome Biol 14(4). 966 17. SanMiguel P, et al. (1996) Nested retrotransposons in the intergenic regions of the maize genome. 967 Science 274(5288):765-768. 968 18. Yanai I, et al. (2005) Genome-wide midrange transcription profiles reveal expression level 969 relationships in human tissue specification. Bioinformatics 21(5):650-659. 970 19. Lowe TM & Eddy SR (1997) tRNAscan-SE: A program for improved detection of transfer RNA 971 genes in genomic sequence. Nucleic Acids Res 25(5):955-964. 972 20. Lagesen K, et al. (2007) RNAmmer: consistent and rapid annotation of ribosomal RNA genes. 973 Nucleic Acids Res 35(9):3100-3108. 974 21. Stanke M, Steinkamp R, Waack S, & Morgenstern B (2004) AUGUSTUS: a web server for gene 975 finding in eukaryotes. Nucleic Acids Res 32:W309-W312. 976 22. Cantarel BL, et al. (2008) MAKER: An easy-to-use annotation pipeline designed for emerging 977 model organism genomes. Genome Res 18(1):188-196. 978 23. Sierro N, et al. (2013) Reference genomes and transcriptomes of Nicotiana sylvestris and 979 Nicotiana tomentosiformis. Genome Biol 14(6).

74

980 24. Trapnell C, Pachter L, & Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. 981 Bioinformatics 25(9):1105-1111. 982 25. Li B & Dewey CN (2011) RSEM: accurate transcript quantification from RNA-Seq data with or 983 without a reference genome. BMC Bioinformatics 12:323. 984 26. Brooks AN, et al. (2011) Conservation of an RNA regulatory map between Drosophila and 985 mammals. Genome Res 21(2):193-202. 986 27. Ling Z, Zhou W, Baldwin IT, & Xu S (2015) Insect herbivory elicits genome-wide alternative 987 splicing responses in Nicotiana attenuata. Plant J. 84(1):228-243. 988 28. Jin Y, Tam OH, Paniagua E, & Hammell M (2015) TEtranscripts: a package for including 989 transposable elements in differential expression analysis of RNA-seq datasets. Bioinformatics 990 31(22):3593-3599. 991 29. Kim SG, Yon F, Gaquerel E, Gulati J, & Baldwin IT (2011) Tissue specific diurnal rhythms of 992 metabolites and their regulation during herbivore attack in a native tobacco, Nicotiana attenuata. 993 PLoS One 6(10):e26214. 994 30. Quinlan AR & Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic 995 features. Bioinformatics 26(6):841-842. 996 31. Diboun I, Wernisch L, Orengo CA, & Koltzenburg M (2006) Microarray analysis after RNA 997 amplification can detect pronounced differences in gene expression using limma. BMC Genomics 998 7. 999 32. Gotz S, et al. (2008) High-throughput functional annotation and data mining with the Blast2GO 1000 suite. Nucleic Acids Res 36(10):3420-3435. 1001 33. Xie C, et al. (2011) KOBAS 2.0: a web server for annotation and identification of enriched 1002 pathways and diseases. Nucleic Acids Res 39:W316-W322. 1003 34. Jones P, et al. (2014) InterProScan 5: genome-scale protein function classification. 1004 Bioinformatics 30(9):1236-1240. 1005 35. Sato S, et al. (2012) The tomato genome sequence provides insights into fleshy fruit evolution. 1006 Nature 485(7400):635-641. 1007 36. Riano-Pachon DM, Ruzicic S, Dreyer I, & Mueller-Roeber B (2007) PlnTFDB: an integrative 1008 plant transcription factor database. BMC Bioinformatics 8. 1009 37. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high 1010 throughput. Nucleic Acids Res 32(5):1792-1797. 1011 38. Abascal F, Zardoya R, & Telford MJ (2010) TranslatorX: multiple alignment of nucleotide 1012 sequences guided by amino acid translations. Nucleic Acids Res 38:W7-W13. 1013 39. Capella-Gutierrez S, Silla-Martinez JM, & Gabaldon T (2009) trimAl: a tool for automated 1014 alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25(15):1972-1973. 1015 40. Guindon S, Delsuc F, Dufayard JF, & Gascuel O (2009) Estimating maximum likelihood 1016 phylogenies with PhyML. Bioinformatics for DNA Sequence Analysis 537:113-137. 1017 41. Darriba D, Taboada GL, Doallo R, & Posada D (2012) jModelTest 2: more models, new 1018 heuristics and parallel computing. Nature Methods 9(8):772-772. 1019 42. Page RDM & Charleston MA (1997) From gene to organismal phylogeny: Reconciled trees and 1020 the gene tree species tree problem. Mol Phylogenet Evol 7(2):231-240. 1021 43. Xu X, et al. (2011) Genome sequence and analysis of the tuber crop potato. Nature 1022 475(7355):189-U194. 1023 44. Rausher MD & Huang J (2016) Prolonged adaptive evolution of a defensive gene in the 1024 Solanaceae. Mol. Biol. Evol. 33(1):143-151. 1025 45. Gonzales-Vigil E, Bianchetti CM, Phillips GN, & Howe GA (2011) Adaptive evolution of 1026 threonine deaminase in plant defense against insect herbivores. Proc. Natl. Acad. Sci. U.S.A. 1027 108(14):5897-5902. 1028 46. Kang JH, Wang L, Giri A, & Baldwin IT (2006) Silencing threonine deaminase and JAR4 in 1029 Nicotiana attenuata impairs jasmonic acid-isoleucine-mediated defenses against Manduca sexta. 1030 Plant Cell 18(11):3303-3320.

75

1031 47. Zhang Z, et al. (2006) KaKs_Calculator: calculating Ka and Ks through model selection and 1032 model averaging. Genomics Proteomics Bioinformatics 4(4):259-263. 1033 48. Ossowski S, et al. (2010) The rate and molecular spectrum of spontaneous mutations in 1034 Arabidopsis thaliana. Science 327(5961):92-94. 1035 49. Yang Z (2007) PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 1036 24(8):1586-1591. 1037 50. Thorne JL & Kishino H (2002) Divergence time and evolutionary rate estimation with multilocus 1038 data. Syst. Biol. 51(5):689-702. 1039 51. Guyot R, et al. (2012) Ancestral synteny shared between distantly-related plant species from the 1040 asterid (Coffea canephora and Solanum Sp.) and rosid (Vitis vinifera) clades. BMC Genomics 13. 1041 52. Wu FN & Tanksley SD (2010) Chromosomal evolution in the plant family Solanaceae. BMC 1042 Genomics 11. 1043 53. Williams JS, Der JP, dePamphilis CW, & Kao TH (2014) Transcriptome analysis reveals the 1044 same 17 S-locus F-box genes in two haplotypes of the self-incompatibility locus of Petunia 1045 inflata. Plant Cell 26(7):2873-2888. 1046 54. Goldberg EE, et al. (2010) Species selection maintains self-incompatibility. Science 1047 330(6003):493-495. 1048 55. Havlova M, et al. (2008) The role of cytokinins in responses to water deficit in tobacco plants 1049 over-expressing trans-zeatin O-glucosyltransferase gene under 35S or SAG12 promoters. Plant 1050 Cell Environ 31(3):341-353. 1051 56. Schäfer M, et al. (2015) Cytokinin levels and signaling respond to wounding and the perception 1052 of herbivore elicitors in Nicotiana attenuata. J Integr Plant Biol 57(2):198-212. 1053 57. Hildreth SB, et al. (2011) Tobacco nicotine uptake permease (NUP1) affects alkaloid metabolism. 1054 Proc. Natl. Acad. Sci. U.S.A. 108(44):18179-18184. 1055 58. Kato K, Shoji T, & Hashimoto T (2014) Tobacco nicotine uptake permease regulates the 1056 expression of a key transcription factor gene in the nicotine biosynthesis pathway. Plant Physiol. 1057 166(4):2195-U1500. 1058 59. Kessler D, et al. (2012) Unpredictability of nectar nicotine promotes outcrossing by 1059 hummingbirds in Nicotiana attenuata. Plant J. 71(4):529-538. 1060 60. Eckart E (2008) Solanaceae and convolvulaceae: secondary metabolites: biosynthesis, 1061 chemotaxonomy, biological and economic significance (Springer, Berlin). 1062 61. Chintapakorn Y & Hamill JD (2007) Antisense-mediated reduction in ADC activity causes minor 1063 alterations in the alkaloid profile of cultured hairy roots and regenerated transgenic plants of 1064 Nicotiana tabacum. Phytochemistry 68(19):2465-2479. 1065 62. Dalton HL, et al. (2016) Effects of down-regulating ornithine decarboxylase upon putrescine- 1066 associated metabolism and growth in Nicotiana tabacum L. J. Exp. Bot. 67(11):3367-3381. 1067 63. Pandey SP, Shahi P, Gase K, & Baldwin IT (2008) Herbivory-induced changes in the small-RNA 1068 transcriptome and phytohormone signaling in Nicotiana attenuata. Proc. Natl. Acad. Sci. U.S.A. 1069 105(12):4559-4564. 1070 64. Li FF, et al. (2015) Regulation of nicotine biosynthesis by an endogenous target mimicry of 1071 microRNA in tobacco. Plant Physiol. 169(2):1062-1071. 1072 65. Brockmöller T, et al. (2017) Nicotiana attenuata Data Hub (NaDH): an integrative platform for 1073 exploring genomic, transcriptomic and metabolomic data in wild tobacco. BMC Genomics 1074 18(1):79. 1075 66. Kaul S, et al. (2000) Analysis of the genome sequence of the Arabidopsis 1076 thaliana. Nature 408(6814):796-815. 1077 67. Kim S, et al. (2014) Genome sequence of the hot pepper provides insights into the evolution of 1078 pungency in Capsicum species. Nat. Genet. 46(3):270-278. 1079 68. Huang SW, et al. (2009) The genome of the cucumber, Cucumis sativus L. Nat. Genet. 1080 41(12):1275-U1229.

76

1081 69. Hellsten U, et al. (2013) Fine-scale variation in meiotic recombination in Mimulus inferred from 1082 population shotgun sequencing. Proc. Natl. Acad. Sci. U.S.A. 110(48):19478-19482. 1083 70. Tuskan GA, et al. (2006) The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). 1084 Science 313(5793):1596-1604. 1085 71. Hirakawa H, et al. (2014) Draft genome sequence of eggplant (Solanum melongena L.): the 1086 representative Solanum species indigenous to the old world. DNA Res. 21(6):649-660. 1087 72. Jaillon O, et al. (2007) The grapevine genome sequence suggests ancestral hexaploidization in 1088 major angiosperm phyla. Nature 449(7161):463-U465. 1089 73. Shoji T, Kajikawa M, & Hashimoto T (2010) Clustered transcription factor genes regulate 1090 nicotine biosynthesis in tobacco. Plant Cell 22(10):3390-3409. 1091 74. Shoji T & Hashimoto T (2011) Recruitment of a duplicated primary metabolism gene into the 1092 nicotine biosynthesis regulon in tobacco. Plant J. 67(6):949-959. 1093 75. Shoji T, Yamada Y, & Hashimoto T (2000) Jasmonate induction of putrescine N- 1094 methyltransferase genes in the root of Nicotiana sylvestris. Plant Cell Physiol. 41(7):831-839. 1095 76. Shoji T & Hashimoto T (2008) Why does anatabine, but not nicotine, accumulate in jasmonate- 1096 elicited cultured tobacco BY-2 cells? Plant Cell Physiol. 49(8):1209-1216. 1097 77. Kajikawa M, Shoji T, Kato A, & Hashimoto T (2011) Vacuole-localized berberine bridge 1098 enzyme-like proteins are required for a late step of nicotine biosynthesis in tobacco. Plant Physiol. 1099 155(4):2010-2022. 1100 78. Kajikawa M, Hirai N, & Hashimoto T (2009) A PIP-family protein is required for biosynthesis of 1101 tobacco alkaloids. Plant Mol. Biol. 69(3):287-298. 1102 79. Hashimoto T, Tamaki K, Suzuki K, & Yamada Y (1998) Molecular cloning of plant spermidine 1103 synthases. Plant Cell Physiol. 39(1):73-79. 1104 80. Bombarely A, et al. (2016) Insight into the evolution of the Solanaceae from the parental 1105 genomes of Petunia hybrida. Nat. Plants 2(6). 1106

77