1 Supplementary Information Appendix 2 3 1. Plant material, sequencing, assembly and map integration of the Nicotiana attenuata and N. 4 obtusifolia ge no mes ...... 2 5 1.1 Plant material and DNA preparation ...... 2 6 1.2 Raw data processing...... 2 7 1.3 Genome assembly pipeline...... 3 8 1.4 Generating the optical map ...... 4 9 1.5 Anchor scaffolds to pseudomolecules ...... 5 10 1.6 Quality assessment and validation of assembly ...... 6 11 2. Genome annotation ...... 7 12 2.1 Annotation of repeats and LTR insertion time estimation in Nicotiana ...... 7 13 2.2 Analyzing DTT-NIC1 insertions in Nicotiana genomes...... 8 14 2.3 Annotation of miRNA, tRNA and rRNA ...... 9 15 2.4 Annotation of protein coding genes...... 10 16 2.5 RNA sequencing and data analysis ...... 12 17 2.6 Expression of transposable elements...... 14 18 2.7 Functional annotation of protein coding genes ...... 14 19 2.8 Functional annotation of protein domains ...... 15 20 2.9 Annotation of transcription factor and protein kinase genes ...... 16 21 3. Evolutionary genomics analyses ...... 16 22 3.1 Homologous group identification ...... 16 23 3.2 Detection of gene duplication events in each HG ...... 16 24 3.3 Confirmation of the whole-genome triplication in Nicotiana...... 18 25 3.4 Estimation of species divergence times...... 20 26 3.5 Identification of lineage-specific gene family expansion ...... 21 27 4. Evolution of nicotine biosynthesis ...... 22 28 4.1 Identification and reconstruction of the evolution history of nicotine biosynthesis genes ...... 22 29 4.2 Validation of the promoter sequences of nicotine biosynthesis genes and their ancestral copies... 23 30 4.3 Prediction of TE-derived putative miRNA target sites into regulatory region of nicotine 31 biosynthesis genes...... 23 32 4.4 Evolution of root-specific expression of nicotine biosynthesis genes...... 24 33 4.5 Data access...... 25 34 5. Supplemental Tables ...... 27 35 6. Supplemental Figures...... 36 36 7. Captions for supplementary datasets S1 to S9 ...... 72 37 8. References ...... 74 38 1
39 1. Plant material, sequencing, assembly and map integration of the Nicotiana attenuata and
40 N. obtusifolia genomes
41 1.1 Plant material and DNA preparation
42 Plants were grown as previously described (1). The genomic DNA sequenced by 454 and
43 Illumina HiSeq2000 technologies was isolated from late rosette-stage plants using the CTAB-method(2).
44 The two N. attenuata DNA plants used for this extraction were from a 30th generation inbred line,
45 referred to as the UT accession, which originated from a 1996 collection from a native population in
46 Washington County, Utah, USA (1). N. obtusifolia DNA was obtained from a single plant of the first
47 inbred generation derived from seeds collected from a native population in 2004 at the Lytle Ranch
48 Preserve, Saint George, Utah, USA. High molecular weight genomic DNA used to generate the optical
49 map of N. attenuata was isolated using a nuclei based protocol (3) from approximately one hundred
50 freshly harvested N. attenuata plants (harvested 29 days post germination) of the same inbred generation
51 and origin as used for the short-read sequencing.
522. To reduce the potential effects of secondary metabolites on single molecular sequencing, the plant
53 material used for PacBio sequencing was from a cross of two isogenic N. attenuata transgenic lines
54 (mother: ir-aoc, line A-07-457-1, which was transformed with pRESC5AOC [GenBank KX011463] and
55 is impaired in JA biosynthesis; father: irGGPPS, line A-07-230-5, which was transformed with
56 pRESC5GGPPS [GenBank KX011462] and is impaired in the synthesis of the abundant 17-
57 hydroxygeranyllinalool diterpene glycosides), both generated from the 22nd inbred generation of the
58 same origin as the N. attenuata plants described above (1). Genomic DNA was isolated from
59 approximately one hundred young plants (harvested 29 days post germination) by Amplicon Express
60 (http://ampliconexpress.com) according to a proprietary protocol.
61 1.2 Raw data processing
62 We filtered and trimmed the Illumina HiSeq2000 whole-genome shotgun raw sequences of N.
63 attenuata and N. obtusifolia to obtain high-quality non-duplicated read pairs prior to genome assembly.
2
64 This was achieved by a custom script that extracted only the largest reads, longer than 32 bp, and which
65 contained no bases with a Phred quality score lower than 11. We compared the first 32 bp of each read in
66 a pair to those of other read pairs and retained only one pair if the same sequence was found more than
67 once (deduplication). Roche/454 long reads for N. attenuata were processed by the sffToCA script
68 provided by the Celera Assembler v7 pipeline (CA7). Table S3 provides details of the filtered data used
69 for the genome assemblies.
70 1.3 Genome assembly pipeline.
71 The overall assembly workflow for N. attenuata is shown as SI Appendix, Fig. S3. All paired-end
72 reads from the sequenced libraries were assembled using the Celera Assembler (CA7) with a minimum
73 read length cut-off at 64bp. Preliminary tests showed that single end or short reads did not improve the
74 assemblies, but increased calculation time significantly. In total, 86.4 Gb short read data were assembled.
75 The expected genome size of N. attenuata is of 2.54 Gb based on coverage of the “larger than N50 length
76 unitigs”, similar to the estimation (1C=2.5pg) from the flow cytometry analysis. We used the SSPACE v2
77 scaffolder to further improve scaffolding using the mate-pair data and filled gaps using GapCloser v1.12
78 (4). After manual inspections, we found that certain neighboring contigs in the scaffolds still contained
79 overlaps, which might be due to the assembly process from CA7 that places copies of repeat sequence at
80 the end of contigs or due to issues in Gapcloser v1.12 that leave open some closable gaps. To close these
81 gaps between overlapped contigs, we compared neighboring contigs using BLASTN (min. identity
82 95%/min. length 43) and then joined overlapping contigs with custom scripts.
83 The assembly scaffolds from short reads were used for gap filling and further scaffolding using
84 PBJelly (v15.8.24)(5) with ~10x PacBio reads (N50=14.9 kb, max read length =48.9 kb), which resulted
85 in a 2.17 Gb genome assembly. While PBjelly only increased the N50 scaffold size from 176 kb to 202 kb,
86 it significantly increased the N50 contig size from 67 kb to 90 kb. Because PacBio reads contain about
87 12-15% errors, we performed an additional correction step using short reads with PILON(6). In total, 98.2%
88 of the draft assembly was confirmed by short reads and ~1.3 Mb sequences were corrected by PILON.
3
89 The PILON corrected assembly was further mapped to the 10x PacBio reads using BLASR and used
90 SSPACE-longreads for the second round scaffolding, which increased the N50 scaffold size to 292 kb.
91 To assemble the N. obtusifolia genome, we employed a hybrid strategy in which we first
92 assembled all short reads by a ‘de Bruijn graph’ assembler using idba-ud v1.1.1 (7), and then assembled
93 the locally re-assembled contigs and a subset of the short read data by an ‘OLC’ long read assembler
94 using CA7. Scaffolding and gap filling were carried out using SSPACE v2 using mate-pair data in a
95 similar manner to the N. attenuata assembly.
96 1.4 Generating the optical map
97 The optical maps of N. attenuata obtained from BioNano Genomics were used to improve the
98 assembly quality. High molecular weight genomic DNA was isolated from fresh N. attenuata tissues
99 using the protocol similar to that of VanBuren et al(3). Briefly, around 5g leaves were collected from
100 young plants and fixed with formaldehyde. After being homogenized in isolation buffer, filtrating
101 washing treatment was performed. The nuclei were purified on percoll cushions and washed extensively
102 and finally embedded in low melting agarose at different dilutions. The DNA plugs were treated with a
103 lysis buffer containing detergent, proteinase K and β-mercaptoethanol (BME). In total, 140 Gb of data
104 (>100 kb) were collected representing ~55× genome coverage with a molecule N50 length of 250 kb (Fig.
105 S1). Molecules were de novo assembled as previously described (8). The optical map finally assembled
106 consists of 2.2 Gb with an N50 size of 1.4 Mb.
107 The optical map final assembly was then used to anchor sequence scaffolds using the sewing
108 machine pipeline (9). Overall 81.7% of sequence assembly can be mapped to the optical map and 84.7 %
109 of the optical map can be aligned to sequence assembly. The super scaffolding was performed using
110 parameter ‘--f_con 12 --f_algn 40 --s_con 10 --s_algn –T 1e-8’ after manual inspection of the quality of
111 the scaffolds. The super scaffolding using the optical map generated a genome assembly consisting of
112 39,115 scaffolds with N50 to 358.4 kb (total size 2.4 Gb). Construction of a high-density linkage map
4
113 A high-density linkage map was constructed using the genotyping by sequencing (GBS) method
114 on 256 individuals from an advanced intercross recombinant inbred line population (AI-RIL). The
115 establishment of the AI-RIL and detailed procedures for DNA isolation and sequencing is described in
116 detail elsewhere (10). In total, 1.2 billion paired-end clean reads were obtained from 256 samples
117 (average = 4.7 million) after quality control and adapter trimming using AdapterRemoval (11) with
118 parameter “--minquality 30 --minlength 36”. All reads were then mapped to the draft genome of N.
119 attenuata (release v1.0) using Bowtie 2 (version 2.2.5) with default parameters and only reads with
120 mapping quality greater than 3 were used for the downstream analysis. Genome Analysis Toolkit (GATK,
121 version 3.3-0-g37228af) (12) was used to call single nucleotide polymorphisms (SNPs) and indels with
122 the parameter “-stand_call_conf 30 -stand_emit_conf 10” after realignment at the indel loci as
123 recommended. All SNPs were filtered based on mapping and SNP quality and sequence depth using the
124 parameter “--clusterWindowSize 10 MQ0 >= 4 && ( (MQ0 / (1.0 * DP)) > 0.1) QUAL < 30.0 QD < 5.0”,
125 which resulted in 16,904 polymorphic markers. For linkage map construction, we further removed all
126 markers that were missing in more than 30% of individuals or showed segregation distortions (P < 0.001,
127 χ2 test). The final dataset used for the linkage group construction contained 7989 markers. The linkage
128 map was constructed using ASmap (13) following recommended workflow. In brief, all markers that were
129 typical for more than 70% of individuals were used to construct the backbone of the linkage map using
130 parameters: dist.fun = "kosambi", p.value = 1e-12. These parameters were selected based on several
131 rounds of manual optimizations. Then, markers that were typical of less than 70% of all individuals were
132 pushed back to the map based on their similarity with markers included in the backbone map. The final
133 linkage map consists of 12 linkage groups with 2,906 cM.
134 1.5 Anchor scaffolds to pseudomolecules
135 To anchor scaffolds to pseudomolecules, we first used chromonomer
136 (http://catchenlab.life.illinois.edu/chromonomer/) to identify conflicts between genome assembly and
137 linkage maps and remove inconsistent markers. In total, 219 scaffolds were identified as containing
5
138 potential assembly errors and 188 were split at the largest gap between two markers that mark the
139 boundaries of the two positions. The remaining 31 scaffolds that contained too many potential errors were
140 excluded from anchoring to pseudomolecules but included as scaffolds in the final genome assembly. The
141 final step of anchoring scaffolds to pseudomolecules was performed using ALLMAPS (14). In total,
142 2,132 scaffolds representing 825.8Mb (34.9%) of the final genome assembly and 13,076 (38.9%) of total
143 predicted genes were anchored to 12 pseudomolecules (Fig. S2).
144 The final assembly consists 37,194 scaffolds, which represents 92% (2.37 Gb) of the estimated
145 genome size. The N50 scaffold and contig sizes equal to 524.5kb and 64.2kb, respectively, and 50% of
146 the genome was represented in 420 longest scaffolds (L50). The detailed step-by-step workflow of the
147 genome assembly procedures is shown in Fig. S3.
148 1.6 Quality assessment and validation of assembly
149 We assessed the completeness and quality of the final N. attenuata assembly using four different
150 methods. First, we mapped 80,044 N. attenuata EST sequences that were assembled in a previous study
151 (with a length greater than 300bp) to the genome using gmap (15). The overall mapping rate was ~97.1%
152 (identity greater than 95% and coverage greater than 80% of each transcript). In addition, we also mapped
153 1207 million paired-end reads to the genome using TopHat2 (16). Overall, 96.5% of them could be
154 mapped to the genome and 94.2% were properly paired. Furthermore, among the 45,816 N. benthamiana
155 EST sequences from NCBI (length greater than 300 bp), 88.7% of them could be mapped to the final N.
156 attenuata assembly (greater than 85% identity and 60% coverage). Second, mapping 224 million (a subset)
157 paired-end 100 bp Illumina reads (1kb library) back to the genome using Bowtie2 (-I 500 -X 2000)
158 showed that 99.94% of reads can be mapped to the genome and 97.7% were properly paired. Third, we
159 used CEGMA based on 248 ultra-conserved CEGs to further estimate the completeness of the assembly.
160 In total, 208 (83.9%) were found in full length and 243 (98.0%) were found to be partial or full-length in
161 our final assembly. The overall completeness of the final N. attenuata assembly estimated based on
162 CEGMA is slightly less than the tomato genome, but similar to the potato genome (Table S2). Fourth, re-
6
163 sequencing of ~2kb the upstream regions of 26 candidate genes using Sanger sequencer confirmed that
164 96.2% (25 out of 26) of the assembly were correct. For these 25 correctly assembled genes, the similarity
165 between sequences obtained from Sanger sequencer and WGS assembly is more than 99.86%. Most of
166 the mismatches (53 out of 72 bps) were due to an additional AT-rich fragment was miss-assembled to the
167 promoter region of NIATv7_g09977. Together, these data suggest that the overall quality of this assembly
168 is high.
169 2. Genome annotation
170 2.1 Annotation of repeats and LTR insertion time estimation in Nicotiana
171 The summary of repeat content is shown in Table S4. The abundance of MITEs is shown in
172 Supplementary Dataset 1. Because the expansion of LTRs contributed to the rapid increase of genome
173 size in Nicotiana, we further analyzed the insertion time of this subgroup. All annotated LTR/gypsy
174 and/or LTR/copia TE sequences with a mapping score greater than 250 and a size greater than 100 bp
175 were extracted from the RepeatMasker output. Only TE families that contained more than 200 copies
176 were retained for downstream analysis. For each of the extracted LTR sequences, BLASTN was used to
177 find the other LTR sequences with the highest similarity score with parameters: -evalue 1e-10 -task dc-
178 megablast. The pairwise aligned fragments that had a length greater than 200 bp and a bit score greater
179 than 200 were used for estimating substitution rates. The identical reciprocal best blast hits were only
180 counted once. A substitution rate was then calculated using baseml from the PAML (4.7) package with
181 the “REV” molecular evolution model. Our LTR expansion dating approach is different from the method
182 described by SanMiguel et al (17), which compares the sequence divergence between two LTR regions of
183 complete retrotransposons. However, in the N. attenuata genome like in many other plant genomes, the
184 complete LTR set only contributes to a small proportion of the total amount of retrotransposons,
185 especially since many old retrotransposons are rapidly disrupted by new insertions and thus cannot be
186 detected as complete LTRs. This may result in a biased picture of the overall insertion history of
7
187 retrotransposons in the genome. Our method directly calculates the divergence between most recently
188 duplicated repeat sequences and thus avoids the bias introduced from annotating complete LTRs.
189 2.2 Analyzing DTT-NIC1 insertions in Nicotiana genomes
190 To analyze the effect of DTT-NIC1 insertions on gene expression divergences between
191 duplicated genes, we first calculated the gene expression and tissue specificity divergences between
192 duplicated gene pairs in N. attenuata using 21 different RNA-seq libraries (see details below on
193 expression analysis and gene duplication identification). To reduce redundancy and false positives, we
194 performed the analysis based on the following protocol: 1) we analyzed the gene duplication history of all
195 N. attenuata genes based on the constructed phylogenetic trees of each homologous group; 2) only the
196 reciprocal most recently duplicated paralog pairs were identified and used for the downstream analysis; 3)
197 only the gene pairs that resulted from duplications that enjoyed an approximate Bayesian support of
198 greater than 0.9 were retained in the analysis; 4) all gene pairs were assigned into two groups: A: one
199 member of the gene pair with at least one DTT-NIC1 insertion (within the 1kb upstream of 5’ region) and
200 B: neither one of gene pair with DTT-NIC1 insertions; 5) gene expression divergence at both expression
201 and tissue specificity levels were then compared between the two groups. The tissue-specificity score for
202 each gene/transcript using the τ index (18) based on a subset of the RNA-seq libraries. The index τ is
203 defined as:
204
(1 ) = 푁 1 ∑푖=1 − 푋푖 τ 205 푁 −
206 where N is the number of tissues and is the expression profile component normalized by the maximal
207 component value. Euclidean distance 푋between푖 the expression profiles of two paralogs among the same
208 RNA-seq libraries was used to quantify the gene expression divergence. Both tissue specificity and
209 expression divergence were calculated based on log2 transformed TPM values.
8
210 2.3 Annotation of miRNA, tRNA and rRNA
211 Small RNA sequencing was performed to capture all small RNAs (smRNAs), especially the
212 miRNAs that are expressed during day and night in the wild-type N. attenuata. Three replicates were used;
213 clean reads were generated from the raw smRNA reads after removing the adaptor sequences and
214 removing the low quality reads—such as reads with unidentified nucleotides (N) or reads with any single
215 nucleotide stretch > 5 nucleotides. Clean reads, > 15 nucleotides, were filtered for further analysis. Next,
216 all the clean reads were aligned against Rfam and the reads mapped to tRNA, rRNA and snoRNA were
217 removed. Remaining reads in all the replicates were merged for each time point, and reads that were
218 expressed in at least two of three replicates were retained for further analysis. These reads are referred to
219 as “mappable miRNA reads.” These reads were aligned to the N. attenuata genome using Bowtie with
220 maximum two mismatches and five reported genomic location alignments. Sequences that did not match
221 the genome were discarded. Further, aligned reads were mapped against the N. attenuata transcriptome by
222 Bowtie with no reverse complement mapping parameter, and those aligned to the transcriptome were
223 removed from the bin of miRNA mappable reads. Conserved miRNAs were identified by comparing the
224 N. attenuata mappable reads against the known plant mature miRNAs and their precursors in the
225 miRBase database (www.mirbase.org). Sequences with perfect matches to the known sequences were
226 regarded as bona-fide conserved miRNAs and were subjected to precursor prediction in the Nicotiana
227 genome. A flanking sequence of 200 bases was extracted from the mapped genomic locations for each
228 read and RNAfold was used to predict precursor stem-loop structure. The miRCheck algorithm was used
229 to compare all the potential miRNAs and precursors against a set of secondary structure constraints
230 derived from known plant miRNA precursors.
231 A total of 131 miRNA reads corresponding to 83 bona-fide conserved miRNAs were identified
232 that were expressed during day and night harvests of N. attenuata leaves. These miRNAs were obtained
233 from 158 genomic locations in Nicotiana genome and were conserved in 34 other plant species (such as
234 Arabidopsis lyrata, Arabidopsis thaliana, Vitis vinifera, and Zea mays etc.) as shown in Supplementary
235 Dataset 2. Of these 83 miRNAs (131 miRNA reads), twelve miRNAs (miR160c-3p, miR171b-3p,
9
236 miR398a-3p, miR426, miR1429-5p, miR5069, miR5497, miR6020b, miR6021, miR6206, miR7711-5p.4
237 and miR7744-5p) were expressed only during day time. On the other hand, 14 miRNAs (miR160-5p,
238 miR408-5p, miR2609a, miR4379, miR5015, miR5042-3p, miR5255, miR5303c, miR5635a, miR5741a,
239 miR6444, miR7750-5p, miR8143 and miR8764) were expressed only during the night. Fifty seven
240 miRNAs (corresponding to 88 miRNA reads) were expressed during both day and night, of which 12
241 miRNAs (miR172c, miR172j, miR5303, miR5303a, miR6149a, miR6161a, miR6161b, miR6161c,
242 miR6161d, miR6164a, miR6164b and miR7997c) were expressed with different mature sequences at
243 these times. Differences in sequences for a miRNA indicate that different isomiRs were expressed
244 between the day and night.
245 tRNAs were first annotated using tRNAscan-SE (v 1.3.1)(19) with default parameters. All
246 predicted pseudo tRNAs and tRNAs that showed high similarity to chloroplast and mitochondrial
247 genomes (e-value < 1e-10, identify > 95) were removed. In total, 1052 tRNAs that decode standard 20
248 AA and four selenocysteine tRNAs were predicted. rRNAs were annotated using RNAmmer (v1.2)(20)
249 with parameter “ -S euk -m tsu,lsu,ssu”. In total, 799 rRNA sequences were predicted.
250 2.4 Annotation of protein coding genes.
251 The N. attenuata genome was annotated using the Nicotiana Genome Annotation (NGA) pipeline,
252 which employs both hint-guided (hg) Augustus (v. 2.7) (21) (hg-Augustus) and MAKER2 (v.2.28)(22),
253 gene prediction pipelines based on genome release v1.0.
254 For hg-Augustus gene annotation, the HMM gene model was specifically trained for N. attenuata
255 using RNA-seq data from major plant tissues, and gene models were predicted using unmasked genome
256 sequences. The repeat regions were here given less probability to be predicted as a gene, in particular
257 when RNA-seq evidence was missing. For MAKER2 annotation pipeline, we integrated evidence of
258 multiple protein codings from three sources: ab initio gene predictions, transcript evidence and protein
259 homolog evidence. The input evidences for MAKER2 were: 1) ab initio gene predictors (GeneMark and
10
260 Augustus) that were each trained with full-length transcripts; 2) Trinity (v. r20131110) assembled
261 transcripts using RNA-seq data from major tissues; 3) UNIREF90 plant proteins (mapped using genewise,
262 v. wise2-4-1); 4) six high quality plant proteomes (tomato, potato, grape, Arabidopsis, Populus and rice).
263 The MAKER2 annotation pipeline was run on the repeat masked N. attenuata genome.
264 The predicted gene models from hg-Augustus were filtered based on their repeats contents. All
265 genes with repeats occupying more than 50% of the gene length were removed, and genes were retained if
266 less than 10% of their sequence matched to repeats. For genes that contained repeats occupying 10-50%
267 of their entire gene length, we performed an additional search of the plant Refseq database using
268 BLASTX. If the gene matched with a non-repeats homolog (e-value greater than 1e-5 and bit score
269 greater than 200) from the plant refseq database, the predicted gene models were retained for downstream
270 analysis. In total, 35,737 gene models from hg-Augustus predictions were retained. For MAKER2
271 predicted gene models, in addition to the filtering based on repeat content as described above, the gene
272 models that had low evidence support were removed from downstream analysis (eAED <= 0.45, this
273 cutoff was set after manual inspections on the predicted gene models). In total 33,274 gene models from
274 MAKER2 prediction were retained. The predicted gene models from hg-Augustus and MAKER2
275 pipelines were then combined and overlapping gene structures were removed. After manual inspections,
276 we found that hg-Augustus outperformed MAKER prediction when the pipelines predicted different gene
277 structures for the same gene. Therefore, when the two pipelines predicted different gene structures for the
278 same genome region, we retained only the hg-Augustus predicted gene models. After merging the
279 predicted models from both hg-Augustus and MAKER2 predictions, genes with pre-mature stop codons
280 were considered as pseudogenes and were removed from the downstream analysis. The predicted gene
281 models were then transferred to N. attenuata genome release v2.0 using a custom script which first
282 identified homologous regions using BLAST and then predicted gene structure using GeneWise. This
283 finally resulted in 33,449 high quality gene models in the final N. attenuata genome, of which 74.9%
284 were supported by at least 50 RNA-seq reads. Because gene models within a genus are usually conserved,
11
285 we annotated N. obtusifolia and two other publicly available diploid Nicotiana genomes, N. sylvestris and
286 N. tomentosiformis(23), using a homology-based annotation pipeline based on all predicted N. attenuata
287 protein coding gene models. Using all predicted gene models from N. attenuata, we predicted protein
288 coding gene models in N. obtusifolia, N. tomentosiformis and N. sylvestris using a homolog-based
289 approach.
290 2.5 RNA sequencing and data analysis
291 RNA was isolated from plants of the same origin and inbred generation as described above for N.
292 attenuata DNA isolation. RNA was isolated using TRIZOL® (Thermo Fisher Scientific) according to the
293 instructions of the manufacturer. DNA was removed from all RNA preparations using TURBO™ DNase
294 (Thermo Fisher Scientific) according to the manufacturer’s protocol. In total, twenty one RNA-seq
295 libraries from different plant tissues and the same tissues under different biotic and abiotic stress
296 treatments were first enriched for RNAs with poly-A tails and then used for RNA-seq library construction
297 with Illumina's TruSeq RNA sample preparation kits. The insertion sizes of the libraries are
298 approximately 200 bp. All RNA-seq libraries were sequenced using Illumina 2000 HiSeq platform with
299 read lengths of 50 or 101 bp and resulted in 793,785,373 paired-end Illumina reads.
300 The raw sequence reads were trimmed using AdapterRemoval (v1.1) with parameters “--collapse
301 --trimns --trimqualities 2 --minlength 36”. The trimmed reads were then aligned to the N. attenuata
302 genome assembly (v 2.0) using TopHat2 (v2.1.0) (24), with maximum and minimum intron size set to
303 50,000 and 41 base pairs (bp), respectively, estimated from the N. attenuata genome annotation.
304 To estimate the expression level of assembled genes and transcripts, all trimmed RNA-seq reads
305 were mapped to the assembled transcripts using RSEM (v1.2.20) (25). Transcripts per million (TPM) was
306 calculated for each transcript and gene. To exclude low-expressed genes and transcripts, only genes with
307 TPM greater than five in at least one sample were considered as expressed. Similarly, only the transcripts
308 with TPM greater than one in at least one sample were considered as expressed.
309
12
310 In total, 21 RNA-seq library from different tissues and same tissue with different treatments were
311 sequenced (Table S5). Expression profile of all predicted gene models is provided in Supplementary
312 Dataset 3. For transcript expression, only the transcripts with TPM greater than one in at least one sample
313 were considered as expressed. In total, 25,506 genes and 73,624 transcripts were expressed, respectively
314 (Supplementary Dataset 4). Among all sequenced tissues, roots and pollen tubes expressed the highest
315 and lowest number of both genes and transcripts (Fig. S18), respectively. We further annotated the repeat
316 content of all expressed transcripts using RepeatMasker (open-4.0.5) with repeat library annotated from N.
317 attenuata as described previously. Among these 73,624 expressed transcripts, 17,463 of them (~24%)
318 were found containing repeats (Supplementary Dataset 4), suggesting that a large number of repeats of N.
319 attenuata are expressed.
320 The floral gene expression was calculated based on RNA-seq library from open flowers (OFL),
321 and for tissues that contain more than one library (with different treatments), we used the average
322 expression values among all libraries to represent the expression level in the respective tissues. Tissue-
323 specific genes/transcripts were considered as τ index >= 0.95. Among all sequenced tissues, roots have
324 the highest number of tissue-specific genes and transcripts (Fig. S19). Interestingly, although pollen tubes
325 expressed fewer genes (4201) than other tissues, they harbored the largest proportion of tissue-specific
326 genes (10%).
327 We annotated all of the alternative splicing (AS) events using JuncBASE (v0.6) (26) based on the
328 splicing junctions (SJs) information from the BAM files produced by Tophat2 (16). To reduce false
329 positives that likely result from non-specific or erroneous alignments, all original junctions were filtered
330 based on overhangs greater than 13 bp in length, as shown in our previous study(27). The overall pattern
331 of AS detected among different tissues is consistent with our previous study based on roots and leaves of
332 N. attenuata, which showed that intron retention (IR) and exon skipping (ES) are the most and least
333 abundant AS events, respectively (Fig. S20).
13
334 2.6 Expression of transposable elements
335 The expression of TE was analyzed using two different datasets. Among different tissues, RNA-
336 seq data were used. We first remapped all clean reads to the N. attenuata genome using tophat2 with the
337 parameters mentioned above, except for the retention of 150 multiple mapped reads (-g 150). Then, the
338 mapped bam files were used to estimate expression of the different TE families using TEtranscripts (28).
339 Among all measured tissues, we found that TE expression was highest in pollen tube (Fig. S21).
340 For M. sexta herbivory-induced TE expression in leaves, we used a previously published time-
341 course microarray dataset (29). All probes were mapped to the genome using blastn. Only probes that had
342 a perfect match to the genome were considered. Overlaps between positions of mapped probes and
343 repeats were identified using bedtools (30) based on repeatmakser output. All probes of which 100%
344 could be located within the repeat region and did not map to any annotated genes were used to analyze
345 expression of the M. sexta herbivory-induced expression of TEs. The microarray data were first
346 normalized using quantile normalization, and differential gene expression was analyzed using the limma
347 R package (31). Probes that showed a false discovery rate (FDR) adjusted P value lower than 0.05 and a
348 fold change greater than 2 were considered as induced by M. sexta OS. Differentially expressed probes
349 that were annotated as repeats are reported in Supplementary Dataset 5. Similar to protein coding genes,
350 most of TEs showed the highest induction at 1 h after elicitation (Fig. S22), except LTR/copia, which
351 were induced the most at 13 h after elicitation.
352 2.7 Functional annotation of protein coding genes
353 The functions and gene ontology (GO) of all predicted protein coding genes were annotated
354 using BLAST2GO (32). All protein coding sequences were first compared to the plant reference sequence
355 database (downloaded in February 2016) using BLAST with e-value cutoff 1e-10. The BLAST results
356 were imported to BLAST2GO and the functions and GO terms of each gene were annotated using the
357 default settings. In total, 59.3% of genes were assigned to at least one GO term. The enzyme commission
358 (EC) number was also assigned to the predicted genes by comparing to the KEGG pathway databases.
14
359 The pathways that contained only one mapped enzyme were removed. Overall, 5,370 genes (16.1 %)
360 were assigned to 942 EC codes that were involved in 144 pathways. The top three pathways that contain
361 the highest number of annotated genes are starch and sucrose metabolism, purine metabolism and
362 phenylalanine metabolism. Furthermore, the KEGG orthologs (KO) were annotated using kobas 2.0 (33),
363 with e-value cutoff set as 1e-10. In total, 10,746 genes were assigned to KO terms. The information of
364 mapped KO information for each gene is reported in the Supplementary Dataset 6.
365 2.8 Functional annotation of protein domains
366 To identify the functional domains of protein-coding genes, INTERPROSCAN (v. 5.16-55.0)(34)
367 was used to scan protein sequences against the protein signatures from InterPro database (v. 55.0,
368 downloaded in February, 2016). The InterPro database integrates protein families, Pfam domains and
369 functional sites from different databases. The following databases from InterPro were used: Coils (v.
370 2.2.1), CATH-Gene3D (v. 3.5.0), HAMAP (v. 201511.02), Panther (v. 10.0), Pfam (v. 28.0), PIRSF (v.
371 3.01), PRINTS (v. 42.0), ProDom (v. 2006.1), PROSITE (v. 20.113), SMART (v. 6.2), SUPERFAMILY
372 (v. 1.75) and TIGRFAM (v. 15.0). In total, 38,810 protein domains were identified, consisting of 3,925
373 unique Pfams. For predicted genes in N. attenuata genome, 71.1% (23,783 out of 33,449) of them contain
374 at least one predicted protein domain. The top 20 Pfam and SUPERFAMILY domains are shown in Fig.
375 S23 and S24. The top 20 most frequent domains account for 22.7% (8,780 out of 38,729) and 37.6%
376 (10,950 out of 29,091) of all predicted Pfam domains and SUPERFAMILIES respectively. In comparison
377 to tomato(35), most of the top 20 SUPERFAMILIES are similar except three, which are ribonuclease H-
378 like (SSF53098), DNA/RNA polymerases (SSF56672) and trans-glycosidases (SSF51445). At domain
379 levels, the top 20 list in N. attenuata and tomato differed in seven domains, including domains of reverse
380 transcriptase-like (PF13456), PPR repeat (PF12854) and a domain of unknown function (PF14111).
15
381 2.9 Annotation of transcription factor and protein kinase genes
382 The transcription factor and protein kinase-containing genes were identified based on the
383 identified domain in each gene according to rules described in Pérez-Rodríguez et. al(36) using the iTAK
384 tool (http://bioinfo.bti.cornell.edu/cgi-bin/itak/index.cgi). In total, 2,498 and 1,071 genes were annotated
385 as transcription factor and protein kinase genes respectively (Supplementary Dataset 7). Among all
386 transcription factors, MYB, AP2 and bHLH were the three largest families.
387 3. Evolutionary genomics analyses
388 3.1 Homologous group identification
389 The detailed methods used for defining the homologous groups (HG) are given in the Method
390 section of the main text. In total, we identified 23,340 HGs (with at least two homolog sequences) from
391 the 11 plant genomes (Table S1). The average gene family size within each species varied from 1.8 in C.
392 sativus to 2.8 in P. trichocarpa. Among these identified HGs, 4,328 contained at least one gene from each
393 of all 11 dicots species, thus representing the core dicot HGs. In N. attenuata, 13,632 HGs containing
394 30,513 protein-coding genes were identified, of which 89.7% (12,231) contained orthologs from both
395 Nicotiana genomes sequenced in this study. In addition, 8.8% (2,936 of 33,449) of N. attenuata genes
396 were assigned as orphan genes, since they did not cluster with any other sequences (Fig. S25).
397 3.2 Detection of gene duplication events in each HG
398 We first constructed the phylogenetic tree for each HG. For this, we aligned all coding sequences
399 for each HG using MUSCLE (v.3.8.31) (37), based on translated protein sequences from TranslatorX
400 (v.1.1) (38). For all aligned sequences, all non-informational sites (gaps in more than 20% of sequences)
401 were removed using trimAL (v1.4) (39). Then, for each HG, PhyML (v. 20140206) (40) was used to
402 construct the gene tree with the best nucleotide substitution model, estimated based on jModeltest2
403 (v.2.1.10) (41) with the following parameters: -f -i -g 4 -s 3 -AIC -a. The support for each branch was
404 calculated using the approximate Bayesian method.
16
405 The duplication events within each HG were predicted based on the constructed gene trees using
406 a tree reconciliation algorithm which compares the structure of a species’ tree and gene tree to infer the
407 duplication events (42). This approach allowed us to predict the history of gene duplication events on
408 each branch of the species tree. To reduce false positives, we only considered tree structures with
409 approximate Bayes branch supports greater than 0.9 for the downstream analysis. In addition, we also
410 characterized the gene duplication events based on their genomic location. Tandem duplicated gene pairs
411 were identified as gene pairs from the same HG if located within a distance of 100 kb or four genes from
412 each other.
413 Using the tree reconciliation algorithm mentioned above, we identified 81,859 duplication events
414 on the branches of 11 species tree. Among all these duplicated events, genes from N. attenuata were
415 involved in 27.0% of them (22,054 out of 81,859). The species/genera that were known to have
416 experienced whole genome duplication (WGD) or triplication (WGT), such as Arabidopsis, Populus,
417 Mimulus, showed much higher numbers of duplication events (Fig. S4, and Supplementary Dataset 8).
418 Furthermore, a large number of gene duplication events were found on the branch of Solanaceae,
419 indicating the shared WGT (35) between the Nicotiana and Solanum genera (Fig. S4). Although the
420 cultivated potato is an autotetraploid, the genome sequences used for our analysis represent a diploid
421 species (2n=2x=24), which were obtained through the combinations of sequencing a homozygous
422 doubled-monoploid derived from classical tissue culture and a heterozygous diploid breeding line (43).
423 Therefore the additional polyploidization likely did not affect the detection of gene duplication events at
424 the branch of Solanaceae and Nicotiana.
425 Additionally, we inferred the type of duplication to be tandem duplication or genome-wide
426 duplication by using gene distance and syntenic block information of the genome, respectively. All
427 duplicated genes gene pairs that are separated less than 10 genes and were less than 150 kbp apart in the
428 genome of any species of tomato, potato or N. attenuata were considered as derived from tandem
429 duplications. In total, we found 535 duplications that occurred on the Solanaceae branch (7.8%) which
430 were likely to be tandem. We further used syntenic blocks estimated in tomato and potato that were
17
431 predicted by the Plant Genome Duplication Database (PGDD,
432 http://chibba.pgml.uga.edu/duplication/index/downloads) to infer duplications that resulted from genome-
433 wide duplications. All duplication events occurred on the Solanaceae branch and can be found in a
434 syntenic block (with e-value < 0.01 and 0.3 435 duplications. 436 One example of duplications via WGT is the evolution of threonine deaminase (TD) family, 437 which has four copies in both Nicotiana and Solanum resulting from one round of local duplication 438 followed by a genome-wide duplication (Fig. S6). TD is a Solanaceae specific anti-herbivore defense 439 gene that encodes a pyridoxal phosphate-dependent enzyme that dehydrates threonine to α-ketobutyrate 440 and ammonia, as the committed step in the biosynthesis of isoleucine (Ile) in plants. After one tandem 441 duplication followed by WGT, the duplicated gene copy (TD2.2), likely through several adaptive 442 substitutions (44), evolved as an anti-digestive defense in tomato that degrades threonine in the guts of 443 attacking larvae (45). Interesting, in N. attenuata TD2.2 likely retained its biosynthetic function to supply 444 the Ile essential for jasmonate-mediated defense signaling (46). Consistent with a functional 445 specialization and in contrast to other WGT-derived TD copies, N. attenuata TD2.2 exhibits a unique 446 spatial expression pattern and is specifically induced by herbivore attack (Fig. S6). 447 448 3.3 Confirmation of the whole-genome triplication in Nicotiana 449 To further support the conclusion that Nicotiana spp. share the WGT event previously identified 450 in Solanum (35), we performed molecular evolution and phylogenomic analyses. For the molecular dating, 451 the synonymous substitution rates (Ks) for the gene pairs in all homologous groups (with less than 5 452 genes per species) were calculated based on YN model using KaKs_Calculator (v.1.2) (47). To estimate 453 the divergence time, the formula T=Ks/ (2*mutational rate) was used, where the mutational rate was set 454 to 7.1*10-9 substitutions/site/million years (MYA)(48). The within-species 4dTV distribution in N. 455 attenuata and tomato showed a similar peak between 0.2 to 0.25, indicating that a genome-wide 18 456 duplication event had occurred at a similar time in both species (Fig. S5). Given the previously identified 457 WGT in tomato, it is likely that Nicotiana shares this genome-wide duplication event with tomato. This 458 inference is also consistent with the more ancient estimated age of this WGT (71 ±19.4 MYA)(35) 459 compared to the recent divergence between Nicotiana and Solanum (24.4 MYA to 28.6 MYA). 460 To further validate the shared WGT event between Nicotiana and Solanum, we analyzed a subset 461 of genes that underwent a triplication in either Solanum or Nicotiana. The genes that underwent a 462 triplication event in Solanum were identified based on the fact that the tree structure consists of one 463 outgroup node (at least one of genes in Arabidopsis, Populus, Cucumber or Vitis) and two duplication 464 events that were shared between tomato and potato with no gene loss in either tomato or potato. Similarly, 465 the genes that underwent a triplication event in Nicotiana were identified based on the fact that the tree 466 structure consists of one outgroup node and two duplication events that were shared between N. attenuata 467 and N. obtusifolia with no gene loss in either species. In order to reduce the false positives, we only 468 considered the tree nodes that have approximate Bayes branch supports of greater than 0.9. We then 469 analyzed the number of these triplication events that were shared between Solanum and Nicotiana based 470 on the complete tree structure. In total, we identified 229 and 436 triplication events in Solanum and 471 Nicotiana, respectively. In both the Solanum and Nicotiana datasets, the majority (89.2% and 79.9% 472 respectively) of the triplication events were shared with the other genus. These results are consistent with 473 the fact that the WGT event found in tomato is indeed shared between Solanum and Nicotiana. 474 Furthermore, using MCSCAN, we also found 22 duplication blocks, each of which contains at least 20 475 genes among assembled pseudo-chromosomes. In addition, we also compared the identified triplication 476 events in Solanum and Nicotiana with those of genes in Mimulus. Our results showed that a majority of 477 the triplication events (98.7% and 98.4% in Solanum and Nicotiana, respectively) were not shared with 478 Mimulus, indicating that gene duplication events in Mimulus are independent of the WGT found in 479 Solanaceae. 19 480 3.4 Estimation of species divergence times 481 We calculated the divergence times of four Solanaceae spp. (N. attenuata, N. obtusifolia, S. 482 lycopersicum and S. tuberosum) with V. vinifera as the outgroup using a Bayesian approach. In brief, we 483 first identified 1,622 one-to-one orthologs using BLAST reciprocal best-hits (RBH) algorithm. Then each 484 orthologous group was aligned using the protein coding sequences with MUSCLE (v. 3.8.31) (37) based 485 on translated protein sequences from TranslatorX (v.1.1) (38). These alignments were concatenated to one 486 super-alignment and all ambiguously aligned regions were removed using trimAL (v. 1.4) (39) with 487 parameter: -gt 0.8. The final alignment that contained ~2.1Mb nucleotide sites was then used for species 488 divergence time estimation. The nucleotide substitution parameters were first estimated with the HKY85 489 model using baseml from the PAML package (v4.8) (49). The branch-length and the corresponding 490 variance-covariance matrix were then estimated using estbranches. The results from estbranches were 491 used to predict divergence times with the MultiDivTime program (50). The MultiDivTime analysis was 492 performed according to the manual with the following parameters: the root-to-tip mean was set to 119.5 493 MYA with a standard deviation of 10 MYA based on the divergence time estimated by Guyot et. Al (51); 494 the evolutionary rate of the root was set to 0.007631799 substitutions per nucleotide site per MYA 495 calculated based on the results of baseml; and a burn-in time of 10,000,000 generations was used. MCMC 496 chains were sampled every 10,000 generations until 100,000 samples were taken and the calibration 497 points of 7.3 MYA and 23.7 MYA for the divergence time of tomato-potato and tomato-Nicotiana, 498 respectively (52), were used. 499 The analysis revealed that N. attenuata and N. obtusifolia diverged about 12.5 MYA ago (95% 500 confidence interval: 10.1 MYA to 14.7 MYA), which is about five million years earlier than the 501 divergence time between potato and tomato (7.1 MYA, range from 6.4 MYA to 8.2 MYA). The estimated 502 divergence time between the Nicotiana and Solanum lineages was estimated to be of 27.1 MYA (95% 503 confidence interval: 24.4 MYA to 28.6 MYA) and the divergence time of V. vinifera and solanaceous 504 species is around 116.70 MYA (95% confidence interval: 100.9 MYA to 133.6 MYA). 20 505 3.5 Identification of lineage-specific gene family expansion 506 We analyzed Solanaceae and Nicotiana lineage-specific HG expansions using the previously 507 identified gene duplication events. For this, we performed a Fisher’s exact test to identify gene families 508 that exhibit significantly more duplications in comparison to the genome-wide pattern. Because the total 509 number of gene families was large, we reduced the number of tests and false positives for a given branch 510 by calculating a Z-score for each HG and only considering those HGs that had Z-scores > 1.96 for 511 Fisher’s exact test. Multiple testing corrections were then performed based on these P-values using the 512 false discovery rate (FDR < 0.1) method. It should be noted that this is a conservative approach, as many 513 small gene families cannot be detected by such stringent statistical tests. Hence, the gene families 514 identified by this analysis as significantly expanding are likely true positives. 515 For the Solanaceae branch, we found 1,596 HGs with a Z-score greater than 1.96. Among them, 516 four HGs experienced significantly more duplications than the genome-wide pattern based on Fisher’s 517 exact test. One of these belongs to the S-locus F-box protein (SLF) family, of which 10 duplications were 518 identified on the branch of Solanaceae. Consistently, a recent study also found increased number of SLF 519 genes in Petunia, another Solanaceae species (53). The phylogenetic tree combining SLF from Petunia, 520 Solanum and Nicotiana revealed that this particular SLF subfamily experienced frequent duplication and 521 gene loss in different lineages (Fig. S26). The SLF gene family is known to be involved in pollen 522 recognition and self-compatibility processes, and the expansion of this family might have contributed to 523 the observed diversity of the mating systems in the Solanaceae family (54). Another HG that significantly 524 expanded in the Solanaceae is corresponding to the Zeatin O-glucosyltransferase-like (ZOG) gene family, 525 of which we detected 25 duplication events (Fig. S27). Genes from the ZOG family are involved in 526 regulating cytokinin levels a process critical in tuning the signaling mediated by this hormonal pathway to 527 biotic and abiotic stresses (55, 56). Expansion of this gene family in the Solanaceae branch likely reflects 528 physiological adaptations to the diverse habitats colonized by these species. 529 Analyzing gene duplications within the Nicotiana branch showed that 256 HGs have a Z-score 530 greater than 1.96, and 58 HGs experienced significantly more duplications, based on Fisher’s exact test, 21 531 than seen from the genome-wide pattern (Supplementary Dataset 9). Among these significantly expanded 532 gene families, an NBS-LRR type disease resistance gene family specifically duplicated three times in 533 Nicotiana but not in other branches of the Solanaceae. Genomic location of these duplicated genes shows 534 that all four genes are located in a gene cluster within one scaffold (Fig. S28). Further expression analyses 535 revealed that all four genes are highly expressed in roots, suggesting that these genes might be involved in 536 plant-pathogen interactions in Nicotiana roots. 537 Another gene family that significantly expanded in Nicotiana species is that of the Purine uptake 538 permeases (Fig. S29). Genes from this family are known as plasma membrane-localized transporters (57). 539 More specifically, in tobacco, one member of this family, Nicotine Uptake Permease1 (NUP1), has 540 demonstrated functions in regulating nicotine localization via its transporter activity (57, 58). Furthermore, 541 NUP1 has also been shown to act as a transcriptional regulator of the key transcription factor ERF189 in 542 the nicotine biosynthesis pathway (58), although details on the underlying mechanism are lacking. The 543 expression profile of NUP1 in N. attenuata highlights that this gene is not only expressed in roots, but 544 also shows high levels of expression in floral tissues (Fig. S29), which indicates that NUP1 might be 545 involved in the allocation of nicotine to flowers of N. attenuata, where it serves as a deterrent for 546 pollinators and increases out-crossing rates (59). In addition, several other members of this family are 547 specifically induced in the leaves and stem by simulated herbivory, indicating that these genes in N. 548 attenuata might also be involved in allocating nicotine to particular parts of the vegetative tissues as an 549 anti-herbivore toxin. 550 4. Evolution of nicotine biosynthesis 551 4.1 Identification and reconstruction of the evolution history of nicotine biosynthesis genes 552 We identified nicotine biosynthesis genes and their ancestral copies based on sequence homology 553 using blastp and manual curations (Table S6). Gene evolutionary history was inferred using the 554 combination of phylogenetic and synteny analysis when possible. However, due to lack of genomic 22 555 information from the closely related genus of Nicotiana, it remains unclear whether the Nicotiana-lineage 556 specific duplications, such as the NAD pathway genes and neofunctionalization of BBLs occurred in the 557 ancestors of all genera, in which trace levels of nicotine have been controversially reported, such as 558 Crenidium, Cyphanthera and Duboisia (60). 559 The putrescine biosynthetic pathway for nicotine biosynthesis could have two routes: 1) 560 synthesized by ODC-mediated decarboxylation of ornithine or 2) synthesized by ADC-mediated 561 decarboxylation of arginine. Recent studies that individually silenced ODC and ADC suggest that the 562 putrescine for nicotine biosynthesis in Nicotiana is likely through the former route (61, 62). Consistently, 563 while phylogenetic analysis showed that two ADC copies are present in Nicotiana genomes likely 564 through the WGT, none of them showed root specific expression pattern. 565 4.2 Validation of the promoter sequences of nicotine biosynthesis genes and their ancestral copies 566 We validated the 2kb promoter sequences of 26 genes from N. attenuata, which included all 567 nicotine biosynthesis genes and several of their ancestral copies using Sanger sequencing. Primers were 568 first designed based on the assembled genome sequences, and direct PCRs were performed to amplify the 569 target fragments (Table S7). For genes that gave multiple bands or no PCR products, nested PCRs were 570 performed. The amplified fragments with the expected size were cut from gels and used for either direct 571 sequencing or sequencing after cloning into the pJET vector. All Sanger sequences were manually 572 inspected and compared to the genome assembly. Among all tested genes, only the promoter region of 573 one gene, NIATv7_g05934, was found miss-assembled (PCR products do not match the genome 574 assembly). For all other correctly assembled genes, 99.86% identity was found between Sanger 575 sequencing and the assembled genome. 576 4.3 Prediction of TE-derived putative miRNA target sites into regulatory region of nicotine 577 biosynthesis genes 578 Insertions of TE are known to introduce putative target sites of miRNAs. To examine whether the 579 observed TE insertions within the 2kb upstream region of nicotine biosynthesis genes and of their 23 580 ancestral copies had introduced candidate miRNA targeting sites, we performed in silico miRNA target 581 site predictions using the method described in Pandey et al (63). Briefly, all the candidate promoter 582 sequences were first checked for miRNA seed-pairs using custom written Perl script. The promoter 583 sequences from 3’ end were used for "Watson-Crick" complementarity matching against 5’ ends of 584 miRNAs after generating 7- to 13-nt seeds starting from first or second nucleotide position at 5’ end of 585 the miRNAs. The matches were extended by allowing mismatches after the seed match or 9th nucleotide 586 in the miRNAs. 587 In total, 23 and 17 putative miRNA target sites were predicted among the 2kb upstream regions 588 of 10 genes that are involved in nicotine biosynthesis and of their ancestral copies or non-root specific 589 expressed genes (10), respectively (Table S8). Among all nicotine biosynthesis genes, five predicted 590 miRNA target sites that were detected from 4 genes (BBL2.2, PMT1.2, ODC2 and QS) overlapped with 591 TE insertions. 592 4.4 Evolution of root-specific expression of nicotine biosynthesis genes 593 One of the important features associated with efficient nicotine production is the root-specific 594 expression of nicotine biosynthesis genes. To obtain evolutionary insights into the process of root-specify 595 expression acquisition by these genes, we explored the expression pattern of orthologous genes of the 596 nicotine biosynthetic pathway in tomato and potato. The results show that although absolute expression 597 levels of these orthologues are consistently low in the roots of these two species, some genes exhibit a 598 certain degree of root-specific expression in these Solanum species, such as PMT1 and ODC2 from the 599 duplicated polyamine pathway as well as the homologs of BBLs and A622. These observations indicate 600 that the evolution of a preferential root expression for these genes may have taken place before the 601 divergence between Nicotiana and Solanum (Fig. S15). This, however, does not exclude the likely 602 possibility that the neofunctionalization of BBL and A622 occurred after the divergence between 603 Nicotiana and Solanum. In contrast, the root-specific expression of the NAD pathway genes, such as AO2, 604 QPT2 and QS likely evolved specifically in Nicotiana. A further comparative analysis of the expression 24 605 profile of ERF189, the key transcription factor that regulates most of the nicotine biosynthesis genes, 606 showed that its root-specific expression likely evolved in Nicotiana, as its orthologues are more uniformly 607 expressed across tissues in tomato and potato (Fig. S17). 608 From a mechanistic point of view, there are at least two possibilities: 1), while gaining additional 609 GCC/G-box motifs increased their expression in roots, nicotine genes may have lost the transcription 610 factor binding sites that allow them to be expressed in other tissues, and 2) specific smRNAs may have 611 contributed either directly to or through interactions with endogenous target mimicry (64) the suppression 612 of the expression of nicotine genes in non-root tissues. We speculate that both mechanisms might have 613 contributed to root expression specialization. Future molecular studies that specifically manipulate the 614 promoter sequences of nicotine biosynthesis genes and targeted smRNAs will be required to understand 615 this process. 616 4.5 Data access 617 To provide free access to our genome data, we established a webserver (Nicotiana attenuata Data 618 Hub: http://nadh.ice.mpg.de/)(65), which allows data visualization of gene families and co-expressed 619 genes, as well as data download of the annotated gene models of the two Nicotiana genomes. The Whole 620 Genome Shotgun projects of N. attenuata and N. obtusifolia have been deposited at DDBJ/ENA/GenBank 621 under the accession MJEQ00000000 and MJEQ00000000, respectively. All short reads and PacBio reads 622 used in this study were deposited in NCBI under BioProject PRJNA317743 (RNA-seq reads of N. 623 attenuata), PRJNA316810 (short gun reads of N. attenuata), PRJNA378521 (GBS sequencing of AI-RIL 624 population), PRJNA317654 (WGS PacBio long reads of N. attenuata), PRJNA316803 (RNA-seq reads 625 and assembly of N. obtusifolia), and PRJNA316794 (short reads of N. obtusifolia). The assembled 626 genome sequences of N. attenuata and annotation information of all protein coding genes are available 627 from the CoGe platform, which provides genome-browser view and downstream comparative genomic 628 analysis, and Sol Genomics network (https://solgenomics.net/). There are 129 gene models that are 629 deposited in CoGe and SGN but were not submitted to NCBI because they contain introns smaller than 25 630 10bp, which do not fulfill the criteria for NCBI submission. However these genes models are likely to be 631 correct based on their sequence conservation. Therefore, they were retained and made them publically 632 available via SGN and CoGe. Including or excluding these gene models does not affect any conclusion of 633 the study. The python scripts used for phylogenomic analysis are deposited on github 634 (https://github.com/thobrock/Nicotiana-attenuata-genome-project/). 635 26 636 5. Supplemental Tables 637 Table S1. Genomes of 11 plant species used for comparative genomic analysis. Species Version # of gene models URL Reference N. attenuata r2.0 33,449 http://nadh.ice.mpg.de T his st udy N. obtusifolia v1.0 27,911 http://nadh.ice.mpg.de T his st udy The Arabidopsis Genome Initiative. A. thaliana T AIR 10 27,416 http://phytozome.jgi.doe.gov/arabidopsis 2000 (66) http://peppersequence.genomics.cn/page/species/d C. annuum v2.0 35,336 Kim et al. 2014 (67) ownload.jsp Huang et al. 2009 (68) C. sativus v1.0 21,503 http://phytozome.jgi.doe.gov/cucumber Hellsten et al. 2013 (69) M. guttatus v2.0 28,140 http://phytozome.jgi.doe.gov/mimulus Tuskan et al. 2006 (70) P. trichocarpa v3.0 41,335 http://phytozome.jgi.doe.gov/poplar Tomato Consortium. 2012 (35) S. lycopersicum IT AG2 .3 34,727 http://phytozome.jgi.doe.gov/tomato Hirakawa et al. 2014 (71) S. melongena v2.5.1 42,035 ftp://ftp.kazusa.or.jp/pub/eggplant/ Xu et al. 2011 (43) S. tuberosum v3.4 35,119 http://phytozome.jgi.doe.gov/potato Jaillon et al. 2007 (72) V. vinifera Genoscope 12X 26,346 http://phytozome.jgi.doe.gov/grape 638 639 27 640 Table S2. Completeness of N. attenuata and N. obtusifolia, two other published Nicotiana species, 641 potato and tomato genomes (potato version 206 and tomato version 225) estimated based on 248 642 ultra-conse rved CEGs using CEGMA. Comple te Partial Groups Average G roup 1 G roup 2 G roup 3 G roup 4 Average G roup 1 G roup 2 G roup 3 G roup 4 #Prots 208 51 44 52 61 243 63 54 61 65 %Completeness 83.87 77.27 78.57 85.25 93.85 97.98 95.45 96.43 100 100 N. attenuata #Total 463 95 86 121 161 646 145 135 167 199 Average 2.23 1.86 1.95 2.33 2.64 2.66 2.3 2.5 2.74 3.06 %Ortho 63.94 52.94 54.55 71.15 73.77 75.31 66.67 74.07 78.69 81.54 #Prots 211 52 45 55 59 241 65 53 61 62 %Completeness 85.08 78.79 80.36 90.16 90.77 97.18 98.48 94.64 100 95.38 N. obtusifolia #Total 421 96 78 113 134 573 121 117 169 166 Average 2 1.85 1.73 2.05 2.27 2.38 1.86 2.21 2.77 2.68 %Ortho 56.87 51.92 42.22 65.45 64.41 70.54 53.85 66.04 88.52 74.19 #Prots 195 49 39 50 57 234 60 52 60 62 %Completeness 78.63 74.24 69.64 81.97 87.69 94.35 90.91 92.86 98.36 95.38 N. sylvestris #Total 402 85 73 102 142 588 124 118 157 189 Average 2.06 1.73 1.87 2.04 2.49 2.51 2.07 2.27 2.62 3.05 %Ortho 60.51 42.86 61.54 64 71.93 73.93 60 71.15 85 79.03 #Prots 164 51 35 39 39 188 57 42 44 45 %Completeness 66.13 77.27 62.5 63.93 60 75.81 86.36 75 72.13 69.23 N. tomentosiformis #Total 321 84 62 88 87 441 113 88 121 119 Average 1.96 1.65 1.77 2.26 2.23 2.35 1.98 2.1 2.75 2.64 %Ortho 56.71 41.18 48.57 66.67 74.36 72.34 61.4 66.67 81.82 82.22 #Prots 208 51 43 51 63 239 63 51 60 65 %Completeness 83.87 77.27 76.79 83.61 96.92 96.37 95.45 91.07 98.36 100 S. tuberosum #Total 392 84 76 98 134 512 108 102 137 165 Average 1.88 1.65 1.77 1.92 2.13 2.14 1.71 2 2.28 2.54 %Ortho 53.85 41.18 53.49 54.9 63.49 65.27 47.62 62.75 70 80 #Prots 215 55 46 54 60 238 63 52 61 62 %Completeness 86.69 83.33 82.14 88.52 92.31 95.97 95.45 92.86 100 95.38 S. lycopersicum #Total 391 88 73 104 126 497 112 98 136 151 Average 1.82 1.6 1.59 1.93 2.1 2.09 1.78 1.88 2.23 2.44 %Ortho 47.44 34.55 43.48 53.7 56.67 58.4 49.21 55.77 60.66 67.74 643 28 644 Table S3. Summary of filtered (duplicate removal, quality clipping) WGS sequencing data used for 645 the assemblies of N. attenuata and N. obtusifolia. Nicotiana attenuata (2.5Gb) Nicotiana obtusifolia (1.4Gb) Library PE180 PE250 PE600 PE950 MP5500 MP20000 SR454 PE480 PE1050 MP3500 type Number 354,327,6 254,673,1 152,657,0 92,178,3 80,509,5 2,005,65 478,531,3 222,388,2 108,153,56 paired - 62 42 90 08 74 8 66 20 2 reads Number 103,391,4 30,351,12 68,691,92 77,197,1 25,188,0 - - - - - non paired 83 1 4 96 20 Avg. 87.5 97.9 91.7 88.6 75.0 75.0 644.1 91.3 77.6 94.6 length [bp] Length 40.03 27.91 20.30 15.01 6.04 0.150 16.22 43.70 17.26 10.23 sum [Gb] Approx. seq. 16.7 11.6 8.5 6.3 2.5 0.1 6.8 31.2 12.3 7.3 coverage Approx. frag. 13.3 13.3 19.1 18.2 92.3 8.4 6.8 82.0 83.4 135.2 coverage 646 29 647 Table S4. Repeat content among six solanaceous genomes. S. tuberosum S. lycopersicum N. obtusifolia N. tomentosiformis N. sylvestris N. attenuata TE class TE type (Mb) (Mb) (Mb) (Mb) (Mb) (Mb) DNA 10.3 11.9 36.2 66.5 69.4 70.0 DNA/CMC-EnSpm 7.5 9.4 11.7 7.6 8.6 12.4 DNA/Dada 0.1 0.0 2.9 5.5 7.8 7.7 DNA/MULE-MuDR 3.3 3.2 0.4 0.6 0.4 0.6 DNA/MuLE-MuDR 1.4 5.0 3.2 5.5 9.7 9.5 DNA/PIF-Harbinger 5.6 4.3 4.7 4.8 5.8 6.0 DNA DNA/TcMar-Pogo 0.4 0.4 - - - - DNA/TcMar-Stowaway 9.2 4.1 2.7 3.9 4.0 3.8 DNA/hAT 0.1 - 0.1 0.2 0.4 1.1 DNA/hAT-Ac 7.5 4.1 9.0 11.1 12.5 12.7 DNA/hAT-Charlie 0.1 - - - - - DNA/hAT-Tag1 1.7 1.2 1.2 2.0 1.5 2.0 DNA/hAT-Tip100 4.6 2.6 1.5 2.3 2.3 2.2 LINE - - 11.7 14.6 19.2 22.9 LINE/L1 10.6 7.9 18.2 12.0 18.3 24.9 LINE/L1-Tx1 - - 0.0 0.0 0.5 0.6 LINE LINE/L2 - - 0.5 0.2 0.5 0.5 LINE/R1 - - 0.2 0.3 0.4 0.4 LINE/RTE-BovB 5.1 3.3 8.6 13.0 15.4 12.4 LTR 0.1 - 83.1 133.7 129.4 132.5 LTR/Caulimovirus 3.7 2.4 12.5 1.5 2.4 1.1 LTR LTR/Copia 29.1 48.8 42.6 61.5 87.1 131.0 LTR/ERVL - - 0.0 0.9 4.1 4.0 LTR/Gypsy 244.6 318.0 519.7 864.6 1171.4 1201.5 RC/Helitron 2.4 0.6 3.1 0.0 3.0 3.4 Retrotransposon 3.8 3.5 8.1 3.4 12.6 14.8 Others SINE 0.7 0.5 1.3 13.5 2.5 2.5 Satellite 3.8 1.0 5.9 1.9 3.2 4.1 Unknown (Mb) 46.4 37.1 3.2 5.1 9.1 9.5 TE size (Mb) 402.2 469.2 792.3 1236.2 1601.6 1694.2 Assembled genome size excluding Ns (Mb) 676.3 737.6 1223.2 1642.4 2047.5 2090.5 Percentage of TE in the genome (%) 59.5 63.6 64.8 75.3 78.2 81.0 648 30 649 Table S5. Information on samples used for RNA-seq. Treatment / % uniquely mapped Tissue Library ID # raw reads # clean reads Read length development stage reads M. sexta OS induced on Roots leaves NA1498ROT 327,772,944 317,843,200 93.77 50bp Control NA1717LEC 61,531,550 16,430,982 93.29 100bp Leaves M. sexta OS locally 328,071,888 induced NA1500LET 317,905,162 91.94 50bp Smoke solution treated NA1501SES 51,423,280 49,120,770 90.53 50bp Seeds Water treated NA1502SEW 75,944,970 73,525,266 90.31 50bp Dry NA1503SED 63,463,542 61,285,870 89.01 50bp M. sexta OS treated on Stems 72,473,514 leaves NA1504STT 70,575,710 94.70 50bp Early developmental 44,064,054 Corollas st age NA1505COE 23,081,618 95.14 100bp Late developmental stage NA1515COL 41,110,650 17,017,608 94.71 100bp Stigmas Mat ure NA1506STI 39,281,658 17,878,354 94.88 100bp Germinated on pollen Po l l en t u bes 41,692,244 germinat ion medium NA1507POL 22,451,272 96.83 100bp Without pollination NA1508SNP 45,428,336 21,445,272 94.64 100bp Pollinated with non-self 39,071,214 Styles pollen NA1509STO 18,816,496 94.29 100bp Pollinated with self- 50,505,064 pollen NA1510STS 23,107,862 95.20 100bp Nectaries Mat ure NA1511NEC 55,777,620 23,403,620 92.92 100bp Anthers Mat ure NA1512ANT 41,480,422 18,767,454 96.22 100bp O vari e s Mat ure NA1513OVA 39,608,326 17,426,638 94.42 100bp Pedicels No treatment NA1514PED 41,520,690 19,508,002 95.14 100bp Fully opened NA1516OFL 43,791,980 18,550,688 94.87 100bp Fl owe rs Buds NA1517FLB 45,864,746 22,167,496 95.47 100bp Leave samples that were Leaves collect at different time NA1821CTN 37,692,054 36,884,606 95.44 100bp points 31 650 Table S6. Annotation of nicotine biosynthesis genes and number of motifs in the 2kb upstream 651 region. TPM: transcript per million. G ene ID in # of G-box # of GCC Root expression Reference sequences in Gene name References N. attenuata motifs motifs (TPM) database NIATv7_g12091 AO2 5 2 3640.36 DW001381 (73) NIATv7_g36700 QS 2 2 3311.29 AF154657 (73) NIATv7_g25254 QPT2 3 2 2816.12 AJ748263.1 (74) NIATv7_g05809 ODC2 2 0 3866.97 AF127242.1 (73) Nicotine NIATv7_g61142 PMT1.1 4 3 4457.42 D28506.1 (75) biosynthesis NIATv7_g05615 PMT1.2 4 4 1806.61 D28506.1 (75) copies NIATv7_g14063 MPO1 1 3 454.52 AB289456.1 (76) NIATv7_g17187 BBL2.1 2 2 275.14 AB604219.1 (77) NIATv7_g00769 BBL2.2 4 3 956.46 AB604219.1 (77) NIATv7_g09977 A622 4 0 2258.39 AB445396.1 (78) NIATv7_g00125 DAO 1 4 12.01 XM_009602560 - NIATv7_g01601 SPDS1 1 0 226.22 NM_001302585.1 (79) Ancestors/th NIATv7_g11058 DAO2-like 2 0 12.86 AB289457 (76) e closest NIATv7_g29386 AO1 3 0 8.21 XM_009759430 - homologs NIATv7_g29483 BBL-like 0 0 0 AB604221 (77) NIATv7_g34686 ODC1 0 2 11.27 XM_009771959 - NIATv7_g58606 QPT1 1 1 29.29 AJ748262.1 (74) 652 32 653 Table S7. Primers used for validating 2kb upstream sequence regions of selected genes. While direct 654 PCR amplifications worked well for most of the genes using primers directly binding to beginning of the 655 protein coding region and the end of upstream 2kb fragments, nested PCR amplifications were required 656 for four genes, likely due to the presences of repetitive sequences present in these genes. Gene ID Forward Primer Reverse Primer PCR amplification NIATv7_g00125 CTCCATCAAAACTGAAGATCCAGTAGA GGGGAAGAAACCGTCGCCTTTTCCTGA NIATv7_g01601 GTTGGTGTTTATAGAAATATGAGAACCA CGTTGTCGTTGTTGTTGTGGTTGGCTG NIATv7_g04912 CATGAGGGGGAGACATGACAAACAGTT GCAGCGTCTACACAACAACCTAGGGC NIATv7_g05615 ATCCATCCGCTCGTCCATTTCTCTCATG TAGAGCCATTTGTGTTGGTAGATATGA NIATv7_g05809 ATCAGACCAGACAAATTAGCTTATGCG GAATGGCCGCCGGGTTCAACCCGGAA NIATv7_g09978 CTCCGATTTTGGTGACATGAGTTTACCT CAAGGGCTGCTCAACTTCAGACTTCAT NIATv7_g12091 GAGTTGGAGTGGGCTAAAGTTCATGCC CTGTCCGCATCCTGAAGCGATACCAG NIATv7_g14063 CCCCGCTTGGTCCCATTAGGCCTATTGG GTGCCGTCACTTTCTGTTTAGTAGTGG NIATv7_g20235 CATGAATACAGTAGCAAAAATAGGCGT CTCCGAGAAAGAGAAGGTTGCATTAA One step NIATv7_g20333 CTATCTCGCTCGAAATCGAGAACTTTG GTCCTACAATGGCTTCTACGGGAGAA amplification NIATv7_g21863 CTATTTTCCCTGTACTCAGCTTTAGTAAC CCATTTTCCACTGGCGTCCCCTTCAGA NIATv7_g25254 CGGTTAATCGATTGTCAAATATGATTTC CACTGTTGCAGTGAAAGGAATAGCTC NIATv7_g28218 GCGGAAAGGATTCCGACAATA GGGGAA GACTCCATGATTTTCCCCATGCTAATG NIATv7_g29386 ATGCCTGACAGGTAAGAACATATAGGT CCGCTTCCTGAAGCGATACCAGTTGC NIATv7_g29483 GCTAATCAAAGCAACTACACAAGAGTT GAATTATGAGCAGAAGCTGAAGAAAC NIATv7_g32622 GATATCCGGTGGTTTAGGTGGAATGGA GCAACAGTTACCAATCGCTTC TTCTCC NIATv7_g34686 GTACACCGGAGAAAGTATACTCCTAAT GGGAACATGACTCTTACCGATAGCTGT NIATv7_g36700 GACACAATCACGCACAGAGGCCAGAGC GAAGATTTCATGACTAAATTTGCGGCA NIATv7_g42088 CAGATGGATTTTACAATGCCTTTTTGTT CAAGGAGATTAAAATCAGAGAAAGAG NIATv7_g58606 CTCCCAACTCCAGAATCACCATACG GTAATTGCATGAGGATGCACTATTGCA NIATv7_g61142 GCTGGGGCTGGAGACTAAGGAACCAGA GCCATTTGTGTTGGTAGATATGACTTC GTGAAGTACGGTGTATAGGCATGGTGA GATGATGCAAGCGCATGCT 1st nested PCR NIATv7_g14350 GTGAAGTACGGTGTATAGGCATGGTGA GTTGGTGAACTTCGCTCTTAGAGGAAG 2nd nested PCR CAGTCCACACACATTTCGGCAGGTG CTTCATTCCACTGTGCTAGATACTGAA 1st nested PCR NIATv7_g00769 CAGTCCACACACATTTCGGCAGGTG CTTGCTGTTTGGTAGGATAATGAGG 2nd nested PCR GTCATTATCTTGTCTTTTTCTTCTCATCT AAGGATGGCATGTATGAGCTCTTG 1st nested PCR NIATv7_g11058 CATATTGCTGGTGCATGATATCAC GGAGGAGTCACCTTGTGCAAAGTTGC 2nd nested PCR GGATTGAAAGTAGTATATTTTTGTG GGAGGAGTCACCTTGTGCAAAGTTGC 3rd nested PCR TGGGATATTAAAATTCAATATACCCAGT GTAAGGAGATTCATTATTGTTATTCGC 1st nested PCR NIATv7_g34398 TGGGATATTAAAATTCAATATACCCAGT CTTGCTTTTGTACTGACCCCTTTGAG 2nd nested PCR left GACTAAAACTTCTACAAATCGTCTGC GTAAGGAGATTCATTATTGTTATTCGC 2nd nested PCR right CTTACACGTACGGTAAGAGGCTGAGT CCTGCTGCTTGGTAGGATAATGAAC 1st nested PCR NIATv7_g17187 GAACTTCTTGTTTATACTTATGACAAG CGGACTAATAAGGAAAGTGTGTCATC 2nd nested PCR left GAAACTGTTTCTACGGCTCAAAAAA CCTGCTGCTTGGTAGGATAATGAAC 2nd nested PCR right CTTTCGCTTTAGAGATGGATTCACTCTT GTAGCCTGTGCCTCCAATTATTAAGAT 1st nested PCR NIATv7_g09977 CTCAGTTGTTCTAGTGAAGTTGAAG CACATTTACCTATTATAACATGTAAC 2nd nexted PCR left CGATTTTAGCACCACTTTGGCAGCA GTAGCCTGTGCCTCCAATTATTAAGAT 2nd nested PCR right 657 33 658 Table S8. Prediction of miRNA targeting sites on the 2kb upstream region of nicotine biosynthesis 659 genes and their ancestral or non-root specific expressed copies. The overlap between predicted 660 miRNA targeting sites and TE insertions are also provided in the last column. The positions indicate the 661 seeds of predicted miRNA targeting sites. Seed Root-specific GeneID Start End miRNA G eneName Overlap with TE length expression rnd-5_family -2472: NIATv7_g00769 -1937 -1929 NIATTr2_miR77253p.2 8 BBL2.2 YES DNA/hAT-Ac NIATv7_g00769 -1228 -1219 NIATTr2_miR1429.5p 9 BBL2.2 YES - NIATv7_g00769 -1233 -1225 NIATTr2_miR6164 8 BBL2.2 YES - NIATv7_g05615 -36 -28 NIATTr2_miR7504 8 PMT1.2 YES - NIATv7_g05615 -1499 -1491 NIATTr2_miR7711.5p.4 8 PMT1.2 YES rnd-5_family -822: LTR/G y p sy NIATv7_g05615 -737 -729 NIATTr2_miR5163.3p 8 PMT1.2 YES rnd-5_family -58: LTR NIATv7_g05809 -1865 -1857 NIATTr2_miR1429.5p 8 ODC2 YES - rnd-4_family -757: LINE/RTE- NIATv7_g05809 -1450 -1442 NIATTr2_miR162 8 ODC2 YES BovB NIATv7_g05809 -1865 -1857 NIATTr2_miR6161 8 ODC2 YES - NIATv7_g05809 -106 -95 NIATTr2_miR9487 11 ODC2 YES - NIATv7_g09977 -977 -968 NIATTr2_miR1066 9 A622 YES - NIATv7_g09977 -688 -680 NIATTr2_miR6161 8 A622 YES - NIATv7_g12091 -724 -716 NIATTr2_miR8015.3p 8 AO2 YES - NIATv7_g12091 -1092 -1083 NIATTr2_miR398.3p 9 AO2 YES - NIATv7_g14063 -1219 -1211 NIATTr2_miR6444 8 MPO2 YES - NIATv7_g14063 -780 -772 NIATTr2_miR8015.3p 8 MPO2 YES - NIATv7_g17187 -1700 -1692 NIATTr2_miR5497 8 BBL2.1 YES - NIATv7_g25254 -793 -785 NIATTr2_miR6161 8 QPT2 YES - NIATv7_g25254 -1559 -1551 NIATTr2_miR156.5p 8 QPT2 YES - NIATv7_g25254 -940 -932 NIATTr2_miR5750 8 QPT2 YES - NIATv7_g36700 -1075 -1067 NIATTr2_miR6164 8 QS YES rnd-4_family -1369: DNA NIATv7_g61142 -36 -28 NIATTr2_miR7504 8 PMT1.1 YES - NIATv7_g61142 -1042 -1032 NIATTr2_miR1429.5p 10 PMT1.1 YES - NIATv7_g00125 -1151 -1143 NIATTr2_miR7504 8 DAO NO - NIATv7_g01601 -1138 -1130 NIATTr2_miR5750 8 SPDS1 NO - NIATv7_g04912 -1586 -1578 NIATTr2_miR6444 8 ADC2 NO rnd-3_family -18: LTR/Copia NIATv7_g04912 -1190 -1182 NIATTr2_miR8015.3p 8 ADC2 NO rnd-4_family -175: DNA NIATv7_g05934 -67 -59 NIATTr2_miR5750 8 ADC1 NO - NIATv7_g11058 -215 -207 NIATTr2_miR8007.5p 8 MPO1 NO - NIATv7_g11058 -482 -474 NIATTr2_miR172.3p 8 MPO1 NO - NIATv7_g29386 -1738 -1730 NIATTr2_miR8667 8 AO1 NO rnd-4_family -391: DNA NIATv7_g29483 -1830 -1821 NIATTr2_miR1429.5p 9 BBL-like NO - NIATv7_g29483 -825 -817 NIATTr2_miR408.5p 8 BBL-like NO - NIATv7_g29483 -1830 -1821 NIATTr2_miR6161 9 BBL-like NO - 34 rnd-4_family -1284: NIATv7_g34398 -982 -974 NIATTr2_miR7725.3p.2 8 SPDS2 NO DNA/TcMar-Stowaway NIATv7_g34398 -695 -686 NIATTr2_miR6161 9 SPDS2 NO - NIATv7_g34686 -1228 -1220 NIATTr2_miR5069 8 ODC1 NO NIATv7_g34686 -447 -439 NIATTr2_miR6164 8 ODC1 NO rnd-4_family -173: DTT-NIC1 NIATv7_g58606 -1401 -1392 NIATTr2_miR426 9 QPT1 NO rnd-2_family -51: LTR/Gypsy NIATv7_g58606 -1962 -1954 NIATTr2_miR9496 8 QPT1 NO rnd-5_family -192: LTR/G y p sy 662 35 663 6. Supplemental Figures 664 665 Fig. S1. Distribution of Bio-Nano molecule length. Left panel indicates the size distribution of the 666 BioNano molecule length, right panel shows an example of the BioNano output image. 667 36 668 669 670 Fig. S2. Anchoring of scaffolds to the linkage map. In total, 12 linkage groups were constructed. 671 Linkage map and anchored pseudo-chromosomes are shown on left and right sides, respectively. 672 37 673 674 Fig. S3. Workflow used to assemble the genome of N. attenuata. Data used for N. attenuata assembly 675 are shown in the rounded boxes. Software and assembly processes are shown on the left and right side of 676 each arrow, respectively. 677 38 678 679 Fig. S4. Gene family size evolution among 11 plant genomes. The number of all duplication events 680 (above branch nodes) from 11 different plant species. The duplication events were estimated from 681 phylogenetic trees constructed from homologous groups. Only branches that have approximated Bayes 682 branch support values greater than 0.9 are presented. Known whole genome triplication (red stars) and 683 duplication (green stars) events are shown. 39 684 685 Fig. S5. Distribution of 4dTV is consistent with the hypothesis that Nicotiana and Solanum share a 686 genome-wide duplication event. The distributions of fourth-fold degenerate sites (4dTV) between 687 duplicated paralogs within and orthologs between genomes are shown. The comparison N. attenuata with 688 tomato reflects the speciation events between Nicotiana and Solanum. Within genome comparisons show 689 the divergence between duplicated gene pairs and thus reflects genome-wide duplication events. 690 691 40 692 693 Fig. S6. Evolution of threonine deaminase (TD). A) Molecular function of TD, which converts 694 threonine to α- ketobutyrate, the substrate for the biosynthesis of isoleucine. B) Syntenic information of 695 TD in grape and tomato genomes. Upper panel shows the syntenic region between grape and tomato 696 chromosome 9. Two copies of TD were found on the tomato chromosome 10 (between 61.8Mb- 63.8Mb), 697 and one copy was found on the tomato chromosome 9, which was reverse duplicated from chromosome 698 10. C and D). Phylogenetic tree of the TD family (C) and a simplified model (D) show the evolutionary 41 699 history of TD. Numbers on each branch indicate the approximate Bayes branch supports. Local 700 duplication and whole genome duplication events are shown as purple and green circles, respectively. E 701 and F) expression of four TDs among different tissues of N. attenuata (E), and among different time 702 points after M. sexta OS elicitation (F) in leaves. In (F), the expression is shown as mean expression and 703 standard error (from three biological replicates). 704 42 705 706 Fig. S7. Evolution of ornithine decarboxylases (ODC). A and B) phylogenetic tree of ODC among 707 different plants (A) and the predicted evolutionary history of ODC (B). An ancient duplication event 708 occurred before the divergence between Vitaceae and Solanaceae. The number on the branch shows the 709 approximate Bayes branch support. C) Syntenic information between the homologs of ODC1 and ODC2 710 in tomato. No clear signature of synteny was found. D) A dot plot depicts the sequence similarity of CDS 711 and the 2kb upstream region between OCD1 and ODC2 in N. attenuata. E) Detailed annotations of TE 712 and transcription factor binding motifs. Light blue regions indicate the TEs annotated from RepeatMasker. 713 43 714 715 Fig. S8. Evolution of N-putrescine methyltransferases (PMT). A) Phylogenetic tree of PMT and 716 SPERMIDINE SYNTHASE (SPDS) in six plant species. The number on the branch shows the approximate 44 717 Bayes branch support. The closest homolog from Arabidopsis and grape were considered as outgroups. B) 718 and C) simplified evolutionary model of SPDS and PMT in tomato and Nicotiana. The syntenic 719 information used to construct these models is from tomato. D) a simplified schematic representation of 720 SPDS and PMT functions. E) and F) detailed annotation of TE and transcription factor binding motifs of 721 NaPMT1.1 (E) and NaPMT1.2 (F). Light blue regions indicate the TEs annotation from RepeatMasker, 722 and pink rounded rectangles depict the motif sequences and their 150 bp flanking region that shows 723 significant homology to TEs (e-value < 1e-5). 724 725 45 726 727 Fig. S9. Evolution of N-methylputrescine oxidases (MPO). A and B) the phylogenetic tree of MPO and 728 DAO in 11 plant species (A) and the predicted evolutionary history of MPO (B). MPO evolved from the 729 duplication of DAO, likely before the divergence between Phrymaceae (Lamiales) and Solanaceae. The 730 duplication between MPO1 and DAO2 are shared among different Solanaceae species. C) The 731 homologues of MPO1 and DAO2 from tomato are in a syntenic block. D) Two dot plots depict the 46 732 sequence similarity of protein coding sequences and 2kp upstream regions between MPO1 and DAO1 733 (left) and between MPO1 and DAO2 from N. attenuata (right). E) Detailed annotation of TE and 734 transcription factor binding motifs. Pink rounded rectangles depict the motif sequences and their 150 bp 735 flanking region that show significant homology to TEs (e-value < 1e-5). 736 47 737 738 Fig. S10. Evolution of aspartate oxidases (AO). A and B ) phylogenetic tree of AO among 11 plants (A) 739 and the predicted evolutionary history of AO (B). Both N. attenuata and N. obtusifolia have two copies of 740 AO. The number on the branch shows the approximate Bayes branch support. C) and D) The dot plot of 741 CDS sequence together with the 2kp upstream region between N. attenuata AO1 and AO2 (C), and 742 between AO2 from N. attenuata and N. obtusifolia (D). E) Detailed annotation of TE and transcription 743 factor binding motifs. Light blue region indicates the TEs annotated from RepeatMasker, and pink 744 rounded rectangles depict the motifs sequences and their 150 bp flanking region that show significant 745 homology to TEs (e-value < 1e-5). 746 48 747 748 Fig. S11. Evolution of quinolinic acid phosphoribosyl transferases (QPT). A and B) The phylogenetic 749 tree of QPT among 13 plants and the predicted evolutionary history of QPT (B). All Nicotiana species 750 have two copies of QPT, except N. obtusifolia, which is likely due to incomplete genome assembly. The 751 number on the branch shows the bootstrap value. C and D) Dot plots depicting the sequence similarity of 752 CDS sequence together with the 2kp upstream region of QPT1 and QTP2 in N. attenuata (C) and QPT2 753 between N. attenuata and N. obtusifolia (D). E) Detailed annotation of TE and transcription factor 754 binding motifs. Light blue regions indicate the TE annotated from RepeatMasker. 755 49 756 50 757 Fig. S12. Evolution of berberine bridge enzyme-like proteins (BBL). A and B) Phylogenetic tree of 758 BBLs among different plants (A) and the predicted evolutionary history of BBL (B). Gene structures of all 759 BBL homologs are shown. The numbers on the branches show the bootstrap values. C-F) Dot plots of 760 CDS sequence together with 2kp upstream region between NaBBL2.2 and NoBBL2 (C), NaBBL2.1 and 761 NaBBL2.2 (D), NaBBL2.1and NaBBL-like (E), NaBBL2.2 and NaBBL-like (F). G and H) Detailed 762 annotation of TE and transcription factor binding motifs of NaBBL2.1 (G) and NaBBL2.2 (H). Light blue 763 regions indicate the TEs annotated from RepeatMasker, and pink rounded rectangles depict the motifs 764 sequences and their 150 bp flanking region that show significant homology to TEs (e-value < 1e-5). Both 765 NaBBL2.1 and NaBBL2.2 have an insertion of the DTT-NIC1 MITE insertion. 766 51 767 768 Fig. S13. Evolution of quinolinic acid synthases (QS). A) Phylogenetic tree of QS among 11 plants. 769 Both N. attenuata and N. obtusifolia have one copy of QS. The number on the branch shows the 770 approximate Bayes branch support. B) A dot plot depicts the sequence similarity of CDS sequence 771 together with the 2kp upstream region of QS between N. attenuata and N. obtusifolia. C) Detailed 772 annotation of TE and transcription factor binding motifs. Light blue regions indicate the TEs annotated 773 from RepeatMasker. 774 52 775 776 777 Fig. S14. Evolution of A622 protein. A) A subclade tree of the A622 and A622-like superfamily is 778 shown among six plants. Both N. attenuata and N. obtusifolia have single copies of A622. The numbers 779 on the branch show the approximate Bayesian support. B) The dot plot depicts the sequence similarity of 780 protein coding sequences and 2kp upstream regions between A622 from N. attenuata and N. obtusifolia. 781 C) Detailed annotation of TE and transcription factor binding motifs. Light blue regions indicate the TEs 782 annotated from RepeatMasker, and pink rounded rectangles depict the motifs sequences and their 150 bp 783 flanking region that show significant homology to TEs (e-value < 1e-5). 53 784 785 Fig. S15. Evolution of root-specific expression of nicotine biosynthesis genes. Heatmaps depict the 786 expression of nicotine biosynthesis genes or their orthologues in N. attenuata, tomato and potato. The 787 results show that PMT1, A622-like and BBL-like in tomato showed root-specific expression, although 788 their absolute expression levels are low. It is worth noting that the enzymatic functions of A622-like and 789 BBL-like genes in tomato and potato remain unknown. The expression data of N. attenuata, tomato and 790 potato were retrieved from N. attenuata datahub (http://nadh.ice.mpg.de/NaDH/), tomato eFP browser 791 (http://bar.utoronto.ca/efp_tomato/cgi-bin/efpWeb.cgi) and potato eFP browser 792 (http://bar.utoronto.ca/efp_potato/cgi-bin/efpWeb.cgi), respectively. 793 794 54 795 796 Fig. S16. Evolution of MYC2 in Nicotiana. Phylogenetic tree of MYC2 among different plants (A) and 797 the proposed evolutionary history of MYC2 (B). The numbers on the branches show the bootstrap values. 798 55 799 800 Fig. S17. Evolution of ERF189 in Nicotiana. (a) Phylogenetic tree of ERF189 among different plants. 801 The numbers on the branches denote approximate Bayes branch support values. Red arrows indicate the 56 802 tandem duplications that were identified based on synteny information. (b) Evolutionary model of the 803 ERF189 gene cluster based on both phylogenetic relationship among homolog genes and their syntenic 804 information in P. axillaris (80), tomato (35) and N. obtusifolia. Lineage specific tandem duplications of 805 ERFs are shown in blue (Petunia), light blue (Solanum) and red (Nicotiana). The expression profile of 806 ERF189 and its orthologous gene in tomato were retrieved from N. attenuata data hub (nadh.ice.mpg.de) 807 and tomato eFP browser (http://bar.utoronto.ca/efp_tomato/cgi-bin/efpWeb.cgi), respectively. The 808 asterisk indicates the gene NIOBTv3_g42088 is likely a pseudogene in N. obtusifolia. 809 57 810 811 Fig. S18. Distribution of expressed genes and transcripts among different tissue-specific RNA-seq 812 libraries. A) numbers of expressed genes; B) number of expressed total transcripts; C) number of 813 expressed transcripts that contain transposon sequences. The abbreviations of sample information are: 814 ROT: root from plant induced by M. sexta OS; LET: leaf from plant induced by M. sexta OS; LEC: leaf 815 from non-treated plant; SED: dry seeds; SEW: seeds treated with water; SES: seeds treated with liquid 816 smoke; STT: stem from M. sexta OS-induced plant; COE: corollas at early developmental stage; COL: 58 817 corollas at late developmental stage; STI: styles; POL: pollen tubes grown in pollen germination media; 818 SNP: stigmas not pollinated; STO: stigmas outcrossed; STS: stigmas self-pollinated; NEC: nectaries; 819 ANT: anthers; OVA: ovaries; PED: pedicels; OFL: open flower; FLB: flower bud. 59 820 821 Fig. S19. Distribution of tissue-specific expressed genes and transcripts. A) numbers of tissue-specific 822 expressed genes; B) numbers of tissue-specific expressed transcripts. 823 60 824 825 Fig. S20. Distribution of different alternative splicing events in N. attenuata. AltA refers to 826 alternative acceptor site; AltD refers to alternative donor site; IR refers to intron retention and ES 827 refers to exon skipping. 828 61 829 62 830 Fig. S21. Expression of transposable elements among nine tissues. A) heatmap depicting the relative 831 expression of 577 TE families that showed expression in at least one tissue (FPKM>1). Classification of 832 the different TEs is shown with different colors on the left side. B) Boxplots showing the expression of 833 TEs among different tissues. Different letters indicate statistical differences (multiple comparisons after 834 Kruskal-Wallis test P < 0.01) among different tissues. Numbers on each boxplot indicate median values 835 for each tissue. 63 836 837 838 839 840 841 Fig. S22. Percentage of TEs induced by M. sexta oral secretions in N. attenuata leaves within a 21 h 842 time course. Each color indicates different classes of TE families. 843 64 844 845 846 Fig. S23. The most abundant 20 Pfam domains in N. attenuata. 847 65 848 849 850 Fig. S24. Top 20 superfamily domains in N. attenuata. 851 852 66 853 854 Fig. S25. Distribution of gene families among four plant species. 855 67 856 857 Fig. S26. Expansion of SLF gene family in Solanaceae. Phylogenetic diagram showing the rapid gene 858 duplication (colors) and loss (gray) of SLF subfamily members in the Solanaceae. The sequence from 859 Pyrus pyrifolia was used as an outgroup. The number at each branch indicates the approximate Bayes 860 branch supports. Red circles indicate the detected duplication shared among Solanaceae species. The 861 color of each node indicates the species while the gene loss in a subclade is indicated by the gray regions. 862 Nodes with only color but no gene ID indicate the gene loss detected in the given species/clade. 863 864 68 865 866 Fig. S27. Expansion of the Zeatin O-glucosyltransferase-like gene family in Solanaceae. A 867 phylogenetic tree showing the expansion of the family. Numbers at the branches refer to the approximate 868 Bayes branch support. Red circles indicate the detected duplication events shared among Solanaceae 869 species. 870 871 69 872 873 Fig. S28. Expansion of a disease resistant gene family (RPM1-like) in Nicotiana. A) phylogenetic tree 874 of the gene family. This gene family was found specific to Solanaceae plants. While no duplication was 875 found in other Solanaceae species, three duplication events were identified in Nicotiana. Red circles 876 indicate gene duplication events. The numbers below the branches refer to the approximate Bayes branch 877 supports. B) Genomic localization of the four genes in N. attenuata. 878 879 880 70 881 882 Fig. S29. Expansion of the purine uptake permease gene family in Nicotiana. A) phylogenetic tree of 883 the gene family. The color indicates the species. Red circles indicate Nicotiana-specific gene duplication 884 events. The numbers at the branches refer to the approximate Bayes branch supports. B) Heatmap 885 showing the expression of genes from six representative N. attenuata tissues. The orthologous genes of N. 886 tabaccum NICOTINE UPTAKE PERMEASE1 (NUP1) is highlighted in red color. ROT: root, LET: leaf 887 treated with M. sexta oral secretion. LEC, control untreated leaf; STT: stem from plant treated with M. 888 sexta oral secretion in leaf; NEC: nectary; OFL: flower. Heatmap color gradient indicates the scaled log10 889 TPM value. 71 890 7. Captions for supplementary datasets S1 to S9 891 Supplementary Dataset 1. Summary information on MITE families in Nicotiana and their 892 homologous families in Solanum genomes. Different superfamilies are represented by different 893 codes. DTA for Tc1/Mariner, DTA for hAT, DTH for PIF/Harbinger. For tomato and potato, the 894 original MITE IDs are shown in the last column. The biggest MITE family in Nicotiana, DTT-NIC1, 895 is highlighted in bold font. 896 Supplementary Dataset 2. Annotation of smRNAs in N. attenuata. 897 Supplementary Dataset 3. Expression of genes among 21 RNA-seq libraries. The expression 898 numbers are TPM values. The abbreviation of sample information are: ROT: root from plant induced 899 by M. sexta OS; LET: leaf from plant induced by M. sexta OS; LEC: leaf from non-treated plant; 900 SED: dry seeds; SEW: seeds treated with water; SES: seeds treated with liquid smoke; STT: stem 901 from M. sexta OS-induced plant; COE: corollas at early developmental stage; COL: corollas at late 902 developmental stage; STI: styles; POL: pollen tubes grown in pollen germination media; SNP: 903 stigmas not pollinated; STO: stigmas outcrossed; STS: stigmas self-pollinated; NEC: nectaries; ANT: 904 anthers; OVA: ovaries; PED: pedicels; OFL: open flower; FLB: flower bud; CTN: leaf samples 905 collected at different time points pooled together. Tissue specificity was calculated using the τ index 906 among all tissues except CTN. 907 Supplementary Dataset 4. Expression of assembled transcripts among 21 RNA-seq libraries. 908 The expression numbers are TPM values. The abbreviation of sample information are: ROT: root 909 from plant induced by M. sexta OS; LET: leaf from plant induced by M. sexta OS; LEC: leaf from 910 non-treated plant; SED: dry seeds; SEW: seeds treated with water; SES: seeds treated with liquid 911 smoke; STT: stem from M. sexta OS-induced plant; COE: corollas at early developmental stage; 912 COL: corollas at late developmental stage; STI: styles; POL: pollen tubes grown in pollen 913 germination media; SNP: stigmas not pollinated; STO: stigmas outcrossed; STS: stigmas self- 914 pollinated; NEC: nectaries; ANT: anthers; OVA: ovaries; PED: pedicels; OFL: open flower; FLB: 915 flower bud; CTN: leaf samples collected at different time points pooled together. Tissue specificity 72 916 and transcripts of which at least 50bp were annotated as TE are also indicated. Tissue specificity was 917 calculated using the τ index among all tissues except CTN. 918 Supplementary Dataset 5. Expression of microarray probes that are both induced by 919 simulated herbivory by application of M. sexta oral secretion and mapped to transposable 920 elements of N. attenuata. SL: systemic untreated leaf; TL: treated leaf; RT: systemic untreated root. 921 Supplementary Dataset 6. KEGG ortholog (KO) annotation of N. attenuata genes. 922 Supplementary Dataset 7. List of genes annotated as protein kinases and transcription factors. 923 Supplementary Dataset 8. List of recent duplication events of N. attenuata genes and their 924 duplicated copies. Annotations of the duplication that the gene was most recently involved, and the 925 functional annotation of gene based on Blast2GO are also indicated. Detected duplication times are: 926 N. attenuata specific; shared among Nicotiana; shared among Solanaceae; shared with M. guttatus; 927 shared among core eudicots. Inferred duplication types are: genome-wide duplications, tandem 928 duplications. 929 Supplementary Dataset 9. Gene families that expanded significantly in Nicotiana. 930 73 931 8. References 932 1. Krügel T, Lim M, Gase K, Halitschke R, & Baldwin IT (2002) Agrobacterium-mediated 933 transformation of Nicotiana attenuata, a model ecological expression system. Chemoecology 934 12(4):177-183. 935 2. Bubner B, Gase K, & Baldwin IT (2004) Two-fold differences are the detection limit for 936 determining transgene copy numbers in plants by real-time PCR. BMC Biotechnol. 4. 937 3. VanBuren R, et al. (2015) Single-molecule sequencing of the desiccation-tolerant grass 938 Oropetium thomaeum. Nature 527(7579):508-511. 939 4. Luo RB, et al. (2012) SOAPdenovo2: an empirically improved memory-efficient short-read de 940 novo assembler. GigaScience 1. 941 5. English AC, et al. (2012) Mind the gap: upgrading genomes with Pacific Biosciences RS long- 942 read sequencing technology. PLoS One 7(11). 943 6. Walker BJ, et al. (2014) Pilon: An integrated tool for comprehensive microbial variant detection 944 and genome assembly improvement. PLoS One 9(11). 945 7. Peng Y, Leung HCM, Yiu SM, & Chin FYL (2012) IDBA-UD: a de novo assembler for single- 946 cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28(11):1420- 947 1428. 948 8. Cao HZ, et al. (2014) Rapid detection of structural variation in a human genome using 949 nanochannel-based genome mapping technology. GigaScience 3. 950 9. Shelton JM, et al. (2015) Tools and pipelines for BioNano data: molecule assembly pipeline and 951 FASTA super scaffolding tool. BMC Genomics 16. 952 10. Zhou W, et al. (In press) Tissue-specific emission of (E)-α-bergamotene helps resolve the 953 dilemma when pollinators are also herbivores. Curr. Biol. 954 11. Lindgreen S (2012) AdapterRemoval: easy cleaning of next-generation sequencing reads. BMC 955 research notes 5:337. 956 12. McKenna A, et al. (2010) The genome analysis toolkit: a mapreduce framework for analyzing 957 next-generation DNA sequencing data. Genome Res 20(9):1297-1303. 958 13. Wu YH, Bhat PR, Close TJ, & Lonardi S (2008) Efficient and accurate construction of genetic 959 linkage maps from the minimum spanning tree of a graph. PLoS Genetics 4(10). 960 14. Tang HB, et al. (2015) ALLMAPS: robust scaffold ordering based on multiple maps. Genome 961 Biol 16. 962 15. Wu TD & Watanabe CK (2005) GMAP: a genomic mapping and alignment program for mRNA 963 and EST sequences. Bioinformatics 21(9):1859-1875. 964 16. Kim D, et al. (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, 965 deletions and gene fusions. Genome Biol 14(4). 966 17. SanMiguel P, et al. (1996) Nested retrotransposons in the intergenic regions of the maize genome. 967 Science 274(5288):765-768. 968 18. Yanai I, et al. (2005) Genome-wide midrange transcription profiles reveal expression level 969 relationships in human tissue specification. Bioinformatics 21(5):650-659. 970 19. Lowe TM & Eddy SR (1997) tRNAscan-SE: A program for improved detection of transfer RNA 971 genes in genomic sequence. Nucleic Acids Res 25(5):955-964. 972 20. Lagesen K, et al. (2007) RNAmmer: consistent and rapid annotation of ribosomal RNA genes. 973 Nucleic Acids Res 35(9):3100-3108. 974 21. Stanke M, Steinkamp R, Waack S, & Morgenstern B (2004) AUGUSTUS: a web server for gene 975 finding in eukaryotes. Nucleic Acids Res 32:W309-W312. 976 22. Cantarel BL, et al. (2008) MAKER: An easy-to-use annotation pipeline designed for emerging 977 model organism genomes. Genome Res 18(1):188-196. 978 23. Sierro N, et al. (2013) Reference genomes and transcriptomes of Nicotiana sylvestris and 979 Nicotiana tomentosiformis. Genome Biol 14(6). 74 980 24. Trapnell C, Pachter L, & Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. 981 Bioinformatics 25(9):1105-1111. 982 25. Li B & Dewey CN (2011) RSEM: accurate transcript quantification from RNA-Seq data with or 983 without a reference genome. BMC Bioinformatics 12:323. 984 26. Brooks AN, et al. (2011) Conservation of an RNA regulatory map between Drosophila and 985 mammals. Genome Res 21(2):193-202. 986 27. Ling Z, Zhou W, Baldwin IT, & Xu S (2015) Insect herbivory elicits genome-wide alternative 987 splicing responses in Nicotiana attenuata. Plant J. 84(1):228-243. 988 28. Jin Y, Tam OH, Paniagua E, & Hammell M (2015) TEtranscripts: a package for including 989 transposable elements in differential expression analysis of RNA-seq datasets. Bioinformatics 990 31(22):3593-3599. 991 29. Kim SG, Yon F, Gaquerel E, Gulati J, & Baldwin IT (2011) Tissue specific diurnal rhythms of 992 metabolites and their regulation during herbivore attack in a native tobacco, Nicotiana attenuata. 993 PLoS One 6(10):e26214. 994 30. Quinlan AR & Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic 995 features. Bioinformatics 26(6):841-842. 996 31. Diboun I, Wernisch L, Orengo CA, & Koltzenburg M (2006) Microarray analysis after RNA 997 amplification can detect pronounced differences in gene expression using limma. BMC Genomics 998 7. 999 32. Gotz S, et al. (2008) High-throughput functional annotation and data mining with the Blast2GO 1000 suite. Nucleic Acids Res 36(10):3420-3435. 1001 33. Xie C, et al. (2011) KOBAS 2.0: a web server for annotation and identification of enriched 1002 pathways and diseases. Nucleic Acids Res 39:W316-W322. 1003 34. Jones P, et al. (2014) InterProScan 5: genome-scale protein function classification. 1004 Bioinformatics 30(9):1236-1240. 1005 35. Sato S, et al. (2012) The tomato genome sequence provides insights into fleshy fruit evolution. 1006 Nature 485(7400):635-641. 1007 36. Riano-Pachon DM, Ruzicic S, Dreyer I, & Mueller-Roeber B (2007) PlnTFDB: an integrative 1008 plant transcription factor database. BMC Bioinformatics 8. 1009 37. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high 1010 throughput. Nucleic Acids Res 32(5):1792-1797. 1011 38. Abascal F, Zardoya R, & Telford MJ (2010) TranslatorX: multiple alignment of nucleotide 1012 sequences guided by amino acid translations. Nucleic Acids Res 38:W7-W13. 1013 39. Capella-Gutierrez S, Silla-Martinez JM, & Gabaldon T (2009) trimAl: a tool for automated 1014 alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25(15):1972-1973. 1015 40. Guindon S, Delsuc F, Dufayard JF, & Gascuel O (2009) Estimating maximum likelihood 1016 phylogenies with PhyML. Bioinformatics for DNA Sequence Analysis 537:113-137. 1017 41. Darriba D, Taboada GL, Doallo R, & Posada D (2012) jModelTest 2: more models, new 1018 heuristics and parallel computing. Nature Methods 9(8):772-772. 1019 42. Page RDM & Charleston MA (1997) From gene to organismal phylogeny: Reconciled trees and 1020 the gene tree species tree problem. Mol Phylogenet Evol 7(2):231-240. 1021 43. Xu X, et al. (2011) Genome sequence and analysis of the tuber crop potato. Nature 1022 475(7355):189-U194. 1023 44. Rausher MD & Huang J (2016) Prolonged adaptive evolution of a defensive gene in the 1024 Solanaceae. Mol. Biol. Evol. 33(1):143-151. 1025 45. Gonzales-Vigil E, Bianchetti CM, Phillips GN, & Howe GA (2011) Adaptive evolution of 1026 threonine deaminase in plant defense against insect herbivores. Proc. Natl. Acad. Sci. U.S.A. 1027 108(14):5897-5902. 1028 46. Kang JH, Wang L, Giri A, & Baldwin IT (2006) Silencing threonine deaminase and JAR4 in 1029 Nicotiana attenuata impairs jasmonic acid-isoleucine-mediated defenses against Manduca sexta. 1030 Plant Cell 18(11):3303-3320. 75 1031 47. Zhang Z, et al. (2006) KaKs_Calculator: calculating Ka and Ks through model selection and 1032 model averaging. Genomics Proteomics Bioinformatics 4(4):259-263. 1033 48. Ossowski S, et al. (2010) The rate and molecular spectrum of spontaneous mutations in 1034 Arabidopsis thaliana. Science 327(5961):92-94. 1035 49. Yang Z (2007) PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 1036 24(8):1586-1591. 1037 50. Thorne JL & Kishino H (2002) Divergence time and evolutionary rate estimation with multilocus 1038 data. Syst. Biol. 51(5):689-702. 1039 51. Guyot R, et al. (2012) Ancestral synteny shared between distantly-related plant species from the 1040 asterid (Coffea canephora and Solanum Sp.) and rosid (Vitis vinifera) clades. BMC Genomics 13. 1041 52. Wu FN & Tanksley SD (2010) Chromosomal evolution in the plant family Solanaceae. BMC 1042 Genomics 11. 1043 53. Williams JS, Der JP, dePamphilis CW, & Kao TH (2014) Transcriptome analysis reveals the 1044 same 17 S-locus F-box genes in two haplotypes of the self-incompatibility locus of Petunia 1045 inflata. Plant Cell 26(7):2873-2888. 1046 54. Goldberg EE, et al. (2010) Species selection maintains self-incompatibility. Science 1047 330(6003):493-495. 1048 55. Havlova M, et al. (2008) The role of cytokinins in responses to water deficit in tobacco plants 1049 over-expressing trans-zeatin O-glucosyltransferase gene under 35S or SAG12 promoters. Plant 1050 Cell Environ 31(3):341-353. 1051 56. Schäfer M, et al. (2015) Cytokinin levels and signaling respond to wounding and the perception 1052 of herbivore elicitors in Nicotiana attenuata. J Integr Plant Biol 57(2):198-212. 1053 57. Hildreth SB, et al. (2011) Tobacco nicotine uptake permease (NUP1) affects alkaloid metabolism. 1054 Proc. Natl. Acad. Sci. U.S.A. 108(44):18179-18184. 1055 58. Kato K, Shoji T, & Hashimoto T (2014) Tobacco nicotine uptake permease regulates the 1056 expression of a key transcription factor gene in the nicotine biosynthesis pathway. Plant Physiol. 1057 166(4):2195-U1500. 1058 59. Kessler D, et al. (2012) Unpredictability of nectar nicotine promotes outcrossing by 1059 hummingbirds in Nicotiana attenuata. Plant J. 71(4):529-538. 1060 60. Eckart E (2008) Solanaceae and convolvulaceae: secondary metabolites: biosynthesis, 1061 chemotaxonomy, biological and economic significance (Springer, Berlin). 1062 61. Chintapakorn Y & Hamill JD (2007) Antisense-mediated reduction in ADC activity causes minor 1063 alterations in the alkaloid profile of cultured hairy roots and regenerated transgenic plants of 1064 Nicotiana tabacum. Phytochemistry 68(19):2465-2479. 1065 62. Dalton HL, et al. (2016) Effects of down-regulating ornithine decarboxylase upon putrescine- 1066 associated metabolism and growth in Nicotiana tabacum L. J. Exp. Bot. 67(11):3367-3381. 1067 63. Pandey SP, Shahi P, Gase K, & Baldwin IT (2008) Herbivory-induced changes in the small-RNA 1068 transcriptome and phytohormone signaling in Nicotiana attenuata. Proc. Natl. Acad. Sci. U.S.A. 1069 105(12):4559-4564. 1070 64. Li FF, et al. (2015) Regulation of nicotine biosynthesis by an endogenous target mimicry of 1071 microRNA in tobacco. Plant Physiol. 169(2):1062-1071. 1072 65. Brockmöller T, et al. (2017) Nicotiana attenuata Data Hub (NaDH): an integrative platform for 1073 exploring genomic, transcriptomic and metabolomic data in wild tobacco. BMC Genomics 1074 18(1):79. 1075 66. Kaul S, et al. (2000) Analysis of the genome sequence of the flowering plant Arabidopsis 1076 thaliana. Nature 408(6814):796-815. 1077 67. Kim S, et al. (2014) Genome sequence of the hot pepper provides insights into the evolution of 1078 pungency in Capsicum species. Nat. Genet. 46(3):270-278. 1079 68. Huang SW, et al. (2009) The genome of the cucumber, Cucumis sativus L. Nat. Genet. 1080 41(12):1275-U1229. 76 1081 69. Hellsten U, et al. (2013) Fine-scale variation in meiotic recombination in Mimulus inferred from 1082 population shotgun sequencing. Proc. Natl. Acad. Sci. U.S.A. 110(48):19478-19482. 1083 70. Tuskan GA, et al. (2006) The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). 1084 Science 313(5793):1596-1604. 1085 71. Hirakawa H, et al. (2014) Draft genome sequence of eggplant (Solanum melongena L.): the 1086 representative Solanum species indigenous to the old world. DNA Res. 21(6):649-660. 1087 72. Jaillon O, et al. (2007) The grapevine genome sequence suggests ancestral hexaploidization in 1088 major angiosperm phyla. Nature 449(7161):463-U465. 1089 73. Shoji T, Kajikawa M, & Hashimoto T (2010) Clustered transcription factor genes regulate 1090 nicotine biosynthesis in tobacco. Plant Cell 22(10):3390-3409. 1091 74. Shoji T & Hashimoto T (2011) Recruitment of a duplicated primary metabolism gene into the 1092 nicotine biosynthesis regulon in tobacco. Plant J. 67(6):949-959. 1093 75. Shoji T, Yamada Y, & Hashimoto T (2000) Jasmonate induction of putrescine N- 1094 methyltransferase genes in the root of Nicotiana sylvestris. Plant Cell Physiol. 41(7):831-839. 1095 76. Shoji T & Hashimoto T (2008) Why does anatabine, but not nicotine, accumulate in jasmonate- 1096 elicited cultured tobacco BY-2 cells? Plant Cell Physiol. 49(8):1209-1216. 1097 77. Kajikawa M, Shoji T, Kato A, & Hashimoto T (2011) Vacuole-localized berberine bridge 1098 enzyme-like proteins are required for a late step of nicotine biosynthesis in tobacco. Plant Physiol. 1099 155(4):2010-2022. 1100 78. Kajikawa M, Hirai N, & Hashimoto T (2009) A PIP-family protein is required for biosynthesis of 1101 tobacco alkaloids. Plant Mol. Biol. 69(3):287-298. 1102 79. Hashimoto T, Tamaki K, Suzuki K, & Yamada Y (1998) Molecular cloning of plant spermidine 1103 synthases. Plant Cell Physiol. 39(1):73-79. 1104 80. Bombarely A, et al. (2016) Insight into the evolution of the Solanaceae from the parental 1105 genomes of Petunia hybrida. Nat. Plants 2(6). 1106 77