<<

Supplement: Subgenome dominance in an interspecific hybrid, synthetic allopolyploid, and a 140 year old naturally established neo-allopolyploid monkeyflower

Patrick P. Edger1,*, Ronald Smith2, Michael R. McKain3, Arielle M. Cooley4, Mario Vallejo-Marin5,

Yaowu Yuan6, Adam J. Bewick7, Lexiang Ji7, Adrian E. Platts8, Megan Bowman9, Kevin Childs9,

Robert J. Schmitz7, Gregory D. Smith2, J. Chris Pires10 and Joshua R. Puzey11,*,a

1Department of Horticulture, Michigan State University, East Lansing, MI

2Department of Applied Science, College of William and Mary, Williamsburg, VA

3Donald Danforth Science Center, St. Louis, MO

4Biology Department, Whitman College, Walla Walla, WA

5Biological and Environmental Sciences, University of Stirling, Stirling, UK

6Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs, CT

7Department of Genetics, University of Georgia, Athens, GA

8McGill Centre for Bioinformatics, McGill University, Montreal, QC, Canada H3A 0E9

9Department of Plant Biology, Michigan State University, East Lansing, MI

10Division of Biological Sciences, University of Missouri, Columbia, MO 65211

11Department of Biology, College of William and Mary, Williamsburg, VA

*These authors contributed equally

aAuthor for correspondance: [email protected]

Contents

1 Methods 3

1 1.1 Genome assembly and annotation ...... 3 1.2 Whole genome alignments and phylogenetic analyses ...... 3 1.3 RNA-sequencing and quantification of gene expression ...... 3 1.4 Whole genome methylation sequencing and analysis ...... 4 1.5 LRT to detect homeolog expression bias ...... 4 1.6 Methods for Transposon annotation ...... 6 1.6.1 Autonomous cut-and-paste TEs ...... 6 1.6.2 Helitrons ...... 7 1.6.3 LTR retrotransposons ...... 7 1.6.4 LINEs ...... 8 1.6.5 SINEs ...... 8 1.7 TE examplar ...... 8

List of Figures

1 Concatenation tree Maximium likelihood phylogeny of 96 concatenated single copy loci identified across sampled . Bootstrap values are displayed on branches for 500 replicates. Scale indicates number of substitutions per site...... 11 2 Astral tree Coalescence-based phylogeny of 96 single copy loci estimated in ASTRAL. Node labels represent bootstrap values for 100 replicates...... 12 3 Results of PUG analysis using the coalescence tree. Putative paralog pairs were estimated in the PUG software based on gene tree topologies. Nodes with bootstrap values less than 80 were collapsed. Total numbers of unique gene duplications are mapped to each branch suggesting five to six polyploid events. Nodes that have less than 10% of the highest number of unique duplications (3312) are not mapped. The basal node of the suggests a polyploid event, but support for this event is potentially skewed by the shared gamma triplication shared between Lamiales and Solanum lycopersicum...... 13 4 Summary of genome-wide transposon abundances ...... 14 5 Relationship of TE density and gene expression for M. luteus subgenomic regions Top: F1 hybrid, Middle: Natural allopolyploid, Bottom: Synthetic allopolyploid 15 6 Relationship of TE density and gene expression for M. guttatus subgenomic regions Top: F1 hybrid, Middle: Natural allopolyploid, Bottom: Synthetic allopolyploid 16

2 1 Methods

1.1 Genome assembly and annotation

Genome assembly of M. luteus was performed using the ALLPATHS-LG assembler v45395 [1] with approximately 33x Illumina paired end (2x100bp) reads and a TruSeq mate-pair (5Kb insert) library. Additional gap closing was undertaken using GapCloser (v. 1.12, BGI). The assembly totaled 409Mb in 6,349 scaffolds with a scaffold N50/L50 of 283/439Kb, a contig N50 of 52Kb and 96 percent of bases called (IUPAC ambiguous bases converted to N). This is 60 percent of the total genome size (680Mb) anticipated by the assembler from the Kmer depth distribution. Genome completeness in terms of gene content was assessed using BUSCO [2] with default setting and a set of universal single-copy orthologs. The vast majority of BUSCO groups, 931 of 956 (97.4 percent), were identified in the M. luteus genome assembly with 837 duplicates. TEs were annotated using a combination of sequence and structure homology as well as de novo approaches (see TE annotation methods). The MAKER genome annotation pipeline was used to annotate the M. luteus and M. guttatus genome assemblies using SNAP, AUGUSTUS and Fgenesh gene prediction programs [3, 4, 5, 6]. Species-specific transcript assemblies as well as TAIR Arabidopsis thaliana and plant-specific Swis- sProt protein sequences were aligned to the repeat-masked genome assemblies and used as evidence to aid the gene prediction programs [7] [8]. Gene predictions with transcript, protein or protein domain support were retained as final high-quality gene predictions [9]. Assembled genome sequences and an- notation files (GFFs), including gene models, are deposited on CoGe (https://genomevolution.org/coge ; Genome IDs 22656 and 22665) hosted by National Science Foundation - CyVerse (http://www.cyverse.org/). All raw Illumina data, in fastq format, will be made available on SRA following acceptance (SRA#).

1.2 Whole genome alignments and phylogenetic analyses

Pairwise genomic alignments between the M. luteus and M. guttatus genomes were made using Syn- Map [10] and QUOTA-ALIGN [11], and then filtered to identify syntenic orthologs between the two genomes and retained homeologs in M. luteus. Data for various members of the order Lamiales were downloaded from NCBI-SRA and combined with newly generated transcriptome data for M. peregri- nus and genomic data from M. luteus, M. guttatus (v.1.2) [12], and Solanum lycopersicum (ITAG2.3) [13] for phylogenetic analysis of retained duplicates.

1.3 RNA-sequencing and quantification of gene expression

RNA-seq libraries were constructed using the TruSeq RNA kit (Illumina; San Diego, CA) from total RNA isolated from calyx, stem, and petals and then sequenced with single-end 100bp reads on an

3 Illumina HiSeq-2000 at the University of Missouri Sequencing Core for three tissue types for both M. guttatus and M. luteus. Illumina reads were quality filtered using NextGENe v2.3.3.1 (SoftGenetics; State College, PA), removing adapter sequences, reads with a median quality score of less than 22, trimmed reads at positions that had three consecutive bases with a quality score of less than 20, and any trimmed reads with a total length less than 40bp. This resulted in 87.9 percent of the reads passing the quality-score filter. Expression levels, in FPKM (fragments per kilobase per million reads), were determined for all genes in the M. guttatus and M. luteus genomes. Quality-filtered reads for each library were aligned to the respective genomes using NextGENe v2.3.3.1 and counting only uniquely mapped reads, using parameters (A. matching Requirement: ¿40 Bases and ¿99 percent, B. Allow Ambiguous Mapping: FALSE, and C. Rigorous Alignment: TRUE) resulting in the alignment of over 189.6 million reads to the M. guttatus and M. luteus genomes.

1.4 Whole genome methylation sequencing and analysis

MethylC-seq libraries were prepared according to the following protocol [14], and were sequenced on an Illumina HiSeq 2500 platform with 150bp reads at the University of Georgia. Cutadapt v1.9 [15] was used to trim adapter sequences, and then Bowtie 1.1.0 [16] aligned reads to reference genomes as previously described in [17]. Only uniquely aligned reads were retained. Mitochondria sequence of M. guttatus (which is fully unmethylated) was used as control to calculate the sodium bisulfite reaction non-conversion rate of unmodified cytosines. For metaplots, both upstream and downstream regions were divided into 20 bins each of 50bp in length for a total 1 Kb in each direction. Gene and TE bodies were separated every 5 percent, for a total of 20 bins. Weighted methylation levels were computed for each bin as described previously [18].

1.5 LRT to detect homeolog expression bias

We defined homeolog expression bias (B) as   N j 1 X RPKM B = log A N  2 j  j=1 RPKMB where A and B are the two subgenomes being considered (e.g. M. luteus and M. guttatus). In this j j formula RPKMA and RPKMB are the observed expression levels of the homeologs normalized by gene length and total number of mapped reads and j is an index over the N tissues. This measure of homeolog expression bias can be computed for any homeolog pair with non-zero read counts (testable homeolog pairs).

For the analysis of our RNA-seq data, we developed a likelihood ratio test involving three nested

4 hypotheses to identify cases of homeolog expression bias that do not involve tissue specific expression

differences. The null hypothesis (H0) is that both homeologs are expressed at equal levels (ratio of

homeolog-1 to homeolog-2 equals 1 for all three tissues). The first alternative hypothesis (H1) is that homeologs are expressed at different levels, but similar ratios, across all three tissue types. The second

alternative hypothesis (H2) is that homeologs are expressed at different levels and at different ratios across all three tissues.

A brief description of our likelihood ratio test for homeolog expression bias that does not involve tissue specific expression differences follows. Denote the true but unknown expression levels (per kilobase of A B coding DNA sequence) of homeologs A and B as Λj and Λj . The expected number of reads in tissue A B j (λj and λj ) are A A A λj = Λj k (1)

B B B λj = Λj k (2)

A B B A where k and k are the transcript lengths of homeologs A and B, respectively. Define αj = Λj /Λj to be the ‘expression scaling factor’ of A compared to B in tissue j, and K = kB/kA to be the ratio of A B the homeologs’ transcript lengths. In that case, the expected number of reads, λj and λj , are related by B A λj = αjKλj (3) where K = kB/kA is a known quantity.

Because we are working with single replicates (one observation for each homeolog pair in each tissue), A B we model the observed read counts (Xj and Xj ) as Poisson distributed random variables with A B parameters λj and λj respectively. The parameter sets associated with the three nested hypotheses for our likelihood ratio test are as follows:

Θ0 : {λ1, . . . , λN , α1, . . . , αN : λj > 0 and αj = 1 where j = 1,...,N}

Θ1 : {λ1, . . . , λN , α1, . . . , αN : λj > 0 and αj = α > 0 where j = 1,...,N}

Θ2 : {λ1, . . . , λN , α1, . . . , αN : λj > 0 and αj > 0 where j = 1,...,N}

where Θ0 ⊂ Θ1 ⊂ Θ2. The parameter set Θ0 corresponds to the null hypothesis (H0) that, after accounting for differences in transcript length, there are no tissue- or subgenome-specific differences

in expression levels. The parameter set Θ1 corresponds the first alternative hypothesis (H1) that there are subgenome- but not tissue-specific differences in expression levels. The parameter set Θ2 corresponds to the second alternative hypothesis (H2) that there are tissue-specific differences in ex-

5 pression levels.

Assuming the probability models described above, we analytically derived the likelihood functions for A B H0, H1 and H2 (not shown). Using the observed counts for each homeolog pair (Xj and Xj ), maxi- mum likelihood estimation of the free parameters was performed using MATLAB’s built-in nonlinear system solver (fsolve) for H1, while H0 and H2 admit analytic solutions. Subsequently, two likelihood ratio tests were performed (using significance levels of 0.01). The first test was to determine whether we could reject H0 in favor of H1. If so, a second test determined whether we could reject H1 in favor of H2. Rejecting H0, but not rejecting H1, is interpreted as statistically significant homeolog expression bias that does not involve tissue specific expression differences.

1.6 Methods for Transposon annotation

1.6.1 Autonomous cut-and-paste TEs

There are five superfamilies of cut-and-paste DNA TEs in angiosperm genomes [i.e., Tc1/mariner, PIF/Harbinger, Mutator-like elements (MULEs), hAT, and CACTA] [19]. They can be readily dis- tinguished by the size of their TSDs and sequence similarity of the encoded transposase (TPase). The DDE/D domain alignment profile for each of the five superfamilies (obtained from [19]) was used as query to search against the Mimulus guttatus genome assembly by TBLASTN [20], as implemented in the TARGeT pipeline [21]. TARGeT output the matched DNA sequences with 10 kb upstream and downstream of the matched region, and a guide tree depicting the phylogenetic relationships of these putative elements. The ends of each putative element was then determined by aligning two closely related elements with their 20 kb flanking sequences, using NCBI-BLAST 2 SEQUENCES on the NCBI server. Usually the breakpoint of a pair-wise alignment is the boundary of a full-length element, which can be subsequently refined by identifying the TIRs and TSDs around the breakpoint. Full-length elements were then classified into individual families following the guidelines proposed by Wicker et al. (2007) [22]. Non-autonomous elements do not have coding capacity and thus, cannot be identified effectively using sequence homology-based approaches. However, virtually all plant non-autonomous cut-and- paste TEs have TIRs (10-400 bp) on both ends, are flanked by TSDs (2-10 bp), and are less than 3 kb in length. These structure features have been incorporated into an annotation pipeline MITE- Hunter [23], which was primarily designed to search Miniature Inverted repeat Transposable Elements (MITEs), but also can be used to search other non-autonomous cut-and-paste TEs. MITE-Hunter was employed to search the M. guttatus genome assembly, and the output consensus sequence for each putative nonautonomous family was used as query to retrieve all family members with 200-bp

6 flanking sequences from the genome, using BLASTN implemented in TARGeT. Inspection of the multiple alignment profile of all family members leads to verification or refinement of the element boundary.

1.6.2 Helitrons

The program HelSearch [24] was used to identify putative Helitron 3’ end, characterized by a CTRR terminus that is followed by a target site T residue and is preceded 5-10 bp by a 18-bp hairpin structure. HelSearch assigns candidate 3’ ends into groups based on the hairpin stem sequences and generates multiple alignment of flanking sequences for each group. These alignments were manually inspected to define authentic 3’ end and false positives were discarded. The last 100 nucleotides of each remaining alignment of authentic 3’ end were then used as query to retrieve the 15-kb upstream sequences using TARGeT. The 5’ ends were determined by inspection of the breakpoints of pair-wise alignments of these upstream sequences using NCBI-BLAST 2 SEQUENCES on the NCBI server. A typical Helitron 5’ prime end starts with TC, following a target site A residue, and is conspicuously AT-rich in the first 18-bp region. One the 5’ ends were identified, 70 nucleotides from both 5’ and 3’ ends for each group were joined together to form a pseudo-element, which was used as a query to retrieve all related Helitrons from the genome using TARGet, allowing the sequence between the two ends to be up to 15kb. Sequence similarity of the 70-bp region at the 5’ end was the basis for family assignment.

1.6.3 LTR retrotransposons

LTR retrotransposons were mined using LTR retrotransposons were mined using LTR STRUC [25], a structural data-searching program that identifies LTR retrotransposons based on the presence of the LTR pairs, the 4-6-bp TSDs flanking the LTRs, the primer binding site (PBS) and polypurine tract (PPT), as well as the canonical TG/CA dinucleotides at the 5’ and 3’ end of each LTR. A copia- and gypsy-specific reference sequence from the most conserved region of the RT domain (corresponding to Pfam family PF07727 and PF00078, respectively) was used to blast the full-length retrotransposon sequences found by LTR STRUC. A phylogenetic tree was generated for all the matched copia-like and gypsy-like elements, respectively. Elements that have the RT domain were then classified into families based on the RT phylogeny. The remaining elements that do not have the RT domain (i.e. nonautonomous) were classified into families based on an all-to-all blast, following the 80-80 rule suggested by Wicker et al. (2007) [22].

7 1.6.4 LINEs

The protocol of LINE identification is very similar to that of autonomous cut-and-paste TEs. There are two major clades/superfamilies of LINEs in , L1-like and RTE-like [26]. The RT amino acid sequence alignments of representative elements of these two clades were retrieved from Kapitonov et al. (2009) [26], and were used as query to search all putative LINEs with their flanking sequences using TARGeT. Pair-wise alignments were then used to determine the element boundary. The 3’ end is usually characterized by a poly(A) tail followed by a 7-21 bp TSD sequence. The TSD, in turn, can be used to demarcate the 5’ end. The RT phylogeny was used to guide family assignments.

1.6.5 SINEs

The M. guttatus genome assembly was subjected to RepeatModeler open-1.0 [27], a de-novo repeat detection package that searches putative interspersed repeats and builds consensus models of identified repeat families. A TE exemplar set (see below) was built from all other elements identified by the homology-based and structure-based methods, as described above, and was used to mask the de novo repeat library output from RepeatModeler. The unmasked consensus sequences were then subjected to manual examination individually, to identify SINEs and additional TEs that were missed by structure- and homology-based methods. Sequences that are short (¡500 bp) and have two conserved motifs corresponding to the Pol III internal promoter A and B boxes within the first 100-bp region are candidate SINEs [28, 29]. Each candidate consensus sequence was used as query to retrieve all related sequences through TARGeT. A 7 21-bp TSD flanking most of these sequences would verify bono fide SINEs.

1.7 TE examplar

One to a few exemplar sequences were chosen to represent each family for building an exemplar library that can represent the entire TE constitute of the genome with a minimum number of sequences. For autonomous elements, these exemplar sequences were chosen based on two criteria: (1) sequence similarity between the exemplar and all other family members is at least 80%; (2) encoded ORFs should be the most intact. The first criterion ensures that the exemplar library can be used to mask almost all TEs in the genome prior to gene model predictions and other kinds of genome annotations. The second criterion renders convenience to use the exemplars for comparative analysis between different genomes, because elements satisfying the second criterion are often among the youngest and most abundant members of a family and the encoded protein products can be translated with minimal extent of interruption. For nonautonomous elements, only the first criterion was applied. Helitrons are much more heterogeneous in sequence between family members than other TE types, so usually

8 many more exemplars need to be included for each family to capture the diversity. The complete M. guttatus TE exemplar library is available at http://monkeyflower.uconn.edu/resources/.

References

[1] Gnerre, S. et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences 108, 1513–1518 (2011).

[2] Sim˜ao,F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. Busco: as- sessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics btv351 (2015).

[3] Cantarel, B. L. et al. Maker: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome research 18, 188–196 (2008).

[4] Korf, I. Gene finding in novel genomes. BMC bioinformatics 5, 1 (2004).

[5] Stanke, M. & Waack, S. Gene prediction with a hidden markov model and a new intron submodel. Bioinformatics 19, ii215–ii225 (2003).

[6] Salamov, A. A. & Solovyev, V. V. Ab initio gene finding in drosophila genomic dna. Genome research 10, 516–522 (2000).

[7] Consortium, U. et al. Uniprot: a hub for protein information. Nucleic acids research gku989 (2014).

[8] Berardini, T. Z. et al. The arabidopsis information resource: Making and mining the ?gold standard? annotated reference plant genome. genesis 53, 474–485 (2015).

[9] Campbell, M. S., Holt, C., Moore, B. & Yandell, M. Genome annotation and curation using maker and maker-p. Current Protocols in Bioinformatics 4–11 (2014).

[10] Lyons, E., Pedersen, B., Kane, J. & Freeling, M. The value of nonmodel genomes and an example using synmap within coge to dissect the hexaploidy that predates the rosids. Tropical Plant Biology 1, 181–190 (2008).

[11] Tang, H. et al. Screening synteny blocks in pairwise genome comparisons through integer pro- gramming. BMC bioinformatics 12, 1 (2011).

[12] Hellsten, U. et al. Fine-scale variation in meiotic recombination in mimulus inferred from popu- lation shotgun sequencing. Proceedings of the National Academy of Sciences 110, 19478–19482 (2013).

9 [13] Consortium, T. G. et al. The tomato genome sequence provides insights into fleshy fruit evolution. Nature 485, 635–641 (2012).

[14] Urich, M. A., Nery, J. R., Lister, R., Schmitz, R. J. & Ecker, J. R. Methylc-seq library preparation for base-resolution whole-genome bisulfite sequencing. Nature protocols 10, 475–483 (2015).

[15] Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EM- Bnet. journal 17, pp–10 (2011).

[16] Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with bowtie 2. Nature methods 9, 357–359 (2012).

[17] Schmitz, R. J. et al. Patterns of population epigenomic diversity. Nature 495, 193–198 (2013).

[18] Schultz, M. D., Schmitz, R. J. & Ecker, J. R. ?leveling?the playing field for analyses of single-base resolution dna methylomes. Trends in genetics: TIG 28, 583 (2012).

[19] Yuan, Y.-W. & Wessler, S. R. The catalytic domain of all eukaryotic cut-and-paste transposase superfamilies. Proceedings of the National Academy of Sciences 108, 7884–7889 (2011).

[20] Altschul, S. F. et al. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic acids research 25, 3389–3402 (1997).

[21] Han, Y., Burnette, J. M. & Wessler, S. R. Target: a web-based pipeline for retrieving and characterizing gene and transposable element families from genomic sequences. Nucleic acids research 37, e78–e78 (2009).

[22] Wicker, T. et al. A unified classification system for eukaryotic transposable elements. Nature Reviews Genetics 8, 973–982 (2007).

[23] Han, Y. & Wessler, S. R. Mite-hunter: a program for discovering miniature inverted-repeat transposable elements from genomic sequences. Nucleic acids research gkq862 (2010).

[24] Yang, L. & Bennetzen, J. L. Structure-based discovery and description of plant and animal helitrons. Proceedings of the National Academy of Sciences 106, 12832–12837 (2009).

[25] McCarthy, E. M. & McDonald, J. F. Ltr struc: a novel search and identification program for ltr retrotransposons. Bioinformatics 19, 362–367 (2003).

[26] Kapitonov, V. V., Tempel, S. & Jurka, J. Simple and fast classification of non-ltr retrotransposons based on phylogeny of their rt domain protein sequences. Gene 448, 207–213 (2009).

[27] Smit, A. F., Hubley, R. & Green, P. Repeatmasker open-3.0 (1996).

10 [28] Kumar, A. & Bennetzen, J. L. Plant retrotransposons. Annual review of genetics 33, 479–532 (1999).

[29] Schmidt, T. Lines, sines and repetitive dna: non-ltr retrotransposons in plant genomes. Plant molecular biology 40, 903–910 (1999).

Mimulus peregrinus SYN 48

63 Mimulus peregrinus NAT

100 Mimulus luteus

100 Mimulus guttatus Orobanche aegyptiaca 100 98 Striga hermonthica Tectona grandis 100 100 Pogonstemon cablin

76 Paulownia tomentosa Sesamum indicum 82 100 Acanthus leucostahcyus 100 Avicennia marina Utricularia gibba 0.1 95 79 Lippia dulcis

100 Plantago ovata 100 linearis Primulina pteropoda Fraxinus pennsylvanicus 100 Olea europaea Solanum lycopersicum

Figure 1: Concatenation tree Maximium likelihood phylogeny of 96 concatenated single copy loci identified across sampled species. Bootstrap values are displayed on branches for 500 replicates. Scale indicates number of substitutions per site.

11 Mimulus peregrinus SYN 86

69 Mimulus peregrinus NAT

100 Mimulus luteus

91 Mimulus guttatus Striga hermonthica 100 46 Orobanche aegyptia Tectona grandis 99 100 Pogonstemon cablin 49 Paulownia tomentosa Lippia dulcis 64 Avicennia marina 100

100 69 Acanthus leucostahcyus Sesamum indicum 96 Utricularia gibba 100 97 Plantago ovata Primulina pteropoda Olea europaea 100 Fraxinus pennsylvanicus Solanum lycopersicum

Figure 2: Astral tree Coalescence-based phylogeny of 96 single copy loci estimated in ASTRAL. Node labels represent bootstrap values for 100 replicates.

12 Striga−hermonthica Orobanche−aegyptiaca Unique Gene Duplications Mimulus−peregrinus−SYN 0 1656 3312 Mimulus−peregrinus−NAT Mimulus−luteus Mimulus−guttatus Tectona−grandis Pogonstemon−cablin Paulownia−tomentosa Lippia−dulcis Avicennia−marina Acanthus−leucostahcyus Sesamum−indicum Utricularia−gibba Collinsia−linearis Plantago−ovata Primulina−pteropoda Olea−europaea Fraxinus−pennsylvanicus Solanum−lycopersicum

Figure 3: Results of PUG analysis using the coalescence tree. Putative paralog pairs were estimated in the PUG software based on gene tree topologies. Nodes with bootstrap values less than 80 were collapsed. Total numbers of unique gene duplications are mapped to each branch suggesting five to six polyploid events. Nodes that have less than 10% of the highest number of unique duplications (3312) are not mapped. The basal node of the Lamiales suggests a polyploid event, but support for this event is potentially skewed by the shared gamma triplication shared between Lamiales and Solanum lycopersicum.

13 No. of all Coverage No. of intact elements Genome Super-family No. of families TEs (including (Mb) fraction fragments) 88 811 LTR/Copia 15, 124 30.32 10.08% (74 / 15) (595 / 216) Class I 86 1005 LTR/Gypsy 27, 603 31.87 10.60% (56 / 30) (332 / 673) 46 385 LTR/Unknown 9, 133 7.08 2.35% (0 / 46) (0 / 385) 105 386 LINE 5, 218 4.16 1.38% (105 / 0) (386 / 0) 5 1, 250 SINE 5, 517 0.72 0.24% (0 / 5) (0 / 1, 250) Total class I 331 3, 837 62, 595 74.16 24.67% 10 2, 522 Tc1-Mariner 9, 067 1.66 0.55% (4 / 6) (10 / 2, 512) 85 4, 662 PIF-Harbinger 25, 568 6.43 2.14% (40 / 45) (124 / 4, 538) 253 15, 050 MULE 91, 554 34.75 11.56% Class II (76 / 177) (482 / 14, 568) 137 4, 907 hAT 45, 381 10.79 3.59% (43 / 94) (136 / 4, 761) 35 458 CACTA 12, 842 3 1.00% (24 / 11) (44 / 414) Helitron 12 1, 570 53, 622 18.9 6.29% Total class II 532 29, 159 238, 034 75.53 25.12% Total 863 32, 996 300, 629 149.69 49.79%

Figure 4: Summary of genome-wide transposon abundances

14 log(RPKM)

F1 Hybrid 0 1 2 3 4 5 6 7 8 9 0. 0. 0. 0. 0. 0. 0. 0. 0. log(RPKM)

Natural Allopolyploid 1 2 3 4 5 6 7 8 9 0 0. 0. 0. 0. 0. 0. 0. 0. 0. log(RPKM)

Synthetic allopolyploid 0 1 2 3 4 5 6 7 8 9 0. 0. 0. 0. 0. 0. 0. 0. 0. Transposon density (number of TE bases/window size)

Figure 5: Relationship of TE density and gene expression for M. luteus subgenomic regions Top: F1 hybrid, Middle: Natural allopolyploid, Bottom: Synthetic allopolyploid

15 log(RPKM)

F1 Hybrid 1 2 3 4 5 6 7 8 9 0 0. 0. 0. 0. 0. 0. 0. 0. 0. log(RPKM)

Natural Allopolyploid 1 2 3 4 5 6 7 8 9 0 0. 0. 0. 0. 0. 0. 0. 0. 0. log(RPKM)

Synthetic allopolyploid 1 2 3 4 5 6 7 8 9 0 0. 0. 0. 0. 0. 0. 0. 0. 0. Transposon density (number of TE bases/window size)

Figure 6: Relationship of TE density and gene expression for M. guttatus subgenomic regions Top: F1 hybrid, Middle: Natural allopolyploid, Bottom: Synthetic allopolyploid

16