Subgenome Dominance in an Interspecific Hybrid, Synthetic Allopolyploid, and a 140 Year Old Naturally Established Ne
Total Page:16
File Type:pdf, Size:1020Kb
Supplement: Subgenome dominance in an interspecific hybrid, synthetic allopolyploid, and a 140 year old naturally established neo-allopolyploid monkeyflower Patrick P. Edger1,*, Ronald Smith2, Michael R. McKain3, Arielle M. Cooley4, Mario Vallejo-Marin5, Yaowu Yuan6, Adam J. Bewick7, Lexiang Ji7, Adrian E. Platts8, Megan Bowman9, Kevin Childs9, Robert J. Schmitz7, Gregory D. Smith2, J. Chris Pires10 and Joshua R. Puzey11,*,a 1Department of Horticulture, Michigan State University, East Lansing, MI 2Department of Applied Science, College of William and Mary, Williamsburg, VA 3Donald Danforth Plant Science Center, St. Louis, MO 4Biology Department, Whitman College, Walla Walla, WA 5Biological and Environmental Sciences, University of Stirling, Stirling, UK 6Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs, CT 7Department of Genetics, University of Georgia, Athens, GA 8McGill Centre for Bioinformatics, McGill University, Montreal, QC, Canada H3A 0E9 9Department of Plant Biology, Michigan State University, East Lansing, MI 10Division of Biological Sciences, University of Missouri, Columbia, MO 65211 11Department of Biology, College of William and Mary, Williamsburg, VA *These authors contributed equally aAuthor for correspondance: [email protected] Contents 1 Methods 3 1 1.1 Genome assembly and annotation . .3 1.2 Whole genome alignments and phylogenetic analyses . .3 1.3 RNA-sequencing and quantification of gene expression . .3 1.4 Whole genome methylation sequencing and analysis . .4 1.5 LRT to detect homeolog expression bias . .4 1.6 Methods for Transposon annotation . .6 1.6.1 Autonomous cut-and-paste TEs . .6 1.6.2 Helitrons . .7 1.6.3 LTR retrotransposons . .7 1.6.4 LINEs . .8 1.6.5 SINEs . .8 1.7 TE examplar . .8 List of Figures 1 Concatenation tree Maximium likelihood phylogeny of 96 concatenated single copy loci identified across sampled species. Bootstrap values are displayed on branches for 500 replicates. Scale indicates number of substitutions per site. 11 2 Astral tree Coalescence-based phylogeny of 96 single copy loci estimated in ASTRAL. Node labels represent bootstrap values for 100 replicates. 12 3 Results of PUG analysis using the coalescence tree. Putative paralog pairs were estimated in the PUG software based on gene tree topologies. Nodes with bootstrap values less than 80 were collapsed. Total numbers of unique gene duplications are mapped to each branch suggesting five to six polyploid events. Nodes that have less than 10% of the highest number of unique duplications (3312) are not mapped. The basal node of the Lamiales suggests a polyploid event, but support for this event is potentially skewed by the shared gamma triplication shared between Lamiales and Solanum lycopersicum. 13 4 Summary of genome-wide transposon abundances ................ 14 5 Relationship of TE density and gene expression for M. luteus subgenomic regions Top: F1 hybrid, Middle: Natural allopolyploid, Bottom: Synthetic allopolyploid 15 6 Relationship of TE density and gene expression for M. guttatus subgenomic regions Top: F1 hybrid, Middle: Natural allopolyploid, Bottom: Synthetic allopolyploid 16 2 1 Methods 1.1 Genome assembly and annotation Genome assembly of M. luteus was performed using the ALLPATHS-LG assembler v45395 [1] with approximately 33x Illumina paired end (2x100bp) reads and a TruSeq mate-pair (5Kb insert) library. Additional gap closing was undertaken using GapCloser (v. 1.12, BGI). The assembly totaled 409Mb in 6,349 scaffolds with a scaffold N50/L50 of 283/439Kb, a contig N50 of 52Kb and 96 percent of bases called (IUPAC ambiguous bases converted to N). This is 60 percent of the total genome size (680Mb) anticipated by the assembler from the Kmer depth distribution. Genome completeness in terms of gene content was assessed using BUSCO [2] with default setting and a set of universal single-copy orthologs. The vast majority of BUSCO groups, 931 of 956 (97.4 percent), were identified in the M. luteus genome assembly with 837 duplicates. TEs were annotated using a combination of sequence and structure homology as well as de novo approaches (see TE annotation methods). The MAKER genome annotation pipeline was used to annotate the M. luteus and M. guttatus genome assemblies using SNAP, AUGUSTUS and Fgenesh gene prediction programs [3, 4, 5, 6]. Species-specific transcript assemblies as well as TAIR Arabidopsis thaliana and plant-specific Swis- sProt protein sequences were aligned to the repeat-masked genome assemblies and used as evidence to aid the gene prediction programs [7] [8]. Gene predictions with transcript, protein or protein domain support were retained as final high-quality gene predictions [9]. Assembled genome sequences and an- notation files (GFFs), including gene models, are deposited on CoGe (https://genomevolution.org/coge ; Genome IDs 22656 and 22665) hosted by National Science Foundation - CyVerse (http://www.cyverse.org/). All raw Illumina data, in fastq format, will be made available on SRA following acceptance (SRA#). 1.2 Whole genome alignments and phylogenetic analyses Pairwise genomic alignments between the M. luteus and M. guttatus genomes were made using Syn- Map [10] and QUOTA-ALIGN [11], and then filtered to identify syntenic orthologs between the two genomes and retained homeologs in M. luteus. Data for various members of the order Lamiales were downloaded from NCBI-SRA and combined with newly generated transcriptome data for M. peregri- nus and genomic data from M. luteus, M. guttatus (v.1.2) [12], and Solanum lycopersicum (ITAG2.3) [13] for phylogenetic analysis of retained duplicates. 1.3 RNA-sequencing and quantification of gene expression RNA-seq libraries were constructed using the TruSeq RNA kit (Illumina; San Diego, CA) from total RNA isolated from calyx, stem, and petals and then sequenced with single-end 100bp reads on an 3 Illumina HiSeq-2000 at the University of Missouri Sequencing Core for three tissue types for both M. guttatus and M. luteus. Illumina reads were quality filtered using NextGENe v2.3.3.1 (SoftGenetics; State College, PA), removing adapter sequences, reads with a median quality score of less than 22, trimmed reads at positions that had three consecutive bases with a quality score of less than 20, and any trimmed reads with a total length less than 40bp. This resulted in 87.9 percent of the reads passing the quality-score filter. Expression levels, in FPKM (fragments per kilobase per million reads), were determined for all genes in the M. guttatus and M. luteus genomes. Quality-filtered reads for each library were aligned to the respective genomes using NextGENe v2.3.3.1 and counting only uniquely mapped reads, using parameters (A. matching Requirement: >40 Bases and >99 percent, B. Allow Ambiguous Mapping: FALSE, and C. Rigorous Alignment: TRUE) resulting in the alignment of over 189.6 million reads to the M. guttatus and M. luteus genomes. 1.4 Whole genome methylation sequencing and analysis MethylC-seq libraries were prepared according to the following protocol [14], and were sequenced on an Illumina HiSeq 2500 platform with 150bp reads at the University of Georgia. Cutadapt v1.9 [15] was used to trim adapter sequences, and then Bowtie 1.1.0 [16] aligned reads to reference genomes as previously described in [17]. Only uniquely aligned reads were retained. Mitochondria sequence of M. guttatus (which is fully unmethylated) was used as control to calculate the sodium bisulfite reaction non-conversion rate of unmodified cytosines. For metaplots, both upstream and downstream regions were divided into 20 bins each of 50bp in length for a total 1 Kb in each direction. Gene and TE bodies were separated every 5 percent, for a total of 20 bins. Weighted methylation levels were computed for each bin as described previously [18]. 1.5 LRT to detect homeolog expression bias We defined homeolog expression bias (B) as 0 1 N j 1 X RP KM B = log A N @ 2 j A j=1 RP KMB where A and B are the two subgenomes being considered (e.g. M. luteus and M. guttatus). In this j j formula RP KMA and RP KMB are the observed expression levels of the homeologs normalized by gene length and total number of mapped reads and j is an index over the N tissues. This measure of homeolog expression bias can be computed for any homeolog pair with non-zero read counts (testable homeolog pairs). For the analysis of our RNA-seq data, we developed a likelihood ratio test involving three nested 4 hypotheses to identify cases of homeolog expression bias that do not involve tissue specific expression differences. The null hypothesis (H0) is that both homeologs are expressed at equal levels (ratio of homeolog-1 to homeolog-2 equals 1 for all three tissues). The first alternative hypothesis (H1) is that homeologs are expressed at different levels, but similar ratios, across all three tissue types. The second alternative hypothesis (H2) is that homeologs are expressed at different levels and at different ratios across all three tissues. A brief description of our likelihood ratio test for homeolog expression bias that does not involve tissue specific expression differences follows. Denote the true but unknown expression levels (per kilobase of A B coding DNA sequence) of homeologs A and B as Λj and Λj . The expected number of reads in tissue A B j (λj and λj ) are A A A λj = Λj k (1) B B B λj = Λj k (2) A B B A where k and k are the transcript lengths of homeologs A and B, respectively. Define αj = Λj =Λj to be the `expression scaling factor' of A compared to B in tissue j, and K = kB=kA to be the ratio of A B the homeologs' transcript lengths. In that case, the expected number of reads, λj and λj , are related by B A λj = αjKλj (3) where K = kB=kA is a known quantity. Because we are working with single replicates (one observation for each homeolog pair in each tissue), A B we model the observed read counts (Xj and Xj ) as Poisson distributed random variables with A B parameters λj and λj respectively.