MORALES-BRIONES ET AL. 1 Library Preparation Library Preparation
Total Page:16
File Type:pdf, Size:1020Kb
MORALES-BRIONES ET AL. 1 SUPPLEMENTARY METHODS Library preparation Library preparation was carried out using either poly-A enrichment of ribosomal RNA depletion. For Tidestromia oblongifolia, tissue collection, RNA isolation, library preparation was carried out using the KAPA Stranded mRNA-Seq Kits (KAPA Biosystems, Wilmington, Massachusetts, USA). The library was multiplexed with 10 other samples from a different project on an Illumina HiSeq2500 platform with V4 chemistry at the University of Michigan Sequencing Core. For the remaining 16 samples total RNA was isolated from c. 70-125 mg leaf tissue collected in liquid nitrogen using the RNeasy Plant Mini Kit (Qiagen) following the manufacturer’s protocol (June 2012). A DNase digestion step was included with the RNase-Free DNase Set (Qiagen). Quality and quantity of RNA were checked using the 2100 Bioanalyzer (Agilent Technologies). Library preparation was carried out using the TruSeq® Stranded Total RNA Library Prep Plant with RiboZero probes (96 Samples. Illumina, #20020611; Schuierer et al. 2017). Indexed libraries were normalized, pooled and size selected to 320bp +/- 5% using the Pippin Prep HT instrument to generate libraries with mean inserts of 200 bp, and sequenced on the Illumina HiSeq2500 platform with V4 chemistry at the University of Minnesota Genomics Center. Reads from all 17 libraries were paired-end 125 bp. MORALES-BRIONES ET AL. 2 Transcriptome data processing and assembly Scripts and instructions for read processing, assembly, translation, and homology and orthology search can be found at https://bitbucket.org/yanglab/phylogenomic_dataset_construction/ as part of an updated ‘phylogenomic dataset construction’ pipeline (Yang and Smith 2014). We processed raw reads for all 88 transcriptome datasets (except Bienertia sinuspersici) used in this study (Table S1). Sequencing errors in raw reads were corrected with Rcorrector (Song and Florea 2015) and reads flagged as uncorrectable were removed. Sequencing adapters and low-quality bases were removed with Trimmomatic v0.36 (ILLUMINACLIP: TruSeq_ADAPTER: 2:30:10 SLIDINGWINDOW: 4:5 LEADING: 5 TRAILING: 5 MINLEN: 25; Bolger et al. 2014). Additionally, chloroplast and mitochondrial reads were filtered with Bowtie2 v 2.3.2 (Langmead and Salzberg 2012) using publicly available Caryophyllales organelle genomes from the Organelle Genome Resources database (RefSeq; [Pruitt et al. 2007]; last accessed on October 17, 2018) as references. Read quality was assessed with FastQC v 0.11.7 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Overrepresented sequences detected with FastQC were discarded. De novo assembly was carried out with Trinity v 2.5.1 (Grabherr et al. 2011) with default settings, but without in silico normalization. Assembly quality was assessed with Transrate v 1.0.3 (Smith-Unna et al. 2016). Low quality and poorly supported transcripts were removed using individual cut-off values for three contig score components of Transrate: 1) proportion of nucleotides in a contig that agrees in identity with the aligned read, s(Cnuc) ≤ 0.25; 2) proportion of nucleotides in a contig that have one or more mapped reads, s(Ccov) ≤0.25; and 3) proportion of reads that map to the contig in correct orientation, s(Cord) ≤ 0.5. Furthermore, chimeric transcripts (trans-self and trans-multi-gene) were removed following the approach described in Yang and Smith (2013) using Beta vulgaris as the reference proteome, MORALES-BRIONES ET AL. 3 and percentage similarity and length cutoffs of 30 and 100, respectively. In order to remove isoforms and assembly artifacts, filtered reads were remapped to filtered transcripts with Salmon v 0.9.1 (Patro et al. 2017) and putative genes were clustered with Corset v 1.07 (Davidson and Oshlack 2014) using default settings, except that we used a minimum of five reads as threshold to remove transcripts with low coverage (-m 5). Only the longest transcript of each putative gene inferred by Corset was retained (Chen et al. 2019). Filtered transcripts were translated with TransDecoder v 5.0.2 (Haas et al. 2013) with default settings and the proteome of Beta vulgaris and Arabidopsis thaliana to identify open reading frames. Finally, coding sequences (CDS) from translated amino acids were further reduced with CD-HIT v 4.7 (-c 0.99; [Fu et al. 2012]) to remove near-identical sequences. Homology and orthology inference of nuclear genes Initial homology inference was carried out following Yang and Smith (2014) with some modifications. First, an all-by-all BLASTN search was performed on CDS using an E value cutoff of 10 and max_target_seqs set to 100. Raw BLAST output was filtered with a hit fraction of 0.4. Then putative homologs groups were clustered using MCL v 14-137 (van Dongen 2000) with a minimum minus log-transformed E value cutoff of 5 and an inflation value of 1.4. Finally, only clusters with a minimum of 25 taxa were retained. Individual clusters were aligned using MAFFT v 7.307 (Katoh and Standley 2013) with settings ‘–genafpair –maxiterate 1000’. Aligned columns with more than 90% missing data were removed using Phyx (Brown et al. 2017). Homolog trees were built using RAxML v 8.2.11 (Stamatakis 2014) with a GTRCAT model and clade support assessed with 200 rapid bootstrap (BS) replicates. Spurious or outlier long tips were detected and removed with TreeShrink v 1.0.0. by maximally reducing the tree MORALES-BRIONES ET AL. 4 diameter (Mai and Mirarab 2018). Monophyletic and paraphyletic tips that belonged to the same taxon were removed keeping the tip with the highest number of characters in the trimmed alignment. After visual inspection of ca. 50 homolog trees, we determined that internal branches longer than 0.25 were likely representing deep paralogs. These branches were cut apart, keeping resulting subclades with a minimum of 25 taxa. Homolog tree inference, tip and outlier removal, and deep paralog cutting was carried out for a second time using the same settings to obtain final homologs. Orthology inference was carried out following the ‘monophyletic outgroup’ approach from Yang and Smith (2014), keeping only ortholog groups with at least 25 ingroup taxa. The ‘monophyletic outgroup’ approach filters for clusters that have outgroup taxa being monophyletic and single-copy, and therefore filters for single- and low-copy genes. It then roots the gene tree by the outgroups, traverses the rooted tree from root to tip, and removes the side with less taxa when gene duplication is detected. Synteny analyses To visualize any genomic patterns of the phylogenetic history of Beta vulgaris regarding its relationship with Amaranthaceae s.s. and Chenopodiaceae, we first identified syntenic regions between the genomes of Beta vulgaris and the outgroup Mesembryanthemum crystallinum using the SynNet pipeline (https://github.com/zhaotao1987/SynNet-Pipeline; Zhao and Schranz 2019). We used DIAMOND v.0.9.24.125 (Buchfink et al. 2015) to perform all-by-all inter- and intra- pairwise protein searches with default parameters, and MCScanX (Wang et al. 2012) for pairwise synteny block detection with default parameters, except match score (-k) that was set to five. Then, we plot the nine chromosomes of Beta vulgaris by assigning each of the 8,258 orthologs of the quartet composed of Mesembryanthemum crystallinum (outgroup), Amaranthus MORALES-BRIONES ET AL. 5 hypochondriacus, Beta vulgaris, and Chenopodium quinoa (BC1A) to synteny blocks and to one of the three possible quartet topologies based on best likelihood score. Assessment of substitutional saturation, codon usage bias, compositional heterogeneity, and model of sequence evolution misspecification We refiltered the final 105-taxon ortholog alignments to again include genes that have the same 11 taxa (referred herein as 11-taxon(tree) dataset used for the species network analyses. We realigned individual genes using MACSE v.2.03 (Ranwez et al. 2018) to account for codon structure and frameshifts. Codons with frameshifts were replaced with gaps, and ambiguous alignment sites were removed using GBLOCKS v0.9b (Castresana 2000) while accounting for codon alignment (-t=c -b1=6 -b2=6 -b3=2 -b4=2 -b5=h). After realignment and removal of ambiguous sites, we kept genes with a minimum of 300 aligned base pairs. To detect potential saturation, we plotted the uncorrected genetic distances against the inferred distances as described in Philippe and Forterre (1999). The level of saturation was determined by the slope of the linear regression between the two distances where a shallow slope (i.e < 1) indicates saturation. We estimated the level of saturation by concatenating all genes and dividing the first and second codon positions from the third codon positions. We calculated uncorrected, and inferred distances with the TN93 substitution model using APE v5.3 (Paradis and Schliep 2019) in R. To determine the effect of saturation in the phylogenetic inferences we estimated individual gene trees using three partition schemes. We inferred ML trees with an unpartitioned alignment, a partition by first and second codon positions, and the third codon positions, and by removing all third codon positions. All tree searches were carried out in RAxML with a GTRGAMMA model and 200 bootstrap replicates. A species tree for each of the three data schemes was MORALES-BRIONES ET AL. 6 estimated with ASTRAL-III v5.6.3 (Zhang et al. 2018) and gene tree discordance was examined with PhyParts (Smith et al. 2015). Codon usage bias was evaluated using a correspondence analysis of the Relative Synonymous Codon Usage (RSCU), which is defined as the number of times a particular codon is observed relative to the number of times that the codon would be observed in the absence of any codon usage bias (Sharp and Li 1986). RSCU for each codon in the 11-taxon concatenated alignment was estimated with CodonW v.1.4.4 (Peden 1999). Correspondence analysis was carried out using FactoMineR v1.4.1(Lê et al. 2008) in R (R Core Team 2019).