Supporting Information
Total Page:16
File Type:pdf, Size:1020Kb
Supporting Information McGaugh et al. 10.1073/pnas.1419659112 SI Text RNAeasy kit (Qiagen cat. no. 74104) with a DNA digestion on the Summary of New Resources Available. For the 18 liver transcriptomes membrane, as described in the manual. The quality and quantity of we generated, the raw reads can be found at the NCBI Sequence RNA was determined on an Agilent Bioanalyzer using a Nano- Read Archive (SRA062458 at www.ncbi.nlm.nih.gov/sra/?term= RNA chip. For each sample, 1 μg total RNA was sent to the Duke SRA062458 and SRP017466 at www.ncbi.nlm.nih.gov/sra/?term= Genome Sequencing and Analysis Core Resource for library SRP017466). Transcriptome assemblies, annotation summaries, and preparation and to generate 100-bp paired-end reads using an alignments for protein coevolution analyses are available through Illumina Hi SEq. 2000 with TruSeq v3 chemistry with a standard Dryad (dx.doi.org/10.5061/dryad.vn872). Individual identifiers for insert size distribution. The library preparation protocol was these data can be found under citation in Table S2. Transcriptome based on the technical document TruSeq_RNA_SamplePrep_ assemblies, annotation summaries, and alignments are available Guide_15008136_A. Individual libraries were uniquely barcoded through Dryad: dx.doi.org/10.5061/dryad.vn872. (indexed), and quality was checked on the Bioanalyzer DNA100 chip. For 15 non–garter snake species, five indexed libraries were i) The transcriptome assembly for each of the 18 individuals pooled in each lane, and ∼8 pM of library pool was deposited on sequenced. These assemblies contain the longest ORFs pro- each lane. Because garter snakes (Thamnophis spp.) are focal duced by Trinity, which were then clustered by UCLUST into ’ species in our laboratory, the two Thamnophis species (three centroids to reduce redundancy within a single species tran- samples) were sequenced more deeply. The Thamnophis cou- scriptome. A centroid may have collapsed multiple isoforms, chii indexed library was pooled with separately indexed libraries truncated transcripts, and alleles from a gene, but it may also from two individual Thamnophis elegans of different ecotypes have collapsed very recent paralogs. (1) (meadow and lakeshore in Table S2). This Thamnophis pool ii) Trinotate annotation databases for each individual. The IDs (one T. couchii and two T. elegans individuals) was sequenced in the database correspond to the centroid IDs in the tran- twice, resulting in larger amounts of data available overall for scriptome assembly described above. these two species. None of the libraries were normalized. The iii) Putative ortholog amino acid alignments and corresponding raw reads for the 15 species excluding Thamnophis species can nucleotide alignments. We used OrthoMCL to cluster ORF be found at the SRA SRA062458. The raw reads for the three centroids into putative orthologs from all of the species in- garter snake liver transcriptomes (i.e., one from T. couchii and cluded in this study. Data are available as separate files for two from T. elegans) can be found at the SRA SRP017466 (samples each ortholog (104,235 total orthologs with two or more species). Additionally, we included a spreadsheet showing HS08, HS11, and TC). the best BLAST hit of each putative ortholog cluster to Processing and de Novo Assembly of Reads. For de novo assembly of the uniprot database. each species’ transcriptome, we used the Trinity version released iv) “Best” ortholog amino acid and nucleotide alignments. The on February 25, 2013 (2). Original reads were processed by the 104,235 putative orthologs described above often contained following methods. more than two representative sequences per species. For the The following processing steps were performed using the Fastx first 15,000 putative orthologs (those with the most species tool kit, (hannonlab.cshl.edu/fastx_toolkit/), Cutadapt (3), and included in the alignments), we used UCLUST to find the Trimmomatic (4). best representative per species per ortholog by taking the sequence that was closest to the centroid for that ortholog. i) Fastx_trimmer was used to remove the first base, as Illumina v) The final nucleotide and amino acid alignments for the 1417 personnel indicate that this base can be unreliable (Gary “control genes.” Schroth). vi) The hand-curated nucleotide and amino acid alignments for ii) Cut-adapt was used to trim adapters from the 3′ ends of reads 61 IIS/TOR network genes. with an allowed error rate of 0.01. iii) Trimmomatic was used to remove reads with sliding win- dows of 6bp that had average quality scores of 30 or less, SI Materials and Methods and then reads less than 30 bp in length were removed. Sample Collection. Animals or tissues used in this study were provided by colleagues or our research colonies. Each individual From this point, reads that were orphaned (only the left or the was maintained or shipped to Iowa State University (ISU). In right remained after processing) were removed from the left and agreement with ISU Institutional Animal Care and Use Com- right read files. These reads were placed at the end of the left read mittee protocol 3-2-5125J, animals were euthanized by decapi- files, as specified in the Trinity manual. All default settings were tation, exsanguinated, and dissected with relevant organs snap kept for transcriptome assembly. frozen. The exceptions were the cottonmouth and alligator (Agkistrodon piscivorus and Alligator mississippiensis), which were Transcriptome Quality Assessment and Annotation. We sequenced euthanized onsite in Texas and California, respectively, following 33.73–140.95 million reads per species (mean: 50.23; median: our established protocol; snap-frozen tissues were sent to ISU. 42.10). Reads were assembled into 87,016–221,818 contigs using The animals used were of a variety of ages and both sexes, thus Trinity (mean: 155,855; median: 165,685). Contigs shorter than findings reported here are robust to variation in transcripts that 200 bp were excluded (5). Table S2 contains statistics about the depend on age, sex, and rearing condition (Table S2). Trinity assemblies. To evaluate the quality of a transcriptome assembly, we aligned Tissue and RNA Extraction and Sequencing. Total RNA was isolated the assembled Trinity transcripts to the proteins of the UniProtKB/ from 12 to 19 mg of snap-frozen liver from each of 18 individuals Swiss-Prot database downloaded on March 21, 2013 using blastx with from 17 species: a single individual for 16 species and two different an E-value cutoff of 1e-20 and allowing only a single target sequence ecotypes from one species for Thamnophis elegans (Table S2 and to be reported. Next, we determined the percent of the UniProtKB/ Fig. S2). We followed standard protocols including Qiagen Swiss-Prot protein that aligned to the best matching Trinity transcript McGaugh et al. www.pnas.org/cgi/content/short/1419659112 1of19 through the perl script analyze_blastPlus_topHit_coverage.pl pro- (range, 10.43–26.51%). All Trinotate annotation databases are vided through Trinity. publically available on Dryad: dx.doi.org/10.5061/dryad.vn872. Likely coding regions (ORFs) were extracted from Trinity transcripts using Transdecoder. Transdecoder identified between Identifying Candidate Orthologs and Generating Multiple Species 25,945 and 113,672 best ORFs (mean: 65,766; median: 72,152). Alignments. For any comparative evolutionary analysis, identifi- Transcriptome size of the best ORFs identified in Transdecoder cation of putative orthologs and accurate alignment are essential ranged from 27.80 to 113.60 Mb (mean = 69.54 Mb; median = but can be extremely challenging due to paralogs and alternative 78.65 Mb), indicating ∼57- to 269-fold coverage when consid- splicing. In addition, we found that in some cases, a particular ering the amount of filtered and trimmed data input into Trinity species may have Trinity transcripts that blasted with high con- (range, 5.21–11.55 Gb; mean: 6.80 Gb; median = 6.43 Gb). fidence to the particular gene of interest, but this species was These ORFs were clustered into centroids using USEARCH (6) unrepresented in our final multiple species alignments because separately for each transcriptome (see below for a more detailed Transdecoder did not include the transcript from that particular description). gene in its best ORF candidate file. To avoid this complication, we The coding sequence of the peptides produced by Transde- only used ORFs from the longest ORF file and not the best ORF coder and the centroids were also analyzed with the analyze_ predictions. blastPlus_topHit_coverage.pl script provided by Trinity to de- We reduced overlap between the ORFs for each individual termine the percent length of coverage for the top hit in the species using USEARCH (6) with an identity threshold of 95% UniProtKB/Swiss-Prot database. We conducted this analysis on of the nucleotide sequences sorted by length (gaps are counted the best ORF sequences and separately on the centroids to ex- as differences in USEARCH). Because our goal was to cluster amine whether the Transdecoder or USEARCH processes re- isoforms to have one representative sequence per gene, we re- sulted in ORFs that spanned a greater percent length of their best duced the gap penalties to the settings -gapopen 5I/1E -gapext blast hit relative to the originally produced Trinity transcript 0.1I/0.1E. These clustered centroids were used for all subsequent contigs. Blastx analysis of the original Trinity transcripts to the analyses. UniProtKB/Swiss-Prot database resulted