A Comparative Genomic Investigation of High-Elevation Adaptation in Ectothermic Snakes
Total Page:16
File Type:pdf, Size:1020Kb
A comparative genomic investigation of high-elevation adaptation in ectothermic snakes Supplementary Information Appendix Table of contents 1. Materials and sequencing 1 1.1. De novo genome sequencing 1 1.2. Transcriptome sequencing 1 1.3. Genome resequencing 1 2. Assembly 1 2.1. De novo genome assembly of a female Tibetan hot spring snake 1 2.2. De novo transcriptome assembly of the Tibetan hot spring snake 2 2.3. De novo genome assembly of five re-sequenced snakes 2 2.4. Mitochondrial genome assembly 2 3. Repeat 2 3.1. Repeat annotation 2 3.2. Transposable element (TE) expansion history analysis 2 3.3. Discussion 2 4. Gene annotation 3 4.1. Gene structural annotation 3 4.2. Gene functional annotation 3 5. GC contents 3 5.1. Isochore structure in vertebrates 3 6. Genome evolution 4 6.1. Identification of gene families 4 6.2. Phylogenetic tree construction 4 www.pnas.org/cgi/doi/10.1073/pnas.1805348115 i 6.3. Divergence time estimation 4 6.4. Expanded and contracted gene families 4 6.5. Positively selected genes (PSGs) 5 7. Whole genome alignments and sex chromosome evolution 5 8. High-altitude adaptation 6 8.1. Shared amino acid substitutions 6 8.2. Functional assay of FEN1 6 8.3. Functional assay of EPAS1 6 9. Supplementary Figures 7 10. Supplementary Tables 20 REFERENCES 43 ii 1. Materials and sequencing 1.1. De novo genome sequencing Blood samples acquired from a female Tibetan hot spring snake (Thermophis baileyi, sample name: 1-13) captured in Yangbajing, Xizang, China were used for de novo sequencing. Whole-genome shotgun sequencing was employed and short paired-end inserts (280 bp and 450 bp) and long mate-paired inserts (2 kb, 5 kb, and 10 kb) were subsequently constructed using a standard protocol provided by Illumina (San Diego, USA). Paired-end sequencing was performed using the Illumina HiSeq 2000 system (SI Appendix, Table S1). 1.2. Transcriptome sequencing Six tissues (liver, brain, heart, lung, muscle, and ovary) were collected from the same Tibetan hot spring snake individual (individual name: 1-13) used for genome sequencing. Total RNA was extracted from pooled tissues using a TRIzol® kit (Life Technologies, Carlsbad, USA). PolyA messenger RNAs (mRNAs) were isolated using oligonucleotide (dT) magnetic beads and disrupted into short segments. This was followed by cDNA synthesis using random hexamer primers and reverse transcriptase. After end-repair, adapter-ligation and polymerase chain reaction (PCR) amplification, each paired-end cDNA library was sequenced with a read length of 150 bp using the Illumina Hiseq 2000 sequencing platform. 1.3. Genome resequencing Samples used for genome resequencing were acquired from three Thermophis and two false cobra (Pseudoxenodon) individuals. A male Tibetan hot spring snake (T. baileyi) sample, a female Sichuan hot spring snake (T. zhaoermii) sample, and a female Shangri-La hot spring snake (T. shangrila) sample were obtained from Yangbajing, Xizang, China, Quhe village, Litang, Sichuan, China, and Tianshengqiao, Shangri-La, Yunnan, China, respectively. A female large-eyed false cobra (Pseudoxenodon macrops, http://www.iucnredlist.org/details/191926/0) and male bamboo false cobra (Pseudoxenodon bambusicola, http://www.iucnredlist.org/details/192230/0) were collected from Honghe, Yunnan and Quanzhou, Fujian, China. For each of the five snakes, two paired-end libraries with an average insert size of 450bp were constructed. Each library was prepared according to the appropriate Illumina’s protocols, and sequenced with a read length of 150 bp using the HiSeq 2000 instrument (SI Appendix, Table S2). 2. Assembly 2.1. De novo genome assembly of a female Tibetan hot spring snake Short insert size (280 bp and 450 bp) paired-end reads were filtered by removing adaptor sequences, PCR duplicates and low-quality reads using Trimmomatic (v3.20)(1), followed by error correction using SOAPec (v2.01)(2). The target DNA fragment size was less than twice the single-end read length, so the reads may have overlapped, e.g., 150 bp Illumina reads were taken from the 280 bp insert size library, and for the read-through library, the corresponding paired-end reads were merged into a longer fragment if there was an overlap using PEAR(3). Long mate-pair reads (2 kb and 5 kb) were trimmed using NextClip(4), and fragments with the junction adapter in at least one of the paired reads were used. (SI Appendix, Table S3) To estimate the genome size of the Tibetan hot spring snake, the KmerFreq_AR program in the SOAPec ver. 2.01 package was used to construct a k-mer frequency spectrum (k=17) using data from library I310 (SI Appendix, Fig. S1 and Table S3). In our analysis, the total number of input reads was 301,627,168, the total number of bases was 41,624,549,184, the total K-mer number (K_num) was 36,798,245,614, and the expected depth (K_depth) was 20. Assuming that genome size (G) can be estimated as G=K_num/K_depth(5), the Tibetan hot spring snake genome size was estimated as 1.76 Gb. Whole-genome shotgun assembly of the Tibetan hot spring snake was performed using the short oligonucleotide analysis package SOAPdenovo2(2). Qualified reads from short-insert size libraries were used to construct a de Bruijn graph without using paired-end information. Contigs were constructed by merging the bubbles and resolving the small repeats. All of the qualified reads were realigned to the contig sequences and paired-end relationships between the reads of allowed linkages between the contigs. Step-by-step, we used the relationships from the short-insert-size paired ends to the long-distance paired ends to construct scaffolds. Gaps were then closed using the paired-end information to retrieve read pairs, in which one end mapped to a unique contig and the other was located in the gap region, using GapCloser (version 1.10)(2). (SI Appendix, Table S4) To assess assembly quality, we plotted the GC-depth distribution for the assemblies (SI Appendix, Fig. S2). We also used a core eukaryotic gene mapping method (CEGMA, Core Eukaryotic Genes Mapping Approach)(6) to identify the core genes in the Tibetan hot spring snake genome assembly; 235 core eukaryotic genes from 248 (94%) were found in the assembly (SI Appendix, Table S5). 1 2.2. De novo transcriptome assembly of the Tibetan hot spring snake Before de novo assembly, Illumina raw reads were filtered by the following steps. Read pairs with adapter contamination were removed, and each remaining read was cropped by 10 bp at the head and 5 bp at the tail. Read pairs with N contents greater than 5% and over 50% low quality bases (Q3) were also removed. Finally, reads with potentially low-quality regions were trimmed using Trimmomatic (v3.20)(1). Reads with a quality score below 3 at the beginning or at the end were also trimmed off, and reads containing 3' or 5' ends with an average quality score of below Q15 in a four-base pair sliding window were trimmed. Any reads becoming shorter than 25 bp were excluded from further assembly. After trimming, all of the cleaned reads were used for assembly using Trinity (version2.0.3)(7) with the default parameters set. (SI Appendix, Table S6) 2.3. De novo genome assembly of five re-sequenced snakes Before de novo assembly, Illumina raw reads were filtered (SI Appendix, Table S7) by the following steps. Read pairs with adapter contamination were removed, and reads with potentially low-quality regions were trimmed using Trimmomatic (v3.20)(1). Reads with a quality score below 3 at the beginning or at the end were also trimmed off, and reads containing 3' or 5' ends with an average quality score of below Q15 in a four-base pair sliding window were trimmed. Any reads shorter than 32 bp were also excluded from further assembly. After trimming, all of the cleaned reads of each individual were used for assembly using SOAPdenov2(2) with K=31. The gap regions of each assembly were filled using GapCloser(2) (version 1.10) (SI Appendix, Table S8). 2.4. Mitochondrial genome assembly The mitochondrial genome of each individual was reconstructed (SI Appendix, Table S9) directly from genomic next-generation sequencing reads using MITObim v1.8(8), which implements a baiting and iterative mapping approach. The complete mitochondrial genome of T. zhaoermii (National Center for Biotechnology Information (NCBI) Accession No.: GQ166168.1) was used as the initial reference, and 24 million pairs of clean reads (SI Appendix, Table S3 and Table S7) were used as input for each individual. 3. Repeat 3.1. Repeat annotation A de novo repeat annotation of the Tibetan hot spring snake genome was conducted using RepeatModeler(9) and RepeatMasker(10). Tibetan hot spring snake-specific de novo repeat libraries were constructed by combining results from LTR_STRUC(11) and RepeatModeler(9) with the default parameters set. Consensus sequences in the Tibetan hot spring snake-specific de novo repeat libraries and their classification information were further combined with the library from RepeatMasker(10) in order to run RepeatMasker(10) on the assembled scaffolds, followed by tandem repeat identification using TRF(12). Ophiophagus hannah, Python bivittatus, Boa constrictor, Thamnophis sirtalis, Ophisaurus gracilis, Pogona vitticeps, Anolis carolinensis, and Gekko japonicus genomes were annotated with the same pipeline. (SI Appendix, Fig. S3 and Table S10-S12) 3.2. Transposable element (TE) expansion history analysis An in-house PERL script was written to parse the result (.out file) generated by RepeatMasker(10) to fetch the divergence between copies of TEs identified in the genome to the consensus sequence in the library. The distributions of each TE subtype in each species were then plotted (SI Appendix, Fig.