Materials and Methods Reference Genome Sampling and DNA
Total Page:16
File Type:pdf, Size:1020Kb
Materials and Methods Reference genome Sampling and DNA extraction In 2012, we extracted DNA from the blood of a male vinous-throated parrotbill (S. w. bulomachus) sampled from Taiwan (24°06’N, 119°57’E) using a QIAGEN Genomic-tip 100/G kit and following the manufacturer’s instructions. Since avian species are of the ZW sex chromosome system, we chose a male sample, which has ZZ sex chromosomes, to ensure obtaining equal coverage of autosomes and sex chromosomes when sequenced and assembled de novo. Genomic library preparation, sequencing and reference de novo assembly Libraries with insert sizes 180 bp, 3 kb and 5 kb were constructed (with the Paired-End DNA Sample Prep Kit for 180 bp library and Mate Pair Library Prep Kit v2 for 3 kb and 5 kb libraries) and sequenced on an Illumina HiSeq 2000 platform by BGI (Beijing Genomic Institute, Hong Kong) and Genomics BioSci & Tech (Taipei, Taiwan) (table S1). The genome was assembled with ALLPATHS-LG (1) version-44099 using parameters PLOIDY=2 and PHRED_64=1. ALLPATHS-LG‘s correction step CleanCorrectedReads and ErrorCorrectJump removed 0.5% of paired-end reads and corrected 71.1% of mate-paired reads with the criteria of low frequency k-mers (K=25 and 96 for paired-end and mate-paired reads separately) (detailed of de novo assembly see table S1 and fig. S1). Masking repeated sequences and removing contaminating DNA A total of 72 Mb (6.8%) of the genome sequence was masked by RepeatMasker (table S2). To remove potential contaminating DNA in the draft genome, we followed the procedure in Ellegren et al. (2). We sampled every one kb without unknown base “N” within 100 kb on each genome scaffold to check for the presence of contaminants. A www.pnas.org/cgi/doi/10.1073/pnas.1813597116 total of 16,115 one kb fragments were sampled and blasted against the NCBI nt database with BLASTN (v. 2.2.30) with an e-value cut-off of 10-5. According to the BLASTN result, fragments of top hit species that were neither birds nor lizards were considered as contaminants. Contigs that contained these fragments were removed from the scaffolds. In total, contaminating DNA was found in 26 scaffolds (15 Mb), which included bacteria (N= 1), fishes (N= 5), arthropods (N= 2), amphibians (N= 6), and mammals (N= 12). Fifteen of them were directly removed from our analysis as the size of contaminating scaffolds equaled the contig sizes, while the rest of them were excluded from the scaffold they were located on. Genome annotation Gene structure and functional annotation We used Trinity (3, 4) to re-analyze 13 Gb of RNA-seq reads from a male S. w. bulomachus (5). Using both de novo assembly and genome guide (using the parrotbill genome assembly as a reference) methods, we generated 122 Mb and 78 Mb of the RNA assembly, respectively. Then, we merged the two assemblies with PASA (6), and finally obtained 89 Mb of parrotbill RNA assembly as gene model to train Augustus (7) for gene prediction. Augustus predicted 20,149 genes - 278 Mb of genes (26% of total genome) with 253 Mb of introns and 25 Mb of coding regions (CDS) - on the parrotbill genome. We extracted and blasted these CDSs to the zebra finch (Taeniopygia guttata) genome in the NCBI nr database with BLAT (standalone version v.34x12) using the following criteria: identity >90% and alignment length >40%; if a single CDS was mapped to multiple genes, the gene with the longest alignment was considered the best match and retained; if multiple CDSs were mapped to the same gene, CDSs with overlapping genomic locations or whose distance from one another was less than the length of the gene were considered parts of the same gene; otherwise, they were considered independent repetitive genes. To explore possible adaptation occurring in the non-coding parts of the genome, we used the ALDB database (8) for chicken lncRNAs; and criteria for lncRNA identification and alignment are both set to 50% of sequence identity. We totally identified 607 chicken lncRNA analogues in the parrotbill genome. After accounting for multiple transcripts of the same gene, we obtained 497 parrotbill lncRNAs. Population re-sequencing Whole genome re-sequencing We extracted DNA from blood samples with traditional chloroform and LiCl extraction precipitation (modified from Gemmell and Akiyama, (9) . The whole genome re-sequencing library with insert size 500 bp was constructed for each individual. Our sequencing data was generated from three different platforms: samples from western Taiwan (Taichung city and Nantou county) were sequenced on an Illumina HiSeq 2000 platform (BGI), samples from eastern Taiwan (Hualien county) were sequenced on an Illumina NextSeq 500 platform (Genomics BioSci & Tech) and samples from the mainland were sequenced on an Illumina HiSeq 2500 platform (Genomics BioSci & Tech). Re-sequencing mapping The algorithm BWA-MEM allows mapping longer reads (read length can be up to 1Mb) on the reference genome. Because six females were included in the sample from a high altitude local population (Hualien County), we excluded one set of the putative haplotype of Z chromosome of these individuals based on the result of Satsuma. Variant calling After mapping raw reads to the reference genome, we used Samtools (9) for variant calling. We set four criteria to filter out SNPs for improving the accuracy of variant calling: (1) alternate bases should be higher than one third of the average coverage; (2) the minimum read depth should not be lower than one third of the average coverage; (3) the maximum read depth should not be twice as high as the average coverage; and (4) INDEL sites were excluded. We used Vcf-tools (10) to generate a consensus sequence for each individual, and then randomly assigned one allele from heterozygous genotypes to two putative haplotypes. Divergence demography of the vinous-throated parrotbill Inferring effective population size PSMC ran with parameters -N25 -t15 -r5 -p "4+25*2+4+6", and the mutation rate and generation time were set to be 4.6 × 10-9 mutations per site per generation (11) and two years a generation, respectively. We also performed bootstrap analysis with option –b, which generated 100 rounds of bootstrapping (fig. S4). Divergence demography inferred by G-PhoCS We sampled one kb sequences without unknown base “N”, with 50 kb jumping size on all of the autosomal sequences. Total of 18,188 one kb sequences were sampled. The model was set by default parameters with standard priors described by Gronau et al.(11), except that we assumed there was no gene flow between populations. The Markov Chain was first run for 100,000 iterations as burn-in; then the parameter values were sampled every 100 iterations in the following 1,000,000 iterations, and finally resulted in 10,000 sampling iterations. Demographic parameters were recalibrated by a substitution rate of 4.6*10-9 per site per generation and an average generation time of 2 years. Genome scan Calculating summary statistics of genetic variation Because a long elapsed time allows recombination to break down haplotypes, standing genetic variation usually only leaves a signature of recent positive selection within a much shorter genomic region than does novel mutation. Therefore, we used 10 kb sliding windows to allow the detection of candidate regions carried by both ancient/short and recent/long haplotypes. Due to the uncertainty of the scaffolds’ position and orientation on the chromosome, sliding windows were generated on a per-scaffold basis with the first window starting at position one of each scaffold. The draft genome was divided into 103,658 non-overlapping windows in sizes of 10 kb. Finding candidate regions We calculated the zFST of autosomes and the Z chromosome separately, because both the homozygosity and high variance of male reproductive success could confer a lower effective population size to Z chromosome than to autosomes. Additionally, genetic drift is expected to operate more efficiently and render a higher level of FST value for genomic regions on the Z chromosome than on autosomes (12). 휋Between− 휋Within The formula for FST is (πBetween: average number of pairwise 휋Between differences between populations; πWithin: average number of pairwise differences within population). In our study, because the πWithin value was small, it brought out the possibility that high FST values could be attributed to linked selection within populations (a low recombination rate could result in low 휋Within values) rather than divergent selection between populations (13, 14). In addition, we also found that the 2 level of linkage disequilibrium (r ) was positively correlated with the FST value (r = 0.06, p < 1 × 10-15; fig. S7) in both pairs of altitudinal population pairs, and negatively correlated with the value (r = -0.27 to -0.29, all of the p values were under 1 × 10-15; fig. S8) in each population. Therefore, we further implemented ΔFST (15, 16) to eliminate the effect of linked selection. The low altitude populations, rather than the high altitude ones, were chosen as controls for population pairs from both sides of the CMR, because the high altitude sampling site in the east is 500m higher than that in the west. Fig. S1. Distribution of the length of (a) contigs (over 50 kb) and (b) scaffolds (over 50 kb). We assembled 68,427 contigs with N50 size of 41.1 kb and total length of 0.99 GB (contig size: 1kb ~ 370kb), and these contigs were concatenated into 6,512 scaffolds with N50 size of 1.94 Mb (scaffold size: 1kb ~ 16Mb). 7 Fig. S2. Length of chromosomes of the (a) zebra finch (Taeniopygia guttata) and (b) vinous-throated parrotbill (aligned by Satusma).