Supporting Information
Total Page:16
File Type:pdf, Size:1020Kb
Supporting Information Hübner et al. 10.1073/pnas.1321533111 SI Materials and Methods imposed, disallowing a direct transition between low and high Sampling and DNA Extractions. Single nucleotide polymorphism states (7). For each chromosome arm, parameters with highest was tested in Drosophila melanogaster isofemale lines established log-likelihood were used in the HMM. Assignment of SNPs to from females inseminated in nature and collected on the oppo- each of the detected “hidden” distributions was conducted by site slopes of Nahal Oren canyon (Carmel massif, Israel) in local decoding of each point to a posteriori most probable state. A October 2010. All flies were reared on standard instant Dro- local decoding approach was preferred over the Viterbi global sophila food medium (Carolina Biological Supply Co.) in half- decoding algorithm (10) because the later maximized the joint pint milk bottles at a temperature of 24 ± 1 °C and on a 12:12 distribution of the hidden states even at the cost of local false light/dark cycle. For the molecular experiments, we used flies assignments. from 32 isofemale lines [16 from north facing slope (NFS) and 16 To search for high-differentiation regions along chromosome from south facing slope (SFS)]. DNA was extracted from 100 arms, we used 10-kb nonoverlapping windows. For each window, males of each line using DNAzol Genomic DNA Isolation Re- the level of differentiation was represented by two parameters, agent (Molecular Research Center Inc.) according to the man- LI = nL/nT (“low island”)andHI= nH/nT (“high island”), where ufacturer’s protocol. Equal amounts of ∼1 μg of DNA from each nT is the total number of SNPs within each window whereas nH line were pooled to make slope representations. Illumina paired- and nL are the numbers of SNPs assigned in the HMM step to end libraries were constructed and sequenced with HiSeq, 100- either high- or low-differentiation states, respectively. Following cycle, at ∼40× coverage per population. this step, a permutation test was conducted to detect genomic regions with significant enrichment of HI scores. For each per- Mapping Reads and Data Processing. Paired-end reads were filtered mutation, Fst values were taken randomly from the whole set for minimum average base quality score of 20 and a minimum of SNPs within a chromosome arm together with their corre- length of 50 bp using PoPoolation (1). Trimmed reads were sponding assignment to high-differentiation hidden state. High- mapped to the D. melanogaster reference genome using the differentiation score HI was then calculated for each window. Burrows–Wheeler aligner (BWA) v0.6.2 (2) with the following This reshuffling process was repeated 10,000 times, allowing only parameters: -n 0.01 -l 100 -o 1-d 12 -e 12. Paired-end data were significantly differentiated regions to be compared with random merged to single files in sam format with the “sampe” option of expectations. BWA, converted to binary alignment/map format and filtered for Significantly differentiated genomic regions between slopes a minimum mapping quality of 20 using SAMtools v1.18 (3). were then inspected for the type and strength of selection using Files of both NFS and SFS populations were further converted the Tajima’s D scores for each window as obtained with Po- to a pileup file and synchronized for downstream analysis of poolation (1). Trimmed files of each population were converted allele frequency differences between the slopes of the canyon. to a pileup format separately and used to calculate population For each detected SNP, differentiation between populations was measures over a nonoverlapping window of size 10 kb in Po- < − calculated using both Fst and Fisher exact test as implemented in Poolation. A score of D 2 is indicative of a recent selective Popoolation2 (4). For each of the two populations, footprints of sweep (fixation of novel mutation) followed by a slow recovery of natural selection were detected using Tajima’s D calculated over variation, thus an excess of rare alleles. On the other hand, D > 2 nonoverlapping 10-kb windows along chromosome arms. Trim- is indicative of small allele frequency differences due to bal- med files of each population were converted to a pileup format ancing selection. Thus, combining both differentiation and se- separately and used to calculate population measures over lection scores can be used for detection of genomic regions a nonoverlapping window of size 10 kb in PoPoolation using corresponding to interslope differentiation caused by alternative a minimum allele frequency of 2, minimum quality of 20, mini- selection (small D and significant HI). mum coverage of 6, and maximum coverage of 400. Gene Ontology Term Analysis. To study the biological significance Detection of Interslope Selective Differentiated Genomic Regions. To of genes under diversifying selection, an enrichment analysis of reveal genomic regions with significant interslope differentiation, gene ontology (GO) terms was conducted with the Bioconductor we used Hidden Markov Model (HMM) analysis (5–7). HMM package gosEq (11), which accounts for the gene-length bias. was used to discriminate between the distributions of three hidden Genes located in genomic regions found to be under differen- states corresponding to high, moderate, and low interslope dif- tiation between slopes (HI) were further tested for their asso- ferentiation and to assign each SNP to a corresponding state. We ciation with the detected differentiation. For each SNP within thus applied an HMM as implemented in the R package “Hid- each gene, a Fisher exact test was performed, and the proportion fi < denMarkov” (8) using interslope Fst values as were obtained for of signi cantly differentiated SNPs (P 0.01) was recorded for each SNP from PoPoolation2 software. Additional parameters each gene. Genes with significantly high proportion of differen- for the HMM are the transition probability matrix between tiated SNPs (P adjusted < 0.0001) were selected for downstream states, emission probabilities, and states distribution parameters. GO enrichment analysis. A list of all genes (significant and Evaluation of these parameters was made within the HMM nonsignificant) and their corresponding sequence length was framework using the expectation maximization Baum–Weltch used to generate the probability weighting function, which en- algorithm (9) for each chromosome arm separately, by enabling ables one to correct for the gene-length bias. For time efficiency, a maximum of 1,000 iterations to reach convergence. Initial pa- the Wallenius distribution method (11) was used as an approx- rameters were defined from the data using a maximum likelihood imation for the H0 distribution. Gene length was calculated for approach by launching the Baum–Weltch algorithm 100 times each gene using FlyBase resources (batch download tool), based with randomly sampled values from a predefined parameter on sequence minimum and maximum locations (available for space. Single-SNP Fst values were found to follow a gamma dis- D. melanogaster reference genome v5.31). We generated a list of tribution, with rate and shape parameters extending between GO annotations for the whole genome using AmiGO v1.8 soft- 0–300 and 0–60, respectively. A constrained transition matrix was ware, the official web-based set of tools for searching and Hübner et al. www.pnas.org/cgi/content/short/1321533111 1of5 browsing the Gene Ontology database (12), which served to physical position of the gene loci in the chromosome sequence determine significant GO terms based on the chosen criteria. relative to the centromere (dist, Mb) known as strong suppressor Significantly overrepresented GO terms were further corrected of recombination (“centromeric effect”) (for a review, see ref. for multiple comparisons using the Benjamini–Hochberg method 17). The two variables, r and dist, allow taking into account (13). The Functional Annotation Clustering tool in DAVID (v. the centromeric effect and local variation of recombination su- 6.7) (14, 15) was used for an in-depth GO enrichment analysis perimposed on the centromeric effect. The effect of recom- and clustering. We performed functional annotation clustering bination was tested for each of the large chromosomal arms for GO terms with high-classification stringency. using rank correlation of r and dist, with the following param- Effects of Recombination. In the current study, we tested, albeit eters calculated for the components of gene sequence [coding indirectly, the effect of recombination on population polymor- DNA sequences (CDSs), introns, 5′ UTR and 3′ UTR, and the phism in the canyon for SNP sites. The effect of recombination total gene sequence]: density of polymorphic SNPs per kb was tested based on the local recombination rates using data from (parameter all/kb) and density of SNPs significantly differen- Marais et al. (16), represented by the score r, cM/Mb, and tiated between the slopes (parameter sig/kb). 1. Kofler R, et al. (2011) PoPoolation: A toolbox for population genetic analysis of next 10. Viterbi AJ (1967) Error bounds for convolutional codes and an asymptotically generation sequencing data from pooled individuals. PLoS ONE 6(1):e15925. optimum decoding algorithm. IEEE Trans Inf Theory 13:260–269. 2. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler 11. Young MD, Wakefield MJ, Smyth GK, Oshlack A (2010) Gene ontology analysis for transform. Bioinformatics 25(14):1754–1760. RNA-seq: Accounting for selection bias. Genome Biol 11(2):R14. 3. Li H, et al.; 1000 Genome Project Data Processing Subgroup (2009) The Sequence 12. Carbon S, et al.; AmiGO Hub; Web Presence Working Group (2009) AmiGO: Online Alignment/Map format and SAMtools. Bioinformatics 25(16):2078–2079. access to ontology and annotation data. Bioinformatics 25(2):288–289. 4. Kofler R, Pandey RV, Schlötterer C (2011) PoPoolation2: Identifying differentiation 13. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: A practical and between populations using sequencing of pooled DNA samples (Pool-Seq).