Insights from High-Resolution Y Chromosome and Mtdna Sequences
Total Page:16
File Type:pdf, Size:1020Kb
1 2 Supplementary Information 3 4 Human paternal and maternal demographic histories: 5 insights from high-resolution Y chromosome and mtDNA sequences 6 7 8 Sebastian Lippold, Hongyang Xu, Albert Ko, Anne Butthof, Mingkun Li, 9 Gabriel Renaud, Roland Schröder, Mark Stoneking 10 11 12 This supplement contains the complete Methods, Supplementary Figures 1-19, Supplementary 13 Tables 3-10 (Supplementary Tables 1 and 2 are provided as individual Excel files), and associated 14 References. 15 16 1 1 Methods 2 3 Samples and sequencing library preparation 4 The samples consist of 623 males (Supplementary Table 2) from the CEPH Human Genome 5 Diversity Panel (HGDP)1. The samples were taken from the subset “H952”, which excludes atypical, 6 duplicated and closely related samples2. Approximately 200 ng of genomic DNA from each sample 7 was sheared by sonication using a Bioruptor system (Diogenode) and used to construct an Illumina 8 Sequencing library with a specific double-index as described previously3. The libraries were then 9 enriched separately for NRY and mtDNA sequences as described below. 10 11 Y-chromosome capture array design 12 Our strategy in designing a capture array to enrich sequencing libraries for NRY sequences was 13 to target unique regions that are free of repeats and to which the typically short next-generation 14 sequencing reads could be mapped with high confidence. We used the UCSC table browser4 and the 15 February 2009 (GRCh37/hg19) assembly and applied the following filter criteria. First, from the 16 group “variation and repeats”, sequence regions annotated in the following tracks were removed: 17 Interrupted Repeats, RepeatMasker, Simple Repeats, and Segmental Duplications. Next, we used the 18 “mapability” table “CRG Align 75” from the group “mapping and sequencing tracks” to identify and 19 remove regions with mapability scores below 1. We then removed regions of less than 500 bp in 20 order to reduce the number of fragments and thereby the number of fragment ends, which have low 21 probe densities. We also removed 15mers that occurred more than 100 times in the hg19 genome 22 assembly, as described previously5, which resulted in splitting some target regions into sub-regions 23 that were less than 500 bp. The final result was a total of ~500 kb of unique NRY sequence, 24 distributed among 655 target regions ranging from 61 bp –3.9 kb (Supplementary Table 1). These 2 1 regions were then used to design a custom array (SureSelect 1M capture array, Agilent) with 60nt 2 probes that were printed twice with a tiling density of 1 bp. 3 4 NRY enrichment 5 Up to 60 barcoded libraries were pooled in equimolar ratio. The library mix was enriched for 6 target NRY regions by hybridization-capture on the custom designed array following the protocol 7 described previously6. After enrichment the library-pool was quantified by qPCR and then amplified 8 to a total of ~1012 molecules. The final concentration and length distribution was measured on an 9 Agilent DNA 100 microchip, and 10 nmol of the amplified library pool was used for sequencing. 10 Each pool, consisting of 48-60 samples, was sequenced on a Solexa GAII lane using a paired end 75 11 cycle run plus two 7nt index reads. 12 13 MtDNA enrichment 14 Up to 94 libraries were pooled in equimolar ratio and the library pool was enriched for mtDNA 15 sequences by an in-solution based hybridization-capture method7. The hybridization eluate was 16 measured by qPCR and then amplified to produce a final concentration of 10 nmol. Up to 200 17 samples were sequenced on a Solexa GAII lane using a paired end 75 cycle run, plus two 7nt index 18 reads. 19 20 Data processing 21 In each Solexa GAII lane, 1% PhiX174 phage DNA was spiked in and used as a training set to 22 estimate base quality scores with the IBIS base-caller8. Reads with more than 5 bases having a 23 PHRED scaled quality score below Q15 were discarded, as were reads having a single base quality 3 1 in the index read (7nt) score below Q10. Reads with no mismatches to the expected double index 2 sequences were assigned to each individual sample library. 3 For the NRY-enriched data, reads were mapped to the human reference genome (GRCh37) using 4 default settings with BWA v0.5.109. We mapped to the whole genome rather than just the target 5 region, in order to identify reads that might, with equal probability, map to another position in the 6 genome. The bam files containing the mapping information and reads were processed with samtools 7 v0.1.1810. We used Picard 1.42 (http://picard.sourceforge.net) to mark duplicates, based on the start 8 and end coordinates of the read pairs. The final SNP call was done on all samples simultaneously 9 using the UnifiedGenotyper from the GATK v2.0-35 package11 and the following options: -- 10 output_mode EMIT_ALL_CONFIDENT_SITES, --genotype_likelihoods_model SNP, -- 11 min_base_quality_score 20 and --heterozygosity 0.0000000001. The result was stored in a VCF file 12 containing information for each callable site of the target region (Supplementary Table 1), and a 13 second VCF file was created that contained only the variable positions among the 623 samples. For 14 each sample at each variable position the PL scores, which are the normalized, PHRED-scaled 15 likelihoods for the three genotypes (0/0, 0/1, 1/1) were investigated. Positions that showed a 16 difference in the PL value of less than 30 between homozygote reference (0/0) and homozygote 17 alternative (1/1) were removed, as were heterozygote calls (0/1) with a difference greater than or 18 equal to 30 to the most likely homozygous genotype. Sites where more than two bases were called 19 (i.e., multi-allelic sites) were also removed. 20 For the mtDNA-enriched data, reads were mapped to the revised mtDNA reference sequence 21 (GenBank number: NC_012920) using the software MIA12.The consensus sequences were aligned 22 using MUSCLE v3.8.3113 (cmd line: muscle -maxiters 1 –diags mt_623seq.fasta mt_623seq.aln). 23 24 25 4 1 Imputation for the NRY 2 After quality filtering, there were 2847 variable sites in the 623 individuals, with a total of 2.54% 3 of the individual genotypes at variable positions scored as “N” (i.e., as missing data; the number of 4 missing sites per individual ranged from 9-1173, with an average of 122 missing sites per individual). 5 Since missing data can influence the results of some analyses, we took advantage of the fact that the 6 NRY target regions are completely linked with no recombination to impute missing data as follows. 7 First, all sites with no missing data (605 sites) were used as the reference set to define haplotypes and 8 calculate the number of differences between each haplotype. Sites with missing data were then 9 imputed, beginning with the site with the smallest amount of missing data and proceeding 10 sequentially. For each haplotype with missing data for that site, the missing base was imputed as the 11 allele present in the reference haplotype that had the fewest differences (based on the sites with no 12 missing data). After imputation was finished for that site, it was added to the reference set, and the 13 procedure continued for the next site with the smallest amount of missing data. 14 As a check on the accuracy of the imputation, we randomly deleted 2.54% of the known alleles, 15 following the distribution of missing alleles in the full dataset, thereby creating an artificial dataset 16 with a similar distribution of missing alleles as in the observed dataset. We then imputed the missing 17 data according to the above procedure and compared the imputed alleles to the true alleles; this 18 procedure was carried out 1000 times. The imputed allele matched the true allele in 99.1% of the 19 comparisons, indicating that the imputation procedure is quite accurate. 20 21 Recurrent NRY mutations 22 We expect the majority of the NRY SNPs to have mutated only once, as recurrent mutations in 23 the known NRY phylogeny are quite rare14,15. Therefore, as a further quality control measure, we 24 investigated the NRY data for recurrent mutations by constructing a maximum parsimony tree for 25 the 2847 SNPs using programs in PHYLIP (http://evolution.genetics.washington.edu/phylip.html). 5 1 We then estimated the number of mutations at each SNP, and removed 48 SNPs (1.7%) that mutated 2 more than twice as these are likely to reflect sequencing errors. The final dataset contains 2799 SNPs, 3 of which 2228 are from the target resequencing regions, and the remaining 571 are from the 4 ascertained SNPs from dbSNP. 5 6 Data Analysis 7 Basic summary statistics (haplotype diversity, mean number of pairwise differences, nucleotide 8 diversity, Tajima’s D value and theta(S)) were calculated using Arlequin v3.5.1.316. Arlequin was 9 further used to estimate pairwise ΦST values and for Analysis of Molecular Variance (AMOVA). The 10 pairwise ΦST matrices were used in non-metric Multidimensional Scaling (MDS) analyses, which 11 were carried out in R using the ‘isoMDS’ command from the ‘MASS’ package17. The observed ratio 12 of the mean pairwise differences (mpd) for the NRY vs. mtDNA was calculated as mpdNRY/mpdmt. In 13 order to detect group-specific deviations from the mean distribution of the mpd ratio in the dataset, 14 we carried out a resampling approach. For each group sample size (Ngroup) we chose randomly Ngroup 15 individuals (out of 623) and calculated the mpd ratio using the dist.dna command from the APE 18 16 package in R.