Assessment of the Genetic Analyses of Rasmussen Et Al
Total Page:16
File Type:pdf, Size:1020Kb
Technical Report: Assessment of the genetic analyses of Rasmussen et al. (2015) John Novembre, PhD, David Witonsky, Anna Di Rienzo, PhD April 4, 2016 SYNOPSIS The primary aim of the analysis undertaken here (U.S. Army Corps of Engineers, St Louis District Contract #W912P9-16-P-0010) is to provide an independent validation of the genetic evi- dence underlying a recent publication by Morten Rasmussen and colleagues on July 23rd, 2015, in Nature (Vol 523:455–58). Based on our analysis of the Kennewick Man’s sequence data and Colville tribe genotype data generated by Rasmussen et al., we concur with the findings of the original paper that the sample is genetically closer to modern Native Americans than to any other population worldwide. We carried out several analyses to support this conclusion, including (i) principal component analysis (PCA; Patterson et al. 2006), (ii) unsupervised genetic clustering using ADMIXTURE (Alexander, Novembre, and Lange 2009), (iii) estimation of genetic affinity to modern human populations using f3 and D statistics (Patterson et al. 2012), and (iv) a novel ap- proach based on the geographic distribution of rare variants. Importantly, these distinct analyses, spanning three non-overlapping subsets of the data, are each consistent with Native American ancestry. OVERVIEW OF ANALYSIS AND RESULTS We provide an overview of our key analysis steps and results here. The technical details of the precise methods and settings we used are listed in the appendix. Terminology. In the presentation that follows we use the term “Native Americans” to refer to the indigenous peoples of North, Central, and South America. The human genetics research community understands there to be shared genetic signatures among these peoples relative to other populations in the globe, and we refer to people carrying these signatures as having “Native American ancestry”. Finer levels of genetic structure exist within Native Americans, and as a result, we at times distinguish between Native American groups or populations. When we use the term population, we do not imply any level of genetic differentiation, we simply mean a collection of individuals with a defining characteristic (usually because they inhabit a particular geographic area or have a specific group/tribal affiliation). Obtaining data. To begin the analysis we acquired the Kennewick Man sample raw sequence read data from the Short Read Archive (Accession # SRS937952). We also acquired the Colville 1 Tribe sample genotypes from Morten Rasmussen (personal communication). For reference sam- ples, we used the publicly available sequence phase 3 data from the 1000 Genomes Project Consortium (2015) and a recent large-scale survey of Native American variation (Raghavan et al. 2015). Processing of sequence reads. To process the Kennewick sequence data, we trimmed the adapters from the raw reads and mapped the remaining read data to the human reference genome (hg19). The read-mapping is the most time-consuming computational step as the original input has ∼6 billion reads (specifically this step took 16,104 CPU hours to complete). After removing PCR duplicates and reads with low mapping quality or short length, the read number reduces to ∼54.3 million reads. This number differs somewhat from the Rasmussen et al. paper, in which their pipeline produced ∼60 million reads, since we apply a slightly more stringent filter for fragment length. We observe ∼1x average coverage as a result, which also agrees with the Rasmussen et al. analysis. Of the input reads, on average 6.69% of the reads map to the reference human genome. This fraction of endogenous DNA is within the range of other archaeological samples that have been successfully analyzed. Assessment of chemical damage pattern typical of ancient DNA. DNA molecules experi- ence a post-mortem degradation process, leading to a set of chemical modification patterns typical of ancient DNA (aDNA). First, aDNA molecules are typically short in length, often < 100 base pairs (bps). Second, cytosine deamination occurs with higher frequency near the end of molecules and generates an increasing level of C-to-T substitution at the 5-end of a read and corresponding G- to-A substitution pattern at the 3-end. Lastly, loss of a purine base (A or G) induces a break on the 3-side of the depurinated site, leading to an excess of purines at the genomic positions preceding the 5-end of molecules and a corresponding increase in the proportion of pyrimidines (C or T) following the 3-end. We assessed the size distribution and damage patterns of inserted molecules using uniquely mapped, non-duplicate reads with the mapDamage2 program (Jonsson´ et al.. 2013). The se- quence reads for the Kennewick sample clearly show these patterns: short molecule length (me- dian length 49 bp, Figure 1), increase in C-to-T substitutions at 50 and 30 ends of reads from single-stranded libraries, and C-to-T substitutions at 50 ends and G-to-A substitutions at 30 ends of reads from double-stranded libraries (Figure 2), and the signature of a loss of purines next to depurinated sites (Figure 3). Assessment of contamination from modern humans. Estimating the level of contamina- tion from modern human DNA via various sources, such as contact with those who conducted excavation or sequencing experiments or prepared laboratory reagents, is a critical step in aDNA analysis. We estimated the level of contamination from exogenous human sources in the Ken- newick sequence data based on a method implemented in the contamMix program (Fu et al.. 2013). In brief, this program uses the majority-based mitochondrial DNA (mtDNA) consensus se- quence as a representative of the endogenous sequence, assuming that sequence coverage is high enough to accurately call a consensus sequence given presence of some contaminant reads and sequencing errors. Then, it models observed sequence reads as a sample from a mixture of mitogenomes, including the endogenous one and potential contaminants, as represented by 311 contemporary human mitogenomes. The probability of reads drawn from the endogenous mitogenome, and its 95% confidence interval, is estimated through a Markov chain Monte Carlo 2 algorithm. From this we obtained an estimate of 7.3% contamination (with a 92.5%–97.5% credi- ble interval of 6%–8.8%). In Rasmussen et al., they report a 95% confidence interval ranging from 3.7–7.1%. The level of mtDNA contamination does not necessarily reflect the level of nuclear DNA con- tamination. To assess contamination in the nuclear DNA genome, we took advantage of the fact that a male is hemizygous for the X chromosome and that no mismatches in the X chromosome sequence reads are expected. By evaluating the minor allele mismatch reads on the X chromo- some relative to allele frequencies of alleles in the European population, we obtain contamination estimates of 3–3.2% depending on which method is used. These are slightly higher than the 2.5% reported in the Rasmussen et al. manuscript. Mitochondrial DNA (mtDNA) haplogroup. The mtDNA haplogroup an individual carries is indicative of ancestry and relatively easy to observe from aDNA as mtDNA is more abundant in cells than autosomal DNA. Using the consensus sequence of the mtDNA reads obtained, we find the mtDNA haplotype to be haplogroup X2a. This haplogroup has only been observed in Native Americans (Reidla et al. 2003). Rasmussen et al. report the same haplogroup for the sample. 3 SRR2034764.trim.n1o2.uniq (A) Single−end read length distribution Single−end read length per strand + strand 5e+06 − strand 2500000 4e+06 2000000 3e+06 1500000 Occurences 2e+06 1000000 Occurences subplus$Occurences 1e+06 500000 0e+00 0 11 21 31 41 51 61 71 81 91 11 21 31 41 51 61 71 81 91 Read lengthSRR2034009.trim.n1o2.uniqRead length subplus$Length (B) Single−end read C>Tlength distribution Single−end readG>A length per strand + strand 1 − strand 1 4e+05 + strand + strand 200000 0.9 − strand − strand 0.9 0.8 0.8 0.7 0.7 3e+05 150000 0.6 0.6 0.5 0.5 0.4 0.4 2e+05 100000 0.3 0.3 Occurences Occurences 0.2 0.2 Cumulative frequencies Cumulative frequencies Cumulative 0.1 subplus$Occurences 0.1 1e+050 500000 0 0 10 20 30 40 50 60 70 10 20 30 40 50 60 70 c(0, cumsum(subplus[, mut]/sum(subplus[, mut]))) c(0, cumsum(subplus[, mut]/sum(subplus[, Read position mut]))) c(0, cumsum(subplus[, mut]/sum(subplus[, Read position 0e+00 Index Index 0 11 21 31 41 51 61 71 81 91 11 21 31 41 51 61 71 81 91 101 101 Read length Read length subplus$Length C>T G>A Figure 1: The distribution of fragment lengths from two representative libraries. (A) Single-strand read library 1 1 (SRR2034764). (B) Double-strand+ strand read library (SRR2034009).+ strand The concentration of short reads is typical of dam- aged, ancient DNA.0.9 Only− strand reads with length greater than 30 bp were− strand included in downstream analyses.0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 Cumulative frequencies Cumulative frequencies Cumulative 0.1 0.1 0 0 0 0 10 20 30 40 50 60 70 10 20 30 40 50 60 70 c(0, cumsum(subplus[, mut]/sum(subplus[, mut]))) c(0, cumsum(subplus[, mut]/sum(subplus[, Read position mut]))) c(0, cumsum(subplus[, mut]/sum(subplus[, Read position Index Index 4 SRR2034764.trim.n1o2.uniq.bam.Q30.L30 A A C C 0.5 0.5 0.4 0.4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.3 ● ● ● 0.3 ● ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 Frequency ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.1 0.1 ● 0.0 0.0 SRR2034009.trim.n1o2.uniq.bam.Q30.L30 G G T T 0.5 0.5 A A C C 0.5 ● 0.5 0.4 0.4 ● ● 0.4