Technical Report: Assessment of the genetic analyses of Rasmussen et al. (2015)

John Novembre, PhD, David Witonsky, Anna Di Rienzo, PhD

April 4, 2016

SYNOPSIS The primary aim of the analysis undertaken here (U.S. Army Corps of Engineers, St Louis District Contract #W912P9-16-P-0010) is to provide an independent validation of the genetic evi- dence underlying a recent publication by Morten Rasmussen and colleagues on July 23rd, 2015, in Nature (Vol 523:455–58). Based on our analysis of the Kennewick Man’s sequence data and Colville tribe genotype data generated by Rasmussen et al., we concur with the findings of the original paper that the sample is genetically closer to modern Native Americans than to any other population worldwide. We carried out several analyses to support this conclusion, including (i) principal component analysis (PCA; Patterson et al. 2006), (ii) unsupervised genetic clustering using ADMIXTURE (Alexander, Novembre, and Lange 2009), (iii) estimation of genetic affinity to modern human populations using f3 and D statistics (Patterson et al. 2012), and (iv) a novel ap- proach based on the geographic distribution of rare variants. Importantly, these distinct analyses, spanning three non-overlapping subsets of the data, are each consistent with Native American ancestry.

OVERVIEW OF ANALYSIS AND RESULTS We provide an overview of our key analysis steps and results here. The technical details of the precise methods and settings we used are listed in the appendix. Terminology. In the presentation that follows we use the term “Native Americans” to refer to the indigenous peoples of North, Central, and . The human genetics research community understands there to be shared genetic signatures among these peoples relative to other populations in the globe, and we refer to people carrying these signatures as having “Native American ancestry”. Finer levels of genetic structure exist within Native Americans, and as a result, we at times distinguish between Native American groups or populations. When we use the term population, we do not imply any level of genetic differentiation, we simply mean a collection of individuals with a defining characteristic (usually because they inhabit a particular geographic area or have a specific group/tribal affiliation). Obtaining data. To begin the analysis we acquired the Kennewick Man sample raw sequence read data from the Short Read Archive (Accession # SRS937952). We also acquired the Colville

1 Tribe sample genotypes from Morten Rasmussen (personal communication). For reference sam- ples, we used the publicly available sequence phase 3 data from the 1000 Genomes Project Consortium (2015) and a recent large-scale survey of Native American variation (Raghavan et al. 2015). Processing of sequence reads. To process the Kennewick sequence data, we trimmed the adapters from the raw reads and mapped the remaining read data to the human reference genome (hg19). The read-mapping is the most time-consuming computational step as the original input has ∼6 billion reads (specifically this step took 16,104 CPU hours to complete). After removing PCR duplicates and reads with low mapping quality or short length, the read number reduces to ∼54.3 million reads. This number differs somewhat from the Rasmussen et al. paper, in which their pipeline produced ∼60 million reads, since we apply a slightly more stringent filter for fragment length. We observe ∼1x average coverage as a result, which also agrees with the Rasmussen et al. analysis. Of the input reads, on average 6.69% of the reads map to the reference human genome. This fraction of endogenous DNA is within the range of other archaeological samples that have been successfully analyzed. Assessment of chemical damage pattern typical of ancient DNA. DNA molecules experi- ence a post-mortem degradation process, leading to a set of chemical modification patterns typical of ancient DNA (aDNA). First, aDNA molecules are typically short in length, often < 100 base pairs (bps). Second, cytosine deamination occurs with higher frequency near the end of molecules and generates an increasing level of C-to-T substitution at the 5-end of a read and corresponding G- to-A substitution pattern at the 3-end. Lastly, loss of a purine base (A or G) induces a break on the 3-side of the depurinated site, leading to an excess of purines at the genomic positions preceding the 5-end of molecules and a corresponding increase in the proportion of pyrimidines (C or T) following the 3-end. We assessed the size distribution and damage patterns of inserted molecules using uniquely mapped, non-duplicate reads with the mapDamage2 program (Jonsson´ et al.. 2013). The se- quence reads for the Kennewick sample clearly show these patterns: short molecule length (me- dian length 49 bp, Figure 1), increase in C-to-T substitutions at 50 and 30 ends of reads from single-stranded libraries, and C-to-T substitutions at 50 ends and G-to-A substitutions at 30 ends of reads from double-stranded libraries (Figure 2), and the signature of a loss of purines next to depurinated sites (Figure 3). Assessment of contamination from modern humans. Estimating the level of contamina- tion from modern human DNA via various sources, such as contact with those who conducted excavation or sequencing experiments or prepared laboratory reagents, is a critical step in aDNA analysis. We estimated the level of contamination from exogenous human sources in the Ken- newick sequence data based on a method implemented in the contamMix program (Fu et al.. 2013). In brief, this program uses the majority-based mitochondrial DNA (mtDNA) consensus se- quence as a representative of the endogenous sequence, assuming that sequence coverage is high enough to accurately call a consensus sequence given presence of some contaminant reads and sequencing errors. Then, it models observed sequence reads as a sample from a mixture of mitogenomes, including the endogenous one and potential contaminants, as represented by 311 contemporary human mitogenomes. The probability of reads drawn from the endogenous mitogenome, and its 95% confidence interval, is estimated through a Markov chain Monte Carlo

2 algorithm. From this we obtained an estimate of 7.3% contamination (with a 92.5%–97.5% credi- ble interval of 6%–8.8%). In Rasmussen et al., they report a 95% confidence interval ranging from 3.7–7.1%. The level of mtDNA contamination does not necessarily reflect the level of nuclear DNA con- tamination. To assess contamination in the nuclear DNA genome, we took advantage of the fact that a male is hemizygous for the X chromosome and that no mismatches in the X chromosome sequence reads are expected. By evaluating the minor allele mismatch reads on the X chromo- some relative to allele frequencies of alleles in the European population, we obtain contamination estimates of 3–3.2% depending on which method is used. These are slightly higher than the 2.5% reported in the Rasmussen et al. manuscript. Mitochondrial DNA (mtDNA) haplogroup. The mtDNA haplogroup an individual carries is indicative of ancestry and relatively easy to observe from aDNA as mtDNA is more abundant in cells than autosomal DNA. Using the consensus sequence of the mtDNA reads obtained, we find the mtDNA haplotype to be haplogroup X2a. This haplogroup has only been observed in Native Americans (Reidla et al. 2003). Rasmussen et al. report the same haplogroup for the sample.

3 SRR2034764.trim.n1o2.uniq

(A) Single−end read length distribution Single−end read length per strand + strand 5e+06 − strand 2500000

4e+06 2000000

3e+06 1500000

Occurences 2e+06 1000000 Occurences subplus$Occurences

1e+06 500000

0e+00 0 11 21 31 41 51 61 71 81 91 11 21 31 41 51 61 71 81 91 Read lengthSRR2034009.trim.n1o2.uniqRead length subplus$Length (B) Single−end read C>Tlength distribution Single−end readG>A length per strand + strand

1 − strand 1 4e+05 + strand + strand 200000 0.9 − strand − strand 0.9 0.8 0.8

0.7 0.7 3e+05 150000 0.6 0.6

0.5 0.5

0.4 0.4 2e+05 100000 0.3 0.3 Occurences Occurences 0.2 0.2 Cumulative frequencies Cumulative frequencies Cumulative

0.1 subplus$Occurences 0.1

1e+050 500000 0 0 10 20 30 40 50 60 70 10 20 30 40 50 60 70

c(0, cumsum(subplus[, mut]/sum(subplus[, mut]))) c(0, cumsum(subplus[, mut]/sum(subplus[, Read position mut]))) c(0, cumsum(subplus[, mut]/sum(subplus[, Read position 0e+00 Index Index 0 11 21 31 41 51 61 71 81 91 11 21 31 41 51 61 71 81 91 101 101 Read length Read length subplus$Length C>T G>A Figure 1: The distribution of fragment lengths from two representative libraries. (A) Single-strand read library 1 1 (SRR2034764). (B) Double-strand+ strand read library (SRR2034009).+ strand The concentration of short reads is typical of dam- aged, ancient DNA.0.9 Only− strand reads with length greater than 30 bp were− strand included in downstream analyses.0.9 0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2 Cumulative frequencies Cumulative frequencies Cumulative

0.1 0.1

0 0 0 0 10 20 30 40 50 60 70 10 20 30 40 50 60 70

c(0, cumsum(subplus[, mut]/sum(subplus[, mut]))) c(0, cumsum(subplus[, mut]/sum(subplus[, Read position mut]))) c(0, cumsum(subplus[, mut]/sum(subplus[, Read position Index Index

4 SRR2034764.trim.n1o2.uniq.bam.Q30.L30

A A C C 0.5 0.5

0.4 0.4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.3 ● ● ● 0.3

● ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 Frequency ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.1 0.1 ●

0.0 0.0 SRR2034009.trim.n1o2.uniq.bam.Q30.L30 G G T T 0.5 0.5 A A C C 0.5 ● 0.5 0.4 0.4 ● ● 0.4 ● 0.4 ● ● ● ● ● ● 0.3 ● ● ● ● ● ● ● 0.3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.3 ● ● ●● ● ● ● ● ● ● 0.3 ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ●● ● ● 0.2 Frequency ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 ● ● ● ● 0.2 Frequency

0.1 ● 0.1

0.1 0.1 0.0 0.0

0.0 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 0.0 10 10 10 10 10 10 10 10 − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − 0.30 0.30 (A) G G T T 0.5 0.5 0.25 0.25

0.200.4 0.40.20 ●

● ● ● 0.15 ● ● ● ● 0.15 ● ● ● 0.3 ● ● ● 0.3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Frequency ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.10 ● ● ● 0.10 ● ● ● ● ● 0.2 ● ● ● ● 0.2

Frequency ● ● ● 0.05 0.05 0.1 0.1 0.00 0.00 9 8 7 6 5 4 3 2 1 0.0 1 2 3 4 5 6 7 8 9 0.0 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 − − − − − − − − − − − − − − − − − − − − − − − − − 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 10 10 10 10 10 10 10 − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − 0.30 0.30 (B) 0.25 0.25

0.20 0.20

0.15 0.15

Frequency 0.10 0.10

0.05 0.05

0.00 0.00 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 − − − − − − − − − − − − − − − − − − − − − − − − −

Figure 2: Average number of mismatches between aDNA sequence reads and a reference sequence per position in reads. X-axis denotes position relative to the 50 (left) and 30 (right) end of the reads. (A) An example of the read damage patterns from a single-strand read library (SRR2034764). The enrichment of C-to-T transitions (red line) at the 50 and G-to-A transitions (blue line) at the 30 end are expected for single-strand sequencing of damaged DNA such as ancient DNA. (B) An example of the read damage patterns from a double-strand read library (SRR2034009). The enrichment of C-to-T transitions (red line) at the 50 and 30 end is expected for double-strand sequencing of ancient DNA.

5 SRR2034764.trim.n1o2.uniq.bam.Q30.L30

A A C C (A) 0.5 0.5

0.4 0.4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.3 ● ● ● 0.3

● ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 Frequency ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.1 0.1 ●

0.0 0.0

G G T T 0.5 0.5

● 0.4 0.4 ● ● ● ● ● ● ● ● ● 0.3 ● ● ● ● ● ● ● 0.3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● 0.2 Frequency ● ● ● ● ● ● ● ● 0.1 0.1

0.0 0.0 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 10 10 10 10 10 10 10 − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − 0.30 SRR2034009.trim.n1o2.uniq.bam.Q30.L30 0.30

0.25 A A C C 0.25 (B) 0.5 0.5 0.20 0.20 0.4 0.4 0.15 0.15 ●

● ●

Frequency ● ● 0.3 ● ● ● ● ● ● ● ● 0.3 ● ● ● ● ● ● ● ● ● ● ● ● 0.10 ● ● ● ● ● ● ● 0.10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.050.2 ● ● ● ● 0.20.05 Frequency

0.000.1 0.10.00 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 − − − − − − − − − 0.0 − − − − − − − − − − − − − − − − 0.0

G G T T 0.5 0.5

0.4 0.4 ●

● ● ● ● ● ● ● ● ● ● 0.3 ● ● ● 0.3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 ● ● ● ● 0.2

Frequency ● ● ●

0.1 0.1

0.0 0.0 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 10 10 10 10 10 10 10 − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − 0.30 0.30

0.25 0.25 Figure 3: Frequency of the four bases in the reference sequence at 10 nucleotide positions before and after the start of the sequenced0.20 fragments. Authentic aDNA is expected to show an excess of purines immediately upstream0.20 of the 5 end of the sequence0.15 reads while an excess of pyrimidines is expected downstream of the 3 end. (A) Single-strand0.15 read library (SRR2034764). (B) Double-strand read library (SRR2034009). The excess of changes neighboring depurinated Frequency sites is typical0.10 of ancient DNA. 0.10

0.05 0.05

0.00 0.00 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 − − − − − − − − − 6 − − − − − − − − − − − − − − − − Autosomal ancestry with principal components analysis (PCA). Using PCA applied to individual-level genotypes (Patterson et al. 2006), we provide a visual depiction of the ancestry of the Kennewick Man sample in the context of 3,075 individuals from around the globe geno- typed at 120,037 SNPs (Rhagavan et al. 2015). Specifically we project the Kennewick, Anzick-1 (Rasmussen et al. 2014), and Saqqaq (Rasmussen et al. 2010) samples onto the first two axes of variation defined by the reference set. This approach is suitable for assessing the large-scale continental ancestry of an individual. As seen in Figure 4 (below), the Kennewick and Anzick-1 samples project into a cluster of Native American individuals indicating strong genetic affinities with these groups.

Figure 4: PCA of global samples with ancient samples projected to show ancestry. ’K’ indicates the genetic position of the Kennewick sample, + indicates Anzick-1. For the reference set, each two letter abbreviation denotes the position of an individual and the abbreviation denotes the population label given by Rhagavan et al. (2015). The large points represent median positions of individuals from across different geographic regions or groupings. Regional abbrevia- tions: Afr=, Amer=America, Adm=Admixed from the Americas, Cauc=Caucasus, eAs=East Asia, Eur=, NE=Near East, Ocea=Oceania, Sib=Siberia, sAs=South Asia, seAs=Southeast Asia. Samples from Africa are to the right along the PC1 axis but are not visible due to the plotted region.

7 Autosomal ancestry with ADMIXTURE. We next used ADMIXTURE (Alexander et al., 2009) to conduct model-based inference of ancestry proportions. This method estimates the ancestry proportion for each individual in each of K assumed source populations. No training data is used — the method is simultaneously learning the nature of the source populations and estimating the proportions of each individual’s ancestry in each source. As such, it provides a useful agnostic approach to assessing patterns of ancestry in a sample, and it is particularly well suited for analysis of discrete populations with admixed descendant individuals. We used the same data set as in the PCA, and we find that Kennewick man has an ancestry profile similar to other Native American samples from North America.

Africa Near_East Europe 0.8 0.8 0.8 0.4 0.4 0.4 0.0 0.0 0.0 YRI San Nort Biak Pale Yoru Esto Tusc Fren Bant Ukra Sard Druz Orca Mbut Russ Chuv Bedo Hung Moza Mand Caucasus South_Asia Oceania 0.8 0.8 0.8 0.4 0.4 0.4 0.0 0.0 0.0 Nort Balk Saki Pani Balo Kala Sind Solo Path Brah Guja Buru Mala Mela Lezg Makr Grea Adyg Papu Haza Onge Southeast_Asia East_Asia Siberia 0.8 0.8 0.8 0.4 0.4 0.4 0.0 0.0 0.0 Tu Tuji Dai Alta Tuvi She Tele Han Eski Yizu Nivk Selk Kets Naxi Bajo Xibo Kory Dolg Aeta Agta Bata Bury Shor Yaku Japa Tund Daur Miao Yuka Lahu Oroq Kaya Even Khak Hezh Uygu Chuk Nauk Khan Nene Ngan Mong Mong Camb America Ancient Colville 0.8 0.8 0.8 0.4 0.4 0.4 0.0 0.0 0.0 Anzick kennewick Saqqaq Tlin Teri Pali Inte Chil Hull colv Ojib Brib Kari Mixt Arar Ticu Nort Alas Inga Kogi Huic Nisg Aleu Algo East Kain Piap Spla Chip Diag Haid Sout Para Mexi Mixe Arhu Huet Pure Suru Tepe Toba Chor Cree Male Stsw Yukp Yagh Yaqu Wich Guar Kuml West Kaqc Pima Zapo Coas Coch Cucu Guay Cabe Chan Chon Quec Maya Jama Guah Wayu CanA Lumb Ayma Waun Embe USAm

Figure 5: Results of running ADMIXTURE with K=14 ancestral populations assumed on globally representative sample (the Rhagavan dataset); K=14 was chosen because it gave the lowest cross validation error. Each stacked vertical bar represents a single individual. The broad geographic region is indicated above each section and population identifiers are included. The Kennewick sample shows ancestry components found predominantly among Native Americans.

8 To provide a more detailed view, we carried out a second ADMIXTURE run with a subset of the full Rhagavan data with individuals who have only Native American ancestries. We find the Kennewick Man shares ancestry components that are observed largely in Native American groups from North America.

Andean Equatorial−Tucanoan Ge−Pano−Carib 0.8 0.8 0.8 0.4 0.4 0.4 0.0 0.0 0.0 Pali Chil Hull Kari Arar Ticu Inga Piap Kain Diag Para Suru Toba Yukp Yagh Wich Guar Chon Quec Chan Jama Guah Wayu Ayma Chibchan−Paezan Central−Amerind Northern−Amerind 0.8 0.8 0.8 0.4 0.4 0.4 0.0 0.0 0.0 Teri Ojib Brib Mixt Kogi Huic Nisg Algo Spla Mixe Arhu Huet Pure Tepe Chor Cree Male Stsw Yaqu Kuml Pima Zapo Kaqc Coas Coch Cucu Guay Cabe Maya CanA Waun Embe Na_Dene Eskimo−Aleut USAm Ancient USAm 0.8 0.8 0.8 0.4 0.4 0.4 0.0 0.0 0.0 kennewick Anzick Saqqaq Tlin Nort Alas Aleu East Chip Haid Sout West Colville 0.8 0.4 0.0

Figure 6: Results of running ADMIXTURE with K=8 ancestral populations assumed; K=8 was chosen because it gave the lowest cross validation error. Here we use a sample that includes only populations from the Americas to provide improved resolution and we plot individuals grouped by language family.

9 D and f-stat analyses. To measure genetic similarity between populations in a form that is robust to observation error we use an “f3-outgroup” statistic which measures how much shared similarity two populations have with each other relative to a third. We calculated the f3-outgroup statistic between the Kennewick sample and each population in the Rhagavan dataset using the Yoruban population (West Africa) as an outgroup. We find the Kennewick sample has the high- est shared similarity to Native American populations with the highest values observed being with populations from South America (Figure 7), in line with the observations from Rasmussen et al. We also use the D-statistic which provides a test of whether four populations (numbered 1 to 4 arbitrarily) can be arranged into a tree structure in which Population 1 and 2 are most closely related and 3 and 4 are most closely related, with no secondary gene flow between members of 1 and 2 with 3 and 4. The D-statistic is normalized to produce a Z-score which can be used to reject the tree hypothesis for large magnitude values (e.g. |Z-score| > 3). We find any tree hypothesis with Kennewick grouped with other Native American populations is not rejected, and relevant alternatives are rejected (Table 1). We did not have access to Ainu or Polynesian data so we use Japanese and Solomons Island population data as a surrogate.

10 America Bedouins Bedouins Palestinians Palestinians Druze Druze Siberia Sardinians Makrani Makrani Sardinians North_Italians North_Italians Tuscans Tuscans Saqqaq Brahui Brahui French_Basques Balochi Balochi French_Basques East Asia French French Adygei Sindhi Lezgins Adygei Orcadians Lezgins SE Asia Balkars Balkars Sindhi Sakilli Hungarians Hungarians Ukranians Papuans South Asia Onge Orcadians Pathan Papuans_pygmy Sakilli Onge Kalash Ukranians Oceania Gujaratis Pathan Papuans_pygmy Kalash Estonians Gujaratis Papuans North_Kannadi Europe Russians Estonians North_Kannadi Russians Paniya Malayan Caucasus Malayan Paniya Burusho Burusho Great_Andamanese Melanesians Melanesians Great_Andamanese Near East Chuvash Chuvash Solomons Solomons Agta Agta Aeta Aeta Hazara Hazara Uygurs Uygurs Batak Batak Bajo Bajo Teleut Cambodians Cambodians Teleut Khakases Khakases Shors Kayah_Lebbo Kayah_Lebbo Shors Khanty Altaians Altaians Khanty Lahu Lahu Yukaghirs Yukaghirs Selkups Dai Dai Altaian−Kizhi Altaian−Kizhi Tu Mexican Naxi Tu Selkups Naxi She Kets Tujia She Mongolians Mongolians Dolgans Yizu Tuvinians Xibo Han Miaozu Nenets Tujia Yizu Mongola Miaozu Tundra_Nentsi Buryats Dolgans Tundra_Nentsi Nenets Xibo Buryats Mongola Han Kets Tuvinians Japanese Japanese Nivkhs Daur Yakut Nivkhs Yakuts Yakuts Daur Yakut Mexican Hezhen Hezhen Evenkis Evenkis Oroqens Nganasans Nganasans Oroqens Evens Evens Nganasan2 Nganasan2 Yukaghir Yukaghir Chukchis Saqqaq Koryaks Chukchis Saqqaq Koryaks Eskimo Eskimo Chukchi Chukchi Naukan Naukan East_Greenlanders Alaskan_Inuit Alaskan_Inuit East_Greenlanders West_Greenlanders West_Greenlanders Chorotega Northern_Athabascans_3 Aleutians Aleutians Northern_Athabascans_3 Tlingit CanAmerindian_1 Northern_Athabascans_1 Cree Chipewyan Northern_Athabascans_4 Cree Tlingit Northern_Athabascans_2 Chipewyan Northern_Athabascans_4 Haida Splatsin Northern_Athabascans_1 Haida Yaqui Ojibwa Ojibwa Algonquin Splatsin CanAmerindian_1 Northern_Athabascans_2 Chorotega colville Nisga'a Algonquin Coastal_Tsimshian Nisga'a Stswecem'c Coastal_Tsimshian colville Huetar Interior_Tsimshian Arhuaco Southern_Athabascans_1 Southern_Athabascans_1 Huetar Yukpa Yaqui Stswecem'c Purepecha Wayuu Cochimi Maya2 Pima Purepecha Cucupa Tepehuano Huichol Bribri Arhuaco Cochimi Bribri Kaqchikel Tepehuano Chilote Diaguita Kogi Wayuu Huichol Chilote Maya1 Maya2 Pima Mixtec Guaymi Yaghan Cabecar Hulliche Interior_Tsimshian Maleku Zapotec1 Chane Cabecar Mixtec Maya1 Hulliche Yukpa Zapotec1 Mixe Diaguita Wichi Piapoco Waunana Maleku Kaqchikel Teribe Zapotec2 Quechua Quechua Maya Piapoco Guahibo Kumlai Zapotec2 Chono Mixe Teribe Guarani Aymara Kogi Parakana Ticuna Chono Toba Inga Palikur Waunana Maya Aymara Guarani Toba Embera Guaymi Inga Surui Embera Cucupa Chane Kaingang Karitiana Kumlai Guahibo Wichi Jamamadi Yaghan Parakana Arara Surui 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Kennewick f3 Anzick−1 f3

Figure 7: The ordered f3-outgroup statistics between Kennewick Man and other populations. The higher values rep- resent increased genetic similarity and the highest values are found to be with Native American groups from South America.

11 Pop1 Pop 2 Pop 3 Pop 4 D Z-score Japanese Kennewick : Han ColvilleP 0.1877 42.009 Japanese Kennewick : Han Chipewyan 0.1757 47.559 Solomons Kennewick : Han Karitiana 0.1821 30.554 Solomons Kennewick : Han Anzick 0.1726 18.157 Solomons Kennewick : Han Chipewyan 0.1552 29.847 Han Kennewick : Chukchis Karitiana 0.1224 28.193 Han Kennewick : Chukchis Saqqaq -0.0315 -4.356 Han Solomons : Kennewick Karitiana 0.0003 0.08 Han Solomons : Kennewick ColvilleP -0.0004 -0.083 Han Solomons : Kennewick Chipewyan 0.0014 0.342 Han Solomons : Kennewick Anzick -0.0082 -1.57 Han Japanese : Kennewick ColvilleP -0.0011 -0.76 Han Chukchis : Kennewick Saqqaq -0.0036 -0.84 Han Chukchis : Kennewick Karitiana 0.0005 0.195

Table 1: Summary of D-stat values for various possible trees relating the populations listed. The upper half represent tree configurations clearly rejected by the data. The bottom half presents tree configurations that are consistent with the data.

12 Rare variant analysis. To further assess the ancestry of the Kennewick man, we take a novel approach that is based on the geographic distribution patterns of rare variants. Most rare variants are the result of recent mutation events and as a result they are geographically clustered to single regions of the globe and are highly indicative of ancestry. Here, we tabulate the geographic distri- bution of the rare variants carried by the Kennewick man. This approach requires the use of global sequence data to define the rare variants. As a reference set we use the 1000 Genomes phase3 data. This dataset does not contain Native Americans from North America. However, it contains samples with Native American ancestry; specifically Peruvians from Lima (PEL), Mexicans from Los Angeles (MXL), Colombians from Medellin (CLM). These individuals exhibit varying degrees of admixture between Native American, European, and African ancestries. The Peruvians from Lima are known from previous studies to contain the most Native American ancestry (>75%, de Moura et al. 2016). We find 60 globally rare variants pass our QC filters and are found in the Kennewick Man. Figure 8 shows the number which are present in each 1000 Genomes population. The plot shows that rare variants present in the 1000 Genomes and carried by the Kennewick man are predom- inantly found among individuals with Native American ancestry. Specifically, there is the highest sharing with Peruvians from Lima (PEL). Previous studies estimate the PEL have 1.42–1.95-fold more Native American ancestry than Mexicans from Los Angeles (MXL) (de Moura et al. 2016), and this likely explains the nominally elevated sharing (44 vs 42 shared rare alleles in PEL vs MXL). A correction for the differing levels of Native American ancestry would likely show the MXL as having the highest level of sharing. To provide a positive and negative control for this procedure, we also show sharing profiles for a Neolithic European sample (Lazaridis et al. 2014) and the Anzick-1 Clovis sample (Rasmussen et al. 2014) (Figures 9 and 10). For the Neolithic European sample the sharing is elevated with European populations (Tuscans from Italy (TSI), Spanish (IBS), CEPH study families from Utah (CEU), Great Britain (GBR). For the Anzick-1 sample, the profile is largely similar to that of Kennewick. (Note: the raw numbers are higher for these samples than the Kennewick sample because of differences in coverage).

13 rmUa CU,GetBian(B) elti amracsr skont eerce nsuhr uoewihis which families Europe study 2014). southern CEPH in al. et enriched (IBS), (Lazaridis be Spanish sharing to (TSI), of known levels Italy is highest from ancestry the Tuscans farmer showing Neolithic TSI Europe: (GBR). with from consistent Britain samples Great (CEU), with Utah is from sharing of level highest 9: Figure 3 phase the in available codes of Mexican list = full MXL Rican, highest Lima; Puerto from The = Peruvian PUR sample. = (2015). Kennewick Medellin; PEL paper the from Consortium Americas: by Colombian Project the carried Genomes = from 1000 and CLM populations population Angeles; with each Los are in Kennewick from for found sharing Project of Genomes levels 1000 the in variants 8: Figure aevratsaigpol fteKneiksample. Kennewick the of profile sharing variant Rare aevratsaigpol fa al elti uoensml rmSutat Germany. Stuttgart, from sample European Neolithic early an of profile sharing variant Rare 100 200 300 400 500 10 20 30 40 0 0

TSI PEL IBS MXL CEU CLM GBR PUR PUR JPT CLM ASW FIN CHS MXL CDX PJL CHB PEL LWK

14 GIH FIN ASW KHV ACB CEU ITU YRI STU ACB BEB rare globally of number the counts y-axis The BEB LWK TSI GWD GIH YRI PJL ESN GWD MSL GBR JPT ESN CHS MSL CHB IBS KHV STU CDX ITU The ape(e above) (see sample 10: Figure aevratsaigpol fteAzc- sample. Anzick-1 the of profile sharing variant Rare 1000 1500 2000 500 0

PEL MXL CLM PUR JPT CHB CHS ASW KHV CDX

15 FIN BEB PJL CEU h rfiei eysmlrt hto h Kennewick the of that to similar very is profile The GBR GIH IBS TSI STU ITU ACB LWK GWD YRI MSL ESN DISCUSSION OF RESULTS The analyses carried out in the original Rasmussen et al. study use methods that are widely applied in the field of population genetics, and as such, the main concern in re-evaluating their work is to rule out the possibility of an error in the implementation of their analysis pipeline. Given the numerous steps that take place in a population genetic analysis, there is some probability of implementation errors. One prominent recent example of this is provided in the publication of a claim of back gene flow into Sub-Saharan Africa from outside of Africa on the basis of analysis of ancient DNA from an Ethiopian sample (Gallego Llorente et al. 2015). This claim has since been refuted as it apparently arose due to a bioinformatic error in the merging of data files (Callaway 2016). Thus, one of our main contributions here has been to carry out the major analysis steps of the Rasmussen et al. study independently. Based on our results, we found no reason in any stage to suspect errors in the analysis pipeline carried out by Rasmussen et al. The precise numeric values we find in some cases differ but the differences are, to our judgement, small and in line with the variance expected due to different parameter choices in the initial read mapping steps and subsequent analyses. The analyses we re-implemented from Rasmussen et al. are all consistent with the Native American ancestry of the Kennewick Man. One minor criticism may be the use by Rasmussen and colleagues of a filter on sequence read length of ≥30 bp. Several published analyses use a still more stringent criterion of a ≥ 35 bp length threshold. Using longer read length thresholds can be beneficial because it removes reads that are possibly coming from microbes associated with the archaeological sample and/or reads that are human but map to the reference genome poorly. On the other hand, using longer length thresholds plausibly throws out informative reads and enriches the sample for potential contaminants, as modern human DNA reads are usually less damaged and hence longer in size (>100 bp). Here we used a threshold of >30 bp. The choices regarding read lengths may explain the discrepancy among the contamination estimation numbers. We obtained nuclear genome contamination estimates of 3-3.2% while the Rasmussen et al. study obtained a nuclear estimate of 2.5%. For the mtDNA based estimates, we found a 95% credible interval of 6–8.8%, and Rasmussen et al provide an interval of 3.7–7.1%. Some variance between the nuclear and mtDNA contamination estimates is expected because of their distinct molecular properties (mtDNA exists in much higher copy number per cell than nu- clear DNA, has a distinct structure, and is carried in mitochondrial organelles as opposed to the nucleus). For the difference between studies, the discrepancy is relatively small and likely reflects the read length threshold used by the original study and our follow-up. The X-chromosome based estimation of contamination is designed such that having more microbial reads included (as might be expected with the less stringent threshold of ≥ 30bp) would bias the estimated contamination rate downwards, consistent with the difference observed. The mtDNA-based estimation of con- tamination should be unaffected by the inclusion of additional microbial reads. A second effect is that because short reads are more likely to be endogenous rather than contaminant, shorter read length thresholds should result in read collections that have a lower proportion of contaminating reads. This should cause both the X-based and mtDNA-based contamination estimates to move downwards when shorter reads are included, as observed.

16 As we carried out our analysis we were especially wary to rule out that modern Native Ameri- can DNA has contaminated the sample. Such contamination would be more challenging to detect than contamination from more distant sources such as from Europeans. For this reason, the as- sessment of whether the bulk of the reads truly represent ancient DNA is important. The damage patterns we observed are typical of aDNA and, together with the analyses of mtDNA and X chro- mosome reads, argue against the possibility of extensive modern Native American contamination driving the results. To add a new dimension to the assessment of the Kennewick sample’s ancestry, we assessed rare variant sharing patterns. This analysis identifies globally rare variants and asks which ones are carried by the Kennewick Man. These variants are typically informative of ancestry because rare variants cluster geographically. These are also not typically included on genotyping arrays and so would have little impact in the PCA, admixture, f3, and D-statistic analyses. While our implementation of this approach relied on the 1000 Genomes data, which unfortunately contains poor representation of Native American populations from North America, the patterns of rare vari- ant sharing we found are consistent with Native American ancestry. For example, the profile seen for Kennewick is similar to that seen in the Anzick sample and distinct from that seen in ancient European samples. Taken together our analysis of mtDNA, variants that overlap genotyping arrays (PCA, admix- ture, f3, and D-statistic analyses) and rare variants provide evidence of ancestry from 3 non- overlapping subsets of the sequence read data. Each is consistent with Native American ancestry. In sum, our analyses here, based on the human sequence reads associated with the Ken- newick Man provided in the Short Read Archive, all point to the Native American ancestry of the sample. Our results mitigate any concern over errors by Rasmussen et al. because, while we used some of the same software used by Rasmussen to process and analyze the data, the pipelines used here were independently established. We also added a new method based on rare variants and again found evidence for Native American ancestry.

17 REFERENCES • 1000 Genomes Project Consortium, Adam Auton, Lisa D. Brooks, Richard M. Durbin, Erik P. Garrison, Hyun Min Kang, Jan O. Korbel, et al. 2015. A Global Reference for Human Genetic Variation. Nature 526 (7571): 68-74.

• Alexander, David H., John Novembre, and Kenneth Lange. 2009. Fast Model-Based Esti- mation of Ancestry in Unrelated Individuals. Genome Research 19 (9): 1655-64.

• Callaway, Ewen. 2016. Error Found in Study of First Ancient African Genome. Nature News. Accessed March 12. doi:10.1038/nature.2016.19258.

• de Moura, Ronald R., Valdir de Queiroz Balbino, Sergio Crovella, and Lucas A. C. Brando. 2016. On the Use of Chinese Population as a Proxy of Amerindian Ancestors in Genetic Admixture Studies with Latin American Populations. European Journal of Human Genetics: EJHG 24 (3): 326-27.

• Fu, Qiaomei, Alissa Mittnik, Philip L. F. Johnson, Kirsten Bos, Martina Lari, Ruth Bollongino, Chengkai Sun, et al. 2013. A Revised Timescale for Human Evolution Based on Ancient Mitochondrial Genomes. Current Biology: CB 23 (7): 553-59.

• Gallego Llorente, M., E. R. Jones, A. Eriksson, V. Siska, K. W. Arthur, J. W. Arthur, M. C. Curtis, et al. 2015. Ancient Ethiopian Genome Reveals Extensive Eurasian Admixture throughout the African Continent. Science 350 (6262): 820-22.

• Jonsson,´ Hkon, Aurlien Ginolhac, Mikkel Schubert, Philip L. F. Johnson, and Ludovic Or- lando. 2013. mapDamage2.0: Fast Approximate Bayesian Estimates of Ancient DNA Dam- age Parameters. Bioinformatics 29 (13): 1682-84.

• Korneliussen, Thorfinn Sand, Anders Albrechtsen, and Rasmus Nielsen. 2014. ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics 15: 356.

• Lazaridis, Iosif, Nick Patterson, Alissa Mittnik, Gabriel Renaud, Swapan Mallick, Karola Kir- sanow, Peter H. Sudmant, et al. 2014. Ancient Human Genomes Suggest Three Ancestral Populations for Present-Day Europeans. Nature 513 (7518): 409-13.

• Patterson, Nick, Alkes L. Price, and David Reich. 2006. Population Structure and Eigenanal- ysis. PLoS Genetics 2 (12): e190.

• Patterson Nick, Priya Moorjani, Yontao Luo, Swapan Mallick, Nadin Rohland, Yiping Zhan, Teri Genschoreck, Teresa Webster, and David Reich, 2012. “Ancient admixture in human history.” Genetics 192: 1065-1093.

• Raghavan, Maanasa, Matthias Steinrcken, Kelley Harris, Stephan Schiffels, Simon Ras- mussen, Michael DeGiorgio, Anders Albrechtsen, et al. 2015. Genomic Evidence for the Pleistocene and Recent Population History of Native Americans. Science 349 (6250): 3884.

18 • Rasmussen, Morten, Sarah L. Anzick, Michael R. Waters, Pontus Skoglund, Michael De- Giorgio, Thomas W. Stafford Jr, Simon Rasmussen, et al. 2014. The Genome of a Late Pleistocene Human from a Clovis Burial Site in Western Montana. Nature 506 (7487): 225- 29.

• Rasmussen, Morten, Xiaosen Guo, Yong Wang, Kirk E. Lohmueller, Simon Rasmussen, Anders Albrechtsen, Line Skotte, et al. 2011. An Aboriginal Australian Genome Reveals Separate Human Dispersals into Asia. Science 334 (6052): 94-98.

• Rasmussen, Morten, Martin Sikora, Anders Albrechtsen, Thorfinn Sand Korneliussen, J. Vctor Moreno-Mayar, G. David Poznik, Christoph P. E. Zollikofer, et al. 2015. The Ancestry and Affiliations of Kennewick Man. Nature 523 (7561): 455-58.

• Reidla, Maere, Toomas Kivisild, Ene Metspalu, Katrin Kaldma, Kristiina Tambets, Helle-Viivi Tolk, Jri Parik, et al. 2003. Origin and Diffusion of mtDNA Haplogroup X. American Journal of Human Genetics 73 (5): 1178-90.

• Vianello, Dario, Federica Sevini, Gastone Castellani, Laura Lomartire, Miriam Capri, and Claudio Franceschi. 2013. HAPLOFIND: A New Method for High-Throughput mtDNA Hap- logroup Assignment. Human Mutation 34 (9): 1189-94.

19 APPENDIX: ADDITIONAL DETAIL ON METHODS Read processing and mapping: We used cutadapt-1.8.3 to trim sequence adapters from single end reads with the following command line: cutadapt −a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC −o For paired end reads, we used the python script MergeReadsFastQ cc.py (version 14.06.2009) to trim both forward and reverse adapters and to merge the reads. python ./ MergeReadsFastQ cc . py −f AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC −s AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT −r 82 < i n p u t f a s t q f i l e > Only those sets of paired end reads with overlapping forward and reverse reads that could be merged into a single contiguous read were retained for further downstream analyses. We used the program fastqc to assess raw sequence read quality both before and after adapter trimming. Single end reads and merged paired end reads were mapped and filtered in a similar manner. We used bwa (0.7.5a) to align the trimmed reads to the human reference sequence (hg19). The following bwa command line was used for each of the FASTA files: bwa aln −n 0.01 −o 2 −l 9999 h g 1 9 a l l c o n t i g s . fa We also used bwa to convert the aligned single end read file to a SAM formatted file using the following command: bwa samse −f −r ’ read g r o u p h g 1 9 a l l c o n t i g s . fa < f a s t q f i l e > Uniquely mapped reads were extracted from the SAM files using the following linux command: grep ”XT:A:U” > Mapped reads were also filtered for mapping quality and length. We removed PCR duplicates from the completely merged set of uniquely mapped and filtered reads using picard-tools-1.133 and java from jdk1.8.0 45 with the following command line: java −Xmx12g −jar picard. jar MarkDuplicates REMOVE DUPLICATES=true INPUT=< inp ut BAM fil e > OUTPUT= ASSUME SORTED=true METRICS FILE= metrics . t x t VALIDATION STRINGENCY=LENIENT For post-processing of BAM files, we used samtools version 1.2. Merging Kennewick, Anzick and Raghavan genotype data: For the Kennewick and Anzick samples, we generated pileup files at covered sites in the Raghavan SNP data set using the program samtools and following command line: samtools −Q 30 −l < s i t e f i l e > −O −f h g 1 9 a l l c o n t i g s . fa > < o u t p u t p i l e u p f i l e > Sites that were either C-to-T or G-to-A transitions were removed. For each site, the base from a randomly selected read was converted to a homozygous genotype. Assessing DNA damage patterns: We used the mapDamage 2.0.6 package (Jnsson et al. 2013) with the following command line:

20 mapDamage −i −r h g 1 9 a l l c o n t i g s . fa MapDamage was run on uniquely aligned reads with mapping quality >= 30 and fragment length > 30. mtDNA haplogroup assignment: We used the package HaploFind (Vianello et al. 2013). We also find the same result using the analysis software provided at http://dna.jameslick.com/mthap/. Contamination: For the mtDNA contamination estimation steps we used the contamMix pack- age (Fu et al. 2013). For the X chromosome based estimation we used the routines in angsd (Korneliussen, Albrechtsen, and Nielsen 2014) that are based on methods developed earlier (Ras- mussen et al. 2011). We used the following command-lines: angsd −i $BAM −r chrX:5000000−154900000 −doCounts 1 −iCounts 1 −minMapQ 30 − minQ 20 −out contam.est contamination −a contam.est.icnts.gz −h $ANGSD DIR /RES/ HapMapChrX . gz The range of estimates reported above is based on the results from using Method 2 (random sampling of reads method) which is less biased than the alternative (assuming independence of reads). Specifically we report the interval spanning the methods-of-moment point estimate and the maximum-likelihood estimate, as the standard errors on either point estimate are negligible, this provides a more conservative credible interval. PCA: We performed principal component analysis using the program smartpca in the EIGEN- SOFT version 5.01 software package. The eigenvectors were computed using the data from all samples in the Raghavan data set and the Kennewick and Anzick samples were projected onto those eigenvectors. ADMIXTURE: We performed admixture analysis using the program ADMIXTURE (version 1.22), using the following command line: admixture −j 4 −−cv ADMIXTURE was run on the set of genotypes from all worldwide samples in the Raghavan data set and the constructed genotypes for the Kennewick and Anzick-1 samples. All values of K, the number of ancestral populations, from 6 to 15 were tried. K=14 was found to be the value that minimized the cross validation error and thus was chosen for the final analysis. Admixture anal- ysis was then performed on the Kennewick, Anzick-1, and only the subset of Raghavan samples from the Americas with less than 5% European ancestry, as determined from the prior worldwide admixture analysis. This excluded all but two of the Colville samples. For the admixture analysis on just the American populations, all values of K from 6 to 9 were tried and the value K=8 was found to minimize the cross validation error. F-statistics: We computed outgroup f3 and D statistics using the programs qp3Pop and qpD- stat, respectively, in software package ADMIXTOOLS version 1.0. Rare variant sharing: We filtered the 1000 Genomes variant calls to:

• include bi-allelic variants with global minor allele frequency (MAF) such that 0.001

• remove sites that are C-to-T or G-to-A transitions

21 • include sites in the Kennewick Man overlapping with sites found in 1) and 2), with total coverage between 5 and 20, and where at least half the alleles matched the global minor allele.

There were 64 such variants. Using the jsonlite package in R we download the frequency data for the 1000 Genomes data using the Novembre Lab’s Geography of Genetic Variant’s frequency api (accessible via calls such as: http ://popgen.uchicago.edu/ggv api / f r e q table?data=\ ”1000genomes phase3 table \ ”&chr=19&pos=42098314” ) For 60 of the variants frequencies were available and for those the presence/absence of the vari- ants in each of the populations was tallied and plotted. (Note 4 of the variants had missing fre- quencies due to the 1000 Genomes version used in the GGV).

22