Archaic adaptive introgression in TBX15/WARS2 Supplementary Materials

Fernando Racimo1*^, David Gokhman2, Matteo Fumagalli3, Amy Ko1, Torben Hansen4, Ida Moltke5, Anders Albrechtsen5, Liran Carmel2, Emilia Huerta-Sánchez6, Rasmus Nielsen1,7,8*

Affiliations: 1 Department of Integrative Biology, University of California Berkeley, Berkeley, CA 94720, USA. 2 Department of Genetics, The Alexander Silberman Institute of Life Sciences, Faculty of Science, The Hebrew University of Jerusalem, Edmond J. Safra Campus, Givat Ram, Jerusalem 91904, Israel. 3 Department of Genetics, Evolution, and Environment, University College London, London WC1E 6BT, UK. 4 The Novo Nordisk Foundation Center for Basic Metabolic Research, Section of Metabolic Genetics, Faculty of Health and Medical Sciences, University of Copenhagen, 2100 Copenhagen, Denmark. 5 The Bioinformatics Centre, Department of Biology, University of Copenhagen, 2200 Copenhagen, Denmark. 6 School of Natural Sciences, University of California Merced, Merced, CA 95343, USA. 7 Department of Statistics, University of California Berkeley, Berkeley, CA 94720, USA. 8 Museum of Natural History, University of Copenhagen, 2100 Copenhagen, Denmark. *Correspondence to: FR ([email protected]) and R.N. ([email protected]). ^Now at: New York Genome Center, 101 Ave of the Americas, New York, NY 10013.

Supplementary Figures

Figure S1. fD plotted as a function of the denominator in the D statistic, (ABBA+BABA), for all with coverage in SNP chip data, plus a 20kbp region on either side of the . Points in red are the candidate genes for selection in the GI selection scan (Fumagalli, et al. 2015). The blue circles denote the TBX15 and WARS2 genes. The solid and dashed lines were drawn to denote the location in the plot where fD is equal to -1, 0 and 1.

We only show values of fD that are larger than -3 and smaller than 3. fD (X) = fD(Yoruba,GI,X,Chimpanzee).

Figure S2. fD plotted as a function of the denominator in the D statistic, (ABBA+BABA), for all candidate genes for selection in the SNP chip data of the GI selection scan(Fumagalli, et al. 2015), plus a 20kbp region on either side of the gene. The blue circles denote the TBX15 and WARS2 genes. The solid and dashed lines were drawn to denote the location in the plot where fD is equal to -1, 0 and 1.

Figure S3. East Asian, South Asian, American and European phased from the 1000 Genomes Project are inferred to contain archaic haplotypes (blue tracts, obtained via a HMM), which likely came from an individual that was more closely related to the Denisovan than to the Altai Neanderthal in this region. The tract coincides with the peak of PBS scores obtained in the Greenlander scan for positive selection (red dots). The frequency of the archaic haplotype is higher among East Asians and Americans than among South Asians and Europeans. The green dashed lines denote the limits of the putatively introgressed haplotype.

Figure S4. East Asian, South Asian, American and European phased chromosomes from the 1000 Genomes Project are inferred to contain archaic haplotypes (blue segments), which coincide with the PBS peak in the Greenlander positive selection scan (red dots). Here, instead of using Denisova as the archaic source, we used the Altai Neanderthal, which results in introgressed fragments of shorter size. The green dashed lines denote the limits of the inferred tract when using Denisova as the source.

Figure S5. Recombination rate in the TBX15/WARS2 region estimated from the HapMap genetic distance map (International-HapMap-Consortium 2005). The red dashed lines denote the introgressed tract as inferred by the HMM. The blue dashed lines denote the introgressed tract as inferred by the CRF.

Figure S6. Principal component analysis plots for individuals of Yoruba (YRI, purple), Central European (CEU, red), Han Chinese (CHB, green) and various Latin American (light blue) panels from the 1000 Genomes Project that were used in this study. To represent Native Americans in our selection analyses, we analyzed individuals from Peru (PEL) and Colombia (CLM) that had high proportions of Native American ancestry, as determined by ngsAdmix (Skotte, et al. 2013).

● IBS ● JPT ● GIH

1.0 ● TSI 1.0 ● CHS 1.0 ● PJL ● FIN ● CDX ● BEB ● GBR ● KHV ● STU ● ITU 0.8 0.8 0.8 0.6 0.6 0.6 T T T S S S F F F 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0

119450000 119600000 119750000 119450000 119600000 119750000 119450000 119600000 119750000

Chromosome 1 1

Figure S7. We computed local FST values around the putatively introgressed haplotype using a sliding- windows approach, with window size of 20 kbp and step size of 2 kbp. Values on the y-axis are scaled as in Figure 5. We performed this analysis by comparing 1000 Genomes Project panels of European (EUR, left panel), East Asian (EAS, middle panel) and South Asian (SAS, right panel) ancestry with samples of African Yoruba descent (YRI). The boundaries of the introgressed haplotype as identifed by the HMM are indicated by vertical dotted lines, and the boundaries as defined by the CRF are indicated by vertical dashed lines.

Figure S8. We aimed to determine whether the value of Fst between CHB and PEL in the TBX15/WARS2 region was consistent with ongoing selection in Native Americans after the crossing of Beringia, or could just be explained by a single bout of selection in the ancestral population of Greenlandic Inuit and Native Americans. We simulated demographic scenarios for CHB, PEL and GI with different GI-AMR population split times (x-axis) and AMR effective population sizes (y-axis), as well as different strengths of selection in AMR after the split from GI

(S = 2Ns). The squares with light blue dots correspond to the parameter combinations that result in a value of Fst between CHB and PEL that is as extreme as that observed in the TBX15/WARS2 region. Figure S9. Upper-left panel: Genome-wide distribution of divergences between the Altai Neanderthal genome and the Mezmaiskaya Neanderthal genome, computed only on the Altai side of the tree, and scaled by the divergence between the human reference and the human-chimpanzee ancestor, for regions of the genome of equal size as the introgressed haplotype. The red line shows the scaled divergence between a genome that is homozygous for the introgressed haplotype (HG00436) and the Altai Neanderthal genome, computed only on the Altai side of the tree, and the quantile in which it falls within the distribution. Upper-rightl pane: genome-wide distribution of divergences between the Denisova genome and the Altai Neanderthal genome, again scaled by the reference- ancestor divergence. The red and blue lines denote the scaled divergence between the introgressed haplotype and Denisova and Altai Neanderthal, respectively. Lower-left panel: genome-wide distribution of scaled divergences between the Denisovan genome and a high-coverage Yoruba genome (HGDP00936). The red and green lines show the scaled divergence between the introgressed haplotype and the Denisovan and Yoruba genomes, respectively. Lower-right panel: genome-wide distribution of scaled divergences between the Altai Neanderthal genome and HGDP00936. The blue and green lines show the scaled divergence between the introgressed haplotype and the Altai Neanderthal and Yoruba genomes, respectively.

Figure S10. We simulated two populations that split at different times (100,000, 300,000 or 450,000 years), under a range of effective population sizes (1,000, 2,500, 5,000), based on estimates for the Neanderthal/Denisova population sizes (Prüfer, et al. 2014). We compared these simulations to the divergence between the Altai Neanderthal genome and a that is homozygous for the introgressed tract (HG00436) (red line), as well as to the divergence between the same human genome and the Denisovan genome (blue line), and the divergence between the same human genome and a Yoruba genome (HGDP00936) (green line).

Introgressed Non-introgressed

Figure S11. Hamming distance between Denisova haplotype 24 and other archaic and present-day human haplotypes, including the other Denisova haplotype (23), the Altai Neanderthal haplotypes (21 and 22) and the 20 most abundant present-day human haplotypes from the 1000 Genomes Project. The colors indicate the proportion of individuals from each human population that carries a given haplotype. AFR: Africans. AMR: Americans. EAS: East Asians. EUR: Europeans. SAS: South Asians.

Figure S12. We compared the divergences between Denisova, Altai Neanderthal and the closest haplotype from each of the clusters in the TBX15/WARS2 network to their corresponding values in EPAS1. A. Absolute number of differences. B. Sequence divergence (accounting for differences in length between the introgressed tracts in the two regions). We only used sites that were segregating in present-day humans, to make a fair comparison with published data for EPAS1.

Figure S13. We computed the divergence between the human reference genome and the inferred human- chimpanzee ancestor (Paten, et al. 2008) in 40 kb windows with a 10kb overlap along a 1 Mb region surrounding the TBX15/WARS2 genes. The red dashed lines denote the tract as inferred by the HMM. The blue lines denote the tract as inferred by the CRF. The dark green dashed line is the average divergence over all windows in the region.

HIPadjBMI HIP

30 ● none conditional 30 ● none conditional conditional on rs984222 conditional on rs984222 25 ● top SNP 25 ● top SNP top SNP (conditional) top SNP (conditional) 20 20 15 15 10 10 − log10(P.value) − log10(P.value)

5 ● 5 ● ● ● ●●●● ● ● ●●●●●●● ● ● ●●● ●● ●●● ●●●● ●● ● ●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ●●● ●● ● ● ●● ● ●●●●●●●●●● ●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●● ●●●●●●●●●●●●● ●●● ● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ● ●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 ●●● ● ● ● ●● ●●●●●● ●●●●●●●●●●●● ● ●●●●●● ●● ●●●●●●● ●● ● ● ● ●● ● ●●●●● ●●●●●●●●●●●●●●●●●●●●●● 0 ● ● ●● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●● ●●●● ●●●● ●●●●●●●●●●●●●●● TBX15 WARS2 TBX15 WARS2

119000000 119200000 119400000 119600000 119800000 120000000 119000000 119200000 119400000 119600000 119800000 120000000

Position (HG 19) Position (HG 19)

WCadjBMI WC

30 ● none conditional 30 ● none conditional conditional on rs984222 conditional on rs984222 ● 25 ● top SNP ● 25 ● top SNP

top SNP (conditional) ● ● top SNP (conditional)

20 ● 20

● ● ● ● ● 15 ● ●● 15 ●●● ● ●● ●● ●●● ● ●●● ●● ● ● ● ● ● ● ● ●● ●●●●● ● ●● ● ● ● ●●●● 10 ●● ●● ● ● ● ● ●●● ● 10 ●● ●● ●●●● ● ● ● ● ● ● ● ● ● ● ●●●●●●● ●● ● ●●● ● ● ● − log10(P.value) ●● ●●● ●● ● ●●● − log10(P.value) ● ● ● ● ●●●●●●●●●● ● 5 ●● 5 ● ● ●●●●● ● ●●● ● ●●● ● ● ●●●●●●●●●●●●●●●●● ●● ●●● ● ●● ● ● ●● ● ● ● ● ● ● ●● ●●●●●● ● ●● ● ● ●●● ● ● ●●●●●●● ●●● ●●●●●● ● ● ●●● ● ● ● ●● ●● ● ● ●●●●●●●●●● ● ● ● ● ●●●●●●●●●● ●●●● ● ●●●●●● ●●● ●●●●● ●● ● ● ●●● ●● ●● ● ● ● ● ●●●●● ● ●●●●● ●● ●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ● ●●●●●●●● ●● ●● ●●●● ● ● ●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●● ●●● ● ●●●●●●●●●●● ●●●●●●● ●●●●●● ● ●●● ● ●● ●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●● ● ●●●●●●●●●●● ●●●●●● ●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●● ●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● 0 ●●●● ●●●●● ●●●● ●●●● ●●● ●● ●●● ●●●● ●● ●●● ● ●●● ●●●●●● 0 ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●● ●●●●●● ●●●●●●● ●●●●●●●●●●● ●●●●●●●●●● ●●●● TBX15 WARS2 TBX15 WARS2

119000000 119200000 119400000 119600000 119800000 120000000 119000000 119200000 119400000 119600000 119800000 120000000

Position (HG 19) Position (HG 19)

WHRadjBMI WHR

30 ● none conditional 30 ● none conditional conditional on rs984222 conditional on rs984222 25 ● top SNP 25 ● top SNP top SNP (conditional) top SNP (conditional) 20 ● ● 20 ● ● ● 15 ●● 15 ●●● ●● ● ●●● ● ●●●● ● ●● ● ● ● ● ●● ●●● ● ● ●● ● ● ●●●●●● ●●● ●● ●● ●●● ● ●● ●●● ● ● ● ●● ● 10 ●● 10 ● ●● ● ●● ● ● ● ●● ●● ●●● ●●● ●●●●● ●● ● ●● ●●●● ●●● ● ● ● ●● ● ● ● ● ●●● ● ●●●● − log10(P.value) ● ● ● ● ● − log10(P.value) ●● ●●●● ● ●● ●●● ●● ● ●●● ●● ●● ● ●● ● ● ●●● ● 5 ● ● ●● 5 ● ● ● ● ● ● ● ●● ●● ●●●●● ●●● ●●●● ●●● ● ● ●● ● ● ● ●●●● ● ● ● ●● ● ● ●●●● ● ●● ●●●● ● ● ●● ● ●●● ●●●● ● ● ● ● ● ●● ● ● ●●●●●●●● ●●●● ● ● ● ●●● ●● ●●●●●●●● ● ● ●● ● ●● ● ● ●●●●●●●●●●● ● ● ●● ● ●●●●●●●●● ●●● ●●● ●● ●●●●●● ●●●●● ● ●●●●●●●●●●● ●●●● ●●●● ● ●●●● ● ●●● ● ● ●●●● ●●● ● ●● ●●●●●●●●●●● ●●●●●●●●●●● ● ●● ●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ● ●●●●●●● ●●●●●●●●●●●●●●● ● ●●●●●●●●●● ●●● ●●●● ●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●● ●●●●●●●● ● ●●●● ●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●● ● ●●● ●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●● ●●●●●●●●● ●●●● ●●●●●●● ● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 ● ●● ●● ●●●●●●●● ● ● ● ●●● ● ● ●●● ● ● ● ● ● ●●● 0 ● ●●●●● ●● ●●●●● ●●●●●●●●● ● ● ●●●●●●● ● ●● ● ● ● ● ● ●● ● ●●●●● ● ● TBX15 WARS2 TBX15 WARS2

119000000 119200000 119400000 119600000 119800000 120000000 119000000 119200000 119400000 119600000 119800000 120000000

Position (HG 19) Position (HG 19)

Height BMI

30 ● none conditional 30 ● none conditional conditional on rs984222 conditional on rs984222 25 ● top SNP 25 ● top SNP top SNP (conditional) top SNP (conditional) 20 20 15 15

10 ● 10 ●●● ● ● ●●● ●●● ● ● − log10(P.value) ●● ● − log10(P.value) ● ● ● ● ●●●● ●●● ● ● ● 5 ● ●● ● ●●● ● 5 ● ● ●●●●●●● ● ● ●●● ● ● ● ● ● ●● ● ●●●●● ● ● ●●● ●●●●●●●●●●●●● ●●● ●●●● ● ● ●●●●●●● ●●● ● ●●● ● ● ● ●● ●● ● ● ●●● ● ● ● ●● ●●●● ● ● ● ● ● ● ●●● ●●●●●●●●●●●●●● ●● ●● ●● ● ● ● ●● ●● ● ● ● ●● ●●● ● ●● ●● ●●●●●●●●●●● ●● ●● ●●●●●●● ●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●● ●● ● ●● ●●●●●●●●●●●●●●●● ●● ● ●● ●● ● ●●●● ● ●● ●●● ● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●● ●●● ●●●● ●●●●●●● ● ● ●● ●●●●●●●●●● ●●●● ● ●●●●●●●●●●●●●●●●●● ●●●●●●●● ● ●●●● ●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● 0 ●● ●● ●●●●●●●●● ●●●●●● ●●● ● ●●●● ●●● ● ●●●●●●●● ●●●●●●● ●●● ●● 0 ●●● ● ●● ●● ●● ● ● ●●●●●●●●●● ●● ●●●●● ● ●●●●●●●●●● ●●●●●●●●●●●●●●● ●● ● ●● ●●● ●●●●●●●●● TBX15 WARS2 TBX15 WARS2

119000000 119200000 119400000 119600000 119800000 120000000 119000000 119200000 119400000 119600000 119800000 120000000

Position (HG 19) Position (HG 19) Figure S14. We queried the GIANT data and found that the introgressed lies in the middle of an association peak for BMI-adjusted waist-circumference, waist-hip ratio and BMI-adjusted waist-hip ratio. Orange dots represent –log10(P-values) of SNPs that are top PBS hits in Greenlandic Inuit SNP- chip selection scan (Table 1), while blue dots represent –log10(P-values) of all other SNPs in the region. We also performed a conditional association analysis by conditioning on the SNP with the lowest P-value in the region (rs984222). We show the conditional –log10(P-values) of the SNPs in Table 1 as light green triangles, and of all other SNPs as dark green triangles. HIPadjBMI = BMI-adjusted hip circumference. HIP = hip circumference. WCadjBMI = BMI-adjusted waist circumference. WC = waist circumference. WHRadjBMI = BMI-adjusted waist-hip ratio. WHR = waist-hip ratio.

Supplementary Tables

chr start end gene ABBA (Nea) ABBA (Den) BABA (Nea) BABA (Den) D (Nea) D (Den) fD (Nea) fD (Den)

1 119395669 119562179 TBX15 8.83 17.61 13.94 2.47 -0.22 0.75 -0.16 0.49 1 119543839 119713294 WARS2 9.48 18.46 1 2.5 0.81 0.76 0.31 0.48 1 234010679 234490262 SLC35F3 0.61 0.61 0.15 0.41 0.61 0.2 0.17 0.07 4 10411498 10489034 ZNF518B 0 0 0 0 NA NA NA NA 5 148491046 148670105 ABLIM3 0 0.5 0 0.5 NA 0 0 0

6 7511808 7616950 DSP 0 0 0 0 NA NA NA NA

8 143923772 143991262 CYP11B1 0 0 0 0 NA NA NA NA

11 61246272 61308482 LRRC10B 0 0 0 0 1 1 0 0

11 61252785 61378620 SYT7 0 0 0 0 1 1 0 0

11 61530452 61664826 FADS2 4.98 2.87 5.56 3.4 -0.05 -0.08 -0.06 -0.06

11 61537099 61626790 FADS1 3.01 1.89 5.48 2.33 -0.29 -0.11 -0.45 -0.1

11 61610991 61689523 FADS3 2.8 2.1 2.07 2.4 0.15 -0.07 0.14 -0.06

12 14926506 15089520 C12orf60 0 0 0 0 NA NA NA NA

12 111254573 111375339 CCDC63 1.98 2.49 3.38 3.88 -0.26 -0.22 -0.25 -0.12 12 111318623 111388526 MYL2 2.92 2.49 3.44 3.88 -0.08 -0.22 -0.07 -0.12 17 80644559 80718204 FN3KRP 4.54 7.97 4.28 0.03 0.03 0.99 0.02 0.6 17 80663451 80739073 FN3K 5.04 9.7 5.6 0.62 -0.05 0.88 -0.03 0.54 19 10173014 10243472 ANGPTL6 0 0 0 0 NA NA NA NA 19 16577122 16662163 C19orf44 0 0 0 0 NA NA NA NA 19 16598700 16683341 CHERP 0 0 0 0 NA NA NA NA 20 44432449 44501914 SNX21 0 0 0 0 NA NA NA NA

22 40410821 40761811 TNRC6B 1.1 1.1 0 0 1 1 0.41 0.41

22 40712507 40794823 ADSL 0.92 0.92 0 0 1 1 0.92 0.92

Table S1. Differential archaic ancestry statistics calculated, using the SNP chip data, on the candidate genes for selection in the GI selection scan (Fumagalli, et al. 2015). D(X) = D(Yoruba,Greenlandic

Inuit,X,Chimp). fD = fD(Yoruba,Greenlandic Inuit,X,Chimp). Values with “NA” denote genes where a specific statistic has a denominator equal to 0.

Panel pair FST Percentile YRI - CEU 0.103448 0.2101975

YRI - CHB 0.491954 0.01697042

YRI - PEL 0.915361 6.71722e-05

YRI - MXL 0.851473 0.0001399713

CEU - CHB 0.282759 0.04307482

CEU - PEL 0.777674 9.332711e-05

CEU - MXL 0.670402 5.574053e-05

CHB - PEL 0.336782 0.02118221

CHB - MXL 0.185698 0.06273516

PEL - MXL 0.0174175 0.2312871

Table S2. FST values for different 1000 Genomes panel pairs at a SNP that tags the introgressed haplotype (rs2298080). We also show the

percentile of the value relative to the genome-wide FST distribution. TABLE S3 IS ATTACHED SEPARATELY.

Table S3. DNA methylation levels within WARS2 and TBX15 for 62 present-day individuals, as well as the Denisovan and the Neanderthal. Tissue log10(RPKM) > 0? P-Value Effect Size Adipose - Subcutaneous YES 0.12 0.089 Adipose - Visceral (Omentum) YES 0.18 0.11 Adrenal Gland NO 0.19 0.22 Artery - Aorta NO 0.22 0.12 Artery - Coronary YES 0.58 -0.077 Artery - Tibial YES 0.093 0.12 Brain - Amygdala NO 0.4 -0.18

Brain - Anterior cingulate cortex (BA24) NO 0.67 -0.071 Brain - Caudate (basal ganglia) NO 0.34 -0.15 Brain - Cerebellar Hemisphere NO 0.84 0.039 Brain - Cerebellum NO 0.89 0.021 Brain - Cortex NO 0.86 -0.026 Brain - Frontal Cortex (BA9) NO 0.12 -0.23 Brain - Hippocampus NO 0.99 -0.0019 Brain - Hypothalamus NO 0.008 -0.36 Brain - Nucleus accumbens (basal ganglia) NO 0.19 -0.25 Brain - Putamen (basal ganglia) NO 0.83 0.032 Brain - Substantia nigra NO 0.43 -0.25 Breast - Mammary Tissue YES 0.22 -0.083 Cells - EBV-transformed lymphocytes YES 0.089 0.14 Colon - Transverse NO 0.14 0.19 Esophagus - Muscularis NO 0.17 0.13 Heart - Atrial Appendage YES 0.1 0.17 Heart - Left Ventricle NO 0.24 0.12 Liver YES 0.55 0.086 Lung NO 0.009 -0.25 Muscle - Skeletal YES 0.24 -0.054 Nerve - Tibial YES 0.22 0.06 Ovary NO 0.18 0.29 Pancreas NO 0.31 0.16 Pituitary NO 0.57 -0.098 Prostate NO 0.0073 0.52 Skin - Not Sun Exposed (Suprapubic) YES 0.87 0.0094 Skin - Sun Exposed (Lower leg) YES 0.52 0.024 Spleen NO 0.58 -0.13 Stomach NO 0.12 0.16 Testis NO 0.00018 0.45 Thyroid YES 0.22 0.078 Uterus NO 0.25 0.23 Vagina NO 0.64 -0.088 Whole Blood NO 0.67 -0.031

Table S4. Cis-eQTL p-values and beta (effect size) values retrieved from the test performed between rs2298080 and TBX15, in the GTEx pilot data, across different tissues. The only tissue with a P-value of less than (0.05 / number of tissues) is testis, but the overall expression of TBX15 in this tissue is very low (log10(RPKM) < 0). Effect sizes are with respect to the selected allele in Greenlandic Inuit. Tissue log10(RPKM) > 0? WARS2 P-Value WARS2 Effect Size Adipose - Subcutaneous YES 0.000098 -0.31 Adipose - Visceral (Omentum) YES 0.04 -0.2 Adrenal Gland YES 0.000066 -0.53 Artery - Aorta YES 0.00021 -0.33 Artery - Coronary YES 0.0063 -0.35 Artery - Tibial YES 0.000035 -0.36 Brain - Amygdala YES 0.16 -0.26 Brain - Anterior cingulate cortex (BA24) YES 0.2 -0.26 Brain - Caudate (basal ganglia) YES 0.044 -0.35 Brain - Cerebellar Hemisphere YES 0.081 -0.36 Brain - Cerebellum YES 0.0034 -0.47 Brain - Cortex YES 0.094 -0.3 Brain - Frontal Cortex (BA9) YES 0.012 -0.46 Brain - Hippocampus YES 0.15 -0.26 Brain - Hypothalamus YES 0.19 -0.25 Brain - Nucleus accumbens (basal ganglia) YES 0.099 -0.25 Brain - Putamen (basal ganglia) YES 0.93 -0.021 Brain - Substantia nigra YES 0.17 -0.33 Breast - Mammary Tissue YES 0.028 -0.25 Cells - EBV-transformed lymphocytes YES 0.47 0.098 Colon - Transverse YES 0.29 -0.12 Esophagus - Muscularis YES 0.00077 -0.31 Heart - Atrial Appendage YES 0.04 -0.25 Heart - Left Ventricle YES 0.029 -0.16 Liver YES 0.13 -0.19 Lung YES 0.074 -0.16 Muscle - Skeletal YES 0.00092 -0.22 Nerve - Tibial YES 0.23 -0.12 Ovary YES 0.15 -0.21 Pancreas YES 0.023 -0.23 Pituitary YES 0.0013 -0.58 Prostate YES 0.76 -0.064 Skin - Not Sun Exposed (Suprapubic) YES 0.027 -0.27 Skin - Sun Exposed (Lower leg) YES 0.0016 -0.28 Spleen YES 0.14 -0.32 Stomach YES 0.035 -0.17 Testis YES 0.43 -0.11 Thyroid YES 0.016 -0.24 Uterus YES 0.9 -0.023 Vagina YES 0.06 -0.34 Whole Blood NO 0.61 -0.032

Table S5. Cis-eQTL p-values and beta (effect size) values retrieved from the test performed between rs2298080 and WARS2, in the GTEx pilot data, across different tissues. P-values in red are less than (0.05 / number of tissues) (Bonferroni multiple test correction). Effect sizes are with respect to the selected allele in Greenlandic Inuit. Supplementary Information 1

We used a Hidden Markov Model (HMM) method to infer Denisovan or Neanderthal admixture tracts. This method is a new version of the one deployed in an earlier study (Seguin-Orlando, et al. 2014) and is available for download in: https://github.com/amyko/admixtureHMM. The hidden and observed states are similar to those discussed in a previously developed conditional random field model (Sankararaman, ! et al. 2014). Consider a test haplotype ℎ = ℎ!, … , ℎ! � 0,1 of length k from an admixed population, ! where 1 denotes a derived allele and 0 ancestral. We denote � = �!, … , �! �{0,1} as the ancestry vector for the test haplotype. We assign �!=1 if site i is of archaic human ancestry, and �!=0 if it is of ! modern human ancestry. We also define an observed vector � = �!, … , �! �{0,1} as follows:

� = 1, ��ℎ = 1, �!"# < �, �!"#!!"# > 1 ! ! ! ! �! = 0, ��ℎ������

!"# where �! is the derived allele frequency estimated from a panel of individuals from an unadmixed !"#!!"# population such as the Yoruba, and �! is the derived allele frequency estimated from a panel of archaic genomes (Denisovan or Neanderthal). In other words, the site is informative of archaic ancestry if it is 1) derived in the test haplotype, 2) the derived allele is present in the archaic panel, but 3) is absent in the panel of unadmixed individuals. The parameter � is added to account for possible sequencing errors in the unadmixed population. We consider as missing data sites that are fixed in the test population or sites for which the observed state cannot be determined.

Using these hidden and observed states, we define a two-state, time-homogeneous HMM to infer the admixture tracts of a test haplotype. The ancestry vector z is the sequence of hidden states along the test haplotype and y is its associated sequence of observed states. Following a previously developed model (Gravel 2012), the transition rates between the hidden states are given by:

� �! = 1 �!!! = 0 = � ∗ � � − 1 � �! = 0 �!!! = 1 = � ∗ 1 − � � − 1 , where m is the admixture proportion, t is the admixture time, and r is the recombination rate. It is worthwhile to note that the assumption of exponentially distributed tract lengths given by the equations above is not quite true, but is a fair assumption for the level of divergence and admixture between archaic and modern humans (Liang and Nielsen 2014).

The emission probabilities were obtained from simulated training data. We used a modified version of msHOT (Hellenthal and Stephens 2007) to simulate a demographic history of Europeans (admixed), the Yoruba (unadmixed), and Neanderthals under previously estimated demographic parameters (Sankararaman, et al. 2014). We then sampled 100, 100, and 2 haplotypes from the European, the Yoruba, and the Neanderthal population, respectively. The modified msHOT output keeps record of the local ancestry of the haplotypes in the admixed population (i.e. z is known). This allows us to estimate the emission probabilities � � = 1 � = 0 and � � = 1 � = 1 by simply counting the number of sites in the simulated European haplotypes that fit the criteria. For example, we estimate � � = 1 � = 0 by:

� � = 0 ∧ � = 1 , � � = 0 where the summation is taken over all haplotypes of the admixed panel. We then use the Viterbi algorithm to infer the most likely sequence of local ancestry along the test haplotype.

Supplementary Information 2

We were interested in determining whether the version of the introgressed haplotype present in Greenlanic Inuit matches better with the longer version of the introgressed haplotype present in the Yakut and the Even than with other introgressed haplotypes present throughout Asia. The Yakut are fixed for this longer version of the introgressed haplotype, so if all or most Greenlandic Inuit are carrying this version, we would expect their divergence to the Yakut in this region to be lower than to other populations. To test this, we computed the divergence for each of the SGDP Central Asian / Siberian populations against the Greenlandic Inuits along the extent of the region of the Yakut haplotype. As we only have SNP chip data for the Greenlandic Inuit, we divided the absolute difference between the Greenlandic Inuit’s sample derived allele frequency and each population's sample derived allele frequency, and then scaled that number by the total number of SNPs that we had available (which was the intersection of the SGDP SNPs and the SNPs in the chip used in the GI selection scan (Fumagalli, et al. 2015). To account for possible confounding effects due to European admixture, we computed this divergence using all the Greenlandic Inuit and also using only the Greenlandic Inuit with the least amount of European admixture (Moltke, et al. 2015) (Table S6).

Population Divergence to Divergence to Greenlandic Inuit Greenlandic Inuit (Tasiilaq + South villages only) Itelman 0.0087 0.0068 Altaian 0.0092 0.007 Eskimo Naukan 0.011 0.0092 Tlingit 0.0111 0.0092 Eskimo Sireniki 0.0175 0.018 Yakut 0.0732 0.0725 Ulchi 0.0766 0.078 Aleut 0.0813 0.0813 Kyrgyz 0.1376 0.1378 Chukchi 0.1498 0.1509 Even 0.1535 0.1549 Eskimo Chaplin 0.1648 0.1663 Mansi 0.2323 0.2338 Tubalar 0.2441 0.2455 Mongola 0.2568 0.2584

Table S6. Divergence between Greenlandic Inuit and different Central Asian and Siberian populations from the SGDP, calculated over the region covered by the Yakut introgressed haplotype. We computed this divergence using all Greenlandic Inuit (middle column) and using only the Greenlandic Inuits with the least amount of European admixture (Tasiilaq, Tasiilaq villages and south villages).

Though the Yakut have low divergence, they are not the closest population to the Greenlandic Inuits, and the closest populations do not have the larger version of the haplotype. Therefore, we can say that at least not all Greenlandic Inuits have the Yakut version of the haplotype, but we cannot discard the possibility that some of them may have it. For this, we will have to wait until population-level whole-genome data from the Greenlandic Inuits is available.

Notably, the population with the highest divergence (the Mongola) is also fixed in the SGDP panel for the absence of the short version of the introgressed haplotype. This is consistent with the fact that Greenlanders are almost fixed for the presence of the haplotype (either the long or the short version), and is made even more clear when performing the same divergence calculation but only restricting to the region covered by the short version of the introgressed haplotype (Table S7).

Population Divergence to Divergence to Greenlandic Inuit Greenlandic Inuit (Tasiilaq + South villages only) Altaian 0.002 0.0007 Itelman 0.002 0.0007 Eskimo Sireniki 0.0052 0.0049 Eskimo Naukan 0.0074 0.0060 Tlingit 0.0074 0.0062 Yakut 0.0746 0.0737 Ulchi 0.0873 0.0881 Aleut 0.1054 0.1057 Chukchi 0.1509 0.1514 Kyrgyz 0.1574 0.1579 Eskimo Chaplin 0.1776 0.1781 Even 0.2078 0.2083 Mansi 0.2613 0.2619 Tubalar 0.276 0.2766 Mongola 0.3569 0.3575

Table S7. Divergence between Greenlandic Inuit and different Central Asian and Siberian populations from the SGDP, calculated over the region covered by the main introgressed haplotype. We computed this divergence using all Greenlandic Inuit (middle column) and using only the Greenlandic Inuits with the least amount of European admixture (Tasiilaq, Tasiilaq villages and south villages).

Supplementary Information 3

Introduction

We are interested in the probability that - given that a mutation was segregating (>5% frequency) in the ancestral population of modern humans and Denisovans, that it survived up to the split of Africans and Non-Africans and that it is at high-frequencies in non-Africans (30% to 99% frequency) - it will be at low frequency (< 1%) in Africans and present in homozygous form in Denisovans.

Methods

To answer this question, we generated simulations in SLiM 2.0 (Messer 2013) such that: 1) A randomly chosen mutation segregates in the ancestral modern-archaic population at >5% frequency before the modern-archaic population split 2) The same mutation is still segregating (>0% frequency) right before the split of Africans and non- Africans 3) The same mutation becomes positively selected in non-Africans after their split from Africans and is found at high frequencies in present-day non-Africans at a frequency larger than 30% (as a conservative lower bound).

We set the mutation rate to 1.45e-8 and the recombination rate to 0.155e-8, to emulate the estimated recombination rate in the TBX15/WARS2 region. We used a demographic history of population size changes and population split times based on previously estimated demographic parameters (Gravel, et al. 2011; Prüfer, et al. 2014) converting times where appropriate to account for the mutation rate used. Importantly, we simulated no post-divergence admixture between archaic and modern humans. In the present, we sampled 2 Denisovan sequences, 100 African sequences and 100 non-African sequences.

We simulated 1,000 simulations each of 200,000 bp in length and then recorded, for each simulation: a) The frequency of the selected mutation in the sampled Denisovan genome b) The frequency of the selected mutation in present-day Africans c) The frequency of the selected mutation in present-day non-Africans d) The proportion of non-Africans that contain an inferred archaic tract (inferred using our HMM) that is as large as the inferred tract in the TBX15/WARS2 region (27,992 bp) and that overlaps with the selected mutation.

We generated four sets of simulations with different selection coefficients: s=0.1, s=0.01, s=0.001 and s=0.

We also generated a fifth set of simulations (with s=0.1) but adding an instantaneous admixture event between archaic and modern humans at a rate of 2%, so as to make sure that our HMM actually had power to detect introgressed tracts.

Results

We display the results of the simulations in Tables S8, S9 and S10.

Condition Proportion No admixture & s = 0.1: 0.018 No admixture & s = 0.01: 0.018 No admixture & s = 0.001: 0.007 No admixture & s = 0: 0.01 Admixture (2%) & s = 0.1: 0.078 Table S8. Proportion of simulations where the frequency of the selected mutation in Denisova = 100%, where the frequency < 1% and the frequency is between 30% and < 99% in non-Africans:

Condition Proportion No admixture & s = 0.1: 0.029 No admixture & s = 0.01: 0.025 No admixture & s = 0.001: 0.022 No admixture & s = 0: 0.016 Admixture (2%) & s = 0.1: 0.136 Table S9. Proportion of simulations where an inferred admixture tract of length larger than 27,992 bp has more than 30% frequency:

Condition Proportion No admixture & s = 0.1: 0.013 No admixture & s = 0.01: 0.014 No admixture & s = 0.001: 0.009 No admixture & s = 0: 0.007 Admixture (2%) & s = 0.1: 0.108 Table S10. Proportion of simulations where an inferred admixture tract of length larger than 27,992 bp has more than 30% frequency and the tract overlaps the selected site:

Supplementary Information 4

We were interested in computing the frequency of the selected allele in various ancient Eurasian populations. We focused on the SNPs that had significantly high PBS values in the Greenlandic Inuit and were located in the introgressed region, as defined by the HMM. We used a combined ancient DNA data from various Mesolithic, Neolithic, Bronze Age and Iron Age Eurasian populations (Fu, et al. 2014; Gamba, et al. 2014; Lazaridis, et al. 2014; Raghavan, et al. 2014; Allentoft, et al. 2015), and found that 18 of our putatively selected SNPs overlapped with these data. We obtained obtain genotype likelihoods from ANGSD (Korneliussen, et al. 2014) and only looked at sites where the genotype with the highest likelihood had a log-likelihood that was 5 points ("strict cutoff") or 2 points ("lenient cutoff") higher than the genotype with the second highest likelihood. We then used these likelihoods to obtain estimates of the probability of a particular allele being present in each SNP in each individual. We then computed the average allele frequency over individuals that belonged to the same population group, as defined by an earlier study (Allentoft, et al. 2015).

Because data is sparse for any given SNP, we then computed an average over the average allele frequencies in all the SNPs. This allowed us to pool information from SNPs that are covered in one individual but not another. Because all the selected SNPs are in high LD, this should not lead to artifacts (i.e. we find that all the SNPs have the same allele frequency in CHB, YRI and TSI, so no SNP will have higher weight than another in the mean computation).

The results are shown in Figure S15A (strict cutoff) and S15B (lenient cutoff). We find that most populations where the population allele frequency is high are related to the eastern steppe populations – Ma’lta, Yamnaya, Corded Ware, Afanasievo - while those where the allele frequency is very low are mostly European pre-steppe-migration populations - mesolithic hunter-gatherers, Neolithic individuals from Central Europe (LBK) and Hungary.

A.

B.

Figure S15. We estimated population allele frequencies from the low-coverage ancient DNA data from various ancient DNA shotgun-sequencing studies (Fu, et al. 2014; Gamba, et al. 2014; Lazaridis, et al. 2014; Raghavan, et al. 2014; Allentoft, et al. 2015) in the TBX15/WARS2 region. We concentrated on the SNPs that had significantly high PBS values in the Greenlandic Inuit and were located in the introgressed region, as defined by the HMM. As the SNPs are in very high LD with each other, we computed the mean over the estimated allele frequencies of all the SNPs in the region, in order to pool data from SNPs that may have been covered in one ancient individual but not another. YRI = 1000G Yoruba. IBS = 1000G Iberian. CEU = 1000G Central European. FIN = 1000G Finn. TSI = 1000G Tuscan. CHB = 1000G Han Chinese from Beijing. Ust = Ust’-Ishim. baHu = Bronze Age Hungarians. neolC = Neolithic Central Europeans. neolHu = Neolithic Hungarians. baSca = Bronze Age Scandinavians. baAndrov = Bronze Age Androvono. baUne = Bronze Age Unetice. baSin = Bronze Age Sintashta. baMezh = Bronze Age Mezhovskaya. hunterW = Mesolithic hunter-gatherers. baYam = Bronze Age Yamnaya. baKarasuk = Bronze Age Karasuk. baRem = Bronze Age Remedello. baAfan = Bronze Age Afanasievo. baCw = Bronze Age Corded Ware. irHu = Iron Age Hungarian. irAltai = Iron Age Altai. Malta_man = Mal’ta (MA-1). A) Only using sites in which the highest genotype likelihood is 5 points higher than the second highest genotype likelihood. B) Only using sites in which the highest genotype likelihood is 2 points higher than the second highest genotype likelihood.

References

Allentoft ME, Sikora M, Sjögren KG, Rasmussen S, Rasmussen M, Stenderup J, Damgaard PB, Schroeder H, Ahlström T, Vinner L, et al. 2015. Population genomics of Bronze Age Eurasia. Nature 522:167-172. Fu Q, Li H, Moorjani P, Jay F, Slepchenko SM, Bondarev AA, Johnson PL, Aximu-Petri A, Prüfer K, de Filippo C, et al. 2014. Genome sequence of a 45,000-year-old modern human from western Siberia. Nature 514:445-449. Fumagalli M, Moltke I, Grarup N, Racimo F, Bjerregaard P, Jørgensen ME, Korneliussen TS, Gerbault P, Skotte L, Linneberg A, et al. 2015. Greenlandic Inuit show genetic signatures of diet and climate adaptation. Science 349:1343-1347. Gamba C, Jones ER, Teasdale MD, McLaughlin RL, Gonzalez-Fortes G, Mattiangeli V, Domboróczki L, Kővári I, Pap I, Anders A, et al. 2014. Genome flux and stasis in a five millennium transect of European prehistory. Nat Commun 5:5257. Gravel S. 2012. Population genetics models of local ancestry. Genetics 191:607-619. Gravel S, Henn BM, Gutenkunst RN, Indap AR, Marth GT, Clark AG, Yu F, Gibbs RA, Bustamante CD, Project G. 2011. Demographic history and rare allele sharing among human populations. Proc Natl Acad Sci U S A 108:11983-11988. Hellenthal G, Stephens M. 2007. msHOT: modifying Hudson's ms simulator to incorporate crossover and gene conversion hotspots. Bioinformatics 23:520-521. International-HapMap-Consortium. 2005. A haplotype map of the human genome. Nature 437:1299-1320. Korneliussen TS, Albrechtsen A, Nielsen R. 2014. ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics 15:356. Lazaridis I, Patterson N, Mittnik A, Renaud G, Mallick S, Kirsanow K, Sudmant PH, Schraiber JG, Castellano S, Lipson M, et al. 2014. Ancient human genomes suggest three ancestral populations for present-day Europeans. Nature 513:409-413. Liang M, Nielsen R. 2014. The Lengths of Admixture Tracts. Genetics 197:953-967. Messer PW. 2013. SLiM: simulating evolution with selection and linkage. Genetics 194:1037-1039. Moltke I, Fumagalli M, Korneliussen TS, Crawford JE, Bjerregaard P, Jørgensen ME, Grarup N, Gulløv HC, Linneberg A, Pedersen O, et al. 2015. Uncovering the genetic history of the present-day Greenlandic population. Am J Hum Genet 96:54-69. Paten B, Herrero J, Beal K, Fitzgerald S, Birney E. 2008. Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. Genome Res 18:1814-1828. Prüfer K, Racimo F, Patterson N, Jay F, Sankararaman S, Sawyer S, Heinze A, Renaud G, Sudmant PH, de Filippo C, et al. 2014. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature 505:43-49. Raghavan M, Skoglund P, Graf KE, Metspalu M, Albrechtsen A, Moltke I, Rasmussen S, Stafford TW, Orlando L, Metspalu E, et al. 2014. Upper Palaeolithic Siberian genome reveals dual ancestry of Native Americans. Nature 505:87-91. Sankararaman S, Mallick S, Dannemann M, Prüfer K, Kelso J, Pääbo S, Patterson N, Reich D. 2014. The genomic landscape of Neanderthal ancestry in present-day humans. Nature 507:354-357. Seguin-Orlando A, Korneliussen TS, Sikora M, Malaspinas AS, Manica A, Moltke I, Albrechtsen A, Ko A, Margaryan A, Moiseyev V, et al. 2014. Genomic structure in Europeans dating back at least 36,200 years. Science. Skotte L, Korneliussen TS, Albrechtsen A. 2013. Estimating individual admixture proportions from next generation sequencing data. Genetics 195:693-702.