Population and

This lecture: Population genomics Next lecture: Comparative genomics: Daniel Jeffares Email [email protected]

Updated 19/4/2020 Population genomics and comparative genomics

This lecture: population genomics

• What is population genomics? • How we gather the data. • What can we find out from population genomics (that we can’t from population ). • Concepts: • (origin, and movement though space and time) • wide summary statistics (π, allele frequencies (MAF, DAF), Tajima’s D • Population structure • Purifying selection: expectations and observations • Adaptive selection: expectations and observations • Balancing selection: expectations and observations • Polygenetic selection and genome-scale data • Linkage of alleles on chromosomes • Case studies: • Selective sweeps in malaria parasites • Altitude adaptation in Tibetans Comparative genomics

Next lecture: comparative genomics • What is comparative genomics? • How we gather the data. • What can we find out from comparative genomics • Concepts: • Diversity within species gives rise to divergence between species • Evolutionary rates • Purifying selection (constraint): expectations and observations • Adaptive evolution: expectations and observations, tests for selection • Polygenetic selection and genome-scale data • Case studies: • Evolutionary constraint in mammalian • The McDonald-Kreitman test and evolution in the human genome

All the articles that I mention can be found here: https://paperpile.com/shared/RYh93p Population genomics What is population genomics?

is the study of within species. • Population genomics expands the data to study variation within species using whole genome data. • Population genomics: • Is more challenging data to gather, more expensive, more challenging to analyse • Genome-scale data produces a more comprehensive picture • Demography (population size, migration, population structuring) • (purifying, adaptive, balancing) How we gather population genomics data

1. Hypothesis/query 2. Sample collection and DNA extraction a) Choose geographic/habitat region of interest b) Gather hundreds to thousands of individuals (strains/subjects) from within a species (sometimes tens of thousands) c) Extract genomic DNA from each individual 3. Genome sequencing a) Aim: to obtain sequence data covering the entire genome 5x to 40x coverage a) Coverage: how many read per site b) Sequence genome using ‘short read’ technology (usually Illumina) c) Main issue: cost/base Reads from one individual, aligned to a reference genome 2x coverage 4x coverage How we gather population genomics data

4. Read mapping and ‘variant calling’ a) Locate genetic variants (sites the genome that differ between individuals, polymorphisms) a) eg: SNPs, indels etc b) By mapping (aligning) sequence reads to a reference genome, and identifying sites the genome that differ

5. Segregating genetic variants are (usually) the final data set a) A list of positions that vary. Alleles/polymorphisms/variants.

6. Analysis: a) Describing demography b) Detecting selection c) (like GWAS) Some other strain/individual we are comparing to the reference genome

reference genome Reads from one individual, aligned to a reference genome reference genome

Some other strain/individual we are comparing to the reference genome

reference genome G Some other strain/individual G we are comparing to the G reference genome T T T

Reads from one individual, aligned to a reference genome Sequencing technology now At this point in time this is what we do:

Sanger sequencing Using ABI machines. • Technology of choice for low- to medium output sequencing • Eg: checking plasmids, small-scale population surveys Illumina: • Technology of choice for genome re-sequencing (of populations) • Widely used for small de novo genome assemblies, RNA-seq (sequencing ), chip-seq, 3C,

Pacific BioSciences (PACBIO) One technology of choice for genome assemblies. • Produces fairly long, fairly accurate reads

Oxford Nanopore • One of the technologies of choice for genome assemblies. • Produces the longest reads • Has the worst error rate • Is developing very fast

Sequencing technology develops rapidly. Next year this slide may be different The falling costs of sequencing has transformed evolutionary biology. What we can find out from population genomics (that we can’t find out from population genetics) • Demography • Estimates of population size • Estimates of population size though time (backwards only!) • Population structure (which individuals are more/less closely related) • Migration and ‘gene flow’ between populations (also past migration) • /outbreeding rates • Selection • Which regions of the genome are subject to strong purifying (negative) selection • functional analysis of genomes • Recent/ongoing events of adaptation (positive selection) • Quantitative genetics • GWAS: which alleles contribute to traits?

Demography, selection and quantitative genetics are intimately related. Concepts in population genomics

All the articles that I mention can be found here: https://paperpile.com/shared/RYh93p Concept: genetic diversity

• Polymorphisms/alleles/variants: sites in a genome that differ between individuals of a species • Single nucleotide polymorphisms (SNPs) • Small insertion/deletions (indels) • Transposon insertions • ‘Structural’ variants: duplications, rearrangements, large insertions/deletions • Initial origin: a mutation in one individual • All polymorphisms begin their existence in just one individual

• Polymorphisms then move through space and time, within the population

• Their frequency in the population will change Rare Common individual 1 individual 2 individual 3 ATCCCG-TAAATTTT individual 4 AGCCCG-TAAATTTT individual 5 individual 6 AGCCCGTTAAAGTTTT A genomic site/gene from many individuals within a species Concept: genetic diversity

One population

Another population

A genomic site/gene from many individuals within a species Concept: genetic diversity

One population

Another population Concept: genetic diversity

One population

Another population Concept: genetic diversity

One population

Another population Concept: genetic diversity

Variants tend to remain within their original Most variants are rare population

One population

Physically linked variants tend to travel together

Another population

Many signals can be found within polymorphism data. The patterns of polymorphism data are complex. Hence: summary statistics Concept: genome-wide summary statistics

• Average pairwise similarity (‘diversity’, π): • If we compare every sequence to every other, what is the average number of differences? • The number of segregating sites (S): A genomic site/gene from many individuals within a species • Allele frequencies: ATCCCG-TAAATTTT • Minor allele frequency (MAF) AGCCCG-TAAAGTTT • Derived allele frequency (DAF) AGCCCGTTAAATTTT • Each of these statistics can be described as: AGCCCGTTAAATTTT • A histogram (a distribution) Ancestral state • Plots along chromosomes

See: Nucleotide diversity (π) on Wikipedia: https://en.wikipedia.org/wiki/Nucleotide_diversity Watterson estimator of � on Wikipedia https://en.wikipedia.org/wiki/Watterson_estimator Concept: genome-wide summary statistics: Tajima’s D

Tajima’s D uses summary statistics to detect selection (or demography) S S n-1 S is the number of segregating sites ! n-1 • The expected number of segregating sites, � = ! ∑ " N is the number of sequences compared ∑ " i=1 • In a neutral site π = � i=1 S • Fumio Tajima worked out that π ≠ � in certain circumstances n-1 ! For six sequences: • Tajima’s D is approximately = π - � ∑ " Sum of: ½ + 1/3 +1/4 +1/5 i=1 • When D is negative: • more rare alleles that expected • selective sweep (or expanding population) In theory, � = π = 4Nμ Where N = population size • When D is positive: (too few rare alleles) μ = mutation rate • Balancing selection or shrinking population or population structure So π can give us an estimate of population size.

Tajima’s D on Wikipedia https://en.wikipedia.org/wiki/Tajima%27s_D Excellent explanation of π, � and Tajima’s D on Youtube: https://www.youtube.com/watch?v=wiyay4YMq2A Concept: genome-wide summary statistics: Tajima’s D

Tajima’s D = 0 One population Neutrally evolving, stable population

Another population

When D is negative: more rare alleles that expected One population selective sweep Another population or expanding population

• When D is positive:

• more common alleles than expected One population • balancing selection • shrinking population • population structure

Tajima’s D on Wikipedia https://en.wikipedia.org/wiki/Tajima%27s_D Excellent explanation of π, � and Tajima’s D on Youtube: https://www.youtube.com/watch?v=wiyay4YMq2A Concept: Population structure

Leslie 2015 Variants tend to remain within their original population

One population

Another population

• When people, animals, plants or microbes move about, they carry their DNA with them. • Every individual carries its history within its DNA • Because we know the ‘rules’ (mutation, drift, recombination) population movements can be modelled • Because populations can have small contributions from one/other populations, this can’t always be drawn on a phylogenetic tree.

All the articles that I mention can be found here: https://paperpile.com/shared/RYh93p ARTICLES

Figure 3 Relationships between genetic a 1.5 Diversity diversity and genome function. (a) Main features

of diversity in the genome, with chromosome ) 1.0 scale on the x axis and the mitochondrial –2 10 ×

genome on the right edge. Top, diversity ( 0.5 (Watterson’s ) calculated using SNPs. Middle, recombination rate (scale, LDU/Mb × 10−3 above the x axis and log(1 + LDU/Mb) below the x axis). The six major recombination hotspots 012345 01234 012 are indicated with red dots. Bottom, sites of Recombination rate Tf family LTR insertion (insertions present 15 in all strains are shown in light blue) in the 10 group of 57 non-clonal strains. (b) Diversity 5 described by genome annotation. Distribution 0

of Watterson’s values for each centile of the LDU/Mb genome, using only sites annotated as exons 0.1 (EXO), 5 and 3 UTRs (5UT and 3UT), introns 1 15 (INT), lncRNAs (RNA), unannotated regions (NIL), LTRs of Tf2 family transposons (LTR), 01234 5 01234 012 and onefold-degenerate (1FD) and fourfold- Transposon insertions degenerate (4FD) sites in exons. Protein-coding 50 categories have red borders. The horizontal red 25 lines correspond to the median and interquartile with insertion 0

range for fourfold-degenerate sites; annotation Number of strains 012345 01234 012 Concept:classes with Purifying diversity significantlyselection lower– expectations than and observations Chr. 1 Chr. 2 Chr. 3 the diversity for this proxy for neutral sites are Chromosome position (Mb) shaded gray. One-sided paired Mann-Whitney Purifying is the loss of deleterious (harmful) variants –16 U test P values in comparison to the fourfold- r = –0.50 P < 1 × 10 This process will 10 degenerate sites were as follows: exons, UTRs b c –1.8 • andReduce onefold-degenerate diversity in regions sites, that areP < important 2 × 10−16; 8 introns, P = 1 × 10−6; lncRNAs, unannotated –2.0 • Increase the proportion of rare alleles regions and LTRs, P > 0.05 (whiskers define the

–2.2 • mostCause extreme negative data Tajima’s points D up to 1.5 times the 6 10

interquartile range). (c) Diversity is negatively • Purifying selection is expected to be a common event log –2.4 correlated with exon density. Diversity () is 4 plotted against the proportion of each window –2.6 annotated as protein-coding exons, determined 2 for 10-kb genomic windows. The Spearman’s –2.8 rank correlation and significance are shown 0 above. Filled red circles, centromeric regions; 0 0.2 0.4 0.6 0.8 1.0 filled black circles, telomeric regions INT NIL EXO 3UT 5UT RNA LTR 1FD 4FD (terminal 100 kb). Annotation Exon proportion

Jeffares 2015 searched for new insertions of Tf elements in the 56 non-clonal strains the model that transposon integration occurs during stress but also and determined which reference LTRs were present in the other non- preferentially occurs in highly expressed genes. clonal strains. We located 1,048 LTR insertions, of which 78% were To gauge how much our collection differed in gene content, we not present in the reference genome. Consistent with previous studies used de novo assemblies of the 57 non-redundant strains to identify showing that Tf element insertions are targeted to RNA polymerase II genes that were present in at least one strain but not present in the (Pol II)-bound promoters21,22, we observed a sharp peak of inser- well-annotated reference strain. We created predictions of protein- tions upstream of transcription start sites (Supplementary Fig. 5) coding genes for each strain from the assembly and attempted to and few insertions in exons (Table 1). The majority of the insertions locate similar genes in the reference strain genome. The strains were (593 loci; 57%) were present only in a single strain, suggesting recent highly similar in their gene content; for example, 95% of the pre- transposon integration and loss. dicted encoded peptides from the divergent strain JB758 could be Transposon integration has been proposed to occur during cel- aligned to a reference protein with >95% identity. Curation identi- lular stress23,24. To examine this model, we analyzed Tf element fied only 17 putative new proteins, including 9 with strong support- insertions within intergenic regions containing one promoter and ing evidence (Supplementary Table 3). The majority of these new one terminator, as these insertions allowed us to determine which proteins were most similar to the products of genes from Ascomycete promoter had been targeted by the insertion. Analysis of this set of fungi, including 12 for which we could identify orthologs in related 998 insertion sites upstream of 354 genes showed that insertions Schizosaccharomyces species by BLASTP (e value < 1 × 10−20), sug- were more abundant upstream of genes with high Pol II occupancy, gesting ancient ancestry and subsequent gene loss in the reference suggesting that the level of gene expression is a main determinant strain. A notable exception was a protein most similar to the OsmC for Tf element insertion. Insertions were also enriched upstream of family from the plant-pathogenic enterobacterium Brenneria salicis, genes without introns, which tend to be rapidly regulated25, and of with highly conserved OsmC sequences being present in 29 of the Sty1-activated stress-response genes26 (Supplementary Table 2). 57 strains. This finding may reflect horizontal gene transfer, rais- These observations corroborate the experimental finding that ing the possibility of an ecological association between S. pombe stress response genes are targeted by Tf insertions22 and support and plants.

238 VOLUME 47 | NUMBER 3 | MARCH 2015 NATURE GENETICS Concept: Adaptive evolution – expectations and observations

Adaptive evolution is the increase in frequency of an adaptive (helpful) variant This process will • Reduce diversity around the beneficial allele • Increase rare alleles • Cause negative Tajima’s D • Strong adaptive evolution is expected to be a rare event The beneficial allele

a ‘soft’ selective sweep The beneficial allele’s haplotype Concept: Balancing selection and genome-scale data

• Sometimes there is an advantage to maintaining two alleles in a population: • when heterozygotes are fitter (sickle cell anaemia) • when there is advantage to being the rare allele (parasites evading the immune system) • This causes 'balancing selection’ to maintain more diversity • Balancing selection is probably very rare • This will cause high Tajima’s D • π is higher than expected number of segregating sites normal excess of rare alleles

Balancing selection

fewer rare alleles, more common alleles Concept: Polygenic selection and genome-scale data

• GWAS shows that most traits are determined by multiple genes • (multiple alleles within the genes) • Called ‘complex traits’ • Selection acts on all the alleles at once • This may be why strong selective sweeps are rare events Somewhere in the genome At another location in the genome

beneficial allele 1 beneficial allele 2

fit from allele 1

Extra fit from allele 1 & 2

fit from allele 2 Concept: Linkage of alleles on chromosomes

When a strongly beneficial allele arises it will ‘sweep’ through the population. This will cause strong genetic signatures in the genome: • loss of diversity around the sweep (asses by looking at π) • increase in linkage (look at linkage) Alleles are referred to as ‘linked’ when they are often found together linkage is strongest for alleles that site close on the same chromosome

strong linkage weaker linkage

unlinked alleles See: https://en.wikipedia.org/wiki/Selective_sweep Affect of linkage in a sweep Affect of linkage in a sweep Recombination Unlinks alleles

Affect of linkage in a sweep Recombination Unlinks alleles

Affect of linkage in a sweep Very close to advantageous allele: very low diversity (none), all alleles linked

Recombination Unlinks alleles

Near to advantageous allele: Affect of linkage in a sweep some loss of diversity, increased linkage Case studies: population genomics results

All the articles that I mention can be found here: https://paperpile.com/shared/RYh93p A Selective Sweep in SE Asian Malaria Parasites 1531

logical data would suggest? We suggest two possible explanations. (1) The initial SerfiAsn mutation at codon 108 may occur multiple times, but sequential bottlenecks resulting from selection of additional mutations at codons 51, 59, and 164 will result in progressive winnowing of numbers of selected haplotypes. Under this model, we Case study 1: selective sweeps in malaria parasites would expect to see alleles bearing Asn-108 associated with a variety of genetic backgrounds, whereas parasites bearing resistant dhfr alleles with multiple mutations • Malaria parasites would have fewer associated haplotypes. Our data provide some support for this model. Only five of 299 resistant

• have large populations dhfr alleles examined contained the SerfiAsn mutation Downloaded from https://academic.oup.com/mbe/article-abstract/20/9/1526/976871 by guest on 21 September 2018 at codon 108 in isolation. Yet, three of these five are • Are subject to strong pressure to evolve drug resistance associated with a divergent haplotype differing at multiple loci (fig. 4). Hence, the SerfiAsn A Selective Sweep in SE Asian Malaria Parasites 1529 change at codon 108 has occurred multiple times, but parasites bearing more than one mutation in dhfr occur on a common genetic background. This model is also strongly supported by a recent study in which two-microsatellite markers just 59 of dhfr were genotyped from South American parasites (Cortese et al. 2002). In this study, multiple two-locus haplotypes were associated with sensitive dhfr alleles and with resistant dhfr alleles bearing the 108N and 51I mutation. However, the range of

Downloaded from https://academic.oup.com/mbe/article-abstract/20/9/1526/976871 by guest on 21 September 2018 haplotypes associated with resistant dhfr alleles bearing three mutations was considerably reduced. (2) Compensa- tory mutations (Schrag, Perrot, and Levin 1997) elsewhere in the parasite genome may be required in addition to changes in dhfr. Because mutations in dhfr result in changes in the shape of the active site, they also result in FIG. 3.— flanking resistant dhfr alleles show reduced FIG. 1.—MicrosatelliteNair 2003 variability is reduced for approximately 100 kb around the dhfr locus on chromosome 4 on the Thailand-Myanmar border. weaker binding of the natural substrate (folate) and may He estimates are plotted with 1 SE. The distance (kb) of genotyped microsatellite markers relative to dhfr are shown along the x-axis (ranked in ordervariation, skewed allele frequency distributions, and greater LD than across the chromosome). Details of the microsatellite markers used are shown in table S1 of Supplementary Material online. An arrow marks the therefore carry a fitness cost. For example, fitness costs of those flanking sensitiveEHH, the dhfr probabilityalleles. (A) Observedthat any two and predictedchromosomeHe of position ofLossdhfr. The blackof linediversity shows levels of H earoundpredicted by a deterministic selected, hitchhiking Pyrimethamine model (Wiehe 1998) using empirically-resistance estimated parameters 4 4 4 the IlefiLeu mutation at codon 164 are strongly supported (s 0.1, F 0.8, r 5.88 3 10ÿ Morgans/kb and m 1.59 3 10ÿ /locus/generation, e 10ÿ ), and H for all loci prior to selection was set at 0.8 (themicrosatellite markerssegments flanking thatresistant carry a dhfr given(filled core dots) haplotype and sensitive are ¼ ¼ ¼ ¼ ¼ e mean of valuesconferring, from 56 microsatellites alleles on chromosomes due 1, 2, 3,to and 12).linkage. The broken line shows predictions generated with a 10-fold lower mutationdhfr alleles (openidentical dots) from by five descent. SE Asian Countries. The distance (kb) by work on enzyme kinetics of resistant dhfr alleles and rate but otherwise the same parameter values. of microsatellite markers relative to dhfr are shown along the x-axis. expression of P. falciparum dhfr in yeast (Sirawaraporn Confidence intervals of 95% are shown for the observed He measures. et al. 1997; Cortese and Plowe 1998). Additional com- Results and Discussion as well as six microsatellite markers immediately flanking The upper and lower bounds of the dark-gray shaded area show the pensatory mutations (Schrag, Perrot, and Levin 1997) Diminished Variation Around dhfr this gene. Blood samples containing multiple parasitepredicted H around resistant dhfr alleles, assuming mutation/drift genotypes occur in 25% to 40% of patients in these countries e We measured length variation in 33 dinucleotide elsewhere in the parasite genome may therefore be and can result in errors in constructing haplotypesequilibrium and stepwise mutation (SMM) or infinites alleles model microsatellite markers distributed across chromosome 4 (Anderson et al. 2000a). Such samples were excluded by(IMM), respectively. Predicted He was estimated using coalescent required to limit the deleterious effects of mutations in (table S1A in Supplementary Material online), including 11 inspection of microsatellite marker and SNP data, leavingsimulation based on the observed numbers of alleles and sample size. markers clustered in a 12-kb (0.7-cM) region around dhfr in dhfr. Because concurrent mutations in two different genes 353 samples with complete data for which haplotypes couldThe light-gray shaded area shows equivalent predictions for micro- 61 parasite isolates collected from Mawker-Thai on the occur at reduced frequency, this would help to explain the be determined unambiguously. Of these, there were 299 Thailand-Myanmar border. We also examined the five satellites flanking sensitive dhfr alleles. (B) LD, measured by EHH (Sabeti rarity with which resistance has evolved. This model is codons in the dhfr locus that are known to be involved in samples with resistant dhfr alleles bearing between one andet al. 2002), in sensitive (open dots) and resistant (filled dots) resistance (16, 51, 59, 108, and 164). All dhfr alleles four mutations and 54 samples bearing the sensitive dhfrchromosomes from SE Asia. Confidence intervals of 95% are shown compatible with the data and is also consistent with the sampled from this population had between two and four allele. The most resistant populations were found on the Thai-Myanmar border, where resistant dhfr alleles bearingfor the observed EHH values. We obtained very similar graphs when this persistence of high frequencies of resistant dhfr alleles mutations conferring resistance. Surprisingly, we found analysis was restricted to samples from Laos or Shoklo, demonstrating minimal variation in microsatellite markers for 12 kb four mutations predominate and sensitive dhfr alleles are bearing multiple mutations over 15 years since PS was that the dramatic differences in the combined sample are not an artifact of (0.7 cM) immediately surrounding dhfr, and variation was rare, and the lowest level of resistance was seen in Laos abandoned as antimalarial treatment in Thailand. If reduced in a region of approximately 100 kb (6 cM) around where more than 36% of parasites carried sensitive dhfrpopulation structure. this locus (fig. 1). In comparison, expected heterozygosity alleles and the majority of resistant dhfr alleles carried two present, we expect that such compensatory mutations are mutations (fig. 2; see also table S3 in Supplementary (He) at markers situated more than 58 kb from the 59 end likely to be situated close to dhfr on chromosome 4, and more than 40 kb from the 39 end of dhfr was high Material online). We observed high levels of geographical differentiation, strongly suggesting local adaptation in theof the loci (fig. 4). In comparison, in 54 chromosomes because recombination would rapidly break down linkage (He 0.81 6 0.06) and not significantly different from He at 56¼ unlinked dinucleotide microsatellites (H 0.80 6 face of differing drug treatment history. Using five-amino-examined bearing sensitive dhfr alleles, we found 49 disequilibrium between dhfr and physically unlinked e ¼ 0.11) sampled from chromosomes 1, 2, 3, and 12 and acid dhfr alleles, we observed pairwise FST values as highunique as six-locus haplotypes. A particularly interesting compensatory changes (Levin, Perrot, and Walker 2000). genotyped from the same collection of parasite isolates 0.89, and maximal levels of differentiation (FST 1) was ¼ (table S1B in Supplementary Material online). seen for the IlefiLeu mutation at amino acid 164, which wasfeature of the data is that seven different resistant dhfr One possible explanation for this dramatic pattern is absent in Laos and Vietnam but at high frequenciesalleles, or carrying different combinations of 51, 59, 108, that all extant resistant dhfr alleles observed have a single fixation in Thai, Cambodia, and Myanmar populations Width of the Selective Sweep origin and that variation around dhfr has been purged (table 1; see also table S3 in Supplementary Material online).and 164 codon mutations, are associated with identical as a result of a single selective event. To examine this The patterns observed are consistent with regional treatmentbackground microsatellite haplotypes. This clearly dem- The valley of reduced variation around dhfr spans explanation further and to investigate the geographical span history. For example, in Laos chloroquine is still the first-onstrates that new mutations have arisen sequentially on approximately 100 kb in the Mawker-Thai parasite of the putative selective sweep, we examined infected blood line treatment for uncomplicated malaria (PS was in- samples from 10 additional sites in Thailand, Myanmar, troduced in 1969 as second-line treatment, but since thenthe same chromosomal lineage (Plowe et al. 1997; population (fig. 1). The valley of reduced variation around Laos, Cambodia, and Vietnam, to give a total sample size of chloroquine has remained the predominant antimalarialSirawaraporn et al. 1997). resistant dhfr alleles appears not to be symmetrical as 583 (including those from Mawker-Thai). For these drug used), which is consistent with the low levels of samples, we genotyped the five polymorphic sites in dhfr, resistant dhfr alleles compared with surrounding countries. The molecular data demonstrate a rapid spread of suggested by simulation studies (Kim and Stephan 2002). a single lineage of resistant dhfr alleles in five countries. dhfr diversity is restored to background levels 50 kb from the Why do we not see multiple origins, as the epidemio- 39 end of the gene, whereas to the 59 markers 58 kb distant Case study 1: selective sweeps in malaria parasites

• Parobek 2016 used the nSL test to identify regions that have been subject to

selection in PlasmodiumRobust method for detecting vivax hard and soft sweeps. 19

derived allele Downloaded from http://mbe.oxfordjournals.org/ at University of California School Law (Boalt Hall) on February 26, 2014 ancestral allele

The top hit (AP2 gene) also had high EHH value. Orange: selected haplotype FIG. 5.—Haplotype pattern for three SNPs in the YRI population

The vertical dark line indicates the location of the SNP (SNPid is shown). The average length of each haplotype is Green: unselected haplotype Therepresented nSL test by a horizontal looks line. The for different region haplotype backgrounds with carryinglong the linkage, derived allele are denoted by different tones of green. Each tone of grey indicates a different haplotype background carrying the ancestral allele. All haplotypes carrying the derived allele lie on the white background, whereas haplotypes carrying the ancestral allele lie in aon derived the green background. allele. Fig. 5. Evidence for strong selective sweeps in Cambodian P. vivax.(A)AManhattanplotofnormalizednSL values. Each point corresponds to an SNP, and the

top 0.5% valuesFig. (under 5. Evidence strong directional for strong selection) selective sweepsare rendered in Cambodian in orange.P. Polymorphismsvivax.(A)AManhattanplotofnormalized without evidence of strongnS directionalL values. Each selection point are corresponds rendered into an SNP, and the either gray or green,top 0.5% according values (under to chromosome. strong directional This view selection) suggests are that rendered several in genomic orange. regions Polymorphisms are under without positive evidence selection, of strong including directional areas near selection tran- are rendered in scription factorseither (AP2 domain-containing), gray or green, according chromatin to chromosome. regulators (SET10 This view and HP1), suggests antigens that under several known genomic positive regions selection are under (SERA4 positive and 5), selection, and drug- includingresistance areas near tran- genes (MDR1 andscription MRP1). factors (B and (AP2C) The domain-containing), extent of EHH in the chromatin strongest regulators sweep, within (SET10 chromosome and HP1), antigens 14 in A.( underB) EHH known decay positive for the selectionhaplotypes (SERA4 around and selected 5), and drug-resistance (orange) and unselectedgenes (MDR1 (green) and alleles. MRP1). (C (B) Aand haplotypeC) The extent bifurcation of EHH diagram in the strongest is shown centered sweep, within on the chromosome focal variant. 14 Line in A thickness.(B) EHH represents decay for the the haplotypes number of around selected identical haplotypes(orange) flanking and unselected the selected (green) (orange) alleles. or unselected (C) A haplotype (green) bifurcation alleles. Linkage diagram breaks is shown down with centered increasing on the distance focal variant. from the Line focal thickness variant. representsBand C the number of provide evidenceidentical that the haplotypes strong sweep flanking on chromosome the selected (orange) 14 extends or unselected∼50 kb in each (green) direction. alleles. Linkage breaks down with increasing distance from the focal variant. Band C provide evidence that the strong sweep on chromosome 14 extends ∼50 kb in each direction. P. falciparum ortholog (PF3D7_1317200) (24). In addition to the manner for both P. vivax and P. falciparum (18). In this way we AP2-domainP transcription. falciparum ortholog factor, this (PF3D7_1317200) region contains (24). 25 genes In additionfound to thestatisticallymanner significant for both enrPichment. vivax and forP histone. falciparum deacetylase(18). In this way we (Table S4). AP2-domain transcription factor, this region containscomplex 25 genes membersfound among statistically the first-p significantercentile enrichment exons after for histone strict deacetylase We found( complementaryTable S4). evidence that P. vivax transcriptional Bonferroni correctioncomplex (P members= 0.0423) among (Table the 1). first-percentile exons after strict regulators are underWe found strong complementary directional selection evidence using that alleleP. vivax fre-transcriptional Bonferroni correction (P = 0.0423) (Table 1). quency-basedregulators testing for areselection. under We strong calculated directi genewiseonal selection Tajima using’s Both alleleP. fre-vivax and P. falciparum Show Evidence of Recent Directional D (an allele frequency-basedquency-based testing test, in for contrast selection. to nS WeL and calculated iHS, which genewiseSelection Tajima’ ons KnownBoth P and. vivax Putativeand P Drug-Resistance. falciparum Show Genes. EvidenceFour of theRecent Directional strongest 15 strongest nS signals in P. vivax were in close proximity are haplotype-basedD (an allele tests of frequency-based selection) for P test,. vivax inand contrastP. falciparum to nSL and. iHS, which Selection onL Known and Putative Drug-Resistance Genes. Four of the pvmdr1 pvmdr2 pvmrp1 We selectedare the haplotype-based first and 99th percentiletests of selection) of observed for P. vivax genewiseand P. falciparumto transporters. strongest ( , 15 strongest, nSL,andanABCtransporter),signals in P. vivax were in close proximity Tajima’s D valuesWe selected and investigated the first theseand 99th genes percentile for functional of observed en- all genewise of which areto transporters potential drug-resistance (pvmdr1, pvmdr2 loci, (Tablepvmrp1 2).,andanABCtransporter), A prom- richment. WeTajima observed’s D enrichmenvalues andtofhistonedeacetylasecomplex investigated these genes for functionalinent sweep en- didall encompassof which are the potentialpvmdr1 drug-resistancelocus. We compared loci (Table key 2). A prom- members amongrichment. the first We percentile observed of enrichmen genes, buttofhistonedeacetylasecomplex this observation drug-resistanceinent SNP sweep frequencies did encompass in pvmdr1 the pvmdr1to frequencieslocus. We in compared key did not reachmembers significance among after theBonferroni first percentile correction. of genes, Because but this observationCambodian isolatesdrug-resistance collected seve SNPral yearsfrequencies earlier, during in pvmdr1 2006to and frequencies in strong local valuesdid not of Tajima reach’s significanceD can be obscured after Bonferroni when considering correction.2007, Because from KâmpôtCambodian province isolates (25). collected Two key sevemutationsral years (Y976F earlier, and during 2006 and an entire gene,strong we local also values performed of Tajima this’ analysiss D can be in obscured an exonwise when consideringF1076L) existed2007, at roughly from Kâmpôt the same province frequency (25). (89% Twoin key previously mutations (Y976F and an entire gene, we also performed this analysis in an exonwise F1076L) existed at roughly the same frequency (89% in previously E8100 | www.pnas.org/cgi/doi/10.1073/pnas.1608828113 Parobek et al.

E8100 | www.pnas.org/cgi/doi/10.1073/pnas.1608828113 Parobek et al. ancestry (SI Appendix, Fig. S2B). On a finer scale, the subjects are time of 25 y as in previous studies (3, 19), this estimate suggests stratified along the first PC (SI Appendix, Fig. S2C), consistent that Tibetans and Han split about 4,725 y ago, ∼2,000 y earlier than with a few hundred self-reported Han in the sample. We classified that estimated from whole-exome sequencing data (3) but consis- our subjects into three groups (Tibetans, Han, and possibly tent with recent evidence from archeological studies (20, 21). admixed) (Materials and Methods and SI Appendix, Fig. S2D) and removed the possibly admixed subjects. There were 3,008 Tibetans Genome-Wide Analysis to Detect Genetic Signals of Adaptation. To and 373 Han retained for analysis. detect genetic signals of high-altitude adaptation, we used a We projected the PCs of our subjects on the Chinese subjects mixed linear model-based leave one chromosome out association from the Human Diversity (HGDP) (16) and (MLMA-LOCO) analysis approach [implemented in the BOLT- illustrated the genetic relatedness between Tibetans and other LMM software tool (22)] to test for allele frequency difference ethnic groups in China (Fig. 1A). Our result suggests that Ti- between Tibetans and non-Tibetans of EAS ancestry (Materials betans show the nearest genetic relatedness to Yi, Tu, and Naxi and Methods). We investigated the statistical properties of the ethnic minority populations (Fig. 1A and SI Appendix, Table S1), method using simulations (SI Appendix, Table S2). Similar ap- consistent with these populations who reside in the neighboring proaches have been used in genome-wide association studies regions of the TP (Yi and Naxi people are mainly distributed in (GWASs) to control for population structure (22, 23). In the Yunnan and Sichuan provinces, and most Tu people reside in MLMA-LOCO model, the target SNP to be tested is fitted as a Qinghai province) (Fig. 1B). fixed effect, and all SNPs on the other chromosomes are fitted as We estimated the divergence time between Tibetan and Han random effects (details about the model are in Materials and populations using the conventional FST-based approach (17) (SI Methods). The underlying assumption is that, under a drift model, Appendix, Text S1). As described above, there were 3,008 Ti- the random effects follow a normal distribution with the variance betan and 373 Han subjects collected from the TP after QC. We being proportional to p0(1 – p0)FST,wherep0 is the allele frequency included in this analysis an additional set of 1,726 Han subjects in the ancestral population and FST is the Wright’sfixationindex Casecollected from study the Eye Hospital of2: Wenzhou altitude Medical University betweenadaptation the two derived populations (24). in If there Tibetans are two diverged (WZ) after QC (Materials and Methods). We used GCTA-GRM populations in the sample, even if neither of the populations have to remove cryptic relatedness in the Tibetan and Han samples been under natural selection, SNPs on different chromosomes will (note that the Han sample was a combined set of 373 Han be correlated because of the systematic difference in allele fre- subjects from the TP and 1,726 Han subjects from WZ) at a quency between populations caused by cryptic relatedness in the relatedness• Indigenous threshold of 0.05 andTibetan retained 1,998 people unrelated have Ti- samples, lived genetic on drift, the and/or Tibetan possibly, admixture Plateau with other for pop- millennia. There is a long-standing question betan and 2,059 unrelated Han subjects. There was no genetic ulations (see below for examples). We, therefore, can correct for the difference between WZ-Han and TP-Han as shown by PCA (SI interchromosome correlations by modeling all of the SNPs on the Appendixabout, Fig. S3), probablythe genetic because most ofbasis the Han of subjects, highother-altitude chromosomes adaptation (as random effects, because in Tibetans. number of SNPs is collected from either TP or WZ, were originally from diverse usually larger than sample size) when testing for the association of regions of China. The genome-wide mean FST between Tibetans an SNP. To maximize power, we included in the analysis all of the and• HanYang was 0.012 et [using al the (2016) method by conducted Weir and Cockerham a genomesubjects collected-wide from the study TP and WZ of in China7.3 (3,008 million Tibetans genotyped and imputed SNPs of 3,008 Tibetans (18) implemented in GCTA], consistent with the estimate of the and 2,099 Han) and an additional set of 5,188 subjects of EAS Han subjectsand from 7,287 the HGDP non (SI Appendix-Tibetan, Table S1). individuals Given the ancestry of from Eastern the Genetic Asian Epidemiology ancestry. Research on Aging genome-wide mean FST value (Materials and Methods), we esti- (GERA) Study (25) in the United States (Materials and Methods). mated that the divergence time between Tibetan and Han pop- Because the GERA-EAS subjects were genotyped on a different ulations was 189 generations. Assuming an average generation SNP array (Affymetrix Axiom), we imputed all of the genotype data

Fig. 1. PCA of genetic ancestry in Chinese populations using genome-wide SNP data. (A) Result from a PCA in a combined sample of 3,381 genetically confirmed Tibetan and Han from this study and 180 Chinese subjects (multiple ethnic groups) from the HGDP. PC1 and PC2 represent the first two eigen- vectors from PCA. Note that one of the Yi subjects from the HGDP seems to be of Tibetan ancestry. (B) Distribution of the ethnic groups in China. The blue circles representTibetans the main distribution are genetically areas of the ethnic distinct populations ingroup the HGDP, of and people. the red circle represents the Tibetan population. Note that many of the populations, such as Han, Mongola, Tibetan, and Uygur, are distributed widely in a range of regions rather than the specific areas labeled on the map. The green triangles represent the two areas (Seda and Litang) from which our Tibetan subjects were recruited (SI Appendix, Fig. S1).

4190 | www.pnas.org/cgi/doi/10.1073/pnas.1617042114 Yang et al. Case study 2: altitude adaptation in Tibetans

Variants tend to remain within their original population, but they also ’drift’ between populations Unshared Tibetan (red)

Tibetans

Some gene flow, allele sharing

Other Chinese

unshared Shared (white) (pink)

Regions that are enriched for population separation might be under selection to be different. Fig. 2. Genome-wide scan for genetic signatures of adaptation. Shown on the y axis are −log10 of P values from the tests of allele frequency difference between Tibetan Chinese (n = 3,008) and EASs (n = 7,287). The analysis was performed using the MLMA-LOCO method, which tests for difference in allele Fig. 2. Genome-wide scan for genetic signatures of adaptation. Shown on the y axis are −log10 of P values from the tests of allele frequency difference To favour high altitude living? frequencybetween between Tibetan populations Chinese (n taking= 3,008) into and account EASs ( thn edifferencecausedbyrandomdrift.SNPsattheg= 7,287). The analysis was performed using the MLMA-LOCOenome-wide method, significant which loci testsare highlighted for difference inred. in allele frequency between populations taking into account thedifferencecausedbyrandomdrift.SNPsatthegenome-wide significant loci are highlighted inred. to 1000G reference panels using IMPUTE2 (26). There were three significantly different from 1), suggesting that the sample structure ancestryto 1000G outliers, reference which panels were using excluded IMPUTE2 from analysis (26). There (SI Appendix were three, hassignificantly been well-controlled different from in the 1), suggesting MLMA-LOCO that the analysis sample as structure expected Fig.ancestry S4). Tooutliers, exclude SNPswhich with were allele excluded frequency from analysisdifferences (SI betweenAppendix, fromhas been theory well-controlled (23). We further in the divided MLMA-LOCO the data analysis into the as Sedaexpected and cohortsFig. S4 caused). To excludeby potential SNPs batch with allele effects, frequency we performed differences a “control between– Litangfrom theory subsets (23). and Wereran further the analysis divided in the each data subset into the(Materials Seda and and controlcohorts” analysis caused usingby potential the MLMA-LOCO batch effects, we approach performed to a “ testcontrol for– MethodsLitang subsets). Although and reran all of the nine analysis loci remained in each subsethighly (significant,Materials and not differencecontrol” inanalysis allele frequency using the between MLMA-LOCO TP-Han andapproach a combined to test set for allMethods of them). Although passed theall of genome-wide nine loci remained significance highly significant, level in either not of WZ-Handifference and in allele GERA-EAS frequency and between removed TP-Han SNPs and with a combinedP value < set subsetall of ( themSI Appendix passed,TableS3 the genome-wide). This analysissignificance shows level the in gain either of 1 ×of10 WZ-Han−6.Wethenperformeda and GERA-EAS and“case removed–control” SNPsanalysis with usingP value the< powersubset for (SI detecting Appendix genetic,TableS3 signals). This of natural analysis selection shows the in a gain dataset of 6 MLMA-LOCO1 × 10− .Wethenperformeda approach to test for“case difference–control in” analysis allele frequency using the ofpower large for sample detecting size. Ingenetic addition, signals we of performed natural selection conditional in a analysesdataset betweenMLMA-LOCO Tibetans (“ approachcases”: n = to3,008) test for and difference EAS subjects in allele (“controls frequency”: (27,of large 28) at sample nine size. genome-wide In addition, significant we performed loci and conditional did not analyses find evi- TP-Han,between WZ-Han, Tibetans and (“cases GERA-EAS;”: n = 3,008)n = and7,287) EAS and subjects identified (“controls nine”: dence(27, 28) of multiple at nine genome-wide signals at any significant of these loci. loci We and also did performed not find evi- the lociTP-Han, that passed WZ-Han, the genome-wide and GERA-EAS; significancen = 7,287) level and (PMLMA-LOCO identified nine< MLMA-LOCOdence of multiple analysis signals to at anydetect of these signatures loci. We of also genetic performed adaptation the 5e-8)loci (Fig. that 2 passed and SI the Appendix genome-wide,Fig.S5 significance). Of nine loci, level two (PMLMA-LOCO loci, EPAS1< onMLMA-LOCO the mitochondrial analysis genome to detect and signatures did not observe of genetic any adaptation significant and5e-8)EGLN1 (Fig.,whichshowthestrongestsignalsinouranalysis,are 2 and SI Appendix,Fig.S5). Of nine loci, two loci, EPAS1 signalon the (SI mitochondrial Appendix,Fig.S7 genome). We and replicated did not observe a number any of significant candidate knownand (3EGLN1–7), and,whichshowthestrongestsignalsinouranalysis,are the other seven loci are unique (Table 1 and SI genesignal loci (SI as Appendix reported,Fig.S7 in previous). We studies replicated (3, 4).a number The replication of candidate rate known (3–7), and the other seven loci are unique (Table 1 and SI Appendix,Fig.S6). Note that FGF10 was one of a set of genes that aftergene correcting loci as reported for multiple in previous testing studies was (3,∼ 4).35.7% The (replication= 5/14), much rate Appendix FGF10 showed large,Fig.S6 population). Note branch that statisticwas (PBS) one values of a set (Tibetans of genes thatvs. higherafter correcting than expected for multiple by chance testing (SI Appendix was ∼35.7%,TableS4 (= 5/14),). much Hanshowed vs. Europeans) large population in a recent branch study statistic (15). (PBS) We values show (Tibetans by linkage vs. higher than expected by chance (SI Appendix,TableS4). disequilibriumHan vs. Europeans) (LD) score in regression a recent study analysis (15). (SI We Appendix show,TextS2 by linkage) Associations of the Loci Under Natural Selection with Phenotypes in thatdisequilibrium there is no inflation (LD) score in regression the test statistic analysis ((anSI Appendix estimate,TextS2 of the ) Tibetans.AssociationsHaving of the identified Loci Under nine Natural genetic Selection loci that with have Phenotypes been under in regressionthat there intercept is no inflation of 0.99 in with the test an SE statistic of 0.01, (an estimate which is of not the naturalTibetans. selection,Having weidentified then asked nine genetic whether loci these that loci have are been associated under regression intercept of 0.99 with an SE of 0.01, which is not natural selection, we then asked whether these loci are associated GENETICS GENETICS Table 1. Nine genetic loci with signals of natural selection Table 1. Nine genetic loci with signals of natural selection Frequency of A1 Frequency of A1

Chromosome SNP bp A1 A2 Tibetan EAS P value FST Nearest gene Chromosome SNP bp A1 A2 Tibetan EAS P value FST Nearest gene 1 rs1801133 11,856,378 A G 0.238 0.333 6.3E-09 0.021 MTHFR 1 rs1801133 11,856,378 A G 0.238 0.333 6.3E-09 0.021 MTHFR 1 rs71673426 112,159,304 C T 0.102 0.013 1.5E-08 0.100 RAP1A 1 rs71673426 112,159,304 C T 0.102 0.013 1.5E-08 0.100 RAP1A 1 rs78720557 198,096,548 A T 0.498 0.201 4.7E-08 0.191 NEK7 1 rs78720557 198,096,548 A T 0.498 0.201 4.7E-08 0.191 NEK7 1 rs78561501 231,448,497 A G 0.599 0.135 6.1E-15 0.414 EGLN1 1 rs78561501 231,448,497 A G 0.599 0.135 6.1E-15 0.414 EGLN1 22 rs116611511 rs116611511 46,600,030 46,600,030 G G A A 0.447 0.447 0.003 0.003 3.6E-19 0.570 0.570 EPAS1EPAS1 44 rs2584462 rs2584462 100,324,464 100,324,464 G G A A 0.211 0.211 0.549 0.549 3.9E-09 0.203 0.203 ADH7ADH7 55 rs4498258 rs4498258 44,325,322 44,325,322 T T A A 0.586 0.586 0.287 0.287 1.7E-08 0.171 0.171 FGF10FGF10 66 rs9275281 rs9275281 32,662,920 32,662,920 G G A A 0.095 0.095 0.365 0.365 1.1E-10 0.162 0.162 HLA-DQB1HLA-DQB1 1212 rs139129572 rs139129572 123,178,478 123,178,478 GA GA G G 0.316 0.316 0.449 0.449 5.8E-09 0.036 0.036 HCAR2HCAR2

P value indicates the P value from the MLMA-LOCO analysis. FST is the FST value between Tibetans and EASs. Nearest gene indicates the nearest annotated P value indicates the P value from the MLMA-LOCO analysis. FST is the FST value between Tibetans and EASs. Nearest gene indicates the nearest annotated genegene to the to the top top differentiated differentiated SNP SNP at ateach each locus locus except exceptEGLN1EGLN1, which, which is is known known to to be be associated associated with with high-altitude high-altitude adaptation; rs139129572 rs139129572 is is an an insertion insertion SNPSNP with with two two alleles: alleles: GA GA and and G. G.A1, A1, allele allele 1; 1;A2, A2, allele allele 2. 2.

YangYang et al. et al. PNASPNAS || April 18, 2017 || vol.vol. 114 114 || no.no. 16 16 | | 41914191 RESEARCH LETTER

for two SNPs at low frequency in Han Chinese (Extended Data Fig. 4 is ancestral lineage sorting. However, this explanation is very unlikely and Supplementary Table 7). as it cannot explain the significant D and S* values and because it would If we consider all SNPs (not just the most differentiated) in the 32.7-kb require a long haplotype to be maintained without recombination since region annotated in humans, to build a haplotype network22 using the the time of divergence between Denisovans and humans (estimated to 40 most common haplotypes, we observe a clear pattern in which the be at least 200,000 years (ref. 14)). The chance of maintaining a 32.7-kb Tibetan haplotype is much closer to the Denisovan haplotype than any fragment in both lineages throughout 200,000 years is conservatively modern human haplotype (Fig. 3 and Extended Fig. 5a; see Extended estimated at P 5 0.0012 assuming a constant recombination of 2.3 3 Data Fig. 6a, b for haplotype networks constructed using other criteria). 1028 per base pair (bp) per generation (see Methods). Furthermore, the Furthermore, we find that the Tibetan haplotype is slightly more diver- haplotype would have to have been independently lostin all African and gent from other modern human populations than the Denisovan haplo- non-African populations, except for Tibetans and Han Chinese. type is, a pattern expected under introgression (see Methods and Extended We have re-sequenced the EPAS1 region and found that Tibetans har- Data Fig. 5b). Raw sequence divergence for all sites and all haplotypes bour a highly differentiated haplotype that is only found at very low fre- shows a similar pattern (Extended Data Fig. 7). Moreover, the divergence quency in the Han population among all the 1000 Genomes populations, between the common Tibetan haplotype and Han haplotypes is larger and is otherwise only observed in a previously sequenced Denisovan than expected for comparisons among modern humans, but well within individual14. As the haplotype is observed in a single individual in both the distribution expected from human–Denisovan comparisons (Extended CHS and CHB samples, it suggests that it was introduced into humans Data Fig. 8). Notably, sequence divergence between the Tibetans’ most before the separation of Han and Tibetan populations, but subject to common haplotype and Denisovan is significantly lower (P 5 0.0028) selection in Tibetans after the Tibetan plateau was colonized. Alterna- than expected from human–Denisovan comparisons (Extended Data tively, recent admixture from Tibetans to Hans may have introduced Fig. 8). We also find that the number of pairwise differences between the haplotype to nearby Han populations outside Tibet. The CHS and the common Tibetan haplotype and the Denisovan haplotype (n 5 12) CHB individuals carrying the five-SNP Tibetan–Denisovan haplotype is compatible with the levels one would expect from mutation accu- (Extended Data Fig. 3) show no evidence of being recent migrants from mulation since the introgression event (see Methods for Extended Data Tibet (see Methods and Extended Data Fig. 10b), suggesting that if the Fig. 8). Finally, if we compute D (ref. 14) and S* (refs 23, 24), two statis- haplotype was carried from Tibet to China by migrants, this migration tics that have been designed to detect archaic introgression into mod- did not occur within the last few generations. ern humans, we obtain significant values (D-statistic P , 0.001, and S* Previous studies examining the genetic contributions of Denisovans P # 0.035) for the 32.7-kb region using multiple null models of no gene to modern humans14,25 suggest that Melanesians have a much larger Deni- flow (see Methods, Supplementary Tables 8–10, and Extended Data sovancomponent than either HanHuerta orMongolians,- evenSánchez thoughthe latter 2014 Figs 9 and 10a). populations are geographically much closer to the Altai mountains14,25. Thus, we conclude that the haplotype associated with altitude adapta- Interestingly, the putatively beneficial Denisovan EPAS1 haplotype is Case study 2: altitude adaptationtion inTibetans in isTibetans likely to be a product of introgressionfrom Denisovan not observed in modern-day Melanesians or in the high-coverage Altai or Denisovan-related populations.LETTER RESEARCH The only other possible explanation Neanderthal26 (Supplementary Table 4). Evidence has been found for

XXVII 1.0 EPAS1 gene others XXXIX ***** ** * * * ** ** * ** * * * * * * * *** ** * * 0.9 SLC45A2 was strong SLC45A2 XLI EPAS1 XXV 0.8 SLC45A2 HERC2 outlier in XII VIII 0.7 population III XXXI XXXIV 0.6 XXXVI genetic XVII XXVIII XXXVII 0.5

Maximum frequency difference XXXVIII distance test V XXI Denisovan Tibetans 0.4 XXIX X XXXII 35 0.00 0.05 0.10 0.15 0.20 IX XXII I Genome-wide FST XV LETTER RESEARCH XIII XXXV 9 VI XXIV Figure 1 | Genome-wide FST versus maximal allele frequency difference. XIV 40 XXXIII The relationship between genome-wide FST (x axis) computed for each pair of XXX XL IV 1 1.0 the 26 populations and maximal allele frequency difference (y axis), first II Tibetans VII explored in ref. 19. Maximal allele frequency difference is defined as the largest XXVI XVI XX SLC45A2 ***** ** * * * ** ** * ** * * * * * * * *** ** * * 0.9 frequency difference observed for any SNP between a population pair. The 26 2 10 20 SLC45A2 EPAS1 XI SLC45A2 populations are from the Human Genome Diversity Panel (HGDP). The XIX 0.8 Tibetan CLM HERC2 labels highlight genes that harbour SNPs previously identified as having strong XVIII CHB GBR JPT ASW YRI FIN XXIII Denisovan CHS PUR LWK CEU TSI MXL 0.7 local adaptation. The grey points represent the observed relationship between population differentiation (FST ) and maximal allele frequency difference; EPAS1 gene region isFigure 3 | A haplotype network based on the number of pairwise differences legend shows all the possible haplotypes among these populations. The 0.6 the more differentiated populations tend to have mutations with larger between the 40 most common haplotypes. The haplotypes were defined from numbers (1, 9, 35 and 40) next to an edge (the line connecting two haplotypes) frequency differences. The star symbol and the yellow symbols represent very different 0.5 all the SNPs present in the combined 1000 Genomes and Tibetan samples: in the bottom right are the number of pairwise differences between the Maximum frequency difference outliers; these are populations that are not highly differentiated but wherebetween we Han Chinese Tibetans and515 SNPs in total within theThis 32.7-kb regionEPAS1 region. is The similar Denisovan haplotypes to ancientcorresponding Denisovan haplotypes. We added anDNA. edge afterwards between the Tibetan 0.4 Tibetans find some mutations that have higher frequency differences than expected were added to the set of the common haplotypes. The R software package haplotype XXXIII and its closest non-Denisovan haplotype (XXI) to indicate its 0.00 0.05 0.10 0.15 0.20 (light blue line). pegas23 was used to generate the figure, using pairwise differences as distances. divergence from the other modern human groups. Extended Data Fig. 5a Genome-wide F Chinese. ST Each pie chart represents one unique haplotype, labelled with Roman numerals, contains all the pairwise differences between the haplotypes presented in this

Figure 1 | Genome-wide FST versus maximal allele frequency difference. minority haplotypes and the common haplotype observed in Han Chinese and the radius of the pie chart is proportional to the log2(number of figure. ASW, African Americans from the south western United States; CEU, The relationship between genome-wide FST (x axis) computed for each pair of For example, the region harbours a highly differentiated 5-SNP haplo- chromosomes with that haplotype) plus a minimum size so that it is easier Utah residents with northern and western European ancestry; GBR, British; to see the Denisovan haplotype. The sections in the pie provide the breakdown FIN, Finnish; JPT, Japanese; LWK, Luhya; CHS, southern Han Chinese; CHB, the 26 populations and maximal allele frequency difference (y axis), first type motif (AGGAA) within a 2.5-kb window that is only seen in Tibetan Figure 2 | Haplotype pattern in a region defined by SNPs that are at high It is not similar to anyof the haplotype representation amongst populations. The width of the edges Han Chinese from Beijing; MXL, Mexican; PUR, Puerto Rican; CLM, explored in ref. 19. Maximal allele frequency difference is defined as the largest samples and in none of the Han samples (the first five SNPs in Sup- frequency in Tibetans and at low frequency in Han Chinese. Each column is is proportional to the number of pairwise differences between the joined Colombian; TSI, Toscani; YRI, Yoruban. Where there is only one line within a frequency difference observed for any SNP between a population pair. The 26 plementary Table 3, and blue arrows in Fig. 2). The pattern of geneticmoderna polymorphic human genomic location (95 in total), each row is a phased haplotype populations are from the Human Genome Diversity Panel (HGDP). The haplotypes; the thinnest edge represents a difference of one mutation. The pie chart, this indicates that only one population contains the haplotype. variation within Tibetans appears even more unusual because none of (80 Han and 80 Tibetan haplotypes), and the coloured column on the left labels highlight genes that harbour SNPs previously identified as having strong population.denotes the population identity of the individuals. Haplotypes of the Denisovan the variants in the five-SNP motif is present in any of the minority hap- 196 | NATURE | VOL 512 | 14 AUGUST 2014 local adaptation. The grey points represent the observed relationship between individual are shown in the top two rows (green). The black cells represent the lotypes of Tibetans. Even when subject to a selective sweep, we would ©2014 Macmillan Publishers Limited. All rights reserved population differentiation (FST ) and maximal allele frequency difference; presence of the derived allele and the grey space represents the presence of the more differentiated populations tend to have mutations with larger not generally expect a singlehaplotype to containsomany unique muta- the ancestral allele (see Methods). The first and last columns correspond to the frequency differences. The star symbol and the yellow symbols represent tions not found on other haplotypes. first and last positions in Supplementary Table 3, respectively. The red and outliers; these are populations that are not highly differentiated but where we Han Chinese We investigate whether a model of selection oneither a de novo muta- blue arrows indicate the 32 sites in Supplementary Table 3. The blue arrows find some mutations that have higher frequency differences than expected tion (SDN) or selection on standing variation (SSV) could possibly lead represent a five-SNP haplotype block defined by the first five SNPs in the (light blue line). to so many fixed differences between haplotype classes in such a short 32.7-kb region. Asterisks indicate sites at which Tibetans share a derived allele regionwithin a single population. Todo so, we simulate a 32.7-kb region with the Denisovan individual. minority haplotypes and the common haplotype observed in Han Chinese under these models assuming different strengths of selection and con- For example, the region harbours a highly differentiated 5-SNP haplo- ditioning on the current allele frequency in the sample (see Methods). of these populations, except for a single CHS (Southern Han Chinese) type motif (AGGAA) within a 2.5-kb window that is only seen in Tibetan FigureWe 2 | Haplotype find that the pattern observed in a regionnumber defined of fixed by differences SNPs that are between at high the hap- and a single CHB (Beijing Han Chinese) individual. Extended Data Fig. 3 samples and in none of the Han samples (the first five SNPs in Sup- frequencylotype in Tibetans classes is and significantly at low frequency higher in thanHan Chinese. what is expectedEach column byis simula- contains the frequencies of all the haplotypes present in the fourteen plementary Table 3, and blue arrows in Fig. 2). The pattern of genetic a polymorphictions under genomic any location of the models (95 in total), explored each row (Extended is a phased Data haplotype Fig. 2). Thus 1000 Genomes populations21 at these five SNP positions. Furthermore, variation within Tibetans appears even more unusual because none of (80 Han and 80 Tibetan haplotypes), and the coloured column on the left denotesthe the degree population of differentiation identity of the individuals. between haplotypes Haplotypes of is the significantly Denisovan larger when we examine the data set from ref. 14 containing both modern (Pap- the variants in the five-SNP motif is present in any of the minority hap- than expected from mutation, and directional selection alone. uan, San, Yoruba, Mandeka, Mbuti, French, Sardinian, Han Dai, Dinka, lotypes of Tibetans. Even when subject to a selective sweep, we would individual are shown in the top two rows (green). The black cells represent the presenceIn of other the derived words, allele it is unlikelyand the grey (P , space0.02 represents under either the presence a SSV scenario of or Karitiana, and Utah residents of northern and western European ances- not generally expect a singlehaplotype to containsomany unique muta- the ancestralunder allele a SDN (see scenario) Methods). that The the first high and degree last columns of haplotype correspond differentiation to the try (CEU)) and archaic (high-coverage Denisovan and low-coverage tions not found on other haplotypes. first andcould last positions be caused in by Supplementary a single beneficial Table 3, mutation respectively. landing The red by and chance on a Croatian Neanderthal) human genomes14, we discover that the five-SNP We investigate whether a model of selection oneither a de novo muta- blue arrowsbackground indicate ofthe rare 32 sites SNPs, in Supplementarywhich are then Table brought 3. The to blue high arrows frequency by motif is completely absent in all of their modern human population sam- tion (SDN) or selection on standing variation (SSV) could possibly lead representselection. a five-SNP The haplotype remaining block explanations defined by are the the first presence five SNPs of in strong the epistasis ples (Supplementary Table 4). Therefore, apart from one CHS and one to so many fixed differences between haplotype classes in such a short 32.7-kbbetween region. Asterisks many mutations, indicate sites or thatat which a div Tibetansergent population share a derived introduced allele the CHB individual, none of the other extant human populations sampled with the Denisovan individual. regionwithin a single population. Todo so, we simulate a 32.7-kb region haplotype into Tibetans by gene flow or through ancestral lineage sorting. to date carry this five-SNP haplotype. Notably, the Denisovan haplo- under these models assuming different strengths of selection and con- We search for potential donor populations in two different data sets: type at these five sites (AGGAA) exactly matches the five-SNP Tibetan ditioning on the current allele frequency in the sample (see Methods). of thesethe populations, 1000 Genomes except Project for a21 singleand whole CHS (Southern genome data Han from Chinese) ref. 14. We motif (Supplementary Table 4 and Extended Data Fig. 3). We find that the observed number of fixed differences between the hap- and a singleoriginally CHB defined (Beijing the HanEPAS1 Chinese)32.7-kb individual. region Extended boundaries Data by Fig. the 3 level of We observe the same pattern when focusing on the entire 32.7-kb lotype classes is significantly higher than what is expected by simula- contains the frequencies of all the haplotypes present in the fourteen observed differentiation21 between the Tibetans and Han only (Supplemen- region and not just the five-SNP motif. Twenty SNPs in this region have tions under any of the models explored (Extended Data Fig. 2). Thus 1000 Genomestary Table populations 3, Extendedat Data these Fig. five 1 andSNP Fig. positions. 2) as described Furthermore, in the pre- unusually high frequency differences of at least 0.65 between Tibetans the degree of differentiation between haplotypes is significantly larger when wevious examine section. the In data that set region, from ref. the 14 most containing common both haplotype modern (Pap- in Tibetans and all the other populations from the 1000 Genomes Project (Extended than expected from mutation, genetic drift and directional selection alone. uan, San,is tagged Yoruba, by Mandeka, the distinctive Mbuti, five-SNP French, motif Sardinian, (AGGAA; HanDai, the first Dinka, five SNPs Data Fig. 4). However, in Tibetans, 15 out of these 20 SNPs are identical In other words, it is unlikely (P , 0.02 under either a SSV scenario or Karitiana,in Fig. and 2), Utah not found residents in any of northern of our 40 and Han western samples. European We first ances- focus on this to the Denisovan haplotype generating an overall pattern of high hap- under a SDN scenario) that the high degree of haplotype differentiation try (CEU)) and archaic (high-coverage Denisovan and low-coverage five-SNP motif and determine whether it is unique to Tibetans or if it is lotype similarity between the selected Tibetan haplotype and the Deni- could be caused by a single beneficial mutation landing by chance on a Croatian Neanderthal) human genomes14, we discover that the five-SNP found in other populations. sovan haplotype (Supplementary Tables 5–7). Interestingly, fiveofthese background of rare SNPs, which are then brought to high frequency by motif is completely absent in all of their modern human population sam- Intriguingly, when we examine the 1000 Genomes Project data set,we SNPs in the region are private SNPs shared between Tibetans and the selection. The remaining explanations are the presence of strong epistasis ples (Supplementary Table 4). Therefore, apart from one CHS and one discover that the Tibetan five-SNP motif (AGGAA) is not present in any Denisovan, but not shared with any other population worldwide, except between many mutations, or that a divergent population introduced the CHB individual, none of the other extant human populations sampled haplotype into Tibetans by gene flow or through ancestral lineage sorting. to date carry this five-SNP haplotype. Notably, the Denisovan haplo- 14 AUGUST 2014 | VOL 512 | NATURE | 195 We search for potential donor populations in two different data sets: type at these five sites (AGGAA) exactly matches the©2014 five-SNPMacmillan Tibetan Publishers Limited. All rights reserved the 1000 Genomes Project21 and whole genome data from ref. 14. We motif (Supplementary Table 4 and Extended Data Fig. 3). originally defined the EPAS1 32.7-kb region boundaries by the level of We observe the same pattern when focusing on the entire 32.7-kb observed differentiation between the Tibetans and Han only (Supplemen- region and not just the five-SNP motif. Twenty SNPs in this region have tary Table 3, Extended Data Fig. 1 and Fig. 2) as described in the pre- unusually high frequency differences of at least 0.65 between Tibetans vious section. In that region, the most common haplotype in Tibetans and all the other populations from the 1000 Genomes Project (Extended is tagged by the distinctive five-SNP motif (AGGAA; the first five SNPs Data Fig. 4). However, in Tibetans, 15 out of these 20 SNPs are identical in Fig. 2), not found in any of our 40 Han samples. We first focus on this to the Denisovan haplotype generating an overall pattern of high hap- five-SNP motif and determine whether it is unique to Tibetans or if it is lotype similarity between the selected Tibetan haplotype and the Deni- found in other populations. sovan haplotype (Supplementary Tables 5–7). Interestingly, fiveofthese Intriguingly, when we examine the 1000 Genomes Project data set,we SNPs in the region are private SNPs shared between Tibetans and the discover that the Tibetan five-SNP motif (AGGAA) is not present in any Denisovan, but not shared with any other population worldwide, except

14 AUGUST 2014 | VOL 512 | NATURE | 195 ©2014 Macmillan Publishers Limited. All rights reserved Summary of important points

• Population genomics methods are: • Hypothesis/query • Sample collection, DNA extraction • Sequencing, read mapping and variant calling • Analysis • Genome wide summary statistics: • π (average pairwise diversity) • Two measures of allele frequencies (MAF, DAF) • Tajima’s D

• � = 4Neμ • Population structure can be inferred from sequencing data • Linkage of alleles on chromosomes • Purifying selection: expectations and observations • Adaptive selection: expectations and observations • Polygenetic selection and genome-scale data