<<

Downloaded from http://mbe.oxfordjournals.org/ at MPI Study of Societies on March 30, 2016 ; ). 6 fig. 1 ; ). .Allthese egurel et al. ons, please S ; 2 es 2011 LAD1 Andr ; Rebeca Atencia, 5 Wiuf et al. 2004 Charlesworth 2006). ; Takahata and Nei 1990 ; -species (trSNP; e signatures to short genomic Charlesworth 2006 ; Clark 1997; ; Fernando Racimo, ). If selection is strong and constant 1 de/primate/exomes/VCFs. Charlesworth 2006 , Leipzig, Germany vance Access publication January 19, 2015 ; Claudine Andre, 4 s, 20 chimpanzees, and 20 bonobos and ,1 Maintains a Missense es* . Leffler et al. 2013 ding trSNP (rs12088790) in the LAD1 Juan R. Meneu, er have been deposited in the European Nucleotide , great apes. LAD1 1 ionary Anthropology Charlesworth et al. 1997 Hudson and Kaplan 1988 In species with oldunder enough divergence neutrality time and trSNPs( are are rare hallmarks of balancing selection 2012; enough, the polymorphism may survivespecies the and split persist of in different one present-day species, populations resulting in of a more trans Muirhead than et al. 2002 CharlesworthRecombination et restricts thes segments (Wiuf al. et al. 2004 1997 2006). Selection maintains closelibrium, to the the frequency frequency equi- that maximizestion. fitness This in results the in popula- frequency an enrichment equilibrium of( in variants close selected to the and linked variation typic diversity in populations. When selection acts for long the majority of these trSNPs were found in three of y species and also exhibit other signatures of long-term dermal-epidermal junction; the gene is also an autoantigen LAD1 mbalancingselectionintheevolutionofprotein-coding ssense change (Leucine257Proline) and, besides altering the Michel Halbwax, ) that are shared among the three species and have segregated ssed at http://cdna.eva.mpg. 3 ty for Molecular Biology and Evolution. All rights reserved. For permissi ), and Aida M. Andr 1 es 2011; ). € abo, € a Antje Weihmann, Andr Charlesworth Muehlenbachs ; ,1 Anne Fischer, ; y 1 Pasvol et al. 1978 ; es de Tchimpounga, Jane Goodall Institute, Pointe-Noire, Republic of Congo Svante P cluster, we also uncovered one co 32(5):1186–1196 doi:10.1093/molbev/msv007 Ad niversity of California, Berkeley 1 Clark 1997; iology and Ecology, Nairobi, Kenya ; with changes in gene expression of sa, Democratic Republic Congo ), temporal and spatial vari- , a gene that encodes for Ladinin-1, a collagenous anchoring filament protein of basement clade. We sequenced the exome of 20 human Gillespie 1978 s advantageous polymorphisms aChimpanz Birgit Nickel, Gendzekhadze et al. 2009 LAD1 ıs Parra, 1 E-mail: [email protected]. Cesare de Filippo, Mol. Biol. Evol. Gen ,1 1 Homo–Pan y John Novembre Theexomesequencecapturedatareportedinthispap Gigord et al. 2001 ; erinaire du Dr. Jacquemin, Maisons-Alfort, France balancing selection, trans-species polymorphism, Charlesworth et al. 1997 et ), or pleiotropy (

When balancing selection acts on a variant long enough it ~ ao C. Teixeira, The Author 2015. Published by Oxford University Press on behalf of the Socie eserve Naturelle Sanctuaire These authors contributed equally to this work. Clinique v Lola Ya Bonobo sanctuary, Kinsha R Department of Evolutionary Genetics, MaxDepartment Planck of Institute Integrative for Biology, Evolut U International Center for Insect Phys Wright 1939 Archive (accession no. PRJEB8350). The variant calls can be acce protein sequence, is associated responsible for linear IgA disease. This trSNP results in aKey mi words: membrane that is responsible for maintaining cohesion at the ß 1186 balancing selection, such as segregatingwe at focus intermediate on frequency the trSNP and in lying in a locus with high genetic diversity. Here, in populations, preventing fixationcreasing of genetic alleles by diversity drift (Charlesworth and 2006 in- trSNPs show clustering of sequences by rather than b Balancing selection maintain e-mail: [email protected] Introduction sequences in the for approximately 14 My of independent evolution. Although Abstract Balancing selection maintains advantageous genetic and pheno y evolutionary periods selected polymorphismsdifferent may species. survive Here, species we splits investigate and the segregatedetected eight in role coding trans-species present-day of polymorphisms populations (trSNPs long-ter of the major histocompatibility locus 4 5 6 2 3 Jo Michael Dannemann, Matthias Meyer, 1 Long-Term Balancing Selection in

Bonobos Trans-Species Polymorphism in Humans, Chimpanzees, and Associate editor: Data deposition: *Corresponding author: Key et al. 2014).which There balancing are selection a can varietyor act, of heterozygote including mechanisms advantage overdominance through (Allison 1956 frequency-dependent selection,( and rare-allele advantage ation in selective pressures ( et al. 2008 creates long local genealogies, withtimes. unusually Selected old coalescence alleles can segregateneutral for diversity millions accumulating of near the years, selected with variant(s)to due linkage (

Article Long-Term Balancing Selection in LAD1 . doi:10.1093/molbev/msv007 MBE

A

B

tHCB

Mutation

A allele

B allele

tHCB split human-chimpanzee-bonobo

tCB split chimpanzee-bonobo tCB

hA hB bB bA cA cB Downloaded from Human Bonobo Chimpanzee

FIG.1. Schematic representation of a possible genealogy leading to a trSNP in human, chimpanzee, and bonobo.

The assumption that trSNPs are very rare in humans specific population sizes estimated in Prado-Martinez et al. combined with the absence of unbiased genome-wide poly- (2013). Under this model, given that the lineages of bonobos http://mbe.oxfordjournals.org/ morphism data sets in other great ape species resulted in few and chimpanzees diverged only about 2 Ma (Prufer€ et al. trSNPs being described in humans: Several SNPs in the major 2012) and their present-day populations share polymor- histocompatibility locus (MHC; Klein et al. 1993; Asthana et al. phisms, we expect, under neutrality, 0.85% of the SNPs in 2005), and a few non-MHC genes (e.g., TRIM5 [Cagliani et al. bonobos to be segregating in chimpanzees, and 4.6% of chim- 2010], ZC3HAV1 [Cagliani et al. 2012], and ABO [Segurel et al. panzee SNPs to also be segregating in bonobos (see supple- 2012]). Recently, six well-defined short trans-species haplo- mentary information SI, Supplementary Material online). types containing at least two trSNPs shared in humans and Conversely, a neutral trSNP between Homo and any of the chimpanzees have been identified (Leffler et al. 2013). two Pan species is unlikely to occur by alone: We at MPI Study of Societies on March 30, 2016 Interestingly, none of these haplotypes contains coding estimate that a SNP found in a sample of humans has a 8 SNPs, and the authors propose a role in the regulation of probability PHC =1.6 10 of being polymorphic in chim- genes for the maintenance of these SNPs. Leffler et al. panzees too (see also supplementary information SI, (2013) also identified a number of coding SNPs shared be- Supplementary Material online). The model also allows us tween humans and chimpanzees, but because filtering on to calculate the probability of observing a SNP shared by all allelic trees or CpG sites was not performed, it is unclear three species (bonobo, chimpanzee, and human) in a sample whether they represent trSNPs or recurrent mutations (an of human SNPs. This probability (called PFINAL) is, under neu- 10 important question in the identification of trSNPs, see below). trality, approximately equal to 4.0 10 .Thisisroughly39 Here, we analyze the exomes of 20 humans, 20 chimpan- times lower than the probability that a SNP in humans is also zees, and 20 bonobos to identify trSNPs present since the polymorphic in chimpanzees (PHC), illustrating the advantage Homo–Pan common ancestor until the present-day popula- of including bonobos in the comparison. Given that we ob- 5 tion of each of the three species. By including the three species serve 121,904 human SNPs, we expect about 5.0 10 neu- we focus only on strong balancing selection that has been tral trSNPs in the three species. We note that these are maintained in the three lineages. Besides identifying coding actually overestimates, because coding variation is subject trSNPs in several MHC genes, we also identify a novel trSNP to purifying and background selection that produce shallower (rs12088790) maintained by long-term balancing selection in coalescent trees than neutrally evolving loci. Therefore, any the gene LAD1 (ladinin-1). trSNP that we find is highly unlikely to have occurred under neutrality. An exploration of the behavior of the model under Results a range of parameters for the split times and population sizes is detailed in supplementary information SI, Supplementary A Model for Neutral trSNPs in Humans, Material online. We note that the parameters needed to ex- Chimpanzees, and Bonobos plain the presence of neutral trSNPs in the three species are As mentioned above, the presence of neutral trSNPs is un- unrealistic, given our knowledge of human and great ape likely when species diverged long ago. To estimate how prob- demographic history. ableasharedSNPwouldbeinasampleofSNPsfromoneof the three species, we developed a model based on coalescent Idenfication of trSNPs theory (supplementary information SI, Supplementary We sequenced the exomes of 20 Yoruba humans, 20 central Material online), assuming the ancestral and the species- chimpanzees (Pan troglodytes troglodytes), and 20 bonobos

1187 Teixeira et al. . doi:10.1093/molbev/msv007 MBE

(Pan paniscus) to an average coverage of approximately 18 species tree allelic tree in each individual (data are very homogeneous across species orangutan A orangutan C in coverage and quality, see Materials and Methods). We human C chimp G uncovered a total of 121,904 high-quality SNPs in human, 262,960 in chimpanzee, and 99,142 in bonobo. This represents human A bonobo G a novel SNP discovery rate of approximately 33.54% in chimp A human G bonobo, approximately 49.29% in chimpanzee, and approxi- mately 2.8% in human (compared with Prado-Martinez et al. chimp C bonobo C [2013] anddbSNPbuild138).Wefocusedonthe202coding bonobo C human C SNPs with the same two segregating alleles in the three spe- cies, the shared SNPs (shSNPs). bonobo A chimp C Two important confounding factors in the identification of FIG.2. Examples of a species tree and an allelic tree using six haplotypes, trSNPs are genotype errors and recurrent mutations. To limit one per species and allele. Each Neighbor-joining tree is computed on a the influence of genotype errors in the form of mapping and 500 bp region around a shSNP in our data set for the genes TXNDC2 sequencing artifacts, we conservatively removed SNPs that fall (species trees) and HLA-DQA1 (allelic tree).

in sites that: 1) are in the upper 5% tail of the empirical Downloaded from distribution of coverage in at least one species; and 2) do considered ‘candidate trSNPs’. They lie in 15 different genes, not have high mappability when using the 24-mer filter including three HLA genes. Figure 3 shows the neighbor-join- (Derrien et al. 2012). We further removed SNPs that are not ing tree of one of such trSNPs, the one present in gene LAD1, in Hardy–Weinberg equilibrium (HWE) with P-value with sequences clustering by allelic type. Only two “candidate (P) < 0.05 in at least one species (see supplementary infor- trSNPs” (both in HLA-DQA1) are not associated with CpG http://mbe.oxfordjournals.org/ mation SII, Supplementary Material online). Regarding recur- sites (supplementary table S4, Supplementary Material rent mutations, they are particularly likely in hypermutable online). We also note that other HLA genes that have been sites where the probability of a parallel mutation in two lin- described before as being targets of balancing selection in eages is high. Examples of this are CpG dinucleotides (where a humans have shSNPs that were excluded from our analysis methylated cytosine can deaminate to a thymine and re- due to the stringent filtering criteria implemented, although sult in a C-4 T transition [Bird 1980; Hodgkinson and Eyre- no specific filters were applied in the MHC region. Walker 2011]), but additional, cryptic heterogeneity in muta- Because trSNPs have been previously described in HLA tion rate exists (Hodgkinson et al. 2009; Hodgkinson and Eyre- genes (Lawlor et al. 1988; Mayer et al. 1988; Fan et al. 1989; at MPI Study of Societies on March 30, 2016 Walker 2010; Johnson and Hellmann 2011). Klein et al. 1993; Asthana et al. 2005; Leffler et al. 2013)we Removing CpGs could reduce the number of recurrent focus on the remaining genes (13 candidate trSNPs). Our mutations, but SNPs associated with CpGs represent a large filtering criteria exclude the majority of systematic sequencing fraction of SNPs in the genome (~25% of human SNPs) and errors, so we next investigated the possibility of mapping recurrent mutations can also occur at non-CpG sites errors due to collapsed paralogs (when paralogs are very sim- (Hodgkinson and Eyre-Walker 2011). We therefore mark ilar in sequence, mapping errors can result in erroneous SNP but consider CpG SNPs (those for which either allele results calls). We BLAT (Kent 2002) the 25 bp region surrounding the in a CG dinucleotide), and use additional lines of information 13 non-HLA candidate trSNPs to the reference genome se- to tell apart trSNPs from recurrent mutations. Specifically, quences of human (hg19) and chimpanzee (PanTro4). Only SNPs that result from recurrent mutations are expected to four candidate trSNPs (in genes LY9, LAD1, SLCO1A2,and fall in genomic regions that follow the species tree (fig. 2) OAS1) have high mappability in all genomes (i.e., a high because the most recent common ancestor of the genomic degree of uniqueness in the genome), with the remaining segment containing a human SNP falls in the human branch, nine candidate trSNPs mapping to regions that have a close predating (backward in time) the coalescence of lineages paralog in at least one species (see supplementary informa- from the different species (see previous section). On the tion SII and table S4, Supplementary Material online). contrary, trSNPs create local genealogies that cluster by Although this does not discard these positions as SNPs in allele (fig. 2) because the most recent common ancestor of the other species (or in the species with non-unique BLAT the genomic segment containing the trSNP predates the split hits), we conservatively removed them from further analyses. of the three species (Schierup et al. 2001; Wiuf et al. 2004). We therefore focus on these four SNPs, to investigate Therefore, a SNP’s surrounding region allows us to distinguish additional signatures of long-term balancing selection. trSNPs from recurrent mutations. For each shSNP we inferred the phylogeny of its genomic region (Materials and Methods) and considered further only The Probability of an Allelic Tree shSNPs that fall in genomic regions that exhibit trees that As the allelic tree provides very strong evidence for a SNP to cluster by allelic type. Of the 202 original shSNPs, and after be a trSNP, we next aim to determine how likely an allelic tree additional filtering (coverage, mappability, and HWE), only 20 is, for each candidate trSNP, under recurrent mutation. To have a probability of an allelic tree (Pallelic) 4 0.90 (supple- answer this question we ask how often we observe an allelic mentary table S4, Supplementary Material online); these tree of the same length and minimum number of informative shSNPs, all of which are present in dbSNP build 138, were sites as those of the candidate trSNPs. We estimated the false

1188 Long-Term Balancing Selection in LAD1 . doi:10.1093/molbev/msv007 MBE

orangutan G

G

A

A

A

A

G

G

G Downloaded from G

G

human 1 nucleotide difference chimpanzee http://mbe.oxfordjournals.org/ bonobo

FIG.3. Neighbor-joining tree of LAD1. The tree was constructed using a 350 bp region as described in Materials and Methods. The size of the pie charts is proportional to the number of haplotypes (n = 120), with colors representing the species. The alleles of the trSNP are shown next to the pie charts. The orangutan sequence (PonAbe2) was used as outgroup. Three chimpanzee haplotypes carrying the G allele cluster with haplotypes carrying the A allele, likely due to a recombination event (more likely to occur in chimpanzee, the species with the largest effective population size). discovery rate (FDR, the chance of obtaining an allelic tree exhibit an excess of genetic diversity after taking heterogene- at MPI Study of Societies on March 30, 2016 under recurrent mutation) for each allelic tree length by an- ity in mutation rate into account. We calculated the ratio of alyzing random SNPs in the genome of the three species. In polymorphism to divergence (PtoD = p/(d+1), where p is the short, we pair random SNPs in the human genome with a number of polymorphisms identified in a species and d the close-by SNP in chimpanzees and bonobos; these nearby, in- number of fixed differences identified between species—see dependent mutations act as pseudorecurrent mutations supplementary information SIV, Supplementary Material where to investigate the neutral probability of an allelic tree online) in the genes containing our four non-HLA candidate (see Materials and Methods, suppiementary information SIII trSNPs (LY9, LAD1, SLCO1A2,andOAS1); we also analyze the and table S1, Supplementary Material online, for details). We seven HLA trSNPs (HLA-C, HLA-DQA1,andHLA-DPB1). For found that, as expected, the FDR is inversely proportional to each gene we investigate different genomic regions, in each the length of an allelic tree (supplementary information SIII, species: 1) “ALL”—the entire genic region; 2) “coding”—only Supplementary Material online); that is, the longer the geno- their coding exonic sequence; 3) “500 bp”—the 500 bp sur- mic region, the lower the FDR of an allelic tree because the rounding the trSNP; and 4) the “length of allelic tree” (sup- number of phylogenetically informative positions grows and plementary table S4, Supplementary Material online). First, if so does the chance for recombination. If we condition on we focus on individual genes, HLA genes are in the very far tail observing additional informative sites (besides the candidate of the empirical genomic distribution of PtoD,withasignifi- trSNP) the FDR drops substantially and becomes more uni- cant excess of polymorphism in all the comparisons per- form across the different lengths (supplementary table S1, formed (supplementary table S4, Supplementary Material Supplementary Material online). online). For non-HLA genes, only LAD1 shows a consistent For our set of candidate trSNPs, and after considering the excess of diversity in the three species, with most comparisons exact number of informative sites uncovered in the length of being significant in human andbonobo,andmarginally each allelic tree, only LY9’s allelic tree shows high FDR (36.8% nonsignificant in chimpanzee (see tables 1 and 2; supplemen- for 100 bp and four informative sites). All other candidate tary table S4 and information SIV, Supplementary Material trSNPs fall in allelic trees that given the number of informative online). The weaker signal in chimpanzee is likely due to this sites uncovered in each tree have low FDR (table 1). species’ larger effective population size (Prado-Martinez et al. 2013) that translates in higher genomic diversity and lower power to detect the localized increased diversity in LAD1.The Excess of Polymorphism Linked to the trSNPs signal is weaker for the other three genes. No excess of poly- We further investigate whether, as expected under long-term morphism is observed in SLCO1A2,andinLY9 high PtoD balancing selection, the candidate trSNPs fall in regions that values are observed only for the “length of tree,” due to the

1189 Teixeira et al. . doi:10.1093/molbev/msv007 MBE

Table 1. Comparison of Different Signatures in Candidate trSNPs and Genes. PtoD (p)MAF chr:position Gene Tree bp #SNPs #FDs HCB3sppHCB (FDR%) (H;C;B) (H;C;B) 1:160788067a LY9 100 (36.8) (2;1;1) (0;0;0) 0.3 (0.60) 1.1 (0.34) 0.4 (0.34) 1.7 (0.43) 0.100 0.300 0.125 1:201355761a LAD1 350 (1.5) (3;3;3) (0;0;0) 1.5 (0.02) 2.4 (0.06) 1.5 (0.02) 4.2 (0.03) 0.450 0.325 0.225 6:31237124a HLA-C 150 (4.0) (2;3;2) (0;0;0) 25.0 (0.00) 22.0 (0.00) 20.0 (0.00) 38.0 (0.00) 0.225 0.225 0.225 6:32609097 HLA-DQA1 100 (0.0) (11;7;5) (0;0;0) 39.0 (0.00) 39.0 (0.00) 38.0 (0.00) 60.0 (0.00) 0.200 0.400 0.050 6:32609105a 250 (0.0) (19;14;13) (0;0;0) 0.500 0.350 0.050 6:32609271a 750 (0.0) (25;23;24) (0;0;0) 0.475 0.400 0.050 6:33052736a HLA-DPB1 1,000 (0.0) (8;10;11) (0;0;0) 5.3 (0.00) 10.3 (0.00) 5.8 (0.00) 9.8 (0.00) 0.300 0.425 0.325 6:33052743 1,000 (0.0) (8;10;11) (0;0;0) 0.300 0.450 0.350 6:33052768 1,000 (0.0) (8;10;11) (0;0;0) 0.300 0.475 0.350 12:21453466 SLCO1A2 1,000 (0.9) (2;4;2) (1;1;1) 0.6 (0.26) 1.6 (0.18) 0.9 (0.07) 2.6 (0.12) 0.025 0.350 0.050 12:113354384a OAS1 250 (2.5) (1;6;1) (0;0;0) 0.4 (0.52) 5.3 (0.01) 0.5 (0.28) 2.3 (0.16) 0.025 0.350 0.025 Downloaded from NOTE.—H, human; C, chimpanzee; B, bonobo. The estimated length of the allelic trees (and respective FDR), the number of polymorphisms (#SNPs) and fixed differences (#FDs) in the allelic tree, the PtoD ratios for the whole gene and MAFs of trSNPs are shown. For “human,” we present the PtoD ratio obtained in the human–bonobo comparison, which is very similar to the human–chimpanzee comparison. The genes with trSNPs and consistent signatures of long-term balancing selection are showninitalic. aNonsynonymous trSNP. http://mbe.oxfordjournals.org/ Table 2. PtoD Ratios Calculated in the Gene LAD1 (with the corre- SLCO1A2, LY9 or OAS1 have been under balancing selection, sponding percentile in the empirical distribution in parenthesis). but conservatively we focus on LAD1, HLA-C, HLA-DQA1,and PtoD Human Chimpanzee Bonobo HLA-DPB1 as our final set of trSNPs. All 1.50 (0.023) 2.40 (0.059) 1.50 (0.019) The set of these four genes is, in all species, significantly Coding 2.00 (0.018) 1.25 (0.332) 1.00 (0.068) more polymorphic than the empirical distribution of all genes 500 bp 2.00 (0.024) 1.50 (0.317) 1.50 (0.069) with at least one variable site (polymorphism or substitution) Length allelic tree 3.00 (0.028) 3.00 (0.074) 3.00 (0.024) in our data set (supplementary tables S2 and S3 and figs. S6 3spp 4.20 (0.028) and S7, Supplementary Material online). LAD1 is the least at MPI Study of Societies on March 30, 2016 NOTE.—For human, we present the PtoD ratio obtained in the human–bonobo polymorphic of the four genes, which is not surprising as comparison, which is very similar to the human–chimpanzee comparison. the remaining trSNPs fall in HLA genes.

Intermediate- of the trSNPs and presence of a single additional SNP in humans (in such a small Linked Variants region). OAS1 shows significant excess of polymorphism only in chimpanzee. The allele frequency distribution of sites linked to a balanced We also calculate a three-species PtoD (“3spp”) for the polymorphism is expected to exhibit an excess of alleles at entire genic region by jointly considering (the union of) all frequencies close to the frequency equilibrium. If the fre- polymorphisms and divergent sites across the three species. quency equilibrium is high enough (e.g., 0.5) the local site The 3spp PtoD is unusually high in all HLA genes (P 0.002) frequency spectrum (SFS) will show an observable departure and in LAD1 (P = 0.028), but not in the other three genes from the genome-wide empirical distribution. We note that (tables 1 and 2 and supplementary table S4, Supplementary the frequency equilibrium can be at any allele frequency, so Material online). In fact, only 0.005% of genes in the genome while an excess of intermediate-frequency alleles is indicative have,ineachofthethreespecies,aP-value equal or lower of balancing selection, this is not a necessary signature. than that of LAD1. This shows that the combined excess of The SFS of the four genes together shows a significant shift diversity of LAD1 in all three species is highly unusual. In toward intermediate-frequency alleles, in all species (Mann– 10 addition, we note that LAD1’s signature is due to the Whitney U test [MW-U] P < 4 10 ; fig. 4 and table 3). strong enrichment in polymorphism in the region surround- When we consider the genes individually, almost all exhibit a ing the trSNP (rs12088790): All SNPs we identified in LAD1 are significant excess of intermediate-frequency alleles in all spe- within 182 bp of rs12088790. cies except for LAD1 in bonobo and human (marginally Taken together, these results indicate that apart from the nonsignificant), and for HLA-C in bonobo (table 3). When three HLA genes, only LAD1 has a signature of long-term we combine all SNPs in each gene (the union of SNPs in all balancing selection in the three species. OAS1 shows signa- three species) and compare the resulting SFS with the com- tures of balancing selection in central chimpanzees, which bined empirical SFS (the union of all SNPs from all three have been previously reported (Ferguson et al. 2012), but species), all genes show a significant shift towards intermedi- the gene shows rather unremarkable signatures in bonobo ate frequencies (MW-U test P 0.046, table 3), including and human (table 1). We cannot discard the possibility that LAD1.

1190 Long-Term Balancing Selection in LAD1 . doi:10.1093/molbev/msv007 MBE

FIG.4. Folded SFS of trSNPs and other SNPs in the genes. The x axis represents the MAF and the y axis the proportion of sites in that frequency bin. The histograms show the spectrum of the entire exome (“all”) for each species, excluding the four genes containing a trSNP; the lines show the combined SFS of all SNPs in the four genes containing a trSNP. The number of SNPs in each category is annotated in the legend. The trSNPs are shown as empty circles, with size proportional to the number. A black circle represents the trSNP in LAD1. Downloaded from

Table 3. P-values (MW-U test) for Excess of Intermediate-Frequency shared across human populations, although its frequency Alleles Comparing the SFS of the Genes to the Genome-Wide SFS. shows certain differences among human groups. http://mbe.oxfordjournals.org/ Gene Human Chimpanzee Bonobo 3spp LAD1 5.7 102 4.3 102 7.4 101 4.6 102 Balancing Selection in LAD1 HLA-C 2.5 102 1.8 105 5.4 102 3.1 107 6 12 3 19 LAD1 (ladinin 1) spans 18,704 bp and is composed of ten HLA-DQA1 1.4 10 1.9 10 4.5 10 3.4 10 exons. We obtained a total of 1,213 bp of the gene by se- 9 17 11 36 HLA-DPB1 4.4 10 5.1 10 1.2 10 1.7 10 quencing the complete exons 4, 7, and 9, as well as parts of 14 30 10 54 All four genes 3.9 10 2.0 10 3.7 10 1.8 10 exons2,3,and5.ThetrSNPfoundinLAD1 (chr1: 201355761, rs12088790) lies in a position that has an average

mappability 4 0.9 when considering 24-mers and average at MPI Study of Societies on March 30, 2016 The trSNP in LAD1, which is a missense polymorphism, is at intermediate frequency in all three species (table 1 and mappability = 1 when considering 35-mers (we note that fig. 4): minor allele frequency (MAF) = 0.450 in human, our reads are paired end 75-mers). rs12088790 is an A/G 0.325 in chimpanzee, and 0.225 in bonobos. These frequencies polymorphism (reverse strand), which we validated with are all in the upper quartile of the empirical allele frequency Sanger sequencing, and that results in a missense change distributions of nonsynonymous variants: In the upper 1.9% located in exon 3 that results in a Leucine to Proline quantile for human, in the 8.6% for chimpanzee, and in the change. The change has a moderately conservative 23.8% in bonobo. Grantham score (amino acid replacement score based on When we investigate the 1000 Genomes data set (Abecasis chemical similarity – Leucine -4 Proline = 98) (Grantham et al. 2012), which contains both coding and noncoding data 1974). for LAD1, we observe a significant excess of intermediate-fre- Besides altering the sequence of the protein, the trSNP is quency alleles in all African populations, although the signa- associated with expression changes in present-day humans. ture varies across human groups with some non-African Specifically, when we analyzed expression data in lymphoblast populations showing an excess of low-frequency alleles in- cell lines from a subset of the individ- stead (supplementary table S6, Supplementary Material uals (Lappalainen et al. 2013), we observed significantly lower online). The trSNP is itself present in all these populations expression of LAD1 in carriers of at least one ancestral G allele throughout the world: At intermediate frequency in all (GG and GA genotypes) than in AA homozygotes (P =0.02). African populations (31% < MAF < 48%) and at lower fre- Comparing carriers of at least one A allele with GG homozy- quency (MAF < 8%) in the non-Africans. Interestingly, gotes did not show a significant difference in expression levels (P = 0.21). This shows that the derived A allele is associated when we compute FST (Weir and Cockerham 1984)values for LAD1’s trSNP between the African Yoruba and two non- with increased expression of LAD1 in an at least partially African populations (Toscani and Han Chinese), we observe recessive manner. Mapping biases are not responsible for this result as the total number of SNPs uncovered in the high allele frequency differences (FST = 0.238 and 0.293, re- closest region (one additional SNP in the 150 bp region that spectively), which are in the 6.5% tail of the empirical FST distribution. If we condition the empirical distribution to con- affects read mapping) is only moderate. tain only alleles observed at intermediate frequency in Yoruba Discussion (30% < MAF < 50%), the FST of rs12088790 is not in the upper tail (0.114 < P < 0.310—supplementary table S9, By comparing the exomes of humans, chimpanzees, and bo- Supplementary Material online). The polymorphism is thus nobos, we identify polymorphisms maintained by long-term

1191 Teixeira et al. . doi:10.1093/molbev/msv007 MBE balancing selection in the Homo–Pan clade. Undoubtedly, populations and at low frequency in non-African populations. other cases of long-term balancing selection exist, including It is not uncommon for targets of long-term balancing selec- species-specific balancing selection (Pasvol et al. 1978; tion to show population differences in the allele frequency Bamshad et al. 2002; Wooding et al. 2004, 2005; distribution (Andres et al. 2009), sometimes due to changes in Muehlenbachs et al. 2008; Andres et al. 2009, 2010), but selective pressure across human groups (de Filippo C, Key FM, here we focus on selection that is old, strong, constant, and Ghirotto S, Benazzo A, Meneu JR, Weihmann A, NISC shared across lineages, and that results in trSNPs. Even among Comparative Sequence Program, Parra G, Green ED, and trSNPs, we focus only on coding variants shared among the Andres AM et al., submitted). The fact that only African pop- three species, and likely underestimate the number of human ulations show a significant excess of intermediate frequency trSNPs. First, by focusing on coding variation we are blind to alleles in LAD1,andthatFST for rs12088790 is high (although balancing selection that maintains variants outside genes, not significantly so) between African and non-African popu- which may not be rare (Leffler et al. 2013). Second, by restrict- lations suggest that this might be the case for LAD1. We ing on a SNP being present in the three species we discard expect low FST values between populations that have identical cases where the variant was lost in one of the lineages, which frequency equilibria, so there might have been changes in may again not be rare. Even one of the best-established cases selective pressure among human groups. Although specula-

of trSNPs, the one present in the ABO gene from humans to tive, it seems possible that some environmental pressures Downloaded from old word monkeys, is not shared among the three species long shared by humans, chimpanzees, and bonobos, and because it was lost in chimpanzees (Segurel et al. 2012). This is that still affect certain African populations, have changed in not unexpected as it is likely that one of the species has other human populations. We note nevertheless that the FST undergone demographic or selective changes that weakened values in rs12088790 are not unusually high, so the observed or changed selection on an old balanced polymorphism. population differentiation is compatible with the pure effect http://mbe.oxfordjournals.org/ Conversely, considering three species (e.g., adding bonobo) of genetic drift. reduces the probability of trSNPs under neutrality; in fact, Although rs12088790 in LAD1 is a good candidate to have after considering the number of SNPs discovered in been the target of selection (being nonsynonymous and pre- humans (121,904), we expect to observe no neutral trSNP sent in the three species), it is possible that it is instead main- (specifically, we expect 5.0 105 neutral trSNPs). tained by linkage to an undiscovered selected trSNP, as the Consistent with this, the majority of coding shSNPs we iden- maintenance of several linked trSNPs is possible under long- tified are likely the result of recurrent mutations, as they fall in term balancing selection (Segurel et al. 2012). Although more genomic regions whose phylogenies agree with the expected detailed genomic and functional analysis on LAD1 are needed at MPI Study of Societies on March 30, 2016 species tree. to completely clarify this question, we explored a recently We identify eight trSNPs that pass our filtering criteria and published catalog of great ape genetic polymorphism in that cluster by allelic tree, with an extremely low probability search for additional human–chimpanzee–bonobo shSNPs under recurrent mutation. The loci containing these eight (Prado-Martinez et al. 2013). Besides rs12088790 (which in SNPs show, in addition, the excess of polymorphism expected that data set also segregates in all three species), we identified under long-term balancing selection. Seven trSNPs are located one additional shSNP in the three species. This SNP in HLA genes (HLA-DQA1, HLA-C,andHLA-DPB1) and one is a (rs12035254, chr1:201349024) is intronic and downstream nonsynonymous SNP in exon 3 of the gene LAD1 of exon 10, and is located about 6 kb downstream (rs12088790). This variant, which has segregated for millions rs12088790 (see supplementary information SVI, of years in these lineages, represents to our knowledge the Supplementary Material online). The distance between the only trSNP known to segregate in present-day populations of two SNPs makes it unlikely that rs12035254 is responsible for these three species outside of the MHC. As for the remaining the very localized signatures in rs12088790’s genomic region. candidate trSNPs, the combined results of our analyses are We further compared the trSNPs found in this study with a not strong and consistent enough to provide unequivocal list of shSNPs between human and western chimpanzee pro- evidence that these are targets of long-term balancing selec- vided by Leffler et al. (2013) butwereunabletoretrievethem. tion (although they can be). We thus focus on LAD1,where This is likely due to different sampling and sequencing strat- theevidenceisclear. egies adopted in the two studies (see supplementary infor- Besides containing a trSNP whose genomic region clusters mation SVI, Supplementary Material online). Nonetheless, by allelic type, LAD1 exhibits high levels of genetic diversity Leffler et al. (2013) also reported several human–chimpanzee (particularly in bonobos and humans) and it shows excess of shSNPs in the genes HLA-DQA1 and HLA-DPB1, although the intermediate-frequency alleles (significant in chimpanzee and specific variants are different from the ones uncovered here. marginally nonsignificant in humans, although highly signifi- Interestingly, the two alleles of rs12088790 are associated cant in the 1000 Genomes’ Africans). LAD1 is thus an unusual with differences in expression levels of LAD1,withhigherex- gene in its consistent signatures of long-term balancing pression associated with the ancestral G allele in lymphoblas- selection. toid cell lines. This highlights the possibility that, in addition to The trSNP, rs12088790, segregates at intermediate fre- causing an amino acid replacement, the trSNP might also quency in Yoruba, bonobos and chimpanzees. It is present have regulatory effects. Association of nonsynonymous alleles in all 1000 Genomes human populations (Abecasis et al. with differences in gene expression is not rare (Lappalainen 2012), although at intermediate frequency in African et al. 2013), but we cannot discard the possibility that

1192 Long-Term Balancing Selection in LAD1 . doi:10.1093/molbev/msv007 MBE another, nearby variant, is responsible for the observed differ- DNA was extracted using the Gentra Purgene Tissue Kit ences in expression. (Qiagen), sheared to a size range of 200–300 bp using the The precise biological mechanisms leading to long-term Bioruptor (Diagenode) and converted into DNA libraries for balancing selection on LAD1 arenotknown.Thegeneen- capture and sequencing (Meyer and Kircher 2010). All sam- codes a collagenous anchoring filament protein of basement ples were double-indexed to prevent cross-sample contami- membrane at the dermal–epidermal junction. The mRNA nation during the processing and sequencing of the samples and the protein are observed in a number of tissues including (Kircher et al. 2012). Exome capture was performed using the the gastrointestinal system (and its accessory organs), the SureSelect Human All Exon 50 Mb Kit (Agilent Technologies). kidney, prostate, placenta, and one type of hematopoietic The kit design is based on the complete annotation of cells (Kim et al. 2014).Genesinvolvedincelladhesionand coding regions from the GENCODE project with a capture extracellular matrix components are enriched among candi- size of approximately 50 Mb. We selected all Ensembl genes date targets of balancing selection and among genes with (mapping uniquely to hg19) that are RefSeq genes (with intermediate-frequency alleles in pathogen-rich environ- good functional support) and targeted by our capture ments (Andres et al. 2009; Fumagalli et al. 2009, 2011; Key design, and selected their longest RefSeq transcript. Samples et al. 2014). This suggests that certain components of the were then pooled by species and sequencing was

cellular junction may benefit from the presence of functional performed on Illumina’s GAIIx platform, with paired-end Downloaded from polymorphism, perhaps as a defense against pathogens. In reads of 76 bp. this context, LAD1 may represent one of such examples. Interestingly, genetic variation in LAD1 is associated with Base Calling and Read Mapping linear IgA disease, an autoimmune blistering disease. The dis- Base calling was performed with Ibis (Kircher et al. 2009), and ease, which affects mostly children and elderly adults (McKee http://mbe.oxfordjournals.org/ et al. 2005), is caused by the presence of circulating IgA auto- reads with more than 5 bases with a base quality score lower antibodies that target peptides in the Ladinin-1 protein, caus- than 15 were discarded. Reads were aligned to the human ing an immunological reaction. This results in the disruption reference genome hg19 using BWA with default parameters. of the dermal–epidermal cohesion, leading to skin blistering Mapping all individuals to the same reference genome pre- that predominantly affects the genitalia but also the face, vented complications from mapping to genomes of different trunk, and limbs (Ishiko et al. 1996; Marinkovich et al. 1996; quality. Only reads with a mapping quality (MQ) 25 and Motoki et al. 1997; McKee et al. 2005). Although our under- mapping outside of known segmental duplications in the three species were considered for further analysis.

standing of the effect of the disease in different populations is at MPI Study of Societies on March 30, 2016 biased by the fact that the disease (which is rare) has mostly Specifically, the average coverage for each individual is been studied in Western countries, some evidence suggests 18.9 in human, 17.9 in chimp, and 17.9 in bonobo. that it is more common in Africa (Aboobaker et al. 1991; Denguezli et al. 1994; Monia et al. 2011). Balancing selection Genotype Calling and Filtering has been proposed to play a role in the evolution of autoim- Genotype calls were performed in the autosomes using the mune genes, because the inflammatory response must be Genome Analysis Toolkit (GATK) UnifiedGenotyper (version precisely balanced to be effective yet moderate (Ferrer- 1.3-14; McKenna et al. 2010). Aside from true variation, these Admetlla et al. 2008). Whether balancing selection in LAD1 preliminary SNP calls likely include false positives due to the is responsible for its role in autoimmunity remains though presence of mismapped reads, misaligned indels, and system- unclear. It is possible, and perhaps more likely, that autoim- atic errors. We used a combination of strict filters to remove mune diseases appear as consequences of diversity in proteins such errors. SNPs were removed using the following criteria that is maintained by balancing selection and happen to be (acronyms correspond to the GATK package or fields in the able to initiate pathogenic immunological reactions. Further VCF files): work is necessary to discern the functional consequences and  advantageous role of its balanced polymorphisms in humans The depth of coverage was less than 8 or greater than 100 and other primates. in at least 50% of the individuals of each species. This allowed us not only to exclude positions for which the Materials and Methods coverage depth was low, but also positions that might fall in segmental duplications not annotated in the data DNA Samples and Sequencing sets above [28–30]; We performed whole-exome capture and high-coverage se-  The quality score of the call was less than 50; quencing of 20 humans, 20 central chimpanzees (P. troglo-  There was evidence of strand bias (SB 4 0); dytes troglodytes), and 20 bonobos (P. paniscus). Human  The genotype quality was less than 10 in all individuals samples belong to the well-studied Yoruba population from carrying the alternative allele; HapMap; bonobo and chimpanzee blood samples were col-  The SNP was located within 3 bp of a homopolymer with a lected in African sanctuaries (Lola ya bonobo sanctuary in minimum length of 5 bp; Kinshasa, Democratic Republic Congo; and Tchimpounga  The SNP was located within 5 bp up- and downstream of sanctuary, Jane Goodall Institute, Republic of Congo, respec- an insertion or deletion (indel) polymorphism or substitu- tively) and immortalized as cell culture (Fischer et al. 2011). tion with the human reference genome.

1193 Teixeira et al. . doi:10.1093/molbev/msv007 MBE

Shared SNPs as trSNPs lineage, they perfectly mimic a recurrent mutation (falling Wrongly mapped reads are difficult to account for and can at the same site) in the three species. The proportion of result in an increased false discovery of shSNPs. In order to random samplings with Pallelic 4 0.9 (i.e., the criterion used remove undetected duplications, we further filtered shSNPs to consider the trSNP) reflects the FDR for that given length. to remove sites with unusually high coverage, that are in Hardy–Weinberg disequilibrium, and that do not lie in re- Polymorphism-to-Divergence Ratios gions of high mappability in the human genome (i.e., posi- We defined the ratio of polymorphism to divergence tions that have a 24-mer mappability score lower than 1, as (PtoD) =p/(d+1), where p is the number of polymorphisms defined by the CRG Alignability track in the UCSC browser observed in a species and d the number of fixed differences [Derrien et al. 2012]) (see Results). between species. For each candidate gene, we estimated sig- nificance based on the percentile of each candidate in the Haplotype Inference and Allelic Trees empirical genomic distribution of all genes. In order to ascertain significance when comparing the set We use the fastPHASE 1.4.0 software (Scheet and Stephens of candidate loci to the set of control loci (empirical distribu- 2006) to infer the chromosomal phase for the alleles of each tion), we performed two-tail MW-U tests and used a critical of the genes containing at least one shSNP. The inferences value of 5%. After comparing the PtoD values in the two Downloaded from were performed separately for each species and for each chro- groups, we sequentially removed the top candidate gene mosome using the default parameters of fastPHASE. (i.e., one gene each time) from the candidate’s group The region surrounding a trSNP is expected to follow un- and recalculated MW-UP-values maintaining the control usual genealogies where haplotypes cluster by allelic type group unaltered (see supplementary information SIV, rather than by species. This occurs because the age of the Supplementary Material online, for details). http://mbe.oxfordjournals.org/ balanced polymorphism predates the speciation time and, unless recombination happens, there will be no fixation of new mutations. We call these two types of phylogenies “allelic Measuring Expression Levels in LAD1 Alleles tree” and “species tree” (fig. 2). The trees were inferred in We analyzed lymphoblastoid cell line expression data ob- windows of different lengths (from 100 to 2,000 bp) centered tained from a subset of 462 of the 1000 genomes project on the shared polymorphism, as the region expected to follow individuals provided by Lappalainen et al. (2013).Tocompute the allelic tree is very short due to the long-term effects of gene expression we used the aligned reads provided by recombination. We considered as candidate trSNPs only Lappalainen et al. (2013) and assigned reads with an shSNPs that show an allelic tree with probability MQ 30 to protein-coding genes by overlapping the read at MPI Study of Societies on March 30, 2016 (Pallelic) 4 0.9 in a window of at least 100 nt. coordinates with gene coordinates (ENSEMBL version 69). We adopted a simple resampling approach to calculate Reads overlapping a gene are summed up and used as the Pallelic in the region surrounding a shSNP. We randomly cre- estimate for gene expression. ated 1,000 samples of six haplotypes (one haplotype per allele We grouped the individuals by their genotype at position and per species). For each of the 1,000 resamples we built a chr1:201355761 (rs12088790, the nonsynonymous trSNP in neighbor-joining tree using as distance matrix the number of LAD1). We sought to test for allele-specific expression for nucleotide differences among the six haplotypes. If the three LAD1 between individuals carrying the two different trSNP closest tips were haplotypes from the three species containing alleles by testing for differential expression between 1) the the same allele of the shSNP, it was considered an allelic tree. If groups of individuals with genotype AA versus GG/GA and the two different human haplotypes are closer to each other 2) the groups of individuals with genotype GG versus GA/AA. than to any other haplotypes, the tree was considered a spe- We computed differential expression for LAD1 for (1) and (2) cies tree (the relationship between chimpanzees and bonobos using the DESeq package (Anders and Huber 2010). was not considered because shared polymorphism can occur Expression values in both groups are modeled by a fit of a negative binomial distribution. DESeq tests then for differ- given their short divergence time). Pallelic was estimated as the proportion of resampled trees that were allelic trees. Figure 2 ences between the distributions of the two groups. shows an example of allelic and species trees built from six Supplementary Material haplotypes. We also estimated the probability to observe an allelic tree Supplementary material is available at Molecular Biology and of a given length (FDRs) under recurrent mutation and based Evolution online (http://www.mbe.oxfordjournals.org/). on our empirical data set. For each observed allelic tree lengths (supplementary table S1, Supplementary Material Acknowledgments online), we randomly chose 1,000 human SNPs and the clos- The authors thank the sequencing group at the Max Planck est SNP in chimpanzee and bonobo. We then “paired” these Institute for Evolutionary Anthropology and Martin Kircher SNPs (i.e., use the allelic information of each SNP) as if they for data production. The authors also thank Felix M. Key, occurredinthesamegenomicpositionratherthanindiffer- Gabriel Renaud, Jing Li, Kay Prufer,€ and Joshua Schraiber for ent positions, and calculated Pallelic for these haplotypes helpful discussions and suggestions, as well as four anony- (based on the alleles found in each species’ SNP). Because mous reviewers, whose comments improved substantially these SNPs arose from independent mutations in each the manuscript. The authors are grateful to the Lola Ya

1194 Long-Term Balancing Selection in LAD1 . doi:10.1093/molbev/msv007 MBE

Bonobo sanctuary in Kinshasa, Democratic Republic Congo, Ferguson W, Dvora S, Fikes RW, Stone AC, Boissinot S. 2012. Long-term and the Tchimpounga sanctuary, Jane Goodall Institute, balancing selection at the antiviral gene OAS1 in Central African chimpanzees. MolBiolEvol.29:1093–1103. Republic of Congo, for allowing access to the primate samples. Ferrer-Admetlla A, Bosch E, Sikora M, Marque`s-Bonet T, Ramırez-Soriano This work was supported by the Max Planck Society. J.C.T. is A, Muntasell A, Navarro A, Lazarus R, Calafell F, Bertranpetit J, et al. supported by Fundac¸aoparaaCi~ enciaeaTecnologia(FCT)^ 2008. Balancing selection is the main force shaping the evolution of within the Portuguese Ministry for Science and Education innate immunity genes. JImmunol.181:1315–1322. (SFRH/BD/77043/2011). Fischer A, Prufer€ K, Good JM, Halbwax M, Wiebe V, Andre C, Atencia R, Mugisha L, Ptak SE, Paabo S. 2011. Bonobos fall within the genomic variation of chimpanzees. PLoS One 6:e21605. References Fumagalli M, Cagliani R, Pozzoli U, Riva S, Comi GP, Menozzi G, Bresolin N, Sironi M. 2009. Widespread balancing selection and pathogen- Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker driven selection at blood group antigen genes. Genome Res. 19: RE, Kang HM, Marth GT, McVean GA. 2012. An integrated map 199–212. of genetic variation from 1,092 human genomes. Nature 491: Fumagalli M, Sironi M, Pozzoli U, Ferrer-Admetlla A, Pattini L, Nielsen R. 56–65. 2011. Signatures of environmental genetic adaptation pinpoint Aboobaker J, Wojnarowska FT, Bhogal B, Black MM. 1991. Chronic pathogens as the main selective pressure through human evolution. bullous dermatosis of childhood—clinical and immunological fea- PLoS Genet. 7:e1002355. tures seen in African patients. Clin Exp Dermatol. 16:160–164. Gendzekhadze K, Norman PJ, Abi-Rached L, Graef T, Moesta AK, Downloaded from Allison AC. 1956. The sickle-cell and haemoglobin C genes in some Layrisse Z, Parham P. 2009. Co-evolution of KIR2DL3 with HLA-C African populations. Ann Hum Genet. 21:67–89. in a human population retaining minimal essential diversity Andres AM. 2011. Balancing selection in the human genome. of KIR and HLA class I ligands. Proc Natl Acad Sci U S A. 106: Encyclopedia of life sciences (eLS). Chichester: John Wiley & Sons 18692–18697. Ltd. Gigord LD, Macnair MR, Smithson A. 2001. Negative frequency- Andres AM, Dennis MY, Kretzschmar WW, Cannons JL, Lee-Lin SQ, dependent selection maintains a dramatic flower color polymor- http://mbe.oxfordjournals.org/ Hurle B, Schwartzberg PL, Williamson SH, Bustamante CD, Nielsen phism in the rewardless orchid Dactylorhiza sambucina (L.) Soo. R, et al. 2010. Balancing selection maintains a form of ERAP2 that Proc Natl Acad Sci U S A. 98:6253–6255. undergoes nonsense-mediated decay and affects antigen presenta- Gillespie JH. 1978. A general model to account for enzyme variation in tion. PLoS Genet. 6:e1001157. natural populations. V. The SAS–CFF model. Theor Popul Biol. 14: Andres AM, Hubisz MJ, Indap A, Torgerson DG, Degenhardt JD, Boyko 1–45. AR, Gutenkunst RN, White TJ, Green ED, Bustamante CD, et al. Grantham R. 1974. Amino acid difference formula to help explain pro- 2009. Targets of balancing selection in the human genome. Mol tein evolution. Science 185:862–864. Biol Evol. 26:2755–2764. Hodgkinson A, Eyre-Walker A. 2010. The genomic distribution and local Anders S, Huber W. 2010. Differential expression analysis for sequence context of coincident SNPs in human and chimpanzee. Genome Biol

count data. Genome Biol. 11:R106. Evol. 2:547–557. at MPI Study of Societies on March 30, 2016 Asthana S, Schmidt S, Sunyaev S. 2005. A limited role for balancing Hodgkinson A, Eyre-Walker A. 2011. Variation in the mutation rate selection. Trends Genet. 21:30–32. across mammalian genomes. Nat Rev Genet. 12:756–766. Bamshad MJ, Mummidi S, Gonzalez E, Ahuja SS, Dunn DM, Watkins WS, Hodgkinson A, Ladoukakis E, Eyre-Walker A. 2009. Cryptic variation in Wooding S, Stone AC, Jorde LB, Weiss RB, et al. 2002. A strong the human mutation rate. PLoS Biol. 7:e1000027. signature of balancing selection in the 5’ cis-regulatory region of Hudson RR, Kaplan NL. 1988. The coalescent process in models with CCR5. Proc Natl Acad Sci U S A. 99:10539–10544. selection and recombination. Genetics 120:831–840. Bird AP. 1980. DNA methylation and the frequency of CpG in animal Ishiko A, Shimizu H, Masunaga T, Hashimoto T, Dmochowski M, DNA. Nucleic Acids Res. 8:1499–1504. Wojnarowska F, Bhogal BS, Black MM, Nishikawa T. 1996. 97-kDa Cagliani R, Fumagalli M, Biasin M, Piacentini L, Riva S, Pozzoli U, Bonaglia linear IgA bullous dermatosis (LAD) antigen localizes to the lamina MC, Bresolin N, Clerici M, Sironi M. 2010. Long-term balancing se- lucida of the epidermal basement membrane. J Invest Dermatol. 106: lection maintains trans-specific polymorphisms in the human 739–743. TRIM5 gene. Hum Genet. 128:577–588. Johnson PL, Hellmann I. 2011. Mutation rate distribution inferred from Cagliani R, Guerini FR, Fumagalli M, Riva S, Agliardi C, Galimberti D, coincident SNPs and coincident substitutions. Genome Biol Evol. 3: Pozzoli U, Goris A, Dubois B, Fenoglio C, et al. 2012. A trans-specific 842–850. polymorphism in ZC3HAV1 is maintained by long-standing balanc- Kent WJ. 2002. BLAT—the BLAST-like alignment tool. Genome Res. 12: ing selection and may confer susceptibility to multiple sclerosis. Mol 656–664. Biol Evol. 29:1599–1613. Key FM, Teixeira JC, de Filippo C, Andres AM. 2014. Advantageous Charlesworth B, Nordborg M, Charlesworth D. 1997. The effects of local diversity maintained by balancing selection in humans. Curr Opin selection, balanced polymorphism and background selection on Genet Dev. 29C:45–51. equilibrium patterns of genetic diversity in subdivided populations. Kim MS, Pinto SM, Getnet D, Nirujogi RS, Manda SS, Chaerkady R, Genet Res. 70:155–174. Madugundu AK, Kelkar DS, Isserlin R, Jain S, et al. 2014. A draft Charlesworth D. 2006. Balancing selection and its effects on sequences in map of the human proteome. Nature 509:575–581. nearby genome regions. PLoS Genet. 2:e64. Kircher M, Sawyer S, Meyer M. 2012. Double indexing overcomes inac- Clark AG. 1997. Neutral behavior of shared polymorphism. Proc Natl curacies in multiplex sequencing on the Illumina platform. Nucleic Acad Sci U S A. 94:7730–7734. Acids Res. 40:e3. Denguezli M, Ben Nejma B, Nouira R, Korbi S, Bardi R, Ayed K, Essoussi Kircher M, Stenzel U, Kelso J. 2009. Improved base calling for the AS, Jomaa B. 1994. [Iga linear bullous dermatosis in children. A series Illumina Genome Analyzer using machine learning strategies. of 12 Tunisian patients]. Ann Dermatol Venereol. 121:888–892. Genome Biol. 10:R83. Derrien T, Estelle J, Marco Sola S, Knowles DG, Raineri E, Guigo R, Ribeca KleinJ,SattaY,O’HUiginC,TakahataN.1993.Themoleculardescentofthe P. 2012. Fast computation and applications of genome mappability. major histocompatibility complex. Annu Rev Immunol. 11:269–295. PLoS One 7:e30377. Lappalainen T, Sammeth M, Friedlander MR, ‘t Hoen PA, Monlong J, Fan WM, Kasahara M, Gutknecht J, Klein D, Mayer WE, Jonker M, Klein J. Rivas MA, Gonzalez-Porta M, Kurbatova N, Griebel T, Ferreira PG, 1989. Shared class II MHC polymorphisms between humans and et al. 2013. Transcriptome and genome sequencing uncovers func- chimpanzees. Hum Immunol. 26:107–121. tional variation in humans. Nature 501:506–511.

1195 Teixeira et al. . doi:10.1093/molbev/msv007 MBE

Lawlor DA, Ward FE, Ennis PD, Jackson AP, Parham P. 1988. HLA-A and Pasvol G, Weatherall DJ, Wilson RJ. 1978. Cellular mechanism for the B polymorphisms predate the divergence of humans and chimpan- protective effect of haemoglobin S against P. falciparum malaria. zees. Nature 335:268–271. Nature 274:701–703. Leffler EM, Gao Z, Pfeifer S, Segurel L, Auton A, Venn O, Bowden R, Prado-Martinez J, Sudmant PH, Kidd JM, Li H, Kelley JL, Lorente-Galdos Bontrop R, Wall JD, Sella G, et al. 2013. Multiple instances of ancient B, Veeramah KR, Woerner AE, O’Connor TD, Santpere G, et al. 2013. balancing selection shared between humans and chimpanzees. Great ape genetic diversity and population history. Nature 499: Science 339:1578–1582. 471–475. Marinkovich MP, Taylor TB, Keene DR, Burgeson RE, Zone JJ. 1996. LAD- Prufer€ K, Munch K, Hellmann I, Akagi K, Miller JR, Walenz B, Koren S, 1, the linear IgA bullous dermatosis autoantigen, is a novel 120-kDa Sutton G, Kodira C, Winer R, et al. 2012. The bonobo genome anchoring filament protein synthesized by epidermal cells. JInvest compared with the chimpanzee and human genomes. Nature Dermatol. 106:734–738. 486:527–531. Mayer WE, Jonker M, Klein D, Ivanyi P, van Seventer G, Klein J. Scheet P, Stephens M. 2006. A fast and flexible statistical model for large- 1988. Nucleotide sequences of chimpanzee MHC class I alleles: scale population genotype data: applications to inferring missing evidence for trans-species mode of evolution. EMBO J. 7: genotypes and haplotypic phase. Am J Hum Genet. 78:629–644. 2765–2774. Schierup MH, Mikkelsen AM, Hein J. 2001. Recombination, balancing McKee PH, Calonje E, Granter SR, editors 2005. Pathology of the skin: selection and phylogenies in MHC and self-incompatibility genes. with clinical correlations. Edinburgh: Philadelphia Elsevier Mosby. Genetics 159:1833–1844. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Segurel L, Thompson EE, Flutre T, Lovstad J, Venkat A, Margulis SW, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, Moyse J, Ross S, Gamble K, Sella G, et al. 2012. The ABO blood group et al. 2010. The genome analysis toolkit: a MapReduce frame- is a trans-species polymorphism in primates. Proc Natl Acad Sci U S Downloaded from work for analyzing next-generation DNA sequencing data. A. 109:18493–18498. Genome Res. 20:1297–1303. Takahata N, Nei M. 1990. Allelic genealogy under overdominant and Meyer M, Kircher M. 2010. Illumina sequencing library preparation for frequency-dependent selection and polymorphism of major histo- highly multiplexed target capture and sequencing. Cold Spring Harb compatibility complex loci. Genetics 124:967–978. Protoc. 2010:pdb prot5448. Weir BS, Cockerham CC. 1984. Estimating F-statistics for the analysis of Monia K, Aida K, Amel K, Ines Z, Becima F, Ridha KM. 2011. Linear IgA population-structure. Evolution 38:1358–1370. http://mbe.oxfordjournals.org/ bullous dermatosis in tunisian children: 31 cases. Indian J Dermatol. Wiuf C, Zhao K, Innan H, Nordborg M. 2004. The probability and chro- 56:153–159. mosomal extent of trans-specific polymorphism. Genetics 168: Motoki K, Megahed M, LaForgia S, Uitto J. 1997. Cloning and chromo- 2363–2372. somal mapping of mouse ladinin, a novel basement membrane Wooding S, Kim UK, Bamshad MJ, Larsen J, Jorde LB, Drayna D. 2004. zone component. Genomics 39:323–330. and in PTC, a bitter-taste Muehlenbachs A, Fried M, Lachowitzer J, Mutabingwa TK, Duffy PE. receptor gene. Am J Hum Genet. 74:637–646. 2008. Natural selection of FLT1 alleles and their association with Wooding S, Stone AC, Dunn DM, Mummidi S, Jorde LB, Weiss RK, Ahuja malaria resistance in utero. Proc Natl Acad Sci U S A. 105: S, Bamshad MJ. 2005. Contrasting effects of natural selection on

14488–14491. human and chimpanzee CC chemokine receptor 5. Am J Hum at MPI Study of Societies on March 30, 2016 Muirhead CA, Glass NL, Slatkin M. 2002. Multilocus self-recognition Genet. 76:291–301. systems in fungi as a cause of trans-species polymorphism. Wright S. 1939. The distribution of self-sterility alleles in populations. Genetics 161:633–641. Genetics 24:538–552.

1196