and Immunity (2012) 13, 98–102 & 2012 Macmillan Publishers Limited All rights reserved 1466-4879/12 www.nature.com/gene

SHORT COMMUNICATION Genetic diversity in human erythrocyte

J Berghout1, S Higgins2, C Loucoubar3, A Sakuntabhai3,4, KC Kain2 and P Gros1 1Department of Biochemistry, McGill University, Complex Traits Group, Montreal, Canada; 2McLaughlin-Rotman Centre for Global Health, Toronto General Hospital, University of Toronto, Toronto, Canada; 3Institut Pasteur, Unite´ de Pathoge´nie Virale, Paris, France and 4Center of Excellence for Vector and Vector-Borne Diseases, Faculty of Science, Mahidol University, Bangkok, Thailand

Previously, we have shown that pyruvate kinase, liver and red cell isoform (PKLR) deficiency protects mice in vivo against blood-stage malaria, and observed that reduced PKLR function protects human erythrocytes against Plasmodium falciparum replication ex vivo. Here, we have sequenced the human PKLR in 387 individuals from malaria-endemic and other regions in order to assess genetic variability in different geographical regions and ethnic groups. Rich genetic diversity was detected in PKLR, including 59 single-nucleotide polymorphisms and several loss-of-function variants (frequency 1.5%). Haplotype distribution and allele frequency varied considerably with geography. Neutrality testing suggested positive selection of the genein the sub-Saharan African and Pakistan populations. It is possible that such positive selection involves the malarial parasite. Genes and Immunity (2012) 13, 98–102; doi:10.1038/gene.2011.54; published online 11 August 2011

Keywords: genetics; pyruvate kinase; sequencing; genetic diversity; CEPH; malaria

Introduction polymorphisms (SNPs; Figure 1b, Supplementary Table 1 for details). Full nucleotide sequences can be found Erythrocytes rely exclusively on for production by accessing the CEPH database. Our data is almost of adenosine triphosphate, and, therefore, loss of pyruvate certainly an underestimate as the sequencing technology kinase, liver and red cell isoform (PKLR) activity has major failed to accurately read through two highly repetitive consequences for cell function and lifespan. PK deficiency AT-rich regions of the gene.7 The majority of SNPs were is the second most common cause of hereditary non- 1 in non-coding regions, though seven synonymous spherocytic hemolytic anemia in humans. The large coding SNPs (A36A, Q57Q, L61L, R194R, L236L, L301L mutational spectrum in PKLR is associated with a broad and R569R) and seven non-synonymous mutations were clinical spectrum, varying from mild anemia to a transfusion- 2,3 detected (Figure 2). Many SNPs (21/59) were limited dependent disorder. Clinical PK deficiency is usually to a single heterozygote individual. The sub-Saharan associated with homozygosity or compound heterozyg- 3 population (SS) showed the highest diversity with 31 osity of mutant alleles. PK-deficiency protects mice polymorphisms, including 13 unique sites (Figure 1b; against blood-stage malaria (Plasmodium chabaudi) in 4 Supplementary Table 2). The European (EUR) dataset vivo; and studies with P. falciparum-infected erythrocytes had the lowest genetic diversity (eight SNPs), though demonstrate both homozygosity and heterozygosity only a small number of individuals were included for mutant PKLR alleles decrease parasite invasion and 5,6 (n ¼ 23). All SNPs were in Hardy–Weinberg equilibrium replication, and increase phagocytosis of infected cells. within the populations in which they were detected with These findings raise the possibility that human PKLR the exception of rs3762272 in Pakistan (PAK), which variants may have a protective effect if retained in had an excess of minor allele homozygotes (Po0.0002). malaria-pressured areas. To address this possibility, Most of these were Hazara, likely indicating popu- we sequenced the PKLR gene from 387 individuals, lation substructure in the PAK group. This SNP was primarily those of African and Asian origin living in also enriched in PAK (minor allele frequency, malaria-endemic areas (Figure 1a). MAFPAK ¼ 11.6%) and Southeast Asia (SEA, MAFSEA ¼ 19.6%) while being nearly absent in Africa. We observed linkage disequilibrium across the entire Results and discussion gene in all but the EUR population. To compare haplotype distribution between populations, we used Sequencing PKLR identified 59 polymorphic sites, the seven SNPs indicated by arrows in Figure 1b, which including 30 previously unreported single-nucleotide assigned 490% of samples to a haplotype with a frequency 42% (Figure 1c). Two haplotypes dominated: Correspondence: Dr P Gros, Department of Biochemistry, McGill 50-TTGGCCT-30 (haplotype 1), which accounts for the University, Complex Traits Group, Bellini Life Sciences Building, majority of samples in North Africa (NA), PAK, SEA and room 366, 3649 Sir William Osler Promenade, Montreal, Quebec, EUR, and matches the UCSC reference genome—and Canada H3G-0B1. 50-CCCGATC-30 (haplotype 2), which accounts for E-mail: [email protected] Received 5 April 2011; revised 30 June 2011; accepted 7 July 2011; the majority of SS and South African (SA) haplotypes. published online 11 August 2011 The ancestral haplotype 7 was represented at 10.4% PKLR genetic diversity J Berghout et al 99

Burusho

Orcadian Kalash NA SS SA PAK SEA EUR Tuscan Hazara [1] TTGGCTC 0.569 0.182 0.320 0.666 0.469 0.572 Pathan Balochi [2] CCCGACT 0.277 0.496 0.620 0.176 0.229 0.159 Mozabite Brahui Sindhi [3] CCCAACT 0.087 0.206 0.023 [4] TCGGCTC 0.145 0.011 Mandenka Makrani [5] CCCGACC 0.083 inset [6] CCCGCCT 0.034 [7] CCCGCTC 0.104 0.04 Papuan Yoruba Cambodian [8] TTGGCCC 0.175 [9] TTCGCTC 0.052 Biaka Kenyan Pygmy Bantu Melanesian Mbuti Pygmy San

Bantu Indel-1 SNP-1 A31V SNP-2 SNP-3 SNP-4 Indel-2 SNP-5 SNP-6 rs8177962 SNP-7 rs8177963 SNP-8 rs8177964 rs3020781 SNP-9 rs8177971 SNP-10 rs8177872 rs2071053 SNP-11 SNP-12 SNP-13 rs8177973 SNP-14 V269F L272V E277K A295I SNP-15 rs8177979 rs8177981 SNP-16 rs61208773 R449C rs8177983 rs4620533 rs8177985 SNP-17 SNP-18 SNP-19 SNP-20 SNP-21 SNP-22 rs8177987 SNP-23 rs3762272 R486W H507H SNP-24 SNP-25 (triple) rs1052176 rs1052177 SNP-26 (multiple) rs932972 NA SS SA PAK SEA EUR N S S S S S N N N N S N N S S

Figure 1 (a) 387 unrelated DNA samples were selected from the H952 subset10 of the Human Genetic Diversity Panel provided by the Centre d’Etude du Polymorphisme Humain.24 Ethnic groups were pooled into geographically based populations: ‘North African’ (Mozabite n ¼ 29), ‘Sub-Saharan’ (Biaka Pygmy n ¼ 23, Mbuti Pygmy n ¼ 13, Yoruba n ¼ 22, Mandenka n ¼ 22), ‘South African’ (San n ¼ 6, Bantu n ¼ 8, Kenyan Bantu n ¼ 11), ‘Pakistan’ (inset; Brahui n ¼ 25, Balochi n ¼ 24, Hazara n ¼ 22, Makrani n ¼ 25, Sindhi n ¼ 24, Pathan n ¼ 24, Kalash n ¼ 23, Burusho n ¼ 25), ‘Southeast Asian’ (Cambodian n ¼ 10, Melanesian, n ¼ 11, Papuan n ¼ 16) and European (Tuscan n ¼ 8, Orcadian n ¼ 15). All samples were originally collected following proper informed consent. (b) Coding regions of PKLR were PCR amplified and sequenced (see Supplementary Table 3 for primers). Polymorphic sites are identified as columns, listed from 50–30. Each SNP is scored within the given population as having only the reference allele detected (white), alternative allele detected as a single heterozygous individual (gray), or alternative allele detected in more than one individual (black). Slashes indicate 475% missing data at this locus. Letters below the grid indicate synonymous (S) or non-synonymous (N) substitutions in the amino-acid sequence of the protein. (c) Haplotype frequency distributions were determined in Haploview25 according to Gabriel’s method,26 selecting SNPs with a MAF of at least 10% that defined whole gene linkage disequilibrium (arrows in panel b). frequency in SS, but below 5% in other populations. using Sweep9 (data not shown). In the NA dataset,

All populations had significantly different haplotype A31V appeared in four individuals (MAFNA ¼ 6.9%), distributions (w2-test, Po0.05). three times on haplotype 2 (27% frequency) and a fourth The seven non-synonymous PKLR coding mutations time on haplotype 5 that differs by a single T/C suggest an overall mutant allele frequency of 1.5±0.9% polymorphism at the distal end of the haplotype (Figure 2). All mutations were found as heterozygotes, block. This polymorphism may have appeared recently and no one had more than one mutation present. on a common core haplotype within the Mozabite Pakistan had the highest mutation diversity with V269F lineage, and been retained at a moderate frequency (Hazara), L272V (Balochi), R449C (Makrani), E277K either by positive selection or by relatedness within the (Brahui) and R486W (Burusho) appearing as singletons. population (Supplementary Figure 1). Although the When the data were phased using BEAGLE software,8 all mutated individuals are not first- or second-degree of the PAK SNPs except E277K appeared on haplotype 1, relatives,10 CEPH panel Mozabites have a high inbreed- which was the most frequently observed haplotype in ing coefficient11 and historically small effective popula- this population. R486W in NA appeared on the same tion size.12 Recent common ancestry or ascertainment haplotype. E277K appeared in a Brahui (PAK) individual bias inherent to the small sample size may be confound- on haplotype 7 and on a rare haplotype (50-CCGGACT-30, ing variables. o2%) in a Mandenka individual (SS), having likely To examine more closely the potential functional arisen separately. The A295I mutation in a Biaka Pygmy consequences of the mutations, we compared the PKLR appeared on the most common haplotype in SS protein across 13 species (Homo sapiens NP_000289.1, (haplotype 2). All of these mutations are too rare to have Pan troglodytes XP_524896.2, Macaca mulatta, influenced the underlying haplotype distribution, when XP_001112902.2, Canis lupus familiaris XP_547547.2, Bos we examined the extended haplotype homozygosity taurus NP_001069644.1, Mus musculus NP_038659.2,

Genes and Immunity PKLR genetic diversity J Berghout et al 100 L272V E277K A31V V269F A295I R449C R486W N domain A domain B domain A domain C domain

Catalytically active residues

A31V V269F H. sapiens : LAKSILIG A PGGPAGYL LPGLSEQD V RDLRFGVE M. mulatta : LAKSILIG A PGGPAGYL LPGLSEQD V RDLRFGVE M. musculus : LAKSALSG A PGGPAGYL LPGLSEQD L LDLRFGVE X. laevis : ------LPALSERD C LDLKFGIE D. melanogaster : ------M LPAVSEKD K SDLLFGVE **.:**:* ** **:: Active L272V E277K H. sapiens : LSEQDVRD L RFGVEHGV VRDLRFGV E HGVDIVFA A295I site M. mulatta : LSEQDVRD L RFGVEHGV VRDLRFGV E HGVDIVFA M. musculus : LSEQDLLD L RFGVEHNV LLDLRFGV E HNVDIIFA V269F X laevis : LSERDCLD L KFGIEQGV CLDLKFGI E QGVDMVFA D. melanogaster : VSEKDKSD L LFGVEQEV KSDLLFGV E QEVDMIFA L272V :**:* * * **::: * ** **: : : **::*. R486W

A295I R449C H. sapiens : FVRKASDV A AVRAALG LRRAAPLS R DPTEVTAI M. mulatta : FVRKASDV A AVRAALG LRRAAPLS R DPTEVIAI M. musculus : FVRKASDV V AVRDALG LRRAAPLS R DPTEVTAI X laevis : FIRKAQDV H TIRKELG LRRVTPLT Q DPTEVTAI E277K D. melanogaster : FIRNAAAL T EIRKVLG LVRGAG-T I DASHAAAI *:* * : :* ** : * :.:.. * R486W H. sapiens : GRSAQLLS R YRPRAAVI Allosteric R449C M. mulatta : GRSAQLLS Q YRPRAAVI site M. musculus : GRSAQLLS R YRPRAAVI X laevis : GRSAQLLS R YRPRAPII D. melanogaster : GKSAFQVS K YRPRCPII *:** :* : ****. :*

Figure 2 (a) Schematic representation of the PKLR protein, known structural and functional domains, and location of non-synonymous amino-acid substitutions. (b) Subset of multiple species protein alignments indicating conservation of mutated residues. ‘*’ indicates 100% identity across all 13 species, ‘:’ strong group conserved and ‘.’ weak group conserved according to ClustalW. (c) Three-dimensional modeling of the non-synonymous substitutions in the PKLR protein based on the 2VG crystal structure.2

Rattus norvegicus NP_036756.3, Oryctolagus cuniculus clustered within a 25-residue segment of the A domain XP_002715467.1, Monodelphis domestica XP_001374169.1, of the , which contains active site residues, while Gallus gallus NP_990800.1, Xenopus laevis R449C and R486W map near the allosteric regulatory site NP_001083514.1, Danio rerio NP_958446.1, Drosophila (C domain). The large number and variety of PKLR melanogaster AAF55979.3) using ClustalW. Overall, PKLR mutations reported to cause PK deficiency suggest that is well conserved with a minimum of 60% identity across the protein is highly sensitive to even small changes. the species examined. To quantify conservation, we used Modest changes in protein folding/stability could have SIFT, which assesses conservation and the physical serious consequences for long-term erythrocyte function properties to assign scores where values o0.05 are as mature cells lack the means to compensate by predicted to affect protein function. A31V (median generating more protein. sequence conservation (MSC) ¼ 4.32; score ¼ 0.00), A31V merits further comment as it appeared at high L272V (MSC ¼ 3.84; 0.00), E277K (MSC ¼ 3.84; 0.02), frequency (6.9%) in NA. This mutation lies in the R449C (MSC ¼ 3.84; 0.03) and R486W (MSC ¼ 3.84; 0.00) erythrocyte-specific N-domain, affecting a residue highly were predicted to substantially affect function, while conserved in mammals. This mutation is either V269F (MSC ¼ 3.84, 0.21) and A295I (MSC ¼ 3.84, 0.57) fairly neutral, offers some advantage, or is in linkage may be tolerated. disequilibrium with a polymorphism increasing fitness. Further analyses support that these substitutions are Distinguishing these possibilities will require character- likely to reduce PK enzymatic activity. Notably, L272V, izing A31V enzymatic activity and testing its association E277K and R486W have been reported in PK-deficient with response to malaria in case–control studies. patients2,3 with R486W being the most common mutation Should all coding variants be loss-of-function alleles, associated with PK deficiency.13 While A295I has not the expected overall rate of PK deficiency would be B1/ been previously associated with PK deficiency, a similar 4203 births. Stratified by geography, NA had the highest mutation (A295V) does cause clinical symptoms.14 rate with an 8.6% mutant allele frequency, followed by Additionally, V269F, L272V, E277K and A295I are PAK (1.3%) and SS (1.25%). No mutants were detected in

Genes and Immunity PKLR genetic diversity J Berghout et al 101 our SA, SEA or EUR datasets. Although our report is the structural implications of amino acid substitutions in PK. first to determine mutant allele frequency by complete Hum Mutat 2009; 30: 446–453. sequencing, other studies have estimated mutant allele 4 Min-Oo G, Fortin A, Tam MF, Nantel A, Stevenson MM, Gros frequencies between 0.4–6% by genotyping for common P. Pyruvate kinase deficiency in mice protects against malaria. variants, with the highest mutation prevalence in Saudi Nat Genet 2003; 35: 357–362. Arabia (6%) and Hong Kong (3%).13,15–17 5 Ayi K, Min-Oo G, Serghides L, Crockett M, Kirby-Allen M, We applied neutrality tests in DnaSP 5.1018 to Quirt I et al. Pyruvate kinase deficiency and malaria. determine whether selection has altered the expected N Engl J Med 2008; 358: 1805–1810. 19 6 Ayi K, Liles WC, Gros P, Kain KC. Adenosine triphosphate pattern of genetic variability in PKLR. Neutrality tests depletion of erythrocytes simulates the phenotype associated are based on the principle that the mutation rate is with pyruvate kinase deficiency and confers protection constant and synonymous substitutions will behave in a against Plasmodium falciparum in vitro. J Infect Dis 2009; neutral fashion, whereas non-synonymous changes 200: 1289–1299. will be either highly beneficial or deleterious to fitness, 7 Machado P, Pereira R, Rocha AM, Manco L, Fernandes N, and thus targets of evolutionary selection.20 As up to two Miranda J et al. Malaria: looking for selection signatures in thirds of statistically significant non-neutrality observa- the human PKLR gene region. Br J Haematol 2010; 149: tions can be influenced by demographic history,19 we 775–784. also generated 1000 coalescent simulations using the 8 Browning SR, Browning BL. Rapid and accurate haplotype CoSi package (http://www.broad.mit.edu/~sfs/cosi) phasing and missing-data inference for whole-genome 21 association studies by use of localized haplotype clustering. and Best-fit model described by Schaffner et al. to Am J Hum Genet 2007; 81: 1084–1097. determine empirical P-values (emp-P). This model takes 9 Sabeti PC, Reich DE, Higgins JM, Levine HZ, Richter DJ, into account migration history, bottleneck, agriculture Schaffner SF et al. Detecting recent positive selection in the history, recombination rate and mutation rate, which from haplotype structure. Nature 2002; 419: were optimized to fit well with empirical data of each 832–837. population. Simulations were run on a sequence length 10 Rosenberg NA. Standardized subsets of the HGDP-CEPH of 1725 bp corresponding to the coding region of PKLR, Human Genome Diversity Cell Line Panel, accounting for and population sizes were matched with our experi- atypical and duplicated samples and pairs of close relatives. mental data. In PAK, Tajima’s test suggested non- Ann Hum Genet 2006; 70(Part 6): 841–847. neutrality, specifically when examining the distribution 11 Leutenegger AL, Sahbatou M, Gazal S, Cann H, Genin E. of non-synonymous SNPs (D ¼À1.37292, emp-P ¼ 0.156, Consanguinity around the world: what do the genomic data of the HGDP-CEPH diversity panel tell us? Eur J Hum Genet 2011; DNS ¼À1.78894, emp-P ¼ 0.035). Fu and Li’s D* and F* 19: 583–587. also rejected neutrality for SS (D* ¼À2.89519, 12 Shi W, Ayub Q, Vermeulen M, Shao RG, Zuniga S, van der F* ¼À2.46698, emp-P ¼ 0.001) and PAK (D* ¼À4.55934, Gaag K et al. A worldwide survey of human male F* ¼À4.07077, emp-P ¼ 0.001). Both populations were demographic history based on Y-SNP and Y-STR data from significantly non-neutral when compared with chimpan- the HGDP-CEPH populations. Mol Biol Evol 2010; 27: zee using the MacDonald–Kreitman test (Po0.02). 385–393. Together, these observations are in agreement with the 13 Beutler E, Gelbart T. Estimating the prevalence of pyruvate proposal that the PKLR gene has been under selective kinase deficiency from the gene frequency in the general white pressure in some of the populations examined, possibly population. Blood 2000; 95: 3585–3588. mediated by the malarial parasite.22,23 14 Demina A, Varughese KI, Barbot J, Forman L, Beutler E. Six previously undescribed pyruvate kinase mutations causing enzyme deficiency. Blood 1998; 92: 647–652. 15 Abu-Melha AM, Ahmed MA, Knox-Macaulay H, Al-Sowayan Conflict of interest SA, el-Yahia A. Erythrocyte pyruvate kinase deficiency in The authors declare no conflict of interest. newborns of eastern Saudi Arabia. Acta Haematol 1991; 85: 192–194. 16 Feng CS, Tsang SS, Mak YT. Prevalence of pyruvate kinase deficiency among the Chinese: determination by the quanti- Acknowledgements tative assay. Am J Hematol 1993; 43: 271–273. 17 Yavarian M, Karimi M, Shahriary M, Afrasiabi AR. Prevalence This study was funded by a Canadian Institutes of of pyruvate kinase deficiency among the south Iranian Health Research (CIHR) Team Grant in Malaria (KCK, population: quantitative assay and molecular analysis. Blood PG), CIHR MOP-13721 (KCK), Genome Canada through Cells Mol Dis 2008; 40: 308–311. the Ontario Genomics Institute (KCK), CIHR Canada 18 Librado P, Rozas J. DnaSP v5: a software for comprehensive Research Chairs (KCK) and CIHR studentships (JB, SH). analysis of DNA polymorphism data. Bioinformatics 2009; 25: 1451–1452. 19 Akey JM, Eberle MA, Rieder MJ, Carlson CS, Shriver MD, Nickerson DA et al. Population history and natural selection References shape patterns of genetic variation in 132 genes. PLoS Biol 2004; 2: e286. 1 University Medical Center. PKLR Mutation Database. 20 Wayne ML, Simonsen KL. Statistical tests of neutrality in the Laboratory for Red Blood Cell Research: Utrecht, 2007. age of weak selection. Trends Ecol Evol 1998; 13: 236–240. 2 Valentini G, Chiarelli LR, Fortin R, Dolzan M, Galizzi A, 21 Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, Altshuler D. Abraham DJ et al. Structure and function of human erythro- Calibrating a coalescent simulation of human genome cyte pyruvate kinase. molecular basis of nonspherocytic sequence variation. Genome Res 2005; 15: 1576–1583. hemolytic anemia. J Biol Chem 2002; 277: 23807–23814. 22 Zhang L, Peek AS, Dunams D, Gaut BS. Population genetics of 3 van Wijk R, Huizinga EG, van Wesel AC, van Oirschot BA, duplicated disease-defense genes, hm1 and hm2, in maize Hadders MA, van Solinge WW. Fifteen novel mutations in (Zea mays ssp. mays L.) and its wild ancestor (Zea mays ssp. PKLR associated with pyruvate kinase (PK) deficiency: parviglumis). Genetics 2002; 162: 851–860.

Genes and Immunity PKLR genetic diversity J Berghout et al 102 23 Mirabello L, Conn JE. Molecular population genetics of 25 Barrett JC, Fry B, Maller J, Daly MJ. Haploview: analysis and the malaria vector Anopheles darlingi in Central and visualization of LD and haplotype maps. Bioinformatics 2005; South America. Heredity 2006; 96: 311–321. 21: 263–265. 24 Cann HM, de Toma C, Cazes L, Legrand MF, Morel V, Piouffre 26 Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, L et al. A human genome diversity cell line panel. Science 2002; Blumenstiel B et al. The structure of haplotype blocks in the 296: 261–262. human genome. Science 2002; 296: 2225–2229.

Supplementary Information accompanies the paper on Genes and Immunity website (http://www.nature.com/gene)

Genes and Immunity