Genetic Variation in African Populations: A Multi-Locus Approach to Understanding Selection and Demography in Humans

Item Type text; Electronic Dissertation

Authors Wood, Elizabeth T

Publisher The University of Arizona.

Rights Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.

Download date 27/09/2021 15:42:32

Link to Item http://hdl.handle.net/10150/195188

GENETIC VARIATION IN AFRICAN POPULATIONS: A MULTI-LOCUS APPROACH TO UNDERSTANDING SELECTION AND DEMOGRAPHY IN HUMANS

By Elizabeth T. Wood

A Dissertation Submitted to the Faculty of the DEPARTMENT OF ECOLOGY AND EVOLUTIONARY BIOLOGY In Partial Fulfillment of the Requirements For the Degree of

DOCTOR OF PHILOSOPHY In the Graduate College of the THE UNIVERSITY OF ARIZONA

2006

2

THE UNIVERSITY OF ARIZONA GRADUATE COLLEGE

As members of the Dissertation Committee, we certify that we have read the dissertation

prepared by Elizabeth Wood

entitled Genetic Variation in African Populations: A Multi-Locus Approach to

Understanding Selection and Demography in Humans

and recommend that it be accepted as fulfilling the dissertation requirement for the

Degree of Doctor of Philosophy

______Date: April 28, 2006 Michael Hammer

______Date: April 28, 2006 Michael Nachman

______Date: April 28, 2006 Carlos Machado

______Date: April 28, 2006 Phil Hedrick

Final approval and acceptance of this dissertation is contingent upon the candidate’s submission of the final copies of the dissertation to the Graduate College.

I hereby certify that I have read this dissertation prepared under my direction and recommend that it be accepted as fulfilling the dissertation requirement.

______Date: April 28, 2006 Michael Hammer

3

STATEMENT BY THE AUTHOR

This thesis has been submitted in partial fulfillment of the requirements for an advanced degree at The University of Arizona and is deposited in the University Library to be made available to borrowers under rules of the Library.

Brief quotations from this thesis are allowable without special permission, provided that accurate acknowledgement of source is made. Requests for permission for extended quotation from or reproduction of this manuscript in whole or in part may be granted by the head of the major department of the Dean of the Graduate College when in his or her judgment the proposed use of the material is in the interests of scholarship. In all other instances, however, permission must be obtained from the author.

Elizabeth T. Wood

4

ACKNOWLEDGEMENTS

My deepest gratitude goes to many people who supported me during my graduate career. In particular, Michael Hammer and Michael Nachman, as well as other members of my committee (past and present) including Carlos Machado, Phil Hedrick, Kent Campbell, Wayne Maddison and Teri Markow. I also thank Matt Saunders, Bret Payseur, Jason Wilder, Dan Garrigan, Heather Norton, Tasha Altheide, Gabriela Wlasiuk, Armando Geraldes, Tovah Salcedo, Maya Pilkington, and Cora Varas for wonderful discussions and encouragement. I thank Daryn Stover, Veronica Chamberlain, Tanya Karafet, Alan Redd, Matt Kaplan who were great help in the laboratory and Gavin Nelson, Susan Miller, August Woerner who have amazing computational skills. I also thank my collaborators Montgomery Slatkin, Giovanni Destro-Bisol, and Christopher Ehret. I am most grateful for my family and friends including Dan Pavicich, Ty Wood-Pavicich, Ben Wood-Pavicich, my parents Harley Wood and Elizabeth Knowles as well as Eric Knowles and Kathy Wood, and my brother Tyler Wood and Marian Petrash (and Liam and Alex), Dorothy and Milan Pavicich, Sonia Phillips. I thank those at 2nd Street Children’s School and my good friends Tina Valentine and Andy Young (and Simon and Abby), Mari Tomizuka and Anthony Woodrow (and Gus), Kelly Tiffany Goldsmith and Leila Ali-Akbarian.

I also acknowledge European Journal of Human Genetics for allowing me to reproduce Wood et al. 12005 Contrasting Patterns of Y-Chromosome and MtDNA Variation in : Evidence for Sex-Biased Demographic Processes. EJHG 13:867-876.

5

DEDICATION

To my sons, Ben and Ty, who spent their early years watching me learn and work. Follow your passion. 6

TABLE OF CONTENTS

ABSTRACT…………………………………………………………………………...….7

CHAPTER ONE: INTRODUCTION……….……………………………………………9

CHAPTER TWO: PRESENT STUDY……….………………………………...…….…17

REFERENCES……………………………………………………………………….….19

APPENDIX A: THE β-GLOBIN RECOMBINATIONAL HOTSPOT REDUCES THE EFFECTS OF GENETIC HITCHHIKING AROUND HBC, A RECENTLY ARISEN MUTATION PROVIDING RESISTANCE TO MALARIA...... 22

APPENDIX B: THE EFFECTS OF RECOMBINATION, GENE CONVERSION, AND MUTATION ON PATTERNS OF VARIATION AT THE STRONGLY SELECTED β-GLOBIN LOCUS: THE ORIGINS OF THE HβS ALLELE REVISITED…………………………………..……….……..32

APPENDIX C: CONTRASTING PATTERNS OF Y-CHROMOSOME AND MTDNA VARIATION IN AFRICA: EVIDENCE FOR SEX-BIASED DEMOGRAPHIC PROCESSES………………..…………………..….63

APPENDIX D: HUNTER-GATHERER POPULATIONS SHARE AN ANCIENT COMMON ANCESTRY……………………………………….…...... 74

APPENDIX E: THE PHYLOGEOGRAPHY OF THE Y-CHROMOSOME……..……84

7

ABSTRACT

Mutation, recombination, selection, and demographic processes (such as gene flow and genetic drift) have shaped genetic variation, but the relative impact of these evolutionary forces remains poorly understood. This problem motivates this study which examines three regions of the genome β-globin, the Y-chromosome, and mtDNA in a two part approach to assess the relative impact of evolutionary forces on human genetic variation in Africa. The first approach characterizes levels of nucleotide variability and linkage disequilibrium across the β-globin gene and recombinational hotspot in a sample of malarial-resistance alleles (HbC and HbS). Results suggest that the age of the HbC allele is <5,000 years and selection coefficients are 0.04-0.09 and, recombination is observed within 1-kb of the selected site on >1/3 of the chromosomes sampled (Appendix A). A long-standing question regarding the HbS allele is whether it originated multiple times via recurrent mutation or whether it arose once and was transferred to different haplotypic backgrounds through recombination. These results indicate that recombination played a critical role in generating haplotypic diversity at β- globin and can explain the origins of the Bantu and Senegalese HbS haplotypes

(Appendix B). The second approach examines Y-chromosome and mtDNA variation to disentangle the relative effects of demographic forces. A detailed characterization of the

Y-chromosome and mtDNA in >1000 individuals from ~40 populations reveals that patterns of variation from these paternally- and maternally-inherited loci are remarkably different, suggesting that sex-specific demographic processes have influenced African

8

genetic variation, particularly among agriculturalists (Appendix C). Hunter-gatherer populations carry a suite of Y-chromosomes that differ from those of agricultural populations. The examination of Y-SNP and Y-STR variation in eight hunter-gatherer populations reveals the presence of a very old, >50-kya, derived lineage (B2b) shared among these populations, which is absent in agricultural populations, suggesting that hunter-gatherer populations share an ancient common ancestry (Appendix D). Finally, the Y-chromosome results are placed into a broader evolutionary context in a phylogeographic summary as it relates to archeological and linguistic variation in Africa

(Appendix E). Together these results underscore the vastly different effects that various evolutionary forces have had on shaping human genetic variation in Africa.

9

CHAPTER 1: INTRODUCTION

Explanation of the Problem and its Context

Population genetics is the study of the evolutionary forces that shape genetic variation, which include natural selection, mutation, recombination, and demography.

Mutation is the ultimate source of new variation; while recombination reshuffles existing variation producing new allelic combinations upon which natural selection can act.

Rates of mutation and recombination vary considerably across the human genome. For example, the β-globin locus experiences rates of recombination 50-fold greater than the genome average while the Y-chromosome and mtDNA experience very little or no recombination. Natural selection has had a large impact on human variation especially in regions of the genome that confer protection against the infectious disease, malaria.

Demographic processes include gene flow, isolation and/or genetic drift, population size and range expansions, as well as population subdivision and bottlenecks, which affect many regions of the genome in a similar (but not identical) manner. The forces that shape variation change over time and space and leave a multitude of histories in our genome, each one preserving its own record of the past. How have evolutionary forces shaped variation in various regions of the genome differently? What is the impact of recombination and mutation on loci under strong selective pressure? How have demographic forces such as population size and range expansions affected the distribution of genetic variation? Great advances have been made in the fields of molecular biology and population genetics, which have provided new technologies to

10

efficiently examine nucleotide variation and sophisticated theoretical models to test alternative scenarios. The rigorous application of these methods to different regions of the genome will lead us toward a more synthetic reconstruction of human evolutionary history.

This dissertation evaluates genomic patterns of human nucleotide variability in

African populations (i) to disentangle the relative effects of recombination and mutation shaping variation surrounding malarial resistant alleles at the β-globin gene and (ii) to infer the demographic processes that have shaped Y-chromosome and mitochondrial

(mtDNA) variation. The results presented here suggest that the recombination, mutation, selection, and demography have all played an important role in shaping human variation, but that the relative impact of these evolutionary forces is substantially different for different loci.

A Review of the Literature

Early studies of human genetic variation examined differences among individuals that cause (or protect against) diseases and focused on disorders of blood, or hemoglobin.

The first case of a disease associated with molecular variation was detected by protein electrophoresis by Linus Pauling (Pauling et al. 1949), who proposed that sickle cell anemia was a disease of the hemoglobin molecule and was caused by the HbS allele at the β-globin gene. J.B.S. Haldane, who together with Ronald Fisher, and Sewall Wright developed the ideas that formed the modern evolutionary synthesis in the 1930s and

1940s, proposed in 1949 that β-thalassemia major, a disease caused by another allele at

11

the β-globin gene, is lethal if an individual carries two copies of this variant

(homozygote) but is resistant to malaria if an individual carries only one copy

(heterozygote). Anthony Allison (Allison 1954; Allison 1964) showed a similar scenario

- the HbS allele causes sickle cell disease in the homozygous form but confers resistance to malaria in the heterozygous form. These findings formed the basis of ‘the malaria hypothesis’, which suggests that the elevated frequencies of hemoglobinopathies, such as

β-thalasemia and sickle cell disease, are maintained due to selection for the heterozygote

(heterozygote advantage). The HbS allele has since become a classic example of a single allelic variant that is under strong selection in humans.

The protein electrophoresis techniques first applied to the β-globin gene were later used to examine hundreds of other genetic variants in populations distributed across the world. Much of this work was compiled and synthesized in the landmark tome published by L. Luca Cavalli-Sforza and his collaborators Paolo Menozzi and Alberto

Piazza (Cavalli-Sforza et al. 1994), who provided a remarkably comprehensive reconstruction of where human populations originated and the paths by which they spread throughout the world. Since then, advances in molecular biology have allowed researchers to move beyond examinations of protein variation and to directly examine

DNA sequence variation, including the maternally-inherited mtDNA (e.g. Cann et al.

1987) and the paternally-inherited Y-chromosome (e.g. Hammer 1995). The explosion of studies that followed examined maternally and paternally-inherited regions of the genome, as well as X chromosome and autosomal loci, in hundreds of human populations distributed across the globe (Harding et al. 1997; Nachman et al. 1998; Harris and Hey

12

1999; Nachman and Crowell 2000; Przeworski et al. 2000; Excoffier 2002; Hammer et al. 2004).

Natural Selection, Recombination, and Mutation at the β-globin Locus

Infectious diseases (such as malaria, AIDS, and measles) are responsible for 62%

of all deaths in Africa, more than double that found in any other region of the world

(WHO 2003). Because Africa has had long and intense association with infectious

diseases, genetic variation found within African populations provides important clues to

the impact that selection has had on human variation. Natural selection increases or

decreases the frequency of an allele in a particular environment. Positive selection leads

to the increase in frequency of an advantageous allele and can include both positive

directional and balancing selection. Directional selection favors alleles that provide an

advantage in the homozygous form; whereas, balancing selection (sensu overdominance) favors the heterozygous form. Negative selection, sometimes referred to as purifying selection, will remove deleterious variants from a population. Evolution at the malarial- resistant β-globin gene in Africa encompasses all these types of selection. Two of several alleles at the β-globin gene (HbC and HbS) provide strong protection against infection by malaria and are found at appreciable frequencies in Africa. The HbC allele

has been shown to be advantageous in malarial areas in both the heterozygous and

homozygous form and is thus subject to positive directional selection (Modiano et al.

2001; Hedrick 2004). The HbS allele provides resistance to malaria in the heterozygous

form but causes death due to sickle cell disease in the homozygous form and is subject to

balancing selection (Allison 1954). Additionally, the 1,606 bp β-globin gene produces

13

two of the four globin chains found in hemoglobin and some changes in this gene product produce devastating effects such as extreme illness or death and, hence, the β-globin gene is also subject to purifying selection.

An important consequence of these processes is that selection not only influences the frequency of a functional allele, but it also affects variation linked to the selected site.

In the case of positive directional selection, an advantageous allele will increase in frequency carrying linked neutral variation with it, often referred to as genetic hitchhiking (Maynard-Smith and Haigh 1974) . Purifying selection will remove deleterious variants and linked neutral variation from the population, which is known as background selection. The extent of these effects on linked neutral variation depends on the type and strength of selection, the time since the selected allele arose, and the rate of recombination. The HbC and HbS alleles at β-globin are two of the most intensely selected loci in the recent history of humans. The β-globin locus is also remarkable because a hotspot of recombination is located less than 1,000 base pairs from the gene

(Harding et al. 1997; Wall et al. 2003). The close proximity of the β-globin gene and hotspot provides the unique opportunity to empirically examine, for the first time, the direct effects of recombination on a strongly selected locus in humans.

Demographic Processes in Africa

Fossil evidence suggests that our species, Homo sapiens, originated in Africa

more than 100,000 years ago (McDougall et al. 2005). This hypothesis has been

supported by mtDNA and Y-chromosome studies which indicate an African origin of

humans (Cann et al. 1987; Hammer 1995). Africa not only holds important clues to our

14

ancient evolutionary past, but also possesses a rich and complex history in the recent past.

Today, Africa is home to a large number of agricultural groups as well as several hunter- gatherer groups such as the and the Pygmies. The African continent is also one of the most linguistically diverse regions of the world. The distribution of linguistic variation has been strongly influenced by the origin and spread of agriculture (Ehret

2002; Diamond and Bellwood 2003). Two of the major African language families are the

Khoisan and Niger-Congo, which show very different patterns in both the distribution of and variation within each group (Greenberg 1964). Khoisan is spoken by a very small number of hunter-gatherer and pastoralist populations, is widely (but sparsely) distributed across eastern and , and contains unusual click sounds. The Niger-Congo language family is spoken by the vast majority of present-day sub-Saharan Africans, is widely (and densely) distributed, and contains a homogeneous sub-branch of languages, the Bantu languages, which are associated with farmers. Both linguistic and archeological evidence suggest that beginning 3,000 – 5,000 years ago Bantu-speaking farmers underwent a rapid expansion – both in number and in geographic space. Studies of Y-chromosome, mtDNA, and protein polymorphism variation in Africa have revealed the large impact that the expansion of Bantu speaking farmers has had on the distribution of genetic variation (Cavalli-Sforza et al. 1994; Hammer et al. 1998; Hammer et al. 2001;

Cruciani et al. 2002; Salas et al. 2002). However, the effect of this massive migration had on local hunter-gatherer groups and the relative impact that males and females had during this process remain largely unexplored. By comparing genetic variation in maternally- (mtDNA) and paternally-inherited (Y-chromosome) loci to linguistic and

15

geographic variation, the relative role that various demographic forces have had on genetic variation can begin to be disentangled.

Explanation of Dissertation Format

My dissertation work takes advantage of new technological and theoretical methods (i) to address longstanding and controversial questions regarding the history of malarial-resistance alleles and (ii) to test hypotheses regarding population subdivision and migration within Africa. First, I use empirically-determined DNA sequence data across a hotspot of recombination and the β-globin gene to addresses questions about the role of mutation and recombination on malarial-selected alleles, HbC and HbS (Appendix

A and B). Second, I characterize Y-chromosome single nucleotide polymorphisms (Y-

SNPs) and short tandem repeat polymorphisms (Y-STRs) and I compile mtDNA sequence variation in populations across Africa to compare male and female histories and to examine hypotheses generated in the fields of linguistics, archeology, and ethnohistory

(Appendix C, D, and E).

In Appendix A, I describe the impact that recombination has had on patterns of genetic variation surrounding the HbC allele at the β-globin locus. These results demonstrate the large impact that the β-globin recombinational hotspot has had on reducing the effects of selection surrounding the positively selected HbC allele.

Appendix B examines the relative effects of recombination, mutation, and selection surrounding the HbS allele at the β-globin locus. This study is primarily a reexamination of the well established hypothesis that the HbS allele arose multiple independent times

16

via recurrent mutation, and concludes that recombination has played a major role in generating haplotypic diversity at the β-globin locus. Appendix C compares patterns of genetic variation observed in the Y-chromosome and mtDNA in African populations to linguistic and geographic distances to assess the degree of concordance between these two compartments of the genome. This study reveals that patterns of variation from these paternally- and maternally-inherited loci are remarkably different and suggests that sex-specific demographic processes are responsible for the patterns of variation observed.

Appendix D is an examination of Y-SNP and Y-STR variation in two of the few remaining African hunter-gatherer groups, the Khoisan and Pygmies. The presence of a very old derived polymorphism shared among hunter-gatherer populations, which are absent in agricultural populations, suggests that these two groups share an ancient common ancestry. Appendix E is a phylogeographic summary of the Y-chromosome results as it relates to linguistic variation in Africa. This body of work is primarily a population genetics study of human variation but also employs a multidisciplinary approach drawing upon the fields of molecular evolution, genomics, phylogeography, and the anthropological disciplines including archeology, linguistics and ethnohistory.

17

CHAPTER 2: PRESENT STUDY

Detailed background, methods, results, and conclusions of this study are presented in the manuscripts appended to this dissertation document. To characterize genetic variation in Africa and the evolutionary forces responsible for shaping that variation, I have undertaken a multi-locus approach by surveying sequence variation at several loci including the autosomal β-globin locus and the haploid Y-chromosome and mitochondrial DNA (mtDNA) loci. The most important findings of these manuscripts are summarized below.

In Appendix A, a 5.2 kb region was sequenced and cloned in 19 individuals who carry the malaria-resistant HbC allele yielding 17 wild-type (HbA) allele and 21 HbC chromosomes. These data were supplemented with 18 chromosomes surveyed for the study outlined in Appendix B. This study estimated that the HbC mutation originated

<5,000 years ago and that selection coefficients are 0.04-0.09. Additionally, recombination is observed within 1,000 bp 5’ of the site of selection demonstrating the large effect that the β-globin hotspot has in reducing the effects of selection on linked variation. In Appendix B, the same approach was employed as that outlined in Appendix

A. This study examines the patterns of variation surrounding the HbS allele in African populations. The results indicate that a single origin of the HbS allele incorporating high levels of recombination (crossing-over and gene conversion) can explain all of the data but a model of independent origins can not be excluded. In Appendix C, I compared patterns of Y-SNP to mtDNA HVSI sequence variation that I compiled from the

18

literature. This study examines >1000 individuals sampled from ~40 populations distributed across Africa for both the Y-chromosome and mtDNA. The results from this study suggest that patterns of differentiation and gene flow have been very different for men and women in the recent evolutionary past. Sex-biased rates of admixture and/or language borrowing between expanding Bantu farmers and local hunter-gatherers likely played an important role in influencing patterns of genetic variation during the spread of

African agriculture in the last 4000 years. In Appendix D, STR variation was examined in a small subset of Y-chromosome SNP lineages (B2b) to assess the ancient divergence of two hunter-gatherer groups, the Khoisan and Pygmies. This study infers that the

Khoisan, Western Pygmies, and Eastern Pygmies share an ancient common ancestry and that the expansion of migrating Bantu farmers has not completely obscured the ancient shared history of African hunter-gatherer groups. Finally, Appendix E provides a phylogenetic summary of African Y-chromosomes within a framework of archeological, linguistic, and ethnographic data. Together these findings suggest that selection, demography, mutation, and recombination have all played an important role in influencing human variation, but that the relative impact of these evolutionary forces varies significantly among different regions of the genome.

19

REFERENCES

Allison AC (1954) Notes on sickle-cell polymorphism. Ann Hum Genet 19:39-51.

Allison AC (1964) Polymorphism and Natural Selection in Human Populations. Cold

Spring Harb Symp Quant Biol 29:137-149

Cann RL, Stoneking M, Wilson AC (1987) Mitochondrial DNA and human evolution.

Nature 325:31-36

Cavalli-Sforza LL, Menozzi P, Piazza A (1994) The History and Geography of Human

Genes. Princeton University Press, Princeton

Cruciani F, Santolamazza P, Shen PD, Macaulay V, Moral P, Olckers A, Modiano D,

Holmes S, Destro-Bisol G, Coia V, Wallace DC, Oefner PJ, Torroni A, Cavalli-

Sforza LL, Scozzari R, Underhill PA (2002) A back migration from Asia to sub-

Saharan Africa is supported by high-resolution analysis of human Y-chromosome

haplotypes. American Journal of Human Genetics 70:1197-1214

Diamond J, Bellwood P (2003) Farmers and their languages: the first expansions. Science

300:597-603

Ehret C (2002) The Civilizations of Africa: A History to 1800. The University Press of

Virginia

Excoffier L (2002) Human demographic history: refining the recent African origin

model. Curr Opin Genet Dev 12:675-682

Greenberg J (1964) Historical Inferences from Linguistic Research in Sub-Saharan

Africa. In: Butler J (ed) Boston University Papers in African History I, pp 1-15

20

Hammer MF (1995) A recent common ancestry for human Y-chromosomes. Nature

378:376-378

Hammer MF, Garrigan D, Wood E, Wilder JA, Mobasher Z, Bigham A, Krenz JG,

Nachman MW (2004) Heterogeneous patterns of variation among multiple human

X-linked Loci: the possible role of diversity-reducing selection in non-africans.

Genetics 167:1841-1853

Hammer MF, Karafet T, Rasanayagam A, Wood ET, Altheide TK, Jenkins T, Griffiths

RC, Templeton AR, Zegura SL (1998) Out of Africa and back again: nested

cladistic analysis of human Y-chromosome variation. Mol Biol Evol 15:427-441

Hammer MF, Karafet TM, Redd AJ, Jarjanazi H, Santachiara-Benerecetti S, Soodyall H,

Zegura SL (2001) Hierarchical patterns of global human Y-chromosome

diversity. Molecular Biology and Evolution 18:1189-1203

Harding RM, Fullerton SM, Griffiths RC, Bond J, Cox MJ, Schneider JA, Moulin DS,

Clegg JB (1997) Archaic African and Asian lineages in the genetic ancestry of

modern humans. Am J Hum Genet 60:772-789.

Harris EE, Hey J (1999) X chromosome evidence for ancient human histories. Proc Natl

Acad Sci U S A 96:3320-3324

Hedrick P (2004) Estimation of relative fitnesses from relative risk data and the predicted

future of haemoglobin alleles S and C. J Evol Biol 17:221-224.

Maynard-Smith JM, Haigh J (1974) The hitch-hiking effect of a favourable gene. Genet

Res 23:23-35.

21

McDougall I, Brown FH, Fleagle JG (2005) Stratigraphic placement and age of modern

humans from Kibish, . Nature 433:733-736

Modiano D, Luoni G, Sirima BS, Simpore J, Verra F, Konate A, Rastrelli E, Olivieri A,

Calissano C, Paganotti GM, D'Urbano L, Sanou I, Sawadogo A, Modiano G,

Coluzzi M (2001) Haemoglobin C protects against clinical Plasmodium

falciparum malaria. Nature 414:305-308.

Nachman MW, Bauer VL, Crowell SL, Aquadro CF (1998) DNA variability and

recombination rates at X-linked loci in humans. Genetics 150:1133-1141

Nachman MW, Crowell SL (2000) Contrasting evolutionary histories of two introns of

the duchenne muscular dystrophy gene, Dmd, in humans. Genetics 155:1855-

1864

Pauling L, Itano HA, et al. (1949) Sickle cell anemia a molecular disease. Science

110:543-548

Przeworski M, Hudson RR, Di Rienzo A (2000) Adjusting the focus on human variation.

Trends Genet 16:296-302

Salas A, Richards M, De la Fe T, Lareu MV, Sobrino B, Sanchez-Diz P, Macaulay V,

Carracedo A (2002) The making of the African mtDNA landscape. American

Journal of Human Genetics 71:1082-1111

Wall JD, Frisse LA, Hudson RR, Di Rienzo A (2003) Comparative linkage-

disequilibrium analysis of the beta-globin hotspot in primates. Am J Hum Genet

73:1330-1340. Epub 2003 Nov 1318.

WHO (2003) World Health Organization. Africa Malaria Report

22

APPENDIX A: THE β-GLOBIN RECOMBINATIONAL HOTSPOT REDUCES THE EFFECTS OF GENETIC HITCHHIKING AROUND HBC, A RECENTLY ARISEN MUTATION PROVIDING RESISTANCE TO MALARIA

Published: American Journal of Human Genetics: 77: 637-642

23 24

Am. J. Hum. Genet. 77:637–642, 2005 Report

The b-Globin Recombinational Hotspot Reduces the Effects of Strong Selection around HbC, a Recently Arisen Mutation Providing Resistance to Malaria Elizabeth T. Wood,1,2 Daryn A. Stover,1 Montgomery Slatkin,3 Michael W. Nachman,2 and Michael F. Hammer1,2 1Division of Biotechnology and 2Department of Ecology and Evolutionary Biology, University of Arizona, Tucson; and 3Department of Integrative Biology, University of California, Berkeley

Recombination is expected to reduce the effect of selection on the extent of linkage disequilibrium (LD), but the impact that recombinational hotspots have on sites linked to selected mutations has not been investigated. We empirically determine chromosomal linkage phase for 5.2 kb spanning the b-globin gene and hotspot. We estimate that the HbC mutation, which is positively selected because of malaria, originated !5,000 years ago and that selection coefficients are 0.04–0.09. Despite strong selection and the recent origin of the HbC allele, recombination (crossing-over or gene conversion) is observed within 1 kb 5 of the selected site on more than one-third of the HbC chromosomes sampled. The rapid decay in LD upstream of the HbC allele demonstrates the large effect the ß-globin hotspot has in mitigating the effects of positive selection on linked variation.

Recombinational hotspots are a ubiquitous feature of amine the signature of selection across a region that the human genome, occurring every 60–200 kb, and recombines at a rate 50 (Wall et al. 2003; Winckler et likely contribute to the observed pattern of large hap- al. 2005) to 90 (Schneider et al. 2002) times higher than lotypic blocks punctuated by low linkage disequilibrium the genomic average of 1.1 cM/Mb (Kong et al. 2004). (LD) over very short (1–2-kb) distances (Gabriel et al. More than 50 years ago, Haldane (1949) and Allison 2002; Jeffreys and May 2004; McVean et al. 2004). (1954) proposed that elevated frequencies of hemoglo- Recombination breaks up ancestral LD and produces binopathies such as thalassemia and sickle cell disease, new combinations of alleles on which natural selection which are caused by mutations at the b-globin locus, can act. Positive selection increases the frequency of ben- are maintained via balancing selection (“the malarial eficial mutations, creating LD via genetic hitchhiking hypothesis”). Since then, malarial-resistant alleles have (Smith and Haigh 1974). LD has been observed over been identified at several other loci, including G6PD, great physical distances at several genes experiencing re- TNF, and HLA (Kwiatkowski 2005). Early studies of cent selection, including loci associated with malarial the b-globin HbC polymorphism (b6GlurLys) suggested resistance (Tishkoff et al. 2001; Sabeti et al. 2002; Saun- that this allele was also subject to balancing selection ders et al. 2002; Ohashi et al. 2004). The b-globin hot- (Allison 1956). More recently, it has been shown that spot spans ∼1 kb and is located ∼500 bp from the se- HbC provides protection against Plasmodium falcipa- lected site at the b-globin gene (Harding et al. 1997; rum malaria without significantly reducing fitness, in- Wall et al. 2003). The close proximity of these b-globin dicating that this allele is increasing in frequency as a regions allows us, for the first time, to empirically ex- result of positive directional selection (Agarwal et al. 2000; Modiano et al. 2001; Hedrick 2004; Rihet et al. Received May 17, 2005; accepted for publication July 25, 2005; 2004). Because the African HbC allele rarely exceeds electronically published August 29, 2005. frequencies of 20% and is geographically concentrated Address for correspondence and reprints: Dr. Elizabeth Wood, De- in central , it is thought that this mutation partment EEB, Biosciences West, University of Arizona, Tucson, AZ is very young (Livingstone 1976; Trabuchet et al. 1991; 85721. E-mail: [email protected] ᭧ 2005 by The American Society of Human Genetics. All rights reserved. Modiano et al. 2001). Here, we examine the extent of 0002-9297/2005/7704-0012$15.00 LD surrounding the African HbC allele, to estimate its

637 25

638 Am. J. Hum. Genet. 77:637–642, 2005 age and the strength of selection acting on this mutation Our estimates accord well with those obtained using and to test the hypothesis that the b-globin recombi- theoretical predictions from epidemiological data, where ! ∼ national hotspot decouples the selected HbC allele from t1 150 generations ands 0.08 (Hedrick 2004) (see nearby upstream regions (fig. 1a). the “Material and Methods” section [online only]). To generate 5.2 kb of contiguous phased sequence At least three HbC chromosomes in our survey data, we cloned two ∼3-kb fragments that were tiled (15.8%), given at the bottom of figure 2, show evidence together using polymorphic sites in the overlapping re- for crossing-over and/or gene conversion within the hot- gion (fig. 1b). We examined 17 heterozygous and 2 ho- spot. Dgn66HbC contains a 5 haplotype motif identical mozygous individuals carrying the HbC mutation and to seven HbA chromosomes (AGCGTCTGCGA from supplemented the data set by including 18 additional sites Ϫ3092 to Ϫ1944), which is likely the result of a African HbA chromosomes (fig. 2). Within the 5.2-kb single crossover event occurring between sites Ϫ1944 b fragment spanning the -globin hotspot and gene region, and 16. Dgn99HbC possesses polymorphisms ATCTC, we observe 46 segregating sites. To delineate the bound- from sites Ϫ835 to Ϫ598, that are also found in four aries of the hotspot in our data, we examine the extent HbA chromosomes and may be due to gene conversion of LD among 35 HbA chromosomes (fig. 1c). The or double crossing-over. Dgn83HbC has a CACA motif, boundaries for the three recombinational regions cor- at sites Ϫ1927, Ϫ835, Ϫ598, and Ϫ543, that is also respond well with those observed in a sample of 16 found in three HbA chromosomes. However, Dgn83HbC African Hausa individuals (Wall et al. 2003) and in a does not contain intervening polymorphisms that are larger study of 349 globally distributed individuals (Har- identical to HbA chromosomes surveyed here and may ding et al. 1997). be the result of a single cross-over event where the re- To examine the effects of selection, we compared pat- combining HbA chromosome is not represented in our terns of LD on selected chromosomes with those on data set, or it may be the result of multiple cross-over nonselected chromosomes. Substantially more LD is ob- and/or gene conversion events. Because of the higher served among HbC than HbA chromosomes (fig. 2). We rate of gene conversion relative to crossing-over (4:1 to employed a coalescent-based method that incorporates 15:1) (Jeffreys and May 2004), it is likely that gene recombination (c), effective population size (Ne ), and conversion is responsible for some of the patterns we population growth (r) to jointly estimate the selection observe. Evidence for recombination is observed in an coefficient (s) and the time since the origin of the HbC additional five individuals (for a total of 38.1%), un- allele (t1 ). The selection coefficient is estimated using der the assumption that the HbC mutation occurred Slatkin and Bertorelle’s (2001) method for estimating on the most common HbC haplotype observed (e.g., the likelihood of s from allele frequency and the extent Dgn06HbC) (fig. 2). An alternative explanation is that of LD with a neutral marker locus. The allele age is recurrent mutation occurred at site 16-HbC or at sites estimated from the posterior probability distribution of described in the “recombinant” motifs in Dgn66, t1 as a function of s. Analyses were performed on the Dgn99, or Dgn83. However, this hypothesis requires ei- basis of haplotypes generated from sites Ϫ2906 and 16 ther (1) independent mutations at site 16-HbC on four (HbC), where we assume that the A allele at site 16 arose different HbC haplotypic backgrounds or (2) that three on a chromosome carryingaCatϪ2906 and that the (Dgn99), four (Dgn83), or five (Dgn66) sites have mu- frequency of HbC is 15% in the present generation. tated 5 of the gene on both the HbA and HbC back- Figure 3a shows the likelihood of s given the data, under grounds. Thus, it seems unlikely that elevated rates of p p the assumption that Ne 10,000 andr 0 , for a range mutation would cause decay in LD observed on the HbC of recombination rates (c). For all values of c, we reject chromosomes. neutrality (P ! 10Ϫ5 ) and conclude that s is most likely The extent of LD surrounding a selected allele is ex- in the range 0.04–0.09. Neutrality is also rejected when pected to depend on the age of the allele, the strength of we incorporate a population growth rate as high as selection, and the rate of recombination. The recent origin r p 0.009 for c p 0.004 andr p 0.02 for c p 0.008 (!5,000 years) and high selection coefficient (0.04 ! s ! (data not shown). Figure 3b shows the posterior distri- 0.09) of the HbC allele estimated here are roughly com- butions of the age of HbC for the three maximum-like- parable to those of other malarial-resistance loci, includ- lihood estimates of s obtained using three different val- ing G6PD/A- (!11,800 years; 0.02 ! s ! 0.2) (Tishkoff et ues of c. The age of the allele using this method depends al. 2001; Saunders et al. 2002, 2005), and G6PDmed primarily on s and is affected little by c (data not shown). (!6,500 years; 0.01 ! s ! 0.2) (Tishkoff et al. 2001; M. The estimated age is 75–150 generations ago (or 1,875– Saunders, M. Slatkin, M. F. Hammer, and M. W. Nach- 3,750 years, under the assumption of a 25-year gener- man, unpublished data). Long-range LD extends 400– ation time), with an upper bound !275 generations. Past 1,600 kb at loci recombining near the genome aver- growth would make the allele slightly younger, so esti- age, including G6PD/A-, G6PDmed, and TNFSF5/726C mates of age based on ther p 0 results are conservative. (Tishkoff et al. 2001; Sabeti et al. 2002; Saunders et al. 26

Figure 1 a, Map of the b-globin gene cluster found on 11p15.5. Four functional globin genes, one pseudogene, and the locus control region (LCR) that controls transcription and replication of the globin gene cluster (Aladjem et al. 1995) are located 5 of the b-globin gene. The position of the recombinational hotspot identified using RFLP analysis (A) (Chakravarti et al. 1984) and single-sperm typing (B) (Schneider et al. 2002) are shown above. b, Magnified view of b-globin gene and hotspot, as identified using population sequence data in 16 Africans (C) (Wall et al. 2003) and 349 globally distributed individuals (D) (Harding et al. 1997). The locations of primers that amplify the two 3-kb fragments are shown with horizontal arrows. The polymorphisms observed in a 5.2-kb region are indicated with International Union of Pure and Applied Chemistry (IUPAC) codes. Indels were excluded. Exons are shown in horizontal black boxes; introns are depicted by white boxes. Grey boxes indicate repeat motifs. The HbC mutation is circled. c, LD triangle plot of 35 HbA chromosomes in all pairwise comparisons between 42 sites. Black boxes indicate a significantDP value (! .01 ). Values ofFD FF are 1 unless indicated in white numbers whereDF is multiplied by 100. Dark gray boxes indicateFDF p 1 (P 1 .01 ), and white boxes indicateFDF ! 1 (P 1 .01 ). Light gray boxes correspond to the four segregating sites that are observed only on HbC chromosomes. Asterisks (*) indicate that pairwise comparisons are significant with a Bonferroni correction. The two regions of low recombination are outlined. 27

640 Am. J. Hum. Genet. 77:637–642, 2005

Figure 2 Polymorphisms observed in 35 HbA (top) and 21 HbC (bottom) chromosomes. To visually represent the decay in haplotype sharing, the conserved most-common long-range haplotype within the HbA and HbC chromosomes is shaded. Each site was examined se- quentially, moving away from site 16. Singletons were excluded. Shaded areas include sites that are most frequent within the conserved haplotype. Recombination, as well as mutation, can cause a transition from gray to white in this depiction. CAR p Central African Republic. DRC p Democratic Republic of Congo.

2002, 2005; M. Saunders, M. Slatkin, M. F. Hammer, more efficiently when genes are decoupled from one an- and M. W. Nachman, unpublished data). The b-globin other; this hypothesis predicts that recombination will locus, however, is unique in that elevated rates of re- be higher in gene-dense regions (Barton and Charles- combination are sufficiently strong to reduce the effect worth 1998). A positive correlation between recombi- of genetic hitchhiking on the HbC alleles to !1 kb up- nation and gene density (or associated features such as stream of the gene. Thus, recombinational hotspots ap- CpG motifs) has been observed in humans (Kong et al. pear to quickly erase the signal of strong and recent 2004; McVean et al. 2004). Recombinational hotspots selection. have been characterized at the intensely selected b-globin It has been hypothesized that recombination is evo- gene cluster and the gene-dense HLA region (Jeffreys et lutionarily advantageous because selection can operate al. 2000, 2001; Schneider et al. 2002; Winckler et al. 28

Reports 641

Acknowledgments We thank Matt Saunders and Jason Wilder, for fun discus- sions and helpful suggestions on the manuscript; Phil Hedrick and two anonymous reviewers, for thoughtful comments; Bev- erly Strassmann, for Dogon DNA samples and suggestions; and Matt Kaplan and Leigh Hunnicutt, for technical advice. This manuscript was made possible by National Science Foun- dation (NSF) doctoral dissertation improvement grant BCS- 0220737 (to E.T.W.) and by National Institute of General Medical Sciences grant GM-53566 (to M.F.H.). Its contents are solely the responsibility of the authors and do not neces- sarily represent the official views of the NSF or the National Institutes of Health.

References Agarwal A, Guindo A, Cissoko Y, Taylor JG, Coulibaly D, Kone A, Kayentao K, Djimde A, Plowe CV, Doumbo O, Wellems TE, Diallo D (2000) Hemoglobin C associated with protection from severe malaria in the Dogon of , a West African population with a low prevalence of hemoglobin S. Blood 96:2358–2363 Aladjem MI, Groudine M, Brody LL, Dieken ES, Fournier RE, Wahl GM, Epner EM (1995) Participation of the human beta-globin locus control region in initiation of DNA rep- lication. Science 270:815–819 Allison AC (1954) Notes on sickle-cell polymorphism. Ann Hum Genet 19:39–51 ——— (1956) The sickle-cell and haemoglobin C genes in some African populations. Ann Hum Genet 21:67–89 Barton NH, Charlesworth B (1998) Why sex and recombi- nation? Science 281:1986–1990 Chakravarti A, Buetow KH, Antonarakis SE, Waber PG, Boehm CD, Kazazian HH (1984) Nonuniform recombina- tion within the human beta-globin gene cluster. Am J Hum Genet 36:1239–1258 Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blu- Figure 3 a, Log-likelihood surface of selection coefficient (s) menstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, under the assumption of no growth (r p 0 ), an effective population Liu-Cordero SN, Rotimi C, Adeyemo A, Cooper R, Ward size (Ne ) of 10,000, and a recombination fraction (c) ranging from R, Lander ES, Daly MJ, Altshuler D (2002) The structure 0.4%–0.8%. b, The posterior distribution of the age (in generations) of haplotype blocks in the human genome. Science 296: of the HbC allele over the three values of s (0.04, 0.07, and 0.09) that 2225–2229 correspond with the maximum-likelihood estimate obtained under the Haldane J (1949) The rate of mutation of human genes. Her- different values of c shown in panel a. editas Suppl 35:267–273 Harding RM, Fullerton SM, Griffiths RC, Bond J, Cox MJ, Schneider JA, Moulin DS, Clegg JB (1997) Archaic African 2005). Although it remains unclear whether this cor- and Asian lineages in the genetic ancestry of modern hu- relation is the result of selection favoring recombination mans. Am J Hum Genet 60:772–789 near genes, our data are consistent with these observa- Hedrick P (2004) Estimation of relative fitnesses from relative tions. The location of the b-globin recombinational hot- risk data and the predicted future of haemoglobin alleles S spot may allow selection to act more efficiently on ma- and C. J Evol Biol 17:221–224 Jeffreys AJ, Kauppi L, Neumann R (2001) Intensely punctate laria-resistance alleles at b-globin without significantly meiotic recombination in the class II region of the major affecting the upstream globin gene copies and regulating histocompatibility complex. Nat Genet 29:217–222 locus control region sequence. Future studies will help Jeffreys AJ, May CA (2004) Intense and highly localized gene determine the extent to which recombinational hotspots conversion activity in human meiotic crossover hot spots. are preferentially located near regions that are under Nat Genet 36:151–156 strong selective pressure. Jeffreys AJ, Ritchie A, Neumann R (2000) High resolution 29

642 Am. J. Hum. Genet. 77:637–642, 2005

analysis of haplotype diversity and meiotic crossover in the variability at G6PD and the signature of malarial selection human TAP2 recombination hotspot. Hum Mol Genet 9: in humans. Genetics 162:1849–1861 725–733 Saunders MA, Slatkin M, Garner C, Hammer MF, Nach- Kong A, Barnard J, Gudbjartsson DF, Thorleifsson G, Jons- man MW (2005) The span of linkage disequilibrium caused dottir G, Sigurdardottir S, Richardsson B, Jonsdottir J, Thor- by selection on G6PD in humans. Genetics (http://www geirsson T, Frigge ML, Lamb NE, Sherman S, Gulcher JR, .genetics.org/cgi/content/abstract/genetics.105.048140v1) Stefansson K (2004) Recombination rate and reproductive (electronically published July 14, 2005; accessed August 23, success in humans. Nat Genet 36:1203–1206 2005) Kwiatkowski DP (2005) How malaria has affected the human Schneider JA, Peto TE, Boone RA, Boyce AJ, Clegg JB (2002) genome and what human genetics can teach us about ma- Direct measurement of the male recombination fraction in laria. Am J Hum Genet 77:171–190 the human beta-globin hot spot. Hum Mol Genet 11:207– Livingstone FB (1976) Hemoglobin history in West Africa. 215 Hum Biol 48:487–500 Slatkin M, Bertorelle G (2001) The use of intraallelic vari- McVean GA, Myers SR, Hunt S, Deloukas P, Bentley DR, ability for testing neutrality and estimating population Donnelly P (2004) The fine-scale structure of recombination growth rate. Genetics 158:865–874 rate variation in the human genome. Science 304:581–584 Smith JM, Haigh J (1974) The hitch-hiking effect of a fa- Modiano D, Luoni G, Sirima BS, Simpore J, Verra F, Konate vourable gene. Genet Res 23:23–35 A, Rastrelli E, Olivieri A, Calissano C, Paganotti GM, Tishkoff SA, Varkonyi R, Cahinhinan N, Abbes S, Argyro- D’Urbano L, Sanou I, Sawadogo A, Modiano G, Coluzzi M poulos G, Destro-Bisol G, Drousiotou A, Dangerfield B, (2001) Haemoglobin C protects against clinical Plasmodium Lefranc G, Loiselet J, Piro A, Stoneking M, Tagarelli A, falciparum malaria. Nature 414:305–308 Tagarelli G, Touma EH, Williams SM, Clark AG (2001) Ohashi J, Naka I, Patarapotikul J, Hananantachai H, Britten- Haplotype diversity and linkage disequilibrium at human ham G, Looareesuwan S, Clark AG, Tokunaga K (2004) G6PD: recent origin of alleles that confer malarial resistance. Extended linkage disequilibrium surrounding the hemoglo- Science 293:455–462 bin E variant due to malarial selection. Am J Hum Genet 74:1198–1208 Trabuchet G, Elion J, Dunda O, Lapoumeroulie C, Ducrocq Rihet P, Flori L, Tall F, Traore AS, Fumoux F (2004) Hemo- R, Nadifi S, Zohoun I, et al (1991) Nucleotide sequence globin C is associated with reduced Plasmodium falciparum evidence of the unicentric origin of the beta C mutation in parasitemia and low risk of mild malaria attack. Hum Mol Africa. Hum Genet 87:597–601 Genet 13:1–6 Wall JD, Frisse LA, Hudson RR, Di Rienzo A (2003) Com- b Sabeti PC, Reich DE, Higgins JM, Levine HZ, Richter DJ, parative linkage-disequilibrium analysis of the -globin hot- Schaffner SF, Gabriel SB, Platko JV, Patterson NJ, McDonald spot in primates. Am J Hum Genet 73:1330–1340 GJ, Ackerman HC, Campbell SJ, Altshuler D, Cooper R, Winckler W, Myers SR, Richter DJ, Onofrio RC, McDonald Kwiatkowski D, Ward R, Lander ES (2002) Detecting recent GJ, Bontrop RE, McVean GA, Gabriel SB, Reich D, Don- positive selection in the human genome from haplotype nelly P, Altshuler D (2005) Comparison of fine-scale recom- structure. Nature 419:832–837 bination rates in humans and chimpanzees. Science 308: Saunders MA, Hammer MF, Nachman MW (2002) Nucleotide 107–111 30

Appendix A: Theβ-Globin RecombinationalHotspotReduces the Effects of Strong Selection around HbC,a Recently Arisen Mutation Providing Resistance to Malaria

Elizabeth T. Wood,Daryn A. Stover,MontgomerySlatkin,MichaelW. Nachman, and Michael F. Hammer

Material and Methods

Population Samples

We screened 1,037 Africanindividuals for the presence of the HbC variant, using allele- specific PCR, and we identified 17 heterozygous and 2 homozygous individuals (8 Dogon from Mali; 8 Ga, Ewe, or Fante from ; and 3 individuals of unknownethnicity from the Ivory Coast, Nigeria, and ), which yielded 17 HbAand 21 HbC chromosomes. We supplemented these data with an additional 18 HbA African chromosomes (7 from Senegal and Gambia, 5 from the IvoryCoast and Ghana, 2 from , 1 Dogonfrom Mali, and 3 Baka and Mbuti Pygmies) that were part ofanother study (E. Wood, D. Stover, M. Nachman, and M. Hammer, unpublished data). Sampling protocols were approved by the HumanSubjects Committee at the University of Arizona and similar committees of collaborating institutions.

PCR Amplification, DNA Sequencing, and Cloning

We PCR amplified ~5.2 kb of genomic DNA in two overlapping ~3-kb PCR fragments (Figure 1b) and then sequenced these fragments with primers designed to anneal approximately every 400-600 bp to generate overlapping sequences. The two 3-kb fragments were cloned using the TA cloning kit by Invitrogen (K4520-40). To resolve phase over the 5.2 kb region, we used polymorphic sites in the ~800bp overlapping region to tile the strands together. For 15 individuals, we were able to obtain cloned chromosomes carrying the mutant and wild-type chromosomes for both fragments. For the remaining individuals and fragments, we were unable to obtain both chromosomes after sequencing 2-6 clones per fragment (e.g. all clones contained HbA alleles). In these cases, we determined allelic phase by using one cloned sequence and the diploid sequence. When any sites were unclear, we resequenced and/or recloned until all ambiguities were resolved. Amplification and sequencing primers, as well as reaction conditions, are available on request. Fragments were assembled using Sequencher (GeneCodes). Sequences have been submitted to GenBank under accession numbers DQ126270-DQ126325. We also observed three repeat motifs in this 5.2-kb region(fig. 1b). Because repeat lengthcould not be accuratelydetermined using these methods, we removed from the analysis 34 bp, 10 bp, and 13 bp that correspond with the (TG)n, (ATTTT)n, and (RY)n(T)n repeat motifs, respectively.

Data Analysis

Linkage disequilibrium (LD) or non-random association between pairs of polymorphic sites is evaluated using D' (Lewontin 1964) and significance is determined using a Fisher’s Exact Test. We use a Bonferroni correction to control for non-independence within our data. 31

To estimate allele age (t) and selection coefficients (s), we use the method of Slatkin (2001) method for generating intra-allelic genealogies of selected alleles. This method randomly generated the allele age, t1, from the appropriate prior distribution and a random sample path of allele frequency t1 to the present (t = 0). For each sample path, an intra-allelic genealogy of genealogy of HbC is generated. For each intra-allelic genealogy, the probability of the data - 2906 (14/21 HbC chromosomes with C at -2906) under the assumption that the background frequency of C at -2906 is very low. By averaging over 10,000 replicate sample paths appropriate weighted, an estimate of the probability of the data and hence the likelihood of s is obtained. By accumulating the results and taking the weighted average for each allele age, t1,the probability of the data at –2906 given both s and t1 is estimated. From Bayes theorem of conditional probabilities, the posterior probability of t1 given s and the other parameters is obtained (Slatkin and Bertorelle 2001). We assume that the HbC allele is additive (see below), and population sizes (Ne) are 10,000 under a model of no growth (r = 0) or exponential growth at rate r. We use a range of recombination rates from 0.4% to 0.8%. We obtain the minimum estimate of recombination by using fine-scale population genetics estimates of c across the hotspot (1,045 bp) of 0.136/bp and outside the hotspot of 0.0126/bp (Wall et al. 2003). Using the population recombination parameter, ρ = 4Nec, and assuming Ne = 10,000, we calculate c to be 0.414% from site -2906 to 16. Our maximum estimate of c is slightly less than the value of 0.88% obtained from sperm-typing studies (Schneider et al. 2002). We use an HbC frequency of 15% which is similar to that observed in several populations including the Bisa (18.2%), Gurma (16.5%), and Senufu (15.0 - 22.4%) of Burkina Faso and the Somba of Benin (15.5%) (Livingstone 1985). We also run these analyses with fewer HbC chromosomes and find that neutrality is rejected when frequencies are as low as 5% for r = 0 (data not shown). We calculate the selection coefficient (s) from data reported in Hedrick (2004) using the mortality of m = 0.1 where the fitness values (w) are 0.861, 0.935, and 1 for the AA, AC, and CC genotypes, respectively. We define s as the amount selection for the AC heterozygote where s = 1-w = 1-(0.935/0.861)]= 0.086 or as the amount of selection for the CC homozygote where 2s = (1-w) = 1-(1/0.861) = 0.161. This is very close to additivity. We note that it is possible that the HbC chromosomes may recombine with HbS chromosomes as well as HbA chromosomes. The frequency of the HbS allele in these populations is very low (2.4%-14.6%) and the probability that an HbC chromosome will recombine with an HbS chromosome is the product of their frequencies (0. 4% - 3.2%). Because this value is so low, we do not consider recombination between HbC and HbS chromosome to significantly contribute to the recombinational motifs observed on the HbC chromosomes presented here.

Supplemental References

• Lewontin RC (1964) The interaction of selection and linkage. II. Optimum models. Genetics 50:757-782 First citation in article | PubMed • Livingstone F (1985) Frequencies of hemoglobin variants. Oxford University Press, New York First citation in article • Slatkin M (2001) Simulating genealogies of selected alleles in a population of variable size. Genet Res 78:49-57 First citation in article | PubMed

32

APPENDIX B: THE EFFECTS OF RECOMBINATION, GENE CONVERSION, AND MUTATION ON PATTERNS OF VARIATION AT THE STRONGLY SELECTED β- GLOBIN LOCUS: THE ORIGINS OF THE HbS ALLELE REVISITED

33

The Effects of Recombination, Gene Conversion, and Mutation on Patterns of

Variation at the Strongly Selected β-globin Locus: The Origins of the ΗβS Allele

Revisited

Elizabeth T. Wood1,2, Daryn A. Stover1, Michael W. Nachman2, Michael F. Hammer1,2

1Division of Biotechnology and 2Department of Ecology and Evolutionary Biology,

University of Arizona, Tucson, AZ

Corresponding author:

Elizabeth Wood

Dept. EEB, Biosciences West

University of Arizona, Tucson, AZ 85721 email: [email protected]

keywords: beta-globin, recombination, HbS, linkage disequilibrium, selection, genetic hitchhiking, malaria running title: Origins of HbS β-globin allele 34

Abstract

The ß-globin HbS (sickle cell) allele is found at 15-20% throughout much of sub-Saharan

Africa and is associated with many haplotypic backgrounds. It is generally agreed upon that the HbS allele originated multiple times via recurrent mutation. A minority view is that it arose once and was transferred to different haplotypic backgrounds through recombination. We revisit this debate by examining DNA sequence variation across the

ß-globin gene and recombinational hotspot in a pan-African sample of 31 HbS and 46

HbA chromosomes. Our findings support previous studies indicating that the region surrounding ß-globin comprises three distinct recombinational regimes: a 5´ haplotypic region, a 3´ gene region, and an intervening hotspot with recombination rates as high as

86.3 cM/Mb. We estimate that the majority of recombination events occur within a 1.4- kb segment, and that elevated rates of recombination associated with the hotspot may extend as far as the HbS mutation. Low rates of recombination in the 3´ gene region allow the reconstruction of a single parsimonious gene tree. Two HbS haplotypes

(Senegal and Bantu) share all sites within this 3´ region and differ only at sites 5´ to the

HbS mutation, supporting a common origin with a subsequent crossing-over event that created these two haplotypes. A third HbS haplotype (Benin) shows a more distinctive pattern that is best explained by recurrent mutation; however, a model involving a single mutation and gene conversion cannot be excluded. These results highlight the critical role that local recombination has played in generating haplotypic diversity associated with the HbS allele. 35

Introduction

Natural selection acts on genetic variation generated by mutation and

recombination. One of the strongest selective pressures in the recent history of humans is

infectious diseases. The infectious disease, malaria, is caused by the parasite

Plasmodium falciparum and is the leading causes of childhood mortality in Africa where

is it is estimated that greater than one million children die from the disease each year.

Malarial selection has had an enormous impact on the allele frequencies at loci including

G6PD (A- and Med), Duffy, HLA, α-globin and β-globin [1]. The β-globin locus, in

particular, harbors a remarkable array of alleles conferring resistance to malaria including

HbS, HbC, HbE and many variants that also cause the β-thalassemias [2]. Some of these

malarial resistance alleles are advantageous in both the heterozygous and homozygous

form and are influenced by positive selection (e.g. HbC, HbE, FY-, and a+-thalasemia);

where other alleles cause diseases in the homozygous or hemizygous form and are

subject to balancing selection in malarial areas (e.g., HbS, G6PD A-, β-thalassemias)

[3,4,5,6,7,8,9,10]. Because malarial selection is believed to have recently intensified and selection pressures are variable, the frequencies of most of these alleles are believed to be

in a non-equilibrium state. For example, the HbC allele, which is at relatively low

frequencies in a limited region of West Africa, is predicted to quickly replace the HbS allele when the two alleles co-occur in malarial environments [11]. Moreover, many of 36

these alleles interact epistatically, adding another level of complexity to the effects of selection on malarial resistance alleles [2,12].

Early studies of variation surrounding the HbS allele (β6 Glu→Val) examined

RFLPs in homozygous individuals in a 60-kb region flanking the HbS mutation, allowing the first reconstructions of long-range haplotypes associated with a selected allele in humans [13,14,15]. These seminal studies found that the majority of African HbS alleles occur on one of five common RFLP haplotypes designated as Benin, Bantu, Senegal,

Cameroon, and Asian, which are named after the regions or ethnic groups in which the

HbS haplotype was common [16,17,18,19]. The majority view emerging from the early studies is that the HbS allele arose five times on separate haplotype backgrounds (i.e. via recurrent mutation) [15,16,17,18,19,20,21,22]. An alternative scenario is that the HbS allele arose once and spread to different haplotypes by recombination (crossing-over and/or gene conversion) [2,13,23,24]. This hypothesis is supported by the common occurrence, 5-10%, of ‘atypical’ HbS haplotypes in nearly every study that has examined population genetic variation at the β-globin locus. These 20 or more ‘atypical’ HbS haplotypes are believed to be generated by recombination or interallelic gene conversion of the HbS allele onto pre-existing HbA haplotypes rather than recurrent de novo HbS mutations [23,24,25].

The extent to which we expect to find the HbS mutation on different haplotypic backgrounds depends on the age of the allele, the strength of selection, and the rate of recombination in the region surrounding the ß-globin gene. Recent work has shown that 37

the ß-globin gene sits adjacent to one of the most active recombinational hotspots in the human genome. Located directly 5´ to the ß-globin gene, the ß-globin hotspot recombines at a rate that is > 40 times that of the genome average [7,8,9,10,11]. Given these findings and the paucity of studies that have measured the extent of fine-scale linkage disequilibrium (LD) immediately 5´ and 3´ to the HbS mutation, we set out to re- examine the evolutionary origin(s) of the HbS allele. Here, we clone and sequence 5.1- kb spanning the ß-globin gene and recombinational hotspot in a pan-African sample of 31

HbS and 46 HbA chromosomes to (1) characterize fine-scale patterns of LD, (2) delineate the boundaries of the recombination hotspot, and (3) infer the origin(s) of the HbS allele.

Materials and Methods

DNA samples: We screened 1037 African individuals for the presence of the HbS and/or the HbC variant using a multiplex allele specific PCR (ASPCR) that can determine the presence of both the HbC and HbS allele. We identified 97 individual carrying the HbS and 30 individuals carrying the HbC mutation. The HbC results were previously published [10] but are included here as a comparison with HbS chromosomes.

In this study, we sequenced and cloned 29 heterozygous HbAS and 1 homozygous HbSS individuals and supplemented these data with an additional 17 HbA chromosomes that were paired with HbC chromosomes [10] for a total of 46 HbA, 31 HbS, and 21 HbC chromosomes. Sampling protocols were approved by the Human Subjects Committee at the University of Arizona and those of collaborating institutions. To compare human 38

DNA sequences with other primate species, we sequenced a portion of the region shown

in Figure 1 and combined it with available GenBank data (Chimpanzee: X02345).

PCR Amplification, DNA Sequencing and Cloning: A multiplex ASPCR was

designed to detect the presence of either the HbS allele or the HbC allele. In one reaction

tube we used 6 primers: the two gamma globin and two β-globin HbC primers reported in

a previous study [26] as well as two β-globin HbS primers designed for this study- Sfor

(5´ CGGCTGTCATCACTTAGACC 3´), and Sasp (5´ GTAACGGCAGACTTCTCCCA

3´). The gamma globin primers were used as controls and the HbS and HbC primer pairs

amplified a 215 bp product and a 459 bp product when the HbS and HbC alleles,

respectively, were present. Re-optimized reaction conditions are available upon request.

In a subset of the individuals who carry the HbS, we PCR amplified ~5.1-kb of

genomic DNA in two overlapping ~3-kb PCR fragments (Figure 1b) and then sequenced

these fragments with primers designed to anneal approximately every 400-600 bp to

generate overlapping sequences. The two 3-kb fragments were cloned using the TA

cloning kit by Invitrogen (K4520-40). To resolve phase over the 5.1-kb region, we used

polymorphic sites in the ~800bp overlapping region to tile the strands together.

Fragments were assembled using Sequencher (GeneCodes). Sequences have been

submitted to GenBank under accession numbers xxx-xxx. Numbers differ slightly from

our previously published study [10] and are renumbered with position 0 found at the 5´

end of the cap which and is identical to the nomenclature of several RFLP. We also

observe three repeat motifs in this 5.1-kb region (Figure 1b) and removed from the 39

analysis 34 bp, 10 bp and 13 bp that correspond with the (TG)n, (ATTTT)n and (RY)n(T)n

repeat motifs, respectively.

Other data sources: We compared levels of diversity and divergence in the β-

globin regions to 269 Seattle SNP coding regions. We also compare divergence to

13,455 genes recently reported [27]. To assign HbS haplotypic backgrounds (or

frameworks) to samples surveyed here, we used 7 sites that are segregating within HbS

chromosomes (-1069, -989/HinfI, -710, -551/RasaI, -543, -491) that have previously been

shown studies to define the five HbS haplotypes (or frameworks) using RFLPs and

sequence data in HbS homozygous individuals [17,20,28]. Nucleotide positions in this

study differ slightly from our previously published study [10]. Position 1 in this study

corresponds to the 5´ end of the UTR (Figure 1) and matches site positions reported in

previously published RFLP studies [17,19,20,28].

Data analysis: The population recombination rate was estimated from patterns of

linkage disequilibrium (LD) in the 46 non-selected chromosomes. To map recombination

rates, we employed LDhat which is approximate-likelihood coalescent method [29]

adapted to the finite-site model [30]. We construct a single most parsimonious gene tree

and depict haplotype relationships as a network.

Nucleotide diversity (θπ) [31] and the proportion of segregating sites, θW [32] was calculated in the 46 non-selected HbA chromosomes. Under a standard neutral model, both θπ and θW estimate the population parameter θ = 4Neμ, where Ne is the effective

population size and u is the mutation rate per generation. Divergence between humans

and chimpanzees were calculated as the average number of nucleotide substitutions per 40

site, k, between one chimpanzee and 46 humans using Kimura’s two-parameter model

[33]. Under a neutral model, the amount of divergence, k, between sequences drawn

from two species is given by k = 2μt + 4Neμ where t is the time in generations since the

species have diverged, μ is the mutation rate per generation, and Ne is the ancestral

effective population size [34]. Diversity (θπ and θW) and divergence (k) are estimators of the mutation rate, μ. For comparisons of diversity and divergence to the Seattle SNP data set, 95% confidence regions were generated using rank probabilities.

Results

Recombination rates in HbA chromosomes

We calculate the recombination parameter (ρ) based on the frequency of 45 sites found to be segregating in our sample of 46 HbA chromosomes using LDhat [30].

Estimates vary dramatically across the 5.1-kb region spanning the hotspot and gene region (Figure 1). The highest estimates of the recombination rate parameter (ρ/kb) are

35.1, which are found in the region spanning 15 sites from -1533 to -83, and decrease rapidly moving away from this ~1.4-kb window. The recombination rate drops to less than 2.40 ρ/kb and 0.38 ρ/kb ~350 bp in the 5´ and 3´ directions, respectively (Figure

1c). The recombination parameter (ρ/kb) averages 29.6 ρ/kb in the 2.1-kb window,

which encompasses 24 sites from site -1879 to 250. Thus, the majority of recombination

events, 74.8%, occur within the 1.4-kb window and 94.2% occur within the 2.1-kb

window. Assuming an effective population size (Ne) of 10,000, estimates of

recombination rates for the 1.4- and 2.1-kb windows are 86.3 cM/Mb and 74.1 cM/Mb, 41

respectively. Using the same methods employed here [30] we also recalculated

recombination rates with the data from Wall et al. [37] in a sample of 16 Hausa

individuals. We find that recombination rates are slightly higher, 87.0-108.5 cM/Mb. In

both studies, the boundaries of the hotspot are nearly the same, revealing that a portion of the recombinational hotspot encompasses part of the β-globin gene and may include site

70HbS (Figure 1c). Additionally, a study of 349 globally distributed individuals showed

that 23 sequences (6.6%) recombined 3´ of site -491 and three of these 23 recombinant chromosomes exchange sequence between sites 1 and 458 found in the first exon and intron of the gene [38]. A sperm-typing study estimated a recombination rate of 80 cM/Mb over a much larger 11-kb region 5´ to the β-globin gene (Schneider et al. 2002).

Interestingly, we observe a CCTCCCT motif, located at the 3´ end of the purine-

pyrimadine repeat ((RY)n(T)n in Figure 1b) and <400 bp from the center of the hotspot,

that is similar to motifs found near the centers of other recombinational hotspots [39].

These findings suggest that the majority of recombination events occur within the 1.4-kb

region and that the frequency of recombination events decrease from the center of the

hotspot into the gene region possibly as far 3´ as the 2nd exon. We infer that the HbS

mutation, which is found in the 6th codon of the 1st exon, occurs within the

recombinational hotspot (Figure 1).

Patterns of variation on HbS chromosomes

To infer the haplotypic backgrounds associated with the HbS allele, we used six

sites (-1069, -989/HinfI, -710, -551/RsaI, -543, and -491) to distinguish among the four 42

African RFLP HbS frameworks [17,19,20,28] (see Materials & Methods; Figure 2).

Within the 31 HbS chromosomes examined here, we observe three of the four African

HbS haplotypic frameworks. The Benin haplotype is found in 7 individuals (#51-57), the

Bantu haplotype is found in 14 individuals (#64-77), and the Senegalese type is found in

5 individuals (#55-63) (Figure 2). With one exception (a Benin type found in Gambia),

each of the haplotypes identified here were sampled from locales that correspond to the

known distributions of the Senegal, Bantu, and Benin haplotypes [2,18,19]. The

remaining five chromosomes represent recombinant or ‘atypical’ haplotypes. All of these

‘atypical’ HbS chromosomes contain at least two sites (-543 and -491) and possibly up to

four sites (site -1872, -989/HinfI, -543, and -491) that define the Benin haplotypes

(Figure 2) suggesting that the recombinant haplotypes are more closely related to Benin than to either the Sengalese or Bantu types.

To characterize the evolutionary history of the HbS chromosomes, we performed a parsimony analysis using in the low recombining 3´ region using 20 sites (site 69 to

2099) in 46 HbA and 31 HbS chromosomes sampled. The HbS chromosomes surveyed here are associated with two different 3´ haplotypes (Figure 3). The 14 Bantu haplotypes and the 5 Senegalese haplotypes share all sites within the 3´ haplotype region, forming a single clade. The seven Benin haplotypes are divided among the two clades with one

chromosome occurring in one clade and six Benin chromosomes occurring in another.

The five ‘atypical’ haplotypes are found in both of the HbS clades (Figure 3). This suggests that the Bantu and Senegalese haplotypes are more closely related to one another than either is to the Benin type. 43

Discussion

The majority of studies that examine the origins of the HbS allele have examined

long range RFLP patterns and have favored the hypothesis that the HbS allele originated

multiple times via recurrent mutation - one mutational event for each of the five

geographic types (Benin, Bantu, Senegalese, Cameroon, Asian/Indian)

[15,16,17,18,19,20,21,22]. A few studies suggest that the HbS mutation could have

arisen once and was transferred to different haplotypic backgrounds through

recombination [2,13,23]. To address this controversy, we generated phased DNA

sequence data and examined fine scale patterns of LD over 5.1-kb encompassing the β-

globin recombinational hotspot and gene. Our findings support previous studies

indicating that the region surrounding ß-globin comprises three distinct recombinational

regimes: a 5´ haplotypic region, a 3´ gene region, and an intervening recombinational

hotspot (Figure 1). We estimate that the recombination rate is low in the 5´ and 3´

regions (ρ/kb = 2.40 and 0.38, respectively). The recombinational hotspot, defined as a

DNA segment where recombination rate is elevated over background levels [39], spans

2.1-kb with the recombination parameter averaging 29.6 ρ/kb. Our findings suggest that the majority of recombination events occur within a 1.4-kb segment (i.e., within the 2.1- kb region) and that the frequency of recombination events decrease rapidly from the hotspot into the gene region and can include part or all of the 1st intron {Harding, 1997

#24}. Therefore, elevated rates of recombination associated with the hotspot may extend

as far as the HbS mutation, which is found in the 6th codon of the 1st exon (Figure 1). In 44

the following sections, we revisit the competing hypotheses for the origin(s) of the HbS allele in light of our analyses based on DNA resequence data.

The extremely high rates of recombination estimated at the β-globin hotspot, 74.1

– 86.3 cM/Mb, and the observation that 8-16% of recombination events occur in the gene region that includes the HbS allele, are consistent with the possibility that some or all of the many haplotypic backgrounds associated with the HbS allele have been generated via recombination. In our sample of 31 HbS chromosomes, we observe 8 haplotypes, 5 of which are rare (Figure 2). This is concordant with RFLP data indicating that a large fraction of the HbS chromosomes are associated with recombinant or ‘atypical’ haplotypes. The three most common haplotypes found in our sample represent the

Bantu, Senegal, and Benin haplotypic backgrounds identified in previous studies

[17,19,20,28].

We reconstructed the history of the HbS allele by examining the 18 sites found within the 3´ gene region. This parsimony analysis reveals that there is a single tree (i.e. network) associated with all sites 3´’ to the HbS mutation (sites 148 to 2099) suggesting that recombination rarely occurs within this 3´ haplotype region. RFLPs examined on

HbA chromosomes indicate that no recombination events have occurred in this 10-kb region. The HbC allele is associated with one clade consistent with a single origin of this allele [19]. However, the HbS allele is associated with two clades suggesting that recombination and/or multiple mutations occurred at or near the HbS mutation (Figure

3). The clade defined by sites 569 and 2099 contains all of the Senegalese and Bantu HbS haplotypes. The sharing of this 3´ Senegalese/Bantu haplotype includes an additional 4 45

RFLP sites extending over 10-kb 3´ to the gene [2,18]. One explanation for this pattern is

that the Bantu and Senegalese haplotypes have a common origin and were created

through a single crossing-over event that occurred between sites -543 and 70HbS (i.e.

within the recombinational hotspot). Of the HbA chromosomes surveyed here, there is

one haplotype that could represent the parental chromosome that may have recombined

with the Bantu type (represented by Gna07). On the other hand, here are no HbA types

present in our small sample that represent parental types that could have recombined to

create the Senegalese type (Figure 4). Additional sequencing surveys are needed to identify candidate HbA chromosomes that may have recombined to form the Senegalese

HbS haplotype.

The Benin type is more difficult to explain under the single origin hypothesis

because it differs from the Senegalese/Bantu haplotype at sites both 5´ and 3´ to the HbS mutation based on both sequence data (Figures 2 and 3) and RFLP data surveyed across

60-kb [2,18]. This supports the multiple origins hypothesis for the formation of the

Benin type. It is also possible that the Benin type originated from a double recombination event or by a gene conversion event that placed the HbS mutation on a divergent HbA haplotype (i.e., one that differed from the Bantu/Senegal type at sites both

5´ and 3´ to the HbS allele). If this were the case, the crossing-over event or the 3´ end of gene conversion break-point would have to lie between sites 70HbS and 569, which is on

the edge of the hotspot. Thus, the association of the HbS mutation with common haplotypic backgrounds can be explained either by a single origin of HbS with several 46

recombination events 5´ to the HbS mutation and at least one recombination event 3´ to

the HbS mutation, or by multiple mutational origins with or without recombination.

Our findings suggest that recombination played a critical role in generating HbS

haplotypic diversity at the β-globin locus. However, some observations provide support

for a multiple-origin model of recurrent de novo mutation. Most importantly, the HbS haplotypes have a high degree of geographic specificity [40]. If different HbS mutations

arose in different parts of Africa, we would expect that a single HbS haplotype would be

the predominate type in a given region. On the other hand, if crossing-over or gene

conversion placed the HbS mutation on different haplotypic backgrounds, we would

expect that there would be at least two HbS haplotypes (the original haplotype and the

converted haplotype) present at appreciable frequencies. Surveys of the distribution of

haplotypes suggest that the Senegalese type comprises 80.7-100 % of the HbS types

found in Senegal and Gambia, the Benin type makes up 92.9% of HbS types in Nigeria,

and Bantu type is found in 88-100% of the HbS types across most of Bantu speaking

Africa [2] suggesting that if recombination is responsible for the HbS haplotypic

diversity, original HbS haplotypes may have been lost due to genetic drift [41].

Additionally, malaria resistant alleles associated with the ß-globin locus have arisen

independently outside of Africa [26,42,43,44].

To address the role of mutation in generating haplotypic diversity in our sample,

we examined levels of diversity (θπ) and divergence (k). We show that levels of diversity

are significantly greater than the genome average within the hotspot as well as the 3.5-kb

region that surrounds (but does not include) the gene region, and that levels of diversity 47

within the gene are similar to the genome average. Levels of divergence, on the other

hand, are not significantly elevated in the hotspot or the surrounding 3.5-kb region. In fact, levels of divergence within the β-globin gene are significantly lower than the

genome average. Rates of mutation surrounding the β-globin locus (as measured by

diversity) may be elevated relative to the genome average possibly due to the mutagenic

effects of recombination [45], but that if this increased mutation rate extends to the gene

region, background selection may be removing deleterious mutations before they are

fixed between humans and chimpanzees. It is also possible that the recombinational

hotspot is recent and increased mutation associated with recombination is contributing to

diversity but not to divergence [46].

Taken together, these observations suggest multiple genetic and evolutionary

mechanisms for the origins of the HbS allele. Our data indicate that crossing-over event

5´ to the HbS allele can explain the origin of the Bantu and Senegalese haplotypes. The

Benin type, one the other hand, could have been created by a separate A→T transversion

at the HbS site. We note, however, that a single origin model involving gene conversion

to explain the formation of the Benin type cannot be ruled out. To reconcile the

geographic distribution of HbS with a single origin model, some have suggested that

African populations were subdivided during the early history of the HbS allele [41]. One

possible scenario for generating the observed haplotypic diversity under the single origin

of the HbS allele is shown in Figure 4. The higher diversity associated with the Benin

HbS (Figure 2) suggests an origin in central West Africa. This may have been followed

by a gene conversion event that created the Bantu type. The Bantu type then underwent 48

recombination 5´ to the HbS site creating the Senegalese type. The Bantu type was

carried throughout much of sub-Saharan Africa with the Bantu expansions and the

Senegalese haplotype became the primary type in present-day Senegal and Gambia.

Under this scenario, West African populations were subdivided during the early history of the HbS allele and a single HbS haplotype became the predominant type in these subdivided populations. Later as populations grew in size and geographic range, selective pressures due to malaria also intensified, rapidly increasing the frequency of the

HbS allele [24,40]. The combined effects of population subdivision, genetic drift, and recombination would have created a patchy distribution of HbS haplotypes; while the subsequent intensification of selective pressures, population growth, and range expansions would increase the geographic range of HbS haplotypes [2]. Although we favor a model that collapses the origins of the Bantu and Senegalese types into a single event, reducing the number of independent origins by at least one, we cannot refute the multiple origins model. Future studies that examine both short- and long-range haplotypes associated with selected alleles across the β-globin gene and hotspot in various geographic regions will help disentangle the relative effects of mutation, recombination, selection and demographic processes such as population subdivision, growth, and range expansion on the history of the HbS allele.

49

Table 1: Summary statistics calculated in 46 HbA chromosomes for the entire 5.1-kb sequence length, the recombinational hotspot (1.4 and 2.1-kb window sizes), and the gene region.

Divergence Length # of Haplotype (Ht) (Homo-Pan)

Region (bp) S Hts Diversity θπ % θS % k (%)

Entire Sequence 5139 45 42 0.994 ± 0.007 0.186 0.200 1.30%

1.4 kb Hotspot 1431 15 35 0.978 ± 0.012 0.268 0.240 1.40%

2.1 kb Hotspot 2109 21 39 0.986 ± 0.010 0.217 0.227 0.95%

Gene Region 1606 10 10 0.674 ± 0.051 0.082 0.142 0.87%

50

Figure 1: a) Map of the β-globin gene cluster found on 11p15.5. Four functional globin genes, one pseudogene, and the locus control region (LCR) that controls transcription and replication of the globin gene cluster [47] are located 5´ to the β-globin gene. The position of the recombinational hotspot identified using RFLP analysis (A) [48] and single sperm typing (B) [49] are shown above. b) Magnified view of β-globin gene and hotspot The polymorphisms observed in a 5.1-kb region are indicated with IUPAC codes. Asterisks (*) indicate sites that correspond with the RFLPs, HinfI, RsaI, and AvaII

(see text). Indels were excluded. Exons are shown in horizontal black boxes; introns are depicted by white boxes. Grey boxes indicate repeat motifs. The position of the origin of replication (IR Region) that interacts with the LCR and the 50-bp 5´ promoter region

(UTR) are also shown. c) Recombination rates estimated in the window spanning the gene region and recombinational hotspot using LDhat [30] for the 46 HbA chromosomes reported here (solid line) and 32 African Hausa (dashed line) [37]. The 1.4-kb and 2.1-kb windows corresponds to the regions of maximum recombination and elevated levels of recombination over background, respectively . The regions examined here that overlap with the 5´ and 3´ haplotypes previously identified using RFLPs are shown at the bottom.

These RFLP haplotypes extend ~50kb upstream and ~10kb downstream of the recombinational hotspot [2].

Figure 2: Polymorphisms observed in 46 HbA chromosomes (top), 31 HbS chromosomes (middle), and 21 HbC chromosomes (bottom). The previously published 51

sites used to define the HbS haplotypes are shown at the very bottom [17,20,28]. The 7

Benin, 5 Senegelese, and 14 Bantu HbS haplotypes are labeled. The 5 ‘atypical’ HbS haplotypes (individuals 47-50, 58) are also shown. Nucleotide positions are labeled such that the 1st position of the 5´ UTR is 1. The boundaries of the 1.4 and 2.1-kb recombinational windows are indicated with dashed vertical lines. To visually represent the decay in haplotype sharing, the most common long-range haplotype within the HbA,

HbS, and HbC chromosomes is shaded. We assume that recombination never occurs in the region 3´ to the gene and increases approaching the recombination hotspot [2,13,48].

Each site was examined sequentially, moving away from site 2098 at the 3´ end.

Figure 3: Gene-tree, or network, for 46 HbA (yellow) and 31 HbS (red), and 21 HbC

(blue) chromosomes using 20 sites (Site69HbC to Site2099) found in the low recombining

3´ haplotype region. Circles are proportional to the number of individuals who carry a given haplotype. The numbers denoted on branches correspond to site positions shown in

Figure 2. The HbS haplotypic backgrounds, as defined by other studies (see Results), that are observed in the 31 HbS chromosomes in this study, are shown next to their respective HbS clades.

Figure 4: Hypothetical scenario for the evolution of the HbS allele assuming a single origin (see text for explanation). Horizontal lines represent haplotypes. Stars indicate segregating sites, A = adenine, and T = thymine. Dashed lines indicate a recombination event. 52

Electronic Resources

Seattle SNPs: http://pga.mbt.washington.edu/

OMIM http://www.ncbi.nlm.nih.gov/omim/

CpG Islands: http://bioinformatics.org/sms2/cpg_islands.html

References

1. Kwiatkowski DP (2005) How malaria has affected the human genome and what human

genetics can teach us about malaria. Am J Hum Genet 77: 171-192.

2. Flint J, Harding RM, Boyce AJ, Clegg JB (1998) The population genetics of the

haemoglobinopathies. Baillieres Clin Haematol 11: 1-51.

3. Allison AC (1956) The sickle-cell and haemoglobin C genes in some African

populations. Ann Hum Genet 21: 67-89.

4. Allison AC (1964) Polymorphism and Natural Selection in Human Populations. Cold

Spring Harb Symp Quant Biol 29: 137-149.

5. Haldane J (1949) The rate of mutation of human genes. Hereditas Supplement: 267-

273.

6. Hamblin MT, Thompson EE, Di Rienzo A (2002) Complex signatures of natural

selection at the Duffy blood group locus. Am J Hum Genet 70: 369-383. Epub

2001 Dec 2020. 53

7. Modiano D, Luoni G, Sirima BS, Simpore J, Verra F, et al. (2001) Haemoglobin C

protects against clinical Plasmodium falciparum malaria. Nature 414: 305-308.

8. Ohashi J, Naka I, Patarapotikul J, Hananantachai H, Brittenham G, et al. (2004)

Extended linkage disequilibrium surrounding the hemoglobin E variant due to

malarial selection. Am J Hum Genet 74: 1198-1208. Epub 2004 Apr 1127.

9. Saunders MA, Slatkin M, Garner C, Hammer MF, Nachman MW (2005) The extent of

linkage disequilibrium caused by selection on G6PD in humans. Genetics 171:

1219-1229.

10. Wood ET, Stover DA, Slatkin M, Nachman MW, Hammer MF (2005) The beta -

globin recombinational hotspot reduces the effects of strong selection around

HbC, a recently arisen mutation providing resistance to malaria. American Journal

of Human Genetics 77: 637-642.

11. Hedrick P (2004) Estimation of relative fitnesses from relative risk data and the

predicted future of haemoglobin alleles S and C. J Evol Biol 17: 221-224.

12. Williams TN, Mwangi TW, Wambua S, Peto TE, Weatherall DJ, et al. (2005)

Negative epistasis between the malaria-protective effects of alpha+-thalassemia

and the sickle cell trait. Nat Genet 37: 1253-1257.

13. Antonarakis SE, Boehm CD, Serjeant GR, Theisen CE, Dover GJ, et al. (1984)

Origin of the beta S-globin gene in blacks: the contribution of recurrent mutation

or gene conversion or both. Proc Natl Acad Sci U S A 81: 853-856. 54

14. Kan YW, Dozy AM (1978) Polymorphism of DNA sequence adjacent to human beta-

globin structural gene: relationship to sickle mutation. Proc Natl Acad Sci U S A

75: 5631-5635.

15. Kan YW, Dozy AM (1980) Evolution of the hemoglobin S and C genes in world

populations. Science 209: 388-391.

16. Kulozik AE, Wainscoat JS, Serjeant GR, Kar BC, Al-Awamy B, et al. (1986)

Geographical survey of beta S-globin gene haplotypes: evidence for an

independent Asian origin of the sickle-cell mutation. Am J Hum Genet 39: 239-

244.

17. Lapoumeroulie C, Dunda O, Ducrocq R, Trabuchet G, Mony-Lobe M, et al. (1992) A

novel sickle cell mutation of yet another origin in Africa: the Cameroon type.

Hum Genet 89: 333-337.

18. Pagnier J, Mears JG, Dunda-Belkhodja O, Schaefer-Rego KE, Beldjord C, et al.

(1984) Evidence for the multicentric origin of the sickle cell hemoglobin gene in

Africa. Proc Natl Acad Sci U S A 81: 1771-1773.

19. Trabuchet G, Elion J, Baudot G, Pagnier J, Bouhass R, et al. (1991) Origin and spread

of beta-globin gene mutations in India, Africa, and Mediterranea: analysis of the

5' flanking and intragenic sequences of beta S and beta C genes. Hum Biol 63:

241-252.

20. Chebloune Y, Pagnier J, Trabuchet G, Faure C, Verdier G, et al. (1988) Structural

analysis of the 5' flanking region of the beta-globin gene in African sickle cell 55

anemia patients: further evidence for three origins of the sickle cell mutation in

Africa. Proc Natl Acad Sci U S A 85: 4431-4435.

21. Nagel RL, Labie D (1985) The consequences and implications of the multicentric

origin of the Hb S gene. Prog Clin Biol Res 191: 93-103.

22. Wainscoat JS, Bell JI, Thein SL, Higgs DR, Sarjeant GR, et al. (1983) Multiple

origins of the sickle mutation: evidence from beta S globin gene cluster

polymorphisms. Mol Biol Med 1: 191-197.

23. Zago MA, Silva WA, Jr., Dalle B, Gualandro S, Hutz MH, et al. (2000) Atypical

beta(s) haplotypes are generated by diverse genetic mechanisms. Am J Hematol

63: 79-84.

24. Livingstone FB (1989) Simulation of the diffusion of the beta-globin variants in the

Old World. Hum Biol 61: 297-309.

25. Srinivas R, Dunda O, Krishnamoorthy R, Fabry ME, Georges A, et al. (1988)

Atypical haplotypes linked to the beta S gene in Africa are likely to be the product

of recombination. Am J Hematol 29: 60-62.

26. Sanchaisuriya K, Fucharoen G, Sae-ung N, Siriratmanawong N, Surapot S, et al.

(2001) Molecular characterization of hemoglobin C in Thailand. Am J Hematol

67: 189-193.

27. TheChimpanzeeSequencingandAnalysisConsortium (2005) Initial sequence of the

chimpanzee genome and comparison with the human genome. Nature 437: 69-87.

28. Currat M, Trabuchet G, Rees D, Perrin P, Harding RM, et al. (2002) Molecular

analysis of the beta-globin gene cluster in the Niokholo Mandenka population 56

reveals a recent origin of the beta(S) Senegal mutation. Am J Hum Genet 70: 207-

223.

29. Hudson RR (2001) Two-locus sampling distributions and their application. Genetics

159: 1805-1817.

30. McVean GA, Myers SR, Hunt S, Deloukas P, Bentley DR, et al. (2004) The fine-

scale structure of recombination rate variation in the human genome. Science 304:

581-584.

31. Nei M, Li WH (1979) Mathematical model for studying genetic variation in terms of

restriction endonucleases. Proc Natl Acad Sci U S A 76: 5269-5273.

32. Watterson GA (1975) On the number of segregating sites in genetical models without

recombination. Theor Popul Biol 7: 256-276.

33. Kimura M (1980) A simple method for estimating evolutionary rates of base

substitutions through comparative studies of nucleotide sequences. J Mol Evol 16:

111-120.

34. Kimura M (1983) The Neutral Theory of Molecular Evolution. Cambridge, UK:

Cabridge University Press.

35. Gardiner-Garden M, Frommer M (1987) CpG islands in vertebrate genomes. J Mol

Biol 196: 261-282.

36. Stothard P (2000) The sequence manipulation suite: JavaScript programs for

analyzing and formatting protein and DNA sequences. Biotechniques 28: 1102,

1104. 57

37. Wall JD, Frisse LA, Hudson RR, Di Rienzo A (2003) Comparative linkage-

disequilibrium analysis of the beta-globin hotspot in primates. Am J Hum Genet

73: 1330-1340. Epub 2003 Nov 1318.

38. Harding RM, Fullerton SM, Griffiths RC, Bond J, Cox MJ, et al. (1997) Archaic

African and Asian lineages in the genetic ancestry of modern humans. Am J Hum

Genet 60: 772-789.

39. Myers S, Bottolo L, Freeman C, McVean G, Donnelly P (2005) A fine-scale map of

recombination rates and hotspots across the human genome. Science 310: 321-

324.

40. Nagel RL (2004) Beta-globin-gene haplotypes, mitochondrial DNA, the Y-

chromosome: their impact on the genetic epidemiology of the major structural

hemoglobinopathies. Cell Mol Biol (Noisy-le-grand) 50: 5-21.

41. Fix AG (2003) Simulating hemoglobin history. Hum Biol 75: 607-618.

42. Antonarakis SE, Orkin SH, Kazazian HH, Jr., Goff SC, Boehm CD, et al. (1982)

Evidence for multiple origins of the beta E-globin gene in Southeast Asia. Proc

Natl Acad Sci U S A 79: 6608-6611.

43. Kazazian HH, Jr., Waber PG, Boehm CD, Lee JI, Antonarakis SE, et al. (1984)

Hemoglobin E in Europeans: further evidence for multiple origins of the beta E-

globin gene. Am J Hum Genet 36: 212-217.

44. Trabuchet G, Elion J, Dunda O, Lapoumeroulie C, Ducrocq R, et al. (1991)

Nucleotide sequence evidence of the unicentric origin of the beta C mutation in

Africa. Hum Genet 87: 597-601. 58

45. Hellmann I, Ebersberger I, Ptak SE, Paabo S, Przeworski M (2003) A neutral

explanation for the correlation of diversity with recombination rates in humans.

Am J Hum Genet 72: 1527-1535.

46. Ptak SE, Roeder, A.D., Stephens, M., Gilad, Y., Paabo, S., Przeworski, M. (2004)

Absense of the TAP2 Human Recombination Hotspot in Chimpanzees. PLoS

Biology 2: 849-855.

47. Aladjem MI, Groudine M, Brody LL, Dieken ES, Fournier RE, et al. (1995)

Participation of the human beta-globin locus control region in initiation of DNA

replication. Science 270: 815-819.

48. Chakravarti A, Buetow KH, Antonarakis SE, Waber PG, Boehm CD, et al. (1984)

Nonuniform recombination within the human beta-globin gene cluster. Am J Hum

Genet 36: 1239-1258.

49. Schneider JA, Peto TE, Boone RA, Boyce AJ, Clegg JB (2002) Direct measurement

of the male recombination fraction in the human beta-globin hot spot. Hum Mol

Genet 11: 207-215.

59

Figure 1

a Hotspot A, B LCR ε γ γ ψβ δ β 5’ G A 3’

IR Region b R (TG)n (ATTTT)n (RY)n(T)n YWYY *SKY R YK Y R Y S MWY WK YY RS* WKY*YYYM YY RS RMK WYY K RSS SYY RR 1 2 3 -3000bp -2000 -1000 0 HbC 1000 2000 c HbS

50

40

30

20

10 Recombination Rate(4Nec/kb) -3000bp -2000 -1000 0 1000 2000

1.4 kb recombinational window

2.1 kb recombinational window

5’ haplotype region 3’ haplotype region 60 61

Figure 3

1309

2013

HbC 2078 1946 1720 1161 511 576 1707 1790 148 1850 2099 755 70

HbS HbS 569 5 Senegal 70HbS 14 Bantu HbS 1354 1 Benin 250 1 Atypical 6 Benin 1656 4 Atypical 62

Figure 4

Senegal Bantu T T HbSBantu HbSSenegal A T HbA gene conversion HbSBantu cross-over T A HbS HbA T HbS Benin mutation A HbA Benin

63

APPENDIX C: CONTRASTING PATTERNS OF Y CHROMOSOME AND MTDNA VARIATION IN AFRICA: EVIDENCE FOR SEX-BIASED DEMOGRAPHIC PROCESSES

Published: European Journal of Human Genetics: 13: 867-876

64

European Journal of Human Genetics (2005) 13, 867–876 & 2005 Nature Publishing Group All rights reserved 1018-4813/05 $30.00 www.nature.com/ejhg

ARTICLE Contrasting patterns of Y chromosome and mtDNA variation in Africa: evidence for sex-biased demographic processes

Elizabeth T Wood1,2, Daryn A Stover1, Christopher Ehret3, Giovanni Destro-Bisol4, Gabriella Spedini4, Howard McLeod5, Leslie Louie6, Mike Bamshad7, Beverly I Strassmann8, Himla Soodyall9 and Michael F Hammer*,1,2

1Division of Biotechnology, University of Arizona, Tucson, AZ, USA; 2Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ, USA; 3Department of History, University of Los Angeles, CA, USA; 4Department of Animal and Human Biology, University La Sapienza, Rome, Italy; 5Department of Medicine, Washington University, St Louis, MO, USA; 6Children’s Hospital of Oakland, Oakland, CA, USA; 7University of Utah Health Sciences, Salt Lake City, UT, USA; 8Department of Anthropology, University of Michigan, Ann Arbor, MI, USA; 9Human Genomic Diversity and Disease Research Unit, University of Witwatersand, Johannesberg,

To investigate associations between genetic, linguistic, and geographic variation in Africa, we type 50 Y chromosome SNPs in 1122 individuals from 40 populations representing African geographic and linguistic diversity. We compare these patterns of variation with those that emerge from a similar analysis of published mtDNA HVS1 sequences from 1918 individuals from 39 African populations. For the Y chromosome, Mantel tests reveal a strong partial correlation between genetic and linguistic distances (r ¼ 0.33, P ¼ 0.001) and no correlation between genetic and geographic distances (r ¼0.08, P40.10). In contrast, mtDNA variation is weakly correlated with both language (r ¼ 0.16, P ¼ 0.046) and geography (r ¼ 0.17, P ¼ 0.035). AMOVA indicates that the amount of paternal among-group variation is much higher U when populations are grouped by linguistics ( CT ¼ 0.21) than by geography (UCT ¼ 0.06). Levels of maternal genetic among-group variation are low for both linguistics and geography (UCT ¼ 0.03 and 0.04, respectively). When Bantu speakers are removed from these analyses, the correlation with linguistic variation disappears for the Y chromosome and strengthens for mtDNA. These data suggest that patterns of differentiation and gene flow in Africa have differed for men and women in the recent evolutionary past. We infer that sex-biased rates of admixture and/or language borrowing between expanding Bantu farmers and local hunter-gatherers played an important role in influencing patterns of genetic variation during the spread of African agriculture in the last 4000 years. European Journal of Human Genetics (2005) 13, 867–876. doi:10.1038/sj.ejhg.5201408 Published online 27 April 2005

Keywords: mtDNA; Y chromosome; human; Africa; language; geography; correlation; evolution; Mantel

Introduction *Correspondence: Dr MF Hammer, Department of Ecology and Evolu- An important question in population genetics is identify- tionary Biology, Biosciences West, University of Arizona, Tucson, AZ ing the best predictors of genetic relationships among þ þ 85721, USA. Tel: 1 520 621 9828; Fax: 1 520 621 9247; human populations. Several studies indicate strong corre- E-mail: [email protected] Received 16 July 2004; revised 31 January 2005; accepted 15 February lations between genetic and linguistic relationships among 1,2 2005 globally distributed human populations. At the subcon- 65

Contrasting patterns of Y chromosome and mtDNA variation ET Wood et al 868

tinental scale, correlations between genetic variation and Subjects and methods linguistic or geographic variation differ substantially. Y Population samples chromosome studies have shown that geographic distances Samples include representatives of the four major language correlate with genetic affinities among populations in families: Khoisan, Afroasiatic, Nilo-Saharan, and Niger- ,3 the Americas,4 and Austronesia,5 whereas lan- Congo (Table 1; Figure 1). Many of the 40 populations in guage better explains Y chromosome relationships in Table 1 were analyzed in previously published stu- 22,23,28,29 Siberia.6 Mitochondrial DNA (mtDNA) studies suggest that dies; however, several markers were typed in these linguistic relationships are better correlated with genetic samples for the first time in the current study. Differences affinities among South American populations,7 while in the number of samples in this and previous studies both geography and language are correlated with reflect differences in the availability of DNA, the inclusion maternal variation in Austronesia.8 In Africa the question of new samples, and/or the merging or splitting of of gene–language relationships remains equivocal; some populations according to language or ethnographic criter- classical genetic and Y chromosome studies point to ia. Sampling protocols were approved by the Human language,9,10 while other Y chromosome and mtDNA Subject Committee at the University of Arizona and those studies identify geography11,12 as a better predictor of of collaborating institutions. genetic affinities. The distribution of linguistic variation has been strongly Y chromosome markers and terminology influenced by the Neolithic Revolution, particularly in Fifty biallelic Y-linked markers, SNPs and indels, were typed Africa. Linguistic, archeological, and ethnographic data using a hierarchical protocol.23,26,27 First, we typed muta- suggest that all four African language families arose before tions defining major (eg, A agriculture in West Africa (Niger-Congo), Northeastern defined by M91) and then we typed all markers within a Africa (Afroasiatic), the middle Nile region (Nilo-Saharans), haplogroup until the most derived mutation in that and (Khoisan).13 – 17 Early dispersals of Niger- haplogroup was determined (Figure 2). Thus, not every Congo, Afroasiatic, and possibly Nilo-Saharan languages individual was typed for every marker. Markers were typed are likely associated with migrating farmers.14 – 17 Diamond using allele-specific PCR, restriction enzyme digest, or and Bellwood15 hypothesized that early farmers replaced direct sequencing. Protocols and primer sequences for the languages of hunter-gatherers living in their path of these assays were previously published.23,30 We follow the expansion and that this replacement would lead to strong terminological conventions recommended by the Y Chro- correlations between linguistic and genetic variation. Their mosome Consortium30 for naming NRY lineages. least equivocal example of an association of a language group with the spread of agriculture are the Bantu MtDNA expansions. Beginning B4000 years ago, farmers speaking To compare maternally and paternally inherited patterns of Niger-Congo Bantu languages expanded from a southern variation, we re-examined 366 bp of mtDNA HVS1 Cameroonian homeland over most of subequatorial sequence data compiled from a number of previous Africa.13,18,19 Evidence for the concordant spread of studies.12,31–33 The data set includes 39 populations from Bantu genes and languages comes from autosomal,9,20 the major language groups: Khoisan (!Kung1, !Kung2, mtDNA,12,21 and Y chromosomal10,11,22 – 27 data. Khwe, Hadza), Nilo-Saharan (Kanuri, Songhai, Turkana, The nonrecombining portion of the Y chromosome Nubian, Sudanese, Mbuti, Datoga), Afroasiatic (Moroccan (NRY) and mtDNA are both haploid and uni-parentally Berber, non-Berber Moroccan, Egyptian, Algerian Moza- inherited and, hence, are expected to have a four-fold bite, Tuareg, Somalian, Amhara, Hausa, Podokwo, Man-

reduction in effective population size (Ne) relative to the dara, Uldeme, Iraqw), Niger-Congo non-Bantu ¼ autosomes. In the absence of selection, the reduced Ne (Fulbe Fulfulde, Yoruba, Serer, Wolof, Mandinka, Tupuri), leads to an increased rate of genetic drift, which makes and Niger-Congo Bantu (Bubi, Fang, Biaka, Kikuyu, these haploid regions sensitive indicators of such demo- Mozambique1, Mozambique2, Bakaka, Bassa, Mbenzele, graphic processes as bottlenecks, population subdivision, Sukuma) (Figure 1). Some populations represented in the and population size and range expansions. The compara- original data sets12,31 – 33 were omitted because they are not tive study of patterns of variation at these loci allows the found on the African mainland, are Cameroonian popula- examination of the relative contribution of males and tions not represented in the Y chromosome data set,33 or females in shaping African genetic diversity. In this study, because linguistic designations could not be inferred. we test for associations between genetic, linguistic and geographic differentiation to (1) identify correlates of Mantel tests genetic diversity in Africa, (2) examine the degree of The correlation among genetic, linguistic, and geographic concordance between the Y chromosome and mtDNA, and distances was assessed by the Mantel test34 employing (3) assess the effects of sex-specific demographic processes ARLEQUIN 2.000.35 To test whether statistically significant shaping patterns of variation. associations between linguistic and genetic affiliations

European Journal of Human Genetics 66

Contrasting patterns of Y chromosome and mtDNA variation ET Wood et al 869 Table 1 Sampled populations

Linguistic affiliation Geographic region Ethnicity N Family Sublevel Latitude/Longitude

West Africa Gambia/Senegal Wolof 34 Niger-Congo Atlantic 14N 15W Gambia/Senegal Mandinka 39 Niger-Congo Atlantic 15:5N 15W Mali Dogon 55 Niger-Congo Dogon 14N 3W Ghana Ewe 30 Niger-Congo Kwa 6N 1E Ghana Ga 29 Niger-Congo Kwa 5:5N 0E Ghana Fante 32 Niger-Congo Kwa 6:5N 1:5E

Central Africa Cameroon, North Podokwo 19 Afroasiatic Chadic 12:5N 14:5E Cameroon, North Mandara 28 Afroasiatic Chadic 12:5N 14:5E Cameroon, North Uldeme 13 Afroasiatic Chadic 12:5N 14:5E Cameroon, North Tupuri 9 Niger-Congo Gur 10N 13:5E Cameroon, South Bassa 11 Niger-Congo Bantu 3N 10E Cameroon, South Bakaka 17 Niger-Congo Bantu 3N 10:5E Cameroon, South Ngoumba 31 Niger-Congo Bantu 3N 10E Cameroon, South Bakola Pygmies 33 Niger-Congo Bantu 2:5N 10:5E CAR Biaka Pygmies 31 Niger-Congo Bantu 3N 16E CAR Baka Pygmies 18 Niger-Congo Bantu 3N 16E

East Africa DRC Nande 18 Niger-Congo Bantu 1:5N 30E DRC Hema 18 Niger-Congo Bantu 1:5N 30E DRC Alur 9 Nilo-Saharan Nilotic 1:5N 30E DRC Mbuti Pygmies 47 Nilo-Saharan Sudanic 2N 29E Tanzania S. Cushitic 9 Afroasiatic Cushitic 4S 35E Kenya Massai 26 Nilo-Saharan Nilotic 2S 37E Kenya Luo 9 Nilo-Saharan Nilotic 0:5S 34:5E Kenya Kikuyu & Kamba 42 Niger-Congo Bantu 1:5S 38:5E Uganda Ganda 26 Niger-Congo Bantu 1N 32E Ethiopia Amhara 18 Afroasiatic Semitic 10N 39E Ethiopia Oromo 9 Afroasiatic Cushitic 6N 39E Ethiopia South Semitic 20 Afroasiatic Semitic 12N 38E

South Africa Namibia !Kung/Sekele 32 Khoisan Northern 19:07S 13:39E Namibia Tsumkwe San 29 Khoisan Central 19:13S 17:42E Namibia Dama 18 Khoisan Central 24:50S 17:00E Namibia Nama 11 Khoisan Central 24:00S 17:50E Namibia Herero 24 Niger-Congo Bantu 22:30S 18:58E Namibia Ambo 22 Niger-Congo Bantu 22:30S 18:58E South Africa Sotho-Tswana 28 Niger-Congo Bantu 27S 27E South Africa Zulu 29 Niger-Congo Bantu 31:35S 28:47E South Africa Xhosa 80 Niger-Congo Bantu 33:58S 25:36E Zimbabwe Shona 49 Niger-Congo Bantu 17:50S 31:03E

North Africa Egypt Egyptian 92 Afroasiatic Erythraic 30N 30E Tunisia Tunisian 28 Afroasiatic Semitic 35N 10E

reflect the same events in population history or parallel, among populations. One of us (CE) constructed tree but separate isolation by distance processes, we performed relationships among the languages spoken by the study partial correlations holding geography (or language) con- populations using several sources of linguistic, archeologi- stant.36 Genetic distances were based on Slatkin’s37 cal, and ethnographic data. Divergence times between linearized FST values (ie, incorporating molecular distances related languages were estimated using archeological dates among haplogroups). Geographic distances between po- and glottochronological methods.38 Linguistic relation- pulations were calculated using approximate latitude and ships among populations in this study, as well as among longitude data for the sample sites (Table 1). We used a the populations in the mtDNA data set, are available at novel approach for classifying linguistic relationships http://www.u.arizona.edu/Bewood/data.html. We also

European Journal of Human Genetics 67

Contrasting patterns of Y chromosome and mtDNA variation ET Wood et al 870

Figure 1 Map of Africa. The approximate location of 40 populations typed for Y chromosome markers in this study (K) and 39 populations surveyed for HVS1 sequence data12,31,32,33 (J) are indicated. The distribution of the four African language families was constructed using Greenberg’s39 classifications and further refined with data from the ethnologue (http://www.ethnologue.com/). Three shades of gray on map refer to the distribution of language families: Khoisan (light gray, southwest), Afroasiatic (light gray, north), Niger-Congo (medium gray), and Nilo-Saharan (dark gray). The circled geographic regions include North, West, Central, East, and South Africa.

performed Mantel tests with matrices constructed using (1) were taken into account. We grouped populations by five the method described by Poloni et al10 that uses the tree geographic regions (West, Central, East, South, and North relationships of the languages defined by Greenberg,39 (2) Africa) and by four linguistic groups (Afroasiatic, Nilo- the tree relationships among languages reported in this Saharan, Khoisan, and Niger-Congo) (Figure 1). All samples study without making use of divergence times, (3) equal used for the Mantel analysis were also used in the AMOVA: distances among populations of different language fa- 1122 individuals from 40 populations for the Y chromo- milies, and/or (4) variable distances among populations some and 1918 individuals from 39 populations for of different language families. All matrices yielded very mtDNA. similar correlations (both r and P values) for the entire data Given that levels of population differentiation can be set. Results differed slightly among matrices when we influenced by (1) sample composition and the Y chromo- removed the Bantu speakers. somal and mtDNA data sets presented here are sampled differently, (2) the differing rates and modes of evolution AMOVA characterized by the Y chromosome and mtDNA systems, Analyses of molecular variance (AMOVA) were also per- and (3) the type of polymorphisms examined (eg, pre- formed using ARLEQUIN 2.000.35 Both haplogroup fre- ascertained Y chromosome SNPs versus mtDNA HVS1 quencies and molecular differences among haplogroups sequence data), direct comparisons between these haploid

European Journal of Human Genetics 68

Contrasting patterns of Y chromosome and mtDNA variation ET Wood et al 871

Figure 2 Maximum-parsimony tree of 50 Y chromosome biallelic markers typed in this survey. The root of the tree is denoted by an arrow. Major clades (ie, A– R) are labeled with large capital letters. Subclade labels (eg, A3b) are indicated to the left of the branches. Mutation names are given along the branches. The length of each branch is not proportional to the number of mutations or the age of the mutation. Only the names of the 36 haplogroups observed in the present study are shown to the right of the branches. Haplogroup frequencies are shown on the far right.

European Journal of Human Genetics 69

Contrasting patterns of Y chromosome and mtDNA variation ET Wood et al 872

genetic systems should be considered with caution. Never- populations are grouped according to language family, the F ¼ theless, by comparing linguistic and geographic associa- proportion of among-group variance ( CT 0.21) is more tions within a locus, we can ask whether mtDNA and the Y than three times higher than when populations are F ¼ chromosome have been influenced by similar demo- grouped according to geographic location ( CT 0.06) graphic processes. (Table 2). AMOVA results for the mtDNA data are also

presented in Table 2. The continental mtDNA FST is 0.15. MtDNA F-statistics are very similar when populations are Results placed in either linguistic or geographic groups. Geographic distribution of Y chromosome haplogroups in African populations These results indicate that Y chromosome variation is Phylogenetic analysis of the 50 Y biallelic markers used in significantly partitioned among both geographic and this study yielded 36 haplogroups (Figure 2) (for appendix linguistic groups. Therefore, both language and, to a lesser please refer to Supplementary Information). The vast extent, geography are probably important (albeit over- majority of these lineages (98.1%) belong to five major lapping) predictors of African genetic structure. haplogroups: A (7.1%), B (10.2%), E (70.2%), J (5.4%), and R (5.2%). Haplogroup A is closest to the root of the tree and is found most frequently in the Khoisan, particularly the Mantel tests A2 and A3b1 lineages (47.7%). Haplogroup B chromo- To test the underlying cause of association between genetic somes are most frequently observed in Pygmies (48.9%), and linguistic versus geographic variation, we performed with B2a* and B2b* being nearly exclusive to this group. Mantel tests. These tests ask whether there is a correlation Haplogroup E is overwhelmingly the most common in this between geographic (or language) distance and genetic study. Over half of the individuals in our study (51%) are distance. Mantel tests reveal a statistically significant members of the subclade E3a, which is defined by the P1 positive correlation between Y chromosome variation and mutation. Niger-Congo speakers have the highest fre- linguistics (r ¼ 0.32, P ¼ 0.001) that explains 8.9% of the quency of E-P1* chromosomes (40.7%) and the largest genetic variance. The correlation between genetic and proportion of E-M191 chromosomes (27.5%), particularly linguistic variation remains strong when geography is held in Bantu speakers (31.5%). The E3b1 (E-M78) lineage is constant (r ¼ 0.33, P ¼ 0.001). In contrast, there is no most frequent in Afroasiatics (22.5%). In this study, correlation between paternal genetics and geographic haplogroup J is concentrated in Afroasiatics (19.5%). While distances (r ¼ 0.01, P40.10) (Table 3). Mantel test results African haplogroup R chromosomes are generally quite based on the mtDNA HVS1 data are also presented in rare, R-P25* chromosomes are found at remarkably high Table 3. The correlation between maternal genetics and frequencies in northern Cameroon (60.7–94.7%). The linguistics is significant (r ¼ 0.23, P ¼ 0.016), but weakens remaining haplogroups (K, F*, I, and G) account for only when geography is held constant (r ¼ 0.16, P ¼ 0.046). 1.9% of the individuals in our data set. Similarly, a significant correlation between mtDNA and geography (r ¼ 0.23, P ¼ 0.008) weakens when linguistics is Analysis of molecular variance (AMOVA) held constant (r ¼ 0.17, P ¼ 0.035). It is important to note F The overall Y chromosome ST for the 40 populations is that a failure to find correlations in Mantel tests does not 0.32 (Table 2), a value that is similar to that found in a mean that two variables are not related in some way. F ¼ 28 previous study of African Y-SNP diversity ( ST 0.34). Rather, it means that processes that might cause a positive This value is also similar to that obtained when our African correlation (eg, isolation by distance or directional gene sample is grouped into five geographic regions. When flow in the case of geography, or strict language–gene co-

Table 2 Analysis of molecular variance (AMOVA)

Within populations Among populations within groups Among groups F F F Group No. of groups Variance (%) ST Variance (%) SC Variance (%) CT Y chromosome 1 68.2 0.32 Linguistic groupsa 4 62.1 0.38 16.6 0.21 21.3 0.21 Geographic groupsb 5 67.4 0.33 26.2 0.28 6.4 0.06

mtDNA 1 84.7 0.15 Linguistic groupsa 4 84.0 0.16 13.5 0.14 2.5 0.03 Geographic groupsb 5 84.1 0.16 12.3 0.13 3.6 0.04

All F-statistics. P-values are less than 0.01. aAfroasiatic, Khoisan, Niger-Congo, Nilo-Saharan (see Figure 1). bWest, Central, East, South, North Africa (see Figure 1).

European Journal of Human Genetics 70

Contrasting patterns of Y chromosome and mtDNA variation ET Wood et al 873

Table 3 Correlation and partial correlation coefficients, r significant genetic change, and/or genetic and linguistic (P-value), between genetic, linguistic, and geographic evolution may proceed at heterogeneous rates.2,15,42,43 distances We found a statistically significant association between Y mtDNA NRY variation and linguistic differentiation and a margin- ally significant association between mtDNA variation and Genetics and Linguistics 0.32 (0.001) 0.23 (0.016) linguisitic variation. However, when we performed Mantel Genetics and Linguistics, 0.33 (0.001) 0.16 (0.046) Geography held constant tests controlling for geographic distance, the partial Genetics and Geography 0.01 (0.508) 0.23 (0.008) correlation between maternal genetic and linguistic varia- Genetics and Geography, 0.08 (0.859) 0.17 (0.035) tion weakens, while that between paternal genetic and Linguistics held constant linguistic variation remains statistically significant (Table 3). This suggests that the observed association between Y chromosome and language variation reflects the same co-evolutionary population history events.38 evolution) are unlikely to be the only processes operating These differing patterns for the Y chromosome and mtDNA (Tables 2 and 3). could be the result of a greater degree of female than male admixture and/or the adoption of languages by females to a greater extent than males (see below). In either case, the Discussion implication is that African languages tend to be passed Mantel tests show a statistically significant positive from father to children.10 correlation between Y chromosome and linguistic varia- Associations between genetic and geographic distances tion, while there is no correlation between Y chromosome show the opposite trend than do the aforementioned and geographic variation. Furthermore, when populations associations between genetic and linguistic variation; there are grouped according to language, the amount of among- is no correlation between Y chromosome variation and F group paternal differentiation ( CT) is substantially higher geographic distance. In contrast, there is a stronger than when grouped according to geographic location. correlation between mtDNA variation and geographic Correlations with mtDNA show a different pattern. distance, albeit only marginally significant when language Maternal variation is weakly correlated with both language is held constant (Table 3). Thus, the genetics–language and geography and maternal among-group differentiation correlation is stronger for the Y chromosome and the is nearly the same when populations are grouped according different pattern shown by mtDNA data suggests that men to linguistic affiliation or geographic location. These and women did not have identical demographic histories. results suggest that patterns of differentiation and gene flow in Africa have been different for men and women in Effect of Bantu expansions on Y chromosome the recent evolutionary past.10 In the following sections, and mtDNA variation we discuss (1) the relationships among genetic, linguistic, Numerous studies suggest that the Bantu expansions have and geographic differentiation and the population history had a substantial impact on the distribution of genetic factors that may underlie these relationships, and (2) the variation in Africa.9,10,12,22,25,27 Is the strong Y chromo- effects of Bantu expansions on the distribution of Y some–linguistics correlation we observe across the entire chromosome and mtDNA variation in Africa. continent primarily the result of the massive migrations of Bantu farmers? We sought to clarify the effect of each Associations between genetic and linguistic variation language group on influencing the paternal genetics– The association of genetic and linguistic variation has been linguistics association and the genetics–geography associa- observed at the global level,1 as well as on the regional tion by repeating the Mantel test after removing each scale.36,40,41 What are the underlying causes of these language group in turn (ie, Afroasiatic, Khoisan, Nilo- associations? Sokal41 stated that a common language Saharan, Niger-Congo non-Bantu, and Niger-Congo Ban- usually reflects a common origin for two populations, tu). If one language group were disproportionately con- and a related language indicates a common origin farther tributing to the overall pattern, then the association is back in time. This is generally thought to be an outcome of expected to weaken upon removal of this group. There is common historical processes leading to genetic and only a single language group that has this effect: the linguistic diversification – for example, a founding popula- removal of Bantu–speakers causes the paternal genetics– tion may reproduce biologically and linguistically in a new linguistics correlation to drop from r ¼ 0.33 to 0.08 location and replace the genes and languages of previous (Figure 3).We note that the same trend is observed, albeit residents.2,15,36 Discrepancies between genetic and linguis- to a lesser extent, when other language matrices (see tic differentiation could arise through a number of Subjects and Methods) are employed in the Mantel tests processes: genetic admixture can occur without language (data not shown). The lower correlation coefficient when change, languages can be transmitted horizontally without Bantu populations (but not other linguistic groups) are

European Journal of Human Genetics 71

Contrasting patterns of Y chromosome and mtDNA variation ET Wood et al 874

tions are removed may reflect the fact that these two populations are maternal genetic outliers.31 To further investigate the effect of the Bantu expansions on patterns of geographic variation, we grouped popula- tions by their geographic location and removed each language group in a series of four AMOVA runs (data not shown). Unlike the case for any other language family, the removal of Bantu populations results in a higher Y

chromosome FCT (0.28) than when they are included (0.06). This supports the hypothesis that Bantu Y chromo- somes (eg E-P1*, E-M191) are acting to homogenize geographically differentiated populations. A similar analy- F sis of mtDNA results in slightly higher CT value when the Bantu populations are excluded (0.07 versus 0.04). If Bantu males and females dispersed equally from their Figure 3 Partial correlation coefficient between genetics and West African homeland, replacing the genes of local linguistics holding geography constant (black bars) and genetics and hunter-gatherers in their path of expansion (equal sex geography holding linguistics constant (white bars) for the Y chromosome and mtDNA. **Po0.01, *Po0.05. ratio model), then we would expect similar patterns of association for paternally and maternally inherited loci. If, on the other hand, one sex dispersed more effectively (sex- biased model), we would expect to find differences in the removed suggests that Bantu are contributing more to the degree of association between genetic and linguistic language–Y chromosome relationship than any other variation for the two haploid loci. Several explanations language group. have been offered for observed differences in patterns of Y The Y chromosome–geography correlation shows a chromosome and mtDNA variation among popula- different pattern. While the removal of the Bantu popula- tions.31,44 – 46 Our results support the sex-biased model tions does not produce a correlation, the additional whereby the replacement of pre-existing languages by removal of four northern Cameroonian populations results Bantu languages more closely parallels the turnover of Y in a statistically significant positive correlation (r ¼ 0.27, chromosomes than mtDNA. How can this be explained? P ¼ 0.008). The strong effect of these northern Cameroo- One possibility is that Bantu male farmers dispersed over nian populations on the Y chromosome results can be longer distances or in greater numbers than Bantu females. explained by the very high frequency of derived paternal Another possibility is that males and females dispersed (but not maternal) lineages that originated in non-African equally, but there was a higher ‘effective’ migration rate for populations.27,33 We note that this increased geographical Bantu Y chromosomes than Bantu mtDNA. As Bantu correlation is not entirely attributable to the northern farmers dispersed, they likely intermarried to some extent Cameroonian populations because when only these popu- with the original inhabitants related to modern Pygmies lations are removed, there is no Y chromosome–geography and Khoisan.15 In present-day African populations, the correlation (r ¼0.002, P40.10). Thus, in the absence of direction of intermarriage is usually between hunter- the unique populations from northern Cameroon, the gatherer women and farmer (Bantu) men and the children removal of Bantu speakers leads to an association between of these marriages generally become farmers residing in Y chromosome and geographic differentiation, consistent their father’s village19,47 (ie, patrilocality). If this were with a recent dispersal of Bantu Y chromosomes. typical of practices that existed throughout the Bantu On the other hand, the exclusion of Bantu speakers expansions, we would expect Bantu mtDNA to be diluted strengthens both the mtDNA–linguistics and mtDNA– (with hunter-gatherer mtDNA) to a greater extent than Y geography correlations (Figure 3). Upon further explora- chromosomes. It is also possible that if the ancestral Bantu- tion, we discovered that the increase in both of these speaking population were highly polygynous, then indi- maternal genetic correlations is due solely to Bantu-speak- genous Y chromosomes would have been replaced by a ing Pygmy populations, specifically, the Biaka and Mben- more homogeneous pool of Bantu Y chromosomes, leading zele (data not shown). The strong effect of these Pygmy to a stronger correlation between linguistic and genetic populations on the mtDNA–linguistics correlation sug- variation. Indeed, polygyny is known to be substantially gests that the horizontal transfer of languages from Bantu higher among food-producers than among hunter-gath- farmers to hunter-gatherer Pygmy females19 occurred erers.31 In combination with the above processes, the without significant genetic change.19,31,36 The stronger adoption of Bantu languages by hunter-gatherer Pygmies mtDNA associations with both linguistic and geographic may weaken maternal genetic–linguistic associations. variation observed when the Biaka and Mbenzele popula- Although our data cannot address the relative impact of

European Journal of Human Genetics 72

Contrasting patterns of Y chromosome and mtDNA variation ET Wood et al 875 these sociocultural processes, it is likely that sex-biased 6 Karafet TM, Osipova LP, Gubina MA, Posukh OL, Zegura SL, migration/admixture, patrilocality, polygyny, and/or lan- Hammer MF: High levels of Y-chromosome differentiation among native Siberian populations and the genetic signature of a boreal guage borrowing contributed to the observed patterns of hunter-gatherer way of life. Hum Biol 2002; 74: 761 – 789. 31 African variation. 7 Fagundes NJ, Bonatto SL, Callegari-Jacques SM, Salzano FM: While earlier studies of Y chromosome variation have Genetic, geographic, and linguistic variation among South noted a correspondence between high-frequency hap- American Indians: possible sex influence. Am J Phys Anthropol 2002; 117: 68 – 78. logroups and the distribution of Bantu speakers (ie E-P1* 8 Lum JK, Cann RL, Martinson JJ, Jorde LB: Mitochondrial and 11,22 – 27 and E-M191), this is the first study to demonstrate a nuclear genetic relationships among Pacific Island and Asian statistically significant correlation between Y chromosome populations. Am J Hum Genet 1998; 63: 613 – 624. SNP haplogroups30 and linguistic differentiation in Africa. 9 Excoffier L, Pellegrini B, Sanchez-Mazas A, Simon C, Langaney A: Genetics and history of sub-Saharan Africa. Yrbk Phys Anthropol The data presented here are consistent with the hypothesis 1987; 30: 151 – 194. that prehistoric agriculture dispersed hand-in-hand with 10 Poloni ES, Semino O, Passarino G et al: Human genetic affini- Bantu languages and Y chromosomes, with languages and ties for Y-chromosome P49a,f/TaqI haplotypes show strong correspondence with linguistics. Am J Hum Genet 1997; 61: Y chromosomes replacing those of hunter-gatherers in the 1015 – 1035. 15 paths of expansion. Not all populations speaking Bantu 11 Scozzari R, Cruciani F, Santolamazza P et al: Combined use of languages in our study showed the effects of complete biallelic and microsatellite Y-chromosome polymorphisms to paternal genetic replacement (eg, the Bantu-speaking infer affinities among African populations. Am J Hum Genet 1999; 65: 829 – 846. western Pygmies and northern Cameroonians). It is 12 Salas A, Richards M, De la Fe T et al: The making of the African important to note that different mutation rates, as well as mtDNA landscape. Am J Hum Genet 2002; 71: 1082 – 1111. methods used to assay variation, on the NRY and mtDNA 13 Greenberg J: Historical inferences from linguistic research in Sub-Saharan Africa; In: Butler J (ed): Boston University Papers may contribute to some of the contrasting patterns in African History I. Boston, MA: Boston University Press, 1964, 46 observed here. Future studies that examine Y chromo- pp 1 – 15. some and mitochondrial DNA sequence variation in the 14 Ehret C: The Civilizations of Africa: A History to 1800. Virginia: The same samples representing African geographic and linguis- University Press of Virginia, 2002. 15 Diamond J, Bellwood P: Farmers and their languages: the first tic diversity will help to further elucidate the effects of expansions. Science 2003; 300: 597 – 603. Bantu expansions on the complex genetic landscape of 16 Ehret C: Historical/linguistic evidence for early African food Africa. production; In: Clark JD, Brandt S (eds): From Hunters to Farmers. Berkeley: University of California Press, 1984, pp 26 – 35. 17 Ehret C: Sudanic civilization; In: Adas M (ed): Agricultural and Pastoral Societies in Ancient and Classical History. Philadelphia: Acknowledgements Temple University Press, 2001. 18 Ehret C: Linguistic inferences about early Bantu history; In: We would like to thank Alan Redd, Tanya Karafet, Leigh Hunnicutt, Posnansky CEaM (ed): The Archaeological and Linguistic Reconstruc- Roxane Bonner, and Jared Ragland for typing markers, and Matt tion of African History. Berkeley: University of California Press, Kaplan and the GATC for technical assistance. We also thank 1982, pp 57 – 65. Brendan Hug, Phil Fischer, H Kimura, and AS Santachiara-Benerecetti 19 Klieman K: ‘The Pygmies Were Our Compass’: Bantu and Batwa in for DNA samples, and Jason Wilder, Tanya Karafet, and Maya Metni the History of West , Early Times to c. 1900 C.E. Pilkington and three anonymous reviewers for helpful comments. Portsmouth, NH: Heinemann, 2003. Antonio Salas kindly provided a file with mtDNA data. GDB and GS 20 Cavalli-Sforza LL, Menozzi P, Piazza A: The History and Geography were supported by MIUR (COFIN Grant No. 2003054059). This work of Human Genes. Princeton, Princeton University Press, 1994. 21 Soodyall H, Vigilant L, Hill AV, Stoneking M, Jenkins T: mtDNA was supported by grant GM53566 from the National Institute of control-region sequence variation suggests multiple independent General Medical Sciences to MH. origins of an ‘Asian-specific’ 9-bp deletion in sub-Saharan Africans. Am J Hum Genet 1996; 58: 595 – 608. 22 Hammer MF, Karafet T, Rasanayagam A et al: Out of Africa and References back again: nested cladistic analysis of human Y chromosome 1 Cavalli-Sforza LL, Piazza A, Menozzi P, Mountain J: Reconstruc- variation. Mol Biol Evol 1998; 15: 427 – 441. tion of human evolution: bringing together genetic, archaeolo- 23 Hammer MF, Karafet TM, Redd AJ et al: Hierarchical patterns of gical, and linguistic data. Proc Natl Acad Sci USA 1988; 85: 6002 – global human Y-chromosome diversity. Mol Biol Evol 2001; 18: 6006. 1189 – 1203. 2 Chen JT, Sokal RR, Ruhlen M: Worldwide analysis of genetic and 24 Passarino G, Semino O, Quintana-Murci L, Excoffier L, Hammer linguistic relationships of human-populations. Hum Biol 1995; M, Santachiara-Benerecetti AS: Different genetic components in 67: 595 – 612. the Ethiopian population, identified by mtDNA and Y-chromo- 3 Rosser ZH, Zerjal T, Hurles ME et al: Y-chromosomal diversity in some polymorphisms. Am J Hum Genet 1998; 62: 420 – 434. Europe is clinal and influenced primarily by geography, rather 25 Thomas MG, Parfitt T, Weiss DA et al: Y chromosomes traveling than by language. Am J Hum Genet 2000; 67: 1526 – 1543. south: the cohen modal haplotype and the origins of the Lemba – 4 Zegura SL, Karafet TM, Zhivotovsky LA, Hammer MF: High- the ‘Black Jews of Southern Africa’. Am J Hum Genet 2000; 66: resolution SNPs and microsatellite haplotypes point to a single, 674 – 686. recent entry of Native American Y chromosomes into the 26 Underhill PA, Passarino G, Lin AA et al: The phylogeography of Y Americas. Mol Biol Evol 2004; 21: 164 – 175. chromosome binary haplotypes and the origins of modern 5 Hurles ME, Nicholson J, Bosch E, Renfrew C, Sykes BC, Jobling human populations. Ann Hum Genet 2001; 65: 43 – 62. MA: Y chromosomal evidence for the origins of oceanic-speaking 27 Cruciani F, Santolamazza P, Shen PD et al: A back migration from peoples. Genetics 2002; 160: 289 – 303. Asia to sub-Saharan Africa is supported by high-resolution

European Journal of Human Genetics 73

Contrasting patterns of Y chromosome and mtDNA variation ET Wood et al 876

analysis of human Y-chromosome haplotypes. Am J Hum Genet 37 Slatkin M: A measure of population subdivision based 2002; 70: 1197 – 1214. on microsatellite allele frequencies. Genetics 1995; 139: 457 – 462. 28 Hammer MF, Spurdle AB, Karafet T et al: The geographic 38 Embleton S: Statistics in Historical Linguistics. Bochum, Brock- distribution of human Y chromosome variation. Genetics 1997; meyer, 1986. 145: 787 – 805. 39 Greenberg JH: The Languages of Africa. The Hague: Mouton, 1963. 29 Hammer MF, Redd AJ, Wood ET et al: Jewish and Middle Eastern 40 Barbujani G, Sokal RR: Zones of sharp genetic change in Europe non-Jewish populations share a common pool of Y-chromosome are also linguistic boundaries. Proc Natl Acad Sci USA 1990; 87: biallelic haplotypes. Proc Natl Acad Sci USA 2000; 97: 6769 – 6774. 1816 – 1819. 30 YCC: A nomenclature system for the tree of Y chromosomal 41 Sokal RR: Genetic, geographic, and linguistic distances in Europe. binary haplogroups. Genome Res 2002; 12: 339 – 348. Proc Natl Acad Sci USA 1988; 85: 1722 – 1726. 31 Destro-Bisol G, Donati F, Coia V et al: Variation of female and 42 Cavalli-Sforza LL, Feldman MW: Cultural Transmission and male lineages in Sub-Saharan populations: the importance of Evolution. Princeton: Princeton University Press, 1981. sociocultural factors. Mol Biol Evol 2004; 21: 1673 – 1682. 43 Barbujani G: DNA variation and language affinities. Am J Hum 32 Knight A, Underhill PA, Mortensen HM et al: African Y Genet 1997; 61: 1011 – 1014. chromosome and mtDNA divergence provides insight into the 44 Seielstad MT, Minch E, Cavalli-Sforza LL: Genetic evidence for a history of click languages (vol 13, pg 464, 2003). Curr Biol 2003; higher female migration rate in humans. Nat Genet 1998; 20: 13: 464 – 473. 278 – 280. 33 Coia V, Destro-Bisol G, Verginelli F et al: mtDNA variation in 45 Jorde LB, Watkins WS, Bamshad MJ et al: The distribution of North Cameroon: lack of Asian lineages and implications for back human genetic diversity: a comparison of mitochondrial, auto- migration from Asia to Sub-Saharan Africa. Am J Phys Anthropol somal, and Y-chromosome data. Am J Hum Genet 2000; 66: 2005, in press. 979 – 988. 34 Mantel N: The detection of disease clustering and a generalized 46 Wilder JA, Kingan SB, Mobasher Z, Pilkington MM, Hammer MF: regression approach. Cancer Res 1967; 27: 209 – 220. Global patterns of human mitochondrial DNA and 35 Schneider S, Roessli D, Excoffier L: ARLEQUIN ver 2.000: Y-chromosome structure are not influenced by higher A Software for Population Genetic Analysis. Geneva, Switzerland: migration rates of females versus males. Nat Genet 2004; 36: Genetics and Biometry Laboratory, University of Geneva, 2000. 1122 – 1125. 36 Nettle D, Harriss L: Genetic and linguistic affinities between 47 Jenkins T: Human Evolution in Southern Africa: Human Genetics, Part human populations in Eurasia and West Africa. Hum Biol 2003; A: The Unfolding Genome. New York: Alan R. Liss, Inc., 1982, pp 75: 331 – 444. 227 – 253.

Supplementary Information accompanies the paper on European Journal of Human Genetics website (http://www.nature.com/ejhg).

European Journal of Human Genetics

74

APPENDIX D: HUNTER-GATHERER POPULATIONS SHARE AN ANCIENT COMMON ANCESTRY

75

African hunter-gatherer populations share an ancient common ancestry

Elizabeth T. Wood1,2, Daryn A. Stover1, Veronica A. Chamberlain1 and Michael F. Hammer1,2

1Division of Biotechnology and 2Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ

Corresponding author: Elizabeth Wood Department of Ecology and Evolutionary Biology BSW310, Tucson, AZ 85721 email: [email protected]

Keywords: Khoisan, Pygmy, Y-chromosome, hunter-gatherer, Africa, B2b

76

All humans lived as hunter-gatherers prior to ~10,000 years ago. Agriculture arose independently in different parts of the world and subsequently spread across most of the globe with farmers assimilating or displacing local hunter-gatherers. Today, there are fewer than 25 African hunter-gatherer populations[1]. A central question surrounding modern hunter-gatherers is whether they are the descendants of a common ancestral population whose lifestyle was partially preserved to the present-day[1,2]. Here, we examined Y-chromosome SNP and STR diversity in two African hunter-gatherer groups, the Khoisan and Pygmies, to examine the hypothesis that these populations share an ancient common ancestry. We describe B2b, a unique set of Y-chromosome lineages that are nearly exclusive to African hunter-gatherers suggesting a common history of these populations.

The root of the Y-chromosome tree lies between the African-specific Haplogroup

A and B lineages indicating an African origin of human Y-chromosomes[3]. We previously reported Y-chromosome SNP data showing that the oldest Y-chromosome haplogroups, A and B, are most common among Khoisan and Pygmy groups, respectively[4]. The high frequency (47.7%) of ancestral Haplogroup A chromosomes in the Khoisan (Figure 1) indicate that Khoisan are descendants of some of the oldest patrilineages of modern humans[5]. The Khoisan-specific lineage, A2 (A2* and A2b), is derived from the longest Y-chromosome branch observed to date[3] suggesting an extended period of isolation from other African groups. On the other hand, the A3b

77

lineages (A3b1 and A3b2) are shared with northeastern Africans[3-5] supporting the hypothesis that the earliest ancestors of modern-day Khoisan speakers formed small bands of networked populations who resided across the savannas of eastern and southern

Africa along the Great Rift Valley ~20-40-kya[6]. Pygmies reside in small foraging groups that are widely distributed on the eastern and western edges of the equatorial forest and, like the Khoisan, possess unique phenotypic characteristics. Pygmies carry the highest frequencies of Haplogroup B chromosomes (48.9%) which are found in only

5.1% of agricultural populations (Figure 1). The high frequency of Haplogroup B chromosomes in both Western (59.6%) and Eastern (42.7%) Pygmies support the hypothesis that Pygmies once formed a continuous set of populations living across central

Africa[2]. Notably, the most ancestral lineage found on the Y-chromosome tree (A*) are found only in Bakola Pygmies (9.1%).

Although the oldest Y-chromosome haplogroups are associated with the Khoisan and Pygmy populations, the presence of derived polymorphisms shared exclusively by hunter-gatherer populations has, thus far, not been elucidated. We describe a unique set of lineages, B2b which is comprised of B2b*, B2b1 and B2b4, that are present in both

Pygmies (48/129) and Khoisan (12/90) but are only found in 2 of the 903 other Africans surveyed (Figure 1). Median-joining network analysis[7] based on STR diversity indicates that 1) the B2b lineage is characterized by divergent lineages and 2) the

Pygmies and Khoisan fall on either side of the inferred root (Figure 2). We estimated the age (TMRCA) of the B2b lineage using the average square distance statistic (ASD)[8] and the inferred ancestral haplotype. The TMRCA of the B2b lineage is estimated to be

78

62,400 (95% CI: 36,700-111,600) assuming a mutation rate of 6.9 x 10-4/generation [9]

(see on-line Supplemental Methods). This suggests that the B2b lineage is very old and

likely predates the divergence of the Khoisan and Pygmy populations. Thus, the

Khoisan and Pygmies share ancient Y-chromosome lineages that are not present in

agricultural groups supporting the hypothesis that African hunter-gatherers are descended

from an ancient network of populations.

The Hadza are a hunter-gatherer group residing in Tanzania that is believed to be

descended from the ancient network of Khoisan-speaking populations[6]. If the Hadza

are related to Khoisan, we may expect the Hadza to carry genetic lineages also found in

South African Khoisan. A study of the Hadza[10] did not find the Khoisan-specific

Haplogroup A chromosomes but did find a remarkably high frequency (52%) of the B2b

lineage which is commonly found in both Khoisan (13.3%) and Pygmies (37.2%)

(Figure 1). We suggest that the very high frequency of the B2b lineage support an

ancient connection among the hunter-gatherer Pygmies, Khoisan, and Hadza populations.

The Y-chromosome B2b haplogroup is the only genetic lineage observed to date

that is shared at high frequencies among African hunter-gatherer groups. Several unique

properties of Y-chromosomes in Africa may have allowed the preservation of this hunter-

gatherer lineage while others were lost First, the absence of recombination and the low

occurrence of recurrent mutation allow the unambiguous reconstruction of Y-

chromosome haplotypes and the detection of shared derived lineages. Second, the

reduced effective population size of the Y-chromosome likely resulted in the stochastic

increase (or decrease) of some lineage frequencies through genetic drift. Moreover,

79

hunter-gatherer males likely assimilated less than females during the Bantu expansions[4] which may have allowed rare paternal (but not maternal) polymorphisms shared among hunter-gatherers to survive.

Acknowledgements

We thank F. Marlowe, M. Weale, C. Ehret for helpful comments, M. Bamshad, G.

Destro-Bisol, and S. Horai for DNA samples, and T. Karafet, A. Redd, and L. Hunnicutt and for technical advice. This work was supported by grant GM53566 from the National

Institute of General Medical Sciences to M.H.

References

1. Marlowe, F. (2005) Hunter-Gatherers and Human Evolution. Evol. Anthropol.

14:54-67.

2. Klieman, K. (2003) "The Pygmies Were Our Compass": Bantu and Batwa in the

History of West Central Africa, Early Times to c. 1900 C.E. Heinemann,

Portsmouth, NH.

3. Hammer, M.F., Karafet, T.M., Redd, A.J., Jarjanazi, H., Santachiara-Benerecetti,

S., Soodyall, H., Zegura, S.L. (2001) Hierarchical patterns of global human Y-

chromosome diversity. Mol. Biol. Evol. 18:1189-1203.

4. Wood, E.T., Stover, D.A., Ehret, C., Destro-Bisol, G., Spedini, G., McLeod, H.,

Louie, L., Bamshad, M., Strassmann, B.I., Soodyall, H., Hammer, M.F. (2005)

80

Contrasting patterns of Y chromosome and mtDNA variation in Africa: evidence

for sex-biased demographic processes. Eur. J. Hum. Genet. 13:867-876.

5. Semino, O., Santachiara-Benerecetti, A.S., Falaschi, F., Sforza, L.L.C. &

Underhill, P.A. (2002) and Khoisan share the deepest clades of the

human Y-chromosome phylogeny. Am. J. Hum. Genet. 70:265-268.

6. Nurse, G., Weiner, J. & Jenkins, T. (1985) The peoples of southern Africa and

their affinities. Research Monographs on Human Population Biology. Clarendon

Press, Oxford.

7. Bandelt, H.-J., Forester, P. & Rohl, A. (1999) Median-joining networks for

inferring intraspecific phylogenies. Mol. Biol. Evol. 16:37-48.

8. Behar, D.M., Thomas, M.G., Skorecki, K., Hammer, M.F., Bulygina, E.,

Rosengarten, D., Jones, A.L., Held, K., Moses, V., Goldstein, D, Bradman, N.,

Weale, M.E. (2003) Multiple origins of Ashkenazi Levites: Y chromosome

evidence for both Near Eastern and European ancestries. Am. J. Hum. Genet.

73:768-779.

9. Zhivotovsky, L, Underhill, P.A., Cinnioglu, C., Kayser, M., Morar, B., Kivisild,

T., Scozzari, R., Cruciani, F., Destro-Bisol, G., Spedini, G., Chambers, G.K.,

Herrera, R.J., Yong, K.K., Gresham, D., Tournev, I., Feldman, M.W.,

Kalaydjieva, L. (2004) The Effective Mutation Rate at Y Chromosome Short

Tandem Repeats, with Application to Human Population-Divergence. Am. J.

Hum. Genet. 74:50-61.

81

10. Knight, A., Underhill, P.A., Mortensen, H.M., Zhivotovsky, L.A., Lin, A.A.,,

Henn, B.M.,, Louis, D., Ruhlen, M., Mountain, J. L. (2003) African Y

chromosome and mtDNA divergence provides insight into the history of click

languages. Curr. Biol. 13:705-705.

.

82

Figure 1: Y-chromosome SNP haplogroup frequencies observed in Pygmy, Khoisan, and agricultural African populations. The arrow indicates the root of the tree. Branch lengths are not proportional to age. Frequencies of the B2b lineages are outlined.

Branches for haplotypes F, G, I, and K are not depicted on this tree but frequencies are shown on the far right.

A 10831a M91

3 SRY * 1 2 B M60 P9 M32 * 2 P3 P4 M6 M14 M14 M6 P4 P3

b b M182 * * * a b E M31 P14

P14 4064 50f2 M150 1 12 * * 1 4 JR P7 M51 P6 M13 12f2 P28 M152 M269 YAP SRY A2b B* B2a* A* A1 A2* A3* A3b1 A3b2 B2* B2a1 B2b* B2b1 B2b4 E J R F,G,I,K Khoisan 90 14.4 11.1 22.2 1.1 8.9 4.4 34.4 1.1 2.2 Sekele/!Kung 32 15.6 9.4 21.9 3.1 6.3 43.8 Tsumkwe San 29 27.6 17.2 20.7 24.1 6.9 3.4 Nama 11 9.1 54.5 27.3 9.1 Dama 18 5.6 5.6 5.6 72.2 5.6 5.6

Pygmies 129 2.3 0.8 4.7 4.7 2.3 9.3 27.9 48.1 Mbuti (West) 47 2.1 6.4 10.6 21.3 21.3 38.3 Biaka (East) 31 3.2 3.2 3.2 45.2 45.2 Baka (East) 18 5.6 66.7 27.8 Bakola (East) 33 9.1 6.1 6.1 3.0 75.8

Other Africans 903 0.3 0.1 0.8 3.1 0.1 0.8 4.0 0.1 0.1 75.7 5.2 7.4 2.2

83

Figure 2: Median-joining STR Network of the B2b (N=42) and B2* (N=4) lineages using 12 STRs and 3 SNPs (50f2, P7, P6). Observed STR haplotypes are represented by colored circles with the area of the circles proportional to the number of individuals where yellow = Western Pygmy, red = Eastern Pygmy, blue = Khoisan, green = Kenyans.

The outgroup B2* lineage is circled with the root of the B2b-STR network indicated with a black dot. Branch lengths are roughly proportional to the number of one-repeat mutations separating two haplotypes.

P6 50f2

P7

84

APPENDIX E: THE PHYLOGEOGRAPHY OF THE Y CHROMOSOME

85

The phylogeography of African Y Chromosomes: A Linguistics Perspective

Elizabeth T. Wood1,2, Christopher Ehret3, Daryn A. Stover1, Michael F. Hammer1.2

1Genomics Analysis and Technology Core and 2Department of Ecology and Evolutionary

Biology, University of Arizona, Tucson, AZ 3Department of History, University of Los

Angeles, CA

Corresponding author:

Elizabeth Wood

Department of Ecology and Evolutionary Biology

BSW310, Tucson, AZ 85721 email: [email protected]

86

Abstract

Genetic and linguistic variation are shaped by similar evolutionary processes such as population range expansions and subdivision. When evolution proceeds at similar rates for gene and language, these systems will concord. A strong correlation between Y chromosome and linguistic variation has been observed at the continental level but the degree of concordance at the regional level in Africa has been largely unexplored. We present a detailed phylogeographic description of Y chromosome variation in Africa in relation to African linguistic variation. These findings suggest that the Niger-Congo and

Afroasiatic language families are spread-zones where male speakers of these languages recently (< 10 kya) displaced pre-existing populations. The Khoisan and Nilo-Saharan language families, on the other hand, represent mosaic-zone processes which are characterized by greater linguistic and Y chromosome diversity that trace to more ancient times (>10 kya). Additionally, we infer that the Niger-Congo and Afroasiatic language families were strongly influenced by the spread of agriculture; whereas the Khoisan and

Nilo-Saharan language families are likely the result of the long association with hunter- gatherer and/or pastoralist lifestyle.

87

Introduction

The synthesis of linguistics with archeology and cultural anthropology has

provided an increasingly detailed account of human prehistory and generated testable

hypotheses for population genetics. Human population history has been dramatically

impacted by the Neolithic Revolution. The series of dispersals resulting from the

transition from food gathering to food production, which occurred independently and at

different times in various parts of the world, are thought to be one of the major factors

shaping contemporary patterns of human linguistic, cultural, and biological variation

(‘the farming/language dispersal hypothesis’) (Cavalli-Sforza et al. 1993; Barbujani et al.

1994; Diamond and Bellwood 2003). Africa is one of the areas of the world with the

highest levels of genetic (Excoffier 2002) and linguistic (Nettle 1998) diversity as well as

the continent with the best documented example of an association of a language group

and genetic lineages with the spread of agriculture: the Bantu expansions ((Diamond and

Bellwood 2003).

The historical processes that shaped linguistic variation can be better understood

by examining the geographic distribution of languages. Similarly, the distribution of

genetic variation can shed light on the processes that affected biological variation.

Linguists have noted the marked differences in the geographic distribution of languages

and patterns of language densities. These contrasting patterns arise when a region has

undergone language displacement versus a long period of linguistic stability (Renfrew

1992). If language groups are also biologically-reproducing units, both linguistic and genetic variation will be shaped by similar processes such as population bottlenecks,

88

subdivision, and population size and range expansions. This predicts that patterns of genetic variation will be similar to patterns of linguistic variation. The non-recombining portion of the Y chromosome (NRY) is haploid and only transmitted from father to son and, thus, has a 4-fold reduced effective population size (Ne) relative to autosomes making it sensitive to recent historical processes. The lower Ne of the NRY leads increased genetic drift and is expecte to reduced coalescent times, hence, this compartment of the genome can detect very recent demographic processes. The association between linguistic and Y chromosome variation has been shown to be remarkably strong in Africa indicating that these systems are capturing similar events in population history (Poloni et al. 1997, Wood et al. 2005). In this study, we provide a descriptive analysis of Y chromosome SNP variation to qualitatively examine hypotheses generated from linguistic, archeological and ethnographic studies.

Linguistic Variation in Africa

The four African language families have been shaped by both language displacement (Niger-Congo and Afroasiatic) and linguistic stability (Khoisan and Nilo-

Saharan) (Renfrew 2003). The largest African language family, Niger-Congo, is characterized by deep linguistic diversity and likely originated in West Africa before agriculture, possibly ~12-kya with the earliest migrations occurring with the spread of farming ~10-7 kya (Greenberg 1964; Ehret 1984; Ehret 2002) (Figures 1a, 1b, and 2).

A relatively homogeneous subset of Niger-Congo languages, the Bantu languages, comprise > 500 of this family’s recorded 1436 languages (Williamson and Blench 2000).

89

This is widely attributed to the massive migrations of linguistically-bound farmers throughout most of sub-Saharan Africa (e.g. Phillipson 2003). The origin of the Bantu

languages is placed in the region between the Niger and Congo rivers, possibly in

southern Cameroon ~4-6 kya (Greenberg 1955; Greenberg 1963). Farmers speaking

early Bantu languages first spread into the far northwestern parts of the equatorial

rainforest and then more widely across the forest zone (Klieman 2003) (Figure 1b).

After 3,000 years ago new fronts of Bantu farming expansions emerged and by 1,000

years ago Bantu languages were established across large areas of eastern and southern

Africa (Figure 1a) (Ehret 1982; Vansina 1990; Ehret 1998, Vansina, 1990 #70). The

massive migration of Bantu speakers is supported by a very large number of genetic

studies including classical genetic markers (Excoffier et al. 1987; Cavali-Sforza et al.

1994), mtDNA studies (Soodyall et al. 1996; Salas et al. 2002), and the Y chromosome

(Poloni et al. 1997; Hammer et al. 1998; Scozzari et al. 1999; Passarino et al. 1998;

Thomas et al. 2000; Cruciani et al. 2002; Wood et al. 2005).

The distribution of Afroasiatic languages extends across all of North Africa, East

Africa (Figure 1a), and parts of the Middle East. Five of the six branches of the

Afroasiatic language family (Figure 2) are located in Africa and one (Semitic) is also found in the Levant, Iraq, and the Arabian Peninsula and most are associated with the spread of farmers. The origin and early history of this language family are controversial

(e.g. Bellwood and Diamond and Bellwood 2003; Ehret et al. 2004; Bellwood 2004).

Most linguists support an African origin of this language family (Fleming 1974; Ehret

1978, 1979; Diakonoff 1988, 1998; Blench 1993; see McCall 1998 for further

90

references). The African-origin hypothesis is supported by (1) the ‘fewest-moves’

principle suggests that it is more parsimonious to accept that one branch moved out of

Africa than that five branches moved into Africa and (2) linguistic analysis of internal

sub-classification, reconstructed subsistence and environmental lexicon, integrated with

archeological evidence (McCall 1998). These reconstructions place the Afroasiatic homeland in areas in or immediately north of the Ethiopian Highlands between the Red

Sea and the upper Nile ~15-kya (Figure 1b) (Greenberg 1964; Fleming 1974; Ehret

1979; Diakonoff 1998; Hayward 2000) (Fleming 1969, Munson 1977). The African-

origin hypothesis suggests that the earliest Afroasiatic peoples were likely wild grass or grain collectors with agriculture emerging separately in the several branches of

Afroasiatic—in the , in North Africa (Cushitic), in the Lake Chad Basin

(Chadic), and in the Middle East (Semitic) (Ehret 1979; Militarev 2003). Farmers speaking different Afroasiatic languages then expanded into new areas leading to the present-day distribution. The Asian-origin hypothesis of Afroasiatic languages suggests that the homeland is in the Levant 11.5-kya and that the earliest Afroasiatic speakers were farmers (Militarev 2003). Later two waves moved into Africa, one spread carried farming into the Nile into Egypt and North Africa, giving rise to the Egyptian and Berber branches and another wave of pastorialists moved from western Arabia across the Red

Sea into Ethiopia and giving rise to the Cushitic, Omotic, and Chadic branches

(Militarev 2003). The Semitic branch may have spread earlier but it certainly moved across much Africa with the Arab expansions that occurred within the past 2-3-ky.

According to the Asian-origin hypothesis, the original Middle Eastern homeland has

91

since been masked by the replacement of Afroasiatic languages with languages of the

Semitic branch. Genetic data have been unable to resolve this debate.

The earliest Nilo-Saharan speakers ~15 kya were hunter-gatherers. Their

descendants at around ~10 kya developed two distinctive economies: an Aquatic

adaptation that rapidly spread from the middle Nile in the east to present-day Mali in the west (Sutton 1974), and an agripastoral adaptation restricted at first to the southern eastern Sahara (Wendorf and Schild 1998). The mid-Holocene dry phase, 8500-7500 years ago, led to the spread of the Nilo-Saharan agripastoralists west across the Sahel, in most areas displacing their Aquatic relatives (Ehret 2001), and also southward later on, toward East Africa (David 1982) (Figure 1b). Their descendants formed a continuous set of populations across this vast region until the Arab intrusions within the past 1000 years (Ehret 2001; Ehret 2002). With the past 4,000 years, a subset of Nilo-Saharan speakers, the Nilotes, are believed to have expanded into regions east of Lake Victoria and the southeastern Middle Nile Basin where there were earlier established Nilo-

Saharan or Cushitic (Afroasiatic) agripastoralist populations (Ambrose 1982; Ehret

2003). Today the distribution of Nilo-Saharan languages is primarily in the regions in

and south of the Sahara Desert, around Lake Victoria (Nilotic), and in a small region in

the west near the Niger River (Songay). Because of the poor sampling of this language

group, little is know of the paternal genetic affinities of Nilo-Saharan populations.

Khoisan is the only language family currently not associated with one of the early

African agricultural traditions and is characterized by internal linguistic differentiation

and unique click sounds, suggesting that Khoisan is older than other language groups

92

(Greenberg 1963). The present-day distribution of the is primarily in

Namibia, Botswana, Angola, and South Africa (Figure 1a) and includes the !Kung San who are mostly hunter-gatherers and the Khoekhoe who recently adopted pastoralism. It has been hypothesized that the earliest ancestors of modern day Khoisan speakers formed small bands of networked populations who resided across the savannas of eastern and southern Africa along the Great Rift Valley ~40-20-kya (Figure 1b) (Tobias 1964;

Ambrose 1982; Nurse et al. 1985; Ehret 1998) and formerly extended into Somalia and as far north as the southern edges of the Ethiopian highlands (Ehret, in press). The presence

of two linguistic Khoisan-speaking isolates currently reside in Tanzania, the Hadza and

Sandwe, is consistent with the hypothesis that Khoisan was once more widely distributed.

The relatively high frequency of ancestral Y chromosome haplotypes in southern

Khoisan speakers, the Hazda of Tanzania, and populations as far north as Ethiopia

(Scozzari et al. 1999; Chen et al. 2000; Hammer et al. 2001; Semino et al. 2002) support the antiquity of the Khoisan language family and its once broad geographic distribution.

Pygmies, like the Khoisan, are hunter-gatherers and are phenotypically distinct from other Africans. Residing in small groups on the fringes and within the equatorial forest, Pygmy populations are widely distributed from southern Cameroon/western

Central African Republic (CAR) to the eastern democratic Republic of Congo (DRC),

Ruwanda, and Burundi (Figure 1). Pygmies speak languages that were likely adopted from their farming neighbors (Vansina 1990; Klieman 2003), with the majority of Pygmy populations residing in the west speaking Niger-Congo languages (e.g. Biaka, Bakola,

Baka) and a few populations located in the east speaking Nilo-Saharan languages (e.g.

93

Mbuti and Efe). It has been suggested that the original inhabitants of the area that the

Bantu speakers replaced were hunter-gatherer Pygmies or possibly other fishing

aborigines and that as Bantu expanded in size and migrated, they assimilated or displaced

the populations they encountered (Greenberg 1964; Vansina 1990). Hunter-gatherer

populations often adopt the language of the incoming farmers (Ehret 1988; Refrew 2003)

and occurs because the farming male has greater success in obtaining a mate than does

the hunter-gathering male (Refrew 2003). This hypothesis predicts different patterns of

male and female variation in farming versus hunter-gatherer populations. In Africa,

contrasting patterns of Y chromosome and mtDNA variation support this sex-biased mating hypothesis possibly due to the common cultural practice of marriage between hunter-gatherer women and farmer (Bantu) men where the children of these marriages usually become farmers residing in their father’s village (i.e., patrilocality) (Destro-Bisol

et al. 2004; Wood et al. 2005).

The farming/language dispersal hypothesis predicts contrasting patterns of genetic

variation for farming versus non-farming populations (Renfrew 2003). It is widely

accepted that the expansions of Bantu-speaking farmers had a remarkable effect on

genetic variation in Africa, particularly the paternally-inherited Y chromosome.

However, less is known of the impact that linguistic affiliation had on shaping genetic

variation in other African groups. In this manuscript, we provide a detailed

phylolinguistic analysis of Y chromosome variation in Africa and propose that languages

played an important role in shaping genetic diversity in non-Bantu as well as Bantu

populations.

94

Subjects and Methods

Samples surveyed include individuals representing the four African language families. We divide populations into major seven groups: Khoisan, Afroasiatic, Nilo-

Saharan, and Niger-Congo non-Bantu and Niger-Congo Bantu, Pygmies, and Northern

Cameroonians (also see Discussion) (Table 1). Most of the data represented in this phylogeographic manuscript were previously reported using different analytical methods

(Wood et al. 2005). Here we report new data for an additional 198 individuals: Kenya

(57), Sudan (39), Nigeria (7), Ivory Coast (11), Gambia (21), Ghana (26), and South

African Bantu (34) (Figure 1). These individuals were typed for the same 50 markers as the 1122 individuals surveyed in Wood et al. 2005 (Table 1). SNPs and indels were typed using a hierarchical protocol (see Wood et al. 2005 and references therein) and we used the YCC terminology for naming NRY lineages (YCC 2002). We examined diversity values within each linguistic family by pooling all individuals within a group calculating Nei’s (1987) heterozygosity.

Results

Phylogenetic analysis of the 50 Y biallelic markers used in this analysis yielded

36 African haplogroups (Table 1; Figure 3). Nearly all of these haplogrous, 98.1%, fall into 5 haplogroups: A (7.1%), B (10.7%), E (69.6%), J (5.5%), and R (5.2%).

95

Haplogroup A is closest to the root of the tree and is comprised of four lineages. The

most ancestral lineage (A-M91*), described for the first time in this analysis, was present

in three Bakola Pygmies (9.1%) from southern Cameroon. A2 and A3 are found at high

frequency in the Khoisan (47.7%). A2 includes three lineages that share the M6, M14,

P3, and P4 mutations (Jobling and Tyler-Smith 2003), two of which were examined

here: A-P3* and A-P28. Both of these lineages are exclusive to the Khoisan. Within A3

(defined by M32), we examined A-M32*, A-M51 and A-M13. A-M51 is found mostly

in the Khoisan (22.2%). A-M13 is primarily found in East African Nilo-Saharans

(22.7%), Ethiopian Afroasiatic speaking Amhara (16.7%) and Oromo (11.1%), and

Sudanese (12.8%). A-M13 is also found at appreciable frequencies in the Mandara

(Afroasiatic Chadic) and Tupuri (non-Bantu Niger-Congo) of Northern Cameroon

(14.3% and 22.2% respectively). A-M32* was found in a single Semitic-speaking

Ethiopian.

Haplogroup B chromosomes are most frequent in Pygmies. We found that all but

two of our haplogroup B chromosomes were part of the B2 clade which contains the B2a

(defined by M150) and B2b (defined by 50f2) subclades. We examined two major

haplogroups within B2b: B-P7 and B-P6. B-50f2* chromosomes (i.e., those with the

ancestral state at P6 and P7) are found at their highest frequencies in Pygmies (9.3%),

particularly among the Mbuti (21.3%). The B-P7 lineage is shared between Pygmies

(27.9%) and Khoisan (4.4%) and is only found in 2 individuals elsewhere. B-P6 is

exclusive to the Khoisan (8.9%). Within B2a we examined B-M152 which yielded two

haplogroups within this clade: B-M150* and B-M152. B-M150* chromosomes are also

96

present at their highest frequencies in Pygmies (4.7%). B-M152 is the most common B

lineage and is widely distributed, especially among some southern Bantu populations.

However, it is absent in our Khoisan sample. Haplogroup B chromosomes that are ancestral at all downstream markers surveyed here (i.e., B-M60*) are found in only two

Gambians; but have also been reported in a single Mossi from Burkina Faso and two

Bamileke from southern Cameroon (Cruciani et al. 2002). The ancestral haplogroups A and B are exclusive to Africa which supports an African origin of human Y chromosomes

(Hammer et al. 1998; Underhill et al. 2000).

Haplogroup E chromosomes (i.e., defined by SRY4064) are overwhelmingly the

most common in this study (69.6%) and fall within three major sub-clades: E1, E2, and

E3. Nearly half of the individuals in our study (49.4%) fall into the sub-clade E3a which is defined by the P1 mutation. We examined a single major subclade of E-P1 (E3a7), defined by the M191 mutation. Niger-Congo speakers have the highest frequency of E-

P1* chromosomes (46.1%) with a higher frequency in West African non-Bantu- (58.9%)

than Bantu- (37.7%) speakers. E-P1* chromosomes are found at lower frequencies in

most other populations. The largest proportion of E-M191 chromosomes are also found

in the Niger-Congo group (25.1%), particularly in Bantu speakers (32.2%) and Pygmies

(36.4%). The remaining E-M191 chromosomes are found scattered throughout sub-

Saharan Africa. Notably, the E-P1 mutation is absent in Afroasiatic Ethiopians

(Amhara, Oromo, Mixed South Semitic), Sudan, Tunisia, most Northern Cameroonian

populations, and the Tsumkwe San. The E3b sub-clade contains 11.0% of individuals in this study and includes E-M35*, E-M78, and E-M81 lineages (we did not examine M123

97

and M281). The distribution of E-M35* chromosomes resembles that of haplogroup

A3b, with high frequencies in some East African populations and the Khoisan-speaking

!Kung (12.5%). E-M78 is concentrated in Afroasiatics (29.5%) and is absent in the

Khoisan sample. E-M81 chromosomes are primarily restricted to our Tunisian sample

(35.7%). Chromosomes with the derived state at the P2 marker and the ancestral states at

P1 and M35 (E-P2*) are found primarily in Ethiopian Cushitic (11.1%) and Ethiopian

Amhara, Oromo, and South Semitic (5.6% -11.1%). E-P2* chromosomes have been observed in Burkina Faso, Cameroon, and at appreciable frequencies in Ethiopians (10.4

- 18.2%) (Semino et al. 2002). The E2 sub-clade contains E-M75*, E-M41, and E-M54

lineages and accounts for 5.0% of our sample. Haplogroup E-M41 is restricted to the

region directly west of Lake Victoria (Alur, Hema, and Gandan populations) while E-

M54 is widely distributed and at high frequencies in some southern African populations.

E-M75* chromosomes are found in only six widely dispersed individuals ranging from the Gambia to South Africa. The E1 haplogroup (defined by M33) is virtually restricted to Niger-Congo West Africans and is present at particularly high frequencies in the

Dogon of Mali (45.5%).

Haplogroup J chromosomes are found at high frequencies in the Sudan (59.0%)

decreasing in frequency as one moves eastward into Ethiopia (25.5%), northward to

Egypt (22.8%), and westward into Tunisia (46.4%) and (41%) (Rosser et al.

2000). While African haplogroup R chromosomes are generally quite rare, R-P25*

chromosomes are found at remarkably high frequencies in northern Cameroon (60.7-

94.7%). The remaining haplogroups (K, F*, I, and G) account for 1.8% of the

98

individuals in our dataset; half of these are haplogroup K-M70. All K-M70 lineages are

found in Afroasiatic speakers or in East African populations where linguistic affiliations

could not be assigned.

Discussion

Genetic and linguistic variation are shaped by similar processes such as population splitting events, size and range expansions, and the merging of populations via

admixture (Renfrew 1992). If languages and genes changed simultaneously and at

similar rates, patterns of linguistic and genetic variation will concord. Contrasting patterns of linguistic and genetic variation will result when populations experience a history of population expansion versus subdivision (Renfrew 2003). Spread-zone regions occur when an incoming language(s) displaces existing languages as the result of a

dispersal process. In contrast, mosaic-zone regions occur when a long period of

linguistic stability exists after the initial colonization often prior to the introduction of

farming (Renfrew 2003). The Niger-Congo (specifically Bantu) and Afroasiatic

language families are examples of spread zones where linguistic diversity is low over a

large area and farming dispersals occurred less than 10 ky ago. The Khoisan and Nilo-

Saharan language families, on the other hand, represent the mosaic-zone process which is

characterized by greater linguistic diversity that traces to more ancient times (> 12 kya)

(Renfrew 2003). This ‘linguistic zone’ hypothesis predicts that, if languages and genes

moved together, populations that speak spread-zone languages will possess lower genetic

diversity and the presence of recently-arisen genetic lineages over a large geographic

area. Populations speaking mosaic-zone languages, on the other hand, will have greater

99

genetic diversity of more ancient genetic lineages within a limited geographic region.

We found that Y chromosome diversity values are greatest within the Nilo-Saharan and

Khoisan language families and the lowest within Niger-Congo speakers (Figure 4).

Additionally, the Khoisan carry some of the most ancient NRY lineages (e.g. A2) and the

Niger-Congo speakers possess a very high frequency of the derived E3a lineage. Thus,

Y chromosome variation is consistent with the linguistic zone hypothesis in Africa. In the following sections, we discuss the impact that the dispersal of farmers has had on the genetic diversity within African language groups.

Niger Congo

The Niger-Congo language family has lower Y chromosome diversity than any other African language family which is somewhat surprising given the deep linguistic diversity. This language family, specifically the Bantu speakers, has also had an enormous impact on continental patterns of Y chromosome variation (Wood et al. 2005).

This can be explained by the remarkably high frequencies (>50%) of the E3a lineage that have been associated with the massive migrations of Bantu-speaking farmers < 4000 years ago (Figure 3) (Passarino et al 1998; Scozzari et al. 1999; Hammer et al. 2001;

Underhill et al. 2001, Cruciani et al. 2002, Wood et al. 2005, this study). Notably, age estimate of the E3a lineage (6,200 ± 1,800 years old (Hammer and Zegura 2002; data not shown), accords well with linguistic and archeological dates of the earliest stage of Bantu populations, roughly 5 ky ago. Additionally, the Bantu E3a lineage is five times more recent than the ancient (>35,000 years) haplogroups A and B common among Khoisan

100

and Pygmies. Thus, the extremely high frequency of the E3a lineage and low Y

chromosome diversity within Bantu speakers, the recent origin of E3a lineage, coupled with the low Y-linked STR diversity associated with this haplotype support the conclusion that some Y chromosome lineages expanded recently and rapidly across

Africa with the Bantu expansions {Cruciani, 2002 #13}{Scozzari, 1999 #63}{Thomas,

2000 #90}.

Haplogroup E3a is comprised of many lineages {Underhill, 2001 #27}; two of

which are E-P1* and E-M191 surveyed here. A marked difference in the frequency of

these two haplotypes has been observed in Bantu versus non-Bantu populations. Non-

Bantu Niger-Congo groups possess nearly four times more E-P1* chromosomes than E-

M191; whereas, in Bantu populations these frequencies of these haplotypes are more

similar (Figure 3). These differences are likely due to the E-M191 lineage originating

closer to the Bantu homeland than E-P1* (Cruciani et al. 2002). It is also possible that E-

P1* increased in non-Bantu Niger-Congo groups, perhaps via sociocultural practices such as polygyny or paternal population replacement leading to the very low levels of diversity in this group (Figure 4). Some linguists support an east/west division of Bantu languages, which resulted from two distinct migrations: the eastern branch spread east from the homeland north of the rainforest and the western branch moved southward along rivers through the rainforest (Guthrie 1963; Oliver 1966; Vansina 1995). However, there are no clear differences between the SNP Y chromosome haplotypes found in eastern versus western Bantu Y chromosomes. More intense sampling and a more detailed

101

characterization of SNP and/or STR lineages within the E3a lineage is needed to better understand the nature of the Bantu expansions.

Several other haplotypes are found at appreciable in either non-Bantu Niger-

Congo speakers or Bantu speakers but not both. E-M33 is nearly restricted to west Africa suggesting that this lineage arose after Bantu diverged from other Niger-Congo groups.

B-M152 and E-M54 are found in Bantu populations (6.5% and 8.6%, respectively) but rarely in non-Bantu Niger-Congo (0% and 0.9%, respectively). We suggest that B-M152, as well as E-M54 (Cruciani et al. 2002), are haplotypes that were present in autochthonous populations and recently entered Bantu populations.

Khoisan & Pygmies

The Khoisan language family is believed to be the oldest African language family

(>20,000 years) and consists of three deep branches (Figure 2). The best-known branch

(South African) consists of a number of languages still spoken in the Kalahari Desert and surrounding regions and includes the !Kung San hunter-gatherers and the Nama

Khoekhoe, who adopted pastoralism between 2000 and 2500 years ago (Ehret 1982). In our sample, the South African Khoisan carry the majority of Haplogroup A and B chromosomes and harbor some of the oldest known Y chromosome lineages (refs).

Haplogroup A lineages are concentrated in the South African Khoisan (e.g. A-P3*, A-

P28, A-M51) but are also found at appreciable frequencies in the Nilo-Saharans and some

Afroasiatic populations (e.g. A- M13). The antiquity of haplogroup A (42 kya + 23 kya)

(Hammer and Zegura 2002) coupled with its wide distribution suggests that the M91

102

mutation may have arisen before the origin of the Nilo-Saharan, Afroasiatic and Khoisan

language families. The M6, M14, P2, P4, P28, M51 mutations may then have occurred in

the ancestors of the South African Khoisan and M13 arose in the Northeastern Nilo-

Saharans and/or Afroasiatics. Alternatively, these mutations were once shared between

Khoisan and other groups and were subsequently lost. Both scenarios are consistent with

the hypothesis that there was once a dispersed Khoisan-speaking population occupying

the Great Rift Valley (Passarino et al. 1998, Semino et al. 2002, this study) but may also suggest that this network of populations contributed Y chromosomes to populations of other language families.

Although the majority of Khoisan lineages are of Haplogroup A and B, there are appreciable frequencies of Haplogroup E3a lineages. Assuming that E3a lineages were recently introduced to Khoisan, we can use these lineages to infer levels of admixture between the Khoisan and Bantu speakers. Frequencies of E3a differ substantially among the different Khoisan groups: the Dama carry a very high proportion of these chromosomes (55.5%) as do the !Kung/Sekele (31.3%) and the Nama (18.2%), whereas the Tsumkwe San are completely lacking these haplotypes. The Tsumkwe San have very high frequencies of the A-P3* and A-P28 lineages which are characterized by an extremely long branch suggesting that these the Tsumkwe San may have been isolated from other Khoisan as well as Bantu groups. Thus, there appears to be no paternal Bantu admixture with the Tsumkwe San while over half of the Dama Y- chromosomes are of

Bantu origin. This assumes that the Dama are of Khoisan origin but it is also possible that the Dama are of Bantu origin and very recently adopted a Khoisan language and

103

acquired Khoisan-specific Y chromosome hapolotypes (e.g. A-P28) through admixture

(Spurdle and Jenkins 1992). The European R1b9 (R-M269) lineage is found at

appreciable frequencies in the Herero (8.3%), the Nama (9.1%), and the Dama (5.6%) but

is nearly absent from all other African populations. This is likely due to the introduction of Y chromosome to Southern Africa with the recent European colonization (~ 500

years) of the region.

The Khoisan speaking linguistic isolates, the Sandawe and Hadza (Figure 2),

reside in Tanzania and are believed to be descended from a once more widely distributed

Khoisan-speaking network of population (Tobias 1964; Ambrose 1982; Nurse et al. 1985;

Ehret 1998). If Under this hypothesis, we may expect the Hadza and Sandawe to carry

lineages also found in South African Khoisan. A study of the Hadza (Knight et al. 2003)

did not find the Khoisan-specific Haplogroup A chromosomes but did find a remarkably

high frequency (52%) of the B2b lineage (i.e. B-P6 and B-P7) which are commonly

found in both Pygmies (27.9%) and Khoisan (13.3%). These authors suggest that the

presence of B2b in the Hadza support a model whereby Khoisan click consonants have

been retained over tens of thousands of years. However, some scholars contest the

inclusion of Hadza in the Khoisan family because linguistic similarities are very limited

and the few commonalities found may be due to chance or a shared history before the

origin of the major language families (Guldemann and Vossen 2000). For example, some

Khoisan lexicons in the Southern Cushitic languages (Afroasiatic family) of Kenya and

Tanzania and in the Ik language (Nilo-Saharan) language of Uganda which may have

been borrowed from now extinct Khoisan languages. Although it remains unclear

104

whether the Hadza and Sandawe are the direct descendants of Khoisan populations, the

occurrence of Khoisan lexicons in many disparate language groups and the wide

geographic distribution of Haplogroup A chromosomes support the former existence of

other deep branches of the Khoisan family in southern and eastern Africa (Ehret 1980;

Ehret in press).

Pygmy populations are equatorial forest foragers, resemble one another

phenotypically, and are believed to have formerly spoken several distinct languages of

unknown affiliations (Cavalli-Sforza 1986; Cavalli-Sforza et al. 1994; Klieman 2003). A

central question surrounding modern Pygmy populations is whether they are the

descendants of ancient peoples who lived in central Africa prior to the arrival of

agriculturalists (Kleinman 2003). The most ancestral chromosomes found on the Y

chromosomes tree A-M91* are found in three Bakola Pygmies. Pygmies also carry the

highest frequencies of Haplogroup B chromosomes that are members of an ancient clade

that is estimated to be ~36,800 ± 13,600 years old (Hammer and Zegura 2002) supporting

the idea that Pygmies are descended from an ancient population. Several authors have

suggested that the shared traits of the Pygmies and Khoisan (that are different from all

other Africans) may indicate a common ancient ancestry although the evidence remains weak (Vigilant et al. 1989; Chen et al. 2000). In our study, B2b haplotypes are found in both Pygmies (27.9%) and Khoisan (13.3%) but are nearly absent from all other African populations. We suggest that the very high frequency (52%) of the B2b lineage and anomalous linguistic classification of the Hadza may suggest an ancient connection among Khoisan, Pygmies and Hadza. The B2b haplotype can be further divided into the

105

B-50f2*, B-P6, and B-P7. The ancestral B-50f2* is primarily found in Pygmies (9.3%), particularly Mbuti (21.3%), B-P6 is found only in Khoisan (8.9%), and B-P7 is shared between Pygmies (27.9%) and Khoisan (4.4%). An examination of these haplotypes in the Hadza and Sandawe may shed light on the ancient history of the Khoisan and Pygmy populations.

Four of the B haplogroups (B-M182*, B-M150*, B-50f2*, and B-P7) that are present at high frequencies in Pygmy populations (4.7 - 27.9%) are found at very low frequencies in Bantu-speakers (<1%). If we assume that these B haplotypes were not present at high frequencies in the original Bantu migrants, then gene flow of Pygmy Y chromosomes into Bantu populations occurred infrequently. This is similar to the pattern observed with haplogroup A chromosomes found at high frequencies in Khoisan (see above) suggesting that male gene flow from hunter-gatherers to farming populations was rare. In contrast, the appreciable frequencies of the Bantu haplotypes E-P1* and E-M191 haplogroups in the Pygmies (45.7%) and Khoisan (24.4%) suggests that the reverse— paternal Bantu gene flow into the Khoisan and Pygmy populations—was common.

Afroasiatic

Afroasiatic populations are characterized by a relatively high frequency of E-M78

(29.5%), J-12f2 (26.1%), and E-M81 (8.0%) and the near absence of E-P1* (1.7%) and

E-M191 (0.6%) (Figure 3). Haplogroup E-M78, dating to > 20ky ago, has been associated with the Neolithic expansions (Hammer et al. 1998; Behar et al. 2004) and is found at its highest frequencies in Cushitic-speakers including Somalians (77.6% - 100%)

106

(Sanchez et al. 2005, Cruciani et al. 2004) and the Oromo from Kenya (71.4%) and

Ethiopia (22.2%, 32.0%) (this study, Cruciani et al. 2004). E-M78 is also found at lower

frequencies in Semitic speakers including the Amhara from Ethiopia (8.8%, 33.3%)

(Cruciani et al. 2004, this study) and a mixed south Semitic group from Ethiopia (35.0%).

E-M78 is also found at appreciable frequencies in the Nilo-Saharan speaking Maasai

(15.4%) and Nilo-Saharans from Kenya (11.1%). This suggests that E-M78 may have arisen very early in the history of African languages, possibly before divergence

Afroasiatic and Nilo-Saharan languages and probably before the divergence of Cushitic

and Omotic from other Afroasiatic groups. Microsatellite analysis indicates that the

oldest E-M78 chromosomes (δ) date to 14.7 + 2.7 ky and are geographically restricted in

Africa to the east (particularly x language group) and may have spread during the first

migration(s) of E-M78 lineages from eastern Africa into northern Africa and the Middle

East (Cruciani et al. 2004). The other old E-M78 lineage (γ) dates to 9.6 + 3.3 ky and is

found at the highest frequencies in Cushitic-speakers: the Borana from Kenya (71.4%),

the Oromo from Ethiopia (32.0%), and Somalia (52.2%) (Cruciani et al. 2004). Thus, the

E-M78 mutation most likely arose in Africa and the most ancestral E-M78 lineages have

risen to high frequencies in east African Afroasiatic speakers.

Like E-M78, Haplogroup J has also been associated with the Neolithic demic

diffusion of farmers into Europe (Rosser et al. 2000; Semino et al. 2000), South Asia

(Quintana Murci et al. 1999), and possibly North Africa (Bosch et al. 2001). In contrast

to E-M78, SNP diversity and STR diversity within Haplogroup J is higher in Asia,

consistent a Near Eastern origin for this haplogroup with later migrations bring it to East

107

Africa (Passarino et al. 1998; Semino et al. 2000; Al Zahery et al. 2003; Cinnioglu et al.

2004; Semino et al. 2004). Two major J lineages show different geographic distributions:

J-M172 is at highest frequencies in Europe and the Middle whereas J-M267only one of five lineages is found in Africa (J-M267*) (Semino et al. 2004). A subset of J-M267* chromosomes, defined by the single-banded YCAIIa22-YCAIIb22 motif, is found at

>70% in the Middle East and >90% in North Africa while the European and East African

J-M267* lineages are more diverse and are believed to have diffused into North Africa with the recent Arab incursions (Semino et al. 2004). The Haplogroup J chromosomes in Europe, and to a lesser extent East Africa, are more heterogenous with respect to the

YCAIIa22-YCAIIb22 motif and may be associated with earlier Neolithic migrations

(Semino et al. 2004).

Within Africa, the E-M81 haplogroup shows an opposite clinal gradient to J-12f2 and E-M78 (Semino et al. 2004) with the highest frequencies found in in northwest Africa (65 – 80%) (Cruciani et al. 2004). It has been suggested that this haplotype may be associated with migrations of the Berber branch of Afroasiatic languages (Cruciani et al. 2004). The expansion of Berber-speaking populations occurred ~4-5 ky ago is well documented although the reasons for this expansion are unclear (Ehret 2002). The high frequency of the E-M81 haplotype in the northwest

African populations is accompanied by a low diversity of STR diversity. This is consistent with the origin of Berber languages in the northeast > 9 ky ago and later westward migrations where the increase in frequency of E-M81 is the result of expansion(s) from a smaller founding population into uninhabited landscape (Arredi et al.

108

2004). The low frequency of E-M81 in the Wolof (5.9%) and Mandinka (2.6%) of West

Africa may be due to the once larger geographic distribution of Berber-speaking peoples

or to more recent gene flow possibly through individuals traveling the trans-Saharan trade

routes.

The controversial question of where the Afroasiatic languages originated is

difficult to address using these data. In the simplest case scenario, if all of the Afroasiatic

Y chromosome haplotypes trace to Africa, we might suspect that the Afroasiatic

languages also arose in Africa. Likewise, an Asian origin of Y lineages would lend

support to the Asian-origin hypothese. The pattern observed is more complex with some

Afroasiatic lineages originating in Africa (E-M81, E-M78) and other tracing to Asia (J-

P12f). It is likely that the E-M78 lineage originated in Africa, migrated to Asia, and possibly came back to Africa in populations that were also carrying the J-P12f lineage. It is impossible to know, however, if these migrants that left Africa spoke the earliest

Afroasiatic languages or if they spoke an unknown language that latter gave rise to

Afroasiatic language after arriving in Asia. Unfortunately, this may be a controversy

that can not be resolved using genetic data.

Nilo-Saharan

The poor sampling of Nilo-Saharans in this study and others makes it difficult to

draw conclusions about paternal patterns of diversity associated with this language

family. However, our data indicate that Nilo-Saharan Nilotics share a number of

haplogroups with Afroasiatics (e.g., A-M13 and E-M78) but are lacking the J-12f2

109

haplogroup, supporting a recent introduction of this lineage to Africa. The hypothesis

that Nilo-Saharan speakers formed a continuous set of populations across the Sahel prior to the Arab intrusions would be supported if Nilo-Saharans from the western edges of the range and central portions cluster with Nilotic speakers from eastern Africa The only study that has examined Y chromosomes from west African Nilo-Saharans (the Songhai- speaking Dendi of Benin) suggest that this population is genetically most similar to neighboring populations (Scozzari et al. 1999), indicating extensive genetic admixture and/or language borrowing in the western range of this language family.

Northern Cameroon and Lake Chad

The Sahara Desert and the Sahelian regions south of the desert were considerably wetter than they are today (Wendorf and Schild 1980). From the early to mid-Holcene, the region west of the Nile River valley had many small ponds, lakes and wadis which extended westward toward the giant lake known as Lake Mega-Chad. Today, the region directly south of Lake Chad encompasses the linguistically diverse northern

Cameroonians populations. N. Cameroonians carry a very high frequency of the R-P25* lineage that has been associated with a back-to-Africa migration within the past 2-8-ky

(Cruciani et al. 2002). Another notable haplogroup found in northern Cameroon is A-

M13 (also see Cruciani et al. 2002) which is found primarily in East Africa in both

Afroasiatic and Nilo-Saharan speakers. We suggest that A-M13 may have been widely distributed in the Sahel from East to Central Africa and has decreased in frequency with the introduction of other haplogroups. The presence of extremely divergent haplogroups

110

(R and A) coupled with the high degree of linguistic heterogeneity in northern Cameroon

indicate a unique history of population formation and language acquisition in the region

bordering Lake Chad. We propose that the presence of fresh water lakes and streams in

the Sahara and Sahel during the past 9,000 years acted both as refugia for ancient

haplotypes and as a crossroads for population carrying derived haplotypes. The higher

population sizes and population movements facilitated by riverine or lacustrine regions

likely facilitated increased gene flow and/or language borrowing which may account for

the great similarities among northern Cameroonians despite their linguistic diversity.

Conclusions

Speakers of the Niger-Congo and Afroasiatic languages carry distinct Y chromosome types that differ from each other and from the Khoisan and Nilo-Saharan speakers.

Niger-Congo Bantu populations are characterized by very high frequency (~50%) of a homogeneous set of derived lineages, supporting the hypothesis Bantu populations recently and rapidly expanded across Africa. Non-Bantu Niger-Congo populations have lower Y chromosome diversity than the Bantu speakers, which is surprising given the degree of deep linguistic diversity observed within this language family. Afroasiatic speakers possess a markedly different complement of Y chromosomes than other language families and these lineages show high correspondence with migrational paths of agriculturalists moving across northern Africa. The Khoisan and Nilo-Saharan language families, together with the Pygmies, carry a high frequency of Y chromosomes found in other groups, indicating that these populations have experienced extensive admixture

111

and/or language borrowing. However, these groups retain a low frequency of ancient lineages suggesting that genetic replacement is not complete.

112

Figure 1. Maps of Africa. a) The present-day physical landscape of Africa. b) The

probable areas of the initial domestication of indigenous African crops (from Phillipson

1993). c) The hypothesized distribution of early speakers of the African populations prior

to 3000 years ago. The circled regions indicate the proposed place of origin of each of

the language families (from Ehret 2003). d) The present-day distribution of the four

African linguistic families constructed using Greenberg’s (1963) classifications and

further refined with data from the ethnologue (http://www.ethnologue.com/). The

approximate location of 40 populations typed for Y chromosome markers in Wood et al.

(2005) and the 7 populations reported here are indicated. The numbers correspond to populations listed in Table 1. Three shades of gray on map refer to the distribution of language families: Khoisan (diagonal lines), Afroasiatic (light gray), Niger-Congo

(medium gray), and Nilo-Saharan (dark gray). The location of Pygmy and Bantu populations are also indicated.

Figure 2. Phylolinguistic trees of sub-levels within each African language family. All branching events are believed to have occurred prior to 7000 before present.

Figure 3. Maximum-parsimony tree of 50 Y chromosome biallelic markers. The root of the tree is denoted by an arrow. Major clades (i.e., A-R) are labeled with large capital letters. Subclades labels (e.g., A3b) are indicated to the left of the branches. Mutation names are given along the branches. The length of each branch is not proportional to the number of mutations or the age of the mutation. Only the names of the 36 haplogroups

113

observed in the present study are shown to the right of the branches. Haplogroup frequencies observed in five language groups on the far right. To show the unique composition of haplotypes within Pygmy and Northern Cameroonian populations, frequencies for these groups are shown in a separate column.

Figure 4: Nei’s heterozygosity (H) observed within five linguistic groups, Pygmy populations, and Northern Cameroonian populations.

114

References

Ambrose S (1982) Archaeology and Linguistic Reconstruction of History in East Africa.

In: Ehret C, Posnansky M (eds) The Archaeological and Linguistic Reconstruction of

African History. University of California Press, Berkeley, pp 104-147

Barbujani G, Pilastro A, De Domenico S, Renfrew C (1994) Genetic variation in North

Africa and Eurasia: Neolithic demic diffusion vs. Paleolithic colonisation. Am. J.

Phys. Anthropol. 95:137-54

Bellwood P (2004) The origins of Afroasiatic - Response. Science 306:1681

Cavalli-Sforza LL (1986) African Pygmies. Academic Press, Orlando, FL

Ambrose S (1982) Archaeology and Linguistic Reconstruction of History in East Africa.

In: Ehret C, Posnansky M (eds) The Archaeological and Linguistic

Reconstruction of African History. University of California Press, Berkeley, pp

104-147

Barbujani G, Pilastro A, De Domenico S, Renfrew C (1994) Genetic variation in North

Africa and Eurasia: Neolithic demic diffusion vs. Paleolithic colonisation. Am J

Phys Anthropol 95:137-154

Cavalli-Sforza LL (1986) African Pygmies. Academic Press, Orlando, FL

Cavalli-Sforza LL, Menozzi P, Piazza A (1993) Demic expansions and human evolution.

Science 259:639-646

Cavalli-Sforza LL, Menozzi P, Piazza A (1994) The History and Geography of Human

Genes. Princeton University Press, Princeton

115

Chen YS, Olckers A, Schurr TG, Kogelnik AM, Huoponen K, Wallace DC (2000)

mtDNA variation in the South African Kung and Khwe - and their genetic

relationships to other African populations. Am J Hum Genet 66:1362-1383

David N (1982) Prehistory and Historical Linguistics in Central Africa: Points of

Contact. In: Ehret C, Posnansky M (eds) The Archaeological and Linguistic

Reconstruction of African History. University of California Press, Berkeley, pp

57-65

Diakonoff I (1998) The earliest semitic society - Linguistic data (On the formation and

evolution of Common-Semitic language organization in the Early Neolithic

cultures of the Fertile-Crescent). Journal of Semitic Studies 43:209-219

Diamond J, Bellwood P (2003) Farmers and their languages: the first expansions. Science

300:597-603

Ehret C (1979) On the Antiquity of Agriculture in Ethiopia. Journal of African History

20:161-177

Ehret C (1982) Linguistic inferences about early Bantu history. In: Posnansky CEaM (ed)

The Archaeological and Linguistic Reconstruction of African History. Univ of

California Press, Berkeley, pp 57-65

Ehret C (1998) An African Classical Age: Eastern and Southern Africa in World History,

1000 B.C. to A.D. 400. University Press of Virginia, Charlottesville

Ehret C (2001) Sudanic Civilization. In: Adas M (ed) Agricultural and Pastoral Societies

in Ancient and Classical History. Temple University Press, Philadelphia

116

Ehret C (2002) The Civilizations of Africa: A History to 1800. The University Press of

Virginia

Ehret C (2003) Language Family Expansions: Broadening our Understanding of Cause

from an African Perspective. In: Bellwood PaCRe (ed) Examining the

Farming/Language Dispersal Hypothesis. McDonald Institute for Archaeological

Research, Cambridge, pp 135-150

Excoffier L (2002) Human demographic history: refining the recent African origin

model. Curr Opin Genet Dev 12:675-682

Fleming H (1974) Omotic as an Afroasiatic Family. Studies in African Linguistics Suppl.

5:81-94

Greenberg J (1955) Studies in African Linguistic Classification. Compass Press, New

Haven

Greenberg J (1964) Historical Inferences from Linguistic REsearch in Sub-Saharan

Africa. In: Butler J (ed) Boston University Papers in African History I, pp 1-15

Greenberg JH (1963) The Languages of Africa. The Hague: Mouton.

Hammer MF, Karafet TM, Redd AJ, Jarjanazi H, Santachiara-Benerecetti S, Soodyall H,

Zegura SL (2001) Hierarchical patterns of global human Y-chromosome

diversity. Molecular Biology and Evolution 18:1189-1203

Hayward R (2000) Afroasiatic. In: Nurse BHaD (ed) African Languages: An

Introduction. Cambridge University Press, Cambridge, pp 74-98

Klieman K (2003) "The Pygmies Were Our Compass": Bantu and Batwa in the History

of West Central Africa, Early Times to c. 1900 C.E. Heinemann, Portsmouth, NH

117

McCall D (1998) The Afroasiatic language phylum: African in origin, or Asian? Current

Anthropology 39:139-144

Militarev A (2003) The Prehistory of a Dispersal: the Proto-Afrasian (Afroasiatic). In:

Bellwood PaCRe (ed) Examining the Farming/Language Dispersal Hypothesis.

McDonald Institute for Archaeological Research, Cambridge, pp 135-150

Nettle D (1998) Explaining global patterns of language diversity. J Anthropol Archaeol

17:354-374

Nurse G, Weiner J, Jenkins T (1985) The peoples of southern Africa and their affinities.

Research Monographs on Human Population Biology. 3. Clarendon Press, Oxford

Rosser ZH, Zerjal T, Hurles ME, Adojaan M, Alavantic D, Amorim A, Amos W, et al.

(2000) Y-chromosomal diversity in Europe is clinal and influenced primarily by

geography, rather than by language. Am J Hum Genet 67:1526-1543

Scozzari R, Cruciani F, Santolamazza P, Malaspina P, Torroni A, Sellitto D, Arredi B,

Destro-Bisol G, De Stefano G, Rickards O, Martinez-Labarga C, Modiano D,

Biondi G, Moral P, Olckers A, Wallace DC, Novelletto A (1999) Combined use

of biallelic and microsatellite Y-chromosome polymorphisms to infer affinities

among African populations. Am J Hum Genet 65:829-846

Semino O, Santachiara-Benerecetti AS, Falaschi F, Sforza LLC, Underhill PA (2002)

Ethiopians and Khoisan share the deepest clades of the human Y-chromosome

phylogeny. Am J Hum Genet 70:265-268

Sutton J (1974) The Aquatic Civilization of Middle Africa. Journal of African History

15:527-546

118

Thomas MG, Parfitt T, Weiss DA, Skorecki K, Wilson JF, le Roux M, Bradman N,

Goldstein DB (2000) Y chromosomes traveling south: the cohen modal haplotype

and the origins of the Lemba--the "Black Jews of Southern Africa". Am J Hum

Genet 66:674-686

Tobias P (1964) Bushman Hunter-Gatherers: A Study in Human Ecology. Junk, The

Hague

Vansina J (1990) Paths in the Rainforest. University of Wisconsin Press, Madison

Wendorf F, Schild R (1998) Nabta Playa and its Role. Journal of Anthropological

Archaeology 17:97-123

Williamson K, Blench R (2000) Niger-Congo. In: Nurse BHaD (ed) African Languages:

An Introduction. Cambridge University Press, Cambridge, pp 11-42

119

Figure1

a b s tain oun s M Atla

Sahara Desert Nile River Bulrush millet

Guinea Rice Lake Chad Ethiopia So rg hum Te f f Sahel Highlands Niger River Benue River Ensete Ya m & Noog Equatorial Rainforest Fonio Finger Equator millet Lake Victoria Congo River

Kalahari Desert

C d

Eg yptian Berber

A F AFRO-ASIATIC R O - Sudanic agripastoralists A Cushitic S NILO- Lake ? IA aquatic tradition Megachad T SAHARAN aquatic I NILO-SAHARAN C Dogon West Atlantic Chadic Koman Omotic O NIGER-CONGO Saharan NG -CO ER Mande Nilotic NIG Kwa Bantu

Pygmy Pygmy

N ntu A Ba S OI H K

N A IS O H K

120

Figure 2

Khoisan Nilo-Saharan ‘Northern’ (!Kung) Nilotic South African ‘Central’ (Khoekhoe) Songay Saharan ‘Southern’ Sudanic Sandawe Koman Hadza ?

Bantu Afroasiatic Semitic Niger-Congo Kwa Northern Egyptian Gur-Adamawa Berber Dogon Chadic Atlantic Cushitic Mande Omotic Kordofanian

121 tu n a tu -B n on a n B n n o o o c ra g g ti a n n ro n ia h o o s e a s a -C ie m is a -S r r-C m a o e e h fro ilo ig ig yg . C K A N N N P N 90 176 44 219 395 129 69 * A- M 9 1* 2 . 3 M31 A- M 3 1 1 . 4 1 Nor thern Cameroon M6 M1 4 P3 P 4 * A- P3 * 1 4 .4 (Lake Chad) A M9 1 2 P28 b A- P2 8 1 1 .1 Nilo-Saharans A-M1 3 * A-M32* 0.6 B-M150* M3 2 3 1 M51 A- M 5 1 2 2 .2 1. 8 B-M152 b A-M13 2 M13 A-M13 4.5 22.7 1.0 0.8 8.7 E-M78 B-M150* Afroasiatic * B- M 6 0* 0 . 5 B-M152 R-P25* E-P1* * B-M182* 1 .1 4.7 E-M191E-M41 B M6 0 * B-M150* 2.3 1.8 0. 3 4.7 1.4 a M150 E- M78 1 M152 B-M152 1.7 6.8 6. 3 2.3 7.2 M182 2 B- 50 f 2* 0. 3 9 . 3 * J-12f2 50f 2 1 P6 b B- P6 8 .9 E-M81 4 P7 B-P7 4.4 0.6 27.9 RP S4Y C M1 74 D * E- SR Y 4064*1.10.6* 1.1 0.6 1.5 M33 YAP 1 E- M 3 3 0. 6 14 . 6 * E- M 7 5* 1 .1 0 . 5 0. 5 a M41 2 M7 5 SR Y SRY E- M 4 1 1 3. 6 2. 3 1 0831a 4064 M5 4 b E- M 5 4 1 .1 0.9 8.6 2.3 2.9 E-M33 E * E- P2 * 2. 8 1 . 4 P1 * E-P1* 14.4 1.7 11.4 58.9 37.7 9.3 4.3 a M1 91 E- M191 3 P2 7 E-M191 10 .0 0.6 13.6 15.1 32. 2 36.4 1.4 * E-M35* 6.7 8.5 20.5 1.4 4.8 E-P1* P9 M7 8 b M3 5 1 E-M78 29.5 9.1 0.9 0. 3 1.4 M8 1 2 E- M 8 1 8. 0 1 . 4 Non- Bantu Nige r-Con go * F-P14* 1.1 0. 5 P201 G- M 2 0 1 1. 1 B-M152 M52 G E-M54 P19 H P14 I-P19 0.6 0. 3 12f2a I J J-1 2f 2 1 .1 2 6.1 0. 3 K- M 9 * 1. 1 0 . 5 B-P7 F * E-M191 M70 A-P3* 2 K- M 7 0 5. 1 E-P1* E-P1 M20 E-M19 1 Four B E-M191 A-P28 A-M51 M4 L L ineag es M9 LLY22g M Bantu B-P6 E-P1* B-P7 K M175 N O Pygmy Khoisan

P27 Q * R-M2 07 * 0. 6 M2 0 1 * R-M1 73 * 1. 1 SRY P M7 3 a 10831b R 1 R-S R Y10818b10818b 0. 3 b P25 R-P25* 2.8 0.5 0. 8 72.5 9 M2 69 R-M2 69 2 .2 0. 6 0 . 5 0. 5

122 Figure 4

0.90

0.80

0.70

h 0.60

0.50

0.40

0.30 Pygmy Khoisan Bantu Afroasiatic non-Bantu NiloSaharan N.Cameroon Niger-Congo Niger-Congo