LINGUISTIC AND GENETIC RELATIONSHIPS IN NORTHERN

Brett C. Haberstick1

Erin Shay2

Eric Johnston2

Gary . Stetler1

John . Hewitt1

Andrew Smolen1

Zygmunt Frajzyngier2

1 Institute for Behavioral Genetics, University of Colorado, Boulder, CO 80309-0447

2 Department of Linguistics, University of Colorado, Boulder, CO 80309-0295

The authors wish to thank the Butcher Foundation for their support of the Northern

Cameroon Language and Genes Project [NCLGP] as well as Brad Pemberton and Taylor

Roy for their technical assistance related to this project. Introduction

The goal of this study is to explore the correlation between related languages and the genetic relationships of the populations that speak them. The narrower goal is to examine whether individuals of languages from the same or the same branch of a given family exhibit a closer genetic relationship than individuals belonging to different language groups or subgroups. For this purpose, genotypes for 28 autosomal genetic markers were determined from samples obtained from about 30 speakers of each of six different languages belonging to two different language families spoken in Northern

Cameroon. Genetic relationships established using the genetic data were correlated with

established relationships among languages spoken by those who provided samples.

Background

Several prior studies have sought correlations between linguistic classification and

genetic distance among the language populations of Cameroon. Like the present study,

these studies take for granted the established linguistic classifications. Because the

methods of genetic sampling and analysis differ, the results of these studies are not always

comparable. There have been very few studies dedicated to the Chadic-language

populations of Cameroon.

In two studies, Spedini and colleagues [1999, 2001] analyzed the distribution of ten

protein genetic polymorphisms in eighteen populations belonging to three linguistic families

represented in Cameroon, namely Afroasiatic; Nilo-Saharan; and the West-Atlantic,

Adamawa eastern, and Benué-Congo branches of -Kordofanian. The Afroasiatic family

is represented by the and by Shua . Among the Chadic languages,

Spedini et al. have examined Daba, Giziga (Guiziga, in their spelling), Mafa, Mada, Uldeme,

and Podoko (Podowko, in their spelling), belonging to the Central Chadic branch, and Masa

(Massa in their spelling), all of which belong to the Masa branch as per the Newman 1977 classification. Spedini et al. postulate a partial correlation between linguistic distance and genetic distance, concluding that ‘the language-family relationship between populations contributes more than their geographic location to the genetic differentiation among Chadic

speakers (but not among Niger-Kordofanian)’ (Spedini et al. 1999: 156).

erný et al. [2004] examined mitochondrial DNA sequences for Hdi (which they call

Hide), Kotoko, Mafa (all Chadic languages of the Central branch), and Masa (Masa branch).

The data are compared with published findings for other populations in Africa. The authors conclude that speakers of the four Chadic languages in their study are more closely related to populations in East Africa than to populations in , pointing out that such similarities may be due to prehistoric migrations or to more recent interactions between the populations.

Linguistic Relationships

A language family is a group of languages thought to be descended from a common

ancestor. The members of a language family may be grouped into branches and sub-

branches whose members are thought to be more closely related to one other than to

members of other branches or sub-branches.

The current study is based on genetic data gathered from speakers of six different

language groups [five from the Chadic family, one from the Niger-Congo family] in a

relatively small area of Northern Cameroon. Within the Chadic family, four languages [Gidar,

Mina, Hdi, and Mafa] belong to the Central (also called the Biu Mandara) branch while one

language [Peve] belongs to the Masa branch. From the Niger-Congo language group, we

sampled speakers of Mambay. Previous research has established the linguistic relationships

among these languages using the standard comparative method [Newman (1977) for the

Chadic grouping, Boyd (1989) for the Niger-Congo grouping]. For the purposes of the

current study, we take these relationships as given.

Basis for language group selection

In order to maximize the detection of common population genetic structure between groups, language groups were selected first by language family and then by geography. For example, within the Central branch of the Chadic language family, Hdi and Mafa were sampled because they are spoken within the immediate vicinity of each other; thus these languages are close both linguistically and geographically. Gidar and Mina language groups, also of the Central branch, are also spoken in the immediate vicinity of one another, though each belongs to a different sub-branch within the Central Chadic language family grouping.

A comparison of the genetic relationships between Hdi and Mafa on the one hand and Gidar

and Mina on the other hand may reveal whether linguistic distance, geographical proximity

or both are reflected in the genetic structuring of these respective populations.

Peve, which belongs to the East branch of the Chadic family, was chosen because it

is only distantly related to the other four Chadic languages selected for participation.

Mambay was chosen because it belongs to the Niger-Congo family and thus is not

linguistically related to any other languages in the study [Boyd, 1989]. Because Mambay is

spoken in the geographical area close to where Peve is spoken, including these two

language groups allows an examination of the genetic relatedness when two groups are

geographically proximate and linguistically unrelated. Comparisons between Peve and the

other Chadic languages examined here allow an examination of the genetic relatedness

when two groups are linguistically related, geographically remote. As a whole, each of the

six distinct language groups identified for participation facilitated asking whether the

complete absence of linguistic relatedness also suggested population genetic divergence.

One further consideration in selecting the language groups was that several of the

investigators have previously published work on the Hdi, Gidar, and Mina languages

[Frajzyngier, in press; Frajzyngier et al, 2005; Frajzyngier and Shay, 2002] and have developed working familiarity with the chosen groups. Furthermore, the investigators were able to employ a speaker of Peve who is interested in studying his own language and who assisted in the collection of the genetic samples used in the current study.

Social relationships and historical interactions

The language populations from which genetic samples were obtained have had varying degrees of interaction over time. Hdi is a relatively small language, with estimates varying between 15,000 to 30,000 speakers. Mafa, the largest language group selected for study here, number more than 100,000 speakers and occupies a large area of the extreme

Northern Province of Cameroon, and surrounds the Hdi population on three sides. There

has been much commercial trading and intermarriage between the Hdi and Mafa

populations, with Mafa women sometimes marrying into the Hdi community and Hdi women

marrying into the Mafa community. The primary factors determining intermarriage between

these two populations are economic rather than cultural [.. there is no linguistic exogamy

requirement].

Some Mina and Gidar settlements are separated by as little as 15 kilometers, but the

historical and cultural centers of the Gidar and Mina populations are a considerable distance

apart [see #39 and #61 on Figure 1]. The Mina population, whose speakers number roughly

11,000, were a dominant military force in the area during the 1800’. There are more Gidar

speakers than Mina speakers, with estimates ranging from 40,000 to 70,000 [Ethnologue].

According to Podlewski [1965], there has been a considerable degree of intermarriage or

admixture between Gidar speakers and other populations [not specified by Podlewski].

Peve, the smallest language groups selected for our study, is spoken by

approximately 5,720 speakers and is a dialect of Zime, a language spoken by over 100,000

speakers in Cameroon and nearby . The Peve settlement borders the area where

Mundang [#56 in Figure 1] is spoken. Although Peve and Mambay [spoken by roughly 8,000 speakers] belong to different language families, both languages border on the area where

Mundang, a member of the Niger-Congo family, is spoken. Peve speakers have had considerable linguistic and social contact with Mundang, and many Peve speakers speak

Mundang as a second language. The degree of mutual understanding between Mambay and Mundang has been estimated at about 47% [Hamm, 2002]. There are occasional intermarriages among Mambay and Peve speakers. In those situations, Peve men have settled among Mambay, with fewer cases of Mambay men settling among Peve.

Goals of the study

Our objective was to determine whether the degrees of relatedness determined by

our genetic data would reflect the degrees of relatedness as determined by the linguistic

relationships described above. Thus we hypothesized speakers of languages belonging to

the Chadic language family would be more genetically similar to each other than to speakers

of the Niger-Congo language, Mambay. In particular, we hypothesized that Hdi, Mafa, and

Mina speakers would be more closely related to one another than any one of them is to

Gidar. We also hypothesized that the Hdi, Mafa, Mina, and Gidar groups would be more

closely related to each other than any of them to Peve.

Methods

Sample

Thirty native speakers from one of six villages or settlements representing six distinct

language groups [Gidar, Mina, Peve, Mambay, Hdi, Mafa] in Northern Cameroon were

asked to participate in the investigation. In each population, the leader of the village was

informed about the study and the need to obtain buccal samples from speakers who were

not biologically related. Although considerable effort was made to collect DNA samples only

from speakers not biologically related to one another, some samples may have inadvertently been collected from relatives. Participants in the study were paid approximately $6 (USD) each for their time.

For the Gidar, Mina, Peve, and Mafa language groups, the villages where samples were obtained were the main or central settlements. This was not true for the Mambay and

Mafa language groups, where samples were collected from villages that are peripheral to the main settlement but are geographical neighbors to other languages included in the current study. For example, the Hdi and Peve language groups closely neighbored Mafa and Mambay, respectively.

The genetic samples for Hdi speakers came from Tourou, the cultural and historical

center of the Hdi population. For the Mafa language group, DNA samples were collected

from residents living approximately 15 kilometers (km) away from Tourou. The samples for

Gidar came from the viliage of Lam, considered to be the religious, cultural, and political

center of the Gidar population. The samples collected from Mina speakers came from a

settlement about 150 km from the Gidar village of Lam and about 200 km from the Hdi and

Mafa sample populations. The Peve samples came from the village of Mayo-Lopé, roughly

200 km from the other Chadic populations [Gidar, Mina, Peve, Hdi, Mafa] surveyed and

about 45 to 50 km from the village of Bikallé, where our DNA samples were collected from

Mambay speakers.

DNA collection and genotyping

Buccal cell DNA was collected following signed informed consent. Buccal cells were

collected using two cotton-tipped swabs that were placed in 0.5 ml of lysis buffer (0.5%

SDS, 10 mM Tris-EDTA, pH 8.0), and stored at ambient conditions until shipped to the

Institute for Behavioral Genetics [IBG; University of Colorado, Boulder, Colorado, USA].

Upon arrival in the laboratory, two ml of lysis buffer were added to the samples. Genomic

DNA was extracted using proteinase K treatment followed by isopropyl alcohol and ethanol precipitations. DNA pellets were resuspended in Tris-EDTA buffer at a concentration of 10 ng /µL.

PCR

Three multiplex PCR reactions were used to amplify the 28 Short Tandem Repeat

(STR) loci analyzed. IBG-Hvar1 [Table1a] is a 12-plex PCR that we have used extensively for zygosity determinations. IBG-Hvar2 [Table 1b] is a 14-plex PCR based on the CODIS

[Combined DNA Index System; Budowle et al, 1999] panel that has been modified by replacing four of the loci: D21S11, THO1, D18S51 and FGA with D4S2639, D9S934,

D20S470 and D15S657, respectively for routine use. The replaced CODIS loci were analyzed in a separate five-plex PCR reaction (Table 1c). In practice the four replacement pairs (e.g., D21S11 and D4S2639) can be substituted for one another in any combination if desired, since the four replacement loci were chosen for the same sized amplicon, similarity of primer melting temperature and lack of interactions among primers. The sources of primer sequences are given in the footnotes to the tables. The heterozygosity values for each locus

were obtained from the Invitrogen web site (http://mp.invitrogen.com/

resources/apps/mappairs/). They are given for illustrative purposes only, and do not

represent the observed heterozygosities for these Cameroon populations, which may be

found accompanying the appropriate tables at the website for the Northern Cameroon

Language and Genetics Project [NCLGP; http://ibgwww.colorado.edu

/genotyping_lab/NCLGP] .

Each 20 µl PCR reaction contained 1 µl of DNA (1-10 ng), 4.4 µl of primer mix

composed as described in the tables, 2.0 µl of of GoldSTAR buffer [Promega, Madison, WI]

and two units of AmpliTaq® Gold DNA polymerase [Applied Biosystems, Foster City, CA].

Cycling conditions were as given in Krenke et al [2002]. The alleles were separated and

detected using an ABI PRISM® 3100 Genetic Analyzer. All plates contained at least one control [CEPH 1347-02] and allelic ladders for the CODIS loci. Electropherograms were reviewed by two investigators independently, and discrepancies resolved by reanalysis of the loci using single plex reactions.

[Insert Tables 1a, 1b, 1c about here]

Statistical Analyses

Allele frequencies for the 28 STR markers were determined by direct counting as implemented in CONVERT [Glaubitz, 2004]. Tests for deviations from Hardy-Weinberg equilibrium (HWE) were conducted using Fisher’s Exact probability test [Guo and

Thompson, 1992] using the statistical package Arlequin (Version 3.01; Excoffier et al, 2005].

Unbiased genetic distances were calculated between all pairs of populations using

published methods [Reynolds, Weir, and Cockerham, 1983]. Analysis of molecular variance

(AMOVA) was conducted using data from all 28 STR markers as implemented in Arlequin.

AMOVA enables the partition of genetic variation at a locus or several loci into variation

between and within populations [Excoffier, 2001]. In addition, AMOVA can be used for

hierarchical analyses of genetic differences due to: (1) variation between individuals within a

population, (2) between populations within groups, (3) between groups. Significance of

AMOVA values was estimated using 10,100 permutations. The extent of population division

was measured using the fixation index or coancestry coefficient, Fst [Weir and Cockerham,

1984; Excoffier, Smouse, and Quattro, 1992; Weir, 1996). Fst values range between 0

indicating no population subdivision, random mating, and no genetic divergence within a

population and 1 (population isolation), with values below 0.05 suggesting little to no genetic

differentiation [Adeyemo, Chen, Chen, and Rotimi, 2005; Tishkoff and Williams, 2004]. The

significance of the genetic contribution to variation among populations within groups, Fsc,

and among groups, Fct, was examined using permutation procedures. Genetic structuring of the six Northern Cameroon populations was also assessed using Structure [Version 2.1; Pritchard, Stephens, and Donnelly, 2000], which is a model based clustering method for inferring populations using unlinked markers. The model assumes there are K populations each of which is characterized by a set of allele frequencies at each locus. Individuals in the sample are probabilistically assigned to populations, or jointly to two or more populations if their genotypes indicate they are admixed. An advantage of Structure is that it can be applied to most types of genetic markers including single nucleotide polymorphisms, microsatellites, and Restriction

Fragment-Length Polymorphisms (RFLPs). Genetic data from the 28 STR markers were

examined by specifying an admixture model that assumed correlated allele frequencies

between populations [Falush, Stephens, and Pritchard, 2003; Pritchard et al, 2000]. This

approach has been shown to improve clustering in populations that are weakly differentiated

(e.g. recently admixed) and also allows inference of the pattern of genetic drift or population

divergence [Falush et al, 2003; Rosenberg et al, 2005]. For each run the number of

populations, K, varied between 1 and 7, with a burn in period of 100,000 and a run length of

500,000 iterations. At each value of K, runs were conducted multiple times to ensure the

consistency of the results.

Results and Discussion

Tests of HWE for each of the 28 STR markers, polymorphic information content, and

power of discrimination statistics for each loci examined across all populations have been

previously reported [Haberstick et al, submitted]. Tests for deviations from HWE for each of

the 28 markers within the six Northern Cameroon language groups are detailed elsewhere

(http://ibgwww.colorado.edu/genotyping_lab/NCLGP). Number of observed alleles and most

common alleles at each of the 28 loci along with the average gene diversity over the loci

within the six language groups are provided in Tables 2a, 2b, 2c, and 2d. For most loci, the most common allele (MCA) was shared across 2 – 6 language groups. Variation in the number of observed alleles together with their heterozygosity values (0.4286-0.9643) suggest the 28 STR loci characterized in these populations are highly polymorphic and the within population variability is sufficiently high with a mean genetic diversity of 0.7387 [Table

3].

[Insert Tables 2a, 2b, 2c, 2d about here]

[Insert Table 3 about here]

In order to characterize the extent of the genetic structure among the six language groups studied, Fst values were calculated for each language group separately. Population

specific Fst, which measure the degree of population differentiation, ranged between 0.0201

and 0.0216 with an overall Fst value of 0.0208 and suggested little population differentiation.

Analysis of molecular variance results indicated that when haplogroup frequencies were

analyzed without grouping the six populations, the highest fraction of variability was due to

within population differences (97.9%). Pairwise coancestry coefficients based on this model

were low and significant for four of the six language groups. As shown in Table 4, this

suggested that there is a slight but statistically significant differentiation between speakers of

the North-Central language groups (Gidar, Mina) and South-Central language groups

(Mambay, Peve). Though non-significant, the near zero coancestry coefficients for the two

North-Western language groups (Hdi, Mafa) suggests that there are few restrictions upon

mating between these two populations.

[Insert Table 4 about here]

To understand better how the between population variation was distributed, we conducted three hierarchical analysis of molecular variance tests. For these tests, language groups were arranged into (1) geography; North-Eastern (Mambay, Peve), North-Central

(Mina, Gidar), and North-Western (Mafa, Hdi), (2) linguistic family; Chadic (Gidar, Mafa, Hdi,

Mina, Peve) and Niger-Congo (Mambay); and (3) linguistic sub-family; Central/Biu Mandara

1 (Hdi, Mafa, Mina) and 2 (Gidar), Masa (Peve), Niger-Congo (Mambay). For each of the three hierarchical models, the genetic variance between groups ranged from -0.95 to 0.95

and was similar when arranged along geographic and linguistic sub-family membership. For

none of the three tests was the percentage of variation accounted for by the between groups

parameter statistically significant.

[Insert Table 5 about here]

Because the degree of genetic differentiation between these six groups was low, we

further investigated the genetic population structure using the Baysian clustering algorithm

implemented in Structure. Results from a model specifying admixture with correlated allele

frequencies were indeterminate as the likelihood that individuals descended from two (K =2;

log likelihood = -17,020.6), three (K = 3; -17,044.6), four (K = 4; -17,053.8), and five (K = 5; -

17,071.2) were similar. There was no support for each of the groups belonging to the same

genetic population (K =1; log likelihood = -17,378.2) or belonging to six different populations

(K = 6; -17,238.6). This suggested some genetic structuring within these six language

groups. Table 6 describes the proportion of individuals assigned to each of the K clusters.

As shown, across all levels of K no language group could be completely assigned to a single

group, and suggested a shared ancestry among these populations. Speakers from the

Gidar, Mina, and Peve groups evidenced higher rates of membership with each other than with Mambay and Hdi. The exception was Mafa, with whom individuals from each of the five other groups shared membership.

[Insert Table 6 about here]

Lastly, to put these results in context, we compared the observed allele frequencies with those previously reported from Cameroon samples. Five of the STR loci characterized here were also genotyped in the Bamileke of the western plateau and Ewonodo in the central-southern areas of Cameroon [Destro-Bisol et al, 2000]. Table 7 provides the published allele frequencies and those of our sample. As shown, the number of observed

alleles for three loci (TH01, TPOX, vWA) was identical across all groups. For two loci

(D18S51 and D21S11), however, greater genetic diversity was observed in the Northern

Cameroon populations than in the Bamileke and Ewonodo. In comparison with the Northern

Cameroon populations, all three groups shared the most common allele for two markers

(TH01 and D21S11) and completely differed for one (vWA). While interpretation is limited

without formal inclusion of the Destro-Bisol et al [2002] data, these results lend evidence to

the notion that there is a limited degree of genetic diversity among these 8 groups.

[Insert Table 7 about here]

Conclusions

In this report, the allelic diversity for 28 STR loci in six discrete language populations

of Northern Cameroon are reported; many for the first time. A number of characteristics of

STR loci make them useful for the study of migration history, population substructure, and

controlling for the confounding effects of admixture in association-based studies of complex

traits [Barholtz-Sloan et al, 2005; Jorde et al, 1997; Reddy et al, 2001; Reed and Tishkoff, 2006]. These features include a typically large range in allele sizes, high heterozygosity (Ho) values, relative abundance within the genome, and the relative ease at which STR loci can be characterized [Perez-Miranda et al, 2005]. While the use of autosomal STR loci is common among studies of world-wide migration patterns [Rosenberg et al, 2002; Bastos-

Rodrigues et al, 2006] and as large numbers of markers have been characterized in limited numbers of African populations, there are few studies that have focused on Cameroon in general and Northern Cameroon specifically.

Based on previous linguistic study [Frajzyngier, in press; Frajzyngier et al, 2005;

Frajzyngier and Shay, 2002], we sought to elucidate the extent of genetic differentiation

among six language groups and determine the correlation between language group

membership and genetic substructure. The most significant inference that could be drawn

from these results was the lack of genetic diversity among these populations, despite being

highly differentiated linguistically. While small differences were detected between

populations, they were not along hypothesized lines. The fact that these populations have a

shared genetic substructure [alleles in one population were found in each of the remaining

five populations] suggests that these six groups descended from a common ancestor and

their divergence is somewhat recent.

In these samples, the within-population variation accounted for more than 97.0% of

the genetic diversity. This observation was consistent across hierarchical analyses based on

geography, language family, and linguistic sub-family. While not examined along linguistic

lines, similar estimates have been observed in previous studies based on microsatellite

markers [Adeyemo et al, 2005; Rosenberg et al, 2002] and insertion/deletion polymorphisms

[Bastos-Rodrigues et al, 2006]. One potential reason for this result could be that these six

groups live within a 300 kilometer radius of one another. Furthermore, there are no major

mountain ranges, bodies of water, or other landscape features that would prevent admixture. As described earlier, despite the linguistic differences, these groups have limited contact with one another in the form of trade and marriage partners.

Although the different methodological approaches employed here converge on the notion of limited population substructure among these groups, it is important to keep in mind a number of limitations. First, many of the STR loci characterized within language groups were out of HWE. While this suggests the effects non-random mating, it violates an assumption of the clustering algorithm implemented in Structure [Pritchard et al, 2000]. As such, the proportion of membership reported here may be biased. Second, while our choice

of markers overlapped with one previous study [Destro-Bisol et al, 2000] and extended the

available genotypes for larger population based studies, their number may not have been

sufficiently informative. However, the number of loci characterized here exceeds many

studies of other human populations around the world. Fourth, we sampled a total of 180

individuals; 30 speakers from each of six groups which may have limited our ability to detect

the extent of population genetic substructure. Samples of 200 – 500 individuals per group or

more have been shown to be helpful in determining “clusteredness” [Bamshad et al, 2003;

Rosenberg et al, 2005]. We hope that our future efforts to understand the degree of

relationship between the six language groups examined here and population genetic

structure will address many of these limitations and expand to include the use of

mitochrondrial DNA and -Chromosome haplotype information. References

A.A. Adeyemo, G. Chen, Y. Chen, C. Rotimi, ‘Genetic structure in four West African population groups’. BMC Genetics 6 (2005), 1-9.

M.J. Bamshed, S. Wooding, .S. Watkins, C.. Ostler, .A. Batzer, L.. Jorde, ‘Human population genetic structure and inference of group membership’. American Journal of

Human Genetics 72 (2003), 578-589.

J.S. Barnholtz-Sloan, . Chakarborty, T.A. Sellers, A.G. Schwartz, ‘Examining population stratification via individual ancestry estimates versus self-reported race.’ Cancer

Epidemiology, Biomarkers Prevention 14 (2005) 1545-1551.

R. Boyd, In J. Bendor-Samuel and R.L. Hartell (eds.). The Niger-Congo languages: A classification and description of Africa’s largest language family. (Lanham, MD: University

Press of America, 1989), 178-215.

B. Budowle, T.R. Moretti, A.L. Baumstartk, .A. Defenbaugh, K.M. Keys, ‘Population data on the thirteen CODIS core short tandem repeat loci in African Americans, U.S. Caucasians,

Hispanics, Bahamians, Jamaicans, and Trinidadians.’ Journal of Forensic Sciences, 44

(1999), 1277-1286.

¢ .M. erný, R. Hájek, J. Cémejla, J. Bru ¡ ek, R. Brdi ka. ‘mtDNA sequences of Chadic- speaking populations from northern Cameroon suggest their affinities with eastern Africa.

Annals of Human Biology, 5 (2004) 554-569. G. Destro-Bisol, I. Boschi, A. Cagila, S. Tofanelli, V. Pascali, G. Paoli, G. Spedini,

‘Microsatellite variation in Central Africa: an analysis of intrapopulation and interpopulation genetic diversity’. American Journal of Physical Anthropology 112 (2000) 319-337.

L. Excoffier, Analysis of population subdivision. Handbook of Statistical Genetics. Eds: D.J.

Balding, M. Bishop, C. Cannings (Chichester: John Wiley & Sons, 2001).

L. Excoffier, G. Laval, S. Schneider, ‘Arlequin (version 3.0): An integrated software package for population genetics data analysis,’ Evolutionary Bioinformatics, 1 (2005) 47-50.

D. Falush, M. Stephens, J.K. Pritchard, ‘Inference of population structure using multilocus

genotype data: Linked loci and correlated allele frequencies,’ Genetics, 164 (2003) 1567-

1587.

Z. Frajzynger, A Grammar of Gidar. (Frankfurt: Peter Lang, in press)

Z. Frajzynger, E. Johnston, A. Edwards, A Grammar of Mina. (Berlin/New York: Mouton de

Gruyter, 2005).

Z. Frajzynger, E. Shay, A Grammar of Hdi. (Berlin/New York: Mouton de Gruyter, 2002).

S.W. Guo, E.A. Thompson, ‘Performing the exact test of Hardy-Weinberg proportion for

multiple alleles,’ Biometrics, 48 (1992) 361-372.

J.C. Glaubitz, ‘CONVERT: A user-friendly program to reformat diploid genotypic data for commonly used population genetic software packages,’ Molecular Ecology Notes, 4 (2004)

309-310.

B.C. Haberstick, G.L. Stetler, B. Pemberton, E. Johnston, J.K. Hewitt, Z. Frajzyngier, E.

Shay, A. Smolen, ‘Northern Cameroon population data on 28 STR loci,’ Journal of Forensic

Sciences, submitted.

C. Hamm, ‘A sociolinguistic survey of the Mambay language of Chad and Cameroon. SIL

Electronic Survey Reports SILESR, (2002) 39.

B.E. Krenke, A. Tereba, S.J. Anderson, E. Buel, S. Culhane, C.J. Finis, C.S. Tomsey, J.M.

Zachetti, A. Masibay, D.R. Rabbach, E.A. Amiott, C.J. Sprecher, ‘Validation of a 16-Locus

Fluorescent Multiplex System,’ Journal of Forensic Science, 47 (2002) 773-85.

M. Mizutani, T. Yamamoto, K. Torii, . Kawase, T. Yoshimoto, R. Uchihi, M. Tanaka, K.

Tamaki, Y. Katsumata, ‘Analysis of 168 short tandem repeat loci in the Japanese

population, using a screening set for human genetic mapping,’ Journal of Human Genetics

46 (2001) 448-455.

P. Newman, ‘Chadic classification and reconstructions,’ Afroasiatic Linguistics, 51 (1977) 1-

42.

A.M. Podlewski, La dynamique des principales populations du Nord-Cameroun. (Yaoundé:

Institut de Recherches Scientifiques du Cameroun, 1965).

J.K. Pritchard, M. Stephens, P. Donnelly, ‘Inference of population structure using multilocus genotype data,’ Genetics, 155 (2000) 945-959.

F.A. Reed, S.A. Tishkoff, ‘African human diversity, origins and migrations,’ Current

Opinion in Genetics & Development, 16( 2006) 597-605.

J. Reynolds, B.S. Weir, C.C. Cockerham, ‘Estimation of the coancestry coefficient: basis for a short-term genetic distance,’ Genetics, 105 (1983) 767-779.

R.A. Rosenberg, S. Mahajan, S. Ramachandran, C. Zhao, J.K. Pritchard, M.W. Feldman,

‘Clines, clusters, and the effect of study design on the inference of human population

structure,’ PLoS Genetics, 6 (2005) 660-671.

G. Spedini, G. Destro-Bisol, S. Mondovi, L. Kaptué, L. Taglioli, G. Paoli, ‘The peopling of Sub-Saharan Africa: The case study of Cameroon,’ American Journal of Physical Anthropology, 110 (1999) 143-162.

G. Spedini, M. Stefano, G. Paoli, G. Destro-Bisol. ‘Biological and cultural contraditions? A

reply to MacEachern,’ American Journal of Physical Anthropology, 114 (2001) 361-364.

S.A. Tishkoff, S.M. Williams, ‘Genetic analysis of African populations: Human evolution and

complex disease.’ Nature Reviews Genetics, 3 (2004) 611 – 621.

A. Urquhart, .J. Oldroyd, C.P. Kimpton, P. Gill, ‘Highly discriminating heptaplex short

tandem repeat PCR system for forensic identification.’ Biotechniques, 18 (1995) 116-121. B.S. Weir, Genetic Data Analysis II: Methods for discrete Population Genetic Data.

(Sinauer Associates, Inc., Sunderland, MA, USA, 1996).

B.S. Weir, C.C. Cockerham, ‘Estimating -statistics for the analysis of population structure.’ Evolution, 38 (1984) 1358-1370.

Table 1a. Primer sequences and concentrations for IBG-Hvar1 12-plex PCR

Primer Locus Size Range Concentration (Het) † (base pairs) Primer Sequences (5’ to 3’) and dye labels ‡ (µM)

Amelogenin 1 103-109 F NED™-CCCTGGGCTCTGTAAAGAATAGTG 0.08 R ATCAGAGCTTAAACTGGGAAGCTG 0.08

D2S1384 121-165 F NED™-AATAGAGGGCCCTTGCTTAA 0.60 (0.67) R TTTGGGATAAAAGGTATTTTGC 0.60

D13S796 136-176 F 6FAM™-CATGGATGCAGAAT CACAG 0.20 (0.77) R TCATCTCCCTGTTTGGTAGC 0.20

D1S679 136-176 F HEX™-GCCATCAAGAAAACTAG ACTGC 0.60 (0.84) R ACCATGGTACTCAGCAGTGC 0.60

D8S1119 170-200 F NED™-TCAAAGCAGGTTACTCTCACG 1.40 (0.81) R TAAATATGGGAAGGCAGCAG 1.40

D4S1627 177-201 F 6FAM™-AGCATTAGCATTTGTCCTGG 0.30 (0.69) R GACTAACCTGACTCCCCCTC 0.30

D9S301 205-237 F NE6FAM™-AGTTTTCATAACACAAAAGAGAACA 0.50 (0.75) R ACCTAAATGTTCATCAAAAGAGG 0.50

D3S1766 200-228 F HEX™-ACCACATGAGCCAATTCTGT 0.75 (0.86) R ACCCAATTATGGTGTTGTTACC 0.75

D20S481 215-249 F NED™-TGGGTTATGAGTGCACACAG 0.40 (0.81) R AACAGCAAAAAGACACACAGC 0.40

D7S1808 252-280 F 6FAM™-CAGAACAAACAAATGGGGAG 0.50 (0.81) R CCAAATAAGACTCAGGACGC 0.50

D15S652 282-312 F NED™-GCAGCACTTGGCAAATACTC 1.40 (0.81) R CATCACTCAAGGCTCAAGGT 1.40

D6S1277 278-322 F HEX™-ACACTGCAGGGTAAGACAGC 0.60 (0.69) R AAGACAGTGTCTAAGCTGTCACA 0.60

† (Het), heterozygosity values from http://mp.invitrogen.com/resources/apps/mappairs/. ‡ Primer sequences from the GBD Human Genome Database [www.gdb.org]. 1 Primer sequences from [Krenke et al, 2002].

Table 1b. Primer sequences and concentrations for IBG-Hvar2 14-plex PCR

Primer Locus Size Range Concentration (Het) † (base pairs) Primer Sequences (5’ to 3’) and dye labels ‡ (µM)

Amelogenin 103-109 F 6FAM™-CCCTGGGCTCTGTAAAGAATAGTG 0.10 R ATCAGAGCTTAAACTGGGAAGCTG 0.10

D3S1358 101-147 F HEX™-ACTGCAGTCCAATCTGGGT 0.20 (0.79) R ATGAAATCAACAGAGGCTTGC 0.20

D5S818 119-155 F GGTGATTTTCCTCTTTGGTATCC 0.25 (0.70) R NED™-AGCCACAGTTTACAACATTTGTATCT 0.25

vWA 122-182 F 6FAM™CCCTAGTGGATGATAAGAATAATCAGTATG 0.18 (0.81) R GGACAGATGATAAATACATAGGATGGATGG 0.18

D4S2639 1 152-192 F HEX™-AAGGTTCCAGGACACATTCA 0.25 (0.88) R CTTGAAAGCTCCATAATCATACG 0.25

D13S317 157-201 F ATTACAGAAGTCTGGGATGTGGAGGA 0.30 (0.79) R NED™-GGCAGCCCAAAAAGACAGA 0.30

D9S934 2 198-238 F HEX™-TTTCCTAGTAGCTCAAGTAAAGAGG 0.25 (0.56) R AGACTTGGACTGAATTACACTGC 0.25

D8S1179 203-251 F ATTGCAACTTATATGTATTTTTGTATTTCATG 0.50 (0.82) R FAM™–ACCAAATTGTGTTCATGAGTATAGTTTC 0.50

D7S820 211-251 F NED™–ATGTTGGTCAGGCTGACTATG 0.60 (0.82) R GATTCCACATTTATCCTCATTGAC 0.60

TPOX 258-294 F GCACAGAACAGGCACTTAGG 0.40 (0.64) R 6FAM™-CGCTCAAACGTGAGGTTG 0.40

D16S539 264-304 F GGGGGTCTAAGAGCTTGTAAAAAG 0.40 (0.75) R NED™-GTTTGTGTGTGCATCTGTAAGCATGTATC 0.40

D20S470 3 161-313 F HEX™-CCTTGGGGGATATAGCCTAA 0.25 (0.94) R TGAGTGACAGAGTGATACCATG 0.25

CSF1PO 291-331 F NED™-CCGGAGGTAAAGGTGTCTTAAAGT 0.25 (0.72) R ATTTCCTGTGTCAGACCCTGTT 0.25

D15S657 4 332-60 F 6FAM™–TCTACATTGGACAGAAATGGG 0.25 (0.72) R GATACACATTCTGATTCATGCG 0.25

† (Het), heterozygosity values from http://mp.invitrogen.com/resources/apps/mappairs/. ‡ Primer sequences from [Krenke et al, 2002]. 1 Replaces TH01, primer sequences from the GBD Human Genome Database [www.gdb.org]. 2 Replaces D21S11, primer sequences from GBD Human Genome Database. 3 Replaces D18S51, primer sequences from GBD Human Genome Database. 4 Replaces FGA, primer sequences fromGBD Human Genome Database. Table 1c. Primer sequences and concentrations for IBG-Hvar3 5-plex PCR

Primer Locus Size Range Concentration (Het) † (base pairs) Primer Sequences (5’ to 3’) and dye labels ‡ (µM)

Amelogenin 103-109 F 6FAM™-CCCTGGGCTCTGTAAAGAATAGTG 0.10 R ATCAGAGCTTAAACTGGGAAGCTG 0.10

THO1 152-196 F HEX™-GTGGGCTGAAAAGCTCCCGATTAT 0.60 (0.77) R GTGATTCCCATTGGCCTGTTCCTC 0.60

D21S11 203-261 F HEX™-ATATGTGAGTCAATTCCCCAAG 0.50 (0.84) R TGTATTAGTCAATGTTCTCCAG 0.50

D18S51 1 262-342 F HEX™-CAAACCCGACTACCAGCAAC 0.25 (0.88) R GAGCCATGTTCATGCCACTG 0.25

FGA 308-464 F 6FAM™–GGCTGCAGGGCATAACATTA 0.20 (0.86) R ATTCTATGACTTTGCGCTTCAGGA 0.20

† (Het), heterozygosity values from http://mp.invitrogen.com/resources/apps/mappairs/. ‡ Primer sequences from the GBD Human Genome Database [www.gdb.org]. 1 Primer sequences from Urquhart et al [1995]. Table 2a. Allele diversity at 7 of 28 STR loci describing the extent of variation within six language groups of Northern Cameroon.

CSF1PO ² TH01 ² TPOX ² D5S818 ² D7S820 ² D13S317 ² D16S539 ²

Population Alleles MCA Alleles MCA Alleles MCA Alleles MCA Alleles MCA Alleles MCA Alleles MCA

Gidar 7 12 6 7 7 11 7 13 9 11 7 11 7 9 Mina 7 12 6 7 7 8,9 7 11,13 9 11 8 11 6 11 Peve 8 11 5 7 7 9 7 12 7 11 6 11 7 11 Mambay 8 12 6 7 7 9 8 12 8 11 8 10 8 11 Hdi 7 12 6 7 7 8,9 7 12,13 9 11 9 10,11 7 11,12 Mafa 6 11 5 7 7 9 7 12 7 11 7 10 8 12

Note: ² , CODIS marker; Alleles, # of observed alleles; MCA, most common allele(s).

Table 2b. Allele diversity at 7 of 28 STR loci describing the extent of variation within six language groups of Northern Cameroon.

FGA ² vWA ² D3S1358 ² D18S551 ² D21S11 ² D8S1179 ² D8S1119

Population Alleles MCA Alleles MCA Alleles MCA Alleles MCA Alleles MCA Alleles MCA Alleles MCA

Gidar 14 21,23 7 16 6 16 8 17 12 28 8 14 9 5,9 Mina 12 23 6 16 5 17 12 17 12 28 9 13,15 8 9 Peve 11 23,24 7 15 4 16 11 17 11 28 5 14 7 9 Mambay 12 22 8 17 4 16 10 17 14 30,31 7 14 6 4 Hdi 16 22 6 15,16 5 16 8 16,18 12 30 6 14 9 4,5 Mafa 13 23.2 7 16,17 6 15,16 12 17 11 29 6 13 7 4

Note: ² , CODIS marker; Alleles, # of observed alleles; MCA, most common allele(s).

24 Table 2c. Allele diversity at 7 of 28 STR loci describing the extent of variation within six language groups of Northern Cameroon.

D1S1679 D2S1384 D13S796 D4S1627 D4S2639 D9S301 D3S1766

Population Alleles MCA Alleles MCA Alleles MCA Alleles MCA Alleles MCA Alleles MCA Alleles MCA

Gidar 11 8,9 5 8 7 8 6 8 9 6 10 7 10 6 Mina 9 8 7 8,9 9 5,7 8 8 9 5 9 7 4 6 Peve 8 7 5 8 9 7 7 6,7 7 6 9 7 7 6 Mambay 9 8 8 6 10 3,5 6 7,8 8 6 9 7 6 6 Hdi 11 8 9 6,7 7 7 7 6,7,8 9 6 9 7 6 5,6 Mafa 10 8 7 6 7 4,5 6 7,8 10 6 7 7,8 4 6,7

Note: Alleles, # of observed alleles; MCA, most common allele(s).

Table 2d. Allele diversity at 7 of 28 STR loci describing the extent of variation within six language groups of Northern Cameroon.

D9S934 D20S481 D7S1808 D20S470 D6S1277 D15S652 D15S657

Population Alleles MCA Alleles MCA Alleles MCA Alleles MCA Alleles MCA Alleles MCA Alleles MCA

Gidar 4 6 9 7 6 5 10 4,12 9 6 7 2 7 3 Mina 5 6 8 6,7 7 5 10 9 8 5 7 2,5 4 3 Peve 4 6 8 8 7 3,4,5 9 11 6 4 9 2 6 3 Mambay 5 6 8 3 8 6 9 9,10 8 6 9 4 5 3 Hdi 4 6 7 3 7 5,6 9 12 8 6 7 4,5 5 3 Mafa 6 5 6 5 6 6 10 9,11 8 6 8 2 5 3

Note: Alleles, # of observed alleles; MCA, most common allele(s).

25 Table 3. Genetic diversity calculated from 28 STR markers.

Population Genetic Diversity index (S.E.)

Gidar 0.7413 (0.386) Mina 0.7500 (0.376) Peve 0.7320 (0.377) Mambay 0.7502 (0.376) HDi 0.7166 (0.381) Mafa 0.7423 (0.377)

Note: S.E., standard error (+/-). Table 4. Pairwise coancestry coefficients between populations and their significances. ²

Populations 1 2 3 4 5 6

Gidar - + + + + Mina 0.006 + + + + Peve 0.015 0.007 + + + Mambay 0.034 0.039 0.031 - - Hdi 0.020 0.028 0.017 0.001 - Mafa 0.035 0.041 0.033 0.003 0.002

² Significance of Fst values is shown above the leading diagonal. Population pairwise Fst values are shown below the diagonal. Genetic distances are based on 110 permutations. Table 5. Analysis of Molecular Variance (AMOVA)

Sum of Variance Percentage Grouping Source of Variation df Squares Components of Variation

Geography Between populations 2 36.62 0.596 0.95 Between population within language 3 33.47 0.832 1.32 Within populations 354 2181.85 6.163 97.73

Language Between populations 1 15.14 0.011 0.18 Between population within language 4 54.95 0.127 2.02 Within populations 359 2181.85 6.302 97.81

Language sub-family Between populations 3 36.56 -0.059 -0.95 Between population within language 2 33.53 0.179 2.84 Within populations 354 2181.85 6.163 98.10

Table 6. Proportion of membership in each of the six language groups for K = 2 to K = 6 assuming admixture and correlated allele frequencies.

Population 1 2 3 4 5 6

K = 2 Gidar 0.063 0.937 Mina 0.060 0.940 Peve 0.074 0.926 Mambay 0.853 0.147 Hdi 0.723 0.277 Mafa 0.843 0.157

K = 3 Gidar 0.065 0.382 0.554 Mina 0.053 0.250 0.697 Peve 0.069 0.313 0.619 Mambay 0.692 0.252 0.056 Hdi 0.600 0.248 0.152 Mafa 0.721 0.193 0.085

K = 4 Gidar 0.046 0.261 0.408 0.286 Mina 0.039 0.232 0.488 0.241 Peve 0.051 0.520 0.229 0.200 Mambay 0.680 0.053 0.060 0.207 Hdi 0.575 0.118 0.129 0.178 Mafa 0.704 0.067 0.077 0.152

K = 5 Gidar 0.259 0.387 0.038 0.037 0.279 Mina 0.224 0.481 0.039 0.027 0.229 Peve 0.513 0.212 0.049 0.046 0.180 Mambay 0.055 0.061 0.384 0.320 0.179 Hdi 0.107 0.122 0.259 0.360 0.152 Mafa 0.066 0.068 0.486 0.267 0.112

K = 6 Gidar 0.033 0.194 0.031 0.295 0.231 0.215 Mina 0.034 0.159 0.022 0.385 0.210 0.190 Peve 0.041 0.200 0.038 0.224 0.342 0.154 Mambay 0.374 0.089 0.305 0.045 0.063 0.125 Hdi 0.242 0.125 0.344 0.075 0.092 0.122 Mafa 0.473 0.083 0.246 0.052 0.062 0.083

Table 7. Allele diversity for 5 STR loci in Northern Cameroon, Bamileke, and groups.

TH01 TPOX vWA D18S51 D21S11

Population Alleles MCA Alleles MCA Alleles MCA Alleles MCA Alleles MCA

N. Cameroon ² 6 7 7 9 8 15,16 16 17 21 28 Bamileke ³ 6 7 7 8 9 15,16 10 15,17 13 28 Ewondo ³ 5 7 6 8 9 17 11 15 14 28

Note: ² , represents the overall allele frequencies across Gidar, Mina, Peve, Mambay, HDi, and Mafa language groups examined here; ³ , reported in Destro-Bisol et al [2000]; Alleles, # of observed alleles; MCA, most common allele(s).

30 Figure Captions

Figure 1. Linguistic map of Northern Cameroon [Ethologue].