<<

University of Groningen

Linguistic probes into human history Manni, Franz

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below. Document Version Publisher's PDF, also known as Version of record

Publication date: 2017

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA): Manni, F. (2017). Linguistic probes into human history. University of Groningen.

Copyright Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

The publication may also be distributed here under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license. More information can be found on the University of Groningen website: https://www.rug.nl/library/open-access/self-archiving-pure/taverne- amendment.

Take-down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum.

Download date: 04-10-2021

LINGUISTIC PROBES INTO HUMAN HISTORY

Franz Manni

The work in this thesis has been carried out under the Graduate School for Humani‐ ties (GSH) from the University of Groningen and the Center of Language and Cogni‐ tion Groningen (CLCG).

Groningen Dissertations in Linguistics n° 162

Franz Manni

Linguistic Probes into Human History ISBN: 978‐90‐367‐9871‐6 (print version) ISBN: 978‐90‐367‐9872‐3 (electronic version)

© 2017, F. Manni

All rights are reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means without prior permission of the author.

Cover design: F. Manni © 2017

II

Linguistic Probes into Human History

PhD thesis

to obtain the degree of PhD at the University of Groningen on the authority of the Rector Magnificus Prof. E. Sterken and in accordance with the decision by the College of Deans.

This thesis will be defended in public on

Thursday 6 July 2017 at 16:15 hours

by

Franz Manni

born on 5 March 1973 in Ferrara, Italy

III Supervisors Prof. J. Nerbonne Prof. S. Bahuchet

Assessment committee Prof. M. Cysouw Prof. M. Dunn Prof. F. Zwarts

IV

I’ ho tanti vocavoli nella mia lingua materna, ch’i’ m’ho piuttosto da doler del bene intendere le cose, che del mancamento delle parole, colle quali io possa bene espriemere il concetto della mente mia.

LEONARDO DA VINCI

IN STANDARD ENGLISH: I have so many words in my mother that I worry more about not understanding well than about lacking the words to express my thoughts. [Translation by J. Nerbonne]

V Other translations,

IN STANDARD DUTCH [T. by W. Heeringa]: In mijn taal zijn er zoveel woorden dat ik me er maar beter zorgen over kan maken dat ik goed begrijp wat er wordt gezegd, dan bang te zijn dat ik de woorden niet kan vinden om mijn gedachten goed te verwoorden.

IN STANDARD GERMAN [T. by H. Goebl]: In meiner Muttersprache habe ich dermaßen viele Wörter, so dass ich mir eher darüber den Kopf zerbrechen müsste, die Dinge an sich gut zu verstehen, als darüber, zu wenig Wörter zu haben, um das, was ich denke, gut ausdrücken zu können.

IN THE DIALECT OF THE CITY OF GRONINGEN [T. by W. Heeringa]: Mien moekes toal het zoʹn bult woorden dat ik mie der moar beter drok over moaken kin dat ik goud begriep wat of ter zegd wordt as baang te wezen dat ik de woorden nait vienden kin om mien gedachten goud oet te drukken.

IN STANDARD FRENCH [T. by P. Mennecier]: Jʹai tant de mots dans ma langue maternelle que je me soucie plutôt de bien entendre les choses que de chercher les mots par lesquels exprimer le plus profond de ma pensée.

IN THE DIALECT OF THE CITY OF FERRARA [T. by M. Leziroli and E. Rinaldi]: A gh’è acsì tant vucabol in tla miè lingua materna che piutòst am preocup ad ben intèndar i quei, chʹàm manca il paròl chʹim sèrav par riusir a dìr bèn quel chʹam frùla par la ment.

IN RURAL FRISIAN [T. by D. Drukker and H. Sijens]: Myn memmetaal hat saʹn soad wurden dat ik der mar better oer yn noed sitte kin dat ik goed begryp wat der sein wurdt, as dat ik bang wêze moat dat ik de wurden net fine kin om myn tinzen goed te ferwurdzjen.

IN FANG‐NTUMU [T. by R.S. Ollomo Hella]: ŋ́kɔ́bə́ wɔ̂ m óbə̄ lə̄ àgbì bíyɛ̂ . mà yɛ̀nà dàŋ bɛ́ɛ́ édzām dá dzôbàn mà yə̀m ná bífyɛ̄ mə́ ná byɔ́ èbàn bí bóó bóó mā ŋ́lō étē.

VI IN MEMORIAM LAURO MANNI (1939 – 2017)

Acknowledgments

My sincerest gratitude goes to the supervisors and to the members of the assessment committee for their support and guidance.

I also would like to acknowledge the Graduate School for Humanities (GSH) from the University of Groningen and the Center of Language and Cognition Groningen (CLCG) for having allowed this dissertation.

It is a pleasure to thank those that encouraged me, particularly Evelyne Heyer, Pierre Darlu, Hans Goebl, Marie‐Françoise Rombi and Philippe Mennecier. This dissertation is a collective enterprise: the names of many other scholars that provided help are listed at the beginning and at the end of each chapter. I thank them all.

But John Nerbonne and Wilbert Heeringa deserve a special expression of gratitude for a collaboration that lasted fifteen years and that eventually turned into friendship.

Let me say that I have been privileged to visit so often the Netherlands, and the city of Groningen in particular. My Dutch life has always been very enjoyable.

Finally, I would like to name my family, my mother Marilena and my daughter Clelia. Actually, I had planned to show the finished dissertation to my father Lauro, because he truly loved the Letters, but I have been too late.

Dank u wel en tot ziens! (The only Dutch words I can easily pronounce).

VII VIII

Contents

1. GENERAL INTRODUCTION 3 1.1 Why genetics and linguistics? 4 1.1.1 Some thoughts about the emergence of the language faculty “ 1.1.2 The emergence of the language faculty and the peopling 6 1.1.3 The interest of geneticists in the diversity of human languages 7 1.1.4 Towards a wider Anthropology 9 1.2 What family names tell about a population 10 1.2.1 1.2.1Surnames and dialects 12 1.2.2 How to compare dialects and surnames? 13 1.3 How to assess the reliability of linguistic classifications? 15 1.3.1 Efronʹs (1979) bootstrap “ 1.3.2 Felsensteinʹs (1985) bootstrap 16 1.3.3 Application of resampling techniques to dialectology 17 1.3.3.1 Bootstrap consensus trees “ 1.3.3.2 Adoption of a cut‐off value 18 1.4 The fuel: Lexical databases 19 1.4.1 The number of items “ 1.4.2 The choice of the words 20 1.4.3 Swadesh wordlists 1.5 Outline of the dissertation 22 1.5.1 CHAPTER 2: Sprachraum and genetics “ 1.5.2 CHAPTER 3: Projecting Dialect Distances to Geography 23 1.5.3 CHAPTER 4: To What Extent are Surnames Words? “ 1.5.4 CHAPTER 5: Surname and linguistic structure of Spain “ 1.5.5 CHAPTER 6: Linguistic probes into the Bantu history of 24 1.5.6 CHAPTER 7: A Central‐Asian linguistic survey 25 1.5.7 CHAPTER 8: General conclusions and new prospects 26 References 27

2. SPRACHRAUM AND GENETICS 35 2.1 Genetic mapping 36 2.2 Boundaries in the genetic landscape 43 2.3 How to identify populations in the landscape, the sampling 46 2.4 Regional studies 49 2.5 Concluding remarks 54 References 56

IX 3. PROJECTING DIALECT DISTANCES TO GEOGRAPHY: 63 BOOTSTRAP CLUSTERING VS. NOISY CLUSTERING 3.1 Introduction “ 3.2 Background and motivation “ 3.2.1 Data 66 3.3 Bootstrapping clustering “ 3.4 Clustering with noise 67 3.5 Projecting to geography 68 3.6 Results “ 3.7 Conclusions 69 References 70

4. TO WHAT EXTENT ARE SURNAMES WORDS? 75 COMPARING GEOGRAPHIC PATTERNS OF SURNAME AND DIALECT VARIATION IN THE NETHERLANDS 4.1 Introduction 76 4.1.1 Surnames 79 4.1.2 Dialects 80 4.2 Methodology and data “ 4.2.1 Data “ 4.2.1.1 Surnames “ 4.2.1.2 Dialects 81 4.2.2 Visualization of diversity “ 4.2.2.1 Multidimensional space: Principal component analysis “ 4.2.2.2 Geographic analysis: The Monmonier algorithm 82 4.2.2.2.1 The triangulation “ 4.2.2.2.2 The algorithm 83 4.2.2.2.3 Robustness of barriers “ 4.3 Results 84 4.3.1 Surnames “ 4.3.2 Dialects 88 4.4 Discussion 93 References 100

5. FOOTPRINTS OF MIDDLE AGES KINGDOMS ARE STILL VISIBLE 107 IN THE CONTEMPORARY SURNAME STRUCTURE OF SPAIN 5.1 Introduction “ 5.2 Methods 111 5.2.1 The surname data “ 5.2.1.1 Surnames: From isonymy measures of inbreeding to distances 114 5.2.1.2 Test of robustness ‐ bootstrap 115 5.2.2 The Linguistic Atlas of the Iberian Peninsula 117 5.2.2.1 Reanalysis of the dialectometric matrix of linguistic similarity 118

X 5.3 Results 120 5.3.1 Surname diversity in Spain ‐ General statistics “ 5.3.2 Isonymy levels 122 5.3.3 Surname diversity in Spain ‐ clustering 123 5.3.4 Linguistic diversity in Spain 124 5.3.5 Mantel correlations 126 5.4 Discussion 126 5.4.1 Variability of Spanish surnames: Patterns of diversity “ 5.4.2 Variability of Spanish surnames: Patterns of isonymy 128 5.4.3 Linguistic diversity 129 References 131

6. LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 137 6.1. Introduction “ 6.1.1 General background 139 6.1.1.1 Bantu linguistics: Classifications 140 6.1.1.2 How many major Bantu groups? 142 6.1.1.3 Proto Bantu homeland 144 6.1.1.4 Timeframe and dissemination of Bantu varieties 145 6.1.1.5 Archaeology and linguistics 147 6.1.1.6 The new synthesis 149 6.1.1.7 Early population genetics evidence for the Bantu expansion 152 6.1.2 Linguistic and genetic diversity of Bantu populationd from Gabon 154 6.1.2.1 Origin of the project “ 6.1.2.2 Linguistic evidence and hypotheses 155 6.2 Methods 157 6.2.1 Linguistic datasets “ 6.2.1.1 Database 1: from Tanzania “ 6.2.1.1.1 Origin of the Tanzanian dataset 158 6.2.1.1.2 Mapping 159 6.2.1.2 Database 2: Atlas Linguistique du Gabon (ALGAB) “ 6.2.1.2.1 Transcription 164 6.2.1.2.2 Mapping the ALGAB 167 6.2.1.3 Database 3: Bastin et al. (1999) 169 6.2.2 Computation of linguistic distances 173 6.2.3 Population genetics sampling of Gabon and genetic markers used 174 6.2.3.1 Genetic markers used 175 6.2.3.1.1 Mitochondrial DNA “ 6.2.3.1.2 Y‐chromosome 178 6.2.3.1.3 Autosomal markers 179 6.2.3.2 Parentally transmitted markers to seize the recent history 179 6.3 Results 181 6.3.1 Linguistics “

XI 6.3.1.1 The Tanzanian experiment “ 6.3.1.1.1 The Levenshtein classification of Bantu languages from Tanzania 182 6.3.1.2 Classification of 52 Bantu languages from Gabon (ALGAB) 187 6.3.1.3 Classification of 64 Bantu languages from Gabon (Bastin et al. 1999) 192 6.3.2 Genetics 194 6.3.2.1 Mitochondrial diversity 197 6.3.2.2 Y‐chromosome diversity 201 6.3.2.3 Autosomal diversity 204 6.4. Discussion 205 6.4.1 Bantu dispersal in Gabon: Rainforest versus savannah corridors “ 6.4.2 Compatibility, Levenshtein classifications vs. other ones 209 6.4.3 The classification of the Bantu languages from Gabon 212 6.4.3.1 ALGAB “ 6.4.3.2 Bastin et al. (1999) / MRAC 214 6.4.4 Other sources of evidence 217 6.4.4.1 Population genetics “ 6.4.4.2 Music 220 6.4.5 General conclusions 222 6.4.5.1 The performance of the Levenshtein approach “ 6.4.5.2 Provisos 223 6.4.5.3 The Bantu peopling of Gabon 224 References 226

7. A CENTRAL ASIAN LANGUAGE SURVEY 237 7.1 Introduction “ 7.1.1 Background 238 7.2 Methodology 240 7.2.1 Selection of linguistic test sites “ 7.2.2 Linguistic inquiry 241 7.2.2.1 Classification of spoken varieties by the extended Swadesh list 242 7.2.3 Informants, protocol and linguistic database 243 7.2.4 Computational analysis 244 7.2.5 Matrix Generation 246 7.2.6 Relations among varieties 247 7.2.7 Loan word detection 248 7.3 Results 254 7.3.1 General sketch of phonetical variability “ 7.3.1.1 Turkic languages “ 7.3.1.2 Iranian languages “ 7.3.2 Measures of the linguistic variability 255 7.3.2.1 Matrix consistency “ 7.3.2.2 Representation of variability, the bootstrap clustering 256 7.3.2.3 Representation of variability, the Multidimensional Scaling clustering 257

XII 7.3.2.4 Comparison of results using100‐wd vs.200‐wdSwadesh list 259 7.3.3 Loan word detection “ 7.3.4 Linguistic contact and loans 260 7.4 Discussion 261 7.4.1 Networks versus linguistic distances 262 7.4.2 Effect of loans on linguistic distances 263 7.4.3 Swadesh word list 264 7.4.4 Variationist aspects “ 7.4.4.1 Homogeneous or areally unstructured lexical diversity “ 7.4.4.2 Linguistic isolation and contact 265 7.4.4.3 Bilingualism 266 7.4.4.4 Kazakh and Karakalpak speakers 267 7.4.5 Perspectives of investigation 268 References 270

8. GENERAL CONCLUSIONS AND NEW PROSPECTS 277 8.1 The essence of the Levenshtein distance 278 8.1.1 The Levenshtein distance and the feature system “ 8.1.2 The Levenshtein distance measures intelligibility 279 8.1.3 The Levenshtein distance measures contact 283 8.1.4 The Levenshtein distance measures historical divergence 284 8.2 Current challenge: Going beyond geography 287 8.2.1 The spread of linguistic innovations “ 8.2.2 Levenshtein residual distances 288 8.2.2.1 The Netherlands 289 8.2.2.2 Tanzania 290 8.2.2.3 Gabon “ 8.3 The influence of migration on regional languages 294 8.3.1 The effect of linguistic contact “ 8.3.2 Extensive linguistic contact and demography 295 8.3.2.1 Dialect change in the Netherlands “ 8.3.2.2 Migrations inferred from surname data 298 8.3.2.3 Spanish surnames and internal migrations 300 8.3.2.4 Spanish migrations and regional languages 302 References 305

Summary of the dissertation (in English) 311

Summary of the dissertation (in Dutch) Nederladse Samenvatting 315

About the author 319

GRODIL – Groningen Dissertations in Linguistics 320

XIII ddd

XIV GENERAL INTRODUCTION 1

2 CHAPTER 1

This chapter is unpublished, please cite it as follows:

Manni F. 2017. General Introduction. In: Linguistic probes into human history (Chap‐ ter 1). PhD dissertation, Groningen dissertations in linguistics n° 162. ISBN 978‐90‐ 367‐9872‐3. Groningen: University of Groningen. GENERAL INTRODUCTION 3

GENERAL INTRODUCTION

This dissertation includes five research‐articles published between 2006 and 2016 in peer‐reviewed publications (Manni et al. 2006, Nerbonne et al. 2007, Manni 2010, Rod‐ riguez Diaz et al. 2016, Mennecier et al. 2016) but also an extensive and unpublished report (CHAPTER 6, Linguistic probes into the Bantu history of Gabon) that summarizes 12 years of research. The different chapters correspond to research that has been conducted over a time‐span that is probably longer than the average involved in a PhD thesis because I had no intention of obtaining a second PhD degree (the first was in 2001 in population genetics) until recently, that is until becoming (2013) a member of the Scientific Com‐ mittee in charge of determining the contents of the new permanent exhibition of the Musée de l’Homme (Paris).1 I was responsible for the section presenting the linguistic diversity of the world. During this experience, the scholars I approached were sur‐ prised to learn that I did not already hold a PhD in linguistics. Their reaction did not surprise me, because I am well aware that academia involves sensitivity to rules en‐ suring that research meets the standards of the disciplines so that it might enjoy the recognition of peers. The formal credential of the degree indicates that the bearer has operated within the system and is familiar with the expectations. I think this is a good scheme and, now, I would like to redress my lack of scientific credibility in linguistics by submitting this work to a doctoral committee. My agenda is to further develop sci‐ entific inquiry in the frame of demography and linguistics, especially linking the demographic aspects of speech‐communities to the sociolinguistic effects that demog‐ raphy influences. Sociolinguistics is about contact among groups, and population ge‐ netics too. Over the years I have interacted with many linguists. When I sought collabora‐ tion in research about surname and genetic variability in European populations, I found in PROFESSOR JOHN NERBONNE an excellent scientific partner because he had the answers to many of my questions and because he was willing to focus on new re‐ search problems that only tangentially involved his own scientific interests, at least initially. He uses computational linguistic methods that parallel what is done in genet‐ ics and the collaboration turned out to be beneficial for both of us. Over the years, I have received the encouragement of PROFESSOR SERGE BAHUCHET, the director of the scientific department where I work (Hommes, Natures Societés, National Museum of Natural History, Paris). He happens to be a linguist too and, when I asked him, he enthusiastically accepted to co‐supervise this doctoral work.

1 The museum was reopened to the public the 15th of October, 2016. 4 CHAPTER 1

My focus on languages and dialects developed outside the discipline of lin‐ guistics, and, since the chapters concern research involving at least three sister disci‐ plines, linguistics, genetics and demography, this introduction in meant to provide a broader perspective, and not only a summary and description of the chapters that follow. Here I address the rationale for multidisciplinary research (1.1 Why genetics and linguistics?), report on comparative research involving family names and dialects (1.2 What family names tell about a population?), and, finally, introduce methodological questions related to computational linguistic classifications that are relevant to cross‐ comparisons (1.3 How to assess the reliability of linguistic classifications?; 1.4 Lexical data‐ bases). I end by providing the traditional outline of the dissertation chapters. The last one is not only about discussing what I have learned from the different experiments and projects. It is meant to attract the attention of reader by presenting some addi‐ tional results, based on the datasets that are analysed in the different chapters, in or‐ der to link them together and suggest new methodological research directions. By assessing empirical evidence with novel approaches, I have tried to develop a concep‐ tual frame that is larger than the scope of each chapter.

1.1 WHY GENETICS AND LINGUISTICS ?

1.1.1 Some thoughts about the emergence of the language faculty in a social context

The cohesion of human societies relies on common beliefs and practices that are inter‐ related with the environment and the lifestyle. There are simple inferences concerning both the technical skills and communication repertoires needed to ensure the minimal viability of a (small) human group, but the harmonious development of larger socie‐ ties, with respect to both population‐size and the geographical area occupied, needs a more complex organization in which the benefit of the group is paramount, perhaps to the detriment of the individual. The size of the group seems to be the fundamental parameter determining when the shift from personal to common interest arises leading to the concept on the minimum viable population size for a sustainable population. It has been estimated that a minimum of 150‐180 individuals is necessary (Moore 2003), a figure that matches ob‐ servations of existing groups of hunter‐gatherer populations. If a minimum size is necessary for reasons related to the survival during crises, epidemics, climatic adver‐ sity and to avoid levels of consanguinity that are too high, the maximal size of a popu‐ lation certainly depends on its ability to communicate in order to maintain the cohe‐ sion of the society and its effective functioning. In this framework rules and taboos are seen as the necessary architecture of the social system. Intuitively, societies without GENERAL INTRODUCTION 5 the communication net that a language enables would rely on immediate personal contact for the organization of their social life. This is the standpoint of Mark Turner (1998) in explaining the emergence of speech as related to its narrative function, that is, to its role in structuring human be‐ haviours according to shared beliefs that emerge from stories that are told and re‐ peated. In a different way, the French researcher in artificial intelligence Jean‐Louis Dessalles (2014a; 2014b), has suggested the argumentative function as the main advan‐ tage enabled by the language faculty. He developed and modelled mathematically several arguments showing that human communication did not emerge as a form of cooperation, but as the ability of the speaker to be relevant and display personal quali‐ ties, two crucial aspects to achieve influence and lead a group. The argumentative function of language makes possible to solve social conflicts with a decreased degree of physical violence and menace. Actually, the two theories are complementary and point to the possibility of developing larger societies through speech, societies that can maintain cohesion beyond frequent and direct personal contact and outside ne‐ cessity. While under a functionalist viewpoint languages would emerge just to im‐ prove the practical organization of life, there are exceptions showing that in small groups languages can be extremely simple, such as in the famous example of the Pi‐ rahã, a language of Amazonia that seems to lack recursion2 and has a very limited set of words and phonemes (Everett 2005). Actually the Pirahã people are monolingual hunter‐gatherers, and their idiom is the only surviving variety of a language that went extinct. They live in a single small group of about 200 people; they all know each other, and they are all related. Direct observation and visual learning are able to sub‐ stitute for the functions of a majority of the words. This is reminiscent of familiar jar‐ gon that is also limited but sufficient for everyday life, and if the group were larger, greater linguistic complexity would be expected. Instead of the attempted demonstra‐ tion that Chomskyan theories are wrong (Everett 2012), the case of Pirahã might be brought to bear from the perspective of the social benefit that a more complex lan‐ guage gives to the cohesion of a larger social group. Following Chomsky, the quantification of rumours and gossips in natural speech is relevant to the role of language communication in sustaining large social groups whose members have irregular direct contact. A recent study (Beersma and Van Kleef 2012) shows that 90% of conversations in a professional environment con‐ cern gossip about colleagues, and this estimation echoes a similar figure concerning human speech in general (Dunbar 1996). Spreading rumours and gossiping takes

2 This claim created a lively debase, still ongoing. See Nevins et al. (2009) and Everett (2009) for some elements about it.7 6 CHAPTER 1 place, by definition, in conversations where the people mentioned in the stories are not there. In a social context, rumours are very useful because they contribute to in‐ crease the cohesion: they remind the interlocutors of the existence of members of the society that might be far away or seldom encountered, in the same way that telling stories about dead persons functions to preserve memories and traditions over gen‐ erations. The link existing between the population size and the language seems clear.

1.1.2 The emergence of the language faculty and the peopling of the world from Africa

Chomsky’s hypothesis of a UNIVERSAL GRAMMAR, implying that the human mind dis‐ poses of innate structures allowing us to acquire, comprehend and use language, might be seen through the new prism of neuroimaging and genetics. Mirror‐neurons and specific genes are involved in speech production, meaning that there is an inher‐ ited genetic base enabling language. This kind of research is currently in its infancy and must be linked to social cognition (see Fitch et al. 2010 for a challenging review), but it could well be that some DNA mutations determined a rearrangement in the neuronal nets of our brain leading to new cognitive abilities that made us better able to communicate.3 These genes were probably under positive selection because com‐ munication was granting to the offspring a higher degree of survival. The rise of verbal communication and its complexity enabled us to distinguish space and time and facilitated the expression of symbolic thinking, together with the social advantages that have been listed in the preceding paragraph. The emergence of the language faculty is one of the plausible hypotheses that have been advocated to explain why and how the descendants of human groups having left the African conti‐ nent some 50,000 to 100,000 years ago have been able to successfully colonize all the continents (Mellars 2006), leading to the almost complete replacement of pre‐existing human populations (Homo erectus, Denisovan hominin, Homo floresiensis and Homo neanderthalensis). While interbreeding with these geographical species was possible, it seems to have contributed to less that 10% of the genome of modern humans ( et al. 2010) either because of low inter‐fertility or because of a larger population‐size of the immigrants, maybe related to a more efficient use of language. Interestingly, a regulatory gene that all mammals share (FOXP‐2) (see Takahashi et al. 2013) has been

3 By the way Chomsky himself argues that communication is secondary, and that lan‐ guage evolved because of the advantages to think. He sees language as a set of atomic elements and rules that allow the construction of more complex rules of thinking accord‐ ing to a criterion he calls computational efficiency, which is different from communicative efficiency. For example see: http://www.u‐plum.fr/actualites/232‐conference‐de‐noam‐chomsky GENERAL INTRODUCTION 7 recently suggested to be necessary for proper development of speech and language in humans, but the sequencing of the genome of fossil DNA concerning Homo neander‐ thalensis has confirmed that the gene was present in the latter as well, meaning that it is well possible that Neanderthals could speak (Coop et al. 2008 Krause et al. 2007, Vargha‐Khadem et al. 2005).4 Language is an important factor explaining human evo‐ lution.

1.1.3 The recurrent interest of geneticists in the diversity of human languages

The link between the emergence of an improved speech production and a better abil‐ ity to migrate and to constitute viable and successful societies throughout the world, echoes an older and similar debate, in the 1980s, concerning the genetic evidence for the Out of Africa model, which seems supported by the extant worldwide linguistic diversity. As an answer to the critics of the evidence for a migration wave that led to the re‐peopling of the world from Africa in “recent” times (Cann, et al. 1987; Stonek‐ ing and Cann 1989), Cavalli‐Sforza et al. (1988) focused on cultural evolution and pub‐ lished, side by side, a phylogenetic tree based on human genetic diversity and another corresponding to a worldwide linguistic classification provided by Merrit Ruhlen (1987). He suggested that the similarities between the two classifications were the proof of synchrony between cultural and biologic diversity, implying that cultural divergence happened over a timeframe comparable to that of genetic differentiation. As diversity accumulates at a much faster pace in languages than in DNA, the fact that a correspondence was found had to be seen as a demonstration that all present human populations had a common and recent ancestor in Africa. This paper attracted wide attention and led to a schism between population geneticists and the community of linguists. In fact, the work of Ruhlen had stood outside accepted comparative lin‐ guistics by rejecting the “temporal ceiling” beyond which the fails, considered by some (Kaufman 1990, Nichols 1992) to lie at roughly 6,000 to 8,000 years ago. Ruhlen was comparing lexical items with supposedly close meanings5 showing resemblances in large linguistic groups called (1994), but these

4 Actually, in more general terms, the current view is that the evolution of human speech capabilities required neural changes rather than modifications of vocal anatomy. Ma‐ caques have a speech‐ready vocal tract but lack a speech‐ready brain to control it (Fitch et al. 2016). 5 For example ‘finger, one’ is associated with equivalent etymologies such as: ‘fingernail’, ‘first’, ‘five’, ‘’, ‘guy’, ‘’, ‘index / finger’, ‘merely’, ‘only’, ‘palm (hand)’, ‘paw’, ‘ten’, ‘to point’, ‘to say’, ‘to show’, ‘thing’, ‘toe’ (Ruhlen 1994).

8 CHAPTER 1 were also explainable as chance‐similarities estimated to be equally likely once one treats meaning correspondences as approximate as Ruhlen does (Boë et al. 2003). For their part, many population geneticists did not realize that their continued high esteem for Ruhlen’s analyses was disqualifying their work in the eyes of a vast majority of historical linguists, perhaps also because a scholar as authoritative as Charles Darwin had foreseen a match between the two disciplines, formulating his ideas in what was to attain the status of an Ipse dixit:

If we possessed a perfect pedigree of mankind, a genealogical arrangement of the races of man would afford the best classification of the various languages now spoken throughout the world; and if all extinct languages, and all intermediate and slowly changing dialects, had to be included, such an arrangement would, I think, be the only possible one. Yet it might be that some very ancient language had altered little, and had given rise to few new languages, whilst others (owing to the spreading and subsequent isolation and states of civilisation of the several races, descended from a common race) had altered much, and had given rise to many new languages and dialects. The various degrees of difference in the languages from the same stock, would have to be expressed by groups subordinate to groups; but the proper or even only possible arrangement would still be genealogical; and this would be strictly natural, as it would connect together all languages, extinct and mod‐ ern, by the closest affinities, and would give the filiation and origin of each tongue (Dar‐ win, 1859, 422–423).

Setting aside all the theories that have been mentioned, does it make sense to compare genetic diversity and linguistic diversity? Why should a link be expected? As complex as they might have been, the majority of human societies have left few traces behind them: some artefacts, some , when the soil was not too acid to melt them, and lineages of offspring that have sometimes survived until the present. Ignoring aspects that might have outlived those societies such as useful tools and techniques, which are difficult to link to more abstract beliefs and myths, the cultural traits that defined such societies, and their symbols, have generally disappeared to a very large extent. Given the importance of communication, and as long as there is demographic continuity in a society, a language is likely to be maintained; unless external influences come into play, leading to bilingualism and language shifts, perhaps because of military threats or because of the attractiveness of the social and economic model associated with an‐ other language. Nevertheless languages do change over time, probably in relation to the size of the population speaking them, according to the tightness or looseness of the linguistic‐net connecting the speakers (social strata, geographic distance), and de‐ pending also on sociolinguistic factors and, of course, on linguistic contact. Continuity normally exists so that we can explore the extent to which two peoples speaking re‐ GENERAL INTRODUCTION 9 lated languages also share some genetic make‐up. This is a likely tendency, but clearly does not hold in complete generality. This is the very prudent position I embraced when I decided to initiate the multidisciplinary research reported on in this dissertation and, as I was familiar with population genetics research addressing surname diversity, I decided to compare the latter with the linguistic diversity that is found at national scales, that is the variability of dialects. The models suggest correlations at this scale, and the smaller scale should be more easily testable. CHAPTER 2 is partly about this question.

1.1.4 Towards a wider Anthropology

A wider and synthetic Anthropology is becoming one of the aims of a large commu‐ nity of scholars and I would like to mention the remarkable efforts made by Profes‐ sors Luigi‐Luca Cavalli‐Sforza and Marcus Feldman in promoting a trans‐disciplinar approach to human and culture evolution. Their book, Cultural transmission and evolu‐ tion, as well as many subsequent articles, constitute a landmark that has inspired simi‐ lar efforts outside anthropology in its narrower sense. Later, the archaeologist Colin Renfrew also embraced such views, organizing a series of conferences that “forced” the debate among archaeologists, linguists, geneticists and demographers, later ensur‐ ing that the views be publicized in a series of high‐quality books (Renfrew and Boyle 2000; Bellwood and Renfrew 2002; Forster and Renfrew 2006). By applying methods from the natural sciences (Bryant et al. 2005), while maintaining contact with the pri‐ mary linguistic and cultural data, questions considered intractable have been ad‐ dressed. To mention only the research of those with whom I have personally inter‐ acted,6 the research of Professor Russell Gray (Max Planck Institute for the Science of Human History) has made the subfield of less hesitant about sources of evidence that proceed through digital computing (see for example Gray and Atkinson 2003, Dunn et al. 2011, Bouckaert et al. 2012). Comparable efforts to de‐ velop a wider Anthropology have been undertaken by Professors Stephen Shennan and James Steele (UCL, UK), especially concerning demography and cultural macro‐ evolution (see for example Steele and Shennan 2009), and by Professors Mark Pagel (University of Reading) and Tecumseh Fitch (University of Vienna) concerning lin‐ guistics (see for example Mesoudi et al. 2011, Pagel et al. 2013, Grollemund et al. 2015). My research interests have been shaped by this incredibly large body of scientific re‐ search.

6 By inviting them to conferences and symposiums I organized or by co‐editing special issues published in Human Biology (Wayne State University Press, Detroit, MI), a Journal I have edited from 2008 to 2013. 10 CHAPTER 1

While a majority of the scholars I mentioned above are biologists, archaeolo‐ gists, psychologists, meaning that they approached linguistics from a different back‐ ground, unlike historical linguistics, dialectology has its own computational tradition. It started with Jean Séguy (1914–1973) and was further fostered by Hans Goebl (Uni‐ versity of Salzburg―see Goebl 2006) and pursued by the scholars of the Department Alfa‐Informatica at the University of Groningen. If dialectology has experienced this digital evolution earlier, it is probably related to the availability of larger corpora than those existing in historical linguistics. This is similar to the surname studies in popula‐ tion genetics, markers that are freely available in large amounts and that have made possible very detailed studies well before the advent of DNA sequencing technology (see Colantonio et al. 2003, Darlu et al. 2011 for a review).

1.2 WHAT FAMILY NAMES TELL US ABOUT A POPULATION

The application of models based on surnames to infer the genetic structure of human populations relies on the parallel transmission of surnames and Y‐chromosome DNA, at least in occidental naming practices, where a legitimate son has the surname and the Y‐chromosome of his father. This is the reason why the variability of the surnames (i.e. their different types and their frequencies) can be quantified to estimate the con‐ sanguinity of different societies, without the need to undertake expensive laboratory DNA typing. This variability also enables the evaluation of population isolation, dif‐ ferentiation, and the directionality of migrations. Many population geneticists made major contributions to this field, including Crow, Cavalli‐Sforza, Morton, Relethford, Lasker, and Barrai (see Lasker 1985, Colantonio et al. 2011). Surname methodologies have been applied to more than thirty societies, all around the world. The geographic scope ranges widely, from the household or village to a whole continent. The confirmation that surnames indeed mirror genetic variability came with the advent of full genome sequencing, a technique allowing very deep inferences about regional and micro‐regional genetic differences that can be explained by demo‐ graphic factors that, in turn, can rely on historical and cultural processes. Family names of patrilineal descent have indeed proved to mirror a single locus on the Y‐ chromosome (King and Jobling, 2009). However, the temporal depth of surnames is limited (between 4 and ±30 generations)7 when compared to the scale of demographic processes inferred by molecular markers, and in any case variation in the Y‐ chromosome represents an extremely small amount of the total genetic variability.

7 About 20/25 generations in European Christian countries as the Roman Catholic Church started to register births and deaths after AD 1563. GENERAL INTRODUCTION 11

In this context, why should anthropologists take into consideration surname informa‐ tion that, albeit easier to collect than DNA data, is sometimes tricky to interpret? One simple answer is that family names allow a retrospective look at human variation, which is hard to achieve with DNA studies because these require living populations, if they are to be conducted at a reliable scale. Historical documents often report surname infor‐ mation over several successive generations, and with a degree of polymorphism that (for the moment) is larger than the one available with DNA. The major strand in sur‐ name studies rests on the exploitation of databases that are increasing in size and ex‐ haustiveness due to widespread digitization. In this respect, Pablo Mateos and Paul Longleyʹs UCL Worldnames database,8 which includes about 6 million surname‐types registered in 26 different countries, constitutes an impressive quantity of information and a valuable tool for future research (Mateos 2011). Millions of different types of surnames are drawn from diverse sources, such as national electoral registers, tele‐ phone directories, or national online censuses. Moreover, these can be organized ac‐ cording to lexis, phonology (vowels, consonants, morphology) and based on surname type (derived from place names, professions, nicknames, or first names). The second major research direction, besides these attempts to draw from modern registers a large number of surnames in vast geographic regions, involves a focus on historical data that I am going to skip here, but see Darlu et al. (2001). The large expansion of the available data, both in time and space, has led to the development of new methods and analytical tools. Among them, and now widely used, are automatic geographic representations of surname diversity which plot the variations of frequency of a given name, or a set of names, sharing some phonetic or grammatical features. Some recent statistical methods are also becoming established, such as Bayesian approaches to infer the origins of migrants (see contribution of G. Brunet in Darlu et al. 2012), Self‐Organizing Maps to automatically identify surnames sharing the same geographical origin (Manni et al. 2010), or to identify ethno‐cultural groups (Mateos et al. 2011). The purpose of Mateos was to create a ‘universal’ classifi‐ cation of forenames and surnames by ethnic group. 250,000 surname‐types and 120,000 forename‐types have been aggregated in 150 possible cultural categories. The classifica‐ tion is empirical, not based on scientific or ethnologic background and the methodogy is based on a technique of cross occurrences between forenames and surnames (Mateos and Tucker 2008): with a given forename it is possible to retrieve related sur‐ names and, from the latter ones, to retrieve corresponding forenames. Iteration after iteration, the database expands until a stage is reached where all retrieved surnames correspond to a same set of forenames and vice versa. In this way it is possible to iden‐

8 http://worldnames.publicprofiler.org/ 12 CHAPTER 1 tify clusters of linked individuals corresponding to several isolated clusters existing in the population. They allow depicting contemporary migrations, multiculturalism and assimilation and the results have a direct interest for social anthropologists and popu‐ lation geneticists.

1.2.1 Surnames and dialects

Family names carry social and economic information that merits inclusion in several interdisciplinary approaches to human history. Historians, linguists, and geographers can play as active a role as biologists in surname studies and population analysis. To‐ day, in an age of global migration (Castles and Miller 2009), the distribution of sur‐ names remains far from random and has the potential to allow an intermediate level of access to the recent past and to smaller geographical scales, both of which are diffi‐ cult to obtain otherwise: this is where parallel studies in dialectology become intrigu‐ ing. By providing evidence of migration phenomena in different periods, it is possible to delineate past genetic isolates and population structures that have been modified or disappeared altogether, thus enabling demographic hypotheses to be tested linguisti‐ cally (Falck et al. 2012). By identifying the geographic origins in large corpora of surnames at national scales and aggregating the results, a double matrix (immigration and emigration) leads to the identification of provinces falling in four categories: 1) attractive provinces towards which immigration has been strong but from which emigration was weak, which is typical of important urban areas; 2) unattractive provinces, from which emigra‐ tion has been considerable but which have failed to attract immigrants; these are usu‐ ally economically poor areas where the surname‐set of the population has remained closer to its initial make‐up at the time of surnames’ introduction; 3) corridor provinces, in which immigration and emigration have been important phenomena leading to a considerable modification of the surname‐set over the time; and 4) isolated self‐ sufficient provinces that have never really attracted immigration but from which people have not left; these often correspond to geographically isolated areas. These four classes of regions match alternative demographic phenomena, and it is likely that these had different impacts on regional linguistic variability. As a hypothesis to be tested, one would expect to find higher dialectal diver‐ sity, and clearer geographical structure, in the provinces where the number of immi‐ grants speaking external varieties has been low, since the immigrants could not influ‐ ence the speech communities much (unattractive and isolated provinces). On the other hand, the areas that have been the target of massive immigration are expected to have lost linguistic variability, since some of the original local groups will have lost the GENERAL INTRODUCTION 13 critical size needed to support a linguistic variety; in fact the identity‐marking func‐ tion of local varieties is less relevant where the number of allochthonous speakers is too large. This hypothesis suggests that demography and variationist linguistics should be able to collaborate profitably.

1.2.2 How to compare dialects and surnames?

The cross‐comparison of surname and dialect variability (Manni and Barrai 2000, 2001) made me face the same kind of problems that other geneticists were confronted with, that is the impossible comparison of quantitative information, such as frequency‐ vectors and distance matrices summarizing surname variability, to the qualitative de‐ scriptions and maps that dialectologists often use. An obvious way to circumvent the problem was to compute maps of surname variability to be visually compared to ex‐ isting dialect atlases that, sometimes, report appealing isoglosses. This is the reason why I turned to methods focused on the computation of genetic barriers: areas where the rate of change of a given variable is higher (Manni et al. 2004), boundaries making sense geographically and/or linguistically. But this kind of comparison was unsatis‐ factory because it is rather easy to lie with maps (Monmonier 1996). As a population geneticist, I was familiar with DNA sequence alignment and aware of an interesting paper addressing, by sequence comparison, the time that toponyms of origin in Sicily needed to evolve into their current Italianized form, such as Moio Alcantara (Province of Messina) and al‐maya al‐kantara which means “the of the bridge” (Barrai, 1993). Inspired by this example (Fig. 1.1) and with the help of Professor Barrai (University of Ferrara), I later developed an align‐ ment algorithm to measure the linguistic diversity of some dialects, to evaluate the congruence between linguistic differences and surname/genetic diversity of the prov‐ ince of Ferrara (Manni and Barrai 2000, 2001). Linguistic distances were computed according to what we called unit cost model for insertion, deletions, point mutations, presence/absence of the article but not normalizing our alignments by their length (Fig. 1.2). We were unaware that Kessler (1995) had already applied this same Leven‐ shtein method to Gaelic dialects and that the research group of John Nerbonne had largely improved it over the years (Nerbonne et al. 1996; Heeringa 2004). Actually the idea was “in the air” as the book Time Warps, String Edits and Macromolecules clearly shows (Sankoff and Kruskal, 1999). When I contacted the research group of Groningen to ask for assistance, they were in the process of computationally measuring the diversity of Dutch dialects and, by an incredible coincidence, I was processing Dutch surnames. This circumstance and the methodological vicinity favoured the establishment of a collaboration that this 14 CHAPTER 1

PhD dissertation witnesses. In this way the Levenshtein algorithm has represented the needed computational base to compare linguistic diversity with family name diver‐ sity. CHAPTER 4 is about this first collaboration that was quite dialectic because we were unsure about how to relate the patterns of surname and dialect variation be‐ cause of instability in the clustering at finer scales. The hierarchical clustering of sur‐ names had been tested by resampling techniques, while dialect Levenshtein distances had not. This uncertainty led to a phase of methodological progress directed to better assess the reliability of linguistic classifications (next section).

Figure 1.1  Early alignment of lexical data. Matrix of dots used by Barrai (1993) to es‐ timate the homology between toponyms of Arabic origin in Sicily according to the origi‐ nal pronunciation and to the present ‘Italianized’ one. The diagonal visible among the dots corresponds to matching strings.

Figure 1.2  Description of the unit cost model through which linguistic differences where calculated in Manni and Barrai (2000, 2001). GENERAL INTRODUCTION 15

1.3 HOW TO ASSESS THE RELIABILITY OF LINGUISTIC CLASSIFICATIONS?

A recurrent limitation of computational dialectology is that sometimes different clus‐ tering algorithms lead to conflicting classifications. In an effort to overcome the insta‐ bility in clustering and get correct results, some research had been directed towards the definition of the optimal clustering algorithm (UPGMA, Ward’s, etc) to be adopted within dialectology, proceeding from an examination of the different biases of the different clustering methods and comparing these to the assumptions of lin‐ guistic models, but it is not always easy to theoretically justify why one algorithm is better than another, even if empirical experience can give some clues about it (see chapter 6 in Heeringa 2004). An alternative way to evaluate clustering for dialectology is to examine the frequent discrepancies between a dendrogram (the result of hierar‐ chical clustering) and the projection of a distance matrix onto a two‐ or three‐ dimensional plot (obtained e.g. by multi‐dimensional scaling or principal component analysis), assuming that the reduction to a small number of dimensions does not in‐ volve too much information loss. In routine analyses, it can happen that the clusters identified in the dendrograms are hardly visible, if at all, in the multivariate plot. On the other hand, well identified clusters in multidimensional plots always show up in dendrograms. These rules of thumb and clues were insufficient to definitively com‐ pare the geographic patterns of surnames and dialects, this is why we turned to the application of bootstrap to dialect data.

1.3.1 Efron’s (1979) bootstrap The bootstrap method (Efron 1979) is a technique for obtaining standard errors and confidence limits in various statistics. The basic idea is to take the sample for which one hopes to estimate a parameter  using a sample statistic p and, once p is com‐ puted, if the sample is of size n, one carries out a random resampling procedure by sampling with replacement, using the n items from the observed sample as a parent population. For each of these new (resampled) samples one estimates the desired pa‐ rameter . Because we are sampling with replacement, most samples will contain two or more replicates of some variates appearing in the observed sample and, consequently, some other variates will be missing. It has been shown that the mean of the estimated statistic, from the bootstrapped sample, approximates the mean of the population, and that the standard deviation of such an estimate approximates the standard error of the statistic, as if we had repeatedly sampled from the unknown population without replacement. This is a very important result because it permits us to calculate standard errors for almost all statistics (Sokal and Rohlf 2001, pp. 823‐25).

16 CHAPTER 1

1.3.2 Felsenstein’s (1985) bootstrap

Considered to be an improvement over the Jackknife (Efron 1979), the bootstrap has been extensively used in biostatistical applications and has become popular in nu‐ merical taxonomic and phylogenetic research where it was introduced by Felsenstein (1985). In phylogenetic estimation, the branching sequence of t (animal, plant, etc.) species is estimated from a set of n characters, which vary in their states among spe‐ cies. The estimates are based, for example, on constructing the shortest tree, in terms of the amount of implied evolutionary change for a given data set. After constructing such a tree, the comparative biologist may conclude that species A and B are closer to each other than they are to other species—that is, that they form a monophyletic taxon. To test the reliability of such a taxon, Felsenstein suggested sampling n charac‐ ters with replacement from the original data set, in order to create new data sets from which new minimum‐length trees are constructed. The result is a number, say m = 100 trees. If 65% or more of the trees contain the taxon {A, B}, then this branch of the tree is considered well substantiated. If, by contrast, only 30% of the bootstrapped trees show the set {A, B}, little reliance can be placed on that portion of the taxonomic struc‐ ture of the tree. At this point it should be clear that bootstrap can be seen as a method to esti‐ mate the robustness of a given classification of taxonomic units in the sense that it tells if a given difference between them is widespread or not among the characters that de‐ scribe such taxonomic units. Bootstrap is not a test of how accurate a tree is; it gives information about the stability of the tree topology (the branching order), and it helps in assessing whether the data is adequate to validate the topology.9 An implicit as‐

9 A simple example will make the essence of bootstrap clearer, let’s imagine that we want to classify two persons according to the colour of their clothes (1. cap; 2. shirt; 3. trousers; 4. socks; 5. shoes). If the two individuals are dressed in completely different colours, when resampling the 5 characters that define them in new bootstrapped datasets we will find that all resampled datasets lead to the conclusion that they are different. This is obvious because each colour character that defines them (the colour of their caps, shirts, trousers, socks and shoes) is always different. In such case the difference between these two indi‐ viduals is said to be supported by a 100% bootstrap score. Differently, if the two individu‐ als have clothes identical in colour besides their shoes (one having brown shoes and the other having shoes), we will find that the difference between them will be sup‐ ported by a minority of the resampled data sets, since there are good chances that the only character that conveys a signal of difference (the shoes) will often be left‐out in the resam‐ pling procedure, simply by chance. Only those datasets containing the colour of the shoes as one of the characters will support a difference between them, whereas those datasets not containing such character will not. If resampled datasets are numerous enough to al‐ low each character to be sampled with equal probabilities, then the difference between the GENERAL INTRODUCTION 17 sumption of the bootstrap technique is that different characters that define the items under classification bear equal weight in determining the classification.

1.3.3 Application of resampling techniques to dialectology

The application of bootstrap to dialectology allows us to assess the degree of reliabil‐ ity of a dialect classification, that is to test if the variability of the characters (words) that describe each dialect variety supports the final classification, and to what extent. As an hypothetical example, we can think of a group of dialect varieties that, accord‐ ing to scholarly traditions, are usually split into two groups (say a northern and a southern group) and investigate how many of the words that define such varieties exhibit a North/South differentiation by randomly resampling the set of words into many new auxiliary datasets, from which we then compute a corresponding number of distance matrices and, in turn, of dendrograms. If a majority of the words exhibits a North/South difference, then a majority of the trees obtained from the resampled data‐ sets will show a major split between the northern and southern cluster, which could be intuitively summarized in a synthetic tree where the fork separating such clusters is substantiated by a percentage higher than 50%. When using a resampling technique to generate new databases, we can make inferences about the strength of a given signal (North/South) between the words that constitute the database, since in the new resampled datasets some words will be over‐ represented while some others will be missing, meaning that the randomness of the resampling process gives a different random weight to the original elements constitut‐ ing a database. To say it differently, once the initial weights are all equal, the resam‐ pling procedure randomly emphasizes, or diminishes, the weight of some words and tells if a split (North/South) is supported by only a few words of the database or by a majority of them.

1.3.3.1 Bootstrap consensus trees By definition, dendograms are constituted by branches that are topologically identi‐ fied by the different splits (nodes) that partition the data into groups (clusters) of dif‐ ferent size. A bootstrap consensus tree is a summary of a set of trees computed ac‐ cording to a clustering algorithm (UPGMA, Ward’s, etc.) from a set of distance matri‐ ces obtained from resampled databases. As said, visually a bootstrap consensus tree can be identified since its forks (nodes) are coupled with a score proportional to the number of times a given fork (node) appears in all the original trees it summarizes. two individuals will be supported by a bootstrap score of ~20%, since they differ by one character over five. 18 CHAPTER 1

There are at least two ways to compute a bootstrap consensus tree from a set of trees: a) strict consensus and b) majority‐rule consensus. A strict consensus tree consists of all groups that occur 100% of the time, the rest being ignored. Less stringent, a major‐ ity‐rule consensus (MRC) tree consists of all groups that occur at least 50% of the time (Margush and McMorris 1981). We stress that in MRC trees there cannot be two con‐ flicting splits supported by more than half of the trees in the same time. If they were, there would be at least one tree containing two conflicting splits at the same time, which is absurd. The procedure to compute a MRC tree is quite simple and consists of three steps: 1) computation of all splits; 2) removal of all splits supported by less than half the trees; 3) computation of the consensus tree containing the splits. The majority‐rule can be extended to display some other splits that are supported by scores lower than 50% From now on a majority‐rule extended consensus tree will be referred to as MREC tree. Rephrasing Hillis and Bull (1993), the bootstrap value is a count or per‐ centage of how often each branch was present in exactly the same topology in all the resampled trees, so it gives an impression of how much the tree topology could change if, for example, youʹd reconstruct it using a different set of words.

1.3.3.2 Adoption of a cut‐off value We said that the nodes of a MREC boostrap tree can be supported by values varying from 1% to 100% indicating how stable (robust) their topology (branching order) is. In biology there are many rules of thumb about how to interpret the bootstrap values, and a score of 95% is sometimes taken as the minimum since it reminds the 95% con‐ fidence interval used in statistics. Nevertheless a score of 70% is often cited as a cut‐off for a ʹreliableʹ branching (see also Hillis and Bull 1993). When applying bootstrap pro‐ cedures to dialectological data we cannot rely on any available rule since, to our knowledge, this dissertation includes the first papers using bootstrap to assess the robustness of dialect variants. When choosing a cut‐off value for dialect data, implic‐ itly, we are making a decision about the minimal difference that should exist between two dialect variants, if they are to be considered different. In other words: If we record 100 words in two dialects and it appears that these two are perfectly identical apart from 10 words that are different (meaning a low bootstrap score in the fork separating them), will we still consider these variants as distinct or would we decide that, actually, we are dealing with one single variety and that the observed difference has to be attributed to noise? This is a very thorny question since dialectologists have been keen to catalogue new variants and to extend the sampling grid. The measuring of the noise related to variation within a same village, or town, in order to establish which percentage of variation has to be interpreted as the sampling‐error has not received much attention, GENERAL INTRODUCTION 19 probably the first paper to provide a mean and a standard deviation in dialectological measurements is Nerbonne et al. (1996).

1.4 THE FUEL: LEXICAL DATABASES

1.4.1 The number of items

All the research included in the dissertation is based on the comparison of lexical items – the sort of material that is generally readily available from linguistic atlases and databases. We generally try to include about 100 items, because this number has been recognized to be sufficient for aggregate analyses. In fact the words that are processed are in general composed of 5 segments, meaning that the Levenshtein method, involving sequence comparison, can extract a more robust signal from a short list of lexical items (Heeringa et al. 2002) than earlier dialectometric procedures (see below). I will comment on two aspects of the dialectometric approach adopted by Ner‐ bonne, Heeringa and others in Groningen. First, they insist, following Goebl (2006) and others, that reliable indication of the relations among dialects can only be gleaned from the examination of larger aggregates of comparable dialect material. We recall here Grimm’s (1819) adage that “each word has its own history” (see also Kirk et al. 1985), which we take to be true if we understand it to mean that any single word or dialect feature may be misleading with respect to the inferences it suggests about the relations among dialects (see Nerbonne 2009 where the argument for using aggregates is elaborated on). Some words and features are capricious in some aspects of their dis‐ tribution. The insistence on examining aggregates is shared by Goebl (2006 and else‐ where), but the Groningen direction has tried to simplify processing and to squeeze more information out of dialect atlas collections by employing sequence comparison (the Levenshtein algorithm) extensively. Séguy’s and Goebl’s work tried word lists as categorical data where items might be the same or different. So the aggregate measure of similarity between settlements (represented by word lists) was the fraction of iden‐ tical elements, and the distance the inverse, i.e. the fraction of different elements.10 Sometimes the words were examined to extract a single, simpler feature, e.g. whether the stress vowel was pronounced high in the mouth (as an [u]) or low (as an [a] or [ɒ]). The extracted feature was then compared categorically, but this comes at the cost

10 I am ignoring Goebl’s work on weighting schemes, but it is orthogonal to the point. 20 CHAPTER 1 of loosing information in the other elements in the word’s pronunciation, and at the cost of manually extracting the relevant feature. The Levenshtein algorithm compares the entire sequence of sounds, thereby incorporating more information, and, because it is automated, obviating the need for manual pre‐processing. Finally, because it pro‐ duces a numerical measure of the difference, it is more sensitive than earlier dialecto‐ metric procedures. By using consistency measures, Heeringa (2004, p. 176) has demonstrated that pronunciation‐based classifications tend to remain stable after processing a minimal number of items of about 30, and that the signal that a 100‐word list delivers is not significantly different from the one of a 200‐word list and, finally, that what matters is the choice of the words that are used instead of their number. Categorical compari‐ sons require three to four times as much data to reach stability.

1.4.2 The choice of the words

With the Levenshtein method we measure pronunciation distances. While a random list of words is as likely to mirror the phonology of a language than a more specific wordlist, we stress that we are not comparing phonological repertoires but, each time, the pronunciations of a given word in variety A to the pronunciation of the word cor‐ responding to the same concept in variety B, in a pairwise fashion. If shared vocabulary is large, for example because of extensive borrowing, computed linguistic distances will be lower than in the opposite case. If this bias does not seem to be a serious issue in dialectology because borrowing is not distinguishable in close varieties, it can lead to a systematic error in assessing linguistic diversity when analyzing more distantly‐ related varieties and languages. In this direction, the work of Swadesh (1955) is relevant because it was intended to identify lexical items less likely to be borrowed, words that consti‐ tute a basic vocabulary that is expected to have emerged first in every language, be‐ cause necessary (like ‘water’, ‘fire’, ‘’, etc.). Basic vocabulary has probably re‐ mained stable over the time, meaning that it was not borrowed from another speech community because each group of speakers has its own, long established, words for these concepts. Basic core vocabulary was needed in lexicostatistics to better address the historical phylogeny of languages (see Dyen 1975).

1.4.1 Swadesh wordlists

Swadesh word lists are processed in two chapters of the dissertation. In CHAPTER 6 (concerning Bantu languages), the word‐lists we addressed correspond almost sys‐ GENERAL INTRODUCTION 21 tematically to the core vocabulary. We analyzed a database assembled at the Musée Royal de l’Afrique Centrale, Tervuren (Bastin et al. 1999) and, since this institution has been very active in developing lexicostatistical approaches, the wordlists were based on Swadesh lists. In CHAPTER 7 (concerning Central Asian languages), we process Swadesh word lists too, but in this case we did not have to rely on a existing database, as we did the fieldwork. The reason to use this kind of word list was to provide novel linguistic material in a form that is comparable to available literature, largely based on Swadesh lists. If, during the fieldwork we noticed that such items offered advantages because speakers pronounced them without hesitation, we had to exclude some of them, either because polysemic or because not adapted to that specific ecological con‐ text. This is similar to what Hombert (1990) did when he designed the wordlist to be used in the Atlas Linguistique du Gabon (ALGAB). To anticipate two tangential findings that will not be discussed in the final chapter of the dissertation, we noted inconsisten‐ cies between the classification of the two Gabon datasets we processed (Bastin et al. 1999 and the ALGAB, respectively based on an average of 89 and 132 items). Instead, the classification of 32 Tanzanian Bantu languages did not change when processing a 1400‐word list or a subset of it consisting in 92 Swadesh concepts (see CHAPTER 6 con‐ cerning both examples). While several articles have speculated on the effectiveness and the appropri‐ ateness of the Swadesh list in historical linguistics (Kessler 2001; McMahon and McMahon 2005; Holman et al. 2008), the Loanword Typology project11 coordinated by Haspelmath and Tadmor (2009) identified 113 concepts that are the most stable ones (among 1460 lexical items) in about 50 languages:12 44 of them (22%) are included in the 200‐word list of Swadesh. This result is interesting because i) it shows that the Swadesh list is less stable than previously assumed, ii) it suggest that Swadesh lists

11 http://wold.clld.org/ 12 The list of the 113 more stable concepts according to Haspelmath and Tadmor (2009) is the following (when the item is underlined it appears in the first part [concepts #1‐100], when underlined in italics it appears in the second part [concepts #101‐200] of the Swadesh list): To walk, you, yesterday, black, the back, the nose, the tongue, to kill, the rib, the eye‐ lash, to go out, when?, long, the , to hear, wide, to bring, I, to rise, today, the head louse, the , this, the foot, the toe, few, to fart, the day after tomorrow, to stand, stink‐ ing, to blow, to listen, sometimes, up, behind, bright, to borrow, the clay, that, the day be‐ fore yesterday, the itch, to hollow out, he/she/it, to flow, raw, the nit, the woman, the house, to go, the bark, to carry, the fire, to speak or talk, you (pl.), to meet, the wood, the night, to come, to throw, the flea, to lie down, to follow, , new, the fog, there, the flesh, the sun, the , the lip, the , the breast, the navel, the liver, to cough, to spit, to bite, to sleep, the thunder, to shiver, here, far. 22 CHAPTER 1 can be also used to efficiently measure contact (borrowings), iii) it points to the noise that can explain the inconsistencies we noted concerning CHAPTER 6, and iv) demon‐ strates that our attempt to measure borrowing between Turkic and Indo‐Iranian lan‐ guages, as an indirect measure of population contact and gene‐flow, was not biased by the use of a word list that is markedly conservative (see CHAPTER 7). On a personal note, if a new wordlist, more ‘conservative’ than the Swadesh’s, had to be designed, many of the concepts that Haspelmath and Tadmor (2009) find resistant to borrowing are likely to be of difficult elicitation in fieldwork conditions, because ambiguous or difficult to be defined by the elicitor when the speakers of the language to be docu‐ mented are predominantly monolingual (for example: ‘behind’, ‘the day before yes‐ terday’, ‘the day after tomorrow’, ‘today’, ‘the house’, etc.). The book about wordlists is still open.

1.5 OUTLINE OF THE DISSERTATION

The order of the chapters corresponds to their focus. CHAPTERS 2 and 3 give introduc‐ tory and methodological elements, CHAPTERS 4 and 5 report comparative studies in‐ volving surname and linguistic variability in the Netherlands and in Spain; CHAPTER 6 and 7 address wider linguistic contexts: Bantu and Central‐Asian languages. The dis‐ cussion follows.

1.5.1 CHAPTER 2: Sprachraum and Genetics (Manni 2010)

In this chapter I address the viewpoint of a geneticist with respect to genetic and lin‐ guistic cartography in order to provide an historical and methodological background reviewing the steps that led some population geneticists to co‐operate with linguists, a collaboration that started with the comparison of maps. The mapping, in genetics, has a quite recent tradition, and is not very accurate because genetic sampling is generally uneven and relies, to a wide extent, on available published material. I review the first attempts to obtain maps about the genetic variability of populations, to later focus on the discontinuities in the genetic landscape, which are the “boundaries”, the barriers to gene flow (Manni et al. 2004). It is generally difficult to explain the geographical distribution of genes because genetic variability arises over time‐scales that often are much deeper than historical times and because they depend not only on geographical features but also on cultural divides that have changed or disappeared without leav‐ ing traces. To provide evidence of some successful attempts to identify and explain GENERAL INTRODUCTION 23 these barriers I present some results concerning the surname and linguistic variability in the Netherlands (see CHAPTER 4).

1.5.2 CHAPTER 3: Projecting Dialect Distances to Geography (Nerbonne et al. 2007)

This second methodological chapter addresses clustering instability in dialectology and concerns the application of the bootstrap method to linguistic data. When boot‐ strapping is impossible because original data is not available, an alternative approach consists in adding random noise to the distance (or similarity) matrices during re‐ peated clustering: this is called noisy clustering. We demonstrate that noisy clustering can parallel a bootstrap test but it has a major weakness: it is impossible to know how much noise has to be added to emulate a given cut‐off (see section 1.3.3.2). The article is quite short and is complemented by section 1.3 of this introduction. This work is important because CHAPTERS 4‐7 heavily rely on bootstrap trees to establish reliable classifications of dialects and languages.

1.5.3 CHAPTER 4: To What Extent are Surnames Words? (Manni et al. 2006)

Our focus in this paper is the analysis of surnames. We compare the distribution of surnames to the distribution of dialect pronunciations, which are clearly culturally transmitted. Because surnames, at the time of their introduction, were words subject to the same linguistic processes that otherwise result in dialect differences, one might expect their geographic distribution to be correlated with dialect pronunciation differ‐ ences. In this paper we concentrate on the Netherlands where two official languages are spoken, Dutch and Frisian. We analyze 19,910 different surnames, sampled in 226 locations, and 125 different words, whose pronunciation was recorded in 252 sites. We find that, once the collinear effects of geography on both surname and cultural trans‐ mission are taken into account, there is no statistically significant association between the two, suggesting that surnames cannot be taken as a proxy for dialect variation, even though they can be safely used as a proxy for Y‐chromosome genetic variation. We find the results historically and geographically insightful, hopefully leading to a deeper understanding of the role that local migrations and cultural diffusion play in surname and dialect diversity.

1.5.4 CHAPTER 5: Footprints of Middle Ages Kingdoms are Visible in the Surname and Linguistic Structure of Spain (Rodriguez Diaz et al. 2016)

To assess whether the present‐day geographical variability of Spanish surnames mir‐ rors historical phenomena at the time of the names’ introduction (13th ‐ 16th century), 24 CHAPTER 1 we have analyzed the frequency distribution of 33,753 unique surnames (tokens) oc‐ curring 51,419,788 times, according to the list of Spanish residents of the year 2008. From family‐names we measured surname distances among the 47 mainland Spanish provinces (from which we infer consanguinity) and compared these distances to the relations among corresponding language varieties spoken in Spain. A dialectometric analysis of the first volume of the Linguistic Atlas of the Iberian peninsula (ALPI) started in 2009 in the laboratory of Hans Goebl (University of Salzburg, Austria). Original data have been analyzed by Goebl to identify phonetic, morphologic, syntac‐ tical and lexical features. Each feature has been processed separately in 375 working maps corresponding to 532 sampling points. From this we computed a final 47 x 47 similarity matrix (one sample point for each continental province) according to the relativer Identitätswert, (RIW, Goebl 2006). This index measures the similarity between two basilectal varieties as the percentage of items on which the two varieties agree. The comparison of the two bootstrap consensus trees, accounting for surname and linguistic variability, suggests a similar picture; major clusters are located in the east (Aragón, Cataluña, Valencia), and in the north of the country (Asturias, Galicia, León). Remaining regions appear to be considerably homogeneous. We interpret this pattern as the long‐lasting effect of the surname and linguistic normalization actively led by the Christian kingdoms of the north (Reigns of Castilla y León and Aragón) dur‐ ing and after the southwards reconquest (Reconquista) of the territories ruled by the Arabs from the 8th to the late 15th century, that is when surnames became transmitted in a fixed way and when Castilian linguistic varieties became increasingly prestigious and spread out. The geography of contemporary surname and linguistic variability of Spain in fact does correspond to the political geography at the end of the Middle‐ Ages. The synchrony between surname adoption and the political and cultural effects of the Reconquista have permanently forged a Spanish identity that subsequent migra‐ tions, internal or external, did not deface.

1.5.5 CHAPTER 6: Linguistic Probes into the Bantu History of Gabon

In this extensive unpublished chapter we have compared the linguistic and genetic diversity of Gabon (Africa) in order to contribute new elements to the scenarios con‐ cerning the early Bantu expansion related to the adoption of agriculture. Two inde‐ pendently obtained datasets have been processed (Bastin et al. 1999; ALGAB―see Hombert 1990) accounting for a total of 126 different varieties consisting in Swadesh word lists. They lead to similar results, showing that the languages cluster into a comparable number of groups. The Levenshtein linguistic distances we computed are fully compatible with the classification of Grollemund et al. (2015) based on shared GENERAL INTRODUCTION 25 vocabulary, where sharing is operationalized as the percentage of words (not) having the same historical origin. This coding is unnecessary with the Levenshtein method, making it simpler to use and, for the larger amount of information it accounts for, more sensitive. We have tried to make the genetic dataset more representative of the 17 ethnic groups studied on the genetic side, by filtering‐off all the DNA donors that were born outside the areas typically inhabited by their respective ethnolinguistic communities. The new results confirm the lack of genetic differentiation, which is even wider than previously observed. The linguistic cartography of our classifications shows well de‐ limited areas that might be related to early waves of Bantu migrants that crossed Ga‐ bon in the early stages of their dispersal from Cameroon and Nigeria.

1.5.6 CHAPTER 7: A Central‐Asian Language Survey (Mennecier et al. 2016)

In the frame of a large research project aimed at describing and comparing the genetic and social differences of sedentary and semi‐nomadic populations living in Central Asia, we documented language varieties (either Turkic or Indo‐Iranian) spoken in 23 test sites by 88 informants belonging to the major ethnic groups of Kyrgyzstan, Tajikistan and Uzbekistan (Karakalpaks, Kazakhs, Kyrgyz, Tajiks, Uzbeks, Yaghnobis). The recorded linguistic material concerns 176 words of the extended Swadesh list. Phonological diversity is measured by the Levenshtein distance and displayed as a consensus bootstrap tree and as multidimensional scaling plots. Linguistic contact is measured as the number of borrowings, from one linguistic family into the other, according to a Precision/Recall analysis fur‐ ther validated by expert judgment. Concerning Turkic languages, the results do not support regarding Kazakh and Karakalpak as distinct languages and indicate the existence of several distinct Karakalpak varieties. Kyrgyz and Uzbek, on the other hand, appear quite homo‐ geneous. Among the Indo‐Iranian languages, the distinction between Tajik and Yagnobi varieties is very clear‐cut, despite the endangered status of the latter lan‐ guage whose speakers are in the process of being assimilated in the Tajik society. More generally, the degree of borrowing is higher than average where lan‐ guage families are in contact in one of the many sorts of situations characterizing Central Asia: frequent bilingualism, shifting political boundaries, ethnic groups living outside the “mother” country. The latter case is of special interest because it systematically involves varieties that form clusters more coherent than those spo‐ ken inside the “mother” country (Kyrgyz of Uzbekistan vs. Kyrgyz of Kyrgyzstan, Tajiks of Uzbekistan vs. Tajiks of Tajikstan, Uzbeks of Tajikstan vs. Uzbeks of Uz‐ 26 CHAPTER 1 bekistan). We suggest that this phenomenon — that emigrant varieties are found to be more similar to one another — might be explained by i) the use of a common set of borrowed words and, also, ii) by a shared decreased exposure to a linguistic norm that is levelling them in the mother‐country. All the trends we measured are attested by the availability of several speakers per village that allowed correcting for inter‐individual variation.

1.5.7 CHAPTER 8: General conclusions and new prospects

By way of reflection on what has been learned, the final chapter provides a wider methodological discussion about the Levenshtein distance, discussion based on the empirical assays included in the dissertation and on what they show about its speci‐ ficities in measuring linguistic difference. I will first review the findings about the re‐ lation between pronunciation differences and geographic distance, before suggesting a new line of investigation showing how residual Levenshtein distances can provide testable hypotheses about past linguistic convergence and divergence and, perhaps, addressing the influence that population growth and migrations have on linguistic variability. To do so, I will focus on family names: markers that enable the depiction of migrations occurred in historical times, that is, concerning European countries, in the last five centuries. Family names, appropriately processed, make it possible to dis‐ tinguish the regions that received many immigrants from those that have remained demographically more isolated, aspects that underlie dialect and language contact. The last section develops a perspective from which we may examine the effects of mi‐ gration on language change. GENERAL INTRODUCTION 27

References:

Barrai I. 1993. The origin of HbS in Sicily: a toponomyc study. Human Evolution, 8:33‐42. Bastin Y., Coupez A., Mann M. 1999. Continuity and Divergence in the Bantu Languages: Per‐ spectives from a Lexicostatistic Study. Tervuren: MRAC. Beersma B., Van Kleef G. 2012. Why People Gossip: An Empirical Analysis of Social Mo‐ tives, Antecedents, and Consequences. Journal of Applied Social Psychology, 42: 2640‐2670. Bellwood P., Renfrew C. (eds) 2002. Examining the Farming/Language Dispersal Hy‐ pothesis. Oxford (UK): Oxbow books. Boattini A., Lisa A., Fiorani O., Zei G., Pettener D., Manni F. 2012. General Method to Un‐ ravel Ancient Population Structures through Surnames. Final Validation on Italian Data. Human Biology, 84: 235‐270. Boë L.‐J., Bessière P., Vallée N. 2003. When Ruhlenʹs ‘’ theory meets the null hypothesis. In: M.J. Solé, D. Recasens, J. Romero (eds.) Proceedings of the ICPhS‐15, 15th In‐ ternational Congress of Phonetic Sciences, Barcelona (Spain), August 3‐9, pp. 2706‐2708. Bouckaert, R., Lemey, P., Dunn, M., Greenhill, S. J., Alekseyenko, A. V., Drummond, A. J., ... Atkinson, Q. D. 2012. Mapping the origins and expansion of the Indo‐European lan‐ guage family. Science, 337: 957‐960. Bryant D., Filimon F., Gray R. 2005. Untangling our past: Languages, Trees, Splits and Networks. In: R. Mace, C. Holden, S. Shennan (eds.) The Evolution of Cultural Diversity: Phylogenetic Approaches. London (UK): UCL Press, pp. 69‐85. Cann, R.L, Stoneking M., Wilson AC. 1987. Mitochondrial DNA and human evolution. Nature, 25: 31‐6. Castles S., Miller M. 2009. The Age of Migration. London (UK): Palgrave Macmillan. Cavalli‐Sforza L.‐L., Feldman M.W. 1981. Cultural Transmission and Evolution: A Quanti‐ tative Approach. Princeton (NJ): Princeton University Press. Cavalli‐Sforza L.‐L., Piazza A., Menozzi P., Mountain J. 1988 Reconstruction of human evolution: bringing together genetic, archaeological, and linguistic data. Proceedings of the National Academy of Science USA, 85: 6002‐6. Colantonio S., Lasker G., Kaplan B., Fuster V. 2003. Use of surname models in human population biology: a review of recent developments. Human Biology, 75: 785‐787. Coop G., Bullaughey K., Luca F., Przeworski M. 2008. The Timing of Selection at the Hu‐ man FOXP2 Gene. Molecular Biology and Evolution. 25: 1257‐1259. doi:10.1093/molbev/msn091. Darlu P., Bloothooft G., Boattini A., Brouwer L., Brouwer M., Brunet G., Chareille P., Cheshire J., Coates R., Dräger K., Desjardins B., Hanks P., Longley P., Mandemakers K., Mateos P., Pettener D., Useli A., Manni F. 2012. The family name as socio‐cultural feature and genetic metaphor: From concepts to methods. Human Biology, 84: 169‐214. 28 CHAPTER 1

Darlu P., Zei G., Brunet G. (eds). 2001. Le patronyme. Histoire, anthropologie, société. Paris: CNRS Editions. Darwin C. 1859. The origin of species. Oxford (UK): Oxford University Press. Facsimile reprint (World Classic series), 1996. Dessalles J.‐L. 2014a. Why talk? In: D. Dor, C. Knight, J. Lewis (eds.), The social origins of language, Oxford, UK: Oxford University Press, pp. 284‐296. Dessalles J.‐L. 2014b. Human language: an evolutionary anomaly. In: T. Heams, P. Hune‐ man, G. Lecointre, M. Silberstein (eds.), Handbook of Evolution Theory in the Sciences. Lon‐ don (UK): Springer, pp. 707‐724. Dunbar R.I.M. 1996. Grooming, gossip, and the evolution of language. Cambridge: Har‐ vard University Press. Dunn M., Greenhill S., Levinson S., Gray R. 2011. Evolved structure of language shows lineage‐specific trends in word‐order universals. Nature, 473: 79‐82. Dyen I. 1975. Linguistic Subgrouping and Lexicostatistics. The Hague: Mouton. Efron B. 1979. Bootstrap methods: another look at the Jackknife. The Annals of Statistics, 7: 1‐26. Everett D. 2005. Cultural Constraints on Grammar and Cognition in Pirahã. Current An‐ thropology, 46: 621‐46. Everett D. 2009. Pirahã Culture and Grammar: A Response to Some Criticisms. Language, 85: 405‐442. Everett D. 2012. Language: the cultural tool. New York (NY): Pantheon Books. Falck, O., Heblich, S., Lameli, A., Südekum, J. 2012. Dialects, cultural identity, and eco‐ nomic exchange. Journal of urban economics, 72: 225‐239. Felsenstein J. 1985. Phylogenies and the Comparative Method. The American Naturalist, 125: 1‐15. Fitch W.T., de Boer B., Mathur N., Ghazanfar A.A. 2016. Monkey vocal tracts are speech‐ ready. Science Advances, 2. e1600723. Fitch W.T., Huber L., Bugnyar T. 2010. Social Cognition and the Evolution of Language: Constructing Cognitive Phylogenies. Neuron, 65: 795‐814. Forster P., Renfrew C. (eds) 2006. Phylogenetic Methods and the Prehistory of Languages. Oxford (UK): Oxbow books. Goebl H. 2006. Recent advances in Salzburg dialectometry. Literary and linguistic comput‐ ing, 21: 411‐435. Gray R.D., Atkinson Q.D. 2003. Language‐tree divergence times support the Anatolian theory of Indo‐European origin. Nature, 426: 435‐439. Green R.E., Krause J., Briggs A.W., Maricic T., Stenzel U., Kircher M., et al. 2010. A Draft Sequence of the Neandertal Genome. Science, 328: 710‐722. GENERAL INTRODUCTION 29

Greenhill S.J., Blust R., Gray R.D. 2008. The austronesian basic vocabulary database: From bioinformatics to lexomics. Evolutionary Bioinformatics, 4: 271–283. Grimm J. 1819. Deutsche Grammatik, 1. Theil. Göttingen: Dieterich. Grollemund R., Branford S., Bostoen K., Meade A., Venditti C., Pagel M. 2015. Bantu ex‐ pansion shows that habitat alters the route and pace of human dispersals. Proceedings of the National Academy of Sciences USA, 112: 13296‐13301. Haspelmath M., Tadmor U. (eds.) 2009. Loanwords in world’s languages. Berlin‐New York: De Gruyter Mouton. Heeringa W.J. 2004. Measuring dialect pronunciation differences using Levenshtein dis‐ tance. PhD Doctoral disserationthesis. Groningen: Rijksuniversiteit Groningen. Heeringa W.J., Nerbonne J., Kleiweg P. 2002. Validating dialect comparison meth‐ ods. Classification, automation, and new media. Berlin, Heidelberg: Springer, pp. 445‐452. Hillis D., Bull J. 1993. An empirical test of bootstrapping as a method for assessing confi‐ dence in phylogenetic analysis. Systematic Biology, 42: 182‐192. Holman, E.W., Wichmann S., Brown C. H., Velupillai V., Müller A., Bakker D. 2008. Ex‐ plorations in automated language classification. Folia Linguistica, 42: 331‐354. Hombert J.M. 1990. Atlas linguistique du Gabon. Revue gabonaise des Sciences de lʹhomme, 2: 37‐42. Kaufman, T. 1990. Language History in South America: What We Know and How to Know More. In: D.L. Payne (ed.), Amazonian Linguistics: Studies in Lowland South American Languages, Austin (TX): University of Texas Press, pp. 13‐73. Kessler B. 1995. Computational Dialectology in Irish Gaelic. In: Proceedings of the 6th Con‐ ference of the European Chapter of the Association for Computational Linguistics, pp. 60–67. Kessler B. 2001. The significance of word lists. Stanford (CA): CSLI Press. King T.E., Jobling M.A. 2009. Whatʹs in a name? Y chromosomes and the genetic geneal‐ ogy revolution. Trends in Genetics, 25: 351‐360. Kirk J.M., Anderson S., Widdowson J.D.A. 1985. Studies in Linguistic Geography: The Dialects of English in Britain and Ireland. London: Croom Helm. Kirk JM. et al. (eds). 1985. Studies in linguistic geography. London (UK): Croom Helm. Krause J., Lalueza‐Fox C., Orlando L., Enard W., Green R. E., Burbano H. A., ... Bertran‐ petit J. 2007. The derived FOXP2 variant of modern humans was shared with Neander‐ tals. Current Biology, 17: 1908‐1912. Lasker G. 1985. Surnames and genetic structure. Cambridge (UK): Cambridge University Press. Manni F. 2010. Sprachraum and Genetics. In: A. Mameli, R. Kehrein, S. Rabanus (eds.) Language and Space. An International Handbook of Linguistic Variation. Volume 2: Language Mapping. Berlin‐New York: Mouton de Gruyter, pp. 524‐541. 30 CHAPTER 1

Manni F., Barrai I. 2000. Patterns of genetic and linguistic variation in Italy: A case study. In: C. Renfrew. K. Boyle (eds.) Archaeogenetics: DNA and the population prehistory of Europe. Oxford (UK): Oxbow Books, pp. 333‐338. Manni F., Barrai I. 2001. Genetic structures and linguistic boundaries in Italy: a microre‐ gional approach. Human Biology, 73: 335‐347. Manni F., Guérard E., Heyer E. 2004. Geographic patterns of (genetic, morphcooologic, linguistic, etc.) variation: how barriers can be detected by Monmonier’s algorithm. Human Biology, 76: 173‐90.

Manni F., Heeringa W.J., Nerbonne J. 2006. To what extent are surnames words? Compar‐ ing the geographic patterns of surname and dialect variation in the Netherlands. Special issue of LLC Literary and Linguistic Computing “Progress in Dialectometry: Toward Expla‐ nation”, 21: 507‐27 Margush T., McMorris F.R. 1981. Consensus‐n trees. Bulletin of Mathematical Biology. 43: 239‐244. Mateos P., Tucker D.K. 2008. Forenames and Surnames in Spain in 2004. Names: A Journal of Onomastics, 56: 165‐184. Mateos P. 2011. Ethnicity, geography and populations: Tracing diversity and migration through people’s names. Heidelberg (Germany): Springer. McMahon A, McMahon R. 2005. Language classification by numbers. Oxford University Press: Oxford (UK). Mellars P. 2006. Archaeology and the dispersal of modern humans in Europe: decon‐ structing the Aurignacian. Evolutionary Anthropology 15: 167–182. Mennecier P., Nerbonne J., Heyer E., Manni F. 2016. A Central‐Asian survey. Language Dynamics and Change, 6: 57‐98. Mesoudi A., McElligott A., Adger D. (eds). 2011. Integrating genetic and cultural evolu‐ tionary approaches to language. Human Biology, 83, issue 2. Monmonier M. 1996. How to lie with maps. Chicago (IL): University of Chicago press, 2nd edition. Moore J.H. 2003. Kin based crews for interstellar multi‐generational space travel. In: Y. Kondo F.C. Bruhweiler, J.H. Moore, C. Sheffield (eds), Interstellar Travel and Multigenera‐ tion space ships. Burlington (): Apogee Books. [This book contains the papers that were presented at the American Association for the Advancement of Science symposium Boston, Massachusetts in 2002]. Nerbonne J. 2009. Data driven dialectology. Language and Linguistic Compass, 3: 175‐198. Nerbonne J., Heeringa W. 2010. Measuring Dialect Differences In: J.E. Schmidt, P. Auer (eds.) Language and Space: Theories and Methods in series Handbooks of Linguistics and Com‐ munication Science. Berlin: Mouton De Gruyter, pp. 550‐567. Nerbonne J., Heeringa W., van den Hout E., van de Kooi P., Otten S., van de Vis W. 1996. Phonetic Distance between Dutch Dialects. In: G. Durieux, W. Daelemans, S. Gillis (eds.) GENERAL INTRODUCTION 31

CLIN VI: Proceedings. of the Sixth CLIN Meeting. Antwerp: Centre for Dutch Language and Speech (UIA), pp. 185‐202. Nerbonne J., Kleiweg P., Heeringa W., Manni F. 2007. Projecting Dialect Differences to Geography: Bootstrap Clustering vs. Noisy Clustering. In: C. Preisach, L. Schmidt‐ Thieme, H. Burkhardt, R. Decker (eds.) Data Analysis, Machine Learning, and Applications. Proceedings of the 31st Annual Meeting of the German Classification Society. Berlin: Springer. Nevins A., Pesetsky D., Rodrigues C. 2009. Pirahã Exceptionality: A Reassessment. Lan‐ guage, 85: 355‐404. Nichols J. 1992. Linguistic diversity in space and time. Chicago (IL): University of Chicago Press. Pagel M., Atkinson Q.D., Calude A.S., Meade A. 2013. Ultraconserved words point to deep language ancestry across Eurasia. Proceedings of the National Academy of Sciences USA, 110: 8471‐8476. Renfrew C., Boyle K. (eds). 2000. Archaeogenetics: DNA and the Population Prehistory of Europe. Oxford (UK): Oxbow books. Rodríguez‐Díaz R., Manni F., Blanco‐Villegas M‐J. 2015. Footprints of Middle Ages King‐ doms Are Still Visible in the Contemporary Surname Structure of Spain. PLoS ONE, 10. e0121472. doi:10.1371/journal.pone.0121472 Ruhlen M. 1987. A guide to the world’s languages. Stanford, (CA): Stanford University Press, vol. 1. Ruhlen, M. 1994. On the Origin of Languages: Studies in Linguistic Taxonomy. Stanford (CA): Stanford University Press. Sankoff D., Kruskal J. (eds.). 1983. Time Warps, String Edits, and Macromolecules. The Theory and Practice of Sequence Comparison. Reading (MA): Addison‐Wesley. Sokal R.R., Rohlf J. 2001. Biometry: the principles and practice of statistics in biological research. New York (NY): W.H. Freeman & Co., 3rd edition. Steele J., Shennan S. (eds). 2009. Demography and cultural macroevolution. Special dou‐ ble‐issue of Human Biology, 81, issue 2‐3. Stoneking M., Cann R.L. 1989. African origin of human mitochondrial DNA. In: P. Mellars and C. Stringer, (eds.) The Human Revolution: Behavioural and Biological perspectives on the Origins of Modern Humans., Princeton, N.J.: Princeton University Press, pp. 17‐30 Swadesh, M. 1955. Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics, 21: 121‐37. Swadesh M. 1971. The Origin and Diversification of Language. Ed. post mortem by J. Sher‐ zer. Chicago (IL): Aldine. Takahashi H., Takahashi K., Liu F.C. 2010‐2013. FOXP Genes, Neural Development, Speech and Language Disorders. In: Madame Curie Bioscience Database [Internet]. Austin (TX): Landes Bioscience. 32 CHAPTER 1

Turner M. 1998. The Literary Mind. The Origins of Thought and Language. New York (NY): Oxford University Press. Vargha‐Khadem F., Gadian D.G., Copp A., Mishkin M. 2005. FOXP2 and the neuro‐ anatomy of speech and language. Nature Reviews Neuroscience, 6: 131‐138. Wichmann S., Holman E.W., Brown C.H. (eds.). 2016. The ASJP Database (version 17). SPRACHRAUM AND GENETICS 33

34 CHAPTER 2

This chapter has been published, please cite the original reference:

Manni F. 2010. Sprachraum and Genetics. In: A. Mameli, R. Kehrein, S. Rabanus (eds.) Language and Space. An International Handbook of Linguistic Variation. Volume 2: Language Mapping. De Gruyter, Berlin‐New York. pp. 524‐541. [Reprint permission asked 30/5/17]

SPRACHRAUM AND GENETICS 35

ABSTRACT  In this chapter I address the viewpoint of a geneticist with respect to genetic and linguistic cartography in order to provide an historical and methodologi‐ cal background reviewing the steps that led some population geneticists to co‐operate with linguists, a collaboration that started with the comparison of maps. The map‐ ping, in genetics, has a quite recent tradition, and is not very accurate because genetic sampling is generally uneven and relies, to a wide extent, on available published ma‐ terial. I review the first attempts to obtain maps about the genetic variability of popu‐ lations, to later focus on the discontinuities in the genetic landscape, which are the “boundaries”, the barriers to gene flow (Manni et al. 2004). It is generally difficult to explain the geographical distribution of genes because genetic variability arises over time‐scales that often are much deeper than historical times and because they depend not only on geographical features but also on cultural divides that have changed or disappeared without leaving traces. To provide evidence of some successful attempts to identify and explain these barriers I present some results concerning the surname and linguistic variability in the Netherlands (see CHAPTER 4).

SPRACHRAUM AND GENETICS

To avoid any misunderstanding, I am a population geneticist. Therefore, and before coming to more straightforward topics expected in a manual about maps, I think it might be useful to provide an historical overview of the steps that led some popula‐ tion geneticists to co‐operate with linguists, since this research path was not always as obvious as it may seem today. Actually, population genetics is a kind of biological anthropology, and focus‐ ing on maps would thus have meant illustrating the rich cartographic tradition within anthropology. I am not equipped to do so and therefore address only those geographic issues related to population genetics studies. Further, my presentation of genetic cartographical methods is not exhaustive. I have decided to mention only those best known within the discipline and to ignore those used only sporadically. For a more complete list of other methods, I recommend the excellent article by Darlu (1997). The first section of this chapter will sound more like a complicated story than a scientific summary to some readers; this was my intention. In this specific field, the available scientific summaries display a tendency to smooth over the real motivations and philosophies behind the different approaches. By making these explicit, I hope to clarify what motivated some population geneticists to turn to an interest in cultural transmission.

36 CHAPTER 2

2.1 GENETIC MAPPING

Probably, the first attempt to represent genetic data on a geographic map can be traced back to John Burdon Sanderson Haldane (1892‐1964). In 1940 he published an article in which the blood‐group frequencies of European peoples were displayed as contour maps (Fig. 2.1). The study is also exemplary in terms of the quality of the data. Nowhere in contemporary studies have I ever found such a deep and critical discussion of the data used. As the reader will see, sampling issues are of fundamen‐ tal importance in genetic / linguistic issues. I would also remind those interested in consulting the literature on this subject that, in genetics, the term cartography has nothing to do with geographical mappings since it exclusively refers to procedures adopted to localize genes on chromosomes.

Figure 2.1  Contour lines separating zones with different allele frequency concerning ABO blood groups in Europe. Lines are drawn without references to the points m, o, , , r, v, , which represent minorities and a small Spanish sample (E). Patterns appear as geo‐ graphically consistent. Exceptions are related to small samples. The figure is an original drawing by Haldane himself (1940).

Haldaneʹs aim, following in the footsteps of physical anthropologists, was to infer information about the pre‐Neolithic past of European populations from con‐ temporary blood‐group variability. Almost forty years after Haldane, blood pheno‐ types still constituted the only genetic markers that offered a satisfactory worldwide

SPRACHRAUM AND GENETICS 37 geographic coverage of human populations, as the study — completed in 1976 — by Arthur Ernest Mourant (1904‐1994) and co‐workers shows. Following Haldane, Mourant displayed his results as contour maps, this time drawn by eye. But this kind of representation can no longer be considered fashionable and has often been criti‐ cized, mainly because of the visual artifacts originating from the interpolation of un‐ evenly sampled data. In any case, such representations were a step forward com‐ pared to the dendrograms that had long been the standard way of representing ge‐ netic differences between human populations. Actually, dendrograms tend to visual‐ ize such variability in an artificial way, and seem antithetical to a true geographical mapping. Even if their geometry may reflect geography, their spatial interpretation is inexplicit. To increase the precision, later studies were designed to use more DNA poly‐ morphisms (usually called markers) than the few responsible for the typology of blood groups, hence the graphic presentation of the outcomes had to be modified in order to simultaneously display the frequency of several items. Instead of represent‐ ing the variability of single markers on thematic maps — in a manner similar to lin‐ guistic atlases — there was the need to show overall variability in a single picture. In this frame, aggregate analyses were adopted, allowing a depiction further abstracted from the stochastic processes that influence patterns in the variation of single mark‐ ers. In fact, evolution acts in a different way on different genes, since the selective pressure on certain variants can be absent on others. Moreover, variants can differ in their frequency and geographic distribution, according to random molecular mecha‐ nisms at the DNA level as well as the demographic history of the populations that carry such variants. A way of obtaining visually appealing aggregate analyses was introduced to genetics by Alberto Piazza (1941‐). His suggestion was to plot the first (often three) components of a principal components analysis (PCA) consecutively on the z‐axis of three‐dimensional maps — the x and y‐axes of the plot being assigned to the longi‐ tude and latitude coordinates of each sample (Menozzi, Piazza and Cavalli‐Sforza 1978a, 1978b). According to their ranking, principal components portray a decreasing fraction of the total variance of samples and the percentage of variance represented by the first three components is generally enough to grant a satisfactory representa‐ tion of the whole variability. This synthetic representation was finally adopted by Luigi Luca Cavalli‐Sforza (1922‐) and co‐workers in a reference book about the history and geography of hu‐ man genes (1994). To aid visual recognition, such maps typically display eight classes of principal component values. The choice of increasing or decreasing shading den‐ sity is totally arbitrary; it could be reversed without any loss of information. Interme‐

38 CHAPTER 2 diate classes are close to the average, while extreme classes indicate populations that globally differ most from each other for the particular principal component under study. Populations and regions with similar shading need not be similar, for they may be very different regarding another principal component. In such synthetic maps, the location of samples is not displayed. Fig. 2.2 is a good example; according to the authors it displays the genetic variability of the American continent according to the first principal component. The map emphasizes the distinction between, on the one side, the Eskimo + Na‐Dene group and Amerind populations closer to Eskimos, and the rest of America on the other. In South America there is differentiation be‐ tween east and west. The technique has since become very popular and has caught the eye of schol‐ ars from outside the discipline (archaeologists and historical linguists) as well as the attention of a broader public, and it is likely that the readers of this contribution are already familiar with this aspect of Cavalli‐Sforzaʹs work, that is, investigating whether the genes of modern populations might contain an inherited historical re‐ cord of the human species.

Figure 2.2 Example of the Alberto Piazza’s strategy for mapping separately principal components (PCs) extracted from a multivariate set of marker data (72 genes) This map displays only the first PC and accounts for 32.6 percent of the total variance meaning that other PCs can reveal further important patters. Here a north‐south gradient with the greatest slope in Canada is visible (redrawn from Cavalli‐Sforza et al. 1994).

SPRACHRAUM AND GENETICS 39

Such methodological innovation aside, the scientific setting of that period needs to be recalled. At the time, the so‐called multiregional theory was quite popu‐ lar. This theory, developed in the early 1980s by Milford H. Wolpoff (1942‐; see Thorne and Wolpoff 1981; Wolpoff, Xinzhi and Thorne 1984), was intended to explain the apparent similarities between fossils of Homo erectus and Homo sapiens inhabiting the same region. Wolpoff and supporters explained such apparent regional continu‐ ity by suggesting that all extant human populations were the descendants of humans that left Africa at least a million years ago, through a web of lineages in which the genetic contributions to all living peoples varied regionally and temporally. In other words, according to multiregionalism, the different continents had supposedly all been inhabited for comparable lengths of time, so that their inhabitants had enough time to genetically differentiate at a single location. Even if some transversal gene‐ flow was not excluded, significant genetic diversity between continents was ex‐ pected. The results that the students of Allan Wilson (1934‐1991) were obtaining in his laboratory in the late 1980s (Cann, Stoneking and Wilson 1987; Stoneking and Cann 1989) agreed on the fact that living populations have a common African origin, but disagreed on the timing suggested by Wolpoff. Genetic evidence strongly sug‐ gested a migration wave that — roughly — went off to central Asia and Europe on one side and to eastern Asia and the Americas on the other, leaving Africa around one to two hundred thousand years ago. In the absence of any molecular evidence for admixture with previously established populations, their conclusion was that extant human groups have been living in different continents for too short a time to geneti‐ cally exhibit the regional (continental) differentiation advocated by the multiregional hypothesis. To corroborate Wilsonʹs model, Cavalli‐Sforzaʹs group (1988) had the idea of first focusing on cultural evolution and then comparing a tree summarizing the ge‐ netic differences of human populations on a global scale with the classification of worldʹs languages — consisting of 17 families or phyla — suggested by Merrit Ruhlen (1987), a collaborator of (1915‐2001). If a close match was established between such trees, it would have meant that languages and genes differ‐ entiated over a similar time frame, thus demonstrating a synchronization process between cultural and biologic diversity, in contradiction of Wolpoffʹs multiregional‐ ism. When published, the results supported the idea of a strict correspondence be‐ tween genetic and linguistic differences and were compatible with the time frame provided by archaeologists (see Fig. 2.3). Scientific views have since evolved greatly and my summary is only of historical interest. Multiregionalism is related to the evo‐ lution of humanity and the debate is usually rather lively in that multidisciplinary

40 CHAPTER 2 field. I would therefore prefer to be prudent and say that I have oversimplified the question.

Figure 2.3 Genetic tree of 42 populations from all over the world plotted against the linguistic classification of Ruhlen (1987). Redrawn from Cavalli‐Sforza et al. 1988.

Even if the use of Ruhlenʹs classification was contentious — the majority of historical linguists were convinced that a genealogical tree of human languages was no more than wild speculation (an updated version of this reluctance can be seen in McMahon and McMahon 2006) — and some methodological problems concerning the treatment of genetic data were highlighted, the paper had the virtue of attracting the attention of both scientific communities to the possible synchronization of linguis‐ tic and genetic evolution hypothesized by Cavalli‐Sforza. This line of research prompted other scholars to focus on these questions to a degree that explains why I, a geneticist, am writing a chapter addressing linguistic issues. Above and beyond Ruhlen and some glottochronological studies, historical linguists were very averse to provide a genealogical tree of languages, even at a

SPRACHRAUM AND GENETICS 41 macro‐family or family level. This is probably the reason why some population ge‐ neticists started to compare the results of genetic surveys, usually consisting of dis‐ tance matrices and frequency vectors, with mathematically computed matrices of linguistic diversity in order to statistically compute a correlation between genetic and linguistic variability (Sokal 1988; Poloni et al. 1997; Lum et al. 1998). In order to com‐ pute a numerical table summarizing linguistic diversity, the authors of such studies developed their own rudimentary mathematical methods, not validated by linguists. For example Sokal (1988) constructed a matrix assigning zero distances to samples sharing the same language family; a distance value of one to samples belonging to different language families within the same linguistic phylum; and distances equal to two when the samples belonged to different phyla. Using this method he concluded that the correlation between genetic and linguistic variability in Europe was signifi‐ cant, even when geographic differentiation is allowed for. As the result of such ad hoc approaches, part of the linguistic community became quite disinclined to consider the issue of the relation between linguistic and genetic affinity. The cultural limits of the two disciplines were reached and, to a large extent, still hold nowadays. In a similar vein, Dupanloup de Ceuninck et al. (2000) demonstrated a remark‐ able correlation between genetic and linguistic markers, but at the same time, Rosser et al. (2000) found the correlation between genetic and linguistic distance to be mis‐ leading. Since both are highly correlated with geographic distance they seemed inter‐ related, but when geographic distance was taken into account, their correlation dis‐ appeared — much as shoe size and reading ability correlate in children because both correlate strongly with age. Their result contradicted previous studies based on other markers (Sokal 1988; Poloni et al. 1997). What diminished the impact of the study by Rosser and her coauthors (2000) was their failure to provide an exhaustive explanation for the observed correlation between language diversity and the linear geographic distance separating groups of speakers. Interestingly, a discussion of the association — as related to the variability of the climate — can be found in earlier papers by Daniel Nettle (1998, 1999). Accord‐ ing to this scholar, where the variability of the climate is greater, the size of social network necessary for reliable subsistence is larger and single languages tend to be more widespread. On the other hand, where the climate allows continuous food pro‐ duction throughout the year, small groups can be self‐sufficient and the population can fragment into many small language groups. This hypothesis may be compatible with the observed link between geography and language, since climate is correlated with latitude, which in turn is a measure of geographic distance. Such a model, valu‐ able on a global scale, would imply that the kind of correlation existing between ge‐ netic and linguistic difference could not be linear, but what about smaller geographic

42 CHAPTER 2 scales? My feeling is that general models, however fascinating, are often locally inac‐ curate since they are of little help in understanding the history of specific areas. Any general correlation between two distance matrices is tricky, since it does not tell us where on a geographic map the link is stronger or weaker than the overall correla‐ tion. To cut a long story short, I will not cite a chronological list of all the important papers relating to the genetics/linguistics comparison published in the last twenty years — such publications are just the latest episode in a long exchange between scholars from the natural and linguistic sciences. The first biologist (in a broad sense) to express an interest in historical linguistics was Charles Darwin. In his book on the origin of species (1859: 342) he stated that:

If we possessed a perfect pedigree of mankind, a genealogical arrangement of the races of man would afford the best classification of the various languages now spoken throughout the world; and if all extinct languages, and all intermediate and slowly changing dialects, were to be included, such an arrangement would be the only possible one.

In this famous quotation, Darwin mentioned linguistics more as an analogy to illus‐ trate his theory about evolution than as a real research interest. Nevertheless, his ideas had a considerable influence in linguistics. Cross‐fertilization between linguis‐ tics and biology is not a recent phenomenon, as the work of Charles Lyell (1797‐1875), Ernst Häckel (1834‐1919), August Schleicher (1821‐1868), and Jean‐Louis de Quat‐ refages (1810‒ 1892) demonstrates. Particularly important is Schleicherʹs organicist theory published in 1863; in this paper, essential for the development of historical and comparative linguistics, one of the first language trees was published. In more recent times, the theoretical and philosophical interaction between biologists and lin‐ guists continued until the discovery of DNA (1951) and then seemed to decline. In fact, since the 1970s, geneticists have been very busy reestablishing the bases of their own discipline through a molecular prism. The interest in biological anthropology and the remote history of human populations was probably revived by the possibility of using DNA sequence variation for phylogenic reconstruction. The popularity of this methodology grew throughout the 1970s and exploded in the last decade of the 20th century with the advent of PCR (polymerase chain reaction), a method for repli‐ cating DNA sequences. As a consequence, new molecular markers — namely special fragments of the genome sequence — became available, thus allowing a very efficient detection of the genetic differences between populations.

SPRACHRAUM AND GENETICS 43

2.2 BOUNDARIES IN THE GENETIC LANDSCAPE

Robert Sokal (1927‒2012) was skeptical about Piazzaʹs genetic mapping (Fig. 2.2) since synthetic maps based on Principal Component Analysis were subject to large errors. According to Sokal, these maps, however intriguing, conveyed a false sense of precision insofar as the interpolation process used to fit contour lines to point‐ referenced data strongly influenced the mapped pattern, enabling the detection of apparent geographic trends even in spatially random data. Even though such scruples were only made public quite late in the day (So‐ kal, Oden and Thomson 1999a, 1999b), they explain why Sokal and Guido Barbujani (1955‒) developed alternative ways of representing genetic variability. Their sugges‐ tion was to identify sharp differences in a genetic landscape in order to detect possible barriers to gene flow. In a 1990 paper based on European data, they tested whether genetic boundaries — zones where the rate of change of a given genetic variable is higher — were also linguistic boundaries (Fig. 2.4). To do so they adopted the so‐ called ʺwomblingʺ procedure (Womble 1951), the application of which to population genetics was novel. This technique is based on the analysis of surfaces generated from the frequency of a given marker in a territory, similarly to a map of physical geogra‐ phy. Boundaries are determined by computing the first derivative of the slope de‐ scribing the undulations of such surfaces. The greater the slope, the higher the first derivative is. Abrupt changes are considered synonymous with boundaries and such barriers are seen as significant when they appear in the same zone of different sur‐ faces, meaning that several genetic markers exhibit the same pattern. With this paper, probably for the first time, geographic research directly inspired genetic mapping. In a similar fashion, Barbujani et al. (1996) introduced to genetics the ʺmaxi‐ mum difference algorithmʺ developed by Mark Monmonier in 1973. This latter method turned out to be well suited to identifying — without the interpolation wombling required to compute surfaces — those samples that differed strongly from their neighbors. Incidentally, it should be kept in mind that boundary methods epitomize geneticistsʹ interest in the difference between populations rather than their homogeneity, which is understandable insofar as just fifteen percent of the variance of the human genome can be explained by differences between population groups, in contrast to individual differences within a population, which account for eighty‐five percent of the total variance — the reason why the scientific definition of race does not apply to humans. Linguists, whose groups are normally not difficult to distinguish, may for this reason not have considered such methods.

44 CHAPTER 2

Figure 2.4 Wombling analysis of 60 human allele frequencies in Europe. The lengths of the rods are proportional to the average magnitude of the gene‐frequency change. The directions of rods are the average directions of the maximum slope across gene‐frequency surfaces. Only most significant differences are represented. Solid and dashed lines repre‐ sent the gene‐frequency boundaries. See Sokal and Barbujani (1990) for further details. The figure has been redrawn and adapted.

Although genetic cartography has not developed much in the past, the mathematical relations between genetic and geographic distances have long been ad‐ dressed. ʺIsolation by Distanceʺ (IBD), a model by Sewall Wright (1889‒1988; Wright 1943), one of the founders of population genetics, postulates that the genetic similarity of human groups decreases with geographic distance, as the result of spatially limited gene flow — a commonly observed phenomenon in natural populations. Gustave Malécot (1911‒1998) and other theorists pushed this analysis for‐ ward by establishing that the increase is not linear but logarithmic (1948). Nowadays, following Sokal and Oden (1978a, 1978b), current practice is to compute a general cor‐ relation between a matrix of genetic and geographic distances or to compute spatial autocorrelograms. Then, if the dependence is found to be statistically significant, it is possible to compute a regression using a logarithmic transformation (Fig. 2.5).

SPRACHRAUM AND GENETICS 45

]

e

c

n

e r 50

e

f

f i A

d

f

o 40 line % rA n [ o essi regr 30

20 18

rB 3 B 2345678

Figure 2.5  Regression where distances among, let’s say, couples of dialects and the logarithm of their geogeographic distances are plotted against each other in a Cartesian diagram. In the example the majority of observations (empty dots) is close to the regres‐ sion line that can be seen as a good approximation to overall variability. We can conclude that the increase of dialect difference is proportional to the increase of geographic dis‐ tance. Two exceptions are apparent: the distances A and B. They lay quite far from their expected position according to the regression. To numerically highlight such deviations we can compute the residual distances by subtracting expected distances (according to the regression) from observed differences. Concerning A and B, their residual differences are respectively 45 – 26 = 19 (rA) and 3 ‐ 18 = ‐15 (rB); we can then conclude that the distance A is unexpectedly high and B in unexpectedly low when compared to other measures of diversity occurring at a same geographical distance.

This step leads to two matrices of genetic distances: one accounting for the original database and a second matrix based on the regression. Of course, if the ge‐ netic distance between pairs of localities were exactly proportional to their geo‐ graphic distance, there would be no difference between the two matrices. In reality this does not occur, since distances based on original data can be higher or lower. We were interested in such differences (known as residuals) because — according to a working hypothesis — higher residuals were expected to mirror ʺabnormallyʺ de‐ creased contact and vice versa. Genetic distance values that fit the predictions based on the regression are considered ʺnormalʺ. In other words, the matrix of residuals accounts for that fraction of the genetic variability that is not explained by spatially limited gene flow, as if all populations were at zero distance from one another. Possible systematic errors of measurement, peculiar geographic features (such as deserts or mountains), cultural barriers, and

46 CHAPTER 2 potential differences related to historical phenomena should all be more apparent when a matrix of residuals is analyzed through a multivariate or clustering method. Unfortunately, in current practice this procedure is almost never followed, even at an exploratory level. To return to the barrier methods of Barbujani and Sokal (1990), it should be emphasized that they are linked to the Isolation by Distance model (IBD). In fact, the chances of finding discontinuities (boundaries) in the pattern of change of genetic variation dwindle when genetic distances fit the IBD. Put the other way around, (cul‐ tural, geographic, or anthropomorphologic) isolating factors are likely to act against the expectations of the IBD model by weakening gene flow. As a general intuitive rule, it could be stated that there are as many barriers as the correlation between ge‐ netic and geographic distance is low and vice versa. This is also true in dialectology: similar varieties tend to form near each other as the result of a ʺlinguistic flowʺ be‐ tween neighbors, a flow that is not uniformly constant; the search for boundaries thus makes sense in linguistics as well.

2.3 HOW TO IDENTIFY POPULATIONS IN A LANDSCAPE: THE SAMPLING PROBLEM

For the past 70 years, population geneticists have defined groups their science de‐ scribes as ʺdemesʺ: groups of individuals that are supposed to be more similar to each other than to any other individual (Gilmour and Gregor 1939). Although the demes could be identified on the basis of the anonymous grid sampling adopted by wildlife biologists, this strategy is not followed by human population geneticists. In fact, the maps of human groups that the field biologistʹs random sampling approach to ge‐ netic diversity would produce would bear little resemblance to a map of the worldʹs self‐identified human groups (Juengst 1998); as an understandable shortcut, demes have often been assumed to be such (self‐)identified populations. Since the bases of self‐identification can change over time for all kinds of social or political reasons, eth‐ nological knowledge and language have been taken as a good proxy with which to identify social groups. From experience, I can say that the methods of field ethnology are quite distant from the rapid surveys of population geneticists, which often consist in going from village to village to collect blood and saliva. Language has thus become an easier proxy for ethnic identity with which to identify ʺgeneticʺ populations. I hardly need to add that populations speaking rare and hard‐to‐classify languages seem very attractive to population geneticists in their quest to depict the remote his‐ tory of human populations. In recent state‐of‐the‐art samplings, the DNA sequence is linked to the geo‐ graphical location of the donor and to the language she/he speaks. Hence,

SPRACHRAUM AND GENETICS 47

Sprachraum is integrated into the sampling itself and issues relating to bilingualism or multilingualism are usually skipped over since they add a level of complexity that cannot be easily addressed. It is true that the reality of spoken languages is far more fluid and informal than is reflected in the concepts of static language families that emerged in the eighteenth and nineteenth centuries. The equation of ʺone people = one languageʺ is a Eurocentric perception of reality since languages became codified and fixed in grammars and dictionaries to suit the practical needs of the centralizing European bureaucracies that came to dominate in that very same period. In contrast, non‐European systems of administration posited alternatives, notably religion and tribe, as a preferred basis for identity. However, multilingual situations are very diffi‐ cult to handle, even for purely linguistic research purposes. When the genetic aspect is added, the degree of complexity can almost become too much to be properly ad‐ dressed. A further problem arises when language no longer acts to define a group of people we suppose to be a deme, and instead becomes a point of comparison. In fact, to base the sampling criteria in one domain on data from the other weakens the im‐ portance of any relationship detected (see McMahon 2004 for further discussion). Another interesting issue was raised by MacEachern (2000: 361) when he noted that, in the 1994 reference by Cavalli‐Sforza and coworkers, the hunter gatherers speaking Hadza in Tanzania (less than 1,000 individuals) occupy the same analytical status as the French population (60,000,000), although it is obvious that the two populations, differing in size by four orders of magnitude, are defined according to very different socio‐political beliefs. According to the historian Benjamin Braude (1945‒); personal communication 2007), genetic research has probably given insufficient attention to recent research into myths of origin and collective identity, resulting in problematic sampling catego‐ ries, probably derived from the norms of the eighteenth‐century European Christian male bourgeoisie. As historians, anthropologists, and some geneticists have demon‐ strated, such common organizing principles as continent, tribe, and language are cul‐ tural constructs responding to transitory social, cultural, and political needs (see also Braude 1997; Lewis and Wigen 1998; Judson 2007; Scott 1998). Although the variability of genetic markers is heavily influenced by stochastic phenomena (random genetic drift, mutation) and selective processes, nowadays there is a degree of consensus that a trustworthy genetic portrait of the differences between human populations can be achieved through a carefully designed sampling consist‐ ing of (1) a reasonable number of DNA donors (no less than thirty per sampled place), (2) good ethnological knowledge of the people undergoing the genetic testing (even if it suffers from the limitations highlighted in the previous section), and (3) a

48 CHAPTER 2 sampling grid providing good geographic coverage of the territory, possibly propor‐ tional to the demographic size of populations. Nonetheless, even when all these re‐ quirements are met completely (which is not often), uncertainty remains. Since there are no biologically meaningful human races and there are no ʺpureʺ populations, even in the most remote and isolated valleys, it is very difficult to control genetic re‐ sults for recent migrations and hence, in metaphorical terms, each population can be seen as a fruit salad, which can have a dominant flavor but still remains a salad. In contrast to population genetics, linguistic surveys are often based on the speech of just one informant who fulfills some stringent criteria. The language variety she/he speaks is then taken as the variety of that place, even if other (micro)varieties are spoken in the same area. As a consequence, if a linguistic fruit salad exists some‐ where, it will hardly be recognized, as the migration‐related variance in the sampling cannot be computed. If a different speaker were interviewed, to what extent would the final linguistic analysis (dendrogram, multivariate analysis, isogloss map, etc.) have been the same? This is a question that points more to the population speaking a language than to the language itself. To be honest, since the sociolinguistic turn in the 1960s, lin‐ guists are usually well aware of the limitations of their sampling method — but it is still used, and available data have often been collected in this way. Similar criticisms concerning dialects have been expressed by J. K. Chambers (1938‒), who suggests dialect topography as an alternative to dialect geography (Chambers 1994). The im‐ provements in this kind of approach that are of special interest to a geneticist consist mainly of continuous area samplings and representative samples where both sexes and all ages and social classes are interviewed, in contrast to the NORMs (non‐ mobile, older, rural, male informants) of older studies. The situation is in reality quite different, since dialect topography is still in its infancy and NORM‐based surveys are the rule; this means that available dialect material should only be compared to ge‐ netic data with caution because of the different sampling approaches. Further, changing the scale to languages (wider areas), when we put a linguis‐ tic and a genetic classification side by side, we are comparing an ʺartificialʺ object, the language, with something less artificial, more ʺdirtyʺ in a way — the genetic fruit salad of individuals who constitute a population. The latter is a ʺmixtureʺ in which the immigrants, unless very recent, cannot be identified. All of the linguistic borrow‐ ings (not just the historical ones) that would be highly informative about the popula‐ tion history of the speakers, i.e., their social or cultural interaction with other groups, are minimized. In historical linguistics, at a lexical level, the sampling has often been done on the basis of Swadesh lists, consisting of either 100 or 200 concepts. These special word lists are intended to minimize borrowings in order to better reflect ancient linguistic

SPRACHRAUM AND GENETICS 49 affiliations. For English, 99 percent of the words listed in the Oxford English Diction‐ ary are borrowings from other languages (McWhorter 2001), but in the 200‐word Swadesh list only six percent of the lexicon appears to come from foreign languages (Embleton 1986). Now, if we compare a linguistic classification based on Swadesh lists where contact is minimized (which is the only relevant aspect here — for further discussion on Swadesh lists see Kessler 2001) with genetic surveys where contact cannot be similarly minimized, are we looking at two sides of the same medal? Probably not. Therefore, I suspect that historical linguistics studies based on word lists like the Swadesh one can lead to an underestimation of potential migrations and therefore to contradictory results when genetic and linguistic comparative investigations are un‐ dertaken. As a possible solution, I suggest sampling more individuals in each location and recording words not included in the Swadesh list. An interesting example is pro‐ vided by the technical lexicon of some languages, which can be well suited to reveal‐ ing the past history of populations and their cultural contacts. There are many exam‐ ples of this, such as (1) the use of Italian words in the banking system of many coun‐ tries (agio, banca, bancarotta, cassa, conto, costo, fallire, giroconto, tariffa, valuta, etc.), (2) the English use of many nautical terms from Dutch (ahoy, amidships, avast, ballast, boom, cruise, dock, freight, jeer, skipper, tub, etc.), or (3) the technical vocabulary related to hunting and honey gathering that is extremely similar among the Aka (Central African Republic) and Baka (Cameroon) Pygmy peoples, thus revealing their possible common origin. This finding is very challenging, since the Aka and Baka now speak languages belonging to different families, Bantu and Ubangian, respec‐ tively (Bahuchet 1996). Technical dictionaries, in contrast to toponyms, tell us more than loan words because they imply a close contact between the speakers of the two languages that, in some cases, might suggest a parallel gene flow.

2.4 REGIONAL STUDIES

When the scale diminishes it is easier to take Sprachraum into account in a more straightforward way, since linguistic differences can be expressed in terms of lan‐ guage varieties that have often been studied in great detail and are readily available as linguistic atlases. At such lower scales it is possible to interpret the observed lin‐ guistic variability of more recent origin in the light of some specific geographic fea‐ tures or historical processes that may have influenced such differentiation. Luckily, the precision of dialect maps allows the history of populations to be depicted with

50 CHAPTER 2 more insight compared to the more abstract tree‐like representations widespread in historical linguistics. Further, and interestingly for a population geneticist, the effect of geographic distance can be tested against linguistic distance and this effect can be modelled, thus making apparent the proportional increase in linguistic diversity with geographic dis‐ tance. I will spend some ink on this aspect since, in spite of the long tradition of dia‐ lect cartography, such proportionality is not as accepted as it is in genetics. Histori‐ cally speaking, the concept was implicit in Johannes Schmidtʹs (1872) ʺwave theoryʺ about Indo‐European languages since, in single‐language dialectology, every lan‐ guage consists of a chain of pairs of mutually intelligible speakers (or speech types), where different varieties gradually shade into one another, with the extremes of the chain in the most differentiated areas. In dialectology, the role played by geographic distance in the constant increase in linguistic divergence has been stressed by Cham‐ bers and Trudgillʹs ʺtraveller’s distanceʺ (1998: 5). They repeat the familiar tale of a traveler crossing a linguistic area and repeatedly encountering slightly different dia‐ lects and thus experiencing the continuum now frequently appealed to in dialectology. Later, Heeringa and Nerbonne (2001) tested the mathematical association between geographic and Dutch linguistic distances and then summarized it in a mathematical regression, as geneticists do. Hence, unlike authors who just see the continuum as an undulating landscape, Heeringa and Nerbonne have shown that the mean height of such ʺundulationsʺ is not constant through space, since pairwise comparisons of dia‐ lect variants lead to occasionally higher values as dialect borders are encountered, thus suggesting that boundaries may prove a way to describe dialect variation. If we were able to compute, in a matrix of mathematically computed distances between pairs of languages (dialects), the percentage of linguistic distance that is sim‐ ply related to the physical geographic distance separating the pairs of localities where the variants are spoken, we would be able to focus on the residual variability, which might signal a pattern of linguistic difference that is more ancient or has arisen through migration. As a consequence of sparser population density, less contact be‐ tween speakers and less reliable transportation, we can (1) imagine that linguistic (dialect) differences were stronger in ancient times than they are today and (2) regard present day differences as a relict of the remote patterns that the analysis of residuals may help more effectively highlight. This issue is related to the Isolation by Distance model generally accepted in genetics studies. Interestingly, the correlation with geo‐ graphic distance of both genetic and linguistic data is not linear and the same loga‐ rithmic transformation can be applied to both datasets in order to obtain an improved sub‐linear model.

SPRACHRAUM AND GENETICS 51

In two recent papers examining the Netherlands (Manni, Heeringa and Ner‐ bonne 2006 – CHAPTER 4; Manni et al. 2008), we addressed the possible similarities be‐ tween the geographic distribution of surnames, which have been proven to be reliable genetic markers, and that of dialect pronunciations, which are clearly culturally transmitted. To properly compare surnames and dialects, we subjected the linguistic data to a novel treatment that I would like to discuss first. From Dutch dialect atlases, we computed a general regression (see CHAPTER 8) between the Levenshtein distance — a numerical measure of the distance between pairs of dialect varieties (see Heer‐ inga 2004: 121‐144 and Nerbonne 2010) — and the geographic distance between pairs of localities where such varieties are spoken. The mathematical properties of this re‐ gression were satisfactory enough to conclude that a dependence of linguistic diver‐ sity upon geographic distance exists. By making this mathematical relation explicit, we computed the expected Levenshtein distances between pairs of dialects according to the linear geographic distance existing between them, that is, by taking geographic distance as the independent variable, as in the genetic analyses described in section 2.2. Then we measured the differences between the actual and the expected Leven‐ shtein distances. We recall that such differences are called residuals. Their analysis can reveal important divergences between reality and mathematical expectations and we focused on them by computing the zones where the linguistic rate of change, visu‐ alized as Monmonier barriers, is higher. If we imagine that dialect differences could be mapped on a geographic map as if they were a natural landscape feature, we could say that Monmonier barriers run along the beds of valleys and the tops of mountain chains: the areas where the slope in the pattern of change is higher. Although barriers might remind linguists of bundles of isoglosses, Monmonierʹs approach may only be applied to numerical (e.g., dialectometrical) data. But it is true that it mirrors the goal of a synthetic representation of variability that isogloss bundles were likewise de‐ signed to operationalize. We compare linguistic barriers obtained from original dis‐ tances with barriers computed from residuals in Fig. 2.6. Comparing the Monmonier barriers based on original data (Fig. 2.6 A) with those obtained after computing residual distances (Fig. 2.6 B) leads to quite similar conclusions: (1) the southwestern area of Zeeland is surrounded by barriers; (2) the Saxon dialect area is widely contoured and (3) the northern province of Friesland (where Frisian is spoken instead of Dutch) appears fragmented because of the dialect islands formed by urban Friso‐Franconian varieties in the Frisian dialect continuum (light gray barriers). Nevertheless, two completely new features appear in the analysis of residual distances (Fig. 2.6 B): (1) the northern part of the Netherlands appears less contoured, and (2) the existence of a peculiar boundary that runs from the west of the country to the south is observed (see arrows). This latter feature, usually undetected

52 CHAPTER 2 in available studies, can be attributed to a kind of systematic error, i.e., heterogeneous transcription (see Heeringa 2004), while the increased linguistic homogeneity in the north is accounted for by the fact that Frisian varieties, which are today spoken only in the northwest (Friesland), were also spoken in the northeast (Groningen) until the early sixteenth century (see Hoekstra 2001: 139; Niebaum 2001: 431). Aside from a few contemporary phonetic features, there is no linguistic evidence that a different lan‐ guage was once spoken in this area. Focusing on the major discrepancies between the Levenshtein distances computed from original transcriptions and those predicted by the regression, we can expect them to mirror phenomena unrelated to the standard dynamics of linguistic contact. In fact, this is exactly what we obtain: when speakers of Frisian started speaking Saxon dialects, their second language became their first language and their first language (mother tongue Frisian) died out, even while pro‐ jecting their Frisian substrate into inadequately acquired Saxon.

Figure 2.6  Variability of 252 Dutch dialect variants analyzed by Monmonier’s algo‐ rithm. Solid lines correspond to the area where the rate of linguistic change is higher. The same kind of analysis has been performed on original linguistic distances (A) and on re‐ sidual distances computed as in Fig. 2.5 (B). To help the reader, main dialect areas of the Netherlands and provinces mentioned in the text are show in the top left corner.

This process was coupled to a long struggle between the (Saxon‐speaking) city of Groningen and the (Frisian) countryside, which was partly related to land reclamation in the eastern part of the city that stimulated the immigration of Saxon‐speaking set‐

SPRACHRAUM AND GENETICS 53 tlers. In the same vein, we might have expected the well‐known barrier between Catholics and Protestants (Fig. 2.7) to be apparent since such religious distinction may have acted as a social boundary, increasing dialect differences between populations on opposite sides of the border. The fact that there is no linguistic evidence for such separation, even when residual distances are analyzed, implies that more casual social contacts and interchange were not diminished. I now turn to the genetic side of the study, surnames, which in patrilineal sys‐ tems are transmitted virtually unchanged across generations, like a genetic locus on the Y chromosome. Since surnames, at the time of their introduction, were words sub‐ ject to the same linguistic processes which otherwise result in dialectal differences, our working hypothesis was to expect their geographic distribution to be correlated with dialect pronunciation differences (see CHAPTER 4). When we compared the patterns of the barriers computed on the basis of surname differences (Fig. 2.7) with those based on dialect differences (Fig. 2.6), we found that, overall, the most differentiated areas (located on either side of the barriers) do not correspond to those of the dialects.

Figure 2.7  Surname variability of Dutch surnames. Solid lines correspond to the area where the rate of surname change is higher. The north‐central area inhabited by Protes‐ tants is noticeable by the P letters, while the southern area, where Roman Catholics pre‐ vail, is marked by a C.

54 CHAPTER 2

Surname variability does mirror the reproductive barrier separating Calvinists and Catholics, in contrast to the linguistic results, which demonstrate that communication proceeded between the two communities, despite a profound social cleft. In general, once the collinear effects of geography on both surname and cultural transmission are taken into account, there is no statistically significant association between the two, suggesting that surnames cannot be taken as a proxy for dialect variation. Although an example from the Netherlands is little more than anecdotal when addressing an international audience, it nonetheless prompts me to suggest, wherever possible, explicitly assessing the mathematical relation between linguistic and geo‐ graphic distance, which should not be seen as the ultimate goal but rather as a tool for gaining further insight into the data. Additionally, such a method may test whether dialect continua, so often observed and advocated, are a satisfactory view of linguistic variability or whether more innovative interpretations of the geographic patterns of dialect variation are needed, especially when dealing with old or archaic linguistic patterns otherwise hidden in the data, which may thus be explicitly depicted (see CHAPTER 8).

2.5 CONCLUDING REMARKS

Those readers expecting to find in this chapter a source of inspiration for further de‐ velopments in linguistic cartography may have been disappointed since, some notable exceptions aside, map production has not been the primary goal of population geneti‐ cists. Their major interest, alongside the study of the biochemical properties of DNA, has been increasing understanding of the mechanisms responsible for the genetic dif‐ ferentiation of populations, namely selection, isolation, migration and their modeling. Simulations have been the preferred tool with which to estimate the different parame‐ ters that are common in the mathematical formulas describing the transmission of heredity, and a large part of the literature has addressed the mathematical cross‐ dependency of such parameters. As I understand it, historical linguistics, and to a large extent dialectology, have had a different history, maps being one of the main goals. Since linguistics is an old discipline, scholarly traditions have more weight than in population genetics, and maps are necessary for a broad consensus between linguists. Although I admire the cartographic efforts of linguists, I wonder to what extent the production of maps has undercut more abstract research into the mechanisms responsible for language differ‐ entiation — mechanisms that, in my view, should also take into account the speakers of a language as a population defined by demographic parameters. I expect the vari‐

SPRACHRAUM AND GENETICS 55 ability of a language, its rate of change over time, and the degree of linguistic varia‐ tion between speakers at a given location to be correlated with the size of the speak‐ ersʹ community — whatever the cultural mechanisms of language transmission may be (see Jacquesson 2003 and Nerbonne/ Heeringa 2007). In an interesting attempt to draw parallels between linguistic and genetic dif‐ ferentiation, Salikoko Mufwene (2001) has pointed out that the major force responsi‐ ble for language change is the interaction of individuals belonging to a population. From my perspective, such a parallel is weakened by a misuse of biological terms and concepts and by the fact that the analogies Mufwene uses are purely ontological. Al‐ though he underlines the importance of populations of speakers, he nevertheless places the accent elsewhere, namely, on languages seen as biological organisms. This view is similar to the approach of many geneticists, who see genes as the main object of study, thus forgetting that those genes are carried by interacting individuals. Popu‐ lation genetics has been developed to handle such issues; it is now time to develop a new discipline called ʺpopulation linguisticsʺ.

56 CHAPTER 2

References:

Bahuchet S. 1996. Fragments pour une histoire de la forêt africaine et de son peuplement. In: CM. Hladik, A. Hladik, H. Pagezy, O.F. Linares, G.J.A Koppert and A. Froment (eds.) L’alimentation en forêt tropicale: interactions bioculturelles et perspectives de développement. Paris : UNESCO, pp. 97‐119. Barbujani G., Sokal R.R. 1990. Zones of sharp genetic change in Europe are also linguistic boundaries. Proceedinsg of the National Academy of Science USA, 87: 1816‐9. Barbujani G., Stenico M., Excoffier L., Nigro L. 1996. Mitochondrial DNA sequence varia‐ tion across linguistic and geographic boundaries in Italy. Human Biology, 68: 201‐15. Braude B. 1997. The Sons of Noah and the Construction of Ethnic and Geographical Iden‐ tities in the Medieval and Early Modern Periods. William and Mary Quarterly, 3rd series, 54: 103‐142. Cann R. L., Stoneking M., Wilson A.C. 1987. Mitochondrial DNA and human evolution. Nature, 325: 31‐6. Cavalli‐Sforza L‐L., Piazza A., Menozzi P., Mountain J. 1988. Reconstruction of human evolution: bringing together genetic, archaeological, and linguistic data. Proceedinsg of the National Academy of Science USA, 85: 6002‐6. Cavalli‐Sforza L‐L., Menozzi P., Piazza A. 1994. The History and Geography of Human Genes. Princeton, N.J.: Princeton University Press. Chambers J.K. 1994. An introduction to dialect topography. English World‐Wide, 15: 35‐53. Chambers J.K., Trudgill P. 1998. Dialectology, 2nd edition. Cambridge (UK): Cambridge University Press. Darlu P. 1997. Les representations géographiques de la diversité biologique dans l’espèce humaine. L’espace geographique, 25: 341‐353. Darwin C. 1859. The origin of species. Oxford, Oxford University Press World Classic series, reprint 1996. Dupanloup de Ceuninck I., Schneider S., Langaney A., Excoffier L. 2000. Inferring the impact of linguistic boundaries on population differentiation: application to the Afro‐ Asiatic‐Indo‐European case. European Journal of Human Genetics, 8: 750‐6. Embleton S. 1986. Statistics in historical linguistics. Brockmeyer, Bochum, Germany. Gilmour J.S., Gregor J.W. 1939. Demes: a suggested new terminology. Nature, 144: 33. Haldane J. B. S. 1940. Blood‐Group Frequencies of European Peoples and Racial Origins. Human Biology, 12: 457‐80. Heeringa W.J. 2004. Measuring dialect pronunciation differences, Groningen University Press, Groningen, The Netherlands. Heeringa W.J., Nerbonne J. 2001. Dialect areas and Dialect Continua. Language Variation and Change, pp. 375‐400.

SPRACHRAUM AND GENETICS 57

Hoekstra E. 2001. Frisian Relics in the Dutch Dialects. In: H.H. Munske (ed.). Handbuch des Friesischen/Handbook of Frisian Studies. Tübingen: Niemeyer, Germany, pp. 138‐142. Judson P. M. 2007. Guardians of the Nation, Activists on the Language Frontiers of Imperial Austria. Harvard university press, Cambridge (MA), USA. Juengst E.T. 1998. Group identity and human diversity: keeping biology straight from culture. American Journal of Human Genetics, 63: 673‐7. Kessler B. 2001. The Significance of Word Lists: Statistical Tests for Investigating Historical Connections Between Languages. CSLI Publications, Stanford (CA), USA. (Distributed by The University of Chicago Press). Lewis M.W., Wigen K. 1997. The Myth of Continents, a critique of metageography. Berkeley: University of California Press. Lum J.K., Cann R.L., Martinson J.J., Jorde L.B. 1998. Mitochondrial and nuclear genetic relationships among Pacific Island and Asian populations. American Journal of Human Genetics, 63: 613‐24. MacEachern S. 2000. Genes, tribes and African history. Current Anthropology 41: 357‐384. Malécot G. 1948. Les mathématiques de l’hérédité. Paris: Masson. Manni F., Heeringa W.J., Nerbonne J. 2006. To what extent are surnames words? Compar‐ ing the geographic patterns of surname and dialect variation in the Netherlands. Special issue of LLC Literary and Linguistic Computing “Progress in Dialectometry: Toward Expla‐ nation” 21: 507‐27. Manni F., Heeringa W.J., Toupance B., Nerbonne J. 2008. Do surname differences mirror dialect variation? Human Biology, 80: 41‐64. McMahon A., McMahon R. 2006. Why linguists don’t do dates. In: Peter Forster and Colin Renfrew (eds.), Phylogenetic methods and the prehistory of languages. Cambridge (UK): McDonald Institute for Archaeological Research monographs. McMahon R. 2004. Genes and languages. Community genetics, 7: 2‐13. McWorther L. 2001. The power of Babel: a natural history of languages. New York: Times books, Henry Holt and Co. Menozzi P., Piazza A., Cavalli‐Sforza L‐L. 1978a. Synthetic maps of human gene frequen‐ cies in Europe. Science, 201: 786‐792. Menozzi P., Piazza A., Cavalli‐Sforza L‐L. 1978b. Synthetic gene frequency maps and an application to the analysis of the spread of the Neolithic in Europe [Abstract]. American Journal of Human Genetics, 30: 125. Monmonier M. 1973. Maximum‐Difference Barriers: an Alternative Numerical Regionali‐ zation Method. Geographical Analysis, 3: 245‐61. Mourant A.E., Kopec A.C., Domaniewska‐Sobczak K. 1976. The Distribution of the Human Blood Groups and Other Polymorphisms. 2d ed. Oxford: Oxford University Press. Mufwene S.S. 2001. The Ecology of Language Evolution, Cambridge University Press. Nerbonne J., Heeringa W. , 2007. Geographic Distributions of Linguistic Variation Reflect Dynamics of Differentiation. In: S. Featherston, W. Sternefeld (eds.), : Linguistics in

58 CHAPTER 2

Search of its Evidential Base, studies in generative grammer 96. Berlin and New York: Mou‐ ton De Gruyter, pp. 267‐297. Nerbonne J. 2010. Mapping Aggregate Variation In: A. Lameli, R. Kehrein; S. Rabanus (eds.) Language and Space. International Handbook of Linguistic Variation. Vol. 2 Language Mapping. Berlin: Mouton De Gruyter. 2010. Chap. 24. pp. 476‐495, maps pp.2401‐2406. (Series Handbooks of Linguistics and Communication Science 30.2). Nettle D. 1998. Explaining global patterns of language diversity. Journal of Anthropological Archaeology, 17: 354‐374. Nettle D. 1999. Is the rate of linguistic change constant? Lingua, 108: 119‐136. Niebaum H. 2001. Der Niedergang des Friesischen zwischen Lauwers und Weser. In: H.H. Munske (ed.). Handbuch des Friesischen/Handbook of Frisian Studies. Tübingen: Nie‐ meyer, Germany, pp. 430‐442. Poloni E.S., Semino O., Passarino G., Santachiara‐Benerecetti A.S., Dupanloup I., Langa‐ ney A., Excoffier L. 1997. Human genetic affinities for Y‐chromosome P49a,f/TaqI haplo‐ types show strong correspondence with linguistics. American Journal of Human Genetics, 61: 1015‐35. Rosser Z.H., Zerjal T., Hurles M.E., Adojaan M., Alavantic D., Amorim A., et al. 2000. Y‐ chromosomal diversity in Europe is clinal and influenced primarily by geography, rather than by language. American Journal of Human Genetics, 67: 1526‐43. Ruhlen M. 1987. A guide to the world’s languages. Stanford university Press, Stanford, (CA), USA. vol. 1. Schleicher A. 1863. Die Darwinsche Theorie und die Sprachwissenschaft. Offenes Sendschrei‐ ben an Herrn Dr. Ernst Häckel, a. o. Professor der Zoologie und Director des zoologi‐ schen Museums an der Universität Jena. Weimar: Hermann Böhlau) Schmidt J. 1872. Die Verwandtschaftsverhältnisse der indogermanischen Sprachen. Weimar: H. Böhlau. Scott J.C. 1998. Seeing like a State : How Certain Schemes to Improve the Human Condition Have Failed. New Haven and London: Yale University Press. Sokal RR, Oden N.L. 1978a. Spatial autocorrelation in biology 1. Methodology. Biological Journal of the Linnean Society, 10: 199‐228. Sokal RR, Oden N.L. 1978b. Spatial autocorrelation in biology 2. Some biological implica‐ tions and four applications of evolutionary and ecological interest. Biological Journal of the Linnean Society, 10: 229‐249. Sokal R.R. 1988. Genetic, geographic and linguistic distances in Europe. Proceedings of the National Academy of Science USA, 85: 1722‐1726. Sokal R.R., Oden N.L., Thomson B.A. 1999a. A problem with synthetic maps. Human Biol‐ ogy, 71: 447‐53. Sokal RR, Oden NL, Thomson BA. 1999b. Problems with synthetic maps remain: reply to Rendine et al. Human Biology, 71: 1‐13.

SPRACHRAUM AND GENETICS 59

Stoneking M., Cann R.L. 1989. African origin of human mitochondrial DNA. In: P. Mel‐ lars and C. Stringer, (eds.) The Human Revolution: Behavioural and Biological perspectives on the Origins of Modern humans., Princeton, N.J.: Princeton University Press, pp. 17‐30. Wolpoff M.H., Thorne A.G. 1981. Regional continuity in Australasian Pleistocene Homi‐ nid Evolution. American Journal of Physical Anthropology 55: 337‐349. Wolpoff M.H., Wu X., Thorne A.G. 1984. Modern Homo sapiens origins: a general theory of hominid evolution involving the fossil evidence from East Asia. In F.H. Smith and F. Spencer (eds.): The Origins of Modern Humans: A World Survey of the Fossil Evidence. New York (NY: Alan R. Liss, pp. 411‐483. Womble W.H. 1951. Differential Systematics. Science, 114: 315‐22.

60 CHAPTER 2

63 Chapter 1

This chapter has been published, please cite the original reference:

Nerbonne J., Kleiweg P., Heeringa W. Manni F. 2007. Projecting Dialect Differ‐ ences to Geography: Bootstrap Clustering vs. Noisy Clustering. In: C. Preisach, L. Schmidt‐Thieme, H. Burkhardt, R. Decker (eds.) Data Analysis, Machine Learning, and Applications. Proc. of the 31st Annual Meeting of the German Classification Society. Berlin: Springer, pp. 647‐654. (Studies in Classification, Data Analysis, and Knowledge Organization). PROJECTING DIALECT DIFFERENCES TO GEOGRAPHY 63

ABSTRACT  Dialectometry produces aggregate distance matrices in which a distance is specified for each pair of sites. By projecting groups obtained by clustering onto geography one compares results with traditional dialectology, which produced maps partitioned into implicitly non‐overlapping dialect areas. The importance of dialect areas has been challenged by proponents of continua, but they too need to compare their findings to older literature, expressed in terms of areas. Simple clustering is unstable, meaning that small differences in the input matrix can lead to large differences in results (Jain et al. 1999). This is illustrated with a 500‐site data set from Bulgaria, where input matrices which correlate very highly (r = 0.97) still yield very different clusterings. Kleiweg et al. (2004) introduce composite clustering, in which random noise is added to matrices during repeated clustering. The resulting borders are then projected onto the map. The present contribution compares Kleiweg et al.ʹs procedure to resampled bootstrap‐ ping, and also shows how the same procedure used to project borders from composite clustering may be used to project borders from bootstrapping.

PROJECTING DIALECT DISTANCES TO GEOGRAPHY: BOOTSTRAP CLUSTERING VS. NOISY CLUSTERING

3.1 INTRODUCTION

We focus on dialectal data, examined at a high level of aggregation, i.e. the average linguistic distance between all pairs of sites in large dialect surveys. It is important to seek groups in this data, both to examine the importance of groups as organizing ele‐ ments in the dialect landscape, but also in order to compare current, computational work to traditional accounts. Clustering is thus important as a means of seeking groups in data, but it suffers from instability: small input differences can lead to large differences in results, i.e., in the groups identified. We investigate two techniques for overcoming the instability in clustering techniques, bootstrapping, well known from the biological literature, and ʺnoisyʺ clustering, which we introduce here. In addition we examine a novel means of projecting the results of (either technique involving) such repeated clusterings to the geographic map, arguing that it is better suited to revealing the detailed structure in dialectological distance matrices.

3.2 BACKGROUND AND MOTIVATION

We assume the view of dialectometry (Goebl, 1984 inter alia) that we characterize dia‐ lects in a given area in terms of an aggregate distance matrix, i.e. an assignment of a 64 CHAPTER 3

linguistic distance to each pair of sites s1, s2 in the area D1 (s1, s2) = d. Linguistic dis‐ tances may be derived from vocabulary differences, differences in structural proper‐ ties such as syntax (Spruit 2006), differences in pronunciation, or otherwise. We ignore the derivation of the distances here, except to note two aspects. First, we derive dis‐ tances via individual linguistic items (in fact, words), so that we are able to examine the effect of sampling on these items. Second, we focus on true distances, satisfying the usual distance axioms, i.e. having a minimum at zero:  s1D (s1, s2) = 0; symmetry:

 s1,s2 D(s1, s2) = D (s1, s2); and the triangle inequality:  s1,s2,s3 D(s1, s2)  D(s1, s3) + D(s3, s2) (see (Kruskal 1999: 22). We return to the issue of whether the distances are ULTRAMETRIC in the sense of the phylogenetic literature below. We focus here on how to analyze such distance matrices, and in particular how to detect areas of relative similarity. While multi‐dimensional scaling has undoubt‐ edly proven its value in dialectometric studies (Embleton 1987, Nerbonne et al. 1999), we still wish to detect DIALECT AREAS, both in order to examine how well areas func‐ tion as organizing entities in dialectology, and also in order to compare dialectometric work to traditional dialectology in which dialect areas were seen as the dominant or‐ ganizing principle. CLUSTERING is a standard way in which to seek groups in such data, and it is applied frequently and intelligently to the results of dialectometric analyses. The re‐ search community is convinced that the linguistic varieties are hierarchically organ‐ ized; thus, e.g., the urban dialect of Freiburg is a sort of Low Alemannic, which is in turn Alemannic, which is in turn Southern German, etc. This means that the tech‐ niques of choice have been different varieties of hierarchical clustering (Schiltz 1996, Mucha and Haimerl 2005). Hierarchical clustering is most easily understood procedurally: given a square distance matrix of size n × n, we seek the smallest distance in it. Assume that this is the distance between i and j. We then fuse the two elements i and j, obtaining an n ‐ 1 square matrix. One needs to determine the distance from the newly added i + j ele‐ ment to all remaining k, and there are several alternatives for doing this, including nearest neighbour, average distance, weighted average distance, and minimal vari‐ ance (Wardʹs method). See Jain et al. (1999) for discussion. We return in the final sec‐ tion to the differences between the clustering algorithms, but in order to focus on the effects of bootstrapping and ʺnoisyʺ clustering, we use only weighted average (WPGMA) in the experiments that follow. The result of clustering is a dendrogram, a tree in which the history of the clustering may be seen. For any two nodes in the dendrogram we may determine the point at which they fuse, i.e. the smallest internal node which contains them both. In addi‐ PROJECTING DIALECT DIFFERENCES TO GEOGRAPHY 65 tion, we record the cophenetic distance: this is the distance from one subnode to an‐ other at the point in the algorithm at which the subnodes fused (Fig. 3.1).

Figure 3.1  An example dendrogram. Note the cophenetic distance is reflected in the horizontal distance from the to the encompassing node. Thus the cophenetic dis‐ tance between Borstendorf and Gornsdorf is a bit more than 0.04.

Note that the algorithms depend on identifying minimal elements, which leads to instability: small changes in the input data can lead to very different groupsʹ being identified (Jain et al., 1999). Nor is this problem merely ʺtheoreticalʺ. Figure 3.2 shows two very different cluster results which from genuine, extremely similar data (the dis‐ tance matrices correlated at r = 0.97).

Figure 3.2  Two Bulgarian Datasets from Osenova et al. (2009). Although the distance matrices correlated nearly perfectly (r = 0.97), the results of WPGMA clustering differ sub‐ stantially. Bootstrapping and noisy clustering resolve this instability. When the colour contrast between two areas at comparison is high, the clustering distance is high as well.

Finally, we note that the distances we shall cluster do not satisfy the ultramet‐ ric axiom:  s1 s2 s3 D (s1, s2)  max {D (s2, s3), D (s1, s3)} (Page and Holmes 2006, p.26). Phylogeneticists interpret data satisfying this axiom temporally, i.e., they interpret 66 CHAPTER 3 data points clustered together as later branches in an evolutionary tree. The dialectal data undoubtedly reflects historical developments to some extent, but we proceed from the premise that the social function of dialect variation is to signal geographic provenance, and that similar linguistic variants signal similar provenance. If the signal is subject to change due to contact or migration, as it undoubtedly is, then similarity could also result from recent events. This muddies the history, but does not change the socio‐geographic interpretation.

3.2.1 Data

In the remainder of the paper we use the data analyzed by Nerbonne and Siedle (2005) consisting of 201 word pronunciations recorded and transcribed at 186 sites throughout all of contemporary Germany. The data was collected and transcribed by researchers at Marburg between 1976 and 1991. It was digitized and analyzed in 2003‐ 2004. The distance between word pronunciations was measured using a modified ver‐ sion of edit distance, and full details (including the data) are available. See Nerbonne and Siedle (2005).

3.3 BOOTSTRAPPING CLUSTERING

The biological literature recommends the use of bootstrapping in order to obtain sta‐ ble clustering results (Felsenstein, 2004, Chap. 20). Mucha and Haimerl (2005) and Manni et al. (2006) likewise recommend bootstrapping for the interpretation of cluster‐ ing applied to dialectometric data. In bootstrapped clustering we resample the data, using replacement. In our case we resample the set of word‐pronunciation distances. As noted above, each lin‐ guistic observation o is associated with a site × site matrix Mo. In the observation ma‐ trix, each cell represents the linguistic distance between two sites with respect to the observation: Mo(s,s’) = D(os, os’). In bootstrapping, we assign a weight to each matrix (observation) identical to the number of times it is chosen in resampling:

n if observation o is drawn n times wo = { 0 otherwise

If we resample I times, then I = o wo. The result is a subset of the original set of observa‐ tions (words), where some of the observations may be weighted as a resulted of the re‐ sampling. Each resampled set of words yields a new distance matrix MiI, namely the average distances of the sites using the weighted set of words obtained via bootstrap‐ ping. PROJECTING DIALECT DIFFERENCES TO GEOGRAPHY 67

We apply clustering to each Mi obtained via bootstrapping, recording for each group of sites encountered in the dendrogram (each set of leaves below some node) both i) that the group was encountered, and ii) the cophenetic distance of the group (at the point of fusion). This sounds as if it could lead to a combinatorial problem, but for‐ tunately most of the 2180 possible groups are never encountered. In a final step we extract a COMPOSITE DENDROGRAM from this collection, consist‐ ing of all of the groups that appear in a majority of the clustering iterations, together with their cophenetic distance. See Fig. 3.3 for an example.

3.4 CLUSTERING WITH NOISE *

Clustering with noise is also motivated by the wish to prevent the sort of instability illustrated in Fig. 3.2. To cluster with noise we assume a single distance matrix, from which it turns out to be convenient to calculate variance (among all the distances). We then specify a small noise ceiling c, e.g. c =  / 2, i.e. one‐half standard deviation of distances in the matrix. We then repeat 100 times or more: add random amounts of noise r to the matrix (i.e., different amounts to each cell), allowing r to vary uniformly,

0  r  c. If we let Mi stand in this case for the matrix obtained by adding noise (in the i‐th iteration), then the rest of the procedure is identical to bootstrapping. We apply clustering to Mi and record the groups clustered together with their cophenetic dis‐ tances, just as in Fig. 3.3.

Figure 3.3  A Composite Dendrogram where numbers indicate how often a groups of sites was clustered and the (horizontal) length of the brackets reflects mean cophenetic distance.

68 CHAPTER 3

3.5 PROJECTING TO GEOGRAPHY

Since dialectology studies the geographic variation of language, it is particularly im‐ portant to be able to examine the results of analyses as these correspond to geography. In order to project the results of either bootstrapping or noisy clustering to the geo‐ graphic map, we use the customary Voronoi tessellation (Goebl 1984), in which each site is embedded in a polygon which separates it from other sites optimally. In this sort of tiling there is exactly one border running between each pair of adjacent sites, and bisecting the imaginary line linking the two. To project mean cophenetic distance matrices onto the map we simply draw the Voronoi tessellation in such a way that the darkness of each line corresponds to the distance between the two sites it separates. See Fig. 3.4 for examples of maps obtained by bootstrapping two different clustering algorithms. These largely corroborate scholarship on German dialectology (König 1991, pp. 230‐231). Unlike dialect area maps these composite cluster maps reflect the variable strength of borders, represented by the borderʹs darkness, reflecting the consensus cophenetic distance between the adjacent sites. Haag (1898) (discussed by Schiltz 1996) proposed a quantitative technique in which the darkness of a border was reflected by the number of differences counted in a given sample, and similar maps have been in use since. Such maps look similar to the maps we present here, but note that the borders we sketch need not be reflected in local differences between the two sites. The clustering can detect borders even where differences are gradual, when borders emerge only when many sites are compared.1

3.6 RESULTS

Bootstrap clustering and ʺnoisyʺ clustering identify the same groups in the 186‐site Ger‐ man sample examined here (Fig. 3.4). This is shown by the nearly perfect correlation be‐ tween the mean cophenetic distances assigned by the two techniques (r = 0.997). Given the general acceptance of bootstrapping as a means of examining the stability of clusters, this result shows that ʺnoisyʺ clustering is as effective. The usefulness of the composite cluster map may best be appreciated by inspecting the maps in Fig. 3.4. While maps pro‐

1 Fischer (1980) discusses adding a contiguity constraint to clustering, which structures the hypothesis space in a way that favours clusterings of contiguous regions. Since we use the projection to geography to spot linguistic anomalies‐dialect islands, but also field worker and transcriber errors‐we do not wish to push the clustering in a direction that would hide these anomalies. PROJECTING DIALECT DIFFERENCES TO GEOGRAPHY 69 jected from simple clustering (see Fig. 3.2) merely partition an area into non‐overlapping subareas, these composite maps reflect a great deal more of the detailed structure in the data. The map on the left was obtained by bootstrapping using WPGMA. Although both bootstrapping and adding noise identifies stable groups, neither removes the bias of the particular clustering algorithm. Figure 3.4 compares the boot‐ strapped results of WPGMA clustering with unweighted clustering (UPGMA, see Jain 1999). In both cases bootstrapping and noisy clustering correlate nearly perfectly, but it is clear that the WPGMA is sensitive to more structure in the data. For example, it distin‐ guishes Bavaria (in southeastern Germany) from the Southwest (Swabia and Alemania). So the question of the optimal clustering method for dialectal data remains. For further discussion see: www.let.rug.nl/kleiweg/kaarten/MDS‐clusters.html.

Figure 3.4  Two Composite Cluster Maps. On the left one obtained by bootstrapping using weighted group average clustering, and on the right one obtained by unweighted group aver‐ age. We do not show the maps obtained using ʺnoisyʺ clustering, as these are indistinguishable from the maps obtained via bootstrapping. The composite distance matrices correlate nearly perfectly (r = 0.997) when comparing bootstrapping and ʺnoisyʺ clustering.

3.7 CONCLUSIONS

The ʺnoisyʺ clustering examined here requires that one specify a parameter, the noise ceil‐ ing, and, naturally, one prefers to avoid techniques involving extra parameters. On the other hand it is applicable to single matrices, unlike bootstrapping, which requires that one be able to identify components to be selected in resampling. Both techniques require that one specify a number of iterations, but this is a parameter of convenience. Small numbers of iterations are convenient, and large values result in very stable groupings. 70 CHAPTER 3

References:

Embleton S. 1987. Multidimensional Scaling as a Dialectometrical Technique. In: R.M. Ba‐ bitch (Ed.) Papers from the Eleventh Annual Meeting of the Atlantic Provinces Linguistic Asso‐ ciation. New Brunswick 5canada): Centre Universitaire de Shippagan, pp. 33‐49.

Felsenstein J. 2004. Inferring Phylogenies. Sunderland, MA: Sinauer

Fisher M. 1980. Regional Taxonomy: A Comparison of Some Hierarchic and Non ‐ Hierar‐ chic Strategies. Regional Science and Urban Economics, 10: 503‐537.

Goebl H. 1984. Dialektometrische Studien: Anhand italoromanischer, rëatoromanischer und galloromanischer Sprachmaterialien aus AIS und ALF. 3rd Vol. Tübingen: Max Nie‐ meyer.

Haag K. 1898. Die Mundarten des oberen Neckar‐ und Donaulandes. Reutlingen: Buch‐ druckerei Egon Hutzler.

Jain A.K., Murty M.N., Flynn P.J. 1999. Data Clustering: A Review. ACM Computing Sur‐ veys, 31: 264‐323.

Kleiweg P., Nerbonne J., Bosveld L. 2004. Geographic Projection of Cluster Composites. In: A. Blackwell, K. Marriott, A. Shimojima (eds.) Diagrammatic Representation and Infer‐ ence. 3rd International Conference, Diagrams 2004. Cambridge, UK, Mar. 2004. (Lecture No‐ tes in Artificial Intelligence 2980). Berlin: Springer, pp. 392‐394.

König W. 1991, 1978: DTV‐Atlas zur detschen Sprache. München: DTV.

Kruskal J. 1999. An Overview of Sequence Comparison. In: D. Sankoff, J. Kruskal (eds.) Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison, 2nd ed. Stanford: CSLI, pp. 1‐44.

Manni F., Heeringa W., Nerbonne J. 2006. To what Extent are Surnames Words? Compar‐ ing Geographic Patterns of Surnames and Dialect Variation in the Netherlands. Literary and Linguistic Computing, 21: 507‐ 528.

Mucha H.J., Haimerl E. 2005. Automatic Validation of Hierarchical Cluster Analysis with Application in Dialectometry. In: C. Weihs and W. Gaul (eds.) Classification, the Ubiquitous Challenge. Proceedings of 28th Meeting Gesellschaft für Klassifikation, Dortmund, Mar. 9‐11, 2004. Berlin: Springer, pp. 513‐520.

Nerbonne J., Heeringa W., Kleiweg P. 1999. Edit Distance and Dialect Proximity. In: D. Sankoff, J. Kruskal (eds.) Time Warps, String Edits and Macromolecules: The Theory and Prac‐ tice of Sequence Comparison, 2nd ed. Stanford: CSLI, pp. v‐xv. PROJECTING DIALECT DIFFERENCES TO GEOGRAPHY 71

Nerbonne J., Siedle Ch. 2005. Dialektklassiffkation auf der Grundlage aggregierter Aus‐ spracheunterschiede. Zeitschrift für Dialektologie und Linguistik, 72: 129‐147.

Osenova P., Heeringa W., Nerbonne J. 2009. A Quantitative Analysis of Bulgarian Dialect Pronunciation. Zeitschrift für slavische Philologie, 66: 425‐458.

Page R.D.M., Holmes E.C. 2006. Molecular Evolution: A Phylogenetic Approach. Oxford: Blackwell.

Schiltz G. 1996. German Dialectometry. In: H.‐H. Bock, W. Polasek (eds.) Data Analysis and Information Systems: Statistical and Conceptual Approaches. Proceedings of 19th Meeting of Ge‐ sellschaft für Klassifikation, Basel, Mar. 8‐10, 1995. Berlin: Springer, pp. 526‐539.

Spruit M. 2006. Measuring Syntactic Variation in Dutch Dialects. In J. Nerbonne, W. Kretzschmar Jr. (eds.) Progress in Dialectometry: Toward Explanation. Special issue of Literary and Linguistic Computing, 21: 493‐506. 72 CHAPTER 3

vvcvcvc

TO WHAT EXTENT ARE SURNAMES WORDS? 73

74 CHAPTER 4

This chapter has been published, please cite the original reference:

Manni F., Heeringa W.J., Nerbonne J. 2006. To what extent are surnames words? Comparing the Geographic Patterns of Surname and Dialect Variation in the Netherlands. Special issue of LLC Literary and Linguistic Computing “Progress in Dialectometry: Toward Explanation” 21: 507‐27.

TO WHAT EXTENT ARE SURNAMES WORDS? 75

ABSTRACT  Since the early studies by Sokal (1988) and Cavalli‐Sforza et al. (1989), there has been an increasing interest in depicting the history of human migrations by comparing genetic and linguistic differences that mirror aspects of human history. Most of the literature concerns continental or macroregional patterns of variation, while re‐ gional and microregional scales were investigated less successfully. In this article we concentrate on the Netherlands, an area of only 40,000 km2. The focus of the article is on the analysis of surnames, which have been proven to be reliable genetic markers since in patrilineal systems they are transmitted—virtually un‐ changed—along generations, similar to a genetic locus on the Y‐chromosome. We shall compare their distribution to that of dialect pronunciations, which are clearly culturally transmitted; Children learn one of the linguistic varieties they are exposed to, normally that of their peers in the same area, or that of their families. Since surnames, at the time of their introduction, were words subject to the same linguistic processes that otherwise result in dialect differences, one might expect the distribution of surnames to be corre‐ lated with dialect pronunciation differences. But we shall argue that once the collinear effects of geography on both genetics and cultural transmission are taken into account, there is in fact no statistically significant association between the two. We show that surnames cannot be taken as a proxy for dialect variation, even though they can be safely used as a proxy to Y‐chromosome. We work primarily with regression analyses, which show that both surname and dia‐ lect variation are strongly correlated with geographic distance. In view of this strong correlation, we focus on the residuals of the regression, which seeks to explain genetic and linguistic variation on the basis of geography where geographic distance is the in‐ dependent variable, and surname diversity or linguistic diversity is the dependent vari‐ able. We then seek a more detailed portrait of the geographic patterns of variation by identifying the ‘barriers’, namely the areas where the residuals are greatest by applying the Monmonier algorithm. We find the results historically and geographically insightful, hopefully leading to a deeper understanding of the role of local migrations and cultural diffusion responsible for surname and dialect diversity.

TO WHAT EXTENT ARE SURNAMES WORDS? COMPARING GEOGRAPHIC PATTERNS OF SURNAME AND DIALECT VARIATION IN THE NETHERLANDS

4.1 INTRODUCTION

The aim of this study is to compare the geographic patterns of genetic variation with corresponding linguistic data in the Netherlands (Fig. 4.1). Family names can be re‐ garded as genetic markers since they are transmitted along the male line together with 76 CHAPTER 4 the Y‐chromosome in patrilineal societies. Before becoming surnames with a strict rule of transmission, family names were also words and so they remain today even if ‘frozen’ to meet the needs of record keeping, so we might expect them to pattern with other linguistic material, which is why our study also asks: To what extent are surnames words? We investigate this dual nature of patronymic markers by comparing the geo‐ graphic patterns of variability of 19,910 Dutch surnames accounting for 1,303,369 in‐ dividuals with the linguistic differences of the Netherlands measured by Heeringa (2004). As we shall see, surnames are not distributed in the same way as dialect differ‐ ences. To assess how different surnames are in two locations, we computed a specific pairwise surname distance ‘Nei distance’ between the 226 Dutch localities. Such measures were compared with Levenshtein distances that, analogously, assay linguis‐ tic diversity. We shall note that surname analyses have been implemented by exclud‐ ing very common ‘polyphyletic’ surnames, which otherwise lead to an underestima‐ tion of the actual levels of diversity.

4.1.1 Surnames

Male‐transmitted family names simulate neutral alleles of a gene transmitted only through the Y‐chromosome (King et al. 2005; Yasuda and Morton 1967; Yasuda and Furusho 1971; Yasuda et al. 1974; Zei et al. 1983) and therefore satisfy the expectations of the neutral theory of molecular evolution (Cavalli‐Sforza and Bodmer 1971; Crow 1980), which is entirely described by random genetic drift, mutation and migration (Kimura 1983). This property of surnames, together with their ready availability, has made them useful for the study of population structure since 1965, when Crow and Mange published the quantitative relation existing between isonymy1 and inbreeding. Recently, the isonymy method was applied to a genealogical database (Gagnon and Toupance 2002) and consanguinity was estimated both from surnames and from true

1 Isonymy, a measure of surnames’ overlap in a population, estimates the degree to which the population is related, i.e. its consanguinity. Real isonymy is obtained by counting the number of marriages where partners have the same surname (isonymic marriages). Isonymy can be estimated by computing the probabilities of isonymic marriages for all surnames. The probability depends on the relative frequency and the number of all the different surnames. This latter measure is called ‘random isonymy’ and assumes that the choice of the partner is not influenced by his family name, being, in this respect, com‐ pletely random. In a village where all the inhabitants have different surnames, isonymy is 0; in another village where all the inhabitants have the same surname, isonymy is 1. TO WHAT EXTENT ARE SURNAMES WORDS? 77 genealogies. Results indicate that random isonymy, estimated from family names, fits well with consanguinity estimates obtained from genealogical records. Several papers have focused on surnames on account of their ready availabil‐ ity, from voters’ lists or phone books. They are then useful in the investigation of ge‐ netic structures meaningful differences in the geographic space of populations. If the use of patronymic markers is easy and provides very large sample sizes, it also might suffer from limitations related to 1) nonpaternity; 2) surname change; 3) polyphyleti‐ cism; and 4) limited temporal depth in generations. Non‐paternity and surname change are not at all major problems, infecting no more than 10% of the data, but polyphyleticism can decrease the reliability of surname studies.

Fig. 4.1  Map of the Netherlands showing the location of the Dutch provinces.

78 CHAPTER 4

By polyphyleticism we mean the circumstance in which unrelated people may share the same surname. At the time of surname introduction, the same surname (e.g. Woods, Grant, , etc.) often came into use in different unrelated families, even those established in different geographic locations. In classical surname analyses, i.e. studies based on surname distances (Chen and Cavalli‐Sforza 1983; Lasker 1985), polyphyletic surnames decrease the value of pairwise distance measures between lo‐ cations based on the numbers of families with the same surname. To avoid arbitrary exclusions of some family names, published studies were performed on the whole corpus of data by regarding polyphyletic surnames as monophyletic. We have re‐ cently shown (Manni et al. 2005) that it is possible to decrease this source of error via a neural network analysis (Kohonen 1995) of the geographic distribution of the sur‐ names. In this way, the identification of some clearly polyphyletic surnames becomes possible, since they share the crucial properties 1) the absence of a coherent geo‐ graphic hearth of diffusion; 2) a high average number of people sharing the surname; and 3) a peculiar clustering in specific cells of the Kohonen map. The second major constraint of surnames, as we mentioned, is related to their limited temporal depth. It is known that they provide no information for periods pre‐ vious to the late Middle Ages at best, when they first spread in most European coun‐ tries. In the Netherlands, surnames were not obligatory until the Napoleonic period. As a consequence, surname‐inferred demographic phenomena—such as migrations, drift, and isolation—can be dated at best only within the last six centuries for most of Europe and only within the last two in the Dutch case. The distribution of family names deserves to be studied more extensively in comparison with linguistic variabil‐ ity since dialects evolve at rates detectable over similarly large time frames. This large‐scale synchrony in the diffusion of surname and dialect variants justifies further the comparison that we are undertaking. Possible results might be: 1) similar geo‐ graphic patterns in surnames and dialects, thus suggesting that social and demo‐ graphic processes were similar or 2) genetic variability that differs from linguistic variability, which would show that the social contacts mirrored by dialects do not cor‐ respond to the demographic history of the populations speaking them. But, before addressing such comparisons, it is necessary to discuss an older criticism, related to the dual nature of patronymic markers. If surnames were words, they should mirror linguistic diversity. We note that it is often possible to guess someone’s geographic origin by the sound and spelling of her or his surname. This motivates us to ask whether and to what extent are surnames words. If their variability really was related to their linguistic neighborhood, we would expect to find today patterns similar to those of dialects. This comparison is no longer only hypothetically TO WHAT EXTENT ARE SURNAMES WORDS? 79 interesting since dialectologists now process large amounts of data exactly, enabling the establishment of truly quantitative, statistically meaningful, correlations between linguistic and surname variability. This step is essential since the outcomes of genetic studies are frequency vectors and distance matrices that deserve comparison with similarly exact information.

4.1.2 Dialects

In genetics, the idea that genetic divergence increases with geographic distance is a well‐accepted and established notion, and large‐scale studies gave evidence of it (for an exhaustive introduction see Cavalli‐Sforza et al. 1989). A similar (but in mechanical detail nonidentical) idea can be traced back to the WAVE THEORY of Johannes Schmidt (1872) about Indo‐European languages. From a distant perspective (following Isidore Dyen, personal communication) all languages link mutually intelligible speakers or speech‐types, where different varieties gradually shade one into another, and where the extremes of the linked network are the most different areas. The role played by geographic distance in the continuous increase of linguistic divergence is also the point of Chambers and Trudgill’s ‘traveller’s distance’ (1998, p. 5). The idea is that a traveler going across a linguistic area will repeatedly encounter dialects whose fea‐ tures overlap to a large extent with those of the last dialect he heard and also of the next one he will hear. He experiences in this way the continuum that is now frequently appealed to in dialectology: neighboring dialects are usually quite similar. A dialec‐ tometrical analysis of the traveler has been undertaken, on Dutch dialect data, by Heeringa and Nerbonne (2001), and the mathematical association between geographic and linguistic distance was so close that they summarized it in a mathematical regres‐ sion between geographic and linguistic distance, an approach that was probably first applied to linguistics by Séguy (1971). Unlike authors who see the continuum just as an undulated landscape, Heeringa and Nerbonne have shown that the mean height of such ‘undulations’ is not constant through space, since pairwise comparisons of dia‐ lect variants lead to occasionally higher values as dialect borders are encountered. If we were able to eliminate from a dialectometric matrix of distances the vari‐ ance explained by geography, we would be able to focus on the residual variance that probably is not related to contact between neighboring speakers. When interested in the historical evolution of dialect variation, large residual variance may signal a pattern of linguistic difference that is more ancient. We can also imagine that, in ancient times, as a consequence of sparser population density, less contact between speakers and less reli‐ able transportation, linguistic differences were stronger. 80 CHAPTER 4

In this article, we compute a general regression model between Levenshtein dialect distances (see Heeringa 2004, pp. 121–144) and geographic distances between dialect locations, and thus between pronunciation distances and the distances be‐ tween pairs of dialect locations (in kilometers). We computed the Levenshtein dis‐ tance between the sites in a pairwise fashion. Then, expected distances are subtracted from the observed ones, leading to the computation of residuals. Finally, we construct boundaries based on the residuals.

4.2 METHODOLOGY AND DATA

4.2.1 Data

4.2.1.1 Surnames

From the original Dutch data set (Manni 2001) of 51,578 surnames, corresponding to 2,294,154 individual telephone users (1997 data) in 226 locations, we have eliminated very frequent surnames (those recorded in more than 100 locations) and very rare surnames (those recorded in fewer than ten locations). The 226 locations are those as listed by Barrai et al. (2002), following Manni (2001). The exclusion of very frequent surnames relies on the confounding effect they have on analyses: polyphyleticism leads to inflated estimations of consanguinity (see Section 4.1). Concerning the Neth‐ erlands, the demonstration that very frequent surnames are polyphyletic can be found in Manni et al. (2005). Rare surnames were excluded because their contribution to the overall picture is irrelevant (Manni 2005, unpublished). For the 226 locations, the cor‐ relation (Manly 1997; Mantel 1967) between a surname distance matrix (whatever the distance) computed by retaining rare surnames and another obtained by eliminating them approaches 1. This exclusion does not bias the data set since removed surnames correspond to a similar fraction of individuals in each of the 226 samples. From this new data set, consisting of 19,910 surnames (accounting for 1,303,369 individuals 8.1% of the entire Dutch population), we computed a matrix of Nei dis‐ tances according to the formula:

where nsi denotes the frequency of a given surname s in location i, while nsj denotes the frequency of the same surname in location j. Note that the sums are done for all sur‐ names. This is accepted manner of calculating a measure of surname differentiation.

TO WHAT EXTENT ARE SURNAMES WORDS? 81

4.2.1.2 Dialects

Heeringa and Nerbonne (2001) analyzed Chamber’s dialectal traveller (Chambers and Trudgill 1998) by sketching a line through the Dutch–Belgian area in which Dutch dialects are spoken. Naturally, this is a small sample of all the sites that one can com‐ pare in examining linguistic–geographic correlation, and Nerbonne et al. (1999) have computed regression models for all pairwise distances in the matrix of sampling sites, making the computation more stable than that of the monodimensional ‘traveller’s distance’ along a line. In proceeding this way, we are applying to linguistics the con‐ cept of ‘isolation by distance’ that was first introduced in genetics by Malécot (1955), when he demonstrated that close populations are genetically more similar than dis‐ tant ones. Interestingly, the similarity between genetic and linguistic data can be pushed further since, in both cases, the correlation with geographic distances is not linear and the same logarithmic transformation is applied to both data sets in order to obtain an improved linear model. Levenshtein distances were computed over all pronunciations, using the same data as Heeringa (2004), although some technical constraints forced us to reduce the number of Dutch sites in the sample to 252. The Levenshtein algorithm calculates the least cost of operations needed to map one pronunciation string (phonetic transcrip‐ tion) into another (Nerbonne et al. 1999). The measurement is consistent for large samples of words Cronbach’s α > 0.96 for 100‐word samples from this set (see Heer‐ inga 2004, pp. 170–177), and we used 125 words in the current study. The measure‐ ments have been validated with respect to scholarly tradition (Heeringa et al. 2002) and again with respect to lay dialect speakers’ judgment of dialectal distance (Heer‐ inga and Gooskens 2004). The latter study showed that the measurement correlated highly with lay speakers’ judgments (r = 0.78). In addition, the technique has now been applied to Norwegian, American English, German, Sardinian, and Bulgarian and Bantu . Interestingly, the same Levenshtein algorithm has been applied extensively to measurement differences in long genetic strings (Sankoff and Kruskal 1999).

4.2.2 Visualization of diversity

4.2.2.1 Multidimensional space: Principal component analysis

Principal component analysis (PCA) was applied to the data to graphically identify possible patterns of similarity between the 226 surnames samples and 252 dialect samples. The PCA method involves a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated 82 CHAPTER 4 variables called principal components. The first principal component accounts for as much of the variability (variance) in the data as possible, and each subsequent com‐ ponent accounts for as much of the remaining variability as possible. Multidimensional relationships between items can be seen in a bivariate or tri‐ variate plot if two or three axes are plotted one against each other. The analysis was performed with the Excel applet GenAlEx, by Peakall and Smouse (2001), freely avail‐ able at: http://biology‐assets.anu.edu.au/GenAlEx/Welcome.html

4.2.2.2 Geographic analysis: The Monmonier algorithm

When sampling locations are known, the association between genetic and geographic distances can be tested by regression methods. These tests give some clues about the shape of the genetic landscape. Nevertheless, regression analyses are by themselves unenlightening when attempting to identify where barriers may exist, namely the areas where a given variable shows an abrupt rate of change. To remedy this, we look to a com‐ putational geometry approach, which uses computed distances (surname, linguistic), or the residuals of the regression procedure, to identify the locations of barriers, and which additionally can therefore also show where the geographic patterns of two or more variables are similar. Inspired by this idea we have implemented Monmonier’s (1973) maximum difference algorithm anew (Manni et al., 2004),2 in order to identify genetic and linguistic barriers, namely the areas where differences between pairs of populations are unexpectedly large with respect to Nei and Levenshtein distance measures. To avoid ambiguity we stress that we use the term ‘boundary’ synony‐ mously with ‘barrier’. To test the confidence with which we may view the barriers in genetic or lin‐ guistic landscape, a significance test was implemented in the software by means of bootstrap analysis. As a result, 1) the noise associated in genetic or linguistic markers can be visualized on a geographic map and 2) the areas where barriers are more ro‐ bust can be identified. Moreover, this multiple approach allows us to inspect the bar‐ riers in order to get an idea of 3) the patterns of variation associated with different markers in the same landscape. In this study, bootstrap analysis is undertaken for surname data only.

4.2.2.2.1 The triangulation Delaunay triangulation (Brassel and Reif 1979) is the fastest triangulation method to connect a set of points (localities) on a plane (map) by a set of triangles. It is the most

2 See ‘‘Barrier vs. 2.2’’ at www.mnhn.fr/mnhn/ecoanthropologie/software/barrier.html TO WHAT EXTENT ARE SURNAMES WORDS? 83 direct way to connect (triangulate) adjacent points on a map. It should be noted that Delaunay triangulation is the dual of Voronoi tiling (Voronoi 1908), which results in a set of polygons, each surrounding exactly one site and together covering the area studied. The Delaunay triangulation and Voronoi tiling may be derived from each other. Given a set of populations whose geographic locations are known, there is only one possible Delaunay triangulation. Once a network connecting all the localities is obtained, each edge of the tiling is associated with the pairwise distance between the sites in the tiles taken from a distance matrix (see Goebl 2006).

4.2.2.2.2 The algorithm Monmonier’s (1973) maximum‐difference algorithm is used to identify boundaries. As we noted earlier, each edge in the Voronoi tiling is normally associated with the dis‐ tance between the sites that the ‘tiles’ surround. To apply Monmonier’s, algorithm we associate the edge both with the linguistic or genetic distance directly or with the re‐ sidual from the respective geographic regression. By repeatedly selecting edges asso‐ ciated with large residuals, we aim to identify coherent geographic boundaries. Boundaries are traced perpendicular to the edges of the network. Starting from the edge across which the genetic or linguistic distance value is maximum3 and proceed‐ ing across adjacent edges, the procedure is continued until the boundary under con‐ struction either has reached the limits of the triangulation (map) or has closed on itself by forming a loop around a population. In case of multiple barriers (constructed se‐ quentially in an order in which the researcher has some choice), the construction stops at a previously computed boundary. In the unusual case where two edges have the same value, the one linking to a triangle with higher total values is included in the boundary.

4.2.2.2.3 Robustness of barriers The execution of Monmonier’s algorithm recalls the splitting process seen in the con‐ struction of phylogenetic trees: once a barrier passes across the edges of a triangle, it can be extended only across one of the two remaining edges, in what we will define a ‘right or left’ decision. To assess the robustness of computed barriers, we have devel‐ oped a test based on the analysis of resampled bootstrap matrices. We repeated the procedure of finding boundaries using matrices computed on data sets from which random elements have been removed while others, randomly selected as well, appear

3 To avoid a frequent misunderstanding, we note that the edge associated with the maxi‐ mum distance does not need to be on the borders of the triangulation, such a case more an exception than the rule. If this is the case, the extension of the boundary occurs in one di‐ rection only; otherwise, it takes place in two directions simultaneously. 84 CHAPTER 4 more than once. As with phylogenetic trees, a score is associated with all the different edges that constitute barriers, thus indicating how many times each one of them is included in one of the boundaries computed from the N matrices (typically N ≥ 100). In other words, if we have 100 matrices and we want to compute the first barrier, 100 separate barriers will be obtained. Such 100 different barriers, different in the sense that they have been computed on matrices obtained from modified data sets, are dis‐ played in a single picture by increasing the thickness of the edges of the Voronoi tiling in proportion to the number of times they belong to one of the 100 barriers. If a pat‐ tern exists, whatever the modification of the original data set, barriers should repeat‐ edly emerge in certain areas of the plot. If barriers emerge everywhere in the plot, then the results may be not robust (in terms of geographic differentiation). The bootstrap procedure is intended to test if a given ‘signal’, let us say a north / south differentiation, is reliably perceptible in the original data or not. If a majority of the items i.e. single surnames, single words, or linguistic features, exhibit a geo‐ graphic pattern (north / south), then such a pattern will repeatedly emerge even when some items of the original data set are randomly deleted or over‐represented. In con‐ trast, if only a minority of the items suggests a pattern, after a repeated random modi‐ fication of the original data set, only few barriers will display it. In the latter case the pattern is not robust. This procedure recalls the use of bootstrap in phylogenetic trees (Felsenstein 1985) and similar advantages accrue to this way of computing barriers, notably the way in which the confidence of the postulation of the barrier is reflected in the visu‐ alization.

4.3 RESULTS

4.3.1 Surnames

The PCA plot of surname distances (Fig. 4.2) distinguishes the geographic positions of the 226 sampled localities fairly well. It is possible to identify a well‐defined Limburg cluster (see Fig. 4.1 for a geographic map of Dutch provinces) and a second cluster constituted by north Brabant samples. Remaining eastern and western samples are close to each other in a continuous swarm of points, while Zeeland samples are inter‐ mediate. A more detailed analysis of the topology of the plot reveals that, within the swarm, there is no overlap between northeastern and northwestern samples. It must further be noted that the topology of samples suggests more heterogeneity in the south of the Netherlands than in the north, where samples are plotted closer to each TO WHAT EXTENT ARE SURNAMES WORDS? 85 other. The two axes account for approximately 30% of the total variance, and further axes (even through the tenth) still account for significant fractions of the total vari‐ ance. Even if the low fraction of variance explained by the first two axes—a frequent phenomenon when large numbers of samples are analyzed—means that Fig. 4.2 is less than optimally representative and suggests a rather complex topology of samples in the multidimensional space, it still provides a reasonable first approximation of over‐ all variability. Further axes point to the specificities of both Limburg and Zeeland and, more generally, to the differences existing between the northern and the southern part of the of the country. To understand the geographic variability of surnames, given that general cor‐ relations are not informative about local variability, we analyzed the original surname distance matrix by using the Monmonier’s algorithm (1973) (not shown). The barriers computed highlight some differentiation zones in the north‐eastern provinces and along the northern border of the Limburg and Dutch Brabant provinces. Moreover, the Zeeland province appears as very fragmented, suggesting that surnames are very heterogeneous in this area, with important differences from one location to another. These conclusions are reinforced by the analysis of 100 bootstrap matrices computed by a resampling procedure of original surname data (thick black lines in Fig. 4.3). Bootstrap analysis leads to a clearer picture since some minor barriers in the northern part of the country and in Zeeland disappear. We also note the presence of a major barrier across the former IJsselmeer the internal sea in the north of the Netherlands (visible in Fig. 4.1). To focus on the variance that is not explained by geographic distance, we com‐ puted a general regression between geographic and surname distances after a double logarithmic transformation log y = 0.155 log x + k, which is equivalent to log y = log x0.155 + k, which in turn is equivalent to y = exp (log x0.155 + k) = exp (log x0.155) exp(k). If we represent exp(k) as c, then the relation is y = cx0.155. Geneticists are thus accustomed to analyzing the relation between geography and genetic variation as a power law; in fact, this is standard in the analysis of genetic data. It is interesting to note that Sèguy (1973) analyzed the relation between linguistics and geography as ling = geo0.5,, and Heeringa and Nerbonne (2001) as ling = log (geo). In fact, it is difficult to distinguish the analyses based on the logarithms of geography from those postulating a power law with a fractional coefficient of this size, and so we also apply the double loga‐ rithm transformations to the linguistic data. Using this model, we computed the ex‐ pected surname distance, according to geography, between two sampled localities. The residual distance between observed and expected values can be either positive or negative, reflecting the influence of phenomena other than geography (history, sys‐ tematic errors in data recording, etc.). 86 CHAPTER 4

Figure 4.2  Principal component analysis (PCA) of the surname differences in the Netherlands (Nei’s distances). Two almost distinct clusters corresponding to North Bra‐ bant and Limburg samples can be identified (on the left side). Remaining samples, belong‐ ing to the other provinces, cluster in a same swarm of points. Further details can be found in the text. The first axis explains 17.6% of the total variance, while the second axis ac‐ counts for 11.7% of it. The third and fourth axes (not shown) explain 5.8 and 4.3% of the total variance and highlight the diversity of Limburg and Zeeland provinces, respectively. Further axes (5th, 6th, and 7th) point, in different ways, to the differences existing be‐ tween the north and the south of the Netherlands and to differences existing between the northern and the southern part of the country.

In Fig. 4.3 (solid gray lines often overlapping to the black ones), we show the Monmonier analysis of the matrix of residuals. Apart from some very local barriers (numbered as ‘2’; ‘5’; ‘6’; ‘8’; ‘9’; ‘14’; ‘16’), previously observed patterns are con‐ firmed—with the exception of the IJsselmeer boundary, which disappears. Methodol‐ ogically, it is interesting to note that this latter boundary was traced across some of the longest edges of the Monmonier triangulation. As a consequence, the IJsselmeer boundary mirrors a surname differentiation related to the longer geographic dis‐ tances, if compared to the average length of the Delaunay triangulation edges, which naturally emerges when comparing the samples on the opposite sides of inland sea. TO WHAT EXTENT ARE SURNAMES WORDS? 87

Seen from this perspective, the IJsselmeer does not seem to have been a substantial geographic barrier to internal migrations.

Figure 4.3  Surname barriers. Thick black lines correspond to barriers obtained with the Monmonier algorithm on 100 matrices of surname distances computed according to the Nei’s method. The different matrices were computed by a bootstrap resampling of origi‐ nal surnames. Only the first 20 barriers are shown (2000 barriers in total, 20 for each of the 100 matrices). The thickness of barriers is proportional to their bootstrap score and barri‐ ers whose score is lower than 50% are not represented (see scale). Gray lines correspond to barriers obtained from a matrix of residual surname distances computed as it follows. After a linear regression between the logarithms of geographic and Nei’s distances, we computed the expected surname distance according to the regression. Expected values were subtracted from observed ones, thus leading to the residuals. The first 20 barriers are shown and numbered at both extremes (from ‘1’ to ‘20’). The Delaunay triangulation is visualized by a thin gray network. 88 CHAPTER 4

4.3.2 Dialects

With an identical methodology, we analyzed the dialect data of the Netherlands. The matrix of Levenshtein distances is visualized in the bidimensional PCA plot of Fig. 4.4, which shows very good structure in the dialect data. Low Saxon and Low Franconian dialects are grouped into separate groups, while Frisian samples are represented by three different clusters that describe (rural) Frisian, archaic Frisian (Hindeloopen, Schiermon‐ nikoog, and Terschelling island), and Friso‐Franconian varieties (Frisian cities, Midsland, Ameland island, and Het Bildt). Intermediate between Friso‐Franconian and Low Saxon, we find a small Friso‐Saxon group (Westerkwartier and Stellingwerf). Gray dots repre‐ sent varieties spoken in central Gelderland, while empty dots correspond to varieties of the Dutch province of Zeeland. The first and second axis account for 40.8 and 36.7% of total variance, respectively. The second axis has been mirrored and the plot have been rotated to visually suggest the correlation between the topology of samples and their real geographic locations. We will not describe such classification in more detail since it has been already fully discussed (Heeringa, 2004). Not surprisingly, Monmonier boundaries obtained from the linguistic dis‐ tance matrix directly (Fig. 4.5), not from the residuals of the regression analysis, con‐ firm the PCA plot for the most part. We find a northwestern Frisian area (barriers ‘1’ and ‘2’), a small northeastern area part of the province of Groningen (surrounded by barrier ‘20’), a large northeastern area Low Saxon (barrier ‘8’), a large more or less southwestern Low Franconian area (south of barrier ‘8’), a small southwestern part province of Zeeland (barrier ‘9’), and a small area in the southeast (part of the prov‐ ince of Limburg encircled by boundaries ‘5’; ‘13’; ‘17’). The distinction between Low Saxon and Low Franconian is expected. One of the best known features that demon‐ strates this distinction is the pronunciation of the final  ] syllable. For example, lopen ‘to walk’ is pronounced as [lo:pm ] in Low Saxon and as [lo:p] in Low Franco‐ nian dialects. Fragmentation is found in Friesland because of the well‐known cohe‐ sion among the urban, Friso‐Franconian varieties. As with surname data, we continued the analysis by computing, after a double logarithmic transformation, a linear regression model (log y = 0.287 log x + k) between geographic and Levenshtein distances to obtain the matrix of residuals that is plotted in the PCA analysis of Fig. 4.6. This is a novel treatment of the linguistics data, which we discussed in Section 4.3.1. The residuals reflect variance that is unrelated to geographic distance in general, and in a way, residuals correspond to the ideal case of linguistic differences that would obtain between locations that were equidistant from each‐other.

TO WHAT EXTENT ARE SURNAMES WORDS? 89

Figure 4.4  Principal components plot on the basis of 252 Dutch dialects. Low Saxon and Low Franconian dialects are grouped into separate clusters, while Frisian samples are represented by three different clusters that describe rural Frisian, archaic Frisian (Hin‐ deloopen, Schiermonnikoog, and Terschelling island), and Friso‐Franconian varieties (Fri‐ sian cities, Midsland, Ameland island, and Het Bildt). Intermediate in between Friso‐ Franconian and Low Saxon we find a small Friso‐Saxon group (Westerkwartier and Stel‐ lingwerf). Gray dots represent varieties spoken in central Gelderland, while empty dots correspond to varieties of the Dutch province of Zeeland. The first and second axes ac‐ count for 40.8 and 36.7% of total variance, respectively. The second axis has been mirrored and the plot have been rotated to visually suggest the correlation between the topology of samples and their real geographic locations

90 CHAPTER 4

Figure 4.5  Barriers (solid black lines) obtained with the Monmonier algorithm on a matrix of Levenshtein distances between 252 varieties. The first 20 barriers are shown numbered from ‘1’ to ‘20’. A thin gray network visualizes the Delaunay triangulation. Boundaries identify areas corresponding to Friesland (local barriers corresponding to dif‐ ferent Frisian varieties are displayed in gray to provide a clearer representation) and to parts of Zeeland and Limburg. On a wider scale, it appears that some major barriers well depict the geographic locations where Low Franconian and Low Saxon varieties are spo‐ ken (see labels). Further details can be found in Section 4.3.2. TO WHAT EXTENT ARE SURNAMES WORDS? 91

Figure 4.6  (A): Principal components plot of the variability of residual dialect dis‐ tances after a linear regression between the logarithms of Levenshtein distances and their corresponding geographic distances. Both axes have been mirrored for a better dis‐ play. (B): Map of the Netherlands showing the correspondence between the multidimen‐ sional position of samples (A) and their real geographic location. Different symbols do not necessarily correspond to clusters; they are just intended to help the comparison between the topology of the PCA plot and the geographic map. 92 CHAPTER 4

Figure 4.7  Barriers (solid black lines) obtained with the Monmonier algorithm on a matrix of residual dialect distances to be compared with the identical analysis on the original matrix in Fig. 4.5. The provinces of Friesland and Groningen appear as linguisti‐ cally continuous but see the text for further details. The first 20 barriers are shown, num‐ bered from ‘1’ to ‘20’. As in Fig. 4.5, barriers corresponding to different Frisian varieties are displayed in gray. The Delaunay triangulation is visualized as a gray network.

Therefore, geographic proximity or distance virtually plays no role in residual dis‐ tances. In this sense the proximity of samples in the PCA plot of Fig. 4.6 indicates that the same historical and social factors may be responsible for such similarity and vice versa. We find that the remaining structure in the multidimensional PCA plot, com‐ TO WHAT EXTENT ARE SURNAMES WORDS? 93 puted on residuals, is still striking and appears at some points to reflect geography maybe after all suggesting that the influence of geography is not constant. See Goebl’s (2006) for a reflection on the variable effect of geography. Further research and an ap‐ propriate intellectual frame seem necessary to address such new issues, which might a priori be expected to shed light on the mechanisms of linguistic differentiation through space. The shape of Monmonier barriers, based on the matrix of residual Levenshtein dis‐ tances (Fig. 4.7), confirms the barriers previously found in Zeeland (boundary ‘9’ in Fig. 4.5 and boundaries ‘11’; ‘15’; ‘17’ in Fig. 4.7) as well as the Saxon dialect area that is still surrounded by several barriers (‘1’; ‘3’; ‘10’; ‘14’).4 The surface of the northern part of the Saxon area seems less contoured when compared to Fig. 4.5, since its northern part (corresponding to the province of Groningen) is now geographically continuous with Friesland, but is separated from the province of Drenthe by bounda‐ ries ‘1’ and ‘3’. As in the original matrix of Levenshtein distances (Fig. 4.5), Friesland is still fragmented (as shown by the shape of barriers ‘2’; ‘5’; ‘7’; ‘9’ in Fig. 4.7) because of the dialect islands of the urban Frisian mixed varieties (Friso‐Franconian) in the Frisian dialect continuum. A completely new feature of Fig. 4.7 is the boundary that begins on the left (west), follows the border between North and South Holland, and then veers south to pass vertically through the provinces of Utrecht and North Bra‐ bant (‘16’). Even if this border has not been discussed extensively in previous studies, so that we cannot easily compare alternative explanations about its meaning, it is nonetheless interesting since it could be attributed either to heterogeneous transcrip‐ tions (Heeringa 2004) or to latent linguistic structures emphasized in some traditional maps (Lecoutere 1921).

4.4 DISCUSSION

The major aim of the study was to evaluate to what extent the patterns of geographic variation of surnames overlap with those of linguistic diversity. Family names are a specific part of language. Therefore, their interest as a proxy to Y‐chromosome genetic diversity has sometimes been regarded as weak because they were also expected to be influenced by extra chromosomal factors, i.e. the pressure of the linguistic environ‐ ment. If this were true, such pressure would be detectable—whatever the context. Fol‐ lowing the geographical approach used here, and thus focusing on the barriers where geographic influence is insufficient as an explanation of genetic or linguistic differ‐

4 Hardly visible in the figure of the print version without proper enlargement. 94 CHAPTER 4 ence, we note no striking correspondences between the two markers, e.g. in compar‐ ing the areas of differentiation in Figs 4.2 and 4.4, and Figs 4.3 and 4.7. We can then conclude that the pressure of the linguistic environment on surnames is absent or neg‐ ligible and reasonably extend our claim to future work addressing the comparison of surnames and linguistic markers. With respect to geographic distribution, surnames are not words. To be sure distributions of linguistic and genetic variation correlate very signifi‐ cantly r = 0.417***,5 where significance was calculated using the Mantel test (1967) on the 74 sites common to the surname and dialect samples. But this just reflects the corre‐ lation existing between linguistic and geographic distances (N = 252; r = 0.546***) on the one hand, and between surname and geographic distances (N = 226; r = 0.507***) on the other. Because both pronunciation and surnames correlate strongly with geography, they seem to be correlated with each other much, as shoe size and reading ability corre‐ late in children because both correlate strongly with age. But there is no correlation be‐ tween matrices of residual Nei and Levenshtein distances, i.e. there is no correlation between surname and linguistic differences once their common dependence on geo‐ graphic distance has been included in a statistical model. If we describe the situation from the point of view of a multiple regression model in which we test geography and surname differences as independent predic‐ tors of linguistic distance, then the two predictors are collinear, leading a hasty analy‐ sis to attribute influence to both predictors, where a careful analysis in fact displays none. The correlation between linguistic and surname markers is entirely explained by their common collinearity with geography. In fact, we may strengthen our own conclusion that in the Netherlands there has been no demonstration of a relation between linguistics and surnames by noting the differences between the model used here and those used in our earlier dialecto‐ metric work. Nerbonne et al. (1999) calculated a correlation coefficient of (r = 0.68**) using an overlapping 100‐element set drawn from the same full data set that includes the Flemish part of Belgium from which the sample used in this study was drawn, but they used a linear regression model rather than the power law doubly logarithmic model used here. The linear model clearly explains a great deal more linguistic vari‐ ance than the power law model. Heeringa and Nerbonne (2001) use a logarithmic model, and although their data set yielded an unusually high correlation, we have found in general that logarithmic models function best. It appears that the optimal linguistic model takes a logarithmic form, in distinction to the power law relations

5 Here we adopt the standard scientific notation where * means a significance level of 5%, while (**) and (***), respectively, indicate significance levels of at 1% and 0.1%. TO WHAT EXTENT ARE SURNAMES WORDS? 95 favored by geneticists. This reinforces our main conclusion, viz. that the linguistic and genetic patterns of variation are different, even if they are both conditioned strongly by geography (see CHAPTER 8 for later developments). Our conclusions strikingly differ from those of a similar study comparing sur‐ names and dialects from France by Scapoli et al. (2005). But we suspect that these au‐ thors failed to control their matrices of genetic and linguistic distances for common geographic conditioning, leading them to the incorrect conclusion that language simi‐ larity is an indicator of genetic kinship even at local levels. This may be occasionally true but needs to be systematically verified by analysing residuals. Concerning the Netherlands, the only close match between the variation of surnames and dialects is found in the province of Zeeland, which is also geographi‐ cally separate from surrounding areas (Fig. 4.3, 4.5, and 4.7). This special status of Zeeland may be due to its geography, since it consisted until recently of several is‐ lands, which, starting in the XIV century (Atlas van Nederland 1996), increased in size and—thanks to land reclamation efforts—eventually turned into peninsulas at the beginning of twentieth century. Relative social and geographic isolation, together with an economy based on fishing and trade, may have maintained and reinforced a closed social structure still visible in surname and dialect variability. A diversity that is also mirrored by the different agriculture practice between ‘insular’ Zeeland and Zeeland Flanders (Fig. 4.1). Finally, an additional and complementary explanation is represented by more intense contacts with the adjacent western Flemish area (Bel‐ gium). The computation of a regression model leading to matrices of residuals is ex‐ pected to better illuminate demography (surnames) and social patterns (dialects), both of which are related to history (in a broad sense) rather than to geography. As a con‐ sequence, we can interpret the surname barriers found along the northern borders of Zeeland, North Brabant and Limburg as the results of historical phenomena. The sig‐ nificance of such major separations is confirmed by bootstrap matrices visualized through the Monmonier algorithm and by the analyses of residual distances (barriers ‘4’ and ‘18’ in Fig. 4.3)—which brings up new issues. As we said the distribution of surnames only mirrors demographic phenom‐ ena,6 without any influence from their linguistic environment. Therefore, when we seek explanations for such barriers, which we see that linguistic culture does not sup‐ port, we must turn to other factors. In this case, we are struck by the correspondence between the border induced by common surnames and the border of the Catholic area of the Netherlands (Fig. 4.8). The strength of the surname border (Fig. 4.3) suggests

6 Population genetic differences heavily depend on demographic phenomena. 96 CHAPTER 4 that the frequency of intermarriages between Catholics and Protestants was very low. This religious distinction may have acted as a social boundary, thus increasing sur‐ name differences between populations on the border’s sides. The fact that there is no linguistic evidence (Fig. 4.7) of such separation means that more casual social contacts and interchange were not diminished between Catholic and Protestant populations. Communication proceeded in spite of a profound social cleft. Our article focused on patrilineal genetic differences (surnames) and their rela‐ tion to culture and its transmission. Intuitively, the observed incoherence between patrilineal markers of genetic relatedness and linguistic space distributions may be regarded as misleading, once culture (language) is assumed to be transmitted matri‐ linearly. This was a concern expressed by one of the scholars who reviewed our arti‐ cle. In other words, his question was: Would our findings have been the same if surnames were maternally transmitted? Some readers may remember a popular study pointing to the greater dispersion of females when compared to males (Seielstad et al. 1998). Such results, based on the comparison between specifically‐paternal (Y‐chromosome) and specifically maternal (mitochondrial DNA) genetic markers, were explained in terms of patrilocality.7 Even if alternative explanations (Dupanloup et al. 2003) and different conclusions (Wilder et al. 2004) have been provided since, we note that this debate mainly concerns the deep time frame of prehistory and not the time frame of the sur‐ name data. Family names only portray the variability of populations as if ‘Adams’ and ‘Eves’ lived at the time of surname introduction (about two centuries ago in the Netherlands). If surnames can be a proxy to genetics, they are effective only in the depiction of recent demographic phenomena. Even considering that in Europe ‘matrimonial migrations’ generally consist of only a few kilometers and that we are dealing here with differences that can be traced back for only eight generations, it is likely that patrilocality plays a role in our data set, meaning that females move more than males. A very recent study (Gagnon et al. 2006) based on the ‘core‐fringe model’ by Heyer (1993) suggests that sons inherit their propensity to migrate from their fathers, while such transmission is largely absent among women. The intergenerational de‐ pendency in the probability of migration implies that the pool of migrants is not a rep‐ resentative sample. The social explanation is that, once settled somewhere, the new‐ comers seldom become the owners of the land (or of other means of production) so their sons are more likely to move out.

7 By patrilocality we mean a residential pattern in which a married couple settles in the husband’s home or community. TO WHAT EXTENT ARE SURNAMES WORDS? 97

Figure 4.8  Map showing the frequency of Roman Catholics in the Netherlands in 1954 (Redrawn from van Heek 1954).

In this process, their new Y‐chromosome variants tend to disappear in the next gen‐ eration, while daughters of immigrants can become part of the new community by marriage and, therefore, have higher chances to enrich the local pool of genetic diver‐ sity. If such migrational behavior partially counteracts the effects of patrilocality, fe‐ males still migrate more than males. To answer the thorny question of our reviewer: if women transmitted Dutch surnames, we would have computed pairwise surname distances smaller than patronymic ones. The picture would have been the same but with a lower level of detail because more migrations imply smaller local differences. Therefore, there are no reasons to expect a higher correlation between surname and dialect variability if female lineages were taken into account. Moreover, concerning the role of the mother in language transmission, we also remind that most linguistic studies emphasize the importance of the peer group, outside the immediate family, in 98 CHAPTER 4 influencing adolescent patterns of speech, and the general suspicion is that these are normally then resistant to change in later life. This would be a valuable area for fur‐ ther research. Besides the major research question of the article, we think that some methodo‐ logical outcomes should be reviewed. First of all, the use of matrices of residual lin‐ guistic distances obtained after the computation of a regression between geographic and linguistic distances has been rewarding. This approach has enabled us to visual‐ ize computationally the geographic affinity of the province of Groningen to the Fri‐ sian speaking area (Fig. 4.7). This closer relation may mirror the early linguistic his‐ tory of the Groningen area, where some Frisian varieties were last spoken in the early part of the sixteenth century (Hoekstra 2001, p. 139; Niebaum 2001, p. 431). Besides some few contemporary phonetic features, there has been no linguistic evidence that a different language was once spoken in this area, thus underscoring the effectiveness of the methodological approach we undertook. But see Spruit (2006) for an analysis of the syntactic variability in which the north of the Netherlands appears much less het‐ erogeneous than it does in lexical and phonetic analyses. We would also like to emphasize the value of Monmonier’s algorithm for lin‐ guistic applications (see also Manni et al. 2004 for further discussion). The algorithm allows a geographical visualization of the variability in a distance matrix, showing where differentiation is located. Unless there is a perfect correlation between the vari‐ able under study and geographic distances, meaning that there are no major barriers, the Monmonier method adds geographical detail to the multidimensional analyses such as multidimensional scaling or PCA, which are still the primary analytical tools for appraising linguistic variability. See also Goebl (2006) for an examination of the variability of the influence of geography on dialect. At first glance, barriers computed with the Monmonier’s algorithm might re‐ mind linguists of bundles of isoglosses. While the Monmonier’s approach may only be applied to dialectometrical data, since it requires numerical data, it is true that it mirrors the same goal of a synthetic representation of variability that isogloss bundles were likewise designed to operationalize. Even though the methodologies for analyzing genetic and linguistic data are becoming very similar, at a conceptual level several differences still exist. The archi‐ tecture of this article reflects one of them: population geneticists are more interested in the differences between populations than in homogeneity or similarity. The main rea‐ son lies in the low differentiation of human populations on a global scale. Only 15% of the variance of the human genome is explained by differences between groups of populations, whereas individual differences explain 85% of the total variance (Barbu‐ jani et al. 1997). In other words, two individuals living in the same area are likely to be TO WHAT EXTENT ARE SURNAMES WORDS? 99 genetically more different than two individuals living in different continents.8 The aforementioned reasons explain why, at small geographic scales, the leading hypothe‐ sis of human population geneticists was to expect low or non‐existent genetic differ‐ entiation. Since similarity homogeneity is expected, all the practical and conceptual work of the discipline has been focused on the detection of differences. Linguists, on the other hand, have often focused on the geographic distribution of linguistic variety and its composition—regardless of its ultimate explanation. It is not unfair to say that the geographic conditioning of language variety has been stud‐ ied as much for the light it sheds on the nature of linguistic structure as for the im‐ proved view of social history it enables. While linguistic studies, like genetic ones, are keen to catalogue the differences between language varieties that are really very simi‐ lar, there has been no similar success in quantifying the degree to which language va‐ rieties, seen from the perspective of all existing varieties, might differ. Perhaps some further cross‐fertilization from genetics into linguistics might be worthwhile. In conclusion, the development of computational linguistics studies, as well as the application of spatial and statistical analyses enabled by this discipline, will tell us if dialect continua are a satisfactory view of linguistic variability or if more innovative interpretations of the geographic patterns of dialect variation are needed, especially when dealing with old or ancient linguistic patterns. We hope that future directions of investigation will be focused on interdisciplinary understanding, exhaustively dis‐ cussed by Goebl (1996), of the interrelations existing between surnames and dialects. Since we are also investigating the varying degrees to which variation in different lin‐ guistic levels (pronunciation, vocabulary, syntax) is geographically conditioned (Heeringa and Nerbonne 2006), we shall keep in mind that vocabulary distributions may offer an interesting comparison to surnames.

Acknowledgements:

We are indebted to Bruno Toupance for providing help in data processing, to Pierre Darlu and to Bob Shackleton for commenting on the manuscript, to Isabelle Dupanloup for in‐ spiring us, and to Alain Gagnon for providing insightful references. We also thank George Welling for his critical remarks concerning the history of the Netherlands and Meindert Schroor for improving our geographic insight

8 By the way, this is the reason why the scientific definition of races does not apply to hu‐ mans; they are too similar to be partitioned into separate, biologically meaningful groups. 100 CHAPTER 4

References:

Atlas van Nederland. 1986. ’s‐Gravenhage: Stichting Wetenschappelijke. Barbujani G., Magagni A., Minch E., Cavalli‐Sforza L.‐L. 1997. An apportionment of human DNA diversity. Proceedings of the National Academy of Sciences of the United States of America, 94: 4516–9. Barrai I., Rodriguez‐Larralde A., Manni F., Scapoli C. 2002. Isonymy and Isolation by Dis‐ tance in the Netherlands. Human Biololgy, 74: 263–83. Brassel K.E., Reif D. 1979. A procedure to generate Thiessen polygons. Geographical Analysis, 11: 289–303. Cavalli‐Sforza L.‐L., Bodmer W. 1971. Human population Genetics. San Francisco: Freeman. Cavalli‐Sforza L.‐L., Piazza A., Menozzi P., Mountain J. 1989. Genetic and linguistic evolu‐ tion. Science, 244: 1128–9. Cavalli‐Sforza L.‐L., Menozzi, P., Piazza A. 1994. The history and geography of human genes. Princeton, (N.J.): Princeton University Press. Chambers J. K., Trudgill P. 1998. Dialectology, 2nd edn. Cambridge, UK: Cambridge Uni‐ versity Press. Chen K. H., Cavalli‐Sforza L.‐L. 1983. Surnames in Taiwan: interpretations based on geog‐ raphy and history. Human Biology, 55: 367–74. Crow J. F., Mange A. P. 1965. Measurements of inbreeding from the frequency of marriages between persons of the same surnames. Eugenic Quarterly, 12: 199–203. Crow J. F. 1980. The estimation of inbreeding from isonymy. Human Biololgy, 52: 1–4. Dupanloup I., Pereira L., Bertorelle G., et al. 2003. A recent shift from polygyny to monog‐ amy in humans is suggested by the analysis of worldwide Y‐chromosome diversity. Molecu‐ lar Ecololgy, 13: 853–64. Dyen I. 2004. Personal communication based on the discussion of the paper ‘‘Johannes Schmidt’s ‘Wave theory’ and the Homomeric method’’, presented at the workshop Phyloge‐ netic methods and the prehistory of languages, held at the McDonald institute for archaeological research, Cambridge (UK), 9–12 July. All papers presented at the meeting (besides this one) were published in: P. Forster, C. Renfrew (eds) 2006. Phylogenetic Methods and the Prehis‐ tory of Languages, (UK): Oxbow books. Felsenstein J. 1985. Confidence limits on phylogenies. An approach using the bootstrap. Evolution, 39: 783–91. Gagnon A., Toupance B. 2002. Testing isonymy with paternal and maternal lineages in the early Quèbec population: the impact of polyphyletism and demographic differentials. American Journal of Physical Anthropology, 117: 334–41. TO WHAT EXTENT ARE SURNAMES WORDS? 101

Gagnon A., Toupance B., Tremblay M., Beise J., Heyer E. 2006. Transmission of migration propensity increases genetic divergence between populations. American Journal of Physical Anthropology, 129: 630–6. Goebl H. 1996. La convergence entre fragmentations géo‐linguistique et géo‐génétique de l’Italie du Nord, Revue de Linguistique Romane, 60: 25–49. Goebl H. 2006. Recent Advances in Salzburg Dialectometry. Literary and Linguistic Comput‐ ing, 21: 411‐435. Heek van, F. 1954. Het geboorteniveau der Nederlandse Rooms‐Katholieken. Leiden. Heeringa W. J., Nerbonne, J. 2001. Dialect areas and Dialect Continua. In: Language Variation and Change, 13: 375–400. Heeringa W. J., Nerbonne J. 2006. Taalvariatie in het Nederlandse dialectgebied: een analyse op basis van lexicon en uitspraak. Nederlandse Taalkunde, 11: 218‐257.. Heeringa W., Nerbonne J., Kleiweg P. 2002. Validating Dialect Comparison Methods. In: W. Gaul, G. Ritter (eds.), Classification, Automation, and New Media. Proceedings of the 24th Annual Conference of the ‘Gesellschaft fu¨r Klassifikation’. University of Passau, Heidel‐ berg: Springer:, pp. 445–52. Heeringa W., Gooskens C. 2004. Perceptive evaluation of Levenshtein dialect distance measurements using Norwegian dialect data. Language Variation and Change, 16: 189–207. Heeringa 2004. Measuring dialect pronunciation differences, Ph.D. Dissertation, University of Groningen, The Netherlands. Heyer E. 1993. Population structure and immigration; a study of Valserine valley (French Jura) form the 17th century until present. Annals of Human Biology, 20: 565–73. Hoekstra E. 2001. Frisian Relics in the Dutch Dialects. In: H.H. Munske (ed.), Handbuch des Friesischen/ Handbook of Frisian Studies. Tübingen: Niemeyer, pp. 138–142. Kimura, M. 1983. The Neutral Theory of Molecular Evolution. Cambridge (UK): Cambridge University Press. King T. E., Ballerau S. J., Schürer K. E., Jobling M.A. 2005. Genetic signatures of coancestry within surnames. Current biology, 16: 384–8. Kohonen T. 1995. Self‐Organizing Maps. Berlin: Springer. Lasker G. W. 1985. Surnames and genetic structure. Cambridge (UK): Cambridge Univer‐ sity Press. Lecoutere C.P.F. 1921. Inleiding tot de taalkunde en tot de geschiedenis van het Neder‐ landsch. Brussel. Malécot G. 1955b. The decrease of relationship with distance. Cold Spring Harbor Sympo‐ sia. Quantitative Biology, 20: 52–53. Manly B.F.J. 1997. Randomization, bootstrap and Monte Carlo methods in biology. 2nd edn, Chapman and Hall. 102 CHAPTER 4

Manni F. 2001. Strutture genetiche e differenze linguistiche: Un approccio comparato a livello micro e macro regionale. PhD dissertation. Ferrara: University of Ferrara. Manni F., Guèrard E., Heyer E. 2004. Geographic patterns of (genetic, morphologic, lin‐ guistic) variation: how barriers can be detected by using Monmonier’s algorithm. Human Biology, 76: 173–90. Manni F., Toupance B., Sabbagh A., Heyer E. 2005. New method for surname studies of ancient patrilineal population structures, and possible application to improvement of Y‐ chromosome sampling. American Journal of Physical Anthropology, 126: 214–28. Mantel N. A. 1967. The detection of disease clustering and a generalized regression ap‐ proach. Cancer Research, 27: 209–20. Monmonier M. 1973. Maximum‐difference barriers: an alternative numerical regionalization method. Geographical Analysis, 3: 245–61. Nerbonne J., Heeringa W., Kleiweg P. 1999. Edit Distance and Dialect Proximity. In: D. Sankoff, J. Kruskal (eds), Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison. Stanford: CSLI Press, pp. v–xv. Niebaum H. 2001. Der Niedergang des Friesischen zwischen Lauwers und Weser. In: H.H. Munske (ed.) Handbuch des Friesischen/Handbook of Frisian Studies. Niemeyer: Tübingen, pp. 430–442. Peakall R., Smouse P. E. 2001. GenAlEx vs. 5: Genetic analysis in Excel. Population genetic software for teaching and research. Australian National University, Canberra, Australia. Sankoff D., Kruska, J. (eds). 1999. Time Warps, String Edits and Macromolecules: The The‐ ory and Practice of Sequence Comparison. Stanford: CSLI Press. Scapoli C., Goebl H., Sobota S., Mamolini E., Rodriguez‐Larralde A., Barrai, I. 2005. Sur‐ names and dialects in France: Population structure and cultural evolution. Journal of Theo‐ retical Biology, 237: 75–86. Schmidt J. 1872. Die Verwandtschaftsverha¨ltnisse der indogermanischen Sprachen. Wei‐ mar: H. Böhlau. Séguy J. 1971. La relation entre la distance spatiale et la distance lexicale. Revue de Linguisti‐ que Romane, 35: 335–357. Seielstad M. T., Minch E., Cavalli‐Sforza, L.‐L. 1998. Genetic evidence for a higher female migration rate in humans. Nature Genetics, 20: 278–80. Sokal R.R. 1988. Genetic, geographic and linguistic distances in Europe. Proceedings of the National Academy of Sciences of the United States of America, 85: 1722–6. Spruit, M. 2006. Measuring Syntactic Variation in Dutch Dialects. Literary and Linguistic Computing, 21: 493 506. Voronoi M. G. 1908. Nouvelles application des parame`tres continus a` la théorie des for‐ mes quadratiques, deuxie`me mémoire, recherche sur le paralléloedres primitifs. Journal für die reine und angewandte Mathematik, 134: 198–207. TO WHAT EXTENT ARE SURNAMES WORDS? 103

Wilder J. A., Kingan S. B., Mobasher Z., Pilkington M. M., Hammer M. F. 2004. Global pat‐ terns of human mitochondrial DNA and Y‐chromosome structure are not influenced by higher migration rates of females versus males. Nature Genetics, 36: 1122–5. Yasuda N., Morton N. E. 1967. Studies on human population structure. In: J.F. Crow, J.V. Neel Jr. (eds.), Third International Congress of Human Genetics. Baltimore: Johns Hopkins Press, pp. 249–65. Yasuda N., Furusho T. 1971. Random and non random inbreeding revealed from isonymy study. Small cities in Japan. The American Journal of Human Genetics, 23: 303–16. Yasuda N., Cavalli‐Sforza L.‐L., Skolnick M., Moroni A. 1974. The evolution of surnames: an analysis of their distribution and extinction. Theoretical Population Biology, 5: 123–42. Zei G., Matessi R. G., Siri E., Moroni A., Cavalli‐Sforza L.‐L. 1983. Surnames in Sardinia. I. Fit of frequency distributions for neutral alleles and genetic population structure. Annals of Human Genetics, 474: 329–52.

104 CHAPTER 4

FOOTPRINTS OF MIDDLE AGES KINGDOMS ARE STILL VISIBLE … IN SPAIN 105

106 CHAPTER 5

This chapter has been published, please cite the original reference:

Rodríguez‐Díaz R., Manni F., Blanco‐Villegas M‐J. 2015. Footprints of Middle Ages Kingdoms Are Still Visible in the Contemporary Surname Structure of Spain. PLoS ONE 10(4): e0121472. doi:10.1371/journal.pone.0121472

SUPPLEMENTARY FILES are available at the Journal’s website. FOOTPRINTS OF MIDDLE AGES KINGDOMS ARE STILL VISIBLE … IN SPAIN 107

ABSTRACT  To assess whether the present‐day geographical variability of Spanish surnames mirrors historical phenomena occurred at the times of their introduction in Spain (13th ‐ 16th century), and to infer the possible effect of foreign immigration (about 11% of present‐day populations) on the observed patterns of diversity, we have analyzed the frequency distribution of 33,753 unique surnames (tokens) occur‐ ring 51,419,788 times, according to the list of Spanish residents of the year 2008. Isonymy measures and surname distances have been computed for, and between, the 47 mainland Spanish provinces and compared to a numerical classification of corre‐ sponding language varieties spoken in Spain. The comparison of the two bootstrap consensus trees, representing surname and linguistic variability, suggests a similar picture; major clusters are located in the east (Aragón, Cataluña, Valencia), and in the north of the country (Asturias, Galicia, León). Remaining regions appear to be consid‐ erably homogeneous. We interpret this pattern as the long‐lasting effect of the sur‐ name and linguistic normalization actively led by the Christian kingdoms of the north (Reigns of Castilla y León and Aragón) during and after the southwards reconquest (Reconquista) of the territories ruled by the Arabs from the 8th century to the late 15th century, that is when surnames became transmitted in a fixed way and when Castilian linguistic varieties became increasingly prestigious and spread out. The geography of contemporary surname and linguistic variability in Spain corresponds to the political geography at the end of the Middle‐Ages. The synchronicity between surname adop‐ tion and the political and cultural effects of the Reconquista have permanently forged a Spanish identity that subsequent migrations did not deface.

FOOTPRINTS OF MIDDLE AGES KINGDOMS ARE STILL VISIBLE IN THE CONTEMPORARY SURNAME STRUCTURE OF SPAIN

5.1 INTRODUCTION

Fifty years have passed since the seminal study on the relation between surname di‐ versity and inbreeding by Crow and Mange [1] and, today, surname studies constitute a large body of research in population genetics. The vertical transmission of surnames (generally along the male line), their availability in large numbers (telephone directo‐ ries, conscription lists, voters’ list, etc.), and the large spectrum of applications they have in different disciplines (demography, genetics, geography, linguistics, history) made them a popular source of data. General background can be found in references [2, 3, 4, 5, 6, 7]. In many Europeans countries, surnames started to develop in the early Middle Ages and became stable, generation after generation, towards the 16th century. Ini‐ 108 CHAPTER 5 tially they were a simple way to identify a family or a person, like a designation or nickname that could vary over the time and the generations (the son of Paul, the miller, the man from the river, etc.), while later, for the administrative need to identify a lineage without ambiguities, they became progressively fixed in meaning and spell‐ ing. The geographic diversity of the naming practices, the influence of regional lan‐ guages, the political asset of a region, contributed to the diversity of initial surnames that remain, several centuries after their introduction, a marker of regional diversity. Analyzing Spanish surnames as markers of population diversity, or homogeneity, is the purpose of this research. In Spain, the use of surnames started with the 10th century. They became pro‐ gressively inherited from the 13th to the 15th century, which is during the growing ex‐ pansion of the Reign of Castilla and the Reconquista of the territories ruled by the Ar‐ abs in the preceding centuries. Isabel I, queen of Castilla y León, and her husband Ferdinand II of Aragón (known as the Catholic Majesties) completed the reconquest and gave stability to their respective reigns in a process of large political integration that officially ended, with a unique crown, in 1714 (until this date the two reigns re‐ tained separate legal systems). At the time, religion was also a political instrument and the Spanish Inquisition (a Catholic administration existed between 1478 and 1834 and officially devoted to the preservation of religious orthodoxy) contributed to the castillanization and normalization of surnames (a process that lasted until 1870), ulti‐ mately leading to a considerable loss of surname diversity. Newly adopted surnames (often the same prototypical ones) were often ending with the suffix –ez, a typical Cas‐ tillan patronymic form (Fernández, Rodríguez, Gutiérrez, etc.). This process corre‐ sponds to the growing of the Castilian kingdom, when Aragónese, Asturian‐Leonese and Basque names were Castilianized. This is particularly the case for Aragónese and Asturian‐Leonese family names. A confirmation that records were generally written in Castilian, and therefore the surnames, comes from the Spanish surnames existing in Latin America, a major destination of historical Spanish emigration, that are generally in this form. The process of Castilianization, together with the use of a double sur‐ name‐system (two single‐surnames A and B can lead to four double‐surname combi‐ nations AB, BA, AA, BB), largely explains why Spain has a lower number of surname variants than other European countries [8, 9, 10, 11]. Castilianized forms lasted because, in the 16th century, the prescriptions of the Ecumenical Council of the Roman Catholic Church (held in Trento, Italy, from 1545 to 1563) made mandatory the use of vertically transmitted surnames, stable generation after generation, in order to keep exhaustive birth and marriage records in all parishes (death and census records became compulsory in 1614). FOOTPRINTS OF MIDDLE AGES KINGDOMS ARE STILL VISIBLE … IN SPAIN 109

Arab and Jewish surnames, two large communities in the Spain of that time, did not escape the “normalization” described above, meaning that a large proportion of the Arab and Jewish family names existing nowadays in Spain corresponds to later immigration, and not to the descendants of the medieval population. By using the national telephone directory as source of data, the Spanish sur‐ name corpus has been studied at least three times [12, 13, 14], the latter study concern‐ ing a wider European frame. The clustering of Spanish regions reported in [12], ac‐ cording to surname similarity, was contradicted in [14] and not addressed in [13]. This is the reason why we focus on Spanish surnames again. Our aim is to see whether the present‐day corpus of Spanish surnames, that in‐ ternal migrations and international immigration (Table 5.1) to Spain have modified, still conveys the signal of their historical origin. In this frame, Manni et al. [15] and Boattini et al. [16] have developed and validated a method to identify, from modern surname corpora, the family names that are historically typical (“autochthonous”) of given regions, that is those that better describe the Middle‐Ages population stock in geographical terms. In this article we adopt the opposite approach by avoiding any surname selection in the analysis of the Padrón municipal, the largest available register of the Spanish population. The Padrón municipal is a national census of all residents, listed regardless of their age and nationality. It also includes all the immigrants whose administrative status has not been decided yet. This database is less conservative than telephone directories or voters’ lists that have a lower sample size and do not repre‐ sent all social groups, particularly immigrants (see table 5.1 and check reference [17]). Several languages are currently spoken in Spain (Table 5.2) and they have had an in‐ fluence on surname diversity. To take this aspect into account, we report a computa‐ tional analysis of linguistic features found in the linguistic atlas of the Iberian Penin‐ sula (ALPI) [18] adapted from the original publication of Hans Goebl [19]. All the computational work we present, accounting for surname and linguistic diversity, is based on pairwise distance measures between the 47 mainland Spanish provinces. The Canary and the Balearic islands have been excluded, as well as the Spanish overseas territories located at the south of the Gibraltar straight ( and ).² The robustness of the surname classification has been tested by bootstrap [20]. In fact, it is our experience that regular clustering lacks stability because clustering algorithms look for the minimum distance between two points in a distance‐matrix and, often, several pairs of elements show similar distances. As a consequence, small differences in the input data‐matrix can lead to considerably different clusters. 110 CHAPTER 5

Table 5.1  Foreign population in Spain by countries of origin. The foreign population is 11.3% (5,220,577 individuals) over a total of 46,157,822 individuals. Source: Padrón mu‐ nicipal 2008, Instituto Nacional de Estadistica de Espana (INE).

Provenance Percentage [%] Provenance Percentage [%] Romania 14 Ecuador 8 United Kingdom 7 Colombia 5 Germany 3 Bolivia 5 Italy 3 Argentina 3 Other Europe 18 Other Americas 13 Total Europe 45 Total Americas 34 Morocco 12 China 2 Other Africa 5 Other Asia 2 Total Africa 17 Total Asia 4

Table 5.2  Linguistic varieties spoken in Spain in 2014, according to the Ethnologue [36]. Only the varieties followed by an asterisk are accounted in the Linguistic Atlas of the Iberian Peninsula [18] that we have reprocessed with computational linguistics methods.

Language (Ethnologue code) Speakers Aragónese* arg ~10,000 Asturian-Leonese* ast ~100,000 Basque* eus ~400,000 Iberian Romani (Calò) rmq ~40,000 Catalan* cat ~3,750,000 Erromintxela (Basque Calò) emx ~500 Extremaduran* ext ~200,000 Fala fax ~5,000 Galician* glg 2,300,000 Gascon, Aranese* oci ~4,000 Castilian* spa ~38,000,000

To overcome this instability, at least two main methods have been suggested and tested on surname and linguistic data over the years: bootstrapping [20, 21], and noisy clustering [22]. Briefly, noisy clustering can be viewed as a procedure in which differ‐ ent amounts of random noise are added to the distance matrix during repeated clus‐ tering, whereas bootstrapping consists of varying the input dataset in subsequent clustering iterations, allowing some surnames, or words, to be repeated. Both tech‐ niques lead to a consensus (or composite) dendrogram [23]; this is the kind of output that we will analyze. We consider the test of robustness an essential step, and it is likely that the inconsistencies between the classification of Spanish surnames we noted in the papers by Rodriguez‐Larralde et al. [12] and Scapoli et al. [14] might have been avoided if the authors had applied a test of this kind. FOOTPRINTS OF MIDDLE AGES KINGDOMS ARE STILL VISIBLE … IN SPAIN 111

5.2 METHODS

5.2.1 The surname data

The Padrón municipal (list of residents by municipality regardless of age and citizenship) of the year 2008 has been obtained from the Spanish National Institute of Statistics (Insti‐ tuto Nacional de Estadística, INE — Summary statistics are freely available at www.ine.es). We processed the 45,593,365 personal records corresponding to the 47 mainland official provinces and to 15 regions (see Fig. 5.1 and Table 5.3)

Figure 5.1  Geographic map of Spain. The names of the provinces are reported in small capital letters. The names of the regions are shown as black labels. Note that some regions consist of a single province. Spanish islands are not represented in the map because they were not studied.

112 CHAPTER 5

Table 5.3  Surname data per province and region. [N(INE)] is the official population size (Padrón municipal 2008). [N(tokens)] is the number of occurrences of the [S=tokens] different single surnames (ex: from Rodriguez Diaz we ob‐ tain two tokens: Diaz; Rodriguez). Entropy is as in [37].

S= S/ Province Region Munici- N (INE) Density N Tokens / To- N palities (ind/ km2) N Tokens N full pop. kens Tokens Isonymy Entropy Albacete Castilla - La Mancha 87 397493 332.53 696059 1.75 2890 0.00415 0.016301 5.612761 Alicante/Alacant Valencia 141 1891477 80.16 2753076 1.46 9151 0.00332 0.007232 6.578898 Almería Andalucía 102 667635 104.92 1023128 1.53 4204 0.00411 0.013206 5.69845 Araba/Álava Pais Vasco/Basques 51 309635 26.95 503959 1.63 5567 0.01105 0.00724 6.705729 Asturias Asturias 78 1080138 102.00 1920957 1.78 6224 0.00324 0.025725 5.427719 Ávila Castilla y León 248 171815 21.45 267840 1.56 1693 0.00632 0.026546 4.974525 Badajoz Extremadura 164 685246 31.90 1203833 1.76 3949 0.00328 0.007839 6.184083 Barcelona Cataluña 311 5416447 715.42 301742 0.06 4061 0.01346 0.006367 6.584385 Bizkaia Pais Vasco/Basques 112 1146421 520.86 1810854 1.58 9452 0.00522 0.006089 6.905668 Burgos Castilla y León 371 373672 26.26 590646 1.58 3858 0.00653 0.009664 6.082197 Cáceres Extremadura 221 412498 20.91 696303 1.69 2814 0.00404 0.010414 5.905415 Cádiz Andalucía 44 1220467 167.21 2218350 1.82 6089 0.00275 0.007967 6.232632 Cantabria Cantabria 102 582138 114.60 945248 1.62 4671 0.00494 0.012546 5.929513 Castellón/Castelló Valencia 135 594915 91.13 867299 1.46 5821 0.00671 0.004314 6.8022 Ciudad Real Castilla - La Mancha 102 522343 26.96 886695 1.70 3312 0.00374 0.009924 5.968138 Córdoba Andalucía 75 798822 58.51 1473060 1.84 4186 0.00284 0.007301 6.120176 Coruña (A) Galicia 94 1139121 145.01 2029253 1.78 4994 0.00246 0.008848 6.165282 Cuenca Castilla - La Mancha 238 215274 12.79 324160 1.51 2079 0.00641 0.012363 5.844352 Gipuzkoa Pais Vasco/Basques 88 701056 356.37 1066577 1.52 7283 0.00683 0.003606 7.024319 Girona Cataluña 221 731864 128.04 821377 1.12 6811 0.00829 0.004763 6.82443 Granada Andalucía 168 901220 73.17 1531544 1.70 3918 0.00256 0.012391 5.726686 Guadalajara Castilla - La Mancha 288 237787 21.01 310916 1.31 3010 0.00968 0.011244 5.974349 Huelva Andalucía 79 507915 51.52 860123 1.69 3553 0.00413 0.011199 5.771795 Huesca Aragón 202 225271 14.60 276891 1.23 3530 0.01275 0.003198 6.883493 Jaén Andalucía 97 667438 49.69 1206451 1.81 3193 0.00265 0.009574 5.899168 Rioja (La) Rioja 174 317501 64.02 478477 1.51 4135 0.00864 0.009702 6.20085 León Castilla y León 211 500200 31.95 834196 1.67 3264 0.00391 0.022666 5.34661 Lleida Cataluña 231 426872 36.53 496441 1.16 5871 0.01183 0.003101 7.149126 Lugo Galicia 67 355549 35.66 619753 1.74 2357 0.00380 0.022581 5.34707 Madrid Madrid 179 6271638 808.32 2575022 0.41 10114 0.00393 0.009428 6.388433 Málaga Andalucía 101 1563261 222.33 2508900 1.60 7438 0.00297 0.008479 6.144749 Murcia Murcia 45 1426109 129.94 2426454 1.70 6650 0.00274 0.014517 5.897324 Navarra Navarra 272 620377 61.03 875044 1.41 6509 0.00744 0.004311 7.012149 Ourense Galicia 92 336099 45.63 581330 1.73 2147 0.00369 0.024442 5.130659 Palencia Castilla y León 191 173454 21.32 272273 1.57 2006 0.00737 0.011211 5.797694 Pontevedra Galicia 62 953400 214.37 1710057 1.79 5019 0.00294 0.011019 6.014782 Salamanca Castilla y León 362 353404 28.58 578092 1.64 2548 0.00441 0.027121 5.137096 Segovia Castilla y León 209 163899 23.72 231590 1.41 1696 0.00732 0.014007 5.549435 Sevilla Andalucía 105 1875462 137.43 3385166 1.80 8302 0.00245 0.008563 6.220057 Soria Castilla y León 183 94646 9.23 137558 1.45 1393 0.01013 0.010763 5.653739 Tarragona Cataluña 184 788895 128.75 1020551 1.29 7542 0.00739 0.004505 6.934706 Teruel Aragón 236 146324 9.77 193834 1.32 2147 0.01108 0.005673 6.233202 Toledo Castilla - La Mancha 204 670203 46.01 1009269 1.51 4194 0.00416 0.014376 5.759388 Valencia/València Valencia 266 2543209 240.10 2134762 0.84 7114 0.00333 0.00615 6.626234 Valladolid Castilla y León 225 529019 65.95 892428 1.69 4706 0.00527 0.010633 6.084073 Zamora Castilla y León 248 197221 18.31 311923 1.58 1873 0.00601 0.012807 5.631792 Zaragoza Aragón 293 955323 56.34 1560327 1.63 10372 0.00665 0.003952 7.198151

FOOTPRINTS OF MIDDLE AGES KINGDOMS ARE STILL VISIBLE … IN SPAIN 113

The INE sent us the list of surnames appearing at least five times in a single munici‐ pality, meaning that very rare surnames can be absent from the dataset. The vast majority of Spanish surnames is double (ex: Rodriguez‐Diaz), the first one is inherited from the father (Rodríguez‐) and the second one is inherited from the mother (‐Diaz). At each generation the surname of the mother is lost, only the one of the father being constantly listed in the first position (recent laws now enable to swap their order or to keep only one surname of choice). Paternal and maternal surnames constantly mix in a double form that is not stable over the time and just depends on the random process of mating. This is why we decided to process them as separate tokens corresponding to single surname types (Diaz; Rodríguez) and not to individuals (Rodríguez Diaz). For computational ease, we kept only the surnames having a fre‐ quency higher than 20 occurrences (see Fig. 5.2). In fact, rare surnames can be ex‐ cluded because their contribution to the computation of a distance matrix is minimal.

Figure 5.2  Diagram showing how surname data have been processed. Please refer to the text for a detailed explanation of the different steps.

114 CHAPTER 5

Sometimes the records correspond to individuals having only a single surname, either because the bearers were adopted or because they immigrated from countries where a single‐surname system is used. In this case, the record contributes only one surname to the tokens database (Fig. 5.2). The whole procedure (see Fig. 5.2) has been the occasion to correct many obvi‐ ous misspelling and digitalization errors and led to a final working database of 33,753 different single‐surnames (tokens) occurring 51,419,788 times. The database is avail‐ able at the following URL http://ecoanthropologie.cnrs.fr/article966.html.

5.2.1.1 Surnames: From isonymy measures of inbreeding to surname distances

Several distance measures accounting for the diversity of two sets of surnames (pair‐ wise difference) have been developed over the time. They are often based on isonymy, that is the probability that two surnames, randomly extracted, are identical. This probability is computed for all the different surnames in the database and depends on their frequency. In short, if two areas do not share any common surname, the measure of isonymy will be null; in the opposite extreme case, when the two areas are inhab‐ ited by individuals all having the same identical surname frequency, isonymy will be equal to 1. Mathematically isonymy is computed as:

 n n (1) I ij  si sj s where nsi denotes the relative frequency of a given surname s in locations i, while nsj denotes the frequency of the same surname in location j. The sum is done for all the surnames. Isonymy was originally developed to estimate inbreeding in marriage registers by computing the number of same‐surname marriages, that often correspond to mar‐ riages between first‐cousins. Since this application, the computation of isonymy, as a probability, has been extended to large databases. From isonymy, several similar measures of diversity can be computed, for in‐ stance Hedrick’s [24], Nei’s [25], Lasker’s [26] and Relethford’s [27] coefficients. In this paper, at first, we have, computed the coefficients of Hedrick and Nei and, later, kept only the latter one. Hedrick’s H and Nei’s N coefficients as defined as it follows:

FOOTPRINTS OF MIDDLE AGES KINGDOMS ARE STILL VISIBLE … IN SPAIN 115

 nnsi sj Hedrick’s H = s , (2) 1 22 nnsisj 2  

nnsi sj Nei’s N = s , (3) 221/2 nnsi sj where parameters are defined as in (1). Distances were obtained by the following transformations:

Hedrick’s distance: dH ( )   log( H ) (4) Nei’s distance: dN ( )  log( N ) (5)

We ended by having two 47 x 47 distance matrices (according to formulas 4 and 5) accounting for the 47 mainland Spanish provinces. The computations were run on the DISNEI program, written by P. Darlu. The software is available upon request to [email protected].

5.2.1.2 Test of robustness ‐ bootstrap

Each surname has a peculiar frequency distribution in space. Any new subset of a surname database, once analyzed, leads to a distance matrix that is each time differ‐ ent, the differences being determined by the random presence or absence of given surnames. Instead of computing a single distance matrix on the whole dataset, we pre‐ ferred to resample original data to obtain 100 subsets and, from them, 100 different pairwise distance matrices. Then, from each matrix, we computed a separate Neighbor Joining (NJ) tree [28]. From the 100 trees, we computed a consensus tree where each node is scored to reflect the number of times it appears in the 100 NJ den‐ drograms (Fig. 5.3). This procedure is called bootstrap and consists in resampling, with replacement, the original dataset. The essence of the bootstrap is to give different weights to the surnames in each resampled dataset. The consensus tree (Fig. 5.3) shows how stable, through the resampled datasets, the classification is. In our case the scores, reported for each node, can vary from 1 to 100. Computations were performed by the DISNEI software and plot by Treeview [29]. Only the nodes supported by at least 50 NJ trees are shown (score  50), otherwise the branches have been collapsed (Fig. 5.3).

116 CHAPTER 5

Figure 5.3  Consensus bootstrap trees [20] based on 100 Neighbor Joining [28] trees computed on the Hedrick’s [24] and Nei’s [25] surname distance. The major clusters dis‐ cussed in the text (see labels abbreviated as C1, C2, etc.) are geographically displayed in Fig. 5.7A. , CA simplified linguistic tree, fully reported in Fig. 5.4, is plot on the right. All branches having a bootstrap score lower than 50 have been collapsed. FOOTPRINTS OF MIDDLE AGES KINGDOMS ARE STILL VISIBLE … IN SPAIN 117

5.2.2 The Linguistic Atlas of the Iberian Peninsula (ALPI – Atlas Lingüístico de la Peninsula Ibérica)

The Linguistic Atlas of the Iberian Peninsula (ALPI) [18] is constituted by a single volume containing 75 maps accounting for 70 basilectal features in Spain and Portu‐ gal. A basilect is a variety of language, often a dialect. The atlas has not been fully published and five volumes are still missing. It is interesting to explain why. The project was started in 1914 by the Spanish philologist Ramón Menéndez Pidal (University of Madrid, Spain) and was supervised by his student Tomás Navarro Tomás. Three teams of fieldworkers covered the Iberian Peninsula, i.e. the Catalan zone, the Castilian‐speaking area and the Galician‐Portuguese area. The Basque varieties have not been recorded and the islands have not been fully sampled (this is why we analyzed only the surnames of continental Spain). When the Spanish Civil War started (1936‐1939) the fieldwork had almost been completed, but Navarro Tomás went in exile and took all the materials with him. The notebooks were returned to Spain in 1951, and remaining surveys were completed between 1947 and 1954. Manuel Sanchis Guarner coordinated with Lorenzo Rodríguez Castellano and Aníbal Otero the painstaking work of preparing the materials for publication, which yielded to the only available volume of the Atlas [18]. Shortly thereafter, the publishing work was suspended, and the ALPI notebooks were left (almost forgotten) in different places (private homes, different kinds of archives) until they were found and photo‐ copied between 1999 and 2001 by David Heap (University of Western Ontario, Can‐ ada) in an epic enterprise. The remaining five volumes will hopefully be published one day, that is when all the notebooks will have been transcribed and the data cleaned. At the moment original records are available as high‐resolution image files. For more details see http://westernlinguistics.ca/alpi/more_info.php. A dialectometric analysis of the first volume started in 2009 in the laboratory of Hans Goebl (University of Salzburg, Austria). Dialectometry is the subfield of dialec‐ tology aimed at mathematically measuring the differences between dialect varieties. Original data have been analyzed by Goebl to identify phonetic, morphologic, syntac‐ tical and lexical features. Each feature has been processed separately in 375 working maps corresponding to 532 sampling points. Goebl sent us the final 532 x 532 similar‐ ity matrix computed according to the relative identity value (originally in German: relativer Identitätswert, RIW [30, 31]). The RIW measures the similarity between two basilectal varieties as the percentage of items on which the two varieties agree. This is an accepted method to measure differences between dialects and close languages. A description of the computational work is reported in [19].

118 CHAPTER 5

5.2.2.1 Reanalysis of the dialectometric matrix of linguistic similarity

To compare the linguistic similarity of Spain to its surname diversity, we have se‐ lected a subset of 44 sample points out of the 532 ones listed in the ALPI [18]. We kept only one linguistic variant per province (47 mainland Spanish provinces minus the 3 Basque‐speaking provinces that have not been sampled in the ALPI = 44 provinces). We discarded data about . To make our choice, as many provinces are partly bilingual, we kept the variant spoken in the capital city of each province. This crite‐ rion has no special reason besides its simplicity. The selection of different sampling points might have changed our results, nevertheless a few alternative trials have shown that the main structure of linguistic diversity remains stable. We consider that the selection made, though arbitrary, is sufficient for a study addressing linguistic and surname variation at a provincial level. We stress that the selection of variants re‐ corded in capital cities does not change a clustering that would have remained the same if other variants had been selected aside. As we wanted to analyze linguistic diversity in terms of distance matrices, we applied the transformation:

Linguistic distance 1 – RIW (6)

As with surnames, tiny differences in the selection of the features that are aggregated in a linguistic database can lead to unstable results (different distance matrices, differ‐ ent clustering). As we had no access to the dataset of linguistic features reported in [19], we could not use the bootstrap, this is why we tested the robustness of linguistic classifications with noisy clustering [21]. Bootstrap and noisy clustering give compa‐ rable results once that the level of noise, arbitrary, is set to correct values [21]. We added a random level of noise to our 44 x 44 distance matrix in order to obtain 100 “noisy” distance matrices (noise level of 0.5; 1000000 runs). As with bootstrap, we computed 100 trees and, finally, a consensus tree (Fig. 5.4). We kept a minimal boot‐ strap score  90 because linguistic structures are generally more robust than surname ones. To the consensus tree reported in Fig. 5.4, we have added the Basque provinces excluded in the ALPI, according to the position they would certainly have had after a computational analysis. This artifice is fully justified because the Basque is one of the most divergent languages of all Europe and certainly the most divergent of the whole Iberian Peninsula, meaning that any computational method would classify the Basques varieties as an outgroup cluster.

FOOTPRINTS OF MIDDLE AGES KINGDOMS ARE STILL VISIBLE … IN SPAIN 119

Figure 5.4  Consensus “clustering with noise” [21, 22] tree displaying Spanish linguistic varieties according to the Linguistic Atlas of the Iberian Peninsula [18] as processed in [19]. The linguistic distance (1‐ RIW) is based on the relativer Iden‐ titätswert [30, 31]. Major clusters (see labels) discussed in the text are geographically displayed in Fig. 5.7B. All branches having a score lower than 90 have been collapsed. Solid lines correspond to the result of the computational analysis, while dotted lines correspond to the position that Basque varieties (absent in the ALPI) would have had after a computational classification (see text for details). A simplified version of the tree is shown in Fig. 5.3. 120 CHAPTER 5

5.3 RESULTS

5.3.1 Surname diversity in Spain – General statistics

The working database was constituted by 33,753 tokens (single surnames) occurring 51,419,788 times. This quantity exceeds the number of individuals constituting the Spanish population in 2008 (46,157,822 according to the Padrón municipal) because we added the paternal and the maternal surnames to the database separately (Fig. 5.2). If we had considered the full population, and if each resident had a typical Spanish double surname, we would have had more than 92 million occurrences (46,157,822 * 2 = 92,315,644). The theoretical discrepancy existing with the working database we processed is of ~ 41 million missing occurrences; one explanation for it is that many foreigners do not have a double surname (unless coming from Latin American coun‐ tries using a Spanish double‐surname system – see Table 5.1). In 2008, foreign resi‐ dents officially were 11.3% of the total, and 66% of them came from countries using a single‐surname system, meaning that they can account for ~ 3.5 million missing occur‐ rences (Spanish citizens that do not have a double surname, either because adopted or by deliberate personal choice are a small minority). The choice of the Spanish national institute of statistics (INE) to provide only data corresponding to individuals having a surname appearing at least 5 times within each municipality (to preserve anonymity), and our decision to analyze only the surnames occurring 20 times (see Fig. 5.2), ac‐ count for the remaining larger fraction of the discrepancy (~ 37 million). This said, and while our sample remains largely representative of the Spanish population (we proc‐ essed ~58% of the maximal expected number of occurrences), we noticed that the ag‐ gregation of municipal data, at the provincial level, leads to a lower sample‐ representativeness for some provinces (see ratio N tokens / N full population in Table 5.3).

In general, the N tokens / N full population ratio is above 1, meaning that our provincial sam‐ ples correspond to at least 50% of the population (the nominator is theoretically ex‐ pected to be twice the denominator because of the double‐surname system). Very large deviations from this reasonable pattern concern the provinces of Barcelona and Madrid, and, to a lesser extent, the province of Valencia (Fig. 5.5). These three prov‐ inces are those hosting the three biggest agglomerations of Spain. As the proportion of foreign residents bearing a single‐surname type is higher in large cities because they generally attract many foreign migrants (~20% in Barcelona and Madrid), the INE cut‐ off of surnames with an absolute frequency lower than 5 bearers leads to a bigger loss of data in large municipalities (like the city of Barcelona consisting of a single munici‐ pality of about 1,6 million of residents) than in small ones (like Bellprat, province of Barcelona, ~ 100 inhabitants). FOOTPRINTS OF MIDDLE AGES KINGDOMS ARE STILL VISIBLE … IN SPAIN 121

Figure 5.5  Linear regression between the ratio N(tokens)/N(full population) against the total number of municipalities per province (Table 5.3). The ratio N(tokens)/N(full population) is reported in table 5.3 and corresponds to the number of single surnames, the tokens of the final database (see Fig. 5.2), divided by the total number of individuals listed in the full data source (Padrón municipal of the year 2008 released by INE, the Spanish na‐ tional institute of statistics). The trend suggests that the sample‐representativeness de‐ creases when the number of municipalities in each province rises, becoming very low for the provinces of Barcelona, Madrid and Valencia. See text for details. The regression has been computed by excluding the latter three provinces.

If the size of a municipality leads to a bias, the number of aggregated municipalities constituting each sample per‐province leads to another. In our database, the provinces consisting of a large number of municipalities are less representative of their actual population‐size than those consisting in a low number of municipalities. The ratio N tokens / N full population (Table 5.3) shows a highly significant linear inverse correlation with the number of municipalities (R2 = 0.23; p = 0.000544; see Fig. 5.5). To test whether the increased level of missing surnames, at the municipal level, has an influence on the general classification (Fig. 5.3), we have experimented by processing only highly fre‐ quent surnames (f > 60; f > 80; f > 100). It turns out that the measures of isonymy, and the clustering derived from them, remain very similar to the ones presented in this 122 CHAPTER 5 paper (f  20). This is not surprising, because the mathematical definition of isonymy (and related distance measures) gives a low weight to infrequent surnames. While there is a systematic bias linked to the number of municipalities that compose each province, it does not seem to significantly influence our results because, after all, the classification of Valencia, Madrid and Barcelona (Fig. 5.3) makes sense both histori‐ cally and geographically. This said, the low sample size concerning Barcelona is probably related to other biases that we could not identify, and a larger sample size might have improved its clustering.

5.3.2 Isonymy levels

Concerning random isonymy (Table 5.3; Figs. 5.6; 5.7 C), we noticed an interesting geographic pattern with low and quite homogeneous levels in the eastern part of Spain (regions of Aragón, Cataluña, Valencia) and higher values in the rest of country. Isonymy is quite high in whole Castilla (regions of Castilla ‐ La Mancha and Castilla y León) and in the provinces of Ávila, Asturias, León and Salamanca. Galicia also exhib‐ its very high levels of isonymy. To be sure, there is a strong linear inverse correlation between isonymy and entropy measures (mathematical definition given in the caption of Table 5.3) accounting for the diversity of surnames. It means that the provinces with the highest random isonymy generally have a lower number of surnames, though this cannot be directly inferred by the average number of surnames per person (S/N in Table 5.3).

Figure 5.6  Isonymy levels per region. Please see table 5.3 and Fig. 5.7C for details and refer to isonymy values reported in [12]. Regions are listed in alphabetical order.

FOOTPRINTS OF MIDDLE AGES KINGDOMS ARE STILL VISIBLE … IN SPAIN 123

Figure 5.7  Geographical plot of the major clusters appearing in the surname (A) and linguistic (B) consensus trees of Figs. 5.3; 5.4 (Nei’s) and labeled accordingly. In (C) we have plotted isonymy values (see table 5.3) according to a 8‐class interval; the latter class (solid black) represents an interval that is not continuous with the preceding one. Note that a same cluster name (C1, C2, etc.) in one map may not correspond to the same prov‐ inces in the other. See dendrograms for details.

5.3.3 Surname diversity in Spain – clustering

From isonymy we computed Hedrick’s [24] and Nei’s [25] pairwise distances between all the 47 mainland Spanish provinces and, finally, a bootstrap consensus tree corresponding to them (Fig. 5.3). While the structure of the two trees is similar, the Hedrick’s classifica‐ tion yields a smaller Catalonian cluster (cluster C1) and does not support the existence of a unique cluster for the region of Valencia, (differently from the Nei’s classification, see cluster C2 ‐ Alicante, Castellon, Valencia). Further, while the group Albacete‐Cuenca‐ Murcia is highly supported in both trees, its clustering with the region of Valencia is not 124 CHAPTER 5 well supported in the Hedrick’s tree (bootstrap score below the cut‐off of 50% (Fig. 5.3). Another inconsistency between the trees concerns the clustering of Navarra, that the Hedrick’s method puts together with the Basque province of Guipuzcoa, while the Nei’s method groups with the province of La Rioja (see cluster NR in Fig. 3). A last difference concerns the ASC cluster (Fig. 5.3), where Ávila and Salamanca (represented together in both trees) are put with Segovia with the Hedrick’s approach and with Cáceres with the Nei’s classification. About the similarities of the trees, Albacete, Cuenca and Murcia are always clustered together. Further, the provinces of Galicia (GAL) and Aragón (AR) form coherent clusters. Whatever the distance adopted, we note that no specific Basque cluster appears when bootstrap scores are set to be  50. From now on, we will discuss only the Nei’s distance consensus tree because it represents larger clusters that the Hedrick’s one. In the end, and besides some change in the NR and ASC clusters (Fig. 5.3), the Hedrick’s tree does not contradict the Nei’s representation, it just provides less support to it. To summarize, the geographic areas corresponding to coherent surname clusters concern only a small part of continental Spain (see Fig. 5.7A), that is Cataluña, the region of Valencia, Aragón, Galicia, Asturias‐León and La Mancha‐Murcia. We remind that La Mancha is a geographical and historical region that currently does not have any adminis‐ trative status (it falls in the macroregion called Castilla‐La Mancha). La Mancha was formed by portions of the the provinces of Albacete, Cuenca, Ciudad Real and Toledo. When we have referred to La Mancha we meant only its eastern part, that is Cuenca and Albacete. Interestingly, no clusters with a bootstrap score higher than 50 are found in a very large area concerning central and southern Spain (Fig. 5.7A).

5.3.4 Linguistic diversity in Spain

The computational classification of Spanish linguistic varieties is accounted by the con‐ sensus tree displayed in Fig. 5.4. We remind that the Linguistic Atlas of the Iberian Penin‐ sula (ALPI) [18] contains no data concerning the three Basque‐speaking provinces of Alava, Gipuzkoa and Bizkaya. They are reported in the tree according to the expected position they would have had if data were available (dotted branches ‐‐ see the methodo‐ logical section of the paper). The first partition of the tree concerns the Catalan speaking area. This cluster is divided in two subgroups (C1 and C2 in Fig. 5.4), respectively corre‐ sponding to Cataluña and to the region of Valencia. Another very robust cluster concerns the four provinces of Galicia (G in Fig. 5.4). All remaining provinces are put together in a very large cluster, quite unstructured. Within it, Asturias and León are together (AL). The same happens concerning three southern provinces (CSH ‐ Cádiz, Sevilla, Huelva) and a part of the region of Aragón (provinces of Teruel and Zaragoza, but not Huesca). Simi‐ larly to the surname classification, there are no robust clusters in a large part of conti‐ FOOTPRINTS OF MIDDLE AGES KINGDOMS ARE STILL VISIBLE … IN SPAIN 125 nental Spain. The larger fraction of the linguistic diversity is located in the north and in the east (Aragónese, Asturian, Basque, Catalan and Galician languages). Further analyses (not shown), restricted to the Castilian speaking area only, have confirmed its really low differentiation. The linguistic tree (Fig. 5.4) is extremely robust and its coherence with the surname classification is apparent (Figs. 5.7A‐5.7B).

Figure 5.8  Political subdivision of Spain during the early and late Middle‐Ages. In (A) is reported the political geography of the second half of the 12nd century, (B) corre‐ sponds to 1492 CE. The timeframe matches to the origin of Spanish surnames. By compar‐ ing the two maps, the expansion of the reigns of Castilla and Aragón towards the south‐ ern territories ruled by Arabs is well visible (see arrows). This process is known as the reconquest (Reconquista). Please note that the reigns of Leon and Castilla, independent in (A), later merged (B). The reign of Navarra remained unchanged.

126 CHAPTER 5

5.3.5 Mantel correlations

When distance (or similarity) matrices concerning the same elements are available, it I s it common practice to compute Mantel test correlations [32]. We compared geo‐ graphic, linguistic and surname distance matrices and the results are reported in Ta‐ ble 5.4. According to the Mantel test, while linguistic and surname measures are corre‐ lated with geographic distances, they are not cross correlated (Table 5.4). This result contrasts with the similar clustering they show in Fig. 5.7. To explain the incoherence, we remind that sometimes pair‐wise distances account for noise, like in the Castilian area (not elsewhere) where historical phenomena linked to the Reconquista have de‐ faced a large part of its surname and linguistic variability (Fig. 5.8). Our point, here, is to show that Mantel correlations are often insufficient to describe a phenomenon that concerns only a part of the pair‐wise elements.

Table 5.4 Mantel correlations between geographic, surname (Nei’s distance) [25] and linguistic distances (1‐RIW] [30, 31].

[Distance] Geographic (linear) Geographic (road) Surname (Nei’s) Linguistic (1-RIW) Geographic (linear) 1 0.986 ** 0.281 * 0.598 ** Geographic (road) 0.986 ** 1 0.277 * 0.599 ** Surname (Nei’s d.) 0.281 * 0.277 * 1 n.s. Linguistic (1-RIW) 0.598 ** 0.599 ** n.s. 1

Significance levels, according to [38], are reported as asterisks: (*) = 0.01; (**) = 0.001.

5.4 DISCUSSION

5.4.1 Variability of Spanish surnames: Patterns of diversity

Our initial research question was to assess whether the present‐day geographical variability of Spanish surnames mirrors historical phenomena occurred at the times of surname introduction (13th ‐ 16th century), or if internal migration and international immigration have defaced them. Our analyses may be unrepresentative of the sur‐ names (often rare) corresponding to international immigration, because we processed only a subset of the Padron Municipal (see Fig. 5.2). To estimate the proportion of non‐ Spanish surnames, as an automatic retrival is not possible, we proceeded empirically, by asking two Spanish colleagues to retrieve them in a printed version of the data‐ base. It appears that the percentage of surnames that are not Castillan / Catalan / FOOTPRINTS OF MIDDLE AGES KINGDOMS ARE STILL VISIBLE … IN SPAIN 127

Basque / Galician is ~ 7 % of the total, in good agreement with the official count of the foreign population in 2008 (~11% in Table 5.1). The discrepancy of about 4% between our estimate and the official one, is easily accounted by the proportion of immigrants from Latin America (34% of the total, see Table 5.1), that often have a typical Spanish surname and escape detection. The conclusion is that the database is representative of the contemporary population of Spanish residents, immigrants included. This conclu‐ sion, counterintuitive given the frequency cut‐off we applied to the initial data (see Fig. 5.2), reflects the characteristics of the source of data, the Padrón Municipal. Its re‐ cords correspond to individuals of all ages; meaning that the surname of a family of immigrants constituted by three generations (grandparents, husband and wife, chil‐ dren) appears several times in the Padrón Municipal, thus making possible the inclu‐ sion of their surname in our working database, even if it is rare. Notwithstanding a large immigration, the surname structure of Spain largely reflects the political asset at the times of surname adoption, at the end of the Middle‐ Ages. In fact, the borders of the Reign of Aragón (Fig. 5.8) almost exactly correspond to three major clusters of surnames and the same can be said for the Reign of Navarra (respectively C1/C2/AR and NR in Fig. 5.7A – see Fig. 5.8). The former Reign of León well corresponds to the surname clusters GAL and ASC in Fig. 5.7A. Does this mean that the Spanish population remained largely unchanged over the centuries? According to its surname structure it did, but this is also the effect of the surname normalization posterior to the Reconquista, together with the increased prestige of the Castilian language and identity. Actually these aspects hide a more mixed genetic background of the population. For example, the genetic signature of Sephardic Jews (expelled in 1492 CE), North African Muslims and other groups largely present in Spain until the reconquest, is still found in the Spanish population, as a comprehen‐ sive Y‐chromosome study has recently shown [33]. If specific haplogroups are recog‐ nizable, their frequency varies geographically, being lower in Cataluña and along the corridor of the Aragónese expansion southwards. Though large, the genetic sampling of Adams et al. [33], conducted by region, does not overlap well with our surname sampling consisting in the aggregation of municipal data by province. We believe that a deeper comparison between the diversity of the Y‐chromosome and the diversity of surnames would be most appropriate, given that the two markers share a similar in‐ heritance along the male line. Even if the historical signal is well preserved in contemporary Spanish surname data, some features of the classification (Fig. 5.3) are likely to be of recent origin, like the absence of a Basque surname cluster. Actually, it has been shown that Basque sur‐ names, contrary to the Catalan and Galician ones (which are scattered over an area corresponding to their respective languages), are distributed in a region larger than 128 CHAPTER 5 the Basque‐speaking domain. Besides the provinces of Alava, Gipuzkoa and Bizkaya (Basque linguistic core‐area in Spain), Basque surnames are found in large numbers also in Aragón, Cantabria, Castilla y León and La Rioja [13]. As the Basque has never been spoken in the latter provinces [34], it is likely that many families migrated out‐ side the basque speaking area after the Middle‐Ages and, probably, quite recently. In addition to this dispersal, the Basque country, a long established industrial area, has attracted a large number of immigrants from other Spanish provinces and from for‐ eign countries. The joint‐effect of the two phenomena prevents the existence of a ro‐ bust Basque surname cluster.

5.4.2 Variability of Spanish surnames: Patterns of isonymy

Isonymy levels at the provincial level (Table 5.3; Fig. 5.7C), are lower in the eastern part of the Iberian Peninsula (Cataluña, Valencia, Aragón and Navarra) and generally higher elsewhere, with extremely high values in Galicia, Asturias, León, Salamanca and Ávila. High values of isonymy are in agreement with the loss of surname diver‐ sity in the Reign of Castilla after the Reconquista, as the consequence of a long‐lasting process of Christianization and castillanization together with the increase of the Cas‐ tilian prestige. The other way round, the lower levels of isonymy in Cataluña, Valen‐ cia, Aragón and Navarra can be interpreted by a lower degree of surname Castiliani‐ zation (these regions were part of the reign of Aragón that maintained separate ad‐ ministrative systems until 1714 CE) and by the specific language context (correspond‐ ing to the Aragónese and Catalan language) that acted as a source of surname diver‐ sity. To these structural aspects, we should add the effect of internal migrations, from all Spain, to the wealthy province of Valencia and to the region of Cataluña (but also to Madrid and the Basque country), with a recent increase in their surname diversity. The opposite phenomenon probably took place in provinces having very high levels of isonymy (Lugo, Pontevedra, Asturias, León, Salamanca and Ávila ). Interestingly, and as a partial explanation deserving more research, none of the provinces listed has, or had, a special economical attractiveness. We note that isonymy also depends on the population size at the time of surnames introduction, and it is likely that high levels of isonymy reflect a sparse population during the Middle‐Ages (Galicia, Asturias, León, Salamanca and Ávila ). As a general conclusion, isonymy cannot be readily attributed to a single cause and multidisciplinary research, involving historians and historical demographers, would be of help to assess the magnitude of each single factor. What is clear is that the east/west divide in the pattern of isonymy (Fig. 5.7C) perfectly corre‐ sponds to the historical border separating the reign of Castilla y León from the reign of Aragón (Fig. 5.8). We believe that the levels of isonymy in the eastern part cannot FOOTPRINTS OF MIDDLE AGES KINGDOMS ARE STILL VISIBLE … IN SPAIN 129 be directly compared to those in the western part, because they correspond to political and administrative differences that distorted the demographic processes they are ex‐ pected to readily mirror. This is why we suggest that the low isonymy computed in Cataluña, Aragón and Valencia may not correspond, in reality, to a lower level of con‐ sanguinity than the western Castilian area. In Spain, the estimation of the actual levels of consanguinity from isonymy requires great caution. We found no statistical support for the inverse correlation between consan‐ guinity values (computed from isonymy) and the density of the population per prov‐ ince (Table 5.3) reported by others [12]. To be sure, we have checked our isonymy measures against those in [12] and, despite a difference in the data source and geo‐ graphic sampling (these authors did not compute isonymy per province but only per city and region), we found a substantial agreement (Fig. 5.6). However, we note that the clustering in [12] is very different from the one we have obtained (Fig. 5.3), and we could not replicate, at all, their dendrogram classification, even when using the dis‐ tances and clustering algorithms they applied. Interestingly, the same authors [14] later reprocessed their surname data (telephone directory) and provided a second clustering that is different from the first one they published. As both published trees [12, 14] have not been tested for their robustness, for example by bootstrap or jack‐ knife methods, they might not portray a solid trend in the data. This is the reason why we will not further comment on them.

5.4.3 Linguistic diversity

The data contained in the Linguistic Atlas of the Iberian Peninsula (ALPI) [18] are about eighty years old, meaning that they concern linguistic varieties that have meanwhile changed. With respect to the ALPI, the phenomenon is also linked to the political regime guided by General Francisco Franco from 1939 to 1975. Franco pur‐ sued strong nationalistic policies that weakened regional identities. His regime dis‐ couraged the use of regional languages and Castilian (Spanish) was the only official and accepted way of expression. The advent of democracy, about forty years ago, has corresponded to a strong will to embrace regional cultural identities and to obtain some political and economical independence from the central government, essentially in the Basque region and in Cataluña. Some regional languages (Basque, Catalan, Galician) are now officially accepted, used at all the levels of the public life, including the medias, and taught at school. As a result, they are converging towards a norm. The clusters yielded by the computational linguistics analysis of the ALPI (Fig. 5.4) largely correspond to the known regional varieties of Cataluña, Aragón, Galicia, Asturias and León. The novel aspect of this work concerns the large homogeneity of 130 CHAPTER 5

Castilian language varieties, a homogeneity that cannot be attributed the any modern leveling because in the 1930s (when the ALPI sampling was carried out) the Spanish lifestyle was still quite traditional and rural. Therefore, the reason for the low level of linguistic variation between a majority of the Castilian varieties must be older. Before the Arab invasion, the linguistic landscape of Spain consisted of local varieties resulting from the adoption of Latin by populations that previously spoke Celtic and Iberian languages. Latin varieties evolved for about a thousand years, from the Roman conquest of the Iberian peninsula (started in 218 BCE) to the Arab take‐ over (started in 711 CE), when a progressive linguistic arabization started. The length of the Arab domination, and its influence on the language, differ geographically, hav‐ ing lasted a couple of centuries in the North of Spain and about eight centuries in the very South. During the progressive reconquest of the peninsula by the Christian kingdoms of the North and the growing expansion of the reigns of Castilla y León and Aragón (Fig. 5.8), northern Castilian and Catalan varieties spread to the South, thus replacing a large part of the linguistic varieties encountered. This is why the differ‐ ences found in the Castilian‐speaking area are known to be secondary, in other words occurred in a more recent and shorter time than the first process of differentiation from Latin [35]. The political prestige of the Castilian crown, together with the reli‐ gious and cultural “normalization”, kept Castilian quite homogeneous and led, at the same time and as we mentioned already, to the large Castilianization of surnames, that are a specific part of language. The homogeneity of Castilian dialects underlines the effectiveness of the political power of monarchies that have been able to vigor‐ ously keep a Castilian linguistic norm that has remained a key element of the Spanish identity until the end of the regime of General Franco.

Acknowledgements:

Franz Manni would like to thank Pierre Darlu (CNRS, National Museum of Natural His‐ tory, Paris, France) for invaluable help in statistical analysis and for the continued encour‐ agement and the insightful discussions. Hans Goebl (University of Salzburg, Austria) provided linguistic data and remarkable support over the time. Slavomir Sobota (Univer‐ sity of Salzburg, Austria) timely transferred the matrix of linguistic distance that has been tested for its robustness by Wilbert Heeringa (University of Groningen, The Netherlands). Useful comments concerning the history of Spanish surnames and the Basque language have been respectively provided by Dieter Kremer (ex University of Trier, Germany). Chiara Scapoli (University of Ferrara, Italy) kindly transferred previously published isonymy estimations. Patrick Hanks (University of Wolverhampton, UK) gave access to published materials difficult to retrieve. The description of the Linguistic Atlas of the Ibe‐ rian Peninsula (ALPI) is largely based on the text provided by David Heap (University of Western Ontario, Canada) at http://westernlinguistics.ca/alpi/more_info.php FOOTPRINTS OF MIDDLE AGES KINGDOMS ARE STILL VISIBLE … IN SPAIN 131

References:

01 Crow J.F., Mange A.P. 1965. Measurement of inbreeding from the frequency of mar‐ riages between persons of the same surname. Eugen Q, 12: 199‐203. 02 Beech G., Bourin M., Chareille P. (eds). 2002. Personal names studies of Medieval Europe. (MI) USA: Western Michigan University. 03 Cheshire J., Mateos P., Longley P.A. 2011. Delineating Europeʹs Cultural Regions: Population Structure and Surname Clustering. Hum Biol, 83: 573‐598. 04 Cheshire J. 2014. Analysing surnames as geographic data. J Anthropol Sci., 92: 99‐117. 05 Darlu P., Bloothooft G., Boattini A., Brouwer L., Brouwer M., Brunet G., et al. 2012. The family name as socio‐cultural feature and genetic metaphor: From concepts to meth‐ ods. Human Biology, 84:169‐214. 06 King T.E., Jobling M.A. 2009. Whatʹs in a name? Y‐chromosomes, surnames and the genetic genealogy revolution. Trends Genet, 25:351‐60. 07 Mateos, P. 2014. Names, ethnicity and Populations. Springer. 08 Kremer D. 1992. Spanische Anthroponomastik. Lexikon der Romanistischen Linguistik, 6: 457‐473. 09 Kremer D. 1996. Morphologie und Wortbildung bei Familiennamen II: Romanisch, Namenforschung. Ein internationales Handbuch zur allgemeinen und europäischen Ono‐ mastik, 2. Teilband, Berlin/New York, pp1263‐1275. 10 Kremer D. 2001. Colonisation onymique, Lʹonomastica testimone, custode e promotrice delle identità linguistiche, storiche e culturali. Studi in ricordo di Fernando R. Tato Plaza, Rivista Italiana di Onomastica 7: 337‐373. 11 Kremer D. 2003. Spanish and Portuguese family names. In P. Hanks (ed.) Dictionary of American family names. New York (USA): Oxford University Press. 12 Rodriguez‐Larralde A., Gonzales‐Martin A., Scapoli C., Barrai I. 2003. The names of Spain: a study of the isonymy structure of Spain. Am J Phys Anthropol., 121: 280‐92. 13 Mateos P., Tucker D.K. 2008. Forenames and Surnames in Spain in 2004. Names: A Journal of Onomastics, 56: 165–184. 14 Scapoli C., Mamolini E., Carrieri A., Rodríguez‐Larralde A., Barrai I. 2007. Surnames in Western Europe: a comparison of the subcontinental populations through isonymy. Theor Popul Biol, 71: 37‐48. 15 Manni F., Toupance B., Sabbagh A., Heyer E. 2005. New method for surname studies of ancient patrilineal population structures, and possible application to improvement of Y‐chromosome sampling. Am J Phys Anthr, 126: 214‐28. 16 Boattini A., Lisa A., Fiorani O., Zei G., Pettener D., Manni F. 2012. General Method to Unravel Ancient Population Structures Through Surnames. Final Validation on Italian Data. Human Biology, 84: 235‐270. 132 CHAPTER 5

17 Kreienbrink, A. 2008. Resident Foreign Population [of Spain]. Bundeszentrale fûr poli‐ tische Bildung (German Federal Agency for Civic Education). Internet publication: http://www.bpb.de/gesellschaft/migration/laenderprofile/58609/foreign‐population.. 18 ALPI 1962. Atlas Lingüístico de la Península Ibérica. Madrid: C.S.I.C., tomo I, Fonética. 19 Goebl H. 2013. La dialectometrización del ALPI: rápida presentación de los resultados. In: E. Casanova‐Herrero, C. Calvo‐Rigual (eds) Actas del XXVI Congreso Internacional de Lingüística y de Filología Románicas (volumen VI). Berlin (Germany). Boston USA): De Gruyter, pp. 143‐154. 20 Felsenstein J. 1985. Confidence limits on phylogenies: an approach using the boot‐ strap. Evolution, 39: 783‐791. 21 Nerbonne J., Kleiweg P., Manni F., Heeringa W. 2008. Projecting Dialect Distances to Geography: Bootstrap Clustering vs. Noisy Clustering, In: C. Preisach, L. Schmidt‐ Thieme, H. Burkhardt, R. Decker (eds.), Data Analysis, Machine Learning and Applica‐ tions. Proceedings of the 31st Annual Meeting of the German Classification Society. Berlin: Springer, pp. 647‐654. 22 Kleiweg P., Nerbonne J, Bosveld L. 2004. Geographic Projection of Cluster Compos‐ ites. In: A. Blackwell, K. Marriott, A. Shimojima (eds.), Diagrammatic Representation and Inference. Diagrams 2004. Lecture Notes in Artificial Intelligence 2980. Berlin: Springer, pp. 392‐394. 23 Valls E., Prokic J., Wieling M.; Clua E., Lloret M‐R. 2012. Applying the Levenshtein Distance to Catalan dialects: A brief comparison of two dialectometric approaches. Verba, 39: 35‐61. 24 Hedrick P.W. 1971. A new approach to measuring genetic similarity. Evolution, 25: 276‐280. 25 Nei M. 1973. The theory and estimation of genetic distance. In: N.E. Morton (ed.) Genetic structure of populations. Hawaii (USA): Hawaii University Press. 26 Lasker G.W. 1977. A coefficient of relationship by isonymy: A method for estimating the genetic relationship between populations. Hum Biol, 49: 489‐493. 27 Relethford J.H. 1988. Estimation of kinship and genetic distance from surnames. Hum Biol., 60: 475‐92. 28 Saitou N., Nei M. 1987. The neighbor‐joining method: a new method for reconstruct‐ ing phylogenetic trees. Mol. Biol. Evol., 4: 406‐425. 29 Page R.D.M. 1996. TREEVIEW: An application to display phylogenetic trees on per‐ sonal computers. Computer Applications in the Biosciences, 12: 357‐358. 30 Goebl H. 1984. Dialektometrische Studien. Anhand italoromanischer, rätoromanischer und galloromanischer Sprachmaterialien aus AIS und ALF, Tübingen: Niemeyer. Vol I, pp. 75‐78. 31 Goebl H. 1993. Dialectometry. A Short Overview of the Principles and Practice of Quantitive Classification of Linguistic Atlas Data. In: R. Köhler , B.B. Rieger (eds.), Contributions to Quantitative Linguistics. Dordrecht, pp. 277‐315. FOOTPRINTS OF MIDDLE AGES KINGDOMS ARE STILL VISIBLE … IN SPAIN 133

32 Mantel N.A. 1967. The detection of disease clustering and a generalized regression approach. Cancer Res., 27: 209‐20. 33 Adams S.M., Bosch E., Balaresque P.L., Ballereau S.J., Lee A.C., et al. 2008. The genetic legacy of religious diversity and intolerance: paternal lineages of Christians, Jews, and Muslims in the Iberian Peninsula. Am J Hum Genet 83: 725‐736. 34 Aznar A. 1998. La mezcla del pueblo vasco. Empiria: Revista de metodologia de ciencias sociales, 1: 121‐177. 35 Baldinger K. 1963. La formación de los dominios lingüísticos en la Peninsula Ibérica. Madrid: Gredos. 36 Ethnologue 2014. Ethnologue: Languages of the World. Summer Insitute of Linguistics (SIL), SIL International Publications. Dallas (USA). Online edition Available: www.ethnologue.com. Accessed 2015 February 19. 37 Shannon C.E. A mathematical theory of communication. The Bell System Technical Journal, 1948; 27, 379–423 and 623–656. 38 Manly B.F.J. 1997. Randomization, Bootstrap and Monte Carlo Methods in Biology. London (UK): Chapman and Hall, pp. 399.

134 CHAPTER 5

vvcvcvc

LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 135

136 CHAPTER 6

This chapter is unpublished, please cite it as follows:

Manni F., Nerbonne J. 2017. Linguistic probes into the Bantu history of Gabon. In: Linguistic probes into human history (Chapter 6). PhD dissertation, Groningen dis‐ sertations in linguistics n° 162. ISBN 978‐90‐367‐9872‐3. Groningen: University of Groningen.

APPENDIX 1, APPENDIX 2 and APPENDIX 3 are available at the following link: http://hdl.handle.net/11370/58b394a0‐60c6‐4129‐8178‐a55dd634801d LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 137

ABSTRACT  In this extensive unpublished chapter we have compared the linguistic and genetic diversity of Gabon (Africa) in order to contribute new elements to the scenarios concerning the early Bantu expansion related to the adoption of agriculture. Two independently obtained datasets have been processed (Bastin et al. 1999; ALGAB ― see Hombert 1990) accounting for a total of 126 different varieties consisting in Swadesh word lists. They lead to similar results, showing that the languages cluster into a comparable number of groups. The Levenshtein linguistic distances we com‐ puted are fully compatible with the classification of Grollemund et al. (2015) based on shared vocabulary, where sharing is operationalized as the percentage of words (not) having the same historical origin. This coding is unnecessary with the Levenshtein method, making it simpler to use and, for the larger amount of information it accounts for, more sensitive. We have tried to make the genetic dataset more representative of the 17 ethnic groups studied on the genetic side, by filtering‐off all the DNA donors that were born outside the areas typically inhabited by their respective ethnolinguistic communities. The new results confirm the lack of genetic differentiation, which is even wider than previously observed. The linguistic cartography of our classifications shows well delimited areas that might be related to early waves of Bantu migrants that crossed Gabon in the early stages of their dispersal from Cameroon and Nigeria. Finally, we compare our results with cultural differences with a dataset assembled by musicologists and they are compatible.

LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON

6.1 INTRODUCTION

This chapter presents a research intermittently conducted in the last fifteen years about the linguistic variability of Bantu varieties spoken in Gabon. The Republic of Gabon is an equatorial African country with a total population of about a million and half inhabitants. Besides the Batéké Plateau area (southeastern part at the border with the Republic of the Congo), Gabon is covered by a rainforest. The present‐day distri‐ bution of the villages is along the major roads (Fig. 6.1). Gabon became known by the Europeans starting with the 15th century when the Portuguese and, later, the Dutch established a slave trade with the help of tribes living along the Atlantic coast. If trade progressively expanded to wood, rubber and ivory, the geographical exploration of Gabon started only in the 19th century, when France progressively colonized this region. Gabon became independent in 1960. 138 CHAPTER 6

Figure 6.1  Geographic map of Gabon with major cities and roads. The density of the population is shown as shades of gray (source: census 1966). Gabon is generally covered by the rainforest, but there is also savannah on the Plateau Batéké (at the south‐eastern border with People’s Republic of Congo) and in the southern part.

This country is largely uninhabited, with an average population density of about five individuals per square kilometre over a surface comparable to that of the United Kingdom. More than a half of the population presently lives in the bigger cities (the capital , Port‐Gentil, Franceville, Oyem, Moanda, etc.) and the density of the population outside these urban or semi‐urban areas is often comparable to desert re‐ gions. The emigration towards the capital city and other major cities and towns is a recent and constantly increasing phenomenon related to the late‐colonial and post‐ colonial economy based on the trade of wood, oil and other geologic resources. All the ethnic groups living in Gabon are Bantu populations, with the exception of several Pygmy groups nowadays scattered everywhere (about 20,000 individuals in total). The use of French (the official language) is widespread in the increasingly multiethnic towns, and many indigenous language varieties are threatened. In terms of linguistic diversity, every ethnic group of Gabon speaks a different variety of Bantu, with the exception of the Baka, a Pygmy group speaking a language LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 139 belonging to another linguistic family. Here, we will focus only on non Pygmy Bantu‐ speaking populations. Our major aim is to computationally compare the linguistic diversity of Gabon to the genetic diversity of corresponding populations, in order to contribute to the large field of Bantu historical linguistics and, possibly, to provide new clues about the peopling of this region. The section that follows will start with a detailed background addressing key‐literature that determined the design of this project. Literature re‐ viewed in the introduction is generally older than the year 2002, which is when this comparative project started. More recent literature will be addressed in the discussion section. We proceed this way in order to help the reader appraise the major advances in the field chronologically.

6.1.1 General background 1

Bantu languages belong to the Niger‐Congo phylum and include about 600 varieties spoken in almost all sub‐Saharan Africa. Their geographic continuity is nearly perfect, interrupted only by the Khoisan languages spoken in South Africa.2 Guthrie (1967) records some 440 Bantu varieties, Grimes (2000) finds 501, Bas‐ tin et al. (1999) list 542, Maho (2003) has some 660, while Mann and Dalby (1987) dis‐ tinguish ca. 680 varieties. These figures change according to the classification of the varieties as languages or as dialects, according to aspects such as their prestige, the existence of written forms, their use in wider or narrow communication, etc. When unintelligibility is the only criterion taken into account, the consensus is to classify about a half of the varieties as languages (ca. 300) and the other half as dialects (ca. 300) (Nurse 2001). In practice Bantu varieties are usually referred to as ‘languages’, and we will follow this tradition. There is a lack of accurate statistics concerning the number of speakers per lan‐ guage (but see Van der Veen 2006a), and it is even more difficult to estimate the num‐ ber of primary and secondary speakers. However, Nurse (2001) suggests that, among the 400 million Africans speaking Niger‐Congo languages, about 240 million use a Bantu variety as their first language.

1 The general sketch of this section is largely based on some excellent contributions, par‐ ticularly Doneux 2003, Stahl 2004, Mouguiama‐Daouda 2005, Phillipson 2005, Van der Veen 2006a. 2 There also is one small and fast dwindling Khoisan community in Tanzania. Larger communities speak Cushitic (Afro‐Asiatic) in the north‐eastern part of the Bantu speaking domain. Other groups speaking Nilo‐Saharan and Adamawa‐Ubanguian languages live in the northern area. Like Bantu, the Adamawa‐Ubanguian languages belong to the Ni‐ ger‐Congo phylum. 140 CHAPTER 6

6.1.1.1 Bantu linguistics: classifications

Bantu languages are geographically widely dispersed, and involve distant popula‐ tions that have no contacts between them, and that do not necessarily share a cultural background. For this reason the idea that the Bantu languages form a linguistic unity was not accepted until quite recent times. The first intuition goes back to Wilhelm Heinrich Immanuel Bleek (1862‐1869). The German philologist noticed that human beings are defined by the same word in many African languages (muntu singular; bantu plural), meaning that a common origin could be envisaged for such languages. This intuition became solid evidence with the application of the comparative method to phonology and grammar by Carl Friedrich Michael Meinhof (1857‐1944) (Meinhof 1899, Meinhof and Warmelo 1932). These days, besides the differences, Bantu linguis‐ tics relies on a common methodological ground that includes, in addition to the work of Meinhof, also the research of Malcolm Guthrie (1903‐1972) and Achilles Emile Meeussen (1912‐1978). The most important classification of Bantu languages, which is still used as a practical taxonomic reference, is Guthrie (1967), though more recent ones are available (Mann and Dalby 1987; Grimes 2000). Guthrie was a pioneer. He was confronted with the hard task at compiling a linguistic survey of all known Bantu languages to pro‐ vide a classificatory system that would remain valid over time, including after the description of new varieties. First he defined the geographical boundary of the Bantu linguistic domain and, in doing so, he divided it in a eastern and a western zone. The western region includes Cameroon, Gabon, Congo, the western part of the Democratic Republic of Congo (DRC), Angola and a part of Zambia. The eastern region includes the eastern part of DRC and all the eastern countries of sub‐Saharan Africa. This split, although currently debated, has been consensual for a long time. Guthrie divided each of the two regions in a number of zones and identified them by letters: A, B, C, D, H, K, L, M, R for the West and E, F, G, N P, S for the East. These areas were de‐ signed to maximize the number of shared grammatical features among the languages they encompass, and were partly defined according to a number of isoglosses Guthrie had described. Guthrie adopted a method that was partly linguistic and partly geographical. He started with a given language (or a group of very close languages) and tried to find other languages having similar features. These were put into groups and the groups into zones labeled by an alphabetical letter. Each geographical zone was con‐ stituted by a maximum of 9 languages, a number that he considered sufficient to de‐ fine linguistic subgroups in a synchronic perspective. When a zone threatened to be‐ come too wide (more than 9 languages), another zone was created. The linguistic fea‐ LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 141 tures taken into account had little genetic validity, and Guthrie knew that. This is why his classification has no special historical validity. When the languages included in some of the zones he postulated happen to correspond to a genetic classification, it is because the process of linguistic differentiation is often correlated with geographic distances.

Figure 6.2Left: Bantu linguistic domain and zones defined by Guthrie. Right: Example of language classification within each zone (the example concerns varieties in zone A).

Since this chapter adopts Guthrie’s classification, it is necessary to explain how it works in practice. As an example we can consider the first zone identified, that is the zone A (it happens to be in the northwest – Fig. 6.2). Within this zone we find 9 groups called A10; A20; A30; …; A90. The group A30, for example, is called Bubi‐ Benga and concerns, like the other ones, languages that are very similar. Looking more deeply into this group, the languages are coded by putting numbers after the first digit. In this way, A32 refers to the Batanga language. Small letters after the two digits number refer to different dialects, for example A33a identifies the Yasa dialect, whereas A33b corresponds to Kombe/Ngumbi, dialect (Fig. 6.2). Nowadays, many other languages have been added to the original classifica‐ tion of Guthrie, and linguists have shown that some of the languages included inside a geographical zone identified by a given letter are actually closer to those of another zone, or intermediate, according to the features that are considered. In genetic terms, the internal cohesion of the linguistic zones is often low, and some of them can be merged. Guthrie’s method was practical, like an archaeologist’s before a detailed ex‐ 142 CHAPTER 6 cavation, dividing the site into areas according to their supposed belonging to differ‐ ent kinds of remains. It goes without saying that several proposals have been made to abandon the classification of Guthrie, in order to adopt a more coherent one, but the practice has been to improve and update his classification without changing it. The need to change the classification fundamentally became much less pressing due to the excellent work of Jouni Filip Maho (2003). With the encouragement and help of two well respected scholars, Gerard Philippson (University of Lyon, France) and Derek Nurse (Memorial University, Canada), Maho released the New Updated Guthrie List (Maho 2009). This is the version of the Guthrie classification we use here.3

1.1.2 How many major Bantu groups?

Heine (1973) corroborated, by a lexicostatistics approach,4 the division of Bantu lan‐ guages in two clusters, a western and an eastern one. The eastern group derives from the western one, while the languages of Gabon and Cameroun are independent line‐ ages. Later, Ehret (1999) suggested that eastern, central and should be merged into a single group called Savannah Bantu that is constituted by three branches (Eastern, Western I, Western II in Fig. 6.3) Nurse and Philippson (2003) provided one of the most recent historical classifi‐ cations (Fig. 6.4). It is based on grammatical innovations. While it generally confirms the validity of the zones of Guthrie, there are several exceptions. For example A80 + A90 form a single cluster, as well as {B10 + B30} or {B40 + some H12 + some H13}. More importantly, Nurse and Philippson did not confirm the existence of the Savan‐ nah Bantu group of Ehret, instead suggesting a more classic western / eastern division. The western group is constituted by three overlapping clusters (western Bantu, west‐ ern‐central Bantu; forest Bantu), while the second group (south‐eastern Bantu) is con‐ stituted by languages that are grammatically different from the first group, though a few features are shared. The fact that the western group is more differentiated, being composed by three subgroups that diverged, suggests that western‐Bantu languages are older than the more homogeneous south‐eastern ones that did not evolve into subgroups be‐ cause time was too limited. The overlapping of western‐Bantu, forest Bantu and west‐

3 http://goto.glocalnet.net/mahopapers/nuglonline.pdf 4 Lexicostatistics is a subfield of comparative linguistics involving the quantification of cognates. Though related to the comparative method, the aim of lexicostatistics is not to reconstruct a protolanguage. The name of this discipline is misleading as no statistics is involved in the process of mathematically measuring lexical divergence. LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 143 central‐Bantu (Fig. 6.4) supports the common origin of these languages in the north‐ eastern part of the Bantu speaking domain. After the initial differentiation in the west, the languages have probably migrated southwards and eastwards, and secondary contacts occurred, thus explaining why some varieties have mixed grammatical fea‐ tures (like the D10‐D20‐D30). South‐eastern varieties have probably originated after a migration of western Bantu varieties to the east and then the south of Africa. This is the general scenario.

Figure 6.3 The classification of Bantu languages according to Ehret (1999). The codes correspond to the zones defined by Guthrie (see Fig. 6.2).

Figure 6.4  The classification of Bantu languages according to Nurse and Philippson (2003) and based on grammatical innovations. The codes correspond to the zones defined by Guthrie (Fig. 6.2). The codes underlined correspond to groups of languages that do not cluster together, though belonging to a same Guthrie zone or subgroup.

144 CHAPTER 6

A large number of other classifications, published by scholars working at the Royal Museum of Central Africa of Tervuren (Belgium), rely on statistical methods addressing grammatical or lexical features. Among them, the classification of Bastin (1983) is of special interest because of the large number of varieties that are processed (above 100). This work confirmed the classical western / eastern division of Guthrie, the major difference being that the languages of the zone M are classified with the western cluster, whereas Guthrie saw them as ambiguous. This classification has been published in a revised and much enriched form (542 languages in Bastin et al. 1999). The computation of many dendrograms has shown that it is difficult to establish true groups, as the clusters vary according to the features taken into account (this aspect will be further discussed). However, some clusters remain stable over the trees (Fig. 6.5). The status of the group B20 is particularly interesting, as some of the languages constituting it (Kota, Wumbu, Sama) tend to be close to central‐western Bantu varieties, whereas others (Ngom, Mbahouin, Sake) are very close to the varieties spoken in the zone A. Another facet of the classification concerns the cited homogeneity of eastern varieties, pointing to their late and common origin.

Figure 6.5  The classification of Bantu languages according to Bastin et al. (1999). Clusters displayed do not concern the entire linguistic domain, but only the languages that cluster together in stable way, whatever features are taken into account. The codes correspond to the zones defined by Guthrie (Fig. 6.2)

In conclusion, two important recent studies (Bastin et al. 1999; Rexova et al. 2006, Nurse and Philippson 2003) confirm a western / eastern division of the entire Bantu linguistic domain and suggest that the western part is older, since the lan‐ guages spoken in the zones A, B and C emerged first from proto‐Bantu varieties.

1.1.3 Proto‐Bantu Homeland

Joseph Harold Greenberg (1915 ‐ 2001) identified the middle Benue river valley, lo‐ cated between Cameroon and Nigeria, as the homeland of Bantu languages (1955). The evidence was that the Bantu languages spoken in this area are the only ones close enough to another branch (Benue‐Congo) of the Niger‐Congo linguistic family to LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 145 which Bantu belongs as well. Benue‐Congo languages are spoken in Nigeria. To sup‐ port his view, Greenberg also stressed the great diversity existing among the Bantu languages spoken in the north‐western part of the Bantu linguistic domain. Almost all subsequent lexicostatistics studies have confirmed this geographical origin. Guthrie (1971) noted that the lexical roots of proto‐Bantu are highly shared in this region and decrease everywhere else, meaning that the geographical identifica‐ tion of the nucleus is correct. Guthrie disagreed about the possible relationship of early Bantu languages with the Benue‐Congo branch, explaining the similarities be‐ tween the two as the result of borrowing and suggesting the nucleus to have been lo‐ cated in southern Congo. Later studies did not support this hypothesis however, and the homeland of Bantu languages is now believed to lie along the middle Benue river valley. The identification of a homeland is essential to try to explain how Bantu lan‐ guages have disseminated into a large part of sub‐Saharan Africa.

1.1.4 Timeframe and dissemination of Bantu varieties

The existence of two groups of Bantu languages, the western and the eastern one, was seen as the likely result of two major, and probably independent, migrations that took place from the middle Benue river valley. One migration took place to the equatorial forest through the south of Cameroon, by following the rivers and, finally, progress‐ ing more southwards along the Atlantic coast (Bastin et al. 1979). Western Bantu lan‐ guages differentiated along this path and direction. Another migration took place eastwards, along a route that avoided the equatorial forest. Eastern Bantu varieties emerged along this second direction. This simple scenario has been the origin of a great deal of investigation. Secon‐ dary contact between the languages, a different migration pace according to the route (to the south or to the east), secondary migrations, continuous population displace‐ ments until today, and the emergence of a number of ethno‐linguistic groups (socie‐ ties whose cohesion also relies on a common language the stability of which has probably changed over the time) complicate the general picture. But when did all that start? According to the concepts included in the Swadesh list,5 the most extreme Bantu varieties of the western group share only 20% of cognates, whereas about 30‐ 40% cognate sharing is found for the most divergent languages of the eastern part of

5 The Swadesh (1971, p. 283) list is a list of concepts selected to be universal and culture‐ independent, that is existing in almost all the languages of the world. With similar aims, many other lists have been developed and their use has mainly been in the frame in lexi‐ costatistics and . 146 CHAPTER 6 the Bantu domain. To explain the degree of divergence of the varieties sharing only 20% of cognates, it has been suggested that the dissemination started about 5000 years ago along the southern migration route from the middle Benue river valley, and 3000 years ago for the eastern migration route. The lexicostatistic studies of Greenberg (cited above) maintain that the whole migration process, south or east, did not start before 2000 ybp (ybp = Years Before Present). This timing is contradicted by almost all the other lexicostatistic studies that systematically come to a dissemination process started about 5000 ybp. While more recent studies have been published, and will be mainly cited in the discussion section, whatever the correct timeframe, the attention of the reader should be attracted on the extremely fast spread of Bantu languages, that have disseminated throughout a half of Africa, which is replacing almost all pre‐existing languages, simi‐ larly to what Latin did in Europe but, as far as we know, without the help of a well organized empire, without roads, and without any planning. The reason for this is certainly related to the lifestyle of Bantu populations: they generally were agricultu‐ ralists and farmers living in villages and have remained so to a very large extent even today. Solid evidence comes from many reconstructed proto‐Bantu lexical roots con‐ cerning concepts related to farming and agriculture.6 It has often been said that agri‐ culture requires a considerable work‐force to be a viable sustainment. In other words large societies are necessary for successful agriculture and, once populations begin growing, migrations processes and population diffusion follow, which are further promoted when the soil rapidly and temporarily (about 30 years) becomes unproduc‐ tive due to its increasing impoverishment. The large social groups have many descen‐ dants and population‐splits happen frequently, with new groups colonizing new lands. Actually, this demographic scenario has been first envisaged for Europe by Renfrew (1990), when agricultural practice spread from the Middle‐East to the west, but it might well apply to Africa concerning Bantu migrations. The societies of hunter‐ gatherers are generally much smaller, meaning that they can be easily assimilated by the advancing waves of agriculturalists. This is one of the reasons explaining why languages that preceded Bantu varieties in the present Bantu linguistic domain have disappeared. An interesting case concerns the Pygmies of Africa, small groups that used to be hunter‐gatherers and that, while sometimes keeping this way of life, have systematically adopted Bantu languages, though Bahuchet (2012) conjectures that some relicts of their previous languages may survive in the technical lexicon corre‐ sponding to specialized forest activities that they traditionally mastered.

6 For a reference database including about 10,000 entries proposed for Proto‐Bantu recon‐ structions see www.africamuseum.be/collections/browsecollections/humansciences/blr LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 147

6.1.1.5 Archaeology and linguistics

Lexicostatistics assays led to the identification of the most widely shared cognates, that is the words that are conserved in most Bantu languages. When the geographical distribution of the words is wide and when languages spoken both in the western and eastern part of the Bantu linguistic domain are involved, they can be considered rel‐ icts of the original proto‐Bantu varieties. By comparing them, several proto‐Bantu lexical roots could be reconstructed.7 The proto‐lexicon includes the concepts “oil palm (tree)”, “yam”, “plantation”, “to grow / to cultivate”, “to fish with a line”, “to fish with a net”, “hook (fishing)”, “banana”. They concern activities that are related to an agricultural society that, by being largely sedentary, must have left artefacts behind it, that is archaeological remains. If linguistic cartography has provided a general frame for the origin and dissemination of the Bantu languages in time and space, such hypotheses cannot be validated without the support of sister disciplines like archae‐ ology, social anthropology and genetic anthropology, the latter being applied to the genetic diversity of human populations and of domesticated plants and animals, as domestication is another facet of agriculture. A first attempt to combine archaeological and linguistic evidence can be traced back to Oliver (1966) but, while pottery related to Bantu migrations was available in eastern Africa, the almost total lack of archaeological material from the north‐western part of the Bantu linguistic domain was a serious impediment to understanding the full Bantu migration patterns. Oliver tried to combine the two theories about the homeland; the central Benue river valley advocated by Greenberg was seen as the original homeland, while the nucleus of Guthrie (southern Congo) was judged as a secondary, though major, centre for dispersion. He also introduced the idea that Ban‐ tus were ironworkers, thus contributing to suggest that the spread of Bantu languages was a consequence of the technical superiority of the Bantu speakers. Later, Phillipson (1976, 1977a, 1977b, 2002) provided an ambitious reconstruc‐ tion by arguing the development of early Bantu language varieties in Cameroon about 3000 ybp by a stone tool population that domesticated goats and practiced some forms of agriculture. According to his scenario, when they dispersed eastwards, along the northern fringe of the equatorial forest, they met other farmers, probably speaking Central Sudanic languages. After a long phase of contact, they started herding domes‐ tic cattle and sheep, and learned about the cultivation of certain cereal crops. It is likely that metal‐working techniques were acquired during this phase of contact. The earliest archaeological evidence for iron metallurgy in sub‐Saharan Africa is not older

7 Proto‐bantu is generally reconstructed with a relatively small set of sounds, 11 conso‐ nants and 7 vowels with two tones (rising and falling). 148 CHAPTER 6 than 3000 ybp (Phillipson 2002), meaning that metallurgy was acquired later than the first dispersal from the middle Benue river valley and, in fact, there are no lexical roots related to iron in current Proto‐Bantu reconstructions.8 According to Phillipson, a second Bantu‐speaking stone‐tool population making pottery and ground‐stone ar‐ tefacts moved southwards from Cameroon to lower Congo. They later came into con‐ tact with some Urewe9 populations moving westwards. The Urewe are responsible for having introduced several aspects of an Early Iron Age culture to these Bantu popula‐ tions that had moved more directly southwards. The coalescence of the two groups gave rise to the western Early Iron Age culture (Phillipson 1977a, 2002). Further mi‐ grations eastwards (about 1500 ybp), then to the coast of southern Kenya and north‐ ern Tanzania, and southwards, through Lake Malawi and Mozambique until the South African interior, took place from the Great Lakes region. This elaborate synthe‐ sis of Phillipson was harshly criticized (see Heggert 2004, p. 311). It was accused of being a circular argument largely inspired by linguistic research, but it has the great merit of addressing the issue of the contact Bantus had with other coexisting popula‐ tions, and what effect this had on the Bantu speech. A turnabout came with Vansina (1984, 1990 pp. 49‐57, 1995). In short, this au‐ thor expressed concerns about the possibility of inferring a reliable route for the dis‐ persal, and criticized previous migration hypotheses by Phillipson, underlining, mostly, that they were speculations, because archaeology provided little evidence for them. Vansina also stressed that the major driving force for the Bantu linguistic diver‐ gence was the phenomenon of linguistic fission between varieties that had diverged in an earlier phase, with outermost dialects developing into languages after each fission. (Heggert 2004, p. 315). According to Vansina, a first migration happened eastwards toward the Great Lakes Region, while a simultaneous migration took place south‐ wards, to the south of Cameroon and Gabon, and more southwards. Later, the con‐

8 There is disagreement between specialists about the reconstruction for the lexical roots concerning metallurgy but also for the word “banana”. Concerning metal working, it is suggested that the reconstructed roots corresponded to other activities, becoming related to iron work after a semantic shift. While Guthrie suggested that *‐gèdà and *‐tùd meant “iron” and “to forge” in Proto‐Bantu, in many languages they respectively mean “sharp iron object” and “ to beat / to wash”. Banana is the berry (the banana fruit) produced by the plants of the genus Musa, it originally comes from New Guinea and the Philippines and was introduced to Africa about 3000 ybp, that is probably later than the early Bantu dispersal, implying that Bantu speakers used, for it, a pre‐existing word that, probably, was earlier used for another kind of fruit, plant. 9 The Urewe civilization spread around Lake Victoria and extended in many directions. In their westwards migrations they could migrate to the present‐day People’s Democratic Republic of Congo. LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 149 tinuous pattern of habitat that resulted from the first major migration was disrupted by the presence of both a dense forest and large stretches of marshlands. This disrup‐ tion, together with decreasing contact between the groups, created a number of areas, from which linguistic differentiation and new secondary migrations could take place independently. When compared to the single advance continuous wave model, the interpretation of Vansina allows more time for secondary linguistic contact to take place. This section would not be complete without citing the work of the linguist Christopher Ehret who tried to relate linguistic and archaeological evidence about Nilo‐Saharan languages. He was specialized in detecting loanwords in eastern Bantu languages, postulating that a greater part of eastern and southern Africa was speaking Central Sudanic languages before the Bantu expansion took place. In 1997, Derek Nurse reviewed the various models of diffusion possibly fitting existing linguistic classifications by adding to the traditional single migration model other ones called “wave of advance”, “discontinuous spread”, “simple wave”, not excluding a “language substitution” model. Seen from this angle, the hypothesis of Vansina is a combination of the discontinuous spread and of the wave model. Nurse suggested that the diffusion and differentiation of Bantu varieties proceeded accord‐ ing to different and simultaneous processes determined by social and economic con‐ texts variable in time and space. To admit this is to add a wider and more complete sociolinguistic dimension to the linguistic dispersal, that allows a variety of different local scenarios but, also, admits that a simple and general reconstruction of the full dispersal process is likely to remain a vain quest.

6.1.1.6 The new synthesis

In previous sections we have tried to summarize the major steps of the debate sur‐ rounding the interpretation of the (constantly increasing) body of linguistic data. Un‐ surprisingly, the number of possible interpretations rose in a parallel way, with the result that the Bantu question became known as an “intractable problem”. But a new phase in scientific investigation started with the impulse given by the British archae‐ ologist Colin Renfrew, especially when he was the Director of the McDonald Institute of Archaeological Research in Cambridge (UK). Renfrew had great hope that true multidisciplinary research would help to disentangle a number of problems related to all those human dispersals that have occurred in various areas of the world as a con‐ sequence of the parallel and independent “invention” of agriculture by many human groups. Largely inspired by the multidisciplinary perspective of the Italian geneticist Luigi‐Luca Cavalli‐Sforza, he coined the expression “new synthesis” (Renfrew 2000, 150 CHAPTER 6

2010), suggesting that a reliable tale of human history arises when linguistic, archaeo‐ logical and genetic evidence are put together. To demonstrate it, at the beginning of the years 2000, Renfrew organized a memorable series of meetings by convening ar‐ chaeologists, linguists and human population geneticists. The research presented in this chapter of the dissertation has its roots in this train of anthropological thought. Concerning Bantu languages, new computational analyses were published by using maximum parsimony analyses10 on 75 Bantu languages (Holden 2002) extracted from the 542 Bantu languages published by Bastin et al. (1999). As their purpose was to (later) compare the obtained classification with anthropological data, the languages taken into account were only those corresponding to anthropologically well described populations. In a follow‐up paper (Holden and Gray 2006), the same 75 Bantu lan‐ guages11 were reanalyzed using Bayesian phylogenetic inference methodology, a tech‐ nique derived from biological taxonomy12 and new in linguistics. The application of novel techniques to Bantu linguistics was legitimate, as the conclusion of the vast lexicostatitics work of Bastin et al. (1999) was disappointing for all those interested in a reliable classification of Bantu languages. In fact, the authors concluded that it was impossible to find a good format to present the linguistic rela‐

10 Maximum parsimony is a criterion to identify the phylogenetic tree that minimizes the number of character‐state changes. This criterion minimizes the amount of homoplasy, that is, states that appear identical not ‘by descent’ (= divergence) but because of conver‐ gence, evolution reversal, etc. 11 The languages were defined by a word‐list of 92 concepts, very close to the 100‐words Swadesh list. 12 According to intimidating mathematical jargon, Bayesian inference of phylogeny is based on Bayes’ theorem and combines the prior probability of a genealogical tree with the likelihood of the data to produce a posterior probability of the trees. The posterior probability of a tree indicates it is correct, being the tree with the highest posterior prob‐ ability the one chosen to best represent a phylogeny. The advantage of this method is that it can account for phylogenetic uncertainty. Many tutorials are available on the Internet for a better explanation that is, necessarily, much longer and beyond the scope of this note. The Bayes’ theorem was developed in the 18th century but gained popularity only after the advent of computers, as it is computationally very intensive. It can be applied to DNA sequences thought to have had a common ancestor and that have diverged by mu‐ tation processes. As DNA sequences are chains of similar macromolecules that have only 4 possible repetitive elements (adenine, cytosine, guanine, thymine), it is easy to code se‐ quences as multistate characters. Similarly, wordlists can be coded as multistate charac‐ ters once cognate judgments are available. This is how biologists have applied Bayesian computation to languages. LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 151 tionships they discovered because different results and classifications emerged when different parameters (“connectivity” and “exclusivity”) were used.13 They say (p. 223):

…groups that emerge from any attempt to impose a hierarchical structure on the Bantu languages largely fail to exhibit full connectivity or exclusivity. By varying the priority given to these two conflicting properties, it is possible to construct a series of trees which expose the ambiguities of relationship. It seems likely that such ambiguities reflect different ‘periods of association’ with other groups (which sometimes may reflect known history).

After such definitive words, the tree of Holden and Gray (2006) attracted the interest of the scientific community because the groupings of Bantu languages corre‐ lated well (as in Holden 2002) with the archaeological hypothesis about the large mi‐ gration of Bantu agriculturalists eastwards. The tree consisted of pectinate clusters ordered according to Guthrie and Phillipson theories: Northwest Bantu, West Bantu, West Savannah Bantu, Central Savannah Bantu, East African Bantu, East Bantu and South‐eastern African Bantu. The many criticisms pointing to some details of the tree, and to methodological approximations in the way the cognates had been coded (see Marten 2006 for a re‐ view), did not dampen new optimism that a reliable genetic classification of Bantu languages might be achieved. In fact, the phylogenetic signal found by Holden and Gray (2006) was strong enough to build meaningful classifications, contrary to the conclusions of Bastin et al. (1999). In the paper Holden and Gray also adopted a net‐ work approach14 pointing to three phenomena characterizing Bantu languages, namely “rapid radiation”, “chain‐like convergence” and “tree‐like divergence”: phe‐ nomena that could explain why persistent ambiguities remained at discussion for decades. They explained that western Bantu languages underwent a rapid and early radiation called ‘star‐like’ with little evidence for borrowing, which implies that speech communities were quite isolated one from each other. Differently, eastern Bantu languages show early‐stage borrowing of a sort found in large dialect continua. Borrowing is considered to have been an early process, because many languages are

13 Connectivity is defined as the degree of coherence within one postulated sub‐group. Exlusivity is the degree to which vocabulary is shared with members outside the postu‐ lated sub‐group. 14 Phylogenetic trees provide a clear representation that enables the testing of hypotheses. However, more complex evolutionary scenarios are poorly described by trees and, even in the case of a tree‐like evolution; a richer visualization of the data is provided by phyloge‐ netic networks. First adopted in genetics, they display reticulate events such as hybridiza‐ tion, horizontal gene transfer, recombination, or gene duplication and loss. A parallel to such molecular event can be found in linguistics, as also languages exhibit similar phe‐ nomena (borrowing, fission, change of meanings for a same lexical item, etc.). 152 CHAPTER 6 concerned. A fairly tree‐like structure was found in south‐eastern Bantu languages, meaning that they diverged in the absence of major linguistic contact. The focus of Holden and Gray (2006) on the West / East differences in the evolution of Bantu lan‐ guages is missing in the work of Bastin et al. (1999) and probably explains why the latter authors failed to provide a coherent general classification.15 The study of Holden and Gray gained large renown, because it was compatible with the archaeological evidence documenting the spread of agriculture from the north‐west and because it fit the “new synthesis” advocated by Colin Renfrew.

6.1.1.7 Early population genetics evidence for the Bantu expansion

Genetically, Bantu populations are relatively homogeneous (Li et al. 2014), a figure compatible with their fast spread from western Africa to the rest of the sub‐Saharan continent after the adoption of agriculture. Moreover their genetic homogeneity is higher than other western African populations. As genetic material is expensive to collect and to analyze, and as DNA se‐ quencing has been a routine practice since only about 20 years ago, the first compre‐ hensive study concerning the genetic diversity of human populations at a global scale (Cavalli‐Sforza et al. 1994) did not include DNA sequences and was mainly about the frequency of specific alleles16 in different populations. Aggregate results of many ge‐ netic systems17 were presented as principal components maps where each principal component was plotted separately, over geographic maps, as an interpolated grey‐ scale pattern. Cavalli‐Sforza suggested that the geographic pattern of the fourth principal component about the African continent was related to the Bantu expansion (Fig. 6.6). The map displays two major poles of variation, the first (in black) is considered to cor‐ respond to the original Bantu homeland, whereas the second pole (the lightest shad‐ ing) is considered to be the genetic mark of a preceding and different expansion in western Africa, probably related to agriculture as well. According to the genetic map‐

15 The comparison of the trees obtained by Bastin et al. (1999) with those reported in later studies has been possible only at a rather superficial topological level, because the original matrices published by the group of Tervuren have since been lost (Jacky Maniacki 2015, personal communication). 16 Variants of genes. Many genes were considered. 17 A genetic system is jargon defining the heredity of given genes, that is the identification and transmission of the variants (alleles) in populations. Genes that are highly polymor‐ phic, and which variants do not give a special advantage in evolutionary terms, have been favoured by scientists to estimate the diversity of human populations because ‘neutral’, that is freer to change. LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 153 ping, the original Bantu homeland is not located in middle Benue river valley (see section 6.1.1.3) but more south‐east. Nevertheless, Cavalli‐Sforza explained that the genetic mapping is much less accurate than archaeological or linguistic cartography, concluding that the discrepancy was not relevant because related to an uneven sam‐ pling scheme.

Figure 6.6Synthetic map of Africa displaying the values of the fourth principal com‐ ponent. The scale provides the two extremes of the variation, the black area being one of them. The circle with the cross ‘’ corresponds to the exact location of the middle Benue river valley, the probable origin of the Bantu languages. The black maximum is located close to it. Genetic cartography is not very accurate, especially when interpolation meth‐ ods are used (like here), therefore the black maximum may be regarded as overlapping the Middle Benue river. [From Cavalli Sforza et al. 1994, p. 192, with permission]

Until the year 2002, there was not much to add to this genetic reconstruction, and uniparental markers,18 like the Y‐chromosome and the mitochondrial DNA had not been typed and sequenced in numbers sufficient to address the genetic variability of Bantu‐speaking populations in a detailed way. Concerning mitochondrial DNA, the first comprehensive paper focusing on Bantu populations is probably by Salas et al. (2002). By reanalyzing 2,847 samples from throughout the continent, including 307 new sequences from southeast African Bantu speakers, they estimated that the ages of

18 Uniparental markers are transmitted through the generations, along the male line (Y‐ chromosome) or the female line (mitochondrial DNA) virtually unchanged, besides the mutations that can arise along the way. 154 CHAPTER 6 the major founder‐types of both West and East Africans are consistent with the timing of Bantu dispersals, founder‐types from the west somewhat predating those from the east (similarly to what linguistic inference was suggesting – see section 6.1.1.2). De‐ spite this composite picture, the south‐eastern African Bantu groups are indistin‐ guishable from each other with respect to their mitochondrial DNA, suggesting that they either had a common origin at the point of entry into south‐eastern Africa or have undergone extensive maternal gene‐flow since. Concerning the Y‐chromosome, Underhill et al. (2001; 2002) published a com‐ prehensive Y‐chromosome phylogeny at a worldwide scale and wrote (Underhill 2002, p. 72):

“The widespread Bantu expansion is a recent event19 from a discreet homeland. We now have capability to define high‐resolution Y‐chromosome binary and microsatellite‐defined haplogroups [=major types], in both putative regions of origin and destination. Thus, it appears promising that it will be possible to eventually provide additional resolution to any earlier demographic events, as well as the more recent Bantu east and west migration streams.”

6.1.2 Linguistic and genetic diversity of Bantu population from Gabon

6.1.2.1 Origin of the project

This rather extensive introduction was necessary to temporally ground our study, which is the comparison of the cultural and genetic diversity of the Bantu populations living in Gabon. The research started in 2000, which is right at the beginning of the new ‘optimistic phase’ advocated by Colin Renfrew in his New Synthesis paradigm. At first, the Gabon project was part of the « Origine de l’Homme, du Langage et des Lan‐ gues » scientific programme launched and funded by the French CNRS, later extended by the interdisciplinary Eurocores OMLL action (Origin of Man, Language and Lan‐ guages) financed by the European Science Foundation, in both cases under the super‐ vision of Professor Lolke Van der Veen (University of Lyon, France), a linguist spe‐ cialist of languages spoken in Gabon at the CNRS research laboratory “Dynamique du Langage”, Lyon, France) directed by Professor Jean‐Marie Hombert (for an early sketch of the project see Van der veen and Hombert (2001).

19 In genetics ‘recent’ refers to the pace of the genetic differentiation. Two identical se‐ quences evolving in two independent genealogical branches need several millennia to significantly diverge. When the difference is low, the separation is said to be recent, but can be very old in historical terms LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 155

The uncommon scientific profile of Hombert, engineer in informatics and lin‐ guist, together with his influential role in French science, gave him a sufficient leader‐ ship to open French linguistics to other disciplines, like population genetics. This is how we became members of the project, with the assignment to compare linguistic and genetic data in a statistical way. Actually, the original project was intended to compare the linguistic and genetic classifications of Bantu speakers in a wider frame than Gabon itself (including Angola, Kenya and Tanzania) in order to have different test‐sites representative of the many linguistic clusters into which Bantu languages can be divided. Political instability in Angola and the lack of sufficient financial sup‐ port have refocused the research to Gabon only.

6.1.2.2 Linguistic evidence and hypotheses

The region corresponding to Gabon, the Democratic Republic of Congo (DRC) and Congo corresponds to a quite complex linguistic landscape (Van der Veen 2006b). Ac‐ cording to the classification of Bastin et al. 1999 (see Fig. 6.5) a major linguistic bound‐ ary separates North‐western Bantu from Central‐western Bantu, this partition is nearly confirmed by the classification of Holden and Gray (2006), though convergence phenomena can be noted along the boundary. Northwestern Bantu varieties of Gabon include languages from the Guthrie zones A and B. Four varieties are from the zone A (Benga A34; Fang A75; Shiva A83; Bekwil A85b), while the B zone is represented by the groups B10 (MYENE), B20 (KOTA‐KELE) and B30 (TSOGO). As already noted by Nurse and Philippson (2003) in their study about grammatical innovations, and later confirmed by Mouguiama‐ Daouda and Van der Veen (2005) on lexical traits, the group B10‐B30 is distinct from other languages in the region but it is not clear if this is the consequence of a common genealogical origin or whether it is the result of linguistic convergence related to con‐ tact. The status of the group B20 is ambiguous both because its unity is unclear and since its affiliation to the North‐western Bantu advocated by Bastin et al. 1999 is de‐ bated. Central‐western Bantu languages from Gabon include the groups B40 (SIRA),20 linguistically related to languages spoken in the zone H and the three groups B50 (NJABI), B60 (MBETE) and B70 (TEKE), which are similar to the languages spoken in the zone C. The latter three groups can be classified into a single cluster B50‐B60‐B70, where the languages B60‐B70 are very close and the ones corresponding to B50 might have a different origin, though they have clearly converged.

20 To which the variety Vili (H12a) has to be added. 156 CHAPTER 6

According to Clist (2005, p. 490), starting about 2600 ybp, Gabon has been pro‐ gressively peopled by waves of Bantu‐speaking populations coming from the north‐ east, but also from the South and the East. This means that the peopling scenario is more complex than the simple southwards movement from Cameroon implied by the wider migration model of the Bantu expansion proposed by Vansina (1995), a model in which Gabon would have been crossed by the western Bantu expansion wave mov‐ ing to the south. Clist (2005) explains that, in reality, the main migration wave from the Benue river valley to the south might not have passed through Gabon, but more east of it, in a savannah corridor, created by dry climatic conditions, in what previ‐ ously was a dense equatorial forest. This corridor is believed to have lasted from 2800 ybp to 2100 ybp, that is about seven centuries (Maley 2001): a time‐span long enough to enable continued human migrations. Unfortunately, Clist fails to provide a really convincing and highly supported scenario, his views being the compilation of very heterogeneous sources, and the absence of anthropological remains and burials21 adds to the uncertainty. Whatever the general migration scenario, there is a consensus about the Fang languages (A75) that correspond to a rather recent migration wave from Cameroon, started 1000 ypb and continued until the 1930s. The Fang crossed the north of Gabon, established there and did not go much further than the region surrounding the pre‐ sent‐day capital city (Libreville). While shared and strong local beliefs point to an Egyptian origin of Fang populations that would still be reflected in the speech, de‐ tailed studies have demonstrated that Fang languages fall perfectly within the Bantu linguistic domain, although they have some peculiar features (8 vowels, diphthongs, labiovelar consonants, etc.). (Guthrie 1948, 1967, 1971; Hombert et al. 1989, Medjo Mvé 1997). In order to corroborate the demographic scenario emerging from the study of the linguistic data, 21 test‐sites have been selected to sample human saliva and blood, in order to enable DNA testing. Ethnic and linguistic information was collected dur‐ ing the fieldwork to be better able to address linguistic classifications with respect to the populations speaking the languages. Genetic material has been independently analyzed by two research groups. The variability of the mitochondrial DNA (transmitted along the female line) was as‐ sessed at the Institut Pasteur, Paris, whereas the diversity of the Y‐chromosome (transmitted along the male line) was measured at the University Pompeu Fabra, Bar‐ celona. The analysis of sex‐specific genetic lineages usually enables a high resolution

21 The acid soil of the forests has chemical properties that do no favour the preservation of bones. LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 157 and makes possible to take into account social phenomena such as patrilocality and matrilocality. Genetic results are generally available in a numerical form, as distance matrices and frequency vectors. To allow a direct comparison with linguistic data, collabora‐ tion with the University of Groningen started in 2003. The rationale was to compute Levenshtein distance (also called edit distance – Heeringa 2004) matrices accounting for the lexical diversity of the ALGAB (Atlas Linguistique du GABon – see APPENDIX 1). At that time, computational linguistic methods were quite new, and the Leven‐ shtein algorithm had never been applied to phonological systems outside Germanic and Romance dialects, therefore the group in Lyon was unsure about the reliability of the method to correctly depict the lexical similarities and differences of Bantu lan‐ guages (also because tonal information is not taken into account by the edit distance). The ALGAB could not be processed right away because tedious manual coding and the verification of transcriptions were required. For this one reason we decided to first apply a computational test on Bantu va‐ rieties that did not require further work; we applied the edit distance analysis to a Tanzanian dataset (Nurse and Philippson 1975). The computational classification obtained by using the Levenshtein algorithm has been validated by Professor Gerard Philippson. As has been seen in the back‐ ground introduction, Tanzania belongs to the eastern part of the Bantu linguistic do‐ main, a region were the linguistic variability is smaller than in the northern part, where Gabon is located. The good performance of the edit‐distance on Tanzanian va‐ rieties, considered to be closer to each other than those included in the ALGAB, was reassuring so that the latter ones were finally processed too. In the next section both experiments will be presented, Tanzania first followed by Gabon, but the comparison of linguistic data with genetic evidence will concern only the latter region.

6.2 METHODS

6.2.1 Linguistic datasets

6.2.1.1 Database 1: Bantu languages from Tanzania

In 2002 thirty‐two languages were selected by Professor Gérard Philippson (Univer‐ sity of Lyon, France) from a database including 74 Tanzanian languages (Nurse and Philippson 1975). The wordlists were constituted by 1079 concepts. To test the classifi‐ catory performance of the Levenshtein algorithm 32 varieties were selected by 158 CHAPTER 6

Philippson. The languages correspond to four geographically‐separate groups meant to be a representative subsample of the full database. Before processing the data, the current classification of the 32 languages ac‐ cording to the geographical zones defined by Guthrie and updated by Maho (2009) has been verified (Table 6.1). The choice has been made in order to include several distinct groups (E50KIKUYU‐KAMBA Group; E60CHAGA Group; E70NYIKA‐ TAITA Group; F20SUKUMA‐NYAMWEZI Group; G20SHAMBALA Group; G30ZIGULA‐ZARAMO Group; G50POGOLO‐NDAMBA Group; M10FIPA‐ MAMBWE Group; M20NYIHA‐SAFWA Group; M30NYAKYUSA‐NGONDE Group; N10MANDA Group; P10MATUUMBI Group; P20YAO Group) located in four distinct geographical areas (Fig. 6.7). Each one of the four geographically‐ separate groups includes languages that are neighboring. This choice was meant to assess the ability of the Levenshtein method in distinguishing the groups and in de‐ tecting more tenuous linguistic differences within them.

6.2.1.1.1 Origin of the Tanzanian dataset The data come from a lexicostatistical survey of about 100 Tanzanian languages un‐ dertaken in the early 1970s by Derek Nurse, Gérard Philippson and a team from the Department of Foreign Languages and Linguistics of the University of Dar es Salaam (Tanzania). For more details on the survey and the early analyses of the data see Nurse and Philippson (1980). See APPENDIX 2. The source documents were sets of printed paper forms, each one containing 1079 entries. These forms were distributed to native Tanzanian students of the Uni‐ versity of Dar Es Salaam for translation into the target Tanzanian languages, once the students were back into their family areas for holidays season. Students could fill the forms themselves if they were familiar with the varieties, but they were also asked to check their translation with relatives and people of their community. Each set of forms began with a page of instructions and a short section for details about the per‐ son filling in the form and the language involved in the documentation. The printed forms for the survey were numbered in two sections. The main section consists of a wordlist in parallel columns, in Swahili and English, with room for a translation into the target language. The lists contain 1052 entries numbered from 1 to 1038, with entry 929 missing. 15 additional numerical entries are suffixed with the letter ‘a’, e.g. entry 50a which follows entry 50, to make up the sectionʹs total. After this, there is a short section of phrases in English, some of which are translated into Swahili, for which single terms in the target language were sought. This section, containing 27 entries, has been deleted from this study because including it resulted in inconsistently align‐ ing in the columns of the datafile we accessed. Paper forms were typeset and OCR LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 159 scanned at Berkeley in the 90s, then uploaded on the website http://www.cbold.ish‐ lyon.cnrs.fr/22 In 2002, the 32 languages analyzed in this study were downloaded from the website by G. Philippson (Table 6.1). Later on, and for an unknown reason, the files were deleted from the website. The 1079 concepts included many items which were not reliable, or included loans from Swahili, or were unknown to the respon‐ dents, which is why Gérard Philippson and Derek Nurse reduced the lists to 400 items (Derek Nurse, personal communication 2016). Unfortunately the 400 wordlist have been lost. This is why we have processed a slightly shorter version (1052 concepts) of the full database that used to appear on the website preceded by a disclaimer.23

6.2.1.1.2 Mapping The 32 Tanzanian languages were not linked to geographical coordinates, we plotted them on a map according to the language areas reported by the Ethnologue (Grimes 2000) (Fig. 6.7). We have preferred to rely on this mapping instead of the one pro‐ vided by Maho (2009), because the latter one represents linguistic areas, group by group in separate maps, meaning that a synthetic map would have required too much time to produce. A later comparison has shown that the two mappings are consistent, with just one exception (Meru; E53).

6.2.1.2 Database 2: Atlas Linguistique du Gabon (ALGAB)

The linguistic atlas of Gabon (Atlas Linguistique du Gabon) (Hombert 1990a) has not yet been published, though it is expected to be.24 The data concerns Gabon, with the ex‐ ception of two sites located in Congo, next to the border. At the moment the ALGAB should be regarded as work in progress, though close to completing.25 In its current version it consists of an Excel spreadsheet including 158 glosses gathered in the frame of ethnolinguistic interviews made from the early 1990 to the end of 2006 by several linguists (Table 6.2 and APPENDIX 2). All transcriptions were checked and made

22 Transcribers: Stanley Cushingham (Yale University); Lawrence Greening (Memorial University of ); Towhid bin Muzaffar (Memorial University of Newfound‐ land); John Lowe (CBOLD, University of California at Berkeley); Jeff Good (CBOLD, Uni‐ versity of California at Berkeley). 23 “Pending final review for accuracy by Prof. Nurse and others, the information here should not be relied upon. The database facility is operational now, but it is intended only to assist in research.” 24 Five volumes are scheduled: I Geolinguistics; II Phonetic change and classification; III Phonological systems; IV Tonal systems; V Syntax. 25 New languages have been added: several varieties B20 (see Mokrani 2016), the Baka Ubanguian language spoken by the Baka pygmies (by Pascale Paulin) language) and Bek‐ wil (A85b – by Marion Cheucle). 160 CHAPTER 6 homogeneous (by listening to the original recordings) under the supervision of the same linguist (Lolke Van der Veen, University of Lyon). The choice of the words in‐ cluded in the list is largely inspired by the 200‐words list of Swadesh but has been further modified by J.M. Hombert, in order to better correspond to the ecological con‐ text of Gabon.26

26 The ALGAB 158 word‐list (French / English) is the following (the concepts underlined ap‐ pear also in Bastin et al. 1999): 1: Bouche / Mouth; 2: Oeuil / Eye; 3: Tête / Head; 4: Poil / ; 5: Dent / Tooth; 6: Langue / Tongue; 7: Nez / Nose; 8: Oreille / Ear; 9: Cou / ; 10: Sein / Breast; 11: Bras / Arm; 12: Ongle / ; 13a: Jambe / Leg; 13b: Cuisse / Thig; 14: Fesse / But‐ tock; 15: Ventre / Belly; 16: Nombril / Belly button; 17: Intestins / Guts; 18: Sang / Blood; 19: Urine / Urine; 20: Os / Bone; 21: Peau / (human); 22: Aile / Wing; 23: Plume / ; 24: Corne / Horn; 25: Queue / Tail; 26: Personne / Person (human being); 27: Mâle / Male; 28: Femme / Woman; 29: Mari / Husband; 30: Enfant / Child; 31: Nom / Name; 32: Ciel / Sky; 33: Nuit / Night; 34: Lune / Moon; 35: Soleil / Sun; 36: Vent / Wind; 37: Nuage / Cloud; 38: Rosée / Dew; 39: Pluie / Rain; 40: Terre / Ground (on the); 41: Sable / Sand; 42: Chemin / Path; 43: Eau / Water; 44: Rivière / River; 45: Maison / House; 46: Feu / Fire; 47: Bois (de chauffage) / Fire‐ wood; 48: Fumée / Smoke; 49: Cendre / Ash; 50: Couteau / Knife; 51: Corde / Rope; 52: Lance / Spear; 53: Guerre / War; 54: Animal / Animal; 55: Chien / Dog; 56: Elephant / Elephant; 57: Chèvre / Goat; 58: Oiseau / ; 59: Tortue / Turtle; 60: Serpent / Snake; 61: Poisson / Fish; 62: Pou / Louse; 63: Oeuf / Egg; 64: Arbre / Tree; 65: Écorce / Bark; 66: Feuille / Leaf; 67: Racine / Root; 68: Sel / Salt; 69: Graisse / Fat (animal); 70: Faim / Hunger; 71: Fer / Iron; 72: Cœur / Heart; 73: Étoile / Star; 74: Foie / Liver; 75: Genou / Knee; 76: Montagne / Mountain; 77: Pierre / Stone; 78: Graine / ; 79: Champignon / Mushroom; 80: Pygmée / Pygmy; 81: Paume (de la main) / Palm; 82: Menton / Chin; 83: Lit / Bed; 84: Visage / Face; 85: Cheveu / (A strand of) Hair; 86: Poitrine / Chest; 87: Village / Village; 88: Honte / Shame; 89: Sommeil / A sleep; 90: Un / One; 91: Deux / Two; 92: Trois / Three; 93: Quatre / Four; 94: Cinq / Five; 95: Six / Six; 96: Sept / Seven; 97: Huit / Eight; 98: Neuf / None; 99: Dix / Ten; 100: Venir / To come; 101: En‐ voyer / To send; 102: Marcher / To walk; 103: Tomber / To fall; 104: Partir / To leave; 105: Voler / To fly (bird); 106: Verser / To pour; 107: Frapper / To hit; 108: Cultiver / To cultivate; 109: Enterrer / To bury; 110: Bruler / To burn; 111: Manger / To eat; 112: Boire / To drink; 113: Vomir / To vomit; 114: Mordre / To bite; 115: Lever / To lift; 116: Fendre / To cleave; 117: Don‐ ner / To give; 118: Voler / To steal; 119: Presser / To squeeze; 120: Sucer / To suck; 121: Cracher / To spit; 122: Souffler / To blow; 123: Enfler / To swell; 124: Donner naissance / To give birth; 125: Mourir / To die; 126: Tuer / To kill; 127: Pousser / To push; 128: Tirer / To shoot; 129: Chanter / To sing; 130: Jouer / To play; 131: Avoir peur / To be afraid; 132: Vouloir / To want; 133: Dire / To say; 134: Voir / To see; 135: Montrer / To show; 136: Entendre / To hear; 137: Savoir / To know; 138: Compter / To count; 139: Etre assis / To sit; 140: Nager / To swim; 141: Blanc / White; 142: Noir / Black; 143: Rouge / Red; 144: Chaud / Warm (weather); 145: Froid / Cold (weather); 146: Beaucoup / Many; 147: Tous / All (of them); 148: Sec / Dry; 149: Mouillé / Wet; 150: Bon / Good; 151: Grand / Big; 152: Long / Long; 153: Petit / Small; 154: Plein / Full; 155: Nouveau / New; 156: Qui ? / Who?; 157: Quoi ? / What? – 5 concepts of the 93 in Bastin are missing in the ALGAB (‘to lie down’; ‘man’ (man of masculine sex); ‘meat’; ‘round’ (as a ball); ‘to stand’ with an overlap of 88 concepts. LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 161

The ALGAB wordlist was designed for preliminary linguistic research depending on the linguistic and cultural situation of Gabon. It draws on existing elicitation lists such as the ALAC list27 and takes previous experience and knowledge of the (extended) area into account. The list of 158 words includes mainly nouns (89) and verbs (41), and additionally numerals (from one to ten), adjectives (13), adpositions (2), interroga‐ tive pronouns (2) and a few unclassifiable items (11). The set was chosen to obtain high‐frequency core vocabulary that is not culturally marked, at least not to a great degree.28 As a rule, both singular and plural forms were collected, though for some va‐ rieties there is only one form. Having singular and plural forms is important to Bantu specialists because morphological information, such as finding the gender of substan‐ tives, is reflected by the choice of plural prefixes. Although tone and stress informa‐ tion have been ignored in this study, these features are included in the database and are relevant. They will require a fuller computational treatment in the future. Stress is not marked in the database because it is predictable in all varieties: it is usually placed systematically on the first syllable on the noun stem, while sometimes straightforward penultimate stress is used. No stress contrasts have been found (within single varie‐ ties). While the decision not to mark stress is understandable from the point of phono‐ logical theory, it would be preferable to have data marked with stress to keep track of its distinctive use among different varieties. As far as it is known, tone is indeed dis‐ tinctive in most if not all varieties. Previous analysis has revealed a few different basic categories in the tone‐ systems in use. This is one, among several other factors, that makes a proper study and verification of the tonal transcription throughout all the data very time consum‐ ing. Tonal information had to be discarded from the data in the current research be‐ cause it was not systematically transcribed in the field for different reasons like: i) the absence of tonal contrast at the surface; ii) priority given to the segmental level; iii) inability of the transcriber.

27 Atlas Linguistique de l’Afrique Centrale. See also Dieu and Renauld (1983). 28 Fieldwork was performed by a team comprising some 15 well‐trained elicitors: Jean‐ Marie Hombert, Gilbert Puech, Jean Alain Blanchon, Louise Fontaney, Lolke Van der Veen, Pither Medjo Mve, Patrick Mouguiama‐Daouda, Daniel‐Franck Idiata and Roger Mickala‐Manfoumbi. The few initial and principal elicitors (Hombert, Puech, Blanchon, Fontaney) are all experienced fieldworkers and worked closely with less experienced par‐ ticipants, often supervising them. 162 CHAPTER 6

Table 6.1Tanzanian dataset Variety is the name of the language; NUGL is the updated Guthrie code according to Maho (2009); 1‐2‐3‐4 show to which geographical group the varieties belong (more explanation in the text, but see Fig. 6.7); Annotations include useful information about nomenclature and Ethnologue corresponds to the three‐letters code used by the Ethnologue.com (Grimes 2000) to identify the language. The wordlists processed are constituted by 1052 concepts.

Variety NUGL 1 2 3 4 Annotations Ethhnologue Fipa M10 (M13) + Fip Lambya M20 (M201a) + Lai Lungwa M10 (M12) + Rungwa in Ethnologue Rnw Malila M20 (M24) + Mgq Mambwe M10 (M15) + Mgr Nam- M20 (M22) + Mwn Ndali M30 (M301) + Ndh Nyakyusa M30 (M31) + Nyy Nyiha M20 (M23) + Nih Pimbwe M10 (M11) + Piw Rungu M10 (M14) + Mgr Safwa M20 (M25) + Sbk Wanda M20 (M21) + Wbh Wungu F20 (F25) + Bungu in Maho (2009) Wun Manda N10 (N11) + Mgs Matengo N10 (N13) + Mgv Mpoto N10 (N14) + Mpa Mwera P20 (P22) + Mwe Ngoni N10 (N12) + Ngu Mbunga P10 (P15) + Mgy Ndamba G50 (G52) + Ndj Pogoro G50 (G51) + Poy Dawida E70 (E74a) + Dabida in Maho (2009) Dav Gweno E60 (E65) + Gwe Kibosho E60 + Kivoso in Maho (2009) Kaf Kiseri E60 (E623a) + Not found in Ethnologue - Machame E60 (E621b) + Mashami in Maho (2009) jmc Meru E50 (E53) + Maho (2009) maps it differently mer Mkuu E60 (E623c) + kaf Seuta G20 (G23;G24) + + Assemblage by Nurse and ksb (Shambala) G30 (G31;G34) Philippson (1980, p. 688) of bou (Bondei) four varieties: G23 (Shambala) ziw (Zigula) + G24 (Bondei) + G31 (Zigula) ngp (Ngulu) + G34 (Ngulu) and called Seuta is the name of a mythical common ancestor. Siha E60 (E621c) + kaf Vunjo E60 (E622c) + Wuunjo in Maho (2009) vun

LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 163

Figure 6.7  Map of Tanzania where the 32 varieties analysed in this study are mapped according to the publication Ethnologue.com (Grimes 2000). See Table 6.1.

The table of data has 10,417 filled cells, approximately 64% of the theoretically possi‐ ble. There are a few more data points, as some entries consist of more than one lin‐ guistic equivalent. There are two relatively frequent diacritics present in the data, na‐ salization and the syllabic marker. Informants were chosen in various ways. Whenever possible, the choice was made in consultation with the elders of the communities or, failing that, based on a preliminary check. Elicitation was usually carried out in French with a bilingual speaker, while in a few cases through an interpreter. Many interviews were conducted in villages and hamlets, but others took place more informally on the roadside. In some cases, several speakers have been interviewed for a single variety, therefore in many cases the entries reported are a consensus. Both data collection activities and the 164 CHAPTER 6 data itself were documented carefully, including as many details as possible: language varieties with their name(s), dates, names of consultants, names of elicitors, number of items collected, nature and quality of elicited material, locations, maps with precise or approximate location(s) for each language variety, etc. (Table 6.2, Fig. 6.8). The data collected was systematically checked with the help of additional con‐ sultants and on the basis of the good quality recordings made in the field (as a rule, word lists were recorded using DAT or mini‐disk recorders). The sound recordings were particularly important in checking transcriptions by less experienced elicitors, as they served to safeguard the uniformity and the reliability of transcriptions. Addi‐ tionally, judgments of reliability were attributed to each sample collected in the field, which resulted in some data being discarded. Sample lists may be incomplete for several reasons. Many of the varieties of Gabon are nearly extinct, and their speakers are not always able to recall the equiva‐ lents of the entries of the word list. In addition, multilingualism being the rule, speak‐ ers tend to mix up languages. In several cases, lists are incomplete because of a lack of time. This also explains why certain samples merely contain the initial, i.e. noun, part. Since the task of a language assistant is tedious, another understandable reason is a lack of motivation on behalf of the consultants, who all participated on a voluntary basis.

6.2.1.2.1 Transcription The data used are a careful simplification of a larger corpus under development in Lyon (Laboratoire Dynamique du Language DDL). This version was transformed based on an up‐to‐date analysis of the respective language variants; predictable features such as contextual nasalization or lengthening have not been retained. The data was supplied in a Unicode encoding, but not in Unicode IPA, rather in an encoding which uses a special set of characters which must be viewed in combination with the IPALA font.29 Conversion to a more standard format was therefore necessary before analysis. Since the Levenshtein computations had been implemented using X‐SAMPA until a new version of the software was recently released, the IPALA‐coded characters were mapped to X‐SAMPA. This conversion was verified, since IPALA is not fully documented. The bulk of the calculations was done by using the L04 dialectometric package,30 developed at the University of Groningen (see section 6.2.2 for more details about the computation of linguistic distances).

29 The IPALA font was created by Egidio Marsico (Dynamique du langage, Lyon) for the purpose to make easier the transcriptions of the ALGAB (Hombert 1990b). 30 http://www.let.rug.nl/∼kleiweg/L04/ LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 165

Table 6.2  53 Bantu varieties from Gabon according to the ALGAB Label is the name used in the figures; Variety is the language name, sometimes followed by a geographical annotation; Items corresponds to the number of concepts in the corre‐ sponding word‐list; S and P indicate whether singular or plural forms were recorded; Investig. corresponds to the initials of the investigator (see note 25 for fully spelled name); LA corresponds to the number of language assistants; Location / Year / Lat. / Long. corre‐ spond to the place and year of the recording. Ref. is a reference code used in Lyon to re‐ trieve recordings. When coordinates are in italics the position is inferred (see Fig. 6.8) For the full database see APPENDIX 2.

Label Variety Items S P Investig. LA Location Year Lat Long Ref.

A34 Benga 79 N JMH 1 Malibe 1988 0.5690 9.3673 A34 A43a Basaá Y Y GTD 1 Botmakak (west-central ? 4.0000 10.9167 A43a Cameroon) A75 (Ntumu)Fang 159 Y Y PMM 1 Bitam 2000 2.0833 11.4833 A75Bi Bitam of Bitam A75 (Okak)Fang of 138 Y Y PMM 1 Medouneu 1995 0.9500 10.7833 A75Me Medouneu Medouneu A75 (Mvègn) Fang 154 Y Y PMM 1 Minvoul 1992 2.1500 12.1333 A75Mi Minvoul of Minvoul (extreme north). B11a Mpongwe 110 Y Y PMD 3 Region of ? 0.38333 9.45 B11a (Myene Libreville. group) B11c 1991 Galwa 159 Y Y VDV 1 Lambarené 1991 -0.7000 10.2167 B11c/91 (Myene) (Sampled in Lyon) B11c Galwa 114 Y Y JMH 2 Sampled in Libreville 1987 0.38333 9.45 B11c/87 (Myene) (Eloye). B11d Adjumba 89 Y Y JMH 1 Village NWof Lambare- 1987 -0.5295 10.0270 B11d (Myene) né (Azingo Lake district) B201 Ndasa 149 Y Y JMH 1 Maloundou 1985 -2.1835 13.6488 B201 (Maloundou) 80 km S of Franceville, near Boumango B202 Sisigu / 89 Y Y JMH 1 Boundji 1986 -0.8167 12.5500 B202 Sigu 30 km W of Lastoursvil. B203 Samayi 155 Y Y SM 2 Itebe 2003 0.8167 13.4667 B203It (Itebe) 2004 B204 Ndambo-mo 89 Y Y JMH 1 Kekele 1986 0.0000 11.9500 B204Ke Kekele (Kekele) S of Makokou B204 Ndambomo 32 Y Y JMH 1 Ntoua 1986 -0.0667 11.9667 B204Nt Ntua (Ntua) GP S of Makokou Mwesa Mwesa 89 Y Y JMH 1 Mvadi 1986 1.2167 13.2000 B20Mw (B20x) NE of Makokou in the Mont de Belinga Tombidi Tombidi 71 Y N JMH Region NE 1988 -2.4167 12.2333 B20To (B20x) to Malinga (south) Gabon B21 Seki / 82 Y N GP 1 Mèndjouè 1987 0.9833 9.6000 B21 Sekyani) Region of Cocobeach

B22a Kele 142 Y Y JMH 1 Makouké 1988 -0.1593 11,7886 B22a NE of Lambarené, north of Bellevue B22b Bu-Ngom 89 Y Y JMH 1 Ekata 1986 0.6667 14.3000 B22b GP Ogooué-Ivindo, southeast of Mekambo) B23 Mbawe 86 Y Y JMH 1 Mopia 1985 -1.8833 13.5833 B23 Mopia (Mban-gwe) S of Franceville 166 CHAPTER 6

Label Variety Items S P Investig. LA Location Year Lat Long Ref.

B24 Wumbvu 159 Y Y JMH 1 Poubara 1985 -1.8121 13.5670 B24 (Poubara) near Mopia B25 (1986) I-Kota 159 Y Y JMH 1 Makokou 1988 0.5667 12.8667 B25/88 (Makokou) B25 (1988) I-Kota 89 Y Y JMH 1 Makokou 1986 0.5667 12.8667 B25/86 (Makokou) B251 Sake (Booué) 89 Y Y JMH 1 Djidji 1989 0.2144 11.8077 B251 Dyuyu 55 km NW of Booué B252 Mahong-we 94 Y Y PMM 1 Nkeyi Bokaboka 2000 1.0167 13.9333 B252Nk Nkei Region of Mekambo Bokaboka B252 Mahong-we Y N JMH Region of 1989 1.0167 13.9333 B252/89 (1989) Mekambo B301 Ge-Viya 159 Y Y VDV 1 village next to 1988 -1.2167 10.6000 B301 Fougamou B302 Ge-Himba 159 Y Y JMH 1 Vieux-Mimongo 1988 -1.0333 10.6667 B302 (Ngounié) SE of Sindara B304 Ge-Pinzi 159 Y Y JMH Region between 1988 -1.8667 11.0167 B304 VDV Mouila and Fougamou B305 Ge-Vove 159 Y Y VDV 1 W to Koula-Moutou: 1986 1.5000 10.8000 B305 region Koulamoutou- 1987 Mouila (Pouvi)-Baniati B31 Ge-Tsogo 159 Y Y RW Mimongo 1989 -1.0333 10.6667 B31 MN Sindara B32 O-Kande 95 Y Y JMH 2 Region of Makogué W 1986 -0,0500 11.6167 B32 (Booué) of Booué, Boleko 1988 B41 Gi-Sir(a) 146 Y Y JB Fougamou -1.2167 10.6000 B41 B42 I-Sangu 159 Y Y DFI 1 Dibassa 2000 -1,6775 11.8440 B42Mi Mimongo (Mimongo) 35 km E of Mimongo, = 20 km to N of Mbigou B42 Mbigou B43 Yi-Punu 148 Y Y JB >2 Tchibanga 1987 -2.8500 11.0333 B43 (Tchiba-nga) SW 1988 B44 Yi-Lumbu 142 Y Y JB Mayumba -3.4167 10.6500 B44 SW B501 Wanzi 158 Y Y JMH 1 Moanda 1985 -1.5667 13.2000 B501Mo Moanda (Moanda) W of Franceville B501 (Est) Wanzi 159 Y Y MM 1 Moanda 1990 -1.5667 13.2000 B501Es (eastern) W of Franceville B503 I-Vili 89 Y Y JMH 1 Sindara 1987 -1.0333 10.6667 B503 (Sindara) N of Fougamou B51 Li-Duma 125 Y Y JMH Region of Lastoursville 1985 -0.8167 12.7000 B51 B51 Li-Duma 153 Y Y JMH 1 Lastoursville 1985 -0.8167 12.7000 B51La Lastourville Lastoursville B52 Nzebi 155 Y Y JMH 1 Mounana 1985 -1.4333 13.1667 B52 (Mouna-na) B53 Tsengi 153 Y Y JMH 1 Poungui 1985 -1.8288 13.0218 B53 near Bakoumba, W of Boumango B601 Le-Mpini 152 Y Y JMH 1 Omoy 1985 -1.5167 13.8000 B601 2 km after Ngouoni on road to Akieni B602 Kaningi 156 Y Y JMH 1 Mbouma-Makama 1985 -1.3333 13.3500 B602No (Nord) (Northern) B602 Kaningi 158 Y Y JMH 1 Mopia 1985 -1.8833 13.5833 B602Su (Sud) (Southern) B62 Le-Mbaama 158 Y Y JMH 1 Ambinda 1985 -0.4500 14.1000 B62 NE of Okondja B63 Ndumu 158 Y Y JMH 1 Region of Franceville 1985 -1.6333 13.5833 B63 LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 167

Label Variety Items S P Investig. LA Location Year Lat Long Ref.

B700 A-Tsitse-ge 157 Y Y JMH 1 Mboa (Misiami) 1985 -2.1333 13.7333 B700 (Southeast) B71a Teke 157 Y Y JMH 1 Djoko 1985 -1.2624 14.4252 B71aDj Djoko (Djoko) 40 km NE of Leconi B71a Teke 95 Y Y JMH Brazzaville (Congo) 1985 ? ? B71aIb Ibali (Ibali)

B71a Teke 84 Y Y JMH 1 Leconi 1985 -1.5833 14.2333 B71aLe Lekoni (Lekoni) B71a Teke 152 Y Y JMH 1 Ossele-Kessala 1985 1.1333 13.8833 B71aOs Ossele (Ossele) beyond Onkoua, near Kessala

6.2.1.2.2 Mapping the ALGAB While exact coordinates of many collection sites were provided, other locations were only described. Gazetteers were used to verify and augment the list as much as possi‐ ble. A few locations were calculated from fairly detailed descriptions such as “75km north of Z” or “between X and Y”, where X and Y were fairly close. Other location names or descriptions refer approximately to a collection site, or have a name that refers to one of several sites in gazetteer data, usually related ones. Because of this, a number of locations are not exactly mapped, namely B11a, B11d, B22b, B20x, B31, B32, B304, B42, B252, B305, B602, B71a (Ossele), B71a (Ibali) and B71a (Djoko) (Fig. 6.8). But the vagueness in the reference of place names is not the only problem in locating the provenance of linguistic varieties. In addition, respondents were not always sure where their group was normally located, inter alia because the members had moved a good deal, and because several varieties are scattered rather widely. In a previous study concerning the Bantu languages spoken in Gabon, the va‐ rieties were mapped and corresponding language areas were attributed according to a neighbourhood corresponding to Voronoi tessellation (Alewijnse et al. 2007). This ba‐ sic cartography gave unsatisfactory results because the pattern was chaotic. The rea‐ son is related to the presence of large uninhabited areas within Gabon, particularly in correspondence to the northeastern and central areas, where the forest is very dense (see Figs. 6.1 and 6.9). To establish a coherent cartography of the languages that would solve the is‐ sues listed above, several sources of data have been compared and cross‐checked in order to see if the coordinates of each variety were compatible with the available knowledge about the geographical distribution of the languages. A very accurate veri‐ fication has been conducted. First, the zones uninhabited have been drawn according to a satellite thermographic images published by Central Intelligence Agency (U.S.A.) 168 CHAPTER 6

Figure 6.8  Location of the 53 varieties reported in the ALGAB (hollow squares; when a diagonal appears inside them the position is approximated) and of the 64 varieties of Bastin et al. 1999 (solid circles), that is according to the collection of the Royal Museum of Central Africa, MRAC, Tervuren (Belgium). See Table 6.2 and 6.3 for details.

identifying all the traces of residential presence in Gabon. From them uninhabited areas have been safely inferred and recorded on a map.31 Then, and similarly to what was done for the linguistic map of Tanzania (Fig. 6.7), consensus linguistic areas have been plotted on the map after checking their location according to two surces: the Ethnologue (Grimes 2000, Simons 2016) and Maho (2009). Generally the areas were overlapping. When the Ethnologue does not report given languages, their mapping

31 When corridors are plotted on the map, they correspond to roads with settlements on either side. LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 169 follows Maho (2009).32 It is worth noting that the Ethnologue is sometimes regarded in the linguistic community as defective, but the mapping process mentioned above suggests that the language locations it provides are accurate. To conclude, the map‐ ping used in Alewijnse et al. (2007) was generally correct, but we base the present analysis on the consensus map reported in Fig. 6.9 that represents, per se, a significant improvement.

Figure 6.9  Consensus map of the linguistic varieties spoken in Gabon (labels accord‐ ing to Maho 2009). See section 6.2.1.2.2 for details about the sources and the mapping.

6.2.1.3 Database 3: Bastin et al. (1999) From the Musée Royal de l’Afrique centrale of Tervuren (MRAC) Lolke van der Veen (University of Lyon) received a PDF file (“BastinLexico_Gabon.pdf”) containing lexi‐ cal data (core vocabulary) for all the varieties concerning Gabon and for some close varieties spoken in nearby countries (Cameroon, Congo‐Brazzaville). All of them had

32 Maho adopts Guthrie codes for the languages, the Ethnologue does not. To combine the two maps the names reported by the latter source have been matched with Maho accord‐ ing to the alternate names provided. 170 CHAPTER 6 been analyzed by Bastin et al. (1999). Data were collected over a time span of some 15 to 20 years using the short Swadesh 100‐words list of basic concepts (reduced to 92 entries – see note #26). As most of the entries overlap with the ALGAB,33 Van der Veen was interested in comparing these two independently‐collected datasets. Soraya Mokrani (Laboratoire Dynamique du Langage — Lyon) manually re‐entered all the data because of incompatibilities which prevented from automatically importing it.34 As the linguistic material from the MRAC has been generally collected in situ (i.e. tape or cassette recorded) and later transcribed in Tervuren, it is far from perfect. Re‐entering the data has provided an occasion to correct it as much as possible. In fact, the sources are heterogeneous (many transcribers, different transcription princi‐ ples, missing items, contradictions, obscure transcriptions). In some cases data were taken from publications, meaning that they correspond to mere roots. The harmoniza‐ tion process led to the elimination of obvious inconsistencies, to the replacement of vowels by semivowels (when this was justified) and to the reconstruction of the word when only roots were given (relying on gender information that is often provided for nouns, or relying on available knowledge about the language). Soraya Mokrani divided the Excel file in a series of sheets, each corresponding to a language. When different varieties of the same language were available, corre‐ sponding word lists are reported in different columns. Two columns have been used per variety, one for the singular forms and one for the plural forms (plural forms are not systematically available). Varieties labelled according to the specific nomencla‐ ture at use in the MRAC, have been renamed according to Maho (2003) and, when‐ ever possible, using the 3‐letter code of the Ethnologue (Simons 2016) (otherwise re‐ placed by ‘???’). See the APPENDIX 3, Table 6.3 and Fig. 6.8. When multiple word lists correspond to a same variety they are labelled using Maho’s reference followed by the name of the locality where the language is/was spoken. Geographical coordi‐ nates for the varieties were retrieved according to the introductory part of Bastin et al. 1999. As tones have not been transcribed systematically in the MRAC lists and are sometimes phonemic and sometimes phonetic, they have been left‐out.35

33 There are a few exceptions: #13 (‘burn’ intr.) sometimes corresponds to the ALGAB en‐ try sometimes it does not (in several cases the transitive verb was obtained); #44 ‘lie down’ (absent from ALGAB); #66 ‘round (as a ball)’ (absent from ALGAB), see note 26. 34 Some varieties were excluded; they mainly concern the TEKE language cluster. Some do not concern the area under investigation or are doubtful. Varieties with too many missing items were also excluded. 35 Further details concerning phonetic transcription: [y] = palatal approximant (IPA: [j]) [ɟ] (= horizontally barred j) = palatal plosive or palatal affricate. LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 171

Table 6.3  Dataset of 64 varieties spoken in Gabon and surrounding countries, from Bastin et al. (1999). Label (Maho / MRAC) is the updated Guthrie code according to the nomenclature of Maho (2009) and, between squared parentheses, according to the nomen‐ clature used by MRAC (Musée Royal de l’Afrique Centrale, Tervuren); please refer to Maho for the full name of the variety; Location is the name of the country where the variety was collected, generally followed by a geographical annotation; ‘Transcribers / Informant / Year’ corresponds to the source of the data; Long Lat correspond to the geographical coordinates of the place where the variety was recorded, Items correspond the number of lexical items included in the word list. See Fig. 6.8 for the mapping and APPENDIX 3 for the database.

Label (Maho / MRAC) Location Transcribers / Informants / Year Long (E) Lat (N/S) Items

A34 Benga 1 Equatorial Guinea, Informant: G. Menz. Transciber: J. Vansina 9,7 0,5N 92 Corisco A34 Benga 2 Gabon, Informant: Maman Jeanne. Transciber: H.A 9,4 0,4N 89 Estuary Hazoumé A75 Fang 1 Gabon, Informant & transcriber: P. Ondo-Mebiame 11,6 1,62N 91 Adzan-Esamdom A75 Fang 2 Gabon, Informant & transcriber: Y. Nzang Bie 10 0,5N 89 Aloumendong A75 Fang 3 Congo, Informant: Souanke. Transcriber: J. 14,1 2,1N 90 Souanke, Ndamba & CH. Lia A85b Mpo 1 Gabon, name of location Informant: J.V. Mesa-Mvele. Transcriber: 14,15 1,35N 82 [A86 MPo 1] not mentioned M. Dufeil A85b Mpo 2 Gabon, Informant: G. Igindita & Mayibot. Tran- 14,1 1,3N 91 [A86 Mpo 2] Mayibot 2 scriber: C. Marchal-Nasse A85b Mpo 3 Congo, Messok (Sangha Informant: J. Aselam. Transcriber: G. 14,2 1,8N 92 [A86 Mpo 3] region), Elounga B11a Mpongwe Gabon, Informant: E. Bebedi. Transcriber: C. 9,45 0,3N 91 Libreville Marchal-Nasse B11e Rungu-Nkomi Gabon, Informant: E. Ogandaga. Transcriber: Cl. 9,22 1,3 91 Dialect Fernan-Vaz Grégoire 1989 [B11b Rungu-Nkomi Dialect Nkomi] B11c Galwa 1 Gabon Informant: M. Ossouacah Owanga. Tran- 9,5 0,9N 91 scriber: M. Dufeil B11c Galwa 2 Gabon, Informant: G. Mbezo. Transcriber: A. 10,1 0,8N 92 Lambaréné Coupez B203 Sama Mohongwe Gabon, Informant: Mouba. Transcriber: J. Vansina 12,85 0,55N 92 [B20 Sama] Makokou Appears as Mahongwe Sama B251 B21 Seki Gabon, Informant: G. Kinkata. Transcriber: J. 9,5 0,5S 92 Libreville Vansina B22b Ngom Gabon (Akele) Informant: J.-Ch. Madouma. Transcriber: 10,5 0,5S 92 Moyen-Ogooué M. Dufeil B23 Mbangwe Gabon, Informant: B. Mbangalivoua Maka. Tran- 13,4 1,7S 76 Masuku scribers: E. Eyindanga & C. B24 Wumvu 2 Congo Transcriber: L.Y. Bouka 1989 12,1 2,7S 89 B24 Wumvu 3 Congo, Informant: F. Mouyikou. Transcriber: J. 12 2S 90 Kissiele Ndamba & J. Baka 1989 B25 Kota 1 Gabon, Informant: Kadima Mbola Zamba. Transcri- 14 1,1N 89 Madungwe, Etakangai ber: P. Piron 1989 B25 Kota 2 Gabon Transcriber: P. Medjo Mvé & C. Marchal- 14 1,15N 89 Nasse 1988 B25 Kota 3 Gabon, Informant: N. Mafomangoya & F.C. Moate- 13,9 1,1N 92 Mekambo kouba. Transcriber: C. Marchal-Nasse 1988 B25 Kota 4 Congo, Informant: A.E. Mbelibadi. Transcriber: L. 12 0,7S 91 Mbomo, Polak 1982 172 CHAPTER 6

Label (Maho / MRAC) Location Transcribers / Informants / Year Long (E) Lat (N/S) Items

B252 Sama Mahongwe Gabon Transcriber: J. Vansina 14,3 1N 92 [B26 Mahongwe] B251 Sake 1 Gabon, Informant: J.M. Benga-Mboudza. Tran- 11,9 0,1S 85 [B27 Sake 1] Booué scriber: C. Marchal-Nasse 1986? B251 Sake 2 Gabon Transcriber: C. Marchal-Nasse 1988 11,8 0,2S 85 [B27 Sake 2] B201 Ndasa 1 Congo, Informant: R. Moutsimba. Transcriber: A. 13,2 3,3S 90 [B28 Ndasa 1] Lekoumou region Lipu 1989 B201 Ndasa 2 Congo Transcriber: L.Y. Bouka 1989 13,2 3,4S 88 [B28 Ndasa 1] ***B303 Dialect spoken by the Informant: Djita. Transcriber: Nzete NO NO 88 [B31] Pygmies Bongwe B31 Tsogo Gabon, Informant: J. de D. Moumbegna. Tran- 11,5 1,75S 88 Mimongo scriber: C. Marchal-Nasse 1979 B304 Pinji Gabon, Transcriber: C. Marchal-Nasse 1988 11 1,8S 91 [B33 Pinji] Ngounié B305 Pove Gabon Informant: J. de D. Moubegna. Transcriber: 12,2 1,2S 90 [B34 Pove] C. Marchal-Nasse 1986 B302 Himba Gabon, Informant: L.M. Embiault. Transcriber: J.P. 11,54 1,48S 86 [B36 Himba] Ngounié Rekanga 1989 B40 Bwali Gabon, Informant: Marly-Mounguiba. Transcriber: 10,6 1,4S 90 Ngounié C. Marchal-Nasse B41 Shira Gabon, Informant: Moussanga. Transcriber: J. 10,5 1,5S 92 Dakartango Vansina B42 Sangu 1 Gabon, Informant: J.M. Mombo-Tsoungou. Tran- 11,58 1,62S 91 Mimongo scriber: P. Ondo-Mebiame 1988 B42 Sangu 2 Gabon, Informant: R. Nzambi. Transcriber: C. 12,5 1,2S 91 Koulamoutou Marchal-Nasse B43 Punu 1 Gabon, Informant: A.B. Boulinguy. Transcriber: 11 2,5S 91 Mimongo F.M. Rodegem 1973 B43 Punu 2 Congo, Transcribers: J. Ndamba & J. Baka 1989 12,1 2,7S 91 Niari B44 Lumbu 1 Gabon, Transcriber: C. Marchal-Nasse 1987? N.B. 10 2,8S 92 north of Mayoumba Confusion in list between 1 and 2 B44 Lumbu 2 Congo, Informant: J.R. Mouanda. Transcriber: G. 11,9 3,5SS 92 Niari, Banda Elounga 1988. N.B. Confusion in list between 1 and 2 B44 Lumbu 3 Congo, Informant: J.P. Usangila. Transcriber: J. 12,7 4,1S 92 Nkola, Ndamba 1989 B44 Lumbu 4 Gabon, Informant: M.-Th. Moanda. Transcribers: V. 9,45 0,3S 92 Libreville Koumba & C. Marchal-Nasse B401 Bwisi Congo, Transcriber: J. Ndamba & J. Baka 1989 13,1 3,25S 91 [B45 Bwisi] Niari, Loubetsi B402 Varama 1 Gabon, Transcriber: C. Marchal-Nasse 1986 9,7 2,5S 90 [B46 Varama 1] Ogooué-Maritime B402 Varama 2 Gabon, Informant: Ch. Mumbo. Transcriber: Ch. 9,8 2,6S 91 [B46 Varama 2] Ogooué-Maritime Mumbo & C. Grégoire 1990 B403 Vungu 1 Gabon, Transcriber: C. Marchal-Nasse 10,7 2,1S 85 [B47 Vungu 1] Ogooué-Lolo B403 Vungu 2 Gabon, Informant: Nziengui Mouckagny. Tran- 10,7 2,1S 91 [B47 Vungu 2] Ngounié scriber: M. Dufeil 1986 B51 Duma 1 Gabon, Informant: R. Kouyi-Kayi. Transcriber: V. R. 12,6 0,8S 87 Lastoursville Mickala 1990 B501 Duma 2 d wanzi Gabon, Informant: P. Missambo. Transcriber: H. 12,8 0,9S 90 [B51 Duma 2 d. Wanzi] Kessipoughou Tourneux 1980 B52 Nzebi 1 Gabon, Informant: M. Chacha. Transcriber: C. 11,9 1,9S 91 Lekindou Paulian B52 Nzebi 2 Congo, Informant: G. Moulebe. Transcribers: J. 12,1 2,6S 90 Niari Ndamba & J. Baka 1989

LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 173

Label (Maho / MRAC) Location Transcribers / Informants / Year Long (E) Lat (N/S) Items

B53 Tsangi Congo, Informant: P. Issangou. Transcriber: J. 13,7 2,9S 90 Niari Ndamba 1989 B61 Mbere Gabon Informant: M.J. Lekouleoundjo. Transcriber: 14 0,4S 91 J.M. Rodegem 1973 B62 Mbamba 1 Gabon, Informant: Teta Andjouomo. Transcriber: C. 13,9 1,2S 92 Franceville? Marchal-Nasse 1986? B62 Mbamba 2 Congo Transcriber: L.Y. Bouka 1989? 14,5 1,2S 89 N.B. Condusion with B62 Mbamba 3 B63 Ndumo 1 Gabon, Transcriber: C. Marchal-Nasse 1987? 13,5 1,6S 81 Franceville region B63 Ndumo 2 Gabon, Transcriber: C. Marchal-Nasse 1987? 13,4 1,5S 82 [B63 Ndumo 2 d. Kuya] Epila/Ombele B602 Kaningi Gabon, Transcriber: C. Marchal-Nasse 1988 13,6 1,75S 87 [B64 Kaningi] Haut-Ogooué, Masuku B71a Tee1 Congo, Informant: P. Mempabo. Transcriber: L. 15,3 1,1S 89 [B71 Tee 1] Cuvette, Boundji Polak 1990 B71a Tee2 Congo, Informant: M. Ondele. Transcriber: C. 15,4 1,3S 90 [B71 Tee 2 Abala] Plateaux, Abala Pereira &B. Nkounkou B71a Tee4 Congo, Informant: D. Ilanga. Transcriber: J. 14,5 1,7S 86 [B71 Tee 4] Omvula Vansina 1964 B73 Teke-W 1 Congo, Informant: Tsumaka. Transcriber: J. Vansi- 13,2 3,3S 89 [B71a Teke-W1] Komono, Otsyene na 1964 ***B73a Congo, Informant: A. Libele. Transcriber: J. 12,6 2,8S 90 [B73 Teke-W 3 Kissiele] Niari Ndamba & J. Baka 1989

. 6.2.2 Computation of linguistic distances

Pronunciations recorded in the three datasets (Tanzania, ALGAB, Bastin et al. 1999, respectively APPENDIX 1; 2; 3) were compared using the Levenshtein distance, which may be understood as the cost of the optimal set of operations needed to map one string to another. Heeringa (2004) provides an extensive introduction to the applica‐ tion of Levenshtein distance to the problem of measuring the distance between pro‐ nunciations provided in phonetic transcription (see also CHAPTERS 1 and 8). The pho‐ netic model has discrete costs, meaning that identical tokens cost nothing, while vowel‐vowel and consonant‐consonant substitutions cost one unit, as do insertions and deletions. In general this version of the algorithm only allows substitutions re‐ specting syllabicity, i.e. vowels for vowels and consonants for consonants. There are three exceptions to strict vowel‐consonant borders: the semivowels [j] and [w] as well as the maximally high vowels [i] and [u] may match both vowels and consonants, and [] may match sonorant consonants. Consonant‐vowel substitutions are much more expensive than the combination of a deletion and insertion to the same effect, which enforces the syllabicity constraint, and also causes the Levenshtein results to have slightly longer alignments that are usually more natural. Diacritics are not considered by the present model, meaning that the ninety‐ five occurrences of syllabic markers (marking syllabic sonorants) and the forty occur‐ 174 CHAPTER 6 rences of nasalization in the version of the ALGAB we processed are ignored.36 These counts are low enough with respect to the overall dataset so that we are confident that results were not affected greatly. Following the analysis of Heeringa et al. (2006), the approach used attempts to respect phonetic context by applying the phonetic model not to words represented as sequences of character unigrams, but rather to words rep‐ resented as sequences of character bigrams, thereby including effects of (direct) pho‐ netic context. The resulting comparison costs were not normalized by length, also fol‐ lowing Heeringa et al. (2006) findings. The result of the pairwise distance measures between all sites is a difference matrix containing linguistic distances between all pairs of sites. Synonyms and empty entries are processed as in Heeringa (2004). Concerning the databases, missing values are basically ignored in analyses: the distance between two sites is calculated based on the pronunciations that are present, and the mean distance for all the words that are compared is then computed. This implies that some language‐distances are based on more comparisons than others and are therefore more robust statistically, but there is a large amount of data, so that no comparisons are unreliable. Some varieties record singular and plural forms for each gloss, while others have only a single form, this is not problematic in the calculation of the distances because the L04 dialectometric software handles this inequality by seek‐ ing optimal matches and using the mean of those. In the cases where one variety has one form and the other has two forms, the comparison essentially is the average of the two distances.

6.2.3 Population genetic sampling of the Gabon population and markers used

The sampling of genetic material (saliva and blood to allow the extraction of DNA) took place in 2005, 2006 and 2007 (together with some linguistic fieldwork concerning languages of the groups A34, B21, B20, B30 + Ngom, Koya, Ndasa and Ndambono) in medical centres in Libreville, Booué, Cap Esterias, Fougamou, Franceville, Lambarené, la Lopé, Lastourville, Malibé, Minvoul, Mouila, Port‐Gentil and Sindara (Fig. 6.1 re‐ ports some). After a short training period, teams composed by one linguist and one or two anthropologists (either professionals or students) started a simultaneous collection in several locations. Each team was composed by collaborators familiar with the popula‐ tions going to be sampled. The sampled individuals were generally males37 above 35 years of age, having both parents belonging to the same ethnic group. When this was

36 This kind of statistics has no been computed for the dataset reported in Table 6.3. 37 Males carry a the Y‐chromosome and the mitochondrial DNA, though they do not tran‐ smit the latter to their offspring (see section 6.2.3.1). LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 175 not easy to find, a belonging to a closely related ethnic group was accepted. Before each sampling, a detailed ethnological questionnaire was filled‐in by the volunteer including genealogical information. In total 960 individuals have been sampled (DNA samples are currently stored at the Institut Pasteur, Paris, Laboratory of Dr. Lluis Quintana‐Murci). As the project was initially intended to embrace three countries, each with a different history in the frame of the Bantu expansion (Angola, Gabon, Tanzania), a limited number of samples were intended to be collected in Gabon. After the with‐ drawal of the other scientific partners, the research was refocused to Gabon only, and the number of sampled populations was increased to 21 sites. The 21 ethnicities, over a total of the about 50 existing in Gabon, cover the genetic diversity of this country quite well. There are 17 populations reported in this study because one non‐Bantu speaking Pygmy group (Baka) has been discarded from this analysis together with three populations for which too few volunteers were found (Bekwil, Mbangwe, Okande).38 The ethnolinguistic database includes all the sampled individuals and re‐ ports information like the names of clans and lineages for the ancestors of each DNA donor. See Table 6.4 for details about the sampling.

6.2.3.1 Genetic markers used

6.2.3.1.1 Mitochondrial DNA Human mitochondrial DNA in inherited from the mother. Human sperms contained in the seminal liquid carry about 50‐100 mitochondria each,39 they form a sheath around the flagellar axoneme, which is a structure within the sperm tail. They pro‐ vide chemical energy for the tail to move and the sperm to “swim”. Sperm cells can‐ not divide and have a limited life span, but after fusion with egg‐cells during fertiliza‐ tion, a new organism begins developing. Their contribution to this new organism is limited to the paternal DNA they carry into a very compact structure referred to as the “head”. Only the head enters the egg, the tail and the mitochondria it contains, remain

38 In population genetics investigations volunteers have to be as unrelated as much as pos‐ sible to enable a good depiction of the genetic diversity of their group. When the commu‐ nities are too small, it can be hard to identify DNA donors that do not share a large num‐ ber of ancestors. 39 Mitochondria are considered to have originated from proteobacteria through endosym‐ biosis, becoming adapted to live inside cells, thus benefiting from a protected (endocellu‐ lar) environment and providing chemical energy to the cell as they can oxidize sugar. Without mitochondria a cell would produce energy only by fermentation, a process that is much less efficient. Inside human cells there are many mitochondria, from a few to sev‐ eral thousand. 176 CHAPTER 6 outside the egg. This is why the mitochondria within a fertilized egg are only those of the mother.

Table 6.4  Populations genetics sampling of Gabon The sampling corresponds to (Berniell‐Lee et al. 2007; Quintana‐Murci et al. 2008). Population Name / Size corresponds to the name of the ethnic groups that have been sampled for genetic testing. Populations sizes correspond to the estimations of Idiata (1997). mtDNA sample; Y‐chr. sample; and autos. sample are the number of individuals tested for the mitochondrial DNA, the non‐recombinant portion of the Y‐chromosome and several autosomal regions. Location in Gabon / Geographic coordinates [N/E] corresponds to the barycenter of the area typically inhab‐ ited by each ethnic group according to Lolke Van der Veen (University of Lyon). Language Guthrie/Maho ref. indicates the Guthrie zone and code according to Maho (2009) of the lan‐ guage predominantly spoken by each ethnic group.

Population mtDNA Y-chr. Autos. Location in Gabon Language Name/size sample sample sample Geographic coordinates [N/E] Guthrie/Maho ref. Akele 48 50 13 W B20 1000-3000 -0.7000, 10.2167 (B22a) Ateke 54 48 30 SE B70 30,000 -0.8167, 12.7000 (B71a) Benga 50 48 - NW A30 1500 +0.5835, 9.33349 (A34) Duma 47 46 - E B50 10,000 -0.8167, 12.7000 (B51) Eshira Gisir 40 42 - W B40 30-40,000 -1.2167, 10.6000 (B41) Eviya 38 24 28 Centre B30 50 -1.2123, 10.5982 (B301) Fang 66 60 30 N A70 400,000 +2.0800, 11.4800 (A75) Galwa 51 47 - W B10 10,000 -0.7000, 10.2167 (B11c) Kota 56 53 30 E B20 25,000 +0.5667, 12.8667 (B25) Makina 45 43 28 Centre A80 (chiwa) 1000-3000 -0.1000, 11.9333 Mitsogo 64 60 30 Centre B30 13,000 -1.0333, 10.6667 (B31) Ndumu 39 36 26 SE B60 3000 +1.6333, 13.5833 (B63) Nzebi 63 57 30 SE B50 50,000 -1.5667, 13.2000 (B52) Obamba 47 46 29 SE B60 50,000 -0.6833, 13.7833 (B62) Orungu 20 21 21 W B10 10,000 -0.7167, 8.78330 (B11b) Punu 52 58 28 SW B40 150,000 -1.8667, 11.0167 (B43) Shake 51 43 - E B20 8,000 -0.8167, 12.7000 (B251) LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 177

Mitochondria have their own DNA that is different from the one of the cell they live in, it is a single chromosome, short and circular like bacteria have. They are capable of duplicating within the cell that hosts them. Mitochondria DNA can become different over replication cycles because of a certain instability of the DNA itself (which can get damaged), or because the mitochondrial chromosome is not copied accurately during the replication. Mitochondrial DNA has a length of 16,569 base‐pairs40 coding for 37 genes, but some parts of the sequence do not code, meaning that they do not have a documented function.41 Non‐coding regions have been targeted by geneticists to be reliable mark‐ ers of DNA evolution because they are much more variable than coding regions; in fact mutations can accumulate in them without necessarily compromising any meta‐ bolic function (they are called Hypervariable Segments HVS I and II). This is different from the regions that are coding (meaning that they give the instructions for assem‐ bling proteins), where mutations are generally deleterious because the resulting pro‐ tein is likely to be non‐functional, thus possibly causing metabolic drawbacks. Cann et al. (1986) (see also Vigilant et al. 1991) demonstrated that the HVS I region carries many mutations that enable one to reconstruct, from it, a maternal genealogy of human populations, a genealogy that makes sense in geographical terms. For ex‐ ample, from this mitochondrial region they have reconstructed the history of hu‐ man migrations out of Africa, through the different continents. Very ancient muta‐ tions are shared by many groups; recent mutations are shared by fewer, similarly to the transmission history of medieval manuscripts that helped to date the copies of books according to the spelling errors that copists were introducing over time. This is why Hypervariable Regions (HVRs) have become classical markers in population genetics, even concerning populations that are genealogically close, because mitochondrial DNA varies quite rapidly giving origin to many lineages. Similarly to what was done in glottochronology, a “molecular clock” has been ad‐ vocated in order to estimate when mutations occurred. And as in glottochronol‐ ogy, the clock is biased and not universal. Concerning Gabon populations accounted in this study, the first hyper‐ variable segment (HVS‐I) of the control region was sequenced in all samples, and variable positions were determined from position 16,024 to 16,383. The complete mtDNA sequences have been submitted to GenBank (accession numbers EU273476– EU273502).42 Ten Single Nucleotide Polymorphysms (SNPs)43 were initially genotyped

40 Each base‐pair is also called ‘position’. 41 They might have a regulatory function. 42 https://www.ncbi.nlm.nih.gov/genbank/ 178 CHAPTER 6 in all samples, to identify the major haplogroups to which each mitochondrial se‐ quence belonged. Haplogroups are DNA sequences carrying a given set of mutations. The analyses reported in this study are based on the pairwise comparison of the fre‐ quency of haplogroups for couples of populations in order to estimate FST distance measures (see section 6.3.2 Genetics). The higher the number of the haplogroups that are shared by two populations in similar frequencies, the higher is their genetic simi‐ larity.

6.2.3.1.2 Y‐chromosome If the mitochondrial DNA makes possible to follow the genetic history of the maternal line (‘back to Eve’), a part of the Y‐chromosome makes possible a similar kind of in‐ ference for the paternal genealogy of human populations (‘back to Adam’). The Y‐ chromosome is one of the two sexual chromosomes in mammals (X and Y). It is car‐ ried only by the males. Differently from the mitochondrial DNA, it does not evolve only by mutation‐damage and copy error (together with other mechanisms), but also by recombination. Recombination consists in a physical exchange of DNA fragments so that a new DNA filament is obtained, a filament representing a combination of bases that is novel, being a mosaic of previous combinations. Nevertheless, a large part of the Y‐chromosome does not undergo recombination; it is called the non‐ recombining section of the Y‐chromosome (NRY) and can be considered to be trans‐ mitted unchanged along the male line. Similarly to what has been said before for the mitochondrial DNA, it is possible to establish a genealogy of this chromosome based on regions that do not code and which are, therefore, fast‐mutating. Haplogroups can be defined according to particular sets of mutations. For short‐range molecular phy‐ logenetics, the Y‐chromosome is highly effective because it is one of the fastest‐ evolving parts of the human genome. All the individuals in this study were typed for 35 Y‐single nucleotide polymor‐ phisms (SNPs) as in Berniell‐Lee et al. (2007).. In order to refine the phylogenetic resolu‐ tion in some branches of the Y‐chromosome phylogeny, some individuals were further typed for six single nucleotide polymorphisms. 18 highly informative Y‐short tandem repeats (STRs) were also adopted. Full details are reported in the reference provided.

6.2.3.1.3 Autosomal markers Concerning autosomal markers, located on chromosomes other than the sexual ones, 28 tetranucleotide microsatellites have been genotyped as in Verdu et al. (2009).

43 A single nucleotide polymorphism often consists in the variation of a single nucleotide that occurs at a specific position in the genome. LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 179

Please, refer to this publication for their names. Unlike the Y chromosome and the mitochondrial DNA, these markers are informative about all the lineages of an indi‐ vidual, allowing a more general assessment about the genetic relatedness of popula‐ tions. For reasons explained in the following section (6.2.3.2), their higher effective‐ population‐size makes them less efficient at describing recent genetic change, because there are less likely to drift. The samples concerning the following populations: Akele, Fang, Kota, Nzébi, Teke and Tsogho populations have been typed and published in Verdu et al. (2009), the other samples appear in this work for the first time.

6.2.3.2 Parentally transmitted markers seize the recent history of populations better

If the mitochondrial DNA and the non‐recombinant portion of the Y‐chromosome allow one to follow paternal and maternal genetic inheritance, there is another reason explaining why these two genetic markers (as they are called in the genetic jargon) have special interest. In fact, their effective population size is expected to be one‐ quarter of the effective population size for nuclear genes. Let us first focus on the Y‐ chromosome. In diploids each reproduction involves 4 nuclear gene copies, two per auto‐ some,44 while the male carries a single Y. Under simple neutral models with constant and equal male and female population sizes, the genetic diversity is expected to be proportional to the relative number of each chromosome in the population: X‐ chromosome diversity is expected to be three‐quarters autosomal diversity (because during meiosis there are three X‐chromosomes for four autosomes), and the Y‐ chromosome diversity is expected to be one‐quarter autosomal diversity (Caballero 1995). In fact, a human population of 1000 individuals with a sex ratio 1:1 corresponds to a chromosomal population of 1000 Y‐chromosomes (and 3000 X‐chromosomes) for 4000 chromosomes concerning each of the 22 non‐sexual chromosomes: that is a ¼ ratio.

44 In humans, the genome is composed by 22 couples of non‐sexual chromosomes plus a couple of sexual chromosomes (XY for a male; XX for a female). Non sexual chromo‐ somes are called autosomes. All autosomes undergo recombination, consisting in the ex‐ change of homologous fragments. Two topologically‐corresponding parts of a couple of homologous chromosomes (say chromosome 4 from the father and chromosome 4 from the mother) will have a very similar DNA sequence, because the order of the different genes (sensu lato) in the DNA will be identical. At meiosis the impairment of couples of homologous autosomes takes place and a kind of “chemical attraction” is established, leading to the exchange of fragments, that is giving rise to a new sequence that is different from father’s and from the mother’s. The impairment happens also concerning the X‐ chromosome and the Y‐chromosome, because a portion of the latter has homology with the X, but this happens to a small extent. 180 CHAPTER 6

Mitochondrial DNA is similar because its DNA is transmitted only by the mother, symmetrically to the Y‐chromosome in the opposite sex. In other words, the i) non‐recombinant portion of the Y‐chromosome and ii) the mitochondrial DNA corre‐ spond to a haploid system, meaning that every mutation appearing in them can be immediately transmitted to the next generation. This is not the case with nuclear genes, because there are more copies of them (see paragraph above) and the mutated version of a gene is not necessarily inherited by the offspring. Even if the ¼ ratio is only theo‐ retical,45 a ratio much lower than 1 consistently applies to the amount of genetic diver‐ sity portrayed by both mitochondrial and Y‐chromosomal DNA, thus explaining why their genetic drift is higher.46

45 Mitochondrial effective population sizes do not take into account the differences in se‐ lective interference in its genome with the nuclear one, and, concerning both mitochon‐ dria and the Y‐chromosome, the variance in reproductive success among sexes (eg: if fewer males and females breed). 46 Genetic drift is phenomenon corresponding to the change in frequency of a gene variant in a population due to the randomness of survival and reproduction rates. When a genetic variant is represented by few copies of DNA carrying the information, this effect is larger, similarly to small populations where the change of a linguistic variety can be faster than in a larger one, as the “norm” is shared by a smaller number of speakers that can depart from it more easily. LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 181

6.3 RESULTS

There are several kind of results to present in this section: 1) those about the first ex‐ periment of classification of Bantu languages, on a Tanzanian dataset, to test the per‐ formance of the Levenshtein algorithm on this particular phonological system; 2) the classifications of the Bantu linguistic varieties spoken in Gabon according to two dif‐ ferent datasets, that is 53 varieties documented in the Linguistic Atlas of Gabon (ALGAB) and 64 varieties of the same region according to Bastin et al. (1999) and, fi‐ nally, 3) the outcomes of the population genetics analysis of 17 ethnic groups from the same country. The first set of results (the Tanzanian dataset) will be discussed here because we would like to leave the discussion section (6.4) to Gabon questions only. Other results that could have been reported in this section concern the cross‐ comparison of the classifications about the ALGAB and Bastin et al. (1999), but we prefer to address them in the discussion section.

6.3.1 Linguistics

6.3.1.1 The Tanzanian experiment

A first dendrogram showing the Levenshtein classification of the 32 Bantu languages spoken in Tanzania (Nurse and Philipsson 1980), a dataset consisting in wordlists of 1052 items, was computed in 2003. At the time our computational approach was quite novel and required the validation of an expert. This is why we submitted the dendro‐ gram to Professor Gérard Philippson (University of Lyon) for assessment. He let us know that the tree mirrored existing knowledge in a correct way, meaning that the Levenshtein methodology, previously applied only to European dialects, could be used to analyze computationally the diversity of Bantu languages from Gabon, which we later did. At the time, bootstrap resampling techniques had not been added to the L04 software yet (see section 2.2 Computation of linguistic distances) and no other com‐ putational classification of the Tanzanian varieties was available, besides a lexicosta‐ tistical study (Nurse and Philipsson 1980). Very recently, an independent classifica‐ tion concerning 424 Bantu languages including the same Tanzanian varieties we had processed, has become available (Grollemund et al. 2015). This study is based on a Bayesian classification of language varieties coded as a multistate matrix of cognacy based on a 92‐concept word list (the more than one thousand concepts of the Tanza‐ nian data were reduced to 92 items to be homogeneous with the other word lists of the 424 languages collection). 182 CHAPTER 6

The availability of this new study prompted us to compare our earlier classifi‐ cation with it and, to do so, we recomputed all the distances according to the most recent Levenshtein software of the University of Groningen and by applying a boot‐ strap test of robustness to the new tree. The results of the Levenshtein classification and their comparison with the work of Grollemund et al. (2015) are presented below.

6.3.1.1.1 The Levenshtein classification of 32 Bantu languages from Tanzania Even though the authors of the Tanzanian linguistic database recently mentioned is‐ sues about the correctness of some entries (Derek Nurse, 2016, personal communica‐ tion), we reprocessed the same list of concepts (1052 from a full list of 1079 we first analysed in 2003) because D. Nurse could not provide improved wordlists. The boot‐ strap consensus tree accounting for the diversity of the 32 languages is reported in Fig. 6.10.

Figure 6.10  Classification of 32 Tanzanian languages according to a UPGMA boot‐ strap consensus tree. The tree reports only the nodes having a bootstrap score higher than 90. 1052 concepts; gradual segmental distances. Synonyms and empty entries are taken into account as in Heeringa (2004). See Table 6.1 for details about the languages.

Two major clusters appear in the tree, one including the languages classified as E50‐ E60, and another including all remaining varieties. This second cluster can be roughly dissected in two main sugroups: N10‐P10‐G50 and F20‐M10‐M20‐M30; Safwa stands alone. More exactly, three languages should be added to the N10‐P10‐G50 cluster, Dawida; Mwera and Seuta. Seuta will not be further addressed because it is an artificial assemblage of varieties (G20 + G30) that Gérard Philippson designed for linguistic LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 183 purposes that go beyond the scope of this chapter. Almost all the nodes of the boot‐ strap tree are supported by scores above the very high threshold of 90, meaning that the classification is extremely robust, also at the level of the many sub‐clusters visible in Fig. 6.10. It has been said that the focus of this study is on the languages from Gabon, and that Tanzania was just a test‐database. This is why the results here presented will point to the weaknesses of the “Tanzanian” classification, seen as possible clues to some systematic bias likely to be encountered with the dataset concerning Gabon. The major discrepancy between our classification and Nurse and Philippson (1980) con‐ cerns Dawida because, according to their analysis, this language should be put on the opposite side of the tree, that is with the varieties E50 and E60. The tree of Fig. 6.10 looks very similar to the corresponding multidimensional scaling (MDS) plot that we computed in two (Fig. 6.11) and three dimensions (not shown). The 3rd dimension provides evidence for the extreme position occupied by Safwa and Meru, thus confirming their outlier position in the tree. In the plot, the posi‐ tion of Dawida remains closer to the languages with which it is clustered in tree, even if it maps closer to the varieties E50/E60 than the simple inspection of the UPGMA tree would suggest. Therefore, the inconsistency with the “expected” classification of Dawida remains. To be able to directly compare our tree with the one published by Grollemund et al. (2015) we adapted our classification by: 1) reducing our 1052 concepts to the 92 processed by Grollemund; 2) redrawing the tree published by Grollemund according to the detailed one provided in the Supplementary Information of their article; 3) adopting the same cut‐off for the bootstrap scores.47 Before directly comparing our tree with Grollemund’s, it is interesting to see what happens to the Levenshtein classification when only ~10% of the original data are kept (1052  85 items48) (Fig. 6.12). Interestingly, the 85‐word list yields the same clustering as the 1052‐word list of concepts, and the relations among the languages

47 Grollemund et al. (2015) test the robustness of their tree according to the Jackknife method, while we use bootstrap. We consider this difference minor; in fact the two meth‐ odologies generally yield results that are largely similar or identical. 48 The original wordlists used by Grollemund are not included in their article but do ap‐ pear on the website of the last author, http://www.evolution.rdg.ac.uk/DataSets.html. Our reduced list of concepts actually contains 85 words because it was not possible to identify a correspondence between the original list of 1074 words of Nurse and Philippson (1980) and the concepts of Grollemund et al. (2015) for 7 of them. With respect to the latter list they are: #7 ‘Big’, not reported; #29 ‘Fat / Oil’, Ambiguous; #30 ‘Feather’, Ambiguous; #45 ‘Horn’, Ambiguity with ‘Ivory’; #58 ‘Man’, Not reported; #66 ‘One’, Not reported; #75 ‘Shame’, ambiguous. 184 CHAPTER 6 are portrayed in a similar way. Nevertheless some differences appear: 1) bootstrap scores support a lower number of clusters with the shorter list (expected) and 2) with 85 concepts the language Safwa clusters with the group F10‐M10‐M20‐M30, while with 1052 concepts it does not. Further, 3) the language Meru is an outlier when the longer wordlist is processed, but that clusters with the varieties E50‐E60 when the shorter wordlist in processed. But Safwa and Meru have a peculiar position in the 3D MDS (not shown) that fits these classification swaps. Interestingly, Dawida is still repre‐ sented in the same non‐consensual position.

Figure 6.11  Comparison of the Levenshtein tree classification of Fig. 6.10 with the corresponding two‐dimensional Multidimentional Scaling plot. Stress values: in 1 di‐ mension = 0.1588, in 2 dim. = 0.0966 (plot reported), in 3 dim. = 0.0688.

As we mentioned, to compare our tree with the one of Grollemund et al. (2015), we have manually redrawn the one listed in the Supplementary Information of their arti‐ cle and kept the section listing the 32 Tanzanian varieties under examination here (in solid black in the right side of Fig. 6.13). LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 185

To avoid comparing unstable clusters, we decided to focus on the most robust nodes, those supported by bootstrap score ≥ 90.49 All the nodes of the tree of Grolle‐ mund et al. (2016) (Fig. 6.13) supported by less than 90 subreplicates have been col‐ lapsed (Fig. 6.14). A simplified tree taking into account only the 32 Bantu languages of Tanzania that we have also processed has been compared to our Levenshtein classifi‐ cation (Fig. 6.15).

Figure 6.12  Comparison of the UPGMA Levenshtein tree classifications on 32 Tan‐ zanian languages by using 1052 concepts and a reduced list of 85, almost the same as Grollemund et al. 2015 (see note #48). Gradual segmental distances. The trees report only the nodes having a bootstrap score of at least 90.

Besides a minor difference in the clustering of Mwera, and the already noted unexpected clustering of Dawida, the two classifications of Fig. 6.15 do correspond, therefore confirming the expertise of professor Philippson when, in 2003, he visually validated the Levenshtein classification of Tanzanian languages. The inexplicable50 misclassification of Dawida does not seem a serious impediment to further use of the Levenshtein methodology to Bantu languages, which is why we are confident that the classification of the varieties from Gabon (next sections), is based on a method vali‐ dated twice (expert judgment and this experiment).

49 The trees presented in the other sections of the Results adopt a cut‐off of 70. 50 Derek Nurse (2016 personal communication) sold his scientific books and is unable to further help. 186 CHAPTER 6

Figure 6.13  The full tree appearing in Grollemund et al. (2015) has been redrawn from S.I. information material allowing to access the full list of varieties. Only the section of the tree listing the 32 Tanzanian varieties we examined is shown here. See Fig. 6.12.

Figure 6.14  Simplified tree corresponding to the section redrawn in Fig. 6.13 from Grollemund et al. (2015). Nodes reported are supported by at least 90% of the subrepli‐ cates, otherwise the nodes have been collapsed. Varieties are labelled according to Guthrie codes, and when they are reported in grey it means that they do not correspond to the 32 Tanzanian languages we processed, when they appear in black they do. LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 187

Figure 6.15  Congruence between the Levenshtein classification (left) and the one of Grollemund et al. (2015) (right – see Fig. 6.14). Nodes reported are supported by at least 90% of the subreplicates according to the bootstrap (left) and the jackknife method (right). The arrow shows the lack of congruence for Dawida (E74a).

6.3.1.2 Classification of 53 Bantu languages from Gabon (ALGAB)

The classification of the 52 Bantu languages of Gabon represented as a UPGMA tree built according to the Levenshtein distance matrix accounting for the full dataset (APPENDIX 1) suggests a major partition between the B10/B30 cluster and all the rest. The rest is divided in two subclusters: {A75; B20; A34; B21} and {B40; B50; B60; B70; B202} (Fig. 6.16). The latter group can be further dissected into two subclusters: {B40} vs. {B50; B60; B70}, where the varieties B60 and B70 are so homogeneous that they cannot be further dissected (they coexist in the same subclusters). If there are reasons supporting the inclusion of languages of the group B10 and B30 into a single group (see Nurse and Philippson 2003), the cluster formed by the A75 Fang varieties (spoken in the North of Gabon) grouped with the group B20 is not expected. We note that this tree (Fig. 6.16) roughly fits the larger classification of Bas‐ tin et al. (1999) (Fig 6.5) where the groups Mbam‐bubi (some A varieties, some B20 va‐ rieties), North‐western Bantu (some A, B10, B30 some B20 varieties ) and Central West‐ ern Bantu (B40, B50, B60, B70, B80, C H, K, R) are described as quite stable. If we focus only on Gabon, relevant Bastin’s groups are only two:

188 CHAPTER 6

 North‐western Bantu varieties, including languages of the A zone (Benga‐A34; Fang‐A75; Shiwa‐A83 and Bekwil‐A85b)51 and three groups: MYENE‐B10; (KOTA)KELE‐B20 and TSOGO‐B30.  North‐central Bantu varieties, including the groups SIRA‐B4052 and NJABI‐ MBETE‐TEKE‐B50/60/70.

Bastin et al.’s partition is exactly the same as ours, with the exception of the clus‐ ter {B10; B30} that stands alone, but this group has been postulated to deviate because of similarities that are related to vertical transmission and to convergence phenomena in‐ herent to a long phase of contact (Mouguiama‐Daounda and Van der Veen 2005). In more detail, the dichotomy North‐western Bantu vs. North‐central Bantu is based on dif‐ ferences concerning the number of phonemic vowels, the presence / absence of a dis‐ tinctive vocalic length, the presence/absence of spirantization for obstruents, shared lexical innovations and some tonal specificities. But is this tree robust enough to constitute a working hypothesis about the possible relatedness of Gabon populations to be tested with ge‐ netic markers? The consensus tree based on 100 boostrap resampled matrices (Fig 6.17A) shows that the main partition B10/B30 versus all‐the‐rest is not robust. When a bootstrap score of at least 70% is kept, the main clusters of the tree correspond to 5 groups of languages: {A75}; {B10, B30}; {B20}; {B40}; {B50, B60, B70}. This representation does not contradict the tree computed on the full matrix of Levenshtein distances (Fig. 6.16), because the major sub‐clusters still appear and are compatible with the partition North‐western ver‐ sus North‐central Bantu, although they do not support it fully. If we accept the tree of figure 17A) and we geographically plot its clusters over theconsensus linguistic areas we defined (see Fig. 6.9 and section 6.2.1.2.2 Mapping the ALGAB), we obtain a map showing that the degree of geographic coherence for the clustering is remarkable, with very few exceptions (mainly concerning languages B10 and B20) (Fig. 6.17B). The map‐ ping visually suggests a significant correlation between geographic and linguistic dis‐ tances that indeed is statistically significant (0.461**). One of the major working hypotheses of the group of linguists of Lyon at the origin of the general project (Professors Jean‐Marie Hombert and Lolke van der Veen) was about a contact zone in the central area of Gabon, in between the two main groups found by Bastin et al. (1999), that is North‐western Bantu and North‐central Bantu. The trees in the Figs. 6.16 and 6.17A do not support this hypothesis. As dendrograms portray the variability in a rigid and categorical way, it might be that varieties belonging to two very different linguistic groups, even if in direct con‐

51 Not all of them are included in this study. 52 To this group should be added the Vili dialect (H12a). LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 189 tact and reciprocally influenced by each other, would still be classified as separate clus‐ ters when plotted as a tree. A multidimensional scaling representation shows the vari‐ ability is a more gradual way and might be better suited to display the convergence area. The 3D MDS of Fig. 6.18 confirms that the distinction between the FANG lan‐ guages (A75) and all the other varieties is clear‐cut and shows that all non‐Fang lan‐ guages are quite close to each other. The varieties of the group B20 indeed form a single group, with the exception of the language Sheke‐B21 and the dialect West‐Kele‐B22a, both intermediate between the groups A75 and B20, in agreement with their geographi‐ cal position (they are the only two B20 varieties in contact with Fang languages ‐‐ more‐ over Sheke‐B21 is completely surrounded by varieties not belonging to the B zone).

Figure 6.16  ALGAB: UPGMA classification of the 53 varieties according to the full dataset. Gradual segmental Levenshtein distances. The asterisk corresponds to the lan‐ guage B202, the only B20 variety not clustered with the other ones. 190 CHAPTER 6

Another exception to the cohesion of the B20 group is constituted by the language Sigu‐B202, which classification is intermediate between the languages B40 and B60; again, this has to be related to its geographical position (see Fig. 6.17B). A similar phenomenon arises concerning B203; B24 and B201. To conclude, the five languages belonging to the group B40 cluster together in a tight way. If only the first and the second dimension of the multidimensional scaling are considered, all no‐FANG (A75) languages would be quite close, forming a single swarm of points, with the exception of the languages of the group B10 that would stand apart.

Figure 6.17  A: UPGMA bootstrap consensus tree concerning the classification of 53 varieties from the ALGAB. Nodes supported by less that 70% of the subreplicates have been collapsed. The number of lexical items available for each language is reported after the labels. B: Mapping of the major clusters on the consensus map we obtained (see Fig. 6.9). LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 191

Figure 6.18  ALGAB. Multidimensional scaling plot concerning 53 languages from Gabon. Varieties are coloured as in Fig. 6.17 for the reader’s ease. Stress values: in 1 di‐ mension = 0.3247, in 2 dim. = 0.1641, in 3 dim. = 0.1215 (plot shown).

When the third dimension is included it is clear that the languages B10 and B30 form a distinct cluster compared to the varieties of the groups B50/B60/B70, that are almost indistinguishable from each other. This cluster B10/B30 is interesting because the cor‐ responding languages are spoken in non‐neighbour linguistic areas. Moreover, and differently from the majority of the languages B10/30, three varieties (B11a; B11c and B32) are not in contact with the languages of the group B40. This likely explains why a bi‐dimensional MDS representation (Fig. 6.18 by taking into account the first two di‐ mensions only) splits the group B10/30 in two sub‐clusters. To conclude, even with a Multidimensional scaling plot, the contact zone does not show‐up. 192 CHAPTER 6

6.3.1.3 Classification of 64 Bantu languages from Gabon and neighbouring areas from Bastin et al. (1999)

We said that the Musée Royal de l’Afrique centrale of Tervuren (MRAC) sent to Lolke van der Veen (University of Lyon) a set of around 70 varieties that, in their majority, concern Gabon, but also neighbouring regions. In spite of the proviso listed in the methodological section 6.2.1.3 about the lower quality of the data and the smaller number of concepts that are taken into account (92), this database covers a wider geo‐ graphical and linguistic range which may be able to provide interesting hypotheses about the clustering of the major groups obtained in the computational analysis of the ALGAB:{B10, B30}; {B20}; {B40}; {B50, B60, B70}. The UPGMA tree (not shown) computed without the use of resampling tech‐ niques, i.e. on the full dataset of Bastin et al. (1999), largely corresponds to the tree of Fig. 6.16. Similarly, the corresponding multidimensional scaling plot (not shown) mir‐ rors the topology of the plot in Fig. 6.18. We just note that they both highlight, once more, the marginal position of A75 varieties and the main partition between a single group B10/B30 vs. all the rest. However, the weaknesses of the dataset of Bastin et al. (1999) suggest that it might not be reasonable to focus on the details of the representation. This is why we will just analyze the general structure of the UPGMA consensus bootstrap tree of Fig. 6.19, examining it in more detail in the discussion section.

Figure 6.19 (see next page)  UPGMA bootstrap consensus tree concerning the classifica‐ tion of 64 varieties from the Royal Museum of Central Africa (MRAC, Tervuren, Bel‐ gium) as in Bastin et al. (1999), see Table 6.3. Wordlists of about 92 items. Levenshtein gradual segmental distances. Nodes supported by less that 70% of the subreplicates have been collapsed besides one (in blue), discussed at section 6.4.3.2. Colours as in Fig. 6.17. LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 193

Figure 6.19 (Caption in the preceding page). UPGMA bootstrap consensus tree concern‐ ing the classification of 64 varieties from the Royal Museum of Central Africa (MRAC). 194 CHAPTER 6

This tree yields a representation different from the one obtained by clustering the data of the ALGAB. Concerning the similarities, we note that the presence of a cluster {B10; B30} that is even more robust than with ALGAB data (bootstrap score of 98% vs. 72%) and the existence of a Fang‐A75 cluster (that also includes the varieties A85b that are not present in the ALGAB). Concerning the differences, we report that the group B20 is now split in three clusters and that many varieties are classified to‐ gether (bootstrap score of 70%) into a big group including some languages of the group B20, plus those of the groups B40, B50, B60 and B70. This cluster is interesting because it corresponds to the North‐central Bantu group that Bastin et al. (1999) found with their lexicostatistical approach, but this is not surprising because we are process‐ ing here the same data. A major topological difference between the tree about ALGAB data (Fig. 6.17) and the tree about MRAC / Bastin data is that the major clusters of the former cannot be found in the latter. In fact clusters {B40} and {B50, B60, B70} found in ALGAB data tend to partly collapse. This observation, expected as far as a lower number of lexical items is concerned, is contradicted by the appearance of the North‐central Bantu cluster and by the increased robustness of the {B10; B30} group. Is this phenomenon a conse‐ quence of a wider geographical context or is that related to the differences between the word lists? A closer look at the concepts reported in the word list of Bastin et al. (1999) shows that they are a subset of those listed in the ALGAB, the latter including 65 addi‐ tional items (50 nouns + 15 verbs ― See note #26). The aspect about the wordlists will be discussed in section 6.4.5.2.

6.3.2 Genetics

The genetic data presented in this study have been already published (Berniell‐Lee et al. 2009, Quintana‐Murci et al. 2008), but the analyses that follow are novel because genetic distances have been computed according to parameters different than in the original articles, and because all the data have been reprocessed to make sure that each DNA donor is fully representative of the ethnic groups he is attributed to. The latter aspect is essential because 1) published papers (Barniell‐Lee et al. 2009, Quintana‐Murci et al. 2008) demonstrate that the genetic diversity of the human popu‐ lations living in Gabon is very low, meaning that the sampling design might be ques‐ tioned about its capacity to gauge the genetic differences of the ethnic groups, and 2) because the purpose of the general project was to measure the correlation between linguistic and genetic diversity by comparing two independent datasets (genes and languages), a task that requires extreme caution. In fact, and in contrast to other pro‐ jects where geneticists and linguists have been documenting these two aspects of hu‐ man diversity together and at the same time (for example Mennecier et al. 2016, see LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 195

CHAPTER 7), the genetic sampling of Gabon took place many years after linguistic va‐ rieties were recorded, meaning that the cross‐comparison is based on individuals and locations that are not necessarily the same.

Figure 6.20  Approximate location of the 17 populations typed for mitochondrial, Y‐ chromosome and autosomal markers in Gabon (see Table 6.5). The location of populations has been provided by Professor Lolke Van der Veen (University of Lyon). In the back‐ ground we show the location of the languages (see Fig. 6.8).

During fieldwork, in order to ascertain whether DNA donors were representative of a linguistic group, an ethnological questionnaire was established for each of them (see section 2.3). To verify whether the ethnicity, generally self‐assessed, was ethnologi‐ cally unquestionable,53 we rechecked the ethnological questionnaires54 to be sure that

53 Self assessed ethnical identities are sometimes different from the genealogical ones be‐ cause prestige factors are at play. The investigator (local or from abroad) has an influence on the kind of identity the DNA donor will declare, similarly to linguistic inquiry when local varieties will be more, or less, close to the norm according to the way in which the interview is conducted. 54 Accessible in the lab of Professor L. van der Veen (Lyon, France). 196 CHAPTER 6 the birthplace of each DNA donor fell within (or close) to the area traditionally inhab‐ ited by his/her ethnic group.55 Two genetic databases have been processed for the mitochondrial DNA and two other ones for the Y‐chromosome. In both cases the first database corresponds to the data already published (Quintana‐Murci et al. 2008 and Berniell‐Lee et al. 2009), while the second database is a subset of the first, including only the DNA donors born inside or close (50 km) to the area occupied by their respective ethnic groups. Indi‐ viduals born outside such traditional areas have been filtered‐out as potentially unrep‐ resentative.56 In practice, the approach has consisted in matching the birthplace loca‐ tion‐names reported in the ethnological questionnaire with a database recording all known inhabited places, and finally plotting them over a geographic map of Gabon. The mappings have been compared to Fig. 6.20 where the position of the groups ad‐ dressed in this study is shown.57 The seventeen studied populations include some of the major ethnic groups of Gabon and there are at least two populations speaking languages falling within each of the major linguistic clusters we found (see Table 6.5): three populations speak a language of the group B20; two for B10; two for B30; two for B40; two for B50; two for B60; one for B70 and three concerning the linguistic zone A.

Table 6.5 (next page) Bantu populations included in the genetic analysis Population name (approximate population sizes as in Idiata 2007) and parental system are reported in the first two columns. The three following columns labelled Sample size corre‐ spond to the number of individuals typed for the different markers. Concerning the mito‐ chondrial DNA (mtDNA) and the Y‐chromosome (Y‐chr.), the first number corresponds to the original sample size, while the second number is the sample size after the exclusion of individuals born outside the area typically inhabited by their ethnic group. The num‐ bers about the individuals filtered‐out differ because the exclusion has been more prudent concerning mitochondrial variability and more effective concerning the Y‐chromosome. The full sample size, per population, is sometimes lower concerning the Y‐chromosome because of technical problems were encountered concerning some samples. This table is similar to Table 6.4.

55 About this aspect, in the methodological section, it has been said that many DNA do‐ nors were located in the capital city of Gabon and in other cities, as a large majority of the citizens of Gabon have now abandoned rural life, meaning that the birthplace is informa‐ tion with a special relevance to establish from were they have immigrated. 56 Because there is no exact distribution map for ethnic groups, traditional areas have been determined by making reference to a barycentre of distribution (Tab. 6.5) established by L. van der Veen (University of Lyon). 57 The match has been above 90%. When a village could not be retrieved, the correspond‐ ing individual has been filtered out. LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 197

Table 6.5 (caption in the preceding page) Bantu populations included in the genetic analysis

Population Parental Sample Sample Sample Location of ethnic groups. Linguistic affiliation Name and system size size size Geographic coordinates [N/E] of According to approx. size mtDNA Y-Chr. Autosomes barycentre according to Tab. 5 Guthrie/Maho

Akele Patrilinear 48 50 13 W B20 1000-3000 44; (-14) 48; (-2) -0.7000, 10.2167 (B22a) Teke Patrilinear 54 48 30 SE B70 30,000 (?) 48; (-6) 24; (-24) -0.8167, 12.7000 (B71a) Benga Patrilinear 50 48 - NW A30 1500 49; (-1) 48; (0) +0.5835, 9.33349 (A34) Duma Matrilinear 47 46 - E B50 10,000 44; (-3) 42; (-4) -0.8167, 12.7000 (B51) Eshira Gisir Matrilinear 40 42 - W B40 30-40,000 37; (-3) 30; (-12) -1.2167, 10.6000 (B41) Eviya ? 38 24 28 Centre B30 50 22; (-16) 21; (-3) -1.2123, 10.5982 (B301) Fang Patrilinear 66 60 30 N A70 400,000 66; (0) 39; (-21) +2.0800, 11.4800 (A75) Galwa Matrilinear 51 47 - W B10 10,000 40; (-11) 39; (-18) -0.7000, 10.2167 (B11c) Kota Patrilinear 56 53 30 E B20 25,000 52; (-6) 38; (-15) +0.5667, 12.8667 (B25) Makina Patrilinear 45 43 28 Centre A80 1000-3000 41; (-4) 37; (-6) -0.1000, 11.9333 Mitsogo Matrilinear 64 60 30 Centre B30 13,000 56; (-8) 27; (-33) -1.0333, 10.6667 (B31) Ndumu Matrilinear 39 36 26 SE B60 3000 (?) 33; (-6) 33; (-3) +1.6333, 13.5833 (B63) Nzebi Matrilinear 63 57 30 SE B50 50,000 45 (-18) 24; (-33) -1.5667, 13.2000 (B52) Obamba Patrilinear 47 46 29 SE B60 50,000 (?) 34; (-13) 23; (-23) -0.6833, 13.7833 (B62) Orungu Matrilinear 20 21 21 W B10 10,000 17; (-3) 18; (-3) -0.7167, 8.78330 (B11b) Punu Matrilinear 52 58 28 SW B40 150,000 47; (-5) 25; (-33) -1.8667, 11.0167 (B43) Shake Patrilinear 51 43 - E B20 8,000 44; (-7) 18; (-25) -0.8167, 12.7000 (B251)

6.3.2.1 Mitochondrial DNA diversity

A pairwise matrix of genetic diversity has been computed using the Fixation index (FST)58 as the differentiation measure. Computations are based on the relative fre‐ quency of the mitochondrial haplogroups (see Quintana‐Murci et al. 2008 for more details). The first FST matrix (Tab. 6.6) accounts for 831 individuals, which is about 1/2000 of the full population of the country, a figure quite high in population genetics studies. The large number of null distances, together with the generally low FST val‐

58 Fixation index (FST) is a measure of population differentiation due to genetic structure. It can be estimated from genetic polymorphism data, such as single‐nucleotide polymor‐ phisms (SNPs) or microsatellite data. 198 CHAPTER 6 ues, clearly show that the genetic differentiation of these 17 Bantu populations is very weak (Table 6.10), meaning that they are quite homogeneous, a result that is very clear for almost a half of them: the Eshira, the Akele, the Ndumu, the Obamba and the Teke are characterized by less than four significant FST distances in the distance matrix (Tab. 6.6). By repeating the analysis after excluding, according to the birthplace, 112 potentially unrepresentative individuals we obtain the distance matrix reported in Table 6.7. The exclusion rate (13,5%) could have been higher, but we have taken care to keep the population sample sizes as high as possible. This new matrix does not lead to a significantly different number of null pairwise distances, and the average FST val‐ ues also remains similar (Table 6.10) to those of the unfiltered dataset. In general, when applying a multidimensional scaling to distance matrices that include many null distances, the projection has to be interpreted with a certain cau‐ tion, because couples of populations corresponding to a zero pairwise distance will not necessarily be plotted next to each other. A separate position in the plot can be the effect of pairwise input distances that “push” the sample in different directions. To summarize in a single representation the two multidimensional scaling plots (before and after the exclusion of DNA donors –Tables 6.6 and 6.7), we adopt a Procrustes analysis (Fig. 6.21).59

Table 6.6 (next page)  FST distance matrix for the mitochondrial DNA diversity of 17 Bantu populations from Gabon (see Table 6.5). Data correspond to Quintana‐Murci et al. (2008). The Fst pairwise distances vary from 0 (populations sharing the haplogroups to the same degree) to 1 (populations not sharing any haplogroup). For display ease, only the digits after the comma are reported (example: 035 stands for 0.035). The significance of distances has been tested by performing 10,000 random permutations. When FST p‐values were not significant, distances have been replaced by ‘ns’ and correspond to zero dis‐ tances. Computations, that do not take into account the phylogenetic tree of haplogroups, have been performed according to Excoffier and Lisher (2010).

59 The Procrustes analysis has been developed to compare shapes. To assess the degree of similarity of two shapes (i.e. two polygons), they are optimally superimposed by allowing the following set of actions: translation, rotation, rescaling. This means that size and position do not matter. If the polygons perfectly coincide after translating / rotating / rescaling, the Pro‐ crustes statistics will be 1, if no coincidence is possible (like trying to superimpose a circle with a square) it will be practically 0. Two multidimensional scaling plots can be compared in this way, because all what matters in them is the relative position of the dots corresponding to the samples. The orientation of the axes and their scaling is not relevant, provided that they re‐ main orthogonal. Two plots corresponding to the same set of samples are made coincident to reach an optimum quantified by the Procrustes distance (a least‐squares shape metric that re‐ quires two aligned shapes with one‐to‐one point correspondence). For more details: https://graphics.stanford.edu/courses/cs164‐09‐spring/Handouts/paper_shape_spaces_imm403.pdf

LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 199

Table 6.6 (see caption in the preceding page)  FST distance matrix for the mitochondrial DNA diversity of 17 Bantu populations from Gabon

Ben Dum Evi Fan Gal Esh Ake Kot Mak Ndu Nze Oba Oru Pun Sha Tek Tso Benga - ns 035 ns 029 ns ns 017 ns ns ns ns ns ns ns ns 038 Duma ns - 028 024 ns ns 018 047 027 ns ns ns ns ns 032 ns 023 Eviya 035 028 - 024 022 ns ns 046 028 023 032 ns ns 019 031 ns 021 Fang ns 024 024 - 028 ns ns 040 ns ns ns ns ns ns 016 ns 020 Galoa 029 ns 022 028 - 019 015 037 030 ns 018 ns ns 013 ns 012 ns Eshira ns ns ns ns 019 - ns ns ns ns ns ns ns 016 ns ns 035 Akele ns 018 ns ns 015 ns - ns ns ns 017 ns ns ns ns ns ns Kota 017 047 046 040 037 ns ns - 024 037 030 019 ns 038 020 026 059 Makina ns 027 028 ns 030 ns ns 024 - ns ns ns ns ns ns ns 024 Ndumu ns ns 023 ns ns ns ns 037 ns - ns ns ns ns 018 ns ns Nzebi ns ns 032 ns 018 ns 017 030 ns ns - ns ns ns 027 ns 021 Obamba ns ns ns ns ns ns ns 019 ns ns ns - ns ns ns ns ns Orungu ns ns ns ns ns ns ns ns ns ns ns ns - ns ns ns ns Punu ns ns 019 ns 013 016 ns 038 ns ns ns ns ns - 020 ns ns Shake ns 032 031 016 ns ns ns 020 ns 018 027 ns ns 020 - ns 022 Teke ns ns ns ns 012 ns ns 026 ns ns ns ns ns ns ns - 014 Tsogo 038 023 021 020 ns 035 ns 059 024 ns 021 ns ns ns 022 014 -

Table 6.7  FST distance matrix for the mitochondrial DNA diversity of 17 Bantu popu‐ lations from Gabon (as in Table 6.6) after the exclusion of 112 individuals born outside the area typically inhabited by their respective ethnic groups, and therefore possibly un‐ representative of them. For technical details about the computation see Table 6.6.

Ben Dum Evi Fan Gal Esh Ake Kot Mak Ndu Nze Oba Oru Pun Sha Tek Tso Benga - ns 064 ns 039 ns ns ns ns ns ns ns ns ns ns ns 047 Duma ns - 040 027 ns ns ns 038 036 ns 020 ns ns ns 045 ns 026 Eviya 064 040 - 030 ns 033 ns 071 034 ns 048 ns ns ns 039 ns ns Fang ns 027 030 - 030 ns ns 032 ns ns ns ns ns ns 020 ns 026 Galoa 039 ns ns 030 - 024 023 037 035 ns 023 ns ns ns 027 018 ns Eshira ns ns 033 ns 024 - ns ns ns ns ns ns ns ns ns ns 041 Akele ns ns ns ns 023 ns - ns ns ns 018 ns ns ns ns ns 019 Kota ns 038 071 032 037 ns ns - 028 031 020 ns ns 027 ns 018 059 Makina ns 036 034 ns 035 ns ns 028 - ns ns ns ns ns ns ns 030 Ndumu ns ns ns ns ns ns ns 031 ns - ns ns ns ns 021 ns ns Nzebi ns 020 048 ns 023 ns 018 020 ns ns - ns ns ns 029 ns 026 Obamba ns ns ns ns ns ns ns ns ns ns ns - ns ns ns ns ns Orungu ns ns ns ns ns ns ns ns ns ns ns ns - ns ns ns ns Punu ns ns ns ns ns ns ns 027 ns ns ns ns ns - 021 ns ns Shake ns 045 039 020 027 ns ns ns ns 021 029 ns ns 021 - ns 036 Teke ns ns ns ns 018 ns ns 018 ns ns ns ns ns ns ns - 022 Tsogo 047 026 ns 026 ns 041 019 059 030 ns 026 ns ns ns 036 022 - 200 CHAPTER 6

Figure 6.21  Variability of the mitochondrial DNA in 17 ethnic groups from Gabon. Procrustes superposition of two multidimensional scaling plots (full dataset vs. reduced dataset), concerning the Hypervariable region 1. Arrows indicate the differences in the topologies. In gray: MDS projection of the full dataset according to the FST matrix of Table 6.6. In black: MDS projection of the reduced dataset according to the FST matrix of Table 6.7. The reduced dataset (in black) is obtained by filtering‐off 112 individuals born outside the areas typically occupied by their respective ethnic group. Each label includes the name of the population and its linguistic affiliation. The box in the top right part shows the many null pairwise distances as segments connecting corresponding populations, this plot is the same as the one in black in the general figure.

When looking at the double representation of Fig. 6.21, we see that that the topologi‐ cal position of the populations does not really change, with the exception of the Eviya, that is the ethnic group with the highest drop in sample‐size after the correction (from 38 to 22 individuals, see Table 6.5). A closer look at the plots, taking into account geo‐ graphical distance or linguistic affiliations, does not provide further clues to interpret the swarm of points and the Mantel test correlations we computed between genetic, geographic and linguistic distances are not significant. To conclude, the genetic vari‐ ability is very low, there are no clusters, and DNA variation does not mirror any cul‐ tural difference, (at least not as lexical data reflects it) or any social organization pat‐ tern linked to patrilinearity or matrilinearity (Table 6.5). LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 201

6.3.2.2 Y‐chromosome diversity

The same seventeen populations typed for the mitochondrial DNA have been typed for the non‐recombinant part of the Y‐chromosome, to evaluate whether the paternal lineages convey a different signal than maternal ones. As with the mitochondrial DNA, a pairwise FST distance matrix of genetic diversity has been computed. Compu‐ tations are based on the relative frequency of the Y‐chromosome haplogroups (see Berniell‐Lee et al. 2009 for more details). The first matrix (Tab. 6.8) accounts for 782 individuals, which correspond to about 1/2000 of the full Gabon population. While a very large number of the pairwise distances computed in the matrix were not signifi‐ cant (48 over 136 = 35%), the number of the significant ones is, nonetheless, almost the double than with mitochondrial DNA markers (65% vs. 36%). Also the average FST distance is higher (Table 6.10). The ethnic groups that are the least differentiated ones are the Duma, the Ndumu and the Orungu while the most differentiated ones are the Eviya and the Fang. Concerning the mitochondrial DNA, we said the rate of exclusion of individu‐ als from the original database (those born outside the region typically inhabited by their ethnic group) was rather conservative, meaning that a larger number of indi‐ viduals might have been excluded. It has also been seen that the MDS plots corre‐ sponding to the full and to the reduced database are similar (Fig. 6.21). This is why, when this procedure was repeated concerning the Y‐chromosome, a stronger exclu‐ sion rate (32%) has been adopted (see Table 6.10) to determine if the new FST distance matrix (Table 6.9) would correlate better than the original one with geographic dis‐ tances or linguistic affiliations. With Y‐chromosome data, when filtering‐off more individuals (248) we notice that the number of null distances increases (from 48 to 63) and that the average FST pair‐ wise distance also increases (from 0.024 to 0.028; t‐test significant– see Table 6.10), meaning that the filter has an effect here. Some differences also appear in the Pro‐ crustes analysis comparing the MDS plots corresponding to the full and to the re‐ duced dataset (Fig. 6.22).

Table 6.8 (next page)  FST distance matrix concerning the 17 Bantu populations from Gabon typed for the Y‐Chromosome. Concerning the Bantu populations reported, the dataset is exactly the one processed in Barniell‐Lee et al. (2009). For display ease only the digits after the comma are reported (example: 021 stands for 0.021). The significance of distances has been tested by performing 10,000 random permutations. When FST p‐values were not significant, distances have been replaced by ‘ns’ and correspond to zero dis‐ tances. The computations, not taking into account the phylogenetic tree of haplogroups, have been performed according to Excoffier and Lisher (2010). 202 CHAPTER 6

Table 6.8 (see caption in the preceding page)  FST distance matrix concerning the 17 Bantu populations from Gabon typed for the Y‐Chromosome

Ben Dum Evi Fan Gal Esh Ake Kot Mak Ndu Nze Oba Oru Pun Sha Tek Tso Benga - 021 077 094 052 060 043 019 052 030 018 034 049 038 038 031 020 Duma 021 - 031 058 026 029 Ns ns ns ns ns ns ns ns ns ns ns Eviya 077 031 - 079 077 ns 045 053 035 052 045 036 ns 037 026 042 044 Fang 094 058 079 - 090 107 050 042 059 043 068 060 061 034 036 043 070 Galoa 052 026 077 090 - 055 017 030 052 027 014 024 ns 026 045 021 013 Eshira 060 029 ns 107 055 - 037 051 052 045 026 ns ns 039 031 036 023 Kele 043 ns 045 050 017 037 - 013 024 ns 019 016 ns 013 ns 015 016 Kota 019 ns 053 042 030 051 013 - ns ns ns 025 019 ns 013 ns 013 Makina 052 ns 035 059 052 052 024 ns - ns ns 042 036 014 ns 031 030 Ndumu 030 ns 052 043 027 045 Ns ns ns - ns ns ns ns ns ns ns Nzebi 018 ns 045 068 014 026 019 ns ns ns - 014 ns ns 021 ns ns Obamba 034 ns 036 060 024 ns 016 025 042 ns 014 - ns ns ns ns 013 Orungu 049 ns ns 061 ns ns Ns 019 036 ns ns ns - ns ns ns ns Punu 038 ns 037 034 026 039 013 ns 014 ns ns ns ns - ns ns 013 Shake 038 ns 026 036 045 031 Ns 013 ns ns 021 ns ns ns - 016 019 Teke 031 ns 042 043 021 036 015 ns 031 ns ns ns ns ns 016 - 012 Tsogo 020 ns 044 070 013 023 016 013 030 ns ns 013 ns 013 019 012 -

Table 6.9  FST distance matrix concerning the 17 Bantu populations typed for the Y‐ chromosome as in Table 6.8 but after the exclusion of 248 individuals born outside the area typically inhabited by their respective ethnic group, and therefore possibly unrepre‐ sentative of them. For technical details see the caption of Table 6.8.

Ben Dum Evi Fan Gal Esh Ake Kot Mak Ndu Nze Oba Oru Pun Sha Tek Tso Benga - ns 090 111 052 074 043 019 047 028 ns 061 056 052 048 ns ns Duma ns - 040 073 023 041 ns ns ns ns ns 030 ns ns ns ns ns Eviya 090 040 - 107 088 ns 053 057 036 063 053 067 036 035 058 052 056 Fang 111 073 107 - 112 132 061 050 077 058 087 077 067 044 032 054 069 Galoa 052 023 088 112 - 078 018 027 056 029 ns 046 ns 039 073 ns ns Eshira 074 041 ns 132 078 - 053 067 052 055 036 060 ns 045 054 049 028 Kele 043 ns 053 061 018 053 - ns 024 ns ns 026 ns ns ns ns ns Kota 019 ns 057 050 027 067 ns - ns ns ns 043 ns ns 029 ns ns Makina 047 ns 036 077 056 052 024 ns - 021 ns 060 031 ns ns ns 029 Ndumu 028 ns 063 058 029 055 ns ns 021 - ns ns ns ns ns ns ns Nzebi ns ns 053 087 ns 036 ns ns ns ns - 033 ns ns 046 ns ns Obamba 061 030 067 077 046 060 026 043 060 ns 033 - ns ns ns ns 028 Orungu 056 ns 036 067 ns ns ns ns 031 ns ns ns - ns 037 ns ns Punu 052 ns 035 044 039 045 ns ns ns ns ns ns ns - ns ns ns Shake 048 026 058 032 073 054 ns 029 ns ns 046 ns 037 ns - ns ns Teke ns ns 052 054 ns 049 ns ns ns ns ns ns ns ns ns - ns Tsogo ns ns 056 069 ns 028 ns ns 029 ns ns 028 ns ns ns ns -

LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 203

Table 6.10 Summary statistics concerning FST values (Tables 6.6, 6.7, 6.8, 6.9)

mtDNA mtDNA Y-Chromosome Y-Chromosome (full dataset) (reduced dataset) (full dataset) (reduced dataset) See Tab. 6.6 See Tab. 6.7 See Tab. 6.8 See Tab. 6.9 Number of individu- 831 719 782 534 als (100%) (86,5%) (100%) (68,2%) Null or not signifi- 87 93 48 63 cant distances (64%) (68%) (35%) (46%) Average FST 0.009 0.010 0.024 0.028 and St. Dev. (0.014) (0.016) (0.024) (0.031)

Figure 6.22 Variability of the Y‐chromosome in 17 ethnic groups from Gabon. Pro‐ crustes superposition of two multidimensional scaling plots (full dataset vs. reduced data‐ set). Arrows indicate the differences in the topology. In gray: MDS projection of the full dataset according to the FST matrix of Table 6.8. Stress values: in 1 dimension = 0.3379, in 2 dim. = 0.1682 (plot reported), in 3 dim. = 0.1134. In black: MDS projection of the reduced dataset according to the Fst matrix of Table 6.9. The reduced dataset is obtained by filter‐ ing‐off 248 individuals born outside the areas typically occupied by their respective ethnic groups. Stress values: in 1 dimension = 0.3347, in 2 dim. = 0.1793 (plot reported), in 3 dim. = 0.1168. Labels mention the linguistic affiliation. The box in the top right part shows the many null pairwise distances as segments connecting corresponding populations, this plot is the same as the one in black. 204 CHAPTER 6

The topological shift is considerable for the Obamba population that becomes closer to the Teke and the Ndumu, by the way the three populations speak close languages. While there are no noticeable clusters, the outlier positions of the Eshira, Eviya and Fang samples is clear (Fig. 6.22). Anyway, and to suggest caution in interpreting the plot, we invite the reader to note how the Eshira and the Orungu are plotted quite far from each other, although their pairwise FST distance = 0.

6.3.2.3 Autosomal diversity

The analysis of 28 tetranucleotide microsatellites (Verdu et al. 2009) located on chro‐ mosomes other than the Y and transmitted by both parents leads to a FST matrix (not shown) where the only significant pairwise distances are those concerning the Eviya population with respect to the other ones. While 17 groups were typed for the mito‐ chondrial DNA and Y‐chromosome, here we have only 12 populations because the lack of results and the cost of the analyses discouraged us to process other samples (Benga, Duma, Eshira, Galwa and Shake were not typed). It is important to say that 28 autosomal markers are a relatively small, and thereby limited, genetic dataset, especially for weakly differentiated populations. Therefore, this dataset might very well be underpowered to describe fine population structure among Bantu‐speaking populations from Gabon. In other words, it is not because we do not see significant genetic differences that there are none in reality. Current work on genomewide data and several hundreds of thousands of SNP auto‐ somal markers on these same DNA samples is allowing finer genetic structure to be revealed in Gabon (E. Patin 2017, personal communication). LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 205

6.4 DISCUSSION

The question of the Bantu dispersal has vigorously resurfaced thanks to the work of the team of Koen Bostoen (University of Gent) (see Bostoen et al. 2015 for a review). In this section we will review updated literature that has not been cited in the introduc‐ tion section (6.1) but with a special focus on Gabon and in the light of the results that have been obtained concerning the ALGAB (Atlas Linguistique du GABon) and the corresponding part of the database of Bastin et al. (1999), coming from the archives of the Musée Royal de l’Afrique Centrale, MRAC, Tervuren (Belgium). Finally, we will also address other sources of evidence, namely archaeology, genetics and musicology. At the end we will also review the limitations of our research and suggest possible direc‐ tions for future investigation.

6.4.1 Bantu dispersal in Gabon: Rainforest versus savannah corridors

The routes of the Bantu expansion suggested by Grollemund et al. (2015) rely on the geographical plot of the backbone of a consensus Bayesian tree based on cognacy judgments. This tree portrays a dataset comprising the whole Bantu linguistic domain according to the data of Bastin et al. (1999) and was calibrated according to some ar‐ chaeological dates in order to provide a temporal frame for the different splits. To plot the tree on a geographical map of Africa, it was assumed that the contemporary loca‐ tion of processed languages overlap the position of the ancestral languages they de‐ rive from. This assumption is typical of phylogeographic methods in which the geo‐ graphical location of the frequency peaks of given traits is considered to be a good indicator of the sites where they first arose. Once the tree was plotted, Grollemund et al. (2015) suggested that savannah corridors through the rainforest were the most effi‐ cient migration routes for human displacement, instead of the rainforest itself, the rationale being that Bantu speaking societies were adapted to this kind of environ‐ ment because their homeland was characterized by savannah. These authors strongly stress that the adaptation to other kinds of landscapes would have been difficult. Ac‐ cording to Grollemund et al. (2015), when the shift happened, many generations were necessary to master the techniques ensuring survival; learning new techniques re‐ quired time and limited migration speed (on average they find a difference of 300 years between two competing routes through the rainforest or through savannah). According to palynological evidence, a progressive formation of savannah cor‐ ridors took place through the rainforest during the Middle and Late Holocene, which is starting from 4000 ybp to 2500 ybp, when the surface of savannah was at its maxi‐ mum extension (Lézine et al. 2013; Bostoen et al. 2015). When we look at the region 206 CHAPTER 6 corresponding to Gabon, the formation of several savannah corridors can be inferred (Fig. 6.23) and, if the Bantu migration proceeded through them, it means that Bantu peopling was possible from the east and the south (the west being the Atlantic sea‐ shore) but not from the north. Did the Bantu speaking populations entered Gabon following the progressive formation of savannah corridors starting 4000 ybp or did their migrations were independent from their formation? We cannot answer directly, because the linguistic analyses we conducted do not explicitly address temporal issues related to peopling phases. To set a time‐ frame according to archaeological sites excavated in Gabon, we refer to the calibrated radiocarbon 14C dates compiled by Oslisly et al. (2013) and to their classification in four main periods corresponding to different stages of technological knowledge: Late Stone Age (5500‐3500 ybp)—14 sites; Neolithic Stage (3500‐1900 ybp)—33 sites; Early Iron Age (2800‐1000 ybp) —79 sites; Late Iron Age (1000‐100 ybp) —40 sites.60 While some dating uncertainty cannot be excluded because of the inherent complexity of radiocarbon dating, the temporal sequence of occupation addressed by Oslisly et al. (2013) overlaps with the timeframe of the Bantu expansion and, interestingly, points to a population crash starting about 2400 ybp and lasted until recent centuries (Oslisly 2001; Wozka 2006) to reach a new maximum five centuries ago and decline again, un‐ til the colonial period (Fig. 6.24).

Figure 6.23  Gabon: progressive appearence of savannah corridors (white) in the rain‐ forest (gray) according to palaeoenvironmental data (adapted from Grollemund et al. (2015). Equatorial Guinea in not shaded. BP means “before present” = years ago.

60 The periods sometimes overlap because two technological phases can coexist at the sa‐ me time, like typewriters and computers in the late 1990s. LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 207

Figure 6.24  Survey of radiocarbon dates over the past 4000 years in central Africa including Gabon (from Oslisly et al. 2013).

The Neolithic Stage corresponds to the transition between the Late Stone Age and the Early Iron Age, that is when people started to become sedentary, made pol‐ ished stone tools and pottery and used the stone hoes and axes to practice slash‐and‐ burn agriculture, a cultivation technique corresponding to the first massive anthropo‐ genic impact on the rainforest. All this phase is related to the arrival of Bantu migra‐ tions, with a demographic explosion in the subsequent period, the Iron Age. Because the Neolithic stage started 3500 ybp, we date the first arrival of Bantu groups at this point; the preceding times concerning hunter‐gatherers non Bantu populations (see Fig. 6.24). This timeframe fits well the scenario of Grollemund et al. (2015) and the theory of savannah corridors but is not incompatible with early Bantu migrations southwards, directly though the rainforest. Was the rainforest a real impediment to migrations? More generally, we can answer by saying that the rivers existing in the rainforest were potential ways of travel61 and admit that a migration route from Cameroon southwards, along the generally sandy seashore, as advocated by Bastin et al. (1979), is possible, though Grollemund et al. (2015) did not find evidence of it. Concerning the rainforest it is probably useful to remember that the practice of agriculture is not incompatible with the rainforest envi‐ ronment, in which slash‐and‐burn agriculture is an efficient technique, possible with‐ out metallic tools if reliable stone tools are available to cut the trees. See Gonthier (1987, p. 171) for examples about Papua New Guinea stone adzes and hatchets.62

61 Early Bantu‐speaking populations were probably used to navigation techniques as their homeland was located along a river (Benue). 62 Equipped with such tools, three persons can cut a tree of about 60 cm of diameter in half a day. 208 CHAPTER 6

The present‐day practice of slash‐and‐burn cultivation in Gabon, widespread among the vast majority of ethno‐linguistic groups, can provide some clues to under‐ stand the past; it generally occurs within a five‐kilometre radius around each settle‐ ment. The forest clearing is cut, then another when the soil is impoverished in the first. This pattern is repeated until the first clearing has regenerated and can be used again (Meunier et al. 2014). This process requires rather simple technical knowledge,63 is possible almost everywhere in the forest, and requires a reduced workforce (Me‐ unier et al. 2014).64 Migrations through the savannah are considered by the team of Koen Bostoen (University of Gent) more likely because, even though agriculture is possible, living in the forest requires skills that the Bantu‐speaking groups did not have (Bostoen et al. 2015, Grollemund et al. 2015). These authors explain that innova‐ tions arise through a long process of trial‐and‐error and accumulate at a slow pace, especially in small groups of migrants (fewer people = fewer trials). But this conjecture has to take into account the possibility of horizontal technological transfer, as Bostoen et al. (2015) also extensively discuss, from populations used to the rainforest like the Pygmies, whose ancestors diverged from other African populations millennia ago (Patin et al. 2009, Verdu et al. 2009, Batini et al. 2011) Today Pygmies are still largely present in Gabon and widespread in all the equatorial rainforest of Africa. They are, to varying degrees, hunter‐gatherers living in close association with neighbouring Bantu‐speaking farmers, with whom they trade. This is a constant feature among all Pygmies, seminomadic or sedentary, and the partnership has been lasting for centuries, as ample material demonstrates. The Pyg‐ mies are largely represented in the mythology of the farmers (see Bahuchet 2012 for a review) and the founding myth of many central African societies concerns a proto‐ typical tale about an initial migration from quite far regions during which the Pyg‐ mies were encountered and behaved as guides, introducing the farmers to the forest‐ world, transmitting to them rites, initiations, and techniques, including some that are not typical of hunter‐gatherers like the forging of iron (Arom and Thomas 1974; La‐ burthe‐Tolra 1981). Particularly in Cameroon and Gabon, oral traditions provide many details about initial contact with different Pygmy groups throughout time. The strength of the past association of Bantu‐speaking farmers with Pygmy populations is confirmed by the general language shift of the latter ones to the varieties of their non‐ Pygmy neighbours.

63 This affirmation does not deny the importance to acquire detailed knowledge about the size of the patch, the moment to slash and burn with reference to the season, etc. For in‐ stance in terms of choice of the patch of forest, some tree varieties are easier to cut and burn than others and better enrich the soil. 64 The reduced workforce implies that the population can be of small size. LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 209

To recognize the extent of the horizontal transfer of technology and knowledge from the Pygmies to the Bantus is to admit that a migration through the rainforest was indeed possible, without the need to privilege savannah corridors as routes for dis‐ placement. This phenomenon leads to a different interpretation of the slower pace of migration through forest regions that Grollemund et al. (2015) measure and interpret as the time necessary to get used to a new environment, hostile at first.65 In fact, the rainforest can be seen as a richer environment able to provide several and comple‐ mentary means of subsistence (forest plus agriculture resources), with a decreased necessity to move elsewhere and with a reduced need to constitute large communi‐ ties, as slash‐and‐burn agriculture can be practiced by small groups and is sustainable within small areas, as the observations of Meunier et al. (2014) on contemporary prac‐ tice in Gabon show.66 For all the above reasons, we will not discuss the Bantu peo‐ pling of Gabon as necessarily dependent on the climatic change that led the rainforest to shrink and the savannah habitat to increase its surface. To conclude, the arrival of the first Bantu‐speaking groups in Gabon probably started 3500 ybp and was possible from all directions: though the rainforest, by sa‐ vannah corridors or along the coast. And it should not be forgotten that Gabon is very close (200 km) to the capital of Cameroon, Yaoundé, corresponding to a secondary very important hub for linguistic differentiation (Bostoen et al. 2015).

6.4.2 Compatibility of the Levenshtein‐based classification with previously published ones

Our first attempt to classify Bantu languages with the Levenshtein algorithm con‐ cerned 32 Tanzanian languages (see section 6.3.1.1 The Tanzanian experiment). The tree we computed is compatible with the most recent classification available (Grollemund et al. 2015) yielding clusters that correspond. This is why we test now the agreement of our classification of the varieties included in the ALGAB (Linguistic Atlas of Gabon) with the work of Grollemund. This comparison has even more relevance concerning Gabon, because Grollemund et al. (2015) processed many word lists taken from the ALGAB (not all of them as varieties of the groups B40; B50; B60; B70 are missing). Concerning the varieties that appear in both classifications (A75; B10; B20; B30) we note that the alignment of our classification with the one of Grollemund’s is almost

65 Physically speaking, it is easy to penetrate the African equatorial forest as there is no significant underwood. 66 It should be added that the practice of agriculture in the savannah has some drawbacks too. For example, the roots of cultivated plants form, at the end of the season, a thick layer that it is difficult to break up, forcing to move farther. 210 CHAPTER 6 perfect, with subclusters corresponding to very similar or identical subclusters (Fig. 6.25). This is particularly interesting because Grollemund’s tree was obtained accord‐ ing to a methodology fundamentally different from our approach (Bayesian analyses on multistate matrices of cognacy judgments versus automatic unsupervised Leven‐ shtein alignment of segments). The agreement between the two classifications concerning the wordlists of Ga‐ bon (but also concerning Tanzanian data ― see section 6.3.1.1.1 and Fig. 6.15) is the major methodological result of this study. It suggests that the Levenshtein algorithm captures the same signal of linguistic relatedness (or difference) of a method based on cognate coding, but with the immense advantage of not requiring the aid of experts to emit judgments about cognacy because they are not not necessary. We believe that the Levenshtein classification provides a finer categorization of the varieties since, on av‐ erage, each word is constituted by five segments that can be compared to each other, while cognacy methods yield simpler information of binary type (cognate/not‐cognate basically). The main discrepancy between the two classifications concerns the cluster‐ ing of the group B20 that we classify as a single cluster (subdivided in two clusters) with ALGAB data, whereas Grollemund et al. 2015 do not. This issue is discussed in the next section. Two other recent publications can be linked to Grollemund’s frequently cited study. The first is the above mentioned publication by Bostoen et al. (2015). This pa‐ per, includes a shorter version (168 languages) of the classification they published the same year (Grollemund et al. 2105; 424 languages). The second one is a classification of the whole Bantu linguistic domain (Currie et al. 2013; 542 languages) that overlaps in many respects with Grollemund’s work being based i) on a character‐based Bayesian method, ii) on a similar geographical plotting of the tree and iii) on the same word‐list source (Bastin et al. 1999); the only difference being that Currie et al. (2013) do not cali‐ brate their tree with archaeological dates. Despite the large methodological overlap, Currie’s conclusions are different from Grollemund’s, concerning the migration sce‐ nario of the Bantu dispersal from the homeland. According to Currie the early Bantu‐ speaking groups first moved through the rainforest and “emerged” on the southern side of it with one branch moving south and another east (plus another directed to the great lakes). This suggests that western Bantu languages (central‐western and west‐ western Bantu) are a paraphyletic clade,67 whereas Grollemund and co‐workers find it monophyletic. Another difference: Currie finds no evidence of an early linguistic split between western and eastern Bantu languages, while Grollemund and Bostoen do.

67 Relating to a taxonomic group that includes some but not all of the descendants of a common ancestor. LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 211

Figure 6.25  UPGMA Bootstrap consensus tree concerning the ALGAB (see Fig. 6.17) plot against the section of tree of Grollemund et al. (2015) including the same varieties. The nodes of both trees are supported by at least 70% of the subreplicates (jackknife method on the left; bootstrap method on the right). 212 CHAPTER 6

Concerning the classification, Currie et al. (2013) state that a majority of the lin‐ guistic zones defined by Guthrie have no genetic validity. For example they categorize the languages B21; B23; B22b; B251 with languages of the zone A (for example A75 and A86), while Grollemund does not. Such inconsistencies between the two papers indicate that the general similarity of the pectinate topology of their two trees is only superficial, with differences that can be significant at fine scales. They probably mean that alternative settings of character‐based Bayesian phylogenies can deliver different clusterings when the (many) parameters of the analysis are not identical. This aspect, typical of all classification work, is interesting because Bayesian phylogenies are con‐ sidered to outperform distance‐based methods, like ours. If it is true that lexicostatis‐ tics methods (Bastin et al. 1999 is a good example) seem unable to distinguish between retentions and innovations and imply a constant rate of evolutionary change that makes phylogenetic assessment problematic, we stress that our distance‐based Leven‐ shtein methodology is different in this respect as it does not rely on a lexicostatistic approach.

6.4.3 The Levenshtein classification of Bantu languages from Gabon

6.4.3.1 ALGAB

Concerning Gabon, we applied the Levenshtein method to two datasets, the varieties included in the ALGAB and the corresponding ones from Bastin et al. (1999). We fo‐ cused on the ALGAB with more energy because, in contrast to Bastin, these data were collected by a same group of linguists, with replicable methods and with homogene‐ ous transcriptions, and were checked and validated by the same person (Professor Lolke van der Veen, University of Lyon, France). Furthermore the wordlists of the ALGAB are generally longer (132 items on average) than the ones reported by Bastin (89 items on average). Concerning the ALGAB, we computed a bootstrap consensus tree (Fig. 6.17) and a MDS (Fig. 6.18) to geographically plot the clusters according to linguistic areas that were defined by harmonizing several references (Bastin et al. 1999, Van der Veen 2007, Maho 2008, Simons 2016). About the cartography, we do not know if the pre‐ sent‐day location of each group of languages corresponds to the position they had in the past, and we consider extensive migrations over millenia more than likely, mean‐ ing that ancestral languages might have been spoken elsewhere, not necessarily where they are today. We also consider it possible that, after an initial stage of peopling, some Bantu languages diffused from one Bantu group to another, in the absence of population movements. Finally, we recognize that the vast majority of African popu‐ lations today are multilingual, and there is no reason to think that the past situation LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 213 was different. Multilingualism, in itself, is a source of language diversification and this is not a recent phenomenon (Whiteley et al. 1971). We also find it reasonable to admit that borrowing and secondary contact between differentiated languages have been a major force in the process of linguistic differentiation. These issues lead us to question the representativeness that present‐day lin‐ guistic cartography has with respect to the past. To accept this blindly, as Grollemund et al. (2015) and Currie et al. (2013) do, is hazardous, even when dealing with seden‐ tary populations practicing agriculture. However, if the typical phylogeographic as‐ sumption about the correspondence of present and past locations is not a serious issue concerning large classifications addressing migrations over the whole Bantu linguistic domain, it becomes more serious in the frame of fine‐scale studies, like this one. For example the penetration of the Fang populations and languages from the north of Ga‐ bon, started about five centuries ago, proceeded to the detriment of languages of the MBETE group (B60)68 because the Fang population, having an established warrior tradition, has triggered a southwards migration of more peaceful ethnolinguistic groups. While this recent southwards migration is somewhat documented (because it happened after the first contact of Europeans with Gabon), it is easy to recognize that similar processes took place in the past even if no historical records document them. According to the distribution of archaeological dates (Fig. 6.24), the Bantu linguistic history of Gabon concerns at least three millennia of peopling and this is a very long time in terms of linguistic differentiation and human settlements. Coming back to the linguistic areas of Gabon, the mapping of two clusters we found {B40}; {B50, B60, B70} corresponds to zones that are geographically tight (Fig. 6.17B), while the group B20 and the cluster {B10, B30} are more scattered and wide‐ spread. An obvious working hypothesis is to consider such wider distributions as the relict of older migration waves, possibly from the north though the rainforest (B20) and/or following the Atlantic coast (B10, B30). Later other migrations waves followed (see Mouguiama‐Daounda 2005). At this point we would like to remind the reader that this research stems from a project set by a group of linguists in Lyon (Laboratoire Dynamique du Langage, Lyon), linguists that had their own working hypotheses, one being the exogenous origin of all the linguistic varieties of Gabon, implying their progressive penetration from various directions. According to Vansina (1995) this took place as follows: the group B40 from the south; the groups B50, B60, B70 from the southeast; the group B10‐B30 and the group B20 from the northeast. As an example, a possible common origin of the group B40 and some languages of the group H10 (located south of Gabon in Congo) was

68 They were spoken in an area further north than the present one (Klieman 2003, p. 47). 214 CHAPTER 6 conjectured, though there is no evidence of it in the most recent classification of Grol‐ lemund et al. (2015) (see section 6.1.2.2, third paragraph). In reality, the comprehensive and referential linguistic cartography of the whole Bantu linguistic domain of Bastin et al. (1999)69 shows that the majority of the languages included in each linguistic group that is found today in Gabon are located within Gabon itself, or very close to it,70 so we can conjecture the reverse phenome‐ non: Perhaps all the Bantu languages of the group B that are today spoken in Gabon had an endogenous origin and have later diffused to surrounding regions. This scenario is plausi‐ ble concerning the varieties B70 because, even if many of them are located on the Congo side of the Betéké plateau, our classification provides evidence for a very ro‐ bust cluster B60‐B70 (bootstrap score= 100) where all the languages B60 are generally located within Gabon, and not east of it. The same holds for the strong group B10‐B30 partly located along the sea and possibly linked to a costal Bantu migration. It is diffi‐ cult to draw conclusions for the group B40, similarly located along the cost, because the ALGAB documents too few varieties for it (5). The geographical plot of the linguistic clusters we identified according to the Levenshtein classification shows that, even if past linguistic areas were larger and lo‐ cated differently, they have remained distinct, without the general phenomena of lexi‐ cal convergence that their geographical proximity might suggest. If this were not so, we would not find the high bootstrap scores our consensus tree (Fig. 6.17A) reports, with the exception of the group B20 (having a bootstrap score = 73). This latter cluster deserves further discussion because its genetic unity is contradicted by all available studies.

6.4.3.2 Bastin et al. (1999) / MRAC

After a careful verification of the word lists provided by Bastin et al. (1999) and the correction of several inconsistencies and obvious transcription errors related to het‐ erogeneous sources (a “cleaning” step that Currie et al. 2013 and Grollemund et al. 2015 did not explicitly undertake), we have applied the Levenshtein method to the varieties corresponding to those reported in the ALGAB. Contrary to the ALGAB, the dataset of Bastin et al. (1999) has a slightly larger geographical coverage, as it also documents languages spoken in neighbouring Congo areas. The classification we obtained by using the Levenshtein algorithm (Fig. 6.19) dif‐ fers from the one concerning the ALGAB, especially about the languages B20 that are

69 Generally correct besides some languages mapped outside their likely location because of opportunistic linguistic interviews (Currie et al. 2013). 70 Some B20 and B40 languages can be found south of Gabon; many B70 languages east of it. LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 215 split in two major and very robust clusters {B25, B252, B203} and {B251, B22b, B23}.71 Further, the varieties B201 and B24 are classified elsewhere, that is within a larger clus‐ ter (bootstrap score= 70) including languages of the groups B40; B50; B60 and B70 (see similar results by Nurse and Philippson 2003). This very large cluster was already re‐ ported and discussed by Bastin et al. (1999) that call it North‐central western Bantu (see Fig. 6.5). If one cluster of our analysis based on Gabon wordlists from Bastin et al. (1999) had to include all the varieties B20 (like the classification of the ALGAB), then we would have to admit the validity of a very large group including all the languages of the Guthrie zone B, besides those classified as B10/B30 (see cluster in blue with score=67 in Fig. 6.19). While we do not have reasons to favour or reject the North‐central western Bantu hypothesis, we believe that it may have its origin in a long phase of language con‐ tact among geographically close languages (see previous section). The two datasets (ALGAB, Bastin) lead to a different clustering of the varieties B20 (group KELE), the discrepancy cannot be attributed to the different length of the wordlists (ALGAB132 items on average; Bastin et al.  92 items) because, concerning B20, the ALGAB system‐ atically reports wordlists that are much shorter than usual (Fig. 6.17A). By further com‐ paring the two consensus trees we note that the one corresponding to the dataset of Bastin systematically yields a higher number of smaller clusters than the ALGAB does. Moreover and differently from the ALGAB, there is no unity for the group B40, and the group B50 stands alone, meaning that it is not clustered with varieties B60‐B70, even if we allowed quite low bootstrap scores. The easiest way to explain such inconsistencies is to attribute them to the dif‐ ferent sampling schemes, to the different number of lexical items the two datasets embrace (with the noted exception of B20) and to random phenomena. To test the real effect of these potential biases and to verify if the varieties labelled in a same way would cluster together anyway, we have merged the two datasets to compute a unique consensus bootstrap tree (not shown) that exhibits a number of interesting features:

1. The unity of the group B10‐B30 (bootstrap score = 99); 2. The close association of the languages B60‐B70 (bootstrap score = 82); 3. The coherence of the languages B50 that form one cluster (bootstrap score = 79) but distinct from B60/B70; 4. The looseness of the cluster B40 (bootstrap score= 60); 5. The “explosion” of the group B20, now split in five different and independent clusters, generally subdivided in several robust subclusters.

71 There are phonological reasons supporting the KELE cluster vs. KOTA cluster, although contact phenomena make it less apparent (space contact, indirect contact related to multi‐ lingualism, migration contact). Van der Veen is investigating this question (forthcoming title: Une étude synchronique et diachroniques des voyelles de 38 variétés (idiolectales) du B20). 216 CHAPTER 6

Two other aspects of the classification are interesting:

6. The varieties labelled in the same way, although they come from two inde‐ pendent databases, are generally clustered next in the final leaves of the tree, thus suggesting that the discrepancies between the classifications are not re‐ lated to significant heterogeneity in the wordlists and, if there is heterogeneity, the Levenshtein method seems to overcome it. 7. The bootstrap support for the North‐central western Bantu cluster advocated by Bastin et al. 1999 becomes very weak (bootstrap score = 54). We are inclined to consider it as a classification artefact determined by the strength of the group B10‐B30 which, acting as an outlier, creates a spurious “agglomeration” of the other varieties of the zone B into a large cluster.

As we mentioned in the introduction, the higher diversity of the Bantu lan‐ guages from the West, as opposed to those of the East, has been as one of the reasons leading to the (correct) hypothesis of the earlier differentiation of the first ones, be‐ cause divergence needs time to occur. We can now apply this way of thinking to the group B20 in order to explain the topology of its differentiated clusters. By hypothe‐ sizing the genetic unity of B20 (that the classification of the ALGAB supports) it is possible to conjecture that this group entered Gabon earlier than the others but that its monophyletic origin is not clearly detectable (this hypothesis was independently for‐ mulated also by Mouguiama‐Daounda 2005), because of later secondary diversifica‐ tion linked to long‐lasting linguistic contact (still going on) that took place predomi‐ nantly in the southern part of Gabon (Fig. 6.17B). This is to say that the B20 languages spoken by the Kota ethnic group (B25) might be closer to the ancestral type given that they currently have no neighbours, but we do not know if this has been true also in the past. A more conservative scenario is to admit that not all the languages B20 are related, corresponding to different migrations, but that they, later and progressively, largely converged. Convergence needs time to take place, anyway implying an early migration of these languages (currently classified as) B20. In both cases these varieties seem to predate the other ones in terms of diffusion in Gabon. Differently, we speculate that the looseness of the group B40 (close to the group H10, located south to Gabon) is due to the convergence between varieties that were not closely related and that came into contact, or to a progressive loss of their unity, because of independent differentiation linked to a context of geographical isola‐ tion. Whatever the reason, we believe that their tight and coherent linguistic area and their looseness are two antithetic factors suggesting a linguistic (and social?) dynamics that certainly deserves more attention.

LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 217

6.4.4 Other sources of evidence

6.4.4.1 Population genetics

The DNA analyses concerning 17 populations representative of all the linguistic groups of the zone B, including the Fang populations (A75), did not show noteworthy signals of genetic differentiation between them, meaning that the whole Gabon popu‐ lation is quite homogeneous (see section 6.3.2). While nuclear DNA shows complete homogeneity (besides the very small and highly consanguineous Evyia population), there are measurable, though very weak, differences between the 17 samples concern‐ ing the mitochondrial DNA and the Y‐chromosome. The Y‐chromosome exhibits a higher level of differentiation that is apparent when the number of null pairwise dis‐ tances in the corresponding FST distance matrices is taken into account (see Table 6.10). Y‐chromosome diversity is especially significant concerning the Fang ethnic group that clearly stands out, which we assume is due to its recent immigration to Gabon from the north, meaning that these recent immigrants can be genetically dis‐ tinguished from the other groups that have been living in Gabon since a longer time. A genetic diversity that is lower for the mitochondrial DNA than it is for the Y‐ chromosome corresponds to a result described often in the literature. In fact the mi‐ gration rate among many populations worldwide was estimated to be nearly eight times higher for females than for it is for males. The difference was found to be note‐ worthy both at local and wide scales (Perez‐Lezaun 1999; Seielstad et al. 1998), though a later study did not confirm it at a global scale (Wilder et al. 2004). The differential migration rate can be attributed to the widespread practice of patrilocality, in which women move into their husbands’ residences after marriage, a behaviour that happens even in matrilineal societies.72 Nevertheless this behaviour is not universal, for example the differentiation between Eastern and Western Africa is less pronounced for the Y‐chromosome than it is for the mitochondrial DNA, contrary to the usual pattern. According to FST values concerning sub‐Saharan African popula‐ tions, Destro‐Bisol et al. (2004) noted a striking difference in the genetic structure be‐ tween the food‐producers (like Bantu‐speaking populations) and the hunter‐gatherers (including Pygmy groups). In agreement with our analyses concerning the 17 popula‐ tions from Gabon, Destro Bisol et al. (2004) found that the Y‐chromosome is more dif‐ ferentiated than the mitochondrial DNA in food producers; hunter‐gatherers have the reverse pattern. These authors suggested a model in which asymmetric gene flow, but

72 A matrilineal society is a society in which lineage, birthright and social classification are traced through the motherʹs ancestry rather than the fatherʹs, as is common in patriarchal (= patrilineal) societies. 218 CHAPTER 6 also polyginy73 and patrilocality, are the factors at play since the pressure of genetic drift and gene flow on maternal and paternal lineages is different in the two groups. Verdu et al. (2013) builds on it and finds results consistent with a higher prevalence of polygyny and patrilocality in Bantu‐speaking populations than in Pygmy groups. If different migration rates between male and female matter, they are certainly not the only explanatory factor, because difference in effective population size74 between men and women can be explained by variance in reproductive success. Such variance arises because of polygyny but, also, by specific descent rules and by the transmission of reproductive success75 (Heyer et al. 2012). Our samples concern agriculturalists (Bantu‐speaking populations) and we can attribute to the above listed factors the pattern of genetic differentiation we obtained concerning the mitochondrial DNA and the Y‐chromosome. Actually, it would be probably possible to assess the role of polygyny and patrilocality in Gabon, because much ethnological information was collected during the fieldwork about each donor (see section 6.2.3 Genetic sampling of the Gabon population), but we leave this to future work. We simply note that, concerning a majority of the ethnic groups of Gabon (see Table 6.5), the matrilineal transmission of the social lineage does not seem to have an effect, meaning that patrilocality and matrilinearity are likely to coexist in the same ethnic groups. More inherent to the matter of this work is the absence of correlation of genetic differences with geographic or linguistic distances. We did not find genetic clusters making sense in terms of linguistic classifications, and this is hardly attributable to the possibility of flawed sampling. In fact, the results we obtained when we filtered the dataset according to the birthplace of the donors, i.e. by excluding those born outside the area typically inhabited by their respective ethnic groups, the topologies of the resulting MDS plots were almost identical to the ones based on the full dataset (Figs. 6.21 and 6.22). The low number of meaningful FST distances is, in fact, the major result of our genetic analyses (Table 6.10) given that only one third of the pairwise distances are significant for the Y‐chromosome and only a half for the mitochondrial DNA. And there are almost no significant differences for autosomal markers.

73 Polygamy practiced by males. 74 The effective population size is a kind of normalization used to measure the importance of some parameters describing the genetic diversity. It was defined as the number of breeding individuals an idealised population would need to show the same amount 1) of dispersion of the genetic diversity under random genetic drift or 2) of inbreeding. 75 It has been noted that the number of children per lineage is not random; the descen‐ dants of families having many children tend to replicate this behaviour. The reverse is true as well. LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 219

An interesting study bringing together genetic and linguistic evidence over the whole Bantu linguistic domain (de Filippo et al. 2013) was aimed at testing two mod‐ els of Bantu dispersal: i) Early Split, north of the rainforest, of eastern and western populations (and languages) about 4000 ybp; ii) Late Split, south of the rainforest, of the eastern group from the western group about 2000 ybp. The authors measured the genetic diversity as function of the geographic distance from the Bantu homeland along possible inferred itineraries of migration and found a progressive reduction in the genetic diversity from the homeland. This pattern supports the demic diffusion76 of the Bantu expansion and better correlates with the Late Split model. These conclu‐ sions are presented by de Filippo as in agreement with patristic linguistic distances computed on linguistic Bayesian phylogenies. But is this match a conclusive argument or a casual effect? While the demic diffusion process is very likely (Diamond and Bellwood 2003; Salas et al. 2002; de Filippo et al. 2011), the correlations of de Filippo et al. (2013) are not a final confirmation of one kind of split because they are based on a low number of unevenly distributed populations. Interestingly, a majority of the samples at short distance from the Bantu homeland are those concerning the ethnic groups of Gabon. While we found very little genetic differences between them, these authors found that haplotype diversity is higher than in other Bantu populations more distantly located from the Bantu homeland. Is this evidence sufficient to choose between an Early‐Split and a Late Split? The answer to the two questions above probably comes from Li et al. (2014). By using microsatellite data and a Bayesian approach, they simulated various demo‐ graphic scenarios and estimated the first expansion of the Bantu‐speaking groups at around 5600 years ago and found that, to explain the genetic variability of eastern sub‐Saharan Africa, a migration to the east and then to the south is statistically as likely as other models consisting in direct migrations from the west and/or implying signifi‐ cant gene flow from and to the western Branch of Bantu‐speakers. This is why the match with linguistic distances observed by de Filippo (2013) might relate to chance and to uneven sampling. Anyhow, considerable linguistic difference is found in the population of Ga‐ bon, a population that is genetically homogeneous and undifferentiated today, but in which genetic diversity, as a whole, is not totally negligible when we compare it to

76 The Wikipedia definition is excellent and we report it here: Demic diffusion is a demo‐ graphic term referring to a migratory model, developed by Luigi‐Luca Cavalli‐Sforza, of population diffusion into and across an area that had been previously uninhabited by that group, possibly, but not necessarily, displacing, replacing, or intermixing with pre‐existing populations. 220 CHAPTER 6 other Bantu‐speaking groups. Both aspects are in agreement with the closeness of Ga‐ bon to the Bantu homeland as this region, with reference to the bantu dispersal, has been peopled since a long time.

6.4.4.2 Music

The recent publication of a research addressing the diversity of music practices in Ga‐ bon (Le Bomin et al. 2016) attracted our attention. If music, like language, is an impor‐ tant aspect of the cultural identity of societies, a computational comparison between the two is difficult because of the general lack of suitably coded musical data. Le Bomin et al. (2016) present an original dataset concerning 28 ethnic groups divided in 58 subgroups that accounts for many features: the social context in which the music is made; the instruments that are used; intrinsic parameters related to the music itself like metrics, rhythm, and melody. These traits are coded as multistate characters, meaning that they enable phylogenetic inference. By using cladistic methods, the au‐ thors try to identify a likely mode of transmission of musical traditions (vertical versus horizontal) and claim that the transmission is vertical. The tree they publish is divided into two major clusters, respectively corresponding to patrilinear and matriliner populations, although there are exceptions. We have reported that the linguistic classifications we computed have no corre‐ spondence with such social categories; this is why we were not surprised to find no ob‐ vious correlation between our classification of languages and musical traditions.77 But the tree of Le Bomin et al. (2016) attracted our attention also from a methodological point of view, as the solidity of its topology was assessed by using Bremer indices (Bremer 1994) that, concerning some hierarchically important nodes, are null or ex‐ tremely low, meaning that they collapse in a strict consensus tree. The Bremer proce‐ dure is somewhat proportional to the bootstrap test of robustness (the one used for the linguistic classifications included in this study) but its meaning is not as obvious, and a direct comparison of the two is not possible. They rely on completely different ap‐ proaches (Müller 2005). This is why we decided to test the robustness of the tree pub‐ lished Le Bomin et al. (2016) by applying ourselves the bootstrap procedure to the data, that is by computing a consensus tree (using the very same cladistics method of Le Bomin) (Fig. 6.26). The resulting bootstrap consensus tree points to the considerable weakness of many nodes and leads to serious doubts about the verticality of the trans‐ mission of music according to social categories as patrilinearity and matrilinearity.

77 The comparison was made by adding, to the name of the populations appearing in tree of le Bomin et al. (2016), the Guthrie code corresponding to the linguistic variety spoken by each group. We are able to do it for all of them. LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 221

Figure 6.26  Bootstrap consensus tree concerning the classification of traditional music practices in Gabon according to a cladistics analyses. Original dataset from Le Bomin et al. (2016). Nodes supported by less than 70% of the subreplicates have been collapsed. Pygmy populations (gray labels on the leaves of the tree) are not discussed in the text. The names of major linguistic (Guthrie’s) groups are reported on the clusters (B20, B30, etc.) but ‘D’, ‘E’ and ‘H’ are general labels used to identify clusters and have no relation with linguistic classifications.

The bootstrap tree of Fig. 6.26 shows that 21 populations do not form clusters, while 37 are clustered. This time, and differently from the tree originally published, musical differences make sense linguistically and we find clusters corresponding to the lin‐ guistic groups A75; B10; B20; B30; B60‐B70. This match does not mean that all the samples corresponding to a given linguistic type are clustered together, as some of them fall in the undifferentiated part of the tree, therefore suggesting caution in the interpretation. For example the cluster labelled B30 is constituted by two of the three 222 CHAPTER 6 populations speaking the language B31, the other being in the undifferentiated part of the tree, together with four other groups linguistically classified as B30. Some other clusters (D;E in Fig. 6.26) encompass linguistically heterogeneous groups. Interest‐ ingly, the populations speaking languages classified as B40 (there are six of them) do not form a cluster and are all located in the undifferentiated part of the tree. To conclude, the bootstrap consensus tree concerning musical traditions is quite compatible with available linguistic classifications, and there are interesting cor‐ respondences with the linguistic cluster B60‐B70. The weak linguistic cohesion of the varieties B40 is also mirrored in the tree about music. The perspectives that this pioneering classification of musical data opens to multidisciplinary research in digital humanities are exciting. A reanalysis of the data‐ set of Le Bomin et al. (2016) together with a separate treatment of the parameters con‐ cerning the social context versus those related to intrinsic musical properties (metrics, rhythm, and melody) will probably lead to a better cross‐comparison.

6.4.5 General conclusions

6.4.5.1 The performance of the Levenshtein approach

We have shown that the Levenshtein method is an efficient approach to classify Bantu languages and that the categorizations obtained in this way are compatible with Bayes‐ ian methods based on the cognate coding of the same wordlists. In this work we did not address the variability of the full Bantu linguistic domain, focusing only on Gabon lin‐ guistic data, but this is something we plan for future work. In our opinion the Leven‐ shtein approach is likely to outperform methods based on cognate‐coding because the alignment of words delivers more information than the simple classification in sets of cognates. Moreover, as mentioned, it is not necessary to identify the cognates to run Levenshtein analyses, and this is a considerable advantage in historical linguistics. An‐ other virtue of the Levenshtein method is that it readily delivers pairwise distance ma‐ trices that make possible a wide array of other analyses: multidimensional projections, correlations with distance matrices pertaining to other data (geographic distances, ge‐ netic variability, etc.), spatial analyses to identify discontinuities in the pattern of change (Manni et al. 2004). Distance matrices can be also obtained from Bayesian phylogenies by computing patristic distances, which are distances calculated from tree branch lengths, but these distances totally rely on the topology of the tree meaning that they depend on the many parameters used to compute it. The two approaches are not exclu‐ sive, but alternative and complementary and there is no reason to disregard one or the other, both are equally good with the provisos stated above and next. LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 223

6.4.5.2 Provisos

Equating tree topologies to migration paths of human populations is certainly an oversimplification, which can however provide hypotheses that can be further tested. The same limitation applies to phylogeographic approaches inferring the past location of ancestral languages from present‐day distributions. This is particularly true for Ga‐ bon. Here linguistic diversification has been proceeding for a time span comparable to the evolution of Latin from a dialect spoken in central Italy to present‐day Romance languages. Differently from Bantu, Latin did not spread by demic diffusion but Bantu languages have certainly been influenced by similar phenomena like i) population admixture related to migrations (Africa was not empty when Bantu‐speaking popula‐ tions were migrating), ii) population displacements (like the recent movement south‐ wards of Mbete ethnic group consequent to Fang immigration), iii) linguistic replace‐ ment (like the wide language shift of the Pygmies to Bantu languages), iv) in‐situ di‐ versification (Bantu languages form large dialect chains) and v) secondary linguistic contact. The latter certainly took place in an environmental context that was changing over the time (forest  savannah corridors  forest). A closer look at Fig. 6.24 clearly shows that about 1000 ybp there was a consid‐ erable drop in the number of human‐made artefacts. We equate it to an equivalent population drop of about 70%. Given the very low population densities of Gabon, it means that this region has experienced a severe crisis that certainly modified the pat‐ tern of human habitat. While there are no explanations for the crisis (epidemic?), a phenomenon of this relevance has necessarily transformed speech communities, lead‐ ing to a linguistic loss of variability, at first, and to a phase of diversification later. Do we have the correct instruments to linguistically address such highly complex and historically undocumented succession of events? Over a large scale all what we have are wordlists of 92 items (Bastin et al. 1999) concerning the core vocabulary that is more resistant to borrowing and which better conveys the historical signal (see Haspelmath and Tad‐ mor 2009). When we computed Cronbach’s α on the full ALGAB dataset,78 it was de‐ termined to be 0.93, meaning that we have enough data for a clear signal. In a very large study concerning the borrowability of basic vocabulary Tadmor et al. (2010) found that nouns are more borrowable than adjectives or verbs, would this mean that the historical signal of relatedness of Bantu languages spoken in Gabon is better mirrored by the ALGAB because it contains 15 additional verbs with respect to Bastin et al. (1999)?A par‐ tial answer will come after reprocessing ALGAB wordlists by keeping only the 92 words used by Bastin, and this is something we will address in the near future, not‐

78 Cronbach’s α is measure of consistency in the data (Cronbach 1951). 224 CHAPTER 6 withstanding the similarity of Bantu languages. In fact closely related languages are more likely to borrow from each other (McMahon 1994, p. 204) because the borrowing is easier between mutually intelligible languages in which words that are loan‐words sound anyway “familiar”. This aspect raises the question of the diversification of the so‐called Bantu languages. While we will stay away from the quicksand question of the definition of dialects versus languages, we note that Bantu languages act as a contin‐ uum rather than as discrete categories (Schadeberg 2003, p. 158), which is clearly visi‐ ble when looking at the multidimensional scaling plot we computed on ALGAB data (Fig. 6.18) and confirmed by the high correlation that linguistic data have with geo‐ graphic distance (0.478** Mantel test). Unfortunately, experiments of mutual intelligi‐ bility are missing; we believe that they would be of considerable help to better re‐ frame the research question about Bantu varieties. All the above provisos, often stated and then forgotten, suggest great caution in describing the peopling phases of Bantu groups of specific regions that are much smaller that the full Bantu linguistic domain, as it is the case for Gabon.

6.4.5.3 The Bantu peopling of Gabon

Palaeoenvironmental and archaeological studies show that the opening of savannah plains on the costal region of Gabon started about 4000 ybp (Bostoen et al. 2015), with a Neolithization process dated at around 3500 ybp (Oslisly 2001) and a detectable sed‐ entarization starting at 2700 ybp in northern Gabon (Bostoen et al. 2015). According to linguistic cartography we suggest that B20 varieties (they form one cluster with ALGAB data) emerged after an early migration southwards of Cameroon, through the rainforest, to the north of Gabon. It is possible that a second early migration wave(s) took place by following the Atlantic coast from Cameroon to Gabon (varieties B10 and B30). Other languages probably emerged or arrived later (B40; B50; B60; B70) and the ethnic groups speaking them have remained for a long time within a defined geo‐ graphic region as their geographically‐continuous linguistic areas show. Besides the Fang ethnic group that migrated to Gabon over the last five centu‐ ries and that shows genetic signals of differentiations concerning the variability of the Y‐chromosome (but not concerning the mitochondrial DNA, meaning that women did not necessarily accompany them in this migration), the Bantu‐speaking populations of Gabon are genetically homogeneous, meaning that the different migration waves con‐ cerned closely related people, therefore confirming the demic diffusion process of the Bantu‐speaking populations. Nevertheless, 3000 years of history are enough to allow a genetic differentiation detectable on parental markers (Y‐chromosome, mitochondrial DNA). The fact that we do not find evidence for it, suggests that considerable gene LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 225 flow has taken place over the time and that the construction of social identity and eth‐ nical belonging may better rely on social norms and cultural aspects than on ancestry. A link between the two might be investigated by focusing on patrilinearity and matri‐ linearity but genetic and cultural evidence (linguistics, musicology) do not support these categories as particularly relevant. However, patterns of genetic diversity con‐ cerning manioc plants (Manihot esculenta) cultivated in Gabon have been deciphered by analyzing the rules that structure society.79 Marriage prohibitions and kinship‐ systems do structure social networks of seed exchange between farmer communities and influence the movement of in metapopulations, shaping crop diversity at local and regional levels (Delêtre et al. 2011). It remains to be ascertained when these rules were established, because the genetic diversification occurring in cultivated plants (by artificial selection) goes way much faster than the genetic diversifications of human populations that reproduce, on average, every 25 years. To end, we would like to mention once more the population crises starting 1000 years ago (Oslisly et al. 2013 – Fig. 6.24) that was brutal enough to scramble pre‐ existing patterns of genetic and cultural diversity and which leads us suspect that the later cultural and genetic variability of Gabon has been largely shaped by population dynamics happened over the last millennium. They seem to haze the historical signal. For some perspectives of investigation please see section 8.2.2.3 in CHAPTER 8, General conclusions and new prospects.

Acknowledgments:

We thank Professor Jean‐Marie Hombert, Professor Gérard Philippson and Professor Lolke Van der Veen (Laboratory Dynamique du langage, Lyon, France) for giving access to the Tanzanian and to the ALGAB database and, more importantly, for their continued support over the years. L. Van der Veen has reviewed this chapter and provided valuable input for future research directions. We also thank Professor Derek Nurse and Dr. Re‐ becca Grollemund for comments and advice about their published work. It is a pleasure to mention Dr. Marie‐Françoise Rombi for sharing with us her sophisticated knowledge about Bantu linguistics and Paul Verdu for typing autosomal markers, performing Pro‐ crustes analyses and reading the manuscript. Wilbert Heeringa processed the linguistic varieties from Tanzania, Bart Alewijnse those from Gabon. We are indebted to Dr. Pierre Darlu for the reanalysis of music data from Gabon.

79 Marc Delêtre (forthcoming) is pushing this research forward by analyzing the genetic diversity of the viruses that are associated to Manihot esculenta with the aim to compare their genetic diversity to the cultural diversity of Gabon. 226 CHAPTER 6

References:

Alewijnse B., Nerbonne J., van der Veen L., Manni F. 2007. A Computational Analysis of Gabon Varieties In: P. Osenova et al. (eds.) Proceedings of the RANLP Workshop on Computa‐ tional Phonology, Workshop at the conference Recent Advances in Natural Language Phonology Borovetz (Bulgaria), pp. 3‐12. Arom S., Thomas J. M. C. 1974. Les Mimbo, génies du piégeage et le monde surnaturel des Ngbaka‐maʹbo (R.C.A). Paris: SELAF. Bahuchet S. 2012. Changing language, remaining Pigmy. Human Biology, 84: 11‐43. Bastin Y. 1983. Essai de classification de quatre‐vingt langues bantoues par la statistique grammaticale. Africana linguistica, 9: 11‐108. Bastin Y., Coupez A., de Halleux B. 1979. Statistique lexicale et grammaticale pour la classi‐ fication historique des langues bantoues. Bulletin des séances de l’Académie royale de Sciences d’Outre Mer, 3: 375‐387. Bastin Y., Coupez A., Mann M. 1999. Continuity and Divergence in the Bantu Languages: Per‐ spectives from a Lexicostatistic Study. Tervuren: MRAC. Batini C., Ferri G., Destro‐Bisol G., Brisighelli F., Luiselli D., Sánchez‐Diz P., Rocha J., Simonson T., Brehm A., Montano V., Elwali N.E., Spedini G., DʹAmato M.E., Myres N., Eb‐ besen P., Comas D., Capelli C. 2011. Signatures of the preagricultural peopling processes in sub‐Saharan Africa as revealed by the phylogeography of early Y chromosome lineages. Molecular Biology and Evolution, 28: 2603‐2613. Berniell‐Lee G., Calafell F., Bosch E., Heyer E., Sica L., Mouguiama‐Daouda P., Van der Veen L., Hombert J‐M., Quintana‐Murci L., Comas D. 2009. Genetic and demographic im‐ plications of the Bantu expansion: insights from human paternal lineages. Molecular Biology and Evolution, 26:1581‐9. doi: 10.1093/molbev/msp069. Epub 2009 Apr 15. Bostoen K., Clist B., Doumenge C., Grollemund R., Hombert J. M., Muluwa J. K., Maley Jean. 2015. Middle to late Holocene paleoclimatic change and the early Bantu expansion in the rain forests of Western Central Africa. Current Anthropology, 56: 354‐384. Bremer K. 1994. Branch support and tree stability. Cladistics, 10: 295–304 Caballero A. 1995. On the Effective Size of Populations with Separate Sexes, with Particular Reference to Sex‐Linked Genes. Genetics, 139: 1007–1011. Cann R.L., Stoneking M., Wilson A.C. 1987. Mitochondrial DNA and human evolution. Na‐ ture, 325: 31‐36. Cavalli‐Sforza L‐L., Menozzi P., Piazza A. 1994. The History and Geography of Human Genes. Princeton (NJ) Princeton University Press. Clist B. 2005. Des premiers villages aux premiers européens autour de l’estuaire du Gabon : quatre millénaires d’interaction entre l’homme et son milieux. PhD dissertation. Bruxelles, Université Libre de Bruxelles. LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 227

Cronbach L. 1951. Coefficient alpha and the internal structure of tests. Psychometrika, 16: 297‐334. Currie T.E., Meade A., Guillon M., Mace R. 2013. Cultural phylogeography of the Bantu languages of sub‐Saharan Africa. Proceedings of the Royal Society B. Biological Sciences. 280: DOI: 10.1098/rspb.2013.0695 de Filippo C., Barbieri C., Whitten M., Mpoloka S.W., Gunnarsdóttir E.D., Bostoen K., Nyambe T., Beyer K., Schreiber H., de Knijff P., Luiselli D., Stoneking M., Pakendorf B. 2011. Y‐chromosomal variation in Sub‐Saharan Africa: Insights into the ‐Congo groups. Molecular Biology and Evolution, 28: 1255–1269. de Filippo C., Bostoen K., Stoneking M., Pakendorf B. 2012. Bringing together linguistic and genetic evidence to test the Bantu expansion. Proceedings of the Royal Society B. 279: DOI: 10.1098/rspb.2012.0318 Delêtre M., McKey D., Hodkinson T. 2011. Marriage exchanges, seed exchanges, and the dynamics of manioc diversity. Proceedings of the National Academy of Sciences USA, 108: 18249–18254. Destro‐Bisol G., Donati F., Coia V., Boschi I., Verginelli F., Caglià A., Tofanelli S., Spedini G., Capelli C. 2004. Variation of Female and Male Lineages in Sub‐Saharan Populations: the Importance of Sociocultural Factors. Molecular Biology and Evolution, 21: 1673‐1682. Diamond J., Bellwood P. 2003 Farmers and their languages: the first expansions. Science, 300: 597–603. Dieu M., Renaud P. (eds.). (1983), Atlas linguistique du Cameroun ALCAM : Inventaire préliminaire. Paris / Yaoundé : ACCT / CERDOTOLA Doneux J.L. 2003. Histoire de la linguistique africaine. Aix en Provence: Publications de l’Université de Provence. Ehret C. 1999. Subclassifying Bantu: the evidence of stem morpheme innovations. In: J.M. Hombert and L.M. Heyman (eds.) Bantu historical linguistics: theoretical and empirical perspec‐ tives. Stanford (CA): CSLI, pp. 43‐147. Ehret C. 2002. Language family expansion: broadening our understandings of cause from an African perspective. In: P. Bellwood and C. Renfrew (eds.). Examining the farming / lan‐ guage dispersal hypothesis. Cambridge (UK): McDonald Institute Monographs, pp. 163‐176. Excoffier L. and Lischer H.E. L. 2010. Arlequin suite ver 3.5: A new series of programs to perform population genetics analyses under Linux and Windows. Molecular Ecology Resour‐ ces, 10: 564‐567. Gonthier E. 1987. Etude du matériel lithique des Papous indonesiens. Ecole des Hautes Etudes en Sciences Sociales, Paris (France), unpublished PhD dissertation available at the Library of the Institut de Paléontologie Humaine (IPH), Paris. Greenberg J.H. 1955. Studies in African linguistics classification. New Haven: The Compass Publishing Company. Grimes B. F. (ed.). 2000. Ethnologue. Dallas: SIL International. 2 vols., (14th edition). 228 CHAPTER 6

Grollemund R., Branford S., Bostoen K., Meade A., Venditti C., Pagel M. 2015. Bantu expan‐ sion shows that habitat alters the route and pace of human dispersals. Proceedings of the Na‐ tional Academy of Sciences USA, 112: 13296–13301. Guthrie M. 1948. The classification of the Bantu languages. London: Oxford University Press for the International African Institute. Guthrie M. 1967. Comparative Bantu. Farnborough: Gregg International Publishers Ltd. Vols. 1‐4. Guthrie M. 1971. The western Bantu languages. In: T.A. Sebeok (ed.) Current trends in lin‐ guistics, 7: Linguistics in sub‐Saharan Africa. The Hague and Paris: Mouton & Co., pp. 357‐366. Haspelmath M., Tadmor U. 2009. Loanwords in world’s languages. Berlin‐New York: De Gruyter Mouton. Heeringa W., Kleiweg, P., Gooskens C., Nerbonne J. 2006. Evaluation of string distance al‐ gorithms for dialectology. In: Proceedings of the Workshop on Linguistic Distances. Sydney (Australia): Association for Computational Linguistics, pp. 51–62. Heeringa W. 2004. Measuring dialect pronunciation differences using Levenshtein distance. Groningen Dissertations in Linguistics 46. PhD dissertation, Groningen: University of Gro‐ ningen. Heggert M. 2004. The Bantu problem and African archaeology. In: A.B. Stahl (ed.) African archaeology a critical introduction. Blackwell Publishing, pp.301‐326. Heine B. 1973. Zur genetischen Gliederung der Bantu‐sprachen. Afrika und Űrbersee, 56: 164‐ 195. Heyer E., Chaix R., Pavard S., Austerlitz F. 2012 Sex‐specific demographic behaviours that shape human genomic variation. Molecular Ecology, 21: 597‐612. Holden C.J. 2002. Bantu language trees reflect the spread of farming across sub‐Saharan Africa: a maximum‐parsimony analysis. Proceedings of the Royal Society B. Biological Sciences, 22: 793‐9. Holden C.J., Gray R.D. 2006. Rapid radiation, borrowing and dialect continua in the Bantu languages. In: P. Forster and C. Renfrew (eds.) Phylogenetic methods and the prehistory of lan‐ guages. Cambridge (UK): McDonald Institute Monographs, pp. 19‐32. Hombert J.M., Medjo Mvé P. Nguéma, R. 1989. Les Fangs sont‐ils bantu? Pholia, 4: 133‐147. Hombert J.M., 1990a. Atlas linguistique du Gabon. Revue gabonaise des Sciences de lʹhomme, 2: 37‐42. Hombert J.M., 1990b, Présentation de lʹAlphabet scientifique des langues du Gabon. Revue gabonaise des Sciences de lʹhomme, 2: 105‐112. Idiata D.F. 2007. Les langues du Gabon. Paris: L’Harmattan. Klieman K. 2003. The Pygmies were our compass. Bantu and Batwa in the history of West Central Africa, early times to c. 1900 C.E. Portsmouth (NH): Heinemann. LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 229

Laburthe‐Tolra P. 1981. Les seigneurs de la forêt: essai sur le passé historique, l’organisation sociale et les normes éthiques des anciens Beti du Cameroun. Paris: EdKarthala. Le Bomin S., Lecointre G., Heyer E. 2016. The Evolution of Musical Diversity: The Key Role of Vertical Transmission. PLoS One, http://dx.doi.org/10.1371/journal.pone.0151570 Lézine A‐C., Assi‐Khaudjis C. , Roche E., Vincens A., Achoundong G. 2013. Towards an understanding of West African montane forest response to climate change. Journal of Bio‐ geography, 40: 183–196. Li S., Schlebusch C., Jakobsson M. 2014. Genetic variation reveals large‐scale population expansion and migration during the expansion of Bantu‐speaking peoples. Proceedings of the Royal Society B. 281: DOI: 10.1098/rspb.2014.1448 Maho J.F. 2003. A classification of the Bantu languages: an update of Guthrie’s referential system. In: D. Nurse and G. Philippson (eds.) The Bantu languages. Language family series, n. 4. London and New York: Routledge, pp. 639‐651. Maho J.F. 2009. NUGL online: The online version of the New Updated Guthrie List, a refer‐ ential classification of the Bantu languages. [goto.glocalnet.net/mahopapers/nuglonline.pdf] Maley J. 2001. La destruction catastrophique des forêts dʹAfrique centrale survenue il y a environ 2500 ans exerce encore une influence majeure sur la répartition actuelle des forma‐ tions végétales. Systematic and Geography of Plants, 71: 777‐796. Mann M., Dalby D. 1987. A Thesaurus of African Languages. London: Hans Zell Publishers. Manni F., Guérard E., Heyer E. 2004. Geographic patterns of (genetic, morphcooologic, lin‐ guistic, etc.) variation: how barriers can be detected by Monmonier’s algorithm. Human Bi‐ ology, 76:173‐90. Marten L. 2006. Bantu classification, Bantu trees and phylogenetic methods. In: P. Forster and C. Renfrew (eds.) Phylogenetic methods and the prehistory of languages. Cambridge (UK): McDonald Institute Monographs, pp. 43‐56. McMahon A. 1994. Understanding language change. Cambridge (UK): Cambridge Univer‐ sity Press. Medjo Mvé, P. 1997. Essai sur la phonologie panchronique des parlers fang du Gabon et ses implications historiques, Ph Dissertation, Sciences du Langage, University Lumière Lyon 2, Lyon, p. 544. Meinhof C. 1899. Grundriss einer Lautlehre der Bantusprachen, nebst Anleitung zur Auf‐ nahme von Bantusprachen. Anhang: Verzeichnis von Bantuwortstammen.̈ Leipzig, In commission bei F.A. Brockhaus. Meinhof C., Van Warmelo N.J. 1932. Introduction to the Phonology of the Bantu Languages. Berlin: D. Reimer and E. Vohser Publishers. Mennecier P., Nerbonne J., Heyer E., Manni F. 2016. A Central‐Asian survey. Language Dy‐ namics and Change. 6: 57‐98. 230 CHAPTER 6

Meunier Q., Boldrini S., Moumbogou C., Morin A., Ibinga S., Vermeulen C. 2014. Place de l’agriculture itinérante familiale dans la foresterie communautaire au Gabon. Bois et Forêts des Tropiques, 319: 65‐69. Mokrani S. 2016. Etude comparée des parlers du groupe Bantu KOTA‐KELE (B20) du Ga‐ bon: a la recherche de nouveaux critères classificatoires. PhD dissertation, Lyon: University of Lyon2. Mouguiama‐Daouda P. 2005. Contribution de la linguistique à l’histoire des peuples du Ga‐ bon. Paris: CNRS Editions. Mouguiama‐Daouda P., Van der Veen L.J. 2005. B10‐B30 : conglomérat phylogénétique ou produit d’une hybridation. In: K. Bostoen K., J. Maniacky J. (eds.), Studies in African Com‐ parative Linguistics, with special focus on Bantu and Mande. Tervuren: Royal Museum for Cen‐ tral Africa (RMCA/MRAC), Sciences Humaines, pp. 1781‐9857. Müller K.F. 2005. The efficiency of different search strategies in estimating parsimony jack‐ knife, bootstrap, and Bremer support. BMC Evolutionary Biology, 5: 58. Nurse D. 1997. The contributions of linguistics to the study of the . Journal of African history, 38: 359‐91. Nurse D. 2001. A survey report for the Bantu languages. Dallas: Summer Institute of Lin‐ guistics, [online publication at http://www‐01.sil.org/silesr/2002/016/silesr2002‐016.htm ac‐ cessed the 9/6/2016]. Nurse D., Philippson G. 2003. Towards a historical classification of the Bantu. In: D. Nurse and G. Philippson (eds.), The Bantu languages. London: Routledge, pp. 164‐179. Nurse D., Philippson G.. 1975. The North‐East Bantu Languages of Tanzania and Kenya: a Classification. KiSwahili, 45: 1‐28. Nurse D., Phillipson G. 1980. The Bantu languages of East Africa: A lexicostatistical survey. Language in Tanzania. (E. D. Polomé and C. P. Hill, eds.) London: International African Institute by OUP. Nurse D. 1979. Description of sample Bantu languages of Tanzania. African Languages, 5: 1‐ 150. Oliver R. 1966. The problem of the Bantu expansion. Journal of African History, 7: 361‐376. Oslisly R. 2001. The history of human settlement in the middle Ogooué valley (Gabon): im‐ plications for the environment. In: Weber W., White L.J.T., Vedder A. Naughton‐Treves L. (eds) Afriacan rain forest ecology and conservation. New Haven: Yale University Press, pp. 101‐18. Oslisly R., Bentaleb I., Favier C., Fontugne M., Gillet JF. 2013. West Central African peoples: survey of radiocarbon dates over the past 4000 years, Proceedings of the 21st International Radiocarbon Conference, A.J.Jull & C. Hatté Eds., Radiocarbon, 55: 1377–1382. Patin E., Laval G., Barreiro L.B., Salas A., Semino O., Santachiara‐Benerecetti S., Kidd K.K., Kidd J.R., Van der Veen L., Hombert J.M., Gessain A., Froment A., Bahuchet S., Heyer E., Quintana‐Murci L. 2009. Inferring the demographic history of African farmers and pygmy hunter‐gatherers using a multilocus resequencing data set. PLoS Genetics, 5 :e1000448 LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 231

Pérez‐Lezaun A., Calafell F., Comas D., Mateu E., Bosch E. et al. 1999. Sex‐specific migration patterns in Central Asian populations, revealed by analysis of Y‐chromosome short tandem repeats and mtDNA. The American Journal of Human Genetics, 65: 208‐219 Phillipson D.W. 1976. Archaeology and Bantu linguistics. World Archaeology, 8: 65‐82. Phillipson D.W. 1977a. Later prehistory of eastern and southern Africa. London: Heinemann. Phillipson D.W. 1977b. The spread of the Bantu languages. Scientific American 236: 106‐14. Phillipson D.W. 2002. Language and farming dispersals in Sub‐Saharan Africa. In: P. Bell‐ wood and C. Renfrew (eds.). Examining the farming / language dispersal hypothesis. Cambridge (UK): McDonald Institute Monographs. Phillipson D.W. 2005. African archaeology. Third edition. Cambridge, Cambridge University Press. Quintana‐Murci L., Quach H., Harmant C., Luca F., Massonnet B., Patin E., Sica L., Mouguiama‐Daouda P., Comas D., Tzur S., Balanovsky O., Kidd K.K., Kidd J.R., van der Veen L., Hombert J.M., Gessain A., Verdu P., Froment A., Bahuchet S., Heyer E., Dausset J., Salas A., Behar D.M. 2008. Maternal traces of deep common ancestry and asymmetric gene flow between Pygmy hunter‐gatherers and Bantu‐speaking farmers. Proceedings of the Na‐ tional Academy of Sciences USA, 105: 1596‐601. Renfrew C. 1990. Archaeology and Language: The Puzzle of Indo‐European Origins. New York: Cambridge University Press. Renfrew C. 2000. At the edge of knowability: Towards a prehistory of languages. Cambridge Archaeological Journal, 10: 7‐34. Renfrew C. 2010. Archaeogenetics —Towards a ‘New Synthesis’? Current Biology, 20: 162‐ 165. Rexova K., Bastin Y., Frynta D. 2006. Cladistics analysis of bantu languages: a new tree based on combined lexical and grammatical data. Naturwissenschaften 93: 189‐194. Salas A., Richards M., De la Fe T., Lareu M.‐V., Sobrino B., Sánchez‐Diz P., Macaulay V., Carracedo A. 2002. The making of the African mtDNA landscape. American Journal of Hu‐ man genetics, 71: 1082–1111. Schadeberg T. 2003. Historical linguistics. In: D. Nurse and G. Philippson (eds.) The Bantu languages. London, Routledge, pp. 143‐163. Seielstad M.T., Minch E., Cavalli‐Sforza L.L. 1998. Genetic evidence for a higher female mi‐ gration rate in humans. Nature Genetics. 20: 278−280. Simons G. (ed). 2016. Ethnologue. Languages of the world. Dallas (TX): SIL International. Inter‐ net publication accessible at www.ethnologue.com Sthal A.B. (ed.) 2004. African archaeology a critical introduction. Blackwell Publishing. Swadesh M. 1971. The Origin and Diversification of Language. Ed. post mortem by J. Sherzer. Chicago: Aldine. 232 CHAPTER 6

Tadmor U., Haspelmath M., Taylor B. 2010. Borrowability and the notion of basic vocabu‐ lary. Diachronica, 27: 226‐246. Underhill P.A. 2002. Inference of Neolithic population histories using Y‐chromosome haplo‐ types. In: P. Bellwood and C. Renfrew (eds.). Examining the farming / language dispersal hy‐ pothesis. Cambridge (UK): McDonald Institute Monographs, pp. 49‐64. Underhill P.A., Passarino G., Lin A.A., Shen P., Mirazón Lahr M., Foley R.A., Oefner P.J., Cavalli‐Sforza L.L. 2001. The phylogeography of Y chromosome binary haplotypes and the origins of modern human populations. Annals of Human Genetics, 65: 43‐62. Van der Veen L. 2006a. Gabon: Language Situation. In: K. Brown (ed.) Encyclopaedia of Language and Linguistics, Second Edition, volume 4. Oxford: Elsevier, pp. 708‐715. Van der Veen L. 2006b. La situation linguistique du Gabon : état des recherches. Proceed‐ ings of the conference Crossing borders: Feasability of an Integrated Archaeological and Linguistic Approach to Population Dynamics in southern Central Africa. Tervuren (Bel‐ gium): MRAC (Royal Museum of Central Africa). February 3. Van der Veen L. 2007. Rapport scientifique de fin dʹopération pour les projets ʺLanguage, Culture and Genes in Bantu: a Multidisciplinary Approach to the Bantu‐speaking Popula‐ tions of Africaʺ et ʺLangues et gènes en Afriqueʺ. Lyon, Lab Dynamique du Langage (DDL), 68 pages. Available at the following link : www.ddl.ish‐lyon.cnrs.fr/fulltext/Van%20Der%20Veen/Van%20der%20Veen_2007_LCGB.pdf

Van der Veen L., Hombert J.M. 2001. On the origin and diffusion of Bantu: a multidiscipli‐ nary approach, Proceedings of 32nd Annual Conference on African Languages, Berkeley, USA. Vansina J. 1984. Western Bantu Expansion. Journal of African history, 25: 129‐145. Vansina J. 1990. Paths in the rainforest. London: Currey. Vansina J. 1995. New linguistic evidence of the Bantu expansion. Journal of African history, 36: 173‐195. Verdu P., Austerlitz F., Estoup A., Vitalis R., Georges M., Thery S., Froment A., Le Bomin S., Gessain A., Hombert J.M., Van der Veen L., Quintana‐Murci L., Bahuchet S., Heyer E. 2009. Origins and genetic diversity of pygmy hunter‐gatherers from Western Central Africa. Cur‐ rent Biology, 19: 312–318. Verdu P., Becker N.S., Froment A., Georges M., Grugni V., Quintana‐Murci L., Hombert J.M., Van der Veen L., Le Bomin S., Bahuchet S., Heyer E., Austerlitz F. 2013. Sociocultural behavior, sex‐biased admixture, and effective population sizes in Central and non‐Pygmies. Molecular Biology and Evolution, 30: 918‐937. Vigilant L., Stoneking M., Harpending H., Hawkes K., Wilson A.C. 1991. African popula‐ tions and the evolution of human mitochondrial DNA. Science, 253: 1503‐1507. Whiteley W.H. 1971. Introduction. In: Wilfred H. Whiteley (ed.) Language use and social change: Problems of multilingualism with special reference to Eastern Africa. Oxford: Oxford Uni‐ versity Press, pp. 1–23. LINGUISTIC PROBES INTO THE BANTU HISTORY OF GABON 233

Wilder J.A., Kingan S.B., Mobasher Z., Pilkington M.M., Hammer M. 2004. Global patterns of human mitochondrial DNA and Y‐chromosome structure are not influenced by higher migration rates of females versus males. Nature Genetics, 36: 1122‐1125. Wotzka H‐P. 2006. Records of activity: radiocarbon and the structure of iron age settlement in central Africa. In: H‐P. Wotzka (ed.). Grundlegungen. Beiträge zur europäischen und afrikani‐ schen Archäologie für Manfred K.H. Eggert. Tübingen: Francke Attempto Verlag and Co., pp. 271‐289.

234 CHAPTER 6

vvcvcvc

A CENTRAL ASIAN LANGUAGE SURVEY 235

236 CHAPTER 7

This chapter has been published; please cite the original reference:

Mennecier P., Nerbonne J., Heyer E., Manni F. 2016. A Central Asian language sur‐ vey. Collecting data, measuring relatedness and detecting loans. Language Dynamics and Change 6(1): 57‐98.

SUPPLEMENTARY FILES are available at the following Internet address: http://booksandjournals.brillonline.com/content/journals/10.1163/22105832‐00601015 A CENTRAL ASIAN LANGUAGE SURVEY 237

ABSTRACT  We have documented language varieties (either Turkic or Indo‐Iranian) spoken in 23 test sites by 88 informants belonging to the major ethnic groups of Kyr‐ gyzstan, Tajikistan and Uzbekistan (Karakalpaks, Kazakhs, Kyrgyz, Tajiks, Uzbeks, Yagnobis). The recorded linguistic material concerns 176 words of the extended Swadesh list. Phonological diversity is measured by the Levenshtein distance and displayed as a consensus bootstrap tree and as multidimensional scaling plots. Linguistic contact is measured as the number of borrowings, from one linguistic family into the other, according to a precision/recall analysis further validated by expert judgment. Concerning Turkic languages, the results of our sample do not support regarding Kazakh and Karakalpak as distinct languages and indicate the existence of several separate Karakalpak varieties. Kyrgyz and Uzbek, on the other hand, appear quite homogeneous. Among the Indo‐Iranian languages, the distinction between Tajik and Yagnobi varieties is very clear‐cut. More generally, the degree of borrowing is higher than average where language fami‐ lies are in contact in one of the many sorts of situations characterizing Central Asia: frequent bilingualism, shifting political boundaries, ethnic groups living outside the “mother” country.

A CENTRAL ASIAN LANGUAGE SURVEY

7.1 INTRODUCTION

The primary purpose of this paper is to survey and document the linguistic relations among a genealogically mixed group of languages in Central Asia, some Turkic and others Indo‐Iranian. This is a descriptive goal. Methodologically, we have the modest aim of using a measure of pronunciation dissimilarity as an inverse measure of related‐ ness within the two genealogical families. We measure pronunciation dissimilarity using Levenshtein distance, which has seen a great deal of use in dialectology (Wieling and Nerbonne 2015), but less in assaying relations at a great historical depth. But using the Levenshtein distance in this way is definitely not innovative. The Automated Simi‐ larity Judgment Program (AJSP) demonstrates the usefulness of the Levenshtein dis‐ tance (also known as edit distance) in historical inference (Wichmann et al. 2013), how‐ ever see also Jäger (2013). So we expect to be successful in detecting relations within the two language families. We note that the effort is also interesting because of the close contact among the peoples of central Asia; this complicates the situation. We likewise incidentally report that the pronunciation distances derived from the 200‐word 238 CHAPTER 7

Swadesh list are essentially indistinguishable from those derived from the 100‐word list. We also document borrowing among these languages as a likely reflection of contact (including indirect contact) and we try out the obvious idea of recognizing loan words by unexpectedly similar pronunciations measured by Levenshtein distance. We emphasize here and below that while we can detect loan words, out methods cannot distinguish direct loans from loans that enter the language via a third variety. But it turns out that we can indeed recognize loan words, albeit imperfectly. In the subsection immediately following we sketch the background of the pro‐ ject, which was initiated by population geneticists. We note this here because of the effect it had on the sample of language varieties studied. This is followed by sections on methodology and on results, and we conclude with a discussion section.

7.1.1 Background

Since thirteen years ago and within diverse projects, a good deal of research has been conducted at the Musée de l’Homme (Paris) to better describe the peopling of Central Asia. The main objective has been to characterize social and genetic differences, inferred through DNA testing, of populations living in Kyrgyzstan, Tajikistan, Uzbekistan and belonging to several ethnic groups like the Karakalpaks, Kazakhs, Kyrgyz, Tajiks, Turkmens, Uzbeks and Yagnobis. It has been shown (Martinez‐Cruz et al. 2011) that these populations genetically cluster in two groups that closely follow the linguistic classification Indo‐Iranian versus Turkic populations – with the exception of Turkmens, whose language is Turkic, but who cluster genetically with Indo‐Iranian populations. This general distinction makes also sense in social and economic terms. On the one hand, the Turkic populations were traditionally herders while the Indo‐Iranians were farmers, which contributed to the emergence of a differentiation in some genes involved in food metabolism. For exam‐ ple, some possible genetic differences between herders and farmers concerning the metabolism of milk (Heyer et al. 2011) and carbohydrates (Segurel et al. 2013), but not concerning the metabolism of proteins (Segurel et al. 2010) have been reported. On the other hand social differences according to the group can also be observed. While both groups are patrilocal, Turkic populations are exogamous and inherit cultural traditions from the father, while Indo‐Iranian populations are endogamous and both parents contribute to the transmission of culture and beliefs. These differences in social organi‐ zation have influenced the genetic diversity (Chaix et al. 2004, Chaix et al. 2007, Segurel et al. 2008, Heyer et al. 2009) in a sex‐specific manner: males from the Turkic populations A CENTRAL ASIAN LANGUAGE SURVEY 239 show a low level of migration, while females conversely have a high level of inter‐ population migration. During the genetic fieldwork we also probed the linguistic diversity of this re‐ gion, as languages are widely regarded as a proxy for the cultural diversity among and between human populations (see CHAPTER 2). Actually, though human populations are overall genetically very similar, some heterogeneity can be found between geographi‐ cally distant or ethnically different groups. It goes without saying that the noise and the randomness associated with a genetic sampling, limited in time and space, may reflect local phenomena that a more general classification of human groups would not show. Similarly, local linguistic differences may not fit well within general linguistic classifica‐ tions that are often based on scholarly criteria culled from surveys, but not on general and replicable procedures. Therefore, in order to analyze the linguistic (cultural) diver‐ sity of the populations under study and to compare it to their genetic diversity, we recorded the language variants of the same individuals that agreed to donate their DNA. In this way and because we are working on dialectal variation, we did not have to rely on distant linguistic classifications that may not apply to the groups we ap‐ proached (Baskakov 1966; Menges 1968; Andreev and Sunnik 1982; Johanson and Csató, 1998). A genetic sampling of a given population is not immune from the confounding effect of recent migrations that make the inference about the ancestral genetic variability difficult. Bearing this limitation in mind, it may be misleading to compare a genetic classification of human populations with a linguistic phylogeny obtained without tak‐ ing population contact into account. Historical linguistics, largely based on the com‐ parison of regular correspondences and on the comparative method—that is on the comparison of cognate words—focuses more on the language itself than on the speak‐ ers that kept it alive, speakers that are often non‐native and bilingual. This is why we included loanwords in our linguistic analysis , i.e., because they are symptoms of popu‐ lation contact and admixture, i.e., the kind of phenomena that population genetics can address. This article concerns the methodology we adopted to document local speech and to measure its differences in order to further compare it to the genetic diversity of the same populations. To do this we collected the realization of the concepts in the 200‐ word Swadesh list. First we identified the range of phonological variation of the two language groups under investigation (Turkic: Karakalpak, Kazakh, Kirghiz, Uzbek – Indo‐Iranian: Tajik, Yagnobi), then we computed the aggregate linguistic differences among the different varieties using the Levenshtein (1966) algorithm (see also Heeringa 2004). 240 CHAPTER 7

We should note finally, that our goal of understanding the human history of Central Asia will result in a description of its genetic variability, which will be summa‐ rized as a frequency matrix of given DNA motifs and consanguinity estimates. This makes the quantitative analysis of language variation, which results in matrices of linguistic distance, a most attractive corresponding goal, as it enables a statistical com‐ parison with the genetics.

7.2 METHODOLOGY

7.2.1 Selection of linguistic test sites

Table 7.1 lists the locations where the linguistic data were collected. As the ultimate aim of the general project is to understand how cultural (linguistic) differences can influence human migration and gene flow, we deliberately selected test sites with an eye to the quite complex human and linguistic geography of this region of Central Asia. We have targeted populations living within the Indo‐Iranian and Turkic‐speaking zones but also at the borders between them (see Fig. 7.1). Further, when possible, we have docu‐ mented linguistic varieties surrounded by a different language family. While several Indo‐Iranian Tajik speaking groups live in the officially Turkic‐speaking Uzbekistan, the opposite situation (Uzbek‐speaking groups living in a Tajik‐speaking area) is less common and only one village falls into this category (Urtoqqilsoq – See Fig. 7.1). As far as the Karakalpak‐speaking area is concerned, sampling sites were chosen according to the suggestions of a competent social anthropologist (Dr. Svetlana Jacques‐ son, personal communication).1 More generally, suggestions about appropriate sam‐ pling sites came from local historians, social anthropologists, and local authorities. Our quest for “autochthonous” villages explains why the majority of the sites are away from large urban areas such as the cities of Tashkent (Uzbekistan) and Bishkek (Kyrgyzstan). An exception was made for the city of Bukhara, that has a special interest as it has been peopled since the Neolithic times. For reasons related to the genetic ambitions of the larger project (explained in the introduction), the linguistic varieties under study correspond to populations represen‐ tative of the two life‐styles of Central Asia, farmers (i.e., the Tajiks) and pastoralists (the Kazakhs). While we wanted to investigate a wider region, linguistic varieties in the Tajik portion of the Pamir mountains have not been documented because of severe weather conditions during the fieldwork and because there was too little time to get used to the high altitude. There are no sampling sites in the southern region of Kyr‐

1 See also Jacquesson Sv. 2002 A CENTRAL ASIAN LANGUAGE SURVEY 241 gyzstan because local authorities did not allow investigation because of political insta‐ bility at the time.

Table 7.1  Geographical distribution of the 88 respondents across 23 testsites. Lati‐ tude and longitude coordinates are expressed as decimal values.

Language Country Region Place Latitude Longitude Questionnaires

Karakalpak Uzbekistan Karakalpakstan Kokdarya 43.09 58.78 4 Shege 43.77 59.02 4 Halqabad 42.94 59.78 3 Kazakh Raushan 43.04 58.84 4 Bukhara Gazli 40.08 63.56 7 Kyrgyz Kyrgyzstan Issyk Kul Tamga 42.16 77.57 2 Barskoom 42.17 77.64 Naryn Kulanak 41.36 75.50 5 Akmuz 41.25 76.00 3 Uzbekistan Andizhan Orday 40.77 72.31 4 Uzbek Soy Mahalla 40.77 72.31 3 Bukhara Zarmanak 39.73 64.27 3 Novmetan 39.73 64.27 Karakalpakstan Hitoy 43.04 58.84 3 Tajikistan Penjikent Urtoqqishloq 39.49 67.54 4 Tajik Uzbekistan Bukhara Zarmanak 39.73 64.27 4 Novmetan 39.73 64.27 Fergana Kaptarhona 40.25 71.87 5 Rishtan 40.36 71.28 3 Samarqand Kamangaron 39.50 67.27 5 Agalik 39.54 66.89 Tajikistan Ayni Shink 39.28 67.81 3 Urmetan 39.44 68.26 4 Penjikent Nushor 39.11 70.86 3 Tajikabad Navdî 39.11 70.45 3 Nimich 39.12 70.67 4 Yagnobi Tajikistan Dushanbe Safedorak 38.57 68.78 3 Dugova Next to above Next to above 2

= 88

7.2.2 Linguistic inquiry

Local healthcare professionals selected the volunteers likely to be enrolled in the study after our arrival in the different villages, and our first contact with them was often the day of the DNA donation and, for some of them, the day of the 242 CHAPTER 7 linguistic interview. This is why we experimented with different approaches to identify the language variety spoken by the informants (Fig. 7.1). Before system‐ atically adopting the extended Swadesh list of 200 words on which this paper is based, we experimented with using other word lists. The first one concerned “di‐ agnostic” words chosen by F. Jacquesson (personal communication, see also Jac‐ quesson Fr. 2002) meant to distinguish the speakers of three Turkic languages: Uzbek, Kazakh and Karakalpak. Because of the dialect continuum existing among Turkic languages of Central Asia2 and the high degree of mutual intelligibility among the speakers of Uzbek, Kazakh and Karakalpak, we progressively adopted a longer list of diagnostic words according to Junusaliev (1966) and Menges (1968). While the results (not shown) were quite accurate and confirmed the lin‐ guistic literature, we felt that they were not portraying the dialect continuum ap‐ propriately. For this reason, we finally based our inquiry on the extended Swadesh word list of 200 items (Swadesh 1955, 1972).

7.2.2.1 Classification of spoken language varieties by using the extended Swadesh list

Widely used in linguistic studies, the Swadesh list was developed for glottochronogy, i.e. dating linguistic events such as the splitting of the Germanic languages into North, East and West. It was designed to include the basic notions that we expect to find in the lexicons of all languages and to include, for comparative purposes, the words less likely to have been borrowed. Swadesh’s approach was highly controversial, espe‐ cially his notion of a “basic vocabulary”. His idea was that essential words are more widely used and are more stable over time. For assaying linguistic relatedness, the words of the Swadesh list offer some advantages because speakers use them without hesitation, whilst more “marginal” words require longer reflection. Finally, the essen‐ tial words are generally simple, not derived. Many articles have addressed the effec‐ tiveness and appropriateness of the Swadesh list and we redirect interested readers to reviews (Kessler 2001; McMahon and McMahon 2005; Holman et al. 2008). While the standard Swadesh list seems a convenient base to collect phonologi‐ cal and lexical data and to compute linguistic distances between the speakers, we had to adapt it to this specific linguistic context by excluding some words (see Supplemen‐ tary Materials, Tab. S1).3 To minimize polysemy, a better list had to be developed and

2 This is particularly true for the Kyrgyz, the Kazakhh and the Karakalpak, and partly for the Uzbek of Karakalpakstan. The official Uzbek, which has been influenced by Iranian languages, is slightly different, although its Turkic basis still shimmers through, further attesting to the phonetic continuum among the Turkic languages. 3 http://booksandjournals.brillonline.com/content/journals/10.1163/22105832‐00601015 A CENTRAL ASIAN LANGUAGE SURVEY 243 used instead, but our fundamental choice to use the Swadesh list (because of its wide use in linguistics) was modified only slightly. By using the Swadesh list of concepts we intend to facilitate the comparison with linguistic data collected in the same format in other populations.

7.2.3 Informants, protocol and linguistic database

The Swadesh 200‐word list was submitted to 88 people during fieldwork in Uzbeki‐ stan, Kyrgyzstan and Tajikistan from 2003 to 2007 (Tab. 7.1). More than 17,000 items have been digitally recorded and manually transcribed (Supplementary Materials, Tab. S2).4 All informants were approached by the same linguist (PM), usually in rural health centres. For genetic‐testing purposes, they generally were male adults aged of at least 40 years. If the social background of the informants is uneven (mainly manual workers in rural areas and low middle‐class or middle‐class in urban areas), they all went to school during the times of the Union of Soviet Socialist Republics (USSR) and could understand Russian well. In practice, the informants were asked to orally translate into their mother tongue the Swadesh words that were asked in Russian.5 The conditions of the linguistic interview (microphone on the table, empty room, medical environment, interview conducted in Russian by a linguist coming from Paris, specific request to speak clearly, words pronounced out of context) could not have been more formal, meaning that recorded variants are quite far from natural language conditions and inter‐individual variability could not be captured fully. But, unfortunately, no alternative setup was possible. As we were unable to identify a priori the speaker best representing the variety spoken in a given site, we adopted a sociolinguistic approach and collected about four

4 http://booksandjournals.brillonline.com/content/journals/10.1163/22105832‐00601015 5 While informants were asked to speak their everyday language, some made visible efforts to “speak well”, reproducing the official language. We stressed that we were interested in recording their traditional language and history, which was often a convincing argument in the context of the nationalistic policies pursued by many Republics after the collapse of the Soviet Union and their independence. We note that the Uzbek spoken in Uzbekistan is much more homogenous than the Tajik of the Tajik‐speaking minorities living in Uzbeki‐ stan The speakers who agreed to “play the game” and were proud to use their everyday language are rare; they are notably the informants coded Ka‐Gazli 5, Ka‐Gazli 7, T‐Kaman‐ garon 2, T‐Kaptarhona 5, T‐Navdî 1, T‐Nimich 3, T‐Nimich 4, T‐Nushôr 3, T‐Rishtan 3, T‐Urmetan 1, T‐Urmetan 3. It is significant that these are essentially the Tajiks of Uzbeki‐ stan, less influenced by the Tajik norm. For example, they are perfectly aware that they employ the syllabic consonants [C] instead of groups [V+C]. 244 CHAPTER 7 questionnaires per village (though their number varied according to the size of the villages and was sometimes conditioned by the lack of Russian speaking volunteers ‐ — see Table 7.1). The number of questionnaires concerning Yagnobi speakers is lower than the average because of difficult working conditions during the inquiry. When a respondent did not understand a concept (usually one used less in everyday speech), a drawing corresponding to the concept was sometimes shown. When the informant provided two different responses, we have generally chosen the first realization because the second one was usually an attempt to speak closer to the norm. We note that words were asked in the order established by Swadesh, except for the first 21 (abstract words in a grammatical context) that were asked at the end (Sup‐ plementary Materials, Tab. S1). The whole linguistic interview lasted for about half an hour for each informant. To be sure that no misunderstandings arose during the interview and that re‐ corded words indeed corresponded to the concepts of the Swadesh list, all realizations have been verified according to several references (Andreev and Sunik 1982, Balci et al. 2001, Baskakov 1966, Junusaliev 1966, Kerimova 1959, Moukhtor et al. 2003, Ras‐ togueva 1963, 1964). Later, we transcribed the realizations in IPA (International Pho‐ netic Alphabet) but we did not try to reconstitute the phonology. Each transcription has been compared to the corresponding recording several times. The phonetic tran‐ scriptions were subsequently translated into the X‐Sampa codification for computa‐ tional processing (Wells 1997), for more details please see the following Internet link: http://coral.lili.uni‐bielefeld.de/LangDoc/EGA/Formats/Sampa/sampa.html). It must be pointed out that we are dealing with the phonetics of words pro‐ nounced in isolation. For example, the devoicing of the final consonants in all the languages of the region studied does not occur before a consonant in the following word or in a suffix.

7.2.4 Computational analysis

We estimated the similarity of varieties using a pronunciation distance metric, in other words, ignoring syntax and morphology. Since lexical differences also result in pronunciation differences, these are incorporated into the method. It is occa‐ sionally objected that behavioral tests are needed to determine how closely re‐ lated language varieties are, e.g., tests of (mutual) intelligibility, and the objection is not without merit. But behavioral tests, e.g., of intelligibility are not only expen‐ sive to conduct, but also reflect linguistic similarity only partially, as language attitudes and experience likewise play a role. Finally, as Gooskens et al. (2008) A CENTRAL ASIAN LANGUAGE SURVEY 245 show, pronunciation similarity together with lexical overlap (shared cognates) predicts intelligibility quite well (explaining 81% of variance). We therefore compared the phonetic transcriptions of the pronunciations using the Levenshtein algorithm, also known as Edit Distance (Levenshtein 1966). When calculating edit distance between a pair of words in two different varieties, the algorithm seeks for the minimal set of operations that can be used to trans‐ form one realization into another. The operations can be insertions, deletions, substitutions or swaps and each is associated with a cost (Table 7.2). Although we have experimented with elaborate cost schemes, we generally found simple schemes to function effectively when the purpose is to characterize the overall similarity among varieties (Heeringa 2004: p. 186). Therefore, a standard cost scheme in which all operations cost a single unit (1.0) is adopted, here, in Levenshtein distance computation. Heeringa (2004) presents the application of Levenshtein dis‐ tance in great detail. We ensure, roughly, that only vowels substitute for vowels, and consonants for consonants, and the distance scores are normalized according to word‐ length (see Heeringa et al. 2006 for details). The guarantee is only rough because we do allow vowels to substitute for sonorous consonants (/r, l, n, m/ etc.) and approxi‐ mants such as /w, j/ may substitute for vowels but also for consonants. As in previous dialectometric research conducted by the Groningen group, the software package L04, developed by P. Kleiweg (www.let.rug.nl/kleiweg/L04) and GabMap (www.gabmap.nl/) (Nerbonne et al. 2011) are adopted for analysis. These packages contain several methods to analyze phonological and lexical data statisti‐ cally, building on the Levenshtein measure for string data, but including routines to analyze numerical data such as frequencies formant values, and others to analyze categorical data (such as lexical choices). The focus is on analyzing string data such as IPA transcriptions.6 For the purpose of testing effectiveness in distinguishing different varieties, L04 is used to calculate a distance between each pair of words, and then an aggregate dis‐ tance score for each pair of sites (the mean of word distances). We collected the aggre‐ gate site distances into a site × site matrix, which was further analyzed using multivari‐ ate analyses and hierarchical clustering.

6 Naturally there are alternatives available, notably the AJSP work and Jäger’s (2013) work, both cited in the introduction, but also Mattis List’s LingPy programs (List 2014). In fact we have developed more sensitive measures as well (Wieling et al. 2012). Given our focus on varietal level comparison, we feel that more sensitive measures are unlikely to contribute much. 246 CHAPTER 7

Table 7.2  Levenshtein distance example concerning l  n d a a pairwise distance computation between two realiza‐ tions for the word ‘round’ in two languages: Tajik 0 1 2 3 4 5 null horizontal (Agalik) and Uzbek vertical (Zarmanak). The algorithm begins with all the cells in the matrix empty insertion except for the zero in the upper left hand corner. Each a 1 1 2 3 4 4 cell is then filled in with the minimum of three possible j 2 2 2 3 4 5 insertion values: (i) the value in the cell to the left plus one (corre‐ null sponding to an insertion); (ii) the value of the cell above l 3 2 3 3 4 5 plus one (corresponding to a deletion); or (iii) the value a 4 3 3 4 4 4 substitution of the cell diagonally above and to the left plus one if the row and column indices differ, or plus zero if they are n 5 4 4 3 4 5 deletion the same. The cell (l,l) involved a null change with re‐ 6 5 5 4 4 4 null spect to the cell (j,0). The value in the lower right‐hand a corner (4) is then the minimal edit‐distance between the 4 two strings, the least number of edit operations required to transform ajlana to lnda. See Tab. 7.1 about the loca‐ tion of Agalik and Zarmanak.

7.2.5 Matrix Generation

Given our use of Levenshtein distance as a measure of pronunciation difference, it is natural to continue using distance‐based methods as opposed to character‐based methods to understand how the linguistic varieties relate to one another. Kassian (2015) confirms the general wisdom of preferring distance‐based analyses. Site × site matrices of mean edit‐distances were generated for the entire dataset and for the two main language groups in it, namely Turkic (Karakalpak, Kazakh, Kyrgyz and Uzbek) and Indo‐Iranian (Tajik and Yagnobi). Though the Yagnobi is classified in a different subgroup (Eastern Iranian) than Tajik (Western Iranian), we processed them together. We also processed the dataset per wordlist, meaning that linguistic distances between all the pairs of speakers have been computed according to the shorter (100 words) and to the longer Swadesh list (200 words). There is a discussion in the litera‐ ture as to whether the 100‐word or 200‐word list is better for the purpose of assaying linguistic relatedness, and we wished to know the degree to which the relatedness would overlap depending on which of the two sets was used. Because the 100‐wd set is largely a subset of the 200‐wd set, the two are not at all statistically independent, so we will not attempt to interpret the significance of the (very high) correlation we obtain.

A CENTRAL ASIAN LANGUAGE SURVEY 247

7.2.6 Relations among varieties

We investigate the structure of the site × site matrix of linguistic distances using (boot‐ strap) clustering (Nerbonne et al. 2008) on the one hand and multidimen‐sional scaling (MDS) on the other (Nerbonne, Heeringa and Kleiweg 1999). We do not wish to assume that the varieties are tree structured, i.e., the result of purely vertical inheritance with occasional splits. The high level of contact, systematic migration and potential for popu‐ lation admixture, we find in Central Asia suggests that we should expect to find hori‐ zontal transfer as well. We therefore prefer techniques such as MDS, at least initially, to phylogenetic inference, which does assume a tree structure. We hasten to add that we have no reason to doubt the clear separation of the Turkic from the Indo‐Iranian varie‐ ties. But, as we note below (see section 7.4.4.4), the frequency of lexical borrowing con‐ firms our suspicion that horizontal transfer was also an important factor determining the current relations among the varieties studied. This will be reflected in MDS plots but not in phylogenetic trees. The linguistic distance matrices (between all pairs of speakers and between all pairs of speakers within a language group) have therefore been analyzed and visual‐ ized as a consensus bootstrap tree (Fig. 7.2) and as classical multi‐dimen‐sional‐scaling plots where the squared error is minimized (Fig. 7.3). The bootstrap tree guards against too strong an assumption of tree‐like structure by resampling original data (with re‐ placement) a hundred times obtaining 100 randomly resampled new datasets contain‐ ing the same number of items (words) as the original (though with some items appear‐ ing repeatedly and some not appearing at all, due to the randomness of the resampling procedure). More details about the procedure can be found in Nerbonne et al. (2008; see CHAPTER 3). The length of a branch reflects the cophenetic distance from an internal node to the daughter nodes, which may be leaves (sites). The robustness of the cluster‐ ing (see scores at each node of the tree in Fig. 7.2) is proportional to the number of times a cluster appears in the different 100 trees. In figure 7.2 we set a cutoff value of 70%, meaning that all nodes supported by fewer than 70 of 100 iterations were collapsed. 70% is an arbitrary threshold commonly accepted as a reasonable compromise. The many non‐binary branches in Fig. 7.2 (see the Tadjik leaves as well as the second Ka‐ zakh node) reflect groups but further tree‐like structure could not be reliably ascer‐ tained. While we used three different clustering algorithms, with results that are largely comparable, the clustering method we present was produced by Ward’s method (Fig. 7.2). Ward’s method is one of the four techniques that have been found to recognize hierarchically organized groups in dialects well (Prokić and Nerbonne 2008). The major clusters in the bootstrap tree (Fig. 7.2) have been labeled and some labels are reported in figures 7.1 (part A) and 7.3 (part A and B) for the readers’ visual ease. 248 CHAPTER 7

7.2.7 Loan word detection

To determine the number of loans, we followed a three‐step procedure. First, we fil‐ tered out all the speaker pairs from the same language group (i.e., both from the Indo‐ Iranian group or both from the Turkic group) because we focus on the borrowings that have occurred from one language family to the other. Second, for each pair of speakers from different groups, we used the Levenshtein algorithm to compare the transcriptions of each word probed. Our leading hypothesis was that near‐identical pronunciations of the same word in different language families would indicate that the word had been borrowed from one language family into the other. The third step consisted in evaluat‐ ing this hypothesis against PM’s expert judgment7 as to which words were borrowings. For this we used a technique from information retrieval, the 11‐pt interpolated average precision curve (Manning, Raghavan and Schütze 2008, pp.145‐148), which compares the Levenshtein scores to PM’s expert classification of words into borrowed and un‐ borrowed (Supplementary Materials, Table S2) We elaborate on this below. For each concept in the Swadesh list, and for each pair of sites, we obtained a pronunciation distance—the edit distance between the pronunciations realized at the one site from the pronunciations realized at the other. We use these single‐word distances to detect likely loan words, under the leading hypothesis that words from unrelated language families that are very similar probably are loan words. We quantified the success of recognizing loan words using precision and recall (Manning et al. 2008). Recall is the fraction of genuine loanwords that is correctly recognized, i.e., the percentage of realization‐pairs expertly classified as loanwords which are also automatically recognized as such (i.e. by having a low score for edit‐distance).

Figure 7.1 (next page)  A. Geographical sketch of the region investigated. Test sites are reported as dots. Major cities are reported as gray squares. Gray boxes with labels corresponding to the clusters found in the bootstrap tree of figure 7.2 are reported. Major linguistic classifications (Turkic, Indo‐Iranian) are shown at the top. B. Test‐sites in A have been plotted as circles whose surface is proportional to the percentage of loanwords from the other linguistic family appearing in the Swadesh list of 176 concepts (see scale on the right). Rounded‐corner rectangles encompass the test sites belonging to a same linguistic affiliation and, within them, the loans from the other linguistic family are colored accord‐ ingly (red, blue).

7 Philippe Mennecier, co‐author of this study. A CENTRAL ASIAN LANGUAGE SURVEY 249

Figure 7.1 (See caption in the preceding page)  Geographical sketch of the region. 250 CHAPTER 7

Precision, the fraction of true positives, is the percentage of the pairs identified as loanwords on the basis of low edit‐distance scores, which were also expertly classified by PM as loanwords. Note that there is an obvious trade‐off between precision and recall: the lower we set the edit‐distance threshold, the better our precision gets, while recall, however, drops. For this reason, we prefer to examine a curve, and one conven‐ tional presentation graphs the average precision at eleven different recall levels, namely 0%, 10%, etc. through 100%. Fig. 7.4 presents our detection of loan words as the curve showing precision at these eleven different levels of recall. It shows that precision is nearly perfect at low edit distances, while recall is still 50%. See Fig. 7.4. Finally, we also investigated whether the words related by loan as a set dif‐ fer from other words (whether the mean realization differences differ signifi‐ cantly), and we tested whether the distribution of edit distances might be better understood as a mix of two distributions, using the EM algorithm (Du 2002), im‐ plemented in the ’mixdist’‐package in R (http://www.r‐project.org/). This routine tries to analyze an input distribution as the sum of two Gaussians. The results may be examined in Van der Ark et al. (2007). This confirmed the cutoff point suggested by the precision‐recall analysis.

Figure 7.2 (see next page) Bootstrap consensus tree accounting for the linguistic simi‐ larities and differences between 88 informants interviewed in 23 test sites according to Levenshtein linguistic distances. Labels at the leaves of the tree correspond to the coun‐ try where the inquiry took place, followed by the name of the location and a number that identify speakers like in the following esemple: ‘KK Shege 1’  Karakalpakstan; village of Shege; speaker 1. The scores at each node of the consensus tree correspond to the number of times each bifurcation is observed in the 100 trees obtained from the 100 matrices corresponding to 100 datasets re‐sampled from the original dataset with the bootstrap method. Nodes not supported by at least 70% of the bootstrap re‐sampled 100 datasets have been collapsed, thus giving sometimes rise to a “comb” geometry meaning that no robust hierarchal clustering can be assessed at the corresponding level of the tree. This cut‐off value is arbitrary, though quite standard in similar analyses. Major linguistic bifurcations are very stable (bootstrap score = 100). Further clusters are labeled by an alphanumeric code (ex. Kk1, Kk2, Kk3, etc.) and also reported in figure 7.1 for visual ease in geographical comparisons. Distance matrices concern aggregated Levenshtein distances accounting for pair‐wise comparisons of the realizations of 176 words that are included in the 200 list of Swadesh concepts (see Supplementary Materials table S1 for details about the wordlist). A CENTRAL ASIAN LANGUAGE SURVEY 251

Figure 7.2 (See caption at the bottom of the preceding page)  Bootstrap consensus tree. 252 CHAPTER 7

Figure 7.3 (See caption at the bottom of the following page) Two‐dimensional multidimen‐ sional scaling (MDS) plots accounting for the Levenshtein linguistic distances between the 88 informants interviewed in 23 test sites. A CENTRAL ASIAN LANGUAGE SURVEY 253

Figure 7.4  The precision (or accuracy) of loan word detection as a function of the recall (the fraction of loanwords detected). Recall increases as the Levenshtein dis‐ tance threshold drops. See text for further explanation (sections 7.2.7 and 7.3.3).

Before leaving this section we would like to note that the reliable detection of loan words is a further task that might be assigned to edit distance approaches to dialectology and diachronic linguistics. We are certain that the fairly rough approach taken here can be improved, for example using more sensitive measures and perhaps also by exploring the sensitivity of borrowed words to the structural disparities between their source and target languages.

Figure 7.3 (see previous page)  Two‐dimensional multidimensional scaling (MDS) plots accounting for the Levenshtein linguistic distances between the 88 informants inter‐ viewed in 23 test sites. Informants are displayed by linguistic family and altogether (A. Turkic; B. Indo‐Iranian, A+B altogether). Symbols are provided when necessary to distin‐ guish single languages. Some diamonds (in red in the original article) correspond to sam‐ ples that stand out in a three‐dimensional representation (not shown). This is the case of the village of Kokdaria in A, and of the village of Kaptarhona in B. All plots provide a representation of variability that is complementary to the tree displayed in figure 7.2, though they are not based on re‐sampled matrices. For cross‐comparison ease, some groups of points are encompassed by a circle that corresponds to the clusters (ex. Kk1, Ka1, Kk2, etc.) appearing in figure 7.2 and also in figure 7.1. Stress values, corresponding to the deformation of each projection, are reported for each plot both for two‐dimensional analysis (shown) and for three‐dimensional one (not shown). 254 CHAPTER 7

7.3 RESULTS

7.3.1 General sketch of phonetical variability

It would be vain to try to establish, on the basis of a Swadesh list of 176 terms (24 words were excluded – see Supplementary Materials, Tab. S1), the regular connec‐ tions between languages of the same family. In this section, our purpose is to high‐ light the phonetical similarity (and differences) of different Central Asian varieties to suggest that their diversity falls in a range comparable to the European dialects we have studied so far (Gooskens and Heeringa 2004; Nerbonne and Siedle 2005; Wieling et al. 2007; Prokić et al. 2009; Wieling et al. 2013; Šimičić et al. 2013; Montemagni et al. 2013). As a consequence the computational methods we used to measure the linguistic diversity, originally designed to analyze dialect diversity in Europe, can be regarded as appropriate tools for the task at hand. The linguistic variation we find is within the bounds we find in dialectological studies, where the tools have been found to validly detect the relations among varieties (Heeringa et al. 2002, 2006). See, too, next section.

7.3.1.1 Turkic languages

There are recurrent phenomena in the Turkic languages (Karakalpak, Kazakh, Kir‐ ghiz, Uzbek) of this region of Central Asia: (1) the devoicing of the final consonants (words pronounced singly) (ex: muz / mz ‘ice’); (2) the frequent consonant palataliza‐ tion before [e/] (ex: bes / bs ‘five’, ke ‘wide’); (3) in Kazakh and Karakalpak, the labialization of plosives before [] and [u], and the epenthesis of [w] in word‐initial position (ex trt ‘four’, kz ‘eye’; urman > wrman ‘forest’); (4) in Kazakh and Uzbek, the deletion of interconsonantal [i], with the subsequent assibilation of the next con‐ sonant (ex: qsqa ‘short’ > qsqa); (5) in Uzbek, the frequent deletion of [u] (ex: tuxum ‘egg’ > txum); (6) in Kazakh and Uzbek, the velarization of final [l] (ex: Kaz. q ‘hand’; Uzb. kw ‘lake’); (7) in Kazakh, the tendency to weakening of initial [h] (ex: hajal > ajal ‘woman’); (8) in the Kazakh variety spoken in the village of Shege (Fig. 7.1) and the Karakalpak variety spoken in the village of Halqabad (see Fig. 7.1), the leni‐ tion (voicing) of initial [t] (ex: tuman ‘frog’ > duman ; tuz ‘salt’ > duz). Some more pho‐ netic tendencies in the Turkic languages are listed in Table 7.3.

7.3.1.2 Iranian languages

In recorded Iranian varieties the changes are mostly lexical. Nevertheless, we observe phenomena similar to those occurring in Turkic languages, that probably show an A CENTRAL ASIAN LANGUAGE SURVEY 255 areal influence like the palatalization before [e] (ex: Tjk set ‘three’) or the deletion of intervocalic [u] with the subsequent consonant syllabization (ex: Tjk tuxum ‘egg’ > txum / tuxm / txm / txm; dum ‘tail’ > dm). For Yagnobi varieties, we do not have enough informants to establish fine comparisons.

Table 3  Sketch of phonetic tendencies in the Turkic languages.

Kyrgyz Kazakh Karakalpak Uzbek Commentaries a a  Vowel correspondences. y  /u i  e/ #je #ji #i  Aw a/ Deletion of intervocalic consonant and vowel lengthening in Kyrgyz. te te Consonant palatalization before [e]. #p #b #b Consonant correspondences in initial position. Tendency to sonorization in Karakalpak. #m #m #b Palatalization of dentals before [e]. #te #de #te #de #t #t / #d #d #t # # #s # Assibilation of initial [] in Karakalpak. t# # t# Lenition of final [t] in Kazakh et Karakalpak. #  s# # Assibilation of final [] in Kazakh et Karakal‐ pak. #d # #j Lenition of [d] in Kazakh, Karakalpak et Uzbek. #k # #k/t Examples of correspondences for velars. # #k #k

7.3.2 Measures of the linguistic variability

7.3.2.1 Matrix consistency

Distance matrices were generated for the entire data collection and, separately, by language family (Turkic; Indo‐Iranian sites). We verified that we had enough data to obtain a strong signal by calculating the mean inter‐item correlation coefficient, i.e. the degree to which word measures correlate over all pairs of sites, and derived from that Cronbach’s α (Nunnally 1978, p. 245), which depends on the mean inter‐item correla‐ tion and nw, the number of items (words):

256 CHAPTER 7

Scores above 0.9 are generally regarded as extremely reliable, so this calculation con‐ firms that there is a strong signal in the data that does not diminish significantly when the two families of languages are merged in the analysis (Table 7.4). We also com‐ puted distances by word list, that is according to the shorter 100‐word Swadesh list or according to the longer 200‐words Swadesh list, and in this case matrix consistency is also very high (Tab. 7.4).

Table 7.4  Cronbach’s alpha scores for all wordlists.

Group All(176) Sw200 (163) Sw100 (86) N All Respondents 0.993 0.992 0.988 78 Turkic 0.986 0.984 0.975 39 Indo-Iranian 0.961 0.952 0.921 39

7.3.2.2 Representation of variability, the bootstrap clustering

The consensus bootstrap clustering of figure 7.2, displaying all the 88 informants we approached during the fieldwork, shows a major split between the Turkic languages in the top half of the diagram and the Indo‐Iranian languages in the bottom half. The clustering provided by this consensus bootstrap tree is quite robust because all major nodes are supported by a score of at least 70% (100% for major clusters). Concerning the Turkic group, there are three major clusters (groups under‐ lined correspond to clusters in Fig. 7.2): Kazakh/Karakalpak, Kyrgyz and Uzbek. The Kazakh/Karakalpak cluster is made of two sub‐clusters. On the one hand we have several Karakalpak varieties (Kk1 mainly corresponding to the village of Shege; Kk2‐ Kk3‐Kk4 corresponding to Kokdaria), the Kazakh variety spoken in the village of Raushan (Ka1) and the variety spoken in Hitoj corresponding to a population identi‐ fying itself as Uzbek though we classify it as Karakalpak (KaUZ). On the other hand there is the cluster formed by the Kazakh speakers of Gazli (Ka2). We note that the two Kazakh varieties in our dataset (Raushan and Gazli) do not form a single cluster by themselves, Raushan being closer to Karakalpak than Gazli is. The Kyrgyz cluster is divided into two subclusters, one corresponding to all the varieties spoken in Kir‐ ghizstan (Ki1) and a second consisting of the four speakers of Orday (Ki2), a village that is now part of Uzbekistan. If the Uzbek group is quite homogeneous (villages of Novmetan, Zarmanak and Soj Mahalla), we note that three speakers of Urtoqqishloq are grouped in a subcluster (Uz1), which makes sense given the isolated position of this village in the linguistic landscape of the region (Fig. 7.1, part A). A CENTRAL ASIAN LANGUAGE SURVEY 257

The Indo‐Iranian cluster is split into two groups that correspond to the two languages it includes: Tajik and Yagnobi. While the Yagnobi cluster shows some differences between the two villages of Safedorak and Dugova, which are geographi‐ cally very close to each other, the Tajik cluster is more complex. Even if two speakers from Navdi (Navdi1 and Navdi2) and three speakers form Nimich and Nushor (Ni‐ mich3, Nimich4, Nushor3) are grouped together, but we still note that other speakers from these villages belong to independent branches, as do the speakers from the vil‐ lages of Agalik, Shink and Urmetan. A closer look at figure 7.1 lets us recall that these villages are in Tajikistan (labeled as ‘Tajik’ in Fig. 7.1, part A), apart from the village of Agalik, which is in Uzbekistan, although it is not far from the border. Their belonging to single branches indicates considerable inter‐individual variation that is unexpected seeing that Navdi, Nimic and Nushor are very close to each other. In addition, there are two subclusters within the Tajik cluster, one corresponding to the village of Kap‐ tarhona (T2) and another to the villages of Kamangaron: Rishtan and Zarma‐ nak/Novmetan (T1). These five Tajik‐speaking villages (T1 and T2) are located outside Tajikistan (in Uzbekistan), which probably explains their clustering. We note that Zarmanak and Novmetan are very close and host a bilingual community, though speakers use only one language (Tajik or Uzbek) at home; this explains why speakers from Zarmanak and Novmetan appear in different clusters.

7.3.2.3 Representation of variability, The Multidimensional Scaling (MDS) clustering

The MDS analysis (Fig. 7.3) is complementary to the hierarchical analysis of the boot‐ strap tree (Fig. 7.2) and visually shows the extent of the linguistic differences we measured. The plot that concerns both language families (Fig. 7.3 ‘A+B’) shows that the variability within the Turkic languages is much higher than the one within the Indo‐Iranian family, since Kazah/Karakalpak, Kighiz and Uzbek varieties span a big‐ ger surface of the plot than the one occupied by Tajik and Yagnobi speakers, who are much closer to each other. In general, we note that each language—with the exception of Kazakh and Karakalpak that form a single swarm of points—corresponds to a non‐ overlapping cluster, confirming the major groups of the bootstrap tree (Fig. 7.2). The large linguistic distances between the speakers of the two language groups, Turkic and Indo‐Iranian, distort the representation and muddy the topology of the points corresponding to the same languages, or the same group of languages. This is why we have computed separate plots for Turkic (Fig. 7.3, part A) and Indo‐Iranian speakers (Fig. 7.3, part B). Concerning the Turkic group (Fig. 7.3, part A), the Uzbek and Kyrgyz and Ka‐ zakh/Karakalpak speakers are nicely separated in three non‐overlapping swarms of 258 CHAPTER 7 points. In more detail, the three Uzbek speakers of Urtoqqisloq, Tajikstan (they corre‐ spond to the Uz1 cluster of the bootstrap tree of Fig. 7.2) are next to each other and slightly farther from the other Uzbek speakers. The Kyrgyz of Ordaj (cluster Ki2) are quite distinct from the other Kyrgyz living in Kyrgyzstan (besides the speaker Ordaj 4). We note that the topology of the Kazakh/Karakalpak speakers is more complex and does not correspond well to the classification of the bootstrap tree as the clusters Kk1, Ka1, Ka2 and KkUZ are not distinct from each other in the MDS plot. This phe‐ nomenon is also related to the distortion of the two‐dimensional presentation of the plot, and in fact a closer look at the third dimension (not shown) provides evidence for the separate position of the Ka2 cluster corresponding to the village of Gazli and for the considerable linguistic heterogeneity within the village of Kokdaria. The plot involving Indo‐Iranian (Fig. 7.3, part B) provides clear evidence of the difference between speakers of Tajik and of Yagnobi. The latter language is nowa‐ days endangered and only spoken by a small community. Though inter‐individual diversity (idiolects) is decreasing (this is what was observed during the fieldwork), because speakers are in the process of being integrated into the Tajik group with a loss of linguistic diversity, the linguistic differences based on the Swadesh list still appear to be substantial. As far as Tajik speakers are concerned, the two sub‐clusters T1 and T2 highlighted in the tree re‐appear here clearly, though not as distinctly as the boot‐ strap tree would suggest. This is probably related to a lack of accuracy in the two‐ dimensional representation that is linked to a stress value quite high (0.45), but not far from those predicted according to the tables of Sturrock and Rocha (2000) with 88 objects (stress = 0.39). A closer look at the third dimension shows a clear separation between the five speakers from the village of Kaptarhona and all the others. The Tajik speakers of the villages of Nimich, Navdi and Nushor are linguistically similar, as expected given their geographical vicinity, while those from Agalik, Shink, and Ur‐ metan are more varied. Actually, the tree suggests considerable individual variation within all the six villages of Nimich / Navdi / Nushor and Agalik / Shink / Urmetan, which is reflected in the relatively larger cophenetic distances (branch lengths) in the dendrogram. The MDS plot in figure 7.3, part B, also represents this group as fairly diverse (see lower left‐hand quadrant of the plot). The reason is related to the boot‐ strap procedure that seeks for a consensus tree over resampled datasets, whereas the MDS plot concerns the full dataset without any resampling. Otherwise, both methods (bootstrap and MDS) point to the significant linguistic heterogeneity among the speakers of the villages of Agalik, Shink and Urmetan.

7.3.2.4 Comparison of results using100‐wd vs.200‐wdSwadesh list

A CENTRAL ASIAN LANGUAGE SURVEY 259

To gauge the possible impact on the results of the two different Swadesh lists, we computed an overall Levenshtein distance matrix based on the shorter list (100 words) and another on the full list (200 words). The two matrices are almost identical (Mantel test correlation: 0.997 with a significance level of 1‰) and the comparison of the to‐ pology of samples in MDS plots (both highly correlated to original distance matrices r = 0.90 (Indo‐Iranian) and r = 0.94 (Turkic) plots not shown) is almost identical. We just noted an increased distance of the Yagnobi from the Tajik speakers in the reduc‐ tion based on the 100‐wd. sample. This result is in agreement with the claims that the shorter Swadesh list is more conservative and therefore more likely to reflect older linguistic relations, but it should be clear that the differences are minimal. Because the concepts it contains are used very frequently, they are therefore less likely to be bor‐ rowed (Kessler 2001, McMahon and McMahon 2005).

7.3.3 Loan word detection

In order to evaluate how well we can detect borrowings, we analyze a range of low Levenshtein distances as hypothetical thresholds. It makes sense that word pairs with very low edit distances would be borrowings, and the P/R analysis starts by consider‐ ing the lowest edit‐distances (zero‐linguistic distance = identical pronunciation). For a given low Levenshtein distance d, we ask how well we would detect borrowings if we hypothesized that all word pairs where distance (w1,w2) ≤ d were borrowings. To evaluate this hypothesis we use a technique from information retrieval (Manning and Schütze 1999), by measuring both the PRECISION of the detection – how many of the hypothesized borrowings really are borrowings, and also its RECALL – how many genuine borrowings are detected at this threshold. In the Precision/Recall analysis (P/R), it is sufficient to examine about a half of the full set of pronunciation‐pairs that are considered (about 250,000 pairs involving an Indo‐Iranian speaker and a Turkic speakers). We reach 100% recall at after roughly 37,000 records, at which point the last pair which had been classified as belonging to the same cognate group (and therefore as a loan word in one of the languages) is found. Therefore the precision vs. recall (P/R) curves of figure 7.4 are based on about 15% of the records. Fig. 7.4 shows that, initially and up to the thirtieth percentile, precision is almost perfect, meaning that all the words up to this point correspond to those manually classified as cognate. Since the pair of realizations is found in two different language families, the pair consists of a loanword and its cognate “source” in another language family. We realize that we are using the term “cognate” loosely here to include borrowing; in this sense English ‘beef’ and French bœuf are cognate, and, 260 CHAPTER 7 indeed they arise from the same source, the English word arising via borrowing from French. The analysis shows that recall edit‐distances are close to zero up to the thirtieth percentile. After the fiftieth percentile the precision‐score starts dropping more dra‐ matically, which happens at an average Levenshtein‐distance of about 0.06. Based on this we can infer the score corresponding to edit‐distances low enough to detect a loan reliably. Two thresholds were chosen. The first one considered is the 0.06 normalized edit‐distance, based on a precision score of 0.977 at the fiftieth percentile, that is, just before the precision drop. The second threshold is the 0.02 score, at the thirtieth per‐ centile, up to which precision scores are almost perfect. For this second threshold it can safely be stated that all the pairs that are below it can be considered loans. Once the thresholds are defined, no more runs of the P/R‐analysis are necessary. We add that we also tried applying the P/R‐analysis within the same language groups (Turkic or Indo‐Iranian) but, within each group, the degree of cognacy is too high to identify possible loans that are hardly distinguishable from cognates pairs. We thus cannot detect what linguists call “intimate borrowing” (Jeffers and Lehiste 1979, p. 150) Based on the thresholds of Levenshtein distance 0.06 and 0.02, all the pronun‐ ciation‐pairs corresponding to loans can be compared with the manual classification (Fig. 7.1, part B, Supplementary materials Tab. S2). We found the automatic detection of the loans (see also Van der Ark 2008 for an earlier approach) to be proportional to the estimates of the classification in both directions – that is, Indo‐Iranian words into Turkic languages (and vice versa) – even though at least about 30% of expert‐identi‐ fied loans escape the automatic detection. This proportionality is an important result, as it shows Levenshtein distance computation is not biased in one linguistic group with respect to the other one, in fact the same 50% error‐rate (under‐detection) is found in both.

7.3.4 Linguistic contact and loans

We mentioned that the linguistic research presented in this article is preliminary to an upcoming inquiry about the correlation between the cultural and genetic diversity of the very same populations. This is why we try to estimate social/cultural contact from word borrowing. According to the estimates of PM, the percentages of borrowing (from one linguistic group into the other) are reported in table S1 (Supplementary Materials) and visualized in figure 7.1, part B. The first result about the loans concerns the higher percentage of borrowing in locations close to the borders of linguistic groups (Ristan, Kaptarhona, Soj Mahalla and Orday) or where two linguistic communities live in the same place (Zarmanak A CENTRAL ASIAN LANGUAGE SURVEY 261 and Novmetan). In all these villages the linguistic exchange seems symmetrical, ex‐ cept in the village of Urtoqquisloq, which, being a Turkic (Uzbek) linguistic isolate in an Indo‐Iranian‐speaking area, borrows more from Tajik than vice versa. We also note that Karakalpak speakers (in the villages of Hitoj, Kokdaria, Seghe and Halqabad) seem to borrow more from Indo‐Iranian than do the two Ka‐ zakh villages of Raushan and Gazli. Actually, the speakers of Gazli8 (but not those of Raushan) come from a group recently emigrated from Kazakhstan, a country that has no Indo‐Iranian speakers nearby. As expected, the Tajik speakers living in Tajikstan (Sink, Urmetan, Navdi, Nimic, Nusor) and the Kyrgyz speakers living in Kyrgyzstan (Kulanak, Ak‐Muz, Tamga, Barskoon) show a very low degree of borrowing (actually the few loans are historical ones as, currently, there are no allophone neighbors from which the borrowing could have happened in recent times). Some possible reasons will be discussed in the next section.

7.4 DISCUSSION

The purpose of this paper was twofold. First we provided a survey of linguistic rela‐ tions in a complex area in Central Asia; we described how we designed a linguistic survey, how we computed linguistic distances between and within the two groups of languages we studied (Indo‐Iranian and Turkic), and how well these reflected tradi‐ tional designations. We used edit distance for this purpose, which predictably worked well. Second, we estimated the proportion of loans from a language family into the other, and we tested whether loans words might be detected automatically. This worked less well, but the automatic procedure might free larger scale studies from needing to check all candidate loan words by hand. The linguistic classifications we derived from the data are not intended to be an assessment of the historical relatedness of the linguistic varieties under study; in fact our approach—in many aspects—is more similar to socio‐linguistic inquiry than to historical linguistics methodology. With respect to our longer‐term goals of understanding the parallels between genetic and linguistic diversity, we note that population genetics initially shifted from a quest for a systematic correlation to a denial of any reciprocal influence, where any correlation was seen as a by‐product of the decreasing chance of human interaction when geographic distance rises (see contribution of P. Chareille in Darlu et al. 2012 and Boattini et al. 2012 for examples about migrations in recent historical times).

8 Gazli, 90 Km from Bukhara was founded in 1958 in the middle of the Kyzyl‐Kum desert to exploit natural gas resources in the region. 262 CHAPTER 7

However, a more pragmatic approach has also emerged, that is to check whether a correlation exists between the two, and to seek explanations for correlations that do emerge in the realm of cultural and demographic interaction. This is particularly the case in Central Asia where the (semi) nomadic lifestyle of many populations makes its history complex to understand. Historical linguists have often been reluctant to provide a genealogical tree of languages, even at a or family level, making the statistical correlation of linguistic with genetic data (available as numbers) an unapproachable issue, until the quite recent spread of reliable computational linguistics methods allowing fast, reli‐ able comparison (for an early review see Forster and Renfrew 2006). Cladistic meth‐ ods or network analysis can be applied to historical linguistics with aims that are similar to glottochronology and lexicostatistics. In a similar vein, the Levenshtein distance has been developed by dialectologists as a measure of pronunciation differ‐ ence enabling a kind of inference that is more geographical or social than historical. In the bootstrap tree (Fig. 7.2) only the first separation between Indo‐Iranian and Turkic languages has an historical explanation. Following clusters rely more on geographical factors than on the historical split of languages.

7.4.1 Networks versus linguistic distances

Network analyses make possible a new kind inquiry into linguistic phylogeny by also displaying conflicting signals that weaken the vertical evolutionary signals. A poten‐ tial cause of error are faulty cognacy judgments (for instance possible chance‐ similarities), which increases the measured similarity between languages and leads to an underestimation of the divergence times. Unrecognized borrowing between closely related languages would have a similar effect. Conversely, unrecognized borrowing between distantly related languages will incorrectly depress (not inflate as stated in Gray and Atkinson 2003) branch lengths at the origin of the tree and, therefore, in‐ crease the estimates about divergence‐times. We turned to linguistic distances because we wanted to geographically portray linguistic differences, regardless of their origin in time. While phylogenetic trees are better fit to provide historically reliable family trees of languages and the dates of language splits, the genetic distances we will compare (in future work) to our linguis‐ tic distances are much “noisier” because they also include the effect of migrations and other areal factors. The closeness and relative geographic contiguity of the popula‐ tions we studied involves cross‐migration and admixture, meaning that their phylog‐ eny is multi‐faceted and difficult to disentangle. This is why we chose to measure A CENTRAL ASIAN LANGUAGE SURVEY 263 linguistic differentiation by the Levenshtein distance, as a way to address population differences without exclusively relying on traditional language phylogeny. The Levenshtein method was used to measure linguistic differences between similar varieties, just as it has been applied to analyze the relations among dialects (Nerbonne and Heeringa 2010) and closely related languages (Alewijnse et al. 2007). When the linguistic differences are too great, the Levenshtein method may reach a ceiling so that it no longer reflects common provenance (Greenhill 2011), but Jäger (2013) calls this into question. Moreover, Greenhill’s sample of languages is restricted to Austronesian, a family which has been shown to be the most recalcitrant in the world when it comes to obtaining good matches between a Levenshtein‐distance based phylogenetic approach and standard classifications (Wichmann et al. 2010) While the dissimilarity of Indo‐Iranian and Turkic language groups is too high to be appropriately measured using edit distance, the range of difference within each group is comparable to the differences we find among dialects in some language areas. This is why we do not discuss inter‐group distances.

7.4.2 Effect of loans on linguistic distances

The distance matrices we computed take into account linguistic contact. Loans are not excluded from the analyses, meaning that the realization pairs corresponding to loans correspond to very low, or null, Levenshtein distances. These low edit‐distances de‐ crease the aggregated distance that is obtained from the sum of pairwise distances corresponding to each realization pair. To be sure that all loans are recognized as null distances, we have verified their status based on the judgment of an expert. The two estimates are proportional but about 30‐50% of the loans escape automatic detection. As for the reasons of the discrepancy, the most probable one is the existence of some structural disparities at work, such as for instance different phoneme inventories. Where an experienced linguist would easily see the relation between two realizations where a vowel has shifted, the Levenshtein algorithm does not. Also for this reason, all the computed linguistic distances between language‐groups are overestimated, which is not a terrible concern, as we said that we do not expect the Levenshtein method to measure the actual linguistic distance between the two families perfectly. It merely needs to correlate well with the “real distances” among the varieties, and it does (Heeringa et al 2006). Put differently, and because there are no detectable bor‐ rowings within a language family, the measurements of the linguistic distances among Indo‐Iranian and Turkic varieties are not biased. The lower performance of the automatic detection, though proportional to expert’s judgment, convinced us to use 264 CHAPTER 7 the loan estimates provided by the expert as a better proxy to language and popula‐ tion contact (Fig. 7.1, Tab S2 available at journal site as supplementary material).

7.4.3 Swadesh word list

As the reconstruction of an historical phylogeny of languages was outside our aims, the use of the Swadesh word list might be questioned. In fact, the list was designed to better assess the history of languages by including concepts that are less likely to be borrowed, thus maximizing the number of cognate pairs and, as a consequence, limit‐ ing the possibility of detecting linguistic (population) contact. However, we turned to the Swadesh list because it is of widespread use. In this perspective, an interesting point concerns the supposed stronger histori‐ cal signal conveyed by the shorter Swadesh list (100 words) when compared to its extended version (200 words), because the concepts of the first are believed to be borrowed less (Kessler 2001, McMahon and McMahon 2005). This seems to be the case with the Yagnobi speakers that are more distant from the Tajik group in the MDS plot (not shown) based on the shorter Swadesh list than in the MDS plot based on the longer list (Fig. 7.3, part B). This phenomenon matches fieldwork observations, where we noticed that the Yagnobi varieties spoken in Dughova and Safedorak are lexically close to Tajik. While the progressive replacement of the original Yagnobi vocabulary is related to the endangered status of this language (12,000 speakers in 2004 according to the Ethnologue 2015) and to the resettlement of this people to Zafarabad in 1970s,9 the concepts described in the Swadesh list have resisted replacement, in particular those of the short version (the Tajik / Yagnobi separation in Fig. 7.2 is supported by a bootstrap score of 100%). Of course, the distances assayed by the two lists correlate nearly perfectly, as we noted in the results section.

7.4.4 Variationist aspects

7.4.4.1 Homogeneous or areally unstructured lexical diversity

We encounter homogeneity in Kyrgyzstan (Kulanak, Ak‐Muz, Tamga, Barksoom) where all the different speakers used almost exactly the same words for the Swadesh concepts. This level of homogeneity, in villages that can be quite distant (a full day by car), may be the result of school education because our informants went to school during the times of the USSR, when (secondary school) instruction

9 Zafarabad is located in the northern Tajikistan plain, while the homeland is the Yagnob Valley, north‐west Tajikistan, between the southern slope of the Zarafshan Range and the northern slope of the Gissar Range. A CENTRAL ASIAN LANGUAGE SURVEY 265 mainly took place in Russian.10 Nevertheless, even if another normalization proc‐ ess (loss of diversity after the collapse of the USSR and the rise of national linguis‐ tic policies) were an explanation, we would expect the same phenomenon to arise among the Tajik speakers from Tajikstan (villages of Sink, Urmetan, Navdi, Nimich and Nusor) that are located at geographic distances that are comparable to those existing between the Kyrgyz sites. Actually the Tajik speakers from Tajik‐ stan are less homogeneous and linguistically more distant from one another than the comparable Kyrghyz sites. We found that this greater variability is not areally structured, because the bootstrap analysis shows no subgroups within the Tajik cluster of Fig. 7.2 (clusters T1 and T2 correspond to the speakers outside the coun‐ try). The reason for this lack of areal structure is not obvious, and at the moment we are unable to explain it. Concerning the homogeneity of Kyrghiz varieties, it could also be that their semi‐nomadic lifestyle11 (not shared by the Tajiks) enabled long‐range contacts among distant groups, thus retarding the lexical divergence that customary traditional tribal meetings have further hampered. Of course, the Tajiks and Kyrgyz speakers living where the official language is the same as the one they speak in everyday’s life are those showing the lowest rate of borrowing from the other language family, respectively Turkic and Indo‐Iranian. We will return to this. Finally, we note that the linguistic diversity we analyzed is lower than the one existing in the region in general, because linguistic interviews were con‐ ducted in a very formal context that is very far from natural language conditions. Conversely, recorded varieties are probably quite conservative because we en‐ rolled, in large majority, middle‐age male informants that, in general, have been found to be more conservative than females (Labov 1990; Chambers 1995: pp. 102‐ 103).

7.4.4.2 Linguistic isolation and contact

In our sampling design, chosen by the colleagues involved in the genetics part of the project, there are several speakers of a given language that live in a country whose official language is different. Among them, there are the Kyrgyz speakers of Orday (Ubzbekistan), the Tajik speakers of Ristan, Kaptarhona, Ohalik, Kamangaron (Uz‐ bekistan) and finally the Uzbek speakers of Urtoqquisloq (Tajikstan). In all the cases (besides the single informant from Agalik), the speakers of the villages we mentioned

10 While Russianwas the official language at school, some teaching in Kyrghyz was toler‐ ated in remote areas like many of those we sampled (Derbisheva 2009). 11 Kyrgyz were forced to settle as recently as the Soviets’ time. 266 CHAPTER 7 are clustered together with those of the “motherland” though they systematically belong to specific clusters in the bootstrap tree of figure 7.2 (Orday  cluster Ki2; Ristan  cluster T1; Kaptarhona  cluster T2; Kamangaron  cluster T1, Urtoqqisloq  cluster Uz1). These speakers are somewhat “isolated” because they live in a coun‐ try whose official language is part of a different language family (the Indo‐Iranian speakers of Ristan‐Kaptarhona‐Kamangaron live in the Turkic‐speaking Uzbekistan, and the Turkic speakers from Urtoqqisloq live in the Indo‐Iranian‐speaking Tajikstan), with the only exception of the Kyrgyz speakers living in Orday that is located just across the Uzbekistan border (Kyrgyz and Uzbek are both Turkic languages). The higher borrowing is easy to explain because all these linguistic pseudo‐isolates (we say ‘pseudo’ because there are other Tajiks in Uzbekistan and vice versa) are constantly exposed to an official language that is quite different from the language spoken at home. In this context, their belonging to specific groups of the bootstrap tree (Fig. 7.2) can be interpreted in two different ways. One explanation is that some borrowed words, specific to these communities living “abroad”, may decrease the overall lin‐ guistic distances among them and, conversely, inflate those with the Tajiks living in Tajikstan.12 Another explanation is that Tajik speakers living outside Tajikstan have maintained a vocabulary that has been less conditioned by the normalization process we addressed in the previous section. Finally, as the consensus dendrogram sup‐ presses internal structure that is reliable (bootstrap), Tadjik varieties higher in the node may appear less similar because the clustering algorithm experience contention about the internal nodes. As we noted in the body of the text, it is clear that loan words from language l in l′ do not show that the l′ speakers borrowed the words directly from l. It is always possible that a third party or third parties were involved. We emphasize therefore that loan words are to be interpreted as evidence of direct or indirect contact, perhaps via third parties.

7.4.4.3 Bilingualism

As we have seen, in Central Asia it is commonplace to find ethnic groups speaking a language different from the official one, and one similarly often finds bilingual ethnic

12 Borrowings from Uzbek seem to be similar for many of the Tajik speakers living in Uzbekistan. For example the word for ‘forest’ (urmon) is used in 5 of 6 sites. Less frequent is the use of the Uzbek words for “sand’ (qum), ‘seed’ (urok/uruk), ‘to hunt’ (aw), ‘to think’ (ujla‐), ‘to turn’ (ajnali), ‘to squeeze’ (qisi), ‘lake’ (kul), and ‘cloud’ (bulut). In Zar‐ manak and Novmetan we noted the widespread of the ‘sea’ (deŋiz), ‘mountain’ (to), ‘father’(ota), ‘mother’(ona). We have not attempted to quantify this effect and compare to the effect of borrowing in other areas. A CENTRAL ASIAN LANGUAGE SURVEY 267 groups, such as those in Zarmanak and Novmetan, where our informants could speak Uzbek and Tajik perfectly, although one language was preferred at home. Their bilin‐ gualism, together with phenomena described in the preceding section, is another reason for the high number of words borrowed from the other language family (Fig. 7.1, part B).

7.4.4.4 Kazakh and Karakalpak speakers

Kazakh and Karakalpak speakers deserve a separate discussion because the two lan‐ guages seem really close, at least as far as our samples showed. The Kazakh speakers from Raushan (cluster Ka1 in Fig. 7.2) are clustered together with the Karakalpak ones from Shege‐Halqabad (Kk1, in Fig. 7.2), the ones from Kokdaria (Kk2 and Kk3 in Fig. 7.2) and the self‐defined Uzbeks from Hitoj (they actually speak Karakalpak  cluster KkUZ in Fig. 7.2). This finding is in agreement with Kirchner (1998) who treats Kara‐ kalpak as “so closely related” (p. 318) that he describes only the aspects that differ from Kazakh. In a different way, the Kazakh speakers from Gazli belong to a different group (Ka2 in Fig. 7.2), which corresponds to a group of Kazak workers that migrated to Gazli (Uzbekistan) 470 kilometers (as the crow flies) far from Raushan (Uzbeki‐ stan). This discrepancy cannot be explained as a misclassification of the Kazakhs from Raushan because every speaker enrolled in the study was questioned about his or her ethnical affiliation, according to the recommendations of a competent ethnologist. The reason why the two non‐Karakalpak ethnic groups we sampled in Karakalpakstan (the Uzbeks from from Hitoj and the Kazakhs from Raushan) actually speak Karakal‐ pak is not clear to us. Either they were originally Karakalpaks that later embraced another ethnical identity (for example for social reasons such as in order to acquire prestige, and thereby obtain access to certain jobs) or, indeed, they belong to a differ‐ ent ethnic group that has completely lost its language. This question is however inter‐ esting because it is an exception to the expected undividable transmission unity of the “cultural package” (traditions, beliefs, language). What can be outlined is that the inhabited part of Karakalpakstan is quite small and surrounded by the desert, corre‐ sponding to a quite isolated region where culture assimilation may happen in a way that is different from regions that are less isolated and more extended. Finally, the fact that all the speakers located in Karakalpakstan, whatever their ethnic affiliation, use a percentage of words borrowed from Indo‐Iranian that is higher than average cannot be explained as an increased contact with Tajik speakers but ought to be seen as a secondary effect of the Ubzek language they are exposed to. The following Karakal‐ pak words are Indo‐Iranian in origin, and are also found in Uzbek: Karakalpak: [kalta/kjeltje], Uzbek [kalta/qalta/kɛlte] ‘short’; Karakalpak [tjerek/tjerɛk/daraxt], Uzbek 268 CHAPTER 7

[tjerɛk/daraxt] ‘tree’; Karakalpak [gʉl/gʊl], Uzbek [gʉl/gʊl/gyl] ‘flower’; Karakalpak [gʉʃ/gʊʃ/gøʃ], Uzbek [gʉʃ/gøʃ/gʉʃt/gwʉʃt/gʊʃ/], ‘meat’; see, too, the pronunciations in S2 for the concepts ‘fruit’, ‘seed’, ‘egg’, ‘horn’, ‘tail’, ‘feather’, ‘river’,’dust’, ‘old’, and ‘left’ . While we find no Indo‐Iranian words in Karakalpak that are not also attested in Uzbek, we do occasionally find Indo‐Iranian loans in Uzbek that have not moved on into Karakalpak (see S2, concepts ‘dig’, ‘say’, ‘sing’, ‘fire’), confirming that the path led from Indo‐Iranian through Uzbek into Karakalpak. In reality, according to the Uzbek norm, many words that are close to Tajik have been deliberately replaced, for political purposes, by others that do not corre‐ spond to the spoken language. This is similar to what happened to American English, which has diverged from British English as a result of deliberate intervention to re‐ form spelling, and as the result of cultural independence when new words were adoped for concepts that also existed in the United Kingdom.

7.4.5 Perspectives of investigation

As we mentioned already, our next scientific endeavor will be to compare the patterns of genetic variability with those of the linguistic differentiation described in this pa‐ per. As a working hypothesis, we expect the groups that are bilingual and/or that use many loan words to be more admixed genetically and vice versa. A good mapping of the ethnic groups would also be useful to see which languages are in direct contact or not. The only comprehensive documentation available at the moment (CIA 1993) is not fully convincing, because many ethnic groups are located outside the country to which they culturally “belong” (Tajiks in Uzbekistan, Uzbeks in Tajikistan or Kyr‐ gyzstan), all of whom appear in the documentation as less numerous than our field‐ work revealed. It goes without saying that we could not approach the authors of the C.I.A. map to question them about the methodology used to obtain it. This is why we invite the reader to consult such a map only to get a rough idea of the capricious hu‐ man geography of the region and to appreciate the extent of uninhabited territories. Concerning Karakalpakstan, it will be interesting to see whether the Uzbek and Kazakh groups that (probably) lost their language exhibit a genetic difference from the Karakalpaks than exceeds the average difference among the Karakalpaks group itself. As far as cultural anthropology is concerned, Karakalpakstan is an exceptionally interesting area deserving further research. Linguistically, our paper suggests that there is a place for further work in the automatic detection of loan words. We used a very rough measure of pronunciation difference, and we would expect the detection rate to improve if we employed a more sensitive measure, but we leave this, too, to future work. A CENTRAL ASIAN LANGUAGE SURVEY 269

Acknowledgments:

We would like to thank three anonymous reviewers and the Editor of the Journal for having significantly contributed to improve the paper. The study was supported by a the European Science Foundation OMLL (Origin of Man, Language and Languages) research grant to François Jacquesson (CNRS), by the ANR (Agence Nationale de la Recherche, France) NUTGENEVOL grant (07‐BLAN‐0064) to Evelyne Heyer and by the CNRS (Centre National de la Recherche Scientifique, France) cooperation program PICS 122377 DEMOAC to Evelyne Heyer. It is a special pleasure to thank Professor Almaz Aldashev13 (Academy of Sciences of Kyrgyzstan), Dzhypara Turdubayeva, MD, Professor Tamara Aripova, Dr. Tatyana Hegay, Professor Khodzhakhmet Esbergenov (Academy of Sciences of Uzbeki‐ stan), Dr. Firuza Nasyrova (Academy of Sciences of Tajikstan), Dr. Nargis Khodzhaeva (Donish Institute of History, Archaeology and Ethnography, Dushanbe, Tajikstan) and Dr. Sayfiddin Mirzoev (Rudaki Institute of Language, Literature, Oriental and Written heri‐ tage, Dushanbe, Tajikstan) for their scientific guidance and help over the years. Valuable scientific discussion and input has been provided by Professor Éva Ágnes Csató Johanson (University of Uppsala), Professor Pierre Darlu and Professor Bernard Dupaigne (both at the National Museum of Natural History, Paris). We express gratitude to all the local Authorities that have provided authorizations to conduct investigations in many Oblasts of Central Asia, as well as to all the eighty‐eight anonymous volunteers, that accepted to be enrolled in the linguistic inquiry, for their time, dedication and enthusiasm. Logistic support during fieldwork was partly provided by the Institut Français d’Etudes sur l’Asie Centrale (IFEAC, Uzbekistan/Kyrgyzstan) and by Mr. Stanislav Ashuraliev (Uzbekistan).

13 Professor Almaz A. Aldashev (1953‐2016) was Vice‐President of the National Academy of Sciences of Kyrgyz Republic; Director of the Institute of Molecular Biology and Medi‐ cine, Bishkek, Kyrgyz Republic. I remember him as a very nice and humble man. 270 CHAPTER 7

References:

Alewjinse Bart, Nerbonne John, Van der Veen Lolke and Franz Manni. 2007. A Computa‐ tional Analysis of Gabon Varieties In Petya Osenova et al. (eds.) Proceedings of the RANLP Workshop on Computational Phonology Workshop at the conference Recent Advances in Natural Language Phonology Borovetz (Bulgaria), 3‐12. Andreev, Nikolaj Dimitrievič and Orest Petrovič Sunik 1982. O probleme rodstva altajskix jazykov i metodax ee rešenija, Voprosy jazykoznanija, 2 : 26‐35. Balci, Bayram, Ibraguimov, Khouïdakoul, Mansourov, Ouloughbek and Johann Uhrès. 2001. Dictionnaire ouzbek‐français. Paris : LʹAsiathèque.

Baskakov, Nikolaj Aleksandrovič (ed.). 1966. Tjurkskie jazyki, Jazyki narodov USSR II. Mo‐ scow, USSR: Nauka. Boattini, Alessio, Lisa, Antonella, Fiorani, Ornella, Zei, Gianna, Pettener, Davide and Franz Manni. 2012. General method to unravel ancient population structures through surnames. Final validation on Italian data. Human Biology 84(3): 235‐270. Chaix, Raphaëlle, Austerlitz, Frédéric, Khegay, Tatjana, Jacquesson, Svetlana, Hammer, Michael F., Heyer, Evelyne and Lluis Quintana Murci. 2004. The genetic or mythical ances‐ try of descent groups: lessons from the Y chromosome. American journal of human genetics 75:1113‐1116. Chaix, Raphaëlle, Quintana Murci, Lluis, Hegay, Tatyana, Hammer, Michael F., Mobasher Zahra, Austerlitz, Frédéric and Heyer Evelyne. 2007. From social to genetic structures in central Asia. Current biology, 17:43‐48. Chambers, Jack. 1995. Sociolinguistic theory. Linguistic variation and its social significance. Ox‐ ford (UK) and Cambridge (USA): Blackwell Publishers. CIA (Central Intelligence Agency), 1993. Major ethnic groups in Central Asia, map n°729792 9‐93 [25 x 34 cm, color]. CIA, Washington DC (USA). Accessed through the website of the Library of Congress of the USA (#93686639) www.loc.gov/item/93686639. Darlu, Pierre, Bloothooft, Gerrit, Boattini, Alessio, Brouwer, Leendert, Brouwer, Matthijs, Brunet, Guy, Chareille, Pascal, Cheshire, James, Coates, Richard, Longley, Paul, Dräger, Kathrin, Desjardins, Bertrand, Hanks, Patrick, Mandemakers, Kees, Mateos, Pablo, Pettener, Davide, Useli, Antonella and Franz Manni. 2012. The family name as socio‐cultural feature and genetic metaphor: from concepts to methods. Human Biology 84(2):169‐214. Derbisheva, Zamira Kasymbekova. 2009. Jazykovaja politika i jazykovaja situacija v Kyr‐ gyzstane. Journal, 59: 1. Du, Juan. 2002. Combined Algorithms for Constrained Estimation of Finite Mixture Distri‐ butions with Grouped and Conditional Data. MA Thesis, Ontario, Canada: McMaster Uni‐ versity,. Ethnologue. 2015. Ethnologue, languages of the world. Summer Institute of Linguistics (SIL) International Publications, Dallas (TX), USA. Online version at www.ethnologue.com A CENTRAL ASIAN LANGUAGE SURVEY 271

Forster, Peter and Colin Renfrew (Eds.). 2006. Phylogenetic methods and the prehistory of lan‐ guages. Cambridge (UK): McDonald Institute for Archaeological Research. Gooskens, Charlotte and Wilbert Heeringa. 2004. Perceptive evaluation of Levenshtein dialect distance measurements using Norwegian dialect data. Language Variation and Chan‐ ge, 16(3):189‐207. Gooskens, Charlotte, Heeringa, Wilbert and Karin Beijering. 2008. Phonetic and lexical predictors of intelligibility. International Journal of Humanities and Arts Computing, 2(1‐2), 63‐ 81. Gray, Russell D. and Quentin D. Atkinson. 2003. Language‐tree divergence times support the Anatolian theory of Indo‐European origin. Nature 426(6965):435‐9. Greenhill, Simon J. 2011. Levenshtein distances fail to identify language relationships accu‐ rately. Computational Linguistics, 37(4): 689‐698. Heeringa, Wilbert, Kleiweg, Peter, Gooskens, Charlotte and John Nerbonne. 2006. Evalua‐ tion of String Distance Algorithms for Dialectology. In: John Nerbonne and Erhard W. Hinrichs (eds.) Linguistic Distances Workshop at the joint conference of International Committee on Computational Linguistics and the Association for Computational Linguistics, 51‐62. Sydney. Heeringa, Wilbert, Nerbonne, John and Peter Kleiweg. 2002. Validating dialect comparison methods. In Wolfgang Gaul and Gerd Ritter (eds.) Classification, automation, and new media. Proceedings of the 24th Annual Conference of the Gesellschaft für Klassifikation. 445‐452. Berlin: Springer. Heeringa, Wilbert. 2004. Measuring Dialect Pronunciation Differences using Levenshtein Dis‐ tance. PhD dissertation. The Netherlands: Rijksuniversiteit Groningen. Heyer, Evelyne, Balaresque Patricia, Jobling, Marc A, Quintana Murci, Lluis, Chaix, Raphaëlle, Segurel, Laure, Aldashev, Almaz, and Tatyana Hegay. 2009. Genetic diversity and the emergence of ethnic groups in Central Asia. BMC Genetics, 10:49. Heyer, Evelyne, Brazier, Lionel, Segurel, Laure, Hegay, Tatyana, Austerlitz, Frédéric, Quintana Murci, Lluis, Georges, Myriam, Pasquet, Patrick, and Michel Veuille. 2011. Lacta‐ se persistence in central Asia: phenotype, genotype, and evolution. Human Biology 83:379‐ 392. Holman, Eric W., Wichmann, Søren, Brown, Cecil H., Velupillai, Viveka, Müller, André and Dik Bakker. 2008. Explorations in automated language classification. Folia Linguistica, 42(3‐ 4): 331‐354. Jacquesson, François. 2002. Les parlers karakalpak dans leur contexte. Cahiers dʹAsie Cen‐ trale, “Karakalpaks et autres gens de l’Aral, entre rivages et deserts” 10: 93‐137. Tachkent (Uzbekistan), Aix‐en‐Provence (France): Edisud. Jacquesson, Svetlana. 2002. Parcours ethnographiques dans l’histoire des deltas. Cahiers dʹAsie Centrale, “Karakalpaks et autres gens de l’Aral, entre rivages et déserts”. 10: 51‐92. Tachkent (Uzbekistan), Aix‐en‐Provence (France): Edisud. Jäger, Gerhard. 2013. Phylogenetic inference from word lists using weighted alignment with empirically determined weights. Language Dynamics and Change 3(2): 245‐291. 272 CHAPTER 7

Jeffers, Robert and Use Lehiste. 1979. Principles and methods for historical linguistics. Cam‐ bridge (USA): MIT Press. Johanson, Lars and Éva Á. Csató (eds.), The Turkic languages,. London (UK): Routledge. pp, 333‐343. Junusaliev Bolot Muratalievič. 1966. « Kirgizskij jazyk », in : V.V. Vinogradov, red., Jazyki narodov USSR , 2, Tjurkskie jazyki, 482‐505. Moscow, USSR: Nauka. Kassian, Alexei. 2015. Towards a formal genealogical classification of the Lezgian languages (North Caucasus): Testing various phylogenetic methods on lexical data. PloS one, 10(2), DOI: 10.1371/journal.pone.0116950 Kerimova, Aza Alimovna. 1959. Govor tadžikov Buxary, Izd. vostočnoj literatury, Moscow USSR, 163 pp. Kessler, Brett. 2001. The significance of word lists. Stanford (USA): CSLI Press. Kirchner, Mark. 1998. Kazakh and Karakalpak. In: Lars Johanson and Éva Á. Csató (eds.). The Turkic languages, 318‐332. London: Routledge. Labov, William. 1990. The intersection of sex and social class in the course of linguistic change. Language variation and change 2: 205‐254. Levenshtein, Vladimir Iosifovich. 1966. Binary codes capable of correcting deletions, inser‐ tions, and reversals. Cybernetics and Control Theory, 10:707–710.

List, Johann‐Mattis. 2014. Sequence comparison in historical linguistics. Düsseldorf: Düsseldorf University Press. Manni, Franz. 2010. Sprachraum and genetics. In Alfred Mameli, Roland Kehrein and Stefan Rabanus (eds.) Mapping language, 524‐541. Berlin, New York (USA): Mouton de Gruyter. Manning, Chris D. and Hinrich Schütze. 1999. Foundations of statistical natural language proc‐ essing. Cambridge (USA): MIT Press. Manning, Chris D., Raghavan, Prabhakar and Hinrich Schütze. 2008. Introduction to informa‐ tion retrieval. Cambridge (USA): Cambridge University Press. Martinez Cruz, Begona, Vitalis, Renaud, Segurel, Laure, Austerlitz, Frédéric, Georges, Myriam, Thery, Sylvain, Quintana Murci, Lluis, Hegay, Tatyana, Aldashev, Almaz, Nasy‐ rova, Firuza, and Evelyne Heyer. 2011. In the heartland of Eurasia: the multilocus genetic landscape of Central Asian populations. European Journal of Human Genetics, 19: 216‐223. McMahon, April and Robert McMahon. 2005. Language classification by numbers. Oxford (UK): Oxford University Press. Menges, Karl H. 1968. The Turkic languages and peoples – An Introduction to Turkic studies, Ural‐Altaische Bibliothek. Wiesbaden: Otto Harrassowitz. Montemagni, Simonetta, Wieling, Martijn, de Jonge, Bob and John Nerbonne. 2013. Syn‐ chronic Patterns of Tuscan Phonetic Variation and Diachronic Change: Evidence from a Dialectometric Study. LLC: Journal of Digital Scholarship in the Humanities 28(1): 157‐172. A CENTRAL ASIAN LANGUAGE SURVEY 273

Moukhtor, Chokir, Ibraguimov, Khouïdakoul and Ouloughbek Mansourov. 2003. Diction‐ naire Tajik‐français, published by “Langues & Mondes‐LʹAsiathèque & IFEAC”, Paris, France, 357 pp. Nerbonne, John, and Christine Siedle. 2005. Dialektklassifikation auf der Grundlage aggre‐ gierter Ausspracheunterschiede. Zeitschrift für Dialektologie und Linguistik 72(2): 129‐147. Nerbonne, John, Colen, Rinke, Gooskens, Charlotte, Kleiweg, Peter and Therese Leinonen. 2011. Gabmap — a web application for dialectology. Dialectologia: revista electrònica, 65‐89. Nerbonne, John, Heeringa, Wilbert and Peter Kleiweg. 1999. Edit distance and dialect prox‐ imity. In: David Sankoff and Joseph Kruskal (eds.) Time Warps, String Edits and Macromole‐ cules: The Theory and Practice of Sequence Comparison, 2nd ed., v‐xv. Stanford (USA): CSLI Press. Nerbonne, John, Kleiweg, Peter, Heeringa, Wilbert and Franz Manni. 2008. Projecting Dia‐ lect Differences to Geography: Bootstrap Clustering vs. Noisy Clustering. In: Christine Preisach, Lars Schmidt‐Thieme, Hans Burkhardt and Reinhold Decker (eds.) Data Analysis, Machine Learning, and Applications. Proceedings of the 31st Annual Meeting of the German Classi‐ fication Society, 647‐654. Berlin: Springer (Studies in Classification, Data Analysis, and Knowledge Organization). Nerbonne, John. 2009. Data‐Driven Dialectology. Language and Linguistics Compass, 3(1): 75‐198. Nunally, Jum C. 1978. Psychometric theory, 2nd edition. New York: McGraw‐Hill. Prokić, Jelena and John Nerbonne. 2008. Recognising groups among dialects. International Journal of Humanities and Arts Computing, 2(1‐2):153‐172.

Prokić, Jelena, Nerbonne, John, Zhobov, Vladimir, Osenova, Petya, Simov, Kiril, Zastrow, Thomas and Erhard Hinrichs. 2009. The computational analysis of Bulgarian dialect pro‐ nunciation. Serdica Journal of Computing, 3(3): 269‐298.

Rastogueva, Vera Sergeevna. 1963. Očerki po tadžikskoj dialektologii, 5 : Tadžiksko‐russkij dialektnyj slovar’, AN USSR, Moscow, 250 pp.

Rastorgueva, Vera Sergeevna. 1964. Opyt sravnitel’nogo izučenija tadžikskix govorov, Nauka, Moscow, USSR , 188 pp. Segurel, Laure, Martinez Cruz, Begonia, Quintana Murci, Lluis, Balaresque, Patricia, Georges, Myriam, Hegay, Tatyana, Aldashev, Almaz, Nasyrova, Firuza, Jobling, Marc, Heyer, Evelyne and Renaud Vitalis. 2008. Sex‐specific genetic structure and social organization in Central Asia: insights from a multi‐locus study. PLoS Genetics DOI: 10.1371/journal.pgen.1000200. Segurel, Laure, Austerlitz Frédéric, Toupance Bruno, Gautier, Mathieu, Kelley, Johanna L., Pasquet, Patrick, Lonjou, Christine, Georges, Myriam, Voisin, Sarah, Cruaud, Corinne, Hegay, Tatyana, Aldashev, Almaz, Vitalis, Renaud and Evelyne Heyer. 2013. Positive selec‐ tion of protective variants for type 2 diabetes from the Neolithic onward: a case study in Central Asia. European Journal of Human Genetics, 21: 1146‐1151. doi:10.1038/ejhg.2012.295 274 CHAPTER 7

Segurel, Laure, Lafosse, Sophie, Heyer, Evelyne and Renaud Vitalis. 2010. Frequency of the AGT Pro11Leu polymorphism in humans: Does diet matter? Annals of Human Genetics 74:57‐64. Šimičić, Lucija, Houtzagers, Peter, Sujoldžić, Anita. and John Nerbonne. 2013. Diatopic Patterning of Croatian Varieties in the Adriatic Region. Journal of Slavic Linguistics, 21(2): 259‐301. Sturrock, Kenneth and Jorge Rocha. 2000. A multidimensional scaling stress evaluation table. Field methods, 12: 49‐60. Swadesh, Morris. 1955. Towards greater accuracy in lexicostatistic dating, International Journal of American Linguistics, 21: 121‐137. Swadesh, Morris. 1972. What is glottochronology. In Morris Swadesh (ed.) The origin and diversification of languages, 271‐284. London (UK): Routledge & Kegan Paul. Van Der Ark, René, Mennecier, Philippe, Nerbonne, John and Franz Manni. 2007. Prelimi‐ nary Identification of Language Groups and Loan Words in Central Asia. In: Petya Osenova et al. (eds.) Proceedings of the RANLP Workshop on Computational Phonology Workshop at the conference Recent Advances in Natural Language Phonology Borovetz (Bulgaria), 12‐20. Van der Ark, René. 2008. Comparing languages and dialects in Central Asia. M.A. thesis. Groningen: University of Groningen. Wells, John. 1997. SAMPA computer readable phonetic alphabet. In Dafydd Gibbon, Roger K. Moore, R. K and Richard Winski (eds.). Handbook of standards and resources for spoken language systems. Berlin: Walter de Gruyter. Appendix B. Wichmann Søren at al. 2013 The ASJP Database (vs. 16). Available at http://asjp.clld.org/ Wichmann, Søren, Holman, Eric W., Bakker, Dik and Cecil H. Brown. 2010. Evaluating lin‐ guistic distance measures. Physica A. 389: 3632‐3639. Wieling, Martijn, Margaretha, Eliza and John Nerbonne. 2012. Inducing a measure of pho‐ netic similarity from pronunciation variation. Journal of Phonetics, 40: 307‐314. Wieling, Martijn and John Nerbonne. 2015. Advances in dialectometry. Annual Review of Linguistics. 1(1): 243‐264. Wieling, Martijn, Shackleton, Robert Jr. and John Nerbonne. 2013. Analyzing phonetic variation in the traditional English dialects: Simultaneously clustering dialect and phonetic features. LLC: Journal of Digital Scholarship in the Humanities 28: 31-41 GENERAL CONCLUSIONS AND NEW PROSPECTS 275

276 CHAPTER 8

This chapter is unpublished, please cite it as follows:

Manni F. 2017. General conclusions and new prospects. In: Linguistic probes into human history (Chapter 8). PhD dissertation, Groningen dissertations in linguistics n° 162. ISBN 978‐90‐367‐9872‐3. Groningen: University of Groningen.

GENERAL CONCLUSIONS AND NEW PROSPECTS 277

GENERAL CONCLUSIONS AND NEW PROSPECTS

The different chapters of the dissertation largely overlap in aims and methods, form‐ ing a coherent assemblage of empirical studies meant to shed light over the peopling phases of different countries and areas: i) CHAPTERS 4 and 5 concern the mapping and the classification of Dutch dialects and Spanish languages with respect to the surname differences of the two countries; ii) CHAPTER 6 is about the classification of the Bantu languages spoken in Gabon in connection to population genetics inquiry; iii) CHAPTER 7 addresses the classification of several languages spoken in Central Asia and the quantification of borrowing from each‐other, with the aim to provide migration‐ and population‐contact hypotheses that population geneticists can test. Methodological questions, inherent to every cross‐disciplinary effort, have also been addressed in this thesis work, they concern the rationale for genetic and linguistic comparisons (see CHAPTER 2), the assessment of the robustness of linguistic classifications (CHAPTER 3) and, more generally (besides CHAPTER 5), the use of the Levenshtein distance to meas‐ ure pronunciation distances. Each chapter ends with a specific conclusions section that remains valid today, some years after the publication of corresponding articles,1 and there is no need to repeat here to what has been said. Nonetheless, it is worthwhile to tie the various pieces together and to reflect on what we have learned. This final chapter provides a wider methodological discussion about the Levenshtein distance, because the empirical assays included in the disserta‐ tion enlighten about its specificities in measuring linguistic difference. This is the fo‐ cus of Section 8.1 In Section 8.2 I will first review the findings about the relation be‐ tween pronunciation differences and geographic distance, before suggesting a new line of investigation showing how residual Levenshtein distances can provide testable hypotheses about past linguistic convergence and divergence and perhaps addressing the influence that population growth and migrations have on linguistic variability. To do so, I will focus on family names: markers that enable the depiction of migrations occurred in historical times, that is, concerning European countries, in the last five centuries.2 Family names, appropriately processed, make possible to distinguish the regions that received many immigrants from those that have remained demographi‐ cally more isolated, aspects that underlie dialect and language contact. Finally Section 8.3 develops a perspective from which we may examine the effects of migration on language change.

1 Besides CHAPTER 6 that is unpublished. 2 Surnames became fixed starting with the XVI century, when the Roman Catholic Church made compulsory, for every parish, to keep a register listing newborns and dead people. 278 CHAPTER 8

8.1 THE ESSENCE OF THE LEVENSHTEIN DISTANCE

8.1.1 The Levenshtein distance and the feature system

To computationally measure the difference between two pronunciations Nerbonne et al. (1996) adopted the Levenshtein distance, an edit distance consisting in optimally aligning two text strings and counting the number of operations needed to pass from one string to the other; for example one step is required to go from ‘ABA’ to ‘ACA’, that is replacing B with C. This first implementation of the method was very simple, with 1/0 costs and no features describing segments. The results were very encourag‐ ing and, soon thereafter, Nerbonne and Heeringa (1997) based distance functions on binary features, as Gildea and Jurafsky (1996) had done. Phonetic segments were rep‐ resented by binary vectors in which every entry stood for a single articulatory feature, thus enabling the distinction of a large number of phonetic segments. Later applica‐ tions have tested other feature systems that are closer to the IPA features, especially concerning vowels. Other experiments concerned normalizing by the total length of the alignment, forbidding consonants to align with vowels,3 and treating trans‐ postions (swaps) in a special manner The calculations reported in CHAPTER 4 are based on a unit‐cost model normalized by the length of the alignment, while those concern‐ ing CHAPTERS 6 and 7 are based on gradual segmental distances not normalized by the length (see Tab. 1). The issue of segmental similarity was a major focus of the work in Groningen for about ten years, culminating in Heeringa (2004) which devotes one 52‐page chap‐ ter to the question of how to measure similarity of two phonetic segments by compar‐ ing three phonetic feature systems: first, a 7‐feature system borrowed from ‘The Sound Pattern of English’; second, the ones of Vieregge et al. (1984) and Cucchiarini (1993) respectively for vowels and consonants; and third, a system developed for pho‐ netic segment characterisation by Almeida and Braun (1985, 1986). The more sensitive representations failed to lead to significant improvements when measured against the perception judgments of dialect speakers (Heeringa 2004, p. 186) in the frame of the

3 To deal with syllabicity, the Levenshtein algorithm may be adapted so that only vowels may match with vowels, and consonants with consonants, with several exceptions: [j] and [w] may match with both consonants and vowels, [i] and [u] with both vowels and conso‐ nants, and central vowels with both vowels and sonorant consonants. So the [i], [u], [j] and [w] align with anything, the [] with syllabic (sonorant) consonants, but otherwise vowels align with vowels and consonants with consonants. In this way unlikely matches (e.g., a [p] with an [a]) are prevented. This approach was first applied to Sardinian dialects (Bo‐ lognesi and Heeringa 2002), and then to Dutch (Heeringa and Braun 2003) in a validation study. GENERAL CONCLUSIONS AND NEW PROSPECTS 279 aggregate comparison of full dialect varieties. Other distances exist when the task is to focus with very high precision on the phonological difference of single words (see next section).

Table 8.1  Alignment of the word ‘rabbit’ with a unit‐cost model (Kaninchen in Ger‐ man and konijn in Dutch). Five differences are found over a total length of the alignment of nine positions. When a normalization by the length is applied a difference of 5/9 = 0.55 % is found. From Heeringa (2004), p. 131.

k a n i n   n German k a n . i n   n k o n  i n Dutch k o n  i n . . . 0 1 0 1 0 0 1 1 1 = 5

1 2 3 4 5 6 7 8 9

8.1.2 The Levenshtein distance measures intelligibility

Given the large use that is made of the Levenshtein method in the dissertation (see also CHAPTER 1), reviewing the discussion it has stimulated in the literature seems ap‐ propriate here. Although some of the research included in the dissertation was conducted before certain criticisms were levelled, they cannot be ignored now. First it should be made clear that the Levenshtein algorithm performed very well concerning the comparison of dialect varieties consisting of quite large wordlists. But for the purpose of identifying cognates, or for the purpose of recognizing loan words, or for the purpose of identifying sound correspondences, more sensitive measures were de‐ signed. They rely on how well the measure works per word rather than on sets of words and find application in computational phonology, including dialectometry (Covington 1996; Somers 1998; Gildea and Jurafsky 1996, Kessler 1995, Oakes 2000 ‐‐ See Kondrak 2003 and Kondrak and Sherif 2006 for a review). Among them, Kondrak’s algorithms (2002) (COGIT, ALINE) gained renown in historical linguistics. While COGIT is meant to automatically identifying cognate words, ALINE is similar to the Levenshtein distance but it incorporates the notion of phonetic coalescence and break‐up, and it uses a non‐ binary feature system. If the author does not compare his algorithm to the Levenshtein method (and to the other existing ones, besides Covington’s (1996)), this is probably be‐ cause the application is different, mainly concerning language reconstruction, whereas the Levenshtein approach was initially used to identify language areas. It would be inter‐ esting to see the degree of improvement that Kondrak’s alterations would add to the per‐ formance of the Levenshtein method. 280 CHAPTER 8

Explicit criticisms of the use of the Levenshtein approach were expressed by McMahon and McMahon (2005)4 and Greenhill (2011). The latter author described the approach as blind, unable to distinguish chance similarities from real cognates and lying at the linguistic “surface”, suggesting that better methods should be adopted to classify language varieties: a criticism that, from his standpoint, could have been le‐veled at the use of a majority of the phonological distances cited. While Greenhill (2011) conceded that Levenshtein distance worked well at low time depths, he suggested to use, instead, adap‐ tive algorithms that learn the transition weights through the application of naïve Bayesian classifiers or stochastic transducers. The first ones are generally used for natural language processing tasks and have encountered great success in the subfield of authorship analy‐ ses, the process of attributing the author of an anonymous text according to its writing characteristics (Juola et al. 2006). Stochastic transducers, popular in automatic translation, have been applied to dialectology by Wieling et al. (2007a) and Scherrer (2007). Scherrer’s work is about comparing stochastic transducers with the Levenshtein distance, which was used as a baseline for experiments of bilingual lexical induction between the Ger‐ man‐Swiss dialect of Bern and standard German. Although less efficient, the Levenshtein distance performed well, as it had been shown by Mann and Yarowsky (2001) that in‐ duced over 90% of the Spanish‐Portuguese cognate vocabulary with a 11% F‐measure improvement over the Levenshtein algorithm; however in several cases stochastic trans‐ ducers offered no improvement over the Levenshtein distance. This result is in agreement with the findings of Wieling et al. (2007a) showing that the segment distances induced correlate well with distances in formant space. Besides the efforts to improve the sensitivity of Levenshtein distance using fea‐ tures taken from phonetics and phonology, data‐driven techniques have also been ap‐ plied for this same purpose. Wieling et al. (2012) used an iterative technique to induce segment distances. They first applied the Levenshtein algorithm to a large da‐taset of dia‐ lect pronunciations, collecting first all the alignments. From these, they extracted the segment correspondences and their frequencies. They then collected all these in a large contingency table from which they could calculate an information‐theoretic measure of

4 See the reply of John Nerbonne (2005) where he reviews a large number of methodologi‐ cal misunderstandings, by the McMahons, about the way the Levenshtein distance is computed. McMahon and McMahon (2005) criticise that Nerbonne and Heeringaʹs (1997, p. 11) “earlier work calculated edit distance in the simplest possible way meaning that the pair [a,t] count as different to the same degree as [a, ]”. But in fact the paper they cite focuses on how to differentiate such sounds more subtly, exploiting phonetic and phonological features for this purpose. GENERAL CONCLUSIONS AND NEW PROSPECTS 281 the affinity of pairs of correspondences called pointwise mutual information (PMI).5 Finally, they used (an inversion of) the PMI values as substitution costs in the following iteration of the procedure. Experiments with several datasets showed that the procedure stabilized within ten or fewer iterations. An evaluation based on alignment accuracy confirmed that the PMI‐based version of Levenshtein algorithm reduced error only slightly (from 3% error to about 2.5%) with respect to the classical method. Jäger (2015) used a simplified version of this algorithm to avoid the need for expert judgments on cognacy in historical linguistics, showing that the results were confirmed by Glottlog classifications (Hammar‐ ström et al. 2016). It has been said that the Levenshtein method was not originally introduced to lin‐ guistics for the task of identifying cognates, meaning that the distances it yields are not to be taken as an estimation of the divergence time between language varieties, and suggest‐ ing that the criticisms of Greenhill (2011) and McMahon and McMahon (2005) discussing its use for this purpose were hasty. While it would be interesting to compare some of the more sophisticated versions of the Levenshtein distance with other algorithms, that must be taken as a note for future work. Actually, dialectology and historical linguistics have divergent aims. In the first the focus is on the overall similarity of entire varieties, whereas in the latter the attention goes to the identification of cognates and to the measure of the similarity of individual words. The primary use of the Levenshtein approach has been to seek and identify the signal of geographic provenance in dialect speech, while historical linguistics addresses a signal of historical “relatedness” at the level of variety and at the level of individual words. Another important difference between the two is their relation to geography, which influences the distribution of dialectal varieties massively, but not necessarily dis‐ cretely. Heeringa and Nerbonne (2001) have shown how the dialectal analysis provides an analytical foundation for the notion “dialect continuum”, where the classification into discrete groups, the very heart of phylogenetic analysis and historical classifications, plays no role. Because they were obtained outside an explicit historical linguistics frame, McMahon and McMahon (2005) judged Levenshtein classifications “to a great extent un‐ corroborated” (p.213) basing their judgment on expected methodological drawbacks that, in reality, turned far from true because, when the technique has been applied to Dutch, German, American English, Sardinian, Norwegian, Bulgarian and Catalan, the groupings provided enjoyed the recognition of specialists in these dialects and languages. We re‐ view formal validation efforts below.

5 Point‐wise mutual information (PMI), or point mutual information (Fano 1961), is a measure of association used in information theory, statistics and computational linguistics (see for example Church and Hanks 1990). 282 CHAPTER 8

Concerning dialectology and Catalan varieties, the classificatory effectiveness of the Levenshtein algorithm has been tested in comparison to another computational method, the mCOD (Méthode COD) (Clua et al. 2008; Clua and Lloret 2015), that embraces the study of linguistic variation in the areas of phonetics, phonology and inflection. The mCOD approach differs from other dialectometric analyses in the fact that these are quan‐ titatively surface‐oriented, while the mCOD was designed to capture the differences among varieties not only quantitatively but also qualitatively, in order to increase the accuracy of the groupings. The distances obtained with two methods (mCOD vs. Leven‐ shtein) correlate very highly (r = 0.868) and converge in identifying the same borders be‐ tween dialect areas, the differences between the two methods concerning specific details (Valls et al. 2012). To remind the extensive work conducted in Groningen (see Heeringa 2004, Ner‐ bonne and Heeringa 2010) prior to an extensive application of the Levenshtein method, the classifications have been tested quantitatively by i) addressing the sensitivity of the measure to segment order and to phonological context, ii) taking into account the (non‐) use of length normalisation, iii) testing the linguistic constraint that all alignments respect the consonant/vowel distinction but also by iv) measuring their (good) match with the overall similarity‐judgments of dialect speakers (Gooskens and Heeringa 2004,6 Gooskens and Heeringa 2006, Heeringa et al. 2006) and in comparison to native speakers’ judg‐ ments of accent strength (Wieling et al. 2014). Inspired by the latter research direction Fon‐ tan et al. (2015) have successfully applied the Levenshtein method to measure intelligibil‐ ity in a project concerning the tuning of hearing‐aids, therefore setting up automatic measures of speech intelligibility for the recognition of isolated words and sentences, similarly to Sanders and Chin (2009) that found the Levenshtein method to correlate ex‐ tremely well (r = 0.925**) with naïve human transcriptions of the speech of pediatric co‐ chlear implant users. This is why, the Levenshtein distance can be seen as a good measure of intelligibility, that is the perception a speaker has of the linguistic difference of some‐ one else’s speech (Beijering et al. 2008).

6 About Norwegian dialects, Gooskens and Heeringa (2004) report that perceived linguis‐ tic differences correlate at ~0.8 with measured Levenshtein distances, and the correlation would probably be higher if all the (naïve) speakers had a same (high) level of linguistic competence in assessing the varieties located more distantly that their neighbourhood. In fact, the perception of linguistic differences is finer within the radius of human interaction and decreases outside it, meaning that the speakers are less familiar with distant varieties that they tend to classify as “very different”, whatever the real geographic or linguistic distance is. GENERAL CONCLUSIONS AND NEW PROSPECTS 283

8.1.3 The Levenshtein distance measures contact

To summarize, the Levenshtein algorithm has been criticized for not distinguishing cog‐ nates from chance similarities at great time depths, and for measuring the linguistic “sur‐ face”, which is perhaps why it correlates well with the individual perception.7 This latter aspect is essential in the frame of the research presented in this dissertation, which is work addressing the cultural proximity of human populations with respect to their ances‐ try, inferred through genetic markers or family names. By comparing population genetics data to cultural differences measured through linguistic diversity, we are adding detail to the same research question: describing and explaining how people interact and mate. Human mating is influenced by several factors, the first one being the chance to meet (which de‐ pends on geographical proximity and social stratification) but also, to a large extent, the perceived attractiveness of the partner. This is where cultural differences play a signifi‐ cant role, together with economic considerations, traditional rules of descent, taboos, etc. The speech conveys information about the geographical origin of the partner, about the social status of the family, about education, etc. Our speech plays a significant role, con‐ scious and unconscious, in the feeling we have about the possible mutual understanding lato sensu with a new partner. When we speak to someone we are not mentally counting the number of shared cognates or borrowed words we both employ, instead we perceive, intuitively and very rapidly, the extent to which her/his speech is close to ours; and a sen‐ timent of closeness or distance can arise, leading to a stronger or weaker desire to interact. The speech is also evocative of many preconceived opinions we have about the others, they arise from beliefs, traditions and history. A measure able to capture perceived linguis‐ tic proximity, like the Levenshtein distance does, is useful in the context of cross‐ disciplinary research involving population genetic and demographic inquiry. However, phonological differences do not concern only the linguistic surface. In an example about Dutch dialects, it has been shown that they correlate with syntactic differences (Spruit et al. 2009), meaning that they reflect a deeper level of the languages, showing how linguis‐ tic levels arise from a similar pattern of historical geographic contact.8 This finding has

7 But the book is not closed on the suitability of modified Levenshtein algorithms for cog‐ nate recogntion. T. Rama et al. (in preparation.) reject Greenhill’s criticism forcefully, con‐ cluding that “PMI [‐based Levenshtein] systems yield […] better accuracies than current state‐of‐the‐art systems.” (personal communication to J. Nerbonne, 2017) 8 Spruit et al. (2009) find that pronunciation is associated with syntax and lexis, while syn‐ tax and lexis are only weakly associated. The main cause is that pronunciation and syntax are both strongly associated with geography, while lexis is not. When geographic distance is controlled for, as the underlying factor, the association between pronunciation and syn‐ tax remains but weakens considerably, while the association between syntax and lexis disappears. 284 CHAPTER 8 special relevance as it contradicts the notion that (morpho)syntax is hostile to geographic diffusion and suggests that similar mechanisms of diffusion apply to grammar, syntax and pronunciation (Szmrecsanyi 2013, p. 159).

8.1.4 The Levenshtein distance measures historical divergence

Besides chance similarities, two cognate words generally result in the alignment of strings that are similar, meaning that there are fewer differences between related words (Kaninchen/konijn; konijn/coniglio) than between words that are unrelated (Kan‐ inchen/lapin; konijn/‘rabbit’). Given that the less variable part of a set of related words generally concerns the left part of the alignment (Kaninchen, konijn, coniglio), Kondrak (2003) suggested that the Levenshtein distance overestimates the differences between historically related words because all the segments in a word, located on the left or on the right, equally contribute to the measured distance. When the words under scru‐ tiny have significantly diverged, the corresponding pairwise distances account less for their common origin and more for their divergence through time and space, as cor‐ rectly noted by Greenhill (2011). A strict cognate‐based method would not classify Kaninchen, konijn, coniglio as “the same word = zero difference”, even though they all come from the Latin word cuniculus, because the Dutch and German words are bor‐ rowing French. A Levenshtein method would note their similarity, which exceeds that of chance words. This is to say that the Levenshtein distance captures, at the same time, a part of the historical signal that words convey by being cognates (or not), per‐ haps via borrowing, and the signal related to the phonological change that occurred after the separation, like linguistic divergence or contact; it captures everything re‐ lated to similarity, agnostically. Linguistic contact is one of the main reasons explaining the generally observed high degree of correlation between Levenshtein distances and corresponding geo‐ graphic distances. Two varieties can become increasingly similar through extensive borrowing, to the point that their original difference (historical) is obscured. Since the degree of borrowing and the intensity of communication is proportional to geo‐ graphic proximity, I have experimented with residual Levenshtein distances in order to see if the historical signal would be emphasized after correcting Levenshtein meas‐ urements by controlling for the language contact related to geographic proximity (see section 8.2.2). While the Levenshtein distance measures the signal of historical relatedness and the contact between the languages, its ability to match classifications based on shared cognates identified by the comparative method is much higher than the criti‐ cisms mentioned above would suggest. In CHAPTER 6 the very good match between GENERAL CONCLUSIONS AND NEW PROSPECTS 285 the clusters identified by Grollemund et al. 2015 and the corresponding Levenshtein classifications has been reported (Fig. 6.25). This result might be explained by the fact that Bantu languages are linguistically quite close, often forming dialect‐chains: this is a scenario closer to the initial application of the Levenshtein method to dialectology. Nevertheless, when the Levenshtein classification of six Indo‐Iranian and Turkic Cen‐ tral Asian languages (Tajik, Yaghnobi and Kazakh, Karakalpak, Kyrgyz, Uzbek, re‐ spectively), described in CHAPTER 7, is compared to a tree only accounting for shared cognates, the two representations overlap well again (Fig. 8.1), without discrepancies, suggesting that the Levenshtein distance, after all, normally captures the same histori‐ cal signal that a cognate‐based approach does, additionally delivering sub‐clusters that are less capricious. Jäger (2015) presents a very promising application of a modified Levenshtein algorithm (Needleman‐Wunsch)9 to the problem of detecting historical relatedness. Jäger notes inter alia that the application of the Levenshtein method does not require that language family experts first annotate all of the data to indicate which words are cognates, as the classificatory experiments of the Bantu languages of CHAPTER 6 also indicate. It is therefore much more widely applicable. In addition to the efforts to adapt the Levenshtein method to the task of identi‐ fying cognates carried on by other groups, future research practices might be based on the comparative use of both methods, appropriate versions of the Levenshtein distance versus standard cognate‐based classifications. The degree of their divergence, when there is one, is likely to be proportional to a wide panel of effects, linguistic contact and convergence being the first candidates. The discrepancies between Levenshtein and cognate‐based classifications, instead of being reported as flaws, could be investi‐ gated as clues able to shed light over demographic, sociolinguistic and geographic phenomena that pair off with the verticality of the linguistic transmission. After 20 years of hectic research focused on the application of computational methods on understanding linguistic variability, the moment has come to recognize that more research has been devoted to methods better able to establish a correct phy‐ logeny of the languages than to approaches able to explain how the contact of the speakers took place and with which consequences: the Levenshtein distance is cer‐ tainly one of those.

9 The Needleman‐Wunsch (1970) method is a dynamic programming algorithm for global sequence alignment, a technique particularly appropriate when sequences are of a same length; it finds application in many aspects of computer science. 286 CHAPTER 8

Figure 8.1  Linguistic similarities and differences between 88 informants interviewed in 23 test sites as in CHAPTER 7. The two trees show a very similar topology. A) Expert cognacy judg‐ ments. Cladistic majority‐rule consensus tree obtained from 100 bootstrap trees. Branch lengths are posterior estimations. The nodes supported by less than 70% of the trees have been col‐ lapsed. Tree length = 1596; Consistency index = 0.4518; Homoplasy Index = 0.5482; Retention index = 0.8699; Rescaled consistency index = 0.3930. Courtesy of Pierre Darlu (National Museum of Natural History, Paris). B) Levenshtein distance. Consensus tree obtained from 100 UPGMA bootstrap trees (Fig. 7.2 in CHAPTER 7). The nodes supported by less than 70% of the trees have been collapsed. GENERAL CONCLUSIONS AND NEW PROSPECTS 287

8.2 CURRENT CHALLENGE: GOING BEYOND GEOGRAPHY

In this and the following sections we turn to challenges we are now better prepared to face on the basis of work in this dissertation. Given the result that Levenshtein dis‐ tance is a valid indicator of aggregate similarity, and that it reflects the influence of separation among linguistic varieties, we may ask what other factors influence. For this purpose we explore an examination of residual Levenshtein distances here.

8.2.1 The spread of linguistic innovations

It has long been admitted in linguistics that the geographic distance between varieties has an effect on their evolution, namely that closer varieties are generally more similar than distant ones. The first model about the spread of linguistic innovations probably was the WAVE THEORY of Johannes Schmidt (1872). However, a mathematical model‐ ling of this empirical evidence came only a century later, when Séguy (1971) started to develop computational dialectology. In a similar vein, a theoretical formalization of the phenomenon states that the similarity of dialects is a function of both the geo‐ graphic proximity and the population size of speech communities. Like the WAVE THEORY, Trudgill’s (1974) GRAVITY MODEL explains the spread of linguistic innova‐ tions as a radiation from a centre but, and this is the novel aspect, one which has an effect on larger centres at first, and then spreading to the smaller ones in a cascade of effects (Labov 2001, p. 285) depending on the population size, that is the frequency of linguistic contact.10 When contact occurs, the speakers are influenced by one another’s speech and modify their own, sometimes adopting innovations (Lewis 1979). This model does not take into account geographic features that are likely to increase or de‐ crease the contact (rivers, deserts, mountains, etc.) or social strata, different levels of economic attractiveness or other factors such as the perceived prestige of a given vari‐ ety, probably because the model was primarily meant to formalize theoretical thoughts, rather than providing explicit clues about the way to empirically test it. The spread of innovations is a central aspect in historical‐ and socio‐linguistics and it is gaining increasing attention (see Eisenstein et al. 2014), because today’s social networks enable the measure of the spread of innovations in real time and make it possible to follow their geographical directions. However interesting, such investiga‐ tion deals with technologies enabling very easy contact between the speakers (that can remain virtual to each other) with the consequence that the present‐day spread of lin‐ guistic innovations might actually deviate from the neighbourhood dynamics that

10 The gravity model assumes that populations are sedentary. 288 CHAPTER 8

Labov and Trudgill assume in their modelling. We note, however, that Eisenstein et al. (2014) definitely find locality effects in Twitter data even though it is a medium allow‐ ing world‐wide contact. While Seguy (1971) presented dialect distances as function of the square root of geographic distance, Trudgill (1974) suggested that the spread of innovations declines quadratically. Nerbonne and Heeringa (2007) and Nerbonne (2010) found a logarithmic model to better function, similarly to the models of popula‐ tion genetics concerning the biological differences of neighbouring populations that are function of migration processes. The mathematical relations between genetic and geographic distances have long been addressed and Wright (1943) postulated ISOLATION BY DISTANCE (IBD), a model explaining that the genetic similarity of human groups decreases with their geographic distance with reference to spatially limited gene flow, a frequent phe‐ nomenon in natural populations. Gustave Malécot (1948) pushed this analysis on‐ wards by establishing that the increase is not linear but logarithmic, and this is what is currently found with surname studies addressing the diversity of local populations (see Scapoli et al. 2007 for European case‐studies). The agreement existing in linguis‐ tics and population genetics about the exponential decay of human interaction as a function of the geographic distance is probably not due to a chance similarity. The easiest and most logical explanation is to admit that the speakers interact in a neighbourhood that leads, at the same time, to the dissemination of linguistic innova‐ tions and offspring.

8.2.2 Levenshtein residual distances

At this point it is interesting to speculate on future directions about the research direc‐ tion illustrated in the previous sections. When a matrix of aggregate linguistic dis‐ tances, such as those produced by the Levenshtein algorithm, is found to be signifi‐ cantly correlated with the corresponding matrix of geographic distances, it is possible to compute a regression (linguistic distance vs. log [geographic distance]) in order to compute, from it and for each paiwise comparison, the linguistic distance that is ex‐ pected according to the geographic distance. This procedure leads to a matrix of ex‐ pected pairwise linguistic distances that can be subtracted from the linguistic distances obtained from the original data. The matrix that results after the subtraction consists in residual distances that can be positive, negative or null. They will be positive when the linguistic distance computed on original data is higher than the one expected from the regression; they will be negative when two localities exhibit a linguistic distance that is lower than what is expected according to the regression. The idea is that resid‐ ual distances account for the fraction of the linguistic variability that is not explained GENERAL CONCLUSIONS AND NEW PROSPECTS 289 by normal linguistic contact between neighbours, in fact residuals correspond to the virtual case in which all the populations would be located at the same geographic dis‐ tance one from each other.11 The matrix of original distances can be compared to the matrix of residual dis‐ tances, once they are both represented as multidimensional projections or as trees. The differences between the two representations are likely to correspond to the relations among the varieties before they drifted apart due to geographic remoteness or to con‐ vergence/divergence phenomena related to contact with other varieties, operational‐ ized as geographic distance. In the two representations, the differences of single lin‐ guistic varieties with respect to the others should be considered with caution, because their different positions rely on a geographic correction (the regression model) that arises by taking into account the whole set of pairwise distances, while single varieties might be affected by specific phenomena that the general model of regression does not take into account. In fact, some subsets of the entire dataset, when they are taken separately, can lead to a regression having a different slope (Simpson’s paradox), meaning that the computation of residuals according to the full dataset is not optimal for every sample, because it does not take into account local geographic phenomena. The regression is correct on the whole, but not necessarily in its details. For this reason a matrix of residuals provides tendencies that should be interpreted generally, that is in terms of clusters, to answer questions like the following ones: Are the clusters appear‐ ing in the projection based on residual distances the same ones as those that the matrix of original linguistic distances delivers? Does the relative distance between the clusters change from one plot to the other? To explain the usefulness of residual distances, three empiri‐ cal examples taken from datasets analyzed in this dissertation will be analyzed in this way.

8.2.2.1 The Netherlands

This first example concerns the Dutch dialect areas presented in CHAPTER 4. The ma‐ trix of residuals reveals two aspects that the “regular” Levenshtein matrix does not show: i) hidden structures in the phonology of the province of Groningen that still testify to proximity with the Frisian varieties that used to be spoken in the province of

11 Controlling for geographic distance is very easy with distance‐based methods, like the Levenshtein, but is not readily applicable to multi‐character‐based methods, like cladistics ones for example. To do so, it is necessary to obtain a distance matrix from the phyloge‐ netic tree by computing, for all the taxa, patristic distances on tree branches and, then, to establish regressions between linguistic and kilometric distances. This routine does not seem to have been applied frequently in the literature. 290 CHAPTER 8

Groningen some centuries ago and which have gradually been replaced by the lan‐ guage of the city of Groningen (Lower Saxon),12 and ii) a transcribers’ barrier in the southern part of the country (see Fig. 4.7 in CHAPTER 4).13

8.2.2.2 Tanzania

When the regression method is applied to the dataset of Bantu languages from Tan‐ zania (see Fig. 6.11 and section 6.3.1.1 in CHAPTER 6) and residual linguistic distances are compared to the original ones, the topologies of corresponding projections match but the clusters occupy different surfaces (see Fig. 8.2). With residuals the cluster {E50, E60} becomes more compact and closer to other varieties, while the group {F20, M10, M20, M30} less so. The cluster {N10, P10, P20, G50} remains stable. By definition, the topology of the samples portrayed by residual distances is not linked to the linguistic contact between neighbouring varieties (geog‐ raphy is controlled for), meaning that the expansion/contraction of clusters is ex‐ plained by other factors, probably historical.14 A working hypothesis to be tested could be that the first Bantu speakers that settled in Tanzania spoke varieties that, while divergent, did not belong to clearly identified separate groups. Once the speakers became sedentary, differential linguistic contact between the Bantu immigrants led to phenomena of linguistic convergence in some areas (F20, M10, M20, M30) but not in other ones (E50, E60). Methodologically it is interesting to see that the stress values of multidimensional representations of re‐ siduals’ matrices are generally considerably higher than those of original distances. This is a clear indication that residuals’ variability is not easily represented in few di‐ mensions, differently from data that have been shaped by linguistic contact happen‐ ing in the two dimensions that geography allows.

8.2.2.3 Gabon

Concerning the ALGAB dataset (see section 6.3.1.2 in CHAPTER 6 ), the residuals distances have been computed after a linear regression (R2=0.216; the logarithmic transformation of geographical distances makes almost no difference: R2=0.222). They form clusters that correspond well to those obtained by plotting original distances, but give rise to groups that are more evenly dispersed and better distinguished (Fig. 8.3, Table 8.2).

12 Note that the distinction between Friesland and Groningen is quite strong in pronuncia‐ tion and in lexis, and much weak in syntax. Compare Srpuit et al.’s (2009) Figures 6 and 7 (pronunciation and lexis, respectively) on the one hand and Figure 8 (syntax) on the other. 13 See Mathussek (2016) for examples about computationally inferred transcribers’ borders. 14 No transcribers’ borders here as each variety has been transcribed by a different person. GENERAL CONCLUSIONS AND NEW PROSPECTS 291

Figure 8.2 (see also Fig. 6.11 CHAPTER 6)  Multidimensional scaling plots concerning 32 Tanzanian languages and 1052 concepts. Left: Projection based on original gradual seg‐ mental Levenshtein distance. Stress values: in 1 dimension = 0.1588, in 2 dim. = 0.0966 (plot reported), in 3 dim. = 0.0688. Correlation between geographic and linguistic distance = 0.7. Right: Projection based on residual distances after computing the regression (R2 = 0.4983) between the logarithm of kilometric distances and the corresponding Levenshtein distances. Stress values: in 1 dimension = 0.3779, in 2 dim. = 0.2471 (plot reported), in 3 dim. = 0.1885. Residuals are normally distributed. 292 CHAPTER 8

Residual distances convey very interesting clues about the possible historical scenario of linguistic diversification of the Bantu varieties in Gabon, a setting that is quite hard to interpret (see section 6.4.5.3 in CHAPTER 6). They point to a certain amount of linguistic diversity between different languages that long‐lasting linguistic contact and conver‐ gence has progressively defaced (see Table 8.2 for a summary). With residual distances the varieties B50, B60, B70, spoken in a more densely inhabited area, remain close to each other, but we see changes in their relative distances from the groups correspond‐ ing to the languages classified as B40 and B20. Further, with residual distances the group B20 is resolved in two separate clusters (corresponding to the two subclusters reported in the bootstrap tree of Fig. 6.17 in CHAPTER 6: {B20‐I, B20‐II} and {B20‐III, B20‐ IV}), suggesting that varieties B20 are not genealogically related. Interestingly, the relative distance between the clusters B10 and B30 increases from one plot to the other (Fig. 8.3), recalling the debate about their possible conver‐ gence after a separate phylogenetic origin (Nurse and Philippson 2003). The fact that residual distances put the Fang languages (A75) much closer to the group B40 than measured Levenshtein distances do is also intriguing, and can be related to a similar geographic provenance.

Table 8.2  Summary of the possible effects on the Bantu varieties from Gabon that linguistic contact has determined (ALGAB data, see section 6.2.1.2 in CHAPTER 6). This scenario is inferred by comparing the 2 plots of Fig. 8.3.

Varieties With reference to the initial stage (inferred by residuals) the Varieties later… B10 Converged with B30 B30 Converged with B10 B20 Two initially separate clusters converged together B40 Converged with B50 B50 Diverged slightly from B60/B70 becoming closer to B40 and some varieties B20 B60/B70 Diverged slightly from B50 A75 A75 arrived in Gabon recently (~5 centuries ago), when B40 was already spoken. Their closeness (residuals’ plot) might correspond to a similar geographic origin in Cameroon: but today they are very different. GENERAL CONCLUSIONS AND NEW PROSPECTS 293

Figure 8.3  Multidimensional scaling projections concerning 53 languages from the Linguistic Atlas of Gabon (ALGAB). Top: Original Levenshtein distances. Stress values: in 1 dimension = 0.3247, in 2 dim. = 0.1641 (plot reported), in 3 dim. = 0.1215 (plot shown in Fig. 6.18 in CHAPTER 6). Correlation between geographic and linguistic distances = 0.478. Bottom: Residual distances after computing the regression (R2 = 0.216) between the kilometric distances and the corresponding Levenshtein distances. Stress values: in 1 di‐ mension = 0.399, in 2 dim. = 0.249 (plot reported), in 3 dim. = 0.171. Residuals are nor‐ mally distributed. 294 CHAPTER 8

8.3 THE INFLUENCE OF MIGRATION ON REGIONAL LANGUAGES 15

We examine in this section a second promising arena for future work, namely the in‐ fluence of migration on language. Population genetics and demography can provide evidence about the sources, destinations and sizes of population movements, provid‐ ing a wealth of data on which to test hypotheses about the effect of migration on lan‐ guage. Our goal here is complementary to Falck et al. (2012), who showed that people prefer to move to areas in which the local dialect is more similar to their own.

8.3.1 The effect of linguistic contact

The GRAVITY MODEL (Trudgill 1974) explains the spread of linguistic innovations us‐ ing as parameters geographic distance between speech communities and the number of speakers of the settlements/inhabited areas. While a demographic influence (popu‐ lation size, migrations) on linguistic diversity is obvious (linguistic barriers can orient migrations, however), there are few quantitative studies on this influence, probably due to the lack of detailed and readily‐available historical demographic data concern‐ ing a full linguistic domain. Let’s first focus on linguistic contact to address the quan‐ tification of the demographic phenomena that drive it. For 40 years a large body of research has been focused on investigating the effect of contact between mutually in‐ telligible dialects.16 Many case‐studies have demonstrated that different varieties, in direct contact through face‐to‐face interaction (linguistic accommodation), become more alike with the time. Contact‐induced linguistic accommodation involves several processes that can be briefly summarized as follows:

1. Levelling, the reduction in either the number of linguistic variants or in the degree of their variability. When two alternative forms exist, usually only one is preserved. 2. Emergence of intermediate varieties, where some linguistic forms may be new and may not have occurred in any of the dialects before the contact (koine). 3. Reallocation, in which alternative forms are retained but assigned different roles in the sociolinguistic use of the dialects, or in their grammatical use. 4. Simplification and increase in regularity.

15 This section is largely based on the work about linguistic contact of J. Chambers (ex Univ. of Toronto), W. Labov (Univ. of Pennsylvania) and P. Trudgill (Univ. of East Anglia). Among others, some key references are Dodsworth (2017) and Kerswill (2006). 16 Urban districts receiving immigrants from rural areas should be distinguished from sparsely populated areas that became the target of massive immigration, like new towns or in co‐lonial settlements, because linguistic innovations spread faster in recent speech communities than in pre‐existing ones. GENERAL CONCLUSIONS AND NEW PROSPECTS 295

In the phase that follows the initial dialect‐contact, rudimentary levelling or extreme inter‐speaker / intra‐speaker variability can be noticed. Later a new focused variety gets established and “homogeneously” adopted by the whole speech community. This process lasts at least one generation because new variants get fixed at adolescence age and adults are less likely to modify their speech. An attractive area, say an economically dynamic town, has probably been des‐ tination of migrants for centuries; initially they came from close areas but, with the time passing, immigrants from more distant areas (where more divergent dialects are spoken) moved in. This is to say that linguistic levelling and simplification are ex‐ pected to be stronger in the dialect of an attractive area than in the dialect of an area that is not, because immigration from distant regions is less likely in the second case, and contact phenomena are stronger when the linguistic difference between varieties in contact is higher. On the other hand, an unappealing town has likely lost a large part of its population that migrated to find better living conditions. A linguistic con‐ sequences of this phenomenon is that the dialect of the latter has remained stable over the time, and has not undergone processes of linguistic simplification (see Fig. 8.4).17

8.3.2 Extensive linguistic contact and demography

8.3.2.1 Dialect change in the Netherlands

Some attempts to measure extensive dialect change were based on the comparison of linguistic atlases made in different epochs, or by comparing the pronunciation of speakers of different ages within a same family or community. Dutch dialectology offers interesting computational work addressing the linguistic change of dialect va‐ rieties with the time. Wieling (2007b/c) compared two pronunciation datasets col‐ lected, approximately, at two generations‐interval (50 years)18 and found that Friesland and Limburg are areas of dynamic convergence, while the south‐eastern part of Low Saxony (Groningen, Drenthe, Overijssel, and the eastern part of Gelder‐ land) is an area of divergence. His results do not align well with those of Heeringa and Hinskens (2015) that compared the pronunciation of present‐day older male speakers and younger female speakers (two generation interval) obtaining capricious

17 To explain the considerable linguistic effect of immigration it should be recalled that emigrants are generally younger than the average of the population, meaning that they are more likely to bring linguistic innovations. 18 The altlas of Blancquaert and Peé (1982), created during the period 1925‐1982 (but data generally correspond to the first half of the period), was compared with the Goeman‐ Taeldeman‐van‐Reenen dataset (Goeman and Taeldeman, 1996; Van den Berg 2003) col‐ lected over the years 1980 – 1995. 296 CHAPTER 8 patterns. While the latter study is based on more consistent generation samples than the first one, the results of both have not been related to the population‐growth and ‐ size of corresponding provinces. In fact the increase in the population size could have been a key to link dialect change, population growth and immigration. Actually, Wiel‐ ing et al. (2007b, 2007c) and Heeringa and Hinskens (2015) addressed a linguistic change that took place in very recent times, when the use of dialects was already growing less frequent and when the speakers were extensively exposed to the linguis‐ tic norm (standard Dutch), two factors that make the interpretation of the observed change difficult, because several sociolinguistic effects overlap.

Figure 8.4  Two extreme scenarios of dialect contact according to migration. Attractive Area: The variety spoken here is frequently in contact with new varieties; earlier immigra‐ tion comes from the neighbourhood, while later immigrants come from distant areas. Be‐ sides the normal population growth over time, the demographic balance is as positive as the migratory balance. The timeline reported can be assumed to cover some centuries. Unattrac‐ tive Area: The spoken variety is less exposed to the contact with different varieties because the majority of immigrants comes from neighbouring areas. The population size may fall or remain somewhat stable over the time because there are few immigrants and many emi‐ grants. The growth of the population counterbalances the loss of populations only partially. The timeline reported can be assumed to cover some centuries. GENERAL CONCLUSIONS AND NEW PROSPECTS 297

Ideally, to better assess the linguistic change that dialects experienced over the recent demographic transition from the rural to the urban way of life,19 one would require a linguistic atlas of some centuries ago (of course not available) to be com‐ pared to a linguistic atlas concerning data collected in the first half of the last century, when dialects were still largely spoken and less influenced by the norm (such atlases are available for a majority of European countries). Along those lines, the study of Heeringa and Joseph (2007) was aimed at comparing the pronunciation data of the linguistic atlas of Blancquaert and Peé (1982) to the reconstructed pronunciation of proto‐Germanic lexicon, and at estimating the degree of conservatism of contempo‐ rary Dutch dialects accordingly. I will not mention the special case of Frisian20 and just focus on the findings of Heeringa and Joseph (2007) concerning the other Dutch varie‐ ties, that is Low Franconian dialects (central western part of the Netherlands) that tend to be more conservative (particularly Holland, the eastern part of North Brabant and the southern part of Gelderland) and Low Saxonian dialects, that are phonologi‐ cally more innovative than the first group, according to reconstructed proto‐forms. Is this pattern the outcome of historical phenomena that took place during the linguis‐ tic differentiation from proto‐Germanic over two millenia? While it is difficult to answer directly, a paradox should be noted: nearly all the most conservative areas identified by Heeringa and Joseph (2007) are located in the four provinces (South Holland, North Holland, North Brabant and Gelderland), which have had continued and very high population growth (and immigration) since about one century ago (Fig. 8.5). Ac‐ tually, the phonological conservatism of many Franconian dialects might be a recent effect if we adopt the following sociolinguistic perspective: the high immigration and population expansion in some regions of the Franconian linguistic domain led to lin‐ guistic levelling, with an overall reduction of the number of different realizations of phonological variables but, also, with a general and increasing exposure to Standard Dutch, a variety that has a remarkably conservative sound system (Donaldson 1983, p. 161). In the opposite way, the Low Saxon dialects are more innovative as they experi‐ enced a inferior degree of levelling because of the lower population growth and im‐ migration, meaning that standard Dutch was not needed as lingua franca, and, also, because of their different and more rural socioeconomic environment (but see Haart‐ sen et al. 2003).21

19 It must be emphasised that in middle of the seventeenth century a large proportion of the Dutch population was already living in towns and cities. 20 That turned out to be conservative too. 21 Some influence of phonologically more innovative varieties located across the German border, influenced by the Standard German norm, cannot be excluded. For example, the First Germanic Sound Shift (Campbell 2004) can be found reasonably intact in Dutch, and 298 CHAPTER 8

If none of the studies cited above (Wieling 2007b, 2007c, Heeringa and Joseph 2007, Heeringa and Hinskens 2015) provides conclusive results and patterns about the evolution of Dutch dialects over time, it may be because i) they are exclusively based on phonology while dialect mixing, stability, hypervariability, reallocation and conver‐ gence also concern morphosyntax and rhythmic differences (see Dodsworth 2017 for a review of modern case‐studies),22 and because ii) the timeframe addressed by Wieling et al. (2007b, 2007c) and Heeringa and Hinskens (2015) is blurred by the progressive con‐ temporary levelling of dialects and, finally, since iii) the sampling scheme was not de‐ signed to correspond to areas demographically comparable in terms of migration and population growth.

8.3.2.2 Migrations inferred from surname data

Aside from special cases where the time of the arrival and the provenance of immi‐ grants are known (as in the Canadian province of Québec for example)23 it is often diffi‐ cult to have this information, even for historically recent times. This is why surnames can be of help when no alternative documentation is available. They make it possible to identify the direction of migrations that took place in, say, a European country over the last four or five centuries but, unlike historical registers, they do not show when such migrations took place. They could have happened anytime between the first introduc‐ tion of family‐names to the last generation, that is for a time span of about five centu‐ ries.24 We know that the a majority of these migratory movements took place after the industrial revolution, when new means of displacement became available and new jobs where massively created, therefore contributing to the establishment of coherent migration routes (like the northwards substantial immigration that took place from the rural southern part of Italy to its economically‐dynamic northern side). The de‐ tailed comparison of these migration routes (Manni et al. 2005, Boattini et al. 2012, Rodriguez Diaz et al. 2015, 2017) reveals that neighbouring provinces can be quite dif‐ ferent in the number and the provenance of the immigrants they attracted, but also concerning the directions of the emigrants that left them. This heterogeneity is valu‐

all other Low German and Scandinavian languages for that matter, but a further shift of p/t/k occurred only in German which momentarily obscures the origin of some sounds (Donaldson 1983, p. 123). 22 The examples concern the dialects of London (UK), Sao Paulo (Brazil), Xining (PRC), Amman (Jordan) and the Spanish varieties spoken in New York (USA). 23 In Quebec migration registers have been kept since the beginning of the French rule. 24 In the Netherlands surnames have a more recent origin. GENERAL CONCLUSIONS AND NEW PROSPECTS 299 able to test the cumulative effect25 that migrations had on the dialects spoken in two neighbouring areas and that initially were linguistically very similar, meaning that they later diverged as a consequence of the dialect contact that migration processes drove. A possible experimental set‐up would be to compare couples of locations that initially had a comparable population size and where very close varieties were spoken before linguistically diverging because of a different migration history, similarly to the two cases shown in Fig. 8.4. The linguistic differences between couple of locali‐ ties26 (inferred using a linguistic atlas) selected in this way are expected to correspond to different types of migration‐induced dialect contact.

Figure 8.5  Conservativeness of Dutch dialects and demographic growth by province. Right: Phonetically conservative and innovative dialects in the Netherlands according to Heeringa and Joseph (2007). Ten shadings cover the spectrum from conservative (lighter gray) to innovative areas (darker gray). The three most conservative classes of the spectrum are here coloured in blue, to show that they are mainly located in the provinces of North and South Holland, North Brabant and Gerderland. Friesland is not discussed in the text. Adapted from Heeringa and Joseph (2007). Left: Demographical growth per Dutch province according to several sources, including Blink (1897) and the Dutch Central Bureau of Statistics. The x‐ axis reports the years of the census, the y‐axis indicates the population‐size of each province expressed in millions. Sorce: http://www.populstat.info/Europe/netherlp.htm

25 Cumulative means the aggregation of all migration movements over the time, because, as it was said, surnames do not allow distinguishing their timing. 26 Only dialect contact of similar dialects spoken within a same country is taken into ac‐ count here, not the linguistic contact with other languages. 300 CHAPTER 8

This is a kind of contact that surname studies can help to describe in detail, meaning as it is possible to say how many speakers of each variety came into contact with the original dialect spoken in the two locations under study. A working hypothesis is that immigration of very different varieties has greater impact on the receiving speech communities than does the immigration of similar varieties, where we would expect that the receiving community’s speech should remain more “stable”. This stability can imply a higher level of areal heterogeneity and a lower number of innovations. Ac‐ cording to Trudgill (1992, p. 199) innovation and simplification can be synonymic be‐ cause the growth of new forms, that were not present in the initial mixture but that developed out of the interaction between varieties, gives rise to interdialects that are more regular. Speech communities having frequent contacts with other groups tend to have simplified (innovative) languages or dialects. An exemplification comes from a forthcoming article concerning Spanish data (Rodriguez‐Diaz 2017)

8.3.2.3 Spanish surnames and internal migrations

The way to account for the intensity and the directions of the internal migrations that took place in Spain after the introduction of surnames is quite simple. It consists in coding the surnames listed in the database of current Spanish residents (Padrón mu‐ nicipal)27 as vectors whose components correspond to the relative frequency of each surname in, say, all the 47 continental Spanish provinces.28 Then all the vectors (sur‐ names) are classified in a discrete number of clusters by using Kohonen maps (Koho‐ nen 1982, 1984, Kaski 1997) or other similar methods, so that each cluster corresponds to a group of surnames having a comparable geographic distribution over the coun‐ try: frequent in some provinces and not in others. Finally, such groups are plotted over a geographic map to see if there are visible peaks of frequency corresponding to a single province.29 If one assumes that the province, where the relative frequency of each surname is the highest, corresponds to the geo‐historical origin of corresponding surnames, it is possible to measure migrations because the diffusion centre of these family names is known as well as their present‐day distribution. In this way all migra‐ tion patterns can be summarized in two migration matrices, one for the aggregate

27 Only surnames occurring at least 20 times have been processed. 28 Example: Rodriguez; (Province 1 =) 0.0047; (Province 2 =) 0.0030; …; (Province 47 =) 0.0018. 29 In some cases, the peak of frequency is geographically ambiguous because it corre‐ sponds to two (or more) provinces. Such ambiguities are related to the fact that many sur‐ names, spelled in a same way, independently became the name of unrelated families located in different areas (as it is the case for de Boer, van Dijk, de Jong, Visser in the Netherlands). Only the surnames with a clear origin in one province have been kept. GENERAL CONCLUSIONS AND NEW PROSPECTS 301 immigration‐ and one for the aggregate emigration‐processes that took place over the last five centuries (Fig. 8.6).30 As was anticipated in the introduction of the dissertation (see CHAPTER 1, sec‐ tion 1.2), emigration and immigration phenomena are not symmetrical, this is why Spanish provinces can be classified into four groups: 1) Isolated provinces (low emigra‐ tion, low immigration); 2) Corridor provinces (high emigration, high immigration); 3) Unattractive provinces (high emigration, low immigration); 4) Attractive provinces (low emigration, high immigration) as in Fig. 8.7.

Figure 8.6  Immigration and emigration in Spain. (Left): Major emigration routes from each continental Spanish province. (Right): Major immigration routes from all continental Spanish provinces to each one of them. Analysis based on the distribution of 25,714 single surnames (like ‘Diaz’; ‘Rodriguez’; etc.) corresponding to the 12,348,109 Spanish residents processed by Rodriguez‐Diaz et al. (2017).

Concerning migration distances, they can be classified as short‐, medium‐ and long‐range (Fig. 8.8). It is reasonable to think that the medium and long range move‐ ments took place in more recent times, when the mechanization of transportation and the industrialization of the country led to massive displacement of the population that progressively abandoned rural life. Differently, other provinces are characterized by very local emigration distances directed to neighbouring areas; they correspond to processes that took place within a more traditional frame of displacement, probably when people used to move by their own means, progressively diffusing as described in the WAVE THEORY of Schmidt (1872).

30 Spanish surnames became fixed starting with the 16th century. 302 CHAPTER 8

Figure 8.7  Bidimensional plot of immigration (x‐axis: [%] of surnames of foreign origin in each Province) and emigration (y‐axis: [%] of surnames located outside the province of origin) by province. In the plot it is possible to identify four different cases. Dataset as in Fig. 8.6.

8.3.2.4 Spanish migrations and regional languages

Concerning Spain, it is interesting to note the significant overlap between:

i) The areas that remained rather isolated in terms of internal Spanish migra‐ tions (see bottom part of Fig. 8.7); ii) The provinces in which emigration was mainly directed to neighbouring areas following a short‐range migration discipline of isolation by distance (Wright 1943, Malécot 1948) (see the bottom left part of Fig. 8.8); GENERAL CONCLUSIONS AND NEW PROSPECTS 303

iii) The regions where languages other than Castilian have resisted the political will to set Castilian as The 31 (see Fig. 8.9), therefore showing the positive effect that reduced immigration had on the persis‐ tence of language areas.

To conclude on this challenging result, I note that, in the past, linguistics has driven a considerable amount of hypotheses about the anthropological diversity of human populations. Today, demographers and geneticists can also help by providing an accurate and large‐scale quantification of the demographic processes that led to linguistic contact, setting a new framework to understand linguistic differentiation.

Figure 8.8  Emigration by distance classes from each Spanish province. By comparing the present‐day distribution of Spanish surnames to their inferred geographical origin (where they were first adopted about five centuries ago), it is possible to dissect emigra‐ tion distances (emigration distances have been ranked in 8 distance classes; see triangles). The 47 vectors (one per province) accounting for the distance classes have been the input of the Principal Component Analysis shown above. Provinces from which emigration was directed to the closest neighbouring areas (bottom left) can be distinguished from the oth‐ ers. Dataset as in Fig. 8.6.

31 This policy lasted from the beginning of the 16th century, with the unification of the crowns of Aragon and Castilla, to the times of the regime of General Franco ended in 1975. 304 CHAPTER 8

Figure 8.9  Migration and linguistic diversity in Spain. (A): Provinces that have at‐ tracted a low number of immigrants (see Fig. 8.7). (B): Spanish provinces from which mi‐ gration has been local and directed to neighbouring areas (Fig. 8.8). (C): Major linguistic areas according to a computational linguistics analyses reported in CHAPTER 5.

GENERAL CONCLUSIONS AND NEW PROSPECTS 305

References:

Almeida A., Braun A. 1986. ‘Richtig’ und ‘Falsch’ in phonetischer Transkription; Vor‐ schläge zum Vergleich von Transkriptionen mit Beispielen aus deutschen Dialekten. Zeit‐ schrift für Dialektologie und Linguistik, 53:158‐172. Almeida, A., Braun, A. 1985. What is Transcription? In: W. Kurschner, R. Vogt (eds.) Grammatik, Semantik, Textlinguistik. Akten des 19 Linguistischen Kolloquiums Vechta 1984. Vol. 1, Tűbingen, pp. 37‐48. Beijering K., Gooskens C., Heeringa W. 2008. Predicting intelligibility and perceived lin‐ guistic distances by means of the Levenshtein algorithm. Linguistics in the Netherlands, 25: 13‐24. Blancquaert E., Peé, W. (eds.). 1925–1982. Reeks Nederlans(ch)e Dialectatlassen. Antwer‐ pen: De Sikkel. Blink H. 1897. Tegenwoordige staat van Nederland. Amsterdam: S.L. van Looy. Boattini A., Lisa A., Fiorani O., Zei G., Pettener D., Manni F. 2012. General Method to Un‐ ravel Ancient Population Structures through Surnames. Final Validation on Italian Data. Human Biology, 84: 235‐270. Bolognesi R., Heeringa W. 2002. De invloed van dominante talen op het lexicon en de fo‐ nologie van Sardische dialecten. Gramma/TTT: tijdschrift voor taalwetenschap 9: 45‐84. Campbell L. 2004. Historical linguistics (2nd ed.). Cambridge: MIT Press. Church K.W., Hanks P. 1990. Word association norms, mutual information, and lexicog‐ raphy. Computatational Linguistics, 16: 22–29. Clua E., Valls E., Viaplana J. 2008. Analisi dialettometrica del catalano partendo dai dati del COD. Una prima approssimazione alla gerarchia tra varietà. In: G. Blaikner Hohenwart et al. (eds.) Ladinometria Miscellanea per Hans Goebl per il 65º compleanno Edizione multiligue , vol. 2. Vigo di Fassa: Istituto Culturale Ladino, pp. 27‐42. Clua E., Lloret M‐R. 2015. COD2: An Oral Dialectal Corpus for the Analysis of Spatial and Temporal Variations in Catalan. In: Proceedings of the 7th International Conference on Corpus Linguistics. Current Work in Corpus Linguistics. Working with Traditionally‐conceived Corpora and Beyond (CILC 2015). Valladolid, 5‐7 March, pp. 89‐94. Covington M. A. 1996. An Algorithm to Align Words for Historical Comparison. Compu‐ tational Linguistics, 22: 481‐496. Cucchiarini C. 1993. Phonetic Transcription: a Methodological and EmpericalStudy. PhD dissertation. Nijmegen: Katholieke Universiteit Nijmegen. Dodsworth Rn. 2017. Migration and Dialect Contact. Annual Review of Linguistics, 3: 331‐ 346. Donaldson B.C. 1983. Dutch: A Linguistic History of Holland and Belgium. Leiden: Mar‐ tinus Nijhoff. 306 CHAPTER 8

Eisenstein J., OʹConnor B., Smith N.A., Xing E.P. 2014. Diffusion of lexical change in social media. PLoS ONE, 9. http://dx.doi.org/10.1371/journal.pone.0113114 Falck O., Heblich S., Lameli A., Südekum, J. 2012. Dialects, cultural identity, and economic exchange. Journal of Urban Economics, 72: 225‐239. Fano R. M. 1961. Transmission of Information: A Statistical Theory of Communications. Cambridge, (MA): MIT Press. Fontan L., Farinas J., Ferrané I., Pinquier J., Aumont X. 2015. Automatic intelligibility measures applied to speech signals simulating age‐related hearing loss. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association (INTERSPEECH 2015), Dresden, 6‐10 September. Gildea D., Jurafsky D. 1996. Learning Bias and Phonological‐Rule Induction. Compu‐ tational Linguistics, 22: 497–‐530. Goeman T., Taeldeman J. 1996. Fonologie en morfologie van de Nederlandse dialecten: een nieuwe materiaalverzameling en twee nieuwe atlasprojecten. Taal en Tongval, 48: 38– 59. Gooskens C., Heeringa W. 2004. Perceptive evaluation of Levenshtein dialect distance measurements using Norwegian dialect data. Language variation and change, 16: 189‐207. Gooskens C., Heeringa W. 2006. The Relative Contribution of Pronunciation, Lexical and Prosodic Differences to the Perceived Distances between Norwegian dialects. Literary and Linguistic Computing. 21: 477‐492. Greenhill S. 2011. Levenshtein distances fail to identify language relationship accurately. Computational linguistics, 37: 689‐698. Grollemund R., Branford S., Bostoen K., Meade A., Venditti C., Pagel M. 2015. Bantu ex‐ pansion shows that habitat alters the route and pace of human dispersals. Proceedings of the National Academy of Sciences USA, 112: 13296‐13301. Haartsen T., Groote P., Huigen P.P.P. 2003. Rural areas in the Netherlands. Tijdscrift voor Economische en Sociale Geografie, 94: 129‐136. Hammarström, H., Forkel R., Haspelmath M., Bank S. 2016. Glottolog 2.7. Jena: Max Planck Institute for the Science of Human History. (Avail. http://glottolog.org, Accessed on 2017‐01‐25.) Heeringa W. 2004. Measuring dialect pronunciation differences using Levenshtein dis‐ tance. PhD Doctoral disserationthesis. Groningen: Rijksuniversiteit Groningen. Heeringa W., Braun A. 2003. The Use of the Ameida‐Braun System in the Measurement of Dutch Dialect Distances. Computers and the Humanities, 37: 257‐271. Heeringa W., Hinskens F. 2015. Dialect change and its consequences for the Dutch dialect landscape. How much is due to the standard variety and how much is not? Journal of Lin‐ guistic Geography, 3: 20‐33.

GENERAL CONCLUSIONS AND NEW PROSPECTS 307

Heeringa W., Joseph B. 2007. The Relative Divergence of Dutch Dialect Pronunciations from their Common Source: An Exploratory Study. In: J. Nerbonne, T. Mark Ellison, G. Kondrak G. (ed.), SigMorPhon 07 ACL 2007, Computing and Historical Phonology, Proceedings of the Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Pho‐ nology. Prague, (Czech Republic), Stroudsburg (PA): The Association for Computational Linguistics (ACL), June 28, pp. 31‐39. Heeringa W., Kleiweg P., Gooskens C., Nerbonne J. 2006. Evaluation of String Distance Algorithms for Dialectology. In: J. Nerbonne, E. Hinrichs (eds.) Linguistic Distances Work‐ shop at the joint conference of International Committee on Computational Linguistics and the As‐ sociation for Computational Linguistics, Sydney, July, pp. 51‐62. Heeringa W., Nerbonne J. 2001. Dialect areas and dialect continua. Language Variation and Change, 13: 375‐400. Heeringa W., Nerbonne J., Kleiweg P. 2002. Validating Dialect Comparison Methods. In: W. Gaul, G. Ritter (eds.), Classification, Automation, and New Media. Proceedings of the 24th Annual Conference of the Gesellschaft für Klassifikation, pp. 445‐452. Jäger G. 2015. Support for linguistic macrofamilies from weighted sequence align‐ ment. Proceedings of the National Academy of Sciences USA, 112: 12752‐12757. Juola P., Sofko J., Brennan P. 2006. A Prototype for authorship attribution studies. Literary and Linguistic Computing, 21: 169‐178. Kaski S. 1997. Data exploration using self‐organizing‐maps. Acta Polytechnica. Scandi‐ navica. 82:1‐57. Kerswill P. 2006. Migration and language. In: K. Mattheier, U. Ammon, P. Trudgill (eds.), Sociolinguistics/Soziolinguistik. An international handbook of the science of language and society, 2nd edition, volume 3, Berlin: De Gruyter. Kessler B. 1995. Computational Dialectology in Irish Gaelic. In: Proceedings of the 6th Con‐ ference of the European Chapter of the Association for Computational Linguistics, pp. 60–67. Kohonen T. 1982. Self‐organized formation of topologically correct feature maps. Biological Cybernetics, 43: 59–69. Kohonen T. 1984. Self‐organization and Associative Memory. Berlin: Springer. Kondrak G. 2002. Algorithms for language reconstruction. (Doctoral dissertation, Univer‐ sity of Toronto). Dissertation Abstracts International, 63: 5934. Kondrak G. 2003. Phonetic alignment and similarity. Computers and the humanities, 37: 273‐ 291. Kondrak G., Sherif T. 2006. Evaluation of several phonetic similarity algorithms on the task of cognate identification. In: Proceedings of the ACL Workshop on Linguistic Distances. Sydney: Australia, pp. 43‐50. Labov W. 2001. Principles of linguistic change: Social factors. Vol. II. Malden: Blackwell. Lewis D.K. 1979. Scorekeeping in a language game. Journal of Philosophical Logic, 8: 339‐ 359. 308 CHAPTER 8

Malécot G. 1948. Les mathématiques de l’hérédité. Paris: Masson. Mann G.S., Yarowsky D. 2001. Multipath translation lexicon induction via bridge lan‐ guages. In: Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, June 1‐7, pp 1‐8. Manni F., Toupance B., Sabbagh A., Heyer E. 2005. New method for surname studies of ancient patrilineal population structures, and possible application to improvement of Y‐ chromosome sampling. American Journal of Physical Anthropology, 126: 214‐28. Mathussek A. 2016. On the problem of field worker isoglosses. In: M‐H Côté, R. Knooi‐ huizen, J. Nerbonne (eds.), The future of dialects, dialects: Selected papers from Methods in Dia‐ lectology XV. Berlin: Language Science Press, pp. 99‐116. McMahon A., McMahon R. 2005. Language classification by the numbers. Oxford: Oxford University Press. Needleman S.B., Wunsch C.D. 1970. A general method applicable to the search for simi‐ larities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48: 443–53. Nerbonne J. 2005. Review of April McMahon and Robert McMahon Language Classification by the Numbers. Oxford: Oxford University Press. Linguistic Typology 11: 425‐436. Nerbonne J. 2010. Measuring the diffusion of linguistic change. Philosophical Transactions of the Royal Society B, 365: 3821‐3828. Nerbonne J., Heeringa W. 1997. Measuring Dialect Distance Phonetically In: J. Coleman (ed.) Workshop on Computational Phonology. Special Interest Group of the Association for Com‐ putational Linguistics. Madrid: ACL, pp. 11‐18. Nerbonne J., Heeringa W. 2007. Geographic Distributions of Linguistic Variation Reflect Dynamics of Differentiation. In: S. Featherston, W. Sternefeld (eds.) Roots: Linguistics in Search of its Evidential Base. Berlin: Mouton De Gruyter, pp. 267‐297. Nerbonne J., Heeringa W. 2010. Measuring dialect differences. In : P. Auer, J.E. Schmidt (eds.) Language and Space: Theories and Methods. Berlin: Mouton De Gruyter, pp. 550‐566. Nerbonne J., Heeringa W., van den Hout E., van de Kooi P., Otten S., van de Vis W. 1996. Phonetic Distance between Dutch Dialects. In: G. Durieux, W. Daelemans, S. Gillis (eds.eds.) CLIN VI: Proceedings. of the Sixth CLIN Meeting. Antwerp: Centre for Dutch Lan‐ guage and Speech (UIA), pp.185‐202. Nurse D., Philippson G. 2003. Towards a historical classification of the Bantu. In: D. Nurse and G. Philippson (eds.), The Bantu languages. London: Routledge, pp. 164‐179. Oakes M.P. 2000. Computer estimation of vocabulary in protolanguage from word lists in four daughter languages. Journal of Quantitative Linguistics, 7: 233–243. Rama T., Wahle J., Sofroniev P., Jäger G.. 2017 Unpublished. Fast and unsupervised meth‐ ods for multilingual cognate clustering. ArXiv preprint, arXiv:1702.04938. Rodríguez Díaz R., Manni F., Blanco Villegas M‐J. 2015. Footprints of Middle Ages King‐ doms Are Still Visible in the Contemporary Surname Structure of Spain. PLoS ONE, 10. doi:10.1371/journal.pone.0121472 GENERAL CONCLUSIONS AND NEW PROSPECTS 309

Rodríguez Díaz R., Blanco Villegas M‐J, Manni F. 2017. From surnames to linguistic and genetic diversity: Five centuries of internal migrations in Spain. Journal of Anthropological Sciences, 95: 000‐000, forthcoming. Sanders N., Chin S. 2009. Phonological distance measures. Journal of Quantitative Lin‐ guistics, 16: 96‐114. Scapoli C., Mamolini E., Carrieri A., Rodriguez‐Larralde A., Barrai I. 2007. Surnames in Western Europe: a comparison of the subcontinental populations through isonymy. Theo‐ retical Population Biology, 71: 37‐48. Scherrer Y. 2007. Adaptive string distance measures for bilingual dialect lexicon induc‐ tion. ACL ʹ07 Proceedings of the 45th Annual Meeting of the ACL: Student Research Workshop, Prague (Czech Republic), June 25‐26, pp. 55‐60. Schmidt J. 1872. Die Verwandtschaftsverhältnisse der indogermanischen Sprachen. Wei‐ mar: H. Böhlau. Séguy J. 1971. La relation entre la distance spatiale et la distance lexicale. Revue de Linguis‐ tique Romane, 35: 335‐357. Somers H. L. 1998. Similarity metrics for aligning children’s articulation data. In: Proceed‐ ings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th Inter‐ national Conference on Computational Linguistics, pp. 1227–1231. Spruit M.R., Heeringa W., Nerbonne J. 2009. Associations among linguistic levels. Lingua, 119: 1624‐1642. Szmrecsanyi B. 2013. Grammatical variation in British English dialects. A study in corpus based dialectometry. Cambridge (UK): Cambridge University Press. Trudgill P. 1974. Linguistic Change and Diffusion: Description and explanation in socio‐ linguistic dialect geography. Language in Society, 2: 215‐246. pp. Trudgill P. 1992. Dialect tipology and social structure. In: E.H. Jahr (ed.) Language contact. Berlin, New York (NY): Mouton de Gruyter, pp. 195‐212. Valls E., Nerbonne J., Prokic J., Wieling M., Clua E., Lloret M‐R. 2012. Applying the Levenshtein distance to Catalan dialects: A brief comparison of two dialectometry ap‐ proaches. Verba, 39: 35‐61 Van den Berg B.L. 2003. Phonology and Morphology of Dutch and Frisian Dialects in 1.1 million transcriptions. Goeman‐Taeldeman‐Van Reenen project 1980‐1995, Meertens Insti‐ tuut Electronic Publications in Linguistics 3. Amsterdam: Meertens Instituut (CD‐ROM). Vieregge W.H., Rietveld A.C.M., Jansen C.I.E. 1984. A distinctive feature based system for the evaluation of segmental transcription in Dutch. In: M.P.R. Van den Broecke, and A. Cohen (eds.), Proceedings of the 10th International Congress of Phonetic Sciences, Dordrecht and Cinnaminson: Foris Publications, pp. 654‐659. Wieling M., Leinonen T., Nerbonne J. 2007a. Inducing Sound Segment Differences using Pair Hidden Markov Models. In: John J. Nerbonne, M. Ellison, G. Kondrak (eds.) Computing and Historical Phonology: 9th Meeting of ACL Special Interest Group for Computational Morphol‐ ogy and Phonology Workshop at ACL. Prague, pp. 48‐56. 310 CHAPTER 8

Wieling M. 2007b. Comparison of Dutch Dialects. Master thesis, University of Groningen. Wieling M., Heeringa W., Nerbonne J. 2007c. An aggregate analysis of pronunciation in the Goeman‐Taeldeman‐van Reenen Project data. Taal en Tongval, 59: 84‐116. Wieling, M., Bloem J., Mignella K., Timmermeister M., Nerbonne J. 2014. Automatically measuring foreign accent strength in English. Validating Levenshtein Distance as a Meas‐ ure. Language Dynamics and Change, 4: 253‐269. Wieling, M., Margaretha E., Nerbonne J. 2012. Inducing a measure of phonetic similarity from pronunciation variation. Journal of Phonetics, 40: 307‐314. Wright S. 1943. Isolation by distance. Genetics, 28: 114‐138. Summary of the Dissertation 311

Summary of the Dissertation

This thesis in linguistics includes five published articles and one study to appear, in which I review, test and use computational linguistic methods to classify linguistic va‐ rieties consisting of lexical items — the sort of material that is generally readily avail‐ able from linguistic atlases and databases. A minimum of 90 items is used in all the empirical studies reported in the disser‐ tation, because this number has been recognized to be sufficient for aggregate analyses. To compare linguistic varieties and classify them, two methods that lead to the compu‐ tation of a linguistic distance matrix are used. (i) One compares two basilectal varieties as the percentage of phonetic, morphologic, syntactical and lexical features with respect to which the two varieties agree according to the relativer Identitätswert (RIW). (ii) The other one is based on the Levenshtein distance (edit distance), which accounts for the phonological differences between two varieties through string comparison (of phonetic transcriptions). Words with the same meaning are aligned and, to determine the extent to which the two strings differ from each other, the number of substitutions, insertions, and deletions that are necessary to change one string into the other are calculated. The RIW method is used in one section only, while the Levenshtein method is used in the remaining chapters. The studies reported respectively concern the classification of Dutch varieties from the Netherlands; languages and dialects from Spain; Bantu varieties from Gabon, Tanzania and neighbouring countries and, finally, Turkic and Indo‐Iranian languages spoken in Kyrgyzstan, Tajikistan and Uzbekistan. The dissertation aims at probing human history by interpreting linguistic differ‐ ences together with other sources of evidence. In a multidisciplinary perspective aimed at providing a higher level of anthropological synthesis, linguistic diversity is used as a proxy for the cultural differences of corresponding populations and is then compared to the variability of family names (their number, frequency and geographic distribution) or to genetic differences based on molecular markers on the DNA. The comparison has required methodological developments to assess the stability of linguistic classifica‐ tions, in particular concerning the use of the bootstrap method and the use of residual Levenshtein distances as a way evaluate the linguistic similarity due to geographic con‐ tact as opposed to the remaining similarity, presumably determined by historical (ge‐ nealogical) relatedness. The relation between pronunciation differences and geographic distance suggests a new line of investigation showing how residual Levenshtein dis‐ tances provide testable hypotheses about past linguistic convergence and divergence 312 LINGUISTIC PROBES INTO HUMAN HISTORY and, perhaps, addressing the influence that population growth and migrations have on linguistic variability. With respect to the latter, the analysis of family names enables the depiction of migrations which have taken place in historical times, and, allows us to distinguish regions that have received many immigrants from those that have remained demographically more stable. We conjecture that such migration patterns have influ‐ enced dialect and language contact. This is a novel perspective from which we may ex‐ amine the effects of migration on language change. The order of the chapters corresponds to their focus. CHAPTERS 2 and 3 give in‐ troductory and methodological elements, CHAPTERS 4 and 5 report comparative studies involving surname and linguistic variability in the Netherlands and in Spain; CHAPTER 6 and 7 address wider linguistic contexts: Bantu and Central‐Asian languages. A conclud‐ ing discussion follows.

CHAPTER 2 (“Sprachraum and genetics”, 2010) addresses the viewpoint of a geneticist with respect to genetic and linguistic cartography in order to provide an historical and methodological background reviewing the steps that led some population geneticists to co‐operate with linguists, a collaboration that historically started with the compari‐ son of maps.

CHAPTER 3 (“Projecting Dialect Distances to Geography”, 2007) is about clustering insta‐ bility in dialectology and the application of the bootstrap method to linguistic data. When bootstrapping is impossible because original data is not available, an alternative approach consisting in adding random noise to the distance (or similarity) matrices is described: this is called noisy clustering.

CHAPTER 4 (“To What Extent are surnames words?”, 2006) is about comparing the distri‐ bution of surnames to the distribution of dialect pronunciations, which are clearly culturally transmitted, in the Netherlands. 19,910 different surnames, sampled in 226 locations, and 125 different words, whose pronunciation was recorded in 252 sites are analyzed. We find that, once the collinear effects of geography on both surname and cultural transmission are taken into account, there is no statistically significant asso‐ ciation between dialects and surname variability, suggesting that surnames cannot be taken as a proxy for dialect variation. We find the results historically and geographi‐ cally insightful, hopefully leading to a deeper understanding of the role that local mi‐ grations and cultural diffusion play in surname and dialect diversity.

CHAPTER 5 (“Footprints of Middle Ages Kingdoms are Visible in the Surname and Linguistic Structure of Spain”, 2016) is aimed at assessing whether the present‐day geographical variability of Spanish surnames mirrors historical phenomena at the time of the names’ introduction (13th ‐ 16th century). From the frequency distribution of 33,753 Summary of the Dissertation 313 unique family names, the surname distances among the 47 mainland Spanish prov‐ inces have been measured and compared to of the relations among corresponding language varieties as a dialectometric analysis of phonetic, morphological, syntactical and lexical features portray them. Surname and linguistic variability suggest a similar picture; major clusters are located in the east (Aragón, Cataluña, Valencia), and in the north of the country (Asturias, Galicia, León). Remaining regions appear to be quite homogeneous. We interpret this pattern as the long‐lasting effect of political and demographic phenomena related to the southwards “reconquest” (Reconquista) of the territories ruled by the Arabs from the 8th to the late 15th century.

CHAPTER 6 (“Linguistic probes into the Bantu history of Gabon”, unpublished) concerns the cross‐comparison of the linguistic and genetic diversity of Gabon (Africa) in order to contribute new perspectives to the scenarios about the early Bantu expansion re‐ lated to the adoption of agriculture some millennia ago. Two independently obtained datasets have been processed accounting for a total of 126 different varieties. They lead to similar results, showing that the languages cluster into similar groups. The Levenshtein linguistic distances are fully compatible with other classifications based on shared vocabulary, where sharing is operationalized as the percentage of words (not) having the same historical origin. While the alternative method requires that experts label cognate words in different varieties, this coding is unnecessary with the Levenshtein method, making it simpler to use and, due the larger amount of informa‐ tion it takes into account (all the sounds in the words), more sensitive. Genetic data indicate a lack of differentiation between populations at level higher than previously observed. The linguistic cartography of our classifications shows well delimited areas that might be related to early waves of Bantu migrants that crossed Gabon in the early stages of their dispersal from Cameroon and Nigeria.

CHAPTER 7 (“A Central‐Asian linguistic survey”, 2016) is related to a large research pro‐ ject aimed at describing and comparing the genetic and social differences of sedentary and semi‐nomadic populations living in Central Asia. Studied language varieties (ei‐ ther Turkic or Indo‐Iranian) come from 23 test sites corresponding to the major ethnic groups of Kyrgyzstan, Tajikistan and Uzbekistan (Karakalpaks, Kazakhs, Kyrgyz, Ta‐ jiks, Uzbeks, Yaghnobis). The measure of the phonological diversity obtained by ap‐ plying the Levenshtein distance has been paralleled by the measure of linguistic con‐ tact as proportional to the number of borrowings, from one linguistic family into the other, according to a Precision/Recall analysis validated by expert judgment. Concern‐ ing Turkic languages, the results do not support regarding Kazakh and Karakalpak as distinct languages and indicate the existence of several distinct Karakalpak varieties. Kyrgyz and Uzbek, on the other hand, appear to be quite homogeneous. Among the 314 LINGUISTIC PROBES INTO HUMAN HISTORY

Indo‐Iranian languages, the distinction between Tajik and Yagnobi varieties is very clear‐cut, despite the endangered status of the latter language whose speakers are in the process of being assimilated into Tajik society.

CHAPTER 8 (“General conclusions and new prospects”) provides a wider methodological discussion about the Levenshtein distance, discussion based on the empirical assays included in the dissertation and on what they show about its specificities in measur‐ ing linguistic difference as related to contact or historical linguistics. If in the past lin‐ guistics has driven a considerable number of hypotheses about the anthropological diversity of human populations, today demographers and geneticists can provide an accurate and large‐scale quantification of the demographic processes leading to lin‐ guistic contact, setting a new framework to understand linguistic differentiation in a wider perspective that might referred to as POPULATION LINGUISTICS. Summary of the Dissertation 315

Nederlandse Samenvatting

Dit proefschrift omvat vijf reeds gepubliceerde artikelen en een studie die binnenkort verschijnt. Daarin heb ik taalkundige methoden onderzocht, getoetst en gebruikt om linguïstische variëteiten te classificeren op basis van steekproeven die bestaan uit lexi‐ cale items ‐ het type materiaal dat in het algemeen direct beschikbaar is in taalatlassen en databases. In alle empirische studies die in het proefschrift worden gerapporteerd worden minimaal 90 items gebruikt. Dit aantal wordt erkend als voldoende voor analyses waarin metingen worden geaggregeerd. Om taalkundige variëteiten te vergelijken en te classificeren, worden twee methoden gebruikt die leiden tot de berekening van een taalkundige afstandenmatrix. (i) Men vergelijkt twee basilectale variëteiten als het percentage fonetische, morfologische, syntactische en lexicale kenmerken waarmee de twee variëteiten overeenkomen. Deze methode heet de relativer Identitätswert (RIW). (ii) De andere methode is gebaseerd op de Levenshteinafstand (edit distance), die de fonologische verschillen tussen twee variëteiten berekent door fonetische transcripties met elkaar te vergelijken. Woorden met dezelfde betekenis worden ten opzichte van elkaar opgelijnd. Om te bepalen in hoeverre de twee transcripties van elkaar verschil‐ len, wordt het aantal vervangingen, invoegingen en verwijderingen bepaald die nodig zijn om de ene transcriptie te veranderen in de andere. De RIW‐methode wordt alleen in sectie 1 gebruikt, terwijl de Levenshtein‐methode in de andere hoofdstukken wordt gebruikt. De gerapporteerde studies hebben betrekking op de classificatie van Neder‐ landse variëteiten uit Nederland, talen en dialecten uit Spanje, Bantu‐variëteiten uit Gabon, Tanzania en aangrenzende landen en tenslotte Turkse en Indo‐Iraanse talen die gesproken worden in Kirgizstan, Tadzjikistan en Oezbekistan. Het proefschrift beoogt de geschiedenis van de mens te onderzoeken door lin‐ guïstische verschillen te interpreteren met verschillende bewijsmaterialen. Binnen een multidisciplinair perspectief dat gericht is op het verschaffen van een hoger niveau van antropologische synthese wordt de taalkundige diversiteit gebruikt als proxy voor de culturele verschillen van de overeenkomstige populaties en wordt vervolgens vergeleken met de variabiliteit van familienamen (hun aantal, frequentie en geografi‐ sche verdeling) of met genetische verschillen die gebaseerd zijn op moleculaire ken‐ merken in het DNA. De vergelijking vereist methodologische aanpassingen om de stabiliteit van taalkundige classificaties te kunnen beoordelen, met name door gebruik van de bootstrap‐methode en het gebruik van residuële Levenshtein‐afstanden als een 316 LINGUISTIC PROBES INTO HUMAN HISTORY manier om te onderzoeken of taalkundige verwantschap deels bepaald wordt door geografische contacten, en voor het overige door historische (genealogische) verwant‐ schap. De relatie tussen uitspraakverschillen en geografische afstanden suggereert een nieuwe onderzoekslijn die laat zien hoe residuële Levenshtein‐afstanden testbare hy‐ pothesen over taalconvergentie en ‐divergentie in het verleden bieden en mogelijk meer licht werpen op de invloed van bevolkingsgroei en migraties op taalkundige variabiliteit. Met betrekking tot dat laatste kan de analyse van familienamen migraties zichtbaar maken die mogelijk in historische tijden hebben plaatsgevonden, en kunnen we regioʹs onderscheiden die veel immigranten hebben ontvangen die wegtrokken uit demografisch stabieler gebleven regioʹs (we onderscheiden deze ontvangende regio’s van bronregio’s die veel emigranten hebben weg zien gaan). Wij vermoeden dat der‐ gelijke migratiepatronen dialect‐ en taalcontact hebben beïnvloed. Dit is een nieuw perspectief van waaruit we de effecten van migratie op taalverandering kunnen on‐ derzoeken. De volgorde van de hoofdstukken komt overeen met hun focus. De HOOFDSTUKKEN 2 en 3 zijn inleidend en methodologisch van aard, de HOOFDSTUKKEN 4 en 5 rapporteren vergelijkende studies over achternamen‐ en taalkundige variatie in respectievelijk Nederland en Spanje; de HOOFDSTUKKEN 6 en 7 hebben betrekking op bredere taalkundige contexten: Bantoe en Centraal‐Aziatische talen. Hierna volgen conclusies en discussie in HOOFDSTUK 8.

HOOFDSTUK 2 (“Sprachraum and Genetics”, 2010) bespreekt hoe een geneticus aankijkt tegen genetische en taalkundige cartografie. Hij wil die gebruiken als historisch en methodologisch fundament in het onderzoek. Daarmee wil hij de stappen evalueren die ertoe leiden dat sommige bevolkingsgenetici samenwerken met taalkundigen, een samenwerking die historisch gezien begon met de vergelijking van kaarten.

HOOFDSTUK 3 (“Projecting Dialect Distances to Geography”, 2007) gaat over de instabili‐ teit van clusteranalyse in de dialectologie en de toepassing van de bootstrapmethode op taalkundige gegevens. Voor het geval dat bootstrappen (ʹbootstrappingʹ) onmoge‐ lijk is omdat de originele – niet geaggregeerde – gegevens niet beschikbaar zijn, wordt een alternatieve benadering beschreven, die bestaat uit het toevoegen van willekeuri‐ ge ruis (ʹrandom noiseʹ) aan de afstands‐ of gelijkheidsmetingen. Deze benadering heet clusteren met ruis (ʹnoisy clusteringʹ).

HOOFDSTUK 4 (“To What Extent are Surnames Words?”, 2006) gaat over het vergelijken van de verdeling van achternamen met de verdeling van dialectuitspraken, die in Ne‐ derland duidelijk cultureel zijn overgedragen. 19.910 verschillende familienamen, ase‐ lect gekozen uit 226 locaties, en 125 verschillende woorden, waarvan de uitspraak was Summary of the Dissertation 317 opgenomen in 252 plaatsen, zijn geanalyseerd. We vonden dat, wanneer de collineaire effecten van geografie op zowel de overdracht van achternamen als op de cultuur in de beschouwing betrokken worden, er geen statistisch significant verband meer be‐ staat tussen dialect‐ en achternamenvariatie. Dat bevestigt dat achternamen niet als een proxy voor dialectvariatie kunnen worden gebruikt. We vinden de resultaten zo‐ wel historisch als geografisch inzichtelijk, en hopelijk zullen ze leiden tot een dieper begrip van de rol die lokale migraties en culturele diffusie spelen in de diversiteit van achternamen en dialecten.

HOOFDSTUK 5 (“Footprints of Middle Ages Kingdoms are Visible in the Surname and Lingu‐ istic Structure of Spain”, 2016) is bedoeld om te bepalen of de huidige geografische va‐ riabiliteit van Spaanse achternamen de historische verschijnselen weerspiegelt zoals die waren ten tijde van de introductie van de namen (13e ‐ 16e eeuw). Op basis van de frequentiedistributie van 33.753 unieke familienamen zijn de achternaamafstanden tussen 47 provincies gemeten op het Spaanse vasteland vergeleken met de relaties tussen de daarmee corresponderende taalvariëteiten. Deze relaties zijn bepaald door middel van een dialectometrische analyse van variatie in fonetische, morfologische, syntactische en lexicale kenmerken. Achternaam‐ en taalvariabiliteit suggereren een vergelijkbaar beeld; de belangrijkste clusters vinden we in het oosten (Aragón, Cata‐ lonië, Valencia) en in het noorden van het land (Asturië, Galicië, León). De vorige ge‐ bieden lijken nogal homogeen te zijn. We interpreteren dit patroon als het langdurige effect van politieke en demografische fenomenen die verband houden met de zuide‐ lijke ʹheroveringʹ (reconquista) van de gebieden die van de achtste tot en met de late vijftiende eeuw onder Arabisch bestuur vielen.

HOOFDSTUK 6 (“Linguistic Probes into the Bantu History of Gabon”, unpublished) betreft de vergelijking van de taalkundige en genetische diversiteit van Gabon (Afrika). Dit biedt nieuwe perspectieven voor de bestudering van het verband tussen de vroege Bantu‐expansie en de overgang naar landbouw enkele millennia geleden. Twee zelf‐ standig verkregen gegevensverzamelingen zijn verwerkt met in totaal 126 verschil‐ lende variëteiten. Zij leiden tot vergelijkbare resultaten en laten zien dat de talen in dezelfde groepen worden geclusterd. De Levenshtein‐taalafstanden stemmen volledig overeen met andere classificaties die bepaald zijn op basis van gedeelde woorden‐ schat, dat wil zeggen, afstanden die berekend zijn als het percentage woorden die (niet) dezelfde historische oorsprong hebben. Terwijl andere benaderingen vereisen dat deskundigen de cognaat‐woorden in de verschillende variëteiten aanwijzen, is dat niet nodig wanneer de Levenshtein‐afstand gebruikt wordt, die eenvoudiger is te ge‐ bruiken, en gezien de grotere hoeveelheid informatie die daarmee verwerkt wordt (alle klanken in de woorden), gevoeliger is in het detecteren van verschillen. De gene‐ 318 LINGUISTIC PROBES INTO HUMAN HISTORY tische gegevens wijzen nu in sterkere mate op een gebrek aan differentiatie tussen populaties dan voorheen waargenomen. De taalkundige cartografie van onze classifi‐ caties laat goed afgebakende gebieden zien die verband houden met vroege migratie‐ golven van Bantu‐sprekers die in de vroege stadia van hun verspreiding vanuit Ka‐ meroen en Nigeria door Gabon trokken.

HOOFDSTUK 7 (“A Central‐Asian Language Survey”, 2016) heeft betrekking op een groot onderzoeksproject dat als doel heeft de beschrijving en vergelijking van genetische en sociale verschillen tussen sedentaire en semi‐nomadische populaties die in Centraal‐ Azië leven. De taalvariëteiten die bestudeerd zijn (Turks of Indo‐Iraans) komen uit 23 plaatsen die corresponderen met de belangrijkste etnische groepen van Kirgizië, Tadzjikistan en Oezbekistan (Karakalpaks, Kazachs, Kirgizisch, Tadzjieks, Oezbeeks, Yaghnobi). Enerzijds is fonologische diversiteit gemeten met de Levenshtein‐afstand, anderzijds is taalkundige contact gemeten als het aantal leenwoorden dat de ene taal‐ kundige familie geleend heeft van de andere, volgens een precision/recall‐analyse die gevalideerd werd door de beoordelingen van experts. Wat betreft de Turkse talen, de resultaten ondersteunen niet de idee dat het Kazachs en het Karakalpaks verschillen‐ de talen zijn, en geven aan dat er verschillende Karakalpak‐variëteiten bestaan. Het Kirgizisch en het Oezbeeks lijken daarentegen nogal homogeen te zijn. Binnen de In‐ do‐Iraanse talen is het onderscheid tussen variëteiten van het Tadzjieks en Yagnobi erg duidelijk, ondanks de bedreigde status van de laatste taal waarvan de sprekers zich midden in een proces bevinden waarin zij geassimileerd worden door de Tadzjiekse samenleving.

HOOFDSTUK 8 (“General Conclusions and New Prospects”) geeft een bredere methodolo‐ gische discussie over de Levenshtein‐afstand, discussie op basis van de empirische toetsen die in het proefschrift zijn opgenomen en over wat zij laten zien over de speci‐ fieke kenmerken wanneer taalkundige verschillen in verband worden gebracht met taalcontact of met de historische taalkunde. Terwijl de taalkunde in het verleden een aanzienlijk aantal hypothesen over de antropologische diversiteit van de menselijke populaties heeft opgeworpen en onderzocht, kunnen vandaag de dag demografen en genetici een nauwkeurige en grootschalige kwantificering geven van de demografi‐ sche processen die leiden tot taalcontact en een nieuw kader opzetten om taalkundige differentiatie te begrijpen in een breder perspectief dat kan worden aangeduid als POPULATIELINGUÏSTIEK. [Translation by Wilbert Heeringa]

319

About the author

Franz Manni (1973) studied Biology and Genetics at the University of Ferrara, Italy. In 1997 he graduated with distinction (cum laude) with a Laurea in Biology concerning Historical Demography. In 2000 F. Manni completed a PhD project in Population Ge‐ netics at the Department of Biology of the University of Ferrara under the supervision of Italo Barrai. In 2003 he became Maître de Conferences (Assistant Professor) in Genet‐ ics at the Department Hommes, Natures, Sociétés of the National Museum of Natural History of Paris (France) where he conducts multidisciplinary research involving de‐ mography, genetics and linguistics. From 2008 to 2013 he has been the Executive Edi‐ tor of the Journal Human Biology (Wayne State University Press, Detroit, MI) and, since 2013, he is Scientific Commissioner at the Musée de l’Homme, Paris. Franz Manni has been co‐working with the Department Alfa‐Informatica of the University of Gron‐ ingen since 2001 about the projects included in this dissertation. 320 GRODIL

Groningen dissertations in linguistics (GRODIL)

1. Henriëtte de Swart (1991). Adverbs of Quantification: A Generalized Quantifier Approach. 2. Eric Hoekstra (1991). Licensing Conditions on Phrase Structure. 3. Dicky Gilbers (1992). Phonological Networks. A Theory of Segment Representation. 4. Helen de Hoop (1992). Case Configuration and Noun Phrase Interpretation. 5. Gosse Bouma (1993). Nonmonotonicity and Categorial Unification Grammar. 6. Peter I. Blok (1993). The Interpretation of Focus. 7. Roelien Bastiaanse (1993). Studies in Aphasia. 8. Bert Bos (1993). Rapid User Interface Development with the Script Language Gist. 9. Wim Kosmeijer (1993). Barriers and Licensing. 10. Jan‐Wouter Zwart (1993). Dutch Syntax: A Minimalist Approach. 11. Mark Kas (1993). Essays on Boolean Functions and Negative Polarity. 12. Ton van der Wouden (1994). Negative Contexts. 13. Joop Houtman (1994). Coordination and Constituency: A Study in Categorial Grammar. 14. Petra Hendriks (1995). Comparatives and Categorial Grammar. 15. Maarten de Wind (1995). Inversion in French. 16. Jelly Julia de Jong (1996). The Case of Bound Pronouns in Peripheral Romance. 17. Sjoukje van der Wal (1996). Negative Polarity Items and Negation: Tandem Acquisition. 18. Anastasia Giannakidou (1997). The Landscape of Polarity Items. 19. Karen Lattewitz (1997). Adjacency in Dutch and German. 20. Edith Kaan (1997). Processing Subject‐Object Ambiguities in Dutch. 21. Henny Klein (1997). Adverbs of Degree in Dutch. 22. Leonie Bosveld‐de Smet (1998). On Mass and Plural Quantification: The case of French ‘des’/‘du’‐ NPs. 23. Rita Landeweerd (1998). Discourse semantics of perspective and temporal structure. 24. Mettina Veenstra (1998). Formalizing the Minimalist Program. 25. Roel Jonkers (1998). Comprehension and Production of Verbs in aphasic Speakers. 26. Erik F. Tjong Kim Sang (1998). Machine Learning of Phonotactics. 27. Paulien Rijkhoek (1998). On Degree Phrases and Result Clauses. 28. Jan de Jong (1999). Specific Language Impairment in Dutch: Inflectional Morphology and Argument Structure. 29. H. Wee (1999). Definite Focus. 30. Eun‐Hee Lee (2000). Dynamic and Stative Information in Temporal Reasoning: Korean tense and aspect in discourse. 31. Ivilin P. Stoianov (2001). Connectionist Lexical Processing. 32. Klarien van der Linde (2001). Sonority substitutions. 33. Monique Lamers (2001). Sentence processing: using syntactic, semantic, and thematic information. 34. Shalom Zuckerman (2001). The Acquisition of ʺOptionalʺ Movement. 35. Rob Koeling (2001). Dialogue‐Based Disambiguation: Using Dialogue Status to Improve Speech Un‐ derstanding. 36. Esther Ruigendijk (2002). Case assignment in Agrammatism: a cross‐linguistic study. 37. Tony Mullen (2002). An Investigation into Compositional Features and Feature Merging for Maxi‐ mum Entropy‐Based Parse Selection. 38. Nanette Bienfait (2002). Grammatica‐onderwijs aan allochtone jongeren. 39. Dirk‐Bart den Ouden (2002). Phonology in Aphasia: Syllables and segments in level‐specific deficits. 40. Rienk Withaar (2002). The Role of the Phonological Loop in Sentence Comprehension. 41. Kim Sauter (2002). Transfer and Access to Universal Grammar in Adult Second Language Acquisition. GRONINGEN DISSERTATIONS IN LINGUISTICS 321

42. Laura Sabourin (2003). Grammatical Gender and Second Language Processing: An ERP Study. 43. Hein van Schie (2003). Visual Semantics. 44. Lilia Schürcks‐Grozeva (2003). Binding and Bulgarian. 45. Stasinos Konstantopoulos (2003). Using ILP to Learn Local Linguistic Structures. 46. Wilbert Heeringa (2004). Measuring Dialect Pronunciation Differences using Levenshtein Distance. 47. Wouter Jansen (2004). Laryngeal Contrast and Phonetic Voicing: ALaboratory Phonology. 48. Judith Rispens (2004). Syntactic and phonological processing indevelopmentaldyslexia. 49. Danielle Bougaïré (2004). Lʹapproche communicative des campagnes de sensibilisation en santé publique au Burkina Faso: Les cas de la planification familiale, du sida et de lʹexcision. 50. Tanja Gaustad (2004). Linguistic Knowledge and Word Sense Disambiguation. 51. Susanne Schoof (2004). An HPSG Account of Nonfinite Verbal Complements in Latin. 52. M. Begoña Villada Moirón (2005). Data‐driven identification of fixed expressions and their modifi‐ ability. 53. Robbert Prins (2005). Finite‐State Pre‐Processing for Natural Language Analysis. 54. Leonoor van der Beek (2005) Topics in Corpus‐Based Dutch Syntax 55. Keiko Yoshioka (2005). Linguistic and gestural introduction and tracking of referents in L1 and L2 discourse. 56. Sible Andringa (2005). Form‐focused instruction and the development of second language profi‐ ciency. 57. Joanneke Prenger (2005). Taal telt! Een onderzoek naar de rol van taalvaardigheid en tekstbegrip in het realistisch wiskundeonderwijs. 58. Neslihan Kansu‐Yetkiner (2006). Blood, Shame and Fear: Self‐Presentation Strategies of Turkish Women’s Talk about their Health and Sexuality. 59. Mónika Z. Zempléni (2006). Functional imaging of the hemispheric contribution to language proc‐ essing. 60. Maartje Schreuder (2006). Prosodic Processes in Language and Music. 61. Hidetoshi Shiraishi (2006). Topics in Nivkh Phonology. 62. Tamás Biró (2006). Finding the Right Words: Implementing Optimality Theory with Simulated Annealing. 63. Dieuwke de Goede (2006). Verbs in Spoken Sentence Processing: Unraveling the Activation Pat‐ tern of the Matrix Verb. 64. Eleonora Rossi (2007). Clitic production in Italian agrammatism. 65. Holger Hopp (2007). Ultimate Attainment at the Interfaces in Second Language Acquisition: Grammar and Processing. 66. Gerlof Bouma (2008). Starting a Sentence in Dutch: A corpus study of subject‐ and object‐fronting. 67. Julia Klitsch (2008). Open your eyes and listen carefully. Auditory and audiovisual speech perception and the McGurk effect in Dutch speakers with and without aphasia. 68. Janneke ter Beek (2008). Restructuring and Infinitival Complements in Dutch. 69. Jori Mur (2008). Off‐line Answer Extraction for Question Answering. 70. Lonneke van der Plas (2008). Automatic Lexico‐Semantic Acquisition for Question Answering. 71. Arjen Versloot (2008). Mechanisms of Language Change: Vowel reduction in 15th century West Frisian. 72. Ismail Fahmi (2009). Automatic term and Relation Extraction for Medical Question Answering System. 73. Tuba Yarbay Duman (2009). Turkish Agrammatic Aphasia: Word Order, Time Reference and Case. 74. Maria Trofimova (2009). Case Assignment by Prepositions in Russian Aphasia. 75. Rasmus Steinkrauss (2009). Frequency and Function in WH Question Acquisition. A Usage‐Based Case Study of German L1 Acquisition. 76. Marjolein Deunk (2009). Discourse Practices in Preschool. Young Children’s Participation in Everyday Classroom Activities. 322 GRODIL

77. Sake Jager (2009). Towards ICT‐Integrated Language Learning: Developing an Implementation Framework in terms of Pedagogy, Technology and Environment. 78. Francisco Dellatorre Borges (2010). Parse Selection with Support Vector Machines. 79. Geoffrey Andogah (2010). Geographically Constrained Information Retrieval. 80. Jacqueline van Kruiningen (2010). Onderwijsontwerp als conversatie. Probleemoplossing in interprofessioneel overleg. 81. Robert G. Shackleton (2010). Quantitative Assessment of English‐American Speech Relationships. 82. Tim Van de Cruys (2010). Mining for Meaning: The Extraction of Lexico‐semantic Knowledge from Text. 83. Therese Leinonen (2010). An Acoustic Analysis of Vowel Pronunciation in Swedish Dialects. 84. Erik‐Jan Smits (2010). Acquiring Quantification. How Children Use Semantics and Pragmatics to Constrain Meaning. 85. Tal Caspi (2010). A Dynamic Perspective on Second Language Development. 86. Teodora Mehotcheva (2010). After the fiesta is over. Foreign language attrition of Spanish in Dutch and German Erasmus Student. 87. Xiaoyan Xu (2010). attrition and retention in Chinese and Dutch university stu‐ dents. 88. Jelena Prokić (2010). Families and Resemblances. 89. Radek Šimík (2011). Modal existential wh‐constructions. 90. Katrien Colman (2011). Behavioral and neuroimaging studies on language processing in Dutch speakers with Parkinson’s disease. 91. Siti Mina Tamah (2011). A Study on Student Interaction in the Implementation of the Jigsaw Tech‐ nique in Language Teaching. 92. Aletta Kwant (2011).Geraakt door prentenboeken. Effecten van het gebruik van prentenboeken op de sociaal‐emotionele ontwikkeling van kleuters. 93. Marlies Kluck (2011). Sentence amalgamation. 94. Anja Schüppert (2011). Origin of asymmetry: Mutual intelligibility of spoken Danish and Swedish. 95. Peter Nabende (2011).Applying Dynamic Bayesian Networks in Transliteration Detection and Generation. 96. Barbara Plank (2011). Domain Adaptation for Parsing. 97. Cagri Coltekin (2011).Catching Words in a Stream of Speech: Computational simulations of seg‐ menting transcribed child‐directed speech. 98. Dörte Hessler (2011).Audiovisual Processing in Aphasic and Non‐Brain‐Damaged Listeners: The Whole is More than the Sum of its Parts. 99. Herman Heringa (2012). Appositional constructions. 100. Diana Dimitrova (2012). Neural Correlates of Prosody and Information Structure. 101. Harwintha Anjarningsih (2012).Time Reference in Standard Indonesian Agrammatic Aphasia. 102. Myrte Gosen (2012). Tracing learning in interaction. An analysis of shared reading of picture books at kindergarten. 103. Martijn Wieling (2012). A Quantitative Approach to Social and Geographical Dialect Variation. 104. Gisi Cannizzaro (2012). Early word order and animacy. 105. Kostadin Cholakov (2012). Lexical Acquisition for Computational Grammars. A Unified Model. 106. Karin Beijering (2012). Expressions of epistemic modality in Mainland Scandinavian. A study into the lexicalization‐grammaticalization‐pragmaticalization interface. 107. Veerle Baaijen (2012). The development of understanding through writing. 108. Jacolien van Rij (2012).Pronoun processing: Computational, behavioral, and psychophysiological studies in children and adults. 109. Ankelien Schippers (2012). Variation and change in Germanic long‐distance dependencies. GRONINGEN DISSERTATIONS IN LINGUISTICS 323

110. Hanneke Loerts (2012).Uncommon gender: Eyes and brains, native and second language learners, & grammatical gender. 111. Marjoleine Sloos (2013). Frequency and phonological grammar: An integrated approach. Evidence from German, Indonesian, and Japanese. 112. Aysa Arylova. (2013) Possession in the Russian clause. Towards dynamicity in syntax. 113. Daniël de Kok (2013). Reversible Stochastic Attribute‐Value Grammars. 114. Gideon Kotzé (2013). Complementary approaches to tree alignment: Combining statistical and rule‐ based methods. 115. Fridah Katushemererwe (2013). Computational Morphology and Bantu Language Learning: an Implementation for Runyakitara. 116. Ryan C. Taylor (2013). Tracking Referents: Markedness, World Knowledge and Pronoun Resolution. 117. Hana Smiskova‐Gustafsson (2013). Chunks in L2 Development: A Usage‐based Perspective. 118. Milada Walková (2013). The aspectual function of particles in phrasal verbs. 119. Tom O. Abuom (2013). Verb and Word Order Deficits in Swahili‐English bilingual agrammatic speakers. 120. Gülsen Yılmaz (2013). Bilingual Language Development among the First Generation Turkish Immi‐ grants in the Netherlands. 121. Trevor Benjamin (2013). Signaling Trouble: On the linguistic design of other‐initiation of repair in English conversation. 122. Nguyen Hong Thi Phuong (2013). A Dynamic Usage‐based Approach to Second Language Teach‐ ing. 123. Harm Brouwer (2014). The Electrophysiology of Language Comprehension: A Neurocomputational Model. 124. Kendall Decker (2014). Orthography Development for Creole Languages. 125. Laura S. Bos (2015). The Brain, Verbs, and the Past: Neurolinguistic Studies on Time Reference. 126. Rimke Groenewold (2015). Direct and indirect speech in aphasia: Studies of spoken discourse pro‐ duction and comprehension. 127. Huiping Chan (2015). A Dynamic Approach to the Development of Lexicon and Syntax in a Second Language. 128. James Griffiths (2015). On appositives. 129. Pavel Rudnev (2015). Dependency and discourse‐configurationality: A study of Avar. 130. Kirsten Kolstrup (2015). Opportunities to speak. A qualitative study of a second language in use. 131. Güliz Güneş (2015). Deriving Prosodic structures. 132. Cornelia Lahmann (2015). Beyond barriers. Complexity, accuracy, and fluency in long‐term L2 speakers’ speech. 133. Sri Wachyunni (2015). Scaffolding and Cooperative Learning: Effects on Reading Comprehension and Vocabulary Knowledge in English as a Foreign Language. 134. Albert Walsweer (2015). Ruimte voor leren. Een etnogafisch onderzoek naar het verloop van een interventie gericht op versterking van het taalgebruik in een knowledge building environment op klei‐ ne Friese basisscholen. 135. Aleyda Lizeth Linares Calix (2015). Raising Metacognitive Genre Awareness in L2 Academic Readers and Writers. 136. Fathima Mufeeda Irshad (2015). Second Language Development through the Lens of a Dynamic Usage‐Based Approach. 137. Oscar Strik (2015). Modelling analogical change. A history of Swedish and Frisian verb inflection. 138. He Sun (2015). Predictors and stages of very young child EFL learners’ English development in China. 139 Marieke Haan (2015). Mode Matters. Effects of survey modes on participation and answering behav‐ ior. 140. Nienke Houtzager (2015). Bilingual advantages in middle‐aged and elderly populations. 324 GRODIL

141. Noortje Joost Venhuizen (2015). Projection in Discourse: A data‐driven formal semantic analysis. 142. Valerio Basile (2015). From Logic to Language: Natural Language Generation from Logical Forms. 143. Jinxing Yue (2016). Tone‐word Recognition in Mandarin Chinese: Influences of lexical‐level repre‐ sentations. 144. Seçkin Arslan (2016). Neurolinguistic and Psycholinguistic Investigations on Evidentiality in Turk‐ ish. 145. Rui Qin (2016) Neurophysiological Studies of Reading Fluency. Towards Visual and Auditory Mark‐ ers of Developmental Dyslexia. 146. Kashmiri Stec (2016). Visible Quotation: The Multimodal Expression of Viewpoint. 147. Yinxing Jin (2016). Foreign language classroom anxiety: A study of Chinese university students of Japanese and English over time. 148. Joost Hurkmans (2016). The Treatment of Apraxia of Speech. Speech and Music Therapy, an Inno‐ vative Joint Effort 149. Franziska Köder (2016). Between direct and indirect speech: The acquisition of pronouns in reported speech. 150. Femke Swarte (2016). Predicting the mutual intelligibility of Germanic languages from linguistic and extra‐linguistic factors. 151. Sanne Kuijper (2016). Communication abilities of children with ASD and ADHD. Production, comprehension, and cognitive mechanisms. 152. Jelena Golubović (2016). Mutual intelligibility in the Slavic language area. 153. Nynke van der Schaaf (2016). “Kijk eens wat ik kan!” Sociale praktijken in de interactie tussen kinderen van 4 tot 8 jaar in de buitenschoolse opvang. 154. Simon Šuster (2016). Empirical studies on word representations. 155. Kilian Evang (2016). Cross‐lingual Semantic Parsing with Categorial Grammars. 156. Miren Arantzeta Pérez (2017). Sentence comprehension in monolingual and bilingual aphasia: Evidence from behavioral and eye‐tracking methods. 157. Sana‐e‐Zehra Haidry (2017). Assessment of Dyslexia in the Urdu Language 158. Srđan Popov (2017). Auditory and Visual ERP Correlates of Gender Agreement Processing in Dutch and Italian. 159. Molood Sadat Safavi (2017). The Competition of Memory and Expectation in Resolving Long‐ Distance Dependencies: Psycholinguistic Evidence from Persian Complex Predicates 160. Christopher Bergmann (2017). Facets of native‐likeness: First‐language attrition among German emigrants to Anglophone North America 161. Stefanie Keulen (2017). Foreign Accent Syndrome: A Neurolinguistic Analysis 162. Franz Manni (2017). Linguistic Probes into Human History

GRODIL Center for Language and Cognition Groningen (CLCG) P.O. Box 716 9700 AS Groningen The Netherlands