<<

doie-pub 10.4436/jass.95020 ahead of print JASs Reports doi: 10.4436/jass.89003 Journal of Anthropological Vol. 95 (2017), pp. 249-267

From to linguistic and genetic diversity: five centuries of internal migrations in

Roberto Rodríguez-Díaz1,2, María José Blanco-Villegas1 & Franz Manni2,1

1) Área de Antropología Física, Departamento de Biología Animal, Facultad de Farmacia, Campus Miguel de Unamuno. 37007, , Spain 2) National Museum of Natural History, Musée de l’Homme. 17, Place du Trocadéro, 75116, Paris, France e-mail: [email protected]

Summary - In a previous study concerning 33,753 single Spanish surnames (considered as tokens) occurring 51,419,788 times we have shown that the present-day geography of contemporary variability in Spain still corresponds to the political geography of the country at the end of the Middle Ages. Here we reprocess the same database, by clustering surnames with Self-Organizing Maps (SOMs) according to their geographic distribution, to identify the monophyletic surnames showing a geo-historical origin in one of the 47 of continental Spain. They are 25,714, and they occur 12,348,109 times, meaning that about 75% of the Spanish population bears a surname that had a polyphyletic origin. From monophyletic surnames we compute migration matrices accounting for the internal migrations that took place since five centuries ago, when Spanish surnames started to be patrilineally inherited. The mono/ polyphyletic classification we obtain fits ancient census data and is compatible with published molecular diversity of the Y-chromosomes associated to selected Spanish surnames. Monophyletic surnames indicate that i) the provinces exhibiting a higher percentage of autochthonous surnames are also ii) those from which emigration corresponds to a local isolation-by-distance model of diffusion and iii) those that attracted a lower number of immigrants. These are also the provinces where other than Castilian are spoken. We suggest that demographic stability explains linguistic resilience, as people prefer to move to areas in which the linguistic variety is more similar to their own. So far the reciprocal influence of migration and has been investigated at local scales, here we outline how to investigate it at national scales and for time- depths of centuries.

Keywords - Spain, Surnames, Languages, Self-Organizing Maps, Migrations, Census data, Y-chromosome.

Introduction geographical scales, both of which are difficult to obtain otherwise. By providing evidence for Background migration phenomena in different periods and Family carry social and economic controlling for them, it is possible to delineate information that granted them inclusion in sev- past genetic isolates and population structures eral interdisciplinary approaches to human his- that have been modified or disappeared alto- tory. Historians, linguists, and geographers can gether (Boattini et al., 2010; Rodriguez-Diaz play as active a role as biologists in surname stud- & Blanco-Villegas, 2010, Boattini et al., 2012; ies and population analyses. Today, in an age of Darlu et al., 2012). global migration (Castles & Miller, 2009), the The large expansion of the available surname distribution of surnames remains far from ran- datasets, both in time and space, has led to the dom and has the potential to allow an intermedi- development of new methods and analytical ate level of access to the recent past and to smaller tools. Among them, and now widely used, are

the JASs is published by the Istituto Italiano di Antropologia www.isita-org.com 250 Internal migrations in Spain

automatic geographic representations of surname components correspond to the relative frequency diversity which plot the variations of frequency of each surname in the provinces under consid- of a given family- or a set of family-names eration. Then all the vectors (surnames) are clas- sharing some phonetic or grammatical features sified in a discrete number of clusters by using (see the contributions of Bloothooft and Dräger Kohonen maps (Kohonen, 1982, 1984; Kaski, reported in the collective article by Darlu et al., 1997) or other similar methods, so that each 2012; Bloothooft & Darlu, 2013). Some recent cluster corresponds to a group of surnames hav- statistical methods are also becoming established, ing a similar geographic distribution over the such as Bayesian approaches to infer the origins country: frequent in some provinces and not in of migrants (Darlu et al., 2012), Self-Organizing others. Finally, such groups are plotted over a Maps (SOMs) to automatically identify surnames geographic map to see if there are visible peaks having the same geographical origin (Manni et of frequency corresponding to a single . al., 2005; Boattini et al., 2012), or approaches to If one assumes that the province where the identify ethno-cultural groups (Mateos, 2011) by relative frequency of each surname is the high- retrieving forenames associated to a given set of est corresponds to the geo-historical origin, it surnames and checking to which other surnames is possible to measure migrations because the they are linked, iteratively, until an optimum is diffusion centre of each family name is known, achieved (Mateos & Tucker, 2008). In all cases, as well as its present-day distribution. In some the classification is empirical, not based on sci- cases, the peak of frequency is geographically entific or ethnologic background, and makes ambiguous because it corresponds to two (or possible to identify clusters of linked individu- more) provinces. Such ambiguities are related to als corresponding to several isolated subgroups the fact that many surnames, spelled in a same existing in the population. All the applications way, independently became the name of unre- listed point to a same endeavour: the depiction lated families located in different areas: they are of contemporary and past migrations and the called polyphyletic surnames. It is obvious that assessment of multiculturalism and assimilation only the surnames with a clear origin in one phenomena. The results have a direct interest for province, that is the monophyletic surnames, are social anthropologists and population geneticists. used to assess migrations (see Manni et al., 2005; Boattini et al., 2012 for details about the - Migrations inferred from surnames cedure). Migration patterns can be summarized Aside from special cases, it is often difficult to in two migration matrices, one for the aggregate depict the internal historical migrations occurred immigration- and one for the aggregate emigra- within a given region, surnames can be of help tion-processes that took place over the last five when no alternative documentation is available. centuries. From them the provinces can be clas- They make it possible to identify the direction sified into four categories: 1) Isolated provinces of migrations that took place in, say, a European (low emigration, low immigration); 2) Corridor country over the last four or five centuries but, provinces (high emigration, high immigration); unlike historical registers, they do not say when 3) Unattractive provinces (high emigration, low the migrations took place. They could have hap- immigration); 4) Attractive provinces (low emi- pened anytime between the advent of fixed sur- gration, high immigration). names to the last generation, that is for a time span of about five centuries (Spanish surnames Migration range became fixed starting with the 16th century). Concerning migration distances, Boattini et The way we account here for the intensity of al. (2012) classified them as short-, medium- the internal migrations is quite simple. It con- and long-range, and we do the same here. It is sists in coding the surnames listed in the database reasonable to think that the medium and long of current Spanish residents as vectors whose range movements took place in more recent R. Rodríguez-Díaz et al. 251

times, when the mechanization of transportation is a time when surnames became transmitted in and the industrialization led to massive displace- a fixed way and when Castilian linguistic vari- ment of the population that progressively aban- eties gained prestige and spread out, ultimately doned rural life. Differently, other provinces are leading to the castillanization of family names characterized by very local emigration distances by large surname-change favouring a limited set directed to neighbouring areas; they correspond of prototypical Castilian-and Leónese-sounding to processes that took place within a more tra- surnames (a process lasted until 1870) and ditional frame of displacement, probably when explaining why Spain has a lower number of sur- people used to move by their own means, pro- names than other European countries (Kremer, gressively diffusing. Long or short, the aggre- 1992, 1996, 2001, 2003). gation of migration distances, surname after In this report we reprocess the surname cor- surname, contributes to the establishment of pus of Rodriguez-Diaz et al. (2015) with a dou- coherent migration routes. Aggregated data show ble purpose. The first aim, as we said, consists (Boattini et al., 2012) that neighbouring prov- in the identification of monophyletic surnames inces can be quite different in the number and and in the descrption on the internal migrations the provenance of the immigrants they attracted, that took place in Spain since the beginning but also concerning the directions of the emi- of fixed surnames’ transmission, five centuries grants that left them. We test if this is the case in ago. The second aim is to address the linguistic Spain but we anticipate that the detailed analysis diversity of Spain with respect to the migrations of the directions of the migratory flow will be that we infer. Spain is certainly an ideal loca- presented in a forthcoming article. tion to address this question since its linguistic diversity is remarkable. Over the time, Basque, The goals of this study Catalan-Valencian and Galician have remained This report follows a previous assessment of lively languages and now have an official sta- the variability of Spanish surnames (Rodriguez- tus, being used at all the levels of the public Diaz et al., 2015) in which we analyzed the fre- life, including the medias, and taught at school. quency distribution of 33,753 unique surnames Nevertheless, and despite a lower level of offi- (tokens) occurring 51,419,788 times, according cial support and a minor political connotation, to the list of Spanish residents of the year 2008. other Spanish languages have persisted too, like From family-names we measured surname dis- Asturian-Leónese, Extremaduran and Aragónese tances among the 47 mainland Spanish provinces (Tab. 1). (for which we inferred consanguinity though Our goal, complementary to Falck et al. isonymy levels) and compared these distances to (2012) who showed that people prefer to move to the differences existing between the language vari- areas in which the linguistic variety is more simi- eties spoken in Spain according to Goebl (2010, lar to their own, is to preliminary examine the 2013). The comparison showed a similar picture; reciprocal influence of migration and language. major surname and linguistic clusters are located While the influence that demographics has on in the east (Aragón, Cataluña, ), and in the evolution of linguistic diversity is obvious, the north of the country (, , León). there are few quantitative studies addressing it, Remaining regions appear to be considerably probably due to the lack of detailed and readily- homogeneous. We interpreted this pattern as the available historical demographic data concerning long-lasting effect of the surname and linguistic a full linguistic domain. normalization actively led by the Christian king- For forty years a large body of research has doms of the north (Reigns of Castilla y León and been focused on investigating the effect of con- Aragón) during and after the southwards recon- tact between mutually intelligible linguistic quest () of the territories ruled by the varieties (see Kerswill, 2006 and Dodsworth, Arabs from the 8th to the late 15th century. This 2017 for reviews). Many case-studies have

www.isita-org.com 252 Internal migrations in Spain

Tab. 1 - Languages spoken in Spain according to the Ethnologue (Simons, 2016). Regions appear between parentheses, provinces do not (see Fig. 2).

LANGUAGE SPEAKERS STATUS PROVINCES, (REGIONS)

Aragónese 30,000 Threatened ;

Asturian/Leónese 550,000 3 dialects; Threatened Asturias; León; (, Castilla)

Basque 468,000 Provincial language Alava; Bizkaia; ; Navarra

Catalan/Valencian 8,900,000 5 dialects; Provincial (, Valenciana, , Aragón, )

Extremaduran 200,000 Caceres; (, Castilla and León)

Fala (Extremaduran) 11,000 Vigorous (Extremadura)

Galician 2,340,000 Many dialects; Provincial Galicia; Asturias; León ; Zamora

Gascon/Aranese 4,800 Provincial (Catalonia)

Spanish 45,890,000 Many dialects All provinces

demonstrated that different varieties, in direct phenomena are stronger when the linguistic dif- contact through face-to-face interaction (linguis- ference between the varieties in contact is higher. tic accommodation), become more alike with the On the other hand, an unappealing region has time. Contact-induced linguistic accommoda- probably lost a large part of its population that tion involves several processes like: Levelling, the migrated elsewhere to find better living condi- reduction in either the number of linguistic vari- tions. A linguistic consequence of this phenom- ants or in the degree of their variability; Koine, enon is that the dialect of the latter has remained when new linguistic forms, that did not exist stable over the time, and has not undergone pro- before the contact, arise; Reallocation, in which cesses of linguistic simplification (Fig. 1). alternative forms are retained but assigned differ- We test if higher linguistic diversity and ent roles in the sociolinguistic use of the dialects, clearer geographical structure is found in the or in their grammatical use; Simplification and provinces where the number of immigrants increase in regularity. During the phase that fol- speaking external varieties has been low (unat- lows the initial linguistic-contact, a new focused tractive and isolated provinces). On the other variety gets established and “homogeneously” hand, the areas that have been the target of adopted by the whole community. massive immigration are expected to have lost An attractive area, say an economically linguistic variability; in fact the identity-mark- dynamic town/province has probably been des- ing function of local varieties is less relevant tination of migrants for centuries; initially they where the number of allochthonous speakers is came from close areas but, with the time passing, too large. immigrants from more distant areas (where more Three recent studies llinking the molecular divergent dialects are spoken) moved in. This is variability of the Y-chromosome to Spanish sur- to say that linguistic levelling and simplification names have been recently published (Calderon are expected to be stronger in the dialect of an et al., 2015; Solé-Morada et al., 2015; Martinez- attractive area than in the dialect of an area that Cadenas et al., 2016). This is why, in the discus- is not, because immigration from distant regions sion section, we will tangentially tie our results is less likely in the second case, and contact to them. R. Rodríguez-Díaz et al. 253

Fig. 1 - Two extreme scenarios of linguistic contact according to migration. Attractive Area: The variety spoken here is frequently in contact with new varieties; earlier immigration comes from the neighbourhood, while later immigrants come from distant areas. Besides the normal population growth over time, the demographic balance is as positive as the migratory balance. Unattractive Area: The spoken variety is less exposed to the contact with different varieties because the major- ity of immigrants comes from neighbouring areas. The population size may fall or remain some- what stable over the time because there are few immigrants and many emigrants. The growth of the population counterbalances the loss of populations only partially. The timeline reported can be assumed to cover some centuries.

Materials and methods mother. For example, the son of Mister Alvarez- Gomez and Miss Lopez-Garcia will have Alvarez- Study area Lopez as surname: only the first surname of each Spain is located at the extreme south-west parent is transmitted. Alvarez and Lopez are end of Europe. Administratively it is divided transmitted, Gomez and Garcia are not. While in 17 regions (2 overseas) and 52 Provinces (5 the mother contributes a surname, the patrilineal overseas). We processed only the 47 continen- transmission prevails, because her surname, being tal provinces reported in Fig. 2 as we did in the second one, is lost at the following genera- Rodriguez-Diaz et al. (2015). tion. A recent law made possible to deviate from this traditional scheme. When processing the data Surname data we separated the paternal from the maternal sur- Transmission of Spanish surnames. In the names adding them, as tokens, to the database traditional Spanish system of surname transmis- constituted by Single Surnames Types that we sion, individuals inherit two surnames, the first also call SSTs in the article (to follow the example one from the father and the second one from the SSTs are: Alvarez; Garcia; Gomez; Lopez).

www.isita-org.com 254 Internal migrations in Spain

Fig. 2 - The study area consists in the 47 continental provinces of Spain (small labels). Regions are shown as black labels. In white the former reign of Castilla y León, in gray the former reign of Aragón.

The database. The database we initially pro- when we were not totally sure we did not delete cessed is based on the full list of Spanish residents them (for example Catalan surnames can be simi- of the year 2008 (Padrón municipal) that is estab- lar to French and Italian ones). The final dataset lished regardless of nationality and citizenship. we processed includes 31,919 single-surname- The Spanish National Statistics Institute (INE) types (SST) occurring 51,215,362 times. sent us the list of surnames appearing at least Data correction and treatment. Before pro- five times in a single . For compu- ceeding to the clustering of surnames (SSTs) tational ease, we kept only the surnames having according to their spatial distribution, we have cor- a frequency higher than 20 occurrences, in fact rected their frequency as in Boattini et al. (2012). the methodological core of this article is about the identification of the geographic origin of each sur- 1) The absolute SST-frequencies in each prov- name in the database according to its geographi- ince p (Fip) have been weighted by the natu- cal distribution: 20 occurrences are the minimal ral logarithm of the corresponding full pop- frequency to make a pattern visible. The initial ulation-size as estimated from the data (Np): database is the same processed in Rodriguez-Diaz et al. (2015): 33,753 different single-surname- fip = Fip /Ln(Np) types (SST=tokens) occurring 51,419,788 times. From this list we manually deleted 1834 foreign This transformation scales SSTs frequencies surnames to focus on the internal migrations of according to population sizes. The loga- the Spanish population only. This task was easy rithmic procedure is related to the fact that because we deleted exclusively the surnames obvi- the population sizes of various provinces ously “not-Spanish” (Johnson, Mueller, etc.), vary by different orders of magnitude and R. Rodríguez-Díaz et al. 255

a simple division by Np leads to erroneous in Table S1, where each cell corresponds to a geographic attributions, as many SSTs have cluster of SSTs having a very similar distribution a small population size. pattern. The surnames frequencies of each cluster

2) Obtained frequencies fip have been weighted have been merged and plotted on a corresponding a second time by their absolute frequencies in number of geographical maps of Spain by using the whole set of p=47 Spanish provinces after the software Arcgis 10.0. The visual inspection of subdividing them by their per-surname sum: the maps makes visible (or not) the presence of

p frequency-peaks corresponding to three cases: i) W = f /∑ f surnames with a single clear origin in one province ip ip p=1 ip (monophyletic surnames); ii) surnames of ambigu- The latter step is meant to normalize the rela- ous origin involving two or more geographic fre- tive surname frequency as if all surnames had the quency-peaks; iii) surnames with multiple origins same absolute frequencies. Subsequently Wip val- when the patterns concern large geographic areas ues were arranged in vectors (one per surname) with many frequency peaks. We that the consisting in the weighted relative frequency of visual inspection is perfectly sufficient to identify each of the 31,919 SSTs in the 47 continental the peaks of frequency without a statistical test, provinces. These vectors have been the input similarly to the visual identification of clusters in of a Self-Organizing Map of Kohonen (SOM) multivariate plots. By attributing a geo-historical (Kohonen, 1982, 1984), a clustering-type data- origin corresponding to the frequency-peak of a mining procedure based on artificial intelligence. group of monophyletic SSTs we consider that the SOM are unsupervised learning neural net- area where they are most frequent corresponds to works that allow classifying together SSTs according the region where their regular use and transmission to their frequency in each of the 47 Spanish prov- begun. Surnames not showing a frequency’peak inces we analyze. After its first application (Manni et because, likely, polyphyletic, have been discarded. al., 2005), the method has been replicated and vali- Computation of migration matrices. We dated several times (Boattini et al., 2010; Rodríguez computed migration matrices (Bodmer & Díaz et al., 2010, Boattini et al., 2012). Cavalli-Sforza, 1968) by comparing the present Software used. To cluster surnames we used day distribution of monophyletic SSTs to their the library “kohonen” (R Project; Wehrens and inferred geo-historical origin. The matrices tell Nuydens, 2007 ). We classified the 31,919 SSTs how many surnames of endogenous or exog- in a Self-Organizing Map (SOM) of size 17 x 17, enous origin exist in each Spanish province and meaning that surnames have been classified in from where the latter ones came from. The study 289 clusters. The size of the map, arbitrary, was is based on analyzing these matrices. decided after testing alternative sizes (10 x 10, 20 A migration matrix is a formal representation x 20, etc.) in order to have one cluster, at least, of of population mobility (here the population of the output map corresponding to each of the 47 the 47 Spanish continental provinces) across one provinces. The parameters of the analysis (radius or more generations (approximately 20 genera- of the neighbourhood function = 9; initial and tions by considering a generation every 25 years late learning rate (α) = 0.05 and 0.01) were set in the 500 years time depth of Spanish surnames). according to the recommendations of Dr. Samuel Data were organized in a first 47 by 47 migration Kaski (2008, personal communication). The final matrix (M), where rows represent the provinces of clustering was obtained by inputting 1000 times origin of surnames, and columns represent their the entire corpus of input vectors to the SOM. current location. In this way, the Mij element of the matrix M corresponds to the population bear- Inferring the geo-historical origin of surnames ing a surname that historically originated in the The Self-organizing map (SOM). The SOM province i and is currently found in the province j. clustering accounts for 289 groups summarized Such M matrix was then transformed in a forward

www.isita-org.com 256 Internal migrations in Spain

matrix F, by subdividing each element of M by the peak of origin (monophyletic surnames) and row sum (row-stochastic matrix), and in a back- occurring 12,348,109 times; while 93 clusters ward matrix B by subdividing each element of M account for 6,205 SSTs having an ambiguous pat- by the column sum (column-stochastic matrix). tern that does not allow identifying their geo-his- The F matrix tells to which external provinces torical origin (polyphyletic surnames), they have the ones that adopted given surnames in each one been discarded from the analysis. One cluster of of them (and all their male descendants until the the map is empty (Table S1). From the 25,714 year 2008) emigrated. The B matrix is somewhat monophyletic SSTs we compute migration matri- complementary; it tells from which provinces the ces (Bodmer & Cavalli-Sforza, 1968) correspond- migratory flow to a given province took place. ing to the aggregate migratory flow that took Analytically, we define place in Spain since surnames became patriline- ally transmitted in a fixed way (16th century).

dF = 1 – fii dB = 1 – bii General statistics Table S2 accounts for the detailed statistics where fii and bii respectively are the diagonal ele- concerning the number of polyphyletic, mono- ments of the F and B matrices. In this way dF phyletic and autochthonous Single Surname and dB coefficients provide information about the Types (SSTs) per province as absolute frequen- dispersal outside the province of origin of given cies; their total occurrence is accounted for as surnames (male lineages) and of the amount of well. The occurrence of all SSTs is 150% the immigrants to that province. We invite the reader official population size of the country, this hap- to pay careful attention to the definition of dF and pens because we separated the double surname dB indexes as they are important to understand of Spanish residents in two separate SSTs, when the article (see Table S2). this was applicable. As many citizens officially Province-specific migration patterns were use one single surname, the ratio in always lower estimated according to the non-diagonal val- than the expected 200% (two surnames per one ues of the M matrix and geographic distances individual). We note that this ratio is the lowest between couples of Spanish provinces were pre- in (6%), (41%) and Valencia pared in a square distance matrix (G). (94%) because of a known sampling bias (see Mean emigration distances were calculated as: Rodriguez-Diaz et al., 2015 for full details) related to the size of . On average, d = ∑(m .g ) / ∑m , i≠j 44% of the SSTs initially processed are polyphyl- i i ij ij i ij etic (for the 47 provinces, MIN 29% MAX 59%): and mean immigration distances were obtained as: they occur 38,867,253 times, that is 75% of the full database (51,215,362) with large variation d = ∑(m .g ) / ∑m , i≠j according to the province (MIN 47%, MAX 92 j j ij ij ij %,). The monophyletic surnames (25,714 SSTs where mij are the elements of the M matrix and gij occurring 12,348,109 times) that are still found are the elements of the G matrix. inside the province of origin are 6,571,331. See Table S2 for more details about frequencies. In Rodriguez-Diaz et al. (2015) we have Results shown that the past administrative boundaries between the reigns of Castilla y León and Aragón SOM clustering (Fig. 2) played a role in shaping the surname According to Tab. S1: 195 clusters of the Self- diversity of Spain because two separate legal sys- Organizing Map correspond to 25,714 Single tems have been retained until AD 1714, with Surname Types (SSTs) having a clear geographical massive surname-change occurring in the first R. Rodríguez-Díaz et al. 257

Fig. 3 - Absolute frequency and total occurrence of polyphyletic and monophyletic single surnames (SSTs) identified according to three different Self-Organiszing Maps (SOMs): ‘SOM Aragón’ corre- sponds to the provinces corresponding to the former reign of Aragón (see Fig. 2); ‘SOM Castilla’ cor- responds to the provinces located in the former reign of Castilla y León (see Fig. 2); ‘SOM Full Spain’ corresponds to the SOM used in this article. The three separate analyses were conducted to measure the effect of the administrative border having existed until 1714 between the two reigns, a border that corresponded to different practices in surname ascription; see text for details. but not in the second kingdom. These histori- terms of the number of identified polyphyletic cal differences might distort the SOM cluster- surnames. The SOM corresponding to all Spain ing of Spanish surnames because of a border- delivers 5311 polyphyletic surnames (1545 + effect between the two former reigns, in fact 1035 + 1324 + 1407), but the separate SOMs for Self-organizing maps (SOMs) are sensitive to Aragón and Castilla y Léon identify 6187 (2771 the input-data. Differently from other neural + 377 + 3039) additional polyphyletic ones network applications, SOMs become perfectly (+116%) (Fig. 3). Such discrepancy is expected adapted to classify the set of input vectors that because the number of polyphyletic surnames that are used, and modification of the input data- are identified is determined by the geographical set leads to modification in the classification. area of distribution. Concerning monophyletic sur- Because of this technical reason, we have tested names, 26,608 of them appear in the SOM corre- the “border-effect” mentioned above by process- sponding to full Spain while the SOMs for Castilla ing separately the surname corpora correspond- y León and Aragón identify 3620 additional sur- ing to the two halves of the country, that is to names (+13%). The overlap between the SOM for the mentioned kingdoms, and compared the two full Spain and the SOM for Castilla y León, or the resulting SOM-clustering to the one correspond- SOM for Aragón, corresponds to 25,752 mono- ing to the full country (on which the results of phyletic surnames, over a grand total of 30,228 the article are based – Tab. S1). monophyletic ones identified by at least one of the The comparison reported in Fig. 3 (based on three SOMs analyses, an overlap of 85%. This is slightly different dataset because we had experi- the error related to the border effect, and prob- mented by excluding more surnames of for- ably the different naming practices that existed in eign origin) shows that large differences exist in the two kingdoms. While not negligible, we find

www.isita-org.com 258 Internal migrations in Spain

Fig. 4 - Left: Percentage of the surnames (SSTs) autochthonous of each province over the total num- ber of SSTs. Here the absolute number of SSTs is reported, not their occurrence. Right: Percentage of the occurrence of autochthonous surnames (reported on the left) over the total occurrence of all SSTs per province. See Tab. S2 for details. this error acceptable for a aimed at Autochthonous surnames and historical census data a general description of the migratory movements Attributing a geo-historical origin to sur- occurred over a time-span of five centuries (Tab. names is to provide evidence for the founding S2). We remind that our inferences concerning stock of the Spanish population when surnames the geo-historical origin of Spanish surnames do became fixed. If we assume that the extinction- not reach the accuracy of genealogical documents, rate of surnames is independent from their but we also note that the lack of temporal conti- geographical and historical context, then the nuity of historical records would be insufficient to number of surnames (currently existing in the allow a general study like this one. Spanish population) that we see having their geo-historical origin in each province is expected The percentage of autochthonous surnames to be proportional to the population size that An interesting parameter, to visualize provinces had when surnames got established how migration altered the composition of the five centuries ago. The fact that only the old- Middle-Age population, is the percentage of est historical census data (AD 1787 in Fig. 5) autochthonous surnames per province (Tab. S2, is correlated with our estimates of the number Fig. 4), they are the surnames that had their geo- of surnames original of each province, goes in historical origin inside the province where they are this direction and means that in AD 1787 the currently found. When only their proportion population-size of provinces was still close to the over the total numbers of SSTs is taken into con- population-size of the end of the Middle Ages sideration, we see no clear pattern at a national (three centuries before). Later migrations and scale (left of Fig. 4) but, when the total number differential population-growth made the corre- of occurrences of autochthonous SSTs is the vari- lation with more recent census data not signifi- able under examination, it is apparent that their cant (Tab. S3, Fig. 5). This finding, expected and frequency is higher along the northern and the already noted in Manni et al. (2005), concurs to eastern side of Spain (right of Fig. 4). validate our SOM clustering. R. Rodríguez-Díaz et al. 259

Fig. 5 - Correlation between the number of single surnames (SSTs) autochthonous of each province (roughly proportional to the population size of the provinces when surnames became transmitted in a fixed way) and the census size per province from AD 1787 to AD 2001. The correlation decreases with time according to the progressive demographic change of the population. Only the oldest cen- sus data is significantly correlated (see P values on the right). Phases of change and stability can be identified. See Tab. S3 for details (www.ine.es).

Migration distances place at a very local scale, (major distance The surnames originated in a given province class: 0-100 km) with a general negative (Tab. S2) can be found at a variable degree of exponential decay that fits the isolation- distance from it. If, in their majority, they are by-distance model (Wright 1943, Malécot found in the provinces neighboring the one from 1948, 1955). This pattern applies to prov- which they come from, we can conclude that inces located along the northern and east- migration pattern has been very local, and vice ern coast of Spain (A in Fig. 6). versa. To investigate this aspect (as the percent- b) Gravity deformed. This is similar to the pre- age of autochthonous surnames per province is vious one but the second distance class of not informative about the migration distances emigration (201-300 Km) is also frequent, -- see Fig. 4), we ordered emigration distances in suggesting that emigrants moved a little different classes. In this way, each province corre- further than their immediate neighbor- sponds to a vector constituted by 8 components hood; ‘gravity deformed’ applies to prov- that represent 8 distance classes: the resulting 47 inces located in inland Spain (B in Fig. 6). vectors (one per continental province) have been c) Medium- to long-range emigration. This pat- the input of a Principal Component Analysis tern is characterized by emigration move- (Fig. 6). The PCA plot suggests four migration ments that contrast with the isolation-by- patterns to which all the Spanish provinces, more distance model, as large percentages of or less, belong: emigrants moved quite far (201-600 km). It concerns provinces generally located in a) Isolation-by-distance. Such pattern is typical the southern part of Spain but also of provinces from which emigration took and Zaragoza (C in Fig. 6).

www.isita-org.com 260 Internal migrations in Spain

Fig. 6 - Principal Component Analysis classifying the 47 continental Spanish provinces according to their emigration pattern. Each province is processed as a vector constituted by 10 components representing 8 distance classes (0; 0-100 km; 101-200 km; etc.).The provinces plotted in each quadrant are geographically mapped and a summary chart is reported.

d) Very long-range emigration. This pattern is low immigration) are located in Galicia, along characterized by emigration distances above the northern coast and in the regions previously 601 and up to 1000 km and applies to the part of the reign of Aragón (see also Fig. 2); 2) periphery of the Peninsula (D in Fig. 6). Corridor provinces (high emigration, high immi- gration) stretch from the capital city, Madrid , Migration versus immigration but are also located in more southern Spain; Emigration (frequency of autochthonous sur- 3) Unattractive provinces (high emigration, low names per province) and immigration (number immigration) are scattered and concern the of surnames of allochthonous provenance) was deep south, Cantabria, , and estimated by dF and dB parameters (see Tab. S2). Zaragoza; 4) Attractive provinces (low emigra- dF and dB values correspond to different tion, high immigration) stretch, in their major- aspects of the contemporary Spanish popula- ity, from Galicia to the south, that is along the tion and are summarized in Fig. 7, where the western side of the country. provinces correspond, to a variable degree, to We invite the reader to carefully examine the four cases mentioned in the introduction of the quadrants of Fig. 7 to appreciate the variable this article: 1) Isolated provinces (low emigration, intensity of both parameters. R. Rodríguez-Díaz et al. 261

Fig. 7 - Bidimensional plot of emigration (x-axis, see dF values in Tab. S2) and immigration (y axis: see dB values in Tab. S2) by province. When examining the plot, it is possible to identify four differ- ent cases directly labelled on the plot. The location of provinces is shown each time.

Discussion former reign of Castilla y León, but not in those corresponding to the former reign of Aragón, where In a previous study concerning the variability of surname ascription was less influenced by this kind Spanish family-names (Rodriguez-Diaz et al., 2015) of phenomenon. Before addressing the mat- we noted that the historical impelled adoption of a ter of this study, migrations and language diversity, same set of Castilian surnames by many unrelated we review some recent literature addressing Spanish families (Kremer, 1992, 1996, 2001, 2003) led to surnames and the Y-chromosome variability of the considerable bias in the estimation of isonymy/ corresponding population. Both markers are patri- consanguinity in the regions corresponding to the lineally transmitted and possibly linked.

www.isita-org.com 262 Internal migrations in Spain

Polyphyletic vs. monophyletic surnames and their way, Solé-Morada et al. (2015) examined the Y-chromosome diversity Y-chromosome diversity of 50 selected Catalan We have shown that only 25,714 single-sur- surnames finding results comparable to those of names, of the initial 31,919 ones we processed, Martinez-Cadenas et al. (2016). Interestingly, by have an identified a geo-historical origin, mean- dating the most-recent common ancestor of the ing that they are monophyletic. The remaining Y-chromosome lineages associated to rare sur- 6,136 ones are either ambiguous or clearly poly- names, both studies estimate the age of Spanish phyletic, they represent about 75% of the full family-names at about 500 years, in agreement number of occurrences (see Tab. S2). As a con- with historical knowledge. sequence, we can say that three quarters of the Even if Calderon et al. (2015), to preserve Spanish population bears a surname that is prob- the anonymity of the DNA donors, did not ably polyphyletic, a percentage that is higher send us the list of the surnames they processed, than in other countries (~50% and ~25% in we could check the compatibility of our SOM the according to Boattini et al. 2012 mono/polyphyletic classification with the results and Manni et al. 2005 respectively). The large of the other two studies we mentioned, that is to polyphyleticism of Spanish surnames implies that see if the degree of coancestry of surnames and the bearers of an identical surname do not often Y-chromosome is higher for the monophyletic carry the same Y-chromosome by descent. Rare surnames appearing in the set of 37+50 of Solé- surnames are the best candidates to test the asso- Morada and Martinez-Cadenas . The molecular ciation between Y-chromosome haplogroups and difference between the two groups (mono vs. family-names, in fact they are less likely to have polyphyletic) is highly significant and further been adopted several times by unrelated families, validates the SOM. Polyphyletic surnames have otherwise their frequency would be higher (King a much higher haplotype diversity and a lower et al., 2006: King & Jobling, 2009b; Winney et match with Y-chromosome haplogroups than al., 2012). monophyletic surnames do (Tab. 2). A recent study aimed at comparing the extent of the relationship between 353 single surname Autochthonous surnames, migrations and linguistic types (SSTs) and the Y-chromosome haplotypes diversity of 416 males from (southern part of By examining the demographic growth prov- Spain) reports few concordances whatever the ince by province of the Spanish population over frequency of analyzed surnames (Calderon et the last two centuries (Tab. S3), it is apparent al., 2015). This finding fits our results, because that the growth trend remained rather stable the authors say to have processed a representative until AD 1910. Later, the population-size con- sample of the Spanish population, meaning that siderably changed (particularly in the second 75% of the times the surnames were polyphy- half of the century) with some provinces loosing letic. With similar aims, but using 37 Spanish population and other ones being the destination surnames selected to correspond to four classes of a large number of rural immigrants (Fuster of frequency (very frequent, frequent, rare, very & Colantonio, 2002), attracted by economi- rare), Martinez-Cadenas et al. (2016) typed the cally favorable conditions (also see Fig. 5). This Y-chromosome haplogroups of 2121 individuals phenomenon suggests that the long-distance from Castilla, Catalonia and the Basque region. migratory movements (Fig. 6) have been a recent In agreement with King & Jobling (2009b), event. they found that the degree of co-ancestry is The highlighted significant proportionality higher when surnames are less frequent. Only between the number of surnames autochthonous the frequency matters, not the typology of sur- of each province and the census of AD 1787 names (place name, byname, , etc.) (Tab. S3) is in agreement with the conclusions or the geographic region examined. In a similar of Adams et al. (2008) about the genetic stability R. Rodríguez-Díaz et al. 263

Tab. 2 - Statistics about the molecular difference between Y-chromosome haplogroups accord- ing to the mono or polyphyletic classification of surnames obtained in this study. ‘Estimator’ and ‘Reference’ correspond to the measure reported in the original publications cited. S is the number of surnames analyzed. ‘SSTs SOM M/P-A’ is the number of monophyletic and polyphyletic/ambiguous surnames according to the SOM analysis. ‘M Av. (SD)’ and ‘P-A Av. (SD’) are the averages and Std. Dev. for the reported estimators concerning monophyletic and polyphyletic/ambiguous surnames. ‘WMW’ is the Wilcoxon-Mann-Whitney probability (according to Marx et al. 2016) under the hypoth- esis that the haplotype diversity of monophyletic and polyphyletic/ambiguous surnames is different.

ESTIMATOR REFERENCE S SSTS SOM M P-A W-M-W TEST (M / P-A) AV. (SD) AV. (SD) P (M ≠ P-A)

Haplotype diversity Solé-Morata 50 48 (28; 20) 18.04 31.15 5.00E-3 et al. 2015 (14.64) (17.64)

N descent clusters Solé-Morata 50 48 (28; 20) 0.92 0.97 1.37E-3 et al. 2015 (0.09) (0.02)

Match P score Martinez-Cadenas 37 34 (14; 20) 19.81 5.55 2.62E-4 et al. 2016 (20.12) (5.93) of Spain over the centuries. Also, the fact that Along the same line, we reported the con- internal Spanish migrations were predominantly siderable homogeneity of Castilian dialects directed to neighboring provinces, with frequent Rodriguez-Diaz et al. (2015). While Castilian isolation-by-distance-like (IBD) patterns (Fig. language varieties are known to be of secondary 6), goes in the same direction. Continental and differentiation, having diverged during and after southern provinces deviate from a typical IBD their southwards spread linked to the Reconquista pattern, while costal provinces located along the (that is much later than their first differentiation Mediterranean and Cantabrian arcs do not (Fig. from Latin), their low diversity is also explain- 6). Interestingly, the latter ones are those where able as the consequence of the linguistic simpli- languages and dialects other than Castilian are fication driven by the kind of language contact spoken (Tab. 1) and those where the percent- that immigration brings. In fact, many Castilian- age of autochthonous surnames is higher (Fig. speaking provinces have been the target of large 4) and also those that have remained more iso- immigration (‘attractive’ provinces in Fig. 7). lated in terms of immigration (Fig. 7). We find Both findings constitute promising arena for these correspondences remarkable. See Fig. 8 for future work. a synthesis. A possible explanation for the overlap Directions for future investigation (see the computational linguistics analysis of To better focus on the cumulative effect of Rodriguez-Diaz et al., 2015), is that people pre- internal Spanish migrations occurred over the fer to move to areas in which the local language last 5 centuries, a possible experimental set-up is more similar to their own (Falck et al., 2012). would be to compare couples of locations that i) While economical and political aspects have also initially had a comparable population size and ii) contributed to orient the migration pattern, it where very close varieties were spoken iii) before seems that human displacement took place at a linguistically diverging in relation to a different more local scale in the areas where major linguis- migration history (Fig. 1). The linguistic dif- tic difference occurs in Spain: the link between ferences between couple of localities selected in language (or dialect) diversity and the intensity this way are expected to correspond to different of immigration is apparent. degrees of migration-induced linguistic contact.

www.isita-org.com 264 Internal migrations in Spain

Fig. 8 - A: Provinces that have attracted a low number of immigrants. B: Spanish provinces from which migration has been local and directed to neighbouring areas. C: Major linguistic areas accord- ing to a computational linguistics analyses of Rodriguez-Diaz et al. (2015).

The sources of linguistic data to be used can be of other linguistic varieties, came into contact either the linguistic atlas of the Iberian Peninsula with the variety spoken in a given Spanish loca- (ALPI, 1962) or the linguistic atlas of Catalonia tion. The logical working hypothesis is that (Griera, 1923-1964). the immigration of people speaking very differ- In this article we have shown how to measure ent varieties has a greater impact on the receiv- demographic contact by surname analyses, and we ing speech communities than does the arrival of suggest equating it to linguistic contact. In fact, individuals speaking similar varieties, where we we can say how many immigrants, i.e. speakers would expect that the receiving community’s R. Rodríguez-Díaz et al. 265

speech should remain more stable. This stability Acknowledgments can be measured as a higher level of areal hetero- geneity. The computational linguistic treatment We would like to thank Professor Pierre Darlu of the ALPI (1962) of Professor H. Goebl (2010, (CNRS; National Museum of Natural History, 2013), that is the identification of 375 linguis- Paris), Professor Hans Goebl (University of Salz- tic variables, provides an appropriate corpus that burg) and Professor John Nerbonne (University of can used to measure heterogeneity at fine scales, Groningen) for continued support and proficient together with the spread of linguistic innova- discussion. We also thank Professor Francesc tions. We remind that the surname database (University Pompeu Fabra, Barcelona) for insight- we processed relies on municipality data (7,967 ful comments during a preliminary presentation of municipalities in continental Spain) and would the results. Our extensive revisions have been slow align with the very exhaustive sampling grid of and we are thankful to the Editor of the Journal linguistic atlas of the Iberian Peninsula (529 test for his remarkable patience. The collaboration sites). This future work might be coupled with between the authors was made possible by a grant experiments of mutual intelligibility concerning of the French Embassy in Madrid (Programme de presently spoken varieties to assess the likeliness missions et d’invitations pour la cooperation univer- of linguistic accommodation (people modify sitaire et scientifique) to F. Manni in 2015. their speech in face-to-face interaction only if communication is possible). To conclude, the geographic pattern of References linguistic diversity seems to obey to the same dynamics explaining human diffusion and emi- Adams S.M., Bosch E., Balaresque P.L., Ballereau gration. In linguistics, Trudgill’s (1974) Gravity S.J., Lee A.C., Arroyo E., Lopez-Parra A. M., Model explains the spread of linguistic innova- Aler M., Grifo M.S., Brion M. et al. 2008. The tions as a radiation from a centre and having an genetic legacy of religious diversity and intoler- effect on larger centres at first, and then spread- ance: paternal lineages of Christians, , and ing to the smaller ones, in a cascade of effects Muslims in the Iberian Peninsula. Am. J. Hum. depending on the frequency of linguistic contact Genet., 83: 725-736. and on the population size. Trudgill (1974) sug- ALPI 1962. Atlas Lingüístico de la Península gested that this spread declines quadratically, but Ibérica. C.S.I.C., tomo I, Fonética, Madrid. Nerbonne & Heeringa (2007) and Nerbonne Bloothooft G. & Darlu P. 2013. Evaluation of the (2010) found a logarithmic model (similarly to Bayesian method to derive migration patterns isolation-by-distance) to better function. This from changes in surname distributions over analogy suggests the exponential decay of human time. Hum. Biol., 85: 553–568. interaction, as a function of the geographic dis- Boattini A., Lisa A., Fiorani O., Zei G., Pettener tance, applies in a similar way to the dissemi- D. & Manni F. 2012. General Method to nation of linguistic innovations and offspring Unravel Ancient Population Structures through (surnames). Surnames, Final Validation on Italian Data. If in the past linguistics has driven a consider- Hum. Biol., 84: 235-270. able amount of hypotheses about the anthropo- Boattini A., Pedrosi M. E., Luiselli D. & Pettener logical diversity of human populations, popula- D. 2010. Dissecting a human isolate: Novel tion genetics and demography can now provide sampling criteria for analysis of the genetic evidence about the sources, destinations and structure of the Val di Scalve (Italian Pre-Alps). magnitude of population movements, provid- Ann. Hum. Biol., 37: 604-609. ing a wealth of data on which to test hypotheses Bodmer W. F. & Cavalli-Sforza L. L. 1968. A mi- about the social and demographic mechanisms gration matrix model for the study of random for linguistic differentiation. genetic drift. Genetics, 59: 565-592.

www.isita-org.com 266 Internal migrations in Spain

Calderón R., Hernández C.L., Cuesta P. chromosome diversity and patrilineal surnames. & Dugoujon J-M. 2015. Surnames and Mol. Biol. Evol., 26: 1093-1102. Y-Chromosomal Markers Reveal Low King T. E. & Jobling M. A. 2009b. What’s in a Relationships in Southern Spain. PLoS One, htt- name? Y chromosomes, surnames and the genetic ps://doi.org/10.1371/journal.pone.0123098. genealogy revolution. Trends Genet., 25: 351-360. Castles S. & Miller M.J. 2009. The Age Kohonen T. 1982. Self-organized formation of of Migration: International Population topologically correct feature maps. Biol. Cyber., Movements in the Modern World (4th edition). 43: 59-69. Palgrave MacMillan, Basingstoke. Kohonen T. 1984. Self-organization and associative Darlu P., Bloothooft G., Boattini A., Brouwer L., memory. Springer, Berlin. Brouwer M., Brunet G., Chareille P., Cheshire Kremer D. 1992. Spanische Anthroponomastik. J., Coates R., Dräger K. et al. 2012. The family Lexikon der Romanistischen Linguistik, 6: name as socio-cultural feature and genetic meta- 457-473. phor: From concepts to methods. Hum. Biol., Kremer D. 1996. Morphologie und Wortbildung bei 84:169-214. Familiennamen II: Romanisch, Namenforschung. Dodsworth R. 2017. Migration and Dialect Ein internationales Handbuch zur allgemeinen Contact. Ann. Rev. Linguist., 3: 331-346. und europäischen Onomastik, 2, pp. 1263-1275. Falck O., Heblich S., Lameli A. & Südekum, J. Teilband, Berlin/New York. 2012. Dialects, cultural identity, and economic Kremer D. 2001. Colonisation onymique. RIOn., exchange. J. Urban Econ., 72: 225-239. 7: 337-373. Fuster V. & Colantonio S. E. 2002. Consanguinity Kremer D. 2003. Spanish and Portuguese family in Spain: socioeconomic, demographic, and ge- names. In P. Hanks (ed): Dictionary of American ographic influences. Hum. Biol., 74: 301-315. family names. Oxford University Press, New Goebl H. 2010. La dialectometrización del ALPI: Rápida York (USA). presentación de los resultados. 26th CILFR. Valencia. Malécot G. 1948. Les mathématiques de l’hérédité. Goebl H. 2013. La dialectometrización del ALPI : Masson. Paris. rápida presentación de los resultados. In E. Malecot G. 1955. Decrease of relationship with Casanova herrero & C. Calvo Rigual (eds): Actas distance. Cold Spring Harbor Syrup. Quant. del XXVI Congreso Internacional de Lingüística y Biol., 20: 52-53. de Filología Románicas (volumen VI), pp. 143- Manni F., Toupance B., Sabbagh A. & Heyer 154. Walter De Gruyter, Berlin (Germany), E. 2005. New method for surname studies Boston (USA). of ancient patrilineal population structures, Griera A. 1923-1964. Atlas Lingüístic de and possible application to improvement Catalunya. Institut d’Estudis , of Y-chromosome sampling. Am. J. Phys. Ediciones Polígrafa, Barcelona. Anthropol., 126: 214-228. Kaski S. 1997. Data exploration using self-organ- Marx A., Backes C., Meese E., Lenhof H. P. izing-maps. Acta Polytech. Scand., 82:1-57. & Keller A. 2016. EDISON-WMW: Exact Kerswill P. 2006. Migration and language. In Dynamic Programing Solution of the Wilcoxon- K. Mattheier, U. Ammon & P. Trudgill (eds): Mann-Whitney Test. GPB, 14: 55-61. /Soziolinguistik. An international Martinez-Cadenas C., Blanco- A., Hernando handbook of the of language and society, B., Busby G.B., Brion M., Carracedo A., Salas 2nd edition, volume 3. De Gruyter, Berlin. A. & Capelli C. 2016. The relationship between King T. E., Ballereau S. J., Schurer K. E. & Jobling surname frequency and Y chromosome varia- M. A. 2006. Genetic signatures of coancestry tion in Spain. Eur. J. Hum. Genet., 24: 120-8. within surnames. Curr. Biol., 16: 384-388. Mateos P. 2011. Ethnicity, geography and popula- King T. E. & Jobling M. A. 2009a. Founders, tions: Tracing diversity and migration through drift and infidelity: the relationship between Y people’s names. Springer, Heidelberg (Germany). R. Rodríguez-Díaz et al. 267

Mateos P. & Tucker D.K. 2008. Forenames and Solé-Morata N., Bertranpetit J., Comas D. & Surnames in Spain in 2004. Names, 56: 165–184. Calafell F. 2015. Y-chromosome diversity in Nerbonne J. 2010. Measuring the diffusion of linguistic Catalan surname samples: insights into sur- change. Philos. Trans. R. Soc. Lond. B, 365: 3821-3828. name origin and frequency. Eur. J. Hum. Genet., Nerbonne J. & Heeringa W. 2007. Geographic 23: 1549-57. Distributions of Linguistic Variation Reflect Trudgill P. 1974. Linguistic Change and Diffusion: Dynamics of Differentiation. In S. Featherston Description and explanation in sociolinguistic & W. Sternefeld (eds): Roots: Linguistics in Search dialect geography. Lang. Soc., 2: 215-246. pp. of its Evidential Base, pp. 267-297. Mouton De Winney B, Boumertit A., Day T., Davison D., Gruyter, Berlin. Echeta C., Evseeva I., Hutnik K., Leslie S., Rodríguez-Díaz R. & Blanco-Villegas M.J. 2010. Nicodemus K., Royrvik E.C. et al. 2012. People Genetic structure of a rural region in Spain: of the British Isles: preliminary analysis of gen- distribution of surnames and gene flow. Hum. otypes and surnames in a UK-control popula- Biol., 82: 301-14. tion. Eur. J. Hum. Genet., 20: 203-10. Rodríguez-Díaz R., Manni F. & Blanco-Villegas Wehrens R. & Buydens L.M.C. 2007. Self- and M.J. 2015. Footprints of Middle Ages Kingdoms Super-organising Maps in R: the Kohonen Are Still Visible in the Contemporary Surname package. J. Stat. Softw., 21. Structure of Spain. PLoS One, 10. doi:10.1371/ Wright S. 1943. Isolation by distance. Genetics, journal.pone.0121472 28: 114-138. Simons G. (ed). 2016. Ethnologue. Languages of the world. SIL International, Dallas (TX). Internet publication accessible at www.ethnologue.com Editor, Giovanni Destro Bisol

This work is distributed under the terms of a Creative Commons Attribution-NonCommercial 4.0 Unported License http://creativecommons.org/licenses/by-nc/4.0/

www.isita-org.com