Insights & Perspectives hn again Think

Networks of lexical borrowing and lateral gene transfer in and genome evolution

Johann-Mattis List1)*, Shijulal Nelson-Sathi2), Hans Geisler3) and William Martin2)

Like biological species, change over time. As noted by Darwin, there plants) evolve – linguists have always are many parallels between language evolution and biological evolution. treated language trees with a certain Insights into these parallels have also undergone change in the past 150 years. suspicion. They have emphasized that – given the important role that horizontal Just like genes, words change over time, and language evolution can be likened transmission plays in language history – to genome evolution accordingly, but what kind of evolution? There are such trees can only capture vertical fundamental differences between eukaryotic and prokaryotic evolution. In the aspects of language evolution, while former, natural variation entails the gradual accumulation of minor mutations in horizontal aspects (which linguists alleles. In the latter, lateral gene transfer is an integral mechanism of natural traditionally model as “waves” that spread out in circles around a center variation. The study of language evolution using biological methods has in geographic space) are ignored. attracted much interest of late, most approaches focusing on language tree In the last decade, language trees construction. These approaches may underestimate the important role that have experienced a strong revival, espe- borrowing plays in language evolution. Network approaches that were originally cially in the public notion of as designed to study lateral gene transfer may provide more realistic insights into reflected in popular scientific literature the complexities of language evolution. and in articles addressed to a not exclusively linguistic readership [1]. Ear- lier linguistic work on phylogenetic Keywords: reconstruction was, with a few excep- borrowing; language evolution; lateral transfer; network approaches; . tions [2–8], qualitative in its nature. But prokaryotic evolution starting about 10 years ago, computer Additional supporting information may be found in the online version of this methods originally designed to infer trees : article at the publisher’s web-site. from molecular sequence data made their way into the analysis of large linguistic datasets, leading to a resurgence of language trees [9–15]. If the reconstruc- Introduction evolve. But in contrast to – tion of trees had only played a minor role where the is generally in historical linguistics up to that point, For a long time, both biologists and accepted to be the most realistic way to it has now become a specific field of linguists have been using family trees to model how eukaryotic species (species interest, and some scholars even go so model how species and languages with nucleated cells, such as animals and far as proclaiming tree construction as a priority for historical linguistic endeavor [16]. DOI 10.1002/bies.201300096 In traditional historical linguistics, these new approaches are met with a certain amount of reservation, since 1) Research Center Deutscher Sprachatlas, *Corresponding author: Philipps-University Marburg, Marburg, Johann-Mattis List their results are often not in concor- 2) Institute of Molecular Evolution, Heinrich-Heine E-mail: [email protected] dance with those achieved by tradition- € € University Dusseldorf, Dusseldorf, Germany al methods [17–20]. One important 3) Institute of Romance Languages and Literature, Heinrich-Heine University Dusseldorf,€ reason for such discrepancies is the Dusseldorf,€ Germany relatively large number of individual

Bioessays 36: 141–150, ß 2013 The Authors. BioEssays Published by WILEY Periodicals, Inc. This is an www.bioessays-journal.com 141 open access article under the terms of the Creative Commons Attribution-NonCommercial License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited and is not used for commercial purposes. J.-M. List et al. Insights & Perspectives..... Think again

Figure 1. Three early language trees in the lateral gene transfer in the evolution the Czech linguist Frantisˇek Ladislav history of linguistics. A: August Schleicher’s of prokaryotic species (microbes with- Cˇelakovsky´ (1759–1852), whose post- first tree of Germanic and Balto-Slavic out cell nuclei, such as bacteria and humously published lectures contain languages. B: Schleicher’s first tree of the archaea) can cope with these problems, an early tree diagram of the Slavic Indo-European . C: An early tree of the Slavic languages by Frantisˇ ek hence providing a more realistic way to languages [24] (Fig. 1C). Schleicher was Ladislav Cˇ elakovsky´ . model the complexities of language very interested in biology, especially history by combining both its tree-like , and in his work we find many (vertical) and its wave-like (horizontal) passages where he compares languages aspects. with organisms, assuming that they and methodological errors in linguistic went through stages of birth, youth, datasets [19]; this is reflected by numer- middle age, old age, and – finally – ous cases of wrong translations, wrong Historical linguists were death [25]. He emphasized that lan- homology assessments (incorrect iden- guage classification was quite similar tification of cognate words), and always skeptical about to biological classification of animals undetected cases of lateral transfer language trees or plants [25]. He also mentioned the (borrowing) [17, 18]. problem of distinguishing vertically In this paper, we argue that the In 1853 the German linguist August from horizontally transmitted traits, problem of the new quantitative meth- Schleicher (1821–1868) published two drawing a parallel between “foreign ods is that they focus too much on the articles [21, 22] (Fig. 1A and B) in which influence” due to language contact vertical aspects of language evolution, he showed how branching trees can be in language history, and “crossbreed- thereby forcing the data into tree- used to illustrate the historical develop- ing” in evolutionary biology [26] like structures. We show that network ment of languages (Table 1A). It is (Table 1B). approaches that were originally possible [23] that Schleicher himself In biology, the concept of evolution- designed to study reticulation and adopted the idea from a colleague, ary trees was not introduced until

142 Bioessays 36: 141–150, ß 2013 The Authors. BioEssays Published by WILEY Periodicals, Inc...... Insights & Perspectives J.-M. List et al.

Table 1. Early quotes on language history from August Schleicher and

(A) August Schleicher [26] We know both the Old Latin and the Romance languages which Wir kennen sowohl das Altlateinische, als auch die durch again Think demonstrably descended from the former via differentiation Differenzierung und durch fremden Einfluss – Ihr wurdet€ sagen and – you would call it crossbreeding – foreign influence durch Kreuzung – nachweislich aus ihm hervorgegangenen romanischen Sprachen (B) August Schleicher [22] These assumptions which logically follow from the previous Diese Annahmen, logisch folgend aus den Ergebnissen der research can be best illustrated with the help of a branching bisherigen Forschung, lassen sich am besten unter dem Bilde tree eines sich verastelnden€ Baumes anschaulich machen (C) Hugo Schuchardt [32] We connect the branches and twigs of the family tree with Wir verbinden die Aste€ und Zweige des Stammbaums durch countless horizontal lines and it ceases to be a tree zahllose horizontale Linien, und er hort€ auf ein Stammbaum zu sein

Charles Darwin’s (1809–1882) mention- Borrowing is a Lexical borrowing can affect only small ing of the “Great Tree of Life” in parts of the vocabulary of a given 1859 [27], but it soon became deeply constitutive part of language (such as specific terms for ingrained in thinking on the topic. language history religious concepts, cultural items, or Notably, it was later reinforced by many artifacts), or result in a situation where influential drawings from If we take the most frequent 1,000 Latin large parts of the language’s original (1837–1919, see [28] for details), culmi- words and look at how they survived in lexicon are replaced. This can even nating in the inference of trees from its daughter languages, we will find that result in complete relexification, as in molecular sequences [29], and the 67% of all words were directly inherited Creole languages. In the World Loan- reconstruction of phylogenetic trees in at least one language, yet only 14% word Database [40] the frequency of for all organisms using ribosomal and were inherited in all Romance lan- direct borrowing events documented for informational gene phylogenies [30]. guages [37]. However, this drastic loss 41 languages varies greatly, ranging In linguistics the popularity of of Latin words during Romance lan- from 1% for Mandarin Chinese to 62% language trees began to fade soon after guage history is only part of the story: for Selice Romani, with an average it was first proposed [31]. In 1872 Since Latin never ceased to serve as a of 25% and a standard deviation of Johannes Schmidt (1843–1901) pointed cultural adstrate language (a language 13% [41]. out that linguistic data contradicted that co-exists in some form in parallel the idea of simple, tree-like differentia- with another language with which it is tion [32]. Instead of the family tree in contact), with a particularly great Borrowing cannot be theory he proposed the “wave theory” impact on written vernaculars, only (Wellentheorie in German), which 33% of all 1,000 words were completely ignored in quantitative states that certain changes spread like lost, and about 50% survive as borrow- approaches waves in concentric circles over neigh- ings from the ancestor language in the boring speech communities. And before daughter languages [37]. Moreover, With few exceptions [42–44], the major- Schmidt, Hugo Schuchardt (1842–1927) lexical transfer during the history of ity of the new biological methods for had criticized the idea of split and the Romance languages was not re- tree construction makes use of lexical independent differentiation [33], em- stricted to the influence of Latin alone, language data. This is due to the fact phasizing that languages diverge grad- and contact among the Romance lan- that it is much easier to compile lexical ually while at the same time mutually guages and other neighboring Indo- datasets for large numbers of lan- influencing each other (Table 1C). Even European languages was very frequent guages: in many cases – especially for today, historical linguists continue to and vivid. According to a recent survey less-well studied language families – hold strong reservations about the tree of 2,137 common words in Roma- wordlists are the only things available model. In text books on historical nian [38], for example, 894 (41.8%) for study. However, analysis of lexical linguistics, both the tree and the wave were classified as loanwords from other items also reflects the basic practice of theory are usually introduced as two languages. The majority of these bor- the traditional method for linguistic complementary models, each of which rowed words were transferred from reconstruction, which starts with the only depicts one aspect of language Slavic donor languages (about 14%). comparison of words and mor- history [34, 35]. Thus, if linguists are Only a small number of words were phemes [35, 45, 46]. Similarly to earlier asked whether language evolves in a borrowed from Latin (about 3%). quantitative approaches in historical tree-like manner, most linguists would On the “borrowability scale” [39], linguistics [8], the biological methods probably answer as Hoenigswald did in which ranks the ease with which require that borrowings be filtered out 1990: “Yes, of course it does, if we so different elements of language are of the data before the analysis is wish; but we had better be very assimilated by recipient languages, applied. Since reliable automatic meth- careful” [36]. borrowing of words ranks highest. ods are lacking, cognate and borrowing

Bioessays 36: 141–150, ß 2013 The Authors. BioEssays Published by WILEY Periodicals, Inc. 143 J.-M. List et al. Insights & Perspectives.....

assignments are usually carried out help of these techniques requires expert up their mind regarding the methods manually. In order to make this pains- knowledge of the languages under they need in linguistics, and the meth- taking process easier, scholars revived investigation, and the deeper one goes ods that biology can provide. That an old idea proposed in the 1950s [4, 5, back in time, the harder it becomes even evolutionary biology has developed 47], and restrict the lexical comparison for the experts, since the available some sophisticated tools to reconstruct to words that belong to the realm of the phonological information may be lost. phylogenetic trees, and that these tools so-called “basic vocabulary” [12]. Basic Recent tests on simulated data have can be easily applied to linguistic vocabulary is merely a technical term shown how crucial it is to screen datasets, has been demonstrated fre- that refers to a list of about 100–200 the linguistic data carefully before quently during the last decade. Yet is basic concepts (such as “hand”, “foot”, applying quantitative analyses [51]. this really all that biology has to offer?

Think again “stone”) that are translated into the How difficult it is to prepare the data In several fundamental aspects, the languages under investigation. These and to filter out all borrowings correctly genomes of eukaryotic species – such as lists are usually called Swadesh lists,in is reflected by the fact that the most animals and plants – and prokaryotic acknowledgement of Morris Swadesh frequently used datasets, the Compara- species – such as bacteria and archaea – (1909–1967), who popularized their use tive Indo-European Database ([52], evolve in very different ways, and lateral in linguistics. The basic assumption http://www.wordgumbo.com/ie/cmp/), gene transfer is generally at the root regarding Swadesh lists is that (a) every and the Austronesian Basic Vocabulary of those differences. Gene families are language has words that express the Database ([53], http://language.psy. one example. Gene families are sets of concepts, (b) the words evolve slowly auckland.ac.nz/austronesian/), contain homologous (cognate) genes that were (enabling us to recognize similarities many undetected borrowings and vari- formed by duplication of an ancestral across languages), and (c) the words are ous levels of erroneous cognate judg- gene, quite similar to the reflexes of rather resistant to borrowing [16]. Un- ments [17–19, 49]. But “scrubbing” the the root of a word in the same or fortunately, the last assumption, in data of false cognate assignments does different language. In eukaryotes, gene particular, is highly problematic. Al- not seem to be feasible for large data- families arise through duplication: a though the use of Swadesh lists may sets. Quantitative studies that are based resident gene duplicates, perhaps sev- decrease the number of borrowings to a on the Indo-European Lexical Cognacy eral times, and the resulting gene family certain degree, it cannot exclude all of Database (IELex, http://ielex.mpi.nl/), consists of members that are closely them. In a recent survey of 1,504 whose goal was to significantly enhance related at the outset and undergo common words in English, for example, the notoriously flawed database com- divergence and functional specializa- 616 (41%) were judged to be loan- posed by [52], still yield subgroupings tion [55]. In prokaryotes, gene families words [48], yet in the traditional English that contradict traditional genetic clas- arise via the acquisition of related Swadesh list there are still 32 borrow- sification (compare, for example, the sequences through lateral gene ings out of 200 (16.5%), mostly from Old strange grouping of Polish in [13] transfer, not through duplication [56]. Norse and Old French [18]; and in a and [54]). One reason for these problems As another example, in eukaryotes, recent revision of the Albanian Swadesh is that the database still contains many meiosis ensures that only members of list, 34 out of 107 words (31.8%) were undetected borrowings and other the same species exchange genes, and identified as possible borrowings [49]. errors. The other reason is that the recombination is reciprocal. In prokar- Manual detection of borrowings can exclusion of borrowings necessarily yotes, there are well-studied mechanisms range between trivial and impossible, yields a loss of information that can that mediate gene transfer, both within depending on the case in point. Some have large impacts on the results [49]. It and across species boundaries [57]. borrowing processes are very transpar- seems that the a priori exclusion of Furthermore, if we sequence 61 ent. Neither a linguist nor a German suspected borrowings from the data is human genomes, we will find – to all speaker has problems in identifying the not enough, especially in cases where intents and purposes – the same word Job “job” as a recent borrowing the history of a language family is not collection of about 30,000 genes in from English, since the initial sound of yet well understood. Instead of making each individual, with allelic variants at the word is not yet “integrated” into the tree reconstruction the key objective of many loci, and the 46 chromosomes will German sound system. But the situation historical linguistics, we need quantita- almost always be colinear: the genes is not always that simple. Thus, while tive methods that can deal with borrow- appearing at similar positions. If we no German native speaker would hesi- ings and – ideally – handle both vertical sequence 61 genomes of Escherichia tate to assume that Fett “grease” is a and lateral transmission. coli, a bacterium usually found in the “normal” German word, the word has in intestines of warm-blooded species, we fact been borrowed from Low German will find about 4,500 genes in each dialects [50], as can be proven from its Language history bears a individual genome, but only about irregular correspondence with English 1,000 genes that are present in all fat: If the words were truly cognate, we close resemblance to genomes. Summing up the different would expect the German word to end prokaryote evolution genes we find in all individuals, there with an [s] (spelled as ß in German) are about 18,000 different genes dis- instead of a [t], as in German heiß “hot,” If historical linguists want to profit from tributed among them, and this count which is truly cognate with English biological expertise in large-scale anal- will further increase if we add more hot [50]. Identifying borrowings with yses of big datasets, they need to make individual genomes to this calculation,

144 Bioessays 36: 141–150, ß 2013 The Authors. BioEssays Published by WILEY Periodicals, Inc...... Insights & Perspectives J.-M. List et al.

hence yielding an ever growing pange- seems natural to turn to networks as a statements as to which characters have nome of Escherichia coli [58]. These format to represent language history. been inferred as being borrowed in a examples underscore fundamental dif- In evolutionary biology, different given dataset. Unfortunately, the algo- hn again Think ferences in the nature of the processes of network approaches have been devel- rithm is very time-consuming, and it is evolutionary divergence in prokaryotic oped in order to study reticulation in thus not feasible to apply it to larger and eukaryotic populations: Eukaryotic biological datasets (see the overviews datasets [81]. populations generate tree-like struc- in [63] and [64]). Among the most tures of divergence over time [59], while popular of these methods are those that genome evolution in prokaryotes gen- produce unrooted networks (splits Ancestral genome sizes erates both tree-like and net-like com- graphs) such as split decomposition [65] reveal the minimum ponents of relatedness over time [60]. or NeighborNet [66]. These methods Recalling the scores on shared enjoy some popularity in recent quanti- amount of lateral transfer inherited words and borrowings we tative studies in historical linguistics, in microbial evolution reported for the Romance languages and have been applied to quite a few earlier, it seems obvious that language different datasets [67–71]. In contrast to A more recent method for lateral gene history shows a much closer resem- the popular quantitative methods for transfer detection in prokaryotic genomes blance to prokaryotic evolution than to tree construction, such as Neighbor- is the so-called minimal lateral network eukaryotic evolution. Thus, if one says Joining [72], or Bayesian inference [73], approach (MLN, [82]). This method that language history and genome they are unbiased with respect to “tree- applies the technique of gain-loss map- evolution have a lot in common, it likeness”, and provide a direct visuali- ping [83–85] to presence-absence pat- seems much more appropriate to em- zation of the degree of conflict in a given ternsofgenefamiliesinordertoinfer phasize that language evolution may dataset [74]. They have proven to be a patterns that are suggestive of lateral resemble prokaryotic evolution much very useful tool for data exploration, transfer. Gain-loss mapping starts from a more than it resembles eukaryotic and have even been used to measure given reference tree that should reflect the evolution. We do not claim to make a reticulation directly from lexical dis- vertical component of evolution as closely binary distinction here: As the amount tance matrices across the world’s lan- as possible. With help of the reference of contact-induced change differs from guage families [75]. The drawback of tree, specific gain-loss scenarios for all language to language, so do the under- these methods is that they are distance- gene families in the dataset are inferred. A lying evolutionary processes, and it is based, hence aggregating lexical infor- gain-loss scenario provides an explana- rather a continuum between strictly tree- mation on the taxonomic level. The tion of how a given character could have like and strictly network-like evolution information on shared cognates in the evolved along the reference tree when that we are dealing with. Nevertheless, if underlying datasets is converted to character evolution is modeled as a we want to employ quantitative methods distance scores, and the result is an simple process of gain and loss events. from biology to supplement our research unrooted network that only indicates In order to confirm the assumption that a in historical linguistics, it could be much whether there are conflicting signals in given character evolves in an exclusively more fruitful to get away from focusing the data, but does not directly point to vertical manner, the inferred gain-loss exclusively on those methods that yield the cognate sets that are responsible for scenario should contain only one gain simple family trees, and instead look for these conflicts. event. If more than one gain event is methods that were designed to handle A more realistic modeling of lan- inferred, the character is judged to be lateral transfer. guage history could be achieved by suggestive of lateral transfer (see Fig. 2 for methods that automatically infer hid- an example applied to linguistic data). den borrowings in the data. While quite The crucial point of the MLN method Network approaches offer common in evolutionary biology [76, is to select the best gain-loss scenarios new possibilities for 77], these methods are still in their out of the multitude of possible ones. infancy in historical linguistics. Two The key argument in biology is the quantitative analyses in early approaches [70, 78] are distance- notion of ancestral genome size distri- language evolution based, and therefore do not allow the butions [84]: If, for example, all gene direct identification of the characters families are assumed to originate only Despite the dissatisfaction of many that conflict in the reference trees. The once along the reference tree, this may historical linguists with both the tree first character-based approach to this result in ancestral genomes that contain and the wave model, there are – to our problem [79] uses maximum parsimony much more genes than are observed knowledge – only a few attempts to to determine the characters that conflict in the contemporary genomes. If, on combine both approaches within a new with an inferred family tree. Unfortu- the other hand, one assumes that all framework [35, 61, 62]; furthermore, nately, the method has only been gene families are explained by lateral unfortunately most of these proposals tested on a very small dataset, and no gene transfer only, then the vertical remain a mere visualization of the further applications are known to us. component of genome evolution dis- scholars’ intuitions regarding the data, An alternative proposal expands the appears, and ancestral genome sizes from which no further insights can be notion of perfect phylogenetic trees [10] become too tiny to support life. Between drawn. If one wants to include both the to the notion of perfect phylogenetic those extremes there are amounts of vertical and the horizontal aspects, it networks [80]. The method yields direct vertical and lateral inheritance that will

Bioessays 36: 141–150, ß 2013 The Authors. BioEssays Published by WILEY Periodicals, Inc. 145 J.-M. List et al. Insights & Perspectives..... Think again

Figure 2. Illustration of the MLN method. A: Two cognate sets for “to count” in three Germanic and three Romance languages. The English word is a known borrowing from Old French. The original reflex of Proto-Germanic Ãtal- is still preserved in English “to tell,” but its original meaning has shifted under the influence of the borrowing from Old French, and it is thus not listed in this sample. B: The loss-only scenario assumes that the cognate set with reflexes of Latin originated in the root and was then lost independently in both German and Danish. C: The two-gain scenario infers two separate origins of the cognate sets. The pattern is thus suggestive of lateral transfer, and one lateral transfer event is inferred. This is marked by the link drawn between the two nodes where the characters first originate. D: Combination of scenarios for both cognate sets based on the loss-only scenario in B. Note that this scenario forces us to assume that the ancestor of the Germanic languages had two words expressing the concept “to count.” Whilethisisnotimprobableper se, cases of inferred overwhelming amounts of synonymy are suspicious in language history. E: Combination of scenarios for both cognate sets based on the two-gain scenario in C. This scenario is preferred by the MLN method, since the number of synonyms in the ancestral languages is in balance with the modern languages. Note that the inference does not tell us which language is the real donor (which is Old French). According to our model, it could be any of the three Romance languages. For this reason, the edge is drawn between the ancestor off all languages.

146 Bioessays 36: 141–150, ß 2013 The Authors. BioEssays Published by WILEY Periodicals, Inc...... Insights & Perspectives J.-M. List et al. hn again Think

Figure 3. Minimal Lateral Network of 40 ilies. Theoretically, however, the appli- of a freely available Python library for Indo-European languages. The size of the cation of the approach requires some quantitative tasks in historical linguis- nodes reflects the number of cognate sets in caveats: while genomes are physical tics [87]. It employs weighted parsimony each language as inferred by the MLN entities whose size can be directly for the task of gain-loss mapping [83] approach. The links reflect the minimal amount of lateral transfer events that is determined, the linguistic data consist and also allows for a certain proportion needed to bring the distributions of synonyms of samples based on meaning lists. We of parallel evolution. A Python script in the contemporary languages (leaves of the can restate the genome size criterion for along with the data to run all analyses tree) and the ancestral languages (internal scenario selection in such a way that we can be downloaded from: https://gist. nodes of the tree) as closely together as prefer those scenarios in which the github.com/LinguList/7475830. The ad- possible. number of words used to express vantage of the IELex is that known specific meanings does not differ much borrowings are not only marked as between ancestral and contemporary such, but that they are also assigned bring the distribution of inferred ances- languages. However, we need to keep in to the cognate sets to which they would tral genome sizes into agreement with mind that new words can also shift into belong, if they were not borrowings. the attested distribution of contempo- the meaning slots from outside the Thus, English mountain is clustered rary genome sizes. Those distributions sample. Although parallel semantic with the reflexes of Vulgar Latin can be tested statistically, and the gain- shift involving cognate words in differ- Ãmontanea (derived from Latin mo¯ns) loss scenarios with the amount of lateral ent branches of a language family is in the Romance languages, such as, gene transfer that best fits the data can surely much rarer than borrowing, this among others, French montagne,Italian be determined. Having selected the best has to be considered when applying the montagna, and Spanish montan˜a.This scenarios, a rooted phylogenetic net- method to linguistic data. gives us the possibility to test the work can be reconstructed. Here, multi- The MLN approach was first applied usefulness of the refined MLN approach. ple origins of the same gene family on to the well-known Comparative Indo- We corrected some obvious errors in the different branches of the reference tree European Database [52], and revealed a data, especially in some of the Slavic are connected by lateral links; edges rather high degree of non-tree-like languages (the whole dataset is provided connecting the same two nodes for signal: 61% of all 2,346 cognate sets in Supplementary Material I). Excluding different gene families are joined to in the data were found to be suggestive 1,864 words that could not be shown to form weighted edges [82]. of borrowing [86]. Since the study be cognate to any other word in the data, employed a very simple top-down algo- this yielded a total of 1,190 cognate sets. rithm for gain-loss mapping [84], the As a reference tree, we chose the one How minimal lateral inferred amount of cognate sets contra- provided by Ethnologue [88]. The choice dicting the reference tree is surely too of this tree is for practical reasons, networks can be applied high. In order to test whether more since it was proposed independently of to linguistic data refined techniques of gain-loss mapping quantitative methods, and reflects an can yield more realistic results, we openly available “quasi-standard”. This Technically, the application of the MLN applied a refined variant of the MLN does not mean that we are unaware of approach to language data can be approach to a subset of 40 Indo- the many problems that this tree con- carried out in a rather straightforward European languages taken from the tains, especially in the classification of way, by investigating presence-absence IELex (dump from May 2013 kindly the subgroups. patterns of cognate sets instead of provided by M. Dunn). The modified Figure 3 shows the rooted phylo- presence-absence patterns in gene fam- MLN approach is implemented as part genetic network that the refined MLN

Bioessays 36: 141–150, ß 2013 The Authors. BioEssays Published by WILEY Periodicals, Inc. 147 J.-M. List et al. Insights & Perspectives.....

approach reconstructed from the data. the remaining 19 words, 17 (89%) are developed to describe the evolution of As can be seen, the method nicely correctly identified. 17 further words are languages, but realistic quantitative recovers some well-known cases of found to be not compatible with the models that can explain horizontal contact relations among the languages reference tree, but three of these words evolutionary processes in addition to in the sample. English, for example are known borrowings in other lan- genealogical relationships were lacking. shows two heavily weighted edges, one guages. Of the remaining 14 words, four Since similar evolutionary processes with the ancestor of the Scandinavian words (belly, narrow, dull, smoke), are shaped both genomes and languages languages, and one with the ancestor of obviously erroneously coded, since they into contemporary forms, it is possible the Romance languages, nicely reflect- are linked with words outside the to apply methods that are developed to ing two of its major donors: Scandina- Germanic branch, although their deeper study genome evolution to study lan-

Think again vian words made their way into the etymology or the etymology of their guage evolution. Since lateral transfer English lexicon as a result of Danish and presumed cognates is unclear; and four in language evolution constitutes a real northern Scandinavian invasions start- words (at, leaf, small, know) seem to be form of natural variation, phylogenetic ing in the 8th–9th century [89], and Old real cases of parallel semantic develop- network approaches provide a better Norman (a northern French dialect) ment (be it retention or innovation) with means to model language evolution came to England as a result of the other languages (see Supplementary than strictly bifurcating phylogenetic Norman conquest in 1066. Old Norman Material II). The remaining six words trees. We strongly support the recent even developed into a distinct variety (back, few, many, snake, tree, with) are attempts to strengthen the quantitative called Anglo-Norman which was spoken exclusively shared with the Scandina- basis of historical linguistics by building in England by the higher social strata vian languages inside the Germanic large databases and adapting computa- from 12th to 15th century. The ensuing branch. Whether this pattern results tional methods from biology. Great work intensive language contact results in a from innovations on the West Germanic has been done in the past 10 years, and boom of “French” loans, which eventu- mainland, by which the reflexes of the we know that errors are unavoidable ally became a formative element of the words in Frisian, German, and Dutch when building large databases that English lexicon [89]. Albanian shows were replaced, or from hitherto unno- accumulate historical linguistic knowl- also strong connections with the ances- ticed Scandinavian influence requires edge. However, since errors are not only tor of the Romance languages, reflecting further investigation. A full list of all unavoidable, but – in the case of the large number of Latin loanwords in words with further comments is sup- undetected borrowings – also reflect the language [49]. plied in Supplementary Material II. one vivid aspect of language history, we Of the 105 cognate sets in the data The modified MLN approach is think it is time to rethink claims about that contain known hidden borrowings, surely not perfect. It heavily relies on the major processes underlying lan- the method identifies 76 correctly (see the underlying data, and especially the guage evolution. Applying network the specific results in Supplementary selection of the reference tree can have a approaches in historical linguistics Material I). In total, the method iden- strong influence on the results. Further- can provide new insights into both the tifies 369 out of 1,190 cognate sets (31%) more, it can only recover those cases of vertical and the lateral components of that do not correspond to the reference borrowing that occur inside a given language history, and help to bring tree. If the number of known borrowings language family. External influences traditional and more quantitative re- reflected the true amount of borrowings cannot be recovered. Further research search closer together. in the data, and the reference tree is required in order to assess to which displayed the true vertical history of degree it overestimates borrowing rates the languages, this would mean that the because of its incapacity of handling Acknowledgements method largely overstates the amount of independent parallel developments. This research was supported by the ERC lateral transfer. However, given the However, it is a first step en route to grants 240816 and F020515005. We uncertainty regarding the subgrouping more realistic quantitative models of thank Søren Wichmann and two anon- of the Indo-European languages that is language evolution, and could prove ymous reviewers for constructive com- also reflected in the reference tree, and useful for scholars working on quanti- ments. We also thank Michael Dunn for the uncertainty of the cognate judg- tative applications in historical linguis- providing us with the Indo-European ments in the data, we are confident that tics, since it not only tests the tree- data which we used for the analyses the results provide a good starting point likeness of datasets but also provides presented in this study. for further research that may reveal direct hints as to the characters that further hidden borrowings and errone- cause reticulation. It can help us to ous cognate judgments. improve the quality of our datasets by References This can be exemplified by an identifying possible hidden borrowings inspection of the specific results that and erroneous cognate assignments. 1. Kiparsky P. 2014. New perspectives in historical linguistics. In Bowern C, Evans B, the method yields for English: Of the 32 eds; The Routledge Handbook of Historical borrowings into English [18], eight are Linguistics. London and New York: Routledge. singletons and five have reflexes in Conclusion and outlook 2. Kroeber AL, Chre´ tien CD. 1937. Quantitative almost all Germanic languages in the classification of Indo-European languages. Language 13: 83–103. sample and can thus technically not be Different metaphors and models have, 3. Ross ASC. 1950. Philological probability identified by the MLN approach. Of over the past century or two, been problems. J R Stat Soc 12: 19–59.

148 Bioessays 36: 141–150, ß 2013 The Authors. BioEssays Published by WILEY Periodicals, Inc...... Insights & Perspectives J.-M. List et al.

4. Swadesh M. 1952. Lexico-statistic dating of 22. Schleicher A. 1853. Die ersten Spaltungen 41. Tadmor U. 2009. Loanwords in the prehistoric ethnic contacts. Proc Am Philos des indogermanischen Urvolkes [The first world’s languages. In Haspelmath M, Soc 96: 452–63. splits of the Proto-Indo-European people]. Tadmor U, eds; Loanwords in the World’s € 5. Swadesh M. 1955. Towards greater accura- Allgemeine Monatsschrift fur Wissenschaft Languages.BerlinandNewYork:de again Think cy in lexicostatistic dating. Int J Am Linguist und Literatur [Mon J Sci Lit] 3: 786–7. Gruyter. p. 55–75. 21: 121–37. 23. Sutrop U. 1999. Diskussionsbeitrage€ zur 42. Dunn M, Terrill A, Reesink G, Foley RA, 6. Sankoff D. 1970. On the rate of replacement Stammbaumtheorie [Discussing the theories et al. 2005. Structural phylogenetics and the of word-meaning relationships. Language 46: of family trees]. Fenno-Ugristica 22: 223–51. reconstruction of ancient language history. 564–9. 24. Cˇ elakovsky´ FL. 1853. Cˇtenı´ o Srovnavacı´ Science 309: 2072–5. 7. Embleton SM. 1986. Statistics in Historical Mluvnici Slovanske´ [Readings on the Com- 43. Colonna V, Boattini A, Guardiano C, Linguistics. Bochum: Studienverlag Brockmeyer. parison of Slavic Languages]. V komisı´ uF. Dall’ara I,etal. 2010. Long-range compar- 8. Starostin SA. 1989. Sravnitel’no-istoricˇ eskoe Rˇ ivna´ cˇ e: Prague. ison between genes and languages based jazykoznanie i leksikostatistika [Comparative 25. Schleicher A. 1848. Zur Vergleichenden on syntactic distances. Hum Hered 70: historical linguistics and lexicostatistics]. In Sprachengeschichte [On Comparative Lan- 245–54. Kullanda SV, Longinov JD, Militarev AJ, guage History]. Bonn: Konig.€ 44. Longobardi G, Guardiano C, Silvestri G, Nosenko EJ, et al., eds; Materialy k Diskussi- 26. Schleicher A. 1863. Die Darwinsche Theorie Boattini A,etal. 2013. Toward a syntactic jam na Konferencii [Materials for the Discus- und die Sprachwissenschaft [Darwin’s Theory phylogeny of modern Indo-European lan- sions at the Conference]. Moscow: Institut and Linguistics]. Leipzig: Hermann Bohlau.€ guages. J Hist Linguist 3: 122–52. Vostokovedenija. p. 3–39. 27. Darwin C. 1859. On the Origin of Species by 45. Campbell L, Poser WJ. 2008. Language 9. Holden CJ. 2002. Bantu language trees Means of Natural Selection, or, the Preserva- Classification: History and Method. Cam- reflect the spread of farming across sub- tion of Favoured Races in the Struggle for Life. bridge: Cambridge University Press. Saharan Africa: a maximum-parsimony anal- London: John Murray. 46. Dybo A, Starostin G. 2008. In defense of the ysis. Proc Biol Sci 269: 793–9. 28. Oppenheimer JM. 1987. Haeckel’s varia- , or the end of the Vovin 10. Ringe D, Warnow T, Taylor A. 2002. Indo- tions on Darwin. In Hoenigswald HM, ed; controversy. In Smirnov IS, ed; Aspekty European and computational cladistics. Trans Biological Metaphor and Cladistic Classifica- Komparativistiki [Aspects of Comparativis- Philol Soc 100: 59–129. tion: An Interdisciplinary Perspective. Phila- tics]. Moscow: RGGU. p. 119–258. 11. Gray RD, Atkinson QD. 2003. Language-tree delphia: University of Pennsylvania Press. p. 47. Swadesh M. 1950. Salish internal relation- divergence times support the Anatolian theory 123–35. ships. Int J Am Linguist 16: 157–67. of Indo-European origin. Nature 426: 435–9. 29. Fitch WM, Margoliash E. 1967. Construction 48. Grant AP. 2009. Loanwords in British English. 12. Atkinson QD, Gray RD. 2006. How old is the of phylogenetic trees. Science 155: 279–84. In Haspelmath M, Tadmor U, eds; Loanwords Indo-European language family? Illumination 30. Woese CR, Kandler O, Wheelis ML. 1990. in the World’s Languages. Berlin and New or more moths to the flame? In Forster P, Towards a natural system of organisms: York: de Gruyter. p. 360–82. Renfrew C, eds; Phylogenetic Methods and proposal for the domains Archaea, Bacteria, 49. Holm HJ. 2011. “Swadesh lists” of Albanian the Prehistory of Languages. Cambridge: and Eucarya. Proc Natl Acad Sci USA 87: revisited and consequences for its Position in McDonald Institute for Archaeological Re- 4576–9. the Indo-European Languages. J Indo-Eur search. p. 91–109. 31. Geisler H, List JM. 2013. Do languages grow Stud 39: 43–99. 13. Bouckaert R, Lemey P, Dunn M, Greenhill on trees? The tree metaphor in the history of 50. Kluge F, Seebold H. eds; 2002. Etymolo- SJ,etal. 2012. Mapping the origins and linguistics. In Fangerau H, Geisler H, Halling T, gisches Worterbuch€ der Deutschen Sprache expansion of the Indo-European language Martin W, eds; Classification and evolution in [Etymological dictionary of the German lan- family. Science 337: 957–60. biology, linguistics and the history of science. guage]. Berlin and New York: de Gruyter. 14. Dunn M, Levinson SC, Lindstroem E, Concepts – methods – visualization. Stuttgart: 51. Barbanc¸onF, Evans SN, Ringe D, Warnow Reesink G,etal. 2008. Structural phylogeny Steiner. p. 111–24. T. 2013. An experimental study comparing in historical linguistics: methodological explo- 32. Schmidt J. 1872. Die Verwantschaftsverhalt-€ linguistic phylogenetic reconstruction meth- rations applied in island Melanesia. Language nisse der Indogermanischen Sprachen [The ods. Diachronica 30: 143–70. 84: 710–59. Relationship of the Indo-European Lan- 52. Dyen I, Kruskal JB, Black P. 1992. An 15. Gray RD, Jordan FM. 2000. Language trees guages]. Leipzig: Hermann Bohlau.€ Indoeuropean classification. Trans Am Philos support the express-train sequences of 33. Schuchardt H. 1900. U¨ber die Klassifikation Soc 82: 1–132. Austronesian expansion. Nature 450: 1052–5. der Romanischen Mundarten. Probe-Vorle- 53. Greenhill SJ, Blust R, Gray RD. 2008. The 16. Pagel M. 2009. Human language as a sung, Gehalten zu Leipzig am 30. April 1870 Austronesian basic vocabulary database: culturally transmitted replicator. Nat Rev [On the Classification of the Romance Dia- from bioinformatics to lexomics. Evol Bio- Genet 10: 405–15. lects. Test lecture, held in Leipzig 30th of informa 4: 271–83. 17. Holm HJ. 2007. The new arboretum of Indo- April 1870]. Graz. 54. Dunn M, Greenhill SJ, Levinson SC, Gray European “trees”. J Quant Linguist 14: 167– 34. Campbell L. 1999. Historical Linguistics: An RD. 2011. Evolved structure of language 214. Introduction. Edinburgh: Edinburgh University shows lineage-specific trends in word-order 18. Donohue M, Denham T, Oppenheimer S. Press. universals. Nature 473: 79–82. 2012. New methodologies for historical lin- 35. Anttila R. 1972. An Introduction to Historical and 55. Zhang J. 2003. Evolution by gene duplication: guistics? Calibrating a lexicon-based meth- Comparative Linguistics.NewYork:Macmillan. an update. Trends Ecol Evol 18: 292–8. odology for diffusion vs. subgrouping. 36. Hoenigswald HM. 1990. Does language 56. Treangen TJ, Rocha EP. 2011. Horizontal Diachronica 29: 505–22. grow on trees? Proc Am Philos Soc 134: transfer, not duplication, drives the expansion 19. Geisler H, List J-M. 2014. Beautiful trees on 10–8. of protein families in prokaryotes. PLoS Genet unstable ground. Notes on the data problem 37. Stefenelli A. 1992. Das Schicksal des Latei- 7: e1001284. in lexicostatistics. In Hettrich H, Ziegler S, nischen Wortschatzes in den Romanischen 57. Popa O, Dagan T. 2011. Trends and barriers eds; Die Ausbreitung des Indogermanischen. Sprachen [The Fate of the Lexicon in the to lateral gene transfer in prokaryotes. Curr Thesen aus Sprachwissenschaft, Archaologie€ Romance Languages]. Rothe. Opin Microbiol 14: 615–23. und Genetik [The Spread of Indo-European. 38. Schulte K. 2009. Loanwords in Romanian. In 58. Lukjancenko O, Wassenaar TM, Ussery Theses From Linguistics, Archaeology, and Haspelmath M, Tadmor U, eds; Loanwords in DW. 2010. Comparison of 61 sequenced Genetics]. Wiesbaden: Reichert. the World’s Languages. Berlin and New York: Escherichia coli genomes. Microb Ecol 60: 20. Hakkinen€ J. 2012. Problems in the method de Gruyter. p. 231–59. 708–20. and interpretations of the computational 39. Thomason S, Kaufman T. 1988. Language 59. Bapteste E, O’Malley M, Beiko R, Ereshef- phylogenetics based on linguistic data. An Contact, Creolization, and Genetic Linguis- sky M,etal. 2009. Prokaryotic evolution and example of wishful thinking: Bouckaert et al. tics. Berkeley: University of California Press. the tree of life are two different things. Biol 2012. URL: http://www.elisanet.fi/alkupera/ 40. Haspelmath M, Tadmor U. 2009. The Direct 4: 34. Problems_of_phylogenetics.pdf. loanword typology project and the world 60. Puigbo P, Wolf YI, Koonin EV. 2010. The tree 21. Schleicher A. 1853. O jazyku litevske´ m, loanword database. In Haspelmath M, Tad- and net components of prokaryote evolution. zvla´ steˇ na slovansky´ [On the Lithuanian mor U, eds; Loanwords in the World’s Genome Biol Evol 2: 745–56. language, and specifically on Slavic]. Cˇasopis Languages. Berlin and New York: de Gruyter. 61. Southworth FC. 1964. Family-tree diagrams. Cˇseke´ho Museum [J Czech Mus] 27: 320–4. p. 1–34. Language 40: 557–65.

Bioessays 36: 141–150, ß 2013 The Authors. BioEssays Published by WILEY Periodicals, Inc. 149 J.-M. List et al. Insights & Perspectives.....

62. Holzer G. 1996. Das Erschließen Unbelegter 71. Bowern C. 2010. Historical linguistics in 81. Nichols J, Warnow T. 2008. Tutorial on Sprachen [Reconstructing unattested lan- Australia: trees, networks and their implica- computational linguistic phylogeny. Linguist guages]. Frankfurt am Main: Lang. tions. Philos Trans R Soc B 365: 3845–54. Compass 2: 760–820. 63. Huson DH, Rupp R, Scornavacca C. 2010. 72. Saitou N, Nei M. 1987. The neighbor-joining 82. Dagan T, Artzy-Randrup Y, Martin W. 2008. Phylogenetic Networks. Cambridge: Cambridge method: a new method for reconstructing Modular networks and cumulative impact of University Press. phylogenetic trees. Mol Biol Evol 4: 406–25. lateral transfer in prokaryote genome evolu- 64. Morrison DA. 2011. An Introduction to 73. Ronquist F. 2004. Bayesian inference of tion. Proc Natl Acad Sci USA 105: 10039–44. Phylogenetic Networks. Uppsala: RJR character evolution. Trends Ecol Evol 19: 83. Mirkin BG, Fenner TI, Galperin MY, Koonin Productions. 475–81. EV. 2003. Algorithms for computing parsimoni- 65. Bandelt HJ, Dress AW. 1992. Split decom- 74. Gray RD, Bryant D, Greenhill SJ. 2010. On ous evolutionary scenarios for genome evolu- position: a new and useful approach to the shape and fabric of human history. Philos tion, the last universal common ancestor and phylogenetic analysis of distance data. Mol Trans R Soc B 365: 3923–33. dominance of horizontal gene transfer in the Phylogenet Evol 1: 242–52. 75. Wichmann S, Holman EW, Raman T, evolution of prokaryotes. BMC Evol Biol 3:2.

Think again 66. Bryant D, Moulton V. 2004. Neighbor-net: Walker RS. 2011. Correlates of reticulation 84. Dagan T, Martin W. 2007. Ancestral genome an agglomerative method for the construction in linguistic phylogenies. Lang Dyn Chang 1: sizes specify the minimum rate of lateral gene of phylogenetic networks. Mol Biol Evol 21: 205–40. transfer during prokaryote evolution. Proc 255–65. 76. Koonin EV, Makarova KS, Aravind L. 2001. Natl Acad Sci USA 104: 870–5. 67. Hamed MB, Wang F. 2006. Stuck in the Horizontal gene transfer in prokaryotes: 85. Cohen O, Pupko T. 2011. Inference of gain and forest: trees, networks and Chinese dialects. quantification and classification. Annu Rev loss events from phyletic patterns using sto- Diachronica 23: 29–60. Microbiol 55: 709–42. chastic mapping and maximum parsimony – a 68. Hamed MB. 2005. Neighbour-nets portray 77. Huson DH, Scornavacca C. 2011. A survey simulation study. Genome Biol Evol 3: 1265–75. the Chinese dialect continuum and the of combinatorial methods for phylogenetic 86. Nelson-Sathi S, List J-M, Geisler H, Fan- linguistic legacy of China’s demic history. networks. Genome Biol Evol 3: 23–35. gerau H,etal. 2011. Networks uncover Proc R Soc B 272: 1015–22. 78. Wang WS-Y, Minett JW. 2005. Vertical and hidden lexical borrowing in Indo-European 69. Heggarty P, Maguire W, McMahon A. 2010. horizontal transmission in language evolution. language evolution. Proc R Soc B 278: 1794– Splits or waves? Trees or webs? How Trans Philol Soc 103: 121–46. 803. divergence measures and network analysis 79. Minett JW, Wang WS-Y. 2003. On detecting 87. List J-M, Moran S. 2013. An open source can unravel language histories. Philos Trans R borrowing. Diachronica 20: 289–330. toolkit for quantitative historical linguistics. Soc B 365: 3829–43. 80. Nakhleh L, Ringe D, Warnow T. 2005. Proc ACL 2013 Syst Demonstr 13–8. 70. McMahon A, Heggarty P, McMahon R, Perfect phylogenetic networks: a new meth- 88. Lewis MP, Fennig CD. 2013. Ethnologue. Slaska N. 2005. Swadesh sublists and the odology for reconstructing the evolutionary Dallas: SIL International. benefits of borrowing: an Andean case study. history of natural languages. Language 81: 89. Harbert W. 2007. The Germanic Languages. Trans Philol Soc 103: 147–70. 382–420. Cambridge: Cambridge University Press.

150 Bioessays 36: 141–150, ß 2013 The Authors. BioEssays Published by WILEY Periodicals, Inc.