Language Distance and Tree Reconstruction
Total Page:16
File Type:pdf, Size:1020Kb
Home Search Collections Journals About Contact us My IOPscience Language distance and tree reconstruction This content has been downloaded from IOPscience. Please scroll down to see the full text. J. Stat. Mech. (2008) P08012 (http://iopscience.iop.org/1742-5468/2008/08/P08012) View the table of contents for this issue, or go to the journal homepage for more Download details: IP Address: 131.215.225.9 This content was downloaded on 11/01/2015 at 00:26 Please note that terms and conditions apply. ournal of Statistical Mechanics: Theory and Experiment JAn IOP and SISSA journal Language distance and tree reconstruction J. Stat. Mech. Filippo Petroni1 and Maurizio Serva2 1 GRAPES, B5, Sart Tilman, B-4000 Li`ege, Belgium 2 Dipartimento di Matematica, Universit`a dell’Aquila, I-67010 L’Aquila, Italy E-mail: [email protected] and [email protected] Received 5 June 2008 Accepted 28 July 2008 Published 22 August 2008 Online at stacks.iop.org/JSTAT/2008/P08012 doi:10.1088/1742-5468/2008/08/P08012 Abstract. Languages evolve over time according to a process in which reproduction, mutation and extinction are all possible. This is very similar (2008) P08012 to haploid evolution for asexual organisms and for the mitochondrial DNA of complex ones. Exploiting this similarity, it is possible, in principle, to verify hypotheses concerning the relationship among languages and to reconstruct their family tree. The key point is the definition of the distances among pairs of languages in analogy with the genetic distances among pairs of organisms. Distances can be evaluated by comparing grammar and/or vocabulary, but while it is difficult, if not impossible, to quantify grammar distance, it is possible to measure a distance from vocabulary differences. The method used by glottochronology computes distances from the percentage of shared ‘cognates’, which are words with a common historical origin. The weak point of this method is that subjective judgment plays a significant role. Here we define the distance of two languages by considering a renormalized edit distance among words with the same meaning and averaging over the two hundred words contained in a Swadesh list. In our approach the vocabulary of a language is the analogue of DNA for organisms. The advantage is that we avoid subjectivity and, furthermore, reproducibility of results is guaranteed. We apply our method to the Indo-European and the Austronesian groups, considering, in both cases, fifty different languages. The two trees obtained are, in many respects, similar to those found by glottochronologists, with some important differences as regards the positions of a few languages. In order to support these different results we separately analyze the structure of the distances of these languages with respect to all the others. Keywords: phylogeny (theory), population dynamics (theory), analysis of algorithms, new applications of statistical mechanics c 2008 IOP Publishing Ltd and SISSA 1742-5468/08/P08012+15$30.00 Language distances and tree reconstruction Contents 1. Introduction 2 2. Method and database 3 3. Time distance between languages 4 4. Tree and distance distributions 5 5. Malagasy and Romani 11 J. Stat. Mech. 6. Discussion and conclusions 13 Acknowledgments 15 References 15 1. Introduction The Swadesh list, in contrast to a list of arbitrary words, contains terms which are common to all cultures and which concern the basic activities of humans. The choice is motivated by the fact that the vocabulary learned during childhood changes very slowly over time. The use of Swadesh lists in glottochronology has been popular for half a century. Glottochronologists use the percentage of shared cognates in order to (2008) P08012 compute the distances between pairs of languages. These lexical distances are on average logarithmically proportional to the divergence time. In fact, changes in vocabulary accumulate year after year and two languages initially similar become more and more different. Recent examples of the use of Swadesh lists and cognates to construct language trees are the studies of Gray and Atkinson [1] and Gray and Jordan [2]. Cognates are words inferred to have a common historical origin; their identification is often a matter of sense and personal knowledge. In fact, the task of counting the number of cognate words in the list is far from trivial because cognates do not necessarily look similar. Therefore, subjectivity plays a significant role. Furthermore, results are often biased since it is easier for European or American scholars to find those cognates belonging to western languages. For instance, the Spanish word leche and the Greek word gala are cognates. In fact, leche comes from the Latin lac with genitive form lactis, while the genitive form of gala is galactos. Also the English wheel and Hindi cakra are cognates. These two identifications are possible because of our historical records; it would hardly have been possible for languages of, let us say, Central Africa or Australia. In this paper we use an automated method which avoids any subjectivity so that our results can be replicated by other scholars assuming that the database used is the same. For any language we write down a list of the same 200 words according to the original choice of Swadesh [3]; then we compare words with the same meaning belonging to different languages only considering orthographical differences. This may appear reductive since words may look similar by chance, while cognate words may have a completely different orthography, but we will try to convince the reader that this is indeed a simpler, more objective and more efficient choice than the traditional glottochronological approach. doi:10.1088/1742-5468/2008/08/P08012 2 Language distances and tree reconstruction In section 2 we describe how we measure the distance between words and then between languages. Once all the distances among pairs of languages are given it is possible to find the time separations (section 3) and finally the genealogical trees (section 4). In this work we include some results previously obtained in [4] concerning the Indo-European group of languages and we extend our results to the Austronesian group. For each of the two groups we consider 50 languages and we obtain two genealogical trees which are similar to those found by [1]and[2], with some important differences as regards the positions of a few languages and subgroups. Indeed we think that these differences carry some new information about the structure of the tree and about the positions of some languages, for which we separately analyze the structure of distances with respect to all the others J. Stat. Mech. (section 5). 2. Method and database In order to measure distances between pairs of words in different languages we use a modification of the Levenshtein distance (or edit distance) which is defined as the minimum number of operations needed to transform one word into another. An operation is an insertion, deletion, or substitution of a single character. Our definition of lexical distance between two words is taken as the edit distance normalized by the number of characters of the longer of the two. The reason for the renormalization is easily understood: if two words differ by one character this is much more important for short words than it is for long words. This is taken into account by the renormalization factor. We use distance between word pairs, as defined above, to construct a distance between (2008) P08012 pairs of languages. The first step is to find lists of words with the same meaning for all languages for which we intend to construct a distance. Then, we compute the lexical distance for each pair of words with the same meaning in one language pair. Finally, the distance of each language pair is defined as the average of the distances between the word pairs. As a result we have a number between 0 and 1 which we claim to be the lexical distance between the two languages. The databases used here3 to construct the phylogenetic trees are composed of 50 languages of the Indo-European group and 50 languages of the Austronesian group. The main source for the database of the Indo-European group is the file prepared by Dyen et al in [5] which contains the Swadesh list of 200 words for 96 languages. This list consists of items of basic vocabulary, like body parts, pronouns, numbers, which are known to be resistant to borrowings. Many words are missing in [5] but for our choice of 50 languages we have filled most of the gaps and corrected some errors by finding the words on Swadesh lists and in dictionaries freely available on the Web. For the Austronesian group we used, as the main source, the lists contained in the huge database available in [6]. The lists in [6] contain more than 200 words which do not coincide completely with the words in the original Swadesh list [3]. For our choice of 50 Austronesian languages we have retained only the words which are in [6] and also are in the original Swadesh list. The list obtained has many gaps due to missing words in [6] and because of the incomplete overlap of [6] with the original Swadesh list. Also in this case we have filled some of the gaps by finding 3 The databases, as modified by the authors, are available at the following Web address: http://univaq.it/∼serva/languages/languages.html. Readers are welcome to modify, correct and add words to the databases. doi:10.1088/1742-5468/2008/08/P08012 3 Language distances and tree reconstruction the words on Swadesh lists available on the Web and in one case (Malagasy) from direct knowledge of the language. Both databases [5]and[6] contain information on ‘cognates’ for languages that we do not use in this work.