Language Distance and Tree Reconstruction

Total Page:16

File Type:pdf, Size:1020Kb

Language Distance and Tree Reconstruction Home Search Collections Journals About Contact us My IOPscience Language distance and tree reconstruction This content has been downloaded from IOPscience. Please scroll down to see the full text. J. Stat. Mech. (2008) P08012 (http://iopscience.iop.org/1742-5468/2008/08/P08012) View the table of contents for this issue, or go to the journal homepage for more Download details: IP Address: 131.215.225.9 This content was downloaded on 11/01/2015 at 00:26 Please note that terms and conditions apply. ournal of Statistical Mechanics: Theory and Experiment JAn IOP and SISSA journal Language distance and tree reconstruction J. Stat. Mech. Filippo Petroni1 and Maurizio Serva2 1 GRAPES, B5, Sart Tilman, B-4000 Li`ege, Belgium 2 Dipartimento di Matematica, Universit`a dell’Aquila, I-67010 L’Aquila, Italy E-mail: [email protected] and [email protected] Received 5 June 2008 Accepted 28 July 2008 Published 22 August 2008 Online at stacks.iop.org/JSTAT/2008/P08012 doi:10.1088/1742-5468/2008/08/P08012 Abstract. Languages evolve over time according to a process in which reproduction, mutation and extinction are all possible. This is very similar (2008) P08012 to haploid evolution for asexual organisms and for the mitochondrial DNA of complex ones. Exploiting this similarity, it is possible, in principle, to verify hypotheses concerning the relationship among languages and to reconstruct their family tree. The key point is the definition of the distances among pairs of languages in analogy with the genetic distances among pairs of organisms. Distances can be evaluated by comparing grammar and/or vocabulary, but while it is difficult, if not impossible, to quantify grammar distance, it is possible to measure a distance from vocabulary differences. The method used by glottochronology computes distances from the percentage of shared ‘cognates’, which are words with a common historical origin. The weak point of this method is that subjective judgment plays a significant role. Here we define the distance of two languages by considering a renormalized edit distance among words with the same meaning and averaging over the two hundred words contained in a Swadesh list. In our approach the vocabulary of a language is the analogue of DNA for organisms. The advantage is that we avoid subjectivity and, furthermore, reproducibility of results is guaranteed. We apply our method to the Indo-European and the Austronesian groups, considering, in both cases, fifty different languages. The two trees obtained are, in many respects, similar to those found by glottochronologists, with some important differences as regards the positions of a few languages. In order to support these different results we separately analyze the structure of the distances of these languages with respect to all the others. Keywords: phylogeny (theory), population dynamics (theory), analysis of algorithms, new applications of statistical mechanics c 2008 IOP Publishing Ltd and SISSA 1742-5468/08/P08012+15$30.00 Language distances and tree reconstruction Contents 1. Introduction 2 2. Method and database 3 3. Time distance between languages 4 4. Tree and distance distributions 5 5. Malagasy and Romani 11 J. Stat. Mech. 6. Discussion and conclusions 13 Acknowledgments 15 References 15 1. Introduction The Swadesh list, in contrast to a list of arbitrary words, contains terms which are common to all cultures and which concern the basic activities of humans. The choice is motivated by the fact that the vocabulary learned during childhood changes very slowly over time. The use of Swadesh lists in glottochronology has been popular for half a century. Glottochronologists use the percentage of shared cognates in order to (2008) P08012 compute the distances between pairs of languages. These lexical distances are on average logarithmically proportional to the divergence time. In fact, changes in vocabulary accumulate year after year and two languages initially similar become more and more different. Recent examples of the use of Swadesh lists and cognates to construct language trees are the studies of Gray and Atkinson [1] and Gray and Jordan [2]. Cognates are words inferred to have a common historical origin; their identification is often a matter of sense and personal knowledge. In fact, the task of counting the number of cognate words in the list is far from trivial because cognates do not necessarily look similar. Therefore, subjectivity plays a significant role. Furthermore, results are often biased since it is easier for European or American scholars to find those cognates belonging to western languages. For instance, the Spanish word leche and the Greek word gala are cognates. In fact, leche comes from the Latin lac with genitive form lactis, while the genitive form of gala is galactos. Also the English wheel and Hindi cakra are cognates. These two identifications are possible because of our historical records; it would hardly have been possible for languages of, let us say, Central Africa or Australia. In this paper we use an automated method which avoids any subjectivity so that our results can be replicated by other scholars assuming that the database used is the same. For any language we write down a list of the same 200 words according to the original choice of Swadesh [3]; then we compare words with the same meaning belonging to different languages only considering orthographical differences. This may appear reductive since words may look similar by chance, while cognate words may have a completely different orthography, but we will try to convince the reader that this is indeed a simpler, more objective and more efficient choice than the traditional glottochronological approach. doi:10.1088/1742-5468/2008/08/P08012 2 Language distances and tree reconstruction In section 2 we describe how we measure the distance between words and then between languages. Once all the distances among pairs of languages are given it is possible to find the time separations (section 3) and finally the genealogical trees (section 4). In this work we include some results previously obtained in [4] concerning the Indo-European group of languages and we extend our results to the Austronesian group. For each of the two groups we consider 50 languages and we obtain two genealogical trees which are similar to those found by [1]and[2], with some important differences as regards the positions of a few languages and subgroups. Indeed we think that these differences carry some new information about the structure of the tree and about the positions of some languages, for which we separately analyze the structure of distances with respect to all the others J. Stat. Mech. (section 5). 2. Method and database In order to measure distances between pairs of words in different languages we use a modification of the Levenshtein distance (or edit distance) which is defined as the minimum number of operations needed to transform one word into another. An operation is an insertion, deletion, or substitution of a single character. Our definition of lexical distance between two words is taken as the edit distance normalized by the number of characters of the longer of the two. The reason for the renormalization is easily understood: if two words differ by one character this is much more important for short words than it is for long words. This is taken into account by the renormalization factor. We use distance between word pairs, as defined above, to construct a distance between (2008) P08012 pairs of languages. The first step is to find lists of words with the same meaning for all languages for which we intend to construct a distance. Then, we compute the lexical distance for each pair of words with the same meaning in one language pair. Finally, the distance of each language pair is defined as the average of the distances between the word pairs. As a result we have a number between 0 and 1 which we claim to be the lexical distance between the two languages. The databases used here3 to construct the phylogenetic trees are composed of 50 languages of the Indo-European group and 50 languages of the Austronesian group. The main source for the database of the Indo-European group is the file prepared by Dyen et al in [5] which contains the Swadesh list of 200 words for 96 languages. This list consists of items of basic vocabulary, like body parts, pronouns, numbers, which are known to be resistant to borrowings. Many words are missing in [5] but for our choice of 50 languages we have filled most of the gaps and corrected some errors by finding the words on Swadesh lists and in dictionaries freely available on the Web. For the Austronesian group we used, as the main source, the lists contained in the huge database available in [6]. The lists in [6] contain more than 200 words which do not coincide completely with the words in the original Swadesh list [3]. For our choice of 50 Austronesian languages we have retained only the words which are in [6] and also are in the original Swadesh list. The list obtained has many gaps due to missing words in [6] and because of the incomplete overlap of [6] with the original Swadesh list. Also in this case we have filled some of the gaps by finding 3 The databases, as modified by the authors, are available at the following Web address: http://univaq.it/∼serva/languages/languages.html. Readers are welcome to modify, correct and add words to the databases. doi:10.1088/1742-5468/2008/08/P08012 3 Language distances and tree reconstruction the words on Swadesh lists available on the Web and in one case (Malagasy) from direct knowledge of the language. Both databases [5]and[6] contain information on ‘cognates’ for languages that we do not use in this work.
Recommended publications
  • GRAMMAR of SOLRESOL Or the Universal Language of François SUDRE
    GRAMMAR OF SOLRESOL or the Universal Language of François SUDRE by BOLESLAS GAJEWSKI, Professor [M. Vincent GAJEWSKI, professor, d. Paris in 1881, is the father of the author of this Grammar. He was for thirty years the president of the Central committee for the study and advancement of Solresol, a committee founded in Paris in 1869 by Madame SUDRE, widow of the Inventor.] [This edition from taken from: Copyright © 1997, Stephen L. Rice, Last update: Nov. 19, 1997 URL: http://www2.polarnet.com/~srice/solresol/sorsoeng.htm Edits in [brackets], as well as chapter headings and formatting by Doug Bigham, 2005, for LIN 312.] I. Introduction II. General concepts of solresol III. Words of one [and two] syllable[s] IV. Suppression of synonyms V. Reversed meanings VI. Important note VII. Word groups VIII. Classification of ideas: 1º simple notes IX. Classification of ideas: 2º repeated notes X. Genders XI. Numbers XII. Parts of speech XIII. Number of words XIV. Separation of homonyms XV. Verbs XVI. Subjunctive XVII. Passive verbs XVIII. Reflexive verbs XIX. Impersonal verbs XX. Interrogation and negation XXI. Syntax XXII. Fasi, sifa XXIII. Partitive XXIV. Different kinds of writing XXV. Different ways of communicating XXVI. Brief extract from the dictionary I. Introduction In all the business of life, people must understand one another. But how is it possible to understand foreigners, when there are around three thousand different languages spoken on earth? For everyone's sake, to facilitate travel and international relations, and to promote the progress of beneficial science, a language is needed that is easy, shared by all peoples, and capable of serving as a means of interpretation in all countries.
    [Show full text]
  • A Pipeline for Computational Historical Linguistics
    A Pipeline for Computational Historical Linguistics Lydia Steiner Peter F. Stadler Bioinformatics Group, Interdisciplinary Bioinformatics Group, University of Center for Bioinformatics, University of Leipzig, Interdisciplinary Center for Leipzig ∗ Bioinformatics, University of Leipzig∗; Max-Planck-Institute for Mathematics in the Science; Fraunhofer Institute for Cell Therapy and Immunology; Center for non-coding RNA in Technology and Health, University of Copenhagen; Institute for Theoretical Chemistry, University of Vienna; Santa Fe Institute Michael Cysouw Research unit Quantitative Language Comparison, LMU München ∗∗ There are many parallels between historical linguistics and molecular phylogenetics. In this paper we describe an algorithmic pipeline that mimics, as closely as possible, the traditional workflow of language reconstruction known as the comparative method. The pipeline consists of suitably modified algorithms based on recent research in bioinformatics, that are adapted to the specifics of linguistic data. This approach can alleviate much of the laborious research needed to establish proof of historical relationships between languages. Equally important to our proposal is that each step in the workflow of the comparative method is implemented independently, so language specialists have the possibility to scrutinize intermediate results. We have used our pipeline to investigate two groups of languages, the Tsezic languages from the Caucasus and the Mataco-Guaicuruan languages from South America, based on the lexical data from the Intercontinental Dictionary Series (IDS). The results of these tests show that the current approach is a viable and useful extension to historical linguistic research. 1. Introduction Molecular phylogenetics and historical linguistics are both concerned with the recon- struction of evolutionary histories, the former of biological organisms, the latter of human languages.
    [Show full text]
  • Swadesh Lists Are Not Long Enough: Drawing Phonological Generalizations from Limited Data
    Language Documentation and Description ISSN 1740-6234 ___________________________________________ This article appears in: Language Documentation and Description, vol 16. Editor: Peter K. Austin Swadesh lists are not long enough: Drawing phonological generalizations from limited data RIKKER DOCKUM & CLAIRE BOWERN Cite this article: Rikker Dockum & Claire Bowern (2018). Swadesh lists are not long enough: Drawing phonological generalizations from limited data. In Peter K. Austin (ed.) Language Documentation and Description, vol 16. London: EL Publishing. pp. 35-54 Link to this article: http://www.elpublishing.org/PID/168 This electronic version first published: August 2019 __________________________________________________ This article is published under a Creative Commons License CC-BY-NC (Attribution-NonCommercial). The licence permits users to use, reproduce, disseminate or display the article provided that the author is attributed as the original creator and that the reuse is restricted to non-commercial purposes i.e. research or educational use. See http://creativecommons.org/licenses/by-nc/4.0/ ______________________________________________________ EL Publishing For more EL Publishing articles and services: Website: http://www.elpublishing.org Submissions: http://www.elpublishing.org/submissions Swadesh lists are not long enough: Drawing phonological generalizations from limited data Rikker Dockum and Claire Bowern Yale University Abstract This paper presents the results of experiments on the minimally sufficient wordlist size for drawing phonological generalizations about languages. Given a limited lexicon for an under-documented language, are conclusions that can be drawn from those data representative of the language as a whole? Linguistics necessarily involves generalizing from limited data, as documentation can never completely capture the full complexity of a linguistic system. We performed a series of sampling experiments on 36 Australian languages in the Chirila database (Bowern 2016) with lexicons ranging from 2,000 to 10,000 items.
    [Show full text]
  • Northwest Caucasian Languages and Hattic
    Kafkasya Calışmaları - Sosyal Bilimler Dergisi / Journal of Caucasian Studies Kasım 2020 / November 2020, Yıl / Vol. 6, № 11 ISSN 2149–9527 E-ISSN 2149-9101 Northwest Caucasian Languages and Hattic Ayla Bozkurt Applebaum* Abstract The relationships among five Northwest Caucasian languages and Hattic were investigated. A list of 193 core vocabulary words was constructed and examined to find look-alike words. Data for Abhkaz, Abaza, Kabardian (East Circassian), Adyghe (West Circassian) and Ubykh drew on the work of Starostin, Chirikba and Kuipers. A sub-set list of 15 look-alike words for Hattic was constructed from Soysal (2003). These lists were formulated as character data for reconstructing the phylogenetic relationships of the languages. The phylogenetic relationships of these languages were investigated by a well-known method, Neighbor Joining, as implemented in PAUP* 4.0. Supporting and dissenting evidence from human genetic population studies and archeological evidence were discussed. This project has produced a provisional set of character data for the Northwest Caucasian languages and, to a limited extent, Hattic. Phylogenetic trees have been generated and displayed to show their general character and the types of differences obtained by alternate methods. This research is a basis for further inquiries into the development of the Caucasian languages. Moreover, it presents an example of the method for contrast queries application in studying the evolution of language families. Keywords: Northwest Caucasian Languages, Hattic, Historical Linguistics, Circassian, Adyghe, Kabardian * Ayla Bozkurt Applebaum, ORCID 0000-0003-4866-4407, E-mail: [email protected] (Received/Gönderim: 15.10.2020; Accepted/Kabul: 28.11.2020) 63 Ayla Bozkurt Applebaum Kuzeybatı Kafkas Dilleri ve Hattice Özet Bu araştırma beş Kuzeybatı Kafkas Dilleri ve Hatik arasındaki ilişkiyi incelemektedir.
    [Show full text]
  • Exploring the Corresponding Words Among the Subgroupings, Revising Swadesh List, Compared with Chinese
    International Journal of Language and Linguistics 2019; 7(3): 110-118 http://www.sciencepublishinggroup.com/j/ijll doi: 10.11648/j.ijll.20190703.13 ISSN: 2330-0205 (Print); ISSN: 2330-0221 (Online) Exploring the Corresponding Words Among the Subgroupings, Revising Swadesh List, Compared with Chinese Ke Luo Beijing Institute of Spacecraft Environment Engineering, Beijing, China Email address: To cite this article: Ke Luo. Exploring the Corresponding Words Among the Subgroupings, Revising Swadesh List, Compared with Chinese. International Journal of Language and Linguistics . Vol. 7, No. 3, 2019, pp. 110-118. doi: 10.11648/j.ijll.20190703.13 Received : March 2, 2019; Accepted : April 29, 2019; Published : May 31, 2019 Abstract: In 1647, the Dutch linguist Marcus Zuerius van Boxhorn noted the similarity among certain Asian and European languages and first theorized that they were derived from a primitive common language which he called Scythian (proto-language). For hundreds of years, many scholars have been studying the cognate of the subgroupings of Indo-European languages. This paper with the Swadesh list compares several subgroupings of Indo-European languages and finds out that their cognate correspondence is closer. It is inferred that the Proto-Indo-European was a language with very rich vocabulary and should be contained independently in the subgroupings at the beginning of its dissemination. This paper proposes a revised Swadesh list which can be used to assess language homology degree in high or low, and compares English with Chinese. The result shows that Chinese and English have the same origin. Keywords: Indo-European Languages, Swadesh List, English, Chinese, Cognate conclusion.
    [Show full text]
  • Ancestral Dravidian Languages in Indus Civilization
    ARTICLE https://doi.org/10.1057/s41599-021-00868-w OPEN Ancestral Dravidian languages in Indus Civilization: ultraconserved Dravidian tooth-word reveals deep linguistic ancestry and supports genetics ✉ Bahata Ansumali Mukhopadhyay 1 Ever since the discovery of Indus valley civilization, scholars have debated the linguistic identities of its people. This study analyzes numerous archaeological, linguistic, archae- 1234567890():,; ogenetic and historical evidences to claim that the words used for elephant (like, ‘pīri’, ‘pīru’) in Bronze Age Mesopotamia, the elephant-word used in the Hurrian part of an Amarna letter of ca. 1400 BC, and the ivory-word (‘pîruš’) recorded in certain sixth century BC Old Persian documents, were all originally borrowed from ‘pīlu’, a Proto-Dravidian elephant-word, which was prevalent in the Indus valley civilization, and was etymologically related to the Proto- Dravidian tooth-word ‘*pal’ and its alternate forms (‘*pīl’/‘*piḷ’/‘*pel’). This paper argues that there is sufficient morphophonemic evidence of an ancient Dravidian ‘*piḷ’/‘*pīl’-based root, which meant ‘splitting/crushing’, and was semantically related to the meanings ‘tooth/tusk’. This paper further observes that ‘pīlu’ is among the most ancient and common phytonyms of the toothbrush tree Salvadora persica, which is a characteristic flora of Indus valley, and whose roots and twigs have been widely used as toothbrush in IVC regions since antiquity. This study claims that this phytonym ‘pīlu’ had also originated from the same Proto-Dravidian tooth-word, and argues that since IVC people had named their toothbrush trees and tuskers (elephants) using a Proto-Dravidian tooth-word, and since these names were widely used across IVC regions, a significant population of Indus valley civilization must have used that Proto-Dravidian tooth-word in their daily communication.
    [Show full text]
  • Towards a History of Concept List Compilation in Historical Linguistics
    Towards a history of concept list compilation in historical linguistics Johann-Mattis List Max Planck Institute for the Science of Human History, Jena A large proportion of lexical data of the world’s languages is presented in form of word lists in which a set of concepts was translated into the language varieties of a specific language family or geographic region. The basis of these word lists are concept lists, that is, questionnaires of comparative concepts (in the sense of Haspelmath 2010), which scholars used to elicit the respective translations in their field work. Thus, a concept list is in the end not much more than a bunch of elicitation glosses, often (but not necessarily) based on English as an elicitation language, and a typical concept list may look like the following one, quoted from Swadesh (1950: 161): I, thou, he, we, ye, one, two, three, four, five, six, seven, eight, nine, ten, hundred, all, animal, ashes, back, bad, bark, belly, big, […] this, tongue, tooth, tree, warm, water, what, where, white, who, wife, wind, woman, year, yellow. But scholars may also present their concept list in tabular form, adding additional information in additional columns, numbering and ranking items, providing exemplary translations into other languages, or marking specific items as obsolete. The compilation of concept lists for the purpose of historical language comparison has a long tradition in historical linguistics, dating back at least to the 18th century (Leibniz 1768, Pallas 1786), if not even earlier (see Kaplan 2017). But concept lists were not solely compiled for the purpose of historical language comparison.
    [Show full text]
  • Proof of the Algonquian-Wakashan Relationship
    Sergei L. Nikolaev Institute of Slavic studies of the Russian Academy of Sciences (Moscow/Novosibirsk); [email protected] Toward the reconstruction of Proto-Algonquian-Wakashan. Part 1: Proof of the Algonquian-Wakashan relationship The first part of the present study, following a general introduction (§ 1), presents a classifi- cation and approximate glottochronological dating for the Algonquian-Wakashan languages (§ 2), a preliminary discussion of regular sound correspondences between Proto-Wakashan, Proto-Nivkh, and Proto-Algic (§ 3), and an analysis of the Algonquian-Wakashan “basic lexi- con” (§ 4). The main novelty of the present article is in its attempt at formal demonstration of a genetic relationship between the Nivkh, Algic, and Wakashan languages, arrived at by means of the standard comparative method, i. e. establishing a system of regular sound cor- respondences between the vocabularies of the compared languages. Proto-Salishan is con- sidered as a remote relative of Proto-Algonquian-Wakashan; at the same time, no close (“Mosan”) relationship between Wakashan and Salish has been traced. Additionally, lexical correspondences between Proto-Chukchi-Kamchatkan, Proto-Algonquian-Wakashan, and Proto-Salishan are also reviewed. The conclusion is that no genetic relationship exists be- tween Chukchi-Kamchatkan, on the one hand, and Algonquian-Wakashan, languages (Nivkh included), on the other hand. Instead, it seems more likely that Proto-Chukchi-Kamchatkan has borrowed words from Wakashan, Salishan, and Algic (but probably not vice versa; § 5). The Algonquian-Wakashan, Salishan and Chukchi-Kamchatkan common cultural lexicon is also examined, resulting in the identification of numerous “cultural” loans from Wakashan and Salish into Proto-Chukchi-Kamchatkan. Borrowing from Salishan into Proto-Nivkh was far less intensive, as there are no reliable Nivkh-Wakashan contact words.
    [Show full text]
  • Diachronic Lexical Semantics in Ancient Egyptian–Coptic
    Eitan Grossman (HUJi) + Stéphane Polis (FNRS – ULg) Diachronic lexical semantics in Ancient Egyptian–Coptic The Egyptianness of basic vocabulary in Coptic + Background n It is well known that Coptic, as it comes down to us in written texts, is massively influenced by Greek in the domain of lexicon n The Leipzig-Berlin Dictionary and Database of Greek Loanwords in Coptic project, headed by T. Sebastian Richter, has already recorded c. 5000 loan word types and c. 60.000 loan word tokens + Background n It is well known that Coptic, as it comes down to us in written texts, is massively influenced by Greek in the domain of lexicon n The Leipzig-Berlin Dictionary and Database of Greek Loanwords in Coptic project, headed by T. Sebastian Richter, has already recorded c. 5000 loan word types and c. 60.000 loan word tokens n On this basis, linguists, philologists, and historians often make assumptions about the nature and extent of bilingualism. Some linguists have even proposed that Coptic is a case of ‘code-mixing’ of Egyptian and Greek, which assumes extensive bilingualism among Egyptians in Late Antiquity + Goal of the talk n We tackle this question from another angle, by determining the extent to which Greek influenced Coptic in terms of its basic vocabulary n It may be that we can learn more about bilingualism in Late Antique Egypt this way, since overall lexical borrowing need not correlate with lexical borrowing in the domain of basic vocabulary n As a (significant) side effect of this study, we can also say something about the rate of replacement of basic vocabulary in Egyptian- Coptic across its 4000 years of attestation, as well as the semantic domains and periods in which lexical replacement was faster or slower + Basic vocabulary lists What they are and what they’re for n The most famous lists are that of Morris Swadesh (1909-1967) in the 1950s (with first 225, then 165, 215, 200 and finally meanings) n Basic vocabulary is an intuitive notion, as Swadesh proposed his list on an intuitive basis for his historical-comparative purposes Morris Swadesh, 1950.
    [Show full text]
  • Multilingual Dictionary Based Construction of Core Vocabulary
    Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4211–4217 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC Multilingual Dictionary-Based Construction of Core Vocabulary Winston Wu, Garrett Nicolai, David Yarowsky Center for Language and Speech Processing Johns Hopkins University (wswu, gnicola2, yarowsky)@jhu.edu Abstract We propose a new functional definition and construction method for core vocabulary sets for multiple applications based on the relative coverage of a target concept in thousands of bilingual dictionaries. Our newly developed core concept vocabulary list derived from these dictionary consensus methods achieves high overlap with existing widely utilized core vocabulary lists targeted at applications such as first and second language learning or field linguistics. Our in-depth analysis illustrates multiple desirable properties of our newly proposed core vocabulary set, including their non-compositionality. We employ a cognate prediction method to recover missing coverage of this core vocabulary in massively multilingual dictionary construction, and we argue that this core vocabulary should be prioritized for elicitation when creating new dictionaries for low-resource languages for multiple downstream tasks including machine translation and language learning. Keywords: core vocabulary, Swadesh list, multilingual dictionaries 1. Introduction 1825 Dictionaries are available for most of the world’s lan- 1800 guages, but coverage can be sparse for those with fewer 1775 resources. In sparse dictionaries, many entries are core 1750 vocabulary words from lists such as the Swadesh list 1725 (Swadesh, 1952; Swadesh, 1955), probably the most well- # of dictionaries known formulation of a core vocabulary containing around 1700 100–200 words, depending on the version.
    [Show full text]
  • An Algorithm for Building Language Superfamilies Using Swadesh Lists
    University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Computer Science and Engineering: Theses, Computer Science and Engineering, Department Dissertations, and Student Research of Spring 4-23-2020 An Algorithm For Building Language Superfamilies Using Swadesh Lists Bill Mutabazi University of Nebraska - Lincoln, [email protected] Follow this and additional works at: https://digitalcommons.unl.edu/computerscidiss Part of the Computer Engineering Commons, and the Computer Sciences Commons Mutabazi, Bill, "An Algorithm For Building Language Superfamilies Using Swadesh Lists" (2020). Computer Science and Engineering: Theses, Dissertations, and Student Research. 191. https://digitalcommons.unl.edu/computerscidiss/191 This Article is brought to you for free and open access by the Computer Science and Engineering, Department of at DigitalCommons@University of Nebraska - Lincoln. It has been accepted for inclusion in Computer Science and Engineering: Theses, Dissertations, and Student Research by an authorized administrator of DigitalCommons@University of Nebraska - Lincoln. AN ALGORITHM FOR BUILDING LANGUAGE SUPERFAMILES USING SWADESH LISTS By Bill Jean Claudien Mutabazi A THESIS Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfilment of Requirements For the Degree of Master of Science Major: Computer Science Under the Supervision of Professor Peter Revesz Lincoln, Nebraska May 2020 AN ALGORITHM FOR BUILDING LANGUAGE SUPERFAMILES USING SWADESH LIST Bill Jean Claudien Mutabazi, M.S. University of Nebraska, 2020 Adviser: Peter Revesz To make sense of language originality and their evolution over the years is a daunting task. Most of scientific studies had been based on an asymmetric study of word convergence between different languages, borrowing words and common origin word meanings to define their linguistic family classification.
    [Show full text]
  • Lexical Matches Between Sumerian and Hurro-Urartian: Possible Historical Scenarios
    Cuneiform Digital Library Journal 2014:4 <http://www.cdli.ucla.edu/pubs/cdlj/2014/cdlj2014_4.html> © Cuneiform Digital Library Initiative ISSN 1540-8779 Version: 3 December 2014 Lexical Matches between Sumerian and Hurro-Urartian: Possible Historical Scenarios Alexei Kassian Institute of Linguistics of the Russian Academy of Sciences, Moscow Th e paper deals with lexical matches between two ancient Near Eastern languages: Sumerian and Hurrian (Hurro-Urartian); namely, several basic terms (like ‘hand,’ ‘rain,’ etc.), that demonstrate phonetical similarities in both languages, are discussed. Four possible scenarios are evaluated from the typological, etymological and statistical points of view: (1) chance coincidences; (2) lexi- cal borrowings from Sumerian into Hurro-Urartian or vice versa; (3) genetic relationship between Sumerian and Hurro-Urartian; (4) prehistoric language shift : adoption by a Hurro-Urartian (or closely related) group of the Sumerian language or vice versa. Out of these four, two scenarios—lexical borrowings and genetic relationship—are typologically unlikely. Th e statistical probability of chance coincidences is low, although formally this explanation cannot be excluded. Th e fourth scenario—language shift —fi ts linguistic evidence and does not contradict archaeological data. Keywords: Hurrian, Sumerian, Ancient Near East, language contacts, language shift , loanwords, lexicostatistics §1. Introduction a language of the Urartian empire (present-day Armenia §1.1. Th e languages and neighboring areas).2 For the preliterate period, it is §1.1.1. Sumerian is a language spoken in southern Mes- natural to associate the HU people with the Kura-Araxes opotamia (modern Iraq). Its earliest cuneiform attesta- (Early Trans-Caucasian) archaeological culture (Kassian tions date from the late 4th or early 3rd millennium BC, 2010: 423-428 with further references).
    [Show full text]