Cognet: a Large-Scale Cognate Database

Total Page:16

File Type:pdf, Size:1020Kb

Cognet: a Large-Scale Cognate Database CogNet: a Large-Scale Cognate Database Khuyagbaatar Batsuren† Gábor Bella† Fausto Giunchiglia†§ DISI, University of Trento, Trento, Italy† Jilin University, Changchun, China§ {k.batsuren; gabor.bella; fausto.giunchiglia}@unitn.it Abstract IELex2, or ABVD (Greenhill et al., 2008), cover only the small set of 225 Swadesh basic concepts, This paper introduces CogNet, a new, although with an extremely wide coverage of up to large-scale lexical database that provides 4000 languages. Secondly, in these databases, lex- cognates—words of common origin and ical entries that belong to scripts other than Latin meaning—across languages. The database or Cyrillic mostly appear in phonetic transcription currently contains 3.1 million cognate pairs instead of using their actual orthographies in their across 338 languages using 35 writing sys- tems. The paper also describes the automated original scripts. These limitations prevent such re- method by which cognates were computed sources from being used in real-world computa- from publicly available wordnets, with an tional tasks on written language. accuracy evaluated to 94%. Finally, statistics This paper describes CogNet, a new large-scale, and early insights about the cognate data high-precision, multilingual cognate database, as are presented, hinting at a possible future well as the method used to build it. Our main exploitation of the resource1 by various fields technical contributions are (1) a general method of lingustics. to detect cognates from multilingual lexical re- sources, with precision and recall parametrable ac- 1 Introduction cording to usage needs; (2) a large-scale cognate database containing 3.1 million word pairs across Cognates are words in different languages that 338 languages, generated with the method above; share a common origin and the same meaning, (3) WikTra, a multilingual transliteration dictio- such as the English letter and the French lettre. nary and library derived from Wiktionary data; and Cognates and the problem of cognate identifica- (4) an online platform that lets users explore the tion have been extensively studied in the fields resource. of language typology and historical linguistics, as cognates are considered useful for research- The paper is organised as follows. Section 2 ing the relatedness of languages (Bhattacharya presents the state of the art. Section 3 describes et al., 2018). Cognates are also used in computa- the main cognate discovery algorithm and sec- tional linguistics, e.g., for lexicon extension (Wu tion 4 the way various forms of evidence used and Yarowsky, 2018) or to improve cross-lingual by the algorithm are computed. The method is NLP tasks such as machine translation or bilingual parametrised and the results are evaluated in sec- word recognition (Kondrak et al., 2003; Tsvetkov tion 5. Section 6 describes the resulting CogNet and Dyer, 2015). database in terms of structure and statistical in- sights. Finally, section 7 concludes the paper. Despite the interest in using cognate data for research, state-of-the-art cognate databases have 2 State of the Art had limited practical uses from an applied perspec- tive, for two reasons. Firstly, popular cognate- To our knowledge, cognates have so far been coded databases that are used in historical lin- defined and explored in two fundamental ways guistics, such as ASJP (Wichmann et al., 2010), by two distinct research communities. On the 1The CogNet resource and WikTra tool are available on 2Indo-European Lexical Cognacy Database, http://cognet.ukc.disi.unitn.it. http://ielex.mpi.nl/ 3136 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3136–3145 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics one hand, cognate identification has been studied improving certain cross-lingual tasks in NLP, the within linguistic typology and historical linguis- quality of the output often suffers due to not be- tics. On the other hand, computational linguists ing able to handle certain linguistic phenomena have been researching methods for cognate pro- properly. For example, words in languages such duction. as Arabic or Hebrew are written without vowels The very definition of the term ‘cognate’ varies and machine-produced transliterations often fail to according to the research community. In histori- vowelize such words (Karimi et al., 2011). The so- cal linguistics, cognates must have a provable ety- lution we propose is the use of a dictionary-based mological relationship and must be translated into transliteration tool over machine transliteration. each language (Bhattacharya et al., 2018). Ac- Our method provides new contributions for both cordingly, the English skyscraper and the German research directions. Firstly, to our knowledge no Wolkenkratzer are considered as cognates but the other work on cognate generation has so far used English song and the Japanese S⇣ /songu/) high-quality multilingual lexical resources on a are not. In computational linguistics, the notion scale as large as ours, covering hundreds of lan- of cognate is more relaxed with respect to etymol- guages and more than 100,000 cross-lingual con- ogy and loanwords are also considered as cognates cepts. Secondly, this large cross-lingual cover- (Kondrak et al., 2003). For our work we adopted age could only be achieved thanks to a robust the latter, computational point of view. transliteration tool that is part of the contributions of our paper. Finally, our novel, combined use In historical linguistics, cognate identification of multiple—orthographic, semantic, geographic, methods proceed in two main steps. First, a sim- and etymological—sources of evidence for detect- ilarity matrix of all words is estimated by three ing cognates was crucial to obtain high-quality re- types of similarity measures: semantic similar- sults, in terms of both precision and recall. ity, phonetic similarity, and orthographic simi- larity. For information on semantic similarity, 3 The Algorithm special-purpose multilingual dictionaries, such as the well-known Swadesh List, are used. For or- For our work we have adopted a computational- thographic similarity, string metrics (Hauer and linguistic interpretation of the notion of cognate Kondrak, 2011; St Arnaud et al., 2017) are of- (Kondrak et al., 2003): two words in different lan- ten employed, e.g., edit distance, Dice’s coeffi- guages are cognates if they have the same meaning cient, or LCSR. As these methods do not work and present a similarity in orthography, resulting across scripts, they are completed by phonetic from a supposed underlying etymological relation- similarity, exploiting transformations and sound ship (common ancestry or borrowing). changes across related languages (Kondrak, 2000; Based on this interpretation, our algorithm is Jäger, 2013; Rama et al., 2017). Phonetic similar- based on three main principles: (1) semantic ity measures, however, require phonetic transcrip- equivalence, i.e., that the two words share a com- tions to be a priori available. More recently, his- mon meaning; (2) sufficient proof of etymological torical linguists have started exploiting identified relatedness; and (3) the logical transitivity of the cognates to infer phylogenetic relationships across cognate relationship. languages (Rama et al., 2018; Jäger, 2018). The core resource for obtaining cross-lingual In computational linguistics, cognate produc- evidence on semantic equivalence—i.e., the same- tion consists of finding for a word in a given lan- ness of word meanings—is the Universal Knowl- guage its cognate pair in another language. State- edge Core (UKC), a large multilingual lexico- of-the-art methods (Beinborn et al., 2013; Sen- semantic database (Giunchiglia et al., 2018) al- nrich et al., 2016) have employed character-based ready used both in linguistics research as well machine translation, trained from parallel corpora, as for practical applications (Bella et al., 2016; to produce cognates or transliterations. (Wu and Giunchiglia et al., 2017; Bella et al., 2017). Yarowsky, 2018) also employs similar techniques, The UKC includes the lexicons and lexico- as well as multilingual dictionaries, to produce semantic relations for 338 languages, contain- large-scale cognate clusters for Romance and Tur- ing 1,717,735 words and 2,512,704 language- kic languages. Although the cognates produced specific word meanings. It was built from word- in this manner are, in principle, a good source for nets (Miller, 1995) and wiktionaries converted 3137 into wordnets (Bond and Foster, 2013)). As all Algorithm 1: Cognate Discovery Algorithm of the resources composing the UKC were built Input : c, a lexical concept and validated by humans(Giunchiglia et al., 2015), Input : , a lexical resource R we consider the quality of our input data to be Output : G+, graph of all cognates of c high enough for obtaining accurate results on cog- 1 V,E ; ; nates (Giunchiglia et al., 2017). As most wordnets 2 Languages (c); L R map their units of meaning (synsets in WordNet 3 for each language l do 2L terminology) to English meanings, they can effec- 4 for each word w Words (c, l) do 2 R tively be interconnected into a cross-lingual lexi- 5 V V v =<w, l> ; cal resource. The UKC reifies all of these map- [{ } 6 for each node v1 =<w1,l1> V do pings as supra-lingual lexical concepts (107,196 2 7 for each node v2 =<w2,l2> V do in total, excluding named entities such as Ulan- 2 8 if l1 = l2 then baatar). For example, if the German Fahrrad and 9 continue; the Italian bicicletta are mapped to the English 10 if EtyRel(w1,l1,w2,l2) then bicycle then a single concept is created to which 11 E E e = <v1,v2> ; all three language-specific meanings (i.e., wordnet [{ } 12 else if OrthSim(w1,l1,w2,l2)+TG synsets) will be mapped. ⇥ GeoP rox(l1,l2) >TF then In terms of etymological evidence, we use both 13 E E e = <v1,v2> ; [{ } direct and indirect evidence of etymological re- 14 G < V, E >; + latedness. Direct evidence is provided by gold- 15 G = TransitiveClosure(G) + standard etymological resources, such as the one 16 return G ; we use and present in section 4.1.
Recommended publications
  • Interdisciplinary Approaches to Stratifying the Peopling of Madagascar
    INTERDISCIPLINARY APPROACHES TO STRATIFYING THE PEOPLING OF MADAGASCAR Paper submitted for the proceedings of the Indian Ocean Conference, Madison, Wisconsin 23-24th October, 2015 Roger Blench McDonald Institute for Archaeological Research University of Cambridge Correspondence to: 8, Guest Road Cambridge CB1 2AL United Kingdom Voice/ Ans (00-44)-(0)1223-560687 Mobile worldwide (00-44)-(0)7847-495590 E-mail [email protected] http://www.rogerblench.info/RBOP.htm This version: Makurdi, 1 April, 2016 1 Malagasy - Sulawesi lexical connections Roger Blench Submission version TABLE OF CONTENTS TABLE OF CONTENTS................................................................................................................................. i ACRONYMS ...................................................................................................................................................ii 1. Introduction................................................................................................................................................. 1 2. Models for the settlement of Madagascar ................................................................................................. 2 3. Linguistic evidence...................................................................................................................................... 2 3.1 Overview 2 3.2 Connections with Sulawesi languages 3 3.2.1 Nouns..............................................................................................................................................
    [Show full text]
  • Cognate Words in Mehri and Hadhrami Arabic
    Cognate Words in Mehri and Hadhrami Arabic Hassan Obeid Alfadly* Khaled Awadh Bin Mukhashin** Received: 18/3/2019 Accepted: 2/5/2019 Abstract The lexicon is one important source of information to establish genealogical relations between languages. This paper is an attempt to describe the lexical similarities between Mehri and Hadhrami Arabic and to show the extent of relatedness between them, a very little explored and described topic. The researchers are native speakers of Hadhrami Arabic and they paid many field visits to the area where Mehri is spoken. They used the Swadesh list to elicit their data from more than 20 Mehri informants and from Johnston's (1987) dictionary "The Mehri Lexicon and English- Mehri Word-list". The researchers employed lexicostatistical techniques to analyse their data and they found out that Mehri and Hadhrmi Arabic have so many cognate words. This finding confirms Watson (2011) claims that Arabic may not have replaced all the ancient languages in the South-Western Arabian Peninsula and that dialects of Arabic in this area including Hadhrami Arabic are tinged, to a greater or lesser degree, with substrate features of the Pre- Islamic Ancient and Modern South Arabian languages. Introduction: three branches including Central Semitic, Historically speaking, the Semitic language Ethiopian and Modern south Arabian languages family from which both of Arabic and Mehri (henceforth MSAL). Though Arabic and Mehri descend belong to a larger family of languages belong to the West Semitic, Arabic descends called Afro-Asiatic or Hamito-Semitic that from the Central Semitic and Mehri from includes Semitic, Egyptian, Cushitic, Omotic, (MSAL) which consists of two branches; the Berber and Chadic (Rubin, 2010).
    [Show full text]
  • Comparing the Cognate Effect in Spoken and Written Second Language Word Production
    1 Short title: Cognate effect in spoken and written word production Comparing the cognate effect in spoken and written second language word production 1 2 1 Merel Muylle , Eva Van Assche & Robert J. Hartsuiker 1Department of Experimental Psychology, Ghent University, Ghent, Belgium 2 Thomas More University of Applied Sciences, Antwerp, Belgium Address for correspondence: Merel Muylle Department of Experimental Psychology Ghent University Henri Dunantlaan 2 B-9000 Gent (Belgium) E-mail: [email protected] 2 Abstract Cognates – words that share form and meaning between languages – are processed faster than control words. However, it is unclear whether this effect is merely lexical (i.e., central) in nature, or whether it cascades to phonological/orthographic (i.e., peripheral) processes. This study compared the cognate effect in spoken and typewritten production, which share central, but not peripheral processes. We inquired whether this effect is present in typewriting, and if so, whether its magnitude is similar to spoken production. Dutch-English bilinguals performed either a spoken or written picture naming task in English; picture names were either Dutch-English cognates or control words. Cognates were named faster than controls and there was no cognate-by-modality interaction. Additionally, there was a similar error pattern in both modalities. These results suggest that common underlying processes are responsible for the cognate effect in spoken and written language production, and thus a central locus of the cognate effect. Keywords: bilingualism, word production, cognate effect, writing 3 Converging evidence suggests that bilinguals activate both their mother tongue (L1) and their second language (L2) simultaneously when processing linguistic information (e.g., Dijkstra & Van Heuven, 2002; Van Hell & Dijkstra, 2002).
    [Show full text]
  • 1 Roger Schwarzschild Rutgers University 18 Seminary Place New
    Roger Schwarzschild Rutgers University 18 Seminary Place New Brunswick, NJ 08904 [email protected] to appear in: Recherches Linguistiques de Vincennes November 30, 2004 ABSTRACT In some languages, measure phrases can appear with non-compared adjectives: 5 feet tall. I address three questions about this construction: (a) Is the measure phrase an argument of the adjective or an adjunct? (b) What are we to make of the markedness of this construction *142lbs heavy? (c) Why is it that the markedness disappears once the adjective is put in the comparative (2 inches taller alongside 2lbs heavier)? I claim that because degree arguments are ‘functional’, the measure phrase has to be an adjunct and not a syntactic argument of the adjective. Like event modifiers in extended NP’s and in VPs, the measure phrase predicates of a degree argument of the adjective. But given the kind of meaning a measure phrase must have to do its job in comparatives and elsewhere, it is not of the right type to directly predicate of a degree argument. I propose a lexically governed type-shift which applies to some adjectives allowing them to combine with a measure phrase. KEY WORDS adjective, measure phrase, degree, functional category, lexical, adjunct, argument, antonymy. Measure Phrases as Modifiers of Adjectives1 1. Introduction There is a widely accepted account of expressions like five feet tall according to which the adjective tall denotes a relation between individuals and degrees of height and the measure phrase, five feet, serves as an argument of the adjective, saturating the degree-place in the relation2.
    [Show full text]
  • Palatalization/Velar Softening: What It Is and What It Tells Us About the Nature of Language Morris Halle
    Palatalization/Velar Softening: What It Is and What It Tells Us about the Nature of Language Morris Halle This study proposes an account both of the consonantal changes in- volved in palatalization/velar softening and of the fact that this change is encountered before front vowels. The change is a straightforward case of feature assimilation provided that segments/phonemes are viewed as complexes of features organized into the ‘‘bottle brush’’ model illustrated in example (4) and elsewhere in the text, and that the universal set of features includes, in addition to the familiar binary features, six unary features, which specify the designated articulator(s) for every segment (not only for consonants). Keywords: phonetics, palatalization, velar softening, features, articula- tors For Ken Stevens, friend and colleague, on his 80th birthday My purpose in this study is to present an account of the very common alternation between dorsal and coronal consonants often referred to as palatalization or velar softening. This alternation, exemplified by English electri[k] ϳ electri[s]ity, occurs most often before front vowels. In spite of its extremely common occurrence in the languages of the world, to this time there has been no proper account of palatalization that would relate it to the other properties of language, in particular, to the fact that it is found most commonly before front vowels. In the course of working on these remarks, it became clear to me that palatalization raises numerous theoretical questions about which there is at present no agreement among phonologists. Since these include matters of the most fundamental importance for phonology, I start by exten- sively discussing the main issues and the views on them that seem to me most persuasive at this time (section 1).
    [Show full text]
  • Jicaque As a Hokan Language Author(S): Joseph H
    Jicaque as a Hokan Language Author(s): Joseph H. Greenberg and Morris Swadesh Source: International Journal of American Linguistics, Vol. 19, No. 3 (Jul., 1953), pp. 216- 222 Published by: The University of Chicago Press Stable URL: http://www.jstor.org/stable/1263010 Accessed: 11-07-2017 15:04 UTC REFERENCES Linked references are available on JSTOR for this article: http://www.jstor.org/stable/1263010?seq=1&cid=pdf-reference#references_tab_contents You may need to log in to JSTOR to access the linked references. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at http://about.jstor.org/terms The University of Chicago Press is collaborating with JSTOR to digitize, preserve and extend access to International Journal of American Linguistics This content downloaded from 12.14.13.130 on Tue, 11 Jul 2017 15:04:26 UTC All use subject to http://about.jstor.org/terms JICAQUE AS A HOKAN LANGUAGE JOSEPH H. GREENBERG AND MORRIS SWADESH COLUMBIA UNIVERSITY 1. The problem 2. The phonological equivalences in Hokan 2. Phonological note have been largely established by Edward 3. Cognate list Sapir's work.3 The Jicaque agreements are 4. Use of lexical statistics generally obvious. A special point is that 5.
    [Show full text]
  • Phonetics in Phonology
    Phonetics in Phonology John J. Ohala University of California, Berkeley At least since Trubetzkoy (1933, 1939) many have thought of phonology and phonetics as separate, largely autonomous, disciplines with distinct goals and distinct methodologies. Some linguists even seem to doubt whether phonetics is properly part of linguistics at all (Sommerstein 1977:1). The commonly encountered expression ‘the interface between phonology and phonetics’ implies that the two domains are largely separate and interact only at specific, proscribed points (Ohala 1990a). In this paper I will attempt to make the case that phonetics is one of the essential areas of study for phonology. Without phonetics, I would maintain, (and allied empirical disciplines such as psycholinguistics and sociolinguistics) phonology runs the risk of being a sterile, purely descriptive and taxonomic, discipline; with phonetics it can achieve a high level of explanation and prediction as well as finding applications in areas such as language teaching, communication disorders, and speech technology (Ohala 1991). 1. Introduction The central task within phonology (as well as in speech technology, etc.) is to explain the variability and the patterning -- the “behavior” -- of speech sounds. What are regarded as functionally the ‘same’ units, whether word, syllable, or phoneme, show considerable physical variation depending on context and style of speaking, not to mention speaker-specific factors. Documenting and explaining this variation constitutes a major challenge. Variability is evident in several domains: in everyday speech where the same word shows different phonetic shapes in different contexts, e.g., the release of the /t/ in tea has more noise than that in toe when spoken in isolation.
    [Show full text]
  • Working Paper on Exonyms
    UNITED NATIONS E/CONF.98/CRP.32 ECONOMIC AND SOCIAL COUNCIL 13 July 2007 Ninth United Nations Conference on the Standardization of Geographical Names New York, 21 - 30 August 2007 Item 10 of the provisional agenda* Exonyms Working Paper on Exonyms Submitted by Lebanon** * E/CONF.98/1. ** Prepared by Amal Husseini (Lebanon), Head of Aerial Photography Section in the Photogrammetry Department, Lebanese Army – Geographic Affairs Directorate. WORKING PAPER ON EXONYMS (Submitted under item #10 of the Provisional Agenda E/CONF.98/1) By Amal HUSSEINI (LEBANON) Head of Aerial photography section in the photogrammetry department Lebanese army – Geographic Affairs Directorate Survey Engineering Diploma – ESGT - Le Mans (France). High Studies Diploma in Remote Sensing & GIS - Paul Sabatier University (Toulouse III) – (France). The problem of the standardization of geographical names, of which the discussion of exonyms forms a large part, is an extensive and complex one. United Nations resolutions recommend the reduction of the number of exonyms as far and as quickly as possible. An exonym is a name for a place that is not used within that place by the local inhabitants, or a name for a people or language that is not used by the people or language to which it refers. The name used by the people or locals themselves is an endonym or autonym 1. For all Semitic people, including Arabic people, place names are related to an expression of a religious or an emotional thought. Like wishing and seeking blessings. Geographic names are as old as humanity, and more mysterious, because they date back to far periods when human discovered the agriculture as food resources.
    [Show full text]
  • First Name Americanization Patterns Among Twentieth-Century Jewish Immigrants to the United States
    City University of New York (CUNY) CUNY Academic Works All Dissertations, Theses, and Capstone Projects Dissertations, Theses, and Capstone Projects 2-2017 From Rochel to Rose and Mendel to Max: First Name Americanization Patterns Among Twentieth-Century Jewish Immigrants to the United States Jason H. Greenberg The Graduate Center, City University of New York How does access to this work benefit ou?y Let us know! More information about this work at: https://academicworks.cuny.edu/gc_etds/1820 Discover additional works at: https://academicworks.cuny.edu This work is made publicly available by the City University of New York (CUNY). Contact: [email protected] FROM ROCHEL TO ROSE AND MENDEL TO MAX: FIRST NAME AMERICANIZATION PATTERNS AMONG TWENTIETH-CENTURY JEWISH IMMIGRANTS TO THE UNITED STATES by by Jason Greenberg A dissertation submitted to the Graduate Faculty in Linguistics in partial fulfillment of the requirements for the degree of Master of Arts in Linguistics, The City University of New York 2017 © 2017 Jason Greenberg All Rights Reserved ii From Rochel to Rose and Mendel to Max: First Name Americanization Patterns Among Twentieth-Century Jewish Immigrants to the United States: A Case Study by Jason Greenberg This manuscript has been read and accepted for the Graduate Faculty in Linguistics in satisfaction of the thesis requirement for the degree of Master of Arts in Linguistics. _____________________ ____________________________________ Date Cecelia Cutler Chair of Examining Committee _____________________ ____________________________________ Date Gita Martohardjono Executive Officer THE CITY UNIVERSITY OF NEW YORK iii ABSTRACT From Rochel to Rose and Mendel to Max: First Name Americanization Patterns Among Twentieth-Century Jewish Immigrants to the United States: A Case Study by Jason Greenberg Advisor: Cecelia Cutler There has been a dearth of investigation into the distribution of and the alterations among Jewish given names.
    [Show full text]
  • Testing the Efficacy of a Cognate Curriculum
    CREATE 2009 10/5/09 Empirical and Theoretical Background: Considerable previous work (García, 1991; Nagy, TESTING THE EFFICACY OF 1997; National Reading Panel, 2000; Verhoeven, 1990) suggests that one major determinant of poor A COGNATE CURRICULUM reading comprehension, for Latino children and for other lagging readers, is low vocabulary. Lack of knowledge of the middle and lower Maria Carlo, University of Miami frequency 'academic' words encountered in Diane August, Center for Applied Linguistics middle and secondary school texts impedes comprehension of those texts, which in turn Chris Barr, University of Houston impedes the natural process of learning new Ana Pazos-Rego, St. Thomas University word meanings from exposure during reading (Stanovich, 1986). Austin, CREATE Conference, October 5-6, 2009 Continued… Research Questions One strategy believed to be successful in promoting the rapid acquisition of vocabulary by ELLs involves Can an intervention developed to teach teaching children about the morphological structure of cognate awareness to Spanish/English words. bilingual 3rd and 5th graders improve their Researchers believe that it is beneficial for ELLs if instruction on the structural analysis of words includes learning of English words that have cognate making students aware of the cross-linguistic status in Spanish? morphological relationships between words in their two languages (García & Nagy, 1993; Nagy, García, Does the cognate recognition strategy transfer Durgunoglu, & Hancin-Bhatt,1993; Jiménez et al., 1996; to other cognates that have not been Nation 2001). instructed? This involves making students aware of words that are cognates (words that are spelt alike and have similar meanings in two languages), and making them aware of similarities between derivational morphemes in the two languages (e.g, motivación-motivation).
    [Show full text]
  • Identification of Cognates and Recurrent Sound Correspondences
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Directory of Open Access Journals Identification of Cognates and Recurrent Sound Correspondences in Word Lists Grzegorz Kondrak Department of Computing Science, University of Alberta, Edmonton, AB T6G 2E8, Canada. E-mail: [email protected]. ABSTRACT. Identification of cognates and recurrent sound correspondences is a component of two principal tasks of historical linguistics: demonstrating the relatedness of languages, and reconstructing the histories of language families. We propose methods for detecting and quan- tifying three characteristics of cognates: recurrent sound correspondences, phonetic similarity, and semantic affinity. The ultimate goal is to identify cognates and correspondences directly from lists of words representing pairs of languages that are known to be related. The proposed solutions are language independent, and are evaluated against authentic linguistic data. The results of evaluation experiments involving the Indo-European, Algonquian, and Totonac lan- guage families indicate that our methods are more accurate than comparable programs, and achieve high precision and recall on various test sets. The results also suggest that combining various types of evidence substantially increases cognate identification accuracy. RÉSUMÉ. L’identification de mots apparentés et des correspondances de sons récurrents inter- vient dans deux des principales tâches de la linguistique historique: démontrer des filiations linguistiques et reconstruire l’histoire des familles de langues. Nous proposons des méthodes de détection et de quantification de trois caractéristiques des mots apparentés: les correspon- dances de sons récurrents, la ressemblance phonétique et l’affinité sémantique. Le but ultime est d’identifier les mots apparentés et les correspondances directement à partir de listes de mots représentant des paires des langues dont la filiation est connue.
    [Show full text]
  • Receptive Multilingualism Across the Lifespan: Cognitive and Linguistic Factors in Cognate Guessing
    Languages and Literatures \ Multilingualism research Receptive multilingualism across the lifespan Cognitive and linguistic factors in cognate guessing Jan Vanhove Dissertation zur Erlangung der Doktorwürde an der Philosophischen Fakultät der Universität Freiburg in der Schweiz Genehmigt von der philosophischen Fakultät auf Antrag der Professoren Raphael Berthele (1. Gutachter) und Charlotte Gooskens (2. Gutach- terin). Freiburg, den 17. Januar 2014. Prof. Marc-Henry Soulet, Dekan. 2014 Receptive multilingualism across the lifespan Cognitive and linguistic factors in cognate guessing Jan Vanhove Cite as: Vanhove, Jan (2014). Receptive multilingualism across the lifespan. Cognitive and linguistic factors in cognate guessing. PhD thesis. University of Fribourg (Switzerland). Data and computer code available from: http://dx.doi.org/10.6084/m9.figshare.795286. Contents Tables xi Figures xiii Preface xv I Introduction 1 1 Context and aims 3 1.1 Cross-linguistic similarities in language learning . 4 1.2 Receptive multilingualism . 5 1.3 Multilingualism and the age factor . 8 1.4 The present project . 9 1.4.1 The overarching project ‘Multilingualism through the lifespan’ . 9 1.4.2 Aim, scope and terminology . 10 1.5 Overview . 13 II The lifespan development of cognate guess- ing skills 15 2 Inter-individual differences in cognate guessing skills 17 2.1 Linguistic repertoire . 18 2.1.1 Typological relation between the Lx and the L1 . 18 2.1.2 The impact of multilingualism . 19 2.2 Previous exposure . 26 vi Contents 2.3 Attitudes . 28 2.4 Age . 29 3 The lifespan development of cognition 33 3.1 Intelligence . 34 3.1.1 Fluid and crystallised intelligence . 34 3.1.2 Lifespan trajectories .
    [Show full text]