Reconstructing Language Ancestry by Performing Word Prediction with Neural Networks

Total Page:16

File Type:pdf, Size:1020Kb

Reconstructing Language Ancestry by Performing Word Prediction with Neural Networks MSc Artificial Intelligence Track: Natural Language Processing Master Thesis Reconstructing language ancestry by performing word prediction with neural networks by Peter Dekker 10820973 Defense date: 25th January 2018 42 ECTS January 2017 – January 2018 Supervisors: Assessor: dr. Jelle Zuidema dr. Raquel Fernández prof. dr. Gerhard Jäger Institute for Logic, Language and Computation – University of Amsterdam Seminar für Sprachwissenschaft – University of Tübingen 2 Contents 1 Introduction 5 1.1 Historical linguistics . 5 1.1.1 Historical linguistics: the comparative method and beyond . 5 1.1.2 Sound changes . 6 1.1.3 Computational methods in historical linguistics . 8 1.2 Developments in natural language processing . 10 1.2.1 Natural language processing . 10 1.2.2 Machine learning and language . 10 1.2.3 Deep neural networks . 10 1.3 Word prediction . 11 1.3.1 Word prediction . 11 1.3.2 Model desiderata . 12 1.4 Summary . 12 2 Method 13 2.1 Pairwise word prediction . 13 2.1.1 Task . 13 2.1.2 Models . 13 2.1.3 Data . 17 2.1.4 Experiments . 19 2.2 Applications . 22 2.2.1 Phylogenetic tree reconstruction . 22 2.2.2 Sound correspondence identification . 22 2.2.3 Cognate detection . 22 2.3 Summary . 23 3 Results 25 3.1 Word prediction . 25 3.2 Phylogenetic tree reconstruction . 25 3.3 Identification of sound correspondences . 27 3.4 Cognate detection . 29 3.5 Summary . 31 4 Context vector analysis 33 4.1 Extraction of context vectors and input/target words . 33 4.2 PCA visualization . 33 4.3 Cluster analysis . 34 4.3.1 Distance matrices . 34 3 4 CONTENTS 4.3.2 Clustering . 34 4.4 Summary . 37 5 Phylogenetic word prediction 39 5.1 Beyond pairwise word prediction . 39 5.2 Method . 39 5.2.1 Network architecture . 39 5.2.2 Weight sharing and protoform inference . 40 5.2.3 Implementation details . 41 5.2.4 Training and prediction . 41 5.2.5 Experiments . 41 5.3 Results . 41 5.4 Discussion . 43 5.5 Summary . 43 6 Conclusion and discussion 45 7 Appendix 47 Chapter 1 Introduction How are the languages of the world related and how have they evolved? This is the central question in one of the oldest linguistic disciplines: historical linguistics. Recently, computational methods have been applied to aid historical linguists. Independently, in the computer science and natural language processing community, machine learning methods have become popular: by learning from data, a com- puter can learn to perform advanced tasks. In this thesis, I will explore the following question: How can machine learning algorithms be used to predict words between languages and serve as a model of sound change, in order to reconstruct the ancestry of languages? In this introductory chapter, I will first describe existing methods in historical linguistics. Subse- quently, I will cover the machine learning methods used in natural language processing. Finally, I will propose word prediction as task to apply machine learning to historical linguistics. 1.1 Historical linguistics 1.1.1 Historical linguistics: the comparative method and beyond In historical linguistics, the task of reconstructing the ancestry of languages is generally performed using the comparative method (Clackson, 2007). The goal is to find genetic relationships between lan- guages, excluding borrowings. Different linguistic levels, like phonology, morphology and syntax can be taken into account. Mostly, phonological forms of a fixed list of basic vocabulary is taken into ac- count, containing concepts with a low probability of being borrowed. Durie and Ross (1996, ch. Introduction) describe the general workflow of the comparative method, as it is followed by many historical linguists. 1. Word lists are taken into account for languages which are assumed to be genetically related, based on diagnostic evidence. Diagnostic evidence can be common knowledge, eg. Nichols (1996) argues that in different historical texts by Slavic speakers, the Slavic languages already have been assumed to be related for ages. Another possibility is morphological or syntactical evidence. 2. Sets of cognates (words that are ancestrally related) are created. 3. Based on these cognates, sound correspondences are identified: which sounds do always change between cognates in two languages? 4. The sound correspondences are used to reconstruct a common protolanguage. First, protosounds are inferred, from these, protoforms of words are reconstructed. 5 6 CHAPTER 1. INTRODUCTION 5. Common innovations of sounds and words between groups of languages are identified. 6. Based on the common innovations, a phylogenetic tree is reconstructed: language B is a child of language A, if B has all the innovations that A has. 7. An etymological dictionary can be constructed, taking into account borrowing and semantic change. This process can be performed in an iterative fashion: cognate sets (2) are updated by finding newsound correspondences (3), and by further steps, the set of languages taken into account (1) can be altered. Although the comparative method is regarded as the most reliable method to establish genetic rela- tionships between languages, sometimes less constrained methods are applied, to be able to look further back in time (Campbell, 2003, p. 348). One of the most notable is Greenberg’s mass lexical comparison (Greenberg, 1957), in which genetic relationships are inferred from surface forms, instead of sound correspondences, of words across a large number of languages. This workflow is criticized for not dis- tinguishing between genetically related words (cognates) and other effects which could cause similarity of words, such as chance similarity and borrowing (Campbell, 2013). In this thesis, we will therefore take the comparative method as starting point, and look for compu- tational methods that can automate parts of that workflow. 1.1.2 Sound changes When comparing words in different languages, sound changes between these words are observed. Most sound changes are assumed to be regular: if it occurs in a word in a certain context, it should also occur in the same context in another word. The Neogrammarian hypothesis of the regularity of sound change states that “sound change takes place according to laws that admit no exception” (Osthoff and Brugmann, 1880). There are however also cases of irregular, sporadic sound change. (Durie and Ross, 1996). Regular sound change is crucial for the reconstruction of language ancestry using the comparative method. I will now discuss possible sound changes, divided in three categories: phonemic changes, loss of segments and insertion/movement of segments. For each sound change, I will mention whether it is regular or sporadic. I will refer to the sound changes and their regularity later when looking at different computer algorithms that should be able to model these sound changes. This is a compilation of the sound changes described in Hock and Joseph (2009), Trask (1996), Beekes (2011) and Campbell (2013). Phonemic changes Phonemic changes are sound changes which change the inventory of phonemes. When a phonemic change occurs, a sound changes into another sound in all words in a language, this is thus a regular change. Two general patterns of phonemic change can be distinguished: mergers and splits.A merger is a phonemic change where two distinct sounds in the phoneme inventory are merged into one, existing or new, sound. A split is a phonemic change where one phoneme splits into two phonemes. Based on these two general patterns on the phoneme inventory level, a number of concrete changes inword forms can occur, of which I will highlight two: assimilation and vowel changes. Assimilation (regular) Assimilation is the process where one sound in a word becomes more similar to another sound. For example, in Latin nokte ‘night’ > Italian notte, /k/ changes to /t/, influenced by the neighboring /t/. Assimilation of sounds more distantly located in a word is also pssible. Umlaut in Germanic phonol- ogy is an example of this: the first vowel of the word changes to be more similar to the second vowel. For example, the plural of the Proto-Germanic word gast ‘guest’ became gestiz. /a/ changed to /e/ to be 1.1. HISTORICAL LINGUISTICS 7 more similar to /i/, as both /e/ and /i/ are front vowels. This Proto-Germanic umlaut is still visible inthe German plural Gast/Gäste. Lenition (regular) Lenition is a process which reduces articulatory effort and only affects con- sonants. Lenitive changes include voicing (voiceless to voiced), degemination (geminate, two of the same sounds, to simplex, one sound) and nasalization (non-nasal to nasal). An example of voicing is the change from Latin strata ‘road’ to Italian strada. A voiceless /t/ changes to a voiced /d/. Degemination can be observed when comparing the change from Latin gutta ‘drop’ to Spanish gota. The reduction of articulatory effort can continue so far, that a sound is completely omitted.For example, Latin regāle > Spanish real. Vowel changes (regular) There are a number of changes where vowels change into other vowels, some of them caused by the aforementioned processes, like lenition. However, some vowel changes should be mentioned explicitly. There are a number of changes, where a vowel acquires a new phonetic feature, like lowering, fronting, rounding). An example of fronting is Basque dut ‘I have it’ > Zuberoan dʏt, where back /u/ changes to near-front /ʏ/. Coalescence is the effect where two identical vowels change into one. Compensatory lengthening is the lengthening of a vowel after the loss of the following consonant. For example Old French bɛst ‘beast’ > French bɛ:t. Loss of segments In the previous section, I discussed changes of a sound into another sound. Now, I will cover sound changes were segments are completely lost. Loss (regular) There are several types of regular loss. Aphearesis is loss at the beginnning of a word, as the loss of k in English knee.
Recommended publications
  • Connections Between Sámi and Basque Peoples
    Connections between Sámi and Basque Peoples Kent Randell 2012 Siidastallan Outside of Minneapolis, Minneapolis Kent Randell (c) 2012 --- 2012 Siidastallan, Linwood Township, Minnesota Kent Randell (c) 2012 --- 2012 Siidastallan, Linwood Township, Minnesota “D----- it Jim, I’m a librarian and an armchair anthropologist??” Kent Randell (c) 2012 --- 2012 Siidastallan, Linwood Township, Minnesota Connections between Sámi and Basque Peoples Hard evidence: - mtDNA - Uniqueness of language Other things may be surprising…. or not. It is fun to imagine other connections, understanding it is not scientific Kent Randell (c) 2012 --- 2012 Siidastallan, Linwood Township, Minnesota Documentary: Suddenly Sámi by Norway’s Ellen-Astri Lundby She receives her mtDNA test, and express surprise when her results state that she is connected to Spain. This also surprised me, and spurned my interest….. Then I ended up living in Boise, Idaho, the city with the largest concentration of Basque outside of Basque Country Kent Randell (c) 2012 --- 2012 Siidastallan, Linwood Township, Minnesota What is mtDNA genealogy? The DNA of the Mitochondria in your cells. Cell energy, cell growth, cell signaling, etc. mtDNA – At Conception • The Egg cell Mitochondria’s DNA remains the same after conception. • Male does not contribute to the mtDNA • Therefore Mitochondrial mtDNA is the same as one’s mother. Kent Randell (c) 2012 --- 2012 Siidastallan, Linwood Township, Minnesota Kent Randell (c) 2012 --- 2012 Siidastallan, Linwood Township, Minnesota Kent Randell (c) 2012 --- 2012 Siidastallan, Linwood Township, Minnesota Four generation mtDNA line Sisters – Mother – Maternal Grandmother – Great-grandmother Jennie Mary Karjalainen b. Kent21 Randell March (c) 2012 1886, --- 2012 Siidastallan,parents from Kuusamo, Finland Linwood Township, Minnesota Isaac Abramson and Jennie Karjalainen wedding picture Isaac is from Northern Norway, Kvaen father and Saami mother from Haetta Kent Randell (c) 2012 --- 2012 Siidastallan, village.
    [Show full text]
  • Contents Abbreviations of the Names of Languages in the Statistical Maps
    V Contents Abbreviations of the names of languages in the statistical maps. xiii Abbreviations in the text. xv Foreword 17 1. Introduction: the objectives 19 2. On the theoretical framework of research 23 2.1 On language typology and areal linguistics 23 2.1.1 On the history of language typology 24 2.1.2 On the modern language typology ' 27 2.2 Methodological principles 33 2.2.1 On statistical methods in linguistics 34 2.2.2 The variables 41 2.2.2.1 On the phonological systems of languages 41 2.2.2.2 Techniques in word-formation 43 2.2.2.3 Lexical categories 44 2.2.2.4 Categories in nominal inflection 45 2.2.2.5 Inflection of verbs 47 2.2.2.5.1 Verbal categories 48 2.2.2.5.2 Non-finite verb forms 50 2.2.2.6 Syntactic and morphosyntactic organization 52 2.2.2.6.1 The order in and between the main syntactic constituents 53 2.2.2.6.2 Agreement 54 2.2.2.6.3 Coordination and subordination 55 2.2.2.6.4 Copula 56 2.2.2.6.5 Relative clauses 56 2.2.2.7 Semantics and pragmatics 57 2.2.2.7.1 Negation 58 2.2.2.7.2 Definiteness 59 2.2.2.7.3 Thematic structure of sentences 59 3. On the typology of languages spoken in Europe and North and 61 Central Asia 3.1 The Indo-European languages 61 3.1.1 Indo-Iranian languages 63 3.1.1.1New Indo-Aryan languages 63 3.1.1.1.1 Romany 63 3.1.2 Iranian languages 65 3.1.2.1 South-West Iranian languages 65 3.1.2.1.1 Tajiki 65 3.1.2.2 North-West Iranian languages 68 3.1.2.2.1 Kurdish 68 3.1.2.2.2 Northern Talysh 70 3.1.2.3 South-East Iranian languages 72 3.1.2.3.1 Pashto 72 3.1.2.4 North-East Iranian languages 74 3.1.2.4.1
    [Show full text]
  • On Etymology of Finnic Term for 'Sky'
    Archaeoastronomy and Ancient Technologies 2019, 7(2), 5–10; http://aaatec.org/art/a_jg1 www.aaatec.org ISSN 2310-2144 On Etymology of Finnic Term for 'Sky' Jingyi Gao Beijing International Studies University, China; E-mail: [email protected] Institute of the Estonian Language, Tallinn, Estonia; E-mail: [email protected] University of Tartu, Estonia; E-mail: [email protected] Abstract Using etymological methods, the present study has identified five Sinitic and Uralic shared etymologies. These five etymologies form a rhyme correspondence. This regular sound change validates the etymological connection between Sinitic and Uralic. The Finnic term for 'sky' is among these five etymologies. It is demonstrated that this word root should be aboriginal in Sino-Uralic languages. Keywords: Rhyme correspondence, Sinitic, Uralic, Sino-Uralic, Baltic, Germanic, Celtic, Italic, Indo- Iranian. Introduction The Finnic term for 'sky' (Estonian taevas; Finnish taivas; Livonian tōvaz; Veps taivaz; Votic taivas) has no cognate in other Uralic languages, therefore it has been previously considered a loanword to Finnic from Indo-Iranian (Schott, 1849, p. 126), from Baltic (Thomsen, 1869, p. 34, 73), or from Germanic (Koivulehto, 1972). The present study finds that this Finnic word has cognates in Sinitic languages supported by a deep rhyme correspondence consisting of five etymologies; therefore this word root must be aboriginal in Sino-Uralic languages. Gao (e.g. 2005, 2014b, 2019; Gāo, 2008) detected and identified Sinitic and Uralic shared etymologies, and has solely researched Sinitic and Uralic shared etymologies for more than a decade. We could infer a general skepticism about this approach.
    [Show full text]
  • Sami in Finland and Sweden
    A baseline study of socio-economic effects of Northland Resources ore establishment in northern Sweden and Finland Indigenous peoples and rights Stefan Ekenberg Luleå University of Technology Department of Human Work Sciences 2008 Universitetstryckeriet, Luleå A baseline study of socio-economic effects of Northland Resources ore establishment in northern Sweden and Finland Indigenous peoples and rights Stefan Ekenberg Department of Human Work Sciences Luleå University of Technology 1 Summary The Sami is considered to be one people with a common homeland, Sápmi, but divided into four national states, Finland, Norway, Russia and Sweden. The indigenous rights therefore differ in each country. Finlands Sami policy may be described as accommodative. The accommodative Sami policy has had two consequences. Firstly, it has made Sami collective issues non-political and has thus change focus from previously political mobilization to present substate administration. Secondly, the depoliticization of the Finnish Sami probably can explain the absent of overt territorial conflicts. However, this has slightly changes due the discussions on implementation of the ILO Convention No 169. Swedish Sami politics can be described by quarrel and distrust. Recently the implementation of ILO Convention No 169 has changed this description slightly and now there is a clear legal demand to consult the Sami in land use issues that may affect the Sami. The Reindeer herding is an important indigenous symbol and business for the Sami especially for the Swedish Sami. Here is the reindeer herding organized in a so called Sameby, which is an economic organisations responsible for the reindeer herding. Only Sami that have parents or grandparents who was a member of a Sameby may become members.
    [Show full text]
  • 000 Euralex 2010 03 Plenary
    > State of the Art of the Lexicography of European Lesser Used or Non- State Languages anne tjerk popkema ‘The people who chronicle the life of our language (…) are called lexicographers’ (Martin Hardee, blogger in Cyberspace, 2006) 0 Introductory remarks 1 Language codification and language elaboration (‘Ausbau’) are key ingredients for raising a lesser used language to a level that is adequate for modern use.2 In dictionaries (as well as in grammars) a language’s written standard may be laid down, ‘codified’. 3 At the same time dictionaries make clear what lexical gaps remain or arise in a language. The filling of such gaps – part of language elaboration – will only gain wide acceptance when, in turn, it is codified in a dictionary itself. Thus, both prime categories of language development – codification and elaboration – are hats worn by the same head: the lexicographer’s. Bo Svensén begins the opening chapter of his recent handbook on lexicography by stating that ‘dictionaries are a cultural phenomenon. It is a commonplace to say that a dictionary is a product of the culture in which it has come into being; it is less so to say that it plays an important part in the development of that culture.’ 4 In the case of lesser used languages, language development may lead to (increased) use in domains that were formerly out of reach because of the dominance – for any number of reasons – of another language. In such instances, language development equals language emancipation. An emancipating language takes on new functions, enters new domains of society and is therefore in need of new terminology.
    [Show full text]
  • Sixth Periodical Report Presented to the Secretary General of the Council of Europe in Accordance with Article 15 of the Charter
    Strasbourg, 1 July 2014 MIN-LANG (2014) PR7 EUROPEAN CHARTER FOR REGIONAL OR MINORITY LANGUAGES Sixth periodical report presented to the Secretary General of the Council of Europe in accordance with Article 15 of the Charter NORWAY THE EUROPEAN CHARTER FOR REGIONAL OR MINORITY LANGUAGES SIXTH PERIODICAL REPORT NORWAY Norwegian Ministry of Local Government and Modernisation 2014 1 Contents Part I ........................................................................................................................................... 3 Foreword ................................................................................................................................ 3 Users of regional or minority languages ................................................................................ 5 Policy, legislation and practice – changes .............................................................................. 6 Recommendations of the Committee of Ministers – measures for following up the recommendations ................................................................................................................... 9 Part II ........................................................................................................................................ 14 Part II of the Charter – Overview of measures taken to apply Article 7 of the Charter to the regional or minority languages recognised by the State ...................................................... 14 Article 7 –Information on each language and measures to implement
    [Show full text]
  • Morphological Analyzers and Other Digital Tools for Uralic Languages
    Morphological analyzers and other digital tools for Uralic languages Jack Rueter University of Helsinki, Digital Humanities Some Terminology Word associations Working out a methodology Morphological analyzers under development An outline Where it started Setting up a rule-based description Tools Important players 11/10/19 Jack Rueter, Ph.D. University of Helsinki, Digital Humanities 2 Open-source Language form Rule-based Finite-state morphology Some Disambiguation terminology Multiple reusability Linguistics and Language technology Neural networks Transfer learning Active learning 11/10/19 Jack Rueter, Ph.D. University of Helsinki, Digital Humanities 3 (1) Keyboards (2) Spellers Word (3) Dictionaries associations (4) ICALL for multi- (5) Translation reusability (6) Text-to-speech (7) …. 11/10/19 Jack Rueter, Ph.D. University of Helsinki, Digital Humanities 4 (1) Extract paradigms from grammars, readers and research to build an analyzer. (2) Extract words, part-of-speech information and definitions from existing dictionaries and research. Build on what has already been done (Dutch, French, German, Russian,…) (3) Test analysis coverage on written texts. Are the forms Searching for a unrecognized proper words? (4) Disambiguate morphological analyses based on grammars and methodology research. Point out gaps in descriptions for linguistics (5) Test syntactic disambiguation on example sentences cited in grammatical descriptions of the language. And then retest on text corpora. (6) Make disambiguated sentences
    [Show full text]
  • Grammatical Gender and Linguistic Complexity
    Grammatical gender and linguistic complexity Volume I: General issues and specific studies Edited by Francesca Di Garbo Bruno Olsson Bernhard Wälchli language Studies in Diversity Linguistics 26 science press Studies in Diversity Linguistics Editor: Martin Haspelmath In this series: 1. Handschuh, Corinna. A typology of marked-S languages. 2. Rießler, Michael. Adjective attribution. 3. Klamer, Marian (ed.). The Alor-Pantar languages: History and typology. 4. Berghäll, Liisa. A grammar of Mauwake (Papua New Guinea). 5. Wilbur, Joshua. A grammar of Pite Saami. 6. Dahl, Östen. Grammaticalization in the North: Noun phrase morphosyntax in Scandinavian vernaculars. 7. Schackow, Diana. A grammar of Yakkha. 8. Liljegren, Henrik. A grammar of Palula. 9. Shimelman, Aviva. A grammar of Yauyos Quechua. 10. Rudin, Catherine & Bryan James Gordon (eds.). Advances in the study of Siouan languages and linguistics. 11. Kluge, Angela. A grammar of Papuan Malay. 12. Kieviet, Paulus. A grammar of Rapa Nui. 13. Michaud, Alexis. Tone in Yongning Na: Lexical tones and morphotonology. 14. Enfield, N. J. (ed.). Dependencies in language: On the causal ontology of linguistic systems. 15. Gutman, Ariel. Attributive constructions in North-Eastern Neo-Aramaic. 16. Bisang, Walter & Andrej Malchukov (eds.). Unity and diversity in grammaticalization scenarios. 17. Stenzel, Kristine & Bruna Franchetto (eds.). On this and other worlds: Voices from Amazonia. 18. Paggio, Patrizia and Albert Gatt (eds.). The languages of Malta. 19. Seržant, Ilja A. & Alena Witzlack-Makarevich (eds.). Diachrony of differential argument marking. 20. Hölzl, Andreas. A typology of questions in Northeast Asia and beyond: An ecological perspective. 21. Riesberg, Sonja, Asako Shiohara & Atsuko Utsumi (eds.). Perspectives on information structure in Austronesian languages.
    [Show full text]
  • Contact in Siberian Languages Brigitte Pakendorf
    Contact in Siberian Languages Brigitte Pakendorf To cite this version: Brigitte Pakendorf. Contact in Siberian Languages. In Raymond Hickey. The Handbook of Language Contact, Blackwell Publishing, pp.714-737, 2010. hal-02012641 HAL Id: hal-02012641 https://hal.univ-lyon2.fr/hal-02012641 Submitted on 16 Jul 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 9781405175807_4_035 1/15/10 5:38 PM Page 714 35 Contact and Siberian Languages BRIGITTE PAKENDORF This chapter provides a brief description of contact phenomena in the languages of Siberia, a geographic region which is of considerable significance for the field of contact linguistics. As this overview cannot hope to be exhaustive, the main goal is to sketch the different kinds of language contact situation known for this region. Within this larger scope of contact among the languages spoken in Siberia, a major focus will be on the influence exerted by Evenki, a Northern Tungusic language, on neighboring indigenous languages. The chapter is organized as follows: after a brief introduction to the languages and peoples of Siberia (section 1), the influence exerted on the indigenous languages by Russian, the dominant language in the Russian Federation, is described in section 2.
    [Show full text]
  • Researching Less-Resourced Languages – the Digisami Corpus
    Researching Less-Resourced Languages – the DigiSami Corpus Kristiina Jokinen University of Helsinki, Finland and AIRC, AIST Tokyo Waterfront, Japan [email protected] Abstract Increased use of digital devices and data repositories has enabled a digital revolution in data collection and language research, and has also led to important activities supporting speech and language technology research for less-resourced languages. This paper describes the DigiSami project and its research results, focussing on spoken corpus collection and speech technology for the Fenno-Ugric language North Sami. The paper also discusses multifaceted questions on ethics and privacy related to data collection for less-resourced languages and indigenous communities. Keywords: corpus collection, under-resourced languages, North Sami with new technology applications. The main motivation 1. Introduction was to improve digital visibility and viability of the target languages, and to explore different choices for encouraging Several projects and events have increased research and maintaining the use of less-resourced languages in the activities for under-resourced languages during the past digitalized world. The goals of the DigiSami project are years. For instance, the DLDP-project (Digital Language discussed in Jokinen (2014) and Jokinen et al. (2017). Diversity Project) is to advance the sustainability of Europe’s regional and minority languages, while the Flare- The DigiSami project deals with the North Sami language net network and the LRE Map (Calzolari et al. 2012) have (Davvisámegiela) which belongs to the Fenno-Ugric had a big impact on sharing language resources and making language family and is one of the nine Sami languages speech corpora freely available.
    [Show full text]
  • ISO / TC 37 / SC 2 / WG 1 Table 1
    ISO / TC 37 / SC 2 / WG 1 ISO / TC 37 / SC 2 / WG 1 N 65 TC 37 – Terminology (principles and coordination) Acting convener: Håvard Hjulstad SC 2 – Layout of vocabularies WG 1 – Coding systems Date: 2000-07-24 Current status of ISO 639-1 tables The following three tables are: 1. The finalized items in ISO 639-1 in alphabetical order by language identifier, i.e. the current version of table 3 in the DIS document. 2. The items that are still in “Annex C” in alphabetical order by the English name, i.e. the current version of table C.1 in the DIS document. Note: This annex will not be included in the next version of the document. 3. All changes from ISO 639:1988 to the current version of 639-1, in alphabetical order by the (current) language identifiers. We may decide to include this information in a new annex to 639-1. Table 1 – Current version of table 3 Id English name French name Indigenous name Status aa Afar afar afar a - Finalized 1988 ab Abkhazian; Abkhaz abkhaze; abkhazien a%sua bys%wa [1E55 p- a - Finalized 1988 acute; 017E z-caron] ae Avestan avestique ? b - Finalized 2000-02 af Afrikaans afrikaans afrikaans a - Finalized 1988 ak Akan akan akana a - Finalized before 2000-02 am Amharic; Abyssinian amharique amarinja a - Finalized 1988 ar Arabic arabe 'arabiy a - Finalized 1988 as Assamese assamais asam% [012B i-macron] a - Finalized 1988 av Avar; Avarish avar avar mac% [2021 double a - Finalized before dagger] 2000-02 ay Aymara aymara aymara a - Finalized 1988 az Azerbaijani azéri; azerbaïdjanais az%rbaycan dil [0259 a - Finalized
    [Show full text]
  • Evaluating Cross-Linguistic Polysemies As a Model of Semantic Change for Cognate Finding
    Philosophische Fakultät FB Neuphilologie Seminar für Sprachwissenschaft Evaluating Cross-Linguistic Polysemies as a Model of Semantic Change for Cognate Finding STRiX workshop — Gothenburg, November 24, 2014 Johannes Dellert Table of Contents Motivation The Dictionary Data The Polysemy Network Experiment: Cognate Finding Other Applications of Polysemy Data 2 | Dellert: Extracting Concepts from Colexification Data Motivation: Computational Historical Linguistics CHL develops computational methods for analyzing phenomena of interests to historical linguistics . phylogenetic relationships . language contacts . language change on different levels of linguistic description Goals of our EVOLAEMP project in Tübingen: . evaluate existing methods borrowed directly from bioinformatics . attempt to enhance these methods by bringing linguistic knowledge back into the models (sound correspondences, semantic change) Problem for evaluation: not enough data available . existing wide-coverage lexicostatistical databases have at most 200 concepts per language 3 | Dellert: Extracting Concepts from Colexification Data . more concepts ) only samples from each linguistic region, no unified format Motivation: The Idea There is some informal notion of plausibility when cross-semantic etymologies are discussed in the literature: . a semantic shift from “sun” to “day” is plausible . a shift from “moon” to “night” is much less so . “nose” ! “mountain” is good, “nose” ! “swamp” is not How can we capture and model these constraints? Basic Idea: If there is any language
    [Show full text]