Open-Source Morphology for Endangered Mordvinic Languages

Total Page:16

File Type:pdf, Size:1020Kb

Open-Source Morphology for Endangered Mordvinic Languages Open-Source Morphology for Endangered Mordvinic Languages Jack Rueter Mika Hämäläinen Niko Partanen Dept. of Digital Humanities Dept. of Digital Humanities Dept. of Finnish, University of Helsinki University of Helsinki Finno-Ugrian [email protected] and Rootroo Ltd and Scandinavian Studies [email protected] University of Helsinki [email protected] Abstract languages. 2002 saw the publication of the first monolingual dictionary of Erzya (Abramov, 2002), This document describes shared development and the manuscript was proclaimed open by the of finite-state description of two closely re- author for future development. The Mordvin lan- lated but endangered minority languages, Erzya and Moksha. It touches upon mor- guages have continued to receive a fair share of pholexical unity and diversity of the two lan- linguistic research interest in the recent years (Luu- guages and how this provides a motivation for tonen, 2014; Hamari and Aasmäe, 2015; Kashkin shared open-source FST development. We de- and Nikiforova, 2015; Grünthal, 2016). scribe how we have designed the transducers After the release first finite-state transducer for so that they can benefit from existing open- the closely related Komi-Zyrian (Rueter, 2000), source infrastructures and are as reusable as possible. it was only obvious that similar work should be done for Erzya Mordvin. Fortunately, over the 1 Introduction past decade there has been an increasing number of publications on Erzya, relating to its morphol- There are over 5000 languages spoken world wide, ogy (Rueter, 2010), its OCR tools (Silfverberg and and a vast majority of them are endangered (see Rueter, 2015) and universal dependencies (Rueter Moseley 2010). The Mordvinic languages Erzya and Tyers, 2018). and Moksha are no exception. One of the first NLP This document discusses open-source morphol- solutions that are typically developed along with ogy development, which has greatly benefited from lexical resources for any low-resourced language open-source projects, most notably achievements is a morphological analyzer (cf Zueva et al. 2020; attributed to the GiellaLT infrastructure (Mosha- Tyers et al. 2019; Lovick et al. 2018). gen et al., 2014), i.e. Giellatekno & Divvun at the In this paper, we describe the development of Norwegian Arctic University in Tromsø, Norway. an open-source FST (finite-state transducer) based morphological analyzer, lemmatizer and generator It is also very important that people new to lan- for Erzya and Moksha. We highlight the impor- guage documentation be given opportunities to de- tance of certain design decisions to ensure the com- velop their understanding of languages through par- patibility of our transducers with existing systems. ticipation in projects. Here we must mention an In addition, we will describe how the transducers important span of time 1988–1997, during which can be edited in a system that creates an abstrac- the first author did word processing for the 2073- tion layer behind a graphical user interface for FST page ’Dictionary of Mordvin Dialects’ by Heikki development. Paasonen. The unity and diversity of the Mordvin liter- The transducers are available on GitHub for ary languages of today, Erzya and Moksha, has Erzya 1 and Moksha 2. The nightly builds are avail- been a subject of research for over two hundred able through a Python library called UralicNLP3 years. The first grammars were published in 1830s (Hämäläinen, 2019). – Moksha in 1838 (Ornatov, 1838) and Erzya in 1838–1839 (Gabelentz, 1839). The subsequent 1https://github.com/giellalt/lang-myv 180 years brought scholars for fieldwork, gram- 2https://github.com/giellalt/lang-mdf mars, dictionaries, and the popularization of the 3https://github.com/mikahama/uralicNLP 94 Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), pages 94–100 Virtual Conference, November 19, 2020. c 2020 Association for Computational Linguistics input output 2 Designing for a Reusable API вирев+A+Sg+Nom+Indef analyzer вирев вирь+N+SP+Lat+Indef It is not uncommon that similar tools and methods generator вирев+A+Sg+Nom+Indef вирев are developed by different research groups in differ- Table 1: Examples of GiellaLT style input and output ent parts of the world for language documentation purposes. To name a few, there are projects de- veloping similar language documentation systems HFST based transducers, and in fact Apertium type for African languages (Jones and Muftic, 2020), transducers can be compiled to HFST form as well Indonesian languages (Nasution et al., 2018) and (Pirinen and Tyers, 2012). However, their tagset Yupik (Hunt et al., 2019), while at the same time, is different from that of GiellaLT. This is solved it has been shown that digital humanities projects by applying filters at the time the FST is compiled that are seemingly different, still face the very same to produce a separate, Apertium-compatible, trans- technical problems (Mäkelä et al., 2020). In ar- ducer automatically. guably, any work conducted with endangered lan- guages to document them and better serve the small 3 Erzya and Moksha language pair community of speakers has a lot of value. However, At the moment Erzya and Moksha infrastructure in the fact that the wheel gets reinvented over and GiellaLT can be considered to be in nearly equal over again leads to fragmentation in the resources standing, despite the fact that work on Erzya was available for smaller languages and makes their use started considerably earlier. This does not mean, more difficult as each individual tool exposes an however, that the resources for both languages are API of their own. identical in all measures, although in basic numeric Rule-based morphology can be implemented in levels their sizes are comparable. There are many many different systems. Popular tools include situations where additional consistency between XFST (Karttunen et al., 1997), Foma (Hulden, the infrastructures for these two languages would 2009) and OpenFST (Allauzen et al., 2007). How- be quite desirable and beneficial. We discuss next ever, we use HFST (Helsinki-Finite State Technol- what these instances could be, and outline some ogy) (Lindén et al., 2009) because it is the system of the major questions while working with this used in GiellaLT (Moshagen et al., 2014). language pair. GiellaLT provides our transducers with a list of The open-source machine translation infrastruc- quality attributes. Their infrastructure consists of ture Apertium uses a shallow-transfer strategy. By work done for multiple endangered languages in definition, shallow transfer makes a morphosyntac- such a way that the morphological resources can tic analysis of the source language, translates the be used in a multitude of different contexts such as lemmas from source to target language, generates disambiguation (Trosterud, 2004; Ens et al., 2019), morphology acquired from the source language in dependency parsing (Antonsen et al., 2010), on- the target language and makes adjustments to the line dictionaries (Rueter and Hämäläinen, 2019), syntax of the target language. This concept of paral- spell checkers (Wiechetek et al., 2019), online cre- lel lexica, morphology and syntax, would therefore, ative writing tools (Hämäläinen, 2018), automated seem most effective in translation between closely news generation (Alnajjar et al., 2019) and lan- related languages. In the case of the Mordvin pair, guage learning tools (Antonsen and Argese, 2018). this means the use of mutual tags for describing In order to gain the added benefit from the Giel- mutual phenomena. laLT infrastructure, we have to design our transduc- Mutual phenomena in Erzya and Moksha, how- ers so that they are compatible with HFST and that ever, can be distinguished as exact matches and they follow a certain morphological tagset and that fuzzy matches, as it were. While there is no doubt they take the input and output in a certain format. that the two languages share nearly the same cat- These requirements define the API our transducers egories of case, person, number and definiteness, need to implement. An example of this can be seen it must also be noted that the paradigms do not in Table1. share the same cellular structure (Keresztes 1999; Apertium (Forcada et al., 2011) is another open- Trosterud 2006; Rueter 2016). When the paradigm source system that uses FST transducers for rule- structure is diverse, it is suggested that a union based machine translation. They use their own of morphological tags be taken as a starting point. transducer format, but fortunately also support When one language distinguishes a category of 95 +Nom +Gen +Dat +Nom +Gen +Dat Sg+PxSg1 (my son) цёразе цёразень цёразти цёрам туртов цёрам цёране цёранень цёраненди Sg+PxSg1 (my son) цёрам ∼ цёрань туртов Sg+PxSg1 (my sons) ∼ цёрань Sg+PxSg2 (your son) цёраце цёрацень цёрацти ∼ цёранень цёратне цёратнень цёратненди цёран цёран цёран туртов Pl+PxSg2 (your sons) Pl+PxSg1 (my sons) Sg+PxSg3 (his/her son) цёрац цёранц цёранцты ∼ цёрам ∼ цёрам цёрам туртов Pl+PxSg3 (his/her son) цёранза цёранзон цёранзонды цёрать цёрать туртов SP+PxPl1 (our son/sons) цёраньке цёраньконь цёраньконди Sg+PxSg2 (your son) цёрат SP+PxPl2 (your son/sons) цёранте цёрантень цёрантенди ∼ цёрат ∼ цёратень SP+PxPl3 (their son/sons) цёрасна цёраснон цёраснонды PxSg2 Pl (your son) цёрат цёрат цёрат туртов цёранстэнь ∼ PxSg3 Sg (his/her son) цёразо цёранзо цёранзо туртов Table 2: Symmetric possessive declension of Moksha цёранстэнь ∼ PxSg3 Pl (his/her sons) цёранзо цёранзо core cases цёранзо туртов SP+PxPl1 (our son/sons) цёранок цёранок цёранк туртов SP+PxPl2 (your son/sons) цёранк цёранк цёранк туртов цёранстэнь ∼ SP+PxPl3 (their son/sons) цёраст цёраст number, for instance, and the other does not, there цёраст туртов comes a point where number must be determined. And, in order to facilitate a transition from the ab- Table 3: Asymmetric possessive declension of Erzya sence of the category of number to its presence core cases and vice versa, the rudiments of tagging this cate- Obj+Sg1 Obj+Pl1 gory must be put in place, e.g.
Recommended publications
  • Финно-Угроведение, 2020 (№ 61) Научный Журнал Основан В Феврале 1994 Г
    ФИННО-УГРОВЕДЕНИЕ, 2020 (№ 61) Научный журнал Основан в феврале 1994 г. ISSN 2312-0312 (Print), ISSN 2713-2250 (Online) Учредитель – Государственное бюджетное научное учреждение при Правительстве Республики Марий Эл «Марийский научно-исследовательский институт языка, литературы и истории им. В.М. Васильева» Редакционный совет: Кузьмин Евгений Петрович, председатель ред. совета, канд. ист. н., директор МарНИИЯЛИ (г. Йошкар-Ола) Пенькова Мария Викентьевна, секретарь ред. совета, канд. филол. н., зам. директора-ученый секретарь МарНИИЯЛИ (г. Йошкар-Ола) Агранат Татьяна Борисовна, д-р филол. н., ведущий научный сотрудник Института языкознания РАН (г. Москва) Воронцов Владимир Степанович, канд. ист. н., старший научный сотрудник Удмуртского института истории, языка и литературы УрО РАН (г. Ижевск) Зайцева Нина Григорьевна, д-р филол. н., зав. сектором языкознания Института языка, литературы и истории КарНЦ РАН (г. Петрозаводск) Косинцева Елена Викторовна, д-р филол. н., доцент, главный научный сотрудник Обско-угорского института прикладных исследований и разработок (г. Ханты-Мансийск) Крыласова Наталья Борисовна, д-р ист. н., главный научный сотрудник Пермского федерального исследовательского центра УрО РАН (г. Пермь) Куршева Галина Александровна, д-р ист. н., директор Научно-исследовательского института гуманитарных наук при Правительстве Республики Мордовия (г. Саранск) Лехтинен Илдико, д-р философии, доцент Хельсинкского университета (г. Хельсинки, Финляндия) Ошаев Алексей Григорьевич, канд. ист. н., декан историко-филологического факультета Марийского государственного университета (г. Йошкар-Ола) Пустаи Янош, профессор, д-р философии, председатель Ассоциации финно-угорских литератур, директор Института Collegium Fenno-Ugricum (г. Будапешт, Венгрия) Ракин Анатолий Николаевич, д-р филол. н., главный научный сотрудник Института языка, литературы и истории Коми НЦ УрО РАН (г. Сыктывкар) Сергеев Олег Арсентьевич, канд. филол. н., старший научный сотрудник МарНИИЯЛИ (г.
    [Show full text]
  • Multilingual Facilitation
    Multilingual Facilitation Honoring the career of Jack Rueter Mika Hämäläinen, Niko Partanen and Khalid Alnajjar (eds.) Multilingual Facilitation This book has been authored for Jack Rueter in honor of his 60th birthday. Mika Hämäläinen, Niko Partanen and Khalid Alnajjar (eds.) All papers accepted to appear in this book have undergone a rigorous peer review to ensure high scientific quality. The call for papers has been open to anyone interested. We have accepted submissions in any language that Jack Rueter speaks. Hämäläinen, M., Partanen N., & Alnajjar K. (eds.) (2021) Multilingual Facilitation. University of Helsinki Library. ISBN (print) 979-871-33-6227-0 (Independently published) ISBN (electronic) 978-951-51-5025-7 (University of Helsinki Library) DOI: https://doi.org/10.31885/9789515150257 The contents of this book have been published under the CC BY 4.0 license1. 1 https://creativecommons.org/licenses/by/4.0/ Tabula Gratulatoria Jack Rueter has been in an important figure in our academic lives and we would like to congratulate him on his 60th birthday. Mika Hämäläinen, University of Helsinki Niko Partanen, University of Helsinki Khalid Alnajjar, University of Helsinki Alexandra Kellner, Valtioneuvoston kanslia Anssi Yli-Jyrä, University of Helsinki Cornelius Hasselblatt Elena Skribnik, LMU München Eric & Joel Rueter Heidi Jauhiainen, University of Helsinki Helene Sterr Henry Ivan Rueter Irma Reijonen, Kansalliskirjasto Janne Saarikivi, Helsingin yliopisto Jeremy Bradley, University of Vienna Jörg Tiedemann, University of Helsinki Joshua Wilbur, Tartu Ülikool Juha Kuokkala, Helsingin yliopisto Jukka Mettovaara, Oulun yliopisto Jussi-Pekka Hakkarainen, Kansalliskirjasto Jussi Ylikoski, University of Oulu Kaisla Kaheinen, Helsingin yliopisto Karina Lukin, University of Helsinki Larry Rueter LI Līvõd institūt Lotta Jalava, Kotimaisten kielten keskus Mans Hulden, University of Colorado Marcus & Jackie James Mari Siiroinen, Helsingin yliopisto Marja Lappalainen, M.
    [Show full text]
  • FENNO SUECANA FENNO-UGRICA SUECANA Nova Series 15 UGRICA
    FENNO-UGRICA SUECANA Nova series Journal of Fenno -Ugric R esearch in Scandinavia 15 Institutionen för slaviska och baltiska språk, finska, nederländska och tyska Stockholm 2016 FENNO-UGRICA SUECANA Nova series Journal of Fenno-Ugric Research in Scandinavia 15 Editor-in-chief: Jarmo Lainio Issue editors: Peter S. Piispanen & Merlijn de Smit Editorial board: Jarmo Lainio, Stockholm Peter S. Piispanen, Stockholm Merlijn de Smit, Stockholm/Turku Stockholm 2016 © 2016 The authors Institutionen för slaviska och baltiska språk, finska, nederländska och tyska ISSN 0348-3045 ISBN 978-91-981559-0-7 FENNO-UGRICA SUECANA – Nova series 15 ARTICLES • Peter Piispanen : Statistical Dating of Finno-Mordvinic Languages through Comparative Linguistics and Sound Laws, p. 1 – 58 • Ante Aikio & Jussi Ylikoski: The origin of the Finnic l-cases, p. 59 – 158 • Håkan Rydving: Sydsamisk eller umesamisk? ”Södra Tärna” i det samiska språklandskapet, p. 159 – 174 • Riitta-Liisa Valijärvi : Ruotsinsuomalaisten opiskelijoiden kirjallisen tuotoksen morfosyntaksin ja sanaston virheanalyysia, p. 175 – 200 REVIEW No reviews in this issue. REPORTS • Lasse Vuorsola : Atmosfärförändring inom klimatdebatten, p. 201 – 207 REVIEW ARTICLE No review articles in this issue. FUS 15, 2016, ISSN 0348-3045, ISBN 978-91-981559-0-7 (online) Fenno-Ugrica Suecana – Nova Series This is the second issue of the revitalized Fenno-Ugrica Suecana series (now additionally termed ‘Nova Series‘), which continues to focus on issues concerning Fenno-Ugristics and Fennistics, in a wide sense. We invite researchers, teachers and other scientifically interested persons to send us contributions for publication. The planned annual deadlines are October 15th in the fall and February 15th in the spring.
    [Show full text]
  • Περì O̓ρθότητος Ἐ Τύμων Uusiutuva Uralilainen Etymologia
    ΠΕΡI OΡΘΌΤΗΤΌΣ EΤΎΜΩΝ UUSIUTUVA URALILAINEN ETYMOLOGIA Uralica Helsingiensia11 Περì ̓ ρθότητοςo ἐ τύμων Uusiutuva uralilainen etymologia TOIMITTANEET / EDITED BY SAMPSA HOLOPAINEN & JANNE SAARIKIVI HELSINKI 2018 Sampsa Holopainen & Janne Saarikivi (eds): Περı` o̓ ρθότητος ἐ τύμων. Uusiutuva uralilainen etymologia. Uralica Helsingiensia 11. Cover Rigina Ajanki Layout Mari Saraheimo English proofreading Alexandra Kellner ISBN 978-952-7262-04-7 (print) ISBN 978-952-7262-05-4 (online) ISSN 1797-3945 Orders • Tilaukset Printon Tiedekirja <www.tiedekirja.fi> Tallinn 2018 Snellmaninkatu 13 <[email protected]> FI-00170 Helsinki Uralica Helsingiensia Uralica Helsingiensia is a series published by the Finno-Ugrian Society. It features mono- graphs and thematic collections of articles with a research focus on Uralic languages, and it also covers the linguistic and cultural aspects of Estonian, Hungarian and Saami studies at the University of Helsinki. The series has a peer review system, i.e. the manuscripts of all articles and monographs submitted for publication will be refereed by two anonymous reviewers before binding decisions are made concerning the publication of the material. Uralica Helsingiensia on Suomalais-Ugrilaisen Seuran julkaisema sarja, jossa ilmestyy mono- grafioita ja temaattisia artikkelikokoelmia. Niiden aihepiiri liittyy uralilaisten kielten tutkimuk- seen ja kattaa myös Helsingin yliopistossa tehtävän Viron kielen ja kulttuurin, Unkarin kielen ja kulttuurin sekä saamentutkimuksen. Sarjassa noudatetaan refereekäytäntöä, eli
    [Show full text]
  • Book of Abstracts
    Congressus Duodecimus Internationalis Fenno-Ugristarum, Oulu 2015 Book of Abstracts Edited by Harri Mantila Jari Sivonen Sisko Brunni Kaisa Leinonen Santeri Palviainen University of Oulu, 2015 Oulun yliopisto, 2015 Photographs: © Oulun kaupunki ja Oulun yliopisto ISBN: 978-952-62-0851-0 Juvenes Print This book of abstracts contains all the abstracts of CIFU XII presentations that were accepted. Chapter 1 includes the abstracts of the plenary presentations, chapter 2 the abstracts of the general session papers and chapter 3 the abstracts of the papers submitted to the symposia. The abstracts are presented in alphabetical order by authors' last names except the plenary abstracts, which are in the order of their presentation in the Congress. The abstracts are in English. Titles in the language of presentation are given in brackets. We have retained the transliteration of the names from Cyrillic to Latin script as it was in the original papers. Table of Contents 1 Plenary presentations 7 2 Section presentations 19 3 Symposia 197 Symp. 1. Change of Finnic languages in a multilinguistic environment .......................................................................... 199 Symp. 2. Multilingual practices and code-switching in Finno-Ugric communities .......................................................................... 213 Symp. 3. From spoken Baltic-Finnic vernaculars to their national standardizations and new literary languages – cancelled ...... 231 Symp. 4. The syntax of Samoyedic and Ob-Ugric languages ...... 231 Symp. 5. The development
    [Show full text]
  • Kone Foundation Te Language Programme 2012–2016
    Kone Foundation !e Language Programme – Kone Foundation !e Language Programme .. Patrik Söderlund Mynäprint, Mynämäki "e Kone Foundation Language Programme – !"#$%#$& Background Kone Foundation Places Emphasis on the Minority Languages of the Baltic Region and on Finnish Supporting Multilingualism with the Help of the Language Programme "e Current Support for Finno-Ugrian Minorities and Languages Mission Vision "e Mission: Documentation of Small Finno-Ugrian Languages, Finnish, and Minority Languages in Finland Enhancing the Availability, Accessibility and Usability of Corpora Taking into Consideration All Kinds of Language Users Advancing a Culture of Openness and Interaction Advancing Multidisciplinarity and Crossing New Boundaries How Are the Objectives Achieved? Concrete Measures – Communication and Cooperation between Projects Projects Funded "rough Annual Grant Application Rounds, Specially Allocated Grant Calls, and Projects Initiated by the Foundation An Example of the Use and Development of Language Technology Applications Monitoring the Programme Koneen Säätiön ovat perustaneet vuorineuvos Heikki H. Herlin ja eko- nomi Pekka Herlin vuonna . Perustajat toimivat samanaikaisesti sekä säätiön että Kone Osakeyhtiön johdossa, joten kahden organisaation yh- teys oli läheinen. Nykyisin Koneen Säätiö on itsenäinen ja riippumaton yleishyödyllinen organisaatio, joka edistää tieteellistä tutkimustyötä, eri- tyisesti humanistista, yhteiskuntatieteellistä ja ympäristöntutkimusta, sekä taidetta ja kulttuuria Suomessa.
    [Show full text]
  • The Grown-Up Siblings: History and Functions of Western Uralic *Kse
    Rigina Ajanki University of Helsinki The grown-up siblings: history and functions of Western Uralic *kse In this paper, it is claimed that the case suffix *kse, known as translative, dates back to the Finnic-Mordvin proto language, where it functioned as a functive� It is illustrated using synchronic data from Finnic-Mordvin languages that the functions of *kse do not display an inherent feature of directionality ‘into’, or in other terms, lative� It is even possible that the suffix was neutral with respect to time stability, as it is in contemporary Erzya� Further, it is assumed that since the Northern Finnic languages have acquired a new stative case, the functive labelled essive *nA, formerly applied as an intralocal case, the functions of *kse have changed in these languages: *kse has become mainly the marker of a transformative, with an inherent feature of dynamicity� 1� Introduction 5�1� Translative with 2� Typological background: stative copula in Finnish: *kse as a functive similatives and functives 3� Translatives in the case systems 5�2� Finnish ditransitive of Finnic-Mordvin languages constructions displaying 3�1� The Finnish case translative-essive variation system revisited 5�3� Erzya and Finnish expressions 3�2� Functives as of order in translative non-verbal predicates 6� The developmental path 4� The emergence of *kse of translative *kse as a case suffix 7� The emergence of essive 5� The functions of *kse and its consequences for in contemporary the functions of *kse Finnic-Mordvin languages 1. Introduction The translative
    [Show full text]
  • Baltic Loanwords in Mordvin
    Riho Grünthal Department of Finnish, Finno-Ugrian and Scandinavian Studies University of Helsinki Baltic loanwords in Mordvin Linguists have been aware of the existence of Baltic loanwords in the Mord- vinic languages Erzya (E) and Moksha (M) since the 19th century. However, the analysis and interpretation of individual etymologies and the contacts between these two language groups have been ambiguous, as the assumptions on the place and age of the contacts have changed. The assertions on the prehistoric development and early language contacts between the Finno-Ugric (Uralic) and Indo-European languages have changed as well. The main evidence concerning early Baltic loanwords in the Finno-Ugric languages is drawn from the Finnic languages, which are located geographically further west relative to Mordvinic. The high number of early Baltic loanwords in the Finnic languages suggests that the most intensive contacts took place between the early varieties of the Finnic and Baltic languages and did not infl uence other Finno-Ugric languages to the same extent. In principle, the continuity of these contacts extends until the modern era and very recent contacts between Estonian, Livonian, and Latvian that are geographical neighbors and documented languages with a concrete geo- graphical distribution, historical and cultural context. The Baltic infl uence on the Saamic and Mordvinic languages was much less intensive, as evidenced by the considerably lesser number of loanwords. Moreover, the majority of Baltic loanwords in Saamic are attested in the Finnic languages as well, whereas the early Baltic infl uence on the Mordvinic lan- guages diverges from that on the Finnic languages.
    [Show full text]
  • Book of Abstracts
    Congressus Duodecimus Internationalis Fenno-Ugristarum, Oulu 2015 Book of Abstracts Edited by Harri Mantila Jari Sivonen Sisko Brunni Kaisa Leinonen Santeri Palviainen University of Oulu, 2015 Oulun yliopisto, 2015 Photographs: © Oulun kaupunki ja Oulun yliopisto ISBN: 978-952-62-0851-0 Juvenes Print This book of abstracts contains all the abstracts of CIFU XII presentations that were accepted. Chapter 1 includes the abstracts of the plenary presentations, chapter 2 the abstracts of the general session papers and chapter 3 the abstracts of the papers submitted to the symposia. The abstracts are presented in alphabetical order by authors' last names except the plenary abstracts, which are in the order of their presentation in the Congress. The abstracts are in English. Titles in the language of presentation are given in brackets. We have retained the transliteration of the names from Cyrillic to Latin script as it was in the original papers. Table of Contents 1 Plenary presentations 7 2 Section presentations 19 3 Symposia 199 Symp. 1. Change of Finnic languages in a multilinguistic environment .......................................................................... 201 Symp. 2. Multilingual practices and code-switching in Finno-Ugric communities .......................................................................... 215 Symp. 3. From spoken Baltic-Finnic vernaculars to their national standardizations and new literary languages – cancelled ...... 233 Symp. 4. The syntax of Samoyedic and Ob-Ugric languages ...... 233 Symp. 5. The development
    [Show full text]
  • Information-Theoretic Causal Inference of Lexical Flow
    Information-theoretic causal inference of lexical flow Johannes Dellert language Language Variation 4 science press Language Variation Editors: John Nerbonne, Martijn Wieling In this series: 1. Côté, Marie-Hélène, Remco Knooihuizen and John Nerbonne (eds.). The future of dialects. 2. Schäfer, Lea. Sprachliche Imitation: Jiddisch in der deutschsprachigen Literatur (18.–20. Jahrhundert). 3. Juskan, Martin. Sound change, priming, salience: Producing and perceiving variation in Liverpool English. 4. Dellert, Johannes. Information-theoretic causal inference of lexical flow. ISSN: 2366-7818 Information-theoretic causal inference of lexical flow Johannes Dellert language science press Dellert, Johannes. 2019. Information-theoretic causal inference of lexical flow (Language Variation 4). Berlin: Language Science Press. This title can be downloaded at: http://langsci-press.org/catalog/book/233 © 2019, Johannes Dellert Published under the Creative Commons Attribution 4.0 Licence (CC BY 4.0): http://creativecommons.org/licenses/by/4.0/ ISBN: 978-3-96110-143-6 (Digital) 978-3-96110-144-3 (Hardcover) ISSN: 2366-7818 DOI:10.5281/zenodo.3247415 Source code available from www.github.com/langsci/233 Collaborative reading: paperhive.org/documents/remote?type=langsci&id=233 Cover and concept of design: Ulrike Harbort Typesetting: Johannes Dellert Proofreading: Amir Ghorbanpour, Aniefon Daniel, Barend Beekhuizen, David Lukeš, Gereon Kaiping, Jeroen van de Weijer, Fonts: Linux Libertine, Libertinus Math, Arimo, DejaVu Sans Mono Typesetting software:Ǝ X LATEX Language Science Press Unter den Linden 6 10099 Berlin, Germany langsci-press.org Storage and cataloguing done by FU Berlin Contents Preface vii Acknowledgments xi 1 Introduction 1 2 Foundations: Historical linguistics 7 2.1 Language relationship and family trees .............
    [Show full text]
  • Finno-Ugric Languages in Time and Space Tartu, Estonia, February 1-3, 2015
    Ariste Conference Finno-Ugric Languages in Time and Space Tartu, Estonia, February 1-3, 2015 On the Morphological Alignment of Erzya and Moksha Verbal Paradigms Feb. 3, 2015 Jack Rueter, PhD. University of Helsinki Finno-Ugric Languages in Time and Space => • Финнэнь-угрань кельтне: • Шкастонзо • Тарказост аравтозь => • Finno-Ugric Languages • In time • In their place Feb. 3, 2015 Jack Rueter, PhD. University of Helsinki Not “Why” but “What for” • Open morphological analyzers • Giellatekno.uit.no • Robust • Requires more analyses to disambiguate itself • Open syntactical disambiguation • Constraint Grammar course in the works for Autumn 2015 • Open text to speech, Open translation Feb. 3, 2015 Jack Rueter, PhD. University of Helsinki • Language affinity statistics • Morphology and Function • Moods and tenses • First Preterit • Non-ambiguous forms Feb. 3, 2015 Jack Rueter, PhD. University of Helsinki Statistics we have seen or heard on Erzya and Moksha language affinity • Vocabulary statistics for Erzya and Moksha: • 80% affinity (Keresztes 2011: 118). • Actual figure 80.80% (Zaicz 1990: 7-13). • 116 words: 92.42% • 902 Erzya words: 685 Moksha cognates • 75.94% • 1039 Moksha words: 890 Erzya cognates • 85.66% Feb. 3, 2015 Jack Rueter, PhD. University of Helsinki Statistics few have counted on Mordvin affinity • Paasonen’s “Mordwinisches Wörterbuch” 1990-1996, 2703 pages • 6952 root entries containing cognate information • Erzya 5302 = 76% • Moksha 4808 = 69% • Intersection 3158 = 45% • Stem entries: 21,754 • Erzya 14,524 = 67% • Moksha 12,945 = 59% • Intersection 5,785 = 27% Feb. 3, 2015 Jack Rueter, PhD. University of Helsinki Mutual Mordvin verb inventory based on Paasonen’s dictionary (1990-1996) Macro root articles (6,952) Erzya Moksha Intersection Union 1210 1198 734 1674 72.2% 71.5% 44% 100% Stem articles (21,754) Erzya Moksha Intersection Union 5,127 5,371 2,183 8,315 61.6% 64.6% 26% 100% Feb.
    [Show full text]
  • Arxiv:2103.13275V1 [Cs.CL] 24 Mar 2021
    When Word Embeddings Become Endangered Khalid Alnajjar[0000−0002−7986−2994] Department of Digital Humanities, Faculty of Arts, University of Helsinki, Finland [email protected] Abstract. Big languages such as English and Finnish have many nat- ural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as such resources are so scarce despite the great advantages they would provide for the language communities. The most common types of resources available for low-resourced and endangered languages are translation dictionar- ies and universal dependencies. In this paper, we present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and the translation dictionaries of resource-poor languages. Thereafter, the embeddings are fine-tuned using the sentences in the universal dependencies and aligned to match the semantic spaces of the big languages; resulting in cross- lingual embeddings. The endangered languages we work with here are Erzya, Moksha, Komi-Zyrian and Skolt Sami. Furthermore, we build a universal sentiment analysis model for all the languages that are part of this study, whether endangered or not, by utilizing cross-lingual word em- beddings. The evaluation conducted shows that our word embeddings for endangered languages are well-aligned with the resource-rich languages, and they are suitable for training task-specific models as demonstrated by our sentiment analysis model which achieved a high accuracy. All our cross-lingual word embeddings and the sentiment analysis model have been released openly via an easy-to-use Python library. Keywords: Cross-lingual Word Embeddings · Endangered Languages · Sentiment Analysis.
    [Show full text]