The Relevance of the Source Language in Transfer Learning For

Total Page:16

File Type:pdf, Size:1020Kb

The Relevance of the Source Language in Transfer Learning For The Relevance of the Source Language in Transfer Learning for ASR Nils Hjortnæs Niko Partanen Indiana University University of Helsinki [email protected] Helsinki, Finland [email protected] Michael Rießler Francis M. Tyers University of Eastern Finland Department of Linguistics Joensuu, Finland Indiana University [email protected] Bloomington, IN [email protected] Abstract mediately useful for our goals, but we continue to explore different ways to improve our result. This This study presents new experiments on Zyr- study uses the same dataset, but attempts to take ian Komi speech recognition. We use Deep- Speech to train ASR models from a language the multilingual processes found in the corpus into documentation corpus that contains both con- account better. temporary and archival recordings. Earlier ASR has progressed greatly for high resource studies have shown that transfer learning from languages and several advances have been made English and using a domain matching Komi to extend that progress to low resource languages. language model both improve the CER and There are still, however, numerous challenges WER. In this study we experiment with trans- in situations where the available data is limited. fer learning from a more relevant source lan- guage, Russian, and including Russian text in In many situations with the larger languages the the language model construction. The moti- training data for speech recognition is collected vation for this is that Russian and Komi are expressly for the purpose of ASR. These ap- contemporary contact languages, and Russian proaches, such as Common Voice platform, can is regularly present in the corpus. We found be extended also the endangered languages (see that despite the close contact of Russian and i.e. Berkson et al., 2019), so there is no clear cut Komi, the size of the English speech corpus boundary between resources available for different yielded greater performance when used as the languages. While having dedicated, purpose-built source language. Additionally, we can report that already an update in DeepSpeech version data is good for the performance of ASR, it also improved the CER by 3.9% against the earlier leaves a large quantity of more challenging but us- studies, which is an important step in the de- able data untapped. At the same time these mate- velopment of Komi ASR. rials not explicitly produced for this purpose may be more realistic for the resources we intend to use 1 Introduction the ASR for in the later stages. This study describes a Automatic Speech Recog- The data collected in language documentation nition (ASR) experiment on Zyrian Komi, an en- work customarily originates from recorded and dangered, low-resource Uralic language spoken in transcribed conversations and/or elicitations in the Russia. Komi has approximately 160,000 speakers target language. While this data does not have the and the writing system is well established using desirable features of a custom speech recognition the Cyrillic script. Although Zyrian Komi is en- dataset such as a large variety of speakers and ac- dangered, it is used widely in various media, and cents, and includes much fewer recorded hours, also in education system in the Komi Republic. for many endangered languages the language doc- This paper continues our experiments on Zyr- umentation corpora are the only available source. ian Komi ASR using DeepSpeech, which started However, attention should also be paid to the in Hjortnaes et al.(2020b). The scores reported differences in endangered language contexts glob- there were very low, but in a later study we found ally. There is an enormous variation in how that a language model built on more data increased much one language is previously documented, and the performance dramatically (Hjortnaes et al., whether earlier resources exist. This also connects 2020a). We are not yet at a level that would be im- to the new materials collected, as some languages 63 Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages: Vol. 1 Papers, pages 63–69, Online, March 2–3, 2021. need a full linguistic description, and some already segments where the language is in Russian, al- have established orthographies and variety of de- though the main language is Komi. There are also scriptions available. For example, in the case of very fragmentary occurrences of Tundra Nenets, Komi our transcription choice is the existing or- Kildin Saami and Northern Mansi languages, but thography which is also used in other resources these are so rare at the moment that we have not (Gerstenberger et al., 2016, 32). Our spoken lan- addressed them separately. In addition, none of guage corpus is connected with the entire NLP the data is annotated for which language is be- ecosystem of the Komi language, which includes ing spoken, and we only have transcriptions in the Finite State Transducer (Rueter, 2000), well devel- contemporary Cyrillic orthographies of these lan- oped dictionaries both online1 and in print (Rueter guages, as explained above. We propose two pos- et al., 2020; Beznosikova et al., 2012; Alnajjar sible methods to accommodate these properties of et al., 2019), several treebanks (Partanen et al., the data. First, we compare whether it is better to 2018) and also written language corpora (Fedina, transfer from a high resource language, English, 2019)2. We use this technical stack to annotate or the contact language, Russian. Second, we an- our corpus directly into the ELAN files (Gersten- alyze the impact of constructing languages mod- berger et al., 2017), but also to create versions els from different combinations of Komi and Rus- where identifiable information has been restricted sian sources. The goal is to make the language (Partanen et al., 2020). Thereby our goal is not to model more representative of the data and thereby describe the language from the scratch, but to cre- improve performance. ate a spoken language corpus that is not separate from the rest of the work and infrastructure done 2 Prior Work on this language. From this point of view we need an ASR system that produces the contemporary A majority of the work on Speech Recognition orthography, and not purely the phonetic or phone- focuses on improving the performance of models mic levels. It can be expected that entirely undocu- for high resource languages. Very small improve- mented languages and languages with a long tradi- ments may be made through advances such as im- tion of documentation need very different ASR ap- proving the network and the information available proaches, although still being under the umbrella to it, as in Han et al.(2020) or Li et al.(2019), of endangered language documentation. though as performance increases the gains of these In this work we expand on the use of trans- new methods decrease. Another avenue is to try to fer learning to improve the quality of a speech make these systems more robust to noise through recognition system for dialectal Zyrian Komi. Our data augmentation (Braun et al., 2017; Park et al., data consists of about 35 hours of transcribed 2019). As with improving the networks, how- speech data that will be available as an indepen- ever, these improvements become more and more dent dataset in the Language Bank of Finland marginal as performance increases. (Blokland et al., forthcoming). While this collec- As more models for ASR become available as tion is under preparation, the original raw mul- open source (Hannun et al., 2014; Pratap et al., timedia is available by request in The Language 2019), it becomes easier to develop these tools for Archive in Nijmegen (Blokland et al., 2021). This low resource languages and to create best prac- is a relatively large dataset for a low resource tice standards for doing so. This is the fundamen- language, but is still nowhere near high resource tal goal of Common Voice (Ardila et al., 2020). datasets such as Librispeech (Panayotov et al., Others also work on individual languages, such as 2015), which has about 1000 hours of English. Fantaye et al.(2020) and Dalmia et al.(2018). One of the largest challenges in our dataset is In the language documentation context we have that there is significant code switching between seen a large number of experiments on endangered Komi and Russian. This is a feature shared with languages in the last few years, but often focus- other similar corpora (compare i.e. Shi et al., ing on the datasets with a single speaker. Under 2021). All speakers are bi- or multilingual, and this constraint a few hours of transcribed data has use several languages regularly, so there are large already shown to result in a relatively good ac- curacy, as shown by Adams et al.(2018). Also 1https://dict.fu-lab.ru Partanen et al.(2020) report very good results 2http://komicorpora.ru on the extinct Samoyedic language Kamas, where 64 the model was also trained with one speaker, for cleaned any sections which were too long or too whom, however, a relatively large corpus exists. short as defined by DeepSpeech. The alphabet is Under many circumstances it is realistic and im- based on the Komi data and not the text used to portant to record individual speakers in numerous construct the language models as it is what deter- recording sessions, and such collections appear to mines the output of the network. We obtained our be numerous in the archives containing past field English model from DeepSpeech’s publicly avail- recordings, so there is no doubt that also single able release models4. speaker systems can be useful, although not ideal. Recently, Shi et al.(2021) also report very 3.3 DeepSpeech encouraging results on Yoloxochitl´ Mixtec and We trained our models using Mozilla’s open Puebla Nahuatl, especially as the corpus contains source DeepSpeech5 (Hannun et al., 2014) ver- multiple speakers.
Recommended publications
  • Introduction. Komi Folklore Studies: Connecting Points1
    https://doi.org/10.7592/FEJF2019.76.introduction INTRODUCTION. KOMI FOLKLORE STUDIES: CONNECTING POINTS1 Liudmila Lobanova Researcher Department of Folklore, Institute of Language, Literature, and History Komi Science Centre, Russian Academy of Sciences, Russia Email: [email protected] Nikolay Kuznetsov Lecturer in Finno-Ugric Languages Department of Finno-Ugric Studies University of Tartu Email: [email protected] The special edition of Folklore: Electronic Journal of Folklore is dedicated to Komi2 folklore and folklore studies. The issue was prepared within the frame- work of cooperation between the Department of Folkloristics of the Estonian Literary Museum and the Folklore Department of the Komi Science Centre by Komi and Estonian folklore researchers. Prior to this, the authors published one of the issues (vol. 17, 2016) of the Sator periodical, which was also dedi- cated to Komi folklore studies. The goal of this issue is to present some of the results of recent Komi folklore studies to wider academic circles, overcoming the natural linguistic obstacles. The majority of articles are written within the research project “Local Folklore Traditions of the European Northeast of Russia: Mechanisms of Development and Adaptation, System of Genres, Ethnocultural Folklore Interaction” (№ AAAA-A17-117021310066-4). The history of Komi folklore studies reveals processes typical for the Rus- sian, Soviet, and post-Soviet research dealing with folklore (the research field extended and became more limited over time), as well as studying the Komi language and culture as part of the general development of Finno-Ugric stud- ies. Traditionally, academician Andreas Sjögren (1794–1855) is considered to have discovered Komi folklore – in 1827, he transcribed folklore texts and published them as examples of the Komi language.
    [Show full text]
  • Second Report Submitted by the Russian Federation Pursuant to The
    ACFC/SR/II(2005)003 SECOND REPORT SUBMITTED BY THE RUSSIAN FEDERATION PURSUANT TO ARTICLE 25, PARAGRAPH 2 OF THE FRAMEWORK CONVENTION FOR THE PROTECTION OF NATIONAL MINORITIES (Received on 26 April 2005) MINISTRY OF REGIONAL DEVELOPMENT OF THE RUSSIAN FEDERATION REPORT OF THE RUSSIAN FEDERATION ON THE IMPLEMENTATION OF PROVISIONS OF THE FRAMEWORK CONVENTION FOR THE PROTECTION OF NATIONAL MINORITIES Report of the Russian Federation on the progress of the second cycle of monitoring in accordance with Article 25 of the Framework Convention for the Protection of National Minorities MOSCOW, 2005 2 Table of contents PREAMBLE ..............................................................................................................................4 1. Introduction........................................................................................................................4 2. The legislation of the Russian Federation for the protection of national minorities rights5 3. Major lines of implementation of the law of the Russian Federation and the Framework Convention for the Protection of National Minorities .............................................................15 3.1. National territorial subdivisions...................................................................................15 3.2 Public associations – national cultural autonomies and national public organizations17 3.3 National minorities in the system of federal government............................................18 3.4 Development of Ethnic Communities’ National
    [Show full text]
  • Finno-Ugric Republics and Their State Languages: Balancing Powers in Constitutional Order in the Early 1990S
    SUSA/JSFOu 94, 2013 Konstantin ZAMYATIN (Helsinki) Finno-Ugric Republics and Their State Languages: Balancing Powers in Constitutional Order in the Early 1990s Most of Russia’s national republics established titular and Russian as co-official state languages in their constitutions of the early 1990s. There is no consensus on the reasons and consequences of this act, whether it should be seen as a mere symbolic gesture, a measure to ensure a language revival, an instrument in political debate or an ethnic institution. From an institutional and comparative perspective, this study explores the constitutional systems of the Finno-Ugric republics and demonstrates that across the republics, the official status of the state languages was among the few references to ethnicity built into their constitutions. However, only in the case of language require- ments for the top officials, its inclusion could be interpreted as an attempt at instrumen- tally using ethnicity for political ends. Otherwise, constitutional recognition of the state languages should be rather understood as an element of institutionalized ethnicity that remains a potential resource for political mobilization. This latter circumstance might clarify why federal authorities could see an obstacle for their Russian nation-building agenda in the official status of languages. 1. Introduction The period of social transformations of the late 1980s and early 1990s in Eastern Europe was characterized by countries’ transition from the communist administra- tive−command systems towards the representative democracy and market economy. One important driving force of change in the Union of Soviet Socialist Republics (USSR) was the rise of popular movements out of national resentment and dissatis- faction with the state-of-the-art in the sphere of inter-ethnic relations.
    [Show full text]
  • ISSN 2221-2698 Arkhangelsk, Russia DOI 10.17238/Issn2221-2698.2016
    ISSN 2221-2698 Arkhangelsk, Russia DOI 10.17238/issn2221-2698.2016.25 Arctic and North. 2016. N 25 2 ISSN 2221-2698 Arctic and North. 2016. N 25. CC BY-SA © Northern (Arctic) Federal University named after M.V. Lomonosov, 2016 © Editorial board of electronic scientific journal “Arctic and North”, 2016 The journal “Arctic and North” is registered at Roskomnadzor as an internet periodical issued in Russian and English, Registration certificate El № FS77-42809, November 26, 2010; at the system of the Russian Science Citation Index (RSCI), license contract № 96-04/2011R, April 12, 2011; Scientific Electronic Library "Сyberleninka" (2016); in the catalogs of international databases: Directory of Open Access Journals — DOAJ (2013); Global Serials Directory Ulrichsweb, USA (2013); NSD, Norway (2015); InfoBase Index, India (2015); ERIH PLUS, Norway (2016). The Journal is issued not less than 4 times per year; 25 issues were published in 2011—2016. The Founder — Northern (Arctic) Federal University named after M.V. Lomonosov (Arkhangelsk, Russia). Editor-in-Chief — Yury F. Lukin, Doctor of Historical Sciences, Professor, Honorary Worker of the higher education of the Russian Federation. All journal issues are available free of charge in Russian and English. Rules and regulations on submission, peer reviews, publication and the Declaration of Ethics are available at: http://narfu.ru/aan/rules/ The Journal is devoted to the scientific articles focused on the Arctic and the North relevant for the following professional degrees: 08.00.00 Economics; 22.00.00 Sociology; 23.00.00 Political science; 24.00.00 Culturology. No payments for publication are collected from authors, including students and post-graduate students.
    [Show full text]
  • A Study of the Language Laws in Russia's Finno-Ugric Republics
    OFFICIAL STAtus AS A Tool OF LANGUAGE RevivAL? A StuDY OF THE LANGUAGE LAWS in RussiA’S Finno-UGriC REPUBliCS KONSTAntin ZAMYAtin Researcher, PhD Candidate Department of Finnish, Finno-Ugrian and Scandinavian Studies University of Helsinki P.O. Box 24, FIN-00014, Finland e-mail: [email protected] ABSTRACT This study explores the legal and institutional position of Finno-Ugric languages according to the language laws of the national republics in post-Soviet Russia. The aim is to understand whether the republican authorities intended to use the official designation of state language as a policy device with which to ensure the revival of titular languages. The approach of the study is to test revivalist theories that estab- lish a link between official status and language revival by comparing the number of institutionalised elements of official status in the republics. For the purpose of comparison, the study focuses on education and work environment among the domains within the public sphere of language use. The results demonstrate that the framing of official status in these sectors provided only some additional oppor- tunities for the expansion of language use, while the extent of their institutionali- sation directly correlated with the level of political representation of ethnic elites. KEYWORDS: official language · language revival · language laws · Finno-Ugric peoples · Russia INTRODUCTION Change in language behaviour is an outcome of a complicated variety of sociolinguistic, political and legal processes, and the study of language policy alone cannot explain all tendencies in language practices. Yet, without doubt, the impact of state language policy is among the most important causes for change in a sociolinguistic situation, although this change will not always be one that policy-makers envisage as their goal.
    [Show full text]
  • Proceedings of the Third Workshop on Computational Linguistics For
    IWCLUL 2017 3rd International Workshop for Computational Linguistics of Uralic Languages Proceedings of the Workshop 23–24 January 2017 Norwegian University Centre of Oslo, St. Petersburg St. Petersburg, Russia c 2017 The Association for Computational Linguistics Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected] ii Introduction Uralic is an interesting group of languages from the computational-linguistic perspective. The Uralic languages share large parts of morphological and morphophonological complexity that is not present in the Indo-European language family, which has traditionally dominated computational-linguistic research. This can be seen for example in number of morphologically complex forms belonging to one word, which in Indo-European languages is in range of ones or tens whereas for Uralic languages, it is in the range of hundreds and thousands. Furthermore, Uralic language situations share a lot of geo-political aspects: the three national languages—Finnish, Estonian and Hungarian—are comparably small languages and only moderately resourced in terms of computational-linguistics while being stable and not in threat of extinction. The recognised minority languages of western-European states, on the other hand—such as North Smi, Kven and Vro—do clearly fall in the category of lesser resourced and more threatened languages, whereas the majority of Uralic languages in the east of Europe and Siberia are close to extinction. Common to all rapid development of more advanced computational-linguistic methods is required for continued vitality of the languages in everyday life, to enable archiving and use of the languages with computers and other devices such as mobile applications.
    [Show full text]
  • Entries in the Barents Encyclopedia (By Topic Category)
    Entries in the Barents Encyclopedia (by topic category) The list is divided into the following six sections: A. 118 submitted articles (as of 20 April 2011) (p. 4) B. 169 entries for which we have contracted authors (p. 18) C. 67 entries for which we have suggested or invited (but not contracted) authors (p. 39) D. 55 entries for which we have no suggested authors (p. 51) E. 113 suggested entries that might be included if space allows (p. 57) F. 158 suggested entries that are not likely to be included (p. 67) Note: As of April 20, 2011, we have 409 entries/articles to be included in the Barents Encyclopedia! Thus, we do not need any more new entry suggestions unless this is required for reasons of “balance” or serious omissions! Column contents In column “S” the status of the entry word is indicated (for labels, see top of p. 4). In column “E” the suggested entry word is stated. In column “Enc” the a cronym for the encyclopedia where the entry was found (see listing below) or the name of the person suggesting the entry is listed. In column “T” the “topics category” to which the suggested entry belongs (see category codes 1–12 below); In column “T alt” an alternative topic classification is given. In column “L” the suggested Length of entry is stated. (For labels of the different types of entries identified, see table below!) In column “A” the name (and affiliation/email address) of the suggested author is listed. In column “C” you may enter comments about the suggested entry.
    [Show full text]
  • The Russification of Komi
    Slavica Helsingiensia 27 The Slavicization of the Russian North. Mechanisms and Chronology. Ed. by Juhani Nuorluoto. Die Slavisierung Nordrusslands. Mechanismen und Chronologie. Hrsg. von Juhani Nuorluoto. Славянизация Русского Севера. Mеханизмы и хронология. Под ред. Юхани Нуорлуото. Helsinki 2006 ISBN 952–10–2852–1, ISSN 0780–3281; ISBN 952–10–2928–5 (PDF) Marja Leinonen (Helsinki) The Russification of Komi It has been commonplace in western linguistics to speak about languages in isolation. Although there were proponents of language contacts as a factor in language change, i.e. historical linguistics in Europe – the main representatives in Balkan studies – the interest in language contact is of a fairly recent origin. Nowadays, one can hardly avoid dealing with it, if one takes the research of lan- guage change seriously. This does not, of course, minimize the role of language- inherent change or typological tendencies. The effect of a contiguous linguistic system is an additive factor, but sometimes it crucially defines the types of change, sometimes even leads to a total change of language, a language shift. Formerly, the effects of another language were termed superstratum, sub- stratum and adstratum. When dealing with the European North, Russia, the two latter terms are relevant. In a substratum situation, one language takes another’s place entirely, preserving some features of the former linguistic system. Typical- ly, such features are phonetic or syntactic. Since the seminal monograph by Terrence Kaufman and Sarah G. Thomason (1988), this change is called lan- guage shift, thus highlighting the point of view of the speakers who shift from one language to another.
    [Show full text]
  • Creating a Morphological Analyzer and Generator for the Komi Language
    Attila Novák (2004): Creating a Morphological Analyzer and Generator for the Komi language. Proceedings of the SALTMIL Workshop at LREC-2004: First Steps in Language Documentation for Minority Languages. Lisbon, 64-67. Creating a Morphological Analyzer and Generator for the Komi language Attila Nová k MorphoLogic Ltd. Budapest, Orbá nhegyi ú t 5., 1126 Hungary [email protected] Abstract In this paper a set of tools for morphological analysis and generation is presented along with its application to Komi-Zyryan, a small Finno-Ugric language spoken in Northeastern Europe. This endeavor is part of a project which aims to create annotated corpora and other electronically available linguistic resources for a number of small members of the Uralic language family. A morphological grammar development environment is also introduced which facilitates a rapid development of the morphological descriptions used by the tools presented. parser and it also avoids most of the irrelevant ambiguities 1. Introduction a CF parser would produce. Various Hungarian research groups specialized in Finno- The following is a sample output of the Humor analyzer Ugric linguistics and a Hungarian language technology for the Komi word form kolanla. company, MorphoLogic have initiated a project with the ¿²¿´§¦»®âµ±´¿²´¿ goal of producing annotated electronic corpora for small µ±ªÅÍÁÊÃãµ±´õ¿²ÅÜãßÁÐ׳°Ð•Ãõ´¿Å×ÁÝÒÍà Uralic languages. This paper describes the current state of µ±ªÅÍÁÊÃãµ±´õ¿²ÅÜãÒÁ̱±´Ãõ´¿Å×ÁÝÒÍà the subproject on Komi. Morphs are separated by õ signs from each other. The rep- 2. The Tools resentation morphs is ´»¨·½¿´º±®³Å½¿¬»¹±®§´¿ó In the project, we use a morphological analyzer engine ¾»´Ãã•«®º¿½»º±®³.
    [Show full text]
  • Languages of the Peoples of the Russian Federation: Legal Aspect of Language Policy in a Multinational Country
    Proceedings of INTCESS 2021 8th International Conference on Education and Education of Social Sciences 18-19 January, 2021 LANGUAGES OF THE PEOPLES OF THE RUSSIAN FEDERATION: LEGAL ASPECT OF LANGUAGE POLICY IN A MULTINATIONAL COUNTRY Vida Yu. Mikhalchenko1, Elena A. Kondrashkina2, Svetlana V. Kirilenko3* 1Prof. Dr., Head Scientific Researcher, Institute of Linguistics, Russian Academy of Sciences, RUSSIA, [email protected] 2Senior Scientific Researcher, Candidate of Sciences (Philology), Institute of Linguistics, Russian Academy of Sciences, RUSSIA, [email protected] 3Scientific Researcher, Candidate of Sciences (Philology), Institute of Linguistics, Russian Academy of Sciences; Assoc. Prof., RUSSIA, [email protected] *Corresponding Author Abstract The article gives an overview of the history of language legislation development in the Russian Federation, i.e. it considers the formation and implementation of regulations for the use of a language in different spheres of communication, which in the long run contributed to maintaining language well-being, safeguarded peaceful co-existence of languages, helped to avoid providing basis for language conflicts. Language legislation, as a rule, is represented in separate articles of the Constitution, state programs and laws on languages, being a means of the conscious influence of society on languages. Generally speaking, language legislation is to be designed to correspond to the characteristics of a language situation in which it will be applied. Besides, it is to cater to the social needs of a whole society (language unity) and individual language communities (language diversity) to the maximum extent. The adoption of laws that take into account all ethno-cultural factors mentioned above is only one stage on the way towards harmonizing national-language relations.
    [Show full text]
  • On the Questions in Developing Computational Infrastructure for Komi-Permyak
    On the questions in developing computational infrastructure for Komi-Permyak Jack Rueter Niko Partanen University of Helsinki University of Helsinki [email protected] [email protected] Larisa Ponomareva University of Helsinki [email protected] Abstract вычислительнӧй ресурсісь, медбы керны сыись пермяцкӧйӧ. Медодз There are two main written Komi vari- энӧ вежсьӧммесӧ колӧ керны FST-ын, eties, Permyak and Zyrian. These are mu- но мийӧ сідзжӧ видзӧтам, кыдз FST tually intelligible but derive from different лӧсялӧ Быдкодь Йитсьӧммезлӧн схемаӧ parts of the same Komi dialect continuum, морфология ладорсянь. representing the varieties prominent in the vicinity and in the cities of Syktyvkar and 1 Introduction Kudymkar, respectively. Hence, they share The Komi language is a member of the Permic a vast number of features, as well as the branch of the Uralic language family. By nature, it majority of their lexicon, yet the overlap in is a pluricentric language, which, in addition to hav- their dialects is very complex. This paper ing two strong written traditions (Komi-Zyrian and evaluates the degree of difference in these Komi-Permyak), can be divided into several vari- written varieties based on changes required eties. Although some of the other varieties do ex- for computational resources in the descrip- hibit written use, no actual new written standards tion of these languages when adapted from seem to be emerging (Цыпанов, 2009). Both Zyr- the Komi-Zyrian original. Primarily these ian and Permyak have long and established writ- changes include the FST architecture, but ten traditions, with continuous contemporary use. we are also looking at its application to the There are also numerous dialect resources currently Universal Dependencies annotation scheme available, i.e.
    [Show full text]
  • Studies in Uralic Vocalism II: Reflexes of Proto-Uralic *A in Samoyed, Mansi and Permic
    Kirill Reshetnikov Institute of Linguistics of the Russian Academy of Sciences (Moscow) Mikhail Zhivlov Russian State University for the Humanities (Moscow) Studies in Uralic vocalism II: 1 Reflexes of Proto-Uralic *a in Samoyed, Mansi and Permic The paper presents a set of rules that explain the complementary distribution of the reflexes of alleged Proto-Uralic vowels * and *å in Proto-Samoyed, Proto-Mansi and Proto-Permic. Our aim is to show that the reconstruction of one vowel *a instead of these two phonemes is sufficient for the explanation of all the reflexes in the languages under consideration. Keywords: Uralic languages, Finno-Ugric languages, Samoyed languages, Mansi language, Komi language, Udmurt language, comparative phonology, linguistic reconstruction, Proto- Uralic vocalism. Most present day works dealing with historical phonology and etymology of Uralic lan- guages accept that in the first syllable of some Proto-Uralic lexical stems there must have been a high unrounded back vowel. This phoneme, usually marked as * and sometimes as *, is absent in most early variants of the Proto-Uralic reconstruction as well as in the recon- struction system tentatively used in UEW, but is present in works by Juha Janhunen, Pekka Sammallahti and their followers [Sammallahti 1979; Janhunen 1981; Sammallahti 1988; Aikio 2002; Aikio 2006]. A version of Finno-Ugric historical phonology giving the same vowel the status of a proto-language phoneme was suggested already by Wolfgang Steinitz [Stei- nitz 1944]. The reason why the vowel system of Proto-Uralic was enlarged by * is that Balto-Finnic *a of the first syllable together with its counterparts in Mordvinic, Saami and Mari has a variety of correspondences in other Uralic languages, in particular in Samoyed.
    [Show full text]