Instant Annotations in ELAN Corpora of Spoken and Written Komi, an Endangered Language of the Barents Sea Region
Total Page:16
File Type:pdf, Size:1020Kb
Instant annotations in ELAN corpora of spoken and written Komi, an endangered language of the Barents Sea region Ciprian Gerstenberger∗ Niko Partanen UiT The Arctic University of Norway University of Hamburg Giellatekno – Saami Language Technology Department of Uralic Studies [email protected] [email protected] Michael Rießler University of Freiburg Freiburg Institute for Advanced Studies [email protected] Abstract NLP methods to more efficiently annotate qualita- tively and quantitatively language data for endan- The paper describes work-in-progress by gered languages, and this, despite the fact that the the Izhva Komi language documentation relevant computational methods and tools are well- project, which records new spoken lan- known from corpus-driven linguistic research on guage data, digitizes available recordings larger written languages. With respect to the data and annotate these multimedia data in types involved, endangered language documenta- order to provide a comprehensive lan- tion generally seems similar to corpus linguistics guage corpus as a databases for future re- (i.e. “corpus building”) of non-endangered lan- search on and for this endangered – and guages. Both provide primary data for secondary under-described – Uralic speech commu- (synchronic or diachronic) data derivations and nity. While working with a spoken vari- analyses (for data types in language documenta- ety and in the framework of documentary tion, cf. Himmelmann 2012; for a comparison be- linguistics, we apply language technol- tween corpus linguistics and language documenta- ogy methods and tools, which have been tion, cf. Cox 2011). applied so far only to normalized writ- The main difference is that traditional cor- ten languages. Specifically, we describe pus (and computational) linguistics deals predom- a script providing interactivity between inantly with larger non-endangered languages, for ELAN, a Graphical User Interface tool for which huge amounts of mainly written corpus data annotating and presenting multimodal cor- are available. The documentation of endangered pora, and different morphosyntactic ana- languages, on the other hand, typically results in lysis modules implemented as Finite State rather small corpora of spoken genres. Transducers and Constraint Grammar for Although relatively small endangered languages rule-based morphosyntactic tagging and are also increasingly gaining attention by the com- disambiguation. Our aim is to challenge putational linguistic research (as an example for current manual approaches in the annota- Northern Saami, see Trosterud 2006a; and for tion of language documentation corpora. Plains Cree, see Snoek et al. 2014), these projects work predominantly with written language vari- 1 Introduction eties. Current computational linguistic projects on endangered languages seem to have simply Endangered language documentation (aka docu- copied their approach from already established re- mentary linguistics) has made huge technological search on the major languages, including the fo- progress in regard to collaborative tools and user cus on written language. The resulting corpora are interfaces for transcribing, searching, and archiv- impressively large and include higher-level mor- ing multimedia recordings. However, paradoxi- phosyntactic annotations. However, they repre- cally, the field has only rarely considered applying sent a rather limited range of text genres and in- The∗ order of the authors’ names is alphabetical. clude predominantly translations from the relevant majority languages. endangered languages or dialects (for Russian di- In turn, researchers working within the frame- alects, cf. Waldenfels et al. 2014), the use of phone- work of endangered language documentation, i.e. mic transcriptions in the mentioned language doc- fieldwork-based documentation, preservation, and umentation projects seems highly anachronistic description of endangered languages, often col- if orthographic systems are available and in use. lect and annotate natural texts from a broad va- Note also, that many contemporary and historical riety of genres. Commonly, the resulting spoken printed texts written in these languages are also al- language corpora have phonemic transcriptions as ready digitized,4 which would make it easily possi- well as several morphosyntactic annotation layers ble in principle to combine spoken and written data produced either manually or semi-manually with in one and the same corpus later. The use of an the help of software like Field Linguist’s Toolbox orthography-based transcription system not only (or Toolbox, for short),1 FieldWorks Language Ex- ease, and hence, speeds up the transcription pro- plorer (or FLEx, for short),2 or similar tools. Com- cess, but it enables a straightforward integration of mon morphosyntactic annotations include glossed all available (digitized) printed texts into our cor- text with morpheme-by-morpheme interlineariza- pus in a uniform way, too. Note also, that com- tion. Whereas these annotations are qualitatively bining all available spoken and written data into rich, including the time alignment of annotation one corpus would not only make the corpora larger layers to the original audio or video recordings, token-wise, but such corpora will also be ideal for the resulting corpora are relatively small and rarely future corpus-based variational sociolinguistic in- reach 150,000 word tokens. Typically, they are vestigations. Falling back to orthographic tran- considerably smaller. The main reason for the lim- scriptions of our spoken language documentation ited size of such annotated language documenta- corpora seems therefore most sensible. Last but tion corpora is that manual glossing is an extremely not least, if research in language documentation is time consuming task and even semi-manual gloss- intended to be useful for the communities as well, ing using FLEx or similar tools is a bottleneck the representation of transcribed texts in orthogra- because disambiguation of homonyms has to be phy is the most logical choice. solved manually. In our own language documentation projects, Another problem we identified especially in the we question especially the special value given documentation of endangered languages in North- to time consuming phonemic transcriptions and ern Eurasia is that sometimes the existence of or- (semi-)manual morpheme-by-morpheme interlin- thographies is ignored, and instead, phonemic or earizations when basic phonological and morpho- even detailed phonetic transcription is employed. logical descriptions are already available (which It is a matter of fact that as a result of institutional- is arguably true for the majority of languages in ized and/or community-driven language planning Northern Eurasia), as such descriptions serve as and revitalization efforts many endangered lan- a resource to accessing phonological and morpho- guages in Russia have established written stan- logical structures. Instead, we propose a step-by- dards and do regularly use their languages in writ- step approach to reach higher-level morphosyntac- ing. Russia seems to provide a special case com- tic annotations by using and incrementally improv- pared to many other small endangered languages ing genuine computational methods. This proce- in the world, but for some of these languages, not dure encompasses a systematic integration of all only Komi-Zyrian, but also for instance North- available resources into the language documenta- ern Khanty, Northern Selkup, or Tundra Nenets tion endeavor: textual, lexicographic, and gram- (which are all object to documentation at present), matical. a significant amount of printed texts can be found The language documentation projects we work in books and newspapers printed during different with (cf. Blokland, Gerstenberger, et al. 2015; periods, and several of these languages are also Gerstenberger et al. 2017b; Gerstenberger et al. used digitally on the Internet today.3 Compared 2017a) are concerned with the building of mul- to common practice in corpus building for non- 4For printed sources from the Soviet Union and earlier, 1 http://www-01.sil.org/computing/toolbox the Fenno-Ugrica Collection is especially relevant: http: 2http://fieldworks.sil.org/flex //fennougrica.kansalliskirjasto.fi; contemporary 3See, for instance, The Finno-Ugric Languages and The printed sources are also systematically digitized, e.g. for both Internet Project ( Jauhiainen et al. 2015). Komi languages, cf. http://komikyv.ru. timodal language corpora, including at least spo- and is spoken by approximately 160,000 people ken and written (i.e. transcribed spoken) data and who live predominantly in the Komi Republic of applying innovative methodology at the interface the Russian Federation. Although formally recog- between endangered language documentation and nized as the second official language of the Re- endangered language technology. We understand public, the language is endangered as the result of language technology as the functional application rapid language shift to Russian. of computational linguistics as it is aimed at an- Komi is a morphologically rich and agglutinat- alyzing and generating natural language in vari- ing language with primarily suffixing morphol- ous ways and for a variety of purposes. Language ogy and predominantly head-final constituent or- technology can create tools which analyze spoken der, differential object marking, and accusative- language corpora in a much more