The First Komi-Zyrian Universal Dependencies Treebanks
Total Page:16
File Type:pdf, Size:1020Kb
The First Komi-Zyrian Universal Dependencies Treebanks Niko Partanen1, Rogier Blokland2, KyungTae Lim3, Thierry Poibeau3, Michael Rießler4 [email protected], [email protected], [email protected], [email protected], [email protected] 1Institute for the Languages of Finland 2University of Uppsala 3LATTICE (CNRS & ENS / PSL & Université Sorbonne nouvelle / USPC) 4University of Bielefeld Abstract laboratory in Paris, and the work has been done collaboratively with the IKDP-21 project, which is Two Komi-Zyrian treebanks were included in a continuation of earlier work that produced a lan- the Universal Dependencies 2.2 release. This guage documentation corpus of Komi called IKDP article contextualizes the treebanks, discusses the process through which they were created, (Blokland et al., 2009-2018). A comprehensive and outlines the future plans and timeline for descriptive grammar of Komi with a focus on syn- the next improvements. Special attention is tax is currently being written by members of the paid to the possibilities of using UD in the doc- team. The present treebanks are intended to sup- umentation and description of endangered lan- port the grammatical description. guages. The authors’ recent research at LATTICE labo- ratory has focused on dependency parsing of low- 1 Introduction resource languages, using Komi and North Saami Komi-Zyrian is a Uralic language spoken in the as examples (Lim et al., 2018). The Lattice tree- north-eastern corner of the European part of Rus- bank was initially created for use in testing depen- sia. Smaller Komi settlements can also be found dency parsers, and the IKDP treebank was created elsewhere in northern Russia, from the Kola at a later date with the aim of also including spo- Peninsula to Western Siberia. The language has ken language data. approximately 160,000 speakers and, although not moribund, is still threatened by the local major- 2 Language Documentation ity language, Russian. There is a long history of Language documentation refers to a linguistic research on Komi, but contemporary descriptions practice aiming at the provision of long-lasting and computational resources could be greatly im- and accountable records of speech events, usu- proved. Over the last few years some larger docu- ally carried out in the context of endangered lan- mentation projects have been carried out on Komi. guages and with the goal of understanding spoken These projects have focused on the most endan- communication beyond mere structural grammar. gered spoken varieties, while at the same time, Himmelmann (1998) was the first to define "Doc- new written resources for Standard Komi have be- umentary Linguistics" as separate from "Descrip- came available. tive Linguistics", although with considerable over- This paper discusses the creation of two Komi lap between the two. He also pays special atten- treebanks, one containing written and another spo- tion to the interface between research outputs and ken data. Both the treebanks and the scripts used primary data, ideally including audio and video to create them are included in this paper as sup- recordings (Himmelmann, 2006). This has gen- plementary materials, and the treebanks are part erally been the approach in the present work too, of the Universal Dependencies 2.2 release (Nivre so that the spoken language UD corpus is directly Lattice et al., 2018). The treebanks are called and connected to the documentary multimedia corpus IKDP, due to the fact that most of the work on them has been carried out at the LATTICE-CNRS 1https://langdoc.github.io/IKDP-2 through matching sentence IDs. This allows the based on language documentation material. It is treebank sentences to be connected to rich non- too early to say whether there will be more simi- linguistic metadata. Additionally, the coded time- lar treebanks in the future and within what time- alignment in the original utterances provides in- frame, but having more materials like these in- formation about turn-taking and overlapping at the cluded in UD would fit into the original ideas of millisecond level. The documentary corpus refers the multifunctional language documentation en- to the materials collected and processed within the terprise very well. language documentation activities, which are usu- ally fieldwork-based and aim to represent various 3 Methodology genres and speech practices, all of which are often The initial analysis of Komi plain text was created under a threat of disappearance. using Giellatekno’s6 open infrastructure (Mosha- In language documentation, traditional annota- gen et al., 2014), which is currently at a rather ma- tion methods have mainly consisted of so-called ture level for Komi. The syntactic analysis compo- interlinear glossing.2 This is normally done man- nent demands the most further work, which in turn ually or semi-manually, i.e. with little or no use can be guided by the work on treebanks. Simi- of natural language processing tools (cf. Gersten- lar rule-based architectures have already been used berger et al., 2016). With the available Komi for other treebanks as well. The Northern Saami data in our project, however, we wanted to ap- and Erzya corpora, for example, seem to have been ply an annotation method that would connect our created using a similar approach. Some work has work more closely to established corpus linguis- been conducted with integrating these NLP tools tics and NLP. Universal Dependencies appeared to into workflows commonly used in language doc- be a very attractive annotation scheme as it aims umentation (Gerstenberger et al., 2017a,b, 2016). at cross-linguistic comparability and already con- Since these languages often lack larger annotated tains several Uralic languages. Komi-Zyrian is resources, the use of infrastructures other than currently the sixth Uralic language to be included rule-based ones has not been common or possible, in the project. but these workflows have been implemented in a Work with Komi complements well the devel- modular fashion that would make enable the inte- opments associated with the emergence of new gration of other tools when they become available Uralic treebanks in 2017, with new repositories or reach needed accuracy. created for North Saami3 and Erzya (Rueter and It has been demonstrated that it is possible to Tyers, 2018). Another noteworthy trend is that convert annotations from Giellatekno’s annotation there are several treebanks currently being created scheme into the UD scheme (Sheyanova and Ty- for endangered languages in situations similar to ers, 2017), and this has also worked well in our that of Komi. As far as we have been able to case, although the exact procedure will continue ascertain, these are, at least: Dargwa spoken in to be refined while the token count of the corpus the Caucasus (Kozhukhar, 2017), Pnar4 spoken in grows, which will ultimately also reveal rarer and South-East Asia and Shipibo-Konibo5 spoken in not-yet-analysed morphosyntactic features. Af- Peru. The description of the last treebank men- ter starting with manually editing CoNLL-U files, tioned does not indicate the use of language doc- the UD Annotatrix tool (Tyers et al., 2018) was umentation materials, but as the language is very adopted in January 2018, which marked the mid- small, the context is comparable. To our knowl- point in the project’s timeline. This greatly im- edge, the IKDP treebank discussed here is the first proved the annotation speed and consistency. treebank included in the UD release that is directly The treebank creation thus consisted of the fol- 2Cf., e.g., the Leipzig Glossing Rules https: lowing steps: //www.eva.mpg.de/lingua/resources/ glossing-rules.php 3https://github.com/ 1. Sending Komi sentences to the Giellatekno UniversalDependencies/UD_North_ morphosyntactic analyser (consisting of an Sami-Giella FST component for morphological categories 4https://github.com/ UniversalDependencies/UD_Pnar-PTB and a syntactic component using Constraint 5https://github.com/ Grammar) UniversalDependencies/UD_Shipibo_ Konibo-PUCP 6http://giellatekno.uit.no 2. Resolving the remaining ambiguity manually books being digitalized, made available online9 and converted into a linguistic corpus.10 The cor- 3. Adding the missing syntactic relations manu- pus is currently 40 million words large, and the ally to the UD Annotatrix long-term goal is to digitalize all books and other printed texts ever published in Komi-Zyrian. The 4. Automatically converting the analyzer’s number of publications is approximately 4,500 XPOS-tags into UPOS-tags and converting books, plus tens of thousands of newspaper and morphological feature tags into their UD journal issues. A significant portion of the lat- counterparts ter are available in the Public Domain as part of 5. Manual correction and verification the Fenno-Ugrica project of the National Library of Finland11. We have exclusively chosen to use The current workflow involves a rather large openly available data for the Lattice treebank in amount of manual work. We are interested in order to ensure as broad and simple reuse as pos- testing various approaches to morphological and sible. The forthcoming releases will include more syntactic analysis so that different (rule-based, genres of text, such as newspaper texts and longer statistic-based and hybrid) parsers can eventually sections of Wikipedia articles. replace the manual work. Some tests have already All sentences in the Lattice treebank are pre- been carried out with the dependency parser used sented in the contemporary orthography, even by the Lattice team in the CoNLL-U Shared Task when they were originally published using vari- 2017 (Lim and Poibeau, 2017) and a follow-up ous earlier Komi writing systems. The propor- project (Partanen et al., 2018). tion of texts originally written in the Molodcov al- The treebank processing pipeline has been phabet will rise dramatically in the next releases, tied to several scripts and existing tools. The as this is probably the most commonly used or- primary analysisis done within the Giellatekno thography in the upcoming texts.