The First Komi-Zyrian Universal Dependencies

Niko Partanen1, Rogier Blokland2, KyungTae Lim3, Thierry Poibeau3, Michael Rießler4 [email protected], [email protected], [email protected], [email protected], [email protected]

1Institute for the Languages of Finland 2University of Uppsala 3LATTICE (CNRS & ENS / PSL & Université Sorbonne nouvelle / USPC) 4University of Bielefeld

Abstract laboratory in Paris, and the work has been done collaboratively with the IKDP-21 project, which is Two Komi-Zyrian treebanks were included in a continuation of earlier work that produced a lan- the Universal Dependencies 2.2 release. This guage documentation corpus of Komi called IKDP article contextualizes the treebanks, discusses the process through which they were created, (Blokland et al., 2009-2018). A comprehensive and outlines the future plans and timeline for descriptive grammar of Komi with a focus on syn- the next improvements. Special attention is tax is currently being written by members of the paid to the possibilities of using UD in the doc- team. The present treebanks are intended to sup- umentation and description of endangered lan- port the grammatical description. guages. The authors’ recent research at LATTICE labo- ratory has focused on dependency of low- 1 Introduction resource languages, using Komi and North Saami Komi-Zyrian is a Uralic language spoken in the as examples (Lim et al., 2018). The Lattice tree- north-eastern corner of the European part of Rus- bank was initially created for use in testing depen- sia. Smaller Komi settlements can also be found dency parsers, and the IKDP was created elsewhere in northern , from the Kola at a later date with the aim of also including spo- Peninsula to Western Siberia. The language has ken language data. approximately 160,000 speakers and, although not moribund, is still threatened by the local major- 2 Language Documentation ity language, Russian. There is a long history of Language documentation refers to a linguistic research on Komi, but contemporary descriptions practice aiming at the provision of long-lasting and computational resources could be greatly im- and accountable records of speech events, usu- proved. Over the last few years some larger docu- ally carried out in the context of endangered lan- mentation projects have been carried out on Komi. guages and with the goal of understanding spoken These projects have focused on the most endan- communication beyond mere structural grammar. gered spoken varieties, while at the same time, Himmelmann (1998) was the first to define "Doc- new written resources for Standard Komi have be- umentary Linguistics" as separate from "Descrip- came available. tive Linguistics", although with considerable over- This paper discusses the creation of two Komi lap between the two. He also pays special atten- treebanks, one containing written and another spo- tion to the interface between research outputs and ken data. Both the treebanks and the scripts used primary data, ideally including audio and video to create them are included in this paper as sup- recordings (Himmelmann, 2006). This has gen- plementary materials, and the treebanks are part erally been the approach in the present work too, of the Universal Dependencies 2.2 release (Nivre so that the spoken language UD corpus is directly Lattice et al., 2018). The treebanks are called and connected to the documentary multimedia corpus IKDP, due to the fact that most of the work on them has been carried out at the LATTICE-CNRS 1https://langdoc.github.io/IKDP-2 through matching sentence IDs. This allows the based on language documentation material. It is treebank sentences to be connected to rich non- too early to say whether there will be more simi- linguistic metadata. Additionally, the coded time- lar treebanks in the future and within what time- alignment in the original utterances provides in- frame, but having more materials like these in- formation about turn-taking and overlapping at the cluded in UD would fit into the original ideas of millisecond level. The documentary corpus refers the multifunctional language documentation en- to the materials collected and processed within the terprise very well. language documentation activities, which are usu- ally fieldwork-based and aim to represent various 3 Methodology genres and speech practices, all of which are often The initial analysis of Komi plain text was created under a threat of disappearance. using Giellatekno’s6 open infrastructure (Mosha- In language documentation, traditional annota- gen et al., 2014), which is currently at a rather ma- tion methods have mainly consisted of so-called ture level for Komi. The syntactic analysis compo- interlinear glossing.2 This is normally done man- nent demands the most further work, which in turn ually or semi-manually, i.e. with little or no use can be guided by the work on treebanks. Simi- of natural language processing tools (cf. Gersten- lar rule-based architectures have already been used berger et al., 2016). With the available Komi for other treebanks as well. The Northern Saami data in our project, however, we wanted to ap- and Erzya corpora, for example, seem to have been ply an annotation method that would connect our created using a similar approach. Some work has work more closely to established corpus linguis- been conducted with integrating these NLP tools tics and NLP. Universal Dependencies appeared to into workflows commonly used in language doc- be a very attractive annotation scheme as it aims umentation (Gerstenberger et al., 2017a,b, 2016). at cross-linguistic comparability and already con- Since these languages often lack larger annotated tains several . Komi-Zyrian is resources, the use of infrastructures other than currently the sixth Uralic language to be included rule-based ones has not been common or possible, in the project. but these workflows have been implemented in a Work with Komi complements well the devel- modular fashion that would make enable the inte- opments associated with the emergence of new gration of other tools when they become available Uralic treebanks in 2017, with new repositories or reach needed accuracy. created for North Saami3 and Erzya (Rueter and It has been demonstrated that it is possible to Tyers, 2018). Another noteworthy trend is that convert annotations from Giellatekno’s annotation there are several treebanks currently being created scheme into the UD scheme (Sheyanova and Ty- for endangered languages in situations similar to ers, 2017), and this has also worked well in our that of Komi. As far as we have been able to case, although the exact procedure will continue ascertain, these are, at least: Dargwa spoken in to be refined while the token count of the corpus the Caucasus (Kozhukhar, 2017), Pnar4 spoken in grows, which will ultimately also reveal rarer and South-East Asia and Shipibo-Konibo5 spoken in not-yet-analysed morphosyntactic features. Af- Peru. The description of the last treebank men- ter starting with manually editing CoNLL-U files, tioned does not indicate the use of language doc- the UD Annotatrix tool (Tyers et al., 2018) was umentation materials, but as the language is very adopted in January 2018, which marked the mid- small, the context is comparable. To our knowl- point in the project’s timeline. This greatly im- edge, the IKDP treebank discussed here is the first proved the annotation speed and consistency. treebank included in the UD release that is directly The treebank creation thus consisted of the fol- 2Cf., e.g., the Leipzig Glossing Rules https: lowing steps: //www.eva.mpg.de/lingua/resources/ glossing-rules.php 3https://github.com/ 1. Sending Komi sentences to the Giellatekno UniversalDependencies/UD_North_ morphosyntactic analyser (consisting of an Sami-Giella FST component for morphological categories 4https://github.com/ UniversalDependencies/UD_Pnar-PTB and a syntactic component using Constraint 5https://github.com/ Grammar) UniversalDependencies/UD_Shipibo_ Konibo-PUCP 6http://giellatekno.uit.no 2. Resolving the remaining ambiguity manually books being digitalized, made available online9 and converted into a linguistic corpus.10 The cor- 3. Adding the missing syntactic relations manu- pus is currently 40 million large, and the ally to the UD Annotatrix long-term goal is to digitalize all books and other printed texts ever published in Komi-Zyrian. The 4. Automatically converting the analyzer’s number of publications is approximately 4,500 XPOS-tags into UPOS-tags and converting books, plus tens of thousands of newspaper and morphological feature tags into their UD journal issues. A significant portion of the lat- counterparts ter are available in the Public Domain as part of 5. Manual correction and verification the Fenno-Ugrica project of the National Library of Finland11. We have exclusively chosen to use The current workflow involves a rather large openly available data for the Lattice treebank in amount of manual work. We are interested in order to ensure as broad and simple reuse as pos- testing various approaches to morphological and sible. The forthcoming releases will include more syntactic analysis so that different (rule-based, genres of text, such as newspaper texts and longer statistic-based and hybrid) parsers can eventually sections of Wikipedia articles. replace the manual work. Some tests have already All sentences in the Lattice treebank are pre- been carried out with the dependency parser used sented in the contemporary orthography, even by the Lattice team in the CoNLL-U Shared Task when they were originally published using vari- 2017 (Lim and Poibeau, 2017) and a follow-up ous earlier Komi writing systems. The propor- project (Partanen et al., 2018). tion of texts originally written in the Molodcov al- The treebank processing pipeline has been phabet will rise dramatically in the next releases, tied to several scripts and existing tools. The as this is probably the most commonly used or- primary analysisis done within the Giellatekno thography in the upcoming texts. Storing several toolkit (building on FST Morphology and Con- orthographic variants may be necessary. Conver- straint Grammar), where tokenization, morpho- sion between systems has been carried out using 12 logical analysis and rule-based disambiguation are FU-Lab’s Molodcov converter . The data orig- tied to the script ‘kpvdep’. The script returns a inates from scanned books through text recogni- vislcg3 file that contains all ambiguities left after tion, currently with loss of page coordinates. This the analysis. Once the ambiguities are resolved connects to the question of how to retrieve arbi- manually, the vislcg3 file can be imported into the trary information from different sources that can UD Annotatrix tool. As a final step, the Giellate- be connected to the sentence IDs: metadata, page kno POS-tags and morphological features are con- positions, page images, time codes and audio seg- verted to follow the UD standard with a Python ments. script, originally written by Francis Tyers7.A We considered it very important to also in- modified version of the script with the conversion clude spoken language in the treebank, ideally pattern file is stored in not-to-release folder in the eventually covering all dialects. During the last dev-branch of the Lattice treebank, which is the lo- years, one of the largest research projects inves- cation where all development scripts of both tree- tigating spoken Komi has been the IKDP project, banks will be maintained. led by Rogier Blokland in 20142016, which re- sulted in a large transcribed spoken language cor- 4 Data Sources and Design Principles pus (Blokland et al., 2009-2018). The IKDP treebank contains dialectal texts taken from this Most of the work on the is cur- corpus, and since written Komi does not follow 8 rently being done by collaborators of FU-Lab in the exact same principles employed in the tran- , the capital of the in scriptions, it seems problematic to mix these ma- Russia. The work of FU-Lab, led by Marina Fe- terials together. The orthographic conventions dina, has been particularly exceptional, as it has resulted in a significant number of Komi-language 9http://komikyv.org 10http://komicorpora.ru 7https://github.com/ftyers/ud-scripts/ 11https://fennougrica. blob/master/conllu-feats.py kansalliskirjasto.fi 8http://fu-lab.ru 12http://fu-lab.ru/convertermolodcov of the spoken treebank are basically similar to include many examples of languages with very those used in the recent Komi dialect dictionary complex case systems. For example, Komi has (Beznosikova et al., 2012), with only relatively two values of nominal case that were not yet in- subtle differences.What it comes to spoken fea- cluded in the earlier documentation, namely the tures, corrections are kept and marked with the re- approximative and the egressive. One issue aris- lation reparandum, but features such as pauses are ing when comparing current treebanks is the cross- not separately marked. The user can access the comparability of the case labels applied. Komi has original archived audio, which enables a more de- two cases that express a path of some sort, tra- tailed analysis of spoken phenomena if desired. In ditionally called prolative and transitive in Komi their typographic simplicity, the transcribed texts and Uralic linguistics. These would match closely are reminiscent of some of the dialect texts pub- with a case label already in the UD documentation, lished previously in various printed text collec- perlative, found in Warlpiri, but the fact that there tions (without the original audio recordings). The are two very similar cases already makes the label- context of the spoken data here is therefore not ing problematic. Differences in case labeling are only a faithful representation of the spoken sig- related to further linguistic analyses that are possi- nal, which could include also more exact phonetic ble with the corpora, as well as to parsing accuracy transcriptions, but also the larger landscape of spo- in multilingual scenarios. In the present treebanks, ken language resources which we would like to in- the traditional labels for Komi cases are used. tegrate into our NLP ecosystem. Another theoretical question arising from Komi Furthermore, because local Komi speech and concerns the way different cases can be combined, research communities are often conscious of or- resulting in "double case marking". For example, thographic norms, we wanted to draw a clear it is entirely possible to use several spatial case boundary between written and spoken representa- markers linearly combined in one and the same tions. Additionally, the spoken language treebank inflected noun form, and, although this is some- contains a large number of Russian phrases due to what rare, examples can be easily found even for code-switching, which makes it to some degree a more marginal combinations. For example, the multilingual treebank. In the IKDP treebank, Rus- case suffixes for elative and terminative can com- sian items are currently marked with a language bine to mark subtle changes in focus: vengrija-iC- tag in the misc-field, but verification that Russian edý Hungary-ELA-TER ‘all the way from Hungary’ annotations are consistent with monolingual Rus- (see e.g. (Bartens, 2003, 53). This raises the ques- sian treebanks is a topic that requires further atten- tion of how to best annotate this in UD. Of course tion. each combination could be labeled as a new case, The sentences represent running texts and narra- which is also sometimes seen in the literature on tives, and, to a great extent, they link together into Komi nominal case (Kuznetsov, 2012, p. 374), continuous larger text units. There are deviations but this would greatly increase the number of case from this in situations where individual examples values that need to be documented, and most of have been selected in order to include instances them would be very marginal and specific to in- of each dependency relation in the treebank. This dividual languages. Another solution would be to was done particularly in the early stages of the allow several case affixes to be added to one treebanks when it was important to gain more un- form. However, this would only help when sev- derstanding of how different syntactic relations are eral cases are clearly combined and would not be tagged consistently in UD. In the upcoming re- useful when new spatial cases have emerged from leases, occurrences of each morphosyntactic phe- postpositions, a phenomenon typical of Komi and nomena present in Komi may also be hand-picked Udmurt dialects. from corpora to ensure that they occur in the tree- banks, the need for which is discussed next. Currently, a large portion of the cases in UD documentation are used only in Hungarian. In- 5 Some Questions Arising From cluding more languages with large case systems, Komi-Zyrian such as Uralic or Northeast Caucasian languages like Lezgian, would only increase the number of As the majority of languages in UD are larger names for case values used mainly in individ- Indo-European languages, the project does not yet ual languages. Eventually this also boils down to the question of how comparable the cases in along with some larger Wikipedia texts, will be in- different languages actually are. Haspelmath has cluded in the next UD release 2.3. The next phase argued convincingly that case labels are valid of the IKDP treebank will include individual texts only for particular languages (Haspelmath, 2009, from the Komi recordings made by Eric Vászolyi 510), and the issue probably cannot be explicitly in the 1950s and 1960s (Vászolyi-Vasse, 2003), solved within UD either, but for the sake of us- which the present authors have acquired permis- ability of treebanks and their suitability for mul- sion to re-publish electronically. These texts orig- tilingual NLP applications, some harmonization inate from a time and place of intensive language would seem desirable. One alternative could be contact between Komi-Zyrian and Tundra Nenets, to create a higher layer of mapping that connects what makes them a particularly interesting target language-specific labels to broader shared cate- for further study. gories. In this way, both Komi cases expressing One possibly useful addition to the treebank a path could be connected to a concept of move- could be English glosses in the misc-field, since ment along a path, but the language-specific nu- many linguists are used to working with data from ances would not be lost. endangered languages in a format like this. The English gloss could contain a contextual transla- 6 Conclusion and Further Work tion of the lemma, for example, which would make The written and spoken treebanks have 1389 and the sentences in the treebank much more accessi- 988 tokens, respectively. Due to their small size, ble to different linguistic audiences. they have not been split into test and development In terms of size, the target is to reach 5,000 sets. Based on this experience, it already seems tokens in both treebanks during 2018, and to in- clear that providing annotations in this framework crease this to 20,000 in the first half of 2019. Our has several advantages compared to traditional long-term goal is to create a resource that would methods used in language documentation projects. contribute to research on Komi and provide bet- The main benefit is the comparability between dif- ter resources for Natural Language Processing of ferent languages, and also straightforward licens- this language, which has yet to receive sufficient ing and distribution within UD framework. attention in computational linguistic research. It can be argued that tagging according the UD principles is necessarily a compromise, and that it may not express all particularities of individ- 7 Acknowledgements ual languages. One possible way to solve this problem is to include further annotations in the We want to thank the reviewers for their use- misc-column. Another possible approach would ful comments. This work has been developed be to provide different parts of the documentary in the framework of the LAKME project funded corpus with varying degrees of annotations. In by a grant from Paris Sciences et Lettres (IDEX any case, based on our experience, we would PSL reference ANR-10-IDEX-0001-02). Thierry strongly encourage endangered language docu- Poibeau is partially supported by a RGNF-CNRS mentation projects to take a small segment of their (grant between the LATTICE-CNRS Laboratory materials and add to it an additional layer of anno- and the Russian State University for the Human- tations in the Universal Dependencies framework. ities in Moscow). Kyungtae Lim is partially sup- Language documentation data is usually stored in ported by the ERA-NET Atlantis project. Niko archives that require access requests. This is not Partanen’s work has been carried out at the LAT- very compatible with openly available treebanks. TICE laboratory, and besides Partanen, both Ro- Still, it should be possible to collect small subsets gier Blokland and Michael Rießler collaborate of materials with the clear intention and permis- within the project Language Documentation meets sion for these recordings to be openly licensed, or Language Technology: the Next Step in the De- to use texts old enough that they are copyright free. scription of Komi, funded by the Kone Founda- New material is currently being brought into the tion. Thanks to Jack Rueter for numerous discus- Lattice treebank. The main genres obtained from sions on Komi and Erzya, and to Alexandra Kell- Fenno-Ugrica collection are newspaper texts, non- ner for proofreading the paper. fiction works and schoolbooks. Samples of these, References KyungTae Lim, Niko Partanen, and Thierry Poibeau. 2018. Multilingual Dependency Parsing for Low- Raija Bartens. 2003. Kahden kaasuspäätteen jonoista Resource Languages: Case Studies on North Saami suomalais-ugrilaisissa kielissä. In Bakró-Nagy Mar- and Komi-Zyrian. In Proceedings of the Eleventh ianne and Károly Rédei, editors, Ünnepi kötet Honti International Conference on Language Resources László tiszteletére, pages 46–54. MTA, Budapest. and Evaluation. ELRA.

L.M. Beznosikova, E.A. Ajbabina, N.K. Zaboeva, and KyungTae Lim and Thierry Poibeau. 2017. A sys- R.I. Kosnyreva. 2012. Komi sërnisikas kyvˇcukör. tem for multilingual dependency parsing based on Slovar dialektov komi âzyka: v 2-h tomah. Insti- bidirectional LSTM feature representations. In Pro- tut âzyka, literatury i istorii Komi naunogo centra ceedings of the CoNLL 2017 Shared Task: Multilin- Uralskogo otdeleniâ Rossijskoj akademii nauk, Syk- gual Parsing from Raw Text to Universal Dependen- tyvkar. cies, pages 63–70. ACL.

Rogier Blokland, Marina Fedina, Niko Partanen, and Sjur Moshagen, Jack Rueter, Tommi Pirinen, Trond Michael Rießler. 2009-2018. IKDP. In The Trosterud, and Francis M Tyers. 2014. Open-source Language Archive (TLA): Donated Corpora. Max infrastructures for collaborative work on under- Planck Institute for Psycholinguistics, Nijmegen. resourced languages. In Proceedings of the Ninth International Conference on Language Resources Ciprian Gerstenberger, Niko Partanen, and Michael and Evaluation, pages 71–77. ELRA. Rießler. 2017a. Instant annotations in ELAN cor- pora of spoken and written Komi, an endangered Joakim Nivre, Mitchell Abrams, Željko Agic,´ Lars language of the Barents Sea region. In Proceed- Ahrenberg, Lene Antonsen, Maria Jesus Aranz- ings of the 2nd Workshop on the Use of Compu- abe, Gashaw Arutie, Masayuki Asahara, Luma tational Methods in the Study of Endangered Lan- Ateyah, Mohammed Attia, Aitziber Atutxa, Lies- guages, pages 57–66. ACL. beth Augustinus, Elena Badmaeva, Miguel Balles- teros, Esha Banerjee, Sebastian Bank, Verginica Ciprian Gerstenberger, Niko Partanen, Michael Barbu Mititelu, John Bauer, Sandra Bellato, Kepa Rießler, and Joshua Wilbur. 2016. Utilizing Bengoetxea, Riyaz Ahmad Bhat, Erica Biagetti, language technology in the documentation of Eckhard Bick, Rogier Blokland, Victoria Bobicev, endangered Uralic languages. Northern European Carl Börstell, Cristina Bosco, Gosse Bouma, Sam Journal of Language Technology, 4:29–47. Bowman, Adriane Boyd, Aljoscha Burchardt, Marie Candito, Bernard Caron, Gauthier Caron, Gül¸sen Cebiroglu˘ Eryigit,˘ Giuseppe G. A. Celano, Savas Ciprian Gerstenberger, Niko Partanen, Michael Cetin, Fabricio Chalub, Jinho Choi, Yongseok Cho, Rießler, and Joshua Wilbur. 2017b. Instant anno- Jayeol Chun, Silvie Cinková, Aurélie Collomb, tations: Applying NLP methods to the annotation Çagrı˘ Çöltekin, Miriam Connor, Marine Courtin, of spoken language documentation corpora. In Elizabeth Davidson, Marie-Catherine de Marneffe, Proceedings of the 3rd International Workshop on Valeria de Paiva, Arantza Diaz de Ilarraza, Carly Computational Linguistics for Uralic languages, Dickerson, Peter Dirix, Kaja Dobrovoljc, Tim- pages 25–36. ACL. othy Dozat, Kira Droganova, Puneet Dwivedi, Marhaba Eli, Ali Elkahky, Binyam Ephrem, Tomaž Martin Haspelmath. 2009. Terminology of case. In Erjavec, Aline Etienne, Richárd Farkas, Hector Andrew Spencer and Andrej L. Malchukov, editors, Fernandez Alcalde, Jennifer Foster, Cláudia Fre- The Oxford handbook of case, pages 505–517. OUP, itas, Katarína Gajdošová, Daniel Galbraith, Mar- Oxford. cos Garcia, Moa Gärdenfors, Kim Gerdes, Filip Ginter, Iakes Goenaga, Koldo Gojenola, Memduh Nikolaus Himmelmann. 2006. Language documenta- Gökırmak, Yoav Goldberg, Xavier Gómez Guino- tion: What is it and what is it good for? In Jost Gip- vart, Berta Gonzáles Saavedra, Matias Grioni, Nor- pert, Ulrike Mosel, and Nikolaus Himmelmann, ed- munds Gruz¯ ¯ıtis, Bruno Guillaume, Céline Guillot- itors, Essentials of Language Documentation, pages Barbance, Nizar Habash, Jan Hajic,ˇ Jan Hajicˇ jr., 1–30. Mouton de Gruyter, Berlin. Linh Hà My,˜ Na-Rae Han, Kim Harris, Dag Haug, Barbora Hladká, Jaroslava Hlavácová,ˇ Florinel Nikolaus P. Himmelmann. 1998. Documentary and de- Hociung, Petter Hohle, Jena Hwang, Radu Ion, scriptive linguistics. Linguistics, 36:161–195. Elena Irimia, Tomáš Jelínek, Anders Johannsen, Fredrik Jørgensen, Hüner Ka¸sıkara, Sylvain Ka- Alexandra Kozhukhar. 2017. Universal dependencies hane, Hiroshi Kanayama, Jenna Kanerva, Tolga for Dargwa Mehweb. In Proceedings of the Fourth Kayadelen, Václava Kettnerová, Jesse Kirchner, International Conference on Dependency Linguis- Natalia Kotsyba, Simon Krek, Sookyoung Kwak, tics, pages 92–99. ACL. Veronika Laippala, Lorenzo Lambertino, Tatiana Lando, Septina Dian Larasati, Alexei Lavrentiev, Nikolay Kuznetsov. 2012. Matrix of cognitive do- John Lee, Phng Lê Hông,` Alessandro Lenci, Saran mains for Komi local cases. Journal of Estonian and Lertpradit, Herman Leung, Cheuk Ying Li, Josie Finno-Ugric Linguistics, 3(1):373–394. Li, Keying Li, KyungTae Lim, Nikola Ljubešic,´ Olga Loginova, Olga Lyashevskaya, Teresa Lynn, source universal-dependency treebank for Erzya. Vivien Macketanz, Aibek Makazhanov, Michael In Proceedings of the 4th International Workshop Mandl, Christopher Manning, Ruli Manurung, on Computational Linguistics for Uralic languages, Cat˘ alina˘ Mar˘ anduc,˘ David Marecek,ˇ Katrin Marhei- pages 106–118. ACL. necke, Héctor Martínez Alonso, André Martins, Jan Mašek, Yuji Matsumoto, Ryan McDonald, Gustavo Mariya Sheyanova and Francis M. Tyers. 2017. Anno- Mendonça, Niko Miekka, Anna Missilä, Cat˘ alin˘ tation schemes in North Sámi dependency parsing. Mititelu, Yusuke Miyao, Simonetta Montemagni, In Proceedings of the Third Workshop on Computa- Amir More, Laura Moreno Romero, Shinsuke tional Linguistics for Uralic Languages, pages 66– Mori, Bjartur Mortensen, Bohdan Moskalevskyi, 75. ACL. Kadri Muischnek, Yugo Murawaki, Kaili Müürisep, Pinkey Nainwani, Juan Ignacio Navarro Horñi- Francis M Tyers, Mariya Sheyanova, and acek, Anna Nedoluzhko, Gunta Nešpore-Berzkalne,¯ Jonathan North Washington. 2018. UD Annotatrix: An annotation tool for universal dependencies. In Lng Nguyên˜ Thi, Huyên` Nguyên˜ Thi Minh, Vi- . . Proceedings of the 16th International Workshop on taly Nikolaev, Rattima Nitisaroj, Hanna Nurmi, Treebanks and Linguistic Theories, pages 10–17. Stina Ojala, Adédayo Olúòkun, Mai Omura, Petya . ACL. Osenova, Östling, Lilja Øvrelid, Niko Partanen, Elena Pascual, Marco Passarotti, Ag- Eric Vászolyi-Vasse. 2003. Syrjaenica: Narratives, nieszka Patejuk, Siyao Peng, Cenel-Augusto Perez, folklore and folk poetry from eight dialects of the Guy Perrier, Slav Petrov, Jussi Piitulainen, Emily Komi language. Vol. 1, Upper Izhma, Lower Ob, Pitler, Barbara Plank, Thierry Poibeau, Mar- Kanin Peninsula, Upper Jusva, Middle Inva, Udora. tin Popel, Lauma Pretkalnin, a, Sophie Prévost, Savariae, Szombathely. Prokopis Prokopidis, Adam Przepiórkowski, Ti- ina Puolakainen, Sampo Pyysalo, Andriela Rääbis, Alexandre Rademaker, Loganathan Ramasamy, Taraka Rama, Carlos Ramisch, Vinit Ravishankar, Livy Real, Siva Reddy, Georg Rehm, Michael Rießler, Larissa Rinaldi, Laura Rituma, Luisa Rocha, Mykhailo Romanenko, Rudolf Rosa, Da- vide Rovati, Valentin Roca, Olga Rudina, Shoval Sadde, Shadi Saleh, Tanja Samardžic,´ Stephanie Samson, Manuela Sanguinetti, Baiba Saul¯ıte, Yanin Sawanakunanon, Nathan Schneider, Sebas- tian Schuster, Djamé Seddah, Wolfgang Seeker, Mojgan Seraji, Mo Shen, Atsuko Shimada, Muh Shohibussirri, Dmitry Sichinava, Natalia Silveira, Maria Simi, Radu Simionescu, Katalin Simkó, Mária Šimková, Kiril Simov, Aaron Smith, Is- abela Soares-Bastos, Antonio Stella, Milan Straka, Jana Strnadová, Alane Suhr, Umut Sulubacak, Zsolt Szántó, Dima Taji, Yuta Takahashi, Takaaki Tanaka, Isabelle Tellier, Trond Trosterud, Anna Trukhina, Reut Tsarfaty, Francis Tyers, Sumire Ue- matsu, Zdenkaˇ Urešová, Larraitz Uria, Hans Uszko- reit, Sowmya Vajjala, Daniel van Niekerk, Gertjan van Noord, Viktor Varga, Veronika Vincze, Lars Wallin, Jonathan North Washington, Seyi Williams, Mats Wirén, Tsegay Woldemariam, Tak-sum Wong, Chunxiao Yan, Marat M. Yavrumyan, Zhuoran Yu, Zdenekˇ Žabokrtský, Amir Zeldes, Daniel Zeman, Manying Zhang, and Hanzhi Zhu. 2018. Univer- sal dependencies 2.2. LINDAT/CLARIN digital li- brary at the Institute of Formal and Applied Linguis- tics (ÚFAL), Faculty of Mathematics and Physics, Charles University. Niko Partanen, KyungTae Lim, Michael Rießler, and Thierry Poibeau. 2018. Dependency parsing of code-switching data with cross-lingual feature rep- resentations. In Proceedings of the 4th International Workshop on Computational Linguistics for Uralic languages, pages 1–17. ACL. Jack Rueter and Francis Tyers. 2018. Towards an open-