Proceedings of the Sixth International Workshop for Computational
Total Page:16
File Type:pdf, Size:1020Kb
ACL SIGUR 2020 The Sixth International Workshop on Computational Linguistics for Uralic Languages Proceedings of the Conference January 10 — 11, 2020 Wien, Austria c 2020 The Association for Computational Linguistics Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1–570–476–8006 Fax: +1–570–476–0860 [email protected] ISBN 978-1-952148-00-2 ii Introduction IWCLUL 2020 was held in Wien, Austria. iii Organizers: ACL SIGUR: Tommi A. Pirinen, University of Hamburg Francis M. Tyers, Indiana University Bloomington Local Organizers: Jeremy Bradley, University of Wien Johannes Hirvonen, University of Wien Program Committee: Tommi A. Pirinen, University of Hamburg Francis M. Tyers, Indiana University Bloomington Jeremy Bradley, University of Vienna Eszter Simon, Research Institute for Linguistics, Hungarian Academy of Sciences Kadri Muischnek, University of Tartu Michael Rießler, University of Eastern Finland Filip Ginter, University of Turku Timofey Arkhangelskiy, University of Hamburg / Alexander von Humboldt Foundation Veronika Vincze, Research Group on Articial Intelligence, Hungarian Academy of Sciences Roman Yangarber, University of Helsinki Miikka Silfverberg, University of Helsinki Invited Speaker: Timofey Arkhangelskiy, University of Hamburg / Alexander von Humboldt Foundation v Table of Contents A pseudonymisation method for language documentation corpora: An experiment with spoken Komi Rogier Blokland, Niko Partanen and Michael Rießler . .1 Effort-value payoff in lemmatisation for Uralic languages Nick Howell, Maria Bibaeva and Francis M. Tyers . .9 On the questions in developing computational infrastructure for Komi-Permyak Jack Rueter, Niko Partanen and Larisa Ponomareva . 15 On Editing Dictionaries for Uralic Languages in an Online Environment Khalid Alnajjar, Mika Hämäläinen and Jack Rueter . 26 Towards a Speech Recognizer for Komi, an Endangered and Low-Resource Uralic Language Nils Hjortnaes, Niko Partanen, Michael Rießler and Francis M. Tyers . 31 Hunting for antiharmonic stems in Erzya László Fejes. .38 apPILcation: an Android-based Tool for Learning Mansi Gábor Bobály, Csilla Horváth and Veronika Vincze . 48 vii Conference Program Friday, January 10th, 2020 9:00–10:00 Registration 10:00–10:15 Opening 10:15–11:00 Invited Talk: Timofey Arkhangelskiy 11:00–11:30 Coffee break 11:30–13:00 Session 1 11:30–12:00 A pseudonymisation method for language documentation corpora: An experiment with spoken Komi Rogier Blokland, Niko Partanen and Michael Rießler 12:00–12:30 Effort-value payoff in lemmatisation for Uralic languages Nick Howell, Maria Bibaeva and Francis M. Tyers 12:30–13:00 On the questions in developing computational infrastructure for Komi-Permyak Jack Rueter, Niko Partanen and Larisa Ponomareva 13:00–14:00 Lunch ix Friday, January 10th, 2020 (continued) 14:00–15:00 Session 2 14:00–14:30 On Editing Dictionaries for Uralic Languages in an Online Environment Khalid Alnajjar, Mika Hämäläinen and Jack Rueter 14:30–15:00 Towards a Speech Recognizer for Komi, an Endangered and Low-Resource Uralic Language Nils Hjortnaes, Niko Partanen, Michael Rießler and Francis M. Tyers 15:00–15:30 Coffee break 15:30–17:00 Session 3 15:30–16:00 Hunting for antiharmonic stems in Erzya László Fejes 16:00–16:30 apPILcation: an Android-based Tool for Learning Mansi Gábor Bobály, Csilla Horváth and Veronika Vincze 16:30–17:00 Closing 17:00–18:30 SIGUR business meeting 20:00 Dinner x Saturday, January 11th, 2020 9:00–16:00 Tutorials xi A pseudonymisation method for language documentation corpora: An experiment with spoken Komi Niko Partanen University of Helsinki Helsinki, Finland [email protected] Rogier Blokland Michael Rießler Uppsala University University of Eastern Finland Uppsala, Sweden Joensuu, Finland [email protected] [email protected] Abstract 1 Introduction This article introduces a novel and cre- The research presented in this paper is predomi- ative application of the Constraint Gram- nantly relevant for documentary linguistics, aiming mar formalism, by presenting an automated at the creation of a “lasting multipurpose record of method for pseudonymising a Zyrian Komi a language” (Himmelmann, 2006, 1), while apply- spoken language corpus in an effective, re- ing a computational linguistic approach to an en- liable and scalable manner. The method dangered Uralic language. Specifically, we are de- is intended to be used to minimize vari- veloping an automated method for pseudonymising ous kinds of personal information found in the textual representation of a spoken language cor- the corpus in order to make spoken lan- pus in order to make the corpus data publishable guage data available while preventing the while 1) preventing the spread of sensitive personal spread of sensitive personal data about the data, 2) overcome manual work in the process to the recorded informants or other persons men- extent possible, and 3) keeping the pseudonymised tioned in the texts. In our implementation, a data as one – more openly distributed – version of Constraint Grammar based pseudonymisa- the original – and less openly distributed – data, tion tool is used as an automatically applied rather than destroying the latter by overwriting or shallow layer that derives from the original cutting away parts of them. corpus data a version which can be shared To our knowledge, this is a novel approach in for open research use. documentary linguistics, which so far seems to rely mostly on manual methods for pseudonymising (or anonymising) corpus data or bypasses the problem, Teesiq typically by generally applying very restrict access protocols to corpus data preventing them from be- Seo artikli tutvustas vahtsõt ja loovat ing openly published. piirdmiisi grammatiga (PG) formalismõ Computational linguistic projects aiming at cor- pruuk´mist. Taas om metod´, kon PG pru- pus building for endangered languages, on the other ugitas tuusjaos, et süräkomi kõnõldu keele hand, are rarely faced the problem of personal data korpusõ lindistuisi saassiq tegüsähe, kim- protection because their corpora typically originate mähe ja kontrol´misõvõimalusõga vaŕonim- from written texts, which either are openly available miga käkkiq. Seo metod´ om tett, et kor- to begin with or for which access rights have been pusõn saassiq kõnõlõjidõ andmit nii pall´o cleared before the work with corpus building starts. vähembäs võttaq, ku või, ja et tulõmit saas- Different from written-language data, corpora re- siq kergehe käsilde kontrolliq. Mi plaani sulting from fieldwork-based spoken-language doc- perrä pruugitas taad ku automaatsõt ki- umentation invariably contain large amounts and hti, miä tege korpusõ säändses, et taad või various kinds of personal data. The reason for kergehe uuŕmisõ jaos jakaq. this is that documentary linguists transcribe authen- tic speech samples meant to be re-used in multi- disciplinary research beyond structural linguistics. 1 Proceedings of the 6th International Workshop on Computational Linguistics of Uralic Languages, pages 1–8 Wien, Austria, January 10 — 11, 2020. c 2020 Association for Computational Linguistics https://doi.org/10.18653/v1/FIXME Typically, recordings are done with members of pseudonymisation system described in this paper. small communities, where most individuals know We refer to this method as pseudonymisation, since each other, and common topics include oral histo- the actual identity information is not discarded per- ries about places, persons or events inevitably in- manently from existence. We just derive a version cluding personal information about the informants where the personal data is minimised, and provide itself or other individuals. that to one particular user group. Although the recorded speaker’s informed con- We think this approach can be a satisfactory mid- sent to further processing and re-using the speech dle way to make the data accessible to the scientific sample needs to be at hand in any case, fieldwork- community without needlessly revealing the per- based recordings nearly always include information sonal information of individual speakers. We aim, which should not be made entirely openly avail- however, to make the complete corpus available for able. Simultaneously, it is in the natural interest of research use with a specific application procedure, documentary linguistics to make as many materi- as is common with language documentation corpora als as widely available as possible. The approach in other archives. discussed in this paper consequently attempts to These methods to share the corpora primarily find a suitable middle way in presenting our Zyr- serve academic users in Europe, but not the com- ian Komi corpus to a wider audience, in this case munity itself, and to resolve this issue our colleagues mainly researchers such as linguists and anthropol- from the community have also made a selection of ogists, whilst ensuring that the privacy of individual the corpus available as a ‘community edition’ at a speakers is respected and the risk of miss-using per- website videocorpora.ru², which is maintained and sonal data can be excluded. curated by FU-Lab (the Finno-Ugric Laboratory for Best practice recommendations related to the the Support of Electronic Representation of Re- problem of personal data protection in Open Sci- gional Languages in Syktyvkar, Russia). This, how- ence are currently evolving (cf. Seyfeddinipur et al., ever, though trilingual (Zyrian Komi, Russian, En-