The First Komi-Zyrian Universal Dependencies Treebanks

Total Page:16

File Type:pdf, Size:1020Kb

The First Komi-Zyrian Universal Dependencies Treebanks The First Komi-Zyrian Universal Dependencies Treebanks Niko Partanen1, Rogier Blokland2, KyungTae Lim3, Thierry Poibeau3, Michael Rießler4 [email protected], [email protected], [email protected], [email protected], [email protected] 1Institute for the Languages of Finland 2University of Uppsala 3LATTICE (CNRS & ENS / PSL & Université Sorbonne nouvelle / USPC) 4University of Bielefeld Abstract laboratory in Paris, and the work has been done collaboratively with the IKDP-21 project, which is Two Komi-Zyrian treebanks were included in a continuation of earlier work that produced a lan- the Universal Dependencies 2.2 release. This guage documentation corpus of Komi called IKDP article contextualizes the treebanks, discusses the process through which they were created, (Blokland et al., 2009-2018). A comprehensive and outlines the future plans and timeline for descriptive grammar of Komi with a focus on syn- the next improvements. Special attention is tax is currently being written by members of the paid to the possibilities of using UD in the doc- team. The present treebanks are intended to sup- umentation and description of endangered lan- port the grammatical description. guages. The authors’ recent research at LATTICE labo- ratory has focused on dependency parsing of low- 1 Introduction resource languages, using Komi and North Saami Komi-Zyrian is a Uralic language spoken in the as examples (Lim et al., 2018). The Lattice tree- north-eastern corner of the European part of Rus- bank was initially created for use in testing depen- sia. Smaller Komi settlements can also be found dency parsers, and the IKDP treebank was created elsewhere in northern Russia, from the Kola at a later date with the aim of also including spo- Peninsula to Western Siberia. The language has ken language data. approximately 160,000 speakers and, although not moribund, is still threatened by the local major- 2 Language Documentation ity language, Russian. There is a long history of Language documentation refers to a linguistic research on Komi, but contemporary descriptions practice aiming at the provision of long-lasting and computational resources could be greatly im- and accountable records of speech events, usu- proved. Over the last few years some larger docu- ally carried out in the context of endangered lan- mentation projects have been carried out on Komi. guages and with the goal of understanding spoken These projects have focused on the most endan- communication beyond mere structural grammar. gered spoken varieties, while at the same time, Himmelmann (1998) was the first to define "Doc- new written resources for Standard Komi have be- umentary Linguistics" as separate from "Descrip- came available. tive Linguistics", although with considerable over- This paper discusses the creation of two Komi lap between the two. He also pays special atten- treebanks, one containing written and another spo- tion to the interface between research outputs and ken data. Both the treebanks and the scripts used primary data, ideally including audio and video to create them are included in this paper as sup- recordings (Himmelmann, 2006). This has gen- plementary materials, and the treebanks are part erally been the approach in the present work too, of the Universal Dependencies 2.2 release (Nivre so that the spoken language UD corpus is directly Lattice et al., 2018). The treebanks are called and connected to the documentary multimedia corpus IKDP, due to the fact that most of the work on them has been carried out at the LATTICE-CNRS 1https://langdoc.github.io/IKDP-2 through matching sentence IDs. This allows the based on language documentation material. It is treebank sentences to be connected to rich non- too early to say whether there will be more simi- linguistic metadata. Additionally, the coded time- lar treebanks in the future and within what time- alignment in the original utterances provides in- frame, but having more materials like these in- formation about turn-taking and overlapping at the cluded in UD would fit into the original ideas of millisecond level. The documentary corpus refers the multifunctional language documentation en- to the materials collected and processed within the terprise very well. language documentation activities, which are usu- ally fieldwork-based and aim to represent various 3 Methodology genres and speech practices, all of which are often The initial analysis of Komi plain text was created under a threat of disappearance. using Giellatekno’s6 open infrastructure (Mosha- In language documentation, traditional annota- gen et al., 2014), which is currently at a rather ma- tion methods have mainly consisted of so-called ture level for Komi. The syntactic analysis compo- interlinear glossing.2 This is normally done man- nent demands the most further work, which in turn ually or semi-manually, i.e. with little or no use can be guided by the work on treebanks. Simi- of natural language processing tools (cf. Gersten- lar rule-based architectures have already been used berger et al., 2016). With the available Komi for other treebanks as well. The Northern Saami data in our project, however, we wanted to ap- and Erzya corpora, for example, seem to have been ply an annotation method that would connect our created using a similar approach. Some work has work more closely to established corpus linguis- been conducted with integrating these NLP tools tics and NLP. Universal Dependencies appeared to into workflows commonly used in language doc- be a very attractive annotation scheme as it aims umentation (Gerstenberger et al., 2017a,b, 2016). at cross-linguistic comparability and already con- Since these languages often lack larger annotated tains several Uralic languages. Komi-Zyrian is resources, the use of infrastructures other than currently the sixth Uralic language to be included rule-based ones has not been common or possible, in the project. but these workflows have been implemented in a Work with Komi complements well the devel- modular fashion that would make enable the inte- opments associated with the emergence of new gration of other tools when they become available Uralic treebanks in 2017, with new repositories or reach needed accuracy. created for North Saami3 and Erzya (Rueter and It has been demonstrated that it is possible to Tyers, 2018). Another noteworthy trend is that convert annotations from Giellatekno’s annotation there are several treebanks currently being created scheme into the UD scheme (Sheyanova and Ty- for endangered languages in situations similar to ers, 2017), and this has also worked well in our that of Komi. As far as we have been able to case, although the exact procedure will continue ascertain, these are, at least: Dargwa spoken in to be refined while the token count of the corpus the Caucasus (Kozhukhar, 2017), Pnar4 spoken in grows, which will ultimately also reveal rarer and South-East Asia and Shipibo-Konibo5 spoken in not-yet-analysed morphosyntactic features. Af- Peru. The description of the last treebank men- ter starting with manually editing CoNLL-U files, tioned does not indicate the use of language doc- the UD Annotatrix tool (Tyers et al., 2018) was umentation materials, but as the language is very adopted in January 2018, which marked the mid- small, the context is comparable. To our knowl- point in the project’s timeline. This greatly im- edge, the IKDP treebank discussed here is the first proved the annotation speed and consistency. treebank included in the UD release that is directly The treebank creation thus consisted of the fol- 2Cf., e.g., the Leipzig Glossing Rules https: lowing steps: //www.eva.mpg.de/lingua/resources/ glossing-rules.php 3https://github.com/ 1. Sending Komi sentences to the Giellatekno UniversalDependencies/UD_North_ morphosyntactic analyser (consisting of an Sami-Giella FST component for morphological categories 4https://github.com/ UniversalDependencies/UD_Pnar-PTB and a syntactic component using Constraint 5https://github.com/ Grammar) UniversalDependencies/UD_Shipibo_ Konibo-PUCP 6http://giellatekno.uit.no 2. Resolving the remaining ambiguity manually books being digitalized, made available online9 and converted into a linguistic corpus.10 The cor- 3. Adding the missing syntactic relations manu- pus is currently 40 million words large, and the ally to the UD Annotatrix long-term goal is to digitalize all books and other printed texts ever published in Komi-Zyrian. The 4. Automatically converting the analyzer’s number of publications is approximately 4,500 XPOS-tags into UPOS-tags and converting books, plus tens of thousands of newspaper and morphological feature tags into their UD journal issues. A significant portion of the lat- counterparts ter are available in the Public Domain as part of 5. Manual correction and verification the Fenno-Ugrica project of the National Library of Finland11. We have exclusively chosen to use The current workflow involves a rather large openly available data for the Lattice treebank in amount of manual work. We are interested in order to ensure as broad and simple reuse as pos- testing various approaches to morphological and sible. The forthcoming releases will include more syntactic analysis so that different (rule-based, genres of text, such as newspaper texts and longer statistic-based and hybrid) parsers can eventually sections of Wikipedia articles. replace the manual work. Some tests have already All sentences in the Lattice treebank are pre- been carried out with the dependency parser used sented in the contemporary orthography, even by the Lattice team in the CoNLL-U Shared Task when they were originally published using vari- 2017 (Lim and Poibeau, 2017) and a follow-up ous earlier Komi writing systems. The propor- project (Partanen et al., 2018). tion of texts originally written in the Molodcov al- The treebank processing pipeline has been phabet will rise dramatically in the next releases, tied to several scripts and existing tools. The as this is probably the most commonly used or- primary analysisis done within the Giellatekno thography in the upcoming texts.
Recommended publications
  • Arxiv:1908.07448V1
    Evaluating Contextualized Embeddings on 54 Languages in POS Tagging, Lemmatization and Dependency Parsing Milan Straka and Jana Strakova´ and Jan Hajicˇ Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics {strakova,straka,hajic}@ufal.mff.cuni.cz Abstract Shared Task (Zeman et al., 2018). • We report our best results on UD 2.3. The We present an extensive evaluation of three addition of contextualized embeddings im- recently proposed methods for contextualized 25% embeddings on 89 corpora in 54 languages provements range from relative error re- of the Universal Dependencies 2.3 in three duction for English treebanks, through 20% tasks: POS tagging, lemmatization, and de- relative error reduction for high resource lan- pendency parsing. Employing the BERT, guages, to 10% relative error reduction for all Flair and ELMo as pretrained embedding in- UD 2.3 languages which have a training set. puts in a strong baseline of UDPipe 2.0, one of the best-performing systems of the 2 Related Work CoNLL 2018 Shared Task and an overall win- ner of the EPE 2018, we present a one-to- A new type of deep contextualized word repre- one comparison of the three contextualized sentation was introduced by Peters et al. (2018). word embedding methods, as well as a com- The proposed embeddings, called ELMo, were ob- parison with word2vec-like pretrained em- tained from internal states of deep bidirectional beddings and with end-to-end character-level word embeddings. We report state-of-the-art language model, pretrained on a large text corpus. results in all three tasks as compared to results Akbik et al.
    [Show full text]
  • Extended and Enhanced Polish Dependency Bank in Universal Dependencies Format
    Extended and Enhanced Polish Dependency Bank in Universal Dependencies Format Alina Wróblewska Institute of Computer Science Polish Academy of Sciences ul. Jana Kazimierza 5 01-248 Warsaw, Poland [email protected] Abstract even for languages with rich morphology and rel- atively free word order, such as Polish. The paper presents the largest Polish Depen- The supervised learning methods require gold- dency Bank in Universal Dependencies for- mat – PDBUD – with 22K trees and 352K standard training data, whose creation is a time- tokens. PDBUD builds on its previous ver- consuming and expensive process. Nevertheless, sion, i.e. the Polish UD treebank (PL-SZ), dependency treebanks have been created for many and contains all 8K PL-SZ trees. The PL- languages, in particular within the Universal De- SZ trees are checked and possibly corrected pendencies initiative (UD, Nivre et al., 2016). in the current edition of PDBUD. Further The UD leaders aim at developing a cross- 14K trees are automatically converted from linguistically consistent tree annotation schema a new version of Polish Dependency Bank. and at building a large multilingual collection of The PDBUD trees are expanded with the en- hanced edges encoding the shared dependents dependency treebanks annotated according to this and the shared governors of the coordinated schema. conjuncts and with the semantic roles of some Polish is also represented in the Universal dependents. The conducted evaluation exper- Dependencies collection. There are two Polish iments show that PDBUD is large enough treebanks in UD: the Polish UD treebank (PL- for training a high-quality graph-based depen- SZ) converted from Składnica zalezno˙ sciowa´ 1 and dency parser for Polish.
    [Show full text]
  • Universal Dependencies According to BERT: Both More Specific and More General
    Universal Dependencies According to BERT: Both More Specific and More General Tomasz Limisiewicz and David Marecekˇ and Rudolf Rosa Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics Charles University, Prague, Czech Republic {limisiewicz,rosa,marecek}@ufal.mff.cuni.cz Abstract former based systems particular heads tend to cap- ture specific dependency relation types (e.g. in one This work focuses on analyzing the form and extent of syntactic abstraction captured by head the attention at the predicate is usually focused BERT by extracting labeled dependency trees on the nominal subject). from self-attentions. We extend understanding of syntax in BERT by Previous work showed that individual BERT examining the ways in which it systematically di- heads tend to encode particular dependency verges from standard annotation (UD). We attempt relation types. We extend these findings by to bridge the gap between them in three ways: explicitly comparing BERT relations to Uni- • We modify the UD annotation of three lin- versal Dependencies (UD) annotations, show- ing that they often do not match one-to-one. guistic phenomena to better match the BERT We suggest a method for relation identification syntax (x3) and syntactic tree construction. Our approach • We introduce a head ensemble method, com- produces significantly more consistent depen- dency trees than previous work, showing that bining multiple heads which capture the same it better explains the syntactic abstractions in dependency relation label (x4) BERT. • We observe and analyze multipurpose heads, At the same time, it can be successfully ap- containing multiple syntactic functions (x7) plied with only a minimal amount of supervi- sion and generalizes well across languages.
    [Show full text]
  • Introduction. Komi Folklore Studies: Connecting Points1
    https://doi.org/10.7592/FEJF2019.76.introduction INTRODUCTION. KOMI FOLKLORE STUDIES: CONNECTING POINTS1 Liudmila Lobanova Researcher Department of Folklore, Institute of Language, Literature, and History Komi Science Centre, Russian Academy of Sciences, Russia Email: [email protected] Nikolay Kuznetsov Lecturer in Finno-Ugric Languages Department of Finno-Ugric Studies University of Tartu Email: [email protected] The special edition of Folklore: Electronic Journal of Folklore is dedicated to Komi2 folklore and folklore studies. The issue was prepared within the frame- work of cooperation between the Department of Folkloristics of the Estonian Literary Museum and the Folklore Department of the Komi Science Centre by Komi and Estonian folklore researchers. Prior to this, the authors published one of the issues (vol. 17, 2016) of the Sator periodical, which was also dedi- cated to Komi folklore studies. The goal of this issue is to present some of the results of recent Komi folklore studies to wider academic circles, overcoming the natural linguistic obstacles. The majority of articles are written within the research project “Local Folklore Traditions of the European Northeast of Russia: Mechanisms of Development and Adaptation, System of Genres, Ethnocultural Folklore Interaction” (№ AAAA-A17-117021310066-4). The history of Komi folklore studies reveals processes typical for the Rus- sian, Soviet, and post-Soviet research dealing with folklore (the research field extended and became more limited over time), as well as studying the Komi language and culture as part of the general development of Finno-Ugric stud- ies. Traditionally, academician Andreas Sjögren (1794–1855) is considered to have discovered Komi folklore – in 1827, he transcribed folklore texts and published them as examples of the Komi language.
    [Show full text]
  • Second Report Submitted by the Russian Federation Pursuant to The
    ACFC/SR/II(2005)003 SECOND REPORT SUBMITTED BY THE RUSSIAN FEDERATION PURSUANT TO ARTICLE 25, PARAGRAPH 2 OF THE FRAMEWORK CONVENTION FOR THE PROTECTION OF NATIONAL MINORITIES (Received on 26 April 2005) MINISTRY OF REGIONAL DEVELOPMENT OF THE RUSSIAN FEDERATION REPORT OF THE RUSSIAN FEDERATION ON THE IMPLEMENTATION OF PROVISIONS OF THE FRAMEWORK CONVENTION FOR THE PROTECTION OF NATIONAL MINORITIES Report of the Russian Federation on the progress of the second cycle of monitoring in accordance with Article 25 of the Framework Convention for the Protection of National Minorities MOSCOW, 2005 2 Table of contents PREAMBLE ..............................................................................................................................4 1. Introduction........................................................................................................................4 2. The legislation of the Russian Federation for the protection of national minorities rights5 3. Major lines of implementation of the law of the Russian Federation and the Framework Convention for the Protection of National Minorities .............................................................15 3.1. National territorial subdivisions...................................................................................15 3.2 Public associations – national cultural autonomies and national public organizations17 3.3 National minorities in the system of federal government............................................18 3.4 Development of Ethnic Communities’ National
    [Show full text]
  • Finno-Ugric Republics and Their State Languages: Balancing Powers in Constitutional Order in the Early 1990S
    SUSA/JSFOu 94, 2013 Konstantin ZAMYATIN (Helsinki) Finno-Ugric Republics and Their State Languages: Balancing Powers in Constitutional Order in the Early 1990s Most of Russia’s national republics established titular and Russian as co-official state languages in their constitutions of the early 1990s. There is no consensus on the reasons and consequences of this act, whether it should be seen as a mere symbolic gesture, a measure to ensure a language revival, an instrument in political debate or an ethnic institution. From an institutional and comparative perspective, this study explores the constitutional systems of the Finno-Ugric republics and demonstrates that across the republics, the official status of the state languages was among the few references to ethnicity built into their constitutions. However, only in the case of language require- ments for the top officials, its inclusion could be interpreted as an attempt at instrumen- tally using ethnicity for political ends. Otherwise, constitutional recognition of the state languages should be rather understood as an element of institutionalized ethnicity that remains a potential resource for political mobilization. This latter circumstance might clarify why federal authorities could see an obstacle for their Russian nation-building agenda in the official status of languages. 1. Introduction The period of social transformations of the late 1980s and early 1990s in Eastern Europe was characterized by countries’ transition from the communist administra- tive−command systems towards the representative democracy and market economy. One important driving force of change in the Union of Soviet Socialist Republics (USSR) was the rise of popular movements out of national resentment and dissatis- faction with the state-of-the-art in the sphere of inter-ethnic relations.
    [Show full text]
  • ISSN 2221-2698 Arkhangelsk, Russia DOI 10.17238/Issn2221-2698.2016
    ISSN 2221-2698 Arkhangelsk, Russia DOI 10.17238/issn2221-2698.2016.25 Arctic and North. 2016. N 25 2 ISSN 2221-2698 Arctic and North. 2016. N 25. CC BY-SA © Northern (Arctic) Federal University named after M.V. Lomonosov, 2016 © Editorial board of electronic scientific journal “Arctic and North”, 2016 The journal “Arctic and North” is registered at Roskomnadzor as an internet periodical issued in Russian and English, Registration certificate El № FS77-42809, November 26, 2010; at the system of the Russian Science Citation Index (RSCI), license contract № 96-04/2011R, April 12, 2011; Scientific Electronic Library "Сyberleninka" (2016); in the catalogs of international databases: Directory of Open Access Journals — DOAJ (2013); Global Serials Directory Ulrichsweb, USA (2013); NSD, Norway (2015); InfoBase Index, India (2015); ERIH PLUS, Norway (2016). The Journal is issued not less than 4 times per year; 25 issues were published in 2011—2016. The Founder — Northern (Arctic) Federal University named after M.V. Lomonosov (Arkhangelsk, Russia). Editor-in-Chief — Yury F. Lukin, Doctor of Historical Sciences, Professor, Honorary Worker of the higher education of the Russian Federation. All journal issues are available free of charge in Russian and English. Rules and regulations on submission, peer reviews, publication and the Declaration of Ethics are available at: http://narfu.ru/aan/rules/ The Journal is devoted to the scientific articles focused on the Arctic and the North relevant for the following professional degrees: 08.00.00 Economics; 22.00.00 Sociology; 23.00.00 Political science; 24.00.00 Culturology. No payments for publication are collected from authors, including students and post-graduate students.
    [Show full text]
  • Universal Dependencies for Japanese
    Universal Dependencies for Japanese Takaaki Tanaka∗, Yusuke Miyaoy, Masayuki Asahara}, Sumire Uematsuy, Hiroshi Kanayama♠, Shinsuke Mori|, Yuji Matsumotoz ∗NTT Communication Science Labolatories, yNational Institute of Informatics, }National Institute for Japanese Language and Linguistics, ♠IBM Research - Tokyo, |Kyoto University, zNara Institute of Science and Technology [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] Abstract We present an attempt to port the international syntactic annotation scheme, Universal Dependencies, to the Japanese language in this paper. Since the Japanese syntactic structure is usually annotated on the basis of unique chunk-based dependencies, we first introduce word-based dependencies by using a word unit called the Short Unit Word, which usually corresponds to an entry in the lexicon UniDic. Porting is done by mapping the part-of-speech tagset in UniDic to the universal part-of-speech tagset, and converting a constituent-based treebank to a typed dependency tree. The conversion is not straightforward, and we discuss the problems that arose in the conversion and the current solutions. A treebank consisting of 10,000 sentences was built by converting the existent resources and currently released to the public. Keywords: typed dependencies, Short Unit Word, multiword expression, UniDic 1. Introduction 2. Word unit The definition of a word unit is indispensable in UD an- notation, which is not a trivial question for Japanese, since The Universal Dependencies (UD) project has been de- a sentence is not segmented into words or morphemes by veloping cross-linguistically consistent treebank annota- white space in its orthography.
    [Show full text]
  • Universal Dependencies for Persian
    Universal Dependencies for Persian *Mojgan Seraji, **Filip Ginter, *Joakim Nivre *Uppsala University, Department of Linguistics and Philology, Sweden **University of Turku, Department of Information Technology, Finland *firstname.lastname@lingfil.uu.se **figint@utu.fi Abstract The Persian Universal Dependency Treebank (Persian UD) is a recent effort of treebanking Persian with Universal Dependencies (UD), an ongoing project that designs unified and cross-linguistically valid grammatical representations including part-of-speech tags, morphological features, and dependency relations. The Persian UD is the converted version of the Uppsala Persian Dependency Treebank (UPDT) to the universal dependencies framework and consists of nearly 6,000 sentences and 152,871 word tokens with an average sentence length of 25 words. In addition to the universal dependencies syntactic annotation guidelines, the two treebanks differ in tokenization. All words containing unsegmented clitics (pronominal and copula clitics) annotated with complex labels in the UPDT have been separated from the clitics and appear with distinct labels in the Persian UD. The treebank has its original syntactic annota- tion scheme based on Stanford Typed Dependencies. In this paper, we present the approaches taken in the development of the Persian UD. Keywords: Universal Dependencies, Persian, Treebank 1. Introduction this paper, we present how we adapt the Universal Depen- In the past decade, the development of numerous depen- dencies to Persian by converting the Uppsala Persian De- dency parsers for different languages has frequently been pendency Treebank (UPDT) (Seraji, 2015) to the Persian benefited by the use of syntactically annotated resources, Universal Dependencies (Persian UD). First, we briefly de- or treebanks (Bohmov¨ a´ et al., 2003; Haverinen et al., 2010; scribe the Universal Dependencies and then we present the Kromann, 2003; Foth et al., 2014; Seraji et al., 2015; morphosyntactic annotations used in the extended version Vincze et al., 2010).
    [Show full text]
  • Language Technology Meets Documentary Linguistics: What We Have to Tell Each Other
    Language technology meets documentary linguistics: What we have to tell each other Language technology meets documentary linguistics: What we have to tell each other Trond Trosterud Giellatekno, Centre for Saami Language Technology http://giellatekno.uit.no/ . February 15, 2018 . Language technology meets documentary linguistics: What we have to tell each other Contents Introduction Language technology for the documentary linguist Language technology for the language society Conclusion . Language technology meets documentary linguistics: What we have to tell each other Introduction Introduction I Giellatekno: started in 2001 (UiT). Research group for language technology on Saami and other northern languages Gramm. modelling, dictionaries, ICALL, corpus analysis, MT, ... I Trond Trosterud, Lene Antonsen, Ciprian Gerstenberger, Chiara Argese I Divvun: Started in 2005 (UiT < Min. of Local Government). Infrastructure, proofing tools, synthetic speech, terminology I Sjur Moshagen, Thomas Omma, Maja Kappfjell, Børre Gaup, Tomi Pieski, Elena Paulsen, Linda Wiechetek . Language technology meets documentary linguistics: What we have to tell each other Introduction The most important languages we work on . Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist Language technology for documentary linguistics ... Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist ... what’s in it for the language community I work for? I Let’s pretend there are two types of language communities: 1. Language communities without plans for revitalisation or use in domains other than oral use 2. Language communities with such plans . Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist Language communities without such plans I Gather empirical material and do your linguistic analysis I (The triplet: Grammar, text collection and dictionary) I ..
    [Show full text]
  • Foreword to the Special Issue on Uralic Languages
    Northern European Journal of Language Technology, 2016, Vol. 4, Article 1, pp 1–9 DOI 10.3384/nejlt.2000-1533.1641 Foreword to the Special Issue on Uralic Languages Tommi A Pirinen Hamburger Zentrum für Sprachkorpora Universität Hamburg [email protected] Trond Trosterud HSL-fakultehta UiT Norgga árktalaš universitehta [email protected] Francis M. Tyers Veronika Vincze HSL-fakultehta MTA-SZTE UiT Norgga árktalaš universitehta Szegedi Tudomány Egyetem [email protected] [email protected] Eszter Simon Jack Rueter Research Institute for Linguistics Helsingin yliopisto Hungarian Academy of Sciences Nykykielten laitos [email protected] [email protected] March 7, 2017 Abstract In this introduction we have tried to present concisely the history of language tech- nology for Uralic languages up until today, and a bit of a desiderata from the point of view of why we organised this special issue. It is of course not possible to cover everything that has happened in a short introduction like this. We have attempted to cover the beginnings of the (Uralic) language-technology scene in 1980’s as far as it’s relevant to much of the current work, including the ones presented in this issue. We also go through the Uralic area by the main languages to survey on existing resources, to also form a systematic overview of what is missing. Finally we talk about some possible future directions on the pan-Uralic level of language technology management. Northern European Journal of Language Technology, 2016, Vol. 4, Article 1, pp 1–9 DOI 10.3384/nejlt.2000-1533.1641 Figure 1: A map of the Uralic language area show approximate distribution of languages spoken by area.
    [Show full text]
  • The Universal Dependencies Treebank of Spoken Slovenian
    The Universal Dependencies Treebank of Spoken Slovenian Kaja Dobrovoljc1, Joakim Nivre2 1Institute for Applied Slovene Studies Trojina, Ljubljana, Slovenia 1Department of Slovenian Studies, Faculty of Arts, University of Ljubljana 2Department of Linguistics and Philology, Uppsala University [email protected], joakim.nivre@lingfil.uu.se Abstract This paper presents the construction of an open-source dependency treebank of spoken Slovenian, the first syntactically annotated collection of spontaneous speech in Slovenian. The treebank has been manually annotated using the Universal Dependencies annotation scheme, a one-layer syntactic annotation scheme with a high degree of cross-modality, cross-framework and cross-language interoper- ability. In this original application of the scheme to spoken language transcripts, we address a wide spectrum of syntactic particularities in speech, either by extending the scope of application of existing universal labels or by proposing new speech-specific extensions. The initial analysis of the resulting treebank and its comparison with the written Slovenian UD treebank confirms significant syntactic differences between the two language modalities, with spoken data consisting of shorter and more elliptic sentences, less and simpler nominal phrases, and more relations marking disfluencies, interaction, deixis and modality. Keywords: dependency treebank, spontaneous speech, Universal Dependencies 1. Introduction actually out-performs state-of-the-art pipeline approaches (Rasooli and Tetreault, 2013; Honnibal and Johnson, 2014). It is nowadays a well-established fact that data-driven pars- Such heterogeneity of spoken language annotation schemes ing systems used in different speech-processing applica- inevitably leads to a restricted usage of existing spoken tions benefit from learning on annotated spoken data, rather language treebanks in linguistic research and parsing sys- than using models built on written language observation.
    [Show full text]