The LIA Treebank of Spoken Norwegian Dialects
Total Page:16
File Type:pdf, Size:1020Kb
The LIA Treebank of Spoken Norwegian Dialects Lilja Øvrelidy, Andre Kasen˚ yz, Kristin Hagenz, Anders Nøklestadz, Per Erik Solberg∗ and Janne Bondi Johannessenz y Department of Informatics, University of Oslo z Department of Linguistics and Scandinavian Studies, University of Oslo ∗ Department of Philosophy, Classics, History of Arts and Ideas, University of Oslo fliljao,andrekaag@ifi.uio.no, fkristin.hagen,j.b.johannessen, [email protected], p.e.solberg@ifikk.uio.no Abstract This article presents the LIA treebank of transcribed spoken Norwegian dialects. It consists of dialect recordings made in the period between 1950–1990, which have been digitised, transcribed, and subsequently annotated with morphological and dependency-style syntactic analysis as part of the LIA (Language Infrastructure made Accessible) project at the University of Oslo. In this article, we describe the LIA material of dialect recordings and its transcription, transliteration and further morphosyntactic annotation. We focus in particular on the extension of the native NDT annotation scheme to spoken language phenomena, such as pauses and various types of disfluencies, and present the subsequent conversion of the treebank to the Universal Dependencies scheme. The treebank currently consists of 13,608 tokens, distributed over 1396 segments taken from three different dialects of spoken Norwegian. The LIA treebank annotation is an on-going effort and future releases will extend on the current data set. Keywords: treebanks, spoken language, dialects, Norwegian, Universal Dependencies 1. Introduction 2. The LIA material Large-scale initiatives like the CoNLL shared tasks on de- The LIA project (Language Infrastructure made Accessi- pendency parsing (Surdeanu et al., 2008), the Universal ble) is a five-year national collaboration project between Dependencies (UD) initiative (Nivre et al., 2016) and the four Norwegian universities (University of Oslo, Univer- recent shared task on multilingual parsing from raw text sity of Bergen, University of Tromsø and The Norwegian (Zeman et al., 2017) have made available syntactic tree- University of Science and Technology), Norsk ordbok 2014 banks for a large number of languages, thus enabling pars- and Sprakbanken˚ at the National Library, in addition to in- ing research for a wide variety of languages. Available tree- ternational partners. banks are still, however, largely based on written textual The main aim of the LIA project is to collect dialect record- resources, with a few exceptions (Dobrovoljc and Nivre, ings from the four participating universities, digitise them, 2016; Ostling¨ et al., 2017). inventorise, catalogue and safely store them and make them The LIA project1 has as its main objective to create a cor- accessible for further research. The most interesting record- pus consisting of old dialect recordings and make these ac- ings are transcribed and text-sound synchronised with the cessible for research in linguistics and digital humanities. transcription tool ELAN2. Finally they are morphologically By digitization, transcription and further linguistic process- tagged and parsed. This process is described below. The fi- ing, this corpus can play an important role in the diachronic nal outcome of the LIA project is a user-friendly searchable study of Norwegian dialects and more generally the linguis- dialect corpus. tic variation in Norway. This article describes the LIA tree- The audio files that constitute the data set are recorded be- bank of spoken Norwegian dialects. A longterm goal of this tween 1950 and 1990 in order to explore and survey the work is to develop a parser for spoken Norwegian, with the many different dialects in Norway. Sometimes the research immediate goal of parsing the whole LIA material. This questions also concern person or place names. Most of the will enable more fine-grained linguistic analyses to be car- informants are older people who are native speakers of their ried out over the material. dialect. Typically, the recordings are interviews about old In this paper we present the LIA data set, its transcrip- trades such as agriculture, fisheries, logging and life at the tion and subsequent morphological and syntactic annota- summer farm. Other topics are weaving, knitting, baking tion, with a focus on the extended annotation guidelines or dialects. The recordings are semi-formal or informal and of the Norwegian Dependency Treebank (NDT) for spoken often take place in an informant’s home. language phenomena and the conversion of the treebanked 2.1. Transcription data to the Universal Dependencies (UD) scheme (de Marn- effe et al., 2014). The UD version of the treebank was made The LIA project makes use of a semi-phonetic transcription available with the v2.1 release of the UD treebanks (Nivre standard, similar to that of Papazian and Helleland (2005) et al., 2017). and described in Hagen et al. (2017). This standard is cho- sen mainly to conserve particularities in the different di- 1http://www.hf.uio.no/iln/ alects. english/research/projects/ The speech flow is separated into what we call segments. language-infrastructure-made-accessible A segment is our spoken language approximation of a sen- 2https://tla.mpi.nl/tools/tla-tools/elan 4482 tence. Few special characters are in use. The exceptions are Head Dependent the Norwegian letters æ, ø and a,˚ quotation marks for indi- cating indirect speech, ‘#’ signifying a pause in the speech Preposition Prepositional complement flow and variations of ‘+’ and ‘%’ combined with a letter Finite verb Complementizer indicating unclear speech or laughter etc. The ‘%’ charac- First conjunct Subsequent conjuncts ter followed by a letter represents an independent incident Finite auxiliary Lexical/main verb in the speech flow like laughter, coughing etc. The ‘+’ char- Noun Determiner acterizes the following word or word group. ‘+u’ means for instance that the following word(s) are unclear. ‘+x’ means Table 1: Annotation choices in the NDT that the word(s) are not listed in the dictionary. The vari- ants of ‘+’ and ‘%’ are stripped from the transcripts prior to 3. Morphosyntactic annotation further morphosyntactic processing. This is done under the For grammatical phenomena which are not specific to spo- assumption that these phenomena do not have any syntac- ken language, we have followed the annotation scheme of tic significance. They are inserted back into the transcripts the Norwegian Dependency Treebank (NDT) (Solberg et before the transcripts are made available for search online. al., 2014). An important reason for this choice was the de- 2.2. Transliteration tailed language-specific annotation guidelines which were developed for the NDT project (Kinn et al., 2014). These Before tokenization and lemmatization the semi-phonetic guidelines are custom-made for Norwegian, following the transcriptions are semi-automatically transliterated to stan- Norwegian Reference Grammar (Faarlund et al., 1997) dard Norwegian Nynorsk3 orthography by the Oslo closely. Furthermore, the NDT scheme has performed well Transliterator4. The transliterator can be trained to translit- in previous measures of inter-annotator agreement. Solberg erate any dialect or language variety into any other ortho- et al. (2014) report agreement scores of 96.8% unlabeled graphical representation, and it is so far trained on more and 95.3% labeled accuracy and Skjærholt (2014) quan- than 100 Norwegian dialects in the LIA project. The out- tified inter-annotator agreement using a chance-corrected come from the transliterator is manually corrected and the metric derived from Krippendorff’s α and showed that resulting pair of transcriptions are used for training the agreement on the NDT data is high: scoring an α of about transliterator for this particular dialect, improving perfor- 98%, among the highest of all the data sets studied. This mance on subsequent transliterations of that dialect. annotation scheme was therefore a natural choice for the The Oslo Transliterator has a web interface where the tran- current project. An automatic conversion procedure to Uni- scriptions can be uploaded and associated with the appro- versal Dependencies has furthermore been developed for priate dialect. The transcriptions are divided into smaller the written NDT data set (Øvrelid and Hohle, 2016; Vell- parts, which are transliterated one by one. Each part is man- dal et al., 2017). It is in other words possible to convert ually corrected and added to the training material before the the LIA treebank to the UD annotation scheme, with a few transliterator is trained once more and performs better on modifications, see section 5. the next transcription part. The results of each iteration of the training process are stored in a MySQL database. When 3.1. The NDT scheme all parts are completed, the transcriptions can be down- The Norwegian Dependency Treebank contains manually loaded as ELAN files with the semi-phonetic transcription annotated syntactic and morphological information for both and the orthograpic transcription as separate layers. varieties of Norwegian. The part-of-speech annotation fol- New dialects can also be registered in the web interface. lows the Oslo-Bergen Tagger scheme (Hagen et al., 2000). Instead of starting from scratch on the new dialect, the This scheme also marks inflectional features such as tense, transliterator employs a technique in which suggestions for number, gender and categories such as demonstrative and transliterated word forms for the new dialect are based on