The Universal Dependencies Treebank of Spoken Slovenian

The Universal Dependencies Treebank of Spoken Slovenian Kaja Dobrovoljc1, Joakim Nivre2 1Institute for Applied Slovene Studies Trojina, Ljubljana, Slovenia 1Department of Slovenian Studies, Faculty of Arts, University of Ljubljana 2Department of Linguistics and Philology, Uppsala University [email protected], joakim.nivre@lingfil.uu.se Abstract This paper presents the construction of an open-source dependency treebank of spoken Slovenian, the first syntactically annotated collection of spontaneous speech in Slovenian. The treebank has been manually annotated using the Universal Dependencies annotation scheme, a one-layer syntactic annotation scheme with a high degree of cross-modality, cross-framework and cross-language interoper- ability. In this original application of the scheme to spoken language transcripts, we address a wide spectrum of syntactic particularities in speech, either by extending the scope of application of existing universal labels or by proposing new speech-specific extensions. The initial analysis of the resulting treebank and its comparison with the written Slovenian UD treebank confirms significant syntactic differences between the two language modalities, with spoken data consisting of shorter and more elliptic sentences, less and simpler nominal phrases, and more relations marking disfluencies, interaction, deixis and modality. Keywords: dependency treebank, spontaneous speech, Universal Dependencies 1. Introduction actually out-performs state-of-the-art pipeline approaches (Rasooli and Tetreault, 2013; Honnibal and Johnson, 2014). It is nowadays a well-established fact that data-driven pars- Such heterogeneity of spoken language annotation schemes ing systems used in different speech-processing applica- inevitably leads to a restricted usage of existing spoken tions benefit from learning on annotated spoken data, rather language treebanks in linguistic research and parsing sys- than using models built on written language observation. tems alike, limiting any direct comparison between spo- Since the influential syntactic annotation of the Switch- ken language treebanks of different formalisms, modal- board section of the Penn Treebank (Godfrey et al., 1992; ities (spoken or written) or languages. The need for a Marcus et al., 1993), several syntactically annotated spoken cross-linguistically harmonized treatment of non-language- corpora have emerged, such as the Verbmobil treebanks for specific phenomena is even more important in the field of English, German and Japanese (Hinrichs et al., 2000), the spoken language resources, as these are still very limited CGN treebank for Dutch (van der Wouden et al., 2002), the in terms of number, size and availability due to their costly NoTa treebank for Norwegian (Johannessen and Jørgensen, construction. 2006), the PDTSL treebank for Czech (Hajicˇ et al., 2008), To ensure its wide and long-term usability, the new and the Rhapsodie treebank for French (Lacheret et al., treebank of spoken Slovenian adopts the recently pro- 2014). However, until now, no syntactically annotated data posed Universal Dependencies annotation scheme, aimed has been available for spoken Slovenian. at cross-linguistically consistent dependency treebank an- In addition to differences in the underlying phrase-based or notation. In the following part of this paper, we first briefly dependency-based grammatical formalisms, existing spo- describe the process of the treebank construction and the ken treebanks vary considerably in their approach to anno- general principles related to its tokenization, segmentation tation of syntactic particularities of spoken language, even and spelling. Given this is the first attempt to apply the though these are not generally considered as language- Universal Dependencies scheme to extensive spoken data, specific. On one side of the spectrum are annotation we then present its adaptation for various types of speech- schemes providing syntactic analysis of all transcribed lex- specific phenomena, describe the annotation process, and ical tokens, typically by introduction of new labels for show how the new spoken treebank compares to its written speech-specific phenomena, while on the other side we find counterpart. schemes, in which only well-formed, written-like construc- tions are included in the resulting syntactic trees, disregard- 2. Treebank Construction ing disfluencies and other types of ’noisy’ structural partic- The Spoken Slovenian Treebank is a sample of the Gos ref- ularities. erence corpus of Spoken Slovenian (Zwitter Vitez et al., This prevalent multi-layer approach has partially been mo- 2013), a collection of audio recordings and transcripts of tivated by the data-driven parsing systems themselves, usu- approximately 120 hours (1 million words) of monologic, ally adopting a two-pass pipeline architecture, in which the dialogic and multi-party spontaneous speech in different structural particularities are first removed and followed by everyday situations. The reference corpus was balanced to parsing of normalized transcriptions (Charniak and John- be representative of speakers (sex, age, region, education), son, 2001). Recent advances in parsing systems using non- communication channels (TV, radio, telephone, personal monotonic transition-based algorithms, however, show that contact) and spoken communication settings, broadly cat- joint treatment of disfluencies and other syntactic relations egorized into public informative and educational (TV and 1566 radio shows, interviews, debates; school lessons, academic nausˇ ’you won’t’ and the normalized ne bosˇ ’you will not’), lectures), public entertainment (talk shows, morning radio the normalized tokenization is selected, but the mapping of shows, sports broadcasting), non-public non-private (work both spellings is maintained. meetings, consultations, sale and other services) and non- In addition to lexical tokens (words), the transcripts also in- public private (conversations between friends or family). clude tokens signalling filled pauses (fillers), unfinished or To ensure a similar distribution of text type, channel and incomprehensible words, as well as extralinguistic tokens speaker demographics to the reference corpus, the Spoken marking basic prosodic information, such as exclamation Slovenian Treebank was sampled by taking a random seg- or interrogation intonation markers, silent pauses (if longer ment with a proportional number of tokens from each of than 1.5 sec), non-turn taking speaker interruptions, vocal the 287 texts in the original corpus. Each sampled text sounds (e.g. laughing, sighing, yawning) and non-vocal segment consists of one or more subsequent turns (units of sounds (e.g. applauding, ringing). In the treebank, all tran- speech by one speaker), which in themselves consist of one scription tokens are considered nodes of dependency trees, or more utterances (semantically, syntactically and acousti- however, it is a straightforward task to filter out the non- cally delimited units, roughly corresponding to written sen- lexical tokens and obtain representations with words only. tences).1. Thus, the large majority of texts in the treebank include 4. Annotation Scheme longer continuous spans of complete turns by one or more 4.1. Universal Dependencies speakers, enabling posterior extension, re-segmentation or Universal Dependencies2 is a recently proposed annotation addition of other layers of linguistic annotation, such as scheme for development of cross-linguistically consistent discourse relations annotation or dialogue act annotation. treebank annotation for many languages, with the goal of A detailed description of the sampled treebank, currently facilitating multilingual parser development, cross-lingual amounting to 3,188 utterances or 29,468 tokens, is given in learning, and parsing research from a language typology Table 1. perspective (Nivre, 2015). It is the result of previous similar standardization projects (Zeman, 2008; Petrov et al., 2012; Type Texts Speak. Turns Utter. Tokens Marneffe et al., 2014) and has already been applied to more PI 129 263 703 959 9,898 than 30 different languages (Nivre et al., 2015), includ- PE 42 78 499 726 6,826 ing (written) Slovenian. A detailed description of the de- NN 45 102 425 497 4,535 sign principles and the relation taxonomy is given in Nivre NP 71 163 833 1,006 8,209 (2015) and Nivre et al. (2016), with the main principles be- Total 287 606 2,460 3,188 29,468 ing that dependency relations hold primarily between content words, function words attach to the content word they Table 1: Treebank size by text type: PI = public informa- specify and punctuation marks attach to the clause or phrase tive and educational; PE = public entertainment; NN = non- to which they belong. The basic dependency representation public non-private; NP = non-public private. forms a tree, but additional dependencies can be added in the so-called enhanced representation. From the perspective of spoken language annotation, the 3. Segmentation, Tokenization and Spelling two most important features are that the universal taxonomy already includes labels for several speech-specific Typically, spoken language annotation denotes annotation loose-joining syntactic relations, such as reparandum, of its representation in the form of written transcription. In parataxis, discourse, dislocated, and vocative, and that the Spoken Slovenian Treebank, the spelling, tokenization the scheme design allows for language-specific extensions, and segmentation principles follow the

The Universal Dependencies Treebank of Spoken Slovenian

An Analysis of Sentence Segmentation Features For

Preparation and Exploitation of Bilingual Texts Dusko Vitas, Cvetana Krstev, Eric Laporte

A Bayesian Framework for Word Segmentation: Exploring the Effects of Context

Arxiv:1908.07448V1

The Iafor European Conference Series 2014 Ece2014 Ecll2014 Ectc2014 Official Conference Proceedings ISSN: 2188-1138

The Induction of Phonotactics for Speech Segmentation

Extended and Enhanced Polish Dependency Bank in Universal Dependencies Format

Universal Dependencies According to BERT: Both More Specific and More General

Kernerman Kdictionaries.Com/Kdn DICTIONARY News the European Network of E-Lexicography (Enel) Tanneke Schoonheim

Automatic Speech Segmentation

TANGO: Bilingual Collocational Concordancer

Aconcorde: Towards a Proper Concordance for Arabic