Universal Dependencies for Swedish Sign Language Robert Ostling,¨ Carl Borstell,¨ Moa Gardenfors,¨ Mats Wiren´ Department of Linguistics Stockholm University robert,calle,moa.gardenfors,mats.wiren @ling.su.se { } Abstract annotation is limited to Australian Sign Language, which contains some basic syntactic segmentation We describe the first effort to annotate and annotation (Johnston, 2014). Apart from this, a signed language with syntactic depen- smaller parts of the corpora of Finnish Sign Lan- dency structure: the Swedish Sign Lan- guage (Jantunen et al., 2016) and Polish Sign Lan- guage portion of the Universal Depen- guage (Rutkowski and Łozinska,´ 2016), have had dencies treebanks. The visual modality some syntactic segmentation and analysis, and an- presents some unique challenges in anal- other such project is under way on British Sign ysis and annotation, such as the possi- Language.1 bility of both hands articulating separate To the best of our knowledge, we present the signs simultaneously, which has implica- first dependency annotation and parsing experi- tions for the concept of projectivity in de- ments with sign language data. This brings us one pendency grammars. Our data is sourced step closer to the goal of bridging the gap in avail- from the Swedish Sign Language Corpus, ability between written, spoken and sign language and if used in conjunction these resources natural language processing tools. contain very richly annotated data: de- pendency structure and parts of speech, 2 Universal Dependencies video recordings, signer metadata, and since the whole material is also translated The Universal Dependencies project aims to pro- into Swedish the corpus is also a parallel vide uniform morphological and syntactic (in the text. form of dependency trees) annotations across lan- guages (Nivre et al., 2016b).2 Built on a language- 1 Introduction universal common core of 17 parts of speech and 40 dependency relations, there are also language- The Universal Dependencies (UD) project specific guidelines which interpret and when nec- (Nivre et al., 2016b) has produced a language- essary extend those in the context of a given lan- independent but extensible standard for morpho- guage. logical and syntactic annotation using a formalism based on dependency grammar. This standard has 3 Swedish Sign Language been used to create the Universal Dependencies treebanks (Nivre et al., 2016a), which in its Swedish Sign Language (SSL) is the main sign latest release at the time of writing (version 1.4) language of the Swedish Deaf community.3 It is contains 64 treebanks in 47 languages—one of estimated to be used by at least 10,000 as one which is Swedish Sign Language (SSL, ISO of their primary languages, and is the only sign 639-3: SWL), the topic of this article. language to be recognized in Swedish law, giv- There are very few sign languages for which ing it a special status alongside the official minor- there are corpora. Most of the available sign lan- 1http://www.bslcorpusproject.org/projects/ guage corpora feature only simple sign segmenta- bsl-syntax-project/ tion and annotations, often also with some type of 2Note that our work predates version 2 of the UD guide- translation into a spoken language (either as writ- lines, and is based on the first version. 3Capital D “Deaf” is generally used to refer to the lan- ten translations or as spoken voice-overs). Sign guage community as a cultural and linguistic group, rather language corpora with more extensive syntactic than ‘deaf’ as a medical label. 303 Proceedings of the 21st Nordic Conference of Computational Linguistics, pages 303–308, Gothenburg, Sweden, 23-24 May 2017. c 2017 Linkoping¨ University Electronic Press ity languages (Ahlgren and Bergman, 2006; Park- of discourse that can be represented by an id- vall, 2015). The history of SSL goes back at least iomatic Swedish translation. However, the transla- 200 years, to the inauguration of the first Deaf tion segmentations do not represent clausal bound- school in Sweden, and has also influenced the two aries in either SSL or Swedish (Borstell¨ et al., sign languages of Finland (i.e. Finnish Sign Lan- 2014). More recently, a portion of the SSLC was guage and Finland-Swedish Sign Language) with segmented into clausal units and annotated for ba- which SSL can be said to be related (Bergman and sic syntactic roles (Borstell¨ et al., 2016), which led Engberg-Pedersen, 2010). to the current UD annotation work. Figure 1 shows the basic view of the SSLC videos and annotations 4 Data source in the ELAN software, with tiers for sign glosses and translations on the video timeline. The SSL Corpus Project ran during the years 2009–2011 with the intention to establish the 5 Annotation procedure and principles first systematically designed and publicly avail- for SSL able corpus of SSL, resulting in the SSL Corpus (SSLC). Approximately 24 hours of video data of For practical purposes, annotation was performed pairs of signers conversing was recorded, compris- by extending the ELAN files of our source mate- ing 42 signers of different age, gender, and ge- rial from the SSLC project (see Figure 2 for an ographical background, spanning 300 individual example). These annotations were automatically video files (Mesch, 2012). The translation and an- converted to the CoNLL-U format used by Uni- notation work is still on-going, with new releases versal Dependencies. being made available online as the work moves The annotation of UD based syntactic structure forward. The last official release of the SSLC in- started by coming up with a procedure for anno- cludes just under 7 hours of video data (Mesch tating a signed language using ELAN. Signed lan- et al., 2012) along with annotation files contain- guage is more simultaneous than spoken language, ing 53,625 sign tokens across 6,197 sign types particularly in the use of paired parallel articula- (Mesch, 2016). tors in form of the signer’s two hands (Vermeer- The corpus is annotated using the ELAN soft- bergen et al., 2007). We handle this by allowing ware (Wittenburg et al., 2006), and the annotation signs from both hands into the same tree structure, files are distributed in the corresponding XML- which leads to well-formed trees consistent with based .eaf format. Each annotation file contains the dependency grammar formalism’s single-head, tiers on which annotations are aligned with the connectedness and acyclicity constraints. These video file, both video and annotation tiers being trees can however have some unusual properties visible in the ELAN interface (see Figure 1). The compared to spoken languages. For the purpose of SSLC annotation files currently include tiers for conforming to the CoNLL-U data format, which sign glosses, and others for Swedish translations. requires an ordered sequence of tokens, we sort Sign glosses are written word labels that repre- signs by their chronological order. The chrono- sent signs with approximate meanings (e.g. PRO1 logical order spans both sign tiers per signer, and for a first person pronoun). The sign gloss anno- is defined as the onset time of each sign anno- tation tiers are thus segmented for lexical items tation. In the case of two signs on each hand (i.e. individual signs), and come in pairs for each tier (i.e. dominant vs. non-dominant hand) hav- signer—each tier representing one of the signer’s ing identical onsets, favor is given to signs artic- hands (one tier for the so-called dominant hand, ulated by the signer’s dominant hand. This work- and another for the non-dominant hand) (Mesch ing definition is by no means the only reasonable 4 and Wallin, 2015). Sign glosses also contain a linearization, which means that the notion of pro- part-of-speech (PoS) tag which have been derived jectivity to some extent loses its meaning. A tree from manually correcting the output of a semi- can be considered projective or non-projective de- automatic method for cross-lingual PoS tagging pending on how the ordering of simultaneously ar- ¨ (Ostling et al., 2015). The translation tier is seg- ticulated signs is defined—assuming one wants to mented into longer chunks, representing stretches impose such an ordering in the first place. 4The dominant hand is defined as the hand preferred by a Because the source material contains no seg- signer when signing a one-handed sign. mentation above the sign level, we decided to use 304 Figure 1: Screenshot of an SSLC file in ELAN. This is the material we base our dependency annotations on, and the annotator can easily view the source video recording. Figure 2: Screenshot zooming into the UD annotation tiers and sign–dependency linking for the utterance from Figure 1. This is the interface used by the annotator. conj root conj conj dobj det verb verb verb noun det verb SATTA¨ -SIG ATA¨ (Q) TITTA-PA˚ SNO¨ ˆGUBBE PEK ATA¨ (Q) SIT-DOWN EAT(Q) LOOK-AT SNOWˆOLD-MAN POINT EAT(Q) ‘He is sitting there eating looking out at the snowman.’ Figure 3: The example from Figure 1 and Figure 2 with dependency annotations visualized. The (Q) suffix on the ATA¨ (Q) gloss indicates which of the multiple signs for ‘eat’ in SSL is used in this case. 305 7 Dependency parsing 25 Given that this is the first sign language UD tree- 20 bank, we decided to perform some dependency parsing experiments to establish baseline results. 15 We use the parser of Straka et al. (2015), part of the UDpipe toolkit (Straka et al., 2016), for our 10 Number of trees experiments. The training (334 tokens), develop- ment (48 tokens) and test (290 tokens) split from 5 UD treebanks 1.4 was used. A hundred itera- tions of random hyperparameter search was per- 0 0 5 10 15 20 25 30 formed for each of their parser models (projective, Number of sign tokens partially non-projective and fully non-projective), and the model with highest development set accu- Figure 4: Distribution of tree sizes for the Swedish racy was chosen.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages6 Page
-
File Size-