Converting an English-Swedish Parallel Treebank to Universal Dependencies

Converting an English-Swedish Parallel Treebank to Universal Dependencies Lars Ahrenberg Linkoping¨ University Department of Computer and Information Science [email protected] Abstract which treats punctuation marks and some clitics as separate tokens, but treats all spaces as token The paper reports experiences of automat- separators. Thus, multiword expressions are not ically converting the dependency analy- recognized as such until the dependency layer. sis of the LinES English-Swedish parallel For parts-of-speech a tag set comprising 17 dif- treebank to universal dependencies (UD). ferent tags only is recommended with a basis in The most tangible result is a version of the twelve categories proposed by (Petrov et al., the treebank that actually employs the re- 2012). For an overview, see Table 2 in section 3. lations and parts-of-speech categories re- LinES (Ahrenberg, 2007) is a parallel treebank quired by UD, and no other. It is also currently comprising seven sub-corpora (see Ta- more complete in that punctuation marks ble 1). Future plans for LinES include a substan- have received dependencies, which is not tial increase in the amount of data included. This the case in the original version. We discuss would also entail that new contents would not, as our method in the light of problems that a rule, be manually reviewed. Harmonizing its arise from the desire to keep the syntactic markup with that of other treebanks would make analyses of a parallel treebank internally it possible to develop more accurate taggers and consistent, while available monolingual parsers for it, and thus increase its usefulness as a UD treebanks for English and Swedish di- resource. Conversely, the monolingual treebanks verge somewhat in their use of UD annota- can be used to augment other treebanks for En- tions. Finally, we compare the output from glish or Swedish as training data for parsers and the conversion program with the existing taggers. UD treebanks. Source Segments EN tkns SE tkns 1 Introduction Access help 595 10451 8898 Universal Dependency Annotation (UD) is an ini- Auster 788 13512 13337 tiative taken to increase returns for investments in Bellow 604 10310 9964 multilingual language technology (McDonald et Conrad 622 13063 12092 al., 2013). The idea is that a common set of de- Europarl 594 9334 8715 pendency relations, and a common set of defini- Gordimer 756 15181 15778 tions and guidelines for their application, will bet- Rowlings 605 10299 10635 ter support the development of a common cross- Total 4564 82150 79419 lingual infrastructure for the building of language technology tools such as parsers and translation Table 1: LinES corpora before conversion. systems. UD actually comprises more than just depen- The primary aim of this work is the creation of a dency relations. To be compatible and possible UD-compatible version of LinES, LinES-UD. As to merge in a common collection, the resources far as possible this should happen through auto- for a language should use the same principles of matic conversion. The hypothesis is that LinES tokenization, and common inventories of part-of- markup is sufficient to support automatic conver- speech tags and morphological features. UD ad- sion to universal dependencies for both languages vocates a conservative approach to tokenization, by the same process. 10 Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015), pages 10–19, Uppsala, Sweden, August 24–26 2015. The paper is organised as follows. The next Several other UD treebanks have been devel- section reports related work. Section 3 presents oped as a result of automatic conversion, e.g. for the primary differences between the design of the Italian (Bosco et al., 2013), Russian (Lipenkova LinES treebank and the UD framework. In section and Soucek,ˇ 2014), and Finnish (Pyysalo et al., 4 we describe our approach to develop the con- 2015). The process used here for LinES is quite version program, and in section 5 we present and similar to these works with the special twist that discuss the results. Section 6, finally, states the here two parallel treebanks are converted simul- conclusions. taneously. Thus, the approach is rule-based, although the rules are not available in an external 2 Related work rule format, but implemented as conditions and ac- tions in a Perl script. Also, unlike these works no Universal Dependencies is a project involving sev- new language-specific UD-scheme is developed as eral research groups around the world with a com- part of this work, as such schemes exist for English mon interest in treebank development, multilin- and Swedish already. gual parsing and cross-lingual learning (Univer- sal dependencies, 2015). The annotation scheme 3 Differences in design for dependency relations has its roots in universal Stanford dependencies (de Marneffe and Man- The original LinES design has several differences ning, 2008; de Marneffe et al., 2014) and the from the UD treebanks. The differences pertain- project also embraces a slightly extended version ing to parts of speech are fairly small, while differ- of the Google universal tag set for parts-of-speech ences in sentence segmentation, tokenization and (Petrov et al., 2012). At the time of writing tree- dependency analysis are larger. banks using UD are available for download from We first observe that parallel treebanks are often the LINDAT/CLARIN Repository Home for 18 created for different purposes than mono-lingual different languages (Agic´ et al., 2015). treebanks. UD treebanks have parser development The first release of UD treebanks included six as a primary goal, while the most important pur- languages. Two of these, the ones for English pose of the LinES treebank is as a resource for and Swedish, were created by automatic conver- studying the strategies of human translators and sion (McDonald et al., 2013). The English tree- for testing properties that are sometimes claimed bank used the Stanford parser (v1.6.8) on the WSJ to be typical for translated texts. One way to de- section of the Penn treebank for this purpose. scribe the relation between a translation and its The Swedish Talbanken treebank was converted source text is by trying to quantify the amount of by a set of deterministic rules, and the outcome structural changes, or shifts, that have been per- is claimed to have a high precision “due to the formed. Such a task is obviously helped by using fine-grained label set used in the Swedish Tree- the same annotation scheme for both languages bank” (p. 93). The treebanks are divided into and the demands on consistency in application of three sections for the purposes of parser develop- the categories are high. A measure of structural ment, a training part, a development part, and a test change should reflect real differences; if they in- part. We refer to them in the sequel as the English stead are introduced by alternative schemes of to- UD Treebank (EUD) and the Swedish UD Tree- kenization or by the use of different categories or bank (SUD), respectively, using suffixes 1.0 and definitions, the value of the measure is reduced. 1.1 to differentiate the versions. They have been Some of the differences in the available English used extensively in the current project for compar- and Swedish UD treebanks will be detailed in sec- isons. In the most recent release (1.1) some cor- tion 4. Here we only note that they pose prob- rections have been made to both treebanks. As far lems for a developer of parallel English-Swedish as the syntactic annotation is concerned, the cor- treebanks. As just said, in a parallel treebank rections affect less than 1% of the tokens in EUD, we would like to see parallel constructions be and about 4% of the tokens in SUD. Most of the annotated in the same way for both languages, development work on LinES-UD was made with but if they are not annotated this way in the the previous versions as targets, but the compar- (usually much larger) available monolingual tree- isons reported in section 5 refers to the versions banks, the increase in parsing consistency that we 1.1. expect from training the parser on a union of UD- 11 treebanks, will not be as large as it could be. and den dar¨ (that). Although they are not very nu- merous, some 10% of all sentences would contain 3.1 Sentence segmentation a multiword token. As the tokenization principles The largest syntactic unit in LinES is a translation for UD favours a strict adherence to spaces as sep- unit. This means that it should correspond under arators, instead signalling multiword expressions translation to a similar unit in the other language. in the dependency annotation, the conversion to When the translator has chosen to translate one UD must retokenize the data. English sentence by two Swedish sentences, or The treatment of clitics in LinES are largely the two English sentences by one Swedish sentence, same as in UD with one exception, the English s- LinES treats the two sentences as a single sen- genitive. This is treated as a separate token in the tential unit sharing a single root token. From the English UD treebank, but in LinES it is taken as a monolingual perspective there are two sentences, morpheme, both for English and Swedish. While each with its own root, but from the bilingual arguments can be given to treat the s-genitive as perspective there is a single unit and a single a phrasal clitic also in Swedish, it is usually not root. The two sentences can be analysed as either done, because it is harder to detect in Swedish than being coordinated or one being subordinated to in English.

Load more