Extended and Enhanced Polish Dependency Bank in Universal Dependencies Format

Extended and Enhanced Polish Dependency Bank in Universal Dependencies Format Alina Wróblewska Institute of Computer Science Polish Academy of Sciences ul. Jana Kazimierza 5 01-248 Warsaw, Poland [email protected] Abstract even for languages with rich morphology and relatively free word order, such as Polish. The paper presents the largest Polish Depen- The supervised learning methods require gold- dency Bank in Universal Dependencies format – PDBUD – with 22K trees and 352K standard training data, whose creation is a time- tokens. PDBUD builds on its previous ver- consuming and expensive process. Nevertheless, sion, i.e. the Polish UD treebank (PL-SZ), dependency treebanks have been created for many and contains all 8K PL-SZ trees. The PL- languages, in particular within the Universal De- SZ trees are checked and possibly corrected pendencies initiative (UD, Nivre et al., 2016). in the current edition of PDBUD. Further The UD leaders aim at developing a cross- 14K trees are automatically converted from linguistically consistent tree annotation schema a new version of Polish Dependency Bank. and at building a large multilingual collection of The PDBUD trees are expanded with the enhanced edges encoding the shared dependents dependency treebanks annotated according to this and the shared governors of the coordinated schema. conjuncts and with the semantic roles of some Polish is also represented in the Universal dependents. The conducted evaluation exper- Dependencies collection. There are two Polish iments show that PDBUD is large enough treebanks in UD: the Polish UD treebank (PL- for training a high-quality graph-based depen- SZ) converted from Składnica zalezno˙ sciowa´ 1 and dency parser for Polish. the LFG enhanced UD treebank (PL-LFG) con- 1 Introduction verted from a corpus of the Polish LFG structures.2 PL-SZ contains more than 8K sentences Natural language processing (NLP) is nowadays with 10.1 tokens per sentence on average. PL-LFG dominated by machine learning methods, espe- is larger and contains more than 17K sentences, cially deep learning methods. Data-driven NLP but the average number of tokens per sentence is tools not only perform more accurately than only 7.6.3 rule-based tools, but are also easier to develop. This paper presents the largest Polish Depen- The shift towards machine learning methods is dency Bank in Universal Dependencies format also visible in syntactic parsing, especially depen- – PDBUD4 – with 22K trees and 352K to- dency parsing. The vast majority of the contempo- kens (hence 15.8 tokens per sentence on aver- rary dependency parsing systems (e.g. Nivre et al., age). PDBUD builds on its previous version, i.e. 2006; Bohnet, 2010; Dozat et al., 2017; Straka and the Polish UD treebank (PL-SZ), and contains all Straková, 2017) take advantage of machine learn- 8K PL-SZ trees. The PL-SZ trees are checked ing methods. Based on training data, parsers learn and possibly corrected in the current edition of to analyse sentences and to predict the most appro- priate dependency structures of these sentences. 1Składnica zalezno˙ sciowa´ was converted to the UD for- Even if various learning methods were applied to mat by Zeman et al.(2014). 2LFG structures were converted by A. Przepiórkowski data-driven dependency parsing (e.g. Jiang et al., and A. Patejuk. 2016), the best results so far are given by the su- 3A detailed comparison of PL-SZ and PL-LFG is pre- pervised methods (cf. Zeman et al., 2017). Su- sented on http://universaldependencies.org/ treebanks/pl-comparison.html. pervised dependency parsers trained on correctly 4PDBUD is publicly available on http://zil. annotated data achieve high parsing performance ipipan.waw.pl/PDB. PDBUD. Further 14K trees are automatically con- Składnica. The projection-based trees were also verted from a new version of Polish Dependency more complex and 235 of them are non-projective Bank (PDB, see Section2). Polish sentences un- (i.e. 5.9% of all added trees). The entire set of derlying the additional PDB trees contain prob- Składnica trees and the projection-based trees is lematic linguistic phenomena whose conversion called Polish Dependency Bank (PDB). requires some modifications of the UD annotation PDB is still being developed at the Institute of schema (see Section3). Furthermore, the PDBUD Computer Science PAS. The current version of trees are expanded with the enhanced edges en- PDB is enlarged with a suite of 10K sentences an- coding the shared dependents and the shared gov- notated with the dependency trees. The additional ernors of the coordinated conjuncts (see Section sentences are relatively complex (20.5 tokens per 4) and with the semantic roles of some dependents sentence on average) and come from Polish Na- (see Section5). Finally, we conduct some eval- tional Corpus (Przepiórkowski et al., 2012), Pol- uation experiments. The evaluation results show ish CDSCorpus6 (Wróblewska and Krasnowska- that PDBUD is large enough for training a high- Kieras´, 2017), and literature. There are 1388 non- quality graph-based dependency parser for Polish projective trees in this set (i.e. 13.9% of 10K (see Section6). trees). Besides enlarging PDB, the development consists in correcting the previous PDB trees. 2 Polish Dependency Bank The Składnica trees and the projection-based trees are manually checked and corrected if necessary. 2.1 PDB The current version of PDB consists of more The first Polish dependency treebank – Składnica than 22K trees with 15.8 tokens per sentence zalezno˙ sciowa´ (Wróblewska, 2012) – was a col- on average (see Table1). There are 1912 non- lection of about 8K trees which were automati- projective trees in PDB (i.e. 8.61% of all trees). cally converted from Polish constituent trees of Składnica frazowa (Wolinski´ et al., 2011). All sentences of Składnica were derived from Pol- PDB PDBUD ish National Corpus (Przepiórkowski et al., 2012). # sentences 22,208 The annotated sentences are rather short with 10.2 # tokens 351,715 tokens per sentence on average and corresponding # tokens per sentence 15.84 trees are relatively simple (there is only 289 non- # dependency types 28 31 (48)* projective trees,5 i.e. 3.5% of all trees). % non-projective edges 1.76 1.75 This first version of Polish dependency treebank % non-projective trees 8.61 8.03 was enlarged with 4K trees (Wróblewska, 2014). The additional trees resulted from the projection % enhanced edges n/a 4.96 of English dependency structures on Polish paral- % enhanced graphs n/a 41.58 lel sentences from Europarl (Koehn, 2005), DGT- Table 1: Statistics of Polish Dependency Bank (PDB) Translation Memory (Steinberger et al., 2012), and its UD conversion (PDBUD). *There are 31 uni- OPUS (Tiedemann, 2012) and Pelcra Parallel versal dependency types in PDBUD and 48 universal Corpus (P˛eziket al., 2011). The additional sen- types with the Polish-specific subtypes. tences with the average length of 15.9 tokens per sentence were longer than the sentences from 5Non-projective trees contain long distance dependencies 2.2 PDBUD resulting in crossing edges. See the topicalisation example The PDB trees are automatically converted to Czerwon ˛akupiłam sukienk˛e ‘I bought a red dress’ (lit. ‘Red I bought a dress’) with the following non-projective depen- the UD trees according to the guidelines of Uni- dency tree: versal Dependencies v27 and the resulting set is called PDBUD (i.e. Polish Dependency Bank in amod root Universal Dependencies format). PDBUD con- obj tains all trees of the Polish UD treebank (PL- ROOT Czerwon ˛a kupiłam sukienk˛e 6http://zil.ipipan.waw.pl/Scwad/ red I bought dress CDSCorpus 7http://universaldependencies.org/ guidelines.html SZ), which are possibly corrected. The size of paratives of inequality marked with NIZ˙ (‘than’).8 PDBUD is exactly the same as the size of PDB, All markers introducing comparative construc- i.e. 22K trees and 351K tokens (see Table1). 1783 tions, e.g. JAK, NIZ˙ , JAKBY, NICZYM, are con- of the PDBUD trees are non-projective, i.e. 8.03% verted as the subordinate conjunctions SCONJ of all trees. There are 17K enhanced edges (4.96% with the feature ConjType=Cmpr.9 Comparative of all edges) in PDBUD and 41.6% of the PDBUD constructions are annotated with the following de- graphs have at least one enhanced edge. pendencies (see Figure1): the comparative marker The converted PDBUD trees are largely con- is labelled mark and it depends on the main el- sistent with the PL-SZ trees. While converting, ement of the comparative construction labelled we try to preserve the universality principle of obl:cmpr (a new UD subtype). UD, but some necessary modifications are es- sential. The PL-SZ trees are rather simple and advmod obl:cmpr the sentences underlying this data set do not con- obj mark tain some linguistic phenomena, e.g. ellipsis, com- znali ceny lepiej niz˙ kelnerzy parative constructions, directed speech, interpo- lations and comments, nominative noun phrases they know prices better than waiters used in the vocative function, and many others. Therefore, the repertoire of the UD relation sub- Figure 1: The PDBUD tree of [...] znali ceny potraw types and language-specific features is slightly ex- lepiej niz˙ kelnerzy (‘they know the prices of dishes bet- tended in PDBUD to cover these phenomena (see ter than the waiters’) with the comparative construc- Section3). Furthermore, in contrast to the PL-SZ tion. trees, the PDBUD graphs contain enhanced edges encoding shared dependents or shared governors of coordinated elements (see Section4). Finally, 3.2 Constructions with JAKO some semantic labels are added that goes beyond The lexeme JAKO is one of the uninflectable the standard annotation scheme of Universal De- Polish parts of speech. It causes considerable pendencies (see Section5). difficulties and is heterogeneously analysed as a preposition, a coordinating conjunction, a sub- 3 Corrections and extensions ordinating conjunction, or an adverb in the tra- Plenty of errors are corrected in the original Skład- ditional Polish linguistics.

Extended and Enhanced Polish Dependency Bank in Universal Dependencies Format

Arxiv:1908.07448V1

Universal Dependencies According to BERT: Both More Specific and More General

Universal Dependencies for Japanese

Universal Dependencies for Persian

Language Technology Meets Documentary Linguistics: What We Have to Tell Each Other

Foreword to the Special Issue on Uralic Languages

The Universal Dependencies Treebank of Spoken Slovenian

Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

Conll-2017 Shared Task

Conferenceabstracts

A Multilingual Collection of Conll-U-Compatible Morphological Lexicons

Proceedings of the 55Th Annual Meeting of the Association For