Multi-Source Synthetic Treebank Creation for Improved Cross-Lingual Dependency Parsing

Multi-source synthetic treebank creation for improved cross-lingual dependency parsing Francis M. Tyers Mariya Sheyanova Department of Linguistics School of Linguistics Indiana University Higher School of Economics Bloomington, IN Moscow [email protected] [email protected] Alexandra Martynova Pavel Stepachev Konstantin Vinogradovsky School of Linguistics School of Linguistics School of Linguistics Higher School of Economics Higher School of Economics Higher School of Economics Moscow Moscow Moscow [email protected] [email protected] [email protected] Abstract ish and Swedish for which large syntactically- annotated corpora exist. This paper describes a method of creating synthetic treebanks for cross-lingual depen- Compared with the other Nordic languages, dency parsing using a combination of machine Faroese has a full nominal case system of four translation (including pivot translation), anno- cases: Nominative, Genitive, Accusative and Da- tation projection and the spanning tree algo- tive, where the other languages have only a Gen- rithm. Sentences are first automatically trans- itive case. It has three grammatical genders, like lated from a lesser-resourced language to a Norwegian Nynorsk, but unlike Norwegian Bok- number of related highly-resourced languages, mål, Danish and Swedish, which have a two- parsed and then the annotations are projected gender agreement system. Like the other Nordic back to the lesser-resourced language, leading to multiple trees for each sentence from the languages, it is a verb-second (V2) language and lesser-resourced language. The final treebank the word order is generally similar. Faroese is how- is created by merging the possible trees into a ever not mutually intelligible with any of the main- graph and running the spanning tree algorithm land Nordic languages. to vote for the best tree for each sentence. We Using these treebanks we perform experiments present experiments aimed at parsing Faroese using a combination of Danish, Swedish and using two well-known methods, delexicalised pars- Norwegian. In a similar experimental setup to ing (Zeman and Resnik, 2008; McDonald et al., the CoNLL 2018 shared task on dependency 2011) and synthetic treebanking using annotation parsing we report state-of-the-art results on de- projection (Tiedemann and Agić, 2016), and in ad- pendency parsing for Faroese using an off-the- dition propose a new method based on voting over shelf parser. possible projected trees using the maximum spanning tree algorithm. This can be thought of as cre- 1 Introduction ating a synthetic treebank where the tree for each In this paper, we describe and compare a num- sentence is the result of voting over the set of trees ber of approaches to cross-lingual parsing for generated by parsing different translations. Faroese, a Nordic language spoken by approxi- The remainder of the paper is laid out as follows: mately 66,000 people on the Faroe Islands in the Section 2 describes prior work on both Faroese and North Atlantic. Faroese is a moderately under- on cross-lingual dependency parsing; Section 3 de- resourced language. It has a standardised orthog- scribes the resources we used for the experiments, raphy and fairly long written tradition, but lacks including a description of how the gold-standard large syntactically-annotated corpora. There are for Faroese was made; Section 4 describes the however related well-resourced languages, such methodology, including both the baseline models as Norwegian (both Bokmål and Nynorsk), Dan- and our proposed method. Sections 5 and 6 describe the experiments we performed and the re- not evaluate end-to-end results (the evaluation was sults and discussion respectively and finally: Sec- done over gold standard POS and morphology). tion 7 describes future avenues for research and Section 8 concludes. 3 Resources 2 Prior work In the experiments we used raw Faroese text extracted from Wikipedia, a manually created Our work is closely related to two main trends gold-standard corpus of trees, treebanks for the in cross-lingual dependency parsing. The first is source languages (Danish, Swedish and Norwe- multi-source delexicalised dependency parsing as gian) and machine translation systems between the described by McDonald et al. (2011). languages. The following subsections describe The second is the work on synthetic treebanking these resources. by Tiedemann and Agić (2016); Tiedemann (2017). In these works, sentences in the target language 3.1 Raw data (e.g. Faroese) is first translated by a machine trans- The Faroese raw data that we used in our exper- lation system to a well-resourced language (e.g. iments comes from Wikipedia dump which was Norwegian). The machine-translated Norwegian preliminary cleaned of all the markup using the sentences are then parsed by a parser trained on a WikiExtractor script.1 Then, both manually and treebank of Norwegian, and word aligned to the via regular expressions, we deleted non-Faroese Faroese originals. The output tree from the Nor- texts, poetic texts, reference lists, short sentences wegian parser is then projected back to the Faroese with little or no dependencies. All sentences con- sentences via the word alignments. taining only non-alphanumeric symbols were also In terms of voting for parse trees, the CoNLL deleted. shared task on dependency parsing in 2007 (Nivre For sentence segmentation we used regular ex- et al., 2007) reported that using a similar architec- pressions splitting on sentence-final punctuation, ture to the one we describe here, they were able but taking care to ignore month names, ordinal to get significantly better results by combining the numbers and abbreviations. After cleaning the cor- trees produced by the top three systems, and found pus we ended up with a total of 28,862 sentences. that even after adding all the systems, including the This data was used in the creation of the gold stan- worst-performing system, the performance did not dard (§3.2) and in creating the parallel data used drop below that of the top-performing system. for the synthetic treebanking experiments (§4.2). Our work is very similar to Agić et al. (2016), in that we use spanning tree to find the best parse 3.2 Gold standard in a graph that has been induced from aligned par- In order to evaluate the methods we needed to cre- allel corpora. However, their focus is on cross- ate a gold-standard treebank of Faroese. This was linguality rather than on producing the best system done manually by annotating sentences from the for a related language, and as such the performance Faroese Wikipedia.2 The gold standard contains they report is lower. 10,002 tokens in 1,208 sentences. The annotation It is also worth noting the work by Schlichtkrull procedure was as follows: We extracted sentences and Søgaard (2017), who present a system that can from the Faroese Wikipedia and analysed them us- learn from dependency graphs over tokens as op- ing the Faroese morphological analyser and con- posed to over the well-formed dependency trees straint grammar described by Trosterud (2009). that are typically assumed for other systems. This gave us a corpus where for each token in each In terms of dependency parsing specifically for sentence we had a lemma, a part of speech and a Faroese, we can include the work by Antonsen set of morphological features. These were checked et al. (2010), who apply a slightly-modified rule- manually and on top of these analyses, a depen- based parser written for North Sámi to parsing dency tree was added according to the guidelines Faroese. They achieved good results, F-score of in version 2.0 of Universal Dependencies (Nivre over 0.98, on a small test set of 100 sentences. Un- et al., 2016). Each tree was added manually by the fortunately their work is not directly comparable 1https://github.com/attardi/wikiextractor as it relies on a very different annotation scheme to 2The treebank is available online at https://github. that which we use in our work, in addition they did com/UniversalDependencies/UD_Faroese-OFT. (1) translate Treebank Sentences Tokens UD_Swedish-Talbanken 4,304 66,673 (fao) Maja býr nú í Malmø. (nob) Maja bor nå i Malmø. UD_Danish 4,384 80,378 UD_Norwegian-Nynorsk 14,175 245,330 (2) translate UD_Norwegian-Bokmaal 15,696 243,887 (nno) Maja bur no i Malmø. (nob) Maja bor nå i Malmø. (dan) Maja bor nu i Malmø. Table 1: Number of sentences and tokens in UD treebanks (swe) Maja bor nu i Malmö. for training the delexicalised models Figure 1: Example of pivot translation from Faroese to Swedish, Danish and Norwegian Nynorsk via Norwegian first author in discussion with a native speaker of Bokmål. The sentence Maja býr nú í Malmø translates in En- Faroese and members of the Universal Dependen- glish as ‘Maja now lives in Malmø’. The translation to the other Nordic languages is word-by-word and monotonic. cies community.3 The part-of-speech tags and features were converted automatically to ones compatible with Universal Dependencies using a lookup 4 Methodology table and the longest-match set overlap procedure described in Gökırmak and Tyers (2017). In this section we describe the two baseline methods and our multi-source approach. 3.3 Other treebanks For training the delexicalised models we used the 4.1 Delexicalised parsing following treebanks: UD_Swedish-Talbanken, For the delexicalised parsing baseline, we trained UD_Danish-DDT (Johannsen et al., 2015), delexicalised models on the Swedish, Danish, Nor- UD_Norwegian-Bokmaal (Øvrelid and Hohle, wegian Bokmål and Norwegian Nynorsk Univer- 2016) and UD_Norwegian-Nynorsk. Some sal Dependencies treebanks. Delexicalised models statistics about these treebanks are presented in are models trained only on the sequence of POS- Table 1. tags and morphological features, omitting both lemmas and surface forms. The idea behind this 3.4 Machine translation is to make the model maximally language indepen- Faroese is not supported by the mainstream on- dent.

Multi-Source Synthetic Treebank Creation for Improved Cross-Lingual Dependency Parsing

Arxiv:1908.07448V1

Extended and Enhanced Polish Dependency Bank in Universal Dependencies Format

Universal Dependencies According to BERT: Both More Specific and More General

Universal Dependencies for Japanese

Universal Dependencies for Persian

Language Technology Meets Documentary Linguistics: What We Have to Tell Each Other

Foreword to the Special Issue on Uralic Languages

The Universal Dependencies Treebank of Spoken Slovenian

Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

Conll-2017 Shared Task

Conferenceabstracts

A Multilingual Collection of Conll-U-Compatible Morphological Lexicons