82 Treebanks, 34 Models: Universal Dependency Parsing with Multi-Treebank Models

82 Treebanks, 34 Models: Universal Dependency Parsing with Multi-Treebank Models Aaron Smith∗ Bernd Bohnet† Miryam de Lhoneux∗ Joakim Nivre∗ Yan Shao∗ Sara Stymne∗ ∗Department of Linguistics and Philology †Google Research Uppsala University London, UK Uppsala, Sweden Abstract simultaneously marks multiword tokens that need non-segmental analysis. The second component We present the Uppsala system for the is a part-of-speech (POS) tagger based on Bohnet CoNLL 2018 Shared Task on universal et al.(2018), which employs a sentence-based dependency parsing. Our system is a character model and also predicts morphological pipeline consisting of three components: features. The final stage is a greedy transition- the first performs joint word and sentence based dependency parser that takes segmented segmentation; the second predicts part-of- words and their predicted POS tags as input and speech tags and morphological features; produces full dependency trees. While the seg- the third predicts dependency trees from menter and tagger models are trained on a single words and tags. Instead of training a sin- treebank, the parser uses multi-treebank learning gle parsing model for each treebank, we to boost performance and reduce the number of trained models with multiple treebanks for models. one language or closely related languages, After evaluation on the official test sets (Nivre greatly reducing the number of models. et al., 2018), which was run on the TIRA server On the official test run, we ranked 7th of (Potthast et al., 2014), the Uppsala system ranked 27 teams for the LAS and MLAS metrics. 7th of 27 systems with respect to LAS, with a Our system obtained the best scores over- macro-average F1 of 72.37, and 7th of 27 sys- all for word segmentation, universal POS tems with respect to MLAS, with a macro-average tagging, and morphological features. F1 of 59.20. It also reached the highest average score for word segmentation (98.18), universal 1 Introduction POS (UPOS) tagging (90.91), and morphological The CoNLL 2018 Shared Task on Multilingual features (87.59). Parsing from Raw Text to Universal Dependencies Corrigendum: After the test phase was over, we (Zeman et al., 2018) requires participants to build discovered that we had used a non-permitted re- systems that take as input raw text, without any source when developing the UPOS tagger for Thai linguistic annotation, and output full labelled de- PUD (see Section4). Setting our LAS, MLAS and pendency trees for 82 test treebanks covering 46 UPOS scores to 0.00 for Thai PUD gives the cor- different languages. Besides the labeled attach- rected scores: LAS 72.31, MLAS 59.17, UPOS ment score (LAS) used to evaluate systems in the 90.50. This does not affect the ranking for any of 2017 edition of the Shared Task (Zeman et al., the three scores, as confirmed by the shared task 2017), this year’s task introduces two new met- organizers. rics: morphology-aware labeled attachment score (MLAS) and bi-lexical dependency score (BLEX). 2 Resources The Uppsala system focuses exclusively on LAS and MLAS, and consists of a three-step pipeline. All three components of our system were trained The first step is a model for joint sentence and principally on the training sets of Universal De- word segmentation which uses the BiRNN-CRF pendencies v2.2 released to coincide with the framework of Shao et al.(2017, 2018) to predict shared task (Nivre et al., 2018). The tagger and sentence and word boundaries in the raw input and parser also make use of the pre-trained word em- 113 Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 113–123 Brussels, Belgium, October 31 – November 1, 2018. c 2018 Association for Computational Linguistics https://doi.org/10.18653/v1/K18-2011 beddings provided by the organisers, as well as Thai Segmentation of Thai was a particularly Facebook word embeddings (Bojanowski et al., difficult case: Thai uses a unique script, with no 2017), and both word and character embed- spaces between words, and there was no training dings trained on Wikipedia text1 with word2vec data available. Spaces in Thai text can function (Mikolov et al., 2013). For languages with no as sentence boundaries, but are also used equiva- training data, we also used external resources in lently to commas in English. For Thai sentence the form of Wikipedia text, parallel data from segmentation, we exploited the fact that four other OPUS (Tiedemann, 2012), the Moses statistical datasets are parallel, i.e., there is a one-to-one machine translation system (Koehn et al., 2007), correspondence between sentences in Thai and and the Apertium morphological transducer for in Czech PUD, English PUD, Finnish PUD and Breton.2 Swedish PUD.4 First, we split the Thai text by white space and treat the obtained character strings 3 Sentence and Word Segmentation as potential sentences or sub-sentences. We then We employ the model of Shao et al.(2018) for align them to the segmented sentences of the four joint sentence segmentation and word segmen- parallel datasets using the Gale-Church algorithm tation. Given the input character sequence, we (Gale and Church, 1993). Finally, we compare the model the prediction of word boundary tags as sentence boundaries obtained from different par- a sequence labelling problem using a BiRNN- allel datasets and adopt the ones that are shared CRF framework (Huang et al., 2015; Shao et al., within at least three parallel datasets. 2017). This is complemented with an attention- For word segmentation, we use a trie-based seg- based LSTM model (Bahdanau et al., 2014) for menter with a word list derived from the Face- 5 transducing non-segmental multiword tokens. To book word embeddings. The segmenter retrieves enable joint sentence segmentation, we add extra words by greedy forward maximum matching boundary tags as in de Lhoneux et al.(2017a). (Wong and Chan, 1996). This method requires no We use the default parameter settings intro- training but gave us the highest word segmenta- duced by Shao et al.(2018) and train a segmen- tion score of 69.93% for Thai, compared to the tation model for all treebanks with at least 50 sen- baseline score of 8.56%. tences of training data. For treebanks with less or 4 Tagging and Morphological Analysis no training data (except Thai discussed below), we substitute a model for another treebank/language: We use two separate instantiations of the tag- 6 • For Japanese Modern, Czech PUD, English ger described in Bohnet et al.(2018) to predict PUD and Swedish PUD, we use the model UPOS tags and morphological features, respec- trained on the largest treebank from the same tively. The tagger uses a Meta-BiLSTM over the language (Japanese GSD, Czech PDT, En- output of a sentence-based character model and a glish EWT and Swedish Talbanken). word model. There are two features that mainly distinguishes the tagger from previous work. The • For Finnish PUD, we use Finnish TDT rather character BiLSTMs use the full context of the sen- than the slightly larger Finnish FTB, because tence in contrast to most other taggers which use the latter does not contain raw text suitable words only as context for the character model. for training a segmenter. This character model is combined with the word • For Naija NSC, we use English EWT. model in the Meta-BiLSTM relatively late, after two layers of BiLSTMs. • For other test sets with little or no training For both the word and character models, we data, we select models based on the size of use two layers of BiLSTMs with 300 LSTM cells the intersection of the character sets mea- per layer. We employ batches with 8000 words sured on Wikipedia data (see Table2 for de- and 20000 characters. We keep all other hyper- tails).3 parameters as defined in Bohnet et al.(2018). 1https://dumps.wikimedia.org/ From the training schema described in the above backup-index-bydb.html 2https://github.com/apertium/ 4This information was available in the README files dis- apertium-bre tributed with the training data and available to all participants. 3North Sami Giella was included in this group by mistake, 5github.com/facebookresearch/fastText as we underestimated the size of the treebank. 6https://github.com/google/meta_tagger 114 paper, we deviate slightly in that we perform early is not affected, as we did not predict features at all stopping on the word, character and meta-model for Thai. The same goes for sentence and word independently. We apply early stopping due to segmentation, which do not depend on the tagger. the performance of the development set (or train- Lemmas Due to time constraints we chose not ing set when no development set is available) and to focus on the BLEX metric in this shared task. stop when no improvement occurs in 1000 training In order to avoid zero scores, however, we simply steps. We use the same settings for UPOS tagging copied a lowercased version of the raw token into and morphological features. the lemma column. To deal with languages that have little or no training data, we adopt three different strategies: 5 Dependency Parsing • For the PUD treebanks (except Thai), Japanese Modern and Naija NSC, we use the We use a greedy transition-based parser (Nivre, same model substitutions as for segmentation 2008) based on the framework of Kiperwasser and (see Table2). Goldberg(2016b) where BiLSTMs (Hochreiter and Schmidhuber, 1997; Graves, 2008) learn rep- • For Faroese we used the model for Norwe- resentations of tokens in context, and are trained gian Nynorsk, as we believe this to be the together with a multi-layer perceptron that pre- most closely related language.

Load more