Conll-2017 Shared Task
Total Page:16
File Type:pdf, Size:1020Kb
TurkuNLP: Delexicalized Pre-training of Word Embeddings for Dependency Parsing Jenna Kanerva1,2, Juhani Luotolahti1,2, and Filip Ginter1 1Turku NLP Group 2University of Turku Graduate School (UTUGS) University of Turku, Finland [email protected], [email protected], [email protected] Abstract word and sentence segmentation as well as mor- phological tags for the test sets, which they could We present the TurkuNLP entry in the choose to use as an alternative to developing own CoNLL 2017 Shared Task on Multilingual segmentation and tagging. These baseline seg- Parsing from Raw Text to Universal De- mentations and morphological analyses were pro- pendencies. The system is based on the vided by UDPipe v1.1 (Straka et al., 2016). UDPipe parser with our focus being in ex- In addition to the manually annotated treebanks, ploring various techniques to pre-train the the shared task organizers also distributed a large word embeddings used by the parser in or- collection of web-crawled text for all but one of der to improve its performance especially the languages in the shared task, totaling over 90 on languages with small training sets. The billion tokens of fully dependency parsed data. system ranked 11th among the 33 partici- Once again, these analyses were produced by the pants overall, being 8th on the small tree- UDPipe system. This automatically processed banks, 10th on the large treebanks, 12th on large dataset was intended by the organizers to the parallel test sets, and 26th on the sur- complement the manually annotated data and, for prise languages. instance, support the induction of word embed- dings. 1 Introduction As an overall strategy for the shared task, we In this paper we describe the TurkuNLP entry chose to build on an existing parser and focus in the CoNLL 2017 Shared Task on Multilin- on exploring various methods of pre-training the gual Parsing from Raw Text to Universal De- parser and especially its embeddings, using the pendencies (Zeman et al., 2017). The Univer- large, automatically analyzed corpus provided by sal Dependencies (UD) treebank collection (Nivre the organizers. We expected this strategy to be et al., 2017b) has 70 treebanks for 50 languages particularly helpful for languages with only a lit- with cross-linguistically consistent annotation. Of tle training data. On the other hand we put only these, the 63 treebanks which have at least 10,000 a minimal effort into the surprise languages. We tokens in their test section are used for training and also chose to use the word and sentence segmen- testing the systems. Further, a parallel corpus con- tation of the test datasets, as provided by the or- sisting of 1,000 sentences in 14 languages was de- ganizers. As we will demonstrate, the results of veloped as an additional test set, and finally, the our system correlate with the focus of our efforts. shared task included test sets for four “surprise” Initially, we focused on the latest ParseySaurus languages not known until a week prior to the test parser (Alberti et al., 2017), but due to the magni- phase of the shared task (Nivre et al., 2017a). No tude of the task and restrictions on time, we finally training data was provided for these languages — used the UDPipe parsing pipeline of Straka et al. only a handful of sentences was given as an ex- (2016) as the basis of our efforts. ample. As an additional novelty, participation in the shared task involved developing an end-to-end 2 Word Embeddings parsing system, from raw text to dependency trees, for all of the languages and treebanks. The partic- The most important component of our system ipants were provided with automatically predicted is the novel techniques we used to pre-train the 119 Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 119–125, Vancouver, Canada, August 3-4, 2017. c 2017 Association for Computational Linguistics word embeddings. Our word embeddings com- root bine three important aspects: 1) delexicalized syn- amod tactic contexts for inducing word embeddings, 2) interesting question word embeddings built from character n-grams, and 3) post-training injection and modification of input context embeddings for unseen words. interesting ADJ interesting Degree=Pos 2.1 Delexicalized Syntactic Contexts interesting amod Word embeddings induced from large text corpora question NOUN have been a key resource in many NLP task in question Number=Sing recent years. In many common tools for learn- question root ing word embeddings, such as word2vec (Mikolov question Dep amod et al., 2013), the context for a focus word is a slid- interesting stack1 shift ing window of words surrounding the focus word interesting stack2 left-arc in linear order. Levy and Goldberg(2014) extend interesting stack2 left amod the context with dependency trees, where the con- question stack1 left-arc text is defined as the words nearby in the depen- question stack1 left amod dency tree with additionally the dependency rela- question stack1 root tion attached to the context words, for example in- character four-grams terresting/amod. interesting, char $int, char inte, char nter, In Kanerva et al.(2017), we show that word char tere, char eres, char rest, char esti, embeddings trained in a strongly syntactic fash- char stin, char ting, char ing$ ion outperform standard word2vec embeddings in question, char $que, char ques, char uest dependency parsing. In particular, the context is char esti, char stio, char tion, char ion$ fully delexicalized — instead of using words in the word2vec output layer, only part-of-speech tags, Table 1: Delexicalized contexts and character n- morphological features and syntactic functions are grams for word embeddings predicted. This delexicalized syntactic context is shown to lead to higher performance as well as neural systems (Chen and Manning, 2014; Andor generalize better across languages. et al., 2016). In our shared task submission we build on top Since the scope of languages in the task is large, of the previous work and optimize the embeddings the embeddings used for this dependency pars- even closer to the parsing task: We extend the orig- ing task need to be able to be representative of inal delexicalized context to also predict parsing languages with both large and small available re- actions of a transition-based parser. From the ex- sources, and also they need to be able to cap- isting parse trees in the raw data collection, we ture morphological information for languages with create the transition sequence used to produce the complex morphologies as well as those with less tree and for each word collect features describing morphological variation. the actions taken when the word is on top-2 posi- To address better the needs of small languages tions of the stack, e.g. if the focus word is first on with very little of data available as well as mor- stack, what is the next action. Our word–context phologically rich languages, we build our word pairs are illustrated in Table1. In this way, we embedding models with methods used in the pop- strive to build embeddings which relate together ular fastText1 representation of Bojanowski et al. words which appear in similar configurations of (2016). They suggest, instead of tokens, to use the parser. character n-grams as the basic units in building 2.2 Word embeddings from character embeddings for words. First, words are turned into n-grams character n-grams and embeddings are learned separately for each of these character n-grams. It is known that the initial embeddings affect the Secondly, word vectors are assembled from these performance of neural dependency parsers and pre-training the embedding matrix prior to train- 1https://github.com/facebookresearch/ ing has an important effect on the performance of fastText 120 trained character embeddings by averaging all embedding models using word2vecf software2 by character n-grams present in a word. To make Levy and Goldberg(2014) with negative sam- the embeddings more informative, the n-grams in- pling, skip-gram architecture, embeddings dimen- clude special markers for the beginning and the sionality of 100, delexicalized syntactic contexts end of the token, allowing the model for exam- explained in Section 2.1 and character n-grams of ple to learn special embeddings for word suffixes length 3-6. The maximum size of raw data used which are often used as inflectional markers. In is limited to 50 million unique sentences in or- addition to the n-grams, a vector for the full word der to keep the training times bearable, and sen- is also trained, and when final word vectors are tences longer that 30 tokens are discarded. Lan- produced, the full word vector is treated similarly guages with only very limited resources we run 10 as other n-grams and is averaged as part of the fi- training iterations, but for rest only one iteration nal product along with the rest. This allows for the is used. These character n-gram models explained potential benefits of token level embeddings to be in detail in Section 2.2 can then be used to build materialized in our model. word embeddings for arbitrary words, only requir- Table1 demonstrates the splitting of a word ing that at least one of the extracted character n- into character four-grams with special start and grams is present in our embedding model. end markers. When learning embeddings for For parsing we also used the parameters opti- these character n-grams, context for each n-gram mized for UDPipe baseline system and changed is replicated from the original word context, i.e. only parts related to pre-trained embeddings. As each character n-gram created from the word in- we do not perform further parameter optimiza- teresting gets the delexicalized context assigned tion, we trained our models always using the full for that word, namely ADJ, Degree=Pos, amod, training set, also in cases where different devel- stack1 shift, stack2 left-arc, stack2 left amod in opment sets were not provided.