Universal Dependency Parsing from Scratch
Total Page:16
File Type:pdf, Size:1020Kb
Universal Dependency Parsing from Scratch Peng Qi,* Timothy Dozat,* Yuhao Zhang,* Christopher D. Manning Stanford University Stanford, CA 94305 fpengqi, tdozat, yuhaozhang, [email protected] Abstract 2018).1 Harnessing the power of neural sys- tems, this pipeline achieves competitive perfor- This paper describes Stanford’s system at mance in each of the inter-linked stages: tok- the CoNLL 2018 UD Shared Task. We enization, sentence and word segmentation, part- introduce a complete neural pipeline sys- of-speech (POS)/morphological features (UFeats) tem that takes raw text as input, and per- tagging, lemmatization, and finally, dependency forms all tasks required by the shared task, parsing. Our main contributions include: ranging from tokenization and sentence • New methods for combining symbolic statis- segmentation, to POS tagging and depen- tical knowledge with flexible, powerful neu- dency parsing. Our single system sub- ral systems to improve robustness; mission achieved very competitive perfor- • A biaffine classifier for joint POS/UFeats pre- mance on big treebanks. Moreover, after diction that improves prediction consistency; fixing an unfortunate bug, our corrected • A lemmatizer enhanced with an edit classifier nd st system would have placed the 2 , 1 , and that improves the robustness of a sequence- rd 3 on the official evaluation metrics LAS, to-sequence model on rare sequences; and MLAS, and BLEX, and would have out- • Extensions to our parser from (Dozat et al., performed all submission systems on low- 2017) to model linearization. resource treebank categories on all metrics Our system achieves competitive performance by a large margin. We further show the ef- on big treebanks. After fixing an unfortunate bug, fectiveness of different model components the corrected system would have placed the 2nd, through extensive ablation studies. 1st, and 3rd on the official evaluation metrics LAS, MLAS, and BLEX, and would have outperformed 1 Introduction all submission systems on low-resource treebank Dependency parsing is an important component categories on all metrics by a large margin. We in various natural langauge processing (NLP) sys- perform extensive ablation studies to demonstrate tems for semantic role labeling (Marcheggiani and the effectiveness of our novel methods, and high- 2 Titov, 2017), relation extraction (Zhang et al., light future directions to improve the system. 2018), and machine translation (Chen et al., 2017). However, most research has treated dependency 2 System Description parsing in isolation, and largely ignored upstream NLP components that prepare relevant data for the In this section, we present detailed descriptions parser, e.g., tokenizers and lemmatizers (Zeman for each component of our neural pipeline system, et al., 2017). In reality, however, these upstream namely the tokenizer, the POS/UFeats tagger, the systems are still far from perfect. lemmatizer, and finally the dependency parser. To this end, in our submission to the CoNLL 1 2018 UD Shared Task, we built a raw-text- We chose to develop a pipeline system mainly because it allows easier parallel development and faster model tuning in to-CoNLL-U pipeline system that performs all a shared task context. tasks required by the Shared Task (Zeman et al., 2To facilitate future research, we make our implementa- tion public at: https://github.com/stanfordnlp/ ∗These authors contributed roughly equally. stanfordnlp. (&'()) (+,-) ()./) 2.1 Tokenizer Final prediction !$ !$ !$ Second layer To prepare sentences in the form of a list of words !(&'()) !(+,-) !()./) prediction + 0,$ 0,$ 0,$ for downstream processing, the tokenizer compo- BiLSTM nent reads raw text and outputs sentences in the Second layer 1 Step t CoNLL-U format. This is achieved with two sub- First layer systems: one for joint tokenization and sentence prediction & (&'()) (+,-) ()./) !",$ !",$ !",$ σ + segmentation, and the other for splitting multi- gating + word tokens into syntactic words. BiLSTM First layer 1-D CNN 1 Tokenization and sentence segmentation. We Step t treat joint tokenization and sentence segmentation as a unit-level sequence tagging problem. For Input Input Unit t most languages, a unit of text is a single charac- Figure 1: Illustration of the tokenizer/sentence ter; however, in Vietnamese orthography, the most segmenter model. Components in blue represent 3 natural units of text are single syllables. We as- the gating mechanism between the two layers. sign one out of five tags to each of these units: end of token (EOT), end of sentence (EOS), multi-word token (MWT), multi-word end of sentence (MWS), and W1 contains the weights and biases for a lin- 4 and other (OTHER). We use bidirectional LSTMs ear classifier. For each unit, we concatenate its (BiLSTMs) as the base model to make unit-level trainable embedding with a four-dimensional bi- predictions. At each unit, the model predicts hier- nary feature vector as input, each dimension cor- archically: it first decides whether a given unit is responding to one of the following feature func- at the end of a token with a score s(tok), then clas- tions: (1) does the unit start with whitespace; (2) sifies token endings into finer-grained categories does it start with a capitalized letter; (3) is the unit with two independent binary classifiers: one for fully capitalized; and (4) is it purely numerical. sentence ending s(sent), and one for MWT s(MWT). To incorporate token-level information at the Since sentence boundaries and MWTs usually second layer, we use a gating mechanism to sup- require a larger context to determine (e.g., periods press representations at non-token boundaries be- following abbreviations or the ambiguous word fore propagating hidden states upward: “des” in French), we incorporate token-level infor- (tok) g1 = h1 σ(s1 ) (5) mation into a two-layer BiLSTM as follows (see −! − h = [h ; h ] = BiLSTM (g ); (6) also Figure1). The first layer BiLSTM operates 2 2 2 2 1 (tok) (sent) (MWT) directly on raw units, and makes an initial predic- [s2 ; s2 ; s2 ] = W2h2; (7) tion over the categories. To help capture local unit where is an element-wise product broadcast patterns more easily, we also combine the first- over all dimensions of h1 for each unit. This layer BiLSTM with 1-D convolutional networks, can be viewed as a simpler alternative to multi- by using a one hidden layer convolutional network resolution RNNs (Serban et al., 2017), where the (CNN) with ReLU nolinearity at its first layer, giv- first-layer BiLSTM operates at the unit level, and ing an effect a little like a residual connection (He the second layer operates at the token level. Unlike et al., 2016). The output of the CNN is simply multi-resolution RNNs, this formulation is end-to- added to the concatenated hidden states of the Bi- end differentiable, and can more easily leverage LSTM for downstream computation: efficient off-the-shelf RNN implementations. RNN −! − h1 = [h1; h1] = BiLSTM1(x); (1) To combine predictions from both layers of the CNN BiLSTM, we simply sum the scores to obtain h1 = CNN(x); (2) (X) (X) s(X) = s +s , where X 2 ftok, sent, MWTg. h = hRNN + hCNN; (3) 1 2 1 1 1 The final probability over the tags is then (tok) (sent) (MWT) [s ; s ; s ] = W1h1; (4) 1 1 1 pEOT = p+−− pEOS = p++−; (8) x where is the input character representations, pMWT = p+−+ pMWS = p+++; (9) 3 (tok) (sent) (MWT) In this case, we define a syllable as a consecutive run where p±±± = σ(±s )σ(±s )σ(±s ), of alphabetic characters, numbers, or individual symbols, to- gether with any leading white spaces before them. 4We will omit bias terms in affine transforms for clarity. and σ(·) is the logistic sigmoid function. pOTHER is To bring the symbolic and neural systems to- simply σ(−s(tok)). The model is trained to mini- gether, we train them separately and use the fol- mize the standard cross entropy loss. lowing protocol during evaluation: for each MWT, we first look it up in the dictionary, and return the Multi-word Token Expansion. The tokenizer/ expansion recorded there if one can be found. If sentence segmenter produces a collection of sen- this fails, we retry by lowercasing the incoming tences, each being a list of tokens, some of which token. If that fails again, we resort to the neural are labeled as multi-word tokens (MWTs). We system to predict the final expansion. This allows must expand these MWTs into the underlying syn- us to not only account for languages with flexi- tactic words they correspond to (e.g., “im” to “in ble MWTs patterns (Arabic and Hebrew), but also dem” in German), in order for downstream sys- leverage the training set statistics to cover both tems to process them properly. To achieve this, we languages with simpler MWT rules, and MWTs take a hybrid approach to combine symbolic statis- in the flexible languages seen in the training set tical knowledge with the power of neural systems. without fail. This results in a high-performance, The symbolic statistical side is a frequency lex- robust system for multi-word token expansion. icon. Many languages, like German, have only a handful of rules for expanding a few MWTs. We 2.2 POS/UFeats Tagger leverage this information by simply counting the number of times a MWT is expanded into differ- Our tagger follows closely that of (Dozat et al., ent sequences of words in the training set, and re- 2017), with a few extensions. As in that work, taining the most frequent expansion in a dictionary the core of the tagger is a highway BiLSTM (Sri- to use at test time. When building this dictionary, vastava et al., 2015) with inputs coming from the we lowercase all words in the expansions to im- concatenation of three sources: (1) a pretrained prove robustness. However, this approach would word embedding, from the word2vec embeddings fail for languages with rich clitics, a large set of provided with the task when available (Mikolov unique MWTs, and/or complex rules for MWT ex- et al., 2013), and from fastText embeddings oth- pansion, such as Arabic and Hebrew.