Polyglot Semantic Role Labeling

Polyglot Semantic Role Labeling Phoebe Mulcaire~ Swabha Swayamdipta} Noah A. Smith~ ~Paul G. Allen School of Computer Science & Engineering, University of Washington }School of Computer Science, Carnegie Mellon University fpmulc,[email protected], [email protected] Abstract I think Peter even made some deals with the gorillas . Previous approaches to multilingual se- O O A0 AM-ADV O O A1 AM-ADV O O mantic dependency parsing treat lan- Pero el suizo difícilmente atacará a Rominger en la montaña . guages independently, without exploiting O O arg0-agt argM-adv O O arg1-pat argM-loc O O the similarities between semantic structures across languages. We experiment Četrans oslovil sedm velkých evropských výrobců nákladních automobilů. with a new approach where we combine O O RSTR RSTR RSTR O O PAT resources from a pair of languages in the CoNLL 2009 shared task (Hajicˇ et al., Figure 1: Example predicate-argument structures 2009) to build a polyglot semantic role la- from English, Spanish, and Czech. Note that the beler. Notwithstanding the absence of par- argument labels are different in each language. allel data, and the dissimilarity in annotations between languages, our approach results in an improvement in SRL perfor- ditional monolingual version, and variants which mance on multiple languages over a mono- additionally incorporate supervision from English lingual baseline. Analysis of the poly- portion of the dataset. To our knowledge, this is glot model shows it to be advantageous in the first multilingual SRL approach to combine su- lower-resource settings. pervision from several languages. 1 Introduction The CoNLL 2009 dataset includes seven differ- The standard approach to multilingual NLP is to ent languages, allowing study of trends across the design a single architecture, but tune and train same. Unlike the Universal Dependencies dataset, a separate model for each language. While this however, the semantic label spaces are entirely method allows for customizing the model to the language-specific, making our task more challeng- particulars of each language and the available data, ing. Nonetheless, the success of polyglot training it also presents a problem when little data is avail- in this setting demonstrates that sharing of statisti- able: extensive language-specific annotation is re- cal strength across languages does not depend on quired. The reality is that most languages have explicit alignment in annotation conventions, and very little annotated data for most NLP tasks. can be done simply through parameter sharing. Ammar et al.(2016a) found that using train- We show that polyglot training can result in better ing data from multiple languages annotated with labeling accuracy than a monolingual parser, es- Universal Dependencies (Nivre et al., 2016), and pecially for low-resource languages. We find that represented using multilingual word vectors, out- even a simple combination of data is as effective performed monolingual training. Inspired by this, as more complex kinds of polyglot training. We we apply the idea of training one model on multi- include a breakdown into label categories of the ple languages—which we call polyglot training— differences between the monolingual and polyglot to PropBank-style semantic role labeling (SRL). models. Our findings indicate that polyglot train- We train several parsers for each language in the ing consistently improves label accuracy for com- CoNLL 2009 dataset (Hajicˇ et al., 2009): a tra- mon labels. # sentences w/ 3 Model # sentences 1+ predicates # predicates Given a sentence with a marked predicate, the CAT 13200 12876 37444 CoNLL 2009 shared task requires disambiguation CES 38727 38579 414133 of the sense of the predicate, and labeling all its DEU 36020 14282 17400 dependent arguments. The shared task assumed ENG 39279 37847 179014 predicates have already been identified, hence we JPN 4393 4344 25712 do not handle the predicate identification task. SPA 14329 13836 43828 Our basic model adapts the span-based depen- ZHO 22277 21073 102827 dency SRL model of He et al.(2017). This adaptation treats the dependent arguments as argument Table 1: Train data statistics. Languages are indi- spans of length 1. Additionally, BIO consistency cated with ISO 639-3 codes. constraints are removed from the original model— each token is tagged simply with the argument la- 2 Data bel or an empty tag. A similar approach has also been proposed by Marcheggiani et al.(2017). The input to the model consists of a sequence We evaluate our system on the semantic role label- of pretrained embeddings for the surface forms ing portion of the CoNLL-2009 shared task (Hajicˇ of the sentence tokens. Each token embedding is et al., 2009), on all seven languages, namely Cata- also concatenated with a vector indicating whether lan, Chinese, Czech, English, German, Japanese the word is a predicate or not. Since the part-of- and Spanish. For each language, certain tokens speech tags in the CoNLL 2009 dataset are based in each sentence in the dataset are marked as on a different tagset for each language, we do not predicates. Each predicate takes as arguments, use these. Each training instance consists of the other words in the same sentence, their relation- annotations for a single predicate. These repre- ship marked by labeled dependency arcs. Sen- sentations are then passed through a deep, multi- tences may contain no predicates. layer bidirectional LSTM (Graves, 2013; Hochre- Despite the consistency of this format, there are iter and Schmidhuber, 1997) with highway con- significant differences between the training sets nections (Srivastava et al., 2015). 1 across languages. English uses PropBank role la- We use the hidden representations produced bels (Palmer et al., 2005). Catalan, Chinese, En- by the deep biLSTM for both argument labeling glish, German, and Spanish include (but are not and predicate sense disambiguation in a multitask limited to) labels such as “arg0-agt” (for “agent”) setup; this is a modification to the models of He or “A0” that may correspond to some degree to et al.(2017), who did not handle predicate senses, each other and to the English roles. Catalan and and of Marcheggiani et al.(2017), who used a sep- Spanish share most labels (being drawn from the arate model. These two predictions are made inde- same source corpus, AnCora; Taule´ et al., 2008), pendently, with separate softmaxes over different and English and German share some labels. Czech last-layer parameters; we then combine the losses and Japanese each have their own distinct sets of for each task when training. For predicate sense argument labels, most of which do not have clear disambiguation, since the predicate has been iden- correspondences to English or to each other. tified, we choose from a small set of valid predi- We also note that, due to semi-automatic pro- cate senses as the tag for that token. This set of jection of annotations to construct the German possible senses is selected based on the training dataset, more than half of German sentences do data: we map from lemmatized tokens to predi- not include labeled predicate and arguments. Thus cates and from predicates to the set of all senses of while German has almost as many sentences as that predicate. Most predicates are only observed Czech, it has by far the fewest training examples to have one or two corresponding senses, making (predicate-argument structures); see Table1. the set of available senses at test time quite small (less than five senses/predicate on average across all languages). If a particular lemma was not ob- 1This is expected, as the datasets were annotated independently under diverse formalisms and only later converted into served in training, we heuristically predict it as CoNLL format (Hajicˇ et al., 2009). the first sense of that predicate. For Czech and Japanese, the predicate sense annotation is sim- mira.01 may not correspond. Hence, predicate ply the lemmatized token of the predicate, giving sense representations are also language-specific. a one-to-one predicate-“sense” mapping. For argument labeling, every token in the sen- 3.3 Language Identification tence is assigned one of the argument labels, or In the second variant, we concatenate a language NULL if the model predicts it is not an argument ID vector to each multilingual word embedding to the indicated predicate. and predicate indicator feature in the input repre- sentation. This vector is randomly initialized and 3.1 Monolingual Baseline updated in training. These additional parameters We use pretrained word embeddings as input to provide a small degree of language-specificity in the model. For each of the shared task languages, the model, while still sharing most parameters. we produced GloVe vectors (Pennington et al., 3.4 Language-Specific LSTMs 2014) from the news, web, and Wikipedia text of the Leipzig Corpora Collection (Goldhahn et al., This third variant takes inspiration from the “frus- 2012).2 We trained 300-dimensional vectors, then tratingly easy” architecture of Daume III(2007) reduced them to 100 dimensions with principal for domain adaptation. In addition to process- component analysis for efficiency. ing every example with a shared biLSTM as in previous models, we add language-specific biL- 3.2 Simple Polyglot Sharing STMs that are trained only on the examples be- In the first polyglot variant, we consider multi- longing to one language. Each of these language- lingual sharing between each language and En- specific biLSTMs is two layers deep, and is com- glish by using pretrained multilingual embed- bined with the shared biSLTM in the input to the dings. This polyglot model is trained on the union third layer. This adds a greater degree of language- of annotations in the two languages. We use strati- specific processing while still sharing representa- fied sampling to give the two datasets equal effec- tions across languages.

Polyglot Semantic Role Labeling

Classifying Relevant Social Media Posts During Disasters Using Ensemble of Domain-Agnostic and Domain-Speciﬁc Word Embeddings

Injection of Automatically Selected Dbpedia Subjects in Electronic

Information Extraction Based on Named Entity for Tourism Corpus

A Second Look at Word Embeddings W4705: Natural Language Processing

Cw2vec: Learning Chinese Word Embeddings with Stroke N-Gram Information

Effects of Pre-Trained Word Embeddings on Text-Based Deception Detection

Smart Ubiquitous Chatbot for COVID-19 Assistance with Deep Learning Sentiment Analysis Model During and After Quarantine

A Comparison of Word Embeddings and N-Gram Models for Dbpedia Type and Invalid Entity Detection †

Linked Data Triples Enhance Document Relevance Classification

Utilising Knowledge Graph Embeddings for Data-To-Text Generation

Learning Effective Embeddings from Medical Notes

Investigating the Impact of Pre-Trained Word Embeddings on Memorization in Neural Networks Aleena Thomas, David Adelani, Ali Davody, Aditya Mogadala, Dietrich Klakow