Lexical Resources for Low-Resource Pos Tagging in Neural Times
Total Page:16
File Type:pdf, Size:1020Kb
Lexical Resources for Low-Resource PoS Tagging in Neural Times Barbara Plank Sigrid Klerke Department of Computer Science Department of Computer Science ITU, IT University of Copenhagen ITU, IT University of Copenhagen Denmark Denmark [email protected] [email protected] Abstract benefits from external lexical knowledge. In par- More and more evidence is appearing that ticular: integrating symbolic lexical knowledge a) we evaluate the neural tagger across a total into neural models aids learning. This of 20+ languages, proposing a novel baseline contrasts the widely-held belief that neural which uses retrofitting; networks largely learn their own feature b) we investigate the reliance on dictionary size representations. For example, recent work and properties; has shown benefits of integrating lexicons c) we analyze model-internal representations to aid cross-lingual part-of-speech (PoS). via a probing task to investigate to what ex- However, little is known on how com- tent model-internal representations capture plementary such additional information is, morphosyntactic information. and to what extent improvements depend on the coverage and quality of these exter- Our experiments confirm the synergetic effect nal resources. This paper seeks to fill this between a neural tagger and symbolic linguistic gap by providing a thorough analysis on knowledge. Moreover, our analysis shows that the the contributions of lexical resources for composition of the dictionary plays a more impor- cross-lingual PoS tagging in neural times. tant role than its coverage. 1 Introduction 2 Methodology In natural language processing, the deep learning Our base tagger is a bidirectional long short-term revolution has shifted the focus from conventional memory network (bi-LSTM) (Graves and Schmid- hand-crafted symbolic representations to dense in- huber, 2005; Hochreiter and Schmidhuber, 1997; puts, which are adequate representations learned Plank et al., 2016) with a rich word encoding automatically from corpora. However, particu- model which consists of a character-based bi- larly when working with low-resource languages, LSTM representation ~cw paired with pre-trained small amounts of symbolic lexical resources such word embeddings ~w. Sub-word and especially as user-generated lexicons are often available even character-level modeling is currently pervasive in when gold-standard corpora are not. Recent work top-performing neural sequence taggers, owing has shown benefits of combining conventional lex- to its capacity to effectively capture morpholog- ical information into neural cross-lingual part-of- ical features that are useful in labeling out-of- speech (PoS) tagging (Plank and Agic,´ 2018). vocabulary (OOV) items. Sub-word information is However, little is known on how complementary often coupled with standard word embeddings to such additional information is, and to what extent mitigate OOV issues. Specifically, i) word embed- improvements depend on the coverage and quality dings are typically built from massive unlabeled of these external resources. datasets and thus OOVs are less likely to be en- The contribution of this paper is in the analysis countered at test time, while ii) character embed- of the contributions of models’ components (tag- dings offer further linguistically plausible fallback ger transfer through annotation projection vs. the for the remaining OOVs through modeling intra- contribution of encoding lexical and morphosyn- word relations. Through these approaches, multi- tactic resources). We seek to understand un- lingual PoS tagging has seen tangible gains from der which conditions a low-resource neural tagger neural methods in the recent years. 2.1 Lexical resources icons by comparing tag sets to the gold treebank We use linguistic resources that are user-generated data, inspired by Li et al. (2012). In particular, let and available for many languages. The first is T be the dictionary derived from the gold treebank W WIKTIONARY, a word type dictionary that maps (development data), and be the user-generated words to one of the 12 Universal PoS tags (Li dictionary, i.e., the respective Wiktionary (as we et al., 2012; Petrov et al., 2012). The second re- are looking at PoS tags). For each word type, we T W source is UNIMORPH, a morphological dictionary compare the tag sets in and and distinguish that provides inflectional paradigms for 350 lan- six cases: guages (Kirov et al., 2016). For Wiktionary, we 1.N ONE: The word type is in the training data use the freely available dictionaries from Li et al. but not in the lexicon (out-of-lexicon). (2012). UniMorph covers between 8-38 morpho- logical properties (for English and Finnish, re- 2.E QUAL: W = T spectively).1 The sizes of the dictionaries vary 3.D ISJOINT: W \ T = ; considerably, from a few thousand entries (e.g., for Hindi and Bulgarian) to 2M entries (Finnish Uni- 4.O VERLAP: W \ T 6= ; Morph). We study the impact of smaller dictionary sizes in Section 4.1. 5.S UBSET: W ⊂ T The tagger we analyze in this paper is an exten- 6.S UPERSET: W ⊃ T sion of the base tagger, called distant supervision from disparate sources (DSDS) tagger (Plank and In an ideal setup, the dictionaries contain no dis- Agic,´ 2018). It is trained on projected data and joint tag sets, and larger amounts of equal tag sets further differs from the base tagger by the integra- or superset of the treebank data. This is particu- tion of lexicon information. In particular, given larly desirable for approaches that take lexical in- a lexicon src,DSDS uses ~esrc to embed the lex- formation as type-level supervision. icon into an l-dimensional space, where ~esrc is the concatenation of all embedded m properties of 2.2 Experimental setup length l (empirically set, see Section 2.2), and a In this section we describe the baselines, the data zero vector for words not in the lexicon. A prop- and the tagger hyperparameters. erty here is a possible PoS tag (for Wiktionary) or Data We use the 12 Universal PoS tags (Petrov a morphological feature (for Unimorph). To inte- et al., 2012). The set of languages is motivated by grate the type-level supervision, the lexicon em- accessibility to embeddings and dictionaries. We beddings vector is created and concatenated to the here focus on 21 dev sets of the Universal Depen- word and character-level representations for every dencies 2.1 (Nivre and et al., 2017), test set results token: ~w ◦ ~cw ◦ ~e. are reported by Plank and Agic´ (2018) showing We compare DSDS to alternative ways of that DSDS provides a viable alternative. using lexical information. The first approach uses lexical information directly during decod- Annotation projection To build the taggers for ing (Tackstr¨ om¨ et al., 2013). The second approach new languages, we resort to annotation projec- is more implicit and uses the lexicon to induce tion following Plank and Agic´ (2018). In par- better word embeddings for tagger initialization. ticular, they employ the approach by Agic´ et al. In particular, we use the dictionary for retrofitting (2016), where labels are projected from multi- off-the-shelf embeddings (Faruqui et al., 2015) to ple sources to multiple targets and then decoded initialize the tagger with those. The latter is a through weighted majority voting with word align- novel approach which, to the best of our knowl- ment probabilities and source PoS tagger confi- edge, has not yet been evaluated in the neural tag- dences. The wide-coverage Watchtower corpus ging literature. The idea is to bring the off-the- (WTC) by Agic´ et al. (2016) is used, where 5k shelf embeddings closer to the PoS tagging task instances are selected via data selection by align- by retrofitting the embeddings with syntactic clus- ment coverage following Plank and Agic´ (2018). ters derived from the lexicon. Baselines We compare to the following alterna- We take a deeper look at the quality of the lex- tives: type-constraint Wiktionary supervision (Li 1More details: http://unimorph.org/ et al., 2012) and retrofitting initialization. DEV SETS (UD2.1) LANGUAGE 5k TCW RETRO DSDS Bulgarian (bg) 89.8 89.9 87.1 91.0 Croatian (hr) 84.7 85.2 83.0 85.9 Czech (cs) 87.5 87.5 84.9 87.4 Danish (da) 89.8 89.3 88.2 90.1 Dutch (nl) 88.6 89.2 86.6 89.6 English (en) 86.4 87.6 82.5 87.3 Finnish (fi) 81.7 81.4 79.2 83.1 French (fr) 91.5 90.0 89.8 91.3 Figure 1: Analysis of Wiktionary vs gold (dev set) German (de) 85.8 87.1 84.7 87.5 tag sets. ‘None’: percentage of word types not Greek (el) 80.9 86.1 79.3 79.2 covered in the lexicon. ‘Disjoint’: the gold data Hebrew (he) 75.8 75.9 71.7 76.8 Hindi (hi) 63.8 63.9 63.0 66.2 and Wiktionary do not agree on the tag sets. See Hungarian (hu) 77.5 77.5 75.5 76.2 Section 2.1 for details on other categories. Italian (it) 92.2 91.8 90.0 93.7 Norwegian (no) 91.0 91.1 88.8 91.4 Persian (fa) 43.6 43.8 44.1 43.6 pirically find it to be best to not update the word Polish (pl) 84.9 84.9 83.3 85.4 embeddings in this noisy training setup, as that re- Portuguese 92.4 92.2 88.6 93.1 Romanian (ro) 84.2 84.2 80.2 86.0 sults in better performance, see Section 4.4. Spanish (es) 90.7 88.9 88.9 91.7 Swedish (sv) 89.4 89.2 87.0 89.8 3 Results AVG(21) 83.4 83.6 81.3 84.1 Table 1 presents our replication results, i.e., tag- GERMANIC (6) 88.5 88.9 86.3 89.3 ging accuracy for the 21 individual languages, ROMANCE (5) 90.8 90.1 88.4 91.4 SLAVIC (4) 86.7 86.8 84.6 87.4 with means over all languages and language fam- INDO-IRANIAN (2) 53.7 53.8 53.5 54.9 ilies (for which at least two languages are avail- URALIC (2) 79.6 79.4 79.2 79.6 able).