1 Nov 2018 H Yeo Loih Sdfrtann H Embed- Than the ( More Training for Dings Set Used Algorithm Feature of Type Underlying the the by Driven NLP

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Universität München: Elektronischen Publikationen A Stronger Baseline for Multilingual Word Embeddings Philipp Dufter, Hinrich Schutze¨ Center for Information and Language Processing (CIS) LMU Munich, Germany [email protected] Abstract Most embedding learners build on using context information as feature. Dufter et al. (2018) re- Levy, Søgaard and Goldberg’s (2017) cently showed that using concept information can S-ID (sentence ID) method applies be effective for multilingual embedding learning word2vec on tuples containing a sentence as well. ID and a word from the sentence. It Here, we propose the method SC-ID. SC-ID has been shown to be a strong baseline combines the concept identification method “any- for learning multilingual embeddings. malign” (Lardilleux and Lepage, 2009) and S-ID Inspired by recent work on concept based (Levy et al., 2017) into an embedding learning embedding learning we propose SC-ID, method that is based on both concept (C-ID) and an extension to S-ID: given a sentence context (S-ID). We show below that SC-ID is ef- aligned corpus, we use sampling to extract fective. On a massively parallel corpus, the Par- concepts that are then processed in the allel Bible Corpus (PBC) covering 1000+ lan- same manner as S-IDs. We perform guages, SC-ID outperforms S-ID in a word trans- experiments on the Parallel Bible Corpus lation task. In addition, we show that SC-ID out- across 1000+ languages and show that performs S-ID on EuroParl, a corpus with charac- SC-ID yields up to 6% performance teristics quite different from PBC. increase in a word translation task. In ad- For both corpora there are embedding learners dition, we provide evidence that SC-ID is that perform better. However, SC-ID is the only easily and widely applicable by reporting one that scales easily across 1000+ languages and competitive results across 8 tasks on a at the same time exhibits a stable performance EuroParl based corpus. across datasets. In summary, we make the follow- 1 Introduction ing contributions in this paper: i) We demonstrate that using concept IDs equivalently to (Levy et al., Multilingual embeddings are useful because they 2017)’s sentence IDs works well. ii) We show that provide meaning representations of source and tar- combining C-IDs and S-IDs yields higher qual- arXiv:1811.00586v1 [cs.CL] 1 Nov 2018 get in the same space in machine translation and ity embeddings than either by itself. iii) In ex- because they are a basis for transfer learning. tensive experiments investigating hyperparameters In contrast to prior multilingual work we find that, despite the large number of lan- (Zeman and Resnik, 2008; McDonald et al., guages, lower dimensional spaces work better. iv) 2011; Tsvetkov et al., 2014), automatically We demonstrate that our method works on very learned embeddings potentially perform as different datasets and yields competitive perfor- well but are more efficient and easier to use mance on a EuroParl based corpus. (Klementiev et al., 2012; Hermann and Blunsom, 2014b; Guo et al., 2016). Thus, multilingual word 2 Methods embedding learning is important for NLP. The quality of multilingual embeddings is Throughout this section we describe how we iden- driven by the underlying feature set more than tify concepts and write out text corpora that are the type of algorithm used for training the embed- then used as input to the embedding learning algo- dings (Upadhyay et al., 2016; Ruder et al., 2017). rithm. 2.1 Concept Induction Algorithm 1 Anymalign Lardilleux and Lepage (2009) propose a word 1: procedure GETCONCEPTS(V , MINL, MAXN, T) alignment algorithm which we use for concept in- 2: C = ∅ duction. They argue that hapax legomena (i.e., 3: while runtime ≤ T do 4: V ′ = get-subsample(V ) words which occur exactly once in a corpus) are ′ 5: A = get-concepts(V ) easy to align in a sentence-aligned corpus. If ha- 6: A = filter-concepts(A, MINL, MAXN) pax legomena across multiple languages occur in 7: C = C ∪ A 8: end while the same sentence, their meanings are likely the 9: end procedure same. Similarly, words across multiple languages that occur more than once, but strictly in the same sentences, can be considered translations. We call Figure 1: V is an alingual corpus. get-subsample words that occur strictly in the same sentences per- creates a subcorpus by randomly selecting lines fectly aligned. Further we define a concept as a set from the alingual corpus. get-concepts extracts of words that are perfectly aligned. words and word ngrams that are perfectly aligned. By this definition one expects the number of filter-concepts imposes the constraint that ngrams identified concepts to be low. Coverage can be in- have a specified length and concepts cover enough creased by not only considering the original paral- languages. lel corpus, but by sampling subcorpora from the parallel corpus. As the number of sentences is smaller in each sample and there is a high num- viously this probability depends on the number of ber of sampled subcorpora, the number of per- samples drawn and thus on the runtime of the al- fect alignments is much higher. In addition to gorithm. Thus the runtime (T) is another hyper- words, word ngrams are also considered to in- parameter. Note that T only affects the number of crease coverage. The complement of an ngram distinct concepts, not the quality of an individual (i.e., the sentence in one language without the per- concept. For details see (Lardilleux and Lepage, fectly aligned words) is also treated as a perfect 2009). alignment as their meaning can be assumed to be Note that most members of a concept are neither equivalent as well. hapax legomena nor perfect alignments in V . A ′ For example, if for a particular subsample, the word can be part of multiple concepts within V English trigram “mount of olives” occurs exactly and obviously also within V (it can be found in in the same sentences as “montagne des oliviers” multiple iterations). this gives rise to a concept, even if “olives” or One can interpret the concept identification as a “mountain” might not perfectly aligned in this par- form of data augmentation. From an existing particular subsample. allel corpus new parallel “sentences” are extracted Figure 1 shows Lardilleux and Lepage (2009)’s by considering perfectly aligned subsequences. anymalign algorithm. Given a sentence aligned parallel corpus, “alingual” sentences are created 2.2 Corpus Creation by concatenating each sentence across all lan- Method S-ID. We adopt Levy and Goldberg guages. We then consider the set of all alingual (2014)’s framework; it formalizes the basic infor- sentences V , which Lardilleux and Lepage (2009) mation that is passed to the embedding learner as a call an alingual corpus. The core of the algo- set of pairs. In the monolingual case each pair con- rithm iterates the following loop: (i) draw a ran- sists of two words that occur in the same context. dom sample of alingual sentences V ′ ⊂ V ; (ii) ex- A successful approach to multilingual embedding tract perfect alignments in V ′. The perfect align- learning for parallel corpora is to use a corpus of ments are then added to the set of concepts. pairs (one per line) of a word and a sentence ID Anymalign’s hyperparameters include the min- (Levy et al., 2017). We refer to this method as S- imum number of languages a perfect alignment ID. Note that we use S-ID for the sentence identi- should cover (MINL) and the maximum ngram fier, for the corpus creation method that is based length (MAXN). The size of a subsample is ad- on these identifiers, for the embedding learning justed automatically to maximize the probability method based on such corpora and for the embed- that each sentence is sampled at least once. Ob- dings produced by the method. The same applies to C-ID. Which sense is meant should be clear 2.3 Embedding learning from context. We use word2vec skipgram1 (Mikolov et al., Method C-ID. We use the same method for 2013a) with default hyperparameters, except for writing a corpus using our identified concepts and three: number of iterations (ITER), minimum fre- call this method C-ID. Figure 2 gives examples of quency of a word (MINC) and embedding dimen- the generated corpora that are passed to the em- sion (DIM). For details see Table 2. bedding learner. 3 Data Method SC-ID. We combine S-IDs and C- IDs by simply concatenating their corpora before We work on PBC, the Parallel Bible Corpus learning embeddings. However, we apply differ- (Mayer and Cysouw, 2014), a verse-aligned cor- ent frequency thresholds when learning embed- pus of 1000+ translations of the New Testament. dings from this corpus: for S-ID we investigate a For the sake of comparability we use the same frequency threshold on a development set whereas 1664 Bible editions across 1259 languages (dis- for C-ID we always set it to 1. As argued before tinct ISO 639-3 codes) and the same 6458 training each word in a concept carries a strong multilin- verses as in (Dufter et al., 2018).2 We follow their gual signal, which is why we do not apply any fre- terminology and refer to “translations” as “edi- quency filtering here. In the implementation we tions”. PBC is a good model for resource-poverty; simply delete words in the S-ID part of the corpus e.g., the training set of KJV contains fewer than with frequency lower than our threshold and set 150,000 tokens in 6458 verses.

Load more