
Injecting Word Embeddings with Another Language’s Resource : An Application of Bilingual Embeddings by Prakhar Pandey, Vikram Pudi, Manish Shrivastava in International Joint Conference on Natural Language Processing (IJCNLP-2017) Taipei, Taiwan Report No: IIIT/TR/2017/-1 Centre for Language Technologies Research Centre International Institute of Information Technology Hyderabad - 500 032, INDIA November 2017 Injecting Word Embeddings with Another Language’s Resource : An Application of Bilingual Embeddings Prakhar Pandey Vikram Pudi Manish Shrivastava International Institute of Information Technology Hyderabad, Telangana, India [email protected] fvikram,[email protected] Abstract of different methods have since been proposed to construct bilingual embeddings (Zou et al., 2013; Word embeddings learned from text cor- Vulic and Moens, 2015; Coulmance et al., 2016). pus can be improved by injecting knowl- edge from external resources, while at A disadvantage of learning word embeddings the same time also specializing them for only from text corpus is that valuable knowl- similarity or relatedness. These knowl- edge contained in knowledge resources like Word- edge resources (like WordNet, Paraphrase Net (Miller, 1995) is not used. Numerous meth- Database) may not exist for all languages. ods have been proposed to incorporate knowledge In this work we introduce a method to in- from external resources into word embeddings for ject word embeddings of a language with their refinement (Xu et al., 2014; Bian et al., 2014; knowledge resource of another language Mrksic et al., 2016). (Faruqui et al., 2015) intro- by leveraging bilingual embeddings. First duced retrofitting as a light graph based technique we improve word embeddings of Ger- that improves learned word embeddings. man, Italian, French and Spanish using re- sources of English and test them on variety In this work we introduce a method to im- of word similarity tasks. Then we demon- prove word embeddings of one language (tar- strate the utility of our method by creating get language) using knowledge resources from improved embeddings for Urdu and Tel- some other similar language (source language). ugu languages using Hindi WordNet, beat- To accomplish this, we represent both languages ing the previously established baseline for in the same vector space (bilingual embeddings) Urdu. and obtain translations of source language’s re- sources. Then we use these translations to im- 1 Introduction prove the embeddings of the target language by using retrofitting, leveraging the information con- Recently fast and scalable methods to generate tained in bilingual space to adjust retrofitting pro- dense vector space models have become very pop- cess and handle noise. We also show why a dic- ular following the works of (Collobert and We- tionary based translation would be ineffective for ston, 2008; Mikolov et al., 2013; Pennington et al., this problem and how to handle situations where 2014). These methods take large amounts of text vocabulary of target embeddings is too big or too corpus to generate real valued vector representa- small compared to size of resource. tion for words (word embeddings) which carry many semantic properties. (Kiela et al., 2015) demonstrated the advantage Mikolov et al.(2013b) extended this model of specializing word embeddings for either sim- to two languages by introducing bilingual embed- ilarity or relatedness, which we also incorporate. dings where word embeddings for two languages Our method is also independent of the way bilin- are simultaneously represented in the same vec- gual embeddings were obtained. An added advan- tor space. The model is trained such that word tage of using bilingual embeddings is that they are embeddings capture not only semantic informa- better than monolingual counterparts due to incor- tion of monolingual words, but also semantic re- porating multilingual evidence (Faruqui and Dyer, lationships across different languages. A number 2014; Mrksiˇ c´ et al., 2017). 2 Background word vectors are close to their original vectors as well as vectors of related words. The function to 2.1 Bilingual Embeddings be minimized to accomplish this objective is: Various methods have been proposed to generate n " # bilingual embeddings. One class of methods learn X 2 X 2 mappings to transform words from one monolin- Φ(Q) = αikqi − q^ik + βijkqi − qjk gual model to another, using some form of dic- i=1 (i;j)2E tionary (Mikolov et al., 2013b; Faruqui and Dyer, The iterative update equation is: 2014). Another class of methods jointly optimize monolingual and cross-lingual objectives using P j:(i;j)2E βijqj + αiq^i word aligned parallel corpus (Klementiev et al., qi = P βij + αi 2012; Zou et al., 2013) or sentence aligned par- j:(i;j)2E allel corpus (Chandar A P et al., 2014; Hermann α and β are the parameters used to control the and Blunsom, 2014). Also there are other meth- process. We discuss in Section 3.2 how we set ods which use monolingual data and a smaller set them to adapt the process to bilingual settings. of sentence aligned parallel corpus (Coulmance et al., 2016) and those which use non-parallel doc- 2.3 Dictionary based approach ument aligned data (Vulic and Moens, 2015). Before discussing our method, we would like to We experiment with translation invariant bilin- point that using a dictionary for translating the lex- gual embeddings by (Gardner et al., 2015). We ical resource and then retrofitting with this trans- also experiment with method proposed by (Artetxe lated resource is not feasible. Firstly obtaining et al., 2016) where they learn a linear transform good quality dictionaries is a difficult and costly between two monolingual embeddings with mono- process1. Secondly it is not necessary that one lingual invariance preserved. They use a small would obtain translations that are within the vo- bilingual dictionary to accomplish this task. These cabulary of the embeddings to be improved. To methods are useful in our situation because they demonstrate this, we obtain translations for em- preserve the quality of original monolingual em- beddings of 3 languages2 and show the results in beddings and do not require parallel text (benefi- Table 1. In all cases the number of translations that cial in case of Indian languages). are also present in the embedding’s vocabulary are 2.2 Retrofitting too small. Retrofitting was introduced by (Faruqui et al., Language Vocab Matches 2015) as a light graph based procedure for enrich- German 43,527 9,950 Italian 73,427 24,716 ing word embeddings with semantic lexicons. The Spanish 41,779 16,547 method operates post processing i.e it can be ap- plied to word embeddings obtained from any stan- Table 1: Using a dictionary based approach dard technique such as Word2vec, Glove etc. The method encourages improved vectors to be simi- lar to the vectors of similar words as well as sim- 3 Approach ilar to the original vectors. This similarity rela- tion among words (such as synonymy, hypernymy, Let S, T and R be the vocabularies of source, tar- paraphrase) is derived from a knowledge resource get and resource respectively. Size of R is always such as PPDB, WordNet etc. Retrofitting works as fixed while size of S and T depends on embed- follows: dings. The relation between S, T and R is shown S R Let matrix Q contain the word embeddings to in Figure 1. Sets and have one to one map- ping which in not necessarily onto, while T and be improved. Let V = fw1; w2:::wng be the vocab- S ulary which is equal to number of rows in Q and have many to one mapping. Consider the ideal R S d be the dimension of word vectors which is equal case where every word in is also in and every S T to number of columns. Also let Ω be the ontol- word in has the exact translation from as its ogy that contains the intra word relations that must nearest neighbour in the bilingual space. Then the be injected into the embeddings. The objective of 1eg. Google and Bing Translate APIs have become paid. retrofitting is to find a matrix Q^ such that the new 2using Yandex Translate API, it took around 3 days T S R original vector and β controls movement towards vectors to be fitted with. (Faruqui et al., 2015) set 1 ragazzo α as 1 and β as γ where γ is the number of vectors boy to be fitted with. Thus they give equal weights to figlio each vector. Cosine similarity between a word and its trans- lation is a good measure of the confidence in its translation. We use it to set β such that differ- Figure 1: Relationships between Source, Target ent vectors get different weights. A word for and Resource Vocabularies. which we have higher confidence in translation (i.e higher cosine similarity) is given more weight simple approach for translation would be assign- when retrofitting. Therefore wi being the weights, ing every si 2 S its nearest neighbour ti 2 T as α; β are set as : the translation. γ First problem is that in practical settings these X conditions are almost never satisfied. Secondly the α = wi; βi = wi sizes of S, T and R can be very different. Suppose i=1 the size of S, T is large compared to R or the size of T is large but size of S is comparatively smaller. Further reduction in weights of noisy words can In both cases size of translated resource will be too be done by taking powers of cosine similarity. An small to make impact. Thirdly words common to example in Table 2 shows weights of similar words both R and S will be lesser than the total words in for Italian word professore derived by taking pow- R.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages7 Page
-
File Size-