Learning Word Embeddings for Low-Resource Languages by PU Learning

Learning Word Embeddings for Low-resource Languages by PU Learning Chao Jiang Hsiang-Fu Yu University of Virginia Amazon [email protected] [email protected] Cho-Jui Hsieh Kai-Wei Chang University of California Davis University of California Los Angeles [email protected] [email protected] Abstract they have similar contexts. To learn a good word embedding, most approaches assume a large col- Word embedding is a key component in many lection of text is freely available, such that the es- downstream applications in processing natu- timation of word co-occurrences is accurate. For ral languages. Existing approaches often as- example, the Google Word2Vec model (Mikolov sume the existence of a large collection of text for learning effective word embedding. et al., 2013a) is trained on the Google News However, such a corpus may not be avail- dataset, which contains around 100 billion to- able for some low-resource languages. In kens, and the GloVe embedding (Pennington et al., this paper, we study how to effectively learn 2014) is trained on a crawled corpus that contains a word embedding model on a corpus with 840 billion tokens in total. However, such an as- only a few million tokens. In such a situa- sumption may not hold for low-resource languages tion, the co-occurrence matrix is sparse as the such as Inuit or Sindhi, which are not spoken by co-occurrences of many word pairs are unob- many people or have not been put into a digital served. In contrast to existing approaches often only sample a few unobserved word pairs format. For those languages, usually, only a lim- as negative samples, we argue that the zero ited size corpus is available. Training word vectors entries in the co-occurrence matrix also pro- under such a setting is a challenging problem. vide valuable information. We then design One key restriction of the existing approaches a Positive-Unlabeled Learning (PU-Learning) is that they often mainly rely on the word pairs approach to factorize the co-occurrence matrix that are observed to co-occur on the training data. and validate the proposed approaches in four different languages. When the size of the text corpus is small, most word pairs are unobserved, resulting in an ex- 1 Introduction tremely sparse co-occurrence matrix (i.e., most entries are zero)1. For example, the text82 corpus Learning word representations has become a has about 17,000,000 tokens and 71,000 distinct fundamental problem in processing natural lan- words. The corresponding co-occurrence matrix guages. These semantic representations, which has more than five billion entries, but only about map a word into a point in a linear space, have 45,000,000 are non-zeros (observed on the train- been widely applied in downstream applications, ing corpus). Most existing approaches, such as including named entity recognition (Guo et al., Glove and Skip-gram, cannot handle a vast num- 2014), document ranking (Nalisnick et al., 2016), ber of zero terms in the co-occurrence matrix; sentiment analysis (Irsoy and Cardie, 2014), ques- therefore, they only sub-sample a small subset of tion answering (Antol et al., 2015), and image cap- zero entries during the training. tioning (Karpathy and Fei-Fei, 2015). In contrast, we argue that the unobserved word Over the past few years, various approaches pairs can provide valuable information for train- have been proposed to learn word vectors (e.g., ing a word embedding model, especially when (Pennington et al., 2014; Mikolov et al., 2013a; the co-occurrence matrix is very sparse. Inspired Levy and Goldberg, 2014b; Ji et al., 2015)) based 1 on co-occurrence information between words ob- Note that the zero term can mean either the pairs of words cannot co-occur or the co-occurrence is not observed served on the training corpus. The intuition behind in the training corpus. this is to represent words with similar vectors if 2http://mattmahoney.net/dc/text8.zip 1024 Proceedings of NAACL-HLT 2018, pages 1024–1034 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics by the success of Positive-Unlabeled Learning these approaches are often less flexible. Besides, (PU-Learning) in collaborative filtering applica- directly factorizing the co-occurrence matrix may tions (Pan et al., 2008; Hu et al., 2008; Pan and cause the frequent words dominating the training Scholz, 2009; Qin et al., 2010; Paquet and Koenig- objective. stein, 2013; Hsieh et al., 2015), we design an algo- In the past decade, various approaches have rithm to effectively learn word embeddings from been proposed to improve the training of word em- both positive (observed terms) and unlabeled (un- beddings. For example, instead of factorizing the observed/zero terms) examples. Essentially, by co-occurrence count matrix, Bullinaria and Levy using the square loss to model the unobserved (2007); Levy and Goldberg(2014b) proposed to terms and designing an efficient update rule based factorize point-wise mutual information (PMI) and on linear algebra operations, the proposed PU- positive PMI (PPMI) matrices as these metrics Learning framework can be trained efficiently and scale the co-occurrence counts (Bullinaria and effectively. Levy, 2007; Levy and Goldberg, 2014b). Skip- We evaluate the performance of the proposed gram model with negative-sampling (SGNS) and approach in English3 and other three resource- Continuous Bag-of-Words models (Mikolov et al., scarce languages. We collected unlabeled lan- 2013b) were proposed for training word vectors on guage corpora from Wikipedia and compared the a large scale without consuming a large amount of proposed approach with popular approaches, the memory. GloVe (Pennington et al., 2014) is pro- Glove and the Skip-gram models, for training posed as an alternative to decompose a weighted word embeddings. The experimental results show log co-occurrence matrix with a bias term added that our approach significantly outperforms the to each word. Very recently, WordRank model baseline models, especially when the size of the (Ji et al., 2015) has been proposed to minimize training corpus is small. a ranking loss which naturally fits the tasks re- Our key contributions are summarized below. quiring ranking based evaluation metrics. Stratos et al.(2015) also proposed CCA (canonical cor- We propose a PU-Learning framework for • relation analysis)-based word embedding which learning word embedding. shows competitive performance. All these ap- We tailor the coordinate descent algo- proaches focus on the situations where a large text • rithm (Yu et al., 2017b) for solving the cor- corpus is available. responding optimization problem. Positive and Unlabeled (PU) Learning: Pos- Our experimental results show that PU- itive and Unlabeled (PU) learning (Li and Liu, • Learning improves the word embedding 2005) is proposed for training a model when the training in the low-resource setting. positive instances are partially labeled and the unlabeled instances are mostly negative. Recently, 2 Related work PU learning has been used in many classification and collaborative filtering applications due to the Learning word vectors. The idea of learning nature of “implicit feedback” in many recommen- word representations can be traced back to La- dation systems—users usually only provide posi- tent Semantic Analysis (LSA) (Deerwester et al., tive feedback (e.g., purchases, clicks) and it is very 1990) and Hyperspace Analogue to Language hard to collect negative feedback. (HAL) (Lund and Burgess, 1996), where word vectors are generated by factorizing a word- To resolve this problem, a series of PU matrix document and word-word co-occurrence matrix, completion algorithms have been proposed (Pan respectively. Similar approaches can also be ex- et al., 2008; Hu et al., 2008; Pan and Scholz, 2009; tended to learn other types of relations between Qin et al., 2010; Paquet and Koenigstein, 2013; words (Yih et al., 2012; Chang et al., 2013) or enti- Hsieh et al., 2015; Yu et al., 2017b). The main ties (Chang et al., 2014). However, due to the lim- idea is to assign a small uniform weight to all itation of the use of principal component analysis, the missing or zero entries and factorize the corresponding matrix. Among them, Yu et al.(2017b) 3 Although English is not a resource-scarce language, we proposed an efficient algorithm for matrix factor- simulate the low-resource setting in an English corpus. In this way, we leverage the existing evaluation methods to evaluate ization with PU-learning, such that the weighted the proposed approach. matrix is constructed implicitly. In this paper, we 1025 , vocabulary of central and context words chance they appear together in a local context win- W C m, n vocabulary sizes dow) and their marginal probabilities (the chance k dimension of word vectors they appear independently) (Levy and Goldberg, W, H m k and n k latent matrices 2014b). More specifically, each entry of PMI ma- × × Cij weight for the (i, j) entry trix can be defined by Aij value of the PPMI matrix Q value of the co-occurrence matrix Pˆ(w, c) ij PMI(w, c) = log , (1) wi, hj i-th row of W and j-th row of H Pˆ(w) Pˆ(c) b, bˆ bias term · λi, λj regularization parameters where Pˆ(w), Pˆ(c) and Pˆ(w, c) are the the fre- the size of a set quency of word w, word c, and word pairs (w, c), | · | Ω Set of possible word-context pairs respectively. The PMI matrix can be computed Ω+ Set of observed word-context pairs based on the co-occurrence counts of word pairs, Ω− Set of unobserved word-context pairs and it is an information-theoretic association mea- sure which effectively eliminates the big differ- Table 1: Notations.

Load more