Cross-Lingual Lexical Sememe Prediction
Total Page:16
File Type:pdf, Size:1020Kb
Cross-lingual Lexical Sememe Prediction Fanchao Qi1∗ , Yankai Lin1∗, Maosong Sun1;2y , Hao Zhu1, Ruobing Xie3, Zhiyuan Liu1 1Department of Computer Science and Technology, Tsinghua University Institute for Artificial Intelligence, Tsinghua University State Key Lab on Intelligent Technology and Systems, Tsinghua University 2Jiangsu Collaborative Innovation Center for Language Ability, Jiangsu Normal University 3Search Product Center, WeChat Search Application Department, Tencent, China {qfc17, linyk14}@mails.tsinghua.edu.cn, [email protected] [email protected] [email protected], [email protected] Abstract word apple Sememes are defined as the minimum seman- tic units of human languages. As impor- sense apple (fruit) apple (brand) tant knowledge sources, sememe-based lin- guistic knowledge bases have been widely computer used in many NLP tasks. However, most lan- fruit sememe guages still do not have sememe-based lin- guistic knowledge bases. Thus we present PatternValue able bring SpecificBrand a task of cross-lingual lexical sememe pre- diction, aiming to automatically predict se- memes for words in other languages. We Figure 1: An example of HowNet. propose a novel framework to model corre- lations between sememes and multi-lingual words in low-dimensional semantic space for bases (KBs) via manually annotating every words sememe prediction. Experimental results on with a pre-defined closed set of sememes. HowNet real-world datasets show that our proposed (Dong and Dong, 2003) is one of the most well- model achieves consistent and significant im- known sememe-based linguistic KBs. Different provements as compared to baseline methods from WordNet (Miller, 1995) which focuses on the in cross-lingual sememe prediction. The codes relations between senses, it annotates each word and data of this paper are available at https: with one or more relevant sememes. As illustrated //github.com/thunlp/CL-SP. in Fig. 1, the word apple has two senses includ- 1 Introduction ing apple (fruit) and apple (brand) in HowNet. The sense apple (fruit) has one sememe fruit, and Words are regarded as the smallest meaningful unit the sense apple (brand) has five sememes includ- of speech or writing that can stand by themselves ing computer, PatternValue, able, bring and Speci- in human languages, but not the smallest indivisi- ficBrand. There exist about 2; 000 sememes and ble semantic unit of meaning. That is, the meaning over 100 thousand labeled Chinese and English of a word can be represented as a set of semantic words in HowNet. HowNet has been widely used components. For example, “Man = human + male in various NLP applications such as word simi- + adult” and “Boy = human + male + child”. In lin- larity computation (Liu and Li, 2002), word sense guistics, the minimum semantic unit of meaning is disambiguation (Zhang et al., 2005), question clas- named sememe (Bloomfield, 1926). Some people sification (Sun et al., 2007) and sentiment classifi- believe that semantic meanings of concepts such as cation (Dang and Zhang, 2010). words can be composed of a limited closed set of However, most languages do not have such sememes. And sememes can help us comprehend sememe-based linguistic KBs, which prevents us human languages better. understanding and utilizing human languages to Unfortunately, the lexical sememes of words are a greater extent. Therefore, it is important to not explicit in most human languages. Hence, peo- build sememe-based linguistic KBs for various ple construct sememe-based linguistic knowledge languages. Manual construction for sememe- ∗ Indicates equal contribution based linguistic KBs requires efforts of many y Corresponding author linguistic experts, which is time-consuming and labor-intensive. For example, the construction of searchers. Most of related works focus on apply- HowNet has cost lots of Chinese linguistic experts ing HowNet to specific NLP tasks (Liu and Li, more than 10 years. 2002; Zhang et al., 2005; Sun et al., 2007; Dang To address the issue of the high labor cost of and Zhang, 2010; Fu et al., 2013; Niu et al., 2017; manual annotation, we propose a new task, cross- Zeng et al., 2018; Gu et al., 2018). To the best of lingual lexical sememe prediction (CLSP) which our knowledge, only Xie et al. (2017) and Jin et al. aims to automatically predict lexical sememes for (2018) conduct studies of augmenting HowNet by words in other languages. CLSP aims to assist recommending sememes for new words. How- in the annotation of linguistic experts. There are ever, both of the two works are aimed to recom- two critical challenges for CLSP: (1) There is not mend sememes for monolingual words and not ap- a consistent one-to-one match between words in plicable to cross-lingual circumstance. Accord- different languages. For example, English word ingly, our work is the first effort to automatically “beautiful” can refer to Chinese words of either perform cross-lingual sememe prediction to enrich “美丽” or “漂亮”. Hence, we cannot simply trans- sememe-based linguistic KBs. late HowNet into another language. And how to Our novel model adopts the method of word recognize the semantic meaning of a word in other representation learning (WRL). Recent years have languages becomes a critical problem. (2) Since witnessed great advances in WRL. Models like there is a gap between the semantic meanings of Skip-gram, CBOW (Mikolov et al., 2013a) and words and sememes, we need to build semantic GloVe (Pennington et al., 2014) are immensely representations for words and sememes to capture popular and achieve remarkable performance in the semantic relatedness between them. many NLP tasks. However, most WRL meth- To tackle these challenges, in this paper, we pro- ods learn distributional information of words pose a novel model for CLSP, which aims to trans- from large corpora while the valuable information fer sememe-based linguistic KBs from source lan- contained in semantic lexicons are disregarded. guage to target language. Our model contains three Therefore, some works try to inject semantic infor- modules including (1) monolingual word embed- mation of KBs into WRL (Faruqui et al., 2015; Liu ding learning which is intended for learning se- et al., 2015; Mrkšic et al., 2016; Bollegala et al., mantic representations of words for source and tar- 2016). Nevertheless, these works are all applied get languages respectively; (2) cross-lingual word to word-based KBs such as WordNet, few works embedding alignment which aims to bridge the gap pay attention to how to incorporate the knowledge between the semantic representations of words in from sememe-based linguistic KBs. two languages; (3) sememe-based word embed- There also have been plenty of studies work- ding learning whose objective is to incorporate se- ing on cross-lingual WRL (Upadhyay et al., 2016; meme information into word representations. For Ruder, 2017). Most of them require parallel cor- simplicity, we do not consider the hierarchy infor- pora (Zou et al., 2013; AP et al., 2014; Her- mation in HowNet in this paper. mann and Blunsom, 2014; Kočiskỳ et al., 2014; In experiments, we take Chinese as source lan- Gouws et al., 2015; Luong et al., 2015; Coulmance guage and English as target language to show the et al., 2015). Some of them adopt unsupervised effectiveness of our model. Experimental results or weakly supervised methods (Mikolov et al., show that our proposed model could effectively 2013b; Vulić and Moens, 2015; Conneau et al., predict lexical sememes for words with differ- 2017; Artetxe et al., 2017). There are also some ent frequencies in other languages. Our model works using a seed lexicon as the cross-lingual sig- also has consistent improvements on two auxiliary nal (Dinu et al., 2014; Faruqui and Dyer, 2014; experiments including bilingual lexicon induction Lazaridou et al., 2015; Shi et al., 2015; Lu et al., and monolingual word similarity computation by 2015; Gouws et al., 2015; Wick et al., 2016; Am- jointly learning the representations of sememes, mar et al., 2016; Duong et al., 2016; Vulić and Ko- words in source and target languages. rhonen, 2016). 2 Related Work In terms of our cross-lingual sememe prediction task, parallel data-based bilingual WRL methods Since HowNet was published (Dong and Dong, are unsuitable because most language pairs have 2003), it has attracted wide attention of re- no large parallel corpora. Besides, unsupervised methods are not appropriate either as they are gen- word embeddings. Skip-gram model is aimed at erally hard to learn high-quality bilingual word maximizing the predictive probability of context embeddings. Therefore, we choose the seed lex- words conditioned on the centered word. For- icon method in our model, and further introduce mally, taking the source side for example, given a f S ··· Sg matching mechanism that is inspired by Zhang training word sequence w1 ; ; wn , Skip-gram et al. (2017) to enhance its performance. model intends to minimize: 3 Methodology nX−K X LS − S j S mono = log P (wc+k wc ); In this section, we introduce our novel model for c=K+1 −K≤k≤K;k=06 CLSP. Here we define the language with sememe (3) annotations as source language and the language where K is the size of the sliding window. S j S without sememe annotations as target language. P (wc+k wc ) stands for the predictive probability The main idea of our model is to learn word em- of one of the context words conditioned on the cen- S beddings of source and target languages jointly tered word wc , formalized by the following soft- in a unified semantic space, and then predict se- max function: memes for words in target language according exp(wS · wS) to the words with similar semantic meanings in S j S P c+k c P (wc+k wc ) = S S ; (4) S 2 S exp(w · w ) source language.