Cross-lingual Lexical Prediction Fanchao Qi1∗ , Yankai Lin1∗, Maosong Sun1,2† , Hao Zhu1, Ruobing Xie3, Zhiyuan Liu1 1Department of Computer Science and Technology, Tsinghua University Institute for Artificial Intelligence, Tsinghua University State Key Lab on Intelligent Technology and Systems, Tsinghua University 2Jiangsu Collaborative Innovation Center for Language Ability, Jiangsu Normal University 3Search Product Center, WeChat Search Application Department, Tencent, China {qfc17,linyk14}@mails.tsinghua.edu.cn,[email protected] [email protected] [email protected],[email protected]

Abstract apple are defined as the minimum seman- tic units of human languages. As impor- sense apple (fruit) apple (brand) tant knowledge sources, sememe-based lin- guistic knowledge bases have been widely computer used in many NLP tasks. However, most lan- fruit sememe guages still do not have sememe-based lin- guistic knowledge bases. Thus we present PatternValue able bring SpecificBrand a task of cross-lingual lexical sememe pre- diction, aiming to automatically predict se- for in other languages. We Figure 1: An example of HowNet. propose a novel framework to model corre- lations between sememes and multi-lingual words in low-dimensional semantic space for bases (KBs) via manually annotating every words sememe prediction. Experimental results on with a pre-defined closed set of sememes. HowNet real-world datasets show that our proposed (Dong and Dong, 2003) is one of the most well- model achieves consistent and significant im- known sememe-based linguistic KBs. Different provements as compared to baseline methods from WordNet (Miller, 1995) which focuses on the in cross-lingual sememe prediction. The codes relations between senses, it annotates each word and data of this paper are available at https: with one or more relevant sememes. As illustrated //github.com/thunlp/CL-SP . in Fig. 1, the word apple has two senses includ- 1 Introduction ing apple (fruit) and apple (brand) in HowNet. The sense apple (fruit) has one sememe fruit, and Words are regarded as the smallest meaningful unit the sense apple (brand) has five sememes includ- of speech or writing that can stand by themselves ing computer, PatternValue, able, bring and Speci- in human languages, but not the smallest indivisi- ficBrand. There exist about 2, 000 sememes and ble semantic unit of meaning. That is, the meaning over 100 thousand labeled Chinese and English of a word can be represented as a set of semantic words in HowNet. HowNet has been widely used components. For example, “Man = human + male in various NLP applications such as word simi- + adult” and “Boy = human + male + child”. In lin- larity computation (Liu and Li, 2002), guistics, the minimum semantic unit of meaning is disambiguation (Zhang et al., 2005), question clas- named sememe (Bloomfield, 1926). Some people sification (Sun et al., 2007) and sentiment classifi- believe that semantic meanings of concepts such as cation (Dang and Zhang, 2010). words can be composed of a limited closed set of However, most languages do not have such sememes. And sememes can help us comprehend sememe-based linguistic KBs, which prevents us human languages better. understanding and utilizing human languages to Unfortunately, the lexical sememes of words are a greater extent. Therefore, it is important to not explicit in most human languages. Hence, peo- build sememe-based linguistic KBs for various ple construct sememe-based linguistic knowledge languages. Manual construction for sememe- ∗ Indicates equal contribution based linguistic KBs requires efforts of many † Corresponding author linguistic experts, which is time-consuming and labor-intensive. For example, the construction of searchers. Most of related works focus on apply- HowNet has cost lots of Chinese linguistic experts ing HowNet to specific NLP tasks (Liu and Li, more than 10 years. 2002; Zhang et al., 2005; Sun et al., 2007; Dang To address the issue of the high labor cost of and Zhang, 2010; Fu et al., 2013; Niu et al., 2017; manual annotation, we propose a new task, cross- Zeng et al., 2018; Gu et al., 2018). To the best of lingual lexical sememe prediction (CLSP) which our knowledge, only Xie et al. (2017) and Jin et al. aims to automatically predict lexical sememes for (2018) conduct studies of augmenting HowNet by words in other languages. CLSP aims to assist recommending sememes for new words. How- in the annotation of linguistic experts. There are ever, both of the two works are aimed to recom- two critical challenges for CLSP: (1) There is not mend sememes for monolingual words and not ap- a consistent one-to-one match between words in plicable to cross-lingual circumstance. Accord- different languages. For example, English word ingly, our work is the first effort to automatically “beautiful” can refer to Chinese words of either perform cross-lingual sememe prediction to enrich “美丽” or “漂亮”. Hence, we cannot simply trans- sememe-based linguistic KBs. late HowNet into another language. And how to Our novel model adopts the method of word recognize the semantic meaning of a word in other representation learning (WRL). Recent years have languages becomes a critical problem. (2) Since witnessed great advances in WRL. Models like there is a gap between the semantic meanings of Skip-gram, CBOW (Mikolov et al., 2013a) and words and sememes, we need to build semantic GloVe (Pennington et al., 2014) are immensely representations for words and sememes to capture popular and achieve remarkable performance in the semantic relatedness between them. many NLP tasks. However, most WRL meth- To tackle these challenges, in this paper, we pro- ods learn distributional information of words pose a novel model for CLSP, which aims to trans- from large corpora while the valuable information fer sememe-based linguistic KBs from source lan- contained in semantic are disregarded. guage to target language. Our model contains three Therefore, some works try to inject semantic infor- modules including (1) monolingual word embed- mation of KBs into WRL (Faruqui et al., 2015; Liu ding learning which is intended for learning se- et al., 2015; Mrkšic et al., 2016; Bollegala et al., mantic representations of words for source and tar- 2016). Nevertheless, these works are all applied get languages respectively; (2) cross-lingual word to word-based KBs such as WordNet, few works embedding alignment which aims to bridge the gap pay attention to how to incorporate the knowledge between the semantic representations of words in from sememe-based linguistic KBs. two languages; (3) sememe-based word embed- There also have been plenty of studies work- ding learning whose objective is to incorporate se- ing on cross-lingual WRL (Upadhyay et al., 2016; information into word representations. For Ruder, 2017). Most of them require parallel cor- simplicity, we do not consider the hierarchy infor- pora (Zou et al., 2013; AP et al., 2014; Her- mation in HowNet in this paper. mann and Blunsom, 2014; Kočiskỳ et al., 2014; In experiments, we take Chinese as source lan- Gouws et al., 2015; Luong et al., 2015; Coulmance guage and English as target language to show the et al., 2015). Some of them adopt unsupervised effectiveness of our model. Experimental results or weakly supervised methods (Mikolov et al., show that our proposed model could effectively 2013b; Vulić and Moens, 2015; Conneau et al., predict lexical sememes for words with differ- 2017; Artetxe et al., 2017). There are also some ent frequencies in other languages. Our model works using a seed as the cross-lingual sig- also has consistent improvements on two auxiliary nal (Dinu et al., 2014; Faruqui and Dyer, 2014; experiments including bilingual lexicon induction Lazaridou et al., 2015; Shi et al., 2015; Lu et al., and monolingual word similarity computation by 2015; Gouws et al., 2015; Wick et al., 2016; Am- jointly learning the representations of sememes, mar et al., 2016; Duong et al., 2016; Vulić and Ko- words in source and target languages. rhonen, 2016). 2 Related Work In terms of our cross-lingual sememe prediction task, parallel data-based bilingual WRL methods Since HowNet was published (Dong and Dong, are unsuitable because most language pairs have 2003), it has attracted wide attention of re- no large parallel corpora. Besides, unsupervised methods are not appropriate either as they are gen- word embeddings. Skip-gram model is aimed at erally hard to learn high-quality bilingual word maximizing the predictive probability of context embeddings. Therefore, we choose the seed lex- words conditioned on the centered word. For- icon method in our model, and further introduce mally, taking the source side for example, given a { S ··· S} matching mechanism that is inspired by Zhang training word sequence w1 , , wn , Skip-gram et al. (2017) to enhance its performance. model intends to minimize:

3 Methodology n∑−K ∑ LS − S | S mono = log P (wc+k wc ), In this section, we introduce our novel model for c=K+1 −K≤k≤K,k=0̸ CLSP. Here we define the language with sememe (3) annotations as source language and the language where K is the size of the sliding window. S | S without sememe annotations as target language. P (wc+k wc ) stands for the predictive probability The main idea of our model is to learn word em- of one of the context words conditioned on the cen- S beddings of source and target languages jointly tered word wc , formalized by the following soft- in a unified semantic space, and then predict se- max function: memes for words in target language according exp(wS · wS) to the words with similar semantic meanings in S | S ∑ c+k c P (wc+k wc ) = S S , (4) S ∈ S exp(w · w ) source language. ws V s c Our method consists of three parts: monolingual s word representation learning, cross-lingual word in which V indicates the word vocabulary of LT embedding alignment and sememe-based word source language. mono can be formulated simi- representation learning. Hence, we define the ob- larly. jective function of our method corresponding to 3.2 Cross-lingual Word Embedding the three parts: Alignment

L = Lmono + Lcross + Lsememe. (1) Cross-lingual word embedding alignment aims to build a unified semantic space for the words in Here, the monolingual term Lmono is designed for source and target languages. Inspired by Zhang learning monolingual word embeddings from non- et al. (2017), we align the cross-lingual word em- parallel corpora for source and target languages re- beddings with signals of a seed lexicon and self- spectively. The cross-lingual term Lcross aims to matching. align cross-lingual word embeddings in a unified Formally, Lcross is composed of two terms in- L semantic space. And sememe can draw sememe cluding alignment by seed lexicon Lseed and align- information into word representation learning and ment by matching Lmatch: conduce to better word embeddings for sememe prediction. In the following subsections, we intro- Lcross = λsLseed + λmLmatch, (5) duce the three parts in detail. where λs and λm are hyperparameters for control- 3.1 Monolingual Word Representation ling relative weightings of the two terms. Monolingual word representation is responsible Alignment by Seed Lexicon for explaining regularities in monolingual corpora L of source and target languages. Since the two cor- The seed lexicon term seed encourages word em- D pora are non-parallel, Lmono comprises two mono- beddings of pairs in a seed lexicon to lingual sub-models that are independent of each be close, which can be achieved via a L2 regular- other: izer: ∑ L LS LT L = ∥ S − T ∥2, mono = mono + mono, (2) seed ws wt (6) ⟨ S T ⟩∈D ws ,wt where the superscripts S and T denote source and S T target languages respectively. in which ws and wt indicate the words in source As a common practice, we choose the well es- and target languages in the seed lexicon respec- tablished Skip-gram model to obtain monolingual tively. Alignment by Matching Mechanism Word Relation-based Approach As for the matching process, it is founded on an as- A simple and intuitive method is to let words with sumption that each target word should be matched similar sememe annotations tend to have similar to a single source word or a special empty word, word embeddings, which we name word relation- and vice versa. The goal of the matching process based approach. To begin with, we construct a syn- is to find the matched source (target) word for each onym list from sememe-based linguistic KBs of target (source) word and maximize the matching source language, where we regard words sharing probabilities for all the matched word pairs. The a certain number of sememes as . Next, loss of this part can be formulated as: we force synonyms to have closer word embed- L LT 2S LS2T dings. match = match + match, (7) S Formally, we let wi be original word embed- LT 2S ding of wS and wˆ S be its adjusted word embed- where match is the term for target-to-source i i LS2T ding. And let Syn(wS) denote the set of matching and match is the term for source-to- i S target matching. word wi . Then the loss function is: Next, we give a detailed explanation of ∑ [ L ∥ S − S∥2 target-to-source matching, and the source-to- sememe = αi wi wˆ i + target matching is defined in the same way. wS ∈V S ∑i ] (10) We first introduce a latent variable mt ∈ β ∥wˆ S − wˆ S∥2 , {0, 1, ··· , |V S|} (t = 1, 2, ··· , |V T |) for each ij i j S ∈ S T | S| | T | wj Syn(wi ) target word wt , where V and V indicate the vocabulary size of source and target languages re- where α and β control the relative strengths spectively. Here, m specifies the index of the t of the two terms. It should be noted that source word that wT matches with, and m = 0 t t the idea of forcing similar words to have close signifies the empty word is matched. Then we word embeddings is similar to the state-of-the- have m = {m , m , ··· , m| T |}, and can formal- 1 2 V art retrofitting approach (Faruqui et al., 2015). ize the target-to-source matching term: However, retrofitting approach cannot be applied LT 2S = − log P (CT |CS) here because sememe-based linguistic KBs such match ∑ as HowNet cannot directly provide its needed syn- = − log P (CT , m|CS), (8) onym list. m where CT and CS denote the target and source cor- Sememe Embedding-based Approach pus respectively. Here, we simply assume that the Simple and effective as the word relation-based matching processes of target words are indepen- approach is, it cannot make full use of the infor- dent of each other. Therefore, we have: mation of sememe-based linguistic KBs because it ∏ disregards the complicated relations between se- P (CT , m|CS) = P (wT , m|CS) memes and words as well as relations between T ∈CT w different sememes. To address this limitation, |∏V T | (9) T we propose sememe embedding-based approach, = P (wT |wS )c(wt ), t mt which learns both sememe and word embeddings t=1 jointly. S T In this approach, we represent sememes with where wmt is the source word that wt matches T T with, and c(wt ) is the number of times wt occurs distributed vectors as well and place them into the in the target corpus. same semantic space as words. Similar to SPSE (Xie et al., 2017), which learns sememe embed- 3.3 Sememe-based Word Representation dings by decomposing word-sememe matrix and Sememe-based word representation is intended for sememe-sememe matrix, our method utilizes se- improving word embeddings for sememe predic- meme embeddings as regularizers to learn better tion by introducing the information of sememe- word embeddings. Different from SPSE, we do based linguistic KBs of source language. In this not use pre-trained word embeddings. Instead, we section, we present two methods of sememe-based learn word embeddings and sememe embeddings word representation. simultaneously. More specifically, from HowNet we can ex- where ϵ is a hyperparameter indicating the proba- tract a source-side word-sememe matrix M S with bility of matching the empty word. Therefore, the S S Msj = 1 indicating word ws is annotated with Viterbi E step computes matching by: S sememe xj, otherwise Msj = 0. Hence by fac- m˜ = cos( T , S), torizing M S, we can define the loss function as: t arg max wt ws (14) s∈{1,··· ,|V S |} ∑ { S ′ S 2 T S Lsememe = (w ·xj +bs +b −M ) , m˜ t if cos(w , w ) > ϵ, s j sj mˆ = t m˜ t (15) S ∈ S ∈ t ws V ,xj X 0 otherwise. (11) ′ S From this, we can see that ϵ serves as a threshold where bs and bj are the biases of ws and xj, and X denotes sememe set. to keep out unreliable matched pairs. In this approach, we obtain word and sememe The Viterbi M step performs maximization as if embeddings in a unified semantic space. The se- the latent variable has been observed in the Viterbi meme embeddings bear all the information about E step. Thus, we can treat the matched pairs as cor- the relationships between words and sememes, and rect , and use a L2 regularizer as well. they inject the information into word embeddings. Consequently, the M step computes: Therefore, the word embeddings are expected to ˆ S ˆ T M S T be more suitable for sememe prediction. (w , w ) = arg max (w , w ), (16) wS ,wT 3.4 Training and Prediction where M(wS, wT ) is defined as: Training |V T | When training monolingual word embeddings, we ∑ c(wT ) M(wS, wT ) = − I[m ˜ ≠ 0] t ∥wT −wS ∥2. use negative sampling following Mikolov et al. t |CT | t m˜ t (2013a). In the optimization of sememe part, t=1 (17) we adopt the iterative updating method following Faruqui et al. (2015) for word relation-based ap- Prediction proach and stochastic gradient descent (SGD) for Since we assume that words with similar sememe sememe embedding-based approach. As for the annotations are similar and similar words should optimization of the seed lexicon term of cross- have similar sememes, which resembles collabora- lingual part, we also apply SGD. tive filtering in personalized recommendation, we Nevertheless, due to the existence of the la- can recommend sememes for target words accord- tent variable, optimization of the matching process ing to their most similar source words. in cross-lingual part poses a challenge. We set- Formally, we define the score function tle on Viterbi EM algorithm to address the prob- | T T P (xj wt ) of sememes xj given a target word wt lem. Next, we still take the target-to-source side as: as an example and give a detailed description of ∑ the training process using Viterbi EM algorithm. | T S T · S · rs P (xj wt ) = cos(ws , wt ) Msj c , (18) Viterbi EM algorithm alternates between a S ∈ S ws V Viterbi E step and a subsequent M step. The Viterbi E step aims to find the most probable where rs is the descending rank of word simi- S T S matched word pairs given the current parameters. larity cos(ws , wt ) for the source word ws , and Considering the independence, we can seek the c ∈ (0, 1) is a hyperparameter. Thus, crs is a de- match for each word individually: clined confidence factor which can eliminate the T | S noise from irrelevant source words and concentrate mˆ t = arg max P (wt ws ). (12) s∈{0,1,··· ,|V S |} on the most similar source words when predicting sememes for target words. As for the parametrization of the matching prob- ability, there are various choices. For computa- 4 Experiments tional simplicity, we select cosine similarity: { In this section, we first introduce the dataset used ϵ if s = 0, P (wT |wS) = in the experiments and then describe the experi- t s T S (13) cos(wt , ws ) otherwise, mental settings of both baseline method and our model. Next, we present the experimental results model. These datasets contain word pairs as well of different methods on the task of cross-lingual as human-assigned similarity scores. The word lexical sememe prediction. And then we con- vectors are evaluated by ranking the word pairs ac- duct detailed analysis and exhaustive case stud- cording to their cosine similarities, and measuring ies. Following this, we investigate the effect of Spearman’s rank correlation coefficient with the word frequency on cross-lingual sememe predic- human ratings. tion results. Finally, we perform further quantita- tive analysis through two sub-tasks including bilin- 4.2 Experimental Settings gual lexicon induction and word similarity compu- tation. We empirically set the dimension of word and se- meme embeddings to 200. And the embeddings 4.1 Dataset are all randomly initialized. In monolingual word We use sememe annotations in HowNet for se- embedding learning, we follow the optimal param- meme prediction. HowNet annotates sememes eter settings in Mikolov et al. (2013a). We set the for 118, 346 Chinese words and 104, 025 En- window size K to 5, down-sampling rate for high- −5 glish words. The number of sememes in to- frequency words to 10 , learning rate to 0.025 tal is 1, 983. Since some sememes only appear and the number of negative samples to 5. In cross- few times in HowNet, which are expected to be lingual word embedding alignment, the seed lexi- unimportant, we filter out those low-frequency se- con term weight λs is 0.01, and the matching term memes. Specifically, the frequency threshold is 5, weight λm is 1, 000. In sememe-based word repre- and the final number of distinct sememes used in sentation, the number of shared sememes for syn- our experiments is 1, 400. onyms in the word relation-based approach is 2. In our experiments, Chinese is source language In the training of matching process, we set ϵ to 0.5 and English is target language. To learn Chi- empirically. When predicting sememes for words nese and English monolingual word embeddings, in target language, we only consider 100 most sim- we extract about 2.0G text from Sogou-T1 and ilar source words for each target word and the at- Wikipedia2 respectively. And we use THULAC3 tenuation parameter c is 0.8. The testing set for (Li and Sun, 2009) for Chinese word segmentation. cross-lingual lexical sememe prediction contains As for seed lexicon, we build it in a similar way 2, 000 randomly selected English words from the to Zhang et al. (2017). First, we employ Google vocabulary. Translation API4 to translate the source side (Chi- nese) vocabulary. Then the translations in the tar- 4.3 Cross-lingual Lexical Sememe Prediction get language (English) are queried again in the re- We evaluate our model by recommending se- verse direction to translate back to the source lan- memes for English words. In HowNet, many guage (Chinese). And we only keep the translation words have multiple sememes, so that sememe pairs whose back translated words match with the prediction can be regarded as a multi-label clas- original source words. sification task. We use mean average precision In the task of bilingual lexicon induction, we opt (MAP) and F score to evaluate the sememe pre- for Chinese-English Translation Lexicon Version 1 diction results. 3.05 to be the gold standard. In the task of word similarity computation, we choose WordSim-240 We compare our model that incorporates se- and WordSim-297 (Jin and Wu, 2012) datasets meme information with word relation-based ap- for Chinese, and WordSim-353 (Finkelstein et al., proach (named CLSP-WR) and our model which 2002) and SimLex-999 (Hill et al., 2015) datasets jointly trains word and sememe embeddings for English to evaluate the performance of our (named CLSP-SE) with a baseline method BiLex (Zhang et al., 2017), a bilingual WRL model with- 1 Sogou-T is a corpus of web pages provided by a Chinese out incorporation of sememe information. For commercial search engine. https://www.sogou.com/ labs/resource/t.php BiLex, we use its trained bilingual word embed- 2https://dumps.wikimedia.org/ dings to predict sememes for the words in target 3http://thulac.thunlp.org/ language with our sememe prediction approach. 4https://cloud.google.com/translate/ 5https://catalog.ldc.upenn.edu/ Table 1 exhibits the evaluation results of cross- LDC2002L27 lingual lexical sememe prediction with different Seed Sememe Prediction Method Lexicon burglars MAP F1 Score (kayak) handcuff 1000 27.57 16.08 swimmer rower noose gunpoint handcuffs 2000 33.79 22.33 canoeist (rope) BiLex weightlifter 4000 35.78 25.74 (handcuffs) (kayak) 6000 38.29 28.71 (sprint) (tie) cuffs (canoe) 1000 28.12 18.55 (sports star) (shackles) medalist skier (screwdriver) 2000 33.78 23.64 CLSP-WR 4000 38.30 27.74 6000 41.23 30.64 Figure 2: Two examples of nearest English and 1000 31.78 18.22 Chinese words. 2000 37.70 24.31 CLSP-SE 4000 40.77 29.33 6000 43.16 32.49

Table 1: Evaluation results of cross-lingual lexi- Table 2 lists top-5 sememes we predict for the cal sememe prediction with different seed lexicon two words and the sememes annotated for each sizes. word in HowNet are in boldface. In the table, we also exhibit the annotated sememes of the five closest Chinese words. seed lexicon sizes in {1000, 2000, 4000, 60006}. In the first example, our model finds the best From the table, we can clearly see that: translated word for handcuffs in Chinese ⼿ 铐 (1) Our two models perform much better com- “handcuffs”, whose sememe annotations are ex- pared with BiLex in all the seed lexicon size set- actly the same as those of handcuffs. In addition, tings. It indicates that incorporating sememe infor- the second closest Chinese word 镣 铐 “shack- mation into word embeddings can effectively im- les” is a synonym for ⼿铐 “handcuffs” and also prove the performance of predicting sememes for has the same sememe annotations. Therefore, our target words. The reason is that both of our models model predicts all the correct sememes success- make words with similar sememe annotations have fully. From the prediction results of this exam- similar embeddings, and as a result, we can recom- ple, we notice that our model can accurately pre- mend better sememes for target words according to dict general sememes like 用具 “tool” and ⼈ “hu- its related source words. man”, which are supposed to be difficult to predict. (2) CLSP-SE model achieves better results than CLSP-WR model. The reason is that by represent- In the second example, accurate Chinese trans- ing sememes in a latent semantic space, CLSP- lated counterpart for canoeist does not exist, but SE model can further capture the relatedness be- our model still hits all the three annotated sememes tween sememes as well as the relatedness between in the top-5 predicted sememes. By observing the words and sememes, which is helpful for model- most similar Chinese words, we can find that al- ing the representations of those words with similar though these words do not have the same meaning sememes. as canoeist, they are related to canoeist in different aspects. For example, 短跑 “sprint” and canoeist 4.4 Case Study are both in the sports domain so that they share the In case study, we conduct qualitative analysis to sememes 锻炼 “exercise” and 体育 “sport”. 名将 explain the effectiveness of our models with de- “sports star” has the meaning of sports star and it tailed cases. We show two examples of cross- can provide the sememe ⼈ “human” in sememe lingual word sememe prediction, in which we pre- prediction. Furthermore, it is noteworthy that our dict sememes for handcuffs and canoeist. Fig. 2 model predicts 船 “ship” due to the nearest Chi- shows the embeddings of five closest Chinese and nese words 独⽊⾈ “canoe” and 皮艇 “kayak”, English words to handcuffs and canoeist, and the whereas 船 “ship” is not annotated for canoeist in vector of each word is projected down to two di- HowNet. It is obvious that 船 “ship” is an appro- mensions using t-SNE (Maaten and Hinton, 2008). priate sememe for canoeist. Since HowNet is man- ually annotated by experts, misannotated words al- 6The largest seed lexicon size is 6000 because that is the maximum number of translation word pairs that we can obtain ways exist inevitably, which in some cases under- from the bilingual corpora. estimates our models. Type Words Sememes English Word handcuffs 用具 “tool”, “police”, “detain”, ⼈ “human”, “guilty” ⼿铐 “handcuffs” “guilty”, “police”, ⼈ “human”, “detain”, 用具 “tool” 镣铐 “shackles” “guilty”, “police”, ⼈ “human”, “detain”, 用具 “tool” 5 Nearest Chinese Words 绑 “tie” 包扎 “wrap” 螺丝⼑ “screwdriver” 用具 “tool”, 放松 “loosen”, 勒紧 “tighten” 绳 “rope” 线 “linear”, 材料 “material”, 拴连 “fasten” English Word canoeist 锻炼 “exercise”, ⼈ “human”, 体育 “sport”, 事情 “fact”, 船 “ship” 短跑 “sprint” 事情 “fact” 锻炼 “exercise” 体育 “sport” 独⽊⾈ “canoe” 船 “ship” 5 Nearest Chinese Words 皮艇 “kayak” 船 “ship” 名将 “sports star” 著名 “famous”, ⼈ “human”, 官 “official”, 军 “military” 皮划艇 “kayak” 事情 “fact”, 锻炼 “exercise”, 体育 “sport”

Table 2: Two examples of cross-lingual lexical sememe prediction.

4.5 Effect of Word Frequency scenarios.

To explore how frequencies of target words affect 4.6 Further Quantitative Analysis cross-lingual sememe prediction results, we split In this section, we conduct two typical auxiliary the testing set into four subsets according to word experiments to further analyze the superiority of frequency and then calculate the sememe predic- our models quantitatively. tion MAP and F1 score for each subset. The results are shown in Table 3. Bilingual Lexicon Induction Our models learn bilingual word embeddings in Word Sememe Prediction Method one unified semantic space. Here we use transla- Frequency MAP F1 Score tion top-1 and top-5 average precision (P@1 and <200 30.35 21.83 P@5) to evaluate bilingual lexicon induction per- 200 - 500 34.83 25.95 BiLex 501 - 2500 40.21 28.62 formance of our models and BiLex. The seed lex- >2500 47.56 35.80 icon size also varies in {1000, 2000, 4000, 6000}. <200 34.73 24.41 200 - 500 39.50 29.49 Lexicon Induction CLSP-WR Seed 501 - 2500 43.92 33.87 Method Lexicon P@1 P@5 >2500 47.33 34.99 1000 6.48 10.78 <200 36.54 27.49 2000 10.84 15.84 200 - 500 41.46 30.09 BiLex CLSP-SE 4000 19.48 23.96 501 - 2500 45.35 35.01 6000 25.89 29.59 >2500 49.34 37.16 1000 6.89 11.28 2000 11.96 18.08 Table 3: Evaluation results of cross-lingual lexical CLSP-WR 4000 19.50 25.78 sememe prediction with different word frequen- 6000 25.83 31.03 cies. The number of words in each frequency range 1000 6.60 11.04 is 497, 458, 522 and 523 respectively. 2000 11.90 18.62 CLSP-SE 4000 19.26 25.11 6000 26.91 32.17 From the table we can see that: (1) The more frequently a target word appears in the corpus, the Table 4: Bilingual lexicon induction performance better its predicted sememes are. It is because with different seed lexicon sizes. high-frequency words normally have better word embeddings, which are crucial to sememe predic- The results are shown in Table 4. From this ta- tion. (2) Our models evidently perform better than ble, we observe that our models, especially CLSP- BiLex in different word frequencies, especially in SE model, enhance the performance of word trans- low frequency. It indicates that by considering lation compared to BiLex no matter how large the external information of HowNet, our models are seed lexicon is. It indicates that our models can more robust and can competently handle sparse bind bilingual word embeddings better. Word Similarity Computation models to consider the structure information of se- We also evaluate the task of monolingual word meme and multiple senses of words; (2) In fact, similarity computation on WordSim-240 (WS- our framework for cross-lingual lexical sememe 240) and WordSim-297 (WS-297) datasets prediction can be transferred to other cross-lingual for Chinese, and WordSim-353 (WS-353) and tasks. We will explore the effectiveness of our SimLex-999 (SL-999) datasets for English. model in these tasks such as cross-lingual infor- mation retrieval. Chinese (source) English (target) Method WS-240 WS-297 WS-353 SL-999 Acknowledgments BiLex 60.36 62.17 60.46 27.22 This research is funded by the National 973 project CLSP-WR 61.27 65.25 60.46 27.22 (No. 2014CB340501). It is also partially sup- CLSP-SE 60.84 65.62 62.47 28.79 ported by the NExT++ project, the National Re- search Foundation, Prime Minister’s Office, Sin- Table 5: Performance on monolingual word simi- gapore under its IRC@Singapore Funding Initia- larity computation with seed lexicon size 6000. tive. Hao Zhu is supported by Tsinghua University Initiative Scientific Research Program. We also Table 5 shows the results of monolingual word thank the anonymous reviewers for their valuable similarity computation on four datasets. From comments and suggestions. the table, we find that: (1) Our models per- form better than BiLex on both Chinese word similarity datasets. It signifies incorporating se- References meme information helps learn better monolingual Waleed Ammar, George Mulcaire, Yulia Tsvetkov, embeddings; (2) CLSP-WR model does not en- Guillaume Lample, Chris Dyer, and Noah A Smith. hance English word similarity results but CLSP- 2016. Massively multilingual word embeddings. SE model does. It is because CLSP-WR model arXiv preprint arXiv:1602.01925. only post-processes Chinese word embeddings and Sarath Chandar AP, Stanislas Lauly, Hugo Larochelle, keeps English word embeddings unchanged while Mitesh Khapra, Balaraman Ravindran, Vikas C CLSP-SE model undertakes bilingual alignment Raykar, and Amrita Saha. 2014. An autoencoder ap- and sememe information incorporation together, proach to learning bilingual word representations. In Proceedings of NIPS. which makes English word embeddings improve with Chinese word embeddings. Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word embeddings with (almost) 5 Conclusion and Future Work no bilingual data. In Proceedings of ACL.

In this paper, we introduce a new task of cross- Leonard Bloomfield. 1926. A set of postulates for the lingual sememe prediction. This task is very im- science of language. Language, 2(3):153–164. portant because the construction of sememe-based Danushka Bollegala, Mohammed Alsuhaibani, linguistic knowledge bases in various languages Takanori Maehara, and Ken-ichi Kawarabayashi. is beneficial to better understanding these lan- 2016. Joint word representation learning using a guages. We propose a simple and effective model corpus and a . In Proceedings of for this task, including monolingual word repre- AAAI. sentation learning, cross-lingual word representa- Alexis Conneau, Guillaume Lample, Marc’Aurelio tion alignment and sememe-based word represen- Ranzato, Ludovic Denoyer, and Hervé Jégou. 2017. tation learning. Experimental results on real-world Word translation without parallel data. arXiv datasets show that our model achieves consistent preprint arXiv:1710.04087. and significant improvements compared to base- Jocelyn Coulmance, Jean-Marc Marty, Guillaume line method in cross-lingual sememe prediction. Wenzek, and Amine Benhalloum. 2015. Trans- In the future, we will explore the following re- gram, fast cross-lingual word-embeddings. In Pro- search directions: (1) In this paper, for simplifi- ceedings of EMNLP. cation, we ignore the rich hierarchy information Lei Dang and Lei Zhang. 2010. Method of discriminant in HowNet and also ignore the fact that a word for chinese sentence sentiment orientation based on may have multiple senses. We will extend our hownet. Application Research of Computers, 4:43. Georgiana Dinu, Angeliki Lazaridou, and Marco Ba- Angeliki Lazaridou, Georgiana Dinu, and Marco Ba- roni. 2014. Improving zero-shot learning by roni. 2015. Hubness and pollution: Delving into mitigating the hubness problem. arXiv preprint cross-space mapping for zero-shot learning. In Pro- arXiv:1412.6568. ceedings of ACL-IJCNLP. Zhendong Dong and Qiang Dong. 2003. Hownet-a hy- Zhongguo Li and Maosong Sun. 2009. Punctuation as brid language and knowledge resource. In Proceed- implicit annotations for chinese word segmentation. ings of NLP-KE. Computational Linguistics, 35(4):505–512. Long Duong, Hiroshi Kanayama, Tengfei Ma, Steven Quan Liu, Hui Jiang, Si Wei, Zhen-Hua Ling, and Bird, and Trevor Cohn. 2016. Learning crosslingual Yu Hu. 2015. Learning semantic word embeddings word embeddings without bilingual corpora. In Pro- based on ordinal knowledge constraints. In Proceed- ceedings of EMNLP. ings of ACL-IJCNLP. Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Qun Liu and Sujian Li. 2002. Word similarity comput- Chris Dyer, Eduard Hovy, and Noah A Smith. 2015. ing based on hownet. International Journal of Com- Retrofitting word vectors to semantic lexicons. In putational Linguistics & Chinese Language Process- Proceedings of NAACL-HLT. ing, 7(2):59–76. Manaal Faruqui and Chris Dyer. 2014. Improving vec- Ang Lu, Weiran Wang, Mohit Bansal, Kevin Gimpel, tor space word representations using multilingual and Karen Livescu. 2015. Deep multilingual corre- correlation. In Proceedings of the EACL. lation for improved word embeddings. In Proceed- ings of NAANL-HLT. Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Ey- Thang Luong, Hieu Pham, and Christopher D Manning. tan Ruppin. 2002. Placing search in context: The 2015. Bilingual word representations with monolin- concept revisited. ACM Transactions on Informa- gual quality in mind. In Proceedings of the 1st Work- tion Systems, 20(1):116–131. shop on Vector Space Modeling for Natural Lan- guage Processing. Xianghua Fu, Guo Liu, Yanyan Guo, and Zhiqiang Wang. 2013. Multi-aspect sentiment analysis for Laurens van der Maaten and Geoffrey E Hinton. 2008. chinese online social reviews based on topic model- Visualizing data using t-sne. Journal of Machine ing and hownet lexicon. Knowledge-Based Systems, Learning Research, 9:2579–2605. 37:186–195. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Stephan Gouws, Yoshua Bengio, and Greg Corrado. Dean. 2013a. Efficient estimation of word represen- 2015. Bilbowa: fast bilingual distributed represen- tations in vector space. In Proceedings of ICLR. tations without word alignments. In Proceedings of ICML. Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013b. Exploiting similarities among languages for machine Yihong Gu, Jun Yan, Hao Zhu, Zhiyuan Liu, Ruobing translation. arXiv preprint arXiv:1309.4168. Xie, Maosong Sun, Fen Lin, and Leyu Lin. 2018. Language modeling with sparse product of sememe George A Miller. 1995. Wordnet: a lexical database for experts. In Proceedings of EMNLP. english. Communications of the ACM, 38(11):39– 41. Karl Moritz Hermann and Phil Blunsom. 2014. Mul- tilingual distributed representations without word Nikola Mrkšic, Diarmuid OSéaghdha, Blaise Thom- alignment. In Proceedings of ICLR. son, Milica Gašic, Lina Rojas-Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve Felix Hill, Roi Reichart, and Anna Korhonen. 2015. Young. 2016. Counter-fitting word vectors to lin- Simlex-999: Evaluating semantic models with (gen- guistic constraints. In Proceedings of NAACL-HLT. uine) similarity estimation. Computational Linguis- tics, 41(4):665–695. Yilin Niu, Ruobing Xie, Zhiyuan Liu, and Maosong Sun. 2017. Improved word representation learning Huiming Jin, Hao Zhu, Zhiyuan Liu, Ruobing Xie, with sememes. In Proceedings of ACL. Maosong Sun, Fen Lin, and Leyu Lin. 2018. In- corporating chinese characters of words for lexical Jeffrey Pennington, Richard Socher, and Christopher sememe prediction. In Proceedings of ACL. Manning. 2014. Glove: Global vectors for word rep- resentation. In Proceedings of EMNLP. Peng Jin and Yunfang Wu. 2012. SemEval-2012 Task 4: Evaluating chinese word similarity. In Proced- Sebastian Ruder. 2017. A survey of cross-lingual em- dings of *SEM. bedding models. arXiv preprint arXiv:1706.04902. Tomáš Kočiskỳ, Karl Moritz Hermann, and Phil Blun- Tianze Shi, Zhiyuan Liu, Yang Liu, and Maosong Sun. som. 2014. Learning bilingual word representa- 2015. Learning cross-lingual word embeddings via tions by marginalizing alignments. In Proceedings matrix co-factorization. In Proceedings of ACL- of ACL. IJCNLP. Jingguang Sun, Dongfeng Cai, Dexin Lv, and Yanju Dong. 2007. Hownet based chinese question auto- matic classification. Journal of Chinese Information Processing, 21(1):90–95. Shyam Upadhyay, Manaal Faruqui, Chris Dyer, and Dan Roth. 2016. Cross-lingual models of word em- beddings: An empirical comparison. In Proceedings of ACL. Ivan Vulić and Anna Korhonen. 2016. On the role of seed lexicons in learning bilingual word embed- dings. In Proceedings of ACL. Ivan Vulić and Marie-Francine Moens. 2015. Bilin- gual word embeddings from non-parallel document- aligned data applied to bilingual lexicon induction. In Proceedings of ACL-IJCNLP. Michael Wick, Pallika Kanani, and Adam Craig Pocock. 2016. Minimally-constrained multilingual embeddings via artificial code-switching. In Pro- ceedings of AAAI. Ruobing Xie, Xingchi Yuan, Zhiyuan Liu, and Maosong Sun. 2017. Lexical sememe prediction via word embeddings and matrix factorization. In Pro- ceedings of AAAI. Xiangkai Zeng, Cheng Yang, Cunchao Tu, Zhiyuan Liu, and Maosong Sun. 2018. Chinese liwc lexicon expansion via hierarchical classification of word em- beddings with sememe attention. In Proceedings of AAAI. Meng Zhang, Haoruo Peng, Yang Liu, Huan-Bo Luan, and Maosong Sun. 2017. Bilingual lexicon induc- tion from non-parallel data with minimal supervi- sion. In Proceedings of AAAI. Yuntao Zhang, Ling Gong, and Yongcheng Wang. 2005. Chinese word sense disambiguation using hownet. In Proceedings of International Conference on Natural Computation. Will Y Zou, Richard Socher, Daniel Cer, and Christo- pher D Manning. 2013. Bilingual word embeddings for phrase-based machine translation. In Proceed- ings of EMNLP.