<<

Improved Representation Learning with Sememes

Yilin Niu1∗, Ruobing Xie1,∗ Zhiyuan Liu1,2 ,† Maosong Sun1,2 1 Department of Computer Science and Technology, State Key Lab on Intelligent Technology and Systems, National Lab for Information Science and Technology, Tsinghua University, Beijing, China 2 Jiangsu Collaborative Innovation Center for Language Ability, Jiangsu Normal University, Xuzhou 221009 China

Abstract for each word. Hence, people manually annotate word sememes and build linguistic common-sense Sememes are minimum semantic units of knowledge bases. word meanings, and the meaning of each HowNet (Dong and Dong, 2003) is one of such is typically composed by sev- knowledge bases, which annotates each concep- eral sememes. Since sememes are not ex- t in Chinese with one or more relevant sememes. plicit for each word, people manually an- Different from WordNet (Miller, 1995), the phi- notate word sememes and form linguistic losophy of HowNet emphasizes the significance of common-sense knowledge bases. In this part and attribute represented by sememes. paper, we present that, word sememe in- HowNet has been widely utilized in word similar- formation can improve word representa- ity computation (Liu and Li, 2002) and sentiment tion learning (WRL), which maps word- analysis (Xianghua et al., 2013), and in section 3.2 s into a low-dimensional semantic space we will give a detailed introduction to sememes, and serves as a fundamental step for many senses and in HowNet. NLP tasks. The key idea is to utilize In this paper, we aim to incorporate word se- word sememes to capture exact meanings into word representation learning (WRL) of a word within specific contexts accu- and learn improved word embeddings in a low- rately. More specifically, we follow the dimensional semantic space. WRL is a fundamen- framework of Skip-gram and present three tal and critical step in many NLP tasks such as lan- sememe-encoded models to learn repre- guage modeling (Bengio et al., 2003) and neural sentations of sememes, senses and word- machine (Sutskever et al., 2014). s, where we apply the attention scheme There have been a lot of researches for learn- to detect word senses in various contexts. ing word representations, among which word2vec We conduct experiments on two tasks in- (Mikolov et al., 2013) achieves a nice balance be- cluding word similarity and word analogy, tween effectiveness and efficiency. In word2vec, and our models significantly outperform each word corresponds to one single embedding, baselines. The results indicate that WRL ignoring the polysemy of most words. To address can benefit from sememes via the attention this issue, (Huang et al., 2012) introduces a multi- scheme, and also confirm our models be- prototype model for WRL, conducting unsuper- ing capable of correctly modeling sememe vised word sense induction and embeddings ac- information. cording to context clusters. (Chen et al., 2014) fur- 1 Introduction ther utilizes the synset information in WordNet to instruct word sense representation learning. Sememes are defined as minimum semantic u- From these previous studies, we conclude that nits of word meanings, and there exists a lim- word sense disambiguation are critical for WR- ited close set of sememes to compose the se- L, and we believe that the sememe annotation mantic meanings of an open set of concepts (i.e. of word senses in HowNet can provide neces- word sense). However, sememes are not explicit sary semantic regularization for the both tasks. ∗ indicates equal contribution To explore its feasibility, we propose a novel † Corresponding author: Z. Liu ([email protected]) Sememe-Encoded Word Representation Learning (SE-WRL) model, which detects word senses et al., 2015), parsing (Chen and Manning, 2014) and learns representations simultaneously. More and text classification (Zhang et al., 2015). Word specifically, this framework regards each word distributed representations are capable of encod- sense as a combination of its sememes, and iter- ing semantic meanings in vector space, serving as atively performs word sense disambiguation ac- the fundamental and essential inputs of many NLP cording to their contexts and learn representation- tasks. s of sememes, senses and words by extending There are large amounts of efforts devoted to Skip-gram in word2vec (Mikolov et al., 2013). In learning better word representations. As the ex- this framework, an attention-based method is pro- ponential growth of text corpora, model efficien- posed to select appropriate word senses according cy becomes an important issue. (Mikolov et al., to contexts automatically. To take full advantages 2013) proposes two models, CBOW and Skip- of sememes, we propose three different learning gram, achieving a good balance between effective- and attention strategies for SE-WRL. ness and efficiency. These models assume that the In experiments, we evaluate our framework on meanings of words can be well reflected by their two tasks including word similarity and word anal- contexts, and learn word representations by maxi- ogy, and further conduct case studies on sememe, mizing the predictive probabilities between words sense and word representations. The evaluation and their contexts. (Pennington et al., 2014) fur- results show that our models outperform other ther utilizes matrix factorization on word affinity baselines significantly, especially on word analo- matrix to learn word representations. However, gy. This indicates that our models can build bet- these models merely arrange only one vector for ter knowledge representations with the help of se- each word, regardless of the fact that many word- information, and also implies the potential s have multiple senses. (Huang et al., 2012; Tian of our models on word sense disambiguation. et al., 2014) utilize multi-prototype vector model- The key contributions of this work are conclud- s to learn word representations and build distinct ed as follows: (1) To the best of our knowledge, vectors for each word sense. (Neelakantan et al., this is the first work to utilize sememes in HowNet 2015) presents an extension to Skip-gram model to improve word representation learning. (2) We for learning non-parametric multiple embeddings successfully apply the attention scheme to detect per word. (Rothe and Schutze¨ , 2015) also utilizes word senses and learn representations according to an Autoencoder to jointly learn word, sense and contexts with the favor of the sememe annotation synset representations in the same semantic space. in HowNet. (3) We conduct extensive experiments This paper, for the first time, jointly learns rep- and verify the effectiveness of incorporating word resentations of sememes, senses and words. The sememes for improved WRL. sememe annotation in HowNet provides useful se- mantic regularization for WRL. Moreover, the u- 2 Related Work nified representations incorporated with sememes 2.1 Word Representation also provide us more explicit explanations of both word and sense embeddings. Recent years have witnessed the great thrive in word representation learning. It is simple and s- 2.2 Word Sense Disambiguation and traightforward to represent words using one-hot Representation Learning representations, but it usually struggles with the data sparsity issue and the neglect of semantic re- Word sense disambiguation (WSD) aims to iden- lations between words. tify word senses or meanings in a certain context To address these issues, (Rumelhart et al., computationally. There are mainly two approach- 1988) proposes the idea of distributed represen- es for WSD, namely the supervised methods and tation which projects all words into a continuous the knowledge-based methods. Supervised meth- low-dimensional semantic space, considering each ods usually take the surrounding words or senses word as a vector. Distributed word representation- as features and use classifiers like SVM for word s are powerful and have been widely utilized in sense disambiguation (Lee et al., 2004), which are many NLP tasks, including neural language mod- intensively limited to the time-consuming human els (Bengio et al., 2003; Mikolov et al., 2010), ma- annotation of training data. chine translation (Sutskever et al., 2014; Bahdanau On contrary, knowledge-based methods utilize large external knowledge resources such as knowl- word “apple”. The word “apple” actually has two edge bases or to suggest possible sens- main senses shown on the second layer: one is a es for a word. (Banerjee and Pedersen, 2002) ex- sort of juicy fruit (apple), and another is a famous ploits the rich hierarchy of semantic relations in computer brand (Apple brand). The third and fol- WordNet (Miller, 1995) for an adapted - lowing layers are those sememes explaining each based WSD algorithm. (Bordes et al., 2011) intro- sense. For instance, the first sense Apple brand in- duces synset information in WordNet to WR- dicates a computer brand, and thus has sememes L. (Chen et al., 2014) considers synsets in Word- computer, bring and SpeBrand. Net as different word senses, and jointly conducts From Fig.1 we can find that, sememes of word sense disambiguation and word / sense rep- many senses in HowNet are annotated with vari- resentation learning. (Guo et al., 2014) considers ous relations, such as define and modifier, and for- bilingual datasets to learn sense-specific word rep- m complicated hierarchical structures. In this pa- resentations. Moreover, (Jauhar et al., 2015) pro- per, for simplicity, we only consider all annotat- poses two approaches to learn sense-specific word ed sememes of each sense as a sememe set with- representations that are grounded to ontologies. out considering their internal structure. HowNet (Pilehvar and Collier, 2016) utilizes personalized assumes the limited annotated sememes can well PageRank to learn de-conflated semantic represen- represent senses and words in the real-world sce- tations of words. nario, and thus sememes are expected to be useful In this paper, we follow the knowledge-based for both WSD and WRL. approach and automatically detect word senses ac- cording to the contexts with the favor of sememe 苹果 information in HowNet. To the best of our knowl- ((Appppllee bbrraanndd//aappppllee)) edge, this is the first attempt to apply attention- based models to encode sememe information for ssensse11((Applle brrand)) ssensse22((applle)) word representation learning. define define

电脑 水果 3 Methodology ((ccoomppuutteerr)) ((ffrruuiitt)) modifier modifier

In this section, we present our framework 样式值 能 携带 特定牌子 Sememe-Encoded WRL (SE-WRL) that considers ((PPaatttteerrnnVVaalluuee)) ((aabbllee)) ((bbrriinngg)) ((SSppeeBBrraanndd)) sememe information for word sense disambigua- tion and representation learning. Specifically, we Figure 1: Examples of sememes, senses and word- learn our models on a large-scale text corpus with s. the semantic regularization of the sememe anno- tation in HowNet and obtain sememe, sense and We introduce the notions utilized in the follow- word embeddings for evaluation tasks. ing sections as follows. We define the overall se- In the following sections, we first introduce meme, sense and word sets used in training as X, HowNet and the structures of sememes, senses and S and W respectively. For each w ∈ W , there are words. Then we discuss the conventional WRL (w) (w) (w) possible multiple senses si ∈ S where S model Skip-gram that we utilize for the sememe- (w) encoded framework. Finally, we propose three represents the sense set of w. Each sense si con- (si) (w) sememe-encoded models in details. sists of several sememes xj ∈ Xi . For each target word w in a sequential plain text, C(w) rep- 3.1 Sememes, Senses and Words in HowNet resents its context word set. In this section, we first introduce the arrange- ment of sememes, senses and words in HowNet. 3.2 Conventional Skip-gram Model HowNet annotates precise senses to each word, We directly utilize the widely-used model Skip- and for each sense, HowNet annotates the signif- gram to implement our SE-WRL model, because icance of parts and attributes represented by se- Skip-gram has well balanced effectiveness as well memes. as efficiency (Mikolov et al., 2013). The standard Fig.1 gives an example of sememes, senses and skip-gram model assumes that word embeddings words in HowNet. The first layer represents the should relate to their context words. It aims at maximizing the predictive probability of contex- the semantic units, i.e., sememes. As compared to t words conditioned on the target word w. For- the conventional Skip-gram model, since sememes mally, we utilize a sliding window to select the are shared by multiple words, this model can uti- context word set C(w). For a word sequence lize sememe information to encode latent semantic H = {w1, ··· , wn}, Skip-gram model intends to correlations between words. In this case, similar maximize: words that share the same sememes may finally n−K obtain similar representations. X L(H) = log Pr(wi−K , ··· , wi+K |wi), (1) 3.3.2 Sememe Attention over Context Model i=K The SSA Model replaces the target word embed- where K is the size of sliding window. ding with the aggregated sememe embeddings to Pr(wi−K , ··· , wi+K |wi) represents the predic- encode sememe information into word representa- tive probability of context words conditioned on tion learning. However, each word in SSA model the target word wi, formalized by the following still has only one single representation in differ- softmax function: ent contexts, which cannot deal with polysemy of Y most words. It is intuitive that we should construc- Pr(wi−K , ··· , wi+K |wi) = Pr(wc|wi) t distinct embeddings for a target word according wc∈C(wi) > to specific contexts, with the favor of word sense Y exp(wc · wi) annotation in HowNet. = P 0> , w0 ∈W exp(wc · wi) wc∈C(wi) c To address this issue, we come up with the Se- (2) meme Attention over Context Model (SAC). SAC utilizes the attention scheme to automatically se- in which wc and wi stand for embeddings of con- lect appropriate senses for context words accord- text word wc ∈ C(wi) and target word wi respec- ing to the target word. That is, SAC conducts word tively. We can also follow the strategies of nega- sense disambiguation for context words to learn tive sampling proposed in (Mikolov et al., 2013) better representations of target words. The struc- to accelerate the calculation of softmax. ture of the SAC model is shown in Fig.2. 3.3 SE-WRL Model sememe In this section, we introduce the SE-WRL model- s with three different strategies to utilize sememe S1 S2 S3 sense information, including Simple Sememe Aggrega- att1 att2 att3 tion Model (SSA), Sememe Attention over Con- context attention Wt-2 Wt-1 Wt+1 Wt+2 text Model (SAC) and Sememe Attention over Wt word Target Model (SAT). 3.3.1 Simple Sememe Aggregation Model Figure 2: Sememe Attention over Context Model. The Simple Sememe Aggregation Model (SSA) is a straightforward idea based on Skip-gram mod- More specifically, we utilize the original word el. For each word, SSA considers all sememes in embedding for target word w, but use sememe all senses of the word together, and represents the embeddings to represent context word wc instead target word using the average of all its sememe of original context word embeddings. Suppose a embeddings. Formally, we have: word typically demonstrates some specific senses in one sentence. Here we employ the target word 1 X X (s ) w = x i , embedding as an attention to select the most ap- m j (3) (w) (s ) propriate senses to make up context word embed- s ∈S(w) x i ∈X(w) i j i dings. We formalize the context word embedding which means the word embedding of w is com- wc as follows: posed by the average of all its sememe embed- |S(wc)| dings. Here, m stands for the overall number of X (wc) (wc) wc = att(sj ) · sj , (4) sememes belonging to w. j=1 This model simply follows the assumption that, (wc) the semantic meaning of a word is composed of where sj stands for the j-th sense embedding (wc) of wc, and att(sj ) represents the attention score formalized as follows: of the j-th sense with respect to the target word w, |S(w)| defined as follows: X (w) (w) w = att(sj ) · sj , (7) exp(w · ˆs(wc)) j=1 (wc) j att(sj ) = (w ) . (5) P|S c | (wc) (w) k=1 exp(w · ˆsk ) where sj stands for the j-th sense embedding of w, and the context-based attention is defined as Note that, when calculating attention, we use the follows: average of sememe embeddings to represent each (w) (wc) exp(w0 · ˆs ) sense sj : (w) c j att(sj ) = (w) , (8) P|S | 0 (w) k=1 exp(wc · ˆsk ) |X(wc)| 1 j (wc) X (sj ) (6) where, similar to Eq. (6), we also use the average ˆsj = x . (wc) k of sememe embeddings to represent each sense |Xj | k=1 (w) 0 sj . Here, wc is the context embedding, consist- The attention strategy assumes that the more ing of a constrained window of word embeddings relevant a context word sense embedding is to the in C(wi). We have: target word w, the more this sense should be con- k=i+K0 sidered when building context word embeddings. 1 X w0 = w , k 6= i. (9) With the favor of attention scheme, we can repre- c 0 k 2K 0 sent each context word as a particular distribution k=i−K over its sense. This can be regarded as soft WSD. Note that, since in experiment we find the sense s- As shown in experiments, it will help learn better election of the target word only relies on more lim- word representations. ited context words for calculating attention, hence we select a smaller K0 as compared to K. 3.3.3 Sememe Attention over Target Model Recall that, SAC only uses one target word as The Sememe Attention over Context Model can attention to select senses of context words, but flexibly select appropriate senses and sememes for SAT uses several context words together as atten- context words according to the target word. The tion to select appropriate senses of target words. process can also be applied to select appropriate Hence SAT is expected to conduct more reliable senses for the target word, by taking context words WSD and result in more accurate word represen- as attention. Hence, we propose the Sememe At- tations, which will be explored in experiments. tention over Target Model (SAT) as shown in Fig. 3. 4 Experiments In this section, we evaluate the effectiveness of W W W W t-2 t-1 t+1 t+2 1 context our SE-WRL models on two tasks including word similarity and word analogy, which are two classi- cal evaluation tasks mainly focusing on evaluating word Wt contextual embedding att1 att2 att3 the quality of learned word representations. We

sense S1 S2 S3 also explore the potential of our models in word sense disambiguation with case study, showing the sememe power of our attention-based models.

4.1 Dataset Figure 3: Sememe Attention over Target Model. We use the web pages in Sogou-T2 as the text cor- pus to learn WRL models. Sogou-T is provided by Different from SAC model, SAT learns the o- a Chinese commercial search engine, which con- riginal word embeddings for context words, but tains 2.7 billion words in total. sememe embeddings for target words. We apply 1https://github.com/thunlp/SE-WRL context words as attention over multiple senses of 2https://www.sogou.com/labs/resource/ the target word w to build the embedding of w, t.php We also utilize the sememe annotation in 4.3 Word Similarity HowNet. The number of distinct sememes used in The task of word similarity aims to evaluate the this paper is 1, 889. The average senses for each quality of word representations by comparing the word are about 2.4, while the average sememes for similarity ranks of word pairs computed by WR- each sense are about 1.6. Throughout the Sogou-T L models with the ranks given by dataset. WR- corpus, we find that 42.2% of words have multiple L models typically compute word similarities ac- senses. This indicates the significance of WSD. cording to their distances in the semantic space. For evaluation, we choose wordsim-240 and wordsim-2973 to evaluate the performance of 4.3.1 Evaluation Protocol word similarity computation. The two datasets In experiments, we choose the cosine similarity both contain frequently-used Chinese word pairs between two word embeddings to rank word pairs. with similarity scores annotated manually. We For evaluation, we compute the Spearman correla- choose the Chinese Word Analogy dataset pro- tion between the ranks of models and the ranks of posed by (Chen et al., 2015) to evaluate the human judgments. performance of word analogy inference, that is, w(“king”) − w(“man”) ' w(“queen”) − Model Wordsim-240 Wordsim-297 w(“woman”). CBOW 57.7 61.1 GloVe 59.8 58.7 Skip-gram 58.5 63.3 4.2 Experimental Settings SSA 58.9 64.0 We evaluate three SE-WRL models including S- MST 59.2 62.8 SAC 59.1 61.0 SA, SAC and SAT on all tasks. As for baselines, SAT 61.2 63.3 we consider three conventional WRL models in- cluding Skip-gram, CBOW and GloVe. For Skip- Table 1: Evaluation results of word similarity gram and CBOW, we directly use the code re- computation. leased by Google (Mikolov et al., 2013). GloVe is proposed by (Pennington et al., 2014), which 4.3.2 Experiment Results seeks the advantages of the WRL models based Table1 shows the results of these models for word on statistics and those based on prediction. More- similarity computation. From the results we can over, we propose another model, Maximum Selec- observe that: tion over Target Model (MST), for further compar- (1) Our models outperform all baselines on both ison inspired by (Chen et al., 2014). It represents two test sets. This indicates that, by utilizing se- the current word embeddings with only the most meme annotation properly, our model can better probable sense according to the contexts, instead capture the semantic relations of words, and learn of viewing a word as a particular distribution over more accurate word embeddings. all its senses similar to that of SAT. (2) The SSA model represents a word with the For a fair comparison, we train these model- average of its sememe embeddings. In general, S- s with the same experimental settings and with SA model performs slightly better than baselines, their best parameters. As for the parameter set- which tentatively proves that sememe information tings, we set the context window size K = 8 as is helpful. The reason is that words which share the upper bound, and during training, the window common sememe embeddings will benefit from size is dynamically selected ranging from 1 to 8 each other. Especially, those words with lower fre- randomly. We set the dimensions of word, sense quency, which cannot be learned sufficiently us- and sememe embeddings to be the same 200. For ing conventional WRL models, in contrast, can learning rate α, its initial value is 0.025 and will obtain better word embeddings from SSA simply descend through iterations. We set the number of because their sememe embeddings can be trained negative samples to be 25. We also set a lower sufficiently through other words. bound of word frequency as 50, and in the training (3) The SAT model performs better than base- set, those words less frequent than this bound will 0 lines and SAC, especially on Wordsim-240. This be filtered out. For SAT, we set K = 2. indicates that SAT can obtain more precise sense 3https://github.com/Leonard-Xu/CWE/ distribution of a word. The reason has been men- tree/master/data tioned above that, different from SAC using only Accuracy Mean Rank Model Capital City Relationship All Capital City Relationship All CBOW 49.8 85.7 86.0 64.2 36.98 1.23 62.64 37.62 GloVe 57.3 74.3 81.6 65.8 19.09 1.71 3.58 12.63 Skip-gram 66.8 93.7 76.8 73.4 137.19 1.07 2.95 83.51 SSA 62.3 93.7 81.6 71.9 45.74 1.06 3.33 28.52 MST 65.7 95.4 82.7 74.5 50.29 1.05 2.48 31.05 SAC 79.2 97.7 75.0 81.0 28.88 1.02 2.23 18.09 SAT 82.6 98.9 80.1 84.5 14.78 1.01 1.72 9.48

Table 2: Evaluation results of word analogy inference.

one target word as attention for WSD, SAT adopts the gold standard word w4 according to the scores richer contextual information as attention for WS- computed by Eq. (10). We use the mean rank of D. all gold standard words as the evaluation metric. (4) SAT works better than MST, and we can conclude that a soft disambiguation over senses 4.4.2 Experiment Results prevents inevitable errors when selecting only one Table2 shows the evaluation results of these mod- most-probable sense. The result makes sense be- els for word analogy inference. From the table, we cause, for many words, their various senses are not can observe that: always entirely different from each other, but share (1) The SAT model performs best among al- some common elements. In some contexts, a sin- l models, and the superiority is more significant gle sense may not convey the exact meaning of this than that on word similarity computation. This in- word. dicates that SAT will enhance the modeling of im- plicit relations between word embeddings in the 4.4 Word Analogy semantic space. The reason is that sememes an- Word analogy inference is another widely-used notated to word senses have encoded these word task to evaluate the quality of WRL models relations. For example, capital and Cuba are (Mikolov et al., 2013). two sememes of the word “Havana”, which pro- vide explicit semantic relations between the words 4.4.1 Evaluation Protocol “Cuba” and “Havana”. The dataset proposed by (Chen et al., 2015) con- (2) The SAT model does well on both classes sists of 1, 124 analogies, which contains three of Capital and City, because some words in these analogy types: (1) capitals of countries (Cap- classes have low frequencies, while their sememes ital), 677 groups; (2) states/provinces of cities occur so many times that sememe embeddings can (City), 175 groups; (3) family words (Relation- be learned sufficiently. With these sememe em- ship), 272 groups. Given an analogy group of beddings, these low-frequent words can be learned words (w1, w2, w3, w4), WRL models usually get more efficiently by SAT. w2 −w1 +w3 equal to w4. Hence for word analo- (3) It seems that CBOW works better than SAT gy inference, we suppose w4 is missing, and WRL on Relationship class. Whereas for the mean models will rank all candidate words according to rank, CBOW gets the worst results, which indi- their scores as follows: cates the performance of CBOW is unstable. On the contrary, although the accuracy of SAT is a R(w) = cos(w2 − w1 + w3, w), (10) bit lower than that of CBOW, SAT seldom gives and select the top-ranked word as the answer. an outrageous prediction. In most wrong cas- For word analogy inference, we consider two es, SAT predicts the word “grandfather” instead evaluation metrics: (1) Accuracy. For each anal- of “grandmother”, which is not completely non- ogy group, a WRL model selects the top-ranked sense, because in HowNet the words “grandmoth- word w = arg maxw R(w), which is judged as er”, “grandfather”, “grandma” and some other positive if w = w4. The percentage of positive similar words share four common sememes while samples is regarded as the accuracy score for this only one sememe of them are different. These sim- WRL model. (2) Mean Rank. For each anal- ilar sememes make the attention process less dis- ogy group, a WRL model will assign a rank for criminative with each other. But for the wrong cas- Word: °J(“Apple brand/apple”) sense1: Apple brand (computer, PatternValue, able, bring, SpeBrand) sense2: duct (fruit) °°°JJJ ƒkJ¥{¡£Apple is always famous as the king of fruits¤ Apple brand: 0.28 apple: 0.72 °°°JJJ >MÃ{~éÄ£The Apple brand computer can not startup Apple brand: 0.87 apple: 0.13 normally¤ Word: *Ñ(“proliferate/metastasize”) sense1: proliferate (disperse) sense2: metastasize (disperse, disease) “Ž¼œ***ÑÑÑ £Prevent epidemic from metastasizing¤ proliferate: 0.06 metastasize: 0.94 Ø***ÑÑÑ ØÉì^£Treaty on the Non-Proliferation of Nuclear proliferate: 0.68 metastasize: 0.32 Weapons¤ Word: èÎ(“contingent/troops”) sense1: contingent (community) sense2: troops (army) l|èèèÎÎÎ ?\1ãìNm£Eight contingents enter the second stage contingent: 0.90 troops: 0.10 of team competition¤ úSÄèèèÎÎÎ |„ï£Construct the organization of public security’s contingent: 0.15 troops: 0.85 troops in grass-roots unit¤

Table 3: Examples of sememes, senses and words in context with attention. es of CBOW, we find that many mistakes are about Here, we list three different context words “Cu- words with low frequencies, such as “stepdaugh- ba”, “Russia” and “cigar”. Given the context word ter” which occurs merely for 358 times. Consider- “Cuba”, both sememes get high weights, indicat- ing sememes may relieve this problem. ing their contributions to the meaning of “Havana” in this context. The context word “Russia” is more 4.5 Case study relevant to the sememe capital. When the con- The above experiments verify the effectiveness of text word is “cigar”, the sememe Cuba has more our models for WRL. Here we show some exam- influence, because cigar is a famous specialty of ples of sememes, senses and words for case study. Cuba. From these examples, we can conclude that our Sememe Attention can accurately capture the 4.5.1 Word Sense Disambiguation word meanings in complicated contexts. To demonstrate the validity of Sememe Attention, we select three attention results in training set, as 5 Conclusion and Future Work shown in Table3. In this table, the first rows of three examples are word-sense-sememe structures In this paper, we propose a novel method to model of each word. For instance, in the third example, sememe information for learning better word rep- the word has two senses, contingent and troops; resentations. Specifically, we utilize sememe in- contingent has one sememe community, while formation to represent various senses of each word troops has one sememe army. The three exam- and propose Sememe Attention to select appropri- ples all indicate that our models can estimate ap- ate senses in contexts automatically. We evaluate propriate distributions of senses for a word given our models on word similarity and word analogy, a context. and results show the advantages of our Sememe- Encoded WRL models. We also analyze sever- 4.5.2 Effect of Context Words for Attention al cases in WSD and WRL, which confirms our We demonstrate the effect of context words for at- models are capable of selecting appropriate word tention in Table.4. The word “Havana” consist- senses with the favor of sememe attention. s of four sememes, among which two sememes We will explore the following research direc- capital and Cuba describe distinct attributes tions in future: (1) The sememe information in of the word from different aspects. HowNet is annotated with hierarchical structure and relations, which have not been considered in Word M@(“Havana”) our framework. We will explore to utilize these Sememe IÑ(capital) n(Cuba) annotations for better WRL. (2) We believe the n(“Cuba”) 0.39 0.42 Ûd(“Russia”) 0.39 -0.09 idea of sememes is universal and could be well- È„(“cigar”) 0.00 0.36 functioned beyond languages. We will explore the effectiveness of sememe information for WRL in Table 4: Sememe weight for computing attention. other languages. Acknowledgments Sujay Kumar Jauhar, Chris Dyer, and Eduard Hovy. 2015. Ontologically grounded multi-sense represen- This work is supported by the 973 Program (No. tation learning for semantic vector space models. In 2014CB340501), the National Natural Science Proceedings of NAACL. volume 1. Foundation of China (NSFC No. 61572273, Yoong Keok Lee, Hwee Tou Ng, and Tee Kiah Chia. 61661146007), and the Key Technologies Re- 2004. Supervised word sense disambiguation with search and Development Program of China (No. support vector machines and multiple knowledge 2014BAK04B03). This work is also funded sources. In Proceedings of SENSEVAL-3. pages 137–140. by the Natural Science Foundation of China (NSFC) and the German Research Foundation Qun Liu and Sujian Li. 2002. Word similarity comput- (DFG) in Project Crossmodal Learning, NSFC ing based on how-net. CLCLP 7(2):59–76. 61621136008 / DFC TRR-169. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word represen- tations in vector space. In Proceedings of ICLR. References Tomas Mikolov, Martin Karafiat,´ Lukas Burget, Jan Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Cernocky,` and Sanjeev Khudanpur. 2010. Recur- gio. 2015. Neural machine translation by jointly rent neural network based language model. In Inter- learning to align and translate. In Proceedings of speech. volume 2, page 3. ICLR. George A Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM 38(11):39– Satanjeev Banerjee and Ted Pedersen. 2002. An adapt- 41. ed lesk algorithm for word sense disambiguation using wordnet. In Proceedings of CICLing. pages Arvind Neelakantan, Jeevan Shankar, Alexandre Pas- 136–145. sos, and Andrew McCallum. 2015. Efficient non- parametric estimation of multiple embeddings per Yoshua Bengio, Rejean´ Ducharme, Pascal Vincent, and word in vector space. In Proceedings of EMNLP. Christian Jauvin. 2003. A neural probabilistic lan- guage model. JMLR 3:1137–1155. Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word Antoine Bordes, Jason Weston, Ronan Collobert, and representation. In Proceedings of EMNLP. vol- Yoshua Bengio. 2011. Learning structured embed- ume 14, pages 1532–43. dings of knowledge bases. In Conference on Artifi- cial Intelligence. Mohammad Taher Pilehvar and Nigel Collier. 2016. De-conflated semantic representations. In Proceed- Danqi Chen and Christopher D Manning. 2014. A fast ings of EMNLP. and accurate dependency parser using neural net- Sascha Rothe and Hinrich Schutze.¨ 2015. Autoex- works. In Proceedings of EMNLP. pages 740–750. tend: Extending word embeddings to embeddings for synsets and . Proceedings of ACL . Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. 2014. A unified model for word sense representation and David E Rumelhart, Geoffrey E Hinton, and Ronald J disambiguation. In Proceedings of EMNLP. pages Williams. 1988. Learning representations by back- 1025–1035. propagating errors. Cognitive modeling 5(3):1.

Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. and Huan-Bo Luan. 2015. Joint learning of charac- Sequence to sequence learning with neural network- ter and word embeddings. In Proceedings of IJCAI. s. In Proceedings of NIPS. pages 3104–3112. pages 1236–1242. Fei Tian, Hanjun Dai, Jiang Bian, Bin Gao, Rui Zhang, Zhendong Dong and Qiang Dong. 2003. Hownet-a hy- Enhong Chen, and Tie-Yan Liu. 2014. A probabilis- brid language and knowledge resource. In Proceed- tic model for learning multi-prototype word embed- ings of NLP-KE. IEEE, pages 820–824. dings. In Proceedings of COLING. pages 151–160. Fu Xianghua, Liu Guo, Guo Yanyan, and Wang Jiang Guo, Wanxiang Che, Haifeng Wang, and Ting Li- Zhiqiang. 2013. Multi-aspect sentiment analysis for u. 2014. Learning sense-specific word embeddings chinese online social reviews based on topic model- by exploiting bilingual resources. In Proceedings of ing and hownet . Knowledge-Based Systems COLING. pages 497–507. 37:186–195. Eric H Huang, Richard Socher, Christopher D Man- Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. ning, and Andrew Y Ng. 2012. Improving word Character-level convolutional networks for text clas- representations via global context and multiple word sification. In Proceedings of NIPS. pages 649–657. prototypes. In Proceedings of ACL. pages 873–882.