Lexical Sememe Prediction using Definitions by Capturing Local Semantic Correspondence

Jiaju Du, Fanchao Qi, Maosong Sun, Zhiyuan Liu Department of Computer Science and Technology, Tsinghua University Institute for Artificial Intelligence, Tsinghua University Beijing National Research Center for Information Science and Technology {djj18,qfc17}@mails.tsinghua.edu.cn,{sms,lzy}@tsinghua.edu.cn

Abstract Sememes, defined as the minimum seman- tic units of human languages in linguistics, have been proven useful in many NLP tasks. Since manual construction and update of se- knowledge bases (KBs) are costly, the task of automatic sememe prediction has been proposed to assist sememe annotation. In this paper, we explore the approach of ap- plying dictionary definitions to predicting se- for unannotated . We find that sememes of each are usually semanti- cally matched to different words in its dictio- nary definition, and we name this matching re- lationship local semantic correspondence. Ac- cordingly, we propose a Sememe Correspon- dence Pooling (SCorP) model, which is able to capture this kind of matching to predict se- Figure 1: The annotation of word “observatory” in memes. We evaluate our model and baseline HowNet and its definition in dictionary. Sememes and methods on a famous sememe KB HowNet definition words with the same superscript are seman- and find that our model achieves state-of-the- tically matched. art performance. Moreover, further quantita- tive analysis shows that our model can prop- erly learn the local semantic correspondence HowNet (Dong and Dong, 2003) is the most between sememes and words in dictionary def- famous sememe KB. It defines about 2, 000 initions, which explains the effectiveness of our model. The source codes of this paper can language-independent sememes and uses them to be obtained from https://github.com/ annotate over 100 thousand Chinese and English thunlp/scorp. words. Every word in HowNet contains some senses, and each sense is annotated with a hier- 1 Introduction archical structure of sememes, i.e., sememe tree. arXiv:2001.05954v1 [cs.CL] 16 Jan 2020 In linguistics, sememes are defined as the mini- Figure1 exhibits an example of how words are mum semantic units of human languages (Bloom- annotated in HowNet. The word “observatory” field, 1926). Some linguists believe that the mean- has one sense, and this sense has 4 sememes, ings of all the words can be described with a lim- namely “location”, “investigate”, “celestial body” ited closed set of sememes, which is similar to the and “celestial event”. HowNet has been success- idea of semantic prime (Wierzbicka, 1996). How- fully utilized in various NLP applications such ever, sememes are normally implicit and cannot be as semantic similarity computation (Liu and Li, recognized directly. Therefore, people manually 2002), disambiguation (Zhang et al., annotate words with a set of predefined sememes 2005; Duan et al., 2007), sentiment analysis (Zhu to construct sememe KBs. et al., 2006; Fu et al., 2013; Dang and Zhang, This paper is accepted by Journal of Chinese Information 2010), language modeling (Gu et al., 2018), word Processing (in Chinese). This version is for English readers. embedding (Niu et al., 2017), and word classifica- tion (Zeng et al., 2018). ing. The sememe tree of “observatory” represents Seeing that numerous new words emerge con- the meaning of “location of investigating celes- stantly and meanings of existing words keep al- tial events and celestial bodies”, which is almost tering, it is labor-intensive and time-consuming to identical to that of the dictionary definition. Fur- expand and update sememe KBs like HowNet. To thermore, sememe trees and dictionary definitions solve this problem, Xie et al.(2017) propose the can be decomposed into sememes and definition task of sememe prediction, aiming to automati- words respectively. Consequently, it is natural to cally recommend related sememes to the words see that there is local semantic correspondence be- without sememe annotation. They also present tween a word’s sememes and definition words. two simple but effective embedding-based models To take advantage of local semantic correspon- for the task. These two models ignore the hier- dence, we propose the model Sememe Corre- archical structure of sememe trees and predict a spondence Pooling (SCorP). SCorP computes set of sememes for every word. Jin et al.(2018) the correspondence scores between all sememes further propose to incorporate Chinese characters and words in definitions, and do max-pooling over of words into sememe prediction and improve the correspondence scores for every sememe. More- prediction performance. over, SCorP model ignores the hierarchical struc- These methods rely heavily on the representa- tures of sememes in sememe prediction follow- tions of words or characters. As a result, it is ing previous works (Xie et al., 2017; Jin et al., hard to predict proper sememes for those low- 2018; Li et al., 2018), and uses the framework of frequency words or words with low-frequency sequence to set multi-label classification to avoid characters because of their poor embeddings. In imposing order on sememes. And the inputs to fact, there are other available resources that can be SCorP model are word sequences rather than char- used in sememe prediction to tackle the challenge. acter sequences, which remedies the ambiguity Dictionary definitions clearly explain the mean- problem of characters and at the same time, en- ings of words including low-frequency words, and ables utilizing the sememe information of the def- they are also easy to obtain. Thus, we believe inition words. In addition, we propose two effec- dictionary definitions are appropriate for sememe tive retrofitting operations for SCorP which im- prediction. Li et al.(2018) firstly propose to use proves the prediction performance. In our experi- dictionary definitions in sememe prediction. How- ments, we evaluate the sememe prediction perfor- ever, they adopt a sequence to sequence model that mance of our SCorP model and several baselines. foists inappropriate order on sememes. Addition- Experimental results show that our model achieves ally, they regard definitions as sequences of char- state-of-the-art performance. We also conduct fur- acters, which can hurt representations of the defi- ther quantitative analysis and find that our model nitions owing to Chinese characters’ ambiguity. can correctly match definition words to sememes, which explains the superiority of our model. In this paper, we propose a novel model for se- To conclude, the contributions of this paper are meme prediction using dictionary definitions. The twofold: (1) we discover local semantic corre- model not only addresses the issues of existing spondence, a kind of specific semantic matching methods but also can capture local semantic corre- between a word’s sememe tree and definition; (2) spondence, a kind of particular matching relation- we propose a novel model which utilizes local se- ship between sememe trees and definitions. More mantic correspondence to predict sememes, and it explicitly, for a given word, we find each of its se- achieves state-of-the-art performance in sememe memes can be matched to one or more of its defini- prediction. tion words, i.e., words in the dictionary definition. Taking Figure1 for example, the sememes “loca- 2 Related Work tion”, “investigate”, and “celestial body, celestial event” are semantically matched to the words “in- Applications of Sememes Sememes have been stitution”, “observing, researching”, and “celes- widely used in NLP tasks, such as semantic simi- tial, astronomy” respectively. larity computation (Liu and Li, 2002), word sense The reason for local semantic correspondence disambiguation (Zhang et al., 2005; Duan et al., is easy to explain. Both a word’s sememe tree 2007), and sentiment analysis (Fu et al., 2013; and dictionary definition express the same mean- Dang and Zhang, 2010; Zhu et al., 2006; Huang et al., 2014). In addition, Niu et al.(2017) incor- our SCorP model is based. Next, we present our porate sememes into word representation learning SCorP model and two different retrofitting opera- and utilize sememes to capture the exact mean- tions in detail. Finally, we succinctly describe an ing of a word within specific contexts. Zeng et al. ensemble model. (2018) use sememe knowledge with the attention 3.1 Terms and Notations mechanism to determine the type of a word and expand the Chinese LIWC (Pennebaker As mentioned previously, we use “definition et al., 2007). Gu et al.(2018) propose to regard word” to refer to a word in dictionary definitions. sememes as linguistic experts to help predict suit- And we use “target word” to signify the word for able words in language modeling. which we want to predict sememes. We define W as the vocabulary set and S as the set of sememes. Sememe Prediction Xie et al.(2017) present Given a word w ∈ W , all of its sememes form a the automatic sememe prediction task for the first set Sw = {s1, ..., s|Sw|} ⊆ S, where |·| represents time. They also propose a collaborative filtering- the cardinality of a set. The dictionary definition based model SPWE and a matrix factorization- of word w is denoted by Dw = {d1, ..., d|Dw|}, based model SPSE for this task and achieve ac- where di is its i-th definition word. We use lower- ceptable sememe prediction results. Following case boldface symbols for vectors and uppercase this work, Jin et al.(2018) propose two models boldface symbols for matrices. For example, di is SPWCF and SPCSE which can leverage the in- the word vector of di, si is the sememe vector of ternal character information of Chinese words, as si, and Dw is the matrix composed of word vec- well as an ensemble model CSP which considers tors d1, ..., d|Dw|. both internal and external information. Li et al. (2018) first explore the role of dictionary descrip- 3.2 Basic Multi-label Classification tions in sememe prediction. They propose a se- Framework quence to sequence model named LD+Seq2Seq, The sequence to set multi-label classification where the inputs are character sequences of dictio- (MC) framework serves as the base of our SCorP nary definitions or wiki descriptions, and the out- model. It has two main parts, the encoder which puts are sequences of predicted sememes. In addi- can encode a dictionary definition into a vector, tion, there are also works trying to build sememe and the multi-label classifier, which uses the def- KB for another language by cross-lingual sememe inition vector to compute association scores for prediction (Qi et al., 2018). each sememe. And the sememes with high scores are selected as the predicted sememes. Applications of Dictionary Definitions Dictio- We choose Bidirectional LSTM (BiLSTM) nary definitions are abundant and valuable re- (Schuster and Paliwal, 1997) as the encoder. search resources in NLP. They have been used in Formally, for the dictionary definition Dw = various tasks, such as word sense disambiguation {d1, ..., d } of a target word w, we pass (Luo et al., 2018), knowledge representation learn- |Dw| {d1, ..., d }, the pre-trained word embeddings ing (Xie et al., 2016; Zhong et al., 2015), reading |Dw| of definition words, to the BiLSTM. Then BiL- comprehension (Long et al., 2017), and reverse STM will output two sequences of hidden states: dictionary (Hill et al., 2016; Bosc and Vincent, 2018). Most of the previous works encode defi- → → ← ← {h , ..., h }, {h , ..., h } nitions into vectors for downstream tasks. In ad- 1 |Dw| 1 |Dw| (1) dition, some works adopt graph-based methods to = BiLSTM(d1, ..., d|Dw|). build a word graph using dictionary definitions for We use the concatenation of the last hidden states downstream tasks (Thorat and Choudhari, 2016; of both directions as the definition vector which Tissier et al., 2017). is denoted by v, and feed it to a fully connected layer: 3 Methodology → ← v = Concatenate(h|Dw|, h1), In this section, we first introduce some terms and (2) x = Wv + b, notations which will be used below. Then we |S|×2l |S| give a brief description to a basic sequence to where W ∈ R , x, b ∈ R , l represents the set multi-label classification framework, on which dimension of hidden states in a single direction. |S|×|D | where Y ∈ R w is in fact the semantic cor- respondence matrix, and [Y]ji measures the cor- respondence between j-th sememe and i-th defi- nition word. According to local semantic correspondence, each sememe of a word strongly corresponds to its one or more definition words. In other words, in sememe prediction, whether a sememe is selected as the predicted result depends on whether there is strong correspondence between the sememe and some words in the definition. Inspired by this, SCorP model calculates the final association score Figure 2: A sememe prediction example of SCorP of every sememe by doing max-pooling over the model. The deeper a cell of the correspondence score correspondence scores between the sememe and matrix is, the higher the correspondence score between all the definition words: the sememe and the definition word. See Figure1 for English of the definition. [x]j = max [yi]j. (6) 1≤i≤|Dw|

[x]j, the j-th element of x, denotes the association 3.4 Retrofitting Operations score of j-th sememe. For training, we use the Incorporating Sememes of Definition Words multi-label one-versus-all cross-entropy loss: In this subsection, we propose to incorporate se- |S| meme information of definition words into the en- 1 X L = − [y ]σ([x] ) + (1 − [y ])σ(−[x] ), coder. On the one hand, sememe information has |S| j j j j j=1 been proven effective in improving word embed- (3) dings (Niu et al., 2017). On the other hand, we find where [y]j ∈ {0, 1} represents whether the j-th that if a sememe corresponds to a definition word, sememe is in the sememe set of word w. it usually belongs to the sememe set of the defi- nition word. Still taking the target word “obser- 3.3 SCorP Model vatory” for example, its sememe “investigate” se- The basic MC model encodes the whole definition mantically corresponds to the definition word “ob- into a vector, which will cause information loss serving” and at the same time, “investigate” is also and especially, make it difficult to cope with long annotated to “observing” in HowNet. Therefore, definitions. Our Sememe Correspondence Pool- the sememes of definition words should be helpful ing (SCorP) model, which exploits local semantic to sememe prediction. correspondence, can handle this problem. We use a simple but effective method to incor- A sememe prediction example of SCorP model porate sememe information of definition words, is illustrated in Figure2. The encoder of SCorP i.e., adding the average of sememe embeddings to model is similar to that of MC model. But SCorP the embedding of the definition word: model uses the concatenations of all the bidirec- 0 1 X tional hidden states in Equation1, rather than only di = di + sj, (7) |Sdi | the concatenation of the last bidirectional hidden sj ∈Sdi states: 0 where Sd is the sememe set of di and d is the → ← i i hi = Concatenate(hi, hi), 1 ≤ i ≤ |Dw|. (4) retrofitted embedding of di. For the definition words without sememe annotation, their retrofitted Obviously hi contains the context semantic infor- embeddings are equal to their original word em- mation of di, the i-th definition word. Then every beddings. For simplicity, we use “+SE” to indicate hi is passed to a fully connected layer, and the out- incorporating sememes of definition words. put constitutes a matrix: Integrating Embeddings of Target Words yi = Whi + b, In this subsection, we present an approach to in- (5) tegrating embeddings of target words into our Y = (y1, ..., y|Dw|), model. If a target word has word embedding, we Chinese English simply put it and a separator in front of its dictio- #words in dictionary 69,000 147,306 nary definition to acquire the retrofitted definition #senses in dictionary 98,732 206,978 sequence. We choose colon (:) as the separator. #words in HowNet 103,843 103,390 w Formally, for a given target word , the retrofitted #words with embeddings 5,606,977 400,000 definition sequence is {w, :, d , ..., d }. Then 1 |Dw| #words in dataset 48,383 43,907 we feed the embeddings of the elements of the avg. sememes of words 2.57 3.00 retrofitted definition sequence to the BiLSTM en- coder. Table 1: Some statistics of the two datasets As for the target words that have no embed- ding, we borrow the idea of byte pair encoding from sequence modeling (Sennrich et al., 2016) and characters. We extract dictionary definitions to deal with them. For a target word without em- and segment them into words using THULAC (Li beddings, we first split it into smaller subwords and Sun, 2009). The final dataset contains 48, 383 or characters which have embeddings. Then we words, which simultaneously have sememe anno- add these subwords or characters and a separator tation, word embeddings, and dictionary defini- to the definition sequence of the target word. We tions. Word embeddings are trained using Skip- use “+TW” to indicate integrating embeddings of gram (Mikolov et al., 2013) on the SogouT cor- target words. pus2. The dataset is randomly divided into non- overlapping training, validation, and test sets in 3.5 Ensemble Model the ratio of 8:1:1. Our SCorP model utilizes dictionary definitions For English, we use WordNet (Miller, 1995) as and word embeddings, while previous CSP model the source of dictionary definitions, and use the (Jin et al., 2018) mainly uses internal character in- pre-trained Glove (Pennington et al., 2014) word formation of target words. The two models use di- embeddings. The dataset has 43,907 words and is verse information and naturally, we combine the also divided in the ratio of 8:1:1. Table1 shows two models to build an ensemble model. This some statistics of the two dataset. model scores a sememe by doing a weighted ad- dition of sememe scores outputted by the two sub- 4.2 Experimental Setup models. Baseline Methods We choose the following 4 models as baselines: (1) SPWE (Xie et al., 2017), 4 Experiments an embedding-based model. It first finds some 4.1 Dataset similar words for the target word in the embed- ding space, then recommends sememes of similar We choose HowNet as the sememe KB, which words. (2) CSP (Jin et al., 2018), an ensemble annotates 103,843 Chinese words and 103,390 model utilizing both internal character informa- English words. These words contain 212,539 tion, which is employed by its submodels SPWCF language-independent senses. Following previ- and SPCSE, and external word information, which ous works (Xie et al., 2017; Jin et al., 2018), are employed by its submodels SPWE and SPSE we disregard the hierarchical structures of se- (Xie et al., 2017); (3) LD+Seq2Seq (Li et al., memes in sememe prediction and filter out the 2018), which utilizes dictionary definitions by a low-frequency sememes which appears less than reformed sequence to sequence model; (4) MC, 5 times in HowNet. The final number of sememes the basic sequence to set multi-label classification we use is 1, 400. In addition, we concatenate all model in section 3.2. definitions of a word and use the concatenated def- inition for training and evaluation. Hyper-parameters and Training For all the We evaluate our models on Chinese and En- models, the dimension of word embeddings is 200. glish. For Chinese, we use Modern Chinese Dic- For SCorP and MC models, the dimension of BiL- 1 tionary (6th Edition) as the source of dictionary STM hidden states is 256 × 2. For all the other definitions. It defines about 69, 000 Chinese words 2 Sogou-T is a corpus of web pages provided by a Chinese 1 http://www.cp.com.cn/book/ commercial search engine. https://www.sogou.com/ 978-7-100-08467-3_44.html labs/resource/t.php Chinese English (1) Our proposed SCorP(+TW,SE) model Models MAP F1 MAP F1 achieves the best sememe prediction performance compared with all the single models and previous SPWE 55.04 48.23 40.56 38.25 CSP 58.93 50.26 42.58 37.56 ensemble model CSP. In addition, our proposed LD+Seq2Seq 30.49 33.28 24.63 28.70 ensemble model achieves state-of-the-art perfor- mance. MC 51.24 42.06 49.09 39.71 MC(+TW) 59.15 48.77 54.56 45.34 (2) Our SCorP model and its variants always MC(+SE) 53.99 45.90 52.62 45.24 MC(+TW,SE) 60.55 50.84 56.57 48.95 perform substantially better than the MC model and its corresponding variants. This suggests that SCorP 54.95 46.89 56.17 50.41 a size-fixed definition vector is not enough to con- SCorP(+TW) 63.46 53.07 59.70 52.89 tain all the information of the dictionary defini- SCorP(+SE) 57.57 49.99 58.28 52.83 tion, and it is necessary to consider all the hid- SCorP(+TW,SE) 64.65 54.62 61.53 55.22 den states of the definition. Moreover, the max- Ensemble 69.19 57.99 63.60 56.50 pooling operation can effectively capture the lo- cal semantic correspondence. To validate this, we Table 2: Overall sememe prediction results of all the compare max-pooling with mean-pooling and at- models on the test set. tention mechanism in SCorP. All of the three op- erations can collect information from the hidden states of BiLSTM. The mean-pooling averages the baseline methods, their hyper-parameters except correspondence score for every sememe. In the word embedding dimension are respectively tuned model with attention mechanism, for every se- to the best on the validation set. For our ensemble meme, we use the sememe vector to attend over all model, the weight ratio of correspondence scores the hidden states of the LSTM. The output of the is λ : λ = 1 : 10, which is also de- SCorP CSP attention is used to compute the association score termined by tuning on the validation dataset. As of the sememe by a fully-connected layer. After for training, we use Adam Optimizer (Kingma and replacing the max-pooling with mean-pooling and Ba, 2014) and the learning rate is 0.001. Dur- attention mechanism, the MAP drops from 54.95 ing the training process, the word embeddings are to 51.76 and 52.23 respectively. The reason is that frozen. Dropout is employed in SCorP and MC. a sememe of the target word usually semantically Evaluation Protocol Following previous works, corresponds to only a few words and has nothing we use mean average precision (MAP) and F1 to do with most of definition words. In the mod- score as evaluation metrics. For a word with K els with mean-pooling or attention mechanism, the sememes, the model will score all sememes and association score of a sememe is inevitably af- rank them by scores. If the rankings of the K se- fected by all the hidden states of the definition words, and the hidden states of irrelevant words memes are p1 ≤ p2 ≤ ... ≤ pK , we can compute MAP by bring lots of noise. K X k MAP = (8) (3) Basically, both +SE and +TW enhance the pk k=1 sememe prediction performance, no matter for MC or SCorP model. +TW gains most, be- For F1 score, we set a threshold δ for all mod- cause word embeddings are powerful and previ- els except LD+Seq2Seq. All sememes with scores ous works relying only on target word embeddings higher than δ compose the output set. δ is a hyper- have achieved acceptable results. parameter and also tuned to the best on the vali- dation set. We conduct 10-fold cross-validation in (4) LD+Seq2Seq model has relatively poor per- experiments. formance. This demonstrates it is not a good idea to impose inappropriate order on sememes. Fur- 4.3 Overall Sememe Prediction Performance thermore, using character sequence as input is an- Table2 lists the sememe prediction performance other possible reason for the bad performance, be- of our models and baseline methods. From this cause the ambiguity of characters may hurt the table, we can see that: precise representations of definitions. Model IV OOV Word Frequency ≥5000 500-5000 50-500 ≤50 CSP 59.03 42.92 CSP 62.18 59.60 46.60 26.34 LD+Seq2Seq 29.31 31.83 33.61 31.93 LD+Seq2Seq 29.33 28.15 MC(+TW,SE) 63.18 61.75 59.08 54.32 SCorP 54.44 51.05 SCorP(+TW,SE) 65.42 64.71 62.94 58.91 SCorP(+TW) 63.10 54.30 SCorP(+TW-WS) 62.97 51.20 Table 4: Sememe prediction MAPs for words with dif- ferent frequency ranges. The numbers of words in the Table 3: Sememe prediction MAPs for in-vocabulary four ranges are 31481, 10714, 4528 and 1660 respec- (IV) and out-of-vocabulary (OOV) words of different tively. models. +TW-WS signifies removing word splitting in the +TW operation. relies heavily on word embeddings, suffers a sharp performance drop when faced with low-frequency 4.4 Sememe Prediction for OOV Words words (word frequency less than 50), although The models which heavily rely on target word em- they perform acceptably on high-frequency beddings don’t work or suffer serious performance words. LD+Seq2Seq is less affected by word deterioration when faced with out-of-vocabulary frequencies, but it badly lags behind other models (OOV) words. Our SCorP model can cope with on whichever word frequency range. the OOV words because it needs no target word 4.6 Case Study embeddings. And the word splitting workaround of our +TW operation can also mitigate the OOV In this section, we conduct a case study on a se- problem. meme prediction example, to illustrate the effec- To test the ability to handle the OOV problem of tiveness of our SCorP model in capturing local se- existing models, we expand the original dataset by mantic correspondence. adding 1, 595 OOV words which have no embed- We still pick the word “observatory”, which has dings. Then we evaluate our models and baseline four sememes involving “location”, “investigate”, methods on the expanded dataset. Notice that for “celestial event” and “celestial body”. Our SCorP CSP, only two submodels SPWCF and SPCSE can model obtains the correspondence matrix measur- predict sememes for OOV words, and other two ing the semantic correspondence between each se- submodels are not used. meme and each definition word. Table5 shows The evaluation results are exhibited in Table the six most corresponding sememes and their cor- 3. From the table, we can observe that our plain respondence scores for each definition word. If SCorP model and its variant SCorP(+TW) dramat- a sememe and its correspondence score in a cer- ically outperform the two baseline methods, es- tain row is in boldface, it denotes that the sememe pecially for the OOV words. This manifests our achieves the highest correspondence score on the model’s extraordinary generalization capability to definition word in this row. For example, on the tackle the OOV problem. We also confirm the ef- row of definition word “institution”, sememe “lo- fectiveness of word splitting workaround by a le- cation” is in boldface, which means “location” sion experiment. After removing word splitting corresponds with “institution” most closely. In ad- for target words in +TW (denoted by +TW-WS), dition, sememes “investigate” and “celestial body, SCorP(+TW,-WS) model performs considerably celestial event” corresponds with words “observe” worse in sememe prediction for the OOV words, and “celestial body” most closely, respectively. even close to the plain SCorP model making no The semantic correspondence scores between all use of target word embeddings. punctuation and all sememes are relatively low. These results demonstrate that our SCorP model 4.5 Influence of Word Frequency can properly capture the local semantic correspon- Table4 lists the evaluation results of 4 mod- dence. The last row lists the predicted sememes els for words with different frequencies. with the highest correspondence scores after max- SCorP(+SE,TW) produces the best perfor- pooling. We can clearly see that the four correct mance in all the word frequency ranges, which sememes obtain the highest scores and are ranked manifests its remarkable robustness. CSP, which top 4 exactly. Definition Words Predicted Sememes 天文台(observatory) celestial body/0.20, tell/-2.60, celestial event/-3.26, images/-4.63, light/-4.65, knowledge/-5.00 : celestial body/-4.37, tell/-5.66, time/-5.74, part/-6.38, morning/-7.08, past/-7.17 观测(observe) investigate/2.37, celestial event/-0.20, far/-1.28, celestial body/-1.35, measurement/-2.19, look/-2.68 天体(celestial body) celestial body/6.16, celestial event/1.68, investigate/0.63, measurement/-2.12, far/-3.40, look/-3.51 和(and) celestial body/-1.95, investigate/-2.66, celestial event/-3.61, find/-4.26, look/-5.32, choose/-5.86 研究(research) investigate/1.78, celestial event/-0.88, celestial body/-1.99, research/-2.27, find/-2.62, knowledge/-3.46 天文学(astronomy) celestial body/2.50, investigate/1.54, celestial event/1.29, knowledge/0.56, daytime/-1.31, earch/-1.85 的(of) part/-5.56, time/-6.31, human/-6.36, place/-7.04, tell/-7.16, head/-7.74 机构(institution) location/1.97, celestial event/-1.32, knowledge/-1.48, celestial body/-2.02, facility/-2.14, tool/-2.90 。 time/-1.66, celestial body/-3.02, part/-3.21, investigate/-4.90, celestial event/-4.95, daytime/-5.46 Predicted Sememes celestial body/6.16, investigate/2.37, location/1.97, celestial event/1.68, knowledge/0.56, far/-1.28

Table 5: Sememe prediction case for the word “observatory”. Each row of the first column, except the last row, is filled with a definition word of “observatory”, and the second column lists top three sememes and their corre- spondence scores for the corresponding definition word. The last row lists the final predicted sememes and their correspondence scores after max-pooling. The sememes in boldface means correct sememes.

5 Conclusion and Future Work inant for Chinese sentence sentiment orientation based on HowNet. Application Research of Com- In this paper, we discover the particular local se- puters, 4:43. mantic correspondence in sememe prediction us- ing dictionary definitions. Accordingly, we pro- Zhendong Dong and Qiang Dong. 2003. HowNet-a hy- brid language and knowledge resource. In Proceed- pose a novel model, which can exploit this prop- ings of International Conference on Natural Lan- erty to predict sememes. And we also design two guage Processing and Knowledge Engineering. useful retrofitting operations, including incorpo- rating sememes of definition words and integrat- Xiangyu Duan, Jun Zhao, and Bo Xu. 2007. Word sense disambiguation through sememe labeling. In ing embeddings of target words. Experimental re- Proceedings of IJCAI. sults show that our model with the two retrofitting operations achieves state-of-the-art performance. Xianghua Fu, Guo Liu, Yanyan Guo, and Zhiqiang We will explore the following research direc- Wang. 2013. Multi-aspect sentiment analysis for tions in the future: (1) taking into account the Chinese online social reviews based on topic mod- eling and HowNet lexicon. Knowledge-Based Sys- hierarchical structures of sememes and allocating tems, 37:186–195. context-dependent weights when incorporating se- memes of definition words. (2) predicting the hi- Yihong Gu, Jun Yan, Hao Zhu, Zhiyuan Liu, Ruobing erarchical structure of sememes as well, which is a Xie, Maosong Sun, Fen Lin, and Leyu Lin. 2018. Language modeling with sparse product of sememe key step for expanding and updating sememe KBs experts. In Proceedings of EMNLP. like HowNet. Moreover, dictionary definitions are suitable for structured sememe prediction because Felix Hill, Kyunghyun Cho, Anna Korhonen, and different types of definition words correspond to Yoshua Bengio. 2016. Learning to understand phrases by embedding the dictionary. Transactions the sememes in different hierarchical layers. (3) of the Association for Computational Linguistics, transferring our model to cross-lingual sememe 4:17–30. prediction to assist in the building of sememe KBs for other languages. Minlie Huang, Borui Ye, Yichen Wang, Haiqiang Chen, Junjun Cheng, and Xiaoyan Zhu. 2014. New word detection for sentiment analysis. In Proceed- ings of ACL. References Leonard Bloomfield. 1926. A set of postulates for the Huiming Jin, Hao Zhu, Zhiyuan Liu, Ruobing Xie, science of language. Language, 2(3):153–164. Maosong Sun, Fen Lin, and Leyu Lin. 2018. In- corporating Chinese characters of words for lexical Tom Bosc and Pascal Vincent. 2018. Auto-encoding sememe prediction. In Proceedings of ACL. dictionary definitions into consistent word embed- dings. In Proceedings of EMNLP. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint Lei Dang and Lei Zhang. 2010. Method of discrim- arXiv:1412.6980. Wei Li, Xuancheng Ren, Damai Dai, Yunfang Wu, Anna Wierzbicka. 1996. : Primes and uni- Houfeng Wang, and Xu Sun. 2018. Sememe pre- versals: Primes and universals. Oxford University diction: Learning semantic knowledge from un- Press, UK. structured textual wiki descriptions. arXiv preprint arXiv:1808.05437. Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo Luan, and Maosong Sun. 2016. Representation learning of Zhongguo Li and Maosong Sun. 2009. Punctuation as knowledge graphs with entity descriptions. In Pro- implicit annotations for Chinese word segmentation. ceedings of AAAI. Computational Linguistics, 35(4):505–512. Ruobing Xie, Xingchi Yuan, Zhiyuan Liu, and Qun Liu and Sujian Li. 2002. Word similarity com- Maosong Sun. 2017. Lexical sememe prediction via puting based on HowNet. International Journal of word embeddings and matrix factorization. In Pro- Computational Linguistics and Chinese Language ceedings of IJCAI. Processing, 7(2):59–76. Xiangkai Zeng, Cheng Yang, Cunchao Tu, Zhiyuan Teng Long, Emmanuel Bengio, Ryan Lowe, Jackie Liu, and Maosong Sun. 2018. Chinese LIWC lexi- Chi Kit Cheung, and Doina Precup. 2017. World con expansion via hierarchical classification of word knowledge for reading comprehension: Rare entity embeddings with sememe attention. In Proceedings prediction with hierarchical lstms using external de- of AAAI. scriptions. In Proceedings of EMNLP. Yuntao Zhang, Ling Gong, and Yongcheng Wang. Fuli Luo, Tianyu Liu, Qiaolin Xia, Baobao Chang, and 2005. Chinese word sense disambiguation using Zhifang Sui. 2018. Incorporating glosses into neural HowNet. In Proceedings of International Confer- Proceedings of ACL word sense disambiguation. In . ence on Natural Computation. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013. Distributed representa- Huaping Zhong, Jianwen Zhang, Zhen Wang, Hai Wan, tions of words and phrases and their compositional- and Zheng Chen. 2015. Aligning knowledge and Proceed- ity. In Proceedings of NIPS. text embeddings by entity descriptions. In ings of EMNLP. George A Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM. Yan-Lan Zhu, Jin Min, Ya-qian Zhou, Xuan-jing Huang, and Li-De Wu. 2006. Semantic orientation Yilin Niu, Ruobing Xie, Zhiyuan Liu, and Maosong computing based on HowNet. Journal of Chinese Sun. 2017. Improved word representation learning Information Processing, 20(1):14–20. with sememes. In Proceedings of ACL. James W Pennebaker, Roger J Booth, and Martha E Francis. 2007. Linguistic inquiry and word count: LIWC2007. Mahway: Lawrence Erlbaum Asso- ciates. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of EMNLP. Fanchao Qi, Yankai Lin, Maosong Sun, Hao Zhu, Ruobing Xie, and Zhiyuan Liu. 2018. Cross- lingual lexical sememe prediction. In Proceedings of EMNLP. Mike Schuster and Kuldip K Paliwal. 1997. Bidirec- tional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of ACL. Sushrut Thorat and Varad Choudhari. 2016. Imple- menting a , based on word defini- tions, using a node-graph architecture. In Proceed- ings of COLING. Julien Tissier, Christopher Gravier, and Amaury Habrard. 2017. Dict2vec: Learning word embed- dings using lexical . In Proceedings of EMNLP.