Arxiv:2001.05954V1 [Cs.CL] 16 Jan 2020

Lexical Sememe Prediction using Dictionary Definitions by Capturing Local Semantic Correspondence Jiaju Du, Fanchao Qi, Maosong Sun, Zhiyuan Liu Department of Computer Science and Technology, Tsinghua University Institute for Artificial Intelligence, Tsinghua University Beijing National Research Center for Information Science and Technology fdjj18,[email protected],fsms,[email protected] Abstract Sememes, defined as the minimum semantic units of human languages in linguistics, have been proven useful in many NLP tasks. Since manual construction and update of sememe knowledge bases (KBs) are costly, the task of automatic sememe prediction has been proposed to assist sememe annotation. In this paper, we explore the approach of ap- plying dictionary definitions to predicting sememes for unannotated words. We find that sememes of each word are usually semantically matched to different words in its dictionary definition, and we name this matching re- lationship local semantic correspondence. Ac- cordingly, we propose a Sememe Correspon- dence Pooling (SCorP) model, which is able to capture this kind of matching to predict se- Figure 1: The annotation of word “observatory” in memes. We evaluate our model and baseline HowNet and its definition in dictionary. Sememes and methods on a famous sememe KB HowNet definition words with the same superscript are seman- and find that our model achieves state-of-the- tically matched. art performance. Moreover, further quantitative analysis shows that our model can prop- erly learn the local semantic correspondence HowNet (Dong and Dong, 2003) is the most between sememes and words in dictionary def- famous sememe KB. It defines about 2; 000 initions, which explains the effectiveness of our model. The source codes of this paper can language-independent sememes and uses them to be obtained from https://github.com/ annotate over 100 thousand Chinese and English thunlp/scorp. words. Every word in HowNet contains some senses, and each sense is annotated with a hier- 1 Introduction archical structure of sememes, i.e., sememe tree. arXiv:2001.05954v1 [cs.CL] 16 Jan 2020 In linguistics, sememes are defined as the mini- Figure1 exhibits an example of how words are mum semantic units of human languages (Bloom- annotated in HowNet. The word “observatory” field, 1926). Some linguists believe that the mean- has one sense, and this sense has 4 sememes, ings of all the words can be described with a lim- namely “location”, “investigate”, “celestial body” ited closed set of sememes, which is similar to the and “celestial event”. HowNet has been success- idea of semantic prime (Wierzbicka, 1996). How- fully utilized in various NLP applications such ever, sememes are normally implicit and cannot be as semantic similarity computation (Liu and Li, recognized directly. Therefore, people manually 2002), word sense disambiguation (Zhang et al., annotate words with a set of predefined sememes 2005; Duan et al., 2007), sentiment analysis (Zhu to construct sememe KBs. et al., 2006; Fu et al., 2013; Dang and Zhang, This paper is accepted by Journal of Chinese Information 2010), language modeling (Gu et al., 2018), word Processing (in Chinese). This version is for English readers. embedding (Niu et al., 2017), and word classification (Zeng et al., 2018). ing. The sememe tree of “observatory” represents Seeing that numerous new words emerge con- the meaning of “location of investigating celes- stantly and meanings of existing words keep al- tial events and celestial bodies”, which is almost tering, it is labor-intensive and time-consuming to identical to that of the dictionary definition. Fur- expand and update sememe KBs like HowNet. To thermore, sememe trees and dictionary definitions solve this problem, Xie et al.(2017) propose the can be decomposed into sememes and definition task of sememe prediction, aiming to automati- words respectively. Consequently, it is natural to cally recommend related sememes to the words see that there is local semantic correspondence be- without sememe annotation. They also present tween a word’s sememes and definition words. two simple but effective embedding-based models To take advantage of local semantic correspon- for the task. These two models ignore the hier- dence, we propose the model Sememe Corre- archical structure of sememe trees and predict a spondence Pooling (SCorP). SCorP computes set of sememes for every word. Jin et al.(2018) the correspondence scores between all sememes further propose to incorporate Chinese characters and words in definitions, and do max-pooling over of words into sememe prediction and improve the correspondence scores for every sememe. More- prediction performance. over, SCorP model ignores the hierarchical struc- These methods rely heavily on the representa- tures of sememes in sememe prediction follow- tions of words or characters. As a result, it is ing previous works (Xie et al., 2017; Jin et al., hard to predict proper sememes for those low- 2018; Li et al., 2018), and uses the framework of frequency words or words with low-frequency sequence to set multi-label classification to avoid characters because of their poor embeddings. In imposing order on sememes. And the inputs to fact, there are other available resources that can be SCorP model are word sequences rather than char- used in sememe prediction to tackle the challenge. acter sequences, which remedies the ambiguity Dictionary definitions clearly explain the mean- problem of characters and at the same time, en- ings of words including low-frequency words, and ables utilizing the sememe information of the def- they are also easy to obtain. Thus, we believe inition words. In addition, we propose two effec- dictionary definitions are appropriate for sememe tive retrofitting operations for SCorP which im- prediction. Li et al.(2018) firstly propose to use proves the prediction performance. In our experi- dictionary definitions in sememe prediction. How- ments, we evaluate the sememe prediction perfor- ever, they adopt a sequence to sequence model that mance of our SCorP model and several baselines. foists inappropriate order on sememes. Addition- Experimental results show that our model achieves ally, they regard definitions as sequences of char- state-of-the-art performance. We also conduct fur- acters, which can hurt representations of the defi- ther quantitative analysis and find that our model nitions owing to Chinese characters’ ambiguity. can correctly match definition words to sememes, which explains the superiority of our model. In this paper, we propose a novel model for se- To conclude, the contributions of this paper are meme prediction using dictionary definitions. The twofold: (1) we discover local semantic corre- model not only addresses the issues of existing spondence, a kind of specific semantic matching methods but also can capture local semantic corre- between a word’s sememe tree and definition; (2) spondence, a kind of particular matching relation- we propose a novel model which utilizes local se- ship between sememe trees and definitions. More mantic correspondence to predict sememes, and it explicitly, for a given word, we find each of its se- achieves state-of-the-art performance in sememe memes can be matched to one or more of its defini- prediction. tion words, i.e., words in the dictionary definition. Taking Figure1 for example, the sememes “loca- 2 Related Work tion”, “investigate”, and “celestial body, celestial event” are semantically matched to the words “in- Applications of Sememes Sememes have been stitution”, “observing, researching”, and “celes- widely used in NLP tasks, such as semantic simi- tial, astronomy” respectively. larity computation (Liu and Li, 2002), word sense The reason for local semantic correspondence disambiguation (Zhang et al., 2005; Duan et al., is easy to explain. Both a word’s sememe tree 2007), and sentiment analysis (Fu et al., 2013; and dictionary definition express the same mean- Dang and Zhang, 2010; Zhu et al., 2006; Huang et al., 2014). In addition, Niu et al.(2017) incor- our SCorP model is based. Next, we present our porate sememes into word representation learning SCorP model and two different retrofitting opera- and utilize sememes to capture the exact mean- tions in detail. Finally, we succinctly describe an ing of a word within specific contexts. Zeng et al. ensemble model. (2018) use sememe knowledge with the attention 3.1 Terms and Notations mechanism to determine the type of a word and expand the Chinese LIWC lexicon (Pennebaker As mentioned previously, we use “definition et al., 2007). Gu et al.(2018) propose to regard word” to refer to a word in dictionary definitions. sememes as linguistic experts to help predict suit- And we use “target word” to signify the word for able words in language modeling. which we want to predict sememes. We define W as the vocabulary set and S as the set of sememes. Sememe Prediction Xie et al.(2017) present Given a word w 2 W , all of its sememes form a the automatic sememe prediction task for the first set Sw = fs1; :::; sjSwjg ⊆ S, where j·j represents time. They also propose a collaborative filtering- the cardinality of a set. The dictionary definition based model SPWE and a matrix factorization- of word w is denoted by Dw = fd1; :::; djDwjg, based model SPSE for this task and achieve ac- where di is its i-th definition word. We use lower- ceptable sememe prediction results. Following case boldface symbols for vectors and uppercase this work, Jin et al.(2018) propose two models boldface symbols for matrices. For example, di is SPWCF and SPCSE which can leverage the in- the word vector of di, si is the sememe vector of ternal character information of Chinese words, as si, and Dw is the matrix composed of word vec- well as an ensemble model CSP which considers tors d1; :::; djDwj. both internal and external information. Li et al. (2018) first explore the role of dictionary descrip- 3.2 Basic Multi-label Classification tions in sememe prediction.

Load more