Component-Enhanced Chinese Character Embeddings
Total Page:16
File Type:pdf, Size:1020Kb
Component-Enhanced Chinese Character Embeddings Yanran Li1, Wenjie Li1, Fei Sun2, and Sujian Li3 1Department of Computing, The Hong Kong Polytechnic University, Hong Kong 2Institute of Computing Technology, Chinese Academy of Sciences, China 3Key Laboratory of Computational Linguistics, Peking University, MOE, China {csyli, cswjli}@comp.polyu.edu.hk, [email protected], [email protected] Abstract 2014; Bansal et al., 2014) or directly keeping the order features (Ling et al., 2015). As to another Distributed word representations are very line, Luong et al. (2013) captures morphological useful for capturing semantic information composition by using neural networks and Qiu et and have been successfully applied in a al. (2014) introduces the morphological knowl- variety of NLP tasks, especially on En- edge as both additional input representation and glish. In this work, we innovatively de- auxiliary supervision to the neural network frame- velop two component-enhanced Chinese work. While most previous work focuses on En- character embedding models and their bi- glish, there is a little work on Chinese. Zhang et gram extensions. Distinguished from En- al. (2013) extracts the syntactical morphemes and glish word embeddings, our models ex- Cheng et al. (2014) incorporates the POS tags and plore the compositions of Chinese char- dependency relations. Basically, the work in Chi- acters, which often serve as semantic in- nese follows the same ideas as in English. dictors inherently. The evaluations on Distinguished from English, Chinese characters both word similarity and text classification are logograms, of which over 80% are phono- demonstrate the effectiveness of our mod- semantic compounds, with a semantic component els. giving a broad category of meaning and a phonetic component suggesting the sound1. For example, 1 Introduction the semantic component 亻 (human) of the Chi- Due to its advantage over traditional one-hot rep- nese character 他 (he) provides the meaning con- resentation, distributed word representation has nected with human. In fact, the components of demonstrated its benefit for semantic representa- most Chinese characters inherently bring with cer- tion in various NLP tasks. Among the existing ap- tain levels of semantics regardless of the contexts. proaches (Huang et al., 2012; Levy and Goldberg, Being aware that the components of Chinese char- 2014; Yang and Eisenstein, 2015), the continu- acters are finer grained semantic units, then an im- ous bag-of-words model (CBOW) and the continu- portant question arises before slipping to the appli- ous skip-gram model (SkipGram) remain the most cations of word embeddings—would it be better to popular ones that one can use to build word embed- learn the semantic representations from the charac- dings efficiently (Mikolov et al., 2013a; Mikolov ter components in Chinese? et al., 2013b). These two models learn the dis- We approach this question from both the prac- tributed representation of a word based on its con- tical and the cognitive points of view. In prac- text. The context defined by the window of sur- tice, we expect the representations to be optimized rounding words may unavoidably include certain for good generalization. As analyzed before, the less semantically-relevant words and/or miss the components are more generic unit inside Chinese words with important and relevant meanings (Levy characters that provides semantics. Such inher- and Goldberg, 2014). ent information somehow alleviates the shortcom- To overcome this shortcoming, a line of research ing of the external contexts. From the cognitive deploys the order information of the words in the point of view, it has been found that the knowl- contexts by either deriving the contexts using de- edge of semantic components significantly corre- pendency relations where the target word partici- 1http://en.wikipedia.org/wiki/Radical_ pates (Levy and Goldberg, 2014; Yu and Dredze, (Chinese_characters) late to Chinese word reading and sentence compre- us to assumingly understand or infer the meanings hension (Ho et al., 2003). of characters without any context. In other words, These evidences inspire us to explore novel the component-level features inherently bring with Chinese character embedding models. Different additional information that benefits semantic rep- from word embeddings, character embeddings re- resentations of characters. For example, we know late Chinese characters that occur in similar con- that the characters 你 (you), 他 (he), 伙 (compan- texts with their component information. Chinese ion), 侣 (companion), and 们 (people) all have the characters convey the meanings from their compo- meanings related to human because of their shared nents, and beyond that, the meanings of most Chi- component 亻 (human), a variant of the Chinese nese words also take roots in their composite char- character 人 (human). This kind of component in- acters. For example, the meaning of the Chinese formation is intrinsically different from the con- word 摇篮 (cradle) can be interpreted in terms of texts deriving by dependency relations and POS its composite characters 摇 (sway) and 篮 (basket). tags. It motivates us to investigate the component- Considering this, we further extend character em- enhanced Chinese character embedding models. beddings from uni-gram models to bi-gram mod- While Sun et al. (2014) utilizes radical informa- els. tion in a supervised fashion, we build our models At the core of our work is the exploration of in a holistic unsupervised and bottom-up way. Chinese semantic representations from a novel It is important to note the variation of a radical character-based perspective. Our proposed Chi- inside a character. There are two types of varia- nese character embeddings incorporate the finer- tions. The main type is position-related. For ex- grained semantics from the components of char- ample, the radical of the Chinese character 水 (wa- acters and in turn enrich the representations inher- ter) is itself, but it becomes 氵 as the radical of ently in addition to utilizing the external contexts. 池 (pool). The original radicals are stretched or The evaluations on both intrinsic word similarity squeezed so that they can fit into the general Chi- and extrinsic text classification demonstrate the ef- nese character shape of a square. The second vari- fectiveness and potentials of the new models. ation type emerges along with the history of char- acter simplification when traditional characters are 2 Component-Enhanced Character converted into simplified characters. For instance, Embeddings 食 (eat) is written as 飠 when it forms as a part of some traditional characters, but is written as 饣 Chinese characters are often composed of smaller in simplified characters. To cope with these vari- and primitive components called radicals or ations and recover the semantics, we match all the radical-like components, which serve as the most radical variants back into their original forms. We basic units for building character meanings. Dat- extract all the components to build a component ing back to the 2nd century AD, the Han dynasty list for each Chinese character. With the assump- scholar Shen XU organizes his etymological dic- tion that a character’s radical often bring more im- tionary shuō wén jiě zì (word and expression) by portant semantics than the rest3, we regard the rad- selecting 540 recurring graphic components that ical of a character as the first component in its com- he called bù (means “categories”). Bù is nearly ponent list. the same as what we call radicals today2. Most Let a sequence of characters D = fz ; : : : ; z g radicals are common semantic components. Over 1 N denotes a corpus of N characters over the char- time, some original radicals evolve into radical- acter vocabulary V . And z; c; e; K; T; M; jV j de- like components. Nowadays, a Chinese character note the Chinese character, the context charac- often contains exactly one radical (rarely has two) ter, the component list, the corresponding em- and several other radical-like components. In what bedding dimension, the context window size, the follows, we refer to as components both radicals number of components taken into account for and radical-like components. each character, and the vocabulary size, respec- Distinguished from English, these composite tively. We develop two component-enhanced components are unique and inherent features in- character embedding models, namely charCBOW side Chinese characters. A lot of times, they allow 3Inside a character, its radical often serves as the semantic- 2http://en.wikipedia.org/wiki/Radical_ component while its other radical-like components may be (Chinese_characters) phonetics. and charSkipGram. INPUT PROJECTION OUTPUT charCBOW follows the original continuous ci−T bag-of-words model (CBOW) proposed by ci−T ei−T (Mikolov et al., 2013a). We predict the central ci−T +1 ci−T +1 character zi conditioned on a 2(M+1)TK- ei−T +1 dimensional vector that is the concatena- O . z tion of the remaining character-level contexts . i (ci−T ; : : : ; ci−1; ci+1; : : : ; ci+T ) and the compo- ci+T −1 ci+T −1 nents in their component lists. More formally, ei+T −1 ci+T we wish to maximize the log likelihood of all the ci+T characters as follows, ei+T X (a) charCBOW L = log p(zijhi); zn2D i INPUT PROJECTION OUTPUT hi = cat(ci−T ; ei−T ; : : : ; ci+T ; ei+T ) ci−T where hi denotes the concatenation of the ei−T component-enhanced contexts. We make predic- c − tion using a 2KT (M+1)jV j-dimensional matrix i T +1 O. Different from the original CBOW model, ei−T +1 the extra parameter introduced in the matrix O zi . allows us to maintain the relative order of the ci+T −1 components and treat the radical differently from e − the rest components. i+T 1 The development of charSkipGram is straight- ci+T forward. We derive the component-enhanced con- ei+T texts as (hci−T ; ei−T i;:::; hci+T ; ei+T i) based on (b) charSkipGram the central character zi.