VCWE: Visual Character-Enhanced Word Embeddings

VCWE: Visual Character-Enhanced Word Embeddings Chi Sun, Xipeng Qiu,∗ Xuanjing Huang Shanghai Key Laboratory of Intelligent Information Processing, Fudan University School of Computer Science, Fudan University 825 Zhangheng Road, Shanghai, China fsunc17,xpqiu,[email protected] Abstract et al., 2014; Cao and Rei, 2016; Sun et al., 2016a; Wieting et al., 2016; Bojanowski et al., 2016). Chinese is a logographic writing system, and Compositionality is more critical for Chinese, the shape of Chinese characters contain rich since Chinese is a logographic writing system. In syntactic and semantic information. In this pa- Chinese, each word typically consists of fewer per, we propose a model to learn Chinese word characters and each character also contains richer embeddings via three-level composition: (1) a convolutional neural network to extract the semantic information. For example, Chinese char- intra-character compositionality from the vi- acter “休” (rest) is composed of the characters for sual shape of a character; (2) a recurrent neural “º” (person) and “(” (tree), with the intended network with self-attention to compose char- idea of someone leaning against a tree, i.e., rest- acter representation into word embeddings; ing. (3) the Skip-Gram framework to capture non- Based on the linguistic features of Chinese, re- compositionality directly from the contextual cent methods have used the character informa- information. Evaluations demonstrate the su- tion to improve Chinese word embeddings. These perior performance of our model on four tasks: word similarity, sentiment analysis, named en- methods can be categorized into two kinds: tity recognition and part-of-speech tagging.1 1) One kind of methods learn word embeddings with its constituent character (Chen et al., 2015), 1 Introduction radical2 (Shi et al., 2015; Yin et al., 2016; Yu et al., 2017) or strokes3 (Cao et al., 2018). How- Distributed representations of words, namely word ever, these methods usually use simple operations, embeddings, encode both semantic and syntac- such as averaging and n-gram, to model the inher- tic information into a dense vector. Currently, ent compositionality within a word, which is not word embeddings have been playing a pivotal enough to handle the complicated linguistic com- role in many natural language processing (NLP) positionality. tasks. Most of these NLP tasks also benefit 2) The other kind of methods learns word em- from the pre-trained word embeddings, such as beddings with the visual information of the char- word2vec (Mikolov et al., 2013a) and GloVe (Pen- acter. Liu et al.(2017) learn character embedding nington et al., 2014), which are based on the dis- based on its visual characteristics in the text clas- arXiv:1902.08795v2 [cs.CL] 25 Mar 2019 tributional hypothesis (Harris, 1954): words that sification task. Su and Lee(2017) also introduce occur in the same contexts tend to have similar a pixel-based model that learns character features meanings. Earlier word embeddings often take a from font images. However, their model is not word as a basic unit, and they ignore composi- shown to be better than word2vec model because tionality of its sub-word information such as mor- it has little flexibility and fixed character features. phemes and character n-grams, and cannot compe- Besides, most of these methods pay less atten- tently handle the rare words. To improve the per- tion to the non-compositionality. For example, the formance of word embeddings, sub-word informa- 2the graphical component of Chinese, referring to tion has been employed (Luong et al., 2013; Qiu https://en.wikipedia.org/wiki/Radical_ (Chinese_characters) ∗Corresponding author. 3the basic pattern of Chinese characters, referring to 1The source codes are available at https://github. https://en.wikipedia.org/wiki/Stroke_ com/HSLCY/VCWE (CJKV_character) semantic of Chinese word “沙发” (sofa) cannot be tokens, and the potentially useful internal struc- composed by its contained characters “沙” (sand) tured information of words is ignored. To improve and “发” (hair). the performance of word embedding, sub-word in- In this paper, we fully consider the composition- formation has been employed (Luong et al., 2013; ality and non-compositionality of Chinese words Qiu et al., 2014; Cao and Rei, 2016; Sun et al., and propose a visual character-enhanced word em- 2016a; Wieting et al., 2016; Bojanowski et al., bedding model (VCWE) to learn Chinese word 2016). These methods focus on alphabetic writ- embeddings. VCWE learns Chinese word embed- ing systems, but they are not directly applicable to dings via three-level composition: logographic writing systems. For the alphabetic writing systems, research on • The first level is to learn the intra-character Chinese word embedding has gradually emerged. composition, which gains the representation These methods focus on the discovery of mak- of each character from its visual appearance ing full use of sub-word information. Chen et al. via a convolutional neural network; (2015) design a CWE model for jointly learn- • The second level is to learn the inter- ing Chinese characters and word embeddings. character composition, where a bidirectional Based on the CWE model, Yin et al.(2016) long short-term neural network (Bi-LSTM) present a multi-granularity embedding (MGE) (Hochreiter and Schmidhuber, 1997) with model, additionally using the embeddings associ- self-attention to compose character represen- ated with radicals detected in the target word. Xu tation into word embeddings; et al.(2016) propose a similarity-based character- enhanced word embedding (SCWE) model, ex- • The third level is to learn the non- ploiting the similarity between a word and its compositionality, we can learn the contextual component characters with the semantic knowl- information because the overall framework of edge obtained from other languages. Shi et al. our model is based on the skip-gram. (2015) utilize radical information to improve Chi- nese word embeddings. Yu et al.(2017) introduce Evaluations demonstrate the superior performance a joint learning word embedding (JWE) model and of our model on four tasks such as word similarity, Cao et al.(2018) represent Chinese words as se- sentiment analysis, named entity recognition and quences of strokes and learn word embeddings part-of-speech tagging. with stroke n-grams information. 2 Related Work From another perspective, Liu et al.(2017) pro- vide a new way to automatically extract character- In the past decade, there has been much research level features, creating an image for the character on word embeddings. Bengio et al.(2003) use and running it through a convolutional neural net- a feedforward neural network language model to work to produce a visual character embedding. Su predict the next word given its history. Later and Lee(2017) also introduce a pixel-based model methods (Mikolov et al., 2010) replace feedfor- that learns character features from its image. ward neural network with the recurrent neural net- Chinese word embeddings have recently be- work for further exploration. The most popular gun to be explored, and have so far shown great word embedding system is word2vec, which uses promise. In this paper, we propose a visual continuous-bag-of-words and Skip-gram models, character-enhanced word embedding (VCWE) in conjunction with negative sampling for efficient model that can learn Chinese word embeddings conditional probability estimation (Mikolov et al., from corpus and images of characters. The model 2013a). combines the semantic information of the context A different way to learn word embeddings is with the image features of the character, with su- through factorization of word co-occurrence ma- perior performance in several benchmarks. trices such as GloVe embeddings (Pennington et al., 2014), which have been shown to be intrin- 3 Proposed Model sically linked to Skip-gram and negative sampling (Levy and Goldberg, 2014). In this section, we introduce the visual character- The models mentioned above are popular and enhanced word embedding (VCWE) model for useful, but they regard individual words as atomic Chinese word representation. RELU Projection Output BN Linear 2×2 maxpool Self Attention BN 3×3 conv,32 h1 h2 h3 2×2 maxpool BN 3×3 conv,32 c1 c2 c3 CNN BiLSTM with self-attention Skip-gram （a）（b）（c） Figure 1: The overall architecture of our approach. Given a Chinese word w consisting of n char- a character from its image. acters c1; ··· ; cn, its semantic may come from ei- We first convert each character into an image ther its contained characters or its contexts. There- of size 40 × 40, a deep CNN is used to fuse its fore, we use the two-level hierarchical composi- visual information fully. The specific structure of tion to compose the word embedding, which fur- the CNN is shown in Figure1(a), which consists of ther learned according to its context. two convolution layers and one linear layer. Each The overall architecture of our approach is on convolution layer is followed by a max pooling Figure1. We first use a convolutional neural net- layer and a batch normalization layer. The lower work (CNN) to model the intra-character compo- layers aim to capture the stroke-level information, sitionality of character from its visual shape infor- and the higher layers aim to capture the radical- mation. We use the output of CNN as the em- level and component-level information. beddings of the character. Then the character em- The output of CNN can be regarded as the rep- beddings are used as the input of the bidirectional resentation of the character. The character repre- LSTM network to model the inter-character com- sentation by its visual information can fully cap- positionality. After a self-attention layer, we can ture its intrinsic syntactic and semantic informa- get the representation of the word. Finally, based tion with the intra-character compositionality. on the Skip-Gram framework, we learn the word The parameters of CNN are learned through embeddings with the visual character-enhanced backpropagation in end-to-end fashion.

Load more