Arxiv:1711.01889V2 [Cs.CV] 29 Mar 2018 Than 20,000), Increasing Novel Characters (E.G
Total Page:16
File Type:pdf, Size:1020Kb
RADICAL ANALYSIS NETWORK FOR ZERO-SHOT LEARNING IN PRINTED CHINESE CHARACTER RECOGNITION Jianshu Zhang, Yixing Zhu, Jun Du and Lirong Dai National Engineering Laboratory for Speech and Language Information Processing University of Science and Technology of China, Hefei, Anhui, P. R. China [email protected], [email protected], [email protected], [email protected] ABSTRACT Chinese characters have a huge set of character categories, more than 20,000 and the number is still increasing as more and more novel characters continue being created. However, the enormous characters can be decomposed into a compact set of about 500 fundamental and structural radicals. This pa- per introduces a novel radical analysis network (RAN) to rec- ognize printed Chinese characters by identifying radicals and analyzing two-dimensional spatial structures among them. testing procedure The proposed RAN first extracts visual features from input by employing convolutional neural networks as an encoder. Then a decoder based on recurrent neural networks is em- ployed, aiming at generating captions of Chinese characters by detecting radicals and two-dimensional structures through a { stl { 尸 d { 龷 八 } } d { 几 又 } } a spatial attention mechanism. The manner of treating a Chi- nese character as a composition of radicals rather than a single Fig. 1: The comparison between the conventional whole-character character class largely reduces the size of vocabulary and en- based methods and the proposed RAN method. ables RAN to possess the ability of recognizing unseen Chi- nese character classes, namely zero-shot learning. we propose a novel radical analysis network (RAN) to gener- Index Terms— Chinese characters, radical analysis, ate captions for recognition of Chinese characters. RAN has zero-shot learning, encoder-decoder, attention two distinctive properties compared with traditional methods: 1) The size of the radical vocabulary is largely reduced com- 1. INTRODUCTION pared with the character vocabulary; 2) It is a novel zero-shot learning of Chinese characters which can recognize unseen The recognition of Chinese characters is an intricate problem Chinese characters in the recognition stage because the cor- due to a large number of existing character categories (more responding radicals and spatial relationships are learned from arXiv:1711.01889v2 [cs.CV] 29 Mar 2018 than 20,000), increasing novel characters (e.g. the character other seen characters in the training stage. Fig. 1 illustrates “Duang” created by Jackie Chan) and complicated internal a clear comparison between traditional methods and RAN structures. However, most conventional approaches [1] can for recognizing Chinese characters. In the traditional meth- only recognize about 4,000 commonly used characters with ods, the classifier takes the character input as a single picture no capability of handling unseen or newly created characters. and tries to learn a mapping between the input picture and Also each character sample is treated as a whole without con- a pre-defined class. If the testing character class is unseen sidering the similarity and internal structures among different in training samples, like the character in red dashed rectan- characters. gle in Fig. 1, it will be misclassified. While RAN attempts It is well-known that all Chinese characters are composed to imitate the manner of recognizing Chinese characters as of basic structural components, called radicals. Only about Chinese learners. For example, before asking children to re- 500 radicals [2] are adequate to describe more than 20,000 member and recognize Chinese characters, Chinese teachers Chinese characters. Therefore, it is an intuitive way to de- first teach them to identify radicals, understand the meaning compose Chinese characters into radicals and describe their of radicals and grasp the possible spatial structures between spatial structures as captions for recognition. In this paper, them. Such learning way is more generative and helps im- prove the memory ability of students to learn so many Chi- a d stl str sbl nese characters, which is perfectly adopted in RAN. As the example in top-right part of Fig. 1 shows, the six training samples contain ten different radicals and exhibit top-bottom, left-right and top-left-surround spatial structures. When en- sl sb st s w countering the unseen character during testing, RAN can still generate the corresponding caption of that character (i.e. the bottom-right rectangle of Fig. 1) as it has already learned the essential radicals and structures. Technically, the enormous Fig. 2: Graphical representation of ten common spatial structures Chinese characters, as well as the newly created characters between Chinese radicals. can all be identified by a compact set of radicals and spatial structures learned in the training stage. In the past few decades, lots of efforts have been made in cjk-decomp 1, we decompose Chinese characters into cor- for radical-based Chinese character recognition. [3] over- responding captions. Regarding to spatial structures among segmented characters into candidate radicals and could only radicals, we show ten common structures in Fig. 2 where “a” handle the left-right structure. [4] first detected separate rad- represents a left-right structure, “d” represents a top-bottom icals and then employed a hierarchical radical matching structure, “stl” represents a top-left-surround structure, “str” method to recognize a Chinese character. Recently, [5] also represents a top-right-surround structure, “sbl” represents a tried to detect position-dependent radicals using a deep resid- bottom-left-surround structure, “sl” represents a left-surround ual network with multi-labeled learning. Generally, these ap- structure, “sb” represents a bottom-surround structure, “st” proaches have difficulties when dealing with the complicated represents a top-surround structure, “s” represents a surround 2D structures between radicals and do not focus on the recog- structure and “w” represents a within structure. nition of unseen character classes. We use a pair of braces to constrain a single structure in Regarding to the network architecture, the proposed character caption. Taking “stl” as an example, it is captioned RAN is an improved version of the attention-based encoder- as “stl f radical-1 radical-2 g”. Usually, like the common decoder model in [6]. We employ convolutional neural net- instances mentioned above, a structure is described by two works (CNN) [7] as the encoder to extract high-level visual different radicals. However, as for some unique structures, features from input Chinese characters. The decoder is re- they are described by three or more radicals. current neural networks with gated recurrent units (GRU) [8] which converts the high-level visual features into output char- 3. NETWORK ARCHITECTURE OF RAN acter captions. We adopt a coverage based spatial attention model built in the decoder to detect the radicals and internal The attention based encoder-decoder model first learns to en- two-dimensional structures simultaneously. The GRU based code input into high-level representations. A fixed-length decoder also performs like a potential language model aiming context vector is then generated via weighted summing the to grasp the rules of composing Chinese character captions high-level representations. The attention performs as the after successfully detecting radicals and structures. The main weighting coefficients so that it can choose the most relevant contributions of this study are summarized as follows: parts from the whole input for calculating the context vec- • We propose RAN for the zero-shot learning of Chinese tor. Finally, the decoder uses this context vector to generate character recognition, solving the problem of handling variable-length output sequence word by word. The frame- unseen or newly created characters. work has been extensively applied to many applications in- cluding machine translation [9], image captioning [10, 11], • We describe how to caption Chinese characters based video processing [12] and handwriting recognition [13–15]. on detailed analysis of Chinese radicals and structures. • We experimentally demonstrate how RAN performs on 3.1. CNN encoder recognizing seen/unseen Chinese characters and show its efficiency through attention visualization. In this paper, we evaluate RAN on printed Chinese charac- ters. The inputs are greyscale images and the pixel value is normalized between 0 and 1. The overall architecture of RAN 2. RADICAL ANALYSIS is shown in Fig. 3. We employ CNN as the encoder which is proven to be a powerful model to extract high-quality visual Compared with enormous Chinese character categories, the features from images. Rather than extracting features after amount of radical categories is quite limited. It is declared in a fully connected layer, we employ a CNN framework con- GB13000.1 standard [2], which is published by National Lan- taining only convolution, pooling and activation layers, called guage Committee of China, that 20902 Chinese characters are composed of 560 different radicals. Following the strategy 1https://github.com/amake/cjk-decomp a { where g denotes a softmax activation function over all the stl { words in the vocabulary, h denotes a maxout activation func- 尸 K× m m×n m×D d tion, Wo 2 2 , Ws 2 , Wc 2 , and E { R R R 龷 denotes the embedding matrix, m and n are the dimensions 八 of embedding and GRU decoder. } } The decoder adopts two unidirectional GRU layers to cal- d { culate the hidden state st: 几 又 } ^s = GRU (y ; s ) (4) } t t−1 t−1 1.input 2.CNN encoder 3.RNN decoder with attention 4.output ct = fcatt (^st; A) (5) st = GRU (ct;^st) (6) Fig. 3: The overall architecture of RAN for printed Chinese charac- ter recognition. The purple, blue and green rectangles are examples where st−1 denotes the previous hidden state, ^st is the predic- of attention visualization when predicting the words that are of the tion of current GRU hidden state, and f denotes the cover- same color in output caption. catt age based spatial attention model (Section 3.2.2).