Handwriting Recognition in Low-Resource Scripts Using Adversarial Learning

Handwriting Recognition in Low-resource Scripts using Adversarial Learning Ayan Kumar Bhunia1 Abhirup Das2 Ankan Kumar Bhunia3 Perla Sai Raj Kishore2 Partha Pratim Roy4 1Nanyang Technological University, Singapore 2 Institute of Engineering & Management, India 3 Jadavpur University, India 4 Indian Institute of Technology Roorkee, India [email protected] Abstract ters and scripts. These deep learning algorithms require vast amounts of data to train models that are robust to real- Handwritten Word Recognition and Spotting is a chal- world handwritten data. While large datasets of both word- lenging field dealing with handwritten text possessing ir- level and separated handwritten characters are available for regular and complex shapes. The design of deep neu- scripts like Latin, a large number of scripts with larger vo- ral network models makes it necessary to extend training cabularies have limited data, posing challenges in research datasets in order to introduce variations and increase the in the areas of word-spotting and recognition in languages number of samples; word-retrieval is therefore very difficult using these scripts. in low-resource scripts. Much of the existing literature com- prises preprocessing strategies which are seldom sufficient Deep learning algorithms, which have emerged in recent to cover all possible variations. We propose an Adversar- times, enable networks to effectively extract informative ial Feature Deformation Module (AFDM) that learns ways features from inputs and automatically generate transcrip- to elastically warp extracted features in a scalable man- tions [31] of images of handwritten text or spot [40] query ner. The AFDM is inserted between intermediate layers and words, with high accuracy. In the case of scripts where trained alternatively with the original framework, boost- abundant training data is not available, Deep Neural Net- ing its capability to better learn highly informative fea- works (DNNs) often fall short, overfitting on the training tures rather than trivial ones. We test our meta-framework, set and thus generalizing poorly during evaluation. Popular which is built on top of popular word-spotting and word- methods such as data augmentation allow models to use the recognition frameworks and enhanced by AFDM, not only existing data more effectively, while batch-normalization on extensive Latin word datasets but also on sparser Indic [15] and dropout [39] prevent overfitting. Augmentation scripts. We record results for varying sizes of training data, strategies such as random translations, flips, rotations and and observe that our enhanced network generalizes much addition of Gaussian noise to input samples are often used better in the low-data regime; the overall word-error rates to extend the original dataset [20] and prove to be bene- and mAP scores are observed to improve as well. fitial for not only limited but also large datasets like Ima- genet [7]. The existing literature [6, 19, 29, 51] augment the training data prior to feature extraction before classify- 1. Introduction ing over as many as 3755 character classes [51]. Such trans- formations, however, fail to incorporate the wide variations Handwriting recognition has been a very popular area of in writing style and the complex shapes assumed by charac- research over the last two decades, owing to handwritten ters in words, by virtue of the free-flowing nature of hand- documents being a personal choice of communication for written text. Due to the huge space of possible variances humans, other than speech. The technology is applicable in handwritten images, training by generating deformed ex- in postal automation, bank cheque processing, digitization amples through such generic means is not sufficient as the of handwritten documents, and also as a reading aid for vi- network easily adapts to these policies. Models need to be- sually handicapped. Handwritten character recognition and come robust to uncommon deformations in inputs by learn- word spotting and recognition systems have evolved signif- ing to effectively utilize the more informative invariances, icantly over the years. Since Nipkow’s scanner [27] and and it is not optimal to utilize just “hard” examples to do so LeNet [21], modern deep-learning based approaches today [34, 43]. Instead, we propose an adversarial-learning based [18, 29, 41] seek to be able to robustly recognize handwrit- framework for handwritten word retrieval tasks for low re- ten text by learning local invariant patterns across diverse source scripts in order to train deep networks from a limited handwriting styles that are consistent in individual charac- number of samples. 4767 Information retrieval from handwritten images can be ertheless, the search for a better and more accurate tech- mainly classified into two types: (a) Handwritten Word nique continues to date. Results presented in [16] show Recognition (HWR) which outputs the complete transcrip- that models should preferably use word-embeddings over tion of the word-image and (b) Handwritten Word Spotting bag-of-n-grams approaches. Based on this, another ap- (HWS) which finds occurrences of a query keyword (either proach [29] employed a ConvNet to estimate a frequency a string or sample word-image) from a collection of sam- based profile of n-grams constituting spatial parts of the ple word-images. The existing literature on deep-learning word in input images and correlated it with profiles of exist- based word retrieval, which cover mostly English words, ing words in a dictionary, demonstrating an attribute-based make use of large available datasets, or use image augmen- word-encoding scheme. In [40], Sudholt et al. adopted tation techniques to increase the number of training samples the VGG-Net [37] and used the terminal fully connected [19]. Bhunia et al.[3] proposed a cross-lingual framework layers to predict holistic representations of handwritten- for Indic scripts where training is performed using a script words in images by embedding their pyramidal histogram that is abundantly available and testing is done on the low- of characters (PHOC [1]) attributes. Architectures such resource script using character-mapping. The feasibility of as [18, 40, 48] similarly embedded features into a textual this approach mostly depends on the extent of similarity be- embedding space. The paper [49] demonstrated a region- tween source and target scripts. Antoniou et al.[2] pro- proposal network driven word-spotting mechanism, where posed a data augmentation framework using Generative Ad- the end-to-end model encodes regional features into a dis- versarial Networks (GANs) which can generate augmented tributed word-embedding space, where searches are per- data for new classes in a one-shot setup. formed. Sequence discriminative training based on Connec- Inspired by the recent success of adversarial learning for tionist Temporal Classification (CTC) criterion, proposed different tasks like cross-domain image translation [52], do- by Graves et al. in [10] for training RNNs [14] has attracted main adaptation [44] etc. we propose a generative adversar- much attention and been widely used in works like [11, 31]. ial learning based paradigm to augment the word images In Shi et al.[31], the sequence of image features engi- in a high dimensional feature space using spatial transfor- neered by the ConvNet is given to a recurrent network such mations [17]. We term it as Adversarial Feature Deforma- as LSTM [11] or MDLSTM [45, 4] for computing word tion Module (AFDM) that is added on top of the original transcriptions. Authors in [19] additionally included an task network performing either recognition or spotting. It affine-transformation based attention mechanism to reori- prevents the latter from overfitting to easily learnable and ent original images spatially prior to sequence-to-sequence trivial features. Consequently, frameworks enhanced by transcription for improved detection accuracy. In most of the proposed module generalize well to real-world testing the aforementioned methods, it is important to preprocess data with rare deformations. Both the adversarial genera- images in different ways to extend the original dataset, as tor (AFDM) and task network are trained jointly, where the observed in [18, 19, 20, 29, 35]. adversarial generator intends to generate “hard” examples The process of augmenting to extend datasets is seen while the task network attempts to learn invariances to dif- even in the case of large extensive datasets [19, 7] and in ficult variations, to gradually become better over time. In works focusing on Chinese handwritten character recog- this paper, we make the following novel contributions: nition where there are close to 4000 classes in standard 1. We propose a scalable solution to HWR and HWS in low datasets. In a different class of approaches, the process of resource scripts using adversarial learning to augment the online hard example mining (OHEM) has proved effective, data in high-dimensional convolutional feature space. Var- boosting accuracy in datasets by targeting the fewer “hard” ious deformations introduced by the adversarial generator examples in the dataset, as shown in [22, 34, 36, 46]. With encourage the task network to learn from different varia- the advent of adversarial learning and GANs in recent years, tions of handwriting even from a limited amount of data. several approaches have incorporated generative modeling 2. We compare our adversarial augmentation method with to create synthetic data that is realistic

Load more