Improving Text Recognition Using Optical and Language Model Writer Adaptation
Total Page:16
File Type:pdf, Size:1020Kb
Improving text recognition using optical and language model writer adaptation Yann Soullard, Wassim Swaileh, Pierrick Tranouez, Thierry Paquet Clement´ Chatelain LITIS lab LITIS lab Normandie University, Universite de Rouen Normandie INSA Rouen Normandie 76800 Saint Etienne du Rouvray, France 76800 Saint Etienne du Rouvray, France Email: ffi[email protected] Email: [email protected] Abstract—State-of-the-art methods for handwriting text shown to be very efficient on various recognition tasks recognition are based on deep learning approaches and lan- such as handwriting recognition, signature identification or guage modeling that require large data sets during training. medical imaging [4], [5]. In addition, the language itself, but In practice, there are some applications where the system processes mono-writer documents, and would thus benefit from also lexicons as well as word frequencies may not match being trained on examples from that writer. However, this is the statistical distributions learned during language model not common to have numerous examples coming from just one training. Statistical language model adaptation algorithms writer. In this paper, we propose an approach to adapt both the [6] have also been designed in this respect. optical model and the language model to a particular writer, This paper presents an approach to adapt a generic text from a generic system trained on large data sets with a variety of examples. We show the benefits of the optical and language recognition system to a particular writer. In this respect, both model writer adaptation. Our approach reaches competitive the optical model and the language model of the generic sys- results on the READ 2018 data set, which is dedicated to model tem are considered during the adaptation phase. The generic adaptation to particular writers. system components (optical and language models) are first Keywords-Handwriting recognition; writer adaptation; deep trained on a large data set (that does not contain any example neural network; optical model; language model of the writer). The optical model is based on a reference architecture that can benefit from transfer learning and data I. INTRODUCTION augmentation. The language model benefits from sub-lexical Deep learning approaches combined with language mod- modeling [7] and model interpolation to circumvent the lack eling are now state-of-the art approaches for handwriting of training examples from the writer. recognition tasks. However, they require large amounts Our approach is evaluated on the 2018 READ competition of labeled training data to achieve optimal performance. [8] data sets, dedicated to writer adaptation analysis. We In speech and handwriting recognition, many personalized show that our method reaches competitive performance on applications have to deal with one specific handwriting the adaptation task and outperforms the systems which document or audio recording coming from one writer or have been submitted to the competition. To the best of our speaker. Besides, in the field of digital humanities, there is knowledge, it is the first time that both an optical model and a need for digitization and transcriptions of old manuscripts, a language model adaptation is evaluated simultaneously for many of which being from one writer only. an adaptation task. Designing a handwriting recognition system trained to The article is organized as follows. In section II, we recognize the writing and language style of some spe- discuss related works on handwriting recognition system cific writers is attractive, but incompatible with the large adaption, both for optical model and language model. Sec- amount of examples required to train deep architectures tion III presents the proposed strategy for writer adaptation. and language models. Another difficulty is that speech and Finally, experiments on the 2018 READ competition are handwriting recognition systems may also suffer from a shift described in section IV. in the distribution of the target language between the training data set and the test data set. II. RELATED WORKS As a consequence, there is a growing interest in de- Speech and handwriting recognition both suffer from the signing learning algorithms that can be trained on small large variability between speakers or writers, and for both the labeled data sets from the targeted domain, while keeping acoustic or optical model and the language model. This vari- a good generalization ability. Domain adaptation [1] and ability is known to limit the good recognition performance transfer learning [2], [3] are machine learning approaches of a generic multi-writer (speaker) system. Adaptation of that specifically address the problem of data distribution a generic recognition system to a specific data set should shift between data sets that have some similar properties. therefore consider the adaptation of both the optical model Transfer learning is now a widespread approach that has and the language model. This is the proposition of this paper. Note that language model adaptation has been studied handwriting recognition, Xiu and Baird [22] adapted a word extensively in speech recognition [9], but only few works lexicon from OCR results. Lee and Smith [23] modified have focused on handwriting recognition until now. word uni-gram probabilities in caches. Besides, Wang et al. [19] presented three methods for language model adaptation A. Optical model adaptation in order to match a writer corpus with a predefined set of The adaptation strategy is either a data normalization generic LM. process [10] or a model adaptation process that modifies the Finally, some of the participants of the READ 2018 parameters of a general-purpose recognizer to the specific competition proposed to use language models, such as n- speech or handwriting. In this respect, MLLR (Maximum grams of words, sub-lexical units or characters. In two Likelihood Linear Regression) [11] and MAP (Maximum systems, language models are the result of the interpolation A Posteriori) [12] have been proposed within the well of a document (writer) specific language model with a established framework of Hidden Markov Models (HMM) to generic one. For both systems, balance between the two LM adapt the parameters of the acoustic/optical models, or their is set beforehand. structure [13]. It is to be noticed also that most of the state- of-the-art industrial OCR rely on some adaptation principles III. PROPOSED APPROACH [14] by introducing heuristic rules or statistics gathered at We present in this section our strategy to adapt a generic the document level. handwriting recognition system to a specific writer. In the Recently, as part of the ICFHR 2018 READ competition following we will call generic data set a large data set [8], most of the participants proposed an optical model composed of multiple documents from different writers and composed of a Convolutional Neural Network (CNN) and languages, and specific data set a smaller data set coming Bidirectional Long Short Term Memory (BLSTM) layers as from a document (or a subset of documents) written by in [15] and [16]. One of them was using Multi-dimensional a single writer. Similarly, we will call generic model and LSTM (MDLSTM) [17], which has provided good perfor- specific model an optical model or a language model trained mance when trained on a generic large data set, but has on the generic data set or the specific data set, respectively. shown difficulty in carrying the writer adaptation process with few samples. After the competition, [18] proposed a A. System overview fully convolutional network architecture trained with the The system architecture is illustrated in Figure 1. We first CTC loss function for text recognition. This system achieves train a generic handwriting recognition system composed the best results on the READ 2018 data set, significantly of an optical model (PG(OjW )) and a language model outperforming the other systems. Following the guidelines (PG(W )). This system is trained on the generic data set, of the READ 2018 competition, every systems are trained on in order to model the variability between different writing a generic data set, and then fine-tuned on specific documents styles, languages and document images background. Note using transfer learning and data augmentation strategies. that we showed in [24] that a language model can account for numerous languages without affecting the performance, B. Language model adaptation thus providing a multilingual LM. Language model (LM) adaptation has been studied ex- Both generic models (optical and language) serve as initial tensively in speech recognition. In practice, there is rarely models of the adaptation process intended to design specific enough data to get a reliable language model learned on optical (PS(OjW )) and language models (PS(W )). At de- samples of one single writer’s only [6]. Thus, language coding time, the outputs of the optical model, either generic model adaptation consists in deriving a specific language or specific, are analyzed by the appropriate language model model using a generic language model and a small training by exploring a search graph thanks to a Weighted Finite State data set (text samples of one writer). The generic LM Transducer (WFST) and using the Viterbi algorithm. The is previously trained on a much larger and general data method selects the sentence W^ maximizing the a posteriori set. Language model adaptation is commonly performed by probability P (W jO) among all possible sentences W by interpolating the two models (specific and generic LM) [19], applying the Bayes formula: typically using a linear interpolation. In such approaches, a W^ = arg max P (W jO) = arg max P (OjW )P (W )αβjW j back-off to the generic LM can also be used as a way to fill w w up missing statistics of the specific LM. A back-off threshold (1) has to be determined in this case. where O is the observation sequence coming from the In speech recognition, interpolations between generic LM input image, P (OjW ) is the probability of the observation and task-specific uni-grams have been proposed in [20].