Document Image Analysis and Text Recognition on Khmer Historical Manuscripts
Total Page:16
File Type:pdf, Size:1020Kb
Document Image Analysis and Text Recognition on Khmer Historical Manuscripts A thesis submitted in partial fulfillment of the requirements for the degree of Docteur en sciences de l’ingénieur et technologie, Université catholique de Louvain January 2020 DONA VALY Thesis committee: Prof. Michel Verleysen (UCLouvain), supervisor Prof. Jean-Pierre Raskin (UCLouvain), chairperson Prof. John Lee (UCLouvain), secretary Prof. Bernard Gosselin (UMons, Belgium) Dr. Sophea Chhun (ITC, Cambodia) List of Figures Figure 2.1: A collection of Khmer palm leaf manuscripts ................... 8 Figure 2.2: Palm leaves are being placed to dry ................................... 9 Figure 2.3: A black mixture is used to make the engraved text easy to read ..................................................................................................... 10 Figure 2.4: Several types of deformations and defects found in palm leaf manuscripts .................................................................................. 13 Figure 2.5: Example of similarity between certain Khmer characters 14 Figure 2.6: Examples of double-decker and triple-decker clusters of Khmer consonants .............................................................................. 15 Figure 2.7: Some examples of ligatures between certain letters which form new shapes ................................................................................. 16 Figure 2.8: Irregular sequential ordering of symbols in a word. The different order of the position of the characters in the text image (top) and the Unicode sequence of the text (bottom) .................................. 17 Figure 2.9: Simulation of how the word “TABLE” could have been written in Khmer style (vowel ‘A’ composed of two parts which are placed on top and on the left side of ‘T’, consonant ‘L’ written as a sub-from under ‘B’, and vowel ‘E’ placed under ‘B’ and the sub-f . 17 Figure 3.1: (a) Sigmoid, (b) Tanh, (c) ReLU, (d) Leaky ReLU ......... 28 Figure 3.2: Architecture of a simple CNN ......................................... 31 Figure 3.3: Architecture of a simple RNN (a) rolled, (b) unrolled .... 33 Figure 3.4: Architecture of an LSTM cell .......................................... 35 Figure 3.5: Architecture of a simple Bidirectional-RNN ................... 37 Figure 3.6: Axes representing the 4 possible directions in 2D-RNN . 39 Figure 4.1: Samples of digitized images from top to bottom: EFEO, BI, and NL .......................................................................................... 45 Figure 4.2: A set up for our digitization campaign ............................ 46 i Figure 4.3: Sample images of our digitization campaign (from top to bottom: Phnom Penh, Kandal, and Siem Reap) ................................. 46 Figure 4.4: Annotation of individual character dataset ...................... 48 Figure 4.5: Examples of (a) characters containing multiple parts and (b) merged shapes ............................................................................... 49 Figure 4.6: Application of inpainting technique on a character image patch (a) input image, (b) inpainting mask using polygon boundary, (c) result .............................................................................................. 49 Figure 4.7: Order sequence of characters in a word ........................... 50 Figure 4.8: Annotation of word dataset .............................................. 50 Figure 4.9: (a) Sample Khmer text, (b) Word separation, (c) Sub- syllable separation .............................................................................. 52 Figure 4.10: Construction of line segmentation ground truth ............ 53 Figure 4.11: Sample of an xml file storing annotation information of a manuscript page .................................................................................. 54 Figure 4.12: Samples of annotated character patch images ............... 55 Figure 4.13: Samples of annotated word patch images ...................... 55 Figure 4.14: Balinese palm leaf manuscripts ..................................... 57 Figure 4.15: Sundanese palm leaf manuscript .................................... 57 Figure 5.3: Khmer manuscript with binarized ground truth image .... 61 Figure 5.3: Balinese manuscript with binarized ground truth image . 61 Figure 5.3: Sundanese manuscript with binarized ground truth image ............................................................................................................ 61 Figure 5.4: Binarization of Khmer manuscript with ICFHR G1 method ............................................................................................................ 65 Figure 5.5: Binarization of Balinese manuscript with ICFHR G2 method ................................................................................................ 65 Figure 5.6: Binarization of Sundanese manuscript with Niblack’s method ................................................................................................ 65 Figure 5.7: Overview of the proposed line segmentation pipeline ..... 66 ii Figure 5.8: (a) Original Image, (b) Edge map using Canny edge detection, (c) Stroke map .................................................................... 68 Figure 5.9: (a) Smooth projection profiles and the estimated medial points (red dots), (b) Adaption of the medial points to surrounding connected components (black dots) ................................................... 71 Figure 5.10: An example of an optimal path going from the start state 1 to the goal state ........................................................................ 74 Figure 5.11: Segmentation results of the proposed approach (pairs of whole segmented manuscript page and zoomed out area with medial seams marked in red) .......................................................................... 76 Figure 6.1: Overview of the workflow of the text recognition module ............................................................................................................ 79 Figure 6.2: CNN based network ......................................................... 81 Figure 6.3: Column wise LSTM network .......................................... 81 Figure 6.4: Column wise and row wise LSTM .................................. 81 Figure 6.5: A combination of convolutional and recurrent neural network ............................................................................................... 82 Figure 6.6: (a). Original image, (b). Gray scaled and resized, (c). Normalized ......................................................................................... 83 Figure 6.7: Sample images of Balinese characters ............................. 84 Figure 6.8: Sample images of Sundanese characters ......................... 84 Figure 6.9: (a). Original word image patch, (b). Annotated character information in the word: polygon boundaries of all characters, (c). Glyph-class map ................................................................................. 87 Figure 6.10: General architecture of the networks in the first trial .... 88 Figure 6.11: Architecture of the 1D-LSTM layer (Trial-Net1) .......... 89 Figure 6.12: Architecture of the 2D-LSTM layer (Trial-Net2) .......... 89 Figure 6.13: (a). Initial sample order, (b). Sort by the width of each sample, (c). Pad each sample to the maximum width in the batch, (d). Shuffle batch order ............................................................................. 90 iii Figure 7.6: (a). Original word image, (b). Ground truth GCM, (c). Result predicted by the Trial-Net1, (d). Result predicted by the Trial- Net2 .................................................................................................... 91 Figure 7.1: Overview of the architecture of proposed word recognition model .................................................................................................. 94 Figure 7.2: Detailed architecture of the GCM decoder ...................... 98 Figure 7.3: Overview of the architecture of the SubSyl-Net ............ 102 Figure 7.4: Modified version of GCM, (a) Raw image patch I, (b) Map I' containing polygon boundaries of all glyphs, (c) Downsampling I' by applying nearest neighbor interpolation to obtain the new GCM 103 Figure 7.5: Detailed architecture of the inception layer. Values in the parenthesis are the numbers of filters (corresponding respectively to the first and the second inception blocks in the SubSyl-Net ............ 104 Figure 7.7: Sample results from the SubSyl-Net showing the predicted GCM and the attention map at each time step (the region highlighted in red is where the decoder attends to) ............................................. 117 iv List of Tables Table 4.1: Number of digital Image collections available in various establishments .................................................................................... 44 Table 4.2: Collection of digitized palm leaf manuscripts from our digitization campaign ......................................................................... 45 Table 4.3: Collection of palm leaf manuscript images from different sources composing SleukRith Set ...................................................... 47 Table 4.4 Summary of the statistics of the SleukRith Set .................. 54 Table 5.1: Palm leaf manuscript datasets for binarization task .......... 60 Table 5.2: Experimental results for the binarization task ................... 62 Table 5.3 Result of the performance evaluation of line segmentation methods