<<

Journal of Imaging

Article Benchmarking of Document Image Analysis Tasks for Palm Leaf Manuscripts from

Made Windu Antara Kesiman 1,2,* ID , Dona Valy 3,4, Jean-Christophe Burie 1, Erick Paulus 5, Mira Suryani 5, Setiawan Hadi 5, Michel Verleysen 2, Sophea Chhun 4 and Jean-Marc Ogier 1

1 Laboratoire Informatique Image Interaction (L3i), Université de Rochelle, 17042 La Rochelle, France; [email protected] (J.-C.B.); [email protected] (J.-M..) 2 Laboratory of Cultural Informatics (LCI), Universitas Pendidikan Ganesha, Singaraja, 81116, ; [email protected] 3 Institute of Information and Communication Technologies, Electronic, and Applied Mathematics (ICTEAM), Université Catholique de Louvain, 1348 Louvain-la-Neuve, Belgium; [email protected] 4 Department of Information and Communication Engineering, Institute of Technology of , Phnom Penh, Cambodia; [email protected] 5 Department of Computer Science, Universitas Padjadjaran, Bandung 45363, Indonesia; [email protected] (E.P.); [email protected] (M.S.); [email protected] (S.H.) * Correspondence: [email protected]

Received: 15 December 2017; Accepted: 18 February 2018; Published: 22 February 2018

Abstract: This paper presents a comprehensive test of the principal tasks in document image analysis (DIA), starting with binarization, text line segmentation, and isolated character/glyph recognition, and continuing on to word recognition and transliteration for a new and challenging collection of palm leaf manuscripts from Southeast Asia. This research presents and is performed on a complete dataset collection of Southeast Asian palm leaf manuscripts. It contains three different scripts: Khmer from Cambodia, and script and from Indonesia. The binarization task is evaluated on many methods up to the latest in some binarization competitions. The seam carving method is evaluated for the text line segmentation task, compared to a recently new text line segmentation method for palm leaf manuscripts. For the isolated character/glyph recognition task, the evaluation is reported from the handcrafted feature extraction method, the neural network with unsupervised learning feature, and the Convolutional Neural Network (CNN) based method. Finally, the Recurrent Neural Network-Long Short-Term Memory (RNN-LSTM) based method is used to analyze the word recognition and transliteration task for the palm leaf manuscripts. The results from all experiments provide the latest findings and a quantitative benchmark for palm leaf manuscripts analysis for researchers in the DIA community.

Keywords: document image analysis; binarization; character recognition; text line segmentation; word recognition; transliteration; palm leaf manuscript; dataset; benchmark; experimental test

1. Introduction Since the world entered the digital age in the early 20th century, the need for a document image analysis (DIA) system is increasing. This is due to the dramatic increase in efforts to digitize the various types of document collections available, especially the ancient documents of historical relics found in various parts of the world. Some very interesting projects on a wide variety of heritage document collections can be mentioned here: for example, the tranScriptorium project (http://transcriptorium.eu/)[1]; the READ (Recognition and Enrichment of Archival Documents) project (https://read.transkribus.eu/)[2], which works on documents from the Middle Ages to today, and also focuses on different languages ranging from Ancient Greek to modern English; the

J. Imaging 2018, 4, 43; doi:10.3390/jimaging4020043 www.mdpi.com/journal/jimaging J. Imaging 2018, 4, 43 2 of 27

IAM Historical Document Database (IAM-HistDB) (http://www.fki.inf.unibe.ch/databases/iam- historical-document-database)[3], which includes handwritten historical manuscript images from the Saint Gall Database from the 9th century in Latin; the Parzival Database from the 13th century in German; the Washington Database from the 18th century in English; the Ancient Lives Project (https://www.ancientlives.org/)[4], which asks volunteers to transcribe Ancient Greek text fragments from the Oxyrhynchus Papyri collection; and many other projects. To accelerate the process of accessing, preserving, and disseminating the contents of the heritage documents, a DIA system is needed. Besides aiming to preserve the existence of such ancient documents physically, the DIA system is expected to enable open access to the contents of the documents and provide opportunities for a wider audience to access all the important information stored in the document. DIA is the process of using various technologies to extract text, printed or handwritten, and graphics from digitized document files (http://www.cvisiontech.com/library/ pdf/pdf-document/document-image-analysis.html)[5]. DIA systems generally have a major role in identifying, analyzing, extracting, structuring, and transferring document contents more quickly, effectively, and efficiently. This system is able to work semi-automatically or even fully automatically without human intervention. The DIA system is expected to save time, cost, and effort at many points in the heritage document preservation process. However, although the DIA research develops rapidly, it is undeniable that most of the document collections used in the initial step are from developed regions such as America and European countries. The document samples from these countries are mostly written in English or old English with Latin/Roman script. Several important document collections were finally used as standard benchmarks for the evaluation of the latest DIA research results. The next wave of DIA research finally began to deal with documents from non-English-speaking areas with non-Latin scripts, such as Arabic, Chinese, and Japanese documents. During the evolution of DIA research in the last two decades, DIA researchers have proposed and achieved satisfactory solutions for many complex problems of document analysis for these types of documents. However, the DIA research challenge is ongoing. The latest challenge is documents from Asia, with new languages and more complex scripts to explore, such as script [6], script [7–10], Bangla script [11], and [12], and the case of multiple languages and scripts in documents from India. Optical character recognition (OCR) for Indian languages is considered more difficult in general than for European languages because of the large number of , , and conjuncts (combinations of vowels and consonants) [13]. This work was part of exploring DIA research for a palm leaf manuscripts collection from Southeast Asia. This collection offers a new challenge for DIA researchers because palm leaves are used as the writing medium and the language and script have never been analyzed before. In this paper, we did a comprehensive benchmark experimental test of some principal tasks in the DIA system, starting with binarization, text line segmentation, isolated character/glyph recognition, word recognition, and transliteration. To the best of our knowledge, this work is the first comprehensive study of the DIA researchers’ community and the first to perform a complete series of experimental benchmarking analyses of palm leaf manuscripts. The results of this research will be very useful in accelerating, evaluating, and improving the performance of existing DIA systems for a new type of document. This paper is organized as follow. Section2 gives a brief description of the palm leaf manuscripts collection from Southeast Asia, especially the Khmer palm leaf manuscript corpus from Cambodia and two palm leaf manuscript corpuses, the Balinese and Sundanese manuscripts from Indonesia. The challenges of DIA for this manuscript corpus are also presented in this section. Section3 describes the DIA tasks that need to be developed for the palm leaf manuscript collections, followed by a description of the methods investigated for those tasks. The datasets and evaluation methods for each DIA task used in the experimental studies for this work are presented in Section4. Section5 reports and analyzes the detailed results of the experiments. Finally, conclusions are given in Section6. J. Imaging 2017, 4, x FOR PEER REVIEW 3 of 26 J. Imaging 2018, 4, 43 3 of 27

2. Palm Leaf Manuscripts from Southeast Asia 2. Palm Leaf Manuscripts from Southeast Asia Regarding the use of writing materials and tools, history records the discovery of important documentsRegarding written the on use stone of writing plates, clay materials plates andor tablets, tools, bark, history skin, records animal the bones, discovery ivory, oftortoiseshell, important documentspapyrus, writtenparchment on stone(form plates, of clay leather plates ormade tablets, bark,of processed skin, animal shee bones,pskin ivory, or tortoiseshell, calfskin) papyrus,(http://www.casepaper.com/company/paper parchment (form of leather made of processed-history)[14] sheepskin, copperor and calfskin) bronze (http://www.casepaper. plates, bamboo, palm leaves,com/company/paper-history and other materials [15])[. 14The], copperchoice of and natural bronze materials plates, that bamboo, can be palm used leaves, as a medium and other for documentmaterials [writing15]. The is choice strongly of naturalinfluenced materials by the thatgeographical can be used condition as a medium and location for document of a nation. writing For example,is strongly because influenced bamboo by the and geographical palm trees are condition easily found and location in Asia, of both a nation. types Forof materials example, we becausere the firstbamboo choice and of palm writing trees material are easily in foundAsia. In in Southeast Asia, both Asia, types most of materials ancientwere manu thescripts first choicewere writte of writingn on palmmaterial leaves in Asia.. For example In Southeast, in Cambodia, Asia, most palm ancient leaves manuscripts have been wereused writtenas a writing on palm material leaves. dating For backexample, to the in first Cambodia, appearance palm of leaves Buddhism have in been the usedcountry. as a In writing , material dried dating palm leaves back to have the firstalso beenappearance used as of one Buddhism of the most in the popular country. written In Thailand, documents dried for palm over leaves 500 haveyears also [16] been. Palm used leaves as one were of alsothe most historically popular used written as writing documents supports for over in manuscripts 500 years [16 from]. Palm the leavesIndonesian were archipelago. also historically The used leaves as writingof sugar, supports or toddy, in palm manuscripts (Borassus from flabellifer the Indonesian) are known archipelago. as lontar. The leavesexistence of sugar, of ancient or toddy, palm palm leaf (manuscriptsBorassus flabellifer in Southeast) are known Asia is as verylontar important. The existence both in of term ancients of the palm quantity leaf manuscripts and variety in of Southeasthistorical Asiacontents. is very important both in terms of the quantity and variety of historical contents.

2.1. Balinese Palm Leaf Manuscripts Manuscripts—Collection—Collection from Bali,Bali, Indonesia

2.1.1. Corpus Apart from the collection at the museum (Museum Gedong Kertya Singaraja and Museum Bali ), it iss estimated that there are more than 50,00050,000 lontar collections that ar aree owned by private families ( (FigureFigure 11).). ForFor thisthis research,research, inin orderorder toto obtainobtain aa largelarge varietyvariety ofof manuscriptmanuscript images,images, samplesample images haveve been collected from 23 different collections, which come from fivefive different locations (regions): two museums and three private families. They consist of 10 randomly selected collections from Museum Gedong Kertya, City of Singaraja, Regency of Buleleng, North Bali, Indonesia, four collections fromfrom manuscript manuscript collections collections of Museumof Museum Bali, CityBali, of City Denpasar, of Denpasar, South Bali, South seven Bali, collections seven cfromollections a private from family a private collection family fromcollection the village from the of Jagaraga,village of Regency Jagaraga, of Regency Buleleng, of andBuleleng, two other and privatetwo other family private collections family collections from the village from the of Susut,village Regencyof Susut, of Regency Bangli andof Bangli the village and the of Rendang,village of Rendang,Regency of Regency Karangasem of Karangasem [17]. [17].

Figure 1. Balinese palm leaf manuscripts. Figure 1. Balinese palm leaf manuscripts. 2.1.2. and Language 2.1.2. Balinese Script and Language Although the official language of Indonesia, Bahasa Indonesia, is written in the , Although the official language of Indonesia, Bahasa Indonesia, is written in the Latin script, Indonesia has many local, traditional scripts, most of which are ultimately derived from Brahmi [18]. Indonesia has many local, traditional scripts, most of which are ultimately derived from Brahmi [18]. In Bali, palm leaf manuscripts were written in the Balinese script in the , in the In Bali, palm leaf manuscripts were written in the Balinese script in the Balinese language, in the ancient literary texts composed in the old of Kawi and . Balinese language ancient literary texts composed in the old Javanese language of Kawi and Sanskrit. Balinese language is a Malayo-Polynesian language spoken by more than 3 million people, mainly in Bali, Indonesia is a Malayo-Polynesian language spoken by more than 3 million people, mainly in Bali, Indonesia (www.omniglot.com/writing/balinese.htm)[19]. Balinese is the native language of the people of Bali, (www.omniglot.com/writing/balinese.htm)[19]. Balinese is the native language of the people of known locally as Basa Bali [18]. The and numbers of Balinese script are composed of ±100 Bali, known locally as Basa Bali [18]. The alphabet and numbers of Balinese script are composed of character classes including consonants, vowels, and some other special compound characters. ±100 character classes including consonants, vowels, and some other special compound characters. According to the Standard 9.0, the Balinese script actually has the Unicode table from 1B00 According to the Unicode Standard 9.0, the Balinese script actually has the Unicode table from 1B00 to 1B7F. to 1B7F.

J. Imaging 2018, 4, 43 4 of 27

J. Imaging 2017, 4, x FOR PEER REVIEW 4 of 26 2.2. Khmer Palm Leaf Manuscripts—Collection from Cambodia 2.2. Khmer Palm Leaf Manuscripts—Collection from Cambodia 2.2.1. Corpus 2.2.1. Corpus In Cambodia, Khmer palm leaf manuscripts (Figure2) are still seen in Buddhist establishments and areIn traditionally Cambodia, Khmer used by palm monks leaf as manuscripts reading scriptures. (Figure 2 Various) are still libraries seen in Buddhist and institutions establishments have been collectingand are andtraditionally digitizing used these by manuscriptsmonks as reading and have scriptures. even shared Various the libraries digital and images institutions with the have public. Forbeen instance, collecting the andÉcole digitizing Française these d’Extr manuscriptsême-Orient and (EFEO) have even has shared launched the andigital online images database with the (http: //khmermanuscripts.efeo.frpublic. For instance, the École)[20 Française] of microfilm d’Extrême images-Orient of hundreds(EFEO) has of launched Khmer palm an online leaf manuscriptdatabase collections.(http://khmermanuscripts.efeo.fr Some digitized collections) [20]of aremicrofilm also obtained images fromof hundreds the Buddhist of Khmer Institute, palm whichleaf manuscript is one of the collections. Some digitized collections are also obtained from the Buddhist Institute, which is one of biggest institutes in Cambodia responsible for research on Cambodian literature and language related the biggest institutes in Cambodia responsible for research on Cambodian literature and language to Buddhism, and also from the National Library (situated in the capital city, Phnom Penh), which related to Buddhism, and also from the National Library (situated in the capital city, Phnom Penh), is home to a large collection of palm leaf manuscripts. Moreover, a standard digitization campaign which is home to a large collection of palm leaf manuscripts. Moreover, a standard digitization was conducted in order to collect palm leaf manuscript images found in Buddhist temples in different campaign was conducted in order to collect palm leaf manuscript images found in Buddhist temples locationsin different throughout locations Cambodia: throughout Phnom Cambodia: Penh, Phnom Kandal, Penh, and Kandal, Siem Reap and Siem [21]. Reap [21].

Figure 2. Khmer palm leaf manuscript. Figure 2. Khmer palm leaf manuscript. 2.2.2. and Language 2.2.2. Khmer Script and Language According to the era during which the documents were created, slightly different versions of KhmerAccording characters to the are era used during in the which writing the of documentsKhmer palm were leaf manuscripts. created, slightly The differentKhmer alphabet versions is of Khmerfamous characters for its numerous are used symbols in the writing (~70), including of Khmer consonants, palm leaf manuscripts.different typesThe of vowels, Khmer , alphabet is famousand special for its numerouscharacters. symbolsCertain symbols (~70), including even have consonants, multiple shapes different and types forms of vowels,depending diacritics, on what and specialother characters.symbols are Certain combined symbols with even them have to create multiple words. shapes The and languages forms depending written on on palm what leaf other symbolsdocuments are combined vary from with Khmer, them the to official create words.language The of languagesCambodia, written to onand palm Sanskrit leaf, documentsby which the vary frommodern Khmer, Khmer the officiallanguage language was considerably of Cambodia, influenced. to Pali Only and a Sanskrit, minority by of whichCambodian the modern people, such Khmer languageas philologists was considerably and Buddhist influenced. monks, are Only able to a minorityread and understand of Cambodian the latter people, languages. such as philologists and Buddhist monks, are able to read and understand the latter languages. 2.3. Sundanese Palm Leaf Manuscripts—Collection from West Java, Indonesia 2.3. Sundanese Palm Leaf Manuscripts—Collection from West Java, Indonesia 2.3.1. Corpus 2.3.1. CorpusThe collection of Sundanese palm leaf manuscripts (Figure 3) comes from Situs Kabuyutan Ciburuy,The collection Garut, West of SundaneseJava, Indonesia. palm The leaf Kabuyutan manuscripts Ciburuy (Figure is a3 )complex comes fromcultural Situs heritage Kabuyutan from Ciburuy,Prabu Siliwangi Garut, West and Java, Prabu Indonesia. Kian Santang, The Kabuyutanthe king and Ciburuy the son isof a the complex Padjadjaran cultural kingdom. heritage The from Prabucultural Siliwangi complex and consists Prabu Kian of six Santang, buildings. the One king of and them the is son Bale of thePadaleuman Padjadjaran, which kingdom. is used The to culturalstore complexthe Sundanese consists palm of six leaf buildings. manuscripts. One ofThe them oldest is BaleSundanese Padaleuman, palm leaf which manuscript is used toin store Situs the SundaneseKabuyutan palm Ciburuy leafmanuscripts. came from the The 15th oldest century. Sundanese In Bale palmPadaleuman leaf manuscript, there are in 27 Situs collections Kabuyutan of Sundanese manuscripts. Each collection contains 15 to 30 pages, with dimensions of 25–45 cm in Ciburuy came from the 15th century. In Bale Padaleuman, there are 27 collections of Sundanese length × 10–15 cm in width [22]. manuscripts. Each collection contains 15 to 30 pages, with dimensions of 25–45 cm in length × 10–15 cm in2.3.2. width Sundanese [22]. Script and Language 2.3.2. SundaneseThe Sundanese Script palm and Languageleaf manuscripts were written in the ancient and script. The characters consist of numbers, vowels (such as a, i, , , and o), basic characters (such as The Sundanese palm leaf manuscripts were written in the ancient Sundanese language and script. ha, na, , , etc.), , diacritics (such as panghulu, pangwisad, paneuleung, panyuku, etc.), and The characters consist of numbers, vowels (such as a, i, u, e, and o), basic characters (such as ha, na, many special compound characters.

J. Imaging 2018, 4, 43 5 of 27 ca, ra, etc.), punctuation, diacritics (such as panghulu, pangwisad, paneuleung, panyuku, etc.), and many special compound characters. J. Imaging 2017, 4, x FOR PEER REVIEW 5 of 26

Figure 3. Sundanese palm leaf manuscript. Figure 3. Sundanese palm leaf manuscript. 2.4. Challenges of Document Image Analysis for Palm Leaf Manuscripts 2.4. Challenges of Document Image Analysis for Palm Leaf Manuscripts There are two main technical challenges to assessing palm leaf manuscripts in a DIA system. TheThere first challenge are two mainis the physical technical condition challenges of the to palm assessing leaf manuscript palm leaf, manuscriptswhich will strongly in a DIA influence system. Thethe first quality challenge of the isdocument the physical images condition captured of. theFor palmthe image leaf manuscript,capturing process which for will DIA strongly research, influence data thein quality a paper of document the document are usually images captured.captured by For optical the image scanning, capturing but when process the for document DIA research, is on dataa indifferent a paper document medium such are usuallyas microfil capturedm, palm by leaves, optical or scanning, fabric, photographic but when the methods document are isoften on aused different to mediumcapture such the images as microfilm, [13]. N palmowadays, leaves, due or to fabric, the specific photographic characteristics methods of the are physical often used support to capture of the the imagesmanuscript [13]. Nowadays,s, the development due to theof D specificIA methods characteristics for palm leaf of manuscripts the physical in support order to of extract the manuscripts, relevant theinformation development is considered of DIA methods a new forresearch palm problem leaf manuscripts in handwritten in order document to extract analysis. relevant Ancient information palm is consideredleaf manuscripts a new research contain problemartifacts due in handwritten to aging, foxing, document yellowing, analysis. strain, Ancient local palmshading leaf effe manuscriptscts, low containintensity artifacts variations due to or aging, poor foxing,contrast, yellowing, random noises, strain, discolored local shading part effects,s, fading low, and intensity other types variations of ordegradation. poor contrast, random noises, discolored parts, fading, and other types of degradation. TheThe second second challenge challenge is is the the complexitycomplexity ofof thethe script.script. The The Southeast Southeast Asian Asian manuscripts manuscripts with with differentdifferent scripts scripts and and languages languages provideprovide realreal challengeschallenges for for document document analysis analysis methods, methods, not not only only because of the different forms of characters in the script, but also because the writing style of each because of the different forms of characters in the script, but also because the writing style of each script (e.g., how to join or separate a character in a text line) differs. It ranges widely from a script (e.g., how to join or separate a character in a text line) differs. It ranges widely from a binarization binarization process [23–25], text line segmentation [26,27], and character and text recognition tasks process [23–25], text line segmentation [26,27], and character and text recognition tasks [25,28,29], to [25,28,29], to the word spotting methods [30]. the word spotting methods [30]. In the domain of DIA, handwritten character and text recognition has been the subject of intensiveIn thedomain research of during DIA, handwrittenthe last three character decades. andSome text methods recognition have hasalready been reached the subject a satisfactory of intensive researchperformance during, theespecially last three for decades. Latin, Chinese Some methods, and Japanese have already scripts reached. However, a satisfactory the development performance, of especiallyhandwritten for Latin, charact Chinese,er and andtext Japaneserecognition scripts. methods However, for other the various development Asian ofscript handwrittens presents character many andissues. text recognitionIn the OCR methodstask and fordevelopment other various for palm Asian leaf scripts manuscripts presents from many Southeast issues. In Asia the, OCRseveral task anddeformations development in for the palm character leaf manuscripts shapes are fromvisible Southeast due to Asia,the merges several and deformations fractures of in the characteruse of shapesnonstandard are visible fonts. due The to similarities the merges of and distinct fractures character of the shapes, use of overlaps, nonstandard and interconnection fonts. The similarities of the ofneighboring distinct character characters shapes, further overlaps, complicate and the interconnection OCR system [31] of. the One neighboring of the main charactersproblems faced further complicatewhen dealing the OCRwith segmented system [31 handwritten]. One of the character main problems recognition faced is the when ambiguity dealing and with illegibility segmented of handwrittenthe characters character [32]. These recognition characteristics is the ambiguity provide suitable and illegibility condition ofs to the test characters and evaluate [32]. Thesethe characteristicsrobustness of providefeature extraction suitable conditionsmethods that to were test andprop evaluateosed for character the robustness recognition. of feature extraction methods that were proposed for character recognition. 3. Document Image Analysis Tasks and Investigated Methods 3. DocumentHeritage Image document Analysis preservation Tasks and is not Investigated just about converting Methods physical documents into document images.Heritage With document many physical preservation documents is not being just digitized about converting and stored physical in large document documents databases, into document and images.then sent With and many received physical via documentsdigital machines, being digitizedthe interest and and stored dema innd large grew document to require databases, more andfunctionalities then sent and than received simply view via digitaling and machines, print the im theages interest [33]. Further and demand treatment grew is required to require before more functionalitiesthe collection thanof document simply viewing images can and be print explored the images more extensively [33]. Further. For treatment example, isa requiredmore specific before research field needed to be developed to add machine capabilities for extracting information from the collection of document images can be explored more extensively. For example, a more specific these images, reading text on a document page, finding sentences, and locating paragraphs, lines, research field needed to be developed to add machine capabilities for extracting information from words, and symbols on a diagram [33]. these images, reading text on a document page, finding sentences, and locating paragraphs, lines, In this work, the methods for each DIA task were investigated for palm leaf manuscripts. The words, and symbols on a diagram [33]. binarization task is evaluated using the latest methods from binarization competitions. The seam carvingIn this method work, theis evaluated methods for eachthe text DIA line task segmentation were investigated task, compared for palm to leaf a manuscripts.recent text line The binarizationsegmentation task method is evaluated for palm using leaf themanuscripts latest methods [27]. For from the binarizationisolated character/glyph competitions. recognition The seam task, the evaluation is reported from the handcrafted feature extraction method, the neural network with unsupervised learning feature to the CNN based method. Finally, the RNN-LSTM based method is used to analyze the word recognition and transliteration task for palm leaf manuscripts.

J. Imaging 2018, 4, 43 6 of 27 carving method is evaluated for the text line segmentation task, compared to a recent text line segmentation method for palm leaf manuscripts [27]. For the isolated character/glyph recognition task, the evaluation is reported from the handcrafted feature extraction method, the neural network with unsupervised learning feature to the CNN based method. Finally, the RNN-LSTM based method is used to analyze the word recognition and transliteration task for palm leaf manuscripts.

3.1. Binarization Binarization is widely applied as the first pre-processing step in image document analysis [34]. Binarization is a common starting point for document image analysis and converts gray image values into binary representation for background and foreground, or, more specifically, text and non-text, which is then fed into further document processing tasks such as text line segmentation and optical character recognition. The performance of binarization techniques has a great impact and directly affects the performance of the recognition task [35]. Non-optimal binarization methods produce unrecognizable characters with noise [16]. Many binarization methods have been reported. These methods have been tested and evaluated on different types of document collections. Based on the choice of the thresholding value, binarization methods can generally be divided into two types, global binarization and local adaptive binarization [16]. Some surveys and comparative studies of the performance of several binarization methods have been reported [35,36]. A binarization method that performs well for one document collection may not necessarily be applied to another document collection with the same performance [34]. For this reason, there is always a need to perform a comprehensive evaluation of the existing binarization methods for a new document collection that has different characteristics, for example the historical archive documents [36]. In this work, we compared several alternative binarization algorithms for palm leaf manuscripts. We tested and evaluated some well-known standard binarization methods, and some binarization methods that are experimentally promising for historical archive documents, though not specifically for images of palm leaf manuscripts. We also tested the binarization methods from the Document Image Binarization Competition (DIBCO) competition [37,38], for example Howe’s method [39] and the ones from the International Conference on Frontiers in Handwriting Recognition (ICFHR) competition (amadi.univ-lr.fr/ICFHR2016_Contest) [25,40].

3.1.1. Global Thresholding Global thresholding is the simplest technique and the most conventional approach for binarization [34,41]. A single threshold value was calculated from the global characteristics of the image. This value should be properly chosen based on a heuristic technique or a statistical measurement to be able to give promising optimal binarization results [36]. It is widely known that using a global threshold to process a batch of archive images with different illumination and noise variation is not a proper choice. The variation between images in the foreground and background colors on low-quality document images gives unsatisfactory results. It is difficult to choose one fixed threshold value that is adaptable for all images [36,42]. Otsu’s method is a very popular global binarization technique [34,41]. Conceptually, Otsu’s method tries to find an optimum global threshold on an image by minimizing the weighted sum of variances of the objects and background pixels [34]. Otsu’s method is implemented as a standard binarization technique in a built-in Matlab function called graythresh (https://fr.mathworks.com/help/ images/ref/graythresh.html)[43].

3.1.2. Local Adaptive Binarization To overcome the weakness of the global binarization technique, many local adaptive binarization techniques were proposed, for example Niblack’s method [34,36,41,42,44], Sauvola’s method [34,36,41,42,44,45], Wolf’s method [42,44,46], NICK method [44], and the Rais method [34]. The threshold value in local adaptive binarization technique is calculated in each smaller local image J. Imaging 2018, 4, 43 7 of 27 area, region, or window. Niblack’s method proposed a local thresholding computation based on the local mean and local standard deviation of a rectangular local window for each pixel on the image. The rectangular sliding local window will cover the neighborhood for each pixel. Using this concept, Niblack’s method was reported to outperform many thresholding techniques and gave optimal results for many document collections. However, there is still a drawback to this method. It was found that Niblack’s method works optimally only on the text region, but is not well suited for large non-text regions of an image. The absence of text in local areas forces Niblack’s method to detect noise as text. The suitable window size should be chosen based on the character and stroke size, which may vary for each image. Many other local adaptive binarization techniques were proposed to improve the performance of the basic Niblack method. For example, Sauvola’s method is a modified version of Niblack’s method. Sauvola’s method proposes a local binarization technique to deal with light texture, large variations, and uneven illumination. The improvement over Niblack’s method is in the use of adaptive contribution of standard deviation in determining the local threshold on the gray values of text and non-text pixels. Sauvola’s method processes the image in N × N adjacent and non-overlapping blocks separately. Wolf’s method tried to overcome the problem of Sauvola’s method when the gray values of text and non-text pixels are close to each other by normalizing the contrast and the mean gray value of the image to compute the local threshold. However, a sharp change in background gray values across the image decreases the performance of Wolf’s method. Two other improvements to Niblack’s method are NICK method and the Rais method. NICK method proposes a threshold computation derived from the basic Niblack’s method and the Rais method proposes an optimal size of window for the local binarization.

3.1.3. Training-Based Binarization The top two proposed methods in the Binarization Challenge for the ICFHR 2016 Competition on the Analysis of Handwritten Text in Images of Balinese Palm Leaf Manuscripts are training-based binarization methods [25]. The best method in this competition employs a Fully Convolutional Network (FCN). It takes a color subimage as input and outputs the probability that each pixel in the sub-image is part of the foreground. The FCN is pre-trained on normal handwritten document images with automatically generated “ground truth” binarizations (using the method of Wolf et al. [46]). The FCN is then fine-tuned using DIBCO and HDIBCO competition images and their corresponding ground truth binarizations. Finally, the FCN is fine-tuned again on the provided Balinese palm leaf images. Consequently, the pixel probabilities of foreground are efficiently predicted for the whole image at once and thresholded at 0.5 to create a binarized output image. The second-best method uses two neural network classifiers, C1 and C2, to classify each pixel as background or not. Two binarized images, B1 and B2, are generated in this step. C1 is a rough classifier that tries to detect all the foreground pixels, while probably making mistakes for some background pixels. C2 is an accurate classifier that should not classify a background pixel as a foreground pixel but probably misses some foreground pixels. Secondly, these two binary images are joined to get the final classification result.

3.2. Text Line Segmentation Text line segmentation is a crucial pre-processing step in most DIA pipelines. The task aims at extracting and separating text regions into individual lines. Most line segmentation approaches in the literature require that the input image be binarized. However, due to the degradation and noise often found in historical documents such as palm leaf manuscripts, the binarization task is not able to produce good enough results (see Section 5.1). In this paper, we investigate two line segmentation methods that are independent of the binarization task. These approaches work directly on color/grayscale images. J. Imaging 2018, 4, 43 8 of 27

3.2.1. Seam Carving Method Arvanitopoulos and Süsstrunk [47] proposed a binarization-free method based on a two-stage process: medial seam and separating seam computation. The approach computes medial seams by splitting the input page image into columns whose smoothed projection profiles are then calculated. The positions of the medial seams are obtained based on the local maxima locations of the profiles. TheJ. Imaging goal 2017 of the, 4, xsecond FOR PEER stage REVIEW of the approach is to compute separating seams with the application8 of on 26 the energy map within the area restricted by the medial seams of two neighboring lines found in the previousprevious stage.stage. The technique carves paths that traverse the image from leftleft toto right,right, accumulatingaccumulating energy. TheThe pathpath withwith thethe minimumminimum cumulativecumulative energyenergy isis thenthen chosen.chosen.

3.2.2. Adaptive Path Finding Method This approachapproach was proposed proposed by by Valy Valy et et al. al. [27] [27.]. The The method method takes takes as as input input a grayscale a grayscale image image of ofa document a document page. page. Connected Connected components components are extracted are extracted from fromthe input the inputimage image using usingthe stroke the strokewidth widthinformation information by applying by applying the stroke the stroke width width transform transform (SWT) (SWT) on onthe the Canny Canny edge edge map. map. The The set set of extracted components (filtered(filtered to remove components that come from noise and artifacts) is used to create aa strokestroke map. map. Using Using column-wise column-wise projection projection profiles profiles on on the the output output map, map, estimated estimated number number and medialand medial positions positions of text of line text can line be defined.can be Todefined. adapt betterTo adapt to skew better and to fluctuation, skew and an fluc unsupervisedtuation, an learningunsupervised called learning competitive called learning competitive is applied learning on the is set applied of connected on the componentsset of connected found components previously. Finally,found previously. a path finding Finally, technique a path isfinding applied technique in order is to applied create seamin order borders to create between seam adjacentborders between lines by usingadjacent a combination lines by using of a two combination cost functions: of two one cost penalizing functions: the one path penalizing that goes the through path that the goes foreground through textthe foreground (intensity difference text (intensity cost difference function D cost) and function another one) and favoring another theone pathfavoring that the stays path close that to stays the estimatedclose to the medial estimated lines medial (vertical lines distance (vertical cost distance function cost functionV). Figure ).4 Figureillustrates 4 illustrate an examples an example of an 𝐷𝐷 optimalof an optimal path. path. 𝑉𝑉

FigureFigure 4 4.. AnAn exam exampleple of of an an optimal optimal path path going going from from start start state state S1 toto goalgoal statestate Sn.. 𝑆𝑆1 𝑆𝑆𝑛𝑛 3.3. Isolated Character/Glyph Recognition InIn aa DIADIA system,system, word oror texttext recognitionrecognition taskstasks areare generallygenerally categorizedcategorized intointo twotwo differentdifferent approaches: segmentation-basedsegmentation-based and segmentation-freesegmentation-free methods. In segmentation-basedsegmentation-based methods,methods, thethe isolatedisolated charactercharacter recognition recognition task task is is a verya very important important process process [9]. [9] A. proper A proper feature feature extraction extraction and aand correct a correct classifier classifier selection selection can increase can increase the recognition the recognition rate [48 rate]. Although [48]. Although many methods many methods for isolated for characterisolated character recognition recognition have been have developed been developed and tested, and especially tested, especially for Latin-based for La scriptstin-based and scripts , and therealphabets, is still there a need is still for in-deptha need for evaluation in-depth ofevaluation those methods of those as appliedmethods to as various applied other to various scripts. other This includesscripts. This the isolatedincludes character the isolated recognition character task recognition for many Southeast task for many Asian Southeast scripts, and Asian more scripts specifically, and scriptsmore specifically that were writtenscripts that on ancient were written palm leaf on manuscripts.ancient palm leaf manuscripts. PreviousPrevious studies on isolated character recognition in palm leaf manuscriptsmanuscripts have already been reported,reported, but only with thethe BalineseBalinese scriptscript asas thethe benchmarkbenchmark datasetdataset [[28,29]28,29].. In that firstfirst work,work, an experimental study on feature extraction methods for character recognition of Balinese script was performed [28[28]]. For. For the the second second work, work, a training-based a training method-based withmethod neural with network neural and network unsupervised and unsupervised feature learning was used to increase the recognition rate [29]. In this paper, we will conduct a broader evaluation of the robustness of the methods previously tested on Balinese script, using the other two palm leaf manuscripts with Khmer and Sundanese scripts. In the next sub- sections, we provide a brief description of the methods. For a detailed description of each method, interested readers can refer to our previous works.

3.3.1. Handcrafted Feature Extraction Methods Since the beginning of pattern recognition research, many feature extraction methods for character recognition have been presented in the literature. In our previous work [28], we investigated

J. Imaging 2018, 4, 43 9 of 27 feature learning was used to increase the recognition rate [29]. In this paper, we will conduct a broader evaluation of the robustness of the methods previously tested on Balinese script, using the other two palm leaf manuscripts with Khmer and Sundanese scripts. In the next sub-sections, we provide a brief description of the methods. For a detailed description of each method, interested readers can refer to our previous works.

3.3.1. Handcrafted Feature Extraction Methods J. Imaging 2017, 4, x FOR PEER REVIEW 9 of 26 J. ImagingSince 2017 the, 4, beginningx FOR PEER ofREVIEW pattern recognition research, many feature extraction methods for character9 of 26 J. Imaging 2017, 4, x FOR PEER REVIEW 9 of 26 andrecognition evaluated have the beenperformance presented of 10 in feature the literature. extraction In methods our previous with worktwo classifiers [28], we, investigatedk-NN (k-Nearest and evaluatedand evaluated the the performance performance of 10of 10 feature feature extraction extraction methods methods with with two two classifiers, classifiers, k-NNk-NN (k-Nearest(k-Nearest Neighbor)and evaluated and the SVM performance (Support Vectorof 10 feature Machine) extraction, in 29 methodsdifferent withschemes two classifiersfor Balinese, k- NNscript (k -onNearest palm Neighbor) and SVM (Support Vector Machine),Machine), in 29 different schemes for Balinese script on palm leafNeighbor) manuscripts. and SVM After (Support evaluating Vector the Machine) performance, in 29 of different those individual schemes featurefor Balinese extraction script methods, on palm leaf manuscripts. A Afterfter evaluating the the performance of of those individual feature feature extraction methods, weleaf found manuscripts. that the AHistogramfter evaluating of Gradient the performance (HoG) features of those as directional individual gradient feature- basedextraction features methods, [9,49] we found that the Histogram of Gradient ( (HoG)HoG) features as directional gradient gradient-based-based features [9,49] [9,49] (weFigure found 5) ,that the Neighborhoodthe Histogram ofPixels Gradient Weights (HoG (NPW)) features [50] (asFigure directional 6), the gradientKirsch Directional-based features Edges [9,49] [50], (Figure(Figure5 5),), thethe NeighborhoodNeighborhood PixelsPixels WeightsWeights (NPW)(NPW) [[50]50] ( (FigureFigure6 6),), thethe KirschKirsch DirectionalDirectional EdgesEdges [[50]50],, and(Figure Zoning 5), the [12,32,50,51] Neighborhood (Figure Pixels 7) Weightsgive very (NPW) promising [50] (Figure result s6.) ,W thee thenKirsch proposed Directional a new Edges feature [50], and Zoning [[12,32,50,51]12,32,50,51] (Figure(Figure7 7)) givegive veryvery promisingpromising results.results. WeWe thenthen proposedproposed aa newnew featurefeature extractionand Zoning method [12,32,50,51] applying (Figure NPW 7 )on give Kirsch very edgepromising images result (Figures. W 8e) thenand concatenateproposed a dnew the featureNPW– extraction methodmethod applyingapplying NPW NPW on on Kirsch Kirsch edge edge images images (Figure (Figure8) and 8) concatenatedand concatenate the NPW–Kirschd the NPW– Kirschextraction with method two other applying features, NPW HoG on and Kirsch Zoning edge method images, with (Figure k-NN 8 )as and the concatenateclassifier. d the NPW– withKirsch two with other two features, other features, HoG and HoG Zoning and Zoning method, method with k-NN, with as k- theNN classifier. as the classifier. Kirsch with two other features, HoG and Zoning method , with k-NN as the classifier.

Figure 5. The representation of the array of cells in HoG [28]. Figure 5. The representation of the array of cells in HoG [28]. Figure 5. The representation of the array of cells in HoG [28]. Figure 5. The representation of the array of cells in HoG [28]. 3 3 3 3 3 3 3 23 23 23 23 3 3 3 3 3 3 3 3 2 12 12 2 3 3 2 2 2 2 3 3 2 1 P 1 2 3 3 2 1 1 2 3 3 2 1 P 1 2 3 P 3 2 21 21 2 3 3 2 1 1 2 3 3 32 32 32 32 3 3 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 Figure 6. Neighborhood pixels for NPW features [28]. Figure 6 6.. Neighborhood pixels for NPW features [[28]28].. Figure 6. Neighborhood pixels for NPW features [28].

Figure 7. Type of Zoning (from left to right: vertical, horizontal, block, diagonal, circular, and radial zoning)Figure 7 [28]. Type. of Zoning (from left to right: vertical, horizontal, block, diagonal, circular, and radial Figure 7. Type of Zoning (from left to right: vertical, horizontal, block, diagonal, circular, and radial zoning)Figure 7. [28]Type. of Zoning (from left to right: vertical, horizontal, block, diagonal, circular, and radial zoning) [28]. zoning) [28].

Figure 8. Scheme of NPW on Kirsch features [28]. Figure 8. Scheme of NPW on Kirsch features [28]. Figure 8. Scheme of NPW on Kirsch features [28].

J. Imaging 2017, 4, x FOR PEER REVIEW 9 of 26 and evaluated the performance of 10 feature extraction methods with two classifiers, k-NN (k-Nearest Neighbor) and SVM (Support Vector Machine), in 29 different schemes for Balinese script on palm leaf manuscripts. After evaluating the performance of those individual feature extraction methods, we found that the Histogram of Gradient (HoG) features as directional gradient-based features [9,49] (Figure 5), the Neighborhood Pixels Weights (NPW) [50] (Figure 6), the Kirsch Directional Edges [50], and Zoning [12,32,50,51] (Figure 7) give very promising results. We then proposed a new feature extraction method applying NPW on Kirsch edge images (Figure 8) and concatenated the NPW– Kirsch with two other features, HoG and Zoning method, with k-NN as the classifier.

Figure 5. The representation of the array of cells in HoG [28].

3 3 3 3 3 3 3 2 2 2 2 3 3 2 1 1 2 3 P 3 2 1 1 2 3 3 2 2 2 2 3 3 3 3 3 3 3

Figure 6. Neighborhood pixels for NPW features [28].

J. ImagingFigure2018 7,. 4Type, 43 of Zoning (from left to right: vertical, horizontal, block, diagonal, circular, and radial10 of 27 zoning) [28].

J. Imaging 2017, 4, x FOR PEER REVIEWFigure 8 8.. ScheSchememe of NPW on Kirsch features [[28]28].. 10 of 26

3.3.2. 3.3.2. UnsupervisedUnsupervised LearningLearning FeatureFeature andand NeuralNeural NetworkNetwork With the aim of improving the performance of our proposed feature extraction method, we With the aim of improving the performance of our proposed feature extraction method, we continuedcontinued ourour researchresearch onon isolatedisolated charactercharacter recognitionrecognition byby implementingimplementing thethe neuralneural networknetwork asas classifier.classifier. In In this this second second step step [29] [29,], the the same same combinati combinationon of offeature feature extraction extraction method methodss was was used used and andsent sentas the as input the inputfeature feature vector vectorto a single to a-layer single-layer neural network neural networkcharacter characterrecognizer recognizer.. In addition Into additionusing only to the using neural only network, the neural we network, also applied we alsoan additional applied an sub additional-module for sub-module the initial unsupervised for the initial unsupervisedlearning based learning on K-Means based clustering on K-Means (Figure clustering 9). This (Figure schema9). This was schemainspired was by inspiredthe study by of the Coates study et ofal. Coates [52,53] et. The al.[ 52unsupervised,53]. The unsupervised learning calculates learning the calculates initial thelearning initial weight learning for weight the neural for the network neural networktraining trainingphase from phase the fromcluster the cent clusterers of centers all feature of all vectors. feature vectors.

Instead of using Random Initial Weights Unsupervised Feature Centers Train with Features Learning with as Initial Train Images Neural Extraction K-Means Weights for Network Clustering Neural Network

Final Trained Features Recognition Test Images Weight of Extraction Results Network

FigureFigure 9.9. Schema ofof charactercharacter recognizerrecognizer withwith featurefeature extractionextraction method,method, unsupervisedunsupervised learninglearning feature,feature, andand neuralneural networknetwork [[29]29]..

3.3.3.3.3.3. ConvolutionalConvolutional NeuralNeural NetworkNetwork TheThe multilayermultilayer convolutionalconvolutional neuralneural networksnetworks (CNN)(CNN) havehave provenproven veryvery effectiveeffective inin areasareas suchsuch asas image recognition and and classification. classification. In Inthis this evaluation evaluation experiment, experiment, a vanilla a vanilla CNN CNN is used. is used. The Thearchitecture architecture of the of theCNN CNN (Figure (Figure 10 )10 is) isdescribed described as as follows follows (this (this architecture architecture has has also also been reported inin KhmerKhmer isolatedisolated charactercharacter recognitionrecognition baselinebaseline inin [[21]21]).). TheThe grayscalegrayscale inputinput imagesimages ofof isolatedisolated characterscharacters are rescaled to to 48 48 ×× 4848 pixels pixels in insize size and and normalized normalized by applying by applying histogram histogram stretching. stretching. The network consists of three sets of convolution and max pooling pairs. All convolutional layers use a stride of one and are zero padded so that the output is the same size as the input. The output of each convolutional layer is activated using the ReLu function and followed by a max pooling of 2 × 2 blocks. The numbers of feature maps (of size 5 × 5) used in the three consecutive convolutional layers are 8, 16, and 32, respectively. The output of the last layers is flattened, and a fully-connected layer with 1024 neurons (also activated with ReLu) is added, followed by the last output layer (softmax activation) consisting of neurons, where is the number of character classes. Dropout with probability p = 0.5 is applied before the output layer to prevent overfitting. We trained the 𝑁𝑁𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑁𝑁𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 network using an Adam optimizer with a batch size of 100 and a learning rate of 0.0001.

Figure 10. Architecture of the CNN.

J. Imaging 2017, 4, x FOR PEER REVIEW 10 of 26

3.3.2. Unsupervised Learning Feature and Neural Network With the aim of improving the performance of our proposed feature extraction method, we continued our research on isolated character recognition by implementing the neural network as classifier. In this second step [29], the same combination of feature extraction methods was used and sent as the input feature vector to a single-layer neural network character recognizer. In addition to using only the neural network, we also applied an additional sub-module for the initial unsupervised learning based on K-Means clustering (Figure 9). This schema was inspired by the study of Coates et al. [52,53]. The unsupervised learning calculates the initial learning weight for the neural network training phase from the cluster centers of all feature vectors.

Instead of using Random Initial Weights Unsupervised Feature Centers Train with Features Learning with as Initial Train Images Neural Extraction K-Means Weights for Network Clustering Neural Network

Final Trained Features Recognition Test Images Weight of Extraction Results Network

Figure 9. Schema of character recognizer with feature extraction method, unsupervised learning feature, and neural network [29].

3.3.3. Convolutional Neural Network The multilayer convolutional neural networks (CNN) have proven very effective in areas such as image recognition and classification. In this evaluation experiment, a vanilla CNN is used. The architecture of the CNN (Figure 10) is described as follows (this architecture has also been reported J. Imaging 2018, 4, 43 11 of 27 in Khmer isolated character recognition baseline in [21]). The grayscale input images of isolated characters are rescaled to 48 × 48 pixels in size and normalized by applying histogram stretching. The Thenetwork network consists consists of three of three sets setsof convolution of convolution and and max max pooling pooling pairs. pairs. All All convolutional convolutional layers layers use use a stridea stride of ofone one and and are are zero zero padded padded so that so that the theoutput output is the is thesame same size sizeas the as input. the input. The output The output of each of convolutionaleach convolutional layer layer is activated is activated using using the theReLu ReLu function function and and followed followed by by a amax max pooling pooling of 2 ×× 2 blocks. The numbers of feature maps (of size 5 × 5)5) used used in in the the three three consecutive consecutive convolutional convolutional layers are 8, 16, and 32,32, respectively. The outputoutput of the last layers is flattened,flattened, and a fully-connectedfully-connected layer with 1024 neurons (also activated with ReLu) is added,added, followed by the last output layer (softmax activation) consistingconsisting ofofN class neurons, neurons, where whereN class is theis the number number of character of characte classes.r classes. Dropout Dropout with probabilitywith probabilityp = 0.5 p is= applied0.5 is applied before before the output the output layer to layer prevent to prevent overfitting. overfitting. We trained We thetrained network the 𝑁𝑁𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑁𝑁𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 networkusing an using Adam an optimizer Adam optimizer with a batch with size a batch of 100 size and of a 100 learning and a ratelearning of 0.0001. rate of 0.0001.

Figure 10 10.. Architecture of the CNN CNN..

3.4. Word Recognition and Transliteration In order to make the palm leaf manuscripts more accessible, readable, and understandable to a wider audience, an optical character recognition (OCR) system should be developed. In many DIA systems, word or text recognition is the final task in the processing pipeline. However, normally in Southeast Asian script the speech sound of the syllable change is related to some certain phonological rules. In this case, an OCR system is not enough. Therefore, a transliteration system should also be developed to help transliterate the ancient scripts on these manuscripts. By definition, transliteration is defined as the process of obtaining the phonetic translation of names across languages [54]. Transliteration involves rendering a language from one to another. In [54], the problem is stated formally as a sequence labeling problem from one language alphabet to another. It will help us to index and to quickly and efficiently access the content of the manuscripts. In our previous work [29], a complete scheme for segmentation-based glyph recognition and transliteration specific to Balinese palm leaf manuscripts was proposed. In this work, a segmentation-free method will be evaluated to recognize and transliterate the words from three different scripts of a palm leaf manuscript.

RNN/LSTM-Based Methods From the last decade, sequence-analysis-based methods using a Recurrent Neural Network-Long Short-Term Memory (RNN-LSTM) type of learning network have been very popular among researchers in text recognition. RNN-LSTM-based method together with a Connectionist Temporal Classification (CTC) works as a segmentation-free learning-based method to recognize the sequence of characters in a word or text without any handcrafted feature extraction method. The raw image pixel can be sent directly as the input to the learning network and there is no requirement to segment the training data sequence. RNN is basically an extended version of the basic feedforward neural network. In a RNN, the neurons in the hidden layer are connected to each other. RNN offers very good context-aware processing to recognize patterns in a sequence or time series. One drawback of RNN is the vanishing gradient problem. To deal with this problem, the LSTM architecture was introduced. The LSTM network adds multiplicative gates and additive feedback. Bidirectional LSTM is an LSTM J. Imaging 2018, 4, 43 12 of 27 architecture with two-directional (forward and backward) context processing. LSTM architecture is widely evaluated as a generic and language-independent text recognizer [55]. In this work, the OCRopy (https://github.com/tmbdev/ocropy)[56] framework is used to test and evaluate the word recognition and transliteration tasks for the palm leaf manuscript collection. OCRopy provides the functional library of the OCR system by using RNN-LSTM architecture (http://graal.hypotheses. org/786)[57,58]. We evaluated the dataset with unidirectional LSTM and the (Bidirectional LTSM) BLSTM architecture.

4. Experiments: Datasets and Evaluation Methods From the three manuscript corpuses (Khmer, Balinese, and Sundanese), the datasets for each DIA task were extracted and used in the experimental work for this research.

4.1. Binarization

4.1.1. Datasets The palm leaf manuscript datasets for binarization task are presented in Table1. For Khmer manuscripts, one ground truth binarized image is provided for each image, but for Balinese and Sundanese manuscripts, each image has two different ground truth binarized images [17,25]. The study of ground truth variability and subjectivity was reported in the previous work [24]. In this research, we only used the first binarized ground truth image for evaluation. The binarized ground truth images for Khmer manuscripts were generated manually with the help of photo editing software (Figure 11). A pressure-sensitive tip stylus is used to trace each text stroke by keeping the original size J.of Imaging the stroke 2017, 4 width, x FOR [PEER59]. ForREVIEW the manuscripts from Bali, the binarized ground truth images have12 been of 26 created with a semi-automatic scheme [17,23–25] (Figure 12). The binarized ground truth images for imagesSundanese have manuscripts been created were with manually a semi-automatic [22] generated scheme using [17,23 PixLabeler–25] (Figure [60 1]2 (Figure). The binarized 13). The trainingground truthset is providedimages for only Sundanese for the Balinese manuscripts dataset. were We manually used all images [22] generated of the Khmer using and PixLabeler Sundanese [60] corpuses (Figure 13as) a. The test settraining because set theis prov training-basedided only for binarization the Balinese method dataset. (ICFHR We used G1 all method, images see of Sectionthe Khmer 5.1) wasand Sundaneseevaluated for corpuses the Khmer as a test and set Sundanese because the datasets training by-based using onlybinarization the pre-trained method Balinese(ICFHR G1 training method, set seeweighted Section model. 5.1) was evaluated for the Khmer and Sundanese datasets by using only the pre-trained Balinese training set weighted model. Table 1. Palm leaf manuscript datasets for binarization task. Table 1. Palm leaf manuscript datasets for binarization task. Manuscripts Train Test Ground Truth Dataset Manuscripts Train Test Ground Truth Dataset × BalineseBalinese 50 pages 50 pages 50 pages 50 pages2 × 100 pages 2 100 pagesExtracted Extracted from AMADI_LontarSet from AMADI_LontarSet [17,25,40] [17,25 ,40] Khmer - 46 pages 1 × 46 pages Extracted from EFEO [20,59] Khmer - 46 pages 1 × 46 pages Extracted from EFEO [20,59] Sundanese - 61 pages 2 × 61 pages Extracted from Sunda Dataset ICDAR2017 [22] Sundanese - 61 pages 2 × 61 pages Extracted from Sunda Dataset ICDAR2017 [22]

Figure 11. Khmer manuscript with binarized ground truth image. Figure 11. Khmer manuscript with binarized ground truth image.

Figure 12. Balinese manuscript with binarized ground truth image.

Figure 13. Sundanese manuscript with binarized ground truth image.

4.1.2. Evaluation Method Following our previous work [24] and the evaluation method from the ICFHR competition [25], three metrics of binarization evaluation that were used in the DIBCO 2009 contest [37] are used in the binarization task evaluation for this work. Those three metrics are F-Measure (FM) (Equation (3)), Peak SNR (PSNR) (Equation (5)), and Negative Rate Metric (NRM) (Equation (8)). F-Measure (FM): FM is defined from Recall and Precision.

TP (1) Recall =FN+ TP ×100 (2) Pr ecision =TP ×100 FP+ TP TP, defined as true positive, occurs when the image pixel is labeled as foreground and the ground truth is also. FP, defined as false positive, occurs when the image pixel is labeled as foreground but the ground truth is labeled as background. FN, defined as false negative, occurs when the image pixel is labeled as background but the ground truth is labeled as foreground (Equations (1) and (2)).

J. Imaging 2017, 4, x FOR PEER REVIEW 12 of 26 J. Imaging 2017, 4, x FOR PEER REVIEW 12 of 26 images have been created with a semi-automatic scheme [17,23–25] (Figure 12). The binarized ground imagestruth images have beenfor Sundanese created with manuscripts a semi-automatic were manually scheme [17,23[22] generated–25] (Figure using 12) .PixLabeler The binarized [60] ground(Figure truth13). The images training for Sundanese set is prov idedmanuscripts only for werethe Balinese manually dataset. [22] generated We used usingall images PixLabeler of the [60]Khmer (Figure and 13Sundanese). The training corpuses set isas prov a testided set becauseonly for thethe training Balinese-based dataset. binarization We used allmethod images (ICFHR of the G1Khmer method, and Sundanesesee Section corpuses5.1) was evaluatedas a test set for because the Khmer the training and Sundanese-based binarization datasets by method using only(ICFHR the G1pre method,-trained seeBalinese Section training 5.1) was set evaluatedweighted model.for the Khmer and Sundanese datasets by using only the pre-trained Balinese training set weighted model. Table 1. Palm leaf manuscript datasets for binarization task. Table 1. Palm leaf manuscript datasets for binarization task. Manuscripts Train Test Ground Truth Dataset ManuscriptsBalinese 50Train pages 50Test pages Ground2 × 100 pagesTruth Extracted from AMADI_LontarSetDataset [17,25,40] BalineseKhmer 50 pages- 5046 pages 21 × × 100 46 pagespages ExtractedExtracted from AMADI_LontarSet from EFEO [20,59] [17,25,40] SundaneseKhmer - 4661 pages 12 × 4661 pages ExtractedExtracted from Sunda from Dataset EFEO ICDAR2017[20,59] [22] Sundanese - 61 pages 2 × 61 pages Extracted from Sunda Dataset ICDAR2017 [22]

J. Imaging 2018, 4, 43 13 of 27

Figure 11. Khmer manuscript with binarized ground truth image. Figure 11. Khmer manuscript with binarized ground truth image.

Figure 12. Balinese manuscript with binarized ground truth image. Figure 12. Balinese manuscript with binarized ground truth image. Figure 12. Balinese manuscript with binarized ground truth image.

Figure 13. Sundanese manuscript with binarized ground truth image. Figure 13 13.. Sundanese manuscript with binarized ground truth image.image. 4.1.2. Evaluation Method 4.1.2. Evaluation Method 4.1.2.Following Evaluation our Method previous work [24] and the evaluation method from the ICFHR competition [25], Following our previous work [24] and the evaluation method from the ICFHR competition [25], threeFollowing metrics of ourbinarization previous evaluation work [24] andthat thewere evaluation used in the method DIBCO from 2009 the contest ICFHR [37] competition are used in [ 25the], three metrics of binarization evaluation that were used in the DIBCO 2009 contest [37] are used in the threebinarization metrics task of binarization evaluation evaluationfor this work that. Those were used three in metrics the DIBCO are F 2009-Measure contest (FM) [37] are(Equation used in (3 the)), binarization task evaluation for this work. Those three metrics are F-Measure (FM) (Equation (3)), binarizationPeak SNR (PSNR) task evaluation (Equation for (5) this), and work. Negative Those Rate three Metric metrics (NRM) are F-Measure (Equation (FM) (8)). (Equation (3)), Peak Peak SNR (PSNR) (Equation (5)), and Negative Rate Metric (NRM) (Equation (8)). SNR F(PSNR-Measure) (Equation (FM): FM (5)), is anddefined Negative from RateRecall Metric and Precision. (NRM) (Equation (8)). F-Measure (FM): FM is defined from Recall and Precision. F-Measure (FM): FM is defined from RecallTP and Precision. (1) Recall =FN+ TP ×100 (1) =TP × Recall FN+ TP 100TP Recall =TP × 100(2) (1) Pr ecision =+ FN ×100+ TP FPTP TP (2) Pr ecision =+ ×100 FP TP TP TP, defined as true positive, occurs Prwhenecision the= image pixel× is100 labeled as foreground and the ground (2) TP,truth defined is also. as FP, true defined positive, as false occurs positive, when occursthe imageFP when+ pixelTP the isimage labeled pixel as isforeground labeled as andforeground the ground but TPtruththe, ground defined is also. truth as FP, true definedis labeled positive, as as false occursbackground. positive, when occursFN, the imagedefined when pixel as the false isimage labeled negative, pixel as is foregroundoccurs labeled when as and foregroundthe image the ground pixel but thetruthis labeled ground is also. as truth FPbackground, definedis labeled as but as false thebackground. positive, ground occurstruth FN, definedis when labeled the as as false image foreground negative, pixel is (Equation labeledoccurs when ass foreground (1 )the and image (2)) but. pixel the isground labeled truth as background is labeled as but background. the groundFN truth, defined is labeled as false as foreground negative, occurs (Equation whens the(1) and image (2)) pixel. is

labeled as background but the ground truth is labeled as foreground (Equations (1) and (2)). 2 × Recall × Precision FM = (3) Recall + Precision

A higher F-measure indicates a better match. Peak SNR (PSNR): PSNR is calculated from Mean Square Error (MSE) (Equation (4)).

M N (I (x, y) − I (x, y))2 MSE = 1 2 (4) ∑ ∑ ∗ x=1 y=1 M N

C2 PSNR = 10 × log ( ), (5) 10 MSE where C is defined as 1, the difference between foreground and background colors in the case of a binary image. A higher PSNR indicates a better match. J. Imaging 2018, 4, 43 14 of 27

Negative Rate Metric (NRM): NRM is defined from the negative rate of false negative (NRFN) (Equation (6)) and the negative rate of false positive (NRFP) (Equation (7)):

FN NR = (6) FN FN + TP

FP NR = (7) FP FP + TN TN, defined as true negative, occurs when both the image pixel and ground truth are labeled as background. The definitions of TP, FN, and FP are the same as the ones given for the F-Measure.

NR + NR NRM = FN FP (8) 2 A lower NRM indicates a better match.

4.2. Text Line Segmentation

4.2.1. Datasets The palm leaf manuscript datasets for text line segmentation task are presented in Table2. The text line segmentation ground truth data for Balinese and Sundanese manuscripts have been generated by hand based on the binarized ground truth images [17]. For Khmer 1, a semi-automatic scheme is used [26,59]. A set of medial points for each text is generated automatically on the binarization ground truth of the page image. Then those points can be moved up or down with a tool to fit the skew and fluctuation of the real text lines. We also note touching components spreading over multiple lines and the locations where they can be separated. For Khmer 2 and 3, an ID of the line it belongs to is associated with each annotated character. The region of a text line is the union of the areas of the polygon boundaries of all annotated characters composing it [21,27].

Table 2. Palm leaf manuscript datasets for text line segmentation task.

Manuscripts Pages Text Lines Dataset Balinese 1 35 pages 140 text lines Extracted from AMADI_LontarSet [17,26,40] Bali-2.1: 47 pages 181 text lines Balinese 2 Extracted from AMADI_LontarSet [17] Bali-2.2: 49 pages 182 text lines Khmer 1 43 pages 191 text lines Extracted from EFEO [20,26,59] Khmer 2 100 pages 476 text lines Extracted from SleukRith Set [21,27] Khmer 3 200 pages 971 text lines Extracted from SleukRith Set [21] Sundanese 1 12 pages 46 text lines Extracted from Sunda Dataset [26] Sundanese 2 61 pages 242 text lines Extracted from Sunda Dataset [22]

4.2.2. Evaluation Method Following our previous work [26], we use the evaluation criteria and tool provided by ICDAR2013 Handwriting Segmentation Contest [61]. First, the one-to-one (o2o) match score is computed for a region pair based on the evaluator’s acceptance threshold. In our experiments, we used 90% as the acceptance threshold. Let N be the count of ground truth elements, and M the count of result elements. With the o2o score, three metrics are calculated: detection rate (DR), recognition accuracy (RA), and performance metric (FM).

4.3. Isolated Character/Glyph Recognition

4.3.1. Datasets The palm leaf manuscript datasets for isolated character/glyph recognition task are presented in Table3. For the Balinese character dataset, Balinese philologists manually annotated the segment J. Imaging 2017, 4, x FOR PEER REVIEW 14 of 26 computed for a region pair based on the evaluator’s acceptance threshold. In our experiments, we used 90% as the acceptance threshold. Let N be the count of ground truth elements, and M the count of result elements. With the o2o score, three metrics are calculated: detection rate (DR), recognition accuracy (RA), and performance metric (FM).

4.3. Isolated Character/Glyph Recognition

4.3.1. Datasets

J. ImagingThe 2018palm, 4, leaf 43 manuscript datasets for isolated character/glyph recognition task are presented15 of in 27 Table 3. For the Balinese character dataset, Balinese philologists manually annotated the segment of connected components that represented a correct character in Balinese script from the word-level binarizedof connected images components that that representedwere manually a correct characterannotated in Balinese[11,17,20] script fromusing the word-levelAletheia (binarizedhttp://www.primaresearch.org/tools/Aletheia images that were manually annotated) [[62,63]11,17,20 (Figure] using 14) Aletheia. The Sundanese (http://www.primaresearch. character dataset wasorg/tools/Aletheia annotated manually)[62,63 [22]] (Figure (Figure 14 1).5 The). For Sundanese the Khmer character character dataset dataset, was a tool annotated has been manually developed [22] to(Figure annotate 15). Forcharacters/glyphs the Khmer character on the dataset, document a tool page. has beenThe developedpolygon boundary to annotate of characters/glyphseach character is tracedon the manually document by page. dotting The out polygon its vertex boundary one by one. of each A label character is given is tracedto each manually annotated by character dotting afterout its its vertex boundary one has by one.been Aconstructed label is given [21] to (Figure each annotated16). character after its boundary has been constructed [21] (Figure 16). Table 3. Palm leaf manuscript datasets for isolated character/glyph recognition task. Table 3. Palm leaf manuscript datasets for isolated character/glyph recognition task. Manuscripts Classes Train Test Dataset ManuscriptsBalinese 133 Classes classes 11,710 Trainimages 7673 images Test AMADI_LontarSet Dataset [17,25,28] BalineseKhmer 11 1331 classes classes 113,206 11,710 images images 90,669 7673 images images AMADI_LontarSetSleukRith Set [[21]17,25 ,28] Khmer 111 classes 113,206 images 90,669 images SleukRith Set [21] SundaSundanesenese 60 60 classes classes 4555 4555 images images 2816 2816 images images Sunda Sunda Dataset Dataset [[22]22]

Figure 14 14.. BalineseBalinese character character dataset dataset..

J. Imaging 2018, 4, 43 16 of 27 J.J. ImagingImaging 20172017,, 44,, xx FORFOR PEERPEER REVIEWREVIEW 1515 ofof 2626

Figure 1 15.155.. Sundanese character dataset dataset...

Figure 1 16.66.. Khmer character dataset.

4.3.2. Evaluation Method Following the evaluation method from the the ICFHR ICFHR competition competition [25] [[25]25],,, t thethe recognition rate, i.e., the percentagepercentage ofof correctlycorrectly classifiedclassified samplessamples overover thethe testtest samplessamples (C/N)((C/N)) isis calculated calculated,calculated,, wherewhere C C is is the number of correctly recognizedrecognized samplessamples andand NN isis thethe totaltotal numbernumber ofof testtest samples.samples.

4.4. Word Recognition and and Transliteration

4.4.1. Datasets The palmpalm leafleaf manuscriptmanuscript datasetsdatasets for for word word recognition recognition and and transliteration transliteration task tasktask are areare presented presentedpresented in Tableinin TableTable4. For 4.. theFor Khmer thethe Khmer dataset, dataset all characters,, all characters on the o pagen the have page been have annotated been annotated and grouped and togethergrouped togethertogether intointo wordswords ((Figure 17)).. More than one label may be given to the created word. The order of

J. Imaging 2018, 4, 43 17 of 27

J. Imaging 2017, 4, x FOR PEER REVIEW 16 of 26 J. Imaging 2017, 4, x FOR PEER REVIEW 16 of 26 intoJ. Imaging words 2017 (Figure, 4, x FOR 17 ).PEER More REVIEW than one label may be given to the created word. The order of how16 of each26 how each character in the word is selected is also kept [21]. Balinese (Figure 18) and the Sundanese characterhow each in thecharacter word in is selectedthe word isis alsoselected kept is [21 also]. Balinese kept [21] (Figure. Balinese 18 ()Figure and the 18 Sundanese) and the Su (Figurendanese 19 ) how each character in the word is selected is also kept [21]. Balinese (Figure 18) and the Sundanese ((FigureFigure 19)19) wordword datasetdataset waswas manuallymanually annotatedannotated usingusing AletheiaAletheia [63][63].. word(Figure dataset 19) word was manually dataset was annotated manually using annotated Aletheia using [63 Aletheia]. [63]. Table 4. Palm leaf manuscript datasets for word recognition and transliteration tasks. Table 4. Palm leaf manuscript datasets for word recognition and transliteration tasks. TableTable 4. 4Palm. Palm leaf leaf manuscript manuscript datasetsdatasets for word recognition and and transliteration transliteration task tasks.s. Manuscripts Train Test Text Published Manuscripts Train Test Text Published ManuscriptsManuscripts Train Train TestTest TextText PublishedPublished 15022 images from 130 10475 images from 100 AMADI_LontarSet Balinese 1502215,022 images images from from130 10,47510475 images images from from 100 Latin AMADI_LontarSet BalineseBalinese 15022 imagespages from 130 10475 imagespages from 100 LatinLatin AMADI_LontarSetAMADI_LontarSet[17,25] [17,25 ] Balinese pages130 pages 100 pagespages Latin [17,25] pages pages [17,25] 16333 images16,333 images (part of (part 657 77917791 images images (part (part of 657 Latin and Khmer 16333 images (part of 657 7791 images (part Latinof 657 and KhmerLatin and SleukRith Set [21] Khmer of 657 pages) of 657 pages) SleukRith Set [21] Khmer 16333 imagespages) (part of 657 7791 imagespages) (part of 657 LatinKhmer and SleukRith Set [21] Khmer pages) pages) Khmer SleukRith Set [21] 1427pages) images from 318 images frompages) 10 Khmer SundaneseSundanese 1427 images from 20 pages 318 images from 10 pagesLatin Latin Sunda DatasetSunda [Dataset22] [22] Sundanese 1427 images20 from pages 20 pages 318pages images from 10 pages Latin Sunda Dataset [22] Sundanese 1427 images from 20 pages 318 images from 10 pages Latin Sunda Dataset [22]

FigureFigure 17.17. Khmer word word dataset dataset.. Figure 17. Khmer word dataset. Figure 17. Khmer word dataset.

Figure 18. Balinese word dataset. Figure 18. Balinese word dataset. FigureFigure 18.18. Balinese word word dataset. dataset.

Figure 19. Sundanese word dataset. Figure 19. Sundanese word dataset. Figure 19. Sundanese word dataset. Figure 19. Sundanese word dataset.

J. Imaging 2017, 4, x FOR PEER REVIEW 18 of 26 J. Imaging 2018, 4, 43 18 of 27 J. Imaging 2017, 4, x FOR PEER REVIEW 18 of 26 Balinese 27.94817 0.392937 27.1625 Wolf [42,44] window = 50, k = 0.5 Khmer 46.78589 0.23739 25.1946 4.4.2. Evaluation Method Balinese 27.94817 0.392937 27.1625 Wolf [42,44] window = 50, k = 0.5 SundaneseKhmer 46.7858942.40799 0.2991570.23739 23.6107525.1946 The error rate is defined by edit distances betweenSundaneseBalinese ground truth42.4079944.70123 and recognizer0.2991570.267627 output23.6107528.35427 and is computedHowe1 using [39] the providedDefault OCRopy values function [39] ocropus-errsBalineseKhmer (https://github.com/tmbdev/ocropy/44.7012340.20485 0.2676270.280604 28.3542725.59887 Sundanese 45.90779 0.235175 21.90439 blob/master/ocropus-errsHowe1 [39] )[Default56]. values [39] Khmer 40.20485 0.280604 25.59887 SundaneseBalinese 45.9077940.5555 0.2351750.273994 21.9043928.02874 Howe2 [39] Default values Khmer 32.35603 0.294016 25.96965 5. Experimental Results and Discussion Balinese 40.5555 0.273994 28.02874 Howe2 [39] Default values SundaneseKhmer 32.3560335.35973 0.2940160.274865 25.9696522.36583 In this section, the performance of each methodSundaneseBalinese for the DIA35.3597342.15377 tasks on0.2748650.304962 palm leaf22.3658328.38466 manuscript collectionsHowe3 is [39] presented. Default values BalineseKhmer 42.1537730.7186 0.3049620.382087 28.3846626.36983 Howe3 [39] Default values SundaneseKhmer 25.7732130.7186 0.3820870.350349 26.3698323.66912 5.1. Binarization SundaneseBalinese 25.7732145.73681 0.3503490.273018 23.6691228.60561 Howe4 [39] Default values BalineseKhmer 45.7368136.48396 0.2730180.280519 28.6056125.83969 TheHowe4 experimental [39] resultsDefault for the values binarization taskSundaneseKhmer are presented 36.4839638.98445 in Table 0.2805190.2811185. These results25.8396922.83914 show that the performance of all methods on each datasetSundanese isBalinese still quite low.38.9844563.32 Most of0.2811180.15 the methods 22.8391431.37 achieve less thanICFHR a 50% G1 FM score. ThisSee means ref. [25] that palm leaf manuscriptsBalineseKhmer 52.65608 are63.32 still an0.250503 open0.15 challenge 28.1688631.37 for the Sundanese 38.95626 0.329042 24.15279 binarizationICFHR task.G1 The differentSee parameter ref. [25] values for theKhmer local adaptive52.65608 binarization 0.250503 methods28.16886show Balinese 68.76 0.13 33.39 significant improvement in performance, but still giveSundanese unsatisfactory 38.95626 results. 0.329042 In these experiments,24.15279 ICFHR G2 See ref. [25] BalineseKhmer 68.76- 0.13- 33.39- the ICFHR G1 method was evaluated for the Khmer and Sundanese datasets using the pre-trained ICFHR G2 See ref. [25] SundaneseKhmer - - - Balinese training set weighted model. Based on these experiments,SundaneseBalinese Niblack’s52.20- method0.18- gives26.92 the- highest FM scoreICFHR for SundaneseG3 manuscriptsSee ref. [25] (Figure 20), ICFHRBalineseKhmer G1 method52.20- gives the0.18 highest- FM26.92- score for KhmerICFHR manuscripts G3 (Figure 21See), and ref. ICFHR[25] G2 givesSundanese theKhmer highest FM score- for Balinese- manuscripts- (Figure 22). However, visually, there are still many brokenSundaneseBalinese and unrecognizable 58.57- characters/glyphs,0.17- 29.98- and ICFHR G4 See ref. [25] Khmer - - - noise is detected in the images. Balinese 58.57 0.17 29.98 ICFHR G4 See ref. [25] SundaneseKhmer - - - Sundanese - - -

Figure 20. Binarization of Sundanese manuscript with Niblack’s method. Figure 20. Binarization of Sundanese manuscript with Niblack’s method. Figure 20. Binarization of Sundanese manuscript with Niblack’s method.

Figure 21. Binarization of Khmer manuscript with ICFHR G1 method. FigureFigure 21. 21.Binarization Binarization ofof Khmer manuscript with with ICFHR ICFHR G1 G1 method method..

J. Imaging 2018, 4, 43 19 of 27

Table 5. Experimental results for binarization task in F-Measure (FM), Peak SNR (PSNR), and Negative Rate Metric (NRM). A higher F-measure and PSNR, and a lower NRM, indicate a better result.

Methods Parameter Manuscripts FM (%) NRM PSNR (%) Balinese 18.98178 0.398894 5.019868 OtsuGray Otsu from gray image Khmer 23.92159 0.313062 7.387765 [34,41] Using Matlab graythresh [43] Sundanese 23.70566 0.326681 9.998433 Balinese 29.20352 0.300145 10.94973 OtsuRed Otsu from red image channel Khmer 21.15379 0.337171 5.907433 [34,41] Using Matlab graythresh Sundanese 21.25153 0.38641 12.60233 Balinese 13.20997 0.462312 27.69732 Sauvola window = 50, k = 0.5, R = 128 Khmer 44.73579 0.268527 26.06089 [34,36,41,42,44,45] Sundanese 6.190919 0.479984 24.78595 Balinese 40.18596 0.274551 25.0988 Sauvola2 window = 50, k = 0.2, R = 128 Khmer 47.55924 0.155722 21.96846 [34,36,41,42,44,45] Sundanese 43.04994 0.299694 23.65228 Balinese 35.38635 0.165839 17.05408 Sauvola3 window = 50, k = 0.0, R = 128 Khmer 30.5562 0.190081 12.78953 [34,36,41,42,44,45] Sundanese 40.29642 0.181465 16.25056 Balinese 41.55696 0.175795 21.24452 Niblack window = 50, k = −0.2 Khmer 38.01222 0.160807 16.84153 [34,36,41,42,44] Sundanese 46.79678 0.195015 20.31759 Balinese 35.38635 0.165839 17.05408 Niblack2 window = 50, k = 0.0 Khmer 30.5562 0.190081 12.78953 [34,36,41,42,44] Sundanese 40.29642 0.181465 16.25056 Balinese 37.85919 0.328327 27.59038 NICK [44] window = 50, k= −0.2 Khmer 51.2578 0.176003 24.51998 Sundanese 29.5918 0.390431 24.26187 Balinese 34.46977 0.171096 16.84049 Rais [34] window = 50 Khmer 31.59138 0.187948 13.52816 Sundanese 40.65458 0.177016 16.35472 Balinese 27.94817 0.392937 27.1625 Wolf [42,44] window = 50, k = 0.5 Khmer 46.78589 0.23739 25.1946 Sundanese 42.40799 0.299157 23.61075 Balinese 44.70123 0.267627 28.35427 Howe1 [39] Default values [39] Khmer 40.20485 0.280604 25.59887 Sundanese 45.90779 0.235175 21.90439 Balinese 40.5555 0.273994 28.02874 Howe2 [39] Default values Khmer 32.35603 0.294016 25.96965 Sundanese 35.35973 0.274865 22.36583 Balinese 42.15377 0.304962 28.38466 Howe3 [39] Default values Khmer 30.7186 0.382087 26.36983 Sundanese 25.77321 0.350349 23.66912 Balinese 45.73681 0.273018 28.60561 Howe4 [39] Default values Khmer 36.48396 0.280519 25.83969 Sundanese 38.98445 0.281118 22.83914 Balinese 63.32 0.15 31.37 ICFHR G1 See ref. [25] Khmer 52.65608 0.250503 28.16886 Sundanese 38.95626 0.329042 24.15279 Balinese 68.76 0.13 33.39 ICFHR G2 See ref. [25] Khmer - - - Sundanese - - - Balinese 52.20 0.18 26.92 ICFHR G3 See ref. [25] Khmer - - - Sundanese - - - Balinese 58.57 0.17 29.98 ICFHR G4 See ref. [25] Khmer - - - Sundanese - - - J. Imaging 2018, 4, 43 20 of 27

J. Imaging 2017, 4, x FOR PEER REVIEW 19 of 26

Figure 22. Binarization of Balinese manuscript with ICFHR G2 method. Figure 22. Binarization of Balinese manuscript with ICFHR G2 method. 5.2. Text Line Segmentation 5.2. Text Line Segmentation The experimental results for text line segmentation task are presented in Table 6. According to Thethese experimental results, both methods results perform for text sufficiently line segmentation well for most task datasets are presented, except Khmer in Table 1 (Figure6. Accordings 23–25). to theseThis results, is because both methods all images perform in this sufficientlyset are of low well quality for mostdue to datasets, the fact exceptthat they Khmer are digitized 1 (Figures from 23 –25). This ismicrofilms. because allNevertheless, images in the this adaptive set are path of low finding quality method due toachieve the facts better that results they arethan digitized the seam from microfilms.carving Nevertheless,method on all datasets the adaptive of palm path leaf findingmanuscripts method in our achieves experiment. better The results main thandifference the seam carvingbetween method these on two all approaches datasets of is palmthat instead leaf manuscripts of finding an inoptimal our experiment. separating path The within main an difference area constrained by medial seam locations of two adjacent lines (in the seam carving method), the adaptive between these two approaches is that instead of finding an optimal separating path within an area path finding approach tries to find a path close to an estimated straight seam line section. These line constrainedsections by already medial represent seam locations well the ofseam two borders adjacent between lines (in two the neighboring seam carving lines, method), so they the can adaptive be pathconsidered finding approach a better guide tries for to finding find a pathgood closepaths, tohence an estimatedproducing better straight results. seam line section. These line sectionsOne already common represent error that wellwe encounter the seam for borders both methods between is twoin the neighboring medial position lines, computation so they can be consideredstage. Detecting a better guide correct for medial finding positions good paths, of text hence lines is producing crucial for better the path results.-finding stage of the Onemethods. common In our error experiment, that we we encounter noticed that for some both parameters methods isplay in an the important medial position role. For computationinstance, stage.the Detecting number of correct columns/slices medial positions of the seam of textcarving lines method is crucial and forthe high the path-findingand low thresholding stage of the values of the edge detection algorithm in the adaptive path finding approach are important. In order methods. In our experiment, we noticed𝑟𝑟 that some parameters play an important role. For instance, the numberto select of these columns/slices parameters, a rvalidationof the seam set consisting carving method of five random and the pages high is and used. low The thresholding optimal values values of the parameters are then empirically selected based on the results from this validation set. of the edge detection algorithm in the adaptive path finding approach are important. In order to select these parameters,Table 6. Experimental a validation results set for consisting text line segmentation of five random task: the pages countis of used.ground The truth optimal elements values(N), of the parametersand are the then count empirically of result elements selected (M), the based one-to on-one the (o2o) results match from score thisis computed validation for a set.region pair based on 90% acceptance threshold, detection rate (DR), recognition accuracy (RA), and performance Tablemetric 6. Experimental (FM). results for text line segmentation task: the count of ground truth elements (N), and the count of result elements (M), the one-to-one (o2o) match score is computed for a region pair Methods Manuscripts N M o2o DR (%) RA (%) FM (%) based on 90% acceptance threshold, detection rate (DR), recognition accuracy (RA), and performance Balinese 1 140 167 128 91.42 76.64 83.38 metric (FM). Bali-2.1 181 210 163 90.05 77.61 83.37 Bali-2.2 182 219 161 88.46 73.51 80.29 Methods Manuscripts N M o2o DR (%) RA (%) FM (%) Khmer 1 191 145 57 29.84 39.31 33.92 Seam carving [47] Balinese 1Khmer 140 2 476 167 665 128356 53.53 91.42 74.79 76.64 62.40 83.38 Bali-2.1 181 210 163 90.05 77.61 83.37 Bali-2.2Khmer 182 3 971 219 1046 161845 87.02 88.46 80.78 73.51 83.78 80.29 Khmer 1 191 145 57 29.84 39.31 33.92 Seam carving [47] Sundanese 1 46 43 36 78.26 83.72 80.89 Khmer 2Sundanese 476 2 242 665 257 356218 90.08 53.53 84.82 74.79 87.37 62.40 Khmer 3 971 1046 845 87.02 80.78 83.78 Sundanese Balinese 1 140 143 132 94.28 92.30 93.28 46 43 36 78.26 83.72 80.89 1 Bali-2.1 181 188 159 87.84 84.57 86.17 Sundanese Bali242-2.2 182 257 191 218164 90.10 90.08 85.86 84.82 87.93 87.37 2 Khmer 1 191 169 118 61.78 69.82 65.55 Adaptive Path Finding [27] Balinese 1Khmer 140 2 476 143 484 132446 92.15 94.28 93.70 92.30 92.92 93.28 Bali-2.1 181 188 159 87.84 84.57 86.17 Bali-2.2Khmer 182 3 971 191 990 164910 93.71 90.10 91.91 85.86 92.80 87.93 Khmer 1 191 169 118 61.78 69.82 65.55 Adaptive Path Finding [27] Sundanese 1 46 50 41 89.13 82.00 85.41 Khmer 2Sundanese 476 2 242 484 253 446222 91.73 92.15 87.74 93.70 89.69 92.92 Khmer 3 971 990 910 93.71 91.91 92.80 Sundanese 46 50 41 89.13 82.00 85.41 1 Sundanese 242 253 222 91.73 87.74 89.69 2 J. Imaging 2018, 4, 43 21 of 27 J. Imaging 2017, 4, x FOR PEER REVIEW 20 of 26 J. Imaging 2017, 4, x FOR PEER REVIEW 20 of 26 J. Imaging 2017, 4, x FOR PEER REVIEW 20 of 26

Figure 23. Text line segmentation of Balinese manuscript with the Seam Carving method (green) and Figure 23. Text line segmentation of Balinese manuscript with the Seam Carving method (green) and FigureAdaptive 23. PathText Findingline segmentation (red). of Balinese manuscript with the Seam Carving method (green) and Adaptive Path Finding (red). AdaptiveFigure 23 .Path Text Findingline segmentation (red). of Balinese manuscript with the Seam Carving method (green) and Adaptive Path Finding (red).

Figure 24. Text line segmentation of Khmer manuscript with the Seam Carving method (green) and FigureAdaptive 24 .Path Text Findingline segmentation (red). of Khmer manuscript with the Seam Carving method (green) and Figure 24. Text line segmentation of Khmer manuscript with the Seam Carving method (green) and AdaptiveFigure 24 .Path Text Finding line segmentation (red). of Khmer manuscript with the Seam Carving method (green) and Adaptive Path Finding (red). Adaptive Path Finding (red).

Figure 25. Text line segmentation of Sundanese manuscript with the Seam Carving method (green) Figureand Adaptive 25. Text Path line Finding segmentation (red). of Sundanese manuscript with the Seam Carving method (green) andFigure Adaptive 25. Text Path line Finding segmentation (red). of Sundanese manuscript with the Seam Carving method (green) Figure 25. Text line segmentation of Sundanese manuscript with the Seam Carving method (green) 5.3. Isolatedand Adaptive Character/Glyph Path Finding Recognition (red). 5.3.and Isolated Adaptive Character/Glyph Path Finding Recognition (red). The experimental results for isolated character/glyph recognition task are presented in Table 7. 5.3. Isolated Character/Glyph Recognition 5.3.For Isolated hTheandcrafted experimental Character/Glyph feature results with Recognition kfor-NN, isolated the Khmer character/glyph set with 113,206 recognition train images task are and presented 90,669 test in Table images 7. Forwill hneedTheandcrafted experimental a considerabl feature resultse with amount k for-NN, isolatedof thetime Khmer forcharacter/glyph one set-to with-one 113,206k -recognitionNN comparison, train images task are so and presentedwe 90,669 do not test i nthink Table images it 7is. willForreasonableThe hneedandcrafted experimental a toconsiderabl use feature it. For resultse with amountCNN fork- NN,1, isolated ofprevious timethe Khmer for character/glyph work one set- toonl with-oney reported 113,206k-NN recognition comparison, trainresult imagess for task theso areand we Balinese presented90,669 do not testset. think in imagesFor Table it all is 7. ForreasonablewillICFHR handcrafted need competition a toconsiderabl use feature it. methods,For withe amountCNN k-NN, the 1, of previouscompetition thetime Khmer for work one was set- toonl with -proposedoney reported 113,206k-NN onlycomparison, result train fors images thefor Balinese theso and weBalinese do 90,669 set, not soset. think test we For images onlyit all is willICFHRreasonablehave need the competition a reported considerable to use it.results methods,For amount CNN for thethe 1, of previousBalinesecompetition time for set.work one-to-one wasAccording onl proposedy reported k-NN to theseonly comparison,result forresults,s thefor Balinesethe so handcraftedBalinese we set, do notsoset. we thinkfeatureFor only all it is reasonablehaveICFHRextraction the competition toreported usecombination it. Forresults methods, CNN of for 1,HoG the previousthe - BalineseNPWcompetition- workKirsch set. only -wasAccordingZoning reported proposed is toa results theseproperonly results,for choice the BalineseBalinesethe result handcrafted set.ingset, Forsoin we allafeature good ICFHRonly competitionextractionhaverecognition the reported methods,combination rate for results Balinese the of competition for andHoG the Khmer -BalineseNPW was -characters/glyphs.Kirsch set. proposed -AccordingZoning only is Thetoa for theseproper CNN the Balineseresults, methods choice the set, alsoresult handcrafted so showing we onlysatisfactoryin afeature havegood the reportedrecognitionextractionresults, resultsbut combinationratethe for dforifferences the Balinese Balinese of in andHoG recognition set.Khmer-NPW According -characters/glyphs.Kirsch rates- Zoningare to thesenot toois results, The asignificant proper CNN the methods handcraftedchoicewith the alsoresult handcrafted show featureing satisfactoryin extractiona feature good combinationresults,recognitioncombinations. but ofratetheHoG-NPW-Kirsch-Zoning The dforifferences unbalanceBalinese in and drecognition number Khmer characters/glyphs.of rate isimage as proper are samples not choice too The forsignificant resulting eachCNN character methods with in a the good classalso handcrafted recognitionshow means satisfactory the feature CNN rate for Balinesecombinations.results,method and butdid Khmerthe not The d performifferences unbalance characters/glyphs. optimally. in drecognition number For of Thethe rate image Sundanese CNNs are samples not methods too dataset, forsignificant alsoeach the showcharacter handcrafted with satisfactory the class handcrafted femeansature results, thewith feature CNN but NN the differencesmethodcombinations.slightly outperformdid in recognitionnot The perform unbalanceed the rates optimally.CNNd are number method. not For too of the significantThe image SundaneseUFL samples method with dataset, the forslightly handcraftedeach the increasecharacter handcrafted featured theclass rec fe combinations.meansognitionature withthe rate CNN NN of The unbalancedslightlymethodthe pure outperform didNN number notmethod perform ofed imagefor the the optimally.CNN samples Khmer method. and forFor eachBalinese Thethe characterSundaneseUFL dataset method class sdataset,. slightly means the increase the handcrafted CNNd the method rec feognitionature did notwith rate perform NN of optimally.theslightly pure outperform NN For method the Sundaneseed for the the CNN Khmer dataset, method. and the Balinese The handcrafted UFL dataset method features. slightly with increase NN slightlyd the rec outperformedognition rate of the the pureTable NN 7. methodExperimental for the results Khmer for isolatedand Balinese character/glyph datasets .recognition tasks (in % recognition rate). CNN method. The UFL method slightly increased the recognition rate of the pure NN method for the Table 7. Experimental results for isolated character/glyph recognition tasks (in % recognition rate). Khmer and Balinese datasets. Methods Balinese Khmer Sundanese Experimental results for isolated character/glyph recognition tasks (in % recognition rate). TableHandcrafted 7. Feature (HoG-NPWMethods-Kirsch -Zoning) with k-NN [28] Balinese85.16 Khmer- Sundanese72.91 Handcrafted Feature (HoG-NPW-Kirsch-Zoning) with NN [29] 85.51 92.15 79.69 TableHandcrafted 7. Experimental Feature results(HoG-NPWMethods for- isolatedKirsch -Zoning) character/glyph with k-NN [28] recognition Balinese85.16 tasks (inKhmer % recognition- Sundanese72.91 rate). Handcrafted Feature (HoG-NPW-Kirsch-Zoning) with UFL + NN [29] 85.63 92.44 79.33 HandcraftedHandcrafted Feature Feature (HoG (HoG-NPW-NPW-Kirsch-Kirsch-Zoning)-Zoning) with with k NN-NN [29] [28] 85.5185.16 92.15- 79.6972.91 Handcrafted Feature (HoG-NPWCNN-Kirsch 1 [28]- Zoning) with UFL + NN [29] 85.6384.31 92.44- 79.33- Handcrafted Feature (HoGMethods-NPW-Kirsch-Zoning) with NN [29] Balinese85.51 Khmer92.15 Sundanese79.69 CNN 2 85.39 93.96 79.05 Handcrafted Feature (HoG-NPWCNN-Kirsch 1 [28]- Zoning) with UFL + NN [29] 84.3185.63 92.44- 79.33- Handcrafted Feature (HoG-NPW-Kirsch-Zoning)ICFHR G1: VCMF [25] with k-NN [28] 85.1687.44 -- 72.91- CNNCNN 1 [28]2 85.3984.31 93.96- 79.05- Handcrafted Feature (HoG-NPW-Kirsch-Zoning)ICFHR G1: VMQDF [25] with NN [29] 85.5188.39 92.15- 79.69- ICFHR G1:CNN VCMF 2 [25] 87.4485.39 93.96- 79.05- Handcrafted Feature (HoG-NPW-Kirsch-Zoning)ICFHR G3 [25] with UFL + NN [29] 85.6377.83 92.44- 79.33- ICFHRICFHR G1: G1: VMQDF VCMF [25] [25] 88.3987.44 - - CNNICFHR 1 [28] G5 [25] 84.3177.70 -- - - ICFHRCNNICFHR G1: 2 VMQDF G3 [25] [25] 85.3977.8388.39 93.96- 79.05- ICFHR G1:ICFHR VCMF G5G3 [25 [25]] 87.4477.7077.83 -- - - ICFHR G1:ICFHR VMQDF G5 [[25]25] 88.3977.70 -- - - ICFHR G3 [25] 77.83 - - ICFHR G5 [25] 77.70 - -

5.4. Word Recognition and Transliteration The experimental results for word recognition and transliteration task are presented in Table8. The error rates for word recognition and transliteration tests set on each training model iteration are shown in Figures 26–28. The LSTM-based architecture of OCRopy seems very promising in terms of

J. Imaging 2017, 4, x FOR PEER REVIEW 21 of 26 J. Imaging 2017, 4, x FOR PEER REVIEW 21 of 26 5.4. Word Recognition and Transliteration J.5.4. Imaging Word2018 Recognition, 4, 43 and Transliteration 22 of 27 The experimental results for word recognition and transliteration task are presented in Table 8. The Theerror experimental rates for word results recognition for word and recognition transliteration and transliteration tests set on each task training are presented model iteration in Table are8. The error rates for word recognition and transliteration tests set on each training model iteration are recognizingshown in Figure and directlys 26–28. transliteratingThe LSTM-based Balinese architecture words. of OCRopy For the Khmer seems very and Sundanesepromising in datasets, terms ofthe shown in Figures 26–28. The LSTM-based architecture of OCRopy seems very promising in terms of LSTMrecognizing architecture and directly seems transliterating to struggle to Balinese learn the words. training For data. the Khmer More and synthetic Sundanese data trainingdatasets, withthe a recognizing and directly transliterating Balinese words. For the Khmer and Sundanese datasets, the moreLSTM frequent architecture word seems should to bestruggle generated to learn in orderthe training to support data. theMore training synthetic process. data training For the with Balinese a LSTM architecture seems to struggle to learn the training data. More synthetic data training with a dataset,more frequent a sequence word depth should of 100be generated pixels with in aorder neuron to support size of 200the givestraining a betterprocess. result For forthe bothBalinese LSTM mdataset,ore frequent a sequence word depthshould of be 100 generated pixels with in ordera neuron to support size of 200the givetrainings a better process. result For for the both Balinese LSTM anddataset, BLTSM a sequence architecture. depth Most of 100 of pixels the Southeast with a neuron Asian size scripts of 200 are give syllabics a better scripts. result One for character/glyph both LSTM inand these BLTSM scripts architecture. represents Most a syllable, of the Southeast with a sequence Asian scrip ofts letters are syllabic in Latin script script.s. One Incharacter/glyph this case, word andin these BLTSM script architecture.s represents Most a syllableof the Southeast, with a sequenceAsian scrip ofts letters are syllabic in Latin script script.s. One In character/glyph this case, word transliteration is not just word recognition with one-to-one glyph-to-letter association. This makes intransliteration these scripts isrepresents not just worda syllable recognition, with a with sequence one-to of-one letters glyph in- toLatin-letter script. association. In this case,This makesword word transliteration more challenging than character/glyph recognition. transliterationword transliteration is not justmore word challenging recognition than with character/glyph one-to-one glyphrecognition.-to-letter association. This makes word transliteration more challenging than character/glyph recognition. Table 8. Experimental results for word recognition and transliteration tasks (in % error rate for test). Table 8. Experimental results for word recognition and transliteration tasks (in % error rate for test). Table 8. Experimental results for word recognition and transliteration tasks (in % error rate for test). MethodsMethods (with (with OCRopy OCRopy [ 56[56]] Framework) Framework) BalineseBalinese KhmerKhmer SundaneseSundanese Methods (with OCRopy [56] Framework) Balinese Latin text:Khmer 73.76 Sundanese BLSTM 1 (seq_depth 60, neuron size 100) 43.13 Latin text: 73.76 75.52 BLSTM 1 (seq_depth 60, neuron size 100) 43.13 KhmerLatin text: text: 77.88 73.76 75.52 BLSTM 1 (seq_depth 60, neuron size 100) 43.13 Khmer text: 77.88 75.52 LSTM 1 (seq_depth 100, neuron size 100) 42.88Khmer - text: 77.88 - LSTM 1 (seq_depth 100, neuron size 100) 42.88 - - LSTMBLSTM 1 (seq_depth 2 (seq_depth 100 100,, neuron neuron size size 200) 100) 40.5442.88 -- - - BLSTM 2 (seq_depth 100, neuron size 200) 40.54 - - BLSTM 2 (seq_depth 100, neuron size 200) 40.54 - - LSTMLSTM 2 (seq_depth 2 (seq_depth 100 100,, neuron neuron size size 200) 200) 39.7039.70 -- - - LSTM 2 (seq_depth 100, neuron size 200) 39.70 - -

Figure 26. Error rate for Balinese word recognition and transliteration test set. Figure 26 26.. ErErrorror rate rate for for Balinese Balinese word word recognition recognition and and transliteration transliteration test test set. set.

Figure 27. Error rate for Khmer word recognition and transliteration test set. Figure 27 27.. ErrorError rate rate for for Khmer Khmer word word recognition recognition and and transliteration transliteration test test set. set.

J. Imaging 2018, 4, 43 23 of 27

J. Imaging 2017, 4, x FOR PEER REVIEW 22 of 26

FigureFigure 28.28. ErrorError rate for Sundanese word word recognition recognition and and transliteration transliteration test test set set..

6.6. Conclusions Conclusions and and FutureFuture WorkWork A comprehensive experimental test of the principal tasks in a DIA system, starting with A comprehensive experimental test of the principal tasks in a DIA system, starting with binarization, text line segmentation, and isolated character/glyph recognition, and continuing on to binarization, text line segmentation, and isolated character/glyph recognition, and continuing on to word recognition and transliteration for a new collection of palm leaf manuscripts from Southeast word recognition and transliteration for a new collection of palm leaf manuscripts from Southeast Asia, is presented. The results from all experiments provide the latest findings and a quantitative Asia, is presented. The results from all experiments provide the latest findings and a quantitative benchmark of palm leaf manuscripts analysis for researchers in the DIA community. Binarizing the benchmarkpalm leaf manuscript of palm leaf images manuscripts seems very analysis challenging. for researchers Still, with in many the DIA broken community. and unrecognizable Binarizing the palmcharacters/glyphs leaf manuscript and images noises seemsdetected very in the challenging. images, binarization Still, with manyshouldbroken be reconsidered and unrecognizable the first characters/glyphsstep in the DIA process and noises for palm detected leaf manuscripts. in the images, On binarizationthe other hand should, although be reconsidered there are already the first steptraining in the-based DIA processDIA methods for palm that leaf do manuscripts.not require this On binarization the other hand, process, although they usually there arerequire already training-basedadequate training DIA methodsdata. The thatproblem do not of require inadequate this binarization training data process, also influences they usually glyph require recognition adequate trainingand word data. transliteration. The problem The of inadequateunbalanced number training of data image also samples influences for each glyph character recognition class andmeans word transliteration.the CNN methods The did unbalanced not perform number optimally of image in glyph samples recognition for each. The character differences class in the means recognition the CNN methodsrates of didthe CNN not perform methods optimally are not too in significant glyph recognition. with the handcrafted The differences feature in thecombinations. recognition For rates offuture the CNN work, methods more synthetic are not data too significanttraining for with palm the leaf handcrafted manuscript featureimages combinations.should be generated For future in work,order more to support synthetic the training data training process. for Especially palm leaf for manuscriptthe word trans imagesliteration should task, bemore generated synthetic in data order totraining support with the a training more frequent process. word Especially should be for generated the word in transliteration order to improve task, the more training synthetic process. data trainingMany examples with a more of glyph frequent-to-syllable word shouldassociation be generatedshould be insynthetically order to improve generated the to training transliterate process. Manysyllabic examples scripts from of glyph-to-syllable Southeast Asia. The association special characteristics should be syntheticallyand challenges generated posed by the to transliteratepalm leaf syllabicmanuscript scripts collections from Southeast will require Asia. Thea thorough special characteristicsadaptation of andthe DIA challenges system. posed Some by specific the palm adjustments need to be applied to the DIA methods for other types of documents. The adaptation of leaf manuscript collections will require a thorough adaptation of the DIA system. Some specific a DIA for palm leaf manuscripts is not unique and is not universal for all types of problem from adjustments need to be applied to the DIA methods for other types of documents. The adaptation of a different collections. However, among the DIA system’s non-unique solutions, one specific solution DIA for palm leaf manuscripts is not unique and is not universal for all types of problem from different can still be designed to deliver the most optimal DIA system performance while still taking into collections. However, among the DIA system’s non-unique solutions, one specific solution can still account the conditions of that collection. be designed to deliver the most optimal DIA system performance while still taking into account the conditionsAcknowledgments: of that collection. The authors would like to thank Museum Gedong Kertya, Museum Bali, Mr. Undang Ahmad Darsa, the philologists from Sundanese Centre Studies of Universitas Padjadjaran, the Situs Kabuyutan Acknowledgments:Ciburuy Garut, all familiesThe authors in Bali, would Indonesia, like to the thank EFEO Museum team, the Gedong Buddhist Kertya, Institute, Museum and the Bali, National Undang Library Ahmad Darsa,in Cambodia the philologists for providing fromSundanese us with samples Centre of Studies palm leaf of Universitas manuscripts. Padjadjaran, We also thank the Situsthe students Kabuyutan from Ciburuy the Garut, all families in Bali, Indonesia, the EFEO team, the Buddhist Institute, and the National Library in Cambodia Department of Informatics Education and the Department of Balinese Literature, University of Pendidikan for providing us with samples of palm leaf manuscripts. We also thank the students from the Department ofGanesha, Informatics the Institute Education of Technology and the Department of Cambodia, of Balineseand the National Literature, Institute University of Post, of Telecommunication Pendidikan Ganesha, and the InstituteICT forof helping Technology us with of the Cambodia, ground truthing and the Nationalprocess for Institute this research of Post, project. Telecommunication This work is supported and ICT for by helping the usDIKTI with BPPLN the ground Indonesian truthing Scholarship process forProgram, this research the STIC project. Asia Program This work implemented is supported by the byFrench the DIKTIMinistry BPPLN of IndonesianForeign Affairs Scholarship and International Program, the Development STIC Asia Program(MAEDI), implemented and ARES-CCD by the(program French Ministry 2014-20 of19) Foreign under Affairsthe andfunding International of Belgian Development university (MAEDI), cooperation, and ARES-CCDand DRPMI (program Universitas AI 2014-2019) Padjadjaran, under DIKTI the funding International of Belgian universityCollaboration cooperation, and Publication and DRPMI grant Universitas2017. Padjadjaran, DIKTI International Collaboration and Publication grant 2017.

Author Contributions: The Balinese dataset was prepared by Made Windu Antara Kesiman. The Khmer dataset was prepared by Dona Valy and Sophea Chhun. The Sundanese dataset was prepared by Erick Paulus, Mira Suryani, and Setiawan Hadi. Jean-Christophe Burie, Michel Verleysen, and Jean-Marc Ogier contributed to designing a ground truth validation protocol. Made Windu Antara Kesiman and Dona Valy conceived, designed,

J. Imaging 2018, 4, 43 24 of 27 and performed the experiments. Made Windu Antara Kesiman, Dona Valy, and Jean-Christophe Burie contributed to paper writing and editing. Conflicts of Interest: The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

References

1. tranScriptorium. Available online: http://transcriptorium.eu/ (accessed on 20 February 2018). 2. READ Project—Recognition and Enrichment of Archival Documents. Available online: https://read. transkribus.eu/ (accessed on 20 February 2018). 3. IAM Historical Document Database (IAM-HistDB)—Computer Vision and Artificial Intelligence. Available online: http://www.fki.inf.unibe.ch/databases/iam-historical-document-database (accessed on 20 February 2018). 4. Ancient Lives: Archive. Available online: https://www.ancientlives.org/ (accessed on 20 February 2018). 5. Document Image Analysis—CVISION Technologies. Available online: http://www.cvisiontech.com/ library/pdf/pdf-document/document-image-analysis.html (accessed on 20 February 2018). 6. Ramteke, R.J. Invariant Moments Based Feature Extraction for Handwritten Devanagari Vowels Recognition. Int. J. Comput. Appl. 2010, 1, 1–5. [CrossRef] 7. Siddharth, K.S.; Dhir, R.; Rani, R. Handwritten Gurmukhi Numeral Recognition using Different Feature Sets. Int. J. Comput. Appl. 2011, 28, 20–24. [CrossRef] 8. Sharma, D.; Jhajj, P. Recognition of Isolated Handwritten Characters in Gurmukhi Script. Int. J. Comput. Appl. 2010, 4, 9–17. [CrossRef] 9. Aggarwal, A.; Singh, K.; Singh, K. Use of Gradient Technique for Extracting Features from Handwritten Gurmukhi Characters and Numerals. Procedia Comput. Sci. 2015, 46, 1716–1723. [CrossRef] 10. Lehal, G.S.; Singh, C.A. Gurmukhi script recognition system. In Proceedings of the 15th International Conference on Pattern Recognition, Barcelona, Spain, 3–7 September 2000; pp. 557–560. 11. Rothacker, L.; Fink, G.A.; Banerjee, P.; Bhattacharya, U.; Chaudhuri, B.B. Bag-of-features HMMs for segmentation-free Bangla word spotting. In Proceedings of the 4th International Workshop on Multilingual OCR, Washington, DC, USA, 24 August 2013; p. 5. 12. Ashlin Deepa, R.N.; Rao, R.R. Feature Extraction Techniques for Recognition of Malayalam Handwritten Characters: Review. Int. J. Adv. Trends Comput. Sci. Eng. 2014, 3, 481–485. 13. Kasturi, R.; O’Gorman, L.; Govindaraju, V. Document image analysis: A primer. Sadhana 2002, 27, 3–22. [CrossRef] 14. Paper History, Case Pap. Available online: http://www.casepaper.com/company/paper-history/ (accessed on 20 February 2018). 15. Doermann, D. Handbook of Document Image Processing and Recognition; Tombre, K., Ed.; Springer London: London, UK, 2014; p. 1055. 16. Chamchong, R.; Fung, C.C.; Wong, K.W. Comparing Binarisation Techniques for the Processing of Ancient Manuscripts; Nakatsu, R., Tosa, N., Naghdy, F., Wong, K.W., Codognet, P., Eds.; Springer Berlin: Berlin, Germany, 2010; pp. 55–64. 17. Kesiman, M.W.A.; Burie, J.-C.; Ogier, J.-M.; Wibawantara, G.N.M.A.; Sunarya, I.M.G. AMADI_LontarSet: The First Handwritten Balinese Palm Leaf Manuscripts Dataset. In Proceedings of the 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), Shenzhen, China, 23–26 October 2016; pp. 168–172. 18. The Unicode® Standard; version 9.0—Core Specification; The : Mountain View, CA, USA, 2016. 19. Balinese Alphabet, Language and Pronunciation. Available online: http://www.omniglot.com/writing/ balinese.htm (accessed on 20 February 2018). 20. Khmer Manuscript—Recherche. Available online: http://khmermanuscripts.efeo.fr/ (accessed on 20 February 2018). J. Imaging 2018, 4, 43 25 of 27

21. Valy, D.; Verleysen, M.; Chhun, S.; Burie, J.-C. A New Khmer Palm Leaf Manuscript Dataset for Document Analysis and Recognition—SleukRith Set. In Proceedings of the 4th International Workshop on Historical Document Imaging and Processing, Kyoto, Japan, 10–11 November 2017; pp. 1–6. 22. Suryani, M.; Paulus, E.; Hadi, S.; Darsa, U.A.; Burie, J.-C. The Handwritten Sundanese Palm Leaf Manuscript Dataset From 15th Century. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; pp. 796–800. 23. Kesiman, M.W.A.; Prum, S.; Burie, J.-C.; Ogier, J.-M. An Initial Study on the Construction of Ground Truth Binarized Images of Ancient Palm Leaf Manuscripts. In Proceedings of the 13th International Conference on Document Analysis and Recognition (ICDAR), Nancy, France, 23–26 August 2015; pp. 656–660. 24. Kesiman, M.W.A.; Prum, S.; Sunarya, I.M.G.; Burie, J.-C.; Ogier, J.-M. An Analysis of Ground Truth Binarized Image Variability of Palm Leaf Manuscripts. In Proceedings of the 5th International Conference Image Processing Theory Tools Application (IPTA 2015), Orleans, France, 10–13 November 2015; pp. 229–233. 25. Burie, J.-C.; Coustaty, M.; Hadi, S.; Kesiman, M.W.A.; Ogier, J.-M.; Paulus, E.; Sok, K.; Sunarya, I.M.G.; Valy, D. ICFHR 2016 Competition on the Analysis of Handwritten Text in Images of Balinese Palm Leaf Manuscripts. In Proceedings of the 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), Shenzhen, China, 23–26 October 2016; pp. 596–601. 26. Kesiman, M.W.A.; Valy, D.; Burie, J.-C.; Paulus, E.; Sunarya, I.M.G.; Hadi, S.; Sok, K.H.; Ogier, J.-M. Southeast Asian palm leaf manuscript images: A review of handwritten text line segmentation methods and new challenges. J. Electron. Imaging. 2016, 26, 011011. [CrossRef] 27. Valy, D.; Verleysen, M.; Sok, K. Line Segmentation for Grayscale Text Images of Khmer Palm Leaf Manuscripts. In Proceedings of the 7th International Conference Image Processing Theory Tools Application (IPTA 2017), Montreal, QC, Canada, 28 November–1 December 2017. 28. Kesiman, M.W.A.; Prum, S.; Burie, J.-C.; Ogier, J.-M. Study on Feature Extraction Methods for Character Recognition of Balinese Script on Palm Leaf Manuscript Images. In Proceedings of the 23rd International Conference Pattern Recognition, Cancun, Mexico, 4–8 December 2016; pp. 4017–4022. 29. Kesiman, M.W.A.; Burie, J.-C.; Ogier, J.-M. A Complete Scheme of Spatially Categorized Glyph Recognition for the Transliteration of Balinese Palm Leaf Manuscripts. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; pp. 125–130. 30. Bezerra, B.L.D. Handwriting: Recognition, Development and Analysis; Bezerra, B.L.D., Zanchettin, C., Toselli, A.H., Pirlo, G., Eds.; Nova Science Publishers, Inc.: Hauppauge, NY, USA, 2017; ISBN 978-1-53611-957-2. 31. Arica, N.; Yarman-Vural, F.T. Optical character recognition for cursive handwriting. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 801–813. [CrossRef] 32. Blumenstein, M.; Verma, B.; Basli, H. A novel feature extraction technique for the recognition of segmented handwritten characters. In Proceedings of the Seventh International Conference on Document Analysis and Recognition, Edinburgh, UK, 3–6 August 2003; pp. 137–141. 33. O’Gorman, L.; Kasturi, R. Executive briefing: Document Image Analysis; IEEE Computer Society Press: Los Alamitos, CA, USA, 1997; p. 107. 34. Naveed Bin Rais, M.S.H. Adaptive thresholding technique for document image analysis. In Proceedings of the 8th International Multitopic Conference, Lahore, Pakistan, 24–26 December 2004; pp. 61–66. 35. Ntirogiannis, K.; Gatos, B.; Pratikakis, I. An Objective Evaluation Methodology for Document Image Binarization Techniques. In Proceedings of the Eighth IAPR International Workshop Document Annual System 2008, Nara, Japan, 16–19 September 2008; pp. 217–224. 36. He, J.; Do, Q.D.M.; Downton, A.C.; Kim, J.H. A comparison of binarization methods for historical archive documents. In Proceedings of the Eighth International Conference on Document Analysis and Recognition (ICDAR'05), Seoul, South Korea, 31 August–1 September 2005; pp. 538–542. 37. Gatos, B.; Ntirogiannis, K.; Pratikakis, I. DIBCO 2009: Document image binarization contest. Int. J. Doc. Anal. Recognit. 2011, 14, 35–44. [CrossRef] 38. Pratikakis, I.; Gatos, B.; Ntirogiannis, K. ICDAR 2013 Document Image Binarization Contest (DIBCO 2013). In Proceedings of the 12th International Conference on Document Analysis and Recognition, Washington, DC, USA, 25–28 August 2013; pp. 1471–1476. 39. Howe, N.R. Document binarization with automatic parameter tuning. Int. J. Doc. Anal. Recognit. 2013, 16, 247–258. [CrossRef] J. Imaging 2018, 4, 43 26 of 27

40. ICFHR2016 Competition on the Analysis of Handwritten Text in Images of Balinese Palm Leaf Manuscripts. Available online: http://amadi.univ-lr.fr/ICFHR2016_Contest/ (accessed on 20 February 2018). 41. Gupta, M.R.; Jacobson, N.P.; Garcia, E.K. OCR binarization and image pre-processing for searching historical documents. Pattern Recognit. 2007, 40, 389–397. [CrossRef] 42. Feng, M.-L.; Tan, Y.-P. Contrast adaptive binarization of low quality document images. IEICE Electron. Express 2004, 1, 501–506. [CrossRef] 43. Global Image Threshold Using Otsu’s Method—MATLAB Graythresh—MathWorks France. Available online: https://fr.mathworks.com/help/images/ref/graythresh.html?requestedDomain=true (accessed on 20 February 2018). 44. Khurshid, K.; Siddiqi, I.; Faure, C.; Vincent, N. Comparison of Niblack Inspired Binarization Methods for Ancient Documents. In Proceedings of the Document Recognition and Retrieval XVI, 72470U, San Jose, CA, USA, 21 January 2009; p. 72470U. [CrossRef] 45. Sauvola, J.; Pietikäinen, M. Adaptive document image binarization. Pattern Recognit. 2000, 33, 225–236. [CrossRef] 46. Wolf, C.; Jolion, J.-M.; Chassaing, F. Text Localization, Enhancement and Binarization in Multimedia Documents. In Proceedings of the Object recognition supported by user interaction for service robots, Quebec City, QC, Canada, 11–15 August 2002; pp. 1037–1040. 47. Arvanitopoulos, N.; Susstrunk, S. Seam Carving for Text Line Extraction on Color and Grayscale Historical Manuscripts. In Proceedings of the 14th International Conference on Frontiers in Handwriting Recognition, Heraklion, Greece, 1–4 September 2014; pp. 726–731. 48. Hossain, M.Z.; Amin, M.A.; Yan, H. Rapid Feature Extraction for Optical Character Recognition. Available online: http://arxiv.org/abs/1206.0238 (accessed on 20 February 2018). 49. Fujisawa, Y.; Shi, M.; Wakabayashi, T.; Kimura, F. Handwritten numeral recognition using gradient and curvature of gray scale image. In Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR’99, Bangalore, India, 22 September 1999; pp. 277–280. 50. Kumar, S. Neighborhood Pixels Weights-A New Feature Extractor. Int. J. Comput. Theory Eng. 2009, 2, 69–77. [CrossRef] 51. Bokser, M. Omnidocument technologies. Proc. IEEE. 1992, 80, 1066–1078. [CrossRef] 52. Coates, A.; Lee, H.; Ng, A.Y. An Analysis of Single-Layer Networks in Unsupervised Feature Learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 215–223. 53. Coates, A.; Carpenter, B.; Case, C.; Satheesh, S.; Suresh, B.; Wang, T.; Wu, D.J.; Ng, A.Y. Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning. In Proceedings of the International Conference on Document Analysis and Recognition, Beijing, China, 18–21 September 2011; pp. 440–445. 54. Shishtla, P.; Ganesh, V.S.; Subramaniam, S.; Varma, V. A language-independent transliteration schema using character aligned models at NEWS 2009. In Proceedings of the Association for Computational Linguistics, Suntec, Singapore, 7 August 2009; p. 40. [CrossRef] 55. Ul-Hasan, A.; Breuel, T.M. Can we build language-independent OCR using LSTM networks? In Proceedings of the 4th International Workshop on Multilingual OCR, Washington, DC, USA, 24 August 2013. 56. Ocropy: Python-Based Tools for Document Analysis and OCR, 2018. Available online: https://github.com/ tmbdev/ocropy (accessed on 20 February 2018). 57. Homemade Manuscript OCR (1): OCRopy, Sacré Grl. Available online: https://graal.hypotheses.org/786 (accessed on 20 February 2018). 58. Breuel, T.M.; Ul-Hasan, A.; Al-Azawi, M.A.; Shafait, F. High-Performance OCR for Printed English and Using LSTM Networks. In Proceedings of the 12th International Conference on Document Analysis and Recognition, Washington, DC, USA, 25–28 August 2013; pp. 683–687. [CrossRef] 59. Valy, D.; Verleysen, M.; Sok, K. Line Segmentation Approach for Ancient Palm Leaf Manuscripts using Competitive Learning Algorithm. In Proceedings of the 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), Shenzhen, China, 23–26 October 2016. 60. Saund, E.; Lin, J.; Sarkar, P. PixLabeler: User Interface for Pixel-Level Labeling of Elements in Document Images. In Proceedings of the 10th International Conference on Document Analysis and Recognition, Barcelona, Spain, 26–29 July 2009; pp. 646–650. [CrossRef] J. Imaging 2018, 4, 43 27 of 27

61. Stamatopoulos, N.; Gatos, B.; Louloudis, G.; Pal, U.; Alaei, A. ICDAR 2013 Handwriting Segmentation Contest. In Proceedings of the 12th International Conference on Document Analysis and Recognition, Washington, DC, USA, 25–28 August 2013; pp. 1402–1406. [CrossRef] 62. PRImA. Available online: http://www.primaresearch.org/tools/Aletheia (accessed on 20 February 2018). 63. Clausner, C.; Pletschacher, S.; Antonacopoulos, A. Aletheia—An Advanced Document Layout and Text Ground-Truthing System for Production Environments. In Proceedings of the International Conference on Document Analysis and Recognition, Beijing, China, 18–21 September 2011; pp. 48–52. [CrossRef]

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).