2015 Third International Conference on Image Infonnation Processing

A Database of Printed J awi Character Image

l 2 Khairun Saddami , Khairul Munadi , Fitri Arnia3 1 2 3 , , Graduate Program in Electrical Engineering, Syiah Kuala University, Banda , Indonesia I 2 3 [email protected], [email protected], [email protected]

Abstract- In this paper we present a printed Jawi character UNLV collected 2889 pages from different sources such as database for recognition research. Currently, there are Latin or newspapers and magazines. The dataset was used to analyze character databases, while the database of Jawi character the character recognition system they developed [5]. has not existed yet, neither printed nor handwritten. The database was developed to help the researcher analising and MAY ASTROUN provides an unconstrained database for evaluating Jawi character recognition method. The characters, Arabic and Latin character. It consists of characters, texts, words and sentences were typed in a word processing words and signature images. The database contains 67825 application, then printed and scanned into image format. The character images which are written by 355 writers [6]. database contains 1524 characters from four types of fonts. Besides that, the database also includes 168 printed word and In 2014, a handwritten Arabic database was proposed. The sentence image. database provided handwritten Arabic character that was Keywords- Jawi database; Jawi template; Jawi printed image; developed with the amount of sample reaching 5600 samples. Jawi printed text All of the character was divided into beginning, middle, end and isolated word [7]. I. INTRODUCTION King Fahd University Arabic Font Database (KAFD) is a Character recognition is a sub-discipline in image collection of Arabic characters. It contains characters with processing field. Aim of character recognition is to create various fonts, styles, size and resolutions. The database computer that capable of recognizing characters in a document includes 40 characters which consist of 10 different font size, that either handwritten or printed. Researches in the character 4 style fonts, 4 resolutions and two fonns. The KAFD recognition field was started at the beginning of 1970s. At the database contains 115068 page images and total of 2576024 beginning, it was focused on latin characters but nowday, the lines. All of the dataset was organized into three main groups : character recognition research has extended and focused training, testing, and validation [7]. widely in other such as Arabic, Chinese, and Japanese. Beside Arabic and Latin character databases, there are many character databases of other languages that have been Jawi is one of Arabic variant beside Urdu and established such as HCL200[9], CASIA-OLHWDBI [10] for Farsi. that is adopted from Arabic and Malay Chinese alphabet, Multi-tech Research Group Online language had been used as lingua franca since hundreds years Handwritten Tibet Character (MRG-OHTC) [II] for Tibet ago in Nusantara which include Indonesia, , alphabet, PE92[12] for Korean alphabet, [13][14] for Japanese Darussalam, southern Thailand, southern Philippine, and alphabet, NFl [15] for Dutch alphabet, RIMES [16] for French southern Myanmar[l]. alphabet , [17][18] for alphabet, and LAMIS MSHD[19], Recognition of Jawi characters had not started until the end which is collection of multilanguage database. of 1990's [2]. Some of research papers in Jawi character Farsi as a variant of Arabic Character has some databases, recognition had been published in 2000 [3] and 2002 [4]. one of them is Handwritten Farsi Text Database (HaFT) [20]. Since then there has been a lot of research papers published. The HaFT is a collection of 1800 greyscale images which is Even though Jawi character recognition research has been written by 600 peoples. The database consists of several image conducted for many years, there is no Jawi character database versions such as variant of long text and variant of text size. that can be accessed and used for research. This paper provides a database of printed Jawi (DPJ) characters that can Apart from Farsi, Urdu as another Arabic variant also has be used for Jawi character recognition research. character database. CENIP-UCCP [21] is a database for Urdu character recognition research. This database was completed The second section of this paper presents literature review by automatic line segmentation. It contains 400 digital fonns of currently available character database. The third section which are written by 200 peoples. Total word in the database presents our DPJ database and fmally, we presents conclusion is 23833 words from 2051 text lines. of the paper. Characteristics of Jawi character is similar to that of II. LITERATURE REVIEW . The different between them is Jawi alphabet There are many databases exist for character recognition. has six additional characters. They are used to cover some Many of these databases are Arabic and Latin character. The that do not exist in Arabic. The additional

IEEE �computer• 978-1-5090-0148-4/15/$31.00© 2015 IEEE 56 society 2015 Third International Conference on Image Infonnation Processing

characters are nga, nya, pa, va, ga, and ca. Jawi alphabet is shown in TABLE I.

TABLE I. CHARACTERISTIC JAWI ALPHABET

No Character Name Isolated Beginning Middle End Exp I Alif I L (al 2 Ba y -! -+ ..,.,. 3 Ta w ..:; ..:i.. ..:..,. 4 Ta Marbuthah ;; :L .!J ...... :..,.. : 5 Tsa . 6 Jim � ...;. ...;>- (':'- 7 C (;: -7 7- � Add 8 Ba � � � I�II it 1L-_9__ r- ...;,. ...;,... 9 Kha t f<- (hl lO Da J .l. 11 Dza j i

12 Ra .J .> 13 Za j j. 14 Sin U" � - U'" 15 sya c.J. ....:. - c.J.. 16 Sha U'" ...... <>"- (e) 17 Dha � ...... 4. � 18 Tha .J. ..J. ..J...h. 19 Zha .J;. ...li. ...li...J;.. 20 Ain r ..s;. -"- � 21 Ghain t � -"- � 22 Ng t -t -t: t- add 23 Fa W ...9 ..i...... 24 Pa U j ..... '-'- add 25 Qa J ..i ..i. .> Fig I. Examples of printed Jawi character image: (a) font Courier New, 26 Kaf � oS .s... � (b) font Times New Roman, (c) font SakkalMajalla, (d) font Segoe UI 27 G � ...] ...] .a add 28 Lam J J .1 J...... Next, we started the process of creating Jawi image which 29 Mim � {'- are consist of characters, words and sentences. Each of the 30 0 ...l ...l. 0- character, word and sentence was written in a word processing 31 Nya W -i -+ 6- add 32 J > application (MS-WORD 2007) by using the font Sakkal 33 Va j j. add Majalla, Times New Roman, Colibri New and Segoe UI. 34 Baa . ... -+- "- Furthennore, the characters, the words, and the sentences were 35 Yaa <.i -,l -+ ..... printed and then scanned using a scanner device. After the 36 hamzah • • scanning, the image of characters, words and sentences Jawi 37 Lamalif '1 )l were cropped by using an image processing application (MS­ PAINT 2007). The size of each printed Jawi character image III. DPJ OVERVIEW is 100xlOO. The size of each printed Jawi word image is In this section we proposes a database of printed Jawi 200x200. While the size of each printed Jawi sentence image character. Main purpose of the paper is to provide a Jawi is 275x75. dataset for Jawi character recognition research. B. Printed Jawi Character Dataset

A. Character Image Creation The printed Jawi character image database consists of 524 Prior of creating the database, we installed Jawi keyboard total dataset. This amount is composed of four types of fonts. layout. The installation process is perfonned using an Each of characters is available in four forms; the isolated application which was downloaded from Jawiware - software (stand-alone) character, the character in the beginning of the or researches about Jawi [22]. According to manual word, the character in the middle of words and the character in installation, after the installation was completed, then the the end of words. The character have three kind of image size, English keyboard layout was changed to Jawi keyboard layout 100xlOO, 200x200, 400x400 pixel, and saved it in "JPEG" [23]. fonnat picture. Example of characters can be seen in Figure 1.

57 2015 Third International Conference on Image Infonnation Processing

C. Printed Jawi Text Dataset .. •• In the database, we provide not only Jawi word image, but also printed Jawi word and sentence image. There are four kinds of font style for words and sentences. Each word has three size of image: (1) 150xl00 pixel , (2) 300x200 pixel, and (a) (3) 450x300 pixel. Each sentence has three size of image: (1) 550x150 pixel , (2) 275x75 pixel, and (3) 1100x300 pixel. The image was stored in "JPEG" format picture. Font size of the word and sentence image is 30 points. Example of each words and sentences can be seen in fig 2. (b) D. Statistics

Totally, the number of character image in the database is 1524. Each font style (Sakkal majalla, Times new roman, Courier new, Segoe ui) has 381 character images (TABLE II). (c) Furthennore, the number of word and sentence image in the database is 168 images (TABLE III and IV).

TABLE ll. STATISTICDISTRIBUTION OFPRI NTED CHARACTE R

61 . 100x 200x 400x total ... .;-J 100 200 400 (d) CourierNew 127 127 127 381 TimesN R 127 127 127 381 Sakkal Majalla 127 127 127 381 Fig 2. Examples of printed Jawi wordt image: (a) font CourierNew, Segoe UI 127 127 127 381 (b) font Times New Roman, (c) font SakkalMajalla, (d) font Segoe UI Total 508 508 508 1524

p.-J I.) -Lo...?l:....Al l>--L...J � .9 ..L.3 TABLE ill. STATISTIC DISTRIBUTION OFPRINTED WO RDS L5j-CU � � u LL uJ:..9-G...5 1" size 2nd size 3'd size total CourierNew 11 11 11 33 p� TimesN R 11 11 11 33 (a) SakkalMajalla 11 11 11 33 Segoe UI 11 11 11 33 Total 44 44 44 132

TABLE IV. STATISTICDISTRIBUTION OFPRINTED SENTENCE S

(b) 1" size 2nd size 3'" size total CourierNew 3 3 3 9 TimesN R 3 3 3 9 SakkalMajalla 3 3 3 9 Segoe UI 3 3 3 9 Total 12 12 12 36

IV. CONCLUSION (c) DPJ (Database Printed Jawi) is Jawi image character database which can be used for research. The database � � �.9 � contains two parts of dataset, (1) Jawi character images and �) U� uC.9� �b (2)Jawi text image. Jawi text images include two parts: word rL, and sentence. We presented the database to serve researcher in 15� character recognition research and to the best of our lrnowledge, the database is the first database for Jawi (d) character. Our future work is to create handwritten Jawi Fig 3. Examples of printed Jawi text image: (a) font CourierNew, (b) font character database. Therefore, Jawi character recognition Times New Roman, (c) font Sakkal Majalla, (d) font Segoe UI research can be more improved.

58 2015 Third International Conference on Image Information Processing

REFERENCES [13] M. Nakagawa, T. Higashiyarna, Y. Yamanaka, S. Sawada, L.Higashigawa, K. Akiyama, "On-line handwritten character pattern database sampled in a sequence of sentences without any writing [I] Amat Juhari Moain., Perancangan Bahasa: Sejarah aksara jawi. Kuala instructions", in Proceeding of 4th lnternational Conference on Lumpur: .1996. Document Analysis and Recognition, ICDAR 1997, pp. 376-381. [2] Nasrudin.M.F, Omar. Khairuddin. Zakaria. M.S, and Yeun L.C., [14] K. Matsumoto, T. Fukushima, and M. Nakagawa, "Collection and t "Handwritten cursive Jawi character recognition: a survey", in 5 h analysis of on-line handwritten Japanese character patterns", in lnternational Conference on Computer Graphics, hnaging, and Proceeding of 6th lnternational Conference on Document Analysis and Visualisation, CGN 2008, pp. 247-256. Recognition, ICDAR 2001,pp.496-500. [3] Omar. Khairuddin, "Jawi Handwritten text recognition using multi-level [15] A. Brink, L. Schomaker, and M. Bulacu, "Towards explainable writer th classifier" (in Malay), PhD Thesis, Universiti Putra Malaysia, 2000. verification and identification using vantage writers" in 7 lnternational [4] Manaf. Mazani, "Jawi handwritten text recognition using recurrent Conference on Document and Analysis Recognition, ICDAR 2007, vol. Bama neural network" (in Malay), PhD Thesis, Universiti Kebangsaan 2, pp. 824-828. Malaysia,2002. [16] E. Grosicki, M. Carr, and E. Geoffrois, "RIMES evaluation campaign [5] http://www.isri.unlv.edulISRIJOCRtk for handwritten mail processing" in Proceedings of the lnternational [6] Njah ,So Nouma B.B, Bezine, and H. Alimi A.M., "MAYASTROUN: A Conference on Frontier in Handwriting Recognition, ICFHR 2008, pp. multilanguage handwritten database", in 13th lnternational Conference 941-945. on Frontiers in Handwriting Recognition, ICFHR 2012, pp. 308-312. [17] Jayadevan. R, Kolhe. S.R., and Patil. P.M., and Pal. U, "Database [7] Bahashwan, M.A., and Abu Bakar, S.A. , "A database of Arabic development and recognition of handwritten Devanagri legal amount handwritten characters", IEEE lnternational Conference on Control words", in lnternational Conference on Document Analysis and System, Computing and Engineering, ICCSCE 2014, pp. 632-635. Recognition. ICDAR 2011, pp. 304-308. [8] H. Luqman, S.A. Mahmoud and S. Awaida, "KAFD Arabic font [18] Nethravathi. B., Archana. C.P., Shashikiran. K., Ramakrishnan. A.G., database", Pattern Recognition. Vol 47, no.6, pp2231-2240, 2014. and Kumar. V.,"Creation of a huge annotated databse for Tamil and [9] Honggang Zhang, Jun Guo, and Guang Chen, Chunguang Li., Kannada ORR", in lnternational Conference on Frontier in Handwriting "HCL2000 - A large scale handwritten chinese character database for Recognition, ICFRR 2008, pp. 415-420. handwritten character recognition", 10th lnternational Conference on [19] C. Djeddi, A. Gattal, L.S. Meslati, I. Siddiqi, Y. Chibani, and H.E!. Document Analysis and Recognition, ICDAR 2009, pp. 286-290. Abed "LAMIS-MHSD : A mUlti-script offline handwritten database", in [10] Da-Han Wang, Cheng-Lin Liu, Jin-Lun Yu, and Xiang-Dong lnternational Conference on Frontiers in Handwritten Recognition, Zhou.,"CASIA-OLHWDBI: A database of online handwritten chinese ICFHR 2014, pp. 93-97. characters",lOth lnternational Conference on Document Analysis and [20] Safabaksh. R., Ghanbarian. A.R, and Ghiasi, G. "HaFT : A handwritten th Recognition, ICDAR 2009, pp. 1206-1210 Farsi text database". In 8 Iranian Conference on Machine Vision and [11] Long-Long Ma, Hui-Liu, and Jian Wu., "MRG-OHTC database for hnage Processing, MVIP 2013, pp. 89-94. online handwritten Tibetan character recognition". lnternational [21] Raza, A., Siddiqi, I. Abidi. A, and Arif. F, "An unconstrained Conference on Document Analysis and Recognition. ICDAR 2011, pp. benchmark Urdu handwritten sentence database with automatic line 207-211. segmentation".in 13th lnternational Conference on Frontiers in [12] D.H. Kim, Y.S. Hwang, S.T. Park, E.J. Kim, S.H. Paek, and S.Y. Bang, Handwriting Recognition, ICFHR 2012, pp. 491-496. "Handwritten Korean character image database PE92", in Proceeding of [22] http://www.jawiware.org/ 2nd lnternational Conference on Document Analysis and Recognition, [23] Murah. Mohd Zamri, Abdul Rahman. Hamdan, Omar. Khairuddin, ICDAR 1993, pp. 470-473. "Tulisan Jawi dan teknologi maklumat" (in Malay), in Persidangan Kebangsaan Tulisan Jawi Kali Kedua, UKM Malaysia. Desembar 2011.

59