Offline Arabic Handwriting Identification Using Language
Total Page:16
File Type:pdf, Size:1020Kb
2010 International Conference on Pattern Recognition Offline Arabic Handwriting Identification Using Language Diacritics Mohammed Lutf, Xinge You, Hong Li Huazhong University of Science and Technology Wuhan - China Email:[email protected] , [email protected] , [email protected] Abstract—In this paper, we present an approach for writer probability distributions, combined with moment invariants identification using off-line Arabic handwriting. The proposed and structural word features, such as area, length, height, method introduced Arabic writing in a new form, by presenting length from baseline to upper edge and length from base Arabic writing in its basic components instead of alphabetic. We split the input document into two parts: one for the line to lower edge. Rafiee et al. [6] introduced a new writer letters and the other for the diacritics, we extract all diacritics identification method for Persian writer, 8 different types from the input image and calculate the LBP histogram for of features associated with height and width of text are each diacritic then concatenate these histograms to use it as extracted and feed forward neural network used for the handwriting features. We use the IFN/ENIT database in the classification. In [7], Abdi et al. propose a stroke-based experiments reported here and our tests involve 287 writers. The results show that our method is very effective and makes method; they extract the strokes from the document then the handling of the Arabic handwriting more easily than before. produce different PDF combination of the measurement of height/size, length and curvature. Gazzah et al. Combined Keywords-Diacritics; Arabic handwriting; LBP; structural and statistical methods. Textural analysis using lifting scheme wavelet transforms and structural features like I. INTRODUCTION Average line height, Spaces between subwords, Inclination Personal identification based on handwriting is a be- of the ascenders and Features extracted from dots such as havioral biometric identification approach. Because of the height and width [8]. wide range of application in real life, writer identification The remainder of the paper is organized as follows: becomes a very important field of research, from forensic Section 2 Introduce a new description of Arabic writing and historical document analysis to handwriting recogni- which our proposed approach based on. In section 3, we tion system enhancement. A lot of techniques have been describe our proposed approach. Section 4 deals with the developed over the past years, according to these techniques, testing process and the experimental results, and section 5 Writer identification can be classified into three categories: concludes and gives some perspectives of this work. on-line vs. off-line, text-dependent vs. text-independent and structural vs. statistical. II. ARABIC DIACRITICS AND INDIVIDUALITY Arabic is a native language for more than 280 million Arabic is the second most widely used alphabetic writ- peoples in the world, but the Arabic writer identification ing system in the world (the Latin alphabet is the most have not been addressed as extensively as Latin or Chinese widespread). Originally developed for writing the Arabic writer identification. Recently, a number of new Arabic language and carried across much of the Eastern Hemisphere writer identification approaches have been proposed, Sta- by the spread of Islam, the Arabic script has been adapted tistical and structural methods are applied for proposed to such diverse languages as Persian, Turkish, Spanish, and methods; Faddaoui et al. used handwriting texture analysis Swahili. with a set of 16 Gabor filters to extract 32 features [1]. Arabic alphabet has 28 basic letters. Unlike cursive writ- Nejad et al. proposed another Gabor multi- channel based ing based on the Latin alphabet, the standard Arabic style method for Persian writer identification [2]. Also, in [3], is to have a substantially different shape depending on Ubul et al. used Gabor multi-channel wavelet for the Uyghur whether it will be connecting with a preceding and/or a language, which is written using the Arabic and the Persian succeeding letter, which means each letter has between two characters,they came up with 144 extracted features. Bulacu and four shapes. The shapes correspond to the four positions: et al. [4] considered a set of edge-based joint directional beginning of a (sub)word, middle of a (sub)word, end of a probability distributions, like contour-direction probability (sub)word, and in isolation. distribution function (PDF), contour-hinge PDF and direc- The alphabet was first used to write texts in Arabic, most tion cooccurrence PDF. High performance is achieved by notably the Qur’an, the holy book of Islam. With the spread combining joint directional probability distributions with of Islam, it came to be used to write many languages of many grapheme-emission distributions (allographic features). Sim- language families including, at various times, Urdu, Pashto, ilarly, Al-Ma’adid et al. [5] employed edge-based directional Baloch, Malay, Fulfulde-Pular, Hausa, Mandinka (in West 1051-4651/10 $26.00 © 2010 IEEE 19121916 DOI 10.1109/ICPR.2010.471 Africa), Swahili (in East Africa), Balti, Brahui, Panjabi (in Pakistan), Kashmiri, Sindhi (in India and Pakistan), Arwi (in Sri Lanka), Chinese, Uyghur (in China), Kazakh (in Central Asia), Uzbek (in Central Asia), Kyrgyz (in Central Asia), Azerbaijani (in Iran), Kurdish (in Iraq and Iran), Belarusian (amongst Belarusian tatars), Ottoman Turkish and Spanish (in Western Europe). To accommodate the needs of these other languages, new letters and symbols were added to the original alphabet. This process is known as the Ajami transcription system, which is (a) different from the original Arabic alphabet. With the passage of time, many modifications and improvements have been made to the Arabic writing script and this result in additional letters and strokes. The new strokes called diacritics, and the purpose of the addition of these diacritics was to: 1) Distinguish between letters of the same or similar form. 2) Indicate sounds (vowels and tones) that are not con- veyed by the basic alphabet. 3) Indicate the absence of a vowel. (b) 4) Clarify the different in meaning between the words consisting of the same letters. Table I ARABIC DIACRITICS. No Diacritic Name Shape 1 Fatha & Kasra 2 Tanwin (Fathatain & Kasratain) 3 Damma (c) َ !ء Hamza 4 5 Madda Figure 1. Arabic document decomposition, (a) The original text, (b) .diacritics part, (c) letters part ُ~ 6 Shadda 7 Tanwin (Dammatain) Sukun ّ contains the diacritics as shown in Fig.1. It is obvious that 8 9 One dot the diacritics are easier to be located and segmented than the letters, almost there is no connection between them, and ء 10 Two dots because of the locations of the diacritics are always above or ء 11 Three dots below the baseline, by detecting the baseline, we can extract all diacritics in the document image using any of existing techniques, we can locate the diacritics as black objects in a Nowadays, the diacritics are not something additional or white background within the document image, or by segment optional to the language, it becomes an important part of each line and apply a vertical projection profile to this line, the language itself, it is necessary for learning Arabic for after segment the diacritics, it will be very easy to apply children or foreigners. So as the writer needs to master the either texture or structural technique. writing style of the Arabic alphabet, he/she need to master The question is: does the diacritics enough to represent the the writing and correct placing of the diacritics too, in order handwriting style in an Arabic handwriting document? And to write a correct and fully understood Arabic. Table I list the answer is yes. Document handwriting individuality exists all Arabic diacritics used in modern Arabic handwriting. So whether it involves the whole language characters or part of we can conclude that modern Arabic writing system consists it, there is no exception of what can be including during of two main parts: letters and diacritics. the identification process. Although, the more elements the Now, suppose we split a given Arabic document into its handwriting document contains, the accurate result we can original parts, one part contains the letters and the other part get, But despite what were the language characters set, or 19131917 how many or which characters included in the handwriting document, the individuality of the handwriting style should still exist, because if the handwriting style in a given document is individual, the elements build up this document should be individual too. In Arabic, letters or diacritics both can be used for handwriting identification, but segmenting or normalizing letters always not a good choice, so we select (a) to work with diacritics, they are easy to segment and have small size. Fig.2 shows the “Tow dots” diacritic written by 25 different writers. (b) Figure 2. “Two dots” diacritic written by 25 different writers. (c) Figure 3. Diacritics segmentation, (a) locating of the start points, (b) after III. PROPOSED APPROACH text clear, (c) final diacritics segmentation. In our approach, we first apply a preprocessing step to the handwriting document; we extract all the diacritics in the document image and then produce a new input document the input data, we have a desire to let the system be able to image which only contains diacritics, after that we calculate identify the writer, even if we have only one word. We first the local binary pattern histogram for each diacritic, then calculate the minimum diacritic distance within the same concatenate all these histograms as histogram features. For writer, then sum these distances to form the writer distance 2 as follows: classification, we use K-nearest neighbor with X distance 2 as a distance function. X distance: Xn (Y − Z )2 A.