LUCIDAH Ligative and Unligative Characters in a Dataset for Arabic Handwriting
Total Page:16
File Type:pdf, Size:1020Kb
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 10, No. 8, 2019 LUCIDAH Ligative and Unligative Characters in a Dataset for Arabic Handwriting 1 Yousef Elarian Abdelmalek Zidouri3 Teaching and Learning Coaching Electrical Engineering Department ® Interserve King Fahd University of Petroleum & Minerals Madīnah, Saudi Arabia Dhahran 31261, Saudi Arabia 2 Irfan Ahmad Wasfi G. Al-Khatib4 Information and Computer Science Department Information and Computer Science Department King Fahd University of Petroleum & Minerals King Fahd University of Petroleum & Minerals Dhahran 31261, Saudi Arabia Dhahran 31261, Saudi Arabia Abstract—Arabic script is inherently cursive, even when either written in the normal horizontal connection form or as a machine-printed. When connected to other characters, some ligature, depending on the writer’s choice. Arabic characters may be optionally written in compact aesthetic forms known as ligatures. It is useful to distinguish ligatures Many character sequences are not ligative (or “unligative”, from ordinary characters for several applications, especially as we refer to them); i.e., they do not have optional ligature ,family ”ـ٬“ or ”٫“) automatic text recognition. Datasets that do not annotate these forms. One particular family of ligatures ,and Madda (“~”) combinations (”ء“) ligatures may confuse the recognition system training. Some with different Hamza is considered (”ـ٪“ ,”٩“ ,”ـ٦“ ,”٥“ ,”ـ٨“ ,”٧“ ,popular datasets manually annotate ligatures, but no dataset namely (prior to this work) took ligatures into consideration from the obligatory, i.e., its constituent LAM and ALEF characters design phase. In this paper, a detailed study of Arabic ligatures cannot be connected in the common horizontal form. Table I and a design for a dataset that considers the representation of shows different connections of a ligative, an unligative, and an ligative and unligative characters are presented. Then, pilot data obligatorily ligative examples. collection and recognition experiments are conducted on the presented dataset and on another popular dataset of handwritten Ligatures are often used in handwriting for their esthetic Arabic words. These experiments show the benefit of annotating effects and their compactness. Their presence, however, can ligatures in datasets by reducing error-rates in character add complexities to tasks like Arabic character recognition and recognition tasks. synthesis. Researchers in these and similar fields have attempted to handle some ligatures [6] [7] [8] [9], but a Keywords—Arabic ligatures; automatic text recognition; comprehensive and systematic list of ligatives was only handwriting datasets; Hidden Markov Models available after the initial work presented by the authors in [4]. I. INTRODUCTION In addition to handwriting, Arabic ligatures are increasingly Arabic script is widely used, not only for the Arabic being incorporated in modern computer fonts to help make language, but also for other languages like Urdu, Jawi, and texts more compact, beautiful, and as an alternative method for Persian [1]. Despite such importance of the Arabic script, paragraph justification [10] [11]. research in technologies that serve it still remains The "Arabic Presentation Forms-A" of the Unicode underdeveloped [2]. This is mainly attributed to the scarcity of standard encodes more than 300 Arabic and Extended-Arabic specialized resources and algorithms that take the peculiarities ligatures [12]. We notice, however, that their set suffers from ”خئ“ of the Arabic script into consideration [2] [3]. incompleteness and inconsistencies. For example, the is absent in Unicode despite the presence (”داخئ“ Among the peculiarities of the Arabic script is that it is ligature (as in .([cf. [4) ”خن“ and ,”حن“ ,”جن“ ,”حئ“ ,”جئ“ inherently (and diversely) cursive. Twenty-two out of the 28 of similar ligatures such as Arabic letters must be connected to the letter that comes next to them in a word. The general form of connection is via short To the best of our knowledge, Arabic ligatures were not horizontal extensions, shown in the second row of Table I. systematically analyzed and studied before. In this paper, we Some characters, however, also accept to be connected in present a detailed study of Arabic ligatures that lead to the special compact forms known as ligatures [4] (or typographic design of LUCIDAH, a Ligative and Unligative Dataset for ligatures [5]). Examples of Arabic ligatures are shown in the Arabic Handwriting. LUCIDAH is especially designed to be last row of Table I. representative of both: ligatures and characters in their non- ligatured forms. It is also designed to be practically concise and Most ligatures are optional, i.e., when certain sequences of natural; based on the guidelines in [4]. characters (that we refer to as “ligatives”) occur, they can be 406 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 10, No. 8, 2019 TABLE. I. EXAMPLES OF (A) A LIGATIVE, (B) AN UNLIGATIVE AND (C) AN connecting. Finally, as was mentioned earlier, there is one ,that does not accept connections from either side ”ء“ ,OBLIGATORILY-LIGATIVE CHARACTERS character Mode of connection Ligative Unligative Obligatorily and hence; it only has the Alone (A) character-shape. B. Arabic Typographic Models مػ ـا مػ ـس مػ ـمػ ػح Unconnected Character-shapes NA There are several Arabic typographic models. The مس مـمح Non-Ligature Connections traditional model introduces up to 4 character-shapes per letter Ligature Connections NA or connecting letters having four shapes, non-connecting letters) ﻻ has one). The traditional model ”ء“ having two, and The rest of this paper is organized as follows. In Section II, encompasses more than a hundred character-shapes, which can background on Arabic characters in several typographic be a large number for some applications [17]. Hence, several models is presented. Section III is devoted to the discussion of other models were introduced to represent Arabic script with ligatures and their categorization. In Section IV, the design of less numbers of shapes. LUCIDAH is described. In Section V, implementation issues such as the data collection forms, the demographic distribution One of these reduction models is the 2-Shapes model [18]. of writers, and some scanning parameters are discussed. In The 2-Shapes model clusters the (B) and (M) shapes together Section VI, text recognition experiments are presented to and the (A) and (E) shapes together, whenever the differences highlight the impact of annotating ligatures on recognition are merely the connecting extension found preceding the (M) error-rates. Finally, conclusions are presented in Section VII. and (E) shape, as seen in Fig. 3 (left and middle parts). II. BACKGROUND ON THE ARABIC SCRIPT The 1-Shape model introduces root shapes, or basic glyph parts that are present in all shapes of a character [17]. In The Arabic script consists of 28 alphabetic letters that are addition to the common root shapes, many (A) and (E) mapped to the 36 computer characters (shown and explained in character-shapes also have tails. For example, the character details in the subsection of the Arabic Typographic Models). “ ” in Fig. 3 (right) has the “ ” root in all of its shapes, and ظػ ص Most Arabic characters must connect to subsequent characters .tail in its (A) and (E) shapes ”ں“ to compose words. A few Arabic characters, however, do not the and “ٍ” in ”ح“ connect to subsequent characters (e.g., Letters الرحيم Arabic Word Fig. 1). If any of the non-connecting characters occurs before ا لر حيم another letter in a word, they cause the word to split into Three PAWs disconnected Pieces of Arabic Words (PAWs) [13] [14] [15] or PAWs Ordered from right to left 3rd 2nd 1st subwords [16]. (In addition to that, there is one non-connecting Number of characters in PAW Three Two One .that does not allow previous characters to Fig. 1. An Arabic Word Divided into its PAWs ,”ء“ ,character connect to it, even when the previous character is a connecting one; hence, its occurrence also causes words to be split into TABLE. II. DIFFERENT CHARACTER-SHAPES OF THE ARABIC CHARACTER PAWs). "ّ" BASED ON ITS CONNECTIONS TO PREVIOUS AND NEXT CHARACTERS Example Connected Connected Example Size of A. Arabic Character-Shapes Shape Character- to previous to next Word PAWa Each Arabic character takes a “character-shape” based on Shape character character No No 1 درس س (its position in a PAW, as detailed in the following subsection. (A No Yes 2+ أسف سـ (Arabic traditionally introduces four character-shapes, based on (B Yes Yes 3+ تسامح ـسـ (characters' context (i.e. connection with previous and next (M characters in a PAW). If a PAW consists of only one character, Yes No 2+ خس ـس (that character takes the Alone (A) character-shape. PAWs with (E more than one letter must start with a Beginning (B) character- a The "+" sign after a number X in this column denotes "X or more characters". shape and end with an Ending (E) character-shape. If the PAW ا لر حيم PAWs ـ consists of three or more letters, there would be characters ا لـ ـر حـ يـ ـم between the Beginning and the Ending character-shapes. These Character-Shapes must be Middle (M) character-shapes. Table II shows examples Character Shape Labels E M B E B A ."ّ" of (A), (B), (M), and (E) character-shapes of the letter Fig. 2. An Arabic word divided into PAWs with character-shapes annotated. Fig. 2 shows an example word consisting of three PAWs and annotates its character-shapes (as Alone (A), Beginning 4-Shapes Model 2-Shape Model 1-Shape Model (B), Middle (M), and Ending (E)). Note that Arabic text is (A) (E) (M) (B) (A)and (E) (M)and (B) (A), (E), (M), and (B) ٓـ ـٔـ ـْ ّ ٓـ ـٔـ ـْ ّ ٓـ ـٔـ ـْ ّ written from right to left; hence, the Beginning character-shape of an Arabic multi-lettered PAW appears at the right end of the ٗـ ـ٘ـ ـٖ ٕ ٗـ ـ٘ـ ـٖ ٕ ٗـ ـ٘ـ ـٖ ٕ PAW, whereas the Ending character-shape comes at the ٛـ ـٜـ ٛـ ـٜـ ـٙ ٚ ـٛ ٙ ٚـ ـٜـ ـleftmost end of it.