International Journal of Advancements in Research & Technology, Volume 2, Issue 10, October-2013 174 ISSN 2278-7763 OCR for Classifying and Retrieving Different Syriac Forms Pof. Dr. Abdul Monem S. Rahma1, Dr.Basima Z.Yacob2 and Danny T. Baito2

1 Computer Science -University of Technology –Bagdad – Email:[email protected] ,2Computer Science -University of – Duhok –Iraq Email: sadaanweya @yahoo.com

ABSTRACT In this paper, an attempt is made to present a method for recognizing forms using invariant moments and re- trieving the other forms of recognized . The character is distinguished first and then the other two forms of recognized charater are retrieved. A minimum distance nearest neighbor classifier is adopted for classification. The experimental results confirm the recognition and retrieval accuracy as of 100% for different Syriac alphabet forms.

Keywords: OCR, Syriac alphabet forms, West Syriac (Serta) alphabet, Estrangela alphabet

1 INTRODCCTION PTICAL character recognition (OCR) is one of the im- O portant active areas of research in pattern recognition. 2 OVERVIEW OF SYRIAC ALPHABET FORMS Presently, all communications, businesses and trades are per- formed through tetechnology hence; the development of reli- The is one of the that is able OCR system is inevitable for different scripts and lan- being spoken in Iraq, Syria, Turkey and Iran by Assyrians. It’s guages. Even though, reasonably good OCR systems are an ancient language, one of the rarest and oldest in the world. available in the market for local and universal languages and The Syriac alphabet consists of 22 characters which is written scripts, the rate of recognition drastically drops down when from right to left [6]. OCR reads a document containing different font style (forms) There are three important forms in the Syriac alphabet lan- of the characters [1]. guage,Estrangela(biblefont),Serta(west) and Madnkhaya(East) In the past several decades, a large number of OCR sys- form. tems have been developed IJoARTfor natural languages [2], [3], [4]. Estrangela (bible font), meaning 'rounded', is the oldest form However, the problem of Syriac character recognition has and is considered the classical version of the Syriac alphabet. It been rarely addressed [5], the purpose of this paper is to de- was revived during the 10th century, and is now used mainly velop an efficient system for recognizing different Syriac al- in scholarly publications, titles and inscription. Fig.1 shows phabet forms by computing invariant moments as features. Estrangela Syriac alphabet form. Syriac is an ancient Iraqi language, and it is culturally used by human beings in Iraq. It has many religious scripts as well as scientific and literary books which have been written in different alphabet Syriac forms (Estrangela (bible font), Ser- ta (West) and Madnkhaya (East)). In [5], Authors have used invariant moments feature to recognize the East Syriac alphabet form. In this paper the strangela(bible font), and Serta (West) forms are recognized by computing the invariant moments as features, and also the corresponding recognized character in other two Syriac al- phabet forms are retrieved.

In multi alphabet forms like Syriac language, it is common Fig.1 Estrangela Syriac alphabet form that many scripts and books through the long history of this language have been written in three different Syriac alphabet forms. This addresses the need of developing a single Syriac West Syriac is generally written with Serta, meaning 'line', Alphabet Forms recognition system . which is also known as the Psheta (simple), Maronite or Jaco-

bite. It was modeled on Estrangela but with simpler, more

flowing lines. A version of Serta appeared in the earliest Syriac , and it became popular during the 8th century .Fig. 2 is shown West (Serta) Syriac alphabet Copyright © 2013 SciResPub. IJOART International Journal of Advancements in Research & Technology, Volume 2, Issue 10, October-2013 175 ISSN 2278-7763 TABLE 1

SEVEN INVARIANT MOMENTS FOR SERTA( WEST) SYRIAC ALPHABET

Fig 2. Serta(West) Syriac Alphabet form.

East Syriac is usually written in the Madnkhaya (Eastern) form of the alphabet, which is also known as Swadaya (con- versational- contemporary), Madnkhaya is closer to Estrangela than Serta. Fig 3 shows Madnkhaya Syriac alphabet form

[7][8][9].

IJoART

Fig 3. Madnkḥaya(East) Syriac alphabet form.

alphabet (Estrangela, and Serta) forms except the Classifica- tion and Recognition Stage where it has been modified. The 3 OVERVIEW OF THE SYSTEM Feature extraction stage extracts the moments for each Syriac The major steps involved in recognition of characters include, alphabet (West, Estrangela) form as attributes to build a data- pre processing, segmentation, feature extraction and classifica- base for each form. Table 1, 2 and 3 show the seven invariant tion [10]. moments for Serta (West), Estrangela (bible font) and East Sy- The same OCR system stages for East Syriac alphabet rec- riac alphabet consecutively. ognition in [5 ] are used to recognize the two other Syriac

Copyright © 2013 SciResPub. IJOART International Journal of Advancements in Research & Technology, Volume 2, Issue 10, October-2013 176 ISSN 2278-7763

TABLE 3 TABLE 2 SEVEN INVARIANT MOMENTS FOR EAST SYRIAC ALPHABET[5] SEVEN INVARIANT MOMENTS FOR ESTRANGELA SYRIAC ALPHABET

forms Input: Isolated test Syriac character images.

IJoARTOutput: Recognition of the Syriac character and Retrieval of 3.1 Classifying and Retrieving the Two Other Syriac the two other Syriac character forms. Forms Begin Euclidean distances between the invariant moments of train- 1. Compute the moments for the test character. ing images and test images were obtained and the nearest- 2. Compute the Euclidean distance between the moments on neighbor classifier is used for recognition of test character im- test image with that of training feature vectors of each ages. The nearest-neighbor classifier relies on distance func- class stored in the three libraries. tion between patterns. The Euclidean distances between the 3. Assign the image to a class having minimum distance and invariant moments of the trained sample with that of test retrieve the two other Syriac forms that have the same index sample is obtained by using the following formula: as the recognized image. End.

4 EXPERIMENTAL AND RESULTS Databases for each , East[5], Estrangela(bible font), and Serta (West) Syriac alphabet form are built by calculating the mo- Where a is the invariant moments feature vector of the trained ments of each character ,next step is to find the distance by image and b is the feature vector of the test character image. using Euclidean distance equation between the invariant The test image is preprocessed and invariant moments were moments of entered character and each Syriac character to computed. The Euclidean distances between the training fea- be used later in classification .Recognition step is achieved by ture vector and testing feature vector is calculated and stored selecting the shortest distance between the invariant mo- in the library. The minimum distance is determined and the ments of entered character and Syriac characters database, tested image is declared to belong to the class. Then the index then the index of the recognized character is used to retrieve of test image is used to retrieve the two other forms. The fol- the two other Syraic alphabet forms with the same index as lowing algorithm performs the classification and Retrieval the recognized character, for example if ” is a rec- tasks. ognized character in East form with the index 27, then the Algorithm: Classifying and retrieving the two other Syriac Copyright © 2013 SciResPub. IJOART International Journal of Advancements in Research & Technology, Volume 2, Issue 10, October-2013 177 ISSN 2278-7763 two other forms of “Taw” in Estrangela and Serta(West) can [3] M.Soleymani and F.Razzazi,“An efficient front-end system for Iso- be retrieved using the same index .Fig 4 shows the forms of lated Persian/ character Recognition ofhandwritten data- Entry Forms,” International Journal Of Computational Intelligence, vol 1, “Taw” letter. pp.193-196,2003.

[4] B. B. Chaudhuri and U. Pal, “A Complete Printed Bangla OCR Sys-

tem”, Pattern Recognition. Vol. 31, pp.531-549, 1998. [5] Abdul Monem S. Rahma, Basima Z.Yacob and Danny T. Baito "Moment invariants Based Features Extraction For Classification of Syriac Alphabet Language," International Journal of Advances in Engi- Fig. 4. Letter “Taw” has three different forms, (a): Estrangela neering and Technology, Vol. 6, Issue 4, pp. 1442-1451 Sept. 2013. form, (b) Serta (West) form, (c) Madnkḥaya(East) form [6] Rev. Shlemon I. Khoshaba, “Lessons in the teaching of the Syriac language”, AL-Mashrq House Cultural, Duhok – Iraq, 2010. [7] http://learnaramaic.blogspot.com/2012/05/estrangelo-syriac- Using 7 or 6 or 5 or 4 or 3 or 2 or 1 moment as extracted fea- .html#.UjxT-NLwl7I ,2012. ture for classification, the same results of recognition rate are [8] http://www.omniglot.com/writing/syriac.htm achieved, and the rate of Recognition and retrieval accuracy is [9] Syriac alphabet , http://en.wikipedia.org/wiki/Syriac_alphabet 100% for different Syriac alphabet forms , So the proposed [10] Dhandra B.V., Malemath V.S., Mallikarjun H., Hegadi Ravindra, ” OCR can use one moment instead of 2,3,4,5,6 or 7 moments Multi-font English Character Recognition based on Modified Inva- , this will reduce the time needed to recognize the charac- riant Moments “, Journal of Combinatorial Mathematics and Combina- torial Computing, Vol. 67, pp. 153-162 , 2008. ter. Table 4 shows the recognition and retrival time in millise- cond for one character by using moments between 1 to 7.

TABLE 4 THE RECOGNITION AND RETRIVAL TIME (MS) OF USING MOMENTS BETWEEN 7 TO 1 Number of The recognition time(MS)for one character + Moment retrieval time of the two other forms 7 230.5 6 227.5 5 225.9 4 225.5 3 IJoART223.2 2 222.0 1 220.1

5 C ONCLUSION A simple Syraic Alphabet Characters recognition system of the three different forms that uses Invariant moments as features, and that uses the index of the recognized letter to retrieve the other two forms of this letter is proposed. From the test results, it has been identified that using seven, six, five, four, three, two or one moment as a feature, the same results of recogni- tion and retrieval are achieved, Due to these results, one mo- ment can be used instead of two or three or four or five or six or seven moments to recognize the character, this leads to the reduction of time of the Syriac Alphabet Forms recognition system.

REFERENCES [1] Hangarge, Shashikala Patil and B.V.Dhandra “Multi-font/size Kan- nada Vowels and Numerals Recognition Based on Modified Inva- riant Moments”, IJCA Special Issue on “Recent Trends in Image Processing and Pattern Recognition, RTIPPR, pp 126-130, 2010. [2] S. Sardar, A. Wahab “Optical character recognition system for Urdu” International conference on Information and Emerging Technologies, June 2010.

Copyright © 2013 SciResPub. IJOART