Author Guidelines for 8

Naila Fareen1, Mohammad Abid Khan2 & Attash Durrani3 URDU NASKH TYPOGRAPHY PATTERN RECOGNITION OF TYPE FOUNDRIES Dept. of computer science, Allama Iqbal Open University Islamabad, Pakistan [email protected] Mohammad Abid Khan Dept. of Computer science, University of peshawar Peshawar, Pakistan [email protected] Attash Durrani Dept. Of Computer science,Allama Iqbal Open University Islamabad, Pakistan [email protected] Abstract Optical Character Recognition (OCR) is the process of converting printed, handwritten and typed printed text into its equivalent machine readable form. The study of OCR is becoming more popular all over the world including Pakistan. A lot of work has been done on Urdu Literature and History of Muslims. The old documents need to be transferred into electronic form. In this research, work is done on the Typed Urdu Naskh pattern recognition of type foundries. This research primarily focuses on recognizing the Typed Urdu Naskh font. The aim here is to develop a more reliable, accurate and quick system for Typed Urdu Naskh pattern recognition in order to get benefit of the cultural heritage left by our ancestors for centuries. This research introduced and evaluated different font size of type foundries of Naskh font, from the perspective of spatial resolution, normalization, training, segmentation and downsampling. In this study a new method is proposed for offline printed character recognition problem, which uses the sequence of segmentation, training and recognition algorithms. The proposed method is tested on various type foundry’s digits and characters. The system has been implemented by using self organization map (SOM) technique; recognition rate of 98.18% is achieved. Keywords: Urdu OCR, Naskh Typography, Pattern recognition, Urdu Type Foundries, Urdu Offline OCR. 1. INTRODUCTION The study of OCR is becoming more popular day by day all over the world including Pakistan. For centuries, plenty of work has been done on Urdu Literature, Social Sciences, History of Muslims, Islam and many other areas which need to be transferred to electronic form automatically to avoid intensive labour involved in the re-composition of such printed material. In this way, this treasure of knowledge and culture will be preserved forever to guide future generation and the text will be processed easily for drawing useful conclusions. As the electronic media happen to become more and more widespread, the need for transferring older documents (books, newspapers and Magazines) to the electronic domain arises. Especially the published Urdu books calligraphed with metallic nib of different points, or printed in foundries, lead type are easy to scan, if OCR is developed accordingly. This research mainly focuses on Naskh font of type foundries. Here is the character set of Urdu Naskh, including the Numerals of Urdu Naskh. 1 International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (2) : 2012 Naila Fareen1, Mohammad Abid Khan2 & Attash Durrani3 Figure 1: Urdu Nuskh Character Set Recognition of individual characters within a ligature becomes quite difficult as the characters in Urdu overlap vertically and do not touch each other [1]. 2. Type Foundries Before the advent of the offset printing press, all the books were published by the fonts developed by the type foundries. These type foundries contained the lead characters and symbols used for writing. Ink was filled and then set on the paper for printing purposes. In foundries, printing is a mechanical way by using casted characters for printing the inked image which produce more noise and are difficult to recognize. Just like other languages, Urdu books were also typed by Urdu types prepared by those type foundries. 3. Compound Characters In Urdu, each character has two to four different shapes depending upon its position in the word: isolated, initial, medial or final. Figure 2 shows four different shapes of letter “ ” of the Urdu alphabet. Four different shapes of basic isolated characters of the Urdu alphabet are shown in Appendix B. Figure 2: Basic Shapes Compound characters are created in Urdu by combining two or more individual characters. Examples of some compound characters are shown in Figure 3. 2 International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (2) : 2012 Naila Fareen1, Mohammad Abid Khan2 & Attash Durrani3 Figure 3: Compound Chracters Writing style in Urdu is from right to left in contrast to other Indian and European scripts which are from left to right. There is the structural similarity between Urdu, Arabic and Farsi scripts. Basically all of these called Arabic script. Urdu contains Arabic as well as Farsi characters. In Unicode, this script is called Arabic and is given 06 (table) on UNICODE. Unicode Table 6 is shown in the Appendix A (The Unicode Consortium, 2011). 4. OCR Functions and Problems This research primarily focuses on recognizing the Typed Urdu Naskh font. It introduced and evaluated different font size of type foundries of Naskh font, from the perspective of spatial resolution, normalization, training, segmentation and downsampling. From the literature review we analyze that researchers are working on nastaleeq, but there is no work on Naskh font of type foundries. The proposed work on Urdu typography of Typed Urdu Naskh font that was developed and used by foundries before the advent of computerized printing has not been touched by any researcher. The following are some challenges in accomplishing this work for safeguarding national heritage. 1. Availability of font 2. Foundry Books and pages color 3. Segmentation 4. Downsampling 5. Obtain machine readable data 6. Improve the recognition accuracy rate. 5. Research Solutions 1. The basic characters of Naskh foundry font were not easily available. For the testing purpose basic characters and sample pages were brought in from punchand industries and press Lahore and the other material is collected from the library of national language authority. 2. Books printed in the foundry font were poor in condition and the color of pages turned into yellow. So by using detection and enhancement algorithm noise were remove from the pages to make them available for Urdu Naskh OCR testing procedure. 3. The algorithms developed and used for the line segmentation and word segmentation are simple and easy to implement. 4. Downsampling removes any non-essential white space in the given image characters, which helps in a particular area. Fewer input neurons required for the processing of lower resolution image as compare full-sized image. Downsampling neutralizes the character size; apart from you draw a large character or small character. It is the technique to transform the image into a much lower resolution image. 5. This program first trained using letters that have actually been acquired from the scanner or directory. Before training, the system adds the Urdu Naskh font in the “training .dat”. After adding Urdu Naskh font to a file it saves that file and then for recognition processes it has to load the training file each time. To make it machine readable system will perform loading of training file, training process, loading of input image, sampling and the recognition process. 3 International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (2) : 2012 Naila Fareen1, Mohammad Abid Khan2 & Attash Durrani3 6. To improve recognition speed, we simplified the recognition algorithms to save processing time. 6. Scanning In this phase, a grayscale image is captured and digitized using a scanner. In order to obtain information from the image, it is important to extract a good quality image. It is important to extract as good quality image as possible in order to obtain information from the image later. However, in real application, clean good quality documents are seldom come upon. Smudged characters, poor print quality, poor contrast between the writing and the background are common problems with digital images obtained in testing. Generally, images are scanned in at 300 dpi for obtaining the adequate results. 7. Image Resolution Digital images are made up of small squares called pixels. Image quality is based on image resolutions. There are two types of image resolutions: spatial and output. Spatial resolution is defined in terms of width and height. Output resolution is defined in terms of the number of dots/pixels per inch, or dpi. The higher the spatial resolution, the more pixels are available to create the image. This makes more details possible and creates a sharper image. Figure 4 shows an image of 512 × 512. It is sub-sampled down to 16 × 16; Figure 5 shows the stretching of the 512 ×512 image. Figure 4: A 512×512 image sub-sampled down to 16×16. Figure 5: Results of stretching the sub-samples to the same size of the original 512×512 image. 8. Thresholding For determining the value of a character, the next step is to extract a binary (0, 1) image from the obtained digital image. The foreground and background pixels of an image classifies in image 4 International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (2) : 2012 Naila Fareen1, Mohammad Abid Khan2 & Attash Durrani3 thresholding. The value of grayscale images from 0-255 to convert pixels below the certain level of gray into background and to convert pixels above the certain level into background is the general idea of thresholding. To make the thresholding method simpler contrast and brightness in an image is used to consider the difference between characters in a given image or between various images. The background and writing (clear or dark) of the image depends on image quality. To threshold the image, a dynamic process must be used. For example to create a threshold value, based upon the pixels, a background of pixels in the corner of the image is assumed. The poor and spotty results are due to this method because it ignores the foreground and pay attention only to the shade of the background.

Author Guidelines for 8

The Pakistan National Bibliography 1999

Afghanistan Turmoil and Its Implications for Pakistan’S Security (2009-2016)

Proceedings of the Conference on Language & Technology 2012

Language of Text Messages: a Corpus Based Linguistic Analysis of SMS in Pakistan

The Pakistan National Bibliography 1999

Urdu (In Arabic Script) Letters of the Alphabet

FACULTY Cvs National University of Modern Languages (NUML)

Design Features of Monolingual Urdu Pedagogical Dictionary for Advanced Learners of Urdu Language

Lexicon Reduction for Urdu/Arabic Script Based Character Recognition: a Multilingual OCR

URDU RESEARCH: ASPECTS and PROSPECTS Shamaila Iqbal, Dr