Naila Fareen1, Mohammad Abid Khan2 & Attash Durrani3

URDU NASKH TYPOGRAPHY PATTERN RECOGNITION OF TYPE FOUNDRIES

Dept. of computer science, Allama Iqbal Open University , Pakistan

[email protected] Mohammad Abid Khan Dept. of Computer science, University of peshawar Peshawar, Pakistan [email protected] Attash Durrani Dept. Of Computer science,Allama Iqbal Open University Islamabad, Pakistan [email protected]

Abstract Optical Character Recognition (OCR) is the process of converting printed, handwritten and typed printed text into its equivalent machine readable form. The study of OCR is becoming more popular all over the world including Pakistan. A lot of work has been done on Literature and History of Muslims. The old documents need to be transferred into electronic form. In this research, work is done on the Typed Urdu Naskh pattern recognition of type foundries. This research primarily focuses on recognizing the Typed Urdu Naskh font. The aim here is to develop a more reliable, accurate and quick system for Typed Urdu Naskh pattern recognition in order to get benefit of the cultural heritage left by our ancestors for centuries. This research introduced and evaluated different font size of type foundries of Naskh font, from the perspective of spatial resolution, normalization, training, segmentation and downsampling. In this study a new method is proposed for offline printed character recognition problem, which uses the sequence of segmentation, training and recognition algorithms. The proposed method is tested on various type foundry’s digits and characters. The system has been implemented by using self organization map (SOM) technique; recognition rate of 98.18% is achieved.

Keywords: Urdu OCR, Naskh Typography, Pattern recognition, Urdu Type Foundries, Urdu Offline OCR.

1. INTRODUCTION The study of OCR is becoming more popular day by day all over the world including Pakistan. For centuries, plenty of work has been done on Urdu Literature, Social Sciences, History of Muslims, Islam and many other areas which need to be transferred to electronic form automatically to avoid intensive labour involved in the re-composition of such printed material. In this way, this treasure of knowledge and culture will be preserved forever to guide future generation and the text will be processed easily for drawing useful conclusions. As the electronic media happen to become more and more widespread, the need for transferring older documents (books, newspapers and Magazines) to the electronic domain arises. Especially the published Urdu books calligraphed with metallic nib of different points, or printed in foundries, lead type are easy to scan, if OCR is developed accordingly. This research mainly focuses on Naskh font of type foundries. Here is the character set of Urdu Naskh, including the Numerals of Urdu Naskh.

1 International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (2) : 2012 Naila Fareen1, Mohammad Abid Khan2 & Attash Durrani3

Figure 1: Urdu Nuskh Character Set

Recognition of individual characters within a ligature becomes quite difficult as the characters in Urdu overlap vertically and do not touch each other [1].

2. Type Foundries Before the advent of the offset printing press, all the books were published by the fonts developed by the type foundries. These type foundries contained the lead characters and symbols used for writing. Ink was filled and then set on the paper for printing purposes. In foundries, printing is a mechanical way by using casted characters for printing the inked image which produce more noise and are difficult to recognize. Just like other languages, Urdu books were also typed by Urdu types prepared by those type foundries.

3. Compound Characters In Urdu, each character has two to four different shapes depending upon its position in the word: isolated, initial, medial or final. Figure 2 shows four different shapes of letter “ ” of the Urdu alphabet. Four different shapes of basic isolated characters of the Urdu alphabet are shown in Appendix B.

Figure 2: Basic Shapes

Compound characters are created in Urdu by combining two or more individual characters. Examples of some compound characters are shown in Figure 3.

2 International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (2) : 2012 Naila Fareen1, Mohammad Abid Khan2 & Attash Durrani3

Figure 3: Compound Chracters

Writing style in Urdu is from right to left in contrast to other Indian and European scripts which are from left to right. There is the structural similarity between Urdu, Arabic and Farsi scripts. Basically all of these called Arabic script. Urdu contains Arabic as well as Farsi characters. In Unicode, this script is called Arabic and is given 06 (table) on UNICODE. Unicode Table 6 is shown in the Appendix A (The Unicode Consortium, 2011).

4. OCR Functions and Problems

This research primarily focuses on recognizing the Typed Urdu Naskh font. It introduced and evaluated different font size of type foundries of Naskh font, from the perspective of spatial resolution, normalization, training, segmentation and downsampling. From the literature review we analyze that researchers are working on nastaleeq, but there is no work on Naskh font of type foundries. The proposed work on Urdu typography of Typed Urdu Naskh font that was developed and used by foundries before the advent of computerized printing has not been touched by any researcher. The following are some challenges in accomplishing this work for safeguarding national heritage. 1. Availability of font 2. Foundry Books and pages color 3. Segmentation 4. Downsampling 5. Obtain machine readable data 6. Improve the recognition accuracy rate.

5. Research Solutions

1. The basic characters of Naskh foundry font were not easily available. For the testing purpose basic characters and sample pages were brought in from punchand industries and press Lahore and the other material is collected from the library of national language authority.

2. Books printed in the foundry font were poor in condition and the color of pages turned into yellow. So by using detection and enhancement algorithm noise were remove from the pages to make them available for Urdu Naskh OCR testing procedure.

3. The algorithms developed and used for the line segmentation and word segmentation are simple and easy to implement.

4. Downsampling removes any non-essential white space in the given image characters, which helps in a particular area. Fewer input neurons required for the processing of lower resolution image as compare full-sized image. Downsampling neutralizes the character size; apart from you draw a large character or small character. It is the technique to transform the image into a much lower resolution image.

5. This program first trained using letters that have actually been acquired from the scanner or directory. Before training, the system adds the Urdu Naskh font in the “training .dat”. After adding Urdu Naskh font to a file it saves that file and then for recognition processes it has to load the training file each time. To make it machine readable system will perform loading of training file, training process, loading of input image, sampling and the recognition process.

3 International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (2) : 2012 Naila Fareen1, Mohammad Abid Khan2 & Attash Durrani3

6. To improve recognition speed, we simplified the recognition algorithms to save processing time.

6. Scanning In this phase, a grayscale image is captured and digitized using a scanner. In order to obtain information from the image, it is important to extract a good quality image. It is important to extract as good quality image as possible in order to obtain information from the image later. However, in real application, clean good quality documents are seldom come upon. Smudged characters, poor print quality, poor contrast between the writing and the background are common problems with digital images obtained in testing. Generally, images are scanned in at 300 dpi for obtaining the adequate results.

7. Image Resolution

Digital images are made up of small squares called pixels. Image quality is based on image resolutions. There are two types of image resolutions: spatial and output. Spatial resolution is defined in terms of width and height. Output resolution is defined in terms of the number of dots/pixels per inch, or dpi. The higher the spatial resolution, the more pixels are available to create the image. This makes more details possible and creates a sharper image. Figure 4 shows an image of 512 × 512. It is sub-sampled down to 16 × 16; Figure 5 shows the stretching of the 512 ×512 image.

Figure 4: A 512×512 image sub-sampled down to 16×16.

Figure 5: Results of stretching the sub-samples to the same size of the original 512×512 image.

8. Thresholding For determining the value of a character, the next step is to extract a binary (0, 1) image from the obtained digital image. The foreground and background pixels of an image classifies in image

4 International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (2) : 2012 Naila Fareen1, Mohammad Abid Khan2 & Attash Durrani3 thresholding. The value of grayscale images from 0-255 to convert pixels below the certain level of gray into background and to convert pixels above the certain level into background is the general idea of thresholding. To make the thresholding method simpler contrast and brightness in an image is used to consider the difference between characters in a given image or between various images. The background and writing (clear or dark) of the image depends on image quality. To threshold the image, a dynamic process must be used. For example to create a threshold value, based upon the pixels, a background of pixels in the corner of the image is assumed. The poor and spotty results are due to this method because it ignores the foreground and pay attention only to the shade of the background. In the image to use a histogram of the fixed values, the value of the foreground pixels indicated by a smaller peak and the background pixels by the large peak. Figure 6 shows the method of selecting thresholding point, its upper bound and lower bound. Black pixels show the foreground while background is shown by the white pixels.

Figure 6: Method of selecting thresholding point.

9. Training

In the training, no predictable output is provided. To classify the input into several groups such training is being used by neural network. The unsupervised method for the training purpose is being used in this work. In this method, a neural network is provided with sample data that does not contain anticipated outputs.

9.1 Unsupervised Training

Unsupervised neural network is the exact opposite of the supervised one, in which the output is not known, so the network is allowed to settle into suitable states by discovering special features and patterning from the available data without using an external help. In the training, the input is classified into several groups. Here an (output) unit is trained to answer to group of patterns with the input. In this paradigm, the system is supposed to discover statistically salient features of the input population. Unlike the supervised learning paradigm, there is no appropriate set of categories into which the patterns are to be classified. 10. Segmentation

Segmentation is the process of dividing an image into its essential components or objects. In general, in digital image processing sub-region segmentation is mainly a intricate task. In rugged segmentation, objects to be identified individually require the method carries the process a long way towards the triumphant solution of imaging problems. On the other hand, feeble or changeable segmentation algorithms almost always guarantee ultimate failure. Similarly, to make the recognition succeeded, the segmentation should be more accurate [6]

Segmentation is a very important stage for Farsi\Arabic character recognition systems. There are two main approaches to word recognition: segmentation-based and segmentation-free. [2] Use the segmentation based approach, where each word or sub-word is first split into a set of single character. The word is then recognized by the sequence of its characters. Figure 7 shows different shapes of characters, letters and sub-words.

5 International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (2) : 2012 Naila Fareen1, Mohammad Abid Khan2 & Attash Durrani3

Figure 7: some characteristics of Farsi\ Arabic script. [5]

The above authors proposed a new segmentation algorithm for multi-font Farsi\Arabic text-based segmentation. The algorithm was tested on printed Farsi text containing a data set of 22236 characters and achieved 97% accuracy in segmentation. They mentioned that there are 32 Farsi\Arabic characters and 7 out of these characters do not join to their left neighboring characters. But the above authors didn’t mention about those seven characters which cause the break in ligature. If the four characters in Figure 7 considered with diacritics it is obvious that 10 characters which do not join to their left neighouring character are as given in Figure 8.

Figure 8: Disjoint ten Urdu characters

It should also be stressed that the performance of isolated symbol image segmentation plays a key role in the final accuracy of a character recognition system, with errors at this stage reported as making up a large portion of overall recognition errors [5]. This image has been reproduced in Figure 9.

Figure 9: Results of character spacing on recognition. [3]

10.1. Line Segmentation Line segmentation is the process in which one extracts only lines or differentiates the lines from the image. To extract the lines from the document, the most frequently in used task is the horizontal projection of a document image. The separate peaks and valleys of horizontal projection obtained by the

6 International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (2) : 2012 Naila Fareen1, Mohammad Abid Khan2 & Attash Durrani3 separated and straight lines that serve as the divider of the text lines. These valleys are easily detected and used to determine the location of boundaries between lines.

10.2. Word Segmentation Word segmentation is the process in which one extracts only words from the segmented lines. “There is a distance between two words”, use this concept to word segmentation. Dividing a string of written language into its constituent words is the problem of word segmentation. Figure 10 shows the word segmentation.

Figure 10: Word Segmentation

10.3. Character Segmentation Character segmentation is the process in which one extracts only characters from segmented words. Character segmentation is a crucial step of OCR systems as it extracts meaningful regions for analysis. This step attempts to decompose the image into classifiable units called characters. Figure 11 shows the ligature, initial, final and isolated shapes of character by character segmentation process.

Figure 11: Character Segmentation 11. Recognition Process

Almost always follow the output of a segmentation stage, which usually is raw pixel data, consisting either the boundary of a region (i.e., the set of pixel separating one image region from another) or the whole region itself. Recognition is the process that assigns a label (e.g., "vehicle") to an object based on its descriptors. The author concluded the coverage of digital image processing with the development of methods for recognition of individual object [7]

Hindi words are identified from bilingual or multilingual documents based on features of the Devanagari script or using Support Vectors Machines. They found the following 6 factors of incorrect recognition [4].

1. Incorrect word segmentation.

2. Incorrect character segmentation.

3. Missing punctuation such as commas, periods and parentheses.

4. Character misclassification due to noise.

5. Character similarity

7 International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (2) : 2012 Naila Fareen1, Mohammad Abid Khan2 & Attash Durrani3

6. Special Symbols which are noise-like. Identified words are then segmented into individual characters in the next step, where the composite characters are identified and further segmented based on the structural properties of the script and statistical information. Their system was based on (1) Script Identification; (2) Character Segmentation; (3) Training sample creation and (4) character recognition [4]. Self Organization Map (SOM) is the most popular neural network and is based on unsupervised learning. Recognition process of this research work is based on SOM. 12. Topology Diagram Self Organization Map (SOM) is also called the topology preserving because it preserves the comparative distance between points. SOM only contains the input and output layers, there is no hidden layer in SOM [8]. Figure 12 the Self Organization Map needs the normalized input neurons for the winning neuron. SOM produce only single value either 0 or 1. As a result, the output from the SOM is usually the index of the fired neuron.

Figure 12: Topology Diagram showing the Input labels, Output labels and winning Neuron.

13. Downsampling the Image

After the line segmentation and word segmentation it is necessary to downsample the segmented images. A solution is proposed to downsample the segmented text images for the recognition process. Downsampling ensures the size of a given character. It is neutralized, which means if the system has been trained using a particular size of a character, e.g. point 14. At the time of recognition you may input a character of a small or large size. Due to downsampling the system training and recognition of characters become font size independent. Downsampling removes any non-essential white space in the given image characters, which helps in a particular area. Character position may be top, bottom, right and left, but system focuses on given character are not on the position. There are the two main advantages of downsampling. First, fewer input neurons required for the processing of lower resolution image as compare full-sized image. Second, downsampling neutralizes the character size; apart from you draw a large character or small character. It is the technique to transform the image into a much lower resolution image.

14. Self Organization Map (SOM) Self Organization Map (SOM) is the most popular neural network based on unsupervised learning. The corresponding nodes of the winning node are arranged in a topology structure (a lattice, a graph or a net). By comparing the values of the output neurons the software can determine the winning neuron.

8 International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (2) : 2012 Naila Fareen1, Mohammad Abid Khan2 & Attash Durrani3

Neuron having the largest output value is called winning neuron. The winning node is adjusted according to the distance of winning node from the input node.

Figure 13: Winning Node, Adjacent nodes and two dimensional array of SOM

Figure 13 shows a two dimensional array of nodes. Filled circle shows the winning node and the three rectangles show the neighboring nodes of different sizes. SOM capitulate the effect that the winning node of neighboring nodes is also adjacent in the feature space. For Vector Quantization as hard and soft competitive learning SOM can be used but it is appropriate for searching the adjacency nodes [8]. In SOM, weights are adjusted in each epoch. An epoch occurs when training data is presented to the self-organizing map and the weights are adjusted based on the results of this data. Calculating changes to weights, SOM proposed the additive method. Equation 1 shows adjusting the SOM weights by additive method.

………………..1 The variable x is the training vector presented to the network and wt is the weight of the winning neuron, the outcome of the equation is the new weight. This additive method generally works well for most self-organizing maps; however, in cases for which the additive method shows excessive instability and fails to converge, an alternate method can be used. This method is called the subtractive method. Equation 2 and 3 shows the subtracting method of adjusting the SOM weight [8]. ………..2 …………..3 These two equations given above describe the basic transformation that will occur on the weights of the network. The image once downsampled is used as to feed the input neurons of the Self Organization Map (SOM), which means for every pixel of downsampled image there is one input neuron. Figure 14 shows the downsampled image of the proposed system with the grid of 50 ×70, so the total number of input neurons is equal to 3500.

9 International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (2) : 2012 Naila Fareen1, Mohammad Abid Khan2 & Attash Durrani3

Figure 14: Downsampling the Image The output neurons directly dependent on the total number of character sample provided. In this case, training.dat file contains the following number of character image sample. 1. In Figure 15, there are the 38 Urdu isolated characters of Urdu Naskh font of type foundry. There is no base line of Urdu Naskh characters.

Figure 15: Urdu Naskh Basic Characters of Type Foundry

2. Figure 16 shows the Urdu numbers from zero to nine of type foundries.

Figure 16: Urdu Naskh numbers of Type Foundry

3. In the type foundries English numbers are also used. So this application is also able to recognize the English numbers. Figure 17 Shows English numbers.

Figure 17: English Numbers 4. Symbols are also used in the type foundries, so Figure 18 shows some symbols used in the type foundries

Figure 18: Few Symbols of Type Foundry

10 International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (2) : 2012 Naila Fareen1, Mohammad Abid Khan2 & Attash Durrani3

5. As in the isolated characters, initial characters, medial characters, final characters, Urdu numbers, English numbers and symbols are used; just as those foundries also use the ligature as shown in Figure 19.

Figure 19: Some Ligatures of Type Foundry 15. Conclusion The overall requirements of the proposed system are to receive printed text from the user, and convert it into machine readable form to facilitate the researchers to search the material easily. deals with the segmentation process. Removing any non-essential white space in the given image is downsampling and to recognize the text by using downsampling, normalization and Self Organization Map (SOM). In this research unsupervised training method is used for the training purpose because in supervised training, there is no predictable output. The main purpose of the research is to provide common people with the facility of getting material (books) of the type foundries into electronic form. The Naskh OCR operates on the input image and efficiently recognizes the ligature and words, collection of real world data, perform scanning and labeling of data. Each of the recognition approach to OCR is tested. Selected books from Majlis-e-Tarqi-e-Adab are used for testing purpose. Recognition results of basic Urdu characters, Urdu digits, English digits, symbols and results of Urdu ligatures also tested.

15. References Conference Papers: [1] S. A. Hussain, A. Sajjad & F. Anwar “Online Urdu Character Recognition System”, IAPR Conference on Machine Virsion Application, Tokyo, Japan, pp. 98-101, 2007.

[2] M. Omidyeganeh, K. Nayabi, R. Azmi & A. Javadtalab “A new segmentation Technique for Multi Font Farsi\Arabic Texts”, ICASSP, IEEE, 757-760, 2005. [3] M. Bosker “Omnidocument technologies”. Proceedings of the IEEE, 80(7), 1066– 1078, 1992.

[4] H. Ma & D. Doermann “Adaptive hindi OCR Using Generalized Hausdroff Image Comparision”, ACM Transaction on Asian Language Information Processing, 2(3), 193-218, 2003.

Thesis:

[5] S. Leishmen “Shape-Free Statistical Information in Optical Character Recognition”, MS thesis, University of Toronto, 2007.

Books

[6] D. Zhang, X. Jing & J. Yang “Biometric Image Discrimination Technologies”, Idea Group Publishing, 2006.

[7] A. Bovik “The Essential Guide to Image Processing”, Elsevier Inc, 2009.

[8] J. Heaton “Introduction to Neural Networks for C#”, Publisher: Heaton Research, Inc, Editor: WordsRU.com, ISBN: 1-60439-009-3, Second Edition, Softcover, 2008.

11 International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (2) : 2012 Naila Fareen1, Mohammad Abid Khan2 & Attash Durrani3

Bibliography Articles in journals:

1. T. Nawaz, S. A. Naqvi, H. Rehman and A. Faiz "Optical Character Recognition System for Urdu (Naskh Font) Using Pattern Matching Technique", International Journal of Image Processing, 3(3), pp. 99-104, 2009. 2. N. B. Amara, O. Mazhoud, N. Bouzara & N. Ellouze “ARABAE: A Relational Database for Arabic OCR Systems”, The International Arab Journal of Information Technology, 2(4), 259-266, 2005. Conference Papers:

3. V. Margner and H. El Abed, "Arabic Word and Text Recognition -- Current Developments ", 2nd International conference on Arabic Language Resources and tools, Cairo, Egypt, pp.31-36, April 2009

4. S. A. Hussain, and S. H. Amin, "A Multi-tier Holistic approach for Urdu Recognition", IEEE INMIC, Pakistan. 2002.

5. W.Anwar , X. Wang and X.L. Wang. "A Survey of Automatic Urdu Language Processing" Proceedings of the Fifth International Conference on Machine Learning and Cybernetics, Dalian, pp. 13-16, August 2006.

6. H. Sarmad, "Resources for Urdu Language Processing", the 6th Workshop on Asian Languae Resources, 2008.

7. U. Pal and A. Sarkar. "Recognition of Printed Urdu Script", ICDAR , IEEE, 2003.

8. F. Shafait, Adnan-ul-Hassan, D. Keysers and T. M. Breuel "Layout Analysis of Urdu Document Image", Multitopic Conference,, INMIC’06. IEEE, 05-07. 2006.

9. Z. Ahmed, J. K. Orakzai, I. Shamsher and A. Adnan "Urdu Nastaleeq Optical Character Recogniotion", World Academy of Science, Engineering and Technology, (32), pp.249-252, 2007.

10. I. Shamsher, Z. Ahmad, J. K. Orakzai, and A. Adnan "OCR For Printed Urdu Script Using Feed Forward Neural Network", World Academy of Science, Engineering and Technology,34,2007.

11. A. Gulzar and S. Rehman. "Nastaleeq : a Challenge Accepted by Omega", TUGboat, XVII European TEX Conference, 29(1), pp.89-94, 2007.

12. M. I. Razzak, S. A. Hussain, M. Sher and Z. S. Khan "Combining Offline and Online Preprocessing for Online Urdu Character Recognition", Proceedings of the International MultiConference of Engineering and Computer Science, (1). ISBN: 978-988-17012-2-0., 2009.

13. N. Shahzad, B. Paulson and T. Hammond, "Urdu Qaeda: Recognition System for Isolated Urdu Characters" IUI 2009 Workshop on Sketch Recognition, Sanibel Island, Florida, February 8, 2009

14. J. Tariq, U.Nauman and M.U.Naru "A novel approach to construct OCR for printed Urdu isolated characters", Second international conference on Computer Engineering and Technology (ICCET), 3, pp. 495-498, June 2010.

15. M. Akram and S.Hussain, "Word Segmentation For Urdu OCR System",Proceeding of the 8th Workshop on Asian Language Resources,Beijing, China, August,2010. 16. Z. Ahmed, J. K. Orakzai, I. Shamsher & A. Adnan ”Urdu Nastaleeq Optical Character Recogniotion”, World Academy of Science, Engineering and Technology, (32), 249-252, 2007.

17. Z. Ahmed, J. K. Orakzai, I. Shamsher & A. Adnan “OCR for Printed Urdu Script Using Feed Forward Neural Network”, World Academy of Science, Engineering and Technology, (34), 172-175, 2007. 18. S. Alma’adeed, C. Higgens, and D. Elliman “Recognition of Off-Line Handwritten Arabic Words Using Hidden Markov Model Approach”, IEEE, 1-4, 2002.

19. H. Almohri, J. S. Gray & H. Alnajjar “A Real-time DSP-Based Optical Character Recognition System for Isolated Arabic Character using the TI TMS320C6416T”, Procedding of the 2008 IAJC-IJME International Conference,

12 International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (2) : 2012 Naila Fareen1, Mohammad Abid Khan2 & Attash Durrani3

2008.

20. A. Amin ”Off-line Arabic Character Recognition: The State of the Art”, Pattern Recognition, 31 (5), 517-530, 1998.

21. S. M. Beitzel, E. C. Jensen & D. A. Grossman “Retrieving OCR Text: A Survey of Current Approaches”, ACM SIGIR, 36(2), 58-61, 2003. 22. Bhaskarabhatla A. S. & Madhvanath S. (2005). Online Handwriting Recognition for Indic Scripts, Advances in Pattern Recognition: Guide for OCR for Indic Script,Springer London, 209-234.

23. A. Gulzar & S. Rehman ”Nastaleeq : a Challenge Accepted by Omega”, TUGboat, XVII European TEX Conference, 29(1), 89-94, 2007. Electronic Resources:

24. National Language Authority, "Urdu Alphabet", [On line: www.nla.gov.pk], Retrieved: (07.05.2010).

25. Rauf Parekh, "Controversy over number of letters in Urdu alphabet", Dawn (English Newwspaper), Karachi/Islamabad, Pakistan, (07.15.2009).

Books:

26. A. Durrani. "Urdu Informatics", National Language Authority, Islamabad, Pakistan, 2008

27. M. W. Sagheer, C. L. He , N. Nobile and C. Y. Suen " A New Large Urdu Database for Off-Line Handwriting Recognition" Image Analysis and Processing – ICIAP 2009, Springer Berlin / Heidelberg , pp. 538-546, 2009. 28. T. Aziz “Urdu Rasm-ul-khat aur Type”, National Language Authority, Islamabad, Pakistan, 1987. 29. M. Cheriet, N. Kharma, C. Liu & C. Y. Suen “Character Recognition Syatems A Guide for Students and Practitioners”, Published by John Wiley & Sons, Inc., Hoboken, New Jersey, 2007.

30. R. Gonzalez & R. E. Woods “Digital Image Processing” , Prentice Hall, 2002. 31. Y. Haratambous “From Unicode to Advanced Typography and Everything in Between Fonts and Encodings”, Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472, 2007.

32. M. S. Jelodar, M. J. Fadaeieslam, N. Mozayani & M. Fazeli “A Persian OCR System Using Morphological Operators”, World Academy of Science, Engineering and Technology, (4), 241-244, 2005. Patent: 33. G. Tauschek “Reading machine”, U.S. Patent 2026329, 1935.

Thesis:

34. Hong L. ( 1998.) Automatic personal identification using fingerprints. PhD Thesis, Michigan State University, East Lansing, MI.

13 International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (2) : 2012 Naila Fareen1, Mohammad Abid Khan2 & Attash Durrani3

APPENDICES APPENDIX A-1

Table A.1 Unicode Table 06

14 International Journal of Computational Linguistics (IJCL), Volume (3) : Issue (2) : 2012