A TECHNIQUE FOR THE DESIGN AND IMPLEMENTATION OF AN OCR FOR PRINTED NASTALIQUE TEXT

By

SOHAIL ABDUL SATTAR

Thesis submitted for the Degree of Doctor of Philosophy

Department of Computer Science and Information Technology

N.E.D. University of Engineering & Technology Karachi, Sindh Pakistan

July, 2009 A TECHNIQUE FOR THE DESIGN AND IMPLEMENTATION OF AN OCR FOR PRINTED NASTALIQUE TEXT

Ph.D. Thesis

By

SOHAIL ABDUL SATTAR

Research supervisor Dr. Shamsul Haque

Research co-supervisor Dr. Mahmood Khan Pathan

Department of Computer Science and Information Technology

N.E.D. University of Engineering & Technology Karachi, Sindh Pakistan

July, 2009

ii

Statement of Copyright

This copy of the thesis has been supplied on condition that anyone who consults, it is understood to recognize that the copyright rests with the University and that no quotations from the thesis and no information derived from it may be published without the permission of the University.

© 2009 by NED University of Engineering & Technology

iii

Certificate

Certified that the thesis entitled, “A Technique for the Design and Implementation of an OCR for Printed Nastalique Text” is being submitted by Sohail Abdul Sattar for the award of degree of Doctor of Philosophy in the department of Computer Science and Information

Technology, NED University of Engineering and Technology, Karachi, is a record of the candidate’s own original work carried out by him under my supervision and guidance. The work incorporated in this thesis has not been submitted elsewhere for the award of any other degree.

Dr. Shamsul Haque

Research supervisor Department of Computer Science & Information Technology NED University of Engineering & Technology, Karachi.

iv

Lovingly dedicated to the following departed souls,

my father, Dr. Abdul Ghaffar, my inspiration for achievement,

~

my beloved mother, Zubaida Khatoon, whose prayers stood by me at all times and whose confidence in me willed me to go on,

~

my respected supervisor Dr. Syed Salahuddin Hyder, for introducing me to the challenges in Computer vision and leading me to the path of research.

v

Acknowledgements

All praise and thanks to Almighty Allah (SWT) Who provides opportunities to gain knowledge and opens up ways to make the most daunting tasks possible and achievable.

My most humble thanks to,

The authorities of the NED University of Engineering and Technology for allowing me to pursue my dream in carrying out this research and for providing me all the facilities throughout the course.

Dr. Shamsul Haque, my supervisor whose continuous guidance and support helped me bring this work to completion.

Dr. Mahmood Khan Pathan, my co-supervisor, for his supervision from proposal preparation till finalizing of the thesis. . Dr. Mubarak Shah, my supervisor at the Computer Vision lab., University of Central Florida, for his generous help and guidance in computer vision and for his invaluable ideas on quality paper writing.

Colleagues at the CV lab UCF especially Javed, Arslan and Alexie.

My friends in Orlando whose hospitality and love made me feel at home and cared for.

My daughter Maryam, who had to struggle with Maths alone during my stay in Orlando.

My son Abdullah, who did not forget to brush his teeth, even though I was not there.

Many friends, colleagues and companions whose help and contributions have in many ways made my work easier and enjoyable.

Last, but not the least I am highly indebted to my wife, Aysha, for the endless hours she spent on proof reading and language corrections in this document and for standing by me whenever I was depressed or felt stressed.

vi

Abstract

This thesis presents a novel segmentation free technique for the design and implementation of an OCR (Optical Character Recognition) system for printed Nastalique text.

Specific area of this thesis is document understanding and recognition which is a branch of computer vision and in turn a sub-class of Artificial Intelligence.

Optical character recognition is the translation of optically scanned bitmaps of printed or hand written text into digitally editable data files. OCRs developed for many world languages are already under efficient use but none exist for Nastalique – a calligraphic adaptation of the script, just as Jawi is for Malay. More often, a single script with its basic character shapes is adapted for writing in multiple languages e.g. the Roman script for English, German and French, and the Arabic script for Persian, Sindhi, , Pashtu and Malay.

Urdu has 39 characters against the Arabic 28. Each character then has two to four different shapes according to their position in the word: isolated, initial, medial and final. Many character shapes have multiple instances and are context sensitive – character shapes changing with changes in the antecedent or the precedent character. At times even the third or the fourth character may cause a similar change depicting an n-gram model in a Markov chain. Unlike the Roman script, word and character overlapping in Nastalique, makes optical recognition extremely complex.

Compared to Roman script languages’ OCRs very little research work is done on Arabic Naskh OCR. Only a few Arabic Naskh OCR systems are available today and they too are far from perfect, lagging behind in accuracy as compared to Roman script OCR systems.

In this perspective Nastalique is even more complicated than Naskh as it has multiple base lines, more overlapping of characters within a and between adjacent ligatures, vertical stacking of characters in a ligature etc.

Urdu has still not attracted researchers’ attention for the development of OCR partly due to lack of funds in this area but mainly due to the challenges the Nastalique style offers because of its cursiveness and context-sensitivity. For the same reason published research work in this area is nearly non-existent.

vii

The proposed system for Nastalique OCR does not require segmentation of a ligature into constituent character shapes. However, it does require segmentation at two levels i.e. first the text image is segmented into lines of text then each of the lines of text is further segmented into ligatures or isolated characters. The next step is a line by line cross-correlation for recognition of characters in the ligatures whereby, character codes are written into a text file in the sequence the characters are found in the ligature. As the recognition process is completed, the character codes in the text file are given to the rendering engine, which displays the recognized text in a text region.

The limitation of the proposed Nastalique character recognition system is that it is font dependent: it needs the same font file for recognition which was used to write the text in. The new undertaking has greater challenges as it will aim to overcome the inherent cursiveness and context sensitivity of Nastalique style of writing.

For Nastalique OCR, we develop character-based True Type Font files for a few Nastalique words. These words are written using the same character-based TTF font and an image is made of the Nastalique text. The image is then given to our Nastalique OCR. After recognition the rendering is done by using the same TTF font file to display the recognized text. The work is therefore three folds; development of character-based Nastalique True Type Font, Nastalique character recognition and rendering the recognized text using character-based Nastalique True Type Font.

Since our character-based segmentation-free Nastalique OCR algorithm needs, as a ground work, a character-based Nastalique Text Processor, we have also proposed a Finite State Nastalique Text Processor Model. Implementation is not yet done so results are not reported. However this model could serve as an impetus for future research in this challenging field.

Optical Character Recognition for Roman script languages is almost a solved problem for document images and researchers are now focusing on extraction and recognition of text from video scenes. This new and emerging field in character recognition is called Video OCR and has numerous applications like video annotation, indexing, retrieval, search, digital libraries, and lecture video indexing.

The emerging field for character recognition is attracting research on other scripts like Chinese, but to the best of our knowledge, no work is reported as yet, on Video OCR for Arabic script languages like Arabic, Persian and Urdu.

viii

As an extension of our Nastalique OCR to Video OCR for Arabic script languages, we have also performed experiments on video text identification, localization and extraction for its recognition. We have used MACH (Maximum Average Correlation Height) filter to identify text regions in video frames, these text regions are then localized and extracted for recognition. All research and development work is done using Matlab 7.0. Experiments and results are reported in the thesis.

ix

Table of Contents

Statement of Copyright ...... iii Certificate ...... iv Acknowledgements ...... vi Abstract ...... vii List of Tables ...... xv List of Figures ...... xvi

CHAPTER 1: Introduction ...... 1 1.1 Computer Vision ...... 2 1.2 Character Recognition ...... 5 1.2.1 Online Character Recognition ...... 6 1.2.2 Offline Character Recognition ...... 6 1.2.3 Magnetic ink Character Recognition (MICR) ...... 7 1.2.4 Optical Character Recognition (OCR) ...... 7 1.3 History of OCR ...... 9 1.4 OCR Processes ...... 10 1.4.1 Scanning ...... 11 1.4.2 Document Image Analysis ...... 11 1.4.3 Pre-processing ...... 11 1.4.4 Segmentation ...... 11 1.4.5 Recognition ...... 12 1.5 World Languages and Scripts ...... 12 1.5.1 Non-Cursive Script ...... 13 1.5.2 Cursive Script ...... 13 1.6 Perso-Arabic Script ...... 14 1.7 Urdu Language ...... 14 1.7.1 History of Urdu Language ...... 15 1.7.2 Nastalique Script ...... 15 1.7.2.1 Nastalique Type Setting ...... 16 1.7.3 Noori Nastalique ...... 17 1.8 Nastalique Text Processor ...... 17 1.8.1 InPage Urdu ...... 18 x

1.9 The Digital Divide ...... 20 1.10 Importance of Bridging the Digital Divide ...... 22 1.11 Approaches for Arabic Naskh OCR ...... 22 1.11.1 Segmentation based OCR ...... 22 1.11.2 Ligature based OCR ...... 23 1.12 Motivation and Research Objective ...... 23 1.13 Main Contribution of this Research ...... 24 1.13.1 Nastalique OCR (NOCR) Application ...... 25 1.13.2 NOCR Process ...... 25 1.13.3 NOCR Procedure ...... 26 1.14 Additional Contribution of this Research ...... 26 1.14.1 Video OCR ...... 26 1.14.2 Nastalique Text Processor Model ...... 27 1.15 Thesis overview ...... 28 1.16 Conclusion ...... 29

CHAPTER 2: Literature Survey ...... 31 2.1 Introduction ...... 31 2.2 Previous Work on Urdu OCR ...... 33 2.3 Approaches for Arabic script OCR ...... 38 2.3.1 Ligature-based Approach ...... 39 2.3.2 Segmentation-based Approach ...... 39 2.4 Previous Work in Ligature-Based Arabic OCR ...... 39 2.5 Previous Work in Segmentation-Based Arabic OCR ...... 48

CHAPTER 3: Video OCR ...... 59 3.1 Introduction ...... 59 3.2 Types of video text ...... 62 3.3 Applications of Video-OCR ...... 63 3.4 Literature Survey ...... 63 3.5 Correlation Pattern Recognition ...... 84 3.5.1 The MACH Filter ...... 86 3.5.2 OT-MACH Filter ...... 90 3.6 Text Region Detection in Video Frames ...... 91 3.7 Results ...... 92 xi

3.7.1 Video OCR Results ...... 93 3.7.1.1 Training a MACH filter ...... 93 3.7.1.2 Training Images ...... 93 3.7.1.3 Trained MACH filter ...... 94 3.7.1.4 Text area detection and localization ...... 94 3.7.1.5 Video clips with Arabic text ...... 104 3.8 Conclusion and Future work ...... 107

CHAPTER 4: Implementation Challenges for Nastalique OCR ...... 109 4.1 Introduction ...... 109 4.2 Nastalique Character Set ...... 109 4.3 Nastalique Script Characteristics ...... 111 4.4 Computational Analysis of Urdu Alphabet ...... 111 4.4.1 Classes of base shapes in the Urdu alphabet...... 111 4.4.2 Dots in Urdu Characters ...... 113 4.4.3 Context Sensitive shapes in the Urdu alphabet...... 115 4.4.4 Comparison of Urdu, Arabic and Farsi Alphabets ...... 116 4.4.5 Bi-directional pen movement (from top left to bottom right) ...... 117 4.4.6 Bi-directional writing (numbers written from left to right) ...... 118 118 ...... ” ق Context Sensitive shapes of the character “Quaf 4.4.7 4.5 Nastalique Script for Urdu ...... 119 4.5.1 Character ...... 119 4.5.2 Glyph ...... 120 4.5.3 Ligature ...... 120 4.6 Ligature in Urdu ...... 120 4.7 Word Forming in Urdu ...... 121 4.8 Styles of Urdu Writing ...... 122 4.8.1 Naskh ...... 122 4.8.2 Nastalique ...... 123 4.9 Nastalique Script Complexities ...... 123 4.9.1 Cursiveness ...... 124 4.9.2 Context Sensitivity ...... 125 4.9.3 Dot Positioning ...... 126 4.9.4 Kerning ...... 127 4.9.5 Character Overlapping ...... 128 xii

4.9.5.1 Within a Ligature...... 129 4.9.5.2 Between two adjacent Ligatures ...... 129 4.10 Sloping and Multiple Base-Lines ...... 130 4.11 A Generic OCR Model ...... 131 4.12 Working of a Roman Script OCR ...... 132 4.13 Working of a Nastalique Script OCR ...... 133 4.14 Approaches for Nastalique OCR ...... 134 4.14.1 Character-based Approach ...... 134 4.14.2 Ligature-based Approach ...... 136

CHAPTER 5: The Proposed Nastalique OCR System ...... 137 5.1 Introduction ...... 137 5.2 The Nastalique OCR Implementation ...... 138 5.3 The Novel Segmentation-free Nastalique OCR Algorithm ...... 138 5.4 Nastalique OCR Algorithm Description ...... 140 5.5 Segmentation of Text Image into Lines ...... 141 5.6 Segmentation of Text Line into Ligatures ...... 142 5.7 Character Recognition by Cross-Correlation ...... 143 5.8 Nastalique Text Segmentation ...... 145 5.9 Segmentation of Text Image into Lines and Ligatures ...... 145 5.10 Recognition Technique ...... 159 5.11 Nastalique OCR Application ...... 161 5.11.1 The Dialogue Boxes ...... 161 5.12 The Recognition Procedure ...... 163 5.13 The Recognition Process ...... 168

CHAPTER 6: Conclusion and Future Work ...... Error! Bookmark not defined. 6.1 Introduction ...... 169 6.2 Nastalique Character Shapes ...... 174 6.3 Nastalique Joining Characters Features Set ...... 175 6.4 Proposed Nastalique Text Processor Model ...... 176 6.5 Components of Nastalique Text Processor Model (NTPM) ...... 179 6.5.1 Character Shape Recognizer ...... 179 6.5.2 Next-State Function ...... 180 6.6 Conclusion ...... 181 xiii

6.7 Contribution ...... 189 6.8 Future Work ...... 190 References ...... 191 List of Publications ...... 198

xiv

List of Tables

Table 4.1: Shapes of Nastalique characters...... 110 Table 4.2: Base shape classes in Urdu alphabet ...... 112 Table 4.3: Dots in Urdu characters ...... 114 initial ...... 115 ب Table 4.4: Context sensitive shapes of Table 4.5: Urdu Alphabet ...... 116 Table 4.6: Arabic Alphabet ...... 116 Table 4.7: Farsi Alphabet ...... 116 Table 6.1: Transition Table for NTPM ...... 178

xv

List of Figures

Figure 1.1: Sub-fields of Artificial Intelligence ...... 4 Figure 1.2: Classification of Character Recognition ...... 5 Figure 1.3: OCR Processes ...... 10 Figure 1.4: Cursive and non-cursive scripts ...... 13 Figure 3.1: Training images ...... 93 Figure 3.2: Trained MACH filter ...... 94 Figure 3.3: Text region detected and extracted in color-1 ...... 95 Figure 3.4: Extracted text region is binarized-1 ...... 96 Figure 3.5: Text region detected and extracted in color-2 ...... 97 Figure 3.6: Extracted text region is binarized-2 ...... 98 Figure 3.7: Text region detected and extracted in color-3 ...... 99 Figure 3.8: Extracted text region is binarized-3 ...... 100 Figure 3.9: Extracted text region in color-4 ...... 101 Figure 3.10: Extracted text region is binarized-4 ...... 102 Figure 3.11: Artificial text on plane background ...... 103 Figure 3.12: Extracted text region in color-1 ...... 104 Figure 3.13: Extracted text region is binarized-2 ...... 105 Figure 3.14: Extracted text region in color-3 ...... 106 Figure 3.15: Extracted text region is binarized-4 ...... 107 Figure 4.1: Bi-directional pen movement ...... 117 Figure 4.2: Bi-directional writing ...... 118 119 ...... ق Figure 4.3: Context sensitive shapes of quaf Figure 4.4: Ligature in Urdu ...... 121 Figure 4.5: Word forming in Urdu ...... 121 Figure 4.6: Styles of Urdu writing ...... 122 Figure 4.7: Naskh style of writing ...... 123 Figure 4.8: astalique style of writing ...... 123 Figure 4.9: Nastalique cursiveness ...... 124 Figure 4.10: Word forming in Nastalique ...... 125 Figure 4.11: Context Sensitivity; Two different shapes of Bay-initial ...... 126

xvi

Figure 4.12 (a): Dots in Urdu characters ...... 126 Figure 4.12 (b): Dots in Urdu characters ...... 126 Figure 4.12 (c): Dots in Urdu characters ...... 126 Figure 4.13: Roman kerning pair ...... 127 Figure 4.14: Nastalique kerning pair ...... 128 Figure 4.15: Character overlapping in Nastalique ...... 128 Figure 4.16: Character overlap within a ligature ...... 129 Figure 4.17: Intra-ligature character overlap ...... 130 Figure 4.18: Nastalique sloping base-line ...... 130 Figure 4.19: Multiple baselines ...... 130 Figure 4.20: Different phases of an OCR ...... 131 Figure 4.21: Roman OCR has three levels of segmentation ...... 132 Figure 4.22: Nastalique OCR has two levels of segmentation ...... 134 Figure 4.23: A segmented word in Nastalique ...... 135 Figure 4.24: Ligature-based approach ...... 136 Figure 5.1: Nastalique OCR Algorithm ...... 139 Figure 5.2: Flowchart for NOCR...... 140 Figure 5.3: Binarized Nastalique Text Image ...... 146 Figure 5.4: Binarized Nastalique Text Image ...... 147 Figure 5.5: Binarized Nastalique Text Image ...... 147 Figure 5.6: Binarized Nastalique Text Image ...... 148 Figure 5.7: Lines of Text Separated ...... 149 Figure 5.8: Ligatures are separated ...... 150 Figure 5.9 (a): Analysis of text line 1 ...... 152 Figure 5.9 (b): Analysis of text line 2 ...... 152 Figure 5.9 (c): Analysis of text line 3 ...... 153 Figure 5.9 (d): Analysis of text line 4 ...... 153 Figure 5.9 (e): Analysis of text line 5 ...... 154 Figure 5.9 (f): Analysis of text line 6 ...... 154 Figure 5.9 (g): Analysis of text line 7 ...... 155 Figure 5.10: All elements in the text image separated ...... 155 Figure 5.11 (a): Ligature overlap line 1 ...... 156

xvii

Figure 5.11 (b): Ligature overlap line 2 ...... 158 Figure 5.11 (c): Ligature overlap line 3 ...... 158 Figure 5.11 (d): Ligature overlap line 4 ...... 159 Figure 5.12 (a): A word in Nastalique ...... 159 Figure 5.12 (b): Separated character shapes in a word ...... 160 Figure 5.12 (c): A word in Nastalique ...... 160 Figure 5.13: Nastalique OCR User Interface ...... 162 Figure 5.14: Input word image selection ...... 162 Figure 5.15: Font selection ...... 163 Figure 5.16: Cross-correlation for recognition ...... 164 Figure 5.17: Recognition process ...... 165 Figure 5.18: Recognition complete ...... 166 Figure 5.19: Multiple words single line ...... 167 Figure 5.20: Multiple words multiple lines ...... 167 Figure 6.1: Measurements of Letters in qat ...... 174 Figure 6.2: Nastalique Text Processor Model ...... 176 Figure 6.3: Transition Diagram for NTPM ...... 178

xviii

CHAPTER 1 Introduction

In this thesis we have presented a novel segmentation free technique for the design

and implementation of an OCR (Optical Character Recognition) for printed Nastalique text, a calligraphic style of Urdu which uses the Arabic script for its writing.

Just as a single script with its basic character shapes is adapted for writing in multiple languages e.g. the Latin script for English, German, French etc. the Arabic script has been

adapted for writing Persian, Urdu, Pashtu, Malay etc. Arabic writing has many forms and

styles, the more common being Naskh, while its calligraphic counterpart is Nastalique.

Beautiful and decorative as it is, Nastalique is also highly cursive by nature.

The Urdu alphabet today contains 39 basic characters compared to Roman script languages which have a fewer number of characters in their alphabet. For this reason the development of an Urdu OCR is fairly more difficult than Roman script languages. So far no work has been done with regard to developing an Urdu OCR [55].

Urdu has 39 characters against the Arabic 28. Each character then has 2-4 different shapes

according to their position in the word: isolated, initial, medial and final. Many character shapes have multiple instances. The shapes are context sensitive too – character shapes

changing with changes in the antecedent character or the precedent one. At times even the

3rd or 4th character may cause a similar change depicting an n-gram model in a Markov chain [66].

1

Optical character recognition is the translation of optically scanned bitmaps of printed or written text into digitally editable data files. OCRs developed for many world languages are already under efficient use but none exist for Nastalique.

In Nastalique, inter-word and intra-word character overlapping makes optical recognition more complex. Optical character recognition of the Latin script is relatively easier.

1.1 Computer Vision

Human intelligence is described as the capability to make decisions based on information which is incomplete and noisy. This is the ability which makes human being the most superior amongst all living creatures in the world.

There are five senses that provide information to humans for making everyday decisions and out of these the senses of hearing and vision are the sharpest of all. The auditory sense helps us recognize sounds and classify them. It is this sense which tells us that the person on the phone is a friend because his voice is recognizable. We can differentiate between an endless variety of sounds, voices, utterances and put them in exactly the slots they belong to, animal

sounds, musical notes, wind swishing, the footsteps of a family member, all are within

recognition range of a person with a normal sense of hearing.

The other one and the one more profound is the human vision which allows us to identify a

known person in a crowd of unknowns merely by casting a cursory glance at him. It allows

us to pick an object that belongs to us from a number of those looking exactly as ours, and by

being able to recognize a miss-spelt word in a sentence and unconsciously correct it. The fact

2 is that the human mind is capable of identifying an image on features spontaneously determined and not predefined or predetermined.

With the development of technology these human processes are imitated to create intelligent machines, so much for the immense growth of robotics and intelligent decision making systems and yet the work done so far is not comparable to any natural involuntary human action or process. The hindrance however is that it is not practically possible to imitate all the functions of the human mind and make computer vision as efficient and accurate as the human eye but even though such a possibility may be remote, efforts are consistently being made to bring them as close to it as possible.

Artificial Intelligence is a broad field of computer science taking into account a number of other disciplines that form the bulk of its study. One of the most popular definitions of

Artificial Intelligence (AI) was given by Elaine Rich and Kevin Knight as, “Artificial

Intelligence (AI) is the study of how to make computers do things which, at the moment, people do better” [20].

One important branch of Artificial intelligence is Computer vision, as mentioned in Figure

1.1 that shows few sub-branches of Computer Vision, the area of research which aims to imitate human vision and forms the basis of all image acquisition, its processing, document understanding and recognition.

Computer vision relies on a solid understanding of the physical process of image formation

3 to obtain simple inference from individual pixel values like shape of the object and to recognize objects using geometric information or probabilistic techniques [19].

In its own turn document understanding is a vast and difficult area for the focus of research today lies in being able to make content based searches which hope to allow machines to look beyond the key words, headings or merely topics to find a piece of information. A far more streamlined field of Document Recognition and understanding is Optical Character

Recognition which attempts to identify a single character from an optically read text image as a part of a word that can be then used to process further information on. The area gains rising significance as more and more information each day needs to be stored processed and retrieved rather than being keyed in from an already present printed or handwritten source.

Artificial Intelligence

Computer Vision

Document Understanding & Recognition

Optical Character Recognition

Figure 1.1: Sub-fields of Artificial Intelligence

4

1.2 Character Recognition

Character recognition is a sub-field of pattern recognition in which images of characters from a text image are recognized and as a result of recognition respective character codes are returned, these when rendered give the text in the image.

The problem of character recognition is the problem of automatic recognition of raster images as being letters, digits or some other symbol and it is like any other problem in computer vision [57].

Character recognition is further classified into two types according to the manner in which input is provided to the recognition engine. Considering figure 1.2 which shows the classification hierarchy of character recognition, the two types of character recognition are:

a. On-line character recognition

b. Off-line character Recognition

Figure 1.2: Classification of Character Recognition 5

1.2.1 Online Character Recognition

Online character recognition systems deal with character recognition in real time. The

process involves a dynamic procedure using special sensor based equipment that can capture

input from a transducer while text is being written on a pressure sensitive, electrostatic or

electromagnetic digitizing tablet. The input text is automatically converted with the help of a

recognition algorithm to a series of electronic signals which can be stored for further

processing in the form of letter codes. The recognition system functions on the basis of the x

and y coordinates generated in a temporal sequence by the pen tip movements as they create

recognizable patterns on a special digitizer as the text is written.

1.2.2 Offline Character Recognition

There is a major difference in the input system of off-line and on-line character

recognition which influences the design, architecture and methodologies employed to

develop recognition systems for the two. In online recognition the input data is available in

the form of a temporal sequence or real time text generated on a sensory device thus providing time sequence contextual information. On the contrary in an off line system the actual recognition begins after the data has been written down as it does not require real time contextual information.

Offline character recognition is further classified in two types according to the input provided to the system for recognition of characters. These are,

i. Magnetic ink Character Recognition (MICR)

ii. Optical Character Recognition (OCR)

6

1.2.3 Magnetic ink Character Recognition (MICR)

MICR is a unique technology that relies on recognizing text which has been printed in

special fonts with magnetic ink usually containing iron oxide. As the machine prepares to

read the code the printed characters become magnetized on the paper with the North Pole on

the right of each MICR character creating recognizable waveforms and patterns that are

captured and used for further processing. The reading device is comparable to a tape recorder

head that recognizes the wave patterns of sound recorded on the magnetic tape [79].

The system has been in efficient use for a long time in banks around the world to process

checks as results give high accuracy rates with relatively low chances of error.

There are special fonts for MICR, the most common fonts being E-13B and CMC-7.

1.2.4 Optical Character Recognition (OCR)

Optical Character Recognition or OCR is the text recognition system that allows hard copies of written or printed text to be rendered into editable, soft copy versions. It is the translation of optically scanned bitmaps of printed or written text into digitally editable data files. An OCR facilitates the conversion of geometric source object into a digitally representable character in ASCII or Unicode scheme of digital character representation [37].

Many a times we want to have an editable copy of the text which we have in the form of a hard copy like a fax or pages from a book or a magazine. The system employs the use of an optical input device usually a digital camera or a scanner which pass the captured images to a

7

recognition system that after passing it through a number of processes convert it to a soft

copy like an MS Word document.

When we scan a sheet of paper we reformat it from hard copy to a soft copy, which we save

as an image. The image can be handled as a whole but its text cannot be manipulated

separately. In order to be able to do so, we need to ask the computer to recognize the text as

such and to let us manipulate it as if it was a text in a word document. The OCR application

does that; it recognizes the characters and makes the text editable and searchable, which is

what we need. The technology has also enabled such materials to be stored using much less storage space than the hard copy materials. OCR technology has made a huge impact on the way information is stored, shared and communicated.

OCRs are of two types,

i. OCRs for recognizing printed characters

ii. OCRs for recognizing hand-written text.

OCRs meant for printed text recognition are generally more accurate and reliable because the characters belong to standard font files and it is relatively easier to match images with the ones present in the existing library. As far as hand writing recognition is concerned the vast variety of human writing styles and customs make the recognition task more challenging.

8

Today we have OCRs for printed text in Latin script as an everyday tool in offices while an

OCR for hand writing is still in the research and development stage to have more result accuracy.

Optical Character Recognition (OCR) is one of the most common and useful applications of machine vision, which is a sub-class of artificial intelligence, and has long been a topic of research, recently gaining even more popularity with the development of prototype digital libraries which imply the electronic rendering of paper or film based documents through an imaging process.

1.3 History of OCR

The history of OCR dates back to the early 1950s with the invention of Gismo a machine that could translate printed messages into machine codes for computer processing.

The product had been a combined effort of David Shepard, a cryptanalyst at (Armed Forces

Security Agency) AFSA and Harvey Cook. The successful achievement was then followed by the construction of the world’s first OCR system, also by David Shepard under his

Intelligent Machines Research Corporation. Shepard’s customers included The Readers’

Digest, Standard Oil Company of California for making credit card imprints, Ohio Bell

Telephone Company for a bill stub reader and The U.S. Air force for reading and transmitting teletype messages [81].

Another success came in the area of postal number recognition, which although crude in its inception stage became more refined with advancement in technology.

9

By 1965 Jacob Rabinow an American inventor had patented an OCR for sorting mails which was then used by the U.S. Postal Service [81].

The OCR has since become an interesting field of research posing numerous complexities and offering unique possibilities in the areas of Artificial Intelligence and Computer Vision.

As advancements were made and new challenges were undertaken it became more and more clear that the scope of such an undertaking, though attractive would be daunting and adventurous.

1.4 OCR Processes

The OCR process begins with the scanning and subsequent digital reproduction of the text in the image. It involves the following discrete sub-processes, as shown in figure 1.3.

OCR Process Details

Document Image Pre Processing Bitmap Image Analysis Noise Removal, Blur scanner Removal, Thinning, Slant / Skew detection Skeletonization, Edge and correction Detection

______Recognition Segmentation ______

Recognition Algorithm Segmentation of lines and Words Display

Figure 1.3: OCR Processes

10

1.4.1 Scanning

A flat-bed scanner is usually used at 300dpi which converts the printed material on

the page being scanned into a bitmap image.

1.4.2 Document Image Analysis

The bitmap image of the text is analyzed for the presence of skew or slant and

consequently these are removed.

Quite a lot of printed literature has combinations of text and tables, graphs and other forms of

illustrations. It is therefore important that the text area is identified separately from the other images and could be localized and extracted.

1.4.3 Pre-processing

In this phase several processes are applied to the text image like noise and blur removal, binarization, thinning, skeletonization, edge detection and some morphological processes, so as to get an OCR ready image of the text region which is free from noise and blur.

1.4.4 Segmentation

If the whole image consists of text only, the image is first segmented into separate lines of text. These lines are then segmented into words and finally words into individual letters.

11

Once the individual letters are identified, localized and segmented out in a text image it becomes a matter of choice of recognition algorithm to get the text in the image into a text processor.

1.4.5 Recognition

This is the most vital phase in which recognition algorithm is applied to the images present in the text image segmented at the character level.

As a result of recognition character code corresponding to its image is returned by the system which is then passed to a word processor to be displayed on the screen where it can be edited, modified and saved in a new file format.

1.5 World Languages and Scripts

Communication around the world takes place in more than several hundred languages today. There is a great variety of ways in which these languages are written down but it has been found out that more than 90 languages use the Latin script to scribe their words, English being one of them. There are several other scripts that serve as means to write down a combination of languages. The Arabic script stands second to Latin as it has been adopted by more than 25 different languages to form their alphabet [78].

According to the way the script is written down and the patterns it follows helps us divide them in two separate categories as figure 1.4 shows.

i. Non-cursive scripts.

ii. Cursive scripts. 12

1.5.1 Non-Cursive Script

The non cursive scripts which are more common are inherently discrete as far as the

printed text is concerned. This means that each character has a separate and a definite shape that combines with the next one by being placed side by side with it and never overlapping or

shadowing the preceding or succeeding letters. However, when these scripts are handwritten the scribe’s hand can make the letters decorative, cursive and more flowing in form and

shape. The Latin script is an example of such a script where handwritten text can be as

decorative and cursive as the writer’s choice while printed text is easily recognizable.

D i s c r e t e C h a r a c t e r s Hand-printed Characters

Cursive Handwriting ﻧﺳﺗﻌﻠﯾق اﯾک ﭘﯾﭼﯾده رﺳم اﻟﺧط ﮨﮯ

Figure 1.4: Cursive and non-cursive scripts

1.5.2 Cursive Script

On the other hand the naturally cursive scripts e.g. Arabic have a unique feature for

the formation of words. The characters in these scripts are not discrete but are joined to each

other to form ligatures and then words. These free flowing character forms create words by

overlapping each other sometimes even stacking on each other vertically. The non-discrete

nature of these scripts makes them ever more difficult to be developed as font types for

13

printing as well as pose challenges for character recognition. Creating discrete characters for

text processing for the cursive scripts and placing them side by side to form words like

discrete scripts for convenience in character recognition, as suggested by Abuhaiba [2], mar

the original shape of words giving them an artificial and unnatural look.

1.6 Perso-Arabic Script

The Arabic alphabet has been adapted by several other languages because of

resemblance of sounds and phonetic system, however, this script was originally an exclusive

writing system to support the Arabic language, it was later adapted and modified to

accommodate the demands of writing the . Also because the Persian

language has a greater variety of sounds than Arabic, four more letters were added to this

adapted script [38]. This new version of Arabic writing system is known as the Perso-Arabic

script and is considered a wide range writing system for not just the Persian alphabet, but

also Urdu, Sindhi, Kurdish, Sorani, Balochi, Punjabi Shahmukhi, Azer, Tajik-Persian and several others by adding more characters to the basic Arabic alphabet according to the

different sounds a language has more than available in the basic Arabic alphabet.

To write Urdu we need eleven more characters added to the Arabic alphabet to give a total of

thirty-nine characters in the Urdu alphabet.

1.7 Urdu Language

The Urdu language derives many of its features including its script from Persian and

Arabic. The Urdu alphabet is therefore the same right to left alphabet as followed by Persian and Arabic. However, there are many small differences in the written styles of the two. While

14

Arabic follows the less flowing Naskh style, Urdu is written in the more cursive Nastalique style. Although written Urdu more closely resembles Arabic, spoken Urdu resembles Hindi, although Hindi is written in a totally differently style called the Devanagri style which is based on Sanskrit script.

1.7.1 History of Urdu Language

Urdu language has a unique history it developed as a common means of expression during the Mughul dynasty in South East Asia due to Arab, Persian and Turkish influences in the region. The written script adapted a modification of the Persian alphabet following the same phonetic and pronunciation system. Due to the flowing cursive quality of the script most of the books, newspapers and other publishing material continue to use handwritten copies of the text created by master writers or kaatibs also known as khushnavees. This went on till the 1980s.

In the early 80s however the daily Jang, a Pakistani daily newspaper transferred almost all of the typing work of its newspaper to the computer. Today almost all Urdu publishing tasks are carried out through a variety of software available in the market supporting Urdu text editing features and providing a number of other font styles for writing.

1.7.2 Nastalique Script

In 14th century Iran, Islamic calligraphy reached its zenith, its main genre being the

Nastalique writing style. Although it was originally a script meant to compose in Arabic over a period of time it became more popular for written Persian, Turkic and other South Asian languages like Urdu, and Balochi. This fluid and more cursive amalgam of Arabic

15

Naskh and Taaliq have been extensively used as an art form for calligraphy in Iran and

Afghanistan.

Apparently the art of Arabic calligraphy originated and flourished in the fourteenth century

Iran spreading out later to many other Muslim countries. The art form gradually gained

significance giving birth to such eminent calligraphists as Mir Emad whose work is considered to be most elegant in splendor. Many of the most beautiful transcriptions of

Quranic verses have been rendered in this script by various calligraphists.

The modern day adaptation of Nastalique is however more specific in terms of rules of character proportion laid down by Mirza Reza Kalhouri who created a version of Nastalique that could easily be used for machine printing. This advancement allowed wider and easier use of the script under a formalized set of characters available [80].

1.7.2.1 Nastalique Type Setting

Handwritten Nastalique written from right to left was indeed a creative art form most suited for calligraphic renderings and attempts made to produce high quality printed texts remained challenging undertakings.

Although many attempts had been made to develop Nastalique typography most of them failed drastically mostly for the reason that it was difficult to create such a vast set of individually designed character combinations or ligatures that form the basis for Nastalique writing system. Earlier a version of Nastalique was developed by Fort William College but

16

this could not gain popularity and remained limited to be used only for publishing their own

college literature.

A Nastalique type writer was developed in the state of Hyderabad Deccan but could never be

put to any functional use and further attempts to make a Nastalique typeface were abandoned

considering it an impossible task.

1.7.3 Noori Nastalique

The old system of typesetting Urdu involved single characters joined together to form

ligatures and words. This meant compromising natural elegant and flowing style of the script

and developing a unique set of characters that would more conveniently join together to form words. Ahmed Mirza Jamil replaced this old system by developing ligatures for a new type set for Urdu which restored the cursive, calligraphic quality of Nastalique. The new typeface contained a set of almost 18,000 ligatures on the basis of which the new version of the script

was developed. This was named Noori Nastalique. Agfa Monotype Corporation holds the

copyright for the Noori Nastalique digital font data.

1.8 Nastalique Text Processor

We have few Urdu text processors and none of which is anywhere close to Latin

script text processors. The only one which has true calligraphic Nastalique support is InPage

Urdu that is why it is the most popular and commonly used Nastalique text processor which

also supports other Urdu fonts, however InPage Urdu does not produce character based

Nastalique text, and instead it uses a large collection of ligatures to write text in Nastalique.

17

A character based Nastalique text processor is the need of the time which could produce true

Nastalique calligraphic text by using character based Nastalique font instead of writing it

with a large collection of ligatures.

1.8.1 InPage Urdu

InPage Urdu is widely used software for making page layouts in Urdu using the

Nastalique style of Arabic script.

InPage works under Windows and alongside with Urdu caters to writing in Pashto, Persian

and Arabic – all languages that use the same script for writing.

This publishing tool is popular in Pakistan particularly with the newspaper companies who used to employ a large numbers of calligraphers to make corrections to Urdu text created in the monotype font. In fact, In Page Urdu was actually developed for Pakistan’s newspaper industry in 1994 through the collaborative effort of a UK based company-Multi-lingual

Solutions Windows on world limited led by Kamran Rouhi and an Indian software development team, Concept Software Private Limited led by Ravindra Singh and Vijay

Gupta. This newly developed software was licensed as Noori Nastalique typeface. It was improved and augmented from the original Monotype font and is now available for use as the main Urdu font along with 40 or so other non-Nastalique fonts.

InPage offers a truly authentic style of the Nastalique script, with a wide set of ligatures

(around 18000 ligatures in 85 font files) it also keeps the character display in the WYSIWYG format making the on screen and printed result comparatively more attractive than any other

18 software available in the market. An added quality is that almost all operations and features of desktop publishing packages available for English are similarly available in InPage, making desktop publishing in Urdu, Arabic, Pushtu or Persian as convenient as it is in

English.

These features have enormously overcome the problems that Noori Nastalique suffered from in the 1990s. Noori style of Nastalique was a digital type face developed in 1981 through the collaboration of Ahmed Mirza Jamil (as calligrapher) and Monotype Imaging (formerly monotype corp.)

These two problems were,

i. Standard platforms such as Windows or Mac did not have a built in environment

to support writing this script.

ii. The text could not be entered on the basis of a WYSIWYG (What You See Is

What You Get) method but the data had to be entered with the operator relying on

an understanding of the Monotype’s proprietary page description language.

From OCR point of view Nastalique is more complicated than Naskh as it has multiple base lines, more overlapping of characters within a ligature and between adjacent ligatures, vertical stacking of characters in a ligature etc. Compared to Latin script languages OCRs very less research work is done on Arabic Naskh OCR. Only a few Arabic Naskh OCR systems are available today and they are far from perfect. These are far behind in accuracy as compared to Latin script OCR systems.

19

Urdu has still not attracted researchers’ attention for the OCR research partly due to lack of

funds in this area and partly due to the challenges the Nastalique style offers for its optical recognition. Not only that there is no Nastalique OCR system available today but the published research in this area is nearly non-existent.

1.9 The Digital Divide

Today a phenomenal amount of communication takes place through the internet and standardization of the Latin, non-cursive script has paved the way for efficient communication, sharing of knowledge, information, financial transaction and business correspondence between the users of this script. These countries justly belong to the much acclaimed “global village” bringing close to one another their societies, cultures, technologies and merging their social boundaries into one whole global community. They have access to one another’s research and share in harmony their achievements, accomplishments etc. benefiting one from the other and thereby building on each others’ strengths in all spheres of life.

But this global village is a distinct society that has no recognition of the existence of one

important culture that is as rich and varied as theirs – perhaps even more so in the context of

history, traditional values, languages and learning.

What happens if a language’s writing script is not recognized in the medium of communication which is used elsewhere in the world? The answer is simple. A gradual but

steady decline in the global understanding and appreciation of a script reflective of a rich

20 cultural background, and heritage and very soon a priceless aspect of an important language vanishes from the scene.

What we essentially need today is a complete writing system supported by the most sophisticated browsing facilities, web script as well as text recognition system in Nastalique.

In this regard it would be right to mention that a considerable work has been done and being assembled for improvement for the Arabic (Naskh) script yet it is still not very much supported by search engines for the keyword search and information retrieval, whereas access through the ‘Roman Script’ is a matter of seconds. Users of the Nastalique script can grope for ages trying to find material in their required script.

Research facilities on the net are possible only through Roman script and for this massive amount of data from literature to science is available because it was possible to be scanned, recognized and be stored in cyberspace as searchable and retrievable text.

However, no substantial amount of work could be done for cursive script like ‘Nastalique’ and if we need to search on modern day thesis, articles, research paper or current news the data has firstly not been digitally stored and secondly it is not convenient to search it through the net for research purposes because the computer does not recognize Nastalique script as it does Roman. This situation drastically affects the survival of a language in the modern world.

21

1.10 Importance of Bridging the Digital Divide

This establishes that if all the world cultures should survive and not become isolated or extinct, and if knowledge and wisdom is to flow freely between nations, a strong and willful effort will have to be made to preserve the Nastalique script of writing Urdu in the digital context, as it is the most widely used style of writing Urdu, one of the world’s major languages.

1.11 Approaches for Arabic Naskh OCR

Approaches for the optical character recognition of cursive scripts like Arabic Naskh can be broadly divided into two categories;

i. Segmentation based

ii. Ligature based

1.11.1 Segmentation based OCR

Segmentation based approaches break a ligature into component character shapes before they are presented to the recognition phase of the OCR system. According to the author’s information no perfect segmentation scheme is presented as yet which could divide a ligature into the component character shapes precisely and accurately. Definitely characters accurately segmented from a ligature shall have more recognition probability than the character shapes poorly segmented have.

22

One of the main reasons of low recognition rates of a cursive script OCR system based on segmentation approach is the imperfect segmentation of a ligature into the component character shapes.

1.11.2 Ligature based OCR

Ligature based approaches for cursive script OCR systems do not break a ligature into the component character shapes but use the whole ligatures as the primitive elements of writing and train a learning machine to recognize these ligatures when presented to it, like

Neural Networks (NN), Support Vector Machines (SVM) or Hidden Markov Models (HMM).

A ligature based cursive script OCR system needs all possible ligatures needed to write the script languages. Like Urdu is written in Nastalique style with around 18,000 ligatures, so all these ligature shapes will be required to train a learning machine like NN for its optical recognition, which is a large data set and poses constraint.

Our proposed system for Nastalique character recognition does not require segmentation of a ligature into constituent character shapes, rather it requires only two levels of segmentation i.e. first the text image is segmented into lines of text then each of the lines of text is further segmented into ligatures and isolated characters.

1.12 Motivation and Research Objective

OCRs for many of the world major languages have been developed and are being used but at present an OCR for Nastalique is not available in the world. With a Nastalique

23

OCR we would be able to convert our whole wealth of Nastalique literature into digital form

and make it available on the World Wide Web.

The objective of this research is to design and implement an OCR for the printed Nastalique

text which is not only a national need but it could provide a means to bridge the digital divide in the countries of the region, where Nastalique understanding population is of a considerable volume.

1.13 Main Contribution of this Research

Nastalique is inherently cursive in nature; there is much character as well as word overlapping which makes segmentation of ligatures into constituent character shapes nearly impossible. According to our knowledge, up till now nobody has succeeded in the perfect segmentation of Arabic script text into constituent character shapes which would give recognition results comparable to Roman script OCR.

In this research, we have proposed a novel segmentation-free technique for the design and implementation of a Nastalique OCR based on correlation pattern recognition, we used cross-

correlation in the recognition phase of our Nastalique OCR.

Our proposed system for Nastalique character recognition does not require segmentation of a

ligature into constituent character shapes, rather it requires only two levels of segmentation

i.e. first the text image is segmented into lines of text then each of the lines of text is further

segmented into ligatures and isolated characters.

24

It then uses cross-correlation for recognition of characters in the ligatures line-by-line,

writing their character codes into a text file in sequence, as the character is found in a ligature

along with the x-position of the start of the character shape. As the recognition process is completed, the character codes in the text file are sorted based on x-positions and the sorted sequence of character codes given to the rendering engine, which displays the recognized text in a text region.

The limitation of our proposed Nastalique character recognition system is that it is font dependent; it needs the same font file for the recognition which was used to write the text whose image is given to the Nastalique OCR for recognition.

1.13.1 Nastalique OCR (NOCR) Application

In this research we have proposed a segmentation-free algorithm for the implementation of an OCR for printed Nastalique text. Most of the experimentation and rapid prototyping is done using Matlab 7 while the main application is developed in Microsoft

Visual C++ 6.0.

1.13.2 NOCR Process

Our proposed system for Nastalique character recognition segments the text image into lines of text then each of the lines of text is further segmented into ligatures and isolated characters.

It then uses cross-correlation for recognition of characters in the ligatures line-by-line, writing their character codes into a text file in sequence, as the character is found in a

25 ligature. As the recognition process is completed, the character codes in the text file are given to the rendering engine, which displays the recognized text in a text region.

1.13.3 NOCR Procedure

i. We use Font lab software to make a character-based true type font

ii. Using our TTF font file we write a few words in Nastalique

iii. We make an image of the written text file

iv. This image is given to Nastalique OCR for recognition

v. The recognized text in a text region

1.14 Additional Contribution of this Research

We also have done the following in addition to the main research topic of Nastalique

OCR.

i. Video OCR

ii. Nastalique Text Processor Model

1.14.1 Video OCR

Optical Character Recognition for Roman script languages is almost a solved problem for document images and researchers are now focusing on extraction and recognition of text from video scenes. This new and emerging field in character recognition is called Video

OCR and has numerous applications like video indexing, video data retrieval, etc. The emerging field for character recognition in video frames is attracting research on eastern

26

scripts like Chinese, but to the best of our knowledge, no work is reported as yet on Video

OCR for Arabic script languages like Arabic, Persian and Urdu.

As an extension of our Nastalique OCR to Video OCR for Arabic script languages, we have also performed experiments on video text identification, localization and extraction for its recognition. We have used MACH (Maximum Average Correlation Height) filter to identify text regions in video frames, these text regions are then localized and extracted for recognition. Experiments and results are reported in the thesis.

1.14.2 Nastalique Text Processor Model

For Nastalique OCR, we develop character-based True Type Font files for a few

Nastalique words. These words are written using the same character-based TTF font and an image is made of the Nastalique text. The image is then given to our Nastalique OCR. After recognition the rendering is done by using the same TTF font file to display the recognized

text. The work is therefore three folds; development of character-based Nastalique True Type

Font, Nastalique character recognition and rendering the recognized text using character-

based Nastalique True Type Font.

Since our character-based segmentation-free Nastalique OCR algorithm needs, as a ground

work, a character-based Nastalique Text Processor, we also have proposed a Finite State

Nastalique Text Processor Model. Implementation is not yet done so results are not reported,

however this model could serve as basis (impetus) for the future research in this challenging

field.

27

1.15 Thesis overview

Rest of the thesis is organized as follows:

Chapter 2 Literature survey

This chapter should have covered the research work done on Nastalique OCR or Urdu OCR,

unfortunately published research on Nastalique OCR or Urdu OCR is almost non-existent and whatever is available is included here with the comments of the author of this thesis.

On the other hand, a substantial amount of background research on Arabic Naskh OCR is

included here as the script and the writing rules are same for Arabic and Urdu while the

styles are different, Arabic uses Naskh while Urdu uses Nastalique style of writing which

poses more challenges than the Arabic Naskh for character recognition.

Chapter 3 Video OCR

The new and emerging field of character recognition in video frames is called Video OCR

and has numerous applications like video indexing, video data retrieval, etc. As an extension

of our Nastalique OCR to Video OCR for Arabic script languages, we have performed

experiments on video text identification, localization and extraction for its recognition. We

have used MACH (Maximum Average Correlation Height) filter to identify text regions in

video frames, these text regions are then localized and extracted for recognition. Experiments

and results are reported and discussed in this chapter.

28

Chapter 4 Implementation Challenges for Nastalique OCR

No Nastalique OCR exists so far and the published research on Nastalique OCR, Urdu OCR

or even on any area of Urdu computing is almost non-existent, the reason being the

challenges that the Nastalique style poses for its optical recognition.

The complexities of Nastalique style in particular and of Arabic script in general along with

the challenges it posses for character recognition are discussed here.

Chapter 5 The Proposed Nastalique OCR System

In this research we have developed a novel character-based segmentation-free algorithm for

the recognition of printed Nastalique text which we call NOCR algorithm. The NOCR

algorithm, its implementation and results are presented and discussed in this chapter.

Chapter 6 Conclusion and Future Work

Since our character-based segmentation-free Nastalique OCR algorithm needs, as a ground

work, a character-based Nastalique Text Processor, we also have proposed a Finite State

Nastalique Text Processor Model which is presented and discussed in this chapter.

Implementation is not done and is planned for future work.

All the research work done on this project is summarized in this chapter and the directions for future research work on this and related topics are discussed.

1.16 Conclusion

Our effort in this more challenging and less rewarding research area will set a milestone for new endeavors and shall provide the ground work needed to embark upon the future research projects in this area.

29

We have tested our Nastalique OCR algorithm on a subset of Urdu words. This is because our work in this research also involves the development of a True Type Font for writing the text which would then be rendered for recognition.

This research opens multiple directions for future research e.g. (1) Character based

Nastalique True Type Font development (2) Character based Nastalique Text Processor and

(3) Enhancement and extension of Nastalique OCR system.

30

CHAPTER 2 Literature Survey

2.1 Introduction

While a phenomenal amount of research in IT has made Roman script languages

extremely adaptable in all areas of computing – insubstantial work for Urdu in this area accounts for very little computerization in this script.

Published research is close to non-existent and the complexities associated with Nastalique

OCR development make it a highly challenging undertaking. Many reasons can be held

responsible for this lack of interest in making efforts to implement Urdu in computing and

perhaps a serious shortage for sufficient funds is a major one. Another reason for the lack of

a complete Urdu OCR system is limited support of the Urdu language in computing.

Urdu uses an extended and adapted Arabic script; it has 39 characters while Arabic has 28.

Each character then has 2-4 different shapes depending upon its position in the word; initial,

medial or final. When a character shape is written alone, it is called an isolated character

shape. Each of these initial, medial and final character shapes can have multiple instances,

the character shape changing with the preceding or the succeeding character. This

characteristic is called context sensitivity. The Urdu alphabet contains a large number of

character shapes compared to Roman script languages which have a fewer number of

characters in their alphabet. For this reason the development of an Urdu OCR is fairly more

difficult than Roman script languages. So far no work has been done with regard to

developing an Urdu OCR [55].

31

A complete language script comprises an alphabet and style of writing. Urdu with its

extended Arabic script for writing has two main styles, Naskh and Nastalique. Nastalique is a

calligraphic and more stylistic form and is widely used for writing Urdu.

Urdu writing is inherently cursive in nature where neighboring characters in a word are combined together, under certain restrictions, to form a compound character or a ligature.

This makes Urdu text processing very difficult compared to text processing of Latin script

languages as they follow character based writing styles where each character in a word

retains its shape. Urdu follows a complex style of writing because it has a few characters that

can neither start a ligature nor can occupy the middle position of one; they can appear either

in their isolated form or at the end of a word. When such characters appear in the middle of a

word, the word is broken into more than one ligature. Splitting of words into multiple

ligatures makes most Urdu ligatures considerably long.

The problem would have been easier to solve if all characters in Urdu had different shapes.

Unfortunately, Urdu characters can be grouped in multiple classes with each class containing

anywhere from 2 to 5 characters. All characters in a class use the same base shape and

individual characters in a certain class are distinguished from each other by the number and

position of dots (diacritic marks). In Urdu, these dots either appear above or below the base

shape. A little less than half of Urdu characters (17 out of 39) belong to such classes and one,

two, or three dots are used to differentiate between various characters.

32

Urdu uses the Arabic script for writing, with the most prevalent style being Nastalique.

Published research in Urdu text recognition is almost non-existent; however a considerable

research has been done on Arabic text recognition which uses the Naskh style of writing.

The Arabic language is considered to be a difficult one with a much richer alphabet than the

Latin, the form of the letter is the function of its position in the word: isolated, initial, medial or final, it changes its shape depending upon it’s position, and each shape has multiple instances, words are written from left to right (LTR) [29].

2.2 Previous Work on Urdu OCR

The research study conducted by U. Pal and Anirban Sarkar [55], at the Indian

Statistical Institute states a number of difficulties in the development of an OCR for Urdu most important of these is the large number of characters in the Urdu alphabet and the similarity in the shapes and forms of many of them. Their system proposes the development of an OCR on three embedded processes,

i- Skew detection and correction

ii- Line segmentation

iii- Character segmentation

The technique used for skew detection and correction is based on Hough transforms which picks selected components and computes results on selected candidate points.

33

For segmentation of lines of text from a document the system refers to the valleys of the projection profile and calculates the number of black pixels in each row. These valleys or

‘troughs’ represent the boundaries between two lines of text and are separated accordingly.

The final step is character segmentation which is achieved through component labeling and vertical projection profile methods. This involves recognition of topological features, contour based features and features obtained through the application of the ‘water reservoir’ method.

The collected features are employed for generating a tree classifier, where the decision at each node of the tree is taken on the basis of the presence/absence of a particular feature.

They give the water reservoir principle as, if water is poured from one side of a component, the cavity regions of the component where water will be stored are considered as reservoirs.

The main concept being a reservoir is obtained when water is poured from top (bottom) or left (right) of the component.

The system however, has its limitations. Other than the basic and isolated characters and numerals more complex and compound characters or ligatures are not recognized.

The study also does not report recognition results, only segmentation accuracy of the recognized text is reported. Proposed extension work includes the upgrading of the system to recognize compound characters and ligatures.

Lodhi and Matin [46] published their work that deals with the development of a robust Urdu characters pattern classification, representation and recognition system using Fourier descriptors for optical networks. The Fourier transform is a mathematical operation that tells

34

us the spectral density of an image i.e. the distribution of the different frequency components

of that image. The goal is to extract a finite set of numerical features from a closed curve,

features that will tend to separate the shapes of different classes relative to the intra-class

dispersion. The end product is a system that can classify patterns even if they are deformed by transformations like rotation, scaling and translation or a combination of them, in the presence of noise. The pre-processing stage is important as it filters out noise and improves the image through various algorithms. In the next stage Fourier descriptors are used to uniquely represent the given characters polygonal signatures thereby recognizing them even though they are characterized by geometric transformations.

Shumaila Malik and S. A. Khan [51] proposed a system which takes online input from user by writing the Urdu character/letter with the help of a stylus pen or mouse and converts user handwriting information into Urdu text. The process of online handwritten text recognition is divided into six phases, each of which uses a different technique for recognition depending upon the speed of the writer and the level of accuracy.

Faisal Shafait et al [25] present a layout analysis for Urdu document image understanding highlighting its feature of right to left reading and writing order in contrast with Latin script languages that function from left to right. It considers the system of layout analysis as an important component of an OCR. The authors have experimented with a method of extracting text lines for image processing in the reading order of the Urdu script which is presented as an essential consideration for Urdu document image understanding.

35

Inam Shamsher et al [31] propose a method/algorithm for recognizing isolated characters claiming 98.3% accuracy for printed Urdu alphabet. However, the system also claims to be script independent and yet it is designed for Urdu only. The objective of the research is to develop an efficient recognition system for Urdu characters using minimum processing time.

Details of Multi Layer Perceptron (MLP) network is given as having three layers, one input, one hidden and one out. The input layer has 150 neurons, the hidden layer has 250 and the output has 16 neurons with the reason that the input in a binary image of size 10x15 pixels,

250 neurons at the hidden layer are decided on trial and error basis while 16 neurons at the output layer due to the 16 bits of character code in the Unicode digital character encoding scheme.

The paper does not clarify what tools have been used for system implementation though the algorithmic details of the system training and neural network implementation resemble that of MATLAB Neural Network toolbox.

The network for the developed software has been trained and tested on Ariel font type at 72 pt font size. It does not however indicate if the system will work with equal efficiency on a smaller font of more practical size.

The system is also limited to work on isolated characters in a single line, showing no capacity to segment text images into lines of text. Although feature extraction methods are claimed to be simple and robust, it is not mentioned which features are to be extracted and through which techniques.

36

98.3 % accuracy is claimed without giving any account or details of the experimental

processes or procedures.

Zaheer Ahmad et al [73] have published the paper, the entitled Urdu Nastalique Optical

Character Recognition but within the paper the explanation is that ‘a prototype of the system

has been tested on Urdu text.’ There is very little description of Nastalique in the paper.

The paper discusses Urdu script characteristics and a simple but novel and robust technique to recognize printed Urdu alphabet without using a lexicon, as claimed by the authors. The technique uses the inherent complexity of Urdu script for character recognition. A word is scanned and analyzed for the level of complexity and as the level changes the point is marked for a character. It is then segmented and fed to a Neural Network.

Character segmentation is explained to be in three steps. In the first step lines of text are identified, then words are separated and in the final step characters are segmented and extracted from words or sub words using its complexity level. These are then fed to a Neural

Network for final recognition/ classification. Throughout the paper Urdu words are presented

which is printed with correct ligature (اﺳﻢ) .except for a few e.g (و د ر ا) in the reverse order

form.

Table 1 does not clarify or present the forms of Urdu letters it claims to. In fact it presents all

the Urdu words in their disjoint reverse order i.e. left to right. The contents of Table 2 are

equally questionable for clarity and presentation.

37

Three levels of complexities are mentioned for characters yet no clear explanation is given as

to what forms the basis for differentiation between the various levels.

The word ‘character’ is repeatedly confused with the word ‘alphabet’ both of which have

completely different meanings in the context of languages. There are instances of other rather

significant technical errors e.g. ‘an isolated word scanned vertically from right to left’ and

horizontally from ‘top to bottom’.

The paper claims that the system achieves 93.4 % accuracy but does not provide supportive

proof of procedures for evidence.

There are a total of six references to other papers and five out of them are cited together in one instance.

2.3 Approaches for Arabic script OCR

Arabic characters have features that make direct application of algorithms for character classification in other languages difficult to achieve as the structure of Arabic is very different [50].

Extensive literature survey on Arabic script OCR showed that the researchers in this area followed mainly two different approaches for the implementation of an OCR system for printed Arabic script text, namely segmentation-based and ligature-based approaches.

38

2.3.1 Ligature-based Approach

In the ligature-based approach for the implementation of Arabic script OCR there are

only two levels of segmentation at the end of which the system gets isolated character shapes

or the ligatures segmented into lines of text.

2.3.2 Segmentation-based Approach

If the segmentation-based approach is followed for the implementation of Arabic

script OCR then at the recognition phase all the text images segmented till the character level

will have to be needed, similar to a Latin script OCR, that is all the ligatures in the words are

segmented into their constituent character shapes.

2.4 Previous Work in Ligature-Based Arabic OCR

Here we present a brief overview of the previous work that has been done on ligature-

based Arabic OCR.

Al-Badr and R. Haralick [3] highlight some of the hindrances in the development of an

Arabic OCR and the reasons for inadequate research in this area. It attributes the difficulties in Arabic character recognition to the more complex features of the Arabic script e.g. cursiveness, vertical stacking of characters to form many of the ligatures, context-sensitivity, vertical overlapping of shapes etc. The paper discusses the design and implementation of an

Arabic Word Recognition system which works on the principles of symbol recognition without initially segmenting words into characters, claiming that most recognition errors occur at the crucial stage of segmentation because of the typical shape combinations of characters in the Arabic script. The system first recognizes the input word by detecting a set

39

of ‘shape primitives’ on the word. It then matches the regions of the word with a set of

symbol models. The recognized word is thus presented in the form of a spatial arrangement

of symbol models matching the region of the word. Since the possible combinations of

symbol models are potentially large, the system imposes constraints in terms of word

structure and spatial consistency. The accuracy of the system is shown to be 94.1% for

isolated, scanned symbols and 73% for scanned words.

Pechwitz and Ma¨rgner [58] present an off-line recognition system for Arabic hand written text. The work highlights the efficiency of cursive text recognition methods based on Hidden

Markov Models (HMMs). The study was conducted on a semi-continuous, one dimensional

HMM and describes in detail the modification and adaptation of the preprocessing and feature extraction processes for recognition of Arabic writing. The experiments were based on first estimating the normalization parameters of each binary word image and following it by normalization of height, length and baseline skew. The features are then collected using a sliding window technique.

Alma’adeed et. al. [6] present a complete scheme for character recognition of totally unconstrained Arabic text based on a Model Discriminant HMM. The system proposes feature extraction following the removal of variations in the word images which do not affect the identity of the written word. The system then encodes the skeleton and edges of the word

and a classification process based on the HMM is used. The result is a word matching one in

a dictionary. The study gives indication of successful results of a detailed experiment.

40

Alma’adeed et. al. [5] present a scheme to recognize hand written Arabic text. The overall engine is modeled on the basis of multiple HMMs and a global feature extraction scheme.

The system initially removes variations in the word images and then encodes the skeleton and edge of the word for feature extraction. A rule based classification is then used as a global recognition engine. Finally for each group, the HMM approach is used for trial classification. The given output is a word that matches one present in a lexicon. Once the model has been established the Viterbi algorithm is used to recognize the segments of letters composing a word. The study gives details about the segmentation step as well as off-line recognition operations. The study emphasizes the development of two substantially different recognition engines because there are multiple ways in which one Arabic word can be written down. The first engine is a global feature scheme using some ascender and descender features and making use of a rule-based classification engine. The second scheme is based on a set of features using a HMM classifier.

Farah [27] presents his work that implies the construction of a recognition system around a modular architecture of feature extraction and word classification units. This was done in the attempt to solve the problem of recognizing handwritten Arabic bank checks. The research stresses the efficiency of a multi classifying system with three parallel classifiers working on the same set of structural features. The classification stage results are first normalized and after using contextual information present in the syntactical module the final decision on the candidate words is made.

El-Hajj [22] gives the description of a one dimensional HMM off-line handwriting recognition system using an analytical approach. Specific models are used for each character

41

and word models are built by concatenating the appropriate character models. The system is

supported by a set of robust language independent features extracted on binary images.

The study lays focus on baselines as an important feature in character recognition. The baseline dependant features are then added to the original set of features. Feature vectors are extracted using the sliding window technique.

Alaa Hamid and Ramzi [30] present a technique to segment hand written Arabic text through a neural network. These include three initial steps to achieve the end result before the

Artificial Neural Network (ANN) verification: scanning, binarization and finally feature extraction. A recursive, conventional algorithm is used to segment text into connected blocks of characters and generate pre-segmentation points for these blocks. This heuristic algorithm is responsible for generating the topographic features from the text and calculating pre- segmentation points. An Artificial Neural Network then verifies the accuracy of these segmentation points. The results have shown an accuracy range between 53.11% and 69.72% considering various features. The inaccuracies have been attributed to complexities in shapes of characters or due to dislocated external objects.

Khorsheed [41] proposes a segmentation free approach to recognize Arabic text using the

HMM toolkit, a portable toolkit for building and manipulating Hidden Markov Models. It decomposes the document image into text line images and extracts a set of simple statistical features using the sliding window technique. The Hidden Markov Model Toolkit is then used to develop a sequence of training images and finally for recognition of characters. This implies that the feature vector, extracted from the text, is computed as a function of an

42 independent variable. The experiments were initially conducted on a corpus of data collected in the Arabic font ‘Thuluth’ and later with others. Tahoma and Andulas scored highest recognition rates while Naskh and Thuluth performed lowest. The system was however capable of recognizing complex ligatures and overlaps and showed an overall improvement with the use of the tri-model scheme. Suggested improvements for future developments are expansion of the data corpus and utilization of HTK capabilities.

Possible solutions by Abuhaiba [2] present a progressive approach toward overcoming the cursive laws that put constraints on Arabic language for digital recognition through an OCR.

Abuhaiba proposes the development of new font styles that appear cursive but are discrete in reality (closer inspection). This new font system suggests the creation of a discrete Arabic script rather than a cursive one so that the bulk of information created everyday in this script remains machine recognizable. The development of such a recognition engine would surely overcome the problem of Arabic character recognition because it does not imply the development of a new recognition technique, rather the application of the same techniques that are used for Latin script recognition on newly developed font files for Arabic which is within reach of any research undertaken for cursive scripts whereby the old or original styles or forms can be compromised for newer modified ones. However, the development of a

Nastalique OCR implies the preservation of its unique and typical font style and ligature shapes, also the ligatures are more cursive and calligraphic than the Arabic Naskh and therefore Abuhaiba’s proposition can not be applied to solve the problem of an urdu OCR for

Nastalique.

43

Erlandson et al [23] implemented a word-level Arabic text recognition system that did not

require character segmentation. They characterized the shape of Arabic words by unique

feature vectors. These feature vectors were then matched against a database of feature vectors

derived from a dictionary of known words. The database stored multiple feature vectors for each word in a dictionary of 48,200 words. The word whose feature vectors strongly matched was returned as the hypothesis.

Al-Badr and Haralick [3] designed and implemented an Arabic word recognition system that recognized an input word by detecting a set of shape primitives on the word. The regions of words represented by these shape primitives were then matched with a set of symbol models.

The description of the recognized word was obtained from a spatial arrangement of symbol models that were matched to regions of the word.

Bazzi et al [14] implemented a segmentation free OCR system that was based on Hidden

Markov Model (HMM). They chose a text line as a major unit for training and recognition. A

page of printed text was decomposed into a set of horizontal lines using horizontal position along each line as an independent variable. Hence they scanned a text line from right to left and at each horizontal position a feature vector was computed from a narrow vertical strip of input. The system was based on 14-state HMMs of each character. The output of the system comprises sequences of characters that had the maximum likelihood.

Amin [10] analyzed the shape of Arabic words with a unique vector of features. This feature vector was then represented in attribute/value form to an inductive learning system that created rules for recognizing characters of the same class. The technique was composed of

44

three major steps. The first step was pre- processing in which the original image was

transformed into a binary image utilizing a 300 dpi scanner and then forming the connected component. Second, global features such as number of sub-words, number of peaks within the sub-word, number and position of the complementary character, etc. were then extracted.

Finally, machine learning was used for character classification to generate a decision tree.

Kraus and Dougherty [44] implemented the morphological hit or miss transform. They developed a basic class of structuring-element pairs for segmentation-free character recognition via the morphological hit-or-miss transform. Both hit and miss structuring were selected so that the hit-or-miss transform could be applied across the test image without prior segmentation. The authors marked the location at which a structuring element fitted within a pixel set corresponding to a shape of interest and another structuring element positioned outside the pixel set.

Clocksin and Khorsheed [42] used a new technique for recognizing Arabic cursive words.

They used the holistic approach for word recognition in which the word was treated as a whole and the features were extracted from the un-segmented word image. Each word was represented by a separate template that was part of the Fourier spectrum. The recognition of the word was based on the Euclidean distance from those templates.

Jelodar et al [32] used morphological hit/miss transform to recognize Persian script in machine printed documents. They first used horizontal histogram of document to separate lines and vertical histogram of lines to separate sub-words and then removed dots to make recognition stage simpler. The words were then thinned using the sequential thinning method

45

based on morphological hit/miss transformation. After thinning, feature extraction was done

by an exhaustive search process using hit/miss operator with a complete set of structuring

elements corresponding to the different geometric patterns of interest. These extracted

features along with the number and position of dots provided the required information for

identification of the sub word.

Fan Xialong and Verma Brijesh [26] Compared segmentation-based and non-segmentation based neural techniques for cursive word recognition. They have discussed three papers in

non-segmentation based cursive word recognition that are described as follows. Govindaraju

et al proposed a segmentation-free neural technique for cursive word recognition. They

extracted the features by traversing the strokes in the word image without performing word

segmentation. The word image was then mapped on to the feature vector matrix of uptrends

and downtrends of strokes. Guillevic et al proposed a segmentation-free method to extract 4

types of global features: ascenders, descenders, loops and word length. The contour tracing

procedure was applied on the input image and a representation of the input was obtained as a

list of connected components. Parisse extracted global features depending on the word’s

upper and lower contour. He used training and recognition that was based on n-gram

extraction and identification.

Obaid and Dobrowiecki [54] proposed a segmentation free method called N-markers for recognition of printed Arabic text. Their method was a mixture of global and structural approaches. They collected the informative points lying in the center of the characters. These points were the basis of the coordinate system for the configuration of sensors designed to

46

identify the necessary strokes and were called N-markers. By distributing enough markers

over the character, a letter or a group of letters in a text line was detected.

Khorsheed and Clocksin [43] proposed an approach based on hidden Markov Model (HMM),

where the word was recognized as a single unit. Their method avoided segmenting words

into characters or other primitives and used a predefined lexicon, which acted as a look-up

dictionary. They first applied thinning algorithm based on Stentiford’s algorithm to find a

skeleton graph of the word image and then calculated the centroid of the image. Then the

word image was transformed into a stream of feature vectors. This was done using the fact

that the skeleton graph of the image consisted of a number of segments where each segment

started and ended at a feature point. These segments were listed in descending order relative

to the horizontal value of the starting feature point of each segment. During segment

extraction, loops were also extracted from the skeleton. The segments were then transformed

into 8-dimensional feature vectors. Each feature vector was then mapped to the closest

symbol in the lexicon and the resulting stream of observations was presented to the HMM.

The path discriminant HMM approach was used where a pattern was classified to the word

with maximum path probability. The Viterbi algorithm was used for finding optimal path.

Reza Safabakhsh et al [62] proposed a system for Farsi Nastalique handwritten words

recognition using a continuous-density variable-duration hidden Markov model CDVDHMM.

In this system after the pre-processing stage the ascenders, descenders, dots and other

secondary strokes are eliminated from the original image. Segmentation is done by analyzing

upper contour thus avoiding the under-segmentation problem. Variable-duration states in the

47

system cover the over-segmentation problem. Features are extracted which are invariant to

size and shift. At the recognition stage a modified version of Viterbi algorithm is used.

Mohammad S Khorsheed et al [42] introduced a holistic approach for Arabic word

recognition which uses a normalization process to compensate for dilation and translation.

The process adapted transforms the image of an Arabic word from Cartesian coordinates to

polar coordinates similar to log polar transformation. Rotation is also converted into

translation by this transformation.

2.5 Previous Work in Segmentation-Based Arabic OCR

Ligature-based OCR has its limitation and cannot be expected to identify all possible

ligatures because it needs a prohibitively large database to store all possible combinations of

characters that can form long ligatures. Since every ligature is composed of characters, the

segmentation of ligatures could generate individual characters that can be recognized.

Machine generated (printed) Arabic (Naskh style) text usually follows a horizontal base-line where most characters, irrespective of their shape, have a horizontal segment of constant width. If we can separate these horizontal constant-width segments from a ligature, the remaining components of the ligature could be recognized. However, this is not the case with

Nastalique which has multiple base-lines, horizontal as well as sloping, making Nastalique a complex style for optical recognition.

This section presents a brief overview of the previous work that has been done on segmentation-based Arabic OCR.

48

Bouslama and Kishibe [15] proposed a method that combined the structural and statistical approach for feature extraction and a classification technique based on fuzzy logic. They segmented characters into a main and complement characters. The main segment was then centred and projected horizontally and vertically. The features of classification were then extracted from the number of complement characters and from the horizontal and vertical projection profiles of the main character. A set of fuzzy rules was used for classification. The recognition algorithm was tested on three different fonts and high recognition rates were achieved.

Zidouri et al [77] proposed a sub-word segmentation technique that was independent of font type and font size. After applying pre-processing techniques, they employed horizontal and vertical segmentation to segment a page into separate lines and lines into sub-words respectively. To divide sub-words into characters, they first skeletonized the image of sub- words without dots. Then for all the rows, they scanned the image row-wise from right to left, to find a band of horizontal pixels of length greater than or equal to the width of the smallest character. The vertical projection of this scanned band was then taken and if no pixel was found, a vertical guide band was drawn in an empty image. Thus several guide bands were drawn for all the rows. A special mark for the guide bands for each row was used below the location of the baseline. In order to select the correct guide band, several features were extracted and tested for several predefined rules. If it satisfied rules then it was selected, otherwise rejected.

49

Motawa et al [52] proposed an algorithm for automatic segmentation of Arabic words using

mathematical morphology. They first digitized the image using 300 dpi scanner and then

detected and corrected any slanted strokes. They applied erosion operation on the image and

computed the average slope of all the strokes. Every pixel was then transformed to a new

location according to a formula to correct the slant. After slant correction, connected

components were constructed which formed the skeleton for all future analysis of the image.

Morphological operations, opening and closing, were applied to word image to allocate

singularities and regularities. Singularities represent the start, the end or the transition to

another character. Regularities contain the information required for connecting a character to the next character. So the regularities were the candidates for segmentation.

Tolba and Shaddad [67] proposed a segmentation algorithm for the separation of Arabic characters. In their algorithm, they slid a window over a word horizontally from right to left and at each instant they calculated a segmentation parameter which was then matched with predefined set of threshold values. If the segmentation parameter was less than the threshold value, the region was marked as a silence region. Detecting the silence region after the beginning of the letter identified the end of a letter. When the value of the segmentation parameter increased, the beginning of the next letter started.

Al-Yousefi and Udpa [9] introduced a statistical approach for Arabic character recognition.

They used a two-level segmentation scheme. They first segmented the words into character.

Then a lower level segmentation was applied to segment the characters into primary and secondary parts (dots and zigzags). Then they computed the moments of horizontal and vertical projections of the primary parts and normalized them to zero order moment. The

50

features were extracted from the normalized moments of the vertical and horizontal projections and the classification of the primary characters was done using quadratic

Bayesian classifier. The secondary parts were isolated and identified separately. Their recognition was done during pre-processing and segmentation stage.

Amin and Mari [11] proposed a structural probabilistic technique for automatic recognition of multi-font printed Arabic text that was based on character recognition and word recognition. They first transformed the image of text into separate lines of text by taking a horizontal projection and then segmented the text lines into words and sub-words by taking vertical projection. To segment words into characters, they took vertical projection of the word and the least sum of the average value over all columns showed the connectivity point.

Thus each part of the word having a value less than the average sum was segmented into a different character. This resulted in the number of segments that were then connected together in the recognition phase to form the basic shape of the character and the segments that were not connected to any other segment were considered to be complementary characters. They used Freeman codes of the characters and consulted character recognition dictionary to recognize characters. The word recognition part of the technique used the tree representation lexicon and the Viterbi algorithm.

Zheng et al [75] proposed a new machine printed Arabic character segmentation algorithm.

Their algorithm was based on vertical histogram of sub-words and some rules that were based on four kinds of features. Initially they used some rules to check if the sub-word consisted of only one character. Then they scanned the vertical histogram of sub-word that consisted of more than one characters in the direction of writing and marked a point as

51

potential segmentation point if the histogram value was increased and the point was near the

baseline. Then they used some rules to check if the point was a real segmentation point and

segmented the sub-word at it.

Arica and Yarman-Vural [12] proposed an analytical approach for offline cursive

handwriting recognition that was based on a sequence of segmentation and recognition

algorithm. They did not pre-process the image for noise reduction, however, they estimated global features such as baseline, average stroke width/height, skew and slant angle and integrated them to improve segmentation and recognition. They determined the segmentation

regions in the word image, and then searched these regions for finding segmentation

boundaries between characters. The shortest segmentation path from top to bottom of a word

image was searched that segmented the connected characters into segmentation regions. Each

character candidate was represented by a fixed size string of codes and a feature extraction

scheme was employed. HMM training was applied on the selected output of the segmentation

stage and the HMM and feature space parameters were estimated. Each string of codes was

fed into HMM recognizer and were labelled with the HMM probabilities. The candidate

characters and associated HMM probabilities were then represented in a word graph and the best path in the graph was found using dynamic programming which corresponded to a valid

word.

Lorigo and Govindaraju [47] introduced a new algorithm for segmentation and pre-

recognition of offline Arabic handwritten text. They used binary text images and baseline heights at the left and right edges of the image from IFN/ENIT image database as input to

their system. They detected the connected components and separated the connected

52 components on the basis of dots and sub-words. Dots were combined into dot groups and a new image was made for each sub-word. They identified the loops for each sub-word image and modified the baseline based on the projection of black pixels onto vertical line, which yielded the horizontal sub-word baseline. Then a list of candidate segmentation points, where each point was a range of x-coordinates, was determined using two methods. The gradient method was used to compute horizontal and vertical gradients in the baseline strip, which were then used to indicate a connection point of a character. The down-up method was used to find the short candidate points that were missed by the gradient method. These two methods over-segmented the image, so extra breakpoints in loops and at edges were removed using the knowledge of letter shapes. Dot groups were then assigned to letters for pre- recognition.

Lee et al [45] presented a word segmentation algorithm for segmenting a word into a prefix- stem-suffix sequence using a small manually segmented Arabic corpus and a table of prefixes/suffixes of the language. They first tokenized an Arabic sentence and used trigram language model to segment these tokens into a stream of morphemes. The trigram language model probabilities were then estimated from the morpheme-segmented corpus. For segmentation, each token was compared against the prefix/suffix table and all the matching prefixes and suffixes were identified. Then each matching prefix/suffix was identified at each character position and all prefix-stem-suffix sequences were enumerated. Then trigram language model score for each segmentation was calculated and top N scored segmentations were set aside. Some of illegal sub-segmentations were filtered out on the basis of manually segmented corpus information. After developing the segmenter on the basis of manually segmented corpus, the segmentation accuracy was improved by iteratively expanding the

53

stem vocabulary and retaining the language model on a large un-segmented Arabic corpus.

The model parameters were re-estimated with the expanded vocabulary and training corpus.

Goraine et al [28] presented a segmentation-based approach for offline Arabic character

recognition. They applied thinning algorithm on isolated Arabic words input via video

camera to get the string points of characters. Then they segmented the words into principal

and secondary strokes and classified them into connection points which were pixels with two

neighbours, feature points that were pixels with either one or three neighbours, and strokes

which were strings of pixels between two consecutive feature points. Then they repeatedly

applied stroke finder algorithm to find the start and end points and to trace the stroke from

start to end until all strokes were traversed. The strokes were then coded using 8-direction

codes and were represented by string of characters. For stroke classification 11 primitives

were used and each stroke string was compared with each primitive to find the exact match

and the new shape was stored in a lookup table.

Parhami and Taraghi [56] presented a technique for the automatic recognition of printed

Farsi text. The technique is applicable, with little or no modification to printed Arabic text

(Farsi is written in Arabic script and also uses Naskh style for writing).

The most important parts of the system are: (1) isolation of symbols within each subword; and (2) recognition. The main step in segmenting symbols is to determine the pen (script) thickness which is used to find candidate connection columns. Practical application of the technique to Farsi newspaper headlines has been 100% successful, as reported by the authors.

However, fonts of smaller point-size will result in less than perfect recognition. The system is

54

heavily font dependent and the segmentation process is expected to give degraded results in some cases.

Almuallim and Yamaguchi [8] first segmented words into strokes, since it is difficult to separate a cursive word directly into characters. These strokes are then classified using their geometrical and topological properties. The relative positions of the classified strokes are examined, and the strokes are combined in several steps into a string of characters, which represents the recognized word. A maximum recognition rate of 91% was achieved. The system failure, in most of the cases, was due to wrong segmentation of words.

Ramsis et al [59] adopted a method of segmenting Arabic typewritten characters after recognition. As the characters are not separated yet, they assume that the rightmost columns of a word, the number of which equals the width of the smallest character, constitute a character. Moments are calculated and checked against the feature space of the font. If a character is not found, another column is appended to the underlying portion of the word and moments are calculated and checked again. This process is repeated until a character is recognized or the end of the word is reached.

The method allowed the system to handle overlapping and to isolate the connecting baseline between connected characters. This method seems to be sensitive to font type and input pattern variations. Also, the system uses intensive computations to compute the required accumulative moments. No figures are reported regarding the system recognition rate and efficiency.

55

Amin and Mari [11] presented a structural probabilistic approach to recognize Arabic printed text. The system is based on character recognition and word recognition. Character recognition includes segmentation of words into characters using vertical projections and identification of characters. Word recognition is based on Viterbi algorithm and can handle some identification errors. The system was tested on just a few words and no figures were reported about its performance. The method has inherent ambiguity and deficiencies due to interconnectivity of Arabic text.

Al-Emami and Usher [4] presented an on-line system to recognize handwritten Arabic words.

Words are segmented into primitives that are usually smaller than characters. The system is taught by being fed the specifications of the primitives of each character. In the recognition process, parameters of each primitive are found and special rules are applied to select the combination of primitives that best matches the features of learned characters. The method requires manual adjustment of some parameters. The system was tested against only 170 words, written by 11 different subjects for 540 characters.

Zahour et al [74] presented a method for automatic recognition of off-line Arabic cursive handwritten words based on a syntactic description of words. The features of a word are extracted and ordered to form a tree description of the script with two primitive classes: branches and loops. In this description, the loops are characterized by their classes and the branches by their marked curvature, their relationship, and whether they are in clockwise or counterclockwise direction. Some geometrical attributes are applied to the primitives that are combined to form larger basic forms. A character is then described by a sequence of the basic forms. The reported recognition rate of the system is 86%.

56

Abuhaiba [1] presented a text recognition system, capable of recognizing off-line handwritten Arabic cursive text. A straight-line approximation of an off-line stroke is converted to a one-dimensional representation. Tokens are extracted from this one- dimensional representation. The tokens of a stroke are re-combined to meaningful strings of tokens. Algorithms to recognize and learn token strings were presented. The process of extracting the best set of basic shapes that represent the best set of token strings that constitute an unknown stroke was described. A method was developed to extract lines from pages of handwritten text, arrange main strokes of extracted lines in the same order as they were written, and present secondary strokes to main strokes. Presented secondary strokes are combined with basic shapes to obtain the final characters by formulating and solving assignment problems for this purpose. The system was tested against the handwriting of 20 subjects, yielding overall sub-word and character recognition rates of 55.4% and 51.1%, respectively.

In general, the strategies to segment cursive script for recognition purposes can be classified into two broad categories: In the first category, the word is segmented into several characters and character recognition techniques are applied to each segment. This method depends heavily on the accuracy of the segmentation points found. However, such an accurate segmentation technique is not yet available.

In the other category, a loose segmentation to find a number of potential segmentation points in the pre-segmentation procedure is performed. The final segmentation and the word length are determined later in the recognition stage with the help of a lexicon.

57

Recognizing that there were attempts to segment cursive script into characters, we could only see very few Arabic OCR systems. The performance still lags behind that of Latin and

Chinese systems. In [39][40], Kanungo et al. reported evaluation results for two popular

Arabic OCR products: Sakhr’s Automatic Reader 3.01 and Caere’s OmniPage Pro v2.0 for

Arabic. Sakhr’s Automatic Reader is the most famous Arabic OCR software. In their evaluation, they established that Sakhr and OmniPage have page accuracy rates of 90.33% and 86.89%, respectively. These are not high recognition rates compared to those for Latin and Chinese OCR systems. Again, we stress the fact that this is mainly due to the cursive property of Arabic script.

58

CHAPTER 3 Video OCR

3.1 Introduction

Optical Character Recognition for Roman script languages is almost a solved problem

for document images and researchers are now focusing on extraction and recognition of text

from video scenes. This new and emerging field in character recognition is called Video

OCR and has numerous applications like video annotation, indexing, retrieval, search, digital libraries, and lecture video indexing.

The emerging field for character recognition in video frames is attracting research on other scripts like Chinese, but to the best of our knowledge, no work is reported as yet, on Video

OCR for Arabic script languages like Arabic, Persian and Urdu.

As an extension of our Nastalique OCR to Video OCR for Arabic script languages, we have also performed experiments on video text identification, localization and extraction for its recognition. We have used MACH (Maximum Average Correlation Height) filter to identify text regions in video frames, these text regions are then localized and extracted for recognition. All research and development work is done using Matlab 7.0. Experiments and results are reported in the thesis.

Extensive literature survey on Video OCR including discussions on various techniques used by researchers is also included here.

59

3.2 Introduction to Video OCR

Machine recognition of text images from documents, printed or handwritten has been

an area of popular research and development with a considerable degree of success. A recent

extension of this field has been of text recognition from video frames. This is however a completely different domain. The reason is that text in printed documents is restricted to uni- colored characters written on a uniform background making it relatively easier to separate the text from the background.

In video text recognition a number of noise components render the text comparatively more difficult to separate from the background. Besides, the characters can be moving or presented in a variety of colors, sizes and fonts that are not standardized. Added to this is the fact that the background is usually moving making text extraction a more complicated procedure.

Most videos contain two kinds of text, scene text and artificial text. Scene text is usually text that becomes part of the scene itself as it is recorded at the time of filming the scene, on the other hand artificial text is created independently and away from the scene and is laid over it at a later stage or during the post processing time. The appearance of artificial text is therefore carefully directed. This type of text carries with it important information that helps in video referencing, indexing and later in retrieval.

60

Such superimposed text on video frames is also used for annotation, semantic video analysis and search. An example for this is the graphic logo that continuously appears on broadcasts and may appear on any fixed place on the screen giving either the name of the station or indicating other similar pieces of information to the viewers. This kind of laid over text is usually easier to recognize automatically and can be used as annotation.

On news broadcasts and reports anchor names and locations appear on the screen and are recognizable texts as they can be extracted by focusing on the bottom one third of the screen.

Video text serves a variety of purposes. In the beginning of the video there are titles, names and other bits of information necessary for the understanding or appreciation of the video. In the end are the credits and other important details including references etc.

Sports coverage often includes text that displays the current score; interviews and documentaries display speakers’ names or translated sub titles. Within a broadcast text also gives important information about the subject currently on the screen, its location or other relevant bits.

Similarly text appearing on commercials and advertisements relay the message, the products’ name, the company’s name etc. These text appearances are all carefully directed and are important for the video’s complete understanding. They are easily recognizable as they are deliberately overlaid on the scene.

In many videos text often appears as part of the scene itself. This text might be just words, names printed or written on billboards, names of shops or even words written on a wall

61 which become a part of the scene at the time the shooting is done. This kind of text is the most difficult to separate from the background and recognize. The reason is that the text can appear just about anywhere on the screen, in any type of lighting, in any type of shape, size or font, style or orientation e.g. words that appear on banners and placards could be straight or wavy, tilted or slanting and caught from any angle of the camera.

Superimposed text is fundamentally different from text that is embedded in the view of the scene itself thus making recognition of text a most demanding process. These difficulties arise from background complexities and variations and from an inherent low resolution.

3.3 Types of video text

There are two types of text in videos:

i- Scene Text It appears written on billboards, buildings, banners, shirts etc. in a video scene. ii- Artificial text This text is added artificially into a video by editing devices after the video is made and mostly it appears on the screen at some particular position e.g. horizontally at the lower part of the screen. Examples are Speaker’s/Newscaster’s names, information about them or translation of dialogues in video movie, subtitles in a video.

62

3.4 Applications of Video-OCR

Videotext related to the superimposed text or artificial text on the frames can be used

for:

i- video annotation,

ii- indexing,

iii- retrieval,

iv- semantic video analysis,

v- search,

vi- digital libraries,

vii- home video summarization, and,

viii- lecture video indexing .

3.5 Literature Survey

Nevenka Dimitrova ,Lalitha Agnihotri, Chitra Dorai and Ruud Bolle, described the following computational framework to extract text regions from video frames [53].

The procedure is described as:

They classified approaches to extracting text from video into three categories: (i) methods that use region analysis to extract text components, (ii) methods that perform edge analysis to extract characters, and (iii) methods that use texture features to locate the presence of text.

However, authors suggested that, a common framework can be designed to describe a generic videotext extraction system. Given an image or a sequence of frames with color or grayscale values, authors proposed the following framework that employs three major steps to extract text embedded in the frames: 63

Step 1: Removal of Background Non-Text Scene Content In the initial step it is attempted that all non-textual background be removed to obtain candidate text regions in the video frame. This separation can be acquired by several different approaches. One approach is to employ region analysis to determine chromatically or spatially connected and homogenous components in the frame grouping them to locate text regions in the frame. Similar colored or intensity valued pixels are combined and reject too large or too small regions retaining only candidate text regions.

Another approach employs edge detection in the video frame to identify and locate text regions in the video frames. This involves detection and analysis of the geometrical arrangement of video content that helps to separate out regions that form regular geometrical patterns in the concerned frames. Text lines are detected on the basis of edge directions filtering out text lines from background scene. Other approaches use texture-based analysis to extract text regions from the frames. This is done on the principle that text content exhibits a unique contrast with the background and certain connectivity in its horizontal representation of characters which is reflected in the varying color and light intensities as characters connect or space out.

Step 2: Verification of Text Characteristics The next step is to analyze extracted text to rule out the possibility of obtaining false positive text regions. After the text regions have been drawn out from the video frames individual regions are analysed which may involve grouping of connected components.

Common features of text and characters are employed to use as criteria for judging whether the extracted regions present actual text features and attributes or not. These features may be

64

monochromaticity of individual characters, character size restrictions, horizontal alignment

of text, consistent inter-character spacing, etc.

Step 3: Consistency Analysis for Output The last step typically prepares the remaining text regions for the final usage intended

for the detected text.

One approach employed for final detection of text regions is to group characters to words and

words to text lines. This process allows for marking off their bounding blocks. The approach lays emphasis on automatic location of text, which therefore requires human intervention for identifying and recognizing the characters present in the bounding blocks.

Other approaches further analyze text regions to extracting clear-cut character boundaries

generating frames that are ready for being fed to an OCR system. These frames are obtained

and represented in the form of bit-maps and can now be automatically recognized. The

recognized characters, words and lines can finally be used to annotate indices for video

referencing or for querying a video database.

Rainer Lienhart and Wolfgang Effelsberg, in ‘Automatic Text Segmentation and Text

Recognition for Video Indexing’ [61], reported a feature based scheme to segment and

recognize video text for automatic indexing.

This study has aimed to present a feature based scheme for text segmentation and recognition

of video text for automatic indexing.

The proposed scheme works as follows. 65

i- Presentation of text features ii- Description of segmentation algorithms based on described text features iii- Text recognition

The study explains a straight forward indexing and retrieval scheme which has been used in

the experiments to demonstrate the appropriateness of the algorithms for indexing and

retrieval of video sequences.

There is an emphasis on extraction of artificial text. It is also noted that the appearance of artificial text is subjected to more constraints then that of screen text because it is made deliberately to be easily viewable.

Text Features The mainstream of artificial text appearances has the following characteristic

features:

i- Characters are in the foreground. They are never partially occluded. ii- Characters are monochrome. iii- Characters have size restrictions. A letter is not as large as the whole screen, nor is a letter smaller than a certain number of pixels as it would otherwise be illegible to viewers. iv- Characters are rigid. They do not change their shape, size or orientation from frame to frame. v- Characters are either stationary or linearly moving. vi- Characters are mostly upright. vii- Moving characters also have a dominant translation direction: horizontally from right to left or vertically from bottom to top.

66

viii- Characters appear in clusters at a limited distance aligned to a horizontal line, since that is the natural method of writing down words and word groups. ix- Characters contrast with their background since artificial text is designed to be read easily.

Their text segmentation algorithms are based on these features. However, they also take into account the fact that some of these features are relaxed in practice due to artifacts caused by the narrow bandwidth of the TV signal or other technical imperfections.

Edges are localized by means of the Canny edge detector extended to color images, i.e. the standard Canny edge detector is applied to each image band. Then, the results are integrated

by vector addition. Edge detection is completed by non-maximum suppression and contrast enhancement.

Recognition of Text During the implementation process the OCR software developmental kit Recognita

V3.0 for Windows 95 was integrated for text recognition. Most of the overlain text in video

images is written in block letters. It was imperative to use the OCR module for typed text so

that the rearranged binary text images could be translated into ASCII mode.

On an overall basis the recognition process could give better and more accurate results if benefit is derived from the repeated occurrence of the same words in consecutive video frames. However, the same character appears in a slightly different form in various frames owing the alterations to noise, changes in backgrounds or the relative position of the word itself. Overcoming these hitches and then combining the recognition results into a single final character higher accuracy may be achieved in recognition.

67

In digital videos artificial text appears in any variety of fonts-decorative or plain and various font sizes and styles. This enormous variety compounds the problem of OCR and errors are inevitable. Another problem is to cope with garbage characters which result from non- character regions that could neither be removed by the system nor by the OCR module.

Jinsik Kim, Taehun Kim and Jiexi Lin, in ‘Implementation of a Video Text Detection

System’ [33], a term report for CS570 Artificial Intelligence, KAIST, South Korea, described the implementation of a video text detection system.

The authors have described the implementation methodology as follows:

i- Edge detection ii- Area-based detection iii- Texture based detection iv- Continuous frame detection v- Edge detection method is based on the idea that between the text and the background there is a distinct difference in the brightness and intensity of colors. vi- Area based detection has the underlying principle that color in a certain text area is more or less the same. vii- Texture based detection is founded on the fact that the text area in a region exhibits a clearly different texture that helps to make it stand out from the background and present possibilities for separating it out from the rest and preparing it for recognition through segmentation techniques. viii- Continuous frame detection implies that each consecutive video frame be compared with the next and the previous to spot the appearing and disappearing text.

68

Edge detection by far is the most widely used and perhaps the most effective method

employed in video text extraction and was thus chosen to be the method for their

experimentation.

Edge Detection Edge detection can be attained by many methods e.g. Sobel Edge Detection and

Canny Edge Detection. The research used Canny edge detection and the authors report to

have achieved finer results.

The scheme can be summarized as follows:

Video frame Canny edge Long line removal Horizontal stroke detection Vertical stroke detection Text area detection Bounding box detection Validation Detected text recognition.

Chong-Wah Ngo and Chi-Kwong Chan, in ‘Video text detection and segmentation for

optical character recognition’ [16], presented their approach to detect, segment and recognize

text in video frames.

The authors have aimed to introduce approaches to help detect and segment text present in

video frames. It is presented that by classifying the background complexities appropriate

operators could be applied selectively to detect text in video frames of different modalities.

To remove noise from the candidate area of text image of high edge density effective

operators e.g. repeated shifting operations are applied. Also a text enhancement technique is

applied so that the text images in the low contrast regions could be highlighted. Lastly a

coarse-to-fine projection technique extracts text lines from the video frames. 69

Results from the experiments conducted with the described methods report higher levels of recognition accuracy as compared to machine learning based systems e.g SVM and ANN segmented text in the foreground is then recognized using a commercial OCR package.

The work is described as follows:

Extracting text information from videos is divided into three major steps:

i- Text detection: Locating regions containing text. ii- Text segmentation: Segmenting text in the detected text regions. The result of segmentation is usually a binary image for text recognition. iii- Text recognition: Converting the text in the video frames into ASCII characters.

Overview of the system

Christian Wolf and Jean-Michel Jolion, in ‘Extraction and Recognition of Artificial Text in

Multimedia Documents’[17], Technical Report RFV RR-2002.01 Laboratoire

Reconnaissance de Formes et Vision, INSA, FRANCE, reported a novel technique to extract

and recognize video text.

This research focused on text extraction and recognition in videos. The approach is to allow

automatic text extraction from the videos in a database and then save it with a link referring

70

to the video sequence and the frame number. The focus in this approach is the basic key words. The user submits a request by providing some keywords, which are robustly matched

against the previously extracted text in the database. Videos containing the keywords (or a

part of them) are presented to the user. This can also be merged with image features like

color or texture.

At the detection stage the algorithm is applied to individual frames separately. Text rectangles are identified and extracted and later passed on to the tracking step. At this step extracted rectangles are corresponded with text images from other frames. From this process of matching each text image with those from the other frames a more enhanced version of the image is generated and converted to bitmaps. Lastly the binarized images are passed to a commercial OCR system to be recognized.

The authors have described the steps of recognition in the following way:

Gray Level Constraints The focus here is on detecting horizontal text lines in the artificial text region. This

text is designed deliberately to be easily readable and so it can be assumed that the contrast

between the text and the background is high. However, it cannot be assumed that the

background will remain uniform and therefore arbitrary complex background needs to be supported. The other feature under consideration is the color properties of the text.

Because the text is easily readable it is also relatively easier to be extracted depending on the luminance information embedded in the video signals. Therefore all video frames need to be converted into grey scale images as a pre-processing step.

71

Temporal Constraints – Tracking It should be understood that in videos text images that need to be recognized remain

present for a series of video frames so that enough time is allowed for the various recognition

processes to go on. This feature is taken advantage of in order to create similar text and separate out false signals. This is done by an overlapping technique whereby the list of detected rectangles is overlapped with the list of currently active rectangles. This means that

in each subsequent frame the text in the previous frame will also be visible.

Temporal Constraints – Multiple Frame Integration In this approach even before the images of a text appearance are provided to the OCR

software, all images from the various video frames are drawn out and their contents are

enhanced to generate a better quality image compared to those drawn out initially from the

sequence. This enhancement also results in better resolution of the image without adding up

any extra information to it. This is an important step as most commercial OCR programs are

developed for scanned documents.

Final Binarization In the end the enhanced image needs to be binarized. Low quality videos do not

generate the same quality of text images or the same characteristics as do scanned document

images. The authors of this study were of the opinion that simpler algorithms are more robust

to the noise present in the video images. For this reason they used Niblack’s method for their

system, and finally derived a similar method based on a criterion maximizing local contrast.

72

Niblack’s algorithm calculates a threshold surface by shifting a rectangular window across

the image. The threshold T for the center pixel of the window is computed using the mean

and the variance of the gray values in the window.

They then passed the final binarized text boxes to a commercial OCR product (Abbyy

Finereader 5.0).

Chuan-Jie Lin, Che-Chia Liu, and Hsin-Hsi Chen, ‘A Simple Method for Chinese Video

OCR and Its Application to Question Answering’ [18].

Authors propose a simple video OCR method for Chinese captions, including image capturing, caption region deciding, background removing, character segmentation, OCR and post-processing.

The characteristics of captions are:

i- they are always in a straight line from left to right or up to down; ii- the characters usually have colors which contrast with the background, and often have perceivable borders; iii- they are always in the foreground of the image; iv- they usually consist of two or more characters; v- the height of the caption region is not often higher than one third of the height of the image, because characters cannot be too large or too small for reading; vi- they have fixed height, width, and size; vii- they have fixed colors. These characteristics are employed to locate captions.

73

Removing Backgrounds by Means of Multiple Images After detecting a sequence of images with the same caption texts, they use the

following method to remove the backgrounds.

Let NumFrames be the total number of sequential images. They consider each point in the

caption region. If it is black in 90% of the images (i.e., NumFrames × 0.9), then they set the

point as black. Otherwise, it is set as a white point.

Character Segmentation This approach works on the basis of the fact that a binary image of the caption has

been initially been obtained which bears black characters on a white background. Traditional

OCR systems are then used to draw out the text image from the background. Boundaries for each of these characters are then marked off and then recognition is made possible through the OCR.

They first decide the right and left boundaries for each character and then projection profiles are used to segment text into individual characters. The projection profile presents all black points on a horizontal line. The projection for spaces between Chinese characters remains zero but there are spaces that can be detected within the character itself. This problem is solved with the help of another feature of the Chinese characters. The width of Chinese characters is more or less equal to their height so height of a caption region may be regarded as the image height and possible segmentation points can be marked off by considering that the gap that is a distance of image height x 0.7 ~ imageheight x 1.4 from the previous gap.

Once the left and right boundaries have been decided a similar method is employed to decide the upper and lower boundaries of each character. 74

Optical Character Recognition A statistical model is adopted to perform Chinese OCR. Each Chinese character is broken down into 16 component parts each equal in size to the other. Signature values are attributed to each of these parts by observing, initially the centre and then moving on to all directions-up, down, left and right. As observations are made black points in each direction are sought. If a black point is found in a certain direction, a signature value of 0 is set, if not then the value of 1 is attributed at that point. As the process completes 64 values (16 x 4) called signature values are attained for each character image.

For this study the signature images for a standard corpus of characters was collected from a number of Discovery Channel Films. The procedure explained for the extraction of text from the films was initially to extract its signature and then later make a comparison of these with those present in the standard corpus of data. The numbers of matching values are estimated and similarities are calculated and saved. It is assumed that the similarity score will always remain between 0 and 64. The greater scores will represent higher range of similarities and the fact that the patterns closely resemble each other. The new image is regarded as a non- character image if the highest score attained from its observation is less than 16.

Jean-Marc Odobez and Datong Chen, in ‘Robust Video Text Segmentation and Recognition with Multiple Hypotheses’[32], describe a robust method for segmenting and recognizing video text, as follows:

The authors of this study have attempted to present segmentation as well as recognition techniques for text embedded in video films. The study proposes multiple segmentation of the same text region which will produce multiple hypotheses of binary text images. 75

The segmentation algorithm is stated as a statistical labeling and is based on a Markov random field (MRF) model of the label map. Background regions in each hypothesis are then removed by performing a connected component analysis and by enforcing a more stringent constraint (called GCC) on the text characters grayscale values using a robust 1D-Median operator. Each text image hypothesis is then processed by optical character recognition

(OCR) software. The final result is then selected from the set of output strings. Results show that the use of both multiple hypotheses and the GCC significantly improve the results.

Description of the Method The first step is to locate the text regions in the video frames. The method employed is to integrate horizontal and vertical edges to make a texture based search for locating the

text areas. Using the baseline technique these localized text areas are further segmented into

single line text candidates. The drawn out text regions are then recognized using a Support

Vector Machine (SVM).

The images that are drawn out in the initial text location step are presented in rectangles and

so the OCR software will not work directly on these images. They must undergo further

treatment in order to be understood by the OCR. The experimental proofs suggest that OCR

results are not very reliable and are greatly dependent on the quality of the segmentation

procedure that draws out clear cut images to be fed to the OCR. The authors of this study have presented a sophisticated system in which multiple text layer candidates are provided to the OCR making the decision making process relatively easier for the OCR software.

76

The algorithm used by the authors’ works as following:

A segmentation step allows for the generation of text image hypotheses, connected components are then analyzed, OCR software processes the hypotheses and the result is selected from the output strings.

Datong Chen, Kim Shearer and Hervé Bourlard, in ‘Video OCR for Sport Video Annotation

and Retrieval’, [19], give their scheme as:

The authors have introduced a video OCR system for building annotations of sports films.

The system works by automatically extracting closed captions from video frames which they

call ‘cues’ and treating them as key words for reference. This extraction is done by the use of

Support Vector Machines (SVM) that identifies the text regions with closed captions. These

images are then enhanced to generate better quality text images by using two groups of

asymmetric filters. A commercial OCR software is then used to recognize the text.

The whole procedure for automatically extracting text from sports videos can be summarized

in three distinct steps.

i- Identification of single line texts from video frames using SVMs. ii- Text recognition and, iii- Production of cues for annotation

The employed algorithm extracts texture patterns of texts considering them on the basis of

horizontal and vertical edges mixed together to form word structures.

77

They integrate this texture information and temporal information for text detection in the

following process:

i- Multiple intensity frame integration is performed by computing average image of consecutive frames ii- Detect edges in vertical and horizontal orientation respectively with Canny operators iii- Integrate temporal edge information by keeping only edge points that appear in consecutive two average images iv- Dilate vertical and horizontal edges respectively into clusters. Different dilation operators are used so that the vertical edges are connected in horizontal direction while horizontal edges are connected in vertical direction. The dilation operators are designed to have rectangle shapes: vertical operator 5×1, horizontal operator 3×6. v- Integrate vertical and horizontal edge clusters by keeping the pixels that appear in both vertical and horizontal dilated edge images.

Because the text may have varying gray-scales in video frames, they therefore choose gray- scale independent feature: distance map (as explained below) of 16×16 slide window as input feature of SVMs. The SVMs are trained with a database consisting of 6000 samples labeled as text or non-text (false alarms resulting from text line location) with the software package called SVMTorch.

Rainer Lienhart and Frank Stuber, ‘Automatic text recognition in digital videos.’ Technical

Report TR-95-036, Department for Mathematics and Computer Science, University of

Mannheim, Germany,1995, [60], reported their algorithm for the recognition of text in digital videos.

78

The authors have developed a system for automatic recognition of text in moving pictures through character segmentation algorithms. These text images are drawn out from a variety of regions from the videos including pre-title sequences, credits, and closing sequences which could be titles, credits or other bits of text items. These are automatically and efficiently drawn out with the help of the employed algorithm. The system directs its focus on enhancing the segmentation quality using typical text characteristics in the videos resulting in better recognition results.

As segmented characters are drawn out from the videos they can be parsed by any OCR software. Multiple instances of the same character are drawn out from a series of video frames and are combined to generate a higher quality text image. This in turn improves recognition results and computes a final output.

This technique however has been developed only to deal with artificial text that has been carefully laid over the original scene and has no capacity to deal with scene text embedded within the scene. In particular the system deals with the recognition of text that has been added to the videos with the help of video title machines. The premise for this exclusive treatment is that scene text and artificial text imply two distinct ways in which text is laid out in a video therefore requiring different treatments for drawing them out for recognition.

Owing to this difference, the authors have used the words ‘text’ and ‘character’ only to refer to texts produced by video title machines or similar input devices that contribute to text addition after the scene has been shot.

79

Before characters and, thus, words and text can be recognized, the features of their appearance have to be analyzed.

Their list of features includes:

i- Characters are monochrome. Only a very small percentage are polychrome and are of no further interest here ii- Characters are rigid. They do not change their shape, size or orientation from frame to frame. Again the very small percentages of characters that do change size and/or shape are of no further interest here iii- Characters have size restrictions iv- A letter is not as large as the whole frame nor are letters smaller than a certain number of pixels as they would otherwise be illegible to human beings v- Characters are either stationary or linearly moving vi- Stationary characters do not move. Their position remains fixed over multiple subsequent frames. Moving characters move steadily and also have a dominant translation direction: Generally, they move either horizontally or vertically. Moreover, many just move from right to left or bottom to top. vii- Characters contrast with their background viii- Artificial text is designed to be read and, thus, must be in contrast to its background. ix- The same characters appear in multiple consecutive frames (temporal relation). x- Characters often appear in clusters at a limited distance aligned to a horizontal line (spatial relation), since that is the natural method of writing down words and word groups. But this is not a pre-requisite, just a strong indicator. From time to time just a single character might appear on one line. xi- Character outlines/borders are degraded by current TV technology and digitizer boards. Characters often blend into the background, especially on their left side. Monochrome-designed characters do not appear to be monochrome any more. The color is very noisy and sometimes changes

80

slightly spatially and temporally e.g. by interference with the colors of the surroundings.

Even stationary text may jump around by a few pixels. Those are typical analog television/

video artifacts. Any (artificial) text segmentation and recognition approach has to be based

upon these observed features.

Segmentation of Character Candidate Regions Theoretically the segmentation step extracts all pixels belonging to text appearing in a video. However, this cannot be done without knowing where and what the characters are.

Therefore, the actual aim of the segmentation step is to divide the pixels of each frame of a

video into two classes:

i- Regions which do not contain text and ii- Regions which possibly contain text.

Regions which do not contain text are discarded, since they cannot contribute anything to the

recognition process, and regions which might contain text are kept. They call them; character

candidate regions since they are (not exactly) a superset of the character regions. They will be passed on to the recognition step for evaluation.

The segmentation process can be divided into three parts, each part increasing the set of non- character regions of the previous part by further regions which do not contain text, thereby reducing the character candidate regions, approximating them more and more to real character regions. They first process each frame independently of the others. Then, they try to profit from the multiple instances of identical text in consecutive frames.

81

Finally, they analyze the contrast of the remaining regions in each frame to further reduce the

number of candidate regions and to build the final character candidate regions.

Character Recognition Initially a video is delivered after the segmentation process isolates the text regions in the frames. Each of these frames has to be parsed by the OCR software. The experiment reports using their own OCR system which used a feature vector classification approach for recognition. the results derived were not perfect and is suggested that the use of a more

reliable commercial OCR software will give higher quality results in text recognition.

Wing Hang Cheung, Ka Fai Pang, Michael R. Lyu, Kam Wing Ng, and Irwin King, in

‘Chinese optical character recognition for information extraction from video images’, [71]

presented a simple method for the recognition of Chinese video text.

Chinese and English scripts are fundamentally different from one another and therefore

require entirely different approaches for character recognition. There is a strong need to

develop methods for recognizing the Chinese text. For this study the authors applied OCR

techniques to moving images for extracting characters from the video frames. Through this

method they were able to automatically extract video text, and then use the Chinese subtitles

for indexing and searching in a digital video library. They applied methods to filter out the

heavy noise and segment out each Chinese character in video segments.

The authors have also given details of how the OCR was applied to detect Chinese characters

and evaluate its performance.

82

The Chinese video text extraction system was based on the following assumptions.

i- Text characters remain in the foreground of the scene and are not obscured by other contents. ii- They stand in contrast with the background and are presented in a monochrome scheme. iii- Their forms are upright and rigid and therefore it is not expected that character shapes, sizes or orientations would change from frame to frame. iv- They are restricted to a specified size limitation, not exceeding a certain size nor being smaller than a certain recommendation. v- They do not slide into a scene, nor fade in or fade out, rather they pop out onto the screen. vi- With reference to only Chinese subtitles in videos, the text appears in clusters aligned horizontally from left to right.

All these features form the basis for noise filtration and character segmentation procedures employed during the experiments but they make the task of extracting characters more difficult. To draw out characters from videos the position of the lines of characters must first be located. Since the text area is rich in edges and stands in good contrast against the background the edge densities are used as a feature to locate the text areas. Thus the property

of high edge density has been used as basis for developing a similar segmentation method for

Chinese characters.

Xian-Sheng Hua, Pei Yin, and HongJiang Zhang, in ‘Efficient video text recognition using

multiple frame integration’, [72], reported as:

Artificial text, that is superimposed on a scene is usually a carrier of important information

with reference to the video or broadcast itself. This information is of significant use for video 83

indexing and retrieval. Research in this area (Video OCR) has become recently popular and

widespread. However there are complexities that need to be overcome in order to create an

efficient video OCR, the major hurdles in this regard being low resolution and background

complexities. The authors of the study have aimed to introduce an efficient system to

overcome, at least, the second difficulty- that of reducing background complexities. They

have used multiple frames with repeated occurrences of the same words to obtain the clearest

image of the same character. However before this extraction procedure a multiple frame

verification procedure is adopted to exclude the possibility of receiving false text alarms.

In the next step those frames are selected where the text is the clearest of all and so it makes

the recognition process easier. Next they detect and join and join all the clearest and similar

frames together to generate sets of finer man made frames. On these artificially created

frames a block based adaptive thresholding is then applied and the images are binarized.

These binarized frames are finally sent to the OCR engine for recognition.

A summary of the scheme is presented in the following sequence.

Video stream Text detection Multiple frame verification Get frames containing the same text High contrast frame selection Block division High contrast block averaging Block merging Block adaptive thresholding OCR Text

3.6 Correlation Pattern Recognition

An important problem in image recognition has been finding patterns in images.

Investigations have increasingly been carried out through both optical as well as digital

84

methods to tackle this problem and an area receiving popular acclaim today is the use of

correlators for identifying recognizable patterns in images.

Correlators are basically shift invariant which allows pattern location in the input scene

merely by locating the correlation peak. Therefore it makes it unnecessary to register or

segment an image prior to correlation peak, as is done in a variety of other methods for

pattern recognition.

Amongst the more widely known correlation functions, the Matched Spatial Filter (MSF) is

perhaps the most popular [68]. The MSF depends on a single view of the pattern to be

detected and is optimal for detecting a known pattern in the presence of additive noise.

MSFs however have one major drawback the fact that their correlation peak degrades quite

rapidly if the input pattern deviates even slightly from the reference pattern. Such pattern

deviations are highly common and may occur due to such minor changes as scale changes

and rotation. This inflexibility renders the MSFs inadequate for most pattern recognition

applications.

A method employed to overcome such distortion sensitivity of the MSFs was to use a different MSF for each different view. This meant that a large number of filters were to be

stored and made available in order to filter a variety of images making this approach rather

impractical.

85

An alternative solution to this problem was to create composite filters which could be

enhanced to give an optimal distortion tolerance better than the MFs.

Similar techniques for obtaining various types of correlation filters have been tried and

optimized in the recent years [69].

These filters all work on the premise that correlation filters can be ‘trained’ to recognize any

object on the basis of a set of representative views of that object. These sets of representative

views are called ‘training images’ and the correlation filters can adequately recognize an

object with reference to this set as long as distortions are amply characterized by the training

set too. However, proper selection, registration and clustering of training images are

important tasks in the process of designing composite filters.

Here we will consider a class of composite filters whose tolerance to distortions can be

analytically optimized, the Maximum Average Correlation Height (MACH ) filter [49],

which experimentally proven to surpass MSF in performance with regard to handling noise

and distortion problems [48].

3.6.1 The MACH Filter

Maximum Average Correlation Height (MACH) filters are among one of the most

popular composite correlation filters among researchers in pattern recognition.

Here we take the problem of detecting a distorted target image (or pattern) in the presence of

additive noise, discussed at length by B. V. K. Vijaya Kumar et al [70]. The objective is to

design a correlation filter such that its performance is optimized with respect to not only

86

noise, but also distortions. Let g(m,n) denote the correlation surface produced by a filter

h()m,n in response to the input image f (m,n) . Strictly speaking, the entire correlation surface g()m,n is the output of the filter. However, the point g(0,0) is often referred to as

the “filter output” or the correlation “peak”. By maximizing the correlation output at the

origin, we will be forcing the real peak to be large, with the interpretation; the peak filter

output is given by

g ()0 , 0 = ∑ ∑ f (m , n )h (m , n ) = f T h (3.1)

where superscript T denotes the transpose and where f and h are the vector versions of

f ()m,n and h()m,n , respectively. Typically, for noise tolerance it is desired that this filter

output be as immune to the effect of noise as possible.

Optimizing distortion tolerance elementarily implies that a correlation plane is an entirely

new pattern, rather a linearly transformed version created by the filter in response to an input

image. Here it is important to consider that if the filter has a high capacity for distortion

tolerance, the output will remain relatively unchanged even if the input pattern shows some

variations. The consideration thus is not so much for the correlation peak but essentially for

the complete shape of the correlation surface.

Considering this, a metric for distortion is defined as the average variation in images after

filtering. If gi ()m, n is the correlation surface produced in response to the i-th training image, we can quantify the variability in these correlation outputs by the average similarity measure (ASM) defined as follows

87

N 1 2 ASM = ∑∑∑[]g i ()m , n − g ()m , n (3.2) N imn=1

1 N Where g ()m, n = ∑ gi ()m, n is the average of the N-training image correlation N j=1 surfaces. ASM is a mean square error measure of distortions in the correlation surfaces

relative to an average shape. In an ideal situation, all correlation surfaces produced by a

distortion invariant filter (in response to a valid input pattern) would be the same, and ASM would be zero. In practice, minimizing ASM improves the filter’s stability.

We now discuss how to formulate ASM as a performance criterion for filter synthesis. Using

Parseval’s theorem, ASM can be expressed in the frequency domain as

1 N 2 ASM= ∑∑∑Gi ()k,l − G ()k,l (3.3) N.d ikl=1

where Gi ()k,l and G (k,l) are 2-D Fourier transform of gi ()m,n and g(m,n)

respectively. In vector notation,

N 1 2 ASM = ∑ gi − g (3.4) N.d i=1 1 N Let m = ∑ X i represent the average of the training image Fourier transforms. N i=1

In the following discussion, we treat M and Xi as diagonal matrices with the same elements

along the main diagonal as in vectors m and xi. Using the frequency domain relations

* * gi = X i h and g = M h , the ASM can be rewritten as follows.

N 2 N 1 * * 1 + * ASM = ∑∑X i h − M h = h ()()X i − M X i − M h (3.5) N.d i==11N.d i

N + ⎡ 1 * ⎤ + = h ⎢ ∑()()X i − M X i − M ⎥h = h sh (3.6) ⎣ N.d i=1 ⎦ 88

N 1 * where the matrix S = ∑()()X i − M X i − M is also diagonal, making its N.d i=1 inversion relatively easy.

Noise and distortion detection also require that a correlation filter yields large peak values.

To achieve this, the filter’s response to the training images need to be maximized without imposing hard constraints in relation to the representative images, which is common in traditional Synthetic Discriminant Functions (SFDs). The goal is that the filter should on the average be able to yield a large peak over the complete training set. This end is achieved by maximizing the average correlation height (ACH) criterion defined as

1 N ACH = ∑ x + h = m + h (3.7) N i=1

Another important consideration is to minimize the effect of noise on the filter’s output obtained by cutting down on ONV - output noise variance. Here it should be understood that

ONV is a quadratic term given by h + Ch, where C is the noise covariance matrix [69].

We may take the example of a white noise model aiming to enlarge ACH while minimizing

ASM and ONV, the filter is designed to maximize

2 + 2 ACH m h h + mm + h J ()h = = = (3.8) ASM + ONV h + Sh + h +Ch h + ()S + C h

The optimum filter which maximizes this criterion is the dominant eigenvector of

()S + C −1 mm+ or

h = γ ()S + C −1 m (3.9)

89

Where γ is a normalizing scale factor. The filter given by Equation (3.9) is referred to as the

Maximum Average Correlation Height (MACH) filter and is thought to be one of the most attractive composite correlation filters for researchers in the field of pattern recognition in noisy background.

3.6.2 OT-MACH Filter

The OT-MACH filter is the extension of the MACH filter, which can be viewed as a composite template representing a general model of an object. It is designed to have invariance to distortions in the object by including the expected shapes of the object during its construction. The basic MACH filter is derived by maximizing the average correlation height (ACH) to get a high peak in the output correlation surface [70]. However, the OT-

MACH filter maximizes ACH as well as it minimizes:

• The average correlation energy (ACE) to get sharper peak, • The average similarity measure (ASM) to have distortion tolerance, and • The output noise variance (ONV) to achieve noise tolerance [62] [49] [13].

Thus, instead of satisfying the three performance measures separately, a single energy function (to be minimized) is made by combining them as follows:

J (h) = α(ONV ) + β (ACE) + γ (ASM ) − δ ACH (3.10)

' ' ' T where ONV = h Ch , ACE = h Dxh , ASM = h Sxh , and ACH = h mx . The weighting factors α, β, and γ are the design parameters, which are used to accomplish the trade-off among the four performance measures. Furthermore, h is the vector-form of the 2-D DFT of the filter that minimizes the energy function, Mx is the average of the vector-form of the 2-D

90

DFT of the training images, and the superscript signs (T) and (’) represent the transpose and the conjugate-transpose operations, respectively. C is the diagonal noise covariance matrix, which is normally set to σ2I (where σ is the standard deviation parameter, and I is the identity matrix, if actual noise model is not known. Dx is the diagonal matrix representing the average power spectral density of the training images and is defined as:

N 1 * Dx = ∑ X i X i N i=1 (3.11) where Xi is a diagonal matrix in which the diagonal elements are the 2-D DFT coefficients

(in vector form) of i-th training image, N is the number of training images, and * is the conjugate operation. Sx is the diagonal average similarity matrix defined as:

N 1 * Sx = ∑(X i − M x )(X i − M x ) N i=1 (3.12) where Mx is a diagonal matrix having the diagonal elements same as the elements in Mx. The filter that minimizes the composite energy function given in Eq. (1) is expressed as [62] [49]

[76]:

−1 h = ()αC + βDx + γSx mx (3.13) which is called the Optimal Trade-off Maximum Average Correlation Height (OT-MACH) filter.

3.7 Text Region Detection in Video Frames

Text regions detection in video frames is a pattern recognition problem in complex background, and same techniques can be applied here which are proven successful in pattern recognition with some modifications and enhancement.

91

We used an extended version of the Maximum Average Correlation Height (MACH) filter called the Optimal Trade-off Maximum Average Correlation Height (OT-MACH) filter to detect text regions in video frames.

We kept our problem simple by keeping our focus on artificial text in video frames which appear in the lower one third region of the screen. This text is usually about the speaker currently at the screen, speech translation or a breaking news slide.

Our objective is to detect and isolated text regions in video frames which appear in the lower one third region of the screen. For this purpose we implemented an OT-MACH filter in

Matlab 7, which successfully detects and isolates the text regions in the video frames and extract these frames for further processing.

We synthesize three separate filters for the text regions of three different sizes i.e. 80%,

100% and 120%, using normalized training images of different sizes of text regions to have text region size invariance. Specifically, we generate the training images using the following steps for each of the three sizes:

• Extract some templates of different text regions from real-world video sequences • Resize the templates to an average size

3.8 Results

The OT-MACH filter is trained on the training images (templates) so generated having text only in English language and tested on real world video clips which contain text in English and few with Arabic text also, the results are very encouraging with nearly 80%

92 accurate detection and extraction of text regions. There are few false negatives and some false positives also, which we plan to consider removing in the future work.

3.8.1 Video OCR Results

3.8.1.1 Training a MACH filter

A MACH filter is trained to detect text regions in lower one third screen area.

Training data consists of images of text regions extracted from the frames of video clips with caption artificial text.

3.8.1.2 Training Images

Training images are obtained from real world video clips. Here we give few training images containing artificial text which we used as templates to train the MACH filter.

Figure 3.1: Training images 93

3.8.1.3 Trained MACH filter

Figure 3.2 illustrates the trained MACH filter on text region images.

Figure 3.2: Trained MACH filter

3.8.1.4 Text area detection and localization

Figure 3.3 illustrates text region detection and extraction from a real world video clip.

Upper part of the images shows text region detection while the lower part shows the same detected text region extracted in a separate window in color for further processing.

94

Figure 3.3: Text region detected and extracted in color-1

95

Figure 3.4: Extracted text region is binarized-1

Like figure 3.3 the same process of text region detection and extraction is shown in figure 3.4 however the extracted text region in smaller window is further processed to get the image binarized.

96

Figure 3.5: Text region detected and extracted in color-2

Figure 3.5 shows the process of detection and extraction of text region with the extracted text region in color.

97

Figure 3.6: Extracted text region is binarized-2

The text region extracted in figure 3.5 the same is shown binarized in figure 3.6.

98

Figure 3.7: Text region detected and extracted in color-3

The text region detection and extraction with extracted text region in color is shown in figure

3.7.

99

Figure 3.8: Extracted text region is binarized-3

In figure 3.8 the extracted text region is binarized.

100

Figure 3.9: Extracted text region in color-4

Extracted text region is in color in figure 3.9.

101

Figure 3.10: Extracted text region is binarized-4

Figure 3.10 shows the binarized extracted text region.

102

Figure 3.11: Artificial text on plane background

Figure 3.11 shows a video frame where the artificial text appears on a simple plane background. Here white text appears on a black background. This type of text is easy to detect, extract and further process for character recognition.

103

3.8.1.5 Video clips with Arabic text

We also used such video clips for text region detection and extraction which have

Arabic artificial text. The results of text regions detection and extraction in this case were very much same as were in the case of English text, although our system was trained using only images having English text into it.

Figure 3.12: Extracted text region in color-1

Figure 3.12 shows Arabic text region detection and extraction. The extracted text region is in color.

104

Figure 3.13: Extracted text region is binarized-2

Figure 3.13 shows the binarized extracted text region.

105

Figure 3.14: Extracted text region in color-3

Text region detection and extraction for Arabic artificial text is shown in figure 3.14. The extracted text region is in color.

106

Figure 3.15: Extracted text region is binarized-4

The binarized extracted text region is shown in figure 3.15.

3.9 Conclusion and Future work

Two main challenges in video OCR are:

i. complex back ground, and

ii. low resolution of final text image.

107

We can simplify the problem of complex background by confining our focus on such video clips that have artificial text on a simple plain background like white text on black background ( as we have taken an example above), in this case we are only left with one big challenge and that is of low resolution of the final text image.

As all most all video OCR research shows that people are using commercial OCR software for the recognition of final text images, they need to further process the image to get an image of acceptable resolution for the OCR software.

We plan to take up the task of post processing the extracted text image to increase its resolution so that it is accepted by a commercial OCR, then we shall test its accuracy first on text in English language and after successfully getting results on video OCR on English text then shall get on to our Urdu OCR to test it on video text images.

108

CHAPTER 4 Implementation Challenges for Nastalique OCR

4.1 Introduction

In this chapter we discuss the complexities and challenges the Nastalique script offers for its digital implementation generally in Urdu computing and particularly in character recognition.

The two main challenges, besides others, are the highly cursive nature of Nastalique style of writing Urdu, which actually is calligraphic and has artistic beauty and has context sensitive behavior of its character shapes.

Nastalique style also is bi directional with characters moving from right to left while numerals presented in the opposite direction. As words are written down there is also a vertical stacking of characters as they are kerned and cursively joined while some characters move backward and beyond the previous character. These factors added to the limitations of developing an OCR for Nastalique.

4.2 Nastalique Character Set

Urdu uses an extended Arabic adapted script; it has 39 characters as against Arabic 28.

Each character then has two to four different shapes depending upon its position in the word; initial, medial, final or isolated.

When a character shape is written alone it’s called isolated character shape. Each of these initial, medial and final character shapes can have multiple instances, the character shape

109 changes depending upon the change in the antecedent or the precedent character. This characteristic of having multiple instances of these character shapes is called context sensitivity.

Table 4.1: Shapes of Nastalique characters

A complete language script comprises of an alphabet and style of writing. Urdu uses an extended Arabic script for writing called Perso-Arabic script. It has two main styles, Naskh and Nastalique. Nastalique is a calligraphic, beautiful and more aesthetic style and is widely used for writing Urdu.

110

4.3 Nastalique Script Characteristics

Nastalique script has the following characteristics:

• Text is written from right to left

• Numbers are written from left to right

• Urdu Nastalique script is inherently cursive in nature

• A ligature is formed by joining two or more characters cursively in a free flow

form

• A ligature is not necessarily a complete word, rather in most of the cases a

part of a word, sometimes referred to as a sub-word

• A word in Nastalique is composed of ligatures and isolated characters

• Word forming in Nastalique is context sensitive i.e. characters in a ligature

change shape depending upon their position and preceding or succeeding

characters

4.4 Computational Analysis of Urdu Alphabet

Nastalique script uses Urdu alphabet for its writing, which is based on Perso-Arabic script, which is an extension of Arabic script.

We make present a computational analysis of Urdu alphabet in the following.

4.4.1 Classes of base shapes in the Urdu alphabet.

The character set of the Urdu alphabet can be classified into 21 basic shapes that may or may not take a different set of dots or diacritical marks to represent a variety of sounds.

111

All the 39 basic characters take their shapes from one of the basic ones to represent a single character.

Table 4.2: Base shape classes in Urdu alphabet

For example as shown in table 4.2, the second basic shape in the set is used to represent five

depending on the number, position or placements of ث، ٹ، ت، پ، ب – different characters dots that go along suggesting the difference in each character and serving as a mark of identification for that particular character. These identification marks are quite ambiguous at

112 times and generally, in handwritten Urdu the presence of a certain character in the word can only be recognized through a contextual understanding of the word image.

4.4.2 Dots in Urdu Characters

We have .( ط ) Urdu characters have two types of accents, dots and small tua presented an analysis of Urdu characters with accents in table 4.3 and the details in the following.

In Urdu alphabet 17 of the characters take dots in a variety of positions bringing changes in the sounds represented by the basic shape class of characters. These dots are placed in a variety of positions e.g. below the character, above the character or even inside it. The dots are also presented in a variety of numbers ranging from one to three, these in turn also placed in a number of placement combinations e.g. a set of three dots may be placed as triangular

.representation with the pointed side up or down in two different characters ( ث، پ )

Table 4.3 presents a variety of ways in which the alphabet can be classified according to the number, position or placement of dots in relation to the characters. It can be observed that there are 17 characters that contain a single or a combination of dots, three characters that have a ‘tua’ sign above the character and nineteen characters that do not have a dot or a tua sign attached. There is a further classification – one due to the placement of the dots e.g. a set

and the same ’ژ ,makes it ‘zha ’ر ,of three dots “ ” placed on the fourth basic shape ‘ra

. ’چ ,makes it ‘cha ’ح ,set in its inverted position “ ” placed inside the third basic shape ‘Ha

In the same way several characters have a pair of horizontally placed dots with them. e.g. ‘ta,

in ’ق ,has the two dots placed above in the centre of the character but the character ‘qaf ’ت

113 its initial position has the same set of dots placed in the top right quadrant of the character rectangle that represents its four parts.

Table 4.3: Dots in Urdu characters

114

4.4.3 Context Sensitive shapes in the Urdu alphabet.

The 39 individual characters of the Urdu alphabet have multiple representations of shapes depending upon their position in the word. There are four basic positions that a character might be present in, initial, medial, final and isolated and in each case the character’s form changes considerably. But this does not limit each character’s shapes to four.

Each individual character will form a unique shape at the initial, medial or final position

may take as many as ’ ب ‘ when it combines with a different character. e.g. the character ba

20 different initial shapes as it combines with varying characters, as illustrated in table 4.4.

initial ب Table 4.4: Context sensitive shapes of

115

This context sensitivity increases the number of character representations in Urdu and Arabic manifold as compared to the Latin script languages which only has two basic ways in which to present a character – lower or upper case.

4.4.4 Comparison of Urdu, Arabic and Farsi Alphabets

Table 4.5: Urdu Alphabet

Table 4.6: Arabic Alphabet

Table 4.7: Farsi Alphabet

116

The Arabic script was originally adopted for writing Farsi and Urdu as well as numerous other languages. However as the script was adopted to represent a different set of sounds for each distinct language there were variations made to the number, placements and positions of dots or other accents that were used with the characters to represent the sounds. The Arabic alphabet has a set of 28 characters, table 4.6, while Farsi, table 4.7, has an added number of 4

and ‘ga ’چ cha‘ ,’ پ pa‘ ,’ ژ characters which makes it possible for them to represent the ‘zha

sounds that are not present in the Arabic language. In the same way Urdu, table 4.5, has a ’گ supplementary set of 11 characters that have been derived from the basic character shapes and adding dots or accents to them in an array of combinations. Tables 3.5, 3.6 and 3.7 represent the way the Arabic script has been adopted by the Urdu and Persian to correspond to the language’s distinct sounds.

4.4.5 Bi-directional pen movement (from top left to bottom right)

The formation of words as Urdu is scribed on paper has a unique pattern of flow. The words begin to be written from the top right end and finish at the bottom left, figure 4.1. This pattern follows for writing every word in which each character curves and joins with the preceding and succeeding ones to create uniquely shaped ligatures and words.

Figure 4.1: Bi-directional pen movement

117

The arrows in the given word images represent the direction of the pen movement showing that Urdu words are written in a bi directional mode- that is there are two movements involved in relation to the directions the forming words take,

a. the movement from right to left and,

b. the movement from top to bottom.

4.4.6 Bi-directional writing (numbers written from left to right)

Although Urdu writing follows the Arabic pattern of writing from right to left yet when numbers are written down they follow the Latin pattern of writing from left to right. In the example shown below, figure 4.2, the given sentence is presented in a right to left flow while the date set inside – ‘1947’ and 14 although takes the Arabic numerals is written in the

Roman form, from left to right.

Figure 4.2: Bi-directional writing

This feature is another aspect of Urdu’s bi-directional writing style.

” ق Context Sensitive shapes of the character “Quaf 4.4.7

Each character in the Urdu alphabet has a different shape according to its position in the word. In the following example, figure 4.3, the character quaf’s various shapes are demonstrated as they appear in its various positions in the word. It can be noticed that the 118 three shapes of this character are considerably different from one another in its varying positions and identifying them would need substantial amount of training.

ق Figure 4.3: Context sensitive shapes of quaf

4.5 Nastalique Script for Urdu

Urdu uses the Arabic script for writing with a basic character set of 39 against the

Arabic 28. Mainly there are two styles of Urdu writing Naskh and Nastalique while the most prevalent is the Nastalique which is widely used for Urdu writing. Nastalique is a calligraphic, beautiful and more aesthetic style of Urdu writing. Due to its calligraphic nature it poses many difficulties for its digital implementation specifically in text recognition. It is pertinent to define some related terminologies here:

4.5.1 Character

According to the Unicode definitions a character is the smallest element of the written language which has semantic value. The character identified is an abstract entity, such as

Every character has only one code point .”ح Latin character capital A”, or “Arabic character“ in digital character representation schemes like ASCII or Unicode. A text file in Unicode will invariably contain references to characters and not to glyphs. 119

4.5.2 Glyph

The visual representation of a character made on the screen or paper is called a glyph.

A glyph is the shape or form of a character or characters that they take when they are represented in writing or displayed in print. The digital fonts consist of glyphs while natural languages contain characters. A character can have more than one glyphs like character ‘a’ written in different fonts will give many glyphs of the same character.

4.5.3 Ligature

The Nastalique style of writing is inherently cursive. The character shapes join together cursively to form words or ligatures. A ligature in Nastalique is a complex unit of several characters bound together while one word may be composed of one or more ligatures as well as isolated characters.

4.6 Ligature in Urdu

A ligature in Urdu is a complex unit of several characters bound together cursively to give a single fluid form. A ligature is not a complete word but can be considered as a compound character.

For example the word is formed by two ligatures and an isolated character, figure 4.4.

120

Figure 4.4: Ligature in Urdu

4.7 Word Forming in Urdu

A word in Urdu may be composed of one or two ligatures as well as isolated characters.

Pakistan) is formed by two ligatures and an isolated) ﭘﺎﮐﺳﺗﺎن For example the word character.

Word Forming in Urdu

A ligature

An Isolated Character Shape

Figure 4.5: Word forming in Urdu

121

A word in Urdu is composed of ligatures and isolated characters. The word (Pakistan) consists of two ligatures and an isolated character, illustrated in figure 4.5. Some Urdu words

.’ ﺷﻣس and ‘Shams ’ ﺳﮩﯾل consist of only one ligature like ‘Sohail

is among the few words comprising of a single ligature ’ ﻗﺳطﻧطﻧﯾہ The word 'Qustuntunia with eight character shapes.

4.8 Styles of Urdu Writing

The two prominent styles of writing Urdu are Naskh and Nastalique, figure 4.6.

However Nastalique is the more widely used one.

Styles of Urdu Writing

This is Naskh style of writing … Single Base Line

This is style of writing …

Multiple Base Lines

Figure 4.6: Styles of Urdu writing

4.8.1 Naskh

This is Naskh style of writing

122

Figure 4.7: Naskh style of writing

The Naskh style of writing Arabic script languages has a flat single base line, word are spread horizontally along the base line taking more space for writing a ligature, figure 4.7.

The Naskh style is easier for character recognition compared to Nastalique due to its linearity in writing, having a single base line and non-overlapping of characters in adjacent ligatures.

4.8.2 Nastalique

This is Nastalique style of writing, more complex than Naskh …

Figure 4.8: Nastalique style of writing

Nastalique style of writing is highly cursive in nature with multiple base lines overlapping of characters in adjacent ligatures and vertical stacking of characters within a single ligature, figure 4.8. All this makes character recognition in Nastalique more challenging.

4.9 Nastalique Script Complexities

No Nastalique OCR exists so far. Published Research in Nastalique text recognition is almost non-existent. This is a new undertaking which faces more challenges as no

123 standardized ground work exists, like Nastalique text processor, character-based Nastalique font, Unicode support for Nastalique, keyboard layout for Nastalique.

Main issues in Nastalique optical character recognition are two; cursive nature and context sensitivity

Here we describe the complexities of Nastalique script with particular reference to character recognition.

4.9.1 Cursiveness

The Nastalique style of writing because of its flowing character forms and shapes is inherently cursive. The character shapes join together cursively to form words or ligatures. A ligature in Nastalique is a single unit of several characters bound together cursively in a fluid form that creates a variety of compound characters, figure 4.9. A single word may therefore be composed of one or more ligatures as well as isolated characters. Thus Nastalique with its inherent features of cursiveness makes a complex script.

Figure 4.9: Nastalique cursiveness

There are 39 different characters in the Urdu alphabet and most of them, have three different shapes, depending on the position they appear in a word; initial, medial and final besides an isolated shape.

124

Characters join together cursively in a fluid form called a compound character or a ligature.

Words are composed of ligatures and isolated characters. For example word

.figure 4.10 ,(ن) and one isolated character( ﮐﺳﺗﺎ and ﭘﺎ ) has two ligatures ( ﭘﺎﮐﺳﺗﺎن )

Figure 4.10: Word forming in Nastalique

4.9.2 Context Sensitivity

Ligatures in Nastalique are unique combinations or units of characters that change their shape according to their position within the unit. Each of the 39 characters in the alphabet can have two to four different shapes depending upon its position in a word. An

for example, which is the second character in the alphabet, is quite different ,(ب) ”initial “BA from the shape it bears as a medial, final or an isolated one. Added to this is the dependence of each character on the preceding or succeeding characters it joins with. A character might take as many as 20 different shapes according to the character it is joining with. Sometimes even the 3rd or 4th preceding or succeeding character may initiate a change in shapes. Thus

has multiple instances, and may have as many as 20 different shapes for its initial ,(ب) Ba form. A similar pattern follows for all other characters in the script which enlarges the database of all possible character shapes for all the letters of the alphabet multifold. Figure

.initial’ب illustrates two different context sensitive shapes of ‘bay 4.11

125

Figure 4.11: Context Sensitivity; Two different shapes of Bay-initial

This feature is called context sensitivity – character shape changing with varying antecedent or precedent characters and can be presented as an n-gram model in a Markov chain. The first order Markov model is

Ci | Ci-1

4.9.3 Dot Positioning

Out of 39 characters in Urdu alphabet 17 characters have dots above, below or inside the character. These dots can be one, two or three in number, as illustrated in figure 4.12.

There can be three situations of ambiguity between two characters.

Figure 4.12 (a, b, c): Dots in Urdu characters

In situation one, a character has a dot and another does not, figure 4.12 (a).

126

In the second situation differentiation between two characters is possible only by determining the number of dots which may be in the same position with respect to place, figure 4.12 (b).

In the last situation two characters are differentiated due to the position of dots, figure 4.12

(c).

Situations of ambiguity arise because very small changes in dot numbers or positions are used to bring changes in sounds of characters.

4.9.4 Kerning

The character pair space adjustment is called Kerning. It is usually done by reducing the space between two characters to create good visual effects, for example figure 4.13 shows a character pair while kerned and not-kerned.

Not-kerned Kerned

Figure 4.13: Roman kerning pair

This shows that in Roman script if a character pair is kerned that is if the natural space between the two characters is reduced then they will overlap each other and shall become difficult to recognize. This problem is rare in writing the Roman script while very common in writing Nastalique style, figure 4.14, because there is a great number of ligatures that form

127 their natural shape only when kerned thereby making character identification in scanning texts particularly in cursive scripts more complicated and tricky.

Not-kerned Kerned

Figure 4.14: Nastalique kerning pair

4.9.5 Character Overlapping

Besides kerning which is implemented in the font file, Nastalique style of writing poses another challenge for text recognition which is inherent in this style of writing; it is character overlapping in words. For example figure 4.15 shows character overlapping in

Nastalique.

Figure 4.15: Character overlapping in Nastalique

In both of the cases in figure 4.15 the isolated character ‘ ’ (Alif) difficult to identify separately, although both of the words are composed of one ligature and an isolated character.

The cursive nature of Nastalique script results in inter-word as well as intra-word character overlap which makes the character recognition for Nastalique script more challenging.

128

4.9.5.1 Within a Ligature

Nastalique is naturally cursive connecting each subsequent character with the previous one by means of delicately curving joints. The joining points or curves in characters and ligatures follow a predefined set of rules for formation that are governed by the size of the pen nib with which the words used to be originally written down.

Figure 4.16: Character overlap within a ligature

To form a ligature each character connects with the adjoining character in a fluid form, beginning from the top right and moving diagonally to the bottom left. This feature makes writing in Nastalique space saving as characters are stacked up vertically also as they move ahead, figure 4.16. This makes it also differ from the Arabic Naskh which is written on a primarily horizontal baseline. One drawback of this feature of Nastalique is that words with multiple characters in a line may clash with words on the preceding lines.

4.9.5.2 Between two adjacent Ligatures

The cursive nature of Nastalique script results in vertical stacking of characters in a ligature as well as the adjacent ligatures overlap and shadow each other in most of the cases as shown in figure 4.17.

129

Figure 4.17: Intra-ligature character overlap

4.10 Sloping and Multiple Base-Lines

Nastalique style of writing does not have a horizontal base-line as is the case with

Roman script instead it has a sloping base-line, figure 4.18, due to the fact that each word or ligature in Nastalique is written from top right to bottom left. Ligatures in Nastalique are tilted at 30-40 degrees approximately.

Figure 4.18: Nastalique sloping base-line

The inherent cursiveness of Nastalique style of writing give rise to sloping as well as multiple base-lines, figure 4.19.

Figure 4.19: Multiple baselines

130

4.11 A Generic OCR Model

Optical character recognition is in a broad sense, a branch of artificial intelligence and it is also a branch of computer vision. Nevertheless, it is a distinct discipline in its own right, analogous to speech recognition. If we imagine designing a robot, both reading and listening functions are indispensable if the robot is to be really intelligent. Of course, even a conventional computer must have the capacity to read input documents such as in office and library work. Furthermore, recently, prototype electronic libraries have started to come on- line. In part, the optical character recognition field has grown because of this specific application.

Working of an OCR

______Scanning _____ OCR ______Text Page Image

Display Device ______Recognition Segmentation Preprocessing ______A text page in a text editor Components of OCR

Figure 4.20: Different phases of an OCR

A generic OCR model has a number of phases for optical recognition of printed text as illustrated in figure 4.20. Details of these processes are already covered in chapter one of this thesis.

131

4.12 Working of a Roman Script OCR

Many languages use the Roman script for writing like English, German and French.

The Roman script OCR performs segmentation at three levels. At the first the lines of text are identified and the text image is segmented into lines of text and each of the images of the text line is placed inside a bounding box. This can be done by vertical scanning of the text image for black pixels.

Working of a Roman Script OCR

Text Image

Lines are separated using horizontal projection profile Level 1: Lines Segmentation

Words are separated using vertical projection profile Level 2: Words Segmentation

Level 3: Character Segmentation Characters are separated using DIP techniques

Figure 4.21: Roman OCR has three levels of segmentation

The English printed text has a single and horizontal base-line. Some characters move up the base-line like b, f, h, l etc called ascenders, some of the characters move down the base-line like g, j, p, q etc called descenders, the characters which remain on the base-line are a, c, e, i etc. Lines in a text image can be identified on the basis of number of white pixels present

132 between descenders of the previous line and ascenders of the next line or a line of text can be described as all black pixels between start of the ascenders and end of the descenders.

Next is the word level segmentation in which inside the bounding box containing a line of text smaller boxes are made to contain each of the words in the line of text inside the larger bounding box. This is done by detecting the presence of wide white space between words on horizontal scan from left to right. When we write a word in a text editor we key-in the letters.

As soon as all the letters are keyed-in we press the spacebar key which inserts a white space character after the word which is equal to the one character in length, on the basis of this white space words are separated from each other. At the last level of segmentation words are broken down into their constituent letters by making even smaller bounding boxes inside the previously made boxes to contain each of the letter in a word image by identifying white pixels between two adjacent letters. In English printed text letters in a word are placed side- by-side without touching each other, they are not joined or connected. So at the end of the segmentation process each of the characters in the given text image is isolated; bound in a separate box, and is ready to be presented at the recognition stage, the whole process of isolating individual characters is shown in figure 4.21.

4.13 Working of a Nastalique Script OCR

In the segmentation phase of a Nastalique OCR there are only two levels instead of three as the case is with Roman script OCR. The first level of segmentation is same in both of the cases i.e. Nastalique and Roman scripts in which the whole text image is segmented into lines of text by placing each of the lines of text inside a bounding box. At the next and the final level of segmentation smaller boxes are made inside the larger bounding boxes

133 containing lines of text by horizontal scanning the text image from right to left, these smaller bounding boxes contain ligatures or isolated character shapes, as illustrated in figure 4.22.

Working of a Nastaliq Script OCR

Text Image

Level 1: Text segmented into lines

Level 2: Lines segmented into Words / Ligatures

Figure 4.22: Nastalique OCR has two levels of segmentation

In the case of Nastalique OCR at the end of segmentation phase all character shapes are not bounded inside the smaller bounding boxes but ligatures and isolated characters are bounded which when presented to the recognition stage will only result into recognition of isolated characters and not the ligatures. A ligature in ris a complex unit of several characters bound together cursively while one word may be comprised of one or two ligatures as well as isolated characters.

4.14 Approaches for Nastalique OCR

4.14.1 Character-based Approach

If we follow the segmentation approach for Nastalique OCR then at the recognition phase we will need to have all the text images segmented up to the character level similar to a 134

Roman script OCR, that is, all the ligatures in the words are segmented into their constituent character shapes. For illustration we take the example of an Urdu word

written in Nastalique. Figure 4.23 shows the segmentation of the word into its ﺑﺷﯾﺮ respective components.

+ + +

+

Figure 4.23: A segmented word in Nastalique

The word consists of four components namely Bay-initial, Sheen-medial, Yey-medial and

Ray-final. Now breaking a ligature into its components can be done only if we are able to

ﺑﮩت define the points of segmentation, the task is not easy even for a simple ligature like

ﻋﺟﺎ but it becomes more complicated in a ligature where characters overlap each other like

ﺑﮩت By any means if we are able to separate all character shapes in the ligature . ﻟﻣﺣہ or then as we already know that in Nastalique each character shape has multiple instances so the

Bay-initial has 20, then which character code has to be returned against the recognition of the

Bay-initial here, this has to be supported in the character encoding scheme as well as the font file but the problem still needs to be addressed.

135

4.14.2 Ligature-based Approach

In the ligature based approach, which is also segmentation-free, for a Nastalique OCR we will have to rely on the two levels of segmentation phase at the end of which we have isolated character shapes or the ligatures separated in the bounding boxes, as shown in figure

4.24.

Figure 4.24: Ligature-based approach

The ligature based Nastalique OCR shall require a very large number of ligature images to train a learning machine like Artificial Neural Network (ANN), Hidden Markov Model

(HMM) or a Support Vector Machine (SVM). Many researchers have tried this approach, in the case of Arabic Naskh OCR, however were not able to receive high rates of recognition accuracy.

In this thesis we have tried to explore more innovative technique to create a novel algorithm to implement a Nastalique OCR which is segmentation free and still character based.

136

CHAPTER 5 The Proposed Nastalique OCR System

5.1 Introduction

The main contribution of this thesis is a new algorithm for the design and implementation of an OCR for printed Nastalique text. This algorithm is character-based and at the same time is segmentation-free.

Here by character-based and segmentation-free we mean that our algorithm recognizes characters in a ligature without segmenting the ligature into its constituent character shapes, this is the novelty of our new algorithm.

In chapter 2 , literature survey, we have seen that researchers in Arabic script OCR have usually followed one of the two approaches either ligature-based, which is segmentation-free and does not break a ligature into its characters or the segmentation-based which divides a ligature into its character shapes and both of the approaches have not produced promising results.

Observing the main challenge in the segmentation-based approach as improper segmentation of a ligature into corresponding character shapes we decided to adapt a segmentation-free approach yet emphasizing on character-based recognition to avoid keeping a large lexicon of ligatures or training a learning machine like a Support Vector Machine (SVM), a Hidden

Markov Model (HMM) or an Artificial Neural Network (ANN) on a large database of ligature shapes.

137

5.2 The Nastalique OCR System Implementation

We used Matlab for rapid prototyping and experimentation; however the Urdu

Nastalique character recognition application was implemented using Microsoft VC++ 6.0.

We performed experiments on Urdu Nastalique words keeping the same font size, and the results are very encouraging.

Our Nastalique character recognition system requires a character-based True Type Nastalique font, and an image of Nastalique printed text written with the character based True Type

Nastalique font.

After the segmentation is completed, the isolated character shapes and the ligatures have been identified and bounded into rectangular boxes. In the recognition phase, the True Type

Nastalique font file is loaded into the main memory, and each of the character shapes in the font file is matched with the shapes identified in the text image using cross-correlation for recognition of character shapes, line-by-line, writing their character codes into a text file in sequence as the character is found with their beginning address with respect to the bounding box. As the recognition process is completed the text file is sorted according to the x- positions giving a new order to the character codes, then these character codes are given to the rendering engine which displays the recognized text in a text region.

5.3 The Novel Segmentation-free Nastalique OCR Algorithm

Our novel Nastalique OCR algorithm is presented in figure 5.1 and the corresponding flowchart is given in figure 5.2.

138

Figure 5.1: Nastalique OCR Algorithm

139

Figure 5.2: Flowchart for NOCR

5.4 Nastalique OCR Algorithm Description

i. Text image is given to the OCR engine

ii. A True Type Font file, with which the text in the image was written, is loaded in

the main memory

iii. Text image is segmented into lines of text and each of the lines of text is further

segmented into ligatures and isolated characters 140

iv. Bounding boxes are made around

a. each of the ligatures and isolated characters in the segmented text image

b. each of the character shapes in the font file, to be used as templates

v. First character shape is picked from the font file as a template and moved in the

first bounding box in the first line of the text image from the right-hand side

vi. Template matching is done by cross-correlation

vii. For each pass highest peak for the cross-correlation is noted, at which the x-

position and the corresponding character codes are saved into an array

viii. If the first character shape is not found in the first bounding box in the text

image, next character shape in the font file is picked and the process repeated,

until, either all character shapes in the image box are found or the font file is

exhausted

ix. Next bounding box in the text image is selected and the process repeated until all

of the bounding boxes in the text image are exhausted and the x-positions and

corresponding character codes are saved in an array, in the order they are found in

the search procedure

x. The array is sorted according to the x-positions giving a new order for the

character codes

xi. Now the character codes in the sorted order are given to the rendering engine

which uses the same TTF file to render the recognized text in a text region

5.5 Segmentation of Text Image into Lines

To segment a text region in an image into lines of text we use the horizontal projection profile (or histogram) of the text image.

141

Considering the image of size m×n is F (i, j), the projection of all foreground pixels perpendicular to the vertical axis and along the horizontal direction can be given as:

n Hh()i = ∑ F (i, j ) for 1 ≤ i ≤ m (5.1) j=1

The horizontal projection profile of the text image separates the lines of text on the presence of white pixels between the two adjacent lines. A line of text covers the foreground pixels on the vertical scan from top of ascenders to the end of the descenders in the line of text, while scanning vertically from top to bottom.

5.6 Segmentation of Text Line into Ligatures

One of the techniques for segmenting a text line image into ligatures and isolated characters is the vertical projection profile (or histogram) of the text line image, the projection of all foreground pixels in the image of extracted line L(i, j), having size r×s, perpendicular to the horizontal axis and along the vertical direction can be given as:

r Hv()j = ∑ L (i,j ) for 1 ≤ j ≤ s (5.2) i=1

The ligatures and the isolated characters present in the text line image are identified on the presence of white pixels separating them. The limitation of this technique is that it does not work in situations where adjacent ligatures overlap or shadow each other and the histogram so obtained portrays them as a single ligature.

142

Otherwise, this method will fail in a situation where characters in the different ligatures vertically overlapped with each other. We use connected component labeling to segment out the ligatures.

5.7 Character Recognition by Cross-Correlation

There is a simple method that deals with two-dimensional information, the cross- correlation method. This is a typical method used in pattern matching, as the image of a character shape has two-dimensional information, we can directly apply cross-correlation to find the identity of the character image.

The correlation of two functions f(x, y) and h(x, y) is defined as

(4.1a)

(4.1b)

where f * denotes the complex conjugate of f. We normally deal with real functions (images), in which case

* f = f (4.1c)

143

The term cross correlation often is used in place of the term correlation to clarify that the images being correlated are different. This is as opposed to autocorrelation, in which both images are identical.

We consider correlation as the basis for finding matches of a subimage w(x, y) of size JxK’ within an image f(x, y) of size MxN, where we assume that J≤M and K≤N.

The correlation between f(x, y) and w(x, y) is

(4.2)

for x=0,1,2,…,M-1, y=0,1,2,…,N-1 and the summation is taken over the image region where w and f overlap.

th Let the j template, the unknown input character and its domain be gi(x, y), f(x, y) and R, respectively. The similarity between the template and the input character, based on cross- correlation is defined as

f (x, y)gi (x, y)dxdy i ∫∫R S ( f ) = f (x, y)2 dxdy g (x, y)2 dxdy (4.3) ∫∫ ∫∫ i R R

where i = 1, 2, ... , L; where L is the number of characters of a given alphabet.

144

The maximum value of Si (f) is found scanning i, and if it is Si (f), then the input character f is identified as class j. The denominator of (1) is for the normalization of amplitude, [64].

5.8 Nastalique Text Segmentation

i. The image of Nastalique text is pre-processed to change it into bi-level form

ii. The text region is segmented into lines of text by separating lines on the presence

of white space on horizontal scanning of the text image using horizontal

projection profile or histogram

iii. Ligatures in the text image are further separated by applying connected

component labeling on the text image

5.9 Segmentation of Text Image into Lines and Ligatures

The binarized text image for further processing by Nastalique OCR is given in figure

5.3.

145

Figure 5.3: Binarized Nastalique Text Image

In the pre-processing stage the text image is binarized before further processing by

Nastalique OCR. Figure 5.3 gives the binarized text image before further processing by

NOCR. Figure 5.4 gives the horizontal and vertical projection profiles (histograms) of the binarized text image.

146

Figure 5.4: Binarized Nastalique Text Image

Horizontal and vertical projection profile (histograms) of text image showing the distribution of black pixels over the entire text region are illustrated in figure 5.4.

Figure 5.5: Binarized Nastalique Text Image

147

Horizontal projection profile (histogram) is used to separate text lines in a text image as shown in figure 5.5.

All black pixels are projected along the vertical axis in the horizontal direction. The vertical axis gives the position while the horizontal axis gives the number of black pixels or pixel density at a particular position. The gaps along the vertical direction, between the black projections show the white spaces between lines of text, on the basis of the presence of these wide white spaces lines of text are separated in the text image.

Text Image Horizontal Projection Profile

Figure 5.6: Binarized Nastalique Text Image Figure 5.6 illustrates the separation of lines in the text image using horizontal projection profile or histogram; it also gives the correspondence between horizontal histograms and lines in the text image.

148

Figure 5.7: Lines of Text Separated

Text image segmented into lines of text is illustrated in figure 5.7, where lines of text in the text image are separated for further processing by Nastalique OCR.

The system stores each of the segmented lines as one separate image for further processing by assigning it the actual line number in the text image, so, when the recognized text is rendered sequentially line by line it has the same flow of text as was present in the text image.

We used connected component labeling in Matlab to identify and isolate all the ligatures in the text image, illustrated in figure 5.8. This way we are able to isolate all ligatures and isolated characters. However, as an undesired effect all diacritic marks are also separated like dots belonging to certain characters. When a character has dots, this is identified with dots otherwise this will be interpreted as some other character. In Urdu we have different characters with the same base shape but having different numbers of dots, or dots at different locations or having no dots at all. The dots are to be associated with the character shape so that it can be interpreted as the correct character shape.

149

Figure 5.8: Ligatures are separated

The segmentation phase is shown in Figure 5.9, a-g, where the text image is first segmented into lines, and then each of the lines into ligatures, isolated characters and primitives. Here we call diacritic marks as primitives.

After segmenting a text image into separate lines, each of the text line image is assigned its number and processed separately further to identify the number of ligatures, isolated characters and diacritical marks it contains.

The given text image consists of seven lines. The text image is first segmented into seven lines and then each of the lines is further processed and the results are shown in figure 5.9 (a) to figure 5.9 (g).

Figure 5.9 (a) shows that it has 22 elements out of which 10 are diacritical marks and rest of the 12 are ligatures. The distinction between a ligature and a diacritical mark is made by trial and error method on counting the number of black pixels in each of the bounding box when

150 connected component labeling is applied on separated text line images. By experimentation we found that if the number of black pixels in a bounding box is more than 25 then it is a ligature otherwise a diacritical mark.

Figures 4.8 a-d are a result of connected component labeling and counting number of foreground pixels in each of the bounding boxes.

Number of bounding boxes are plotted along the vertical axis separation between the horizontal axis and the start of a bounding box is plotted along the horizontal axis. The decision whether a bounding box contains a ligature or a diacritic mark is made by counting the number of pixels a bounding box contains, if these are twenty five or less then it is a diacritical mark contained in the bounding box and if the number of pixels are more than twenty five then a ligature is bounded by the box.

So the information we get about each of the text line image is which number of line it is, how many ligatures and diacritical marks it has and an image showing separated connected components.

151

Figure 5.9 (a): Analysis of text line 1

Figure 5.9-a shows line one has 22 elements with 12 ligatures and 10 diacritical marks

Figure 5.9 (b): Analysis of text line 2

Figure 5.9-b shows line two has 25 elements with 11 ligatures and 14 diacritical marks

152

Figure 5.9 (c): Analysis of text line 3

Figure 5.9-c shows line three has 42 elements with 18 ligatures and 24 diacritical marks

Figure 5.9 (d): Analysis of text line 4

Figure 5.9-d shows line four has 29 elements with 11 ligatures and 18 diacritical marks

153

Figure 5.9 (e): Analysis of text line 5

Figure 5.9-e shows line five has 38 elements with 13 ligatures and 25 diacritical marks

Figure 5.9 (f): Analysis of text line 6

Figure 5.9-f shows line six has 27 elements with 9 ligatures and 18 diacritical marks

154

Figure 5.9 (g): Analysis of text line 7

Figure 5.9-g shows line seven has 45 elements with 16 ligatures and 29 diacritical marks

Figure 5.10: All elements in the text image separated

155

All the ligatures, isolated characters and diacritical marks are separated using connected component labeling in Matlab, show in figure 5.10.

The diacritical marks include dots, punctuation marks, vowel marks, hamza etc.

Figure 5.11 (a): Ligature overlap line 1

In case of the Latin script or other discrete script OCRs we can use vertical projection profile

(or histogram) to separate all characters in a line of text on the basis of small white spaces present between characters and their non-overlapping nature.

In figures 4.11 (a) to (d) vertical projection profile is used on the vertical axis. Pixel density is plotted along horizontal axis (position on the text line). However, some ligatures could not be separated for recognition due to the overlapping between adjacent ligatures.

156

Once the text line images are separated from the text image and are processed as separate text image having only one line of text in the image. In discrete scripts lke Roman script for example individual characters are separated by the vertical projection profile of the text image with projecting all black pixels in the text line image along the vertical axis leaving white gaps in between the characters. In printed Roman text discrete characters do not overlap or shadow each other.

However in the case of Nastalique script, due to its cursiveness there is very much overlapping and shadowing of ligatures in the printed text, so ligatures cannot be separated by using a simple vertical histogram of text line image.

Figures 4.11 (a) to (d) illustrate it clearly that there is much overlapping of words and ligatures in the printed Nastalique text and ligature separation is a more challenging job.

The text line image is scanned horizontally form left to right looking for the black

(foreground) pixels and as they are found these are projected along the vertical axis to give the histogram showing distribution of pixel along the horizontal axis and the pixel densities along the vertical axis. Thus we see that there is much overlapping in adjacent ligatures.

157

Figure 5.11 (b): Ligature overlap line 2

Figure 5.11 (c): Ligature overlap line 3

158

Figure 5.11 (d): Ligature overlap line 4

5.10 Recognition Technique

Figure 5.12 (a) is an image of a word written in Nastalique script (the word is actually

this word is a good example of illustrating cursiveness and context ,(ﻧﺳﺗﻌﻠﯾق Nastalique sensitivity of Nastalique script which are the two main challenges in Nastalique character recognition. This word Nastalique has seven characters joined cursively in a fluid form to give a single ligature word.

Figure 5.12 (a): A word in Nastalique

159

Figure 5.12 (b): Separated character shapes in a word

The word has one character at its initial and one at the final positions, while the rest of the five characters all occupy the medial positions as their both of the left and right hand ends join other characters. These initial, medial, and final character shapes are shown separate from each other in Figure 5.12 (b). These separate context sensitive shapes are stored in the font file with their corresponding character codes illustrated in Figure 5.12 (c).

71 70 69 68 67 66 65

Figure 5.12 (c): A word in Nastalique

In text processing applications we key in the character codes to display the character on the screen while in character recognition the algorithm returns the character code as a result of recognizing a character in an image.

In the recognition process in our Nastalique OCR as a result of recognition character codes are returned and passed to the rendering engine to display the recognized text on the screen.

160

5.11 Nastalique OCR Application

The graphic user interface, figure 5.13, along with the OCR application for

Nastalique text recognition has been implemented in Microsoft Visual C 6.0++. The user interface presents an easy and interactive way to use the application. The features present are explained as follows.

5.11.1 The Dialogue Boxes

The dialogue boxes in our Nastalique application are explained as follows:

a. The ‘Open Image’ dialogue box allows the system to select a text image from the

source and prepare it for recognition. This image is displayed on the larger text

box appearing in the centre of the screen.

b. The ‘Load Font File’ dialogue box presents the list of fonts that can be selected

from the drop down menu and activates it for recognition procedure.

c. The ‘Result’ dialogue box displays the recognized text in the smaller left hand

screen.

d. The ‘Exit’ button allows the user to quit the program.

161

Figure 5.13: Nastalique OCR User Interface

Figure 5.14: Input word image selection

Figure 5.14 shows an image is selected and loaded in the input image area

162

Figure 5.15: Font selection

The prompt from the ‘load Font’ button displays the dialogue box that prompts the user to supply basic information about the font file to the system, show in figure 5.15. A selection can be made from the present font files which will ideally correspond with the font style in which the input text was created.

5.12 The Recognition Procedure

Two windows are formed; the larger around the ligature in the image while the smaller one around the character shape image in the font file to act as a template. Top right and bottom left corners addresses of larger ligature window is saved with respect to its line number and position of the ligature in its line.

163

In the recognition process a rectangular bounding box is made across each of the ligatures and isolated characters in a separated text line image. Font file is already loaded in the main memory; bounding boxes are also made across the character shapes in the font file to act as templates. Now the first template from the font file is picked with the small window across it, this small window is moved in the larger image window across a ligature in the image from top right corner towards bottom left corner one pixel left and one pixel down matching the template in the small window from the font file with character shapes in the larger ligature window using cross correlation observing the correlation peak value at each step.

Recognition Procedure

71 70 69 68 67 66 65 (Char Code)

Font File

Template matching

Image

Saving char. codes and x- 6566 67 68 69 70 71

positions in x4 x1 x2 x6 x0 x5 x3 text file Text

Figure 5.16: Cross-correlation for recognition

Figure 5.16 shows that each of the ligatures in the text image is bound with a rectangular bounding box, starting and end positions along horizontal axis of each boundary is noted.

164

Displaying Recognized Text

6566 67 68 69 70 71 sorting 6966 67 71 65 70 68

x4 x1 x2 x6 x0 x5 x3 x0 x1 x2 x3 x4 x5 x6

Unsorted Text File Sorted Text File

x6x5 x4 x3 x2 x1x0

Figure 5.17: Recognition process

One by one character shapes templates from the font file are tried to find a match in the ligature window as a template is matched to a character shape in the ligature somewhere, its character code along with the address of the start position of the character shape in the ligature with respect to the ligature window is also saved in a two dimensional array, figure

5.17, saving character codes with the respective address (starting) of the character shape in a given ligature.

165

Displaying Recognized Text In a Text-box

Figure 5.18: Recognition complete

As the recognition procedure is completed the array is sorted with respect to the starting addresses of the character codes in each of the ligatures with respect to these ligature windows and lines. Then these sorted arrays are given to the rendering engine to display the recognized text in the order it was found in the text image.

The result of single word recognition is illustrated in figure 5.18.

166

Figure 5.19: Multiple words single line

Figure 5.19 shows result of recognition of multiple words in a single line of text image.

Figure 5.20: Multiple words multiple lines

Figure 5.20 shows result of multiple words in multiple lines in the text image. 167

5.13 The Recognition Process

The recognition process is a step by step procedure in which the system reads, recognizes and displays the recognized text on the screen. The process uses the first assigned character template from the font file to match against all the characters present in the segmented input text image. Once a match is made the template’s character code is recorded with the address on which it matched the corresponding character within the bounding box of the ligature image giving it a subscript of the array x where x represents all the character addresses in a single ligature image. The recognized characters are first stored in the order they are recognized and then sorted out according to the addresses they are found in the input image. Once all the characters have been put in the order they appear in the text image, they are displayed on the screen.

168

CHAPTER 6 Conclusion and Future Work

6.1 Introduction

All the research work done on this project is summarized in this chapter and the directions for future research work on this and related topics are discussed besides we have also included in this chapter our proposed Nastalique Text Processor Model, which, in our opinion, if realized, shall help in increasing the processing as well as recognition efficiency of our proposed Nastalique OCR system.

6.2 Introduction to Nastalique Text Processor Model

For our Nastalique OCR system, we develop character-based True Type Font files of

Nastalique words. These words are written using the same character-based TTF font and an image is made of the Nastalique text. The image is then given to our Nastalique OCR system.

After recognition the rendering is done by using the same TTF font file to display the recognized text. The work is therefore three folds; development of character-based

Nastalique True Type Font, Nastalique character recognition and rendering the recognized text using character-based Nastalique True Type Font.

The term writing system implies a formatting and layout system in which the user is required no more than key in a pure sequence of letters in their spoken order and the computer stores and transmits this information as plain text sequence for performing automatic contextual formatting and directional layout to render the text in its proper typeset appearance. 169

Writing systems for most of the languages are simple owing to the fact that they are free from context dependence and are neither cursive inherently. On the contrary Urdu Nastalique because of its context sensitive features and cursiveness is considered as one of the most complex style for electronic typography. Therefore creating a model for the digital implementation of Urdu Nastalique has its limitations and challenges which can be duly attributed to its natural cursive and context sensitive features. However, the goal though difficult to achieve could be made possible by creating versatile and extensive algorithms.

A basic requirement for creating a writing system is to make a good set of fonts available for representing each character in a variety of ways. A unique way to do this is to form a character based Nastalique font which stores all individual characters along with their possible contextual shapes. This would result in a greater variety of ways to combine characters but would increase the rules for character combinations.

This would imply that all possible shapes of characters in ligatures are determined and added to the font files. A set of rules would then be formed to allow for the correct combination of characters that form words which would include rules governing context sensitivity, cursiveness, positions of diacritic marks, bi-directional attitude etc.

Other than being cursive, characters in the Urdu Nastalique writing script are not consistent but context sensitive. They change and adopt shapes according the position in the word and their adjoining characters on both the sides. Also the words do not follow a completely flat

170 horizontal base line. Rather they begin from the top right and moving diagonally finish at the bottom left. Thus characters and words heights vary accordingly.

We discussed in detail the complexities of Urdu Nastalique script in particular related to character recognition and related to computing in general in chapter two of this thesis, so are not going to discuss the same topics again here for the sake of avoiding redundant information.

The native writing system for Urdu Nastalique will serve for localization of regional languages, desktop publishing in Urdu and making global searches.

Over the years Naskh became the most popular Arabic writing style for typing because of its relatively flat, horizontal baseline and legibility which made it comparatively easier to adopt for typography. Since then a more standardized version of Naskh evolved and found its way for scripting printed Arabic text. The fancier form of Nastalique which was adopted for Urdu and Persian did not gain the same popularity despite its space saving features and more artistic form. However interest is now being shown to automate Nastalique for professional typesetting [36].

Except for a few characters in Arabic most of them join the subsequent ones to form ligatures.

There are just a few that do not join a subsequent character and remain either isolated or joined at one end only.

171

Earlier attempts to make a Nastalique typewriter had failed due to the large number of ligatures required to create a font set. In the same way the task of computerization of

Nastalique has two approaches, the ligature based approach and the character based approach.

The ligature based approach offers the creation of extremely fine and well formed fonts though the number of glyphs increases phenomenally approximating around 17000 and perhaps even more because of new words from foreign languages need to be absorbed. Thus there are innumerous combinations that could be made with characters joining the other characters in a variety of ways. The visual effect of this approach is obviously of higher quality as joining of characters is embedded in the system and not based on an algorithm.

This approach also makes it impossible to build the font files from different calligraphers because of the large number of glyphs required in the font file.

The character based approach works in a different manner. Here the shapes of the characters depend on the preceding and subsequent glyphs. This makes Nastalique a highly complex script. For developing this system of writing various shapes of individual Urdu characters are included in the sets of glyphs and rules are determined that govern the formation of words.

The advantage of this system is the relatively small number of glyphs required for its creation and its accessibility to be recreated by different calligraphers.

Urdu is one of those world languages which have a complex writing system and are context sensitive. Nastalique text formatting therefore is an arduous undertaking and implementation of a text processing system would mean creating extensive rules for letterform combinations and analysis.

172

As Urdu takes its origins from the Arabic writing script it follows a bi-directional style. This implies that although all words are formed from right to left the numerals are written from left to right. Another inherent feature of the Nastalique script is cursiveness. Each letter connects with the adjoining character in a fluid form, beginning from the top right and moving diagonally to the bottom left. This feature makes writing in Nastalique space saving as characters are stacked up vertically also as they move ahead. This makes it also differ from the Arabic Naskh which is written on a primarily horizontal baseline. One drawback of this feature of Nastalique is that words with multiple characters in a line may clash with words on the preceding lines. The system also has a non-monotonic nature and other complexities arise due to the position of dots and other diacritical marks. All these complexities make Nastalique an extremely difficult writing system to implement for computerized text processing.

Nastalique is naturally cursive connecting each subsequent character with the previous one by means of delicately curving joints. The joining points or curves in characters and ligatures follow a predefined set of rules for formation that are governed by the size of the pen nib with which the words used to be originally written down.

Each character in Nastalique follows pre-described writing rules which are determined in relation to the width of the flat nib of the pen, called qat. Letters are written using a flat nib and both trajectory of the pen and angle of the nib define a glyph representing a letter. Figure

6.1 displays some of the rules of forming characters in Nastalique in terms of measurement in qat [20].

173

Figure 6.1: Measurements of Letters in qat

Because of the cursive qualities Urdu Nastalique is more space sparing while Naskh with its relatively more horizontal baseline occupies more space to write each word. There is also present, due to this feature an element of style and decorative value as words thus formed have a more artistic fluidity.

Nastalique is also non-monotonic in nature, where certain characters have strokes that move backwards sometimes even beyond the previous characters.

6.3 Nastalique Character Shapes

In Nastalique we have two to four basic character shapes for each character in the alphabet: Initial, Medial, Final and Isolated. Except Isolated last three are position dependent shapes and can have several forms depending on the precedent as well as the antecedent characters.

which is the ,ب Each of these character shapes has multiple instances, for example Bay second letter in the Urdu alphabet has 20 different shapes for its initial form. These character 174 shapes are context sensitive – character shapes change if the antecedent or the precedent character changes. At times even the 3rd or the 4th character may cause a similar change depicting an n-gram model in a Markov chain.

Besides the four basic character shapes in Nastalique typography i.e. initial, medial, final and isolated, we also define a white space character (ws). In Nastalique typography the white space character is an indicator of completion of a ligature. As we key in a sequence of characters from the keyboard the first input character takes the initial shape while the subsequent characters go on taking the medial shape until a white space character in made input, so the last input character which was keyed in just before the white space takes the final shape.

6.4 Nastalique Joining Characters Features Set

We call a set comprising features of Nastalique character shapes as a Nastalique

Joining Characters Features Set which can take the following features values:

Height, Thickness, Angle and Rotation

These features are also termed as attributes, thus the Nastalique feature set is given as:

FS = { Height, Thickness, Angle, Rotation }

Word forming in Nastalique is not as simple as in Roman script. In Nastalique a word is composed of ligatures and isolated character shapes. Sometimes a word consists of only a

ﺑﮩت .single ligature e.g

Two or more character shapes join cursively to form a ligature. While forming a ligature a character shape can join the other from left as well as the right-hand side. When a character

175 shape joins another from its right side with its own left side then at the joining point the two feature sets must match.

So we have features defined at one or both of the two possible joining ends of each character shape as:

Left Features (LF) and Right Features (RF).

6.5 Proposed Nastalique Text Processor Model

In this section we propose a Finite State Model of a character-based Nastalique text processor, illustrated in figure 6.2, which when implemented would produce a perfect

Nastalique text using a character-based font and a knowledge-base consisting of rules for joining various characters shapes cursively together to form a Nastalique ligature.

The proposed Nastalique Text Processor can be modeled as a finite state automaton:

Figure 6.2: Nastalique Text Processor Model

A finite state automaton A is defined by a five tuple [35] as follows:

176

A = ( Q, ∑, δ, qo, F ), where

Q: a finite set of states

∑: a finite set of input symbols also called alphabet

F: a finite set of accepting states

δ: the transition function, which is specified as δ: (Q x ∑ ) → Q

Where the transition function δ takes two input arguments in the form of an ordered pair having first element from Q, the set of states and the second element from ∑ , the set of input symbols or alphabet and returns as its output an element from Q, that is a state. So the transition function processes an input symbol at a particular state and gives the next state the automaton shall be. The transitions by the transition function on the various input symbols by our proposed Nastalique Text Processor Model are shown in figure 6.3. All possible transitions by the transition function on the input symbols from ∑, the alphabet, are shown in the table 6.1.

The set of states of the Proposed Nastalique Text Processor Model is Q,

Q : { q0, q1, q2, q3, q4 }, where

q0 = starting state and final state

q1 = input is an initial character shape

q2 = input is a medial or final character shape

q3 = error state / incomplete word

q4 = input is a white space

The input character set or the alphabet, ∑, can take the following values:

∑ : { initial, medial, final, isolated, white space }

177

= { in, m, f, is, ws }

F, the set of final or accepting states is given as

F= { q0}

Table 6.1: Transition Table for NTPM

δ In M f is ws

q0 q1 q3 q3 q0 q3

q1 q3 q2 q4 q3 q3

q2 q3 q2 q4 q3 q3

q3 q3 q3 q3 q3 q3

q4 q3 q3 q3 q3 q0

Figure 6.3: Transition Diagram for NTPM

178

6.6 Components of Nastalique Text Processor Model (NTPM)

Our proposed Nastalique text processor model has the following two functional components

6.6.1 Character Shape Recognizer

A character in Nastalique can have any of the four basic shapes out of its shapes set.

Following rules determine the shape of an input character which is just keyed in, whether it is an initial, medial, final or an isolated character.

This component gets the input from the keyboard as a character code and checks whether it has LFs only or RFs only or it has both LFs and RFs or it does not have any. On the basis of character shapes determination rules the decisions are made if the new input character shall take the initial, medial, final or an isolated shape.

The character shape determination rules are as follows:

1. If (X has RFs) ^ ~(X has LFs) then (X takes Final shape) 2. If (X has LFs) ^ ~(X has RFs) then (X takes Initial shape) 3. If (X has LFs) ^ (X has RFs) then (X takes Medial shape) 4. If ~(X has RFs) ^ ~(X has LFs) then (X takes Isolated shape)

These rules, put in simple words, describe that if an input character shape has only LFs, then the character shape is Initial, if it has only RFs then the character shape is final, if it has both 179

LFs and RFs then the character shape is Medial and if none of the features are available then the character shape is Isolated.

6.6.2 Next-State Function

Having decided the shapes of characters in a ligature as initial, medial, final or isolated, the system checks for multiple instances of these character shapes.

In Nastalique we have multiple instances of these character shapes encoded in the font file, the Next-state function compares the LFs and RFs on one-to-one basis and decides the correct instance of the character shape present at a particular position in a ligature and returns its character code and moves to the next state where it expects the next input from the keyboard.

As the input process proceeds and the character codes are received from the keyboard. This component takes the decision about a particular contextual shape of an input character amongst the many available for the same.

As soon as the next state function receives a white space character as input from the keyboard it displays the completed ligature on the screen and gets into the starting state looking for the first input from the keyboard for the start of the next ligature.

The Nastalique Text Processor Model will always be in a state to expect a new input character from the keyboard except that it receives a white space character, that results in

180 either an error state indicating an incomplete ligature or an accepting state where all the characters received from the keyboard are displayed joined together in the form of a ligature.

6.7 Conclusion

The Urdu writing takes many of its features from the written Arabic. However it differs from Arabic in terms of the number of characters and the placement of several diacritical marks. In the era of language automation and text processing the Arabic Naskh gained greater popularity due to its simplicity of style, a flatter horizontal baseline and convenience of processing. Most multinational software companies have tried to automate

Naskh to increase their foreign market yet a reliable OCR for Naskh also is yet not available with an acceptable rate of accuracy.

This meant that as far as research work is concerned literature is available showing techniques that have been tried and tested on Arabic to represent the script electronically and create recognition models although they have not been completely perfect.

Approaches that have so far been adopted to develop Arabic character recognition fall into two major categories, ligature based and character based.

On the other hand Nastalique which is a more decorative and artistic version of written

Arabic and which has been adopted to write Persian and Urdu was considered too complex because of a variety of its features to be fully automated and accessible for machine recognition. Advancement in technology has however brought growing interest to automate complex language systems as many Asian languages and therefore our interest in bringing

181

Nastalique at par with this research in terms of digital recognition of Nastalique script as any other.

With relatively very little previous research available in this context our exploration began with the various methods that were employed to recognize Arabic Naskh, which exhibits considerable resemblance to Nastalique though not completely. However the methods, tools and techniques which had been used with relative success with Naskh were not found to be so with Nastalique. We had therefore to find more innovative ways to create a new recognition strategy for this script. In this regard we were able to develop and present a novel technique of recognizing text in Nastalique which is character based and segmentation free.

This thesis presents the various steps in the design and implementation of a novel algorithm for a Nastalique OCR system. It also gives details of the various approaches currently in use in the designing of OCR systems. It includes the extensive research work undertaken to investigate past references to work in similar areas with particular emphasis on Arabic script languages.

The thesis presents in detail the complexities encountered in the implementation of a

Nastalique OCR system and explores the reasons that determine the limitations.

Some of the details as described in the thesis are summarized below.

Urdu has a unique system of representation. It derives its script from Arabic following most of its rules for word and ligature formation. Although Arabic today is written in the Naskh script Urdu follows a style that evolved later as a merger of Taleeq and Naskh. Urdu also has

182 a greater number of characters due to the more varied set of sounds characteristic of the Urdu language.

Nastalique also has inherent complexities that were explored in detail in this thesis. It is naturally cursive connecting each subsequent character with the previous one by means of delicately curving joints. The joining points or curves in characters and ligatures follow a predefined set of rules for formation that are governed by the size of the pen nib with which the words used to be originally written down.

The style also is bi-directional with characters moving from right to left while numerals are presented in the opposite direction. As words are written down there is also a vertical stacking of characters as they are kerned and cursively joined while some characters move backward and beyond the previous character. These factors added to the limitations of developing an OCR for Nastalique.

While Naskh was adopted to scribe Arabic, Nastalique became the more popular version for

Persian and Urdu. As compared to Nastalique, Naskh is simpler in features, following a single, flat and horizontal base line while Nastalique has multiple baselines. Nastalique also has other features that make it a complex system to adopt for computer automation.

Our research indicates that there is practically no reliable work done to automate Urdu

Nastalique for developing an OCR and for everyday computing needs whatever work is currently available has propriety rights rather than being based on more universally accepted standards. The universally accepted Unicode also holds Urdu as a sub language of Arabic

183 with a few added characters to cater to the characters that are not present in Arabic. Unicode also supports the writing of Urdu in the Naskh style and so far does not support the

Nastalique style of Urdu writing.

This forms the basis for the fact that our literature survey comprises of more work on Arabic than Urdu. The survey reveals that the experimentations and research conducted so far for the design and implementation of Arabic OCR have had two main approaches:

1. Character based

2. Ligature based

Both the approaches have had their own set of potentials and drawbacks but in general results have not been too encouraging for both.

Character based approach needs segmentation of text up to the character level, while in

Arabic script text points of segmentation are the main challenging task to define, so the poor segmentation results in low recognition accuracy.

While ligature based approach for an Arabic script OCR requires a very large number of ligature images to train a learning machine like ANN, HMM or a SVM. Many researchers have tried this approach, however were not able to receive high rates of recognition accuracy.

In this thesis we have tried to explore more innovative technique to create a novel algorithm to implement a Nastalique OCR.

184

An extensive degree of research was done on the Urdu Nastalique Script to explore the complete alphabet, position of dots, diacritical marks and rules of word and ligature formation presented in chapter 4 of this thesis.

Urdu has a more extended alphabet than Arabic due to the many sounds that it needs to cater to. This is done by increasing or decreasing the number or changing the position of dots with respect to some basic character shapes or by the changes in diacritical marks.

We have categorized the Nastalique alphabet according to the base shapes and the positions of diacritical marks and on the basis of this a computational model has been constructed.

We have also made profound comparisons between Nastalique and Naskh and analyzed the differences. Naskh is basically horizontally written while Nastalique promotes vertical stacking of characters as characters often overlap one another when they are joined cursively.

We have also studied in detail the Roman script OCRs which use the character based segmentation approach for recognition.

A word in Urdu is either a single ligature or combinations of ligatures with isolated characters. However when we look at the system from the point of view of building an OCR ligature based approach was also not feasible as it requires extensive computation and the end result is computationally expensive.

185

In Chapter 5 we have proposed a novel segmentation free algorithm for the design and implementation of an OCR (Optical Character Recognition) for printed Nastalique text, a calligraphic style of Urdu which uses the Arabic script for its writing.

Our proposed algorithm for Nastalique character recognition does not require segmentation of a ligature into constituent character shapes, rather it requires only two levels of segmentation i.e. first the text image is segmented into lines of text then each of the lines of text is further segmented into ligatures and isolated characters.

The algorithm takes a text image as an input. Segmentation is done at two levels i.e. the text image is segmented into lines of text and the lines are segmented into ligatures. There is no further segmentation at the ligature level to split the ligature into respective character shapes.

At the recognition step the system uses character templates from the font files which have been loaded in the main memory. These templates are matched with the character shapes present in the ligatures using Cross Correlation for recognition. A character shape is found successfully in a ligature at a particular point if the template matching process gives the highest correlation peak at this point in the ligature.

In Chapter 5 we also have presented the process of segmenting the lines of text into ligatures which is done using Matlab 7.

Our innovative technique has however aimed to avoid any of the previously adopted segmentation techniques so that the negative aspects of the segmentation processes that give

186 low recognition results could be kept to the minimum. Its novelty lies in the fact that it aims to segment lines of text into ligatures and then use ligatures for character based recognition without subjecting them to split into constituent character shapes.

Our research and experimentations point to an important pre-requisite required for making character based Nastalique recognition possible. It requires a character based writing system which is currently not available. As such we do not have a standard character based

Nastalique text processing system. For this purpose we create the Nastalique character based font, write the text using our character based font file, make the image of the text and then give it to our Nastalique OCR for recognition.

In this thesis we have also proposed a Finite State Model for such a text processor. It hopes to set a line of action that would lead to the development of a suitable text processor which would make text recognition of Urdu Nastalique easier to accomplish. Although we have created and presented the model for this text processor implementation has not yet been done and so results have not been reported. However it is hoped that future research will pave the way for the appropriate implementation of this model.

Our research sets the groundwork for further exploration of the presented techniques which could work in various directions e.g. video text recognition

We have presented OCR work that extended into the field of video text detection and extraction. There is popular acclaim in the area of video text extraction and experimentations in this regard have given us encouraging results.

187

Video text recognition is a different field of character recognition whereby the considered text is often presented in a more complex background compared to printed text that appears on a more consistent background in terms of texture and color.

Video text is primarily of two kinds, scene text and artificial text. Drawing out artificial text is comparatively easier as it has been superimposed on the background at a post processing and video making stage and confirms to present text writing patterns and standards to a great extent. However scene text, text that is embedded within the video scenes is the most difficult to identify owing to the fact that it can appear almost anywhere and in any form, colour or size.

Our experimentations on VOCR are limited to text area detection and text extraction from the lower one third segment of the video frames that generally display information about the speaker on the screen, breaking news slides or video subtitles. The results of text area detection and extraction are reported in the thesis.

Writing systems for most of the languages are simple owing to the fact that they are free from context dependence and are neither cursive inherently. On the contrary Urdu Nastalique because of its context sensitive features and cursiveness is considered as one of the most complex fonts for electronic typography. Therefore creating a model for the digital implementation of Urdu Nastalique has its limitations and challenges.

Since our character-based segmentation-free Nastalique OCR algorithm needs, as a ground work, a character-based Nastalique Text Processor, and as such there is no native standard

188 text processing system available, we have proposed a Finite State Nastalique Text Processor

Model. Implementation is not yet done so results are not reported. However this model could serve as an impetus for future research in this challenging field.

Our proposed Finite State Model of a character-based Nastalique text processor when implemented would produce a perfect Nastalique text using a character-based font and a knowledge-base consisting of rules for joining different character shapes cursively together to form a Nastalique ligature.

The native writing system for Urdu Nastalique will serve for localization of regional languages, desktop publishing in Urdu and for making global searches.

6.8 Contribution

The main contribution of this thesis is the character based segmentation free algorithm of an OCR for printed Nastalique text, which is presented in chapter 5 of this thesis with experimentations and results on text image segmentation and text recognition.

In chapter 3 we have included the details of experimentation with results for detection and extraction of text regions from video frames for video OCR, with the relevant literature survey.

189

We have included in this thesis, some research work as extension to our main research topic including a Finite State Model of a Nastalique Text Processor presented in the beginning of this chapter.

6.9 Future Work

To keep our problem of investigating a simple and straight technique for the design and implementation of an OCR for printed Nastalique text, we kept our boundaries at a low level, however for future work we have plans to extend our Nastalique OCR algorithm to work on a larger set of Urdu Nastalique words, shall make it more flexible and robust to accept text in different font sizes.

We also have plans to extend our Urdu Nastalique OCR to a video OCR for Arabic script languages. We have already performed successful experiments on the detection and extraction of text regions from the video frames. The results of these experiments shall be used for recognition by our Urdu Nastalique OCR.

190

References [1] S. I. Abuhaiba, “Recognition of Off-Line Handwritten Cursive Text,” Ph.D. thesis, Department of Electronic and Electrica Engineering, Loughborough University, Loughborough, U. K., 1996.

[2] I.S. Abuhaiba, “A Discrete Arabic Script for Better Automatic Document Understanding,” The Arabian J. Science and Eng., vol. 28, pp. 77-94, 2003.

[3] Albadr B, Haralick R, Segmentation-free approach to text recognition with application to Arabic text, International Journal on Document Analysis and Recognition, (1998) 1: 147-166.

[4] Alemami S, Usher M. On-line recognition of handwritten Arabic characters. IEEE Trans Pattern Analysis and Machine Intelligence 1990; 12(7): 704–710

[5] S. Alma’adeed, C. Higgens, and D. Elliman, “Off-Line Recognition of Handwritten Arabic Words Using Multiple Hidden Markov Models,” Knowledge-Based Systems, vol. 17, pp. 75-79, 2004.

[6] S. Alma’adeed, C. Higgens, and D. Elliman, “Recognition of Off-Line Handwritten Arabic Words Using Hidden Markov Model Approach,” Proc. 16th Int’l Conf. Pattern Recognition, vol. 3, pp. 481-484, 2002.

[7] S. Alma’adeed, D. Elliman, and C.A. Higgins, “A Data Base for Arabic Handwritten Text Recognition Research,” Proc. Eighth Int’l Workshop Frontiers in Handwriting Recognition, pp. 485-489, 2002.

[8] H. Almuallim and S. Yamaguchi, “A Method of Recognition of Arabic Cursive Handwriting,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 9, pp. 715-722, 1987.

[9] Al-Yousefi H, Udpa S. Recognition of Arabic characters. IEEE Transactions on Pattern Analysis and Machine Intelligence 1992; 14(8): 853-857.

[10] Amin A. Recognition of printed Arabic text using machine learning. Proceedings of the International Society of Optical Engineers, SPIE, 1998; 3305:63-70.

[11] Amin A, Mari J. Machine recognition and correction of printed Arabic text. IEEE Transactions on Man, Machine and Cybernetics 1989; 19(5): 1300-1306.

[12] Arica N, Yarman-Vural FT. Optical character recognition for cursive handwriting. IEEE Transactions on Pattern Analysis and Machine Intelligence 2002; 24(6): 801- 813.

[13] Bahri, Z.; and Kumar, B. V. K. 1988. Generalized Synthetic Discriminant Functions. J. Opt. Soc. Am. A 5 (4) 562–571

191

[14] I. Bazzi, R. Schwartz, and J. Makhoul, “An Omnifont Open-Vocabulary OCR System for English and Arabic,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 21, pp. 495-504, 1999.

[15] Bouslama F, Kishibe H. Fuzzy logic in the recognition of printed Arabic text. IEEE Transactions on 1999: 1150-1154.

[16] Chong-Wah Ngo, Chi-Kwong Chan, ‘Video text detection and segmentation for optical character recognition ‘Multimedia Systems 10: 261–272 (2005), Digital Object Identifier (DOI) 10.1007/s00530-004-0157-0

[17] Christian Wolf and Jean-Michel Jolion, ‘Extraction and Recognition of Artificial Text in Multimedia Documents’, Technical Report RFV RR-2002.01 Laboratoire Reconnaissance de Formes et Vision INSA de Lyon, Bˆat. J.Verne 20, Av. Albert Einstein 69621 Villeurbanne cedex FRANCE

[18] Chuan-Jie Lin, Che-Chia Liu, Hsin-Hsi Chen, ‘A Simple Method for Chinese Video OCR and Its Application to Question Answering ‘ Computational Linguistics and Chinese Language Processing Vol. 6, No. 2, August 2001, pp. 11-30 The Association for Computational Linguistics and Chinese Language Processing

[19] Datong Chen, Kim Shearer and Hervé Bourlard, ‘Video OCR for Sport Video Annotation and Retrieval ‘ Proceedings of then 8th International Conference on Mechatronics and Machine Vision in Practice, (M2VIP 2001), Hong Kong 27-29 August 2001 ., pp57-62

[20] David A. Forsyth and Jean Ponce, ‘Computer Vision: A Modern Approach’, Pearson Prentice Hall, NJ, USA, 2003

[21] Ejaz Raqum, Usool-o-Qavaid Khush Naveesi, Mir Muhammad Kutub Khana, Arram Bagh, Karachi, Pakistan

[22] R. El-Hajj, L. Likforman-Sulem, and C. Mokbel, “Arabic Handwriting Recognition Using Baseline Dependant Features and Hidden Markov Modeling,” Proc. Int’l Conf. Document Analysis and Recognition, pp. 893-897, 2005

[23] Elaine Rich and Kevin Knight, ‘Artificial Intelligence’, 2nd Edition, McGraw Hill, NY, USA, 1991

[24] Erlandson E, Trenkle J, Vogt R, Word level recognition of multifont Arabic text using a feature vector matching approach, Proceedings of International Society for Optical Engineers, SPIE, 1996; 2660: 63-70.

[25] Faisal Shafait, Adnan-ul-Hasan, Daniel Keysers, and Thomas M. Breuel, Layout Analysis of Urdu Document Images. Proceedings of 10th IEEE International Multitopic Conference, WMIC ’06, Islamabad, Pakistan, Dec. 2006.

[26] Fan X, Verma B. Segmentation vs. Non-Segmentation Based Neural Techniques for Cursive Word Recognition: An Experimental Analysis. Fourth International

192

Conference on Computational Intelligence and Multimedia Applications (ICCIMA'01) P. 251.

[27] N. Farah, L. Souici, L. Farah, and M. Sellami, “Arabic Words Recognition with Classifiers Combination: An Application to Literal Amounts,” Proc. Artificial Intelligence: Methodology, Systems, and Applications, pp. 420-429, 2004.

[28] H. Goraine, M. Usher, and S. Al-Emami, “Off-Line Arabic Character Recognition,” Computer, vol. 25, pp. 71-74, 1992.

[29] Hadjar, K., Ingold, R. (2003) Arabic News paper segmentation. In: Seventh International Conference on Document Analysis and Recognition (ICDAR 2003), pp 895-899. IEEE COMPUTER SOCIETY.

[30] Hamid, A. and Haraty, R.,“A Neuro-Heuristic Approach for Segmenting Handwritten Arabic Text”, ACS/IEEE International Conference on Computer Systems and Applications, Beirut, Lebanon, 25-06-2001 – 29-06-2001, pp: 110-113.

[31] Inam Shamsher, Zaheer Ahmad, Jehanzeb Khan Orakzai, and Awais Adnan, OCR For Printed Urdu Script Using Feed Forward Neural Network, Proceedings of World Academy of Science, Engineering and Technology volume 23 August 2007 ISSN 1307-6884.

[32] Jean-Marc Odobez and Datong Chen, ‘Robust Video Text Segmentation and Recognition with Multiple Hypotheses’ International Conference on Image Processing, Rochester New York, USA, Sept. 22-25, 2002, Vol. 2, IEEEE, pp 433- 436

[33] Jelodar MS, Fadaeieslam MJ, Mozayani N, Fazeli M. A Persian OCR System using Morphological Operators. Transactions On Engineering, Computing And Technology, Vol. 4, February 2005.

[34] Jinsik Kim, Taehun Kim, Jiexi Lin, ‘Implementation of a Video Text Detection System ,CS570 Artificial Intelligence Team Foxtrot KAIST, South Korea 305-701 Korean Advanced Institute of Seience & Technology, CS570-2004

[35] John E. Hopcroft, Rajiv Motwani and Jeffrey D. Ullman. “Introduction to Automata theory, Languages and Computation”, 2nd edition, Addison-Wesley, 2000.

[36] Joseph D. Becker, Arabic Word Processing, Communications of the ACM Volume 30, Issue 7 (July 1987) Pages: 600 – 610, ISSN:0001-0782

[37] Jürgen Schürmann, Norbert Bartneck, Thomas Bayer, Jürgen Franke, Eberhard Mandler, and Matthais Oberländer, “Document Analysis-From Pixels to Contents” Proceedings of the IEEE VOL.80 No.7, July1992, pp.1101-1119

[38] Kamal Mansour, Guidelines to Use of Arabic Characters, 24th Internationalization & Unicode Conference, Atlanta, GA September 2003

193

[39] T. Kanungo, G. A. Marton, and O. Bulbul, "OmniPage vs. Sakhr: Paired Model Evaluation of Two Arabic OCR products," Proceedings of SPIE Conference on Document Recognition, San Jose, CA, Vol. 3651, pp. 109-120, 1999.

[40] T. Kanungo, G. E. Marton, and O. Bulbul, “Performance Evaluation of Two Arabic OCR Products”, Proc. of AIPR Workshop on Advances in Computer Assisted Recognition, SPIE Vol. 3584, Washington DC, 1998.

[41] M. S. Khorsheed, “Offline recognition of omnifont Arabic text using the HMM ToolKit (HTK)”, Pattern Recognition Letters 28 (2007) pp. 1563-1571.

[42] Khorsheed MS, Clocksin WF. Spectral features for Arabic word recognition. The IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP’2000, Istanbul, Turkey, June 5-9, 2000, pp. 3574-3577

[43] Khorsheed MS, Clocksin WF. Structural features of cursive Arabic script.’ Proceedings of the Tenth British Machine Vision Conference, Nottingham, UK, 1999; 2: pp. 422–431

[44] Kraus E, Dougherty E. Segmentation-free morphological character recognition. Proceedings of the International Society for Optical Engineers, SPIE, 1994; 2181: 14–23

[45] Lee Y, Papineni K, Roukos S, Emam O, Hassan H. Language model based Arabic word segmentation.

[46] S. M. Lodhi and M. A. Matin, “Urdu character recognition using Fourier descriptors for optical networks”, Photonic Devices and Algorithms for Computing VII, Proc. of SPIE Vol. 5907-59070O, (2005).

[47] L. Lorigo and V. Govindaraju, “Segmentation and Pre-Recognition of Arabic Handwriting,” Proc. Int’l Conf. Document Analysis and Recognition, pp. 605-609, 2005.

[48] A. Mahalanobis, B. V. K. Vijaya Kumar, “On the optimality of the MACH filter for detection of targets in noise” Optical Engineering, Special Issue on Correlation Pattern Recognition, Vol. 36 (10), pp. 2642-2648,October 1997

[49] A. Mahalanobis, B. V. K. Vijaya Kumar, S. R. F. Sims, and J. F. Epperson. “Unconstrained Correlation Filters,” Applied Optics, Vol. 33, pp. 3751-3759, 1994

[50] Majid M. Altuwaijri and Magdy A. Bayoumi, “Arabic Text Recognition Using Neural Networks”, IEEE International Symposium on Circuits and Systems, ISCAS’94, London, UK, 30 May- 02 June,1994, Vol. 6, pp. 415-418.

[51] Malik, S., Khan, S.A. (2005) Urdu Online Handwriting Recognition. In: IEEE International Conference on Emerging Technologies, pp 27-31. Islamabad, Pakistan

194

[52] Motawa D, Amin A, Sabourin R, Segmentation of Arabic Cursive Script. Proceedings of the 4th International Conference on Document Analysis and Recognition, 1997: 625 – 628.

[53] Nevenka Dimitrova Lalitha Agnihotri Chitra Dorai Ruud Bolle, ‘MPEG-7 Videotext description scheme for superimposed text in images and video ‘ Eelsevier Signal Processing:2000

[54] Obaid AM, Dobrowiecki TP. Heuristic Approach to the Recognition of Printed Arabic Script.

[55] U. Pal and Anirban Sarkar, “ Recognition for Printed Urdu Script”, Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003), IEEE COMPUTER SOCIETY, Edinburgh, Scotland, Aug, 3-6, 2003, Vol. 2, pp. 1183-1187.

[56] B. Parhami and M. Taraghi, “Automatic Recognition of Printed Farsi Texts,” Pattern Recognition, vol. 14, pp. 395-403, 1981.

[57] Parker JR. Algorithms For Image Processing and Computer Vision. John Wiley & Sons, 1997

[58] M. Pechwitz and V. Ma¨rgner, “HMM Based Approach for Handwritten Arabic Word Recognition Using the IFN/ENIT-Database,” Proc. Int’l Conf. Document Analysis and Recognition, pp. 890-894, 2003.

[59] Rabiner L, Juang B. Fundamentals of Speech Recognition. Prentice Hall, 1993

[60] Rainer Lienhart, Frank Stuber.Automatic text recognition in digital videos. Technical Report TR-95-036, Department for Mathematics and Computer Science, University of Mannheim, 1995.

[61] Rainer Lienhart and Wolfgang Effelsberg, ‘Automatic Text Segmentation and Text Recognition for Video Indexing ‘Submitted to ACM/Springer Multimedia Systems Magazine, 9/98

[62] R. Ramsis, S.S. El-Dabi, and A. Kamel, Arabic Character Recognition System, IBM Kuwait Scientific Centre, report No. KSC027, 1988.

[63] P. Refregier. Optimal trade-off filters for noise robustness, sharpness of the correlation peak and Horner efficiency. Opt. Lett. 16 (11), 829–831, 1991

[64] Reza Safabakhsh and Peyman Abidi, “Nastaaligh Handwritten Word Recognition Using a Continuous-Density Variable-Duration HMM”, The Arabian Journal for Science and Engineering, Volume 30, Number 1B, April 2005, pp. 95-118

[65] Shunji Mori, Hirobumi Nishida, Hiromitsu Yamada, Optical Character Recognition, John Wiley and Sons, New York, USA, 1999.

195

[66] Sohail Abdul Sattar, Syed Salahuddin Hyder, Mahmood Khan Pathan, “Problems of Nastalique OCR: A comparison of Nastalique and Roman script OCRs” Proceedings of the ICCCE 06 , International Conference on Computer and Communication Engineering organized by IEEE and the faculty of Engineering, International Islamic University, Malaysia held from May 9 – 11, 2006, Kuala Lumpur, Vol. 2, pp 1066- 1071

[67] Tolba M, Shaddad E. On the automatic reading of printed arabic characters. Proceedings of IEEE International Conference on Systems, Man and Cybernetics, Los Angeles, CA, 1990; 496–498

[68] A. VanderLugt, “Signal detection by complex spatial filtering,” IEEE Transaction on Information Theory, Vol 10, pp. 139-145, 1964

[69] B. V. K. Vijaya Kumar, “Tutorial survey of composite filter designs for optical correlators,” Applied Optics, Vol. 33, pp. 4773-4801, 1992

[70] B. V. K. Vijaya Kumar, Abhijit Mahalanobis and Richard D. Juday, Correlation Pattern Recognition, Cambridge University Press, Cambriddge, UK, 2005

[71] Wing Hang Cheung, Ka Fai Pang, Michael R. Lyu, Kam Wing Ng, and Irwin King. Chinese optical character recognition for information extraction from video images. In Hamid R. Arabnia, editor, Proceedings of The 2000 International Conference on Imaging Science, Systems and Technology (CISST'2000) Volumn One, pages 269-- 275. CSREA Press, 2000

[72] Xian-Sheng Hua, Pei Yin, HongJiang Zhang: Efficient video text recognition using multiple frame integration. , International Conference on Image Processing, Rochester New York, USA, Sept. 22-25, 2002, Vol. 2, IEEEE, Vol. ICIP (2) 2002: 397-400

[73] Zaheer Ahmad, Jehanzeb Khan Orakzai, Inam Shamsher, and Awais Adnan, Urdu Nastaleeq Optical Character Recognition , Proceedings of World Academy of Science, Engineering and Technology volume 26 December 2007 ISSN 1307-6884.

[74] A. Zahour, B. Taconet, and A. Faure, "Machine Recognition of Arabic Cursive Writing", in From Pixels to Features III:Frontiers in Handwriting Recognition, ed. S. Impedovo and J.C. Simon. Amsterdam: Elsevier Science Publishers B.V., 1992, pp. 289-296.

[75] Zheng L, Hassin AH, Tang X. A new algorithm for machine printed Arabic character segmentation. Pattern Recognition Letters 25, 2004; 1723-1729.

[76] H. Zhou and T.-H. Chao. MACH filter synthesising for detecting targets in cluttered environment for gray-scale optical correlator. Proc. SPIE 715, 394–398, 1999

[77] Zidouri A. Sarfraz M. Shahab SA, Jafri SM. Adaptive dissection based subword segmentation of printed Arabic text. IEEE Transactions on 2005: 239-243.

[78] http://en.wikipedia.org/wiki/List_of_languages_by_writing_system 196

[79] http://en.wikipedia.org/wiki/Magnetic_Ink_Character_Recognition

[80] http://en.wikipedia.org/wiki/Nastaliq

[81] http://en.wikipedia.org/wiki/Optical_character_recognition

197

List of Publications Below is a list of publications produced as part of this research work

[1] Sohail Abdul Sattar, Syed Salahuddin Hyder, Mahmood Khan Pathan, “Problems of Nastaliq OCR: A comparison of Nastaliq and Roman script OCRs” Proceedings of the ICCCE 06 , International Conference on Computer and Communication Engineering in collaboration with IEEE, May 9 – 11, 2006, Kuala Lumpur, Malaysia, Vol. 2, pp 1066-1071.

[2] Sohail A. Sattar, Shamsul Haque and Mahmood K. Pathan, “Nastaliq Optical Character Recognition”. Proceedings of ACM SE 2008 Conference, Auburn, Alabama, USA, March 27-28, 2008, proceedings CD.

[3] Sohail A. Sattar, Shamsul Haque, Mahmood K. Pathan and Quintin Gee, “Implementation Challenges for Nastaliq Character Recognition”. Communications in Computer and Information Science Vol. 20, ISSN 1865-0929, 2008, Springer- Verlag, Berlin, Germany, pp 279-285.

[4] Sohail Abdul Sattar, Shamsul Haque and Mahmood Khan Pathan “Segmentation of Nastaliq Script for OCR” International Conference on Computing and Informatics (ICOCI-09), Kuala Lumpur, Malaysia, June 24-25, 2009.

198