A TECHNIQUE FOR THE DESIGN AND IMPLEMENTATION OF AN OCR FOR PRINTED NASTALIQUE TEXT
By
SOHAIL ABDUL SATTAR
Thesis submitted for the Degree of Doctor of Philosophy
Department of Computer Science and Information Technology
N.E.D. University of Engineering & Technology Karachi, Sindh Pakistan
July, 2009 A TECHNIQUE FOR THE DESIGN AND IMPLEMENTATION OF AN OCR FOR PRINTED NASTALIQUE TEXT
Ph.D. Thesis
By
SOHAIL ABDUL SATTAR
Research supervisor Dr. Shamsul Haque
Research co-supervisor Dr. Mahmood Khan Pathan
Department of Computer Science and Information Technology
N.E.D. University of Engineering & Technology Karachi, Sindh Pakistan
July, 2009
ii
Statement of Copyright
This copy of the thesis has been supplied on condition that anyone who consults, it is understood to recognize that the copyright rests with the University and that no quotations from the thesis and no information derived from it may be published without the permission of the University.
© 2009 by NED University of Engineering & Technology
iii
Certificate
Certified that the thesis entitled, “A Technique for the Design and Implementation of an OCR for Printed Nastalique Text” is being submitted by Sohail Abdul Sattar for the award of degree of Doctor of Philosophy in the department of Computer Science and Information
Technology, NED University of Engineering and Technology, Karachi, is a record of the candidate’s own original work carried out by him under my supervision and guidance. The work incorporated in this thesis has not been submitted elsewhere for the award of any other degree.
Dr. Shamsul Haque
Research supervisor Department of Computer Science & Information Technology NED University of Engineering & Technology, Karachi.
iv
Lovingly dedicated to the following departed souls,
my father, Dr. Abdul Ghaffar, my inspiration for achievement,
~
my beloved mother, Zubaida Khatoon, whose prayers stood by me at all times and whose confidence in me willed me to go on,
~
my respected supervisor Dr. Syed Salahuddin Hyder, for introducing me to the challenges in Computer vision and leading me to the path of research.
v
Acknowledgements
All praise and thanks to Almighty Allah (SWT) Who provides opportunities to gain knowledge and opens up ways to make the most daunting tasks possible and achievable.
My most humble thanks to,
The authorities of the NED University of Engineering and Technology for allowing me to pursue my dream in carrying out this research and for providing me all the facilities throughout the course.
Dr. Shamsul Haque, my supervisor whose continuous guidance and support helped me bring this work to completion.
Dr. Mahmood Khan Pathan, my co-supervisor, for his supervision from proposal preparation till finalizing of the thesis. . Dr. Mubarak Shah, my supervisor at the Computer Vision lab., University of Central Florida, for his generous help and guidance in computer vision and for his invaluable ideas on quality paper writing.
Colleagues at the CV lab UCF especially Javed, Arslan and Alexie.
My friends in Orlando whose hospitality and love made me feel at home and cared for.
My daughter Maryam, who had to struggle with Maths alone during my stay in Orlando.
My son Abdullah, who did not forget to brush his teeth, even though I was not there.
Many friends, colleagues and companions whose help and contributions have in many ways made my work easier and enjoyable.
Last, but not the least I am highly indebted to my wife, Aysha, for the endless hours she spent on proof reading and language corrections in this document and for standing by me whenever I was depressed or felt stressed.
vi
Abstract
This thesis presents a novel segmentation free technique for the design and implementation of an OCR (Optical Character Recognition) system for printed Nastalique text.
Specific area of this thesis is document understanding and recognition which is a branch of computer vision and in turn a sub-class of Artificial Intelligence.
Optical character recognition is the translation of optically scanned bitmaps of printed or hand written text into digitally editable data files. OCRs developed for many world languages are already under efficient use but none exist for Nastalique – a calligraphic adaptation of the Arabic script, just as Jawi is for Malay. More often, a single script with its basic character shapes is adapted for writing in multiple languages e.g. the Roman script for English, German and French, and the Arabic script for Persian, Sindhi, Urdu, Pashtu and Malay.
Urdu has 39 characters against the Arabic 28. Each character then has two to four different shapes according to their position in the word: isolated, initial, medial and final. Many character shapes have multiple instances and are context sensitive – character shapes changing with changes in the antecedent or the precedent character. At times even the third or the fourth character may cause a similar change depicting an n-gram model in a Markov chain. Unlike the Roman script, word and character overlapping in Nastalique, makes optical recognition extremely complex.
Compared to Roman script languages’ OCRs very little research work is done on Arabic Naskh OCR. Only a few Arabic Naskh OCR systems are available today and they too are far from perfect, lagging behind in accuracy as compared to Roman script OCR systems.
In this perspective Nastalique is even more complicated than Naskh as it has multiple base lines, more overlapping of characters within a ligature and between adjacent ligatures, vertical stacking of characters in a ligature etc.
Urdu has still not attracted researchers’ attention for the development of OCR partly due to lack of funds in this area but mainly due to the challenges the Nastalique style offers because of its cursiveness and context-sensitivity. For the same reason published research work in this area is nearly non-existent.
vii
The proposed system for Nastalique OCR does not require segmentation of a ligature into constituent character shapes. However, it does require segmentation at two levels i.e. first the text image is segmented into lines of text then each of the lines of text is further segmented into ligatures or isolated characters. The next step is a line by line cross-correlation for recognition of characters in the ligatures whereby, character codes are written into a text file in the sequence the characters are found in the ligature. As the recognition process is completed, the character codes in the text file are given to the rendering engine, which displays the recognized text in a text region.
The limitation of the proposed Nastalique character recognition system is that it is font dependent: it needs the same font file for recognition which was used to write the text in. The new undertaking has greater challenges as it will aim to overcome the inherent cursiveness and context sensitivity of Nastalique style of writing.
For Nastalique OCR, we develop character-based True Type Font files for a few Nastalique words. These words are written using the same character-based TTF font and an image is made of the Nastalique text. The image is then given to our Nastalique OCR. After recognition the rendering is done by using the same TTF font file to display the recognized text. The work is therefore three folds; development of character-based Nastalique True Type Font, Nastalique character recognition and rendering the recognized text using character-based Nastalique True Type Font.
Since our character-based segmentation-free Nastalique OCR algorithm needs, as a ground work, a character-based Nastalique Text Processor, we have also proposed a Finite State Nastalique Text Processor Model. Implementation is not yet done so results are not reported. However this model could serve as an impetus for future research in this challenging field.
Optical Character Recognition for Roman script languages is almost a solved problem for document images and researchers are now focusing on extraction and recognition of text from video scenes. This new and emerging field in character recognition is called Video OCR and has numerous applications like video annotation, indexing, retrieval, search, digital libraries, and lecture video indexing.
The emerging field for character recognition is attracting research on other scripts like Chinese, but to the best of our knowledge, no work is reported as yet, on Video OCR for Arabic script languages like Arabic, Persian and Urdu.
viii
As an extension of our Nastalique OCR to Video OCR for Arabic script languages, we have also performed experiments on video text identification, localization and extraction for its recognition. We have used MACH (Maximum Average Correlation Height) filter to identify text regions in video frames, these text regions are then localized and extracted for recognition. All research and development work is done using Matlab 7.0. Experiments and results are reported in the thesis.
ix
Table of Contents
Statement of Copyright ...... iii Certificate ...... iv Acknowledgements ...... vi Abstract ...... vii List of Tables ...... xv List of Figures ...... xvi
CHAPTER 1: Introduction ...... 1 1.1 Computer Vision ...... 2 1.2 Character Recognition ...... 5 1.2.1 Online Character Recognition ...... 6 1.2.2 Offline Character Recognition ...... 6 1.2.3 Magnetic ink Character Recognition (MICR) ...... 7 1.2.4 Optical Character Recognition (OCR) ...... 7 1.3 History of OCR ...... 9 1.4 OCR Processes ...... 10 1.4.1 Scanning ...... 11 1.4.2 Document Image Analysis ...... 11 1.4.3 Pre-processing ...... 11 1.4.4 Segmentation ...... 11 1.4.5 Recognition ...... 12 1.5 World Languages and Scripts ...... 12 1.5.1 Non-Cursive Script ...... 13 1.5.2 Cursive Script ...... 13 1.6 Perso-Arabic Script ...... 14 1.7 Urdu Language ...... 14 1.7.1 History of Urdu Language ...... 15 1.7.2 Nastalique Script ...... 15 1.7.2.1 Nastalique Type Setting ...... 16 1.7.3 Noori Nastalique ...... 17 1.8 Nastalique Text Processor ...... 17 1.8.1 InPage Urdu ...... 18 x
1.9 The Digital Divide ...... 20 1.10 Importance of Bridging the Digital Divide ...... 22 1.11 Approaches for Arabic Naskh OCR ...... 22 1.11.1 Segmentation based OCR ...... 22 1.11.2 Ligature based OCR ...... 23 1.12 Motivation and Research Objective ...... 23 1.13 Main Contribution of this Research ...... 24 1.13.1 Nastalique OCR (NOCR) Application ...... 25 1.13.2 NOCR Process ...... 25 1.13.3 NOCR Procedure ...... 26 1.14 Additional Contribution of this Research ...... 26 1.14.1 Video OCR ...... 26 1.14.2 Nastalique Text Processor Model ...... 27 1.15 Thesis overview ...... 28 1.16 Conclusion ...... 29
CHAPTER 2: Literature Survey ...... 31 2.1 Introduction ...... 31 2.2 Previous Work on Urdu OCR ...... 33 2.3 Approaches for Arabic script OCR ...... 38 2.3.1 Ligature-based Approach ...... 39 2.3.2 Segmentation-based Approach ...... 39 2.4 Previous Work in Ligature-Based Arabic OCR ...... 39 2.5 Previous Work in Segmentation-Based Arabic OCR ...... 48
CHAPTER 3: Video OCR ...... 59 3.1 Introduction ...... 59 3.2 Types of video text ...... 62 3.3 Applications of Video-OCR ...... 63 3.4 Literature Survey ...... 63 3.5 Correlation Pattern Recognition ...... 84 3.5.1 The MACH Filter ...... 86 3.5.2 OT-MACH Filter ...... 90 3.6 Text Region Detection in Video Frames ...... 91 3.7 Results ...... 92 xi
3.7.1 Video OCR Results ...... 93 3.7.1.1 Training a MACH filter ...... 93 3.7.1.2 Training Images ...... 93 3.7.1.3 Trained MACH filter ...... 94 3.7.1.4 Text area detection and localization ...... 94 3.7.1.5 Video clips with Arabic text ...... 104 3.8 Conclusion and Future work ...... 107
CHAPTER 4: Implementation Challenges for Nastalique OCR ...... 109 4.1 Introduction ...... 109 4.2 Nastalique Character Set ...... 109 4.3 Nastalique Script Characteristics ...... 111 4.4 Computational Analysis of Urdu Alphabet ...... 111 4.4.1 Classes of base shapes in the Urdu alphabet...... 111 4.4.2 Dots in Urdu Characters ...... 113 4.4.3 Context Sensitive shapes in the Urdu alphabet...... 115 4.4.4 Comparison of Urdu, Arabic and Farsi Alphabets ...... 116 4.4.5 Bi-directional pen movement (from top left to bottom right) ...... 117 4.4.6 Bi-directional writing (numbers written from left to right) ...... 118 118 ...... ” ق Context Sensitive shapes of the character “Quaf 4.4.7 4.5 Nastalique Script for Urdu ...... 119 4.5.1 Character ...... 119 4.5.2 Glyph ...... 120 4.5.3 Ligature ...... 120 4.6 Ligature in Urdu ...... 120 4.7 Word Forming in Urdu ...... 121 4.8 Styles of Urdu Writing ...... 122 4.8.1 Naskh ...... 122 4.8.2 Nastalique ...... 123 4.9 Nastalique Script Complexities ...... 123 4.9.1 Cursiveness ...... 124 4.9.2 Context Sensitivity ...... 125 4.9.3 Dot Positioning ...... 126 4.9.4 Kerning ...... 127 4.9.5 Character Overlapping ...... 128 xii
4.9.5.1 Within a Ligature...... 129 4.9.5.2 Between two adjacent Ligatures ...... 129 4.10 Sloping and Multiple Base-Lines ...... 130 4.11 A Generic OCR Model ...... 131 4.12 Working of a Roman Script OCR ...... 132 4.13 Working of a Nastalique Script OCR ...... 133 4.14 Approaches for Nastalique OCR ...... 134 4.14.1 Character-based Approach ...... 134 4.14.2 Ligature-based Approach ...... 136
CHAPTER 5: The Proposed Nastalique OCR System ...... 137 5.1 Introduction ...... 137 5.2 The Nastalique OCR Implementation ...... 138 5.3 The Novel Segmentation-free Nastalique OCR Algorithm ...... 138 5.4 Nastalique OCR Algorithm Description ...... 140 5.5 Segmentation of Text Image into Lines ...... 141 5.6 Segmentation of Text Line into Ligatures ...... 142 5.7 Character Recognition by Cross-Correlation ...... 143 5.8 Nastalique Text Segmentation ...... 145 5.9 Segmentation of Text Image into Lines and Ligatures ...... 145 5.10 Recognition Technique ...... 159 5.11 Nastalique OCR Application ...... 161 5.11.1 The Dialogue Boxes ...... 161 5.12 The Recognition Procedure ...... 163 5.13 The Recognition Process ...... 168
CHAPTER 6: Conclusion and Future Work ...... Error! Bookmark not defined. 6.1 Introduction ...... 169 6.2 Nastalique Character Shapes ...... 174 6.3 Nastalique Joining Characters Features Set ...... 175 6.4 Proposed Nastalique Text Processor Model ...... 176 6.5 Components of Nastalique Text Processor Model (NTPM) ...... 179 6.5.1 Character Shape Recognizer ...... 179 6.5.2 Next-State Function ...... 180 6.6 Conclusion ...... 181 xiii
6.7 Contribution ...... 189 6.8 Future Work ...... 190 References ...... 191 List of Publications ...... 198
xiv
List of Tables
Table 4.1: Shapes of Nastalique characters...... 110 Table 4.2: Base shape classes in Urdu alphabet ...... 112 Table 4.3: Dots in Urdu characters ...... 114 initial ...... 115 ب Table 4.4: Context sensitive shapes of Table 4.5: Urdu Alphabet ...... 116 Table 4.6: Arabic Alphabet ...... 116 Table 4.7: Farsi Alphabet ...... 116 Table 6.1: Transition Table for NTPM ...... 178
xv
List of Figures
Figure 1.1: Sub-fields of Artificial Intelligence ...... 4 Figure 1.2: Classification of Character Recognition ...... 5 Figure 1.3: OCR Processes ...... 10 Figure 1.4: Cursive and non-cursive scripts ...... 13 Figure 3.1: Training images ...... 93 Figure 3.2: Trained MACH filter ...... 94 Figure 3.3: Text region detected and extracted in color-1 ...... 95 Figure 3.4: Extracted text region is binarized-1 ...... 96 Figure 3.5: Text region detected and extracted in color-2 ...... 97 Figure 3.6: Extracted text region is binarized-2 ...... 98 Figure 3.7: Text region detected and extracted in color-3 ...... 99 Figure 3.8: Extracted text region is binarized-3 ...... 100 Figure 3.9: Extracted text region in color-4 ...... 101 Figure 3.10: Extracted text region is binarized-4 ...... 102 Figure 3.11: Artificial text on plane background ...... 103 Figure 3.12: Extracted text region in color-1 ...... 104 Figure 3.13: Extracted text region is binarized-2 ...... 105 Figure 3.14: Extracted text region in color-3 ...... 106 Figure 3.15: Extracted text region is binarized-4 ...... 107 Figure 4.1: Bi-directional pen movement ...... 117 Figure 4.2: Bi-directional writing ...... 118 119 ...... ق Figure 4.3: Context sensitive shapes of quaf Figure 4.4: Ligature in Urdu ...... 121 Figure 4.5: Word forming in Urdu ...... 121 Figure 4.6: Styles of Urdu writing ...... 122 Figure 4.7: Naskh style of writing ...... 123 Figure 4.8: astalique style of writing ...... 123 Figure 4.9: Nastalique cursiveness ...... 124 Figure 4.10: Word forming in Nastalique ...... 125 Figure 4.11: Context Sensitivity; Two different shapes of Bay-initial ...... 126
xvi
Figure 4.12 (a): Dots in Urdu characters ...... 126 Figure 4.12 (b): Dots in Urdu characters ...... 126 Figure 4.12 (c): Dots in Urdu characters ...... 126 Figure 4.13: Roman kerning pair ...... 127 Figure 4.14: Nastalique kerning pair ...... 128 Figure 4.15: Character overlapping in Nastalique ...... 128 Figure 4.16: Character overlap within a ligature ...... 129 Figure 4.17: Intra-ligature character overlap ...... 130 Figure 4.18: Nastalique sloping base-line ...... 130 Figure 4.19: Multiple baselines ...... 130 Figure 4.20: Different phases of an OCR ...... 131 Figure 4.21: Roman OCR has three levels of segmentation ...... 132 Figure 4.22: Nastalique OCR has two levels of segmentation ...... 134 Figure 4.23: A segmented word in Nastalique ...... 135 Figure 4.24: Ligature-based approach ...... 136 Figure 5.1: Nastalique OCR Algorithm ...... 139 Figure 5.2: Flowchart for NOCR...... 140 Figure 5.3: Binarized Nastalique Text Image ...... 146 Figure 5.4: Binarized Nastalique Text Image ...... 147 Figure 5.5: Binarized Nastalique Text Image ...... 147 Figure 5.6: Binarized Nastalique Text Image ...... 148 Figure 5.7: Lines of Text Separated ...... 149 Figure 5.8: Ligatures are separated ...... 150 Figure 5.9 (a): Analysis of text line 1 ...... 152 Figure 5.9 (b): Analysis of text line 2 ...... 152 Figure 5.9 (c): Analysis of text line 3 ...... 153 Figure 5.9 (d): Analysis of text line 4 ...... 153 Figure 5.9 (e): Analysis of text line 5 ...... 154 Figure 5.9 (f): Analysis of text line 6 ...... 154 Figure 5.9 (g): Analysis of text line 7 ...... 155 Figure 5.10: All elements in the text image separated ...... 155 Figure 5.11 (a): Ligature overlap line 1 ...... 156
xvii
Figure 5.11 (b): Ligature overlap line 2 ...... 158 Figure 5.11 (c): Ligature overlap line 3 ...... 158 Figure 5.11 (d): Ligature overlap line 4 ...... 159 Figure 5.12 (a): A word in Nastalique ...... 159 Figure 5.12 (b): Separated character shapes in a word ...... 160 Figure 5.12 (c): A word in Nastalique ...... 160 Figure 5.13: Nastalique OCR User Interface ...... 162 Figure 5.14: Input word image selection ...... 162 Figure 5.15: Font selection ...... 163 Figure 5.16: Cross-correlation for recognition ...... 164 Figure 5.17: Recognition process ...... 165 Figure 5.18: Recognition complete ...... 166 Figure 5.19: Multiple words single line ...... 167 Figure 5.20: Multiple words multiple lines ...... 167 Figure 6.1: Measurements of Letters in qat ...... 174 Figure 6.2: Nastalique Text Processor Model ...... 176 Figure 6.3: Transition Diagram for NTPM ...... 178
xviii
CHAPTER 1 Introduction
In this thesis we have presented a novel segmentation free technique for the design
and implementation of an OCR (Optical Character Recognition) for printed Nastalique text, a calligraphic style of Urdu which uses the Arabic script for its writing.
Just as a single script with its basic character shapes is adapted for writing in multiple languages e.g. the Latin script for English, German, French etc. the Arabic script has been
adapted for writing Persian, Urdu, Pashtu, Malay etc. Arabic writing has many forms and
styles, the more common being Naskh, while its calligraphic counterpart is Nastalique.
Beautiful and decorative as it is, Nastalique is also highly cursive by nature.
The Urdu alphabet today contains 39 basic characters compared to Roman script languages which have a fewer number of characters in their alphabet. For this reason the development of an Urdu OCR is fairly more difficult than Roman script languages. So far no work has been done with regard to developing an Urdu OCR [55].
Urdu has 39 characters against the Arabic 28. Each character then has 2-4 different shapes
according to their position in the word: isolated, initial, medial and final. Many character shapes have multiple instances. The shapes are context sensitive too – character shapes
changing with changes in the antecedent character or the precedent one. At times even the
3rd or 4th character may cause a similar change depicting an n-gram model in a Markov chain [66].
1
Optical character recognition is the translation of optically scanned bitmaps of printed or written text into digitally editable data files. OCRs developed for many world languages are already under efficient use but none exist for Nastalique.
In Nastalique, inter-word and intra-word character overlapping makes optical recognition more complex. Optical character recognition of the Latin script is relatively easier.
1.1 Computer Vision
Human intelligence is described as the capability to make decisions based on information which is incomplete and noisy. This is the ability which makes human being the most superior amongst all living creatures in the world.
There are five senses that provide information to humans for making everyday decisions and out of these the senses of hearing and vision are the sharpest of all. The auditory sense helps us recognize sounds and classify them. It is this sense which tells us that the person on the phone is a friend because his voice is recognizable. We can differentiate between an endless variety of sounds, voices, utterances and put them in exactly the slots they belong to, animal
sounds, musical notes, wind swishing, the footsteps of a family member, all are within
recognition range of a person with a normal sense of hearing.
The other one and the one more profound is the human vision which allows us to identify a
known person in a crowd of unknowns merely by casting a cursory glance at him. It allows
us to pick an object that belongs to us from a number of those looking exactly as ours, and by
being able to recognize a miss-spelt word in a sentence and unconsciously correct it. The fact
2 is that the human mind is capable of identifying an image on features spontaneously determined and not predefined or predetermined.
With the development of technology these human processes are imitated to create intelligent machines, so much for the immense growth of robotics and intelligent decision making systems and yet the work done so far is not comparable to any natural involuntary human action or process. The hindrance however is that it is not practically possible to imitate all the functions of the human mind and make computer vision as efficient and accurate as the human eye but even though such a possibility may be remote, efforts are consistently being made to bring them as close to it as possible.
Artificial Intelligence is a broad field of computer science taking into account a number of other disciplines that form the bulk of its study. One of the most popular definitions of
Artificial Intelligence (AI) was given by Elaine Rich and Kevin Knight as, “Artificial
Intelligence (AI) is the study of how to make computers do things which, at the moment, people do better” [20].
One important branch of Artificial intelligence is Computer vision, as mentioned in Figure
1.1 that shows few sub-branches of Computer Vision, the area of research which aims to imitate human vision and forms the basis of all image acquisition, its processing, document understanding and recognition.
Computer vision relies on a solid understanding of the physical process of image formation
3 to obtain simple inference from individual pixel values like shape of the object and to recognize objects using geometric information or probabilistic techniques [19].
In its own turn document understanding is a vast and difficult area for the focus of research today lies in being able to make content based searches which hope to allow machines to look beyond the key words, headings or merely topics to find a piece of information. A far more streamlined field of Document Recognition and understanding is Optical Character
Recognition which attempts to identify a single character from an optically read text image as a part of a word that can be then used to process further information on. The area gains rising significance as more and more information each day needs to be stored processed and retrieved rather than being keyed in from an already present printed or handwritten source.
Artificial Intelligence
Computer Vision
Document Understanding & Recognition
Optical Character Recognition
Figure 1.1: Sub-fields of Artificial Intelligence
4
1.2 Character Recognition
Character recognition is a sub-field of pattern recognition in which images of characters from a text image are recognized and as a result of recognition respective character codes are returned, these when rendered give the text in the image.
The problem of character recognition is the problem of automatic recognition of raster images as being letters, digits or some other symbol and it is like any other problem in computer vision [57].
Character recognition is further classified into two types according to the manner in which input is provided to the recognition engine. Considering figure 1.2 which shows the classification hierarchy of character recognition, the two types of character recognition are:
a. On-line character recognition
b. Off-line character Recognition
Figure 1.2: Classification of Character Recognition 5
1.2.1 Online Character Recognition
Online character recognition systems deal with character recognition in real time. The
process involves a dynamic procedure using special sensor based equipment that can capture
input from a transducer while text is being written on a pressure sensitive, electrostatic or
electromagnetic digitizing tablet. The input text is automatically converted with the help of a
recognition algorithm to a series of electronic signals which can be stored for further
processing in the form of letter codes. The recognition system functions on the basis of the x
and y coordinates generated in a temporal sequence by the pen tip movements as they create
recognizable patterns on a special digitizer as the text is written.
1.2.2 Offline Character Recognition
There is a major difference in the input system of off-line and on-line character
recognition which influences the design, architecture and methodologies employed to
develop recognition systems for the two. In online recognition the input data is available in
the form of a temporal sequence or real time text generated on a sensory device thus providing time sequence contextual information. On the contrary in an off line system the actual recognition begins after the data has been written down as it does not require real time contextual information.
Offline character recognition is further classified in two types according to the input provided to the system for recognition of characters. These are,
i. Magnetic ink Character Recognition (MICR)
ii. Optical Character Recognition (OCR)
6
1.2.3 Magnetic ink Character Recognition (MICR)
MICR is a unique technology that relies on recognizing text which has been printed in
special fonts with magnetic ink usually containing iron oxide. As the machine prepares to
read the code the printed characters become magnetized on the paper with the North Pole on
the right of each MICR character creating recognizable waveforms and patterns that are
captured and used for further processing. The reading device is comparable to a tape recorder
head that recognizes the wave patterns of sound recorded on the magnetic tape [79].
The system has been in efficient use for a long time in banks around the world to process
checks as results give high accuracy rates with relatively low chances of error.
There are special fonts for MICR, the most common fonts being E-13B and CMC-7.
1.2.4 Optical Character Recognition (OCR)
Optical Character Recognition or OCR is the text recognition system that allows hard copies of written or printed text to be rendered into editable, soft copy versions. It is the translation of optically scanned bitmaps of printed or written text into digitally editable data files. An OCR facilitates the conversion of geometric source object into a digitally representable character in ASCII or Unicode scheme of digital character representation [37].
Many a times we want to have an editable copy of the text which we have in the form of a hard copy like a fax or pages from a book or a magazine. The system employs the use of an optical input device usually a digital camera or a scanner which pass the captured images to a
7
recognition system that after passing it through a number of processes convert it to a soft
copy like an MS Word document.
When we scan a sheet of paper we reformat it from hard copy to a soft copy, which we save
as an image. The image can be handled as a whole but its text cannot be manipulated
separately. In order to be able to do so, we need to ask the computer to recognize the text as
such and to let us manipulate it as if it was a text in a word document. The OCR application
does that; it recognizes the characters and makes the text editable and searchable, which is
what we need. The technology has also enabled such materials to be stored using much less storage space than the hard copy materials. OCR technology has made a huge impact on the way information is stored, shared and communicated.
OCRs are of two types,
i. OCRs for recognizing printed characters
ii. OCRs for recognizing hand-written text.
OCRs meant for printed text recognition are generally more accurate and reliable because the characters belong to standard font files and it is relatively easier to match images with the ones present in the existing library. As far as hand writing recognition is concerned the vast variety of human writing styles and customs make the recognition task more challenging.
8
Today we have OCRs for printed text in Latin script as an everyday tool in offices while an
OCR for hand writing is still in the research and development stage to have more result accuracy.
Optical Character Recognition (OCR) is one of the most common and useful applications of machine vision, which is a sub-class of artificial intelligence, and has long been a topic of research, recently gaining even more popularity with the development of prototype digital libraries which imply the electronic rendering of paper or film based documents through an imaging process.
1.3 History of OCR
The history of OCR dates back to the early 1950s with the invention of Gismo a machine that could translate printed messages into machine codes for computer processing.
The product had been a combined effort of David Shepard, a cryptanalyst at (Armed Forces
Security Agency) AFSA and Harvey Cook. The successful achievement was then followed by the construction of the world’s first OCR system, also by David Shepard under his
Intelligent Machines Research Corporation. Shepard’s customers included The Readers’
Digest, Standard Oil Company of California for making credit card imprints, Ohio Bell
Telephone Company for a bill stub reader and The U.S. Air force for reading and transmitting teletype messages [81].
Another success came in the area of postal number recognition, which although crude in its inception stage became more refined with advancement in technology.
9
By 1965 Jacob Rabinow an American inventor had patented an OCR for sorting mails which was then used by the U.S. Postal Service [81].
The OCR has since become an interesting field of research posing numerous complexities and offering unique possibilities in the areas of Artificial Intelligence and Computer Vision.
As advancements were made and new challenges were undertaken it became more and more clear that the scope of such an undertaking, though attractive would be daunting and adventurous.
1.4 OCR Processes
The OCR process begins with the scanning and subsequent digital reproduction of the text in the image. It involves the following discrete sub-processes, as shown in figure 1.3.
OCR Process Details
Document Image Pre Processing Bitmap Image Analysis Noise Removal, Blur scanner Removal, Thinning, Slant / Skew detection Skeletonization, Edge and correction Detection
______Recognition Segmentation ______
Recognition Algorithm Segmentation of lines and Words Display
Figure 1.3: OCR Processes
10
1.4.1 Scanning
A flat-bed scanner is usually used at 300dpi which converts the printed material on
the page being scanned into a bitmap image.
1.4.2 Document Image Analysis
The bitmap image of the text is analyzed for the presence of skew or slant and
consequently these are removed.
Quite a lot of printed literature has combinations of text and tables, graphs and other forms of
illustrations. It is therefore important that the text area is identified separately from the other images and could be localized and extracted.
1.4.3 Pre-processing
In this phase several processes are applied to the text image like noise and blur removal, binarization, thinning, skeletonization, edge detection and some morphological processes, so as to get an OCR ready image of the text region which is free from noise and blur.
1.4.4 Segmentation
If the whole image consists of text only, the image is first segmented into separate lines of text. These lines are then segmented into words and finally words into individual letters.
11
Once the individual letters are identified, localized and segmented out in a text image it becomes a matter of choice of recognition algorithm to get the text in the image into a text processor.
1.4.5 Recognition
This is the most vital phase in which recognition algorithm is applied to the images present in the text image segmented at the character level.
As a result of recognition character code corresponding to its image is returned by the system which is then passed to a word processor to be displayed on the screen where it can be edited, modified and saved in a new file format.
1.5 World Languages and Scripts
Communication around the world takes place in more than several hundred languages today. There is a great variety of ways in which these languages are written down but it has been found out that more than 90 languages use the Latin script to scribe their words, English being one of them. There are several other scripts that serve as means to write down a combination of languages. The Arabic script stands second to Latin as it has been adopted by more than 25 different languages to form their alphabet [78].
According to the way the script is written down and the patterns it follows helps us divide them in two separate categories as figure 1.4 shows.
i. Non-cursive scripts.
ii. Cursive scripts. 12
1.5.1 Non-Cursive Script
The non cursive scripts which are more common are inherently discrete as far as the
printed text is concerned. This means that each character has a separate and a definite shape that combines with the next one by being placed side by side with it and never overlapping or
shadowing the preceding or succeeding letters. However, when these scripts are handwritten the scribe’s hand can make the letters decorative, cursive and more flowing in form and
shape. The Latin script is an example of such a script where handwritten text can be as
decorative and cursive as the writer’s choice while printed text is easily recognizable.
D i s c r e t e C h a r a c t e r s Hand-printed Characters
Cursive Handwriting ﻧﺳﺗﻌﻠﯾق اﯾک ﭘﯾﭼﯾده رﺳم اﻟﺧط ﮨﮯ
Figure 1.4: Cursive and non-cursive scripts
1.5.2 Cursive Script
On the other hand the naturally cursive scripts e.g. Arabic have a unique feature for
the formation of words. The characters in these scripts are not discrete but are joined to each
other to form ligatures and then words. These free flowing character forms create words by
overlapping each other sometimes even stacking on each other vertically. The non-discrete
nature of these scripts makes them ever more difficult to be developed as font types for
13
printing as well as pose challenges for character recognition. Creating discrete characters for
text processing for the cursive scripts and placing them side by side to form words like
discrete scripts for convenience in character recognition, as suggested by Abuhaiba [2], mar
the original shape of words giving them an artificial and unnatural look.
1.6 Perso-Arabic Script
The Arabic alphabet has been adapted by several other languages because of
resemblance of sounds and phonetic system, however, this script was originally an exclusive
writing system to support the Arabic language, it was later adapted and modified to
accommodate the demands of writing the Persian language. Also because the Persian
language has a greater variety of sounds than Arabic, four more letters were added to this
adapted script [38]. This new version of Arabic writing system is known as the Perso-Arabic
script and is considered a wide range writing system for not just the Persian alphabet, but
also Urdu, Sindhi, Kurdish, Sorani, Balochi, Punjabi Shahmukhi, Azer, Tajik-Persian and several others by adding more characters to the basic Arabic alphabet according to the
different sounds a language has more than available in the basic Arabic alphabet.
To write Urdu we need eleven more characters added to the Arabic alphabet to give a total of
thirty-nine characters in the Urdu alphabet.
1.7 Urdu Language
The Urdu language derives many of its features including its script from Persian and
Arabic. The Urdu alphabet is therefore the same right to left alphabet as followed by Persian and Arabic. However, there are many small differences in the written styles of the two. While
14
Arabic follows the less flowing Naskh style, Urdu is written in the more cursive Nastalique style. Although written Urdu more closely resembles Arabic, spoken Urdu resembles Hindi, although Hindi is written in a totally differently style called the Devanagri style which is based on Sanskrit script.
1.7.1 History of Urdu Language
Urdu language has a unique history it developed as a common means of expression during the Mughul dynasty in South East Asia due to Arab, Persian and Turkish influences in the region. The written script adapted a modification of the Persian alphabet following the same phonetic and pronunciation system. Due to the flowing cursive quality of the script most of the books, newspapers and other publishing material continue to use handwritten copies of the text created by master writers or kaatibs also known as khushnavees. This went on till the 1980s.
In the early 80s however the daily Jang, a Pakistani daily newspaper transferred almost all of the typing work of its newspaper to the computer. Today almost all Urdu publishing tasks are carried out through a variety of software available in the market supporting Urdu text editing features and providing a number of other font styles for writing.
1.7.2 Nastalique Script
In 14th century Iran, Islamic calligraphy reached its zenith, its main genre being the
Nastalique writing style. Although it was originally a script meant to compose in Arabic over a period of time it became more popular for written Persian, Turkic and other South Asian languages like Urdu, Pashto and Balochi. This fluid and more cursive amalgam of Arabic
15
Naskh and Taaliq have been extensively used as an art form for calligraphy in Iran and
Afghanistan.
Apparently the art of Arabic calligraphy originated and flourished in the fourteenth century
Iran spreading out later to many other Muslim countries. The art form gradually gained
significance giving birth to such eminent calligraphists as Mir Emad whose work is considered to be most elegant in splendor. Many of the most beautiful transcriptions of
Quranic verses have been rendered in this script by various calligraphists.
The modern day adaptation of Nastalique is however more specific in terms of rules of character proportion laid down by Mirza Reza Kalhouri who created a version of Nastalique that could easily be used for machine printing. This advancement allowed wider and easier use of the script under a formalized set of characters available [80].
1.7.2.1 Nastalique Type Setting
Handwritten Nastalique written from right to left was indeed a creative art form most suited for calligraphic renderings and attempts made to produce high quality printed texts remained challenging undertakings.
Although many attempts had been made to develop Nastalique typography most of them failed drastically mostly for the reason that it was difficult to create such a vast set of individually designed character combinations or ligatures that form the basis for Nastalique writing system. Earlier a version of Nastalique was developed by Fort William College but
16
this could not gain popularity and remained limited to be used only for publishing their own
college literature.
A Nastalique type writer was developed in the state of Hyderabad Deccan but could never be
put to any functional use and further attempts to make a Nastalique typeface were abandoned
considering it an impossible task.
1.7.3 Noori Nastalique
The old system of typesetting Urdu involved single characters joined together to form
ligatures and words. This meant compromising natural elegant and flowing style of the script
and developing a unique set of characters that would more conveniently join together to form words. Ahmed Mirza Jamil replaced this old system by developing ligatures for a new type set for Urdu which restored the cursive, calligraphic quality of Nastalique. The new typeface contained a set of almost 18,000 ligatures on the basis of which the new version of the script
was developed. This was named Noori Nastalique. Agfa Monotype Corporation holds the
copyright for the Noori Nastalique digital font data.
1.8 Nastalique Text Processor
We have few Urdu text processors and none of which is anywhere close to Latin
script text processors. The only one which has true calligraphic Nastalique support is InPage
Urdu that is why it is the most popular and commonly used Nastalique text processor which
also supports other Urdu fonts, however InPage Urdu does not produce character based
Nastalique text, and instead it uses a large collection of ligatures to write text in Nastalique.
17
A character based Nastalique text processor is the need of the time which could produce true
Nastalique calligraphic text by using character based Nastalique font instead of writing it
with a large collection of ligatures.
1.8.1 InPage Urdu
InPage Urdu is widely used software for making page layouts in Urdu using the
Nastalique style of Arabic script.
InPage works under Windows and alongside with Urdu caters to writing in Pashto, Persian
and Arabic – all languages that use the same script for writing.
This publishing tool is popular in Pakistan particularly with the newspaper companies who used to employ a large numbers of calligraphers to make corrections to Urdu text created in the monotype font. In fact, In Page Urdu was actually developed for Pakistan’s newspaper industry in 1994 through the collaborative effort of a UK based company-Multi-lingual
Solutions Windows on world limited led by Kamran Rouhi and an Indian software development team, Concept Software Private Limited led by Ravindra Singh and Vijay
Gupta. This newly developed software was licensed as Noori Nastalique typeface. It was improved and augmented from the original Monotype font and is now available for use as the main Urdu font along with 40 or so other non-Nastalique fonts.
InPage offers a truly authentic style of the Nastalique script, with a wide set of ligatures
(around 18000 ligatures in 85 font files) it also keeps the character display in the WYSIWYG format making the on screen and printed result comparatively more attractive than any other
18 software available in the market. An added quality is that almost all operations and features of desktop publishing packages available for English are similarly available in InPage, making desktop publishing in Urdu, Arabic, Pushtu or Persian as convenient as it is in
English.
These features have enormously overcome the problems that Noori Nastalique suffered from in the 1990s. Noori style of Nastalique was a digital type face developed in 1981 through the collaboration of Ahmed Mirza Jamil (as calligrapher) and Monotype Imaging (formerly monotype corp.)
These two problems were,
i. Standard platforms such as Windows or Mac did not have a built in environment
to support writing this script.
ii. The text could not be entered on the basis of a WYSIWYG (What You See Is
What You Get) method but the data had to be entered with the operator relying on
an understanding of the Monotype’s proprietary page description language.
From OCR point of view Nastalique is more complicated than Naskh as it has multiple base lines, more overlapping of characters within a ligature and between adjacent ligatures, vertical stacking of characters in a ligature etc. Compared to Latin script languages OCRs very less research work is done on Arabic Naskh OCR. Only a few Arabic Naskh OCR systems are available today and they are far from perfect. These are far behind in accuracy as compared to Latin script OCR systems.
19
Urdu has still not attracted researchers’ attention for the OCR research partly due to lack of
funds in this area and partly due to the challenges the Nastalique style offers for its optical recognition. Not only that there is no Nastalique OCR system available today but the published research in this area is nearly non-existent.
1.9 The Digital Divide
Today a phenomenal amount of communication takes place through the internet and standardization of the Latin, non-cursive script has paved the way for efficient communication, sharing of knowledge, information, financial transaction and business correspondence between the users of this script. These countries justly belong to the much acclaimed “global village” bringing close to one another their societies, cultures, technologies and merging their social boundaries into one whole global community. They have access to one another’s research and share in harmony their achievements, accomplishments etc. benefiting one from the other and thereby building on each others’ strengths in all spheres of life.
But this global village is a distinct society that has no recognition of the existence of one
important culture that is as rich and varied as theirs – perhaps even more so in the context of
history, traditional values, languages and learning.
What happens if a language’s writing script is not recognized in the medium of communication which is used elsewhere in the world? The answer is simple. A gradual but
steady decline in the global understanding and appreciation of a script reflective of a rich
20 cultural background, and heritage and very soon a priceless aspect of an important language vanishes from the scene.
What we essentially need today is a complete writing system supported by the most sophisticated browsing facilities, web script as well as text recognition system in Nastalique.
In this regard it would be right to mention that a considerable work has been done and being assembled for improvement for the Arabic (Naskh) script yet it is still not very much supported by search engines for the keyword search and information retrieval, whereas access through the ‘Roman Script’ is a matter of seconds. Users of the Nastalique script can grope for ages trying to find material in their required script.
Research facilities on the net are possible only through Roman script and for this massive amount of data from literature to science is available because it was possible to be scanned, recognized and be stored in cyberspace as searchable and retrievable text.
However, no substantial amount of work could be done for cursive script like ‘Nastalique’ and if we need to search on modern day thesis, articles, research paper or current news the data has firstly not been digitally stored and secondly it is not convenient to search it through the net for research purposes because the computer does not recognize Nastalique script as it does Roman. This situation drastically affects the survival of a language in the modern world.
21
1.10 Importance of Bridging the Digital Divide
This establishes that if all the world cultures should survive and not become isolated or extinct, and if knowledge and wisdom is to flow freely between nations, a strong and willful effort will have to be made to preserve the Nastalique script of writing Urdu in the digital context, as it is the most widely used style of writing Urdu, one of the world’s major languages.
1.11 Approaches for Arabic Naskh OCR
Approaches for the optical character recognition of cursive scripts like Arabic Naskh can be broadly divided into two categories;
i. Segmentation based
ii. Ligature based
1.11.1 Segmentation based OCR
Segmentation based approaches break a ligature into component character shapes before they are presented to the recognition phase of the OCR system. According to the author’s information no perfect segmentation scheme is presented as yet which could divide a ligature into the component character shapes precisely and accurately. Definitely characters accurately segmented from a ligature shall have more recognition probability than the character shapes poorly segmented have.
22
One of the main reasons of low recognition rates of a cursive script OCR system based on segmentation approach is the imperfect segmentation of a ligature into the component character shapes.
1.11.2 Ligature based OCR
Ligature based approaches for cursive script OCR systems do not break a ligature into the component character shapes but use the whole ligatures as the primitive elements of writing and train a learning machine to recognize these ligatures when presented to it, like
Neural Networks (NN), Support Vector Machines (SVM) or Hidden Markov Models (HMM).
A ligature based cursive script OCR system needs all possible ligatures needed to write the script languages. Like Urdu is written in Nastalique style with around 18,000 ligatures, so all these ligature shapes will be required to train a learning machine like NN for its optical recognition, which is a large data set and poses constraint.
Our proposed system for Nastalique character recognition does not require segmentation of a ligature into constituent character shapes, rather it requires only two levels of segmentation i.e. first the text image is segmented into lines of text then each of the lines of text is further segmented into ligatures and isolated characters.
1.12 Motivation and Research Objective
OCRs for many of the world major languages have been developed and are being used but at present an OCR for Nastalique is not available in the world. With a Nastalique
23
OCR we would be able to convert our whole wealth of Nastalique literature into digital form
and make it available on the World Wide Web.
The objective of this research is to design and implement an OCR for the printed Nastalique
text which is not only a national need but it could provide a means to bridge the digital divide in the countries of the region, where Nastalique understanding population is of a considerable volume.
1.13 Main Contribution of this Research
Nastalique is inherently cursive in nature; there is much character as well as word overlapping which makes segmentation of ligatures into constituent character shapes nearly impossible. According to our knowledge, up till now nobody has succeeded in the perfect segmentation of Arabic script text into constituent character shapes which would give recognition results comparable to Roman script OCR.
In this research, we have proposed a novel segmentation-free technique for the design and implementation of a Nastalique OCR based on correlation pattern recognition, we used cross-
correlation in the recognition phase of our Nastalique OCR.
Our proposed system for Nastalique character recognition does not require segmentation of a
ligature into constituent character shapes, rather it requires only two levels of segmentation
i.e. first the text image is segmented into lines of text then each of the lines of text is further
segmented into ligatures and isolated characters.
24
It then uses cross-correlation for recognition of characters in the ligatures line-by-line,
writing their character codes into a text file in sequence, as the character is found in a ligature
along with the x-position of the start of the character shape. As the recognition process is completed, the character codes in the text file are sorted based on x-positions and the sorted sequence of character codes given to the rendering engine, which displays the recognized text in a text region.
The limitation of our proposed Nastalique character recognition system is that it is font dependent; it needs the same font file for the recognition which was used to write the text whose image is given to the Nastalique OCR for recognition.
1.13.1 Nastalique OCR (NOCR) Application
In this research we have proposed a segmentation-free algorithm for the implementation of an OCR for printed Nastalique text. Most of the experimentation and rapid prototyping is done using Matlab 7 while the main application is developed in Microsoft
Visual C++ 6.0.
1.13.2 NOCR Process
Our proposed system for Nastalique character recognition segments the text image into lines of text then each of the lines of text is further segmented into ligatures and isolated characters.
It then uses cross-correlation for recognition of characters in the ligatures line-by-line, writing their character codes into a text file in sequence, as the character is found in a
25 ligature. As the recognition process is completed, the character codes in the text file are given to the rendering engine, which displays the recognized text in a text region.
1.13.3 NOCR Procedure
i. We use Font lab software to make a character-based true type font
ii. Using our TTF font file we write a few words in Nastalique
iii. We make an image of the written text file
iv. This image is given to Nastalique OCR for recognition
v. The recognized text in a text region
1.14 Additional Contribution of this Research
We also have done the following in addition to the main research topic of Nastalique
OCR.
i. Video OCR
ii. Nastalique Text Processor Model
1.14.1 Video OCR
Optical Character Recognition for Roman script languages is almost a solved problem for document images and researchers are now focusing on extraction and recognition of text from video scenes. This new and emerging field in character recognition is called Video
OCR and has numerous applications like video indexing, video data retrieval, etc. The emerging field for character recognition in video frames is attracting research on eastern
26
scripts like Chinese, but to the best of our knowledge, no work is reported as yet on Video
OCR for Arabic script languages like Arabic, Persian and Urdu.
As an extension of our Nastalique OCR to Video OCR for Arabic script languages, we have also performed experiments on video text identification, localization and extraction for its recognition. We have used MACH (Maximum Average Correlation Height) filter to identify text regions in video frames, these text regions are then localized and extracted for recognition. Experiments and results are reported in the thesis.
1.14.2 Nastalique Text Processor Model
For Nastalique OCR, we develop character-based True Type Font files for a few
Nastalique words. These words are written using the same character-based TTF font and an image is made of the Nastalique text. The image is then given to our Nastalique OCR. After recognition the rendering is done by using the same TTF font file to display the recognized
text. The work is therefore three folds; development of character-based Nastalique True Type
Font, Nastalique character recognition and rendering the recognized text using character-
based Nastalique True Type Font.
Since our character-based segmentation-free Nastalique OCR algorithm needs, as a ground
work, a character-based Nastalique Text Processor, we also have proposed a Finite State
Nastalique Text Processor Model. Implementation is not yet done so results are not reported,
however this model could serve as basis (impetus) for the future research in this challenging
field.
27
1.15 Thesis overview
Rest of the thesis is organized as follows:
Chapter 2 Literature survey
This chapter should have covered the research work done on Nastalique OCR or Urdu OCR,
unfortunately published research on Nastalique OCR or Urdu OCR is almost non-existent and whatever is available is included here with the comments of the author of this thesis.
On the other hand, a substantial amount of background research on Arabic Naskh OCR is
included here as the script and the writing rules are same for Arabic and Urdu while the
styles are different, Arabic uses Naskh while Urdu uses Nastalique style of writing which
poses more challenges than the Arabic Naskh for character recognition.
Chapter 3 Video OCR
The new and emerging field of character recognition in video frames is called Video OCR
and has numerous applications like video indexing, video data retrieval, etc. As an extension
of our Nastalique OCR to Video OCR for Arabic script languages, we have performed
experiments on video text identification, localization and extraction for its recognition. We
have used MACH (Maximum Average Correlation Height) filter to identify text regions in
video frames, these text regions are then localized and extracted for recognition. Experiments
and results are reported and discussed in this chapter.
28
Chapter 4 Implementation Challenges for Nastalique OCR
No Nastalique OCR exists so far and the published research on Nastalique OCR, Urdu OCR
or even on any area of Urdu computing is almost non-existent, the reason being the
challenges that the Nastalique style poses for its optical recognition.
The complexities of Nastalique style in particular and of Arabic script in general along with
the challenges it posses for character recognition are discussed here.
Chapter 5 The Proposed Nastalique OCR System
In this research we have developed a novel character-based segmentation-free algorithm for
the recognition of printed Nastalique text which we call NOCR algorithm. The NOCR
algorithm, its implementation and results are presented and discussed in this chapter.
Chapter 6 Conclusion and Future Work
Since our character-based segmentation-free Nastalique OCR algorithm needs, as a ground
work, a character-based Nastalique Text Processor, we also have proposed a Finite State
Nastalique Text Processor Model which is presented and discussed in this chapter.
Implementation is not done and is planned for future work.
All the research work done on this project is summarized in this chapter and the directions for future research work on this and related topics are discussed.
1.16 Conclusion
Our effort in this more challenging and less rewarding research area will set a milestone for new endeavors and shall provide the ground work needed to embark upon the future research projects in this area.
29
We have tested our Nastalique OCR algorithm on a subset of Urdu words. This is because our work in this research also involves the development of a True Type Font for writing the text which would then be rendered for recognition.
This research opens multiple directions for future research e.g. (1) Character based
Nastalique True Type Font development (2) Character based Nastalique Text Processor and
(3) Enhancement and extension of Nastalique OCR system.
30
CHAPTER 2 Literature Survey
2.1 Introduction
While a phenomenal amount of research in IT has made Roman script languages
extremely adaptable in all areas of computing – insubstantial work for Urdu in this area accounts for very little computerization in this script.
Published research is close to non-existent and the complexities associated with Nastalique
OCR development make it a highly challenging undertaking. Many reasons can be held
responsible for this lack of interest in making efforts to implement Urdu in computing and
perhaps a serious shortage for sufficient funds is a major one. Another reason for the lack of
a complete Urdu OCR system is limited support of the Urdu language in computing.
Urdu uses an extended and adapted Arabic script; it has 39 characters while Arabic has 28.
Each character then has 2-4 different shapes depending upon its position in the word; initial,
medial or final. When a character shape is written alone, it is called an isolated character
shape. Each of these initial, medial and final character shapes can have multiple instances,
the character shape changing with the preceding or the succeeding character. This
characteristic is called context sensitivity. The Urdu alphabet contains a large number of
character shapes compared to Roman script languages which have a fewer number of
characters in their alphabet. For this reason the development of an Urdu OCR is fairly more
difficult than Roman script languages. So far no work has been done with regard to
developing an Urdu OCR [55].
31
A complete language script comprises an alphabet and style of writing. Urdu with its
extended Arabic script for writing has two main styles, Naskh and Nastalique. Nastalique is a
calligraphic and more stylistic form and is widely used for writing Urdu.
Urdu writing is inherently cursive in nature where neighboring characters in a word are combined together, under certain restrictions, to form a compound character or a ligature.
This makes Urdu text processing very difficult compared to text processing of Latin script
languages as they follow character based writing styles where each character in a word
retains its shape. Urdu follows a complex style of writing because it has a few characters that
can neither start a ligature nor can occupy the middle position of one; they can appear either
in their isolated form or at the end of a word. When such characters appear in the middle of a
word, the word is broken into more than one ligature. Splitting of words into multiple
ligatures makes most Urdu ligatures considerably long.
The problem would have been easier to solve if all characters in Urdu had different shapes.
Unfortunately, Urdu characters can be grouped in multiple classes with each class containing
anywhere from 2 to 5 characters. All characters in a class use the same base shape and
individual characters in a certain class are distinguished from each other by the number and
position of dots (diacritic marks). In Urdu, these dots either appear above or below the base
shape. A little less than half of Urdu characters (17 out of 39) belong to such classes and one,
two, or three dots are used to differentiate between various characters.
32
Urdu uses the Arabic script for writing, with the most prevalent style being Nastalique.
Published research in Urdu text recognition is almost non-existent; however a considerable
research has been done on Arabic text recognition which uses the Naskh style of writing.
The Arabic language is considered to be a difficult one with a much richer alphabet than the
Latin, the form of the letter is the function of its position in the word: isolated, initial, medial or final, it changes its shape depending upon it’s position, and each shape has multiple instances, words are written from left to right (LTR) [29].
2.2 Previous Work on Urdu OCR
The research study conducted by U. Pal and Anirban Sarkar [55], at the Indian
Statistical Institute states a number of difficulties in the development of an OCR for Urdu most important of these is the large number of characters in the Urdu alphabet and the similarity in the shapes and forms of many of them. Their system proposes the development of an OCR on three embedded processes,
i- Skew detection and correction
ii- Line segmentation
iii- Character segmentation
The technique used for skew detection and correction is based on Hough transforms which picks selected components and computes results on selected candidate points.
33
For segmentation of lines of text from a document the system refers to the valleys of the projection profile and calculates the number of black pixels in each row. These valleys or
‘troughs’ represent the boundaries between two lines of text and are separated accordingly.
The final step is character segmentation which is achieved through component labeling and vertical projection profile methods. This involves recognition of topological features, contour based features and features obtained through the application of the ‘water reservoir’ method.
The collected features are employed for generating a tree classifier, where the decision at each node of the tree is taken on the basis of the presence/absence of a particular feature.
They give the water reservoir principle as, if water is poured from one side of a component, the cavity regions of the component where water will be stored are considered as reservoirs.
The main concept being a reservoir is obtained when water is poured from top (bottom) or left (right) of the component.
The system however, has its limitations. Other than the basic and isolated characters and numerals more complex and compound characters or ligatures are not recognized.
The study also does not report recognition results, only segmentation accuracy of the recognized text is reported. Proposed extension work includes the upgrading of the system to recognize compound characters and ligatures.
Lodhi and Matin [46] published their work that deals with the development of a robust Urdu characters pattern classification, representation and recognition system using Fourier descriptors for optical networks. The Fourier transform is a mathematical operation that tells
34
us the spectral density of an image i.e. the distribution of the different frequency components
of that image. The goal is to extract a finite set of numerical features from a closed curve,
features that will tend to separate the shapes of different classes relative to the intra-class
dispersion. The end product is a system that can classify patterns even if they are deformed by transformations like rotation, scaling and translation or a combination of them, in the presence of noise. The pre-processing stage is important as it filters out noise and improves the image through various algorithms. In the next stage Fourier descriptors are used to uniquely represent the given characters polygonal signatures thereby recognizing them even though they are characterized by geometric transformations.
Shumaila Malik and S. A. Khan [51] proposed a system which takes online input from user by writing the Urdu character/letter with the help of a stylus pen or mouse and converts user handwriting information into Urdu text. The process of online handwritten text recognition is divided into six phases, each of which uses a different technique for recognition depending upon the speed of the writer and the level of accuracy.
Faisal Shafait et al [25] present a layout analysis for Urdu document image understanding highlighting its feature of right to left reading and writing order in contrast with Latin script languages that function from left to right. It considers the system of layout analysis as an important component of an OCR. The authors have experimented with a method of extracting text lines for image processing in the reading order of the Urdu script which is presented as an essential consideration for Urdu document image understanding.
35
Inam Shamsher et al [31] propose a method/algorithm for recognizing isolated characters claiming 98.3% accuracy for printed Urdu alphabet. However, the system also claims to be script independent and yet it is designed for Urdu only. The objective of the research is to develop an efficient recognition system for Urdu characters using minimum processing time.
Details of Multi Layer Perceptron (MLP) network is given as having three layers, one input, one hidden and one out. The input layer has 150 neurons, the hidden layer has 250 and the output has 16 neurons with the reason that the input in a binary image of size 10x15 pixels,
250 neurons at the hidden layer are decided on trial and error basis while 16 neurons at the output layer due to the 16 bits of character code in the Unicode digital character encoding scheme.
The paper does not clarify what tools have been used for system implementation though the algorithmic details of the system training and neural network implementation resemble that of MATLAB Neural Network toolbox.
The network for the developed software has been trained and tested on Ariel font type at 72 pt font size. It does not however indicate if the system will work with equal efficiency on a smaller font of more practical size.
The system is also limited to work on isolated characters in a single line, showing no capacity to segment text images into lines of text. Although feature extraction methods are claimed to be simple and robust, it is not mentioned which features are to be extracted and through which techniques.
36
98.3 % accuracy is claimed without giving any account or details of the experimental
processes or procedures.
Zaheer Ahmad et al [73] have published the paper, the entitled Urdu Nastalique Optical
Character Recognition but within the paper the explanation is that ‘a prototype of the system
has been tested on Urdu text.’ There is very little description of Nastalique in the paper.
The paper discusses Urdu script characteristics and a simple but novel and robust technique to recognize printed Urdu alphabet without using a lexicon, as claimed by the authors. The technique uses the inherent complexity of Urdu script for character recognition. A word is scanned and analyzed for the level of complexity and as the level changes the point is marked for a character. It is then segmented and fed to a Neural Network.
Character segmentation is explained to be in three steps. In the first step lines of text are identified, then words are separated and in the final step characters are segmented and extracted from words or sub words using its complexity level. These are then fed to a Neural
Network for final recognition/ classification. Throughout the paper Urdu words are presented
which is printed with correct ligature (اﺳﻢ) .except for a few e.g (و د ر ا) in the reverse order
form.
Table 1 does not clarify or present the forms of Urdu letters it claims to. In fact it presents all
the Urdu words in their disjoint reverse order i.e. left to right. The contents of Table 2 are
equally questionable for clarity and presentation.
37
Three levels of complexities are mentioned for characters yet no clear explanation is given as
to what forms the basis for differentiation between the various levels.
The word ‘character’ is repeatedly confused with the word ‘alphabet’ both of which have
completely different meanings in the context of languages. There are instances of other rather
significant technical errors e.g. ‘an isolated word scanned vertically from right to left’ and
horizontally from ‘top to bottom’.
The paper claims that the system achieves 93.4 % accuracy but does not provide supportive
proof of procedures for evidence.
There are a total of six references to other papers and five out of them are cited together in one instance.
2.3 Approaches for Arabic script OCR
Arabic characters have features that make direct application of algorithms for character classification in other languages difficult to achieve as the structure of Arabic is very different [50].
Extensive literature survey on Arabic script OCR showed that the researchers in this area followed mainly two different approaches for the implementation of an OCR system for printed Arabic script text, namely segmentation-based and ligature-based approaches.
38
2.3.1 Ligature-based Approach
In the ligature-based approach for the implementation of Arabic script OCR there are
only two levels of segmentation at the end of which the system gets isolated character shapes
or the ligatures segmented into lines of text.
2.3.2 Segmentation-based Approach
If the segmentation-based approach is followed for the implementation of Arabic
script OCR then at the recognition phase all the text images segmented till the character level
will have to be needed, similar to a Latin script OCR, that is all the ligatures in the words are
segmented into their constituent character shapes.
2.4 Previous Work in Ligature-Based Arabic OCR
Here we present a brief overview of the previous work that has been done on ligature-
based Arabic OCR.
Al-Badr and R. Haralick [3] highlight some of the hindrances in the development of an
Arabic OCR and the reasons for inadequate research in this area. It attributes the difficulties in Arabic character recognition to the more complex features of the Arabic script e.g. cursiveness, vertical stacking of characters to form many of the ligatures, context-sensitivity, vertical overlapping of shapes etc. The paper discusses the design and implementation of an
Arabic Word Recognition system which works on the principles of symbol recognition without initially segmenting words into characters, claiming that most recognition errors occur at the crucial stage of segmentation because of the typical shape combinations of characters in the Arabic script. The system first recognizes the input word by detecting a set
39
of ‘shape primitives’ on the word. It then matches the regions of the word with a set of
symbol models. The recognized word is thus presented in the form of a spatial arrangement
of symbol models matching the region of the word. Since the possible combinations of
symbol models are potentially large, the system imposes constraints in terms of word
structure and spatial consistency. The accuracy of the system is shown to be 94.1% for
isolated, scanned symbols and 73% for scanned words.
Pechwitz and Ma¨rgner [58] present an off-line recognition system for Arabic hand written text. The work highlights the efficiency of cursive text recognition methods based on Hidden
Markov Models (HMMs). The study was conducted on a semi-continuous, one dimensional
HMM and describes in detail the modification and adaptation of the preprocessing and feature extraction processes for recognition of Arabic writing. The experiments were based on first estimating the normalization parameters of each binary word image and following it by normalization of height, length and baseline skew. The features are then collected using a sliding window technique.
Alma’adeed et. al. [6] present a complete scheme for character recognition of totally unconstrained Arabic text based on a Model Discriminant HMM. The system proposes feature extraction following the removal of variations in the word images which do not affect the identity of the written word. The system then encodes the skeleton and edges of the word
and a classification process based on the HMM is used. The result is a word matching one in
a dictionary. The study gives indication of successful results of a detailed experiment.
40
Alma’adeed et. al. [5] present a scheme to recognize hand written Arabic text. The overall engine is modeled on the basis of multiple HMMs and a global feature extraction scheme.
The system initially removes variations in the word images and then encodes the skeleton and edge of the word for feature extraction. A rule based classification is then used as a global recognition engine. Finally for each group, the HMM approach is used for trial classification. The given output is a word that matches one present in a lexicon. Once the model has been established the Viterbi algorithm is used to recognize the segments of letters composing a word. The study gives details about the segmentation step as well as off-line recognition operations. The study emphasizes the development of two substantially different recognition engines because there are multiple ways in which one Arabic word can be written down. The first engine is a global feature scheme using some ascender and descender features and making use of a rule-based classification engine. The second scheme is based on a set of features using a HMM classifier.
Farah [27] presents his work that implies the construction of a recognition system around a modular architecture of feature extraction and word classification units. This was done in the attempt to solve the problem of recognizing handwritten Arabic bank checks. The research stresses the efficiency of a multi classifying system with three parallel classifiers working on the same set of structural features. The classification stage results are first normalized and after using contextual information present in the syntactical module the final decision on the candidate words is made.
El-Hajj [22] gives the description of a one dimensional HMM off-line handwriting recognition system using an analytical approach. Specific models are used for each character
41
and word models are built by concatenating the appropriate character models. The system is
supported by a set of robust language independent features extracted on binary images.
The study lays focus on baselines as an important feature in character recognition. The baseline dependant features are then added to the original set of features. Feature vectors are extracted using the sliding window technique.
Alaa Hamid and Ramzi [30] present a technique to segment hand written Arabic text through a neural network. These include three initial steps to achieve the end result before the
Artificial Neural Network (ANN) verification: scanning, binarization and finally feature extraction. A recursive, conventional algorithm is used to segment text into connected blocks of characters and generate pre-segmentation points for these blocks. This heuristic algorithm is responsible for generating the topographic features from the text and calculating pre- segmentation points. An Artificial Neural Network then verifies the accuracy of these segmentation points. The results have shown an accuracy range between 53.11% and 69.72% considering various features. The inaccuracies have been attributed to complexities in shapes of characters or due to dislocated external objects.
Khorsheed [41] proposes a segmentation free approach to recognize Arabic text using the
HMM toolkit, a portable toolkit for building and manipulating Hidden Markov Models. It decomposes the document image into text line images and extracts a set of simple statistical features using the sliding window technique. The Hidden Markov Model Toolkit is then used to develop a sequence of training images and finally for recognition of characters. This implies that the feature vector, extracted from the text, is computed as a function of an
42 independent variable. The experiments were initially conducted on a corpus of data collected in the Arabic font ‘Thuluth’ and later with others. Tahoma and Andulas scored highest recognition rates while Naskh and Thuluth performed lowest. The system was however capable of recognizing complex ligatures and overlaps and showed an overall improvement with the use of the tri-model scheme. Suggested improvements for future developments are expansion of the data corpus and utilization of HTK capabilities.
Possible solutions by Abuhaiba [2] present a progressive approach toward overcoming the cursive laws that put constraints on Arabic language for digital recognition through an OCR.
Abuhaiba proposes the development of new font styles that appear cursive but are discrete in reality (closer inspection). This new font system suggests the creation of a discrete Arabic script rather than a cursive one so that the bulk of information created everyday in this script remains machine recognizable. The development of such a recognition engine would surely overcome the problem of Arabic character recognition because it does not imply the development of a new recognition technique, rather the application of the same techniques that are used for Latin script recognition on newly developed font files for Arabic which is within reach of any research undertaken for cursive scripts whereby the old or original styles or forms can be compromised for newer modified ones. However, the development of a
Nastalique OCR implies the preservation of its unique and typical font style and ligature shapes, also the ligatures are more cursive and calligraphic than the Arabic Naskh and therefore Abuhaiba’s proposition can not be applied to solve the problem of an urdu OCR for
Nastalique.
43
Erlandson et al [23] implemented a word-level Arabic text recognition system that did not
require character segmentation. They characterized the shape of Arabic words by unique
feature vectors. These feature vectors were then matched against a database of feature vectors
derived from a dictionary of known words. The database stored multiple feature vectors for each word in a dictionary of 48,200 words. The word whose feature vectors strongly matched was returned as the hypothesis.
Al-Badr and Haralick [3] designed and implemented an Arabic word recognition system that recognized an input word by detecting a set of shape primitives on the word. The regions of words represented by these shape primitives were then matched with a set of symbol models.
The description of the recognized word was obtained from a spatial arrangement of symbol models that were matched to regions of the word.
Bazzi et al [14] implemented a segmentation free OCR system that was based on Hidden
Markov Model (HMM). They chose a text line as a major unit for training and recognition. A
page of printed text was decomposed into a set of horizontal lines using horizontal position along each line as an independent variable. Hence they scanned a text line from right to left and at each horizontal position a feature vector was computed from a narrow vertical strip of input. The system was based on 14-state HMMs of each character. The output of the system comprises sequences of characters that had the maximum likelihood.
Amin [10] analyzed the shape of Arabic words with a unique vector of features. This feature vector was then represented in attribute/value form to an inductive learning system that created rules for recognizing characters of the same class. The technique was composed of
44
three major steps. The first step was pre- processing in which the original image was
transformed into a binary image utilizing a 300 dpi scanner and then forming the connected component. Second, global features such as number of sub-words, number of peaks within the sub-word, number and position of the complementary character, etc. were then extracted.
Finally, machine learning was used for character classification to generate a decision tree.
Kraus and Dougherty [44] implemented the morphological hit or miss transform. They developed a basic class of structuring-element pairs for segmentation-free character recognition via the morphological hit-or-miss transform. Both hit and miss structuring were selected so that the hit-or-miss transform could be applied across the test image without prior segmentation. The authors marked the location at which a structuring element fitted within a pixel set corresponding to a shape of interest and another structuring element positioned outside the pixel set.
Clocksin and Khorsheed [42] used a new technique for recognizing Arabic cursive words.
They used the holistic approach for word recognition in which the word was treated as a whole and the features were extracted from the un-segmented word image. Each word was represented by a separate template that was part of the Fourier spectrum. The recognition of the word was based on the Euclidean distance from those templates.
Jelodar et al [32] used morphological hit/miss transform to recognize Persian script in machine printed documents. They first used horizontal histogram of document to separate lines and vertical histogram of lines to separate sub-words and then removed dots to make recognition stage simpler. The words were then thinned using the sequential thinning method
45
based on morphological hit/miss transformation. After thinning, feature extraction was done
by an exhaustive search process using hit/miss operator with a complete set of structuring
elements corresponding to the different geometric patterns of interest. These extracted
features along with the number and position of dots provided the required information for
identification of the sub word.
Fan Xialong and Verma Brijesh [26] Compared segmentation-based and non-segmentation based neural techniques for cursive word recognition. They have discussed three papers in
non-segmentation based cursive word recognition that are described as follows. Govindaraju
et al proposed a segmentation-free neural technique for cursive word recognition. They
extracted the features by traversing the strokes in the word image without performing word
segmentation. The word image was then mapped on to the feature vector matrix of uptrends
and downtrends of strokes. Guillevic et al proposed a segmentation-free method to extract 4
types of global features: ascenders, descenders, loops and word length. The contour tracing
procedure was applied on the input image and a representation of the input was obtained as a
list of connected components. Parisse extracted global features depending on the word’s
upper and lower contour. He used training and recognition that was based on n-gram
extraction and identification.
Obaid and Dobrowiecki [54] proposed a segmentation free method called N-markers for recognition of printed Arabic text. Their method was a mixture of global and structural approaches. They collected the informative points lying in the center of the characters. These points were the basis of the coordinate system for the configuration of sensors designed to
46
identify the necessary strokes and were called N-markers. By distributing enough markers
over the character, a letter or a group of letters in a text line was detected.
Khorsheed and Clocksin [43] proposed an approach based on hidden Markov Model (HMM),
where the word was recognized as a single unit. Their method avoided segmenting words
into characters or other primitives and used a predefined lexicon, which acted as a look-up
dictionary. They first applied thinning algorithm based on Stentiford’s algorithm to find a
skeleton graph of the word image and then calculated the centroid of the image. Then the
word image was transformed into a stream of feature vectors. This was done using the fact
that the skeleton graph of the image consisted of a number of segments where each segment
started and ended at a feature point. These segments were listed in descending order relative
to the horizontal value of the starting feature point of each segment. During segment
extraction, loops were also extracted from the skeleton. The segments were then transformed
into 8-dimensional feature vectors. Each feature vector was then mapped to the closest
symbol in the lexicon and the resulting stream of observations was presented to the HMM.
The path discriminant HMM approach was used where a pattern was classified to the word
with maximum path probability. The Viterbi algorithm was used for finding optimal path.
Reza Safabakhsh et al [62] proposed a system for Farsi Nastalique handwritten words
recognition using a continuous-density variable-duration hidden Markov model CDVDHMM.
In this system after the pre-processing stage the ascenders, descenders, dots and other
secondary strokes are eliminated from the original image. Segmentation is done by analyzing
upper contour thus avoiding the under-segmentation problem. Variable-duration states in the
47
system cover the over-segmentation problem. Features are extracted which are invariant to
size and shift. At the recognition stage a modified version of Viterbi algorithm is used.
Mohammad S Khorsheed et al [42] introduced a holistic approach for Arabic word
recognition which uses a normalization process to compensate for dilation and translation.
The process adapted transforms the image of an Arabic word from Cartesian coordinates to
polar coordinates similar to log polar transformation. Rotation is also converted into
translation by this transformation.
2.5 Previous Work in Segmentation-Based Arabic OCR
Ligature-based OCR has its limitation and cannot be expected to identify all possible
ligatures because it needs a prohibitively large database to store all possible combinations of
characters that can form long ligatures. Since every ligature is composed of characters, the
segmentation of ligatures could generate individual characters that can be recognized.
Machine generated (printed) Arabic (Naskh style) text usually follows a horizontal base-line where most characters, irrespective of their shape, have a horizontal segment of constant width. If we can separate these horizontal constant-width segments from a ligature, the remaining components of the ligature could be recognized. However, this is not the case with
Nastalique which has multiple base-lines, horizontal as well as sloping, making Nastalique a complex style for optical recognition.
This section presents a brief overview of the previous work that has been done on segmentation-based Arabic OCR.
48
Bouslama and Kishibe [15] proposed a method that combined the structural and statistical approach for feature extraction and a classification technique based on fuzzy logic. They segmented characters into a main and complement characters. The main segment was then centred and projected horizontally and vertically. The features of classification were then extracted from the number of complement characters and from the horizontal and vertical projection profiles of the main character. A set of fuzzy rules was used for classification. The recognition algorithm was tested on three different fonts and high recognition rates were achieved.
Zidouri et al [77] proposed a sub-word segmentation technique that was independent of font type and font size. After applying pre-processing techniques, they employed horizontal and vertical segmentation to segment a page into separate lines and lines into sub-words respectively. To divide sub-words into characters, they first skeletonized the image of sub- words without dots. Then for all the rows, they scanned the image row-wise from right to left, to find a band of horizontal pixels of length greater than or equal to the width of the smallest character. The vertical projection of this scanned band was then taken and if no pixel was found, a vertical guide band was drawn in an empty image. Thus several guide bands were drawn for all the rows. A special mark for the guide bands for each row was used below the location of the baseline. In order to select the correct guide band, several features were extracted and tested for several predefined rules. If it satisfied rules then it was selected, otherwise rejected.
49
Motawa et al [52] proposed an algorithm for automatic segmentation of Arabic words using
mathematical morphology. They first digitized the image using 300 dpi scanner and then
detected and corrected any slanted strokes. They applied erosion operation on the image and
computed the average slope of all the strokes. Every pixel was then transformed to a new
location according to a formula to correct the slant. After slant correction, connected
components were constructed which formed the skeleton for all future analysis of the image.
Morphological operations, opening and closing, were applied to word image to allocate
singularities and regularities. Singularities represent the start, the end or the transition to
another character. Regularities contain the information required for connecting a character to the next character. So the regularities were the candidates for segmentation.
Tolba and Shaddad [67] proposed a segmentation algorithm for the separation of Arabic characters. In their algorithm, they slid a window over a word horizontally from right to left and at each instant they calculated a segmentation parameter which was then matched with predefined set of threshold values. If the segmentation parameter was less than the threshold value, the region was marked as a silence region. Detecting the silence region after the beginning of the letter identified the end of a letter. When the value of the segmentation parameter increased, the beginning of the next letter started.
Al-Yousefi and Udpa [9] introduced a statistical approach for Arabic character recognition.
They used a two-level segmentation scheme. They first segmented the words into character.
Then a lower level segmentation was applied to segment the characters into primary and secondary parts (dots and zigzags). Then they computed the moments of horizontal and vertical projections of the primary parts and normalized them to zero order moment. The
50
features were extracted from the normalized moments of the vertical and horizontal projections and the classification of the primary characters was done using quadratic
Bayesian classifier. The secondary parts were isolated and identified separately. Their recognition was done during pre-processing and segmentation stage.
Amin and Mari [11] proposed a structural probabilistic technique for automatic recognition of multi-font printed Arabic text that was based on character recognition and word recognition. They first transformed the image of text into separate lines of text by taking a horizontal projection and then segmented the text lines into words and sub-words by taking vertical projection. To segment words into characters, they took vertical projection of the word and the least sum of the average value over all columns showed the connectivity point.
Thus each part of the word having a value less than the average sum was segmented into a different character. This resulted in the number of segments that were then connected together in the recognition phase to form the basic shape of the character and the segments that were not connected to any other segment were considered to be complementary characters. They used Freeman codes of the characters and consulted character recognition dictionary to recognize characters. The word recognition part of the technique used the tree representation lexicon and the Viterbi algorithm.
Zheng et al [75] proposed a new machine printed Arabic character segmentation algorithm.
Their algorithm was based on vertical histogram of sub-words and some rules that were based on four kinds of features. Initially they used some rules to check if the sub-word consisted of only one character. Then they scanned the vertical histogram of sub-word that consisted of more than one characters in the direction of writing and marked a point as
51
potential segmentation point if the histogram value was increased and the point was near the
baseline. Then they used some rules to check if the point was a real segmentation point and
segmented the sub-word at it.
Arica and Yarman-Vural [12] proposed an analytical approach for offline cursive
handwriting recognition that was based on a sequence of segmentation and recognition
algorithm. They did not pre-process the image for noise reduction, however, they estimated global features such as baseline, average stroke width/height, skew and slant angle and integrated them to improve segmentation and recognition. They determined the segmentation
regions in the word image, and then searched these regions for finding segmentation
boundaries between characters. The shortest segmentation path from top to bottom of a word
image was searched that segmented the connected characters into segmentation regions. Each
character candidate was represented by a fixed size string of codes and a feature extraction
scheme was employed. HMM training was applied on the selected output of the segmentation
stage and the HMM and feature space parameters were estimated. Each string of codes was
fed into HMM recognizer and were labelled with the HMM probabilities. The candidate
characters and associated HMM probabilities were then represented in a word graph and the best path in the graph was found using dynamic programming which corresponded to a valid
word.
Lorigo and Govindaraju [47] introduced a new algorithm for segmentation and pre-
recognition of offline Arabic handwritten text. They used binary text images and baseline heights at the left and right edges of the image from IFN/ENIT image database as input to
their system. They detected the connected components and separated the connected
52 components on the basis of dots and sub-words. Dots were combined into dot groups and a new image was made for each sub-word. They identified the loops for each sub-word image and modified the baseline based on the projection of black pixels onto vertical line, which yielded the horizontal sub-word baseline. Then a list of candidate segmentation points, where each point was a range of x-coordinates, was determined using two methods. The gradient method was used to compute horizontal and vertical gradients in the baseline strip, which were then used to indicate a connection point of a character. The down-up method was used to find the short candidate points that were missed by the gradient method. These two methods over-segmented the image, so extra breakpoints in loops and at edges were removed using the knowledge of letter shapes. Dot groups were then assigned to letters for pre- recognition.
Lee et al [45] presented a word segmentation algorithm for segmenting a word into a prefix- stem-suffix sequence using a small manually segmented Arabic corpus and a table of prefixes/suffixes of the language. They first tokenized an Arabic sentence and used trigram language model to segment these tokens into a stream of morphemes. The trigram language model probabilities were then estimated from the morpheme-segmented corpus. For segmentation, each token was compared against the prefix/suffix table and all the matching prefixes and suffixes were identified. Then each matching prefix/suffix was identified at each character position and all prefix-stem-suffix sequences were enumerated. Then trigram language model score for each segmentation was calculated and top N scored segmentations were set aside. Some of illegal sub-segmentations were filtered out on the basis of manually segmented corpus information. After developing the segmenter on the basis of manually segmented corpus, the segmentation accuracy was improved by iteratively expanding the
53
stem vocabulary and retaining the language model on a large un-segmented Arabic corpus.
The model parameters were re-estimated with the expanded vocabulary and training corpus.
Goraine et al [28] presented a segmentation-based approach for offline Arabic character
recognition. They applied thinning algorithm on isolated Arabic words input via video
camera to get the string points of characters. Then they segmented the words into principal
and secondary strokes and classified them into connection points which were pixels with two
neighbours, feature points that were pixels with either one or three neighbours, and strokes
which were strings of pixels between two consecutive feature points. Then they repeatedly
applied stroke finder algorithm to find the start and end points and to trace the stroke from
start to end until all strokes were traversed. The strokes were then coded using 8-direction
codes and were represented by string of characters. For stroke classification 11 primitives
were used and each stroke string was compared with each primitive to find the exact match
and the new shape was stored in a lookup table.
Parhami and Taraghi [56] presented a technique for the automatic recognition of printed
Farsi text. The technique is applicable, with little or no modification to printed Arabic text
(Farsi is written in Arabic script and also uses Naskh style for writing).
The most important parts of the system are: (1) isolation of symbols within each subword; and (2) recognition. The main step in segmenting symbols is to determine the pen (script) thickness which is used to find candidate connection columns. Practical application of the technique to Farsi newspaper headlines has been 100% successful, as reported by the authors.
However, fonts of smaller point-size will result in less than perfect recognition. The system is
54
heavily font dependent and the segmentation process is expected to give degraded results in some cases.
Almuallim and Yamaguchi [8] first segmented words into strokes, since it is difficult to separate a cursive word directly into characters. These strokes are then classified using their geometrical and topological properties. The relative positions of the classified strokes are examined, and the strokes are combined in several steps into a string of characters, which represents the recognized word. A maximum recognition rate of 91% was achieved. The system failure, in most of the cases, was due to wrong segmentation of words.
Ramsis et al [59] adopted a method of segmenting Arabic typewritten characters after recognition. As the characters are not separated yet, they assume that the rightmost columns of a word, the number of which equals the width of the smallest character, constitute a character. Moments are calculated and checked against the feature space of the font. If a character is not found, another column is appended to the underlying portion of the word and moments are calculated and checked again. This process is repeated until a character is recognized or the end of the word is reached.
The method allowed the system to handle overlapping and to isolate the connecting baseline between connected characters. This method seems to be sensitive to font type and input pattern variations. Also, the system uses intensive computations to compute the required accumulative moments. No figures are reported regarding the system recognition rate and efficiency.
55
Amin and Mari [11] presented a structural probabilistic approach to recognize Arabic printed text. The system is based on character recognition and word recognition. Character recognition includes segmentation of words into characters using vertical projections and identification of characters. Word recognition is based on Viterbi algorithm and can handle some identification errors. The system was tested on just a few words and no figures were reported about its performance. The method has inherent ambiguity and deficiencies due to interconnectivity of Arabic text.
Al-Emami and Usher [4] presented an on-line system to recognize handwritten Arabic words.
Words are segmented into primitives that are usually smaller than characters. The system is taught by being fed the specifications of the primitives of each character. In the recognition process, parameters of each primitive are found and special rules are applied to select the combination of primitives that best matches the features of learned characters. The method requires manual adjustment of some parameters. The system was tested against only 170 words, written by 11 different subjects for 540 characters.
Zahour et al [74] presented a method for automatic recognition of off-line Arabic cursive handwritten words based on a syntactic description of words. The features of a word are extracted and ordered to form a tree description of the script with two primitive classes: branches and loops. In this description, the loops are characterized by their classes and the branches by their marked curvature, their relationship, and whether they are in clockwise or counterclockwise direction. Some geometrical attributes are applied to the primitives that are combined to form larger basic forms. A character is then described by a sequence of the basic forms. The reported recognition rate of the system is 86%.
56
Abuhaiba [1] presented a text recognition system, capable of recognizing off-line handwritten Arabic cursive text. A straight-line approximation of an off-line stroke is converted to a one-dimensional representation. Tokens are extracted from this one- dimensional representation. The tokens of a stroke are re-combined to meaningful strings of tokens. Algorithms to recognize and learn token strings were presented. The process of extracting the best set of basic shapes that represent the best set of token strings that constitute an unknown stroke was described. A method was developed to extract lines from pages of handwritten text, arrange main strokes of extracted lines in the same order as they were written, and present secondary strokes to main strokes. Presented secondary strokes are combined with basic shapes to obtain the final characters by formulating and solving assignment problems for this purpose. The system was tested against the handwriting of 20 subjects, yielding overall sub-word and character recognition rates of 55.4% and 51.1%, respectively.
In general, the strategies to segment cursive script for recognition purposes can be classified into two broad categories: In the first category, the word is segmented into several characters and character recognition techniques are applied to each segment. This method depends heavily on the accuracy of the segmentation points found. However, such an accurate segmentation technique is not yet available.
In the other category, a loose segmentation to find a number of potential segmentation points in the pre-segmentation procedure is performed. The final segmentation and the word length are determined later in the recognition stage with the help of a lexicon.
57
Recognizing that there were attempts to segment cursive script into characters, we could only see very few Arabic OCR systems. The performance still lags behind that of Latin and
Chinese systems. In [39][40], Kanungo et al. reported evaluation results for two popular
Arabic OCR products: Sakhr’s Automatic Reader 3.01 and Caere’s OmniPage Pro v2.0 for
Arabic. Sakhr’s Automatic Reader is the most famous Arabic OCR software. In their evaluation, they established that Sakhr and OmniPage have page accuracy rates of 90.33% and 86.89%, respectively. These are not high recognition rates compared to those for Latin and Chinese OCR systems. Again, we stress the fact that this is mainly due to the cursive property of Arabic script.
58
CHAPTER 3 Video OCR
3.1 Introduction
Optical Character Recognition for Roman script languages is almost a solved problem
for document images and researchers are now focusing on extraction and recognition of text
from video scenes. This new and emerging field in character recognition is called Video
OCR and has numerous applications like video annotation, indexing, retrieval, search, digital libraries, and lecture video indexing.
The emerging field for character recognition in video frames is attracting research on other scripts like Chinese, but to the best of our knowledge, no work is reported as yet, on Video
OCR for Arabic script languages like Arabic, Persian and Urdu.
As an extension of our Nastalique OCR to Video OCR for Arabic script languages, we have also performed experiments on video text identification, localization and extraction for its recognition. We have used MACH (Maximum Average Correlation Height) filter to identify text regions in video frames, these text regions are then localized and extracted for recognition. All research and development work is done using Matlab 7.0. Experiments and results are reported in the thesis.
Extensive literature survey on Video OCR including discussions on various techniques used by researchers is also included here.
59
3.2 Introduction to Video OCR
Machine recognition of text images from documents, printed or handwritten has been
an area of popular research and development with a considerable degree of success. A recent
extension of this field has been of text recognition from video frames. This is however a completely different domain. The reason is that text in printed documents is restricted to uni- colored characters written on a uniform background making it relatively easier to separate the text from the background.
In video text recognition a number of noise components render the text comparatively more difficult to separate from the background. Besides, the characters can be moving or presented in a variety of colors, sizes and fonts that are not standardized. Added to this is the fact that the background is usually moving making text extraction a more complicated procedure.
Most videos contain two kinds of text, scene text and artificial text. Scene text is usually text that becomes part of the scene itself as it is recorded at the time of filming the scene, on the other hand artificial text is created independently and away from the scene and is laid over it at a later stage or during the post processing time. The appearance of artificial text is therefore carefully directed. This type of text carries with it important information that helps in video referencing, indexing and later in retrieval.
60
Such superimposed text on video frames is also used for annotation, semantic video analysis and search. An example for this is the graphic logo that continuously appears on broadcasts and may appear on any fixed place on the screen giving either the name of the station or indicating other similar pieces of information to the viewers. This kind of laid over text is usually easier to recognize automatically and can be used as annotation.
On news broadcasts and reports anchor names and locations appear on the screen and are recognizable texts as they can be extracted by focusing on the bottom one third of the screen.
Video text serves a variety of purposes. In the beginning of the video there are titles, names and other bits of information necessary for the understanding or appreciation of the video. In the end are the credits and other important details including references etc.
Sports coverage often includes text that displays the current score; interviews and documentaries display speakers’ names or translated sub titles. Within a broadcast text also gives important information about the subject currently on the screen, its location or other relevant bits.
Similarly text appearing on commercials and advertisements relay the message, the products’ name, the company’s name etc. These text appearances are all carefully directed and are important for the video’s complete understanding. They are easily recognizable as they are deliberately overlaid on the scene.
In many videos text often appears as part of the scene itself. This text might be just words, names printed or written on billboards, names of shops or even words written on a wall
61 which become a part of the scene at the time the shooting is done. This kind of text is the most difficult to separate from the background and recognize. The reason is that the text can appear just about anywhere on the screen, in any type of lighting, in any type of shape, size or font, style or orientation e.g. words that appear on banners and placards could be straight or wavy, tilted or slanting and caught from any angle of the camera.
Superimposed text is fundamentally different from text that is embedded in the view of the scene itself thus making recognition of text a most demanding process. These difficulties arise from background complexities and variations and from an inherent low resolution.
3.3 Types of video text
There are two types of text in videos:
i- Scene Text It appears written on billboards, buildings, banners, shirts etc. in a video scene. ii- Artificial text This text is added artificially into a video by editing devices after the video is made and mostly it appears on the screen at some particular position e.g. horizontally at the lower part of the screen. Examples are Speaker’s/Newscaster’s names, information about them or translation of dialogues in video movie, subtitles in a video.
62
3.4 Applications of Video-OCR
Videotext related to the superimposed text or artificial text on the frames can be used
for:
i- video annotation,
ii- indexing,
iii- retrieval,
iv- semantic video analysis,
v- search,
vi- digital libraries,
vii- home video summarization, and,
viii- lecture video indexing .
3.5 Literature Survey
Nevenka Dimitrova ,Lalitha Agnihotri, Chitra Dorai and Ruud Bolle, described the following computational framework to extract text regions from video frames [53].
The procedure is described as:
They classified approaches to extracting text from video into three categories: (i) methods that use region analysis to extract text components, (ii) methods that perform edge analysis to extract characters, and (iii) methods that use texture features to locate the presence of text.
However, authors suggested that, a common framework can be designed to describe a generic videotext extraction system. Given an image or a sequence of frames with color or grayscale values, authors proposed the following framework that employs three major steps to extract text embedded in the frames: 63
Step 1: Removal of Background Non-Text Scene Content In the initial step it is attempted that all non-textual background be removed to obtain candidate text regions in the video frame. This separation can be acquired by several different approaches. One approach is to employ region analysis to determine chromatically or spatially connected and homogenous components in the frame grouping them to locate text regions in the frame. Similar colored or intensity valued pixels are combined and reject too large or too small regions retaining only candidate text regions.
Another approach employs edge detection in the video frame to identify and locate text regions in the video frames. This involves detection and analysis of the geometrical arrangement of video content that helps to separate out regions that form regular geometrical patterns in the concerned frames. Text lines are detected on the basis of edge directions filtering out text lines from background scene. Other approaches use texture-based analysis to extract text regions from the frames. This is done on the principle that text content exhibits a unique contrast with the background and certain connectivity in its horizontal representation of characters which is reflected in the varying color and light intensities as characters connect or space out.
Step 2: Verification of Text Characteristics The next step is to analyze extracted text to rule out the possibility of obtaining false positive text regions. After the text regions have been drawn out from the video frames individual regions are analysed which may involve grouping of connected components.
Common features of text and characters are employed to use as criteria for judging whether the extracted regions present actual text features and attributes or not. These features may be
64
monochromaticity of individual characters, character size restrictions, horizontal alignment
of text, consistent inter-character spacing, etc.
Step 3: Consistency Analysis for Output The last step typically prepares the remaining text regions for the final usage intended
for the detected text.
One approach employed for final detection of text regions is to group characters to words and
words to text lines. This process allows for marking off their bounding blocks. The approach lays emphasis on automatic location of text, which therefore requires human intervention for identifying and recognizing the characters present in the bounding blocks.
Other approaches further analyze text regions to extracting clear-cut character boundaries
generating frames that are ready for being fed to an OCR system. These frames are obtained
and represented in the form of bit-maps and can now be automatically recognized. The
recognized characters, words and lines can finally be used to annotate indices for video
referencing or for querying a video database.
Rainer Lienhart and Wolfgang Effelsberg, in ‘Automatic Text Segmentation and Text
Recognition for Video Indexing’ [61], reported a feature based scheme to segment and
recognize video text for automatic indexing.
This study has aimed to present a feature based scheme for text segmentation and recognition
of video text for automatic indexing.
The proposed scheme works as follows. 65
i- Presentation of text features ii- Description of segmentation algorithms based on described text features iii- Text recognition
The study explains a straight forward indexing and retrieval scheme which has been used in
the experiments to demonstrate the appropriateness of the algorithms for indexing and
retrieval of video sequences.
There is an emphasis on extraction of artificial text. It is also noted that the appearance of artificial text is subjected to more constraints then that of screen text because it is made deliberately to be easily viewable.
Text Features The mainstream of artificial text appearances has the following characteristic
features:
i- Characters are in the foreground. They are never partially occluded. ii- Characters are monochrome. iii- Characters have size restrictions. A letter is not as large as the whole screen, nor is a letter smaller than a certain number of pixels as it would otherwise be illegible to viewers. iv- Characters are rigid. They do not change their shape, size or orientation from frame to frame. v- Characters are either stationary or linearly moving. vi- Characters are mostly upright. vii- Moving characters also have a dominant translation direction: horizontally from right to left or vertically from bottom to top.
66
viii- Characters appear in clusters at a limited distance aligned to a horizontal line, since that is the natural method of writing down words and word groups. ix- Characters contrast with their background since artificial text is designed to be read easily.
Their text segmentation algorithms are based on these features. However, they also take into account the fact that some of these features are relaxed in practice due to artifacts caused by the narrow bandwidth of the TV signal or other technical imperfections.
Edges are localized by means of the Canny edge detector extended to color images, i.e. the standard Canny edge detector is applied to each image band. Then, the results are integrated
by vector addition. Edge detection is completed by non-maximum suppression and contrast enhancement.
Recognition of Text During the implementation process the OCR software developmental kit Recognita
V3.0 for Windows 95 was integrated for text recognition. Most of the overlain text in video
images is written in block letters. It was imperative to use the OCR module for typed text so
that the rearranged binary text images could be translated into ASCII mode.
On an overall basis the recognition process could give better and more accurate results if benefit is derived from the repeated occurrence of the same words in consecutive video frames. However, the same character appears in a slightly different form in various frames owing the alterations to noise, changes in backgrounds or the relative position of the word itself. Overcoming these hitches and then combining the recognition results into a single final character higher accuracy may be achieved in recognition.
67
In digital videos artificial text appears in any variety of fonts-decorative or plain and various font sizes and styles. This enormous variety compounds the problem of OCR and errors are inevitable. Another problem is to cope with garbage characters which result from non- character regions that could neither be removed by the system nor by the OCR module.
Jinsik Kim, Taehun Kim and Jiexi Lin, in ‘Implementation of a Video Text Detection
System’ [33], a term report for CS570 Artificial Intelligence, KAIST, South Korea, described the implementation of a video text detection system.
The authors have described the implementation methodology as follows:
i- Edge detection ii- Area-based detection iii- Texture based detection iv- Continuous frame detection v- Edge detection method is based on the idea that between the text and the background there is a distinct difference in the brightness and intensity of colors. vi- Area based detection has the underlying principle that color in a certain text area is more or less the same. vii- Texture based detection is founded on the fact that the text area in a region exhibits a clearly different texture that helps to make it stand out from the background and present possibilities for separating it out from the rest and preparing it for recognition through segmentation techniques. viii- Continuous frame detection implies that each consecutive video frame be compared with the next and the previous to spot the appearing and disappearing text.
68
Edge detection by far is the most widely used and perhaps the most effective method
employed in video text extraction and was thus chosen to be the method for their
experimentation.
Edge Detection Edge detection can be attained by many methods e.g. Sobel Edge Detection and
Canny Edge Detection. The research used Canny edge detection and the authors report to
have achieved finer results.
The scheme can be summarized as follows:
Video frame Canny edge Long line removal Horizontal stroke detection Vertical stroke detection Text area detection Bounding box detection Validation Detected text recognition.
Chong-Wah Ngo and Chi-Kwong Chan, in ‘Video text detection and segmentation for
optical character recognition’ [16], presented their approach to detect, segment and recognize
text in video frames.
The authors have aimed to introduce approaches to help detect and segment text present in
video frames. It is presented that by classifying the background complexities appropriate
operators could be applied selectively to detect text in video frames of different modalities.
To remove noise from the candidate area of text image of high edge density effective
operators e.g. repeated shifting operations are applied. Also a text enhancement technique is
applied so that the text images in the low contrast regions could be highlighted. Lastly a
coarse-to-fine projection technique extracts text lines from the video frames. 69
Results from the experiments conducted with the described methods report higher levels of recognition accuracy as compared to machine learning based systems e.g SVM and ANN segmented text in the foreground is then recognized using a commercial OCR package.
The work is described as follows:
Extracting text information from videos is divided into three major steps:
i- Text detection: Locating regions containing text. ii- Text segmentation: Segmenting text in the detected text regions. The result of segmentation is usually a binary image for text recognition. iii- Text recognition: Converting the text in the video frames into ASCII characters.
Overview of the system
Christian Wolf and Jean-Michel Jolion, in ‘Extraction and Recognition of Artificial Text in
Multimedia Documents’[17], Technical Report RFV RR-2002.01 Laboratoire
Reconnaissance de Formes et Vision, INSA, FRANCE, reported a novel technique to extract
and recognize video text.
This research focused on text extraction and recognition in videos. The approach is to allow
automatic text extraction from the videos in a database and then save it with a link referring
70
to the video sequence and the frame number. The focus in this approach is the basic key words. The user submits a request by providing some keywords, which are robustly matched
against the previously extracted text in the database. Videos containing the keywords (or a
part of them) are presented to the user. This can also be merged with image features like
color or texture.
At the detection stage the algorithm is applied to individual frames separately. Text rectangles are identified and extracted and later passed on to the tracking step. At this step extracted rectangles are corresponded with text images from other frames. From this process of matching each text image with those from the other frames a more enhanced version of the image is generated and converted to bitmaps. Lastly the binarized images are passed to a commercial OCR system to be recognized.
The authors have described the steps of recognition in the following way:
Gray Level Constraints The focus here is on detecting horizontal text lines in the artificial text region. This
text is designed deliberately to be easily readable and so it can be assumed that the contrast
between the text and the background is high. However, it cannot be assumed that the
background will remain uniform and therefore arbitrary complex background needs to be supported. The other feature under consideration is the color properties of the text.
Because the text is easily readable it is also relatively easier to be extracted depending on the luminance information embedded in the video signals. Therefore all video frames need to be converted into grey scale images as a pre-processing step.
71
Temporal Constraints – Tracking It should be understood that in videos text images that need to be recognized remain
present for a series of video frames so that enough time is allowed for the various recognition
processes to go on. This feature is taken advantage of in order to create similar text and separate out false signals. This is done by an overlapping technique whereby the list of detected rectangles is overlapped with the list of currently active rectangles. This means that
in each subsequent frame the text in the previous frame will also be visible.
Temporal Constraints – Multiple Frame Integration In this approach even before the images of a text appearance are provided to the OCR
software, all images from the various video frames are drawn out and their contents are
enhanced to generate a better quality image compared to those drawn out initially from the
sequence. This enhancement also results in better resolution of the image without adding up
any extra information to it. This is an important step as most commercial OCR programs are
developed for scanned documents.
Final Binarization In the end the enhanced image needs to be binarized. Low quality videos do not
generate the same quality of text images or the same characteristics as do scanned document
images. The authors of this study were of the opinion that simpler algorithms are more robust
to the noise present in the video images. For this reason they used Niblack’s method for their
system, and finally derived a similar method based on a criterion maximizing local contrast.
72
Niblack’s algorithm calculates a threshold surface by shifting a rectangular window across
the image. The threshold T for the center pixel of the window is computed using the mean
and the variance of the gray values in the window.
They then passed the final binarized text boxes to a commercial OCR product (Abbyy
Finereader 5.0).
Chuan-Jie Lin, Che-Chia Liu, and Hsin-Hsi Chen, ‘A Simple Method for Chinese Video
OCR and Its Application to Question Answering’ [18].
Authors propose a simple video OCR method for Chinese captions, including image capturing, caption region deciding, background removing, character segmentation, OCR and post-processing.
The characteristics of captions are:
i- they are always in a straight line from left to right or up to down; ii- the characters usually have colors which contrast with the background, and often have perceivable borders; iii- they are always in the foreground of the image; iv- they usually consist of two or more characters; v- the height of the caption region is not often higher than one third of the height of the image, because characters cannot be too large or too small for reading; vi- they have fixed height, width, and size; vii- they have fixed colors. These characteristics are employed to locate captions.
73
Removing Backgrounds by Means of Multiple Images After detecting a sequence of images with the same caption texts, they use the
following method to remove the backgrounds.
Let NumFrames be the total number of sequential images. They consider each point in the
caption region. If it is black in 90% of the images (i.e., NumFrames × 0.9), then they set the
point as black. Otherwise, it is set as a white point.
Character Segmentation This approach works on the basis of the fact that a binary image of the caption has
been initially been obtained which bears black characters on a white background. Traditional
OCR systems are then used to draw out the text image from the background. Boundaries for each of these characters are then marked off and then recognition is made possible through the OCR.
They first decide the right and left boundaries for each character and then projection profiles are used to segment text into individual characters. The projection profile presents all black points on a horizontal line. The projection for spaces between Chinese characters remains zero but there are spaces that can be detected within the character itself. This problem is solved with the help of another feature of the Chinese characters. The width of Chinese characters is more or less equal to their height so height of a caption region may be regarded as the image height and possible segmentation points can be marked off by considering that the gap that is a distance of image height x 0.7 ~ imageheight x 1.4 from the previous gap.
Once the left and right boundaries have been decided a similar method is employed to decide the upper and lower boundaries of each character. 74
Optical Character Recognition A statistical model is adopted to perform Chinese OCR. Each Chinese character is broken down into 16 component parts each equal in size to the other. Signature values are attributed to each of these parts by observing, initially the centre and then moving on to all directions-up, down, left and right. As observations are made black points in each direction are sought. If a black point is found in a certain direction, a signature value of 0 is set, if not then the value of 1 is attributed at that point. As the process completes 64 values (16 x 4) called signature values are attained for each character image.
For this study the signature images for a standard corpus of characters was collected from a number of Discovery Channel Films. The procedure explained for the extraction of text from the films was initially to extract its signature and then later make a comparison of these with those present in the standard corpus of data. The numbers of matching values are estimated and similarities are calculated and saved. It is assumed that the similarity score will always remain between 0 and 64. The greater scores will represent higher range of similarities and the fact that the patterns closely resemble each other. The new image is regarded as a non- character image if the highest score attained from its observation is less than 16.
Jean-Marc Odobez and Datong Chen, in ‘Robust Video Text Segmentation and Recognition with Multiple Hypotheses’[32], describe a robust method for segmenting and recognizing video text, as follows:
The authors of this study have attempted to present segmentation as well as recognition techniques for text embedded in video films. The study proposes multiple segmentation of the same text region which will produce multiple hypotheses of binary text images. 75
The segmentation algorithm is stated as a statistical labeling and is based on a Markov random field (MRF) model of the label map. Background regions in each hypothesis are then removed by performing a connected component analysis and by enforcing a more stringent constraint (called GCC) on the text characters grayscale values using a robust 1D-Median operator. Each text image hypothesis is then processed by optical character recognition
(OCR) software. The final result is then selected from the set of output strings. Results show that the use of both multiple hypotheses and the GCC significantly improve the results.
Description of the Method The first step is to locate the text regions in the video frames. The method employed is to integrate horizontal and vertical edges to make a texture based search for locating the
text areas. Using the baseline technique these localized text areas are further segmented into
single line text candidates. The drawn out text regions are then recognized using a Support
Vector Machine (SVM).
The images that are drawn out in the initial text location step are presented in rectangles and
so the OCR software will not work directly on these images. They must undergo further
treatment in order to be understood by the OCR. The experimental proofs suggest that OCR
results are not very reliable and are greatly dependent on the quality of the segmentation
procedure that draws out clear cut images to be fed to the OCR. The authors of this study have presented a sophisticated system in which multiple text layer candidates are provided to the OCR making the decision making process relatively easier for the OCR software.
76
The algorithm used by the authors’ works as following:
A segmentation step allows for the generation of text image hypotheses, connected components are then analyzed, OCR software processes the hypotheses and the result is selected from the output strings.
Datong Chen, Kim Shearer and Hervé Bourlard, in ‘Video OCR for Sport Video Annotation
and Retrieval’, [19], give their scheme as:
The authors have introduced a video OCR system for building annotations of sports films.
The system works by automatically extracting closed captions from video frames which they
call ‘cues’ and treating them as key words for reference. This extraction is done by the use of
Support Vector Machines (SVM) that identifies the text regions with closed captions. These
images are then enhanced to generate better quality text images by using two groups of
asymmetric filters. A commercial OCR software is then used to recognize the text.
The whole procedure for automatically extracting text from sports videos can be summarized
in three distinct steps.
i- Identification of single line texts from video frames using SVMs. ii- Text recognition and, iii- Production of cues for annotation
The employed algorithm extracts texture patterns of texts considering them on the basis of
horizontal and vertical edges mixed together to form word structures.
77
They integrate this texture information and temporal information for text detection in the
following process:
i- Multiple intensity frame integration is performed by computing average image of consecutive frames ii- Detect edges in vertical and horizontal orientation respectively with Canny operators iii- Integrate temporal edge information by keeping only edge points that appear in consecutive two average images iv- Dilate vertical and horizontal edges respectively into clusters. Different dilation operators are used so that the vertical edges are connected in horizontal direction while horizontal edges are connected in vertical direction. The dilation operators are designed to have rectangle shapes: vertical operator 5×1, horizontal operator 3×6. v- Integrate vertical and horizontal edge clusters by keeping the pixels that appear in both vertical and horizontal dilated edge images.
Because the text may have varying gray-scales in video frames, they therefore choose gray- scale independent feature: distance map (as explained below) of 16×16 slide window as input feature of SVMs. The SVMs are trained with a database consisting of 6000 samples labeled as text or non-text (false alarms resulting from text line location) with the software package called SVMTorch.
Rainer Lienhart and Frank Stuber, ‘Automatic text recognition in digital videos.’ Technical
Report TR-95-036, Department for Mathematics and Computer Science, University of
Mannheim, Germany,1995, [60], reported their algorithm for the recognition of text in digital videos.
78
The authors have developed a system for automatic recognition of text in moving pictures through character segmentation algorithms. These text images are drawn out from a variety of regions from the videos including pre-title sequences, credits, and closing sequences which could be titles, credits or other bits of text items. These are automatically and efficiently drawn out with the help of the employed algorithm. The system directs its focus on enhancing the segmentation quality using typical text characteristics in the videos resulting in better recognition results.
As segmented characters are drawn out from the videos they can be parsed by any OCR software. Multiple instances of the same character are drawn out from a series of video frames and are combined to generate a higher quality text image. This in turn improves recognition results and computes a final output.
This technique however has been developed only to deal with artificial text that has been carefully laid over the original scene and has no capacity to deal with scene text embedded within the scene. In particular the system deals with the recognition of text that has been added to the videos with the help of video title machines. The premise for this exclusive treatment is that scene text and artificial text imply two distinct ways in which text is laid out in a video therefore requiring different treatments for drawing them out for recognition.
Owing to this difference, the authors have used the words ‘text’ and ‘character’ only to refer to texts produced by video title machines or similar input devices that contribute to text addition after the scene has been shot.
79
Before characters and, thus, words and text can be recognized, the features of their appearance have to be analyzed.
Their list of features includes:
i- Characters are monochrome. Only a very small percentage are polychrome and are of no further interest here ii- Characters are rigid. They do not change their shape, size or orientation from frame to frame. Again the very small percentages of characters that do change size and/or shape are of no further interest here iii- Characters have size restrictions iv- A letter is not as large as the whole frame nor are letters smaller than a certain number of pixels as they would otherwise be illegible to human beings v- Characters are either stationary or linearly moving vi- Stationary characters do not move. Their position remains fixed over multiple subsequent frames. Moving characters move steadily and also have a dominant translation direction: Generally, they move either horizontally or vertically. Moreover, many just move from right to left or bottom to top. vii- Characters contrast with their background viii- Artificial text is designed to be read and, thus, must be in contrast to its background. ix- The same characters appear in multiple consecutive frames (temporal relation). x- Characters often appear in clusters at a limited distance aligned to a horizontal line (spatial relation), since that is the natural method of writing down words and word groups. But this is not a pre-requisite, just a strong indicator. From time to time just a single character might appear on one line. xi- Character outlines/borders are degraded by current TV technology and digitizer boards. Characters often blend into the background, especially on their left side. Monochrome-designed characters do not appear to be monochrome any more. The color is very noisy and sometimes changes
80
slightly spatially and temporally e.g. by interference with the colors of the surroundings.
Even stationary text may jump around by a few pixels. Those are typical analog television/
video artifacts. Any (artificial) text segmentation and recognition approach has to be based
upon these observed features.
Segmentation of Character Candidate Regions Theoretically the segmentation step extracts all pixels belonging to text appearing in a video. However, this cannot be done without knowing where and what the characters are.
Therefore, the actual aim of the segmentation step is to divide the pixels of each frame of a
video into two classes:
i- Regions which do not contain text and ii- Regions which possibly contain text.
Regions which do not contain text are discarded, since they cannot contribute anything to the
recognition process, and regions which might contain text are kept. They call them; character
candidate regions since they are (not exactly) a superset of the character regions. They will be passed on to the recognition step for evaluation.
The segmentation process can be divided into three parts, each part increasing the set of non- character regions of the previous part by further regions which do not contain text, thereby reducing the character candidate regions, approximating them more and more to real character regions. They first process each frame independently of the others. Then, they try to profit from the multiple instances of identical text in consecutive frames.
81
Finally, they analyze the contrast of the remaining regions in each frame to further reduce the
number of candidate regions and to build the final character candidate regions.
Character Recognition Initially a video is delivered after the segmentation process isolates the text regions in the frames. Each of these frames has to be parsed by the OCR software. The experiment reports using their own OCR system which used a feature vector classification approach for recognition. the results derived were not perfect and is suggested that the use of a more
reliable commercial OCR software will give higher quality results in text recognition.
Wing Hang Cheung, Ka Fai Pang, Michael R. Lyu, Kam Wing Ng, and Irwin King, in
‘Chinese optical character recognition for information extraction from video images’, [71]
presented a simple method for the recognition of Chinese video text.
Chinese and English scripts are fundamentally different from one another and therefore
require entirely different approaches for character recognition. There is a strong need to
develop methods for recognizing the Chinese text. For this study the authors applied OCR
techniques to moving images for extracting characters from the video frames. Through this
method they were able to automatically extract video text, and then use the Chinese subtitles
for indexing and searching in a digital video library. They applied methods to filter out the
heavy noise and segment out each Chinese character in video segments.
The authors have also given details of how the OCR was applied to detect Chinese characters
and evaluate its performance.
82
The Chinese video text extraction system was based on the following assumptions.
i- Text characters remain in the foreground of the scene and are not obscured by other contents. ii- They stand in contrast with the background and are presented in a monochrome scheme. iii- Their forms are upright and rigid and therefore it is not expected that character shapes, sizes or orientations would change from frame to frame. iv- They are restricted to a specified size limitation, not exceeding a certain size nor being smaller than a certain recommendation. v- They do not slide into a scene, nor fade in or fade out, rather they pop out onto the screen. vi- With reference to only Chinese subtitles in videos, the text appears in clusters aligned horizontally from left to right.
All these features form the basis for noise filtration and character segmentation procedures employed during the experiments but they make the task of extracting characters more difficult. To draw out characters from videos the position of the lines of characters must first be located. Since the text area is rich in edges and stands in good contrast against the background the edge densities are used as a feature to locate the text areas. Thus the property
of high edge density has been used as basis for developing a similar segmentation method for
Chinese characters.
Xian-Sheng Hua, Pei Yin, and HongJiang Zhang, in ‘Efficient video text recognition using
multiple frame integration’, [72], reported as:
Artificial text, that is superimposed on a scene is usually a carrier of important information
with reference to the video or broadcast itself. This information is of significant use for video 83
indexing and retrieval. Research in this area (Video OCR) has become recently popular and
widespread. However there are complexities that need to be overcome in order to create an
efficient video OCR, the major hurdles in this regard being low resolution and background
complexities. The authors of the study have aimed to introduce an efficient system to
overcome, at least, the second difficulty- that of reducing background complexities. They
have used multiple frames with repeated occurrences of the same words to obtain the clearest
image of the same character. However before this extraction procedure a multiple frame
verification procedure is adopted to exclude the possibility of receiving false text alarms.
In the next step those frames are selected where the text is the clearest of all and so it makes
the recognition process easier. Next they detect and join and join all the clearest and similar
frames together to generate sets of finer man made frames. On these artificially created
frames a block based adaptive thresholding is then applied and the images are binarized.
These binarized frames are finally sent to the OCR engine for recognition.
A summary of the scheme is presented in the following sequence.
Video stream Text detection Multiple frame verification Get frames containing the same text High contrast frame selection Block division High contrast block averaging Block merging Block adaptive thresholding OCR Text
3.6 Correlation Pattern Recognition
An important problem in image recognition has been finding patterns in images.
Investigations have increasingly been carried out through both optical as well as digital
84
methods to tackle this problem and an area receiving popular acclaim today is the use of
correlators for identifying recognizable patterns in images.
Correlators are basically shift invariant which allows pattern location in the input scene
merely by locating the correlation peak. Therefore it makes it unnecessary to register or
segment an image prior to correlation peak, as is done in a variety of other methods for
pattern recognition.
Amongst the more widely known correlation functions, the Matched Spatial Filter (MSF) is
perhaps the most popular [68]. The MSF depends on a single view of the pattern to be
detected and is optimal for detecting a known pattern in the presence of additive noise.
MSFs however have one major drawback the fact that their correlation peak degrades quite
rapidly if the input pattern deviates even slightly from the reference pattern. Such pattern
deviations are highly common and may occur due to such minor changes as scale changes
and rotation. This inflexibility renders the MSFs inadequate for most pattern recognition
applications.
A method employed to overcome such distortion sensitivity of the MSFs was to use a different MSF for each different view. This meant that a large number of filters were to be
stored and made available in order to filter a variety of images making this approach rather
impractical.
85
An alternative solution to this problem was to create composite filters which could be
enhanced to give an optimal distortion tolerance better than the MFs.
Similar techniques for obtaining various types of correlation filters have been tried and
optimized in the recent years [69].
These filters all work on the premise that correlation filters can be ‘trained’ to recognize any
object on the basis of a set of representative views of that object. These sets of representative
views are called ‘training images’ and the correlation filters can adequately recognize an
object with reference to this set as long as distortions are amply characterized by the training
set too. However, proper selection, registration and clustering of training images are
important tasks in the process of designing composite filters.
Here we will consider a class of composite filters whose tolerance to distortions can be
analytically optimized, the Maximum Average Correlation Height (MACH ) filter [49],
which experimentally proven to surpass MSF in performance with regard to handling noise
and distortion problems [48].
3.6.1 The MACH Filter
Maximum Average Correlation Height (MACH) filters are among one of the most
popular composite correlation filters among researchers in pattern recognition.
Here we take the problem of detecting a distorted target image (or pattern) in the presence of
additive noise, discussed at length by B. V. K. Vijaya Kumar et al [70]. The objective is to
design a correlation filter such that its performance is optimized with respect to not only
86
noise, but also distortions. Let g(m,n) denote the correlation surface produced by a filter
h()m,n in response to the input image f (m,n) . Strictly speaking, the entire correlation surface g()m,n is the output of the filter. However, the point g(0,0) is often referred to as
the “filter output” or the correlation “peak”. By maximizing the correlation output at the
origin, we will be forcing the real peak to be large, with the interpretation; the peak filter
output is given by
g ()0 , 0 = ∑ ∑ f (m , n )h (m , n ) = f T h (3.1)
where superscript T denotes the transpose and where f and h are the vector versions of
f ()m,n and h()m,n , respectively. Typically, for noise tolerance it is desired that this filter
output be as immune to the effect of noise as possible.
Optimizing distortion tolerance elementarily implies that a correlation plane is an entirely
new pattern, rather a linearly transformed version created by the filter in response to an input
image. Here it is important to consider that if the filter has a high capacity for distortion
tolerance, the output will remain relatively unchanged even if the input pattern shows some
variations. The consideration thus is not so much for the correlation peak but essentially for
the complete shape of the correlation surface.
Considering this, a metric for distortion is defined as the average variation in images after
filtering. If gi ()m, n is the correlation surface produced in response to the i-th training image, we can quantify the variability in these correlation outputs by the average similarity measure (ASM) defined as follows
87
N 1 2 ASM = ∑∑∑[]g i ()m , n − g ()m , n (3.2) N imn=1
1 N Where g ()m, n = ∑ gi ()m, n is the average of the N-training image correlation N j=1 surfaces. ASM is a mean square error measure of distortions in the correlation surfaces
relative to an average shape. In an ideal situation, all correlation surfaces produced by a
distortion invariant filter (in response to a valid input pattern) would be the same, and ASM would be zero. In practice, minimizing ASM improves the filter’s stability.
We now discuss how to formulate ASM as a performance criterion for filter synthesis. Using
Parseval’s theorem, ASM can be expressed in the frequency domain as
1 N 2 ASM= ∑∑∑Gi ()k,l − G ()k,l (3.3) N.d ikl=1
where Gi ()k,l and G (k,l) are 2-D Fourier transform of gi ()m,n and g(m,n)
respectively. In vector notation,
N 1 2 ASM = ∑ gi − g (3.4) N.d i=1 1 N Let m = ∑ X i represent the average of the training image Fourier transforms. N i=1
In the following discussion, we treat M and Xi as diagonal matrices with the same elements
along the main diagonal as in vectors m and xi. Using the frequency domain relations
* * gi = X i h and g = M h , the ASM can be rewritten as follows.
N 2 N 1 * * 1 + * ASM = ∑∑X i h − M h = h ()()X i − M X i − M h (3.5) N.d i==11N.d i
N + ⎡ 1 * ⎤ + = h ⎢ ∑()()X i − M X i − M ⎥h = h sh (3.6) ⎣ N.d i=1 ⎦ 88
N 1 * where the matrix S = ∑()()X i − M X i − M is also diagonal, making its N.d i=1 inversion relatively easy.
Noise and distortion detection also require that a correlation filter yields large peak values.
To achieve this, the filter’s response to the training images need to be maximized without imposing hard constraints in relation to the representative images, which is common in traditional Synthetic Discriminant Functions (SFDs). The goal is that the filter should on the average be able to yield a large peak over the complete training set. This end is achieved by maximizing the average correlation height (ACH) criterion defined as
1 N ACH = ∑ x + h = m + h (3.7) N i=1
Another important consideration is to minimize the effect of noise on the filter’s output obtained by cutting down on ONV - output noise variance. Here it should be understood that
ONV is a quadratic term given by h + Ch, where C is the noise covariance matrix [69].
We may take the example of a white noise model aiming to enlarge ACH while minimizing
ASM and ONV, the filter is designed to maximize
2 + 2 ACH m h h + mm + h J ()h = = = (3.8) ASM + ONV h + Sh + h +Ch h + ()S + C h
The optimum filter which maximizes this criterion is the dominant eigenvector of
()S + C −1 mm+ or
h = γ ()S + C −1 m (3.9)
89
Where γ is a normalizing scale factor. The filter given by Equation (3.9) is referred to as the
Maximum Average Correlation Height (MACH) filter and is thought to be one of the most attractive composite correlation filters for researchers in the field of pattern recognition in noisy background.
3.6.2 OT-MACH Filter
The OT-MACH filter is the extension of the MACH filter, which can be viewed as a composite template representing a general model of an object. It is designed to have invariance to distortions in the object by including the expected shapes of the object during its construction. The basic MACH filter is derived by maximizing the average correlation height (ACH) to get a high peak in the output correlation surface [70]. However, the OT-
MACH filter maximizes ACH as well as it minimizes:
• The average correlation energy (ACE) to get sharper peak, • The average similarity measure (ASM) to have distortion tolerance, and • The output noise variance (ONV) to achieve noise tolerance [62] [49] [13].
Thus, instead of satisfying the three performance measures separately, a single energy function (to be minimized) is made by combining them as follows:
J (h) = α(ONV ) + β (ACE) + γ (ASM ) − δ ACH (3.10)
' ' ' T where ONV = h Ch , ACE = h Dxh , ASM = h Sxh , and ACH = h mx . The weighting factors α, β, and γ are the design parameters, which are used to accomplish the trade-off among the four performance measures. Furthermore, h is the vector-form of the 2-D DFT of the filter that minimizes the energy function, Mx is the average of the vector-form of the 2-D
90
DFT of the training images, and the superscript signs (T) and (’) represent the transpose and the conjugate-transpose operations, respectively. C is the diagonal noise covariance matrix, which is normally set to σ2I (where σ is the standard deviation parameter, and I is the identity matrix, if actual noise model is not known. Dx is the diagonal matrix representing the average power spectral density of the training images and is defined as:
N 1 * Dx = ∑ X i X i N i=1 (3.11) where Xi is a diagonal matrix in which the diagonal elements are the 2-D DFT coefficients
(in vector form) of i-th training image, N is the number of training images, and * is the conjugate operation. Sx is the diagonal average similarity matrix defined as:
N 1 * Sx = ∑(X i − M x )(X i − M x ) N i=1 (3.12) where Mx is a diagonal matrix having the diagonal elements same as the elements in Mx. The filter that minimizes the composite energy function given in Eq. (1) is expressed as [62] [49]
[76]:
−1 h = ()αC + βDx + γSx mx (3.13) which is called the Optimal Trade-off Maximum Average Correlation Height (OT-MACH) filter.
3.7 Text Region Detection in Video Frames
Text regions detection in video frames is a pattern recognition problem in complex background, and same techniques can be applied here which are proven successful in pattern recognition with some modifications and enhancement.
91
We used an extended version of the Maximum Average Correlation Height (MACH) filter called the Optimal Trade-off Maximum Average Correlation Height (OT-MACH) filter to detect text regions in video frames.
We kept our problem simple by keeping our focus on artificial text in video frames which appear in the lower one third region of the screen. This text is usually about the speaker currently at the screen, speech translation or a breaking news slide.
Our objective is to detect and isolated text regions in video frames which appear in the lower one third region of the screen. For this purpose we implemented an OT-MACH filter in
Matlab 7, which successfully detects and isolates the text regions in the video frames and extract these frames for further processing.
We synthesize three separate filters for the text regions of three different sizes i.e. 80%,
100% and 120%, using normalized training images of different sizes of text regions to have text region size invariance. Specifically, we generate the training images using the following steps for each of the three sizes:
• Extract some templates of different text regions from real-world video sequences • Resize the templates to an average size
3.8 Results
The OT-MACH filter is trained on the training images (templates) so generated having text only in English language and tested on real world video clips which contain text in English and few with Arabic text also, the results are very encouraging with nearly 80%
92 accurate detection and extraction of text regions. There are few false negatives and some false positives also, which we plan to consider removing in the future work.
3.8.1 Video OCR Results
3.8.1.1 Training a MACH filter
A MACH filter is trained to detect text regions in lower one third screen area.
Training data consists of images of text regions extracted from the frames of video clips with caption artificial text.
3.8.1.2 Training Images
Training images are obtained from real world video clips. Here we give few training images containing artificial text which we used as templates to train the MACH filter.
Figure 3.1: Training images 93
3.8.1.3 Trained MACH filter
Figure 3.2 illustrates the trained MACH filter on text region images.
Figure 3.2: Trained MACH filter
3.8.1.4 Text area detection and localization
Figure 3.3 illustrates text region detection and extraction from a real world video clip.
Upper part of the images shows text region detection while the lower part shows the same detected text region extracted in a separate window in color for further processing.
94
Figure 3.3: Text region detected and extracted in color-1
95
Figure 3.4: Extracted text region is binarized-1
Like figure 3.3 the same process of text region detection and extraction is shown in figure 3.4 however the extracted text region in smaller window is further processed to get the image binarized.
96
Figure 3.5: Text region detected and extracted in color-2
Figure 3.5 shows the process of detection and extraction of text region with the extracted text region in color.
97
Figure 3.6: Extracted text region is binarized-2
The text region extracted in figure 3.5 the same is shown binarized in figure 3.6.
98
Figure 3.7: Text region detected and extracted in color-3
The text region detection and extraction with extracted text region in color is shown in figure
3.7.
99
Figure 3.8: Extracted text region is binarized-3
In figure 3.8 the extracted text region is binarized.
100
Figure 3.9: Extracted text region in color-4
Extracted text region is in color in figure 3.9.
101
Figure 3.10: Extracted text region is binarized-4
Figure 3.10 shows the binarized extracted text region.
102
Figure 3.11: Artificial text on plane background
Figure 3.11 shows a video frame where the artificial text appears on a simple plane background. Here white text appears on a black background. This type of text is easy to detect, extract and further process for character recognition.
103
3.8.1.5 Video clips with Arabic text
We also used such video clips for text region detection and extraction which have
Arabic artificial text. The results of text regions detection and extraction in this case were very much same as were in the case of English text, although our system was trained using only images having English text into it.
Figure 3.12: Extracted text region in color-1
Figure 3.12 shows Arabic text region detection and extraction. The extracted text region is in color.
104
Figure 3.13: Extracted text region is binarized-2
Figure 3.13 shows the binarized extracted text region.
105
Figure 3.14: Extracted text region in color-3
Text region detection and extraction for Arabic artificial text is shown in figure 3.14. The extracted text region is in color.
106
Figure 3.15: Extracted text region is binarized-4
The binarized extracted text region is shown in figure 3.15.
3.9 Conclusion and Future work
Two main challenges in video OCR are:
i. complex back ground, and
ii. low resolution of final text image.
107
We can simplify the problem of complex background by confining our focus on such video clips that have artificial text on a simple plain background like white text on black background ( as we have taken an example above), in this case we are only left with one big challenge and that is of low resolution of the final text image.
As all most all video OCR research shows that people are using commercial OCR software for the recognition of final text images, they need to further process the image to get an image of acceptable resolution for the OCR software.
We plan to take up the task of post processing the extracted text image to increase its resolution so that it is accepted by a commercial OCR, then we shall test its accuracy first on text in English language and after successfully getting results on video OCR on English text then shall get on to our Urdu OCR to test it on video text images.
108
CHAPTER 4 Implementation Challenges for Nastalique OCR
4.1 Introduction
In this chapter we discuss the complexities and challenges the Nastalique script offers for its digital implementation generally in Urdu computing and particularly in character recognition.
The two main challenges, besides others, are the highly cursive nature of Nastalique style of writing Urdu, which actually is calligraphic and has artistic beauty and has context sensitive behavior of its character shapes.
Nastalique style also is bi directional with characters moving from right to left while numerals presented in the opposite direction. As words are written down there is also a vertical stacking of characters as they are kerned and cursively joined while some characters move backward and beyond the previous character. These factors added to the limitations of developing an OCR for Nastalique.
4.2 Nastalique Character Set
Urdu uses an extended Arabic adapted script; it has 39 characters as against Arabic 28.
Each character then has two to four different shapes depending upon its position in the word; initial, medial, final or isolated.
When a character shape is written alone it’s called isolated character shape. Each of these initial, medial and final character shapes can have multiple instances, the character shape
109 changes depending upon the change in the antecedent or the precedent character. This characteristic of having multiple instances of these character shapes is called context sensitivity.
Table 4.1: Shapes of Nastalique characters
A complete language script comprises of an alphabet and style of writing. Urdu uses an extended Arabic script for writing called Perso-Arabic script. It has two main styles, Naskh and Nastalique. Nastalique is a calligraphic, beautiful and more aesthetic style and is widely used for writing Urdu.
110
4.3 Nastalique Script Characteristics
Nastalique script has the following characteristics:
• Text is written from right to left
• Numbers are written from left to right
• Urdu Nastalique script is inherently cursive in nature
• A ligature is formed by joining two or more characters cursively in a free flow
form
• A ligature is not necessarily a complete word, rather in most of the cases a
part of a word, sometimes referred to as a sub-word
• A word in Nastalique is composed of ligatures and isolated characters
• Word forming in Nastalique is context sensitive i.e. characters in a ligature
change shape depending upon their position and preceding or succeeding
characters
4.4 Computational Analysis of Urdu Alphabet
Nastalique script uses Urdu alphabet for its writing, which is based on Perso-Arabic script, which is an extension of Arabic script.
We make present a computational analysis of Urdu alphabet in the following.
4.4.1 Classes of base shapes in the Urdu alphabet.
The character set of the Urdu alphabet can be classified into 21 basic shapes that may or may not take a different set of dots or diacritical marks to represent a variety of sounds.
111
All the 39 basic characters take their shapes from one of the basic ones to represent a single character.
Table 4.2: Base shape classes in Urdu alphabet
For example as shown in table 4.2, the second basic shape in the set is used to represent five
depending on the number, position or placements of ث، ٹ، ت، پ، ب – different characters dots that go along suggesting the difference in each character and serving as a mark of identification for that particular character. These identification marks are quite ambiguous at
112 times and generally, in handwritten Urdu the presence of a certain character in the word can only be recognized through a contextual understanding of the word image.
4.4.2 Dots in Urdu Characters
We have .( ط ) Urdu characters have two types of accents, dots and small tua presented an analysis of Urdu characters with accents in table 4.3 and the details in the following.
In Urdu alphabet 17 of the characters take dots in a variety of positions bringing changes in the sounds represented by the basic shape class of characters. These dots are placed in a variety of positions e.g. below the character, above the character or even inside it. The dots are also presented in a variety of numbers ranging from one to three, these in turn also placed in a number of placement combinations e.g. a set of three dots may be placed as triangular
.representation with the pointed side up or down in two different characters ( ث، پ )
Table 4.3 presents a variety of ways in which the alphabet can be classified according to the number, position or placement of dots in relation to the characters. It can be observed that there are 17 characters that contain a single or a combination of dots, three characters that have a ‘tua’ sign above the character and nineteen characters that do not have a dot or a tua sign attached. There is a further classification – one due to the placement of the dots e.g. a set
and the same ’ژ ,makes it ‘zha ’ر ,of three dots “ ” placed on the fourth basic shape ‘ra
. ’چ ,makes it ‘cha ’ح ,set in its inverted position “ ” placed inside the third basic shape ‘Ha
In the same way several characters have a pair of horizontally placed dots with them. e.g. ‘ta,
in ’ق ,has the two dots placed above in the centre of the character but the character ‘qaf ’ت
113 its initial position has the same set of dots placed in the top right quadrant of the character rectangle that represents its four parts.
Table 4.3: Dots in Urdu characters
114
4.4.3 Context Sensitive shapes in the Urdu alphabet.
The 39 individual characters of the Urdu alphabet have multiple representations of shapes depending upon their position in the word. There are four basic positions that a character might be present in, initial, medial, final and isolated and in each case the character’s form changes considerably. But this does not limit each character’s shapes to four.
Each individual character will form a unique shape at the initial, medial or final position
may take as many as ’ ب ‘ when it combines with a different character. e.g. the character ba
20 different initial shapes as it combines with varying characters, as illustrated in table 4.4.
initial ب Table 4.4: Context sensitive shapes of
115
This context sensitivity increases the number of character representations in Urdu and Arabic manifold as compared to the Latin script languages which only has two basic ways in which to present a character – lower or upper case.
4.4.4 Comparison of Urdu, Arabic and Farsi Alphabets
Table 4.5: Urdu Alphabet
Table 4.6: Arabic Alphabet
Table 4.7: Farsi Alphabet
116
The Arabic script was originally adopted for writing Farsi and Urdu as well as numerous other languages. However as the script was adopted to represent a different set of sounds for each distinct language there were variations made to the number, placements and positions of dots or other accents that were used with the characters to represent the sounds. The Arabic alphabet has a set of 28 characters, table 4.6, while Farsi, table 4.7, has an added number of 4
and ‘ga ’چ cha‘ ,’ پ pa‘ ,’ ژ characters which makes it possible for them to represent the ‘zha
sounds that are not present in the Arabic language. In the same way Urdu, table 4.5, has a ’گ supplementary set of 11 characters that have been derived from the basic character shapes and adding dots or accents to them in an array of combinations. Tables 3.5, 3.6 and 3.7 represent the way the Arabic script has been adopted by the Urdu and Persian to correspond to the language’s distinct sounds.
4.4.5 Bi-directional pen movement (from top left to bottom right)
The formation of words as Urdu is scribed on paper has a unique pattern of flow. The words begin to be written from the top right end and finish at the bottom left, figure 4.1. This pattern follows for writing every word in which each character curves and joins with the preceding and succeeding ones to create uniquely shaped ligatures and words.
Figure 4.1: Bi-directional pen movement
117
The arrows in the given word images represent the direction of the pen movement showing that Urdu words are written in a bi directional mode- that is there are two movements involved in relation to the directions the forming words take,
a. the movement from right to left and,
b. the movement from top to bottom.
4.4.6 Bi-directional writing (numbers written from left to right)
Although Urdu writing follows the Arabic pattern of writing from right to left yet when numbers are written down they follow the Latin pattern of writing from left to right. In the example shown below, figure 4.2, the given sentence is presented in a right to left flow while the date set inside – ‘1947’ and 14 although takes the Arabic numerals is written in the
Roman form, from left to right.
Figure 4.2: Bi-directional writing
This feature is another aspect of Urdu’s bi-directional writing style.
” ق Context Sensitive shapes of the character “Quaf 4.4.7
Each character in the Urdu alphabet has a different shape according to its position in the word. In the following example, figure 4.3, the character quaf’s various shapes are demonstrated as they appear in its various positions in the word. It can be noticed that the 118 three shapes of this character are considerably different from one another in its varying positions and identifying them would need substantial amount of training.
ق Figure 4.3: Context sensitive shapes of quaf
4.5 Nastalique Script for Urdu
Urdu uses the Arabic script for writing with a basic character set of 39 against the
Arabic 28. Mainly there are two styles of Urdu writing Naskh and Nastalique while the most prevalent is the Nastalique which is widely used for Urdu writing. Nastalique is a calligraphic, beautiful and more aesthetic style of Urdu writing. Due to its calligraphic nature it poses many difficulties for its digital implementation specifically in text recognition. It is pertinent to define some related terminologies here:
4.5.1 Character
According to the Unicode definitions a character is the smallest element of the written language which has semantic value. The character identified is an abstract entity, such as
Every character has only one code point .”ح Latin character capital A”, or “Arabic character“ in digital character representation schemes like ASCII or Unicode. A text file in Unicode will invariably contain references to characters and not to glyphs. 119
4.5.2 Glyph
The visual representation of a character made on the screen or paper is called a glyph.
A glyph is the shape or form of a character or characters that they take when they are represented in writing or displayed in print. The digital fonts consist of glyphs while natural languages contain characters. A character can have more than one glyphs like character ‘a’ written in different fonts will give many glyphs of the same character.
4.5.3 Ligature
The Nastalique style of writing is inherently cursive. The character shapes join together cursively to form words or ligatures. A ligature in Nastalique is a complex unit of several characters bound together while one word may be composed of one or more ligatures as well as isolated characters.
4.6 Ligature in Urdu
A ligature in Urdu is a complex unit of several characters bound together cursively to give a single fluid form. A ligature is not a complete word but can be considered as a compound character.
For example the word is formed by two ligatures and an isolated character, figure 4.4.
120
Figure 4.4: Ligature in Urdu
4.7 Word Forming in Urdu
A word in Urdu may be composed of one or two ligatures as well as isolated characters.
Pakistan) is formed by two ligatures and an isolated) ﭘﺎﮐﺳﺗﺎن For example the word character.
Word Forming in Urdu
A ligature
An Isolated Character Shape
Figure 4.5: Word forming in Urdu
121
A word in Urdu is composed of ligatures and isolated characters. The word (Pakistan) consists of two ligatures and an isolated character, illustrated in figure 4.5. Some Urdu words
.’ ﺷﻣس and ‘Shams ’ ﺳﮩﯾل consist of only one ligature like ‘Sohail
is among the few words comprising of a single ligature ’ ﻗﺳطﻧطﻧﯾہ The word 'Qustuntunia with eight character shapes.
4.8 Styles of Urdu Writing
The two prominent styles of writing Urdu are Naskh and Nastalique, figure 4.6.
However Nastalique is the more widely used one.
Styles of Urdu Writing
This is Naskh style of writing … Single Base Line
This is Nastaliq style of writing …
Multiple Base Lines
Figure 4.6: Styles of Urdu writing
4.8.1 Naskh
This is Naskh style of writing
122
Figure 4.7: Naskh style of writing
The Naskh style of writing Arabic script languages has a flat single base line, word are spread horizontally along the base line taking more space for writing a ligature, figure 4.7.
The Naskh style is easier for character recognition compared to Nastalique due to its linearity in writing, having a single base line and non-overlapping of characters in adjacent ligatures.
4.8.2 Nastalique
This is Nastalique style of writing, more complex than Naskh …
Figure 4.8: Nastalique style of writing
Nastalique style of writing is highly cursive in nature with multiple base lines overlapping of characters in adjacent ligatures and vertical stacking of characters within a single ligature, figure 4.8. All this makes character recognition in Nastalique more challenging.
4.9 Nastalique Script Complexities
No Nastalique OCR exists so far. Published Research in Nastalique text recognition is almost non-existent. This is a new undertaking which faces more challenges as no
123 standardized ground work exists, like Nastalique text processor, character-based Nastalique font, Unicode support for Nastalique, keyboard layout for Nastalique.
Main issues in Nastalique optical character recognition are two; cursive nature and context sensitivity
Here we describe the complexities of Nastalique script with particular reference to character recognition.
4.9.1 Cursiveness
The Nastalique style of writing because of its flowing character forms and shapes is inherently cursive. The character shapes join together cursively to form words or ligatures. A ligature in Nastalique is a single unit of several characters bound together cursively in a fluid form that creates a variety of compound characters, figure 4.9. A single word may therefore be composed of one or more ligatures as well as isolated characters. Thus Nastalique with its inherent features of cursiveness makes a complex script.
Figure 4.9: Nastalique cursiveness
There are 39 different characters in the Urdu alphabet and most of them, have three different shapes, depending on the position they appear in a word; initial, medial and final besides an isolated shape.
124
Characters join together cursively in a fluid form called a compound character or a ligature.
Words are composed of ligatures and isolated characters. For example word
.figure 4.10 ,(ن) and one isolated character( ﮐﺳﺗﺎ and ﭘﺎ ) has two ligatures ( ﭘﺎﮐﺳﺗﺎن )
Figure 4.10: Word forming in Nastalique
4.9.2 Context Sensitivity
Ligatures in Nastalique are unique combinations or units of characters that change their shape according to their position within the unit. Each of the 39 characters in the alphabet can have two to four different shapes depending upon its position in a word. An
for example, which is the second character in the alphabet, is quite different ,(ب) ”initial “BA from the shape it bears as a medial, final or an isolated one. Added to this is the dependence of each character on the preceding or succeeding characters it joins with. A character might take as many as 20 different shapes according to the character it is joining with. Sometimes even the 3rd or 4th preceding or succeeding character may initiate a change in shapes. Thus
has multiple instances, and may have as many as 20 different shapes for its initial ,(ب) Ba form. A similar pattern follows for all other characters in the script which enlarges the database of all possible character shapes for all the letters of the alphabet multifold. Figure
.initial’ب illustrates two different context sensitive shapes of ‘bay 4.11
125
Figure 4.11: Context Sensitivity; Two different shapes of Bay-initial
This feature is called context sensitivity – character shape changing with varying antecedent or precedent characters and can be presented as an n-gram model in a Markov chain. The first order Markov model is
Ci | Ci-1
4.9.3 Dot Positioning
Out of 39 characters in Urdu alphabet 17 characters have dots above, below or inside the character. These dots can be one, two or three in number, as illustrated in figure 4.12.
There can be three situations of ambiguity between two characters.
Figure 4.12 (a, b, c): Dots in Urdu characters
In situation one, a character has a dot and another does not, figure 4.12 (a).
126
In the second situation differentiation between two characters is possible only by determining the number of dots which may be in the same position with respect to place, figure 4.12 (b).
In the last situation two characters are differentiated due to the position of dots, figure 4.12
(c).
Situations of ambiguity arise because very small changes in dot numbers or positions are used to bring changes in sounds of characters.
4.9.4 Kerning
The character pair space adjustment is called Kerning. It is usually done by reducing the space between two characters to create good visual effects, for example figure 4.13 shows a character pair while kerned and not-kerned.
Not-kerned Kerned
Figure 4.13: Roman kerning pair
This shows that in Roman script if a character pair is kerned that is if the natural space between the two characters is reduced then they will overlap each other and shall become difficult to recognize. This problem is rare in writing the Roman script while very common in writing Nastalique style, figure 4.14, because there is a great number of ligatures that form
127 their natural shape only when kerned thereby making character identification in scanning texts particularly in cursive scripts more complicated and tricky.
Not-kerned Kerned
Figure 4.14: Nastalique kerning pair
4.9.5 Character Overlapping
Besides kerning which is implemented in the font file, Nastalique style of writing poses another challenge for text recognition which is inherent in this style of writing; it is character overlapping in words. For example figure 4.15 shows character overlapping in
Nastalique.
Figure 4.15: Character overlapping in Nastalique
In both of the cases in figure 4.15 the isolated character ‘ ’ (Alif) difficult to identify separately, although both of the words are composed of one ligature and an isolated character.
The cursive nature of Nastalique script results in inter-word as well as intra-word character overlap which makes the character recognition for Nastalique script more challenging.
128
4.9.5.1 Within a Ligature
Nastalique is naturally cursive connecting each subsequent character with the previous one by means of delicately curving joints. The joining points or curves in characters and ligatures follow a predefined set of rules for formation that are governed by the size of the pen nib with which the words used to be originally written down.
Figure 4.16: Character overlap within a ligature
To form a ligature each character connects with the adjoining character in a fluid form, beginning from the top right and moving diagonally to the bottom left. This feature makes writing in Nastalique space saving as characters are stacked up vertically also as they move ahead, figure 4.16. This makes it also differ from the Arabic Naskh which is written on a primarily horizontal baseline. One drawback of this feature of Nastalique is that words with multiple characters in a line may clash with words on the preceding lines.
4.9.5.2 Between two adjacent Ligatures
The cursive nature of Nastalique script results in vertical stacking of characters in a ligature as well as the adjacent ligatures overlap and shadow each other in most of the cases as shown in figure 4.17.
129
Figure 4.17: Intra-ligature character overlap
4.10 Sloping and Multiple Base-Lines
Nastalique style of writing does not have a horizontal base-line as is the case with
Roman script instead it has a sloping base-line, figure 4.18, due to the fact that each word or ligature in Nastalique is written from top right to bottom left. Ligatures in Nastalique are tilted at 30-40 degrees approximately.
Figure 4.18: Nastalique sloping base-line
The inherent cursiveness of Nastalique style of writing give rise to sloping as well as multiple base-lines, figure 4.19.
Figure 4.19: Multiple baselines
130
4.11 A Generic OCR Model
Optical character recognition is in a broad sense, a branch of artificial intelligence and it is also a branch of computer vision. Nevertheless, it is a distinct discipline in its own right, analogous to speech recognition. If we imagine designing a robot, both reading and listening functions are indispensable if the robot is to be really intelligent. Of course, even a conventional computer must have the capacity to read input documents such as in office and library work. Furthermore, recently, prototype electronic libraries have started to come on- line. In part, the optical character recognition field has grown because of this specific application.
Working of an OCR
______Scanning _____ OCR ______Text Page Image
Display Device ______Recognition Segmentation Preprocessing ______A text page in a text editor Components of OCR
Figure 4.20: Different phases of an OCR
A generic OCR model has a number of phases for optical recognition of printed text as illustrated in figure 4.20. Details of these processes are already covered in chapter one of this thesis.
131
4.12 Working of a Roman Script OCR
Many languages use the Roman script for writing like English, German and French.
The Roman script OCR performs segmentation at three levels. At the first the lines of text are identified and the text image is segmented into lines of text and each of the images of the text line is placed inside a bounding box. This can be done by vertical scanning of the text image for black pixels.
Working of a Roman Script OCR
Text Image
Lines are separated using horizontal projection profile Level 1: Lines Segmentation
Words are separated using vertical projection profile Level 2: Words Segmentation
Level 3: Character Segmentation Characters are separated using DIP techniques
Figure 4.21: Roman OCR has three levels of segmentation
The English printed text has a single and horizontal base-line. Some characters move up the base-line like b, f, h, l etc called ascenders, some of the characters move down the base-line like g, j, p, q etc called descenders, the characters which remain on the base-line are a, c, e, i etc. Lines in a text image can be identified on the basis of number of white pixels present
132 between descenders of the previous line and ascenders of the next line or a line of text can be described as all black pixels between start of the ascenders and end of the descenders.
Next is the word level segmentation in which inside the bounding box containing a line of text smaller boxes are made to contain each of the words in the line of text inside the larger bounding box. This is done by detecting the presence of wide white space between words on horizontal scan from left to right. When we write a word in a text editor we key-in the letters.
As soon as all the letters are keyed-in we press the spacebar key which inserts a white space character after the word which is equal to the one character in length, on the basis of this white space words are separated from each other. At the last level of segmentation words are broken down into their constituent letters by making even smaller bounding boxes inside the previously made boxes to contain each of the letter in a word image by identifying white pixels between two adjacent letters. In English printed text letters in a word are placed side- by-side without touching each other, they are not joined or connected. So at the end of the segmentation process each of the characters in the given text image is isolated; bound in a separate box, and is ready to be presented at the recognition stage, the whole process of isolating individual characters is shown in figure 4.21.
4.13 Working of a Nastalique Script OCR
In the segmentation phase of a Nastalique OCR there are only two levels instead of three as the case is with Roman script OCR. The first level of segmentation is same in both of the cases i.e. Nastalique and Roman scripts in which the whole text image is segmented into lines of text by placing each of the lines of text inside a bounding box. At the next and the final level of segmentation smaller boxes are made inside the larger bounding boxes
133 containing lines of text by horizontal scanning the text image from right to left, these smaller bounding boxes contain ligatures or isolated character shapes, as illustrated in figure 4.22.
Working of a Nastaliq Script OCR
Text Image
Level 1: Text segmented into lines
Level 2: Lines segmented into Words / Ligatures
Figure 4.22: Nastalique OCR has two levels of segmentation
In the case of Nastalique OCR at the end of segmentation phase all character shapes are not bounded inside the smaller bounding boxes but ligatures and isolated characters are bounded which when presented to the recognition stage will only result into recognition of isolated characters and not the ligatures. A ligature in ris a complex unit of several characters bound together cursively while one word may be comprised of one or two ligatures as well as isolated characters.
4.14 Approaches for Nastalique OCR
4.14.1 Character-based Approach
If we follow the segmentation approach for Nastalique OCR then at the recognition phase we will need to have all the text images segmented up to the character level similar to a 134
Roman script OCR, that is, all the ligatures in the words are segmented into their constituent character shapes. For illustration we take the example of an Urdu word
written in Nastalique. Figure 4.23 shows the segmentation of the word into its ﺑﺷﯾﺮ respective components.
+ + +
+
Figure 4.23: A segmented word in Nastalique
The word consists of four components namely Bay-initial, Sheen-medial, Yey-medial and
Ray-final. Now breaking a ligature into its components can be done only if we are able to
ﺑﮩت define the points of segmentation, the task is not easy even for a simple ligature like
ﻋﺟﺎ but it becomes more complicated in a ligature where characters overlap each other like
ﺑﮩت By any means if we are able to separate all character shapes in the ligature . ﻟﻣﺣہ or then as we already know that in Nastalique each character shape has multiple instances so the
Bay-initial has 20, then which character code has to be returned against the recognition of the
Bay-initial here, this has to be supported in the character encoding scheme as well as the font file but the problem still needs to be addressed.
135
4.14.2 Ligature-based Approach
In the ligature based approach, which is also segmentation-free, for a Nastalique OCR we will have to rely on the two levels of segmentation phase at the end of which we have isolated character shapes or the ligatures separated in the bounding boxes, as shown in figure
4.24.
Figure 4.24: Ligature-based approach
The ligature based Nastalique OCR shall require a very large number of ligature images to train a learning machine like Artificial Neural Network (ANN), Hidden Markov Model
(HMM) or a Support Vector Machine (SVM). Many researchers have tried this approach, in the case of Arabic Naskh OCR, however were not able to receive high rates of recognition accuracy.
In this thesis we have tried to explore more innovative technique to create a novel algorithm to implement a Nastalique OCR which is segmentation free and still character based.
136
CHAPTER 5 The Proposed Nastalique OCR System
5.1 Introduction
The main contribution of this thesis is a new algorithm for the design and implementation of an OCR for printed Nastalique text. This algorithm is character-based and at the same time is segmentation-free.
Here by character-based and segmentation-free we mean that our algorithm recognizes characters in a ligature without segmenting the ligature into its constituent character shapes, this is the novelty of our new algorithm.
In chapter 2 , literature survey, we have seen that researchers in Arabic script OCR have usually followed one of the two approaches either ligature-based, which is segmentation-free and does not break a ligature into its characters or the segmentation-based which divides a ligature into its character shapes and both of the approaches have not produced promising results.
Observing the main challenge in the segmentation-based approach as improper segmentation of a ligature into corresponding character shapes we decided to adapt a segmentation-free approach yet emphasizing on character-based recognition to avoid keeping a large lexicon of ligatures or training a learning machine like a Support Vector Machine (SVM), a Hidden
Markov Model (HMM) or an Artificial Neural Network (ANN) on a large database of ligature shapes.
137
5.2 The Nastalique OCR System Implementation
We used Matlab for rapid prototyping and experimentation; however the Urdu
Nastalique character recognition application was implemented using Microsoft VC++ 6.0.
We performed experiments on Urdu Nastalique words keeping the same font size, and the results are very encouraging.
Our Nastalique character recognition system requires a character-based True Type Nastalique font, and an image of Nastalique printed text written with the character based True Type
Nastalique font.
After the segmentation is completed, the isolated character shapes and the ligatures have been identified and bounded into rectangular boxes. In the recognition phase, the True Type
Nastalique font file is loaded into the main memory, and each of the character shapes in the font file is matched with the shapes identified in the text image using cross-correlation for recognition of character shapes, line-by-line, writing their character codes into a text file in sequence as the character is found with their beginning address with respect to the bounding box. As the recognition process is completed the text file is sorted according to the x- positions giving a new order to the character codes, then these character codes are given to the rendering engine which displays the recognized text in a text region.
5.3 The Novel Segmentation-free Nastalique OCR Algorithm
Our novel Nastalique OCR algorithm is presented in figure 5.1 and the corresponding flowchart is given in figure 5.2.
138
Figure 5.1: Nastalique OCR Algorithm
139
Figure 5.2: Flowchart for NOCR
5.4 Nastalique OCR Algorithm Description
i. Text image is given to the OCR engine
ii. A True Type Font file, with which the text in the image was written, is loaded in
the main memory
iii. Text image is segmented into lines of text and each of the lines of text is further
segmented into ligatures and isolated characters 140
iv. Bounding boxes are made around
a. each of the ligatures and isolated characters in the segmented text image
b. each of the character shapes in the font file, to be used as templates
v. First character shape is picked from the font file as a template and moved in the
first bounding box in the first line of the text image from the right-hand side
vi. Template matching is done by cross-correlation
vii. For each pass highest peak for the cross-correlation is noted, at which the x-
position and the corresponding character codes are saved into an array
viii. If the first character shape is not found in the first bounding box in the text
image, next character shape in the font file is picked and the process repeated,
until, either all character shapes in the image box are found or the font file is
exhausted
ix. Next bounding box in the text image is selected and the process repeated until all
of the bounding boxes in the text image are exhausted and the x-positions and
corresponding character codes are saved in an array, in the order they are found in
the search procedure
x. The array is sorted according to the x-positions giving a new order for the
character codes
xi. Now the character codes in the sorted order are given to the rendering engine
which uses the same TTF file to render the recognized text in a text region
5.5 Segmentation of Text Image into Lines
To segment a text region in an image into lines of text we use the horizontal projection profile (or histogram) of the text image.
141
Considering the image of size m×n is F (i, j), the projection of all foreground pixels perpendicular to the vertical axis and along the horizontal direction can be given as:
n Hh()i = ∑ F (i, j ) for 1 ≤ i ≤ m (5.1) j=1
The horizontal projection profile of the text image separates the lines of text on the presence of white pixels between the two adjacent lines. A line of text covers the foreground pixels on the vertical scan from top of ascenders to the end of the descenders in the line of text, while scanning vertically from top to bottom.
5.6 Segmentation of Text Line into Ligatures
One of the techniques for segmenting a text line image into ligatures and isolated characters is the vertical projection profile (or histogram) of the text line image, the projection of all foreground pixels in the image of extracted line L(i, j), having size r×s, perpendicular to the horizontal axis and along the vertical direction can be given as:
r Hv()j = ∑ L (i,j ) for 1 ≤ j ≤ s (5.2) i=1
The ligatures and the isolated characters present in the text line image are identified on the presence of white pixels separating them. The limitation of this technique is that it does not work in situations where adjacent ligatures overlap or shadow each other and the histogram so obtained portrays them as a single ligature.
142
Otherwise, this method will fail in a situation where characters in the different ligatures vertically overlapped with each other. We use connected component labeling to segment out the ligatures.
5.7 Character Recognition by Cross-Correlation
There is a simple method that deals with two-dimensional information, the cross- correlation method. This is a typical method used in pattern matching, as the image of a character shape has two-dimensional information, we can directly apply cross-correlation to find the identity of the character image.
The correlation of two functions f(x, y) and h(x, y) is defined as
(4.1a)
(4.1b)
where f * denotes the complex conjugate of f. We normally deal with real functions (images), in which case
* f = f (4.1c)
143
The term cross correlation often is used in place of the term correlation to clarify that the images being correlated are different. This is as opposed to autocorrelation, in which both images are identical.
We consider correlation as the basis for finding matches of a subimage w(x, y) of size JxK’ within an image f(x, y) of size MxN, where we assume that J≤M and K≤N.
The correlation between f(x, y) and w(x, y) is
(4.2)
for x=0,1,2,…,M-1, y=0,1,2,…,N-1 and the summation is taken over the image region where w and f overlap.
th Let the j template, the unknown input character and its domain be gi(x, y), f(x, y) and R, respectively. The similarity between the template and the input character, based on cross- correlation is defined as
f (x, y)gi (x, y)dxdy i ∫∫R S ( f ) = f (x, y)2 dxdy g (x, y)2 dxdy (4.3) ∫∫ ∫∫ i R R
where i = 1, 2, ... , L; where L is the number of characters of a given alphabet.
144
The maximum value of Si (f) is found scanning i, and if it is Si (f), then the input character f is identified as class j. The denominator of (1) is for the normalization of amplitude, [64].
5.8 Nastalique Text Segmentation
i. The image of Nastalique text is pre-processed to change it into bi-level form
ii. The text region is segmented into lines of text by separating lines on the presence
of white space on horizontal scanning of the text image using horizontal
projection profile or histogram
iii. Ligatures in the text image are further separated by applying connected
component labeling on the text image
5.9 Segmentation of Text Image into Lines and Ligatures
The binarized text image for further processing by Nastalique OCR is given in figure
5.3.
145
Figure 5.3: Binarized Nastalique Text Image
In the pre-processing stage the text image is binarized before further processing by
Nastalique OCR. Figure 5.3 gives the binarized text image before further processing by
NOCR. Figure 5.4 gives the horizontal and vertical projection profiles (histograms) of the binarized text image.
146
Figure 5.4: Binarized Nastalique Text Image
Horizontal and vertical projection profile (histograms) of text image showing the distribution of black pixels over the entire text region are illustrated in figure 5.4.
Figure 5.5: Binarized Nastalique Text Image
147
Horizontal projection profile (histogram) is used to separate text lines in a text image as shown in figure 5.5.
All black pixels are projected along the vertical axis in the horizontal direction. The vertical axis gives the position while the horizontal axis gives the number of black pixels or pixel density at a particular position. The gaps along the vertical direction, between the black projections show the white spaces between lines of text, on the basis of the presence of these wide white spaces lines of text are separated in the text image.
Text Image Horizontal Projection Profile
Figure 5.6: Binarized Nastalique Text Image Figure 5.6 illustrates the separation of lines in the text image using horizontal projection profile or histogram; it also gives the correspondence between horizontal histograms and lines in the text image.
148
Figure 5.7: Lines of Text Separated
Text image segmented into lines of text is illustrated in figure 5.7, where lines of text in the text image are separated for further processing by Nastalique OCR.
The system stores each of the segmented lines as one separate image for further processing by assigning it the actual line number in the text image, so, when the recognized text is rendered sequentially line by line it has the same flow of text as was present in the text image.
We used connected component labeling in Matlab to identify and isolate all the ligatures in the text image, illustrated in figure 5.8. This way we are able to isolate all ligatures and isolated characters. However, as an undesired effect all diacritic marks are also separated like dots belonging to certain characters. When a character has dots, this is identified with dots otherwise this will be interpreted as some other character. In Urdu we have different characters with the same base shape but having different numbers of dots, or dots at different locations or having no dots at all. The dots are to be associated with the character shape so that it can be interpreted as the correct character shape.
149
Figure 5.8: Ligatures are separated
The segmentation phase is shown in Figure 5.9, a-g, where the text image is first segmented into lines, and then each of the lines into ligatures, isolated characters and primitives. Here we call diacritic marks as primitives.
After segmenting a text image into separate lines, each of the text line image is assigned its number and processed separately further to identify the number of ligatures, isolated characters and diacritical marks it contains.
The given text image consists of seven lines. The text image is first segmented into seven lines and then each of the lines is further processed and the results are shown in figure 5.9 (a) to figure 5.9 (g).
Figure 5.9 (a) shows that it has 22 elements out of which 10 are diacritical marks and rest of the 12 are ligatures. The distinction between a ligature and a diacritical mark is made by trial and error method on counting the number of black pixels in each of the bounding box when
150 connected component labeling is applied on separated text line images. By experimentation we found that if the number of black pixels in a bounding box is more than 25 then it is a ligature otherwise a diacritical mark.
Figures 4.8 a-d are a result of connected component labeling and counting number of foreground pixels in each of the bounding boxes.
Number of bounding boxes are plotted along the vertical axis separation between the horizontal axis and the start of a bounding box is plotted along the horizontal axis. The decision whether a bounding box contains a ligature or a diacritic mark is made by counting the number of pixels a bounding box contains, if these are twenty five or less then it is a diacritical mark contained in the bounding box and if the number of pixels are more than twenty five then a ligature is bounded by the box.
So the information we get about each of the text line image is which number of line it is, how many ligatures and diacritical marks it has and an image showing separated connected components.
151
Figure 5.9 (a): Analysis of text line 1
Figure 5.9-a shows line one has 22 elements with 12 ligatures and 10 diacritical marks
Figure 5.9 (b): Analysis of text line 2
Figure 5.9-b shows line two has 25 elements with 11 ligatures and 14 diacritical marks
152
Figure 5.9 (c): Analysis of text line 3
Figure 5.9-c shows line three has 42 elements with 18 ligatures and 24 diacritical marks
Figure 5.9 (d): Analysis of text line 4
Figure 5.9-d shows line four has 29 elements with 11 ligatures and 18 diacritical marks
153
Figure 5.9 (e): Analysis of text line 5
Figure 5.9-e shows line five has 38 elements with 13 ligatures and 25 diacritical marks
Figure 5.9 (f): Analysis of text line 6
Figure 5.9-f shows line six has 27 elements with 9 ligatures and 18 diacritical marks
154
Figure 5.9 (g): Analysis of text line 7
Figure 5.9-g shows line seven has 45 elements with 16 ligatures and 29 diacritical marks
Figure 5.10: All elements in the text image separated
155
All the ligatures, isolated characters and diacritical marks are separated using connected component labeling in Matlab, show in figure 5.10.
The diacritical marks include dots, punctuation marks, vowel marks, hamza etc.
Figure 5.11 (a): Ligature overlap line 1
In case of the Latin script or other discrete script OCRs we can use vertical projection profile
(or histogram) to separate all characters in a line of text on the basis of small white spaces present between characters and their non-overlapping nature.
In figures 4.11 (a) to (d) vertical projection profile is used on the vertical axis. Pixel density is plotted along horizontal axis (position on the text line). However, some ligatures could not be separated for recognition due to the overlapping between adjacent ligatures.
156
Once the text line images are separated from the text image and are processed as separate text image having only one line of text in the image. In discrete scripts lke Roman script for example individual characters are separated by the vertical projection profile of the text image with projecting all black pixels in the text line image along the vertical axis leaving white gaps in between the characters. In printed Roman text discrete characters do not overlap or shadow each other.
However in the case of Nastalique script, due to its cursiveness there is very much overlapping and shadowing of ligatures in the printed text, so ligatures cannot be separated by using a simple vertical histogram of text line image.
Figures 4.11 (a) to (d) illustrate it clearly that there is much overlapping of words and ligatures in the printed Nastalique text and ligature separation is a more challenging job.
The text line image is scanned horizontally form left to right looking for the black
(foreground) pixels and as they are found these are projected along the vertical axis to give the histogram showing distribution of pixel along the horizontal axis and the pixel densities along the vertical axis. Thus we see that there is much overlapping in adjacent ligatures.
157
Figure 5.11 (b): Ligature overlap line 2
Figure 5.11 (c): Ligature overlap line 3
158
Figure 5.11 (d): Ligature overlap line 4
5.10 Recognition Technique
Figure 5.12 (a) is an image of a word written in Nastalique script (the word is actually
this word is a good example of illustrating cursiveness and context ,(ﻧﺳﺗﻌﻠﯾق Nastalique sensitivity of Nastalique script which are the two main challenges in Nastalique character recognition. This word Nastalique has seven characters joined cursively in a fluid form to give a single ligature word.
Figure 5.12 (a): A word in Nastalique
159
Figure 5.12 (b): Separated character shapes in a word
The word has one character at its initial and one at the final positions, while the rest of the five characters all occupy the medial positions as their both of the left and right hand ends join other characters. These initial, medial, and final character shapes are shown separate from each other in Figure 5.12 (b). These separate context sensitive shapes are stored in the font file with their corresponding character codes illustrated in Figure 5.12 (c).
71 70 69 68 67 66 65
Figure 5.12 (c): A word in Nastalique
In text processing applications we key in the character codes to display the character on the screen while in character recognition the algorithm returns the character code as a result of recognizing a character in an image.
In the recognition process in our Nastalique OCR as a result of recognition character codes are returned and passed to the rendering engine to display the recognized text on the screen.
160
5.11 Nastalique OCR Application
The graphic user interface, figure 5.13, along with the OCR application for
Nastalique text recognition has been implemented in Microsoft Visual C 6.0++. The user interface presents an easy and interactive way to use the application. The features present are explained as follows.
5.11.1 The Dialogue Boxes
The dialogue boxes in our Nastalique application are explained as follows:
a. The ‘Open Image’ dialogue box allows the system to select a text image from the
source and prepare it for recognition. This image is displayed on the larger text
box appearing in the centre of the screen.
b. The ‘Load Font File’ dialogue box presents the list of fonts that can be selected
from the drop down menu and activates it for recognition procedure.
c. The ‘Result’ dialogue box displays the recognized text in the smaller left hand
screen.
d. The ‘Exit’ button allows the user to quit the program.
161
Figure 5.13: Nastalique OCR User Interface
Figure 5.14: Input word image selection
Figure 5.14 shows an image is selected and loaded in the input image area
162
Figure 5.15: Font selection
The prompt from the ‘load Font’ button displays the dialogue box that prompts the user to supply basic information about the font file to the system, show in figure 5.15. A selection can be made from the present font files which will ideally correspond with the font style in which the input text was created.
5.12 The Recognition Procedure
Two windows are formed; the larger around the ligature in the image while the smaller one around the character shape image in the font file to act as a template. Top right and bottom left corners addresses of larger ligature window is saved with respect to its line number and position of the ligature in its line.
163
In the recognition process a rectangular bounding box is made across each of the ligatures and isolated characters in a separated text line image. Font file is already loaded in the main memory; bounding boxes are also made across the character shapes in the font file to act as templates. Now the first template from the font file is picked with the small window across it, this small window is moved in the larger image window across a ligature in the image from top right corner towards bottom left corner one pixel left and one pixel down matching the template in the small window from the font file with character shapes in the larger ligature window using cross correlation observing the correlation peak value at each step.
Recognition Procedure
71 70 69 68 67 66 65 (Char Code)
Font File
Template matching
Image
Saving char. codes and x- 6566 67 68 69 70 71
positions in x4 x1 x2 x6 x0 x5 x3 text file Text
Figure 5.16: Cross-correlation for recognition
Figure 5.16 shows that each of the ligatures in the text image is bound with a rectangular bounding box, starting and end positions along horizontal axis of each boundary is noted.
164
Displaying Recognized Text
6566 67 68 69 70 71 sorting 6966 67 71 65 70 68
x4 x1 x2 x6 x0 x5 x3 x0 x1 x2 x3 x4 x5 x6
Unsorted Text File Sorted Text File
x6x5 x4 x3 x2 x1x0
Figure 5.17: Recognition process
One by one character shapes templates from the font file are tried to find a match in the ligature window as a template is matched to a character shape in the ligature somewhere, its character code along with the address of the start position of the character shape in the ligature with respect to the ligature window is also saved in a two dimensional array, figure
5.17, saving character codes with the respective address (starting) of the character shape in a given ligature.
165
Displaying Recognized Text In a Text-box
Figure 5.18: Recognition complete
As the recognition procedure is completed the array is sorted with respect to the starting addresses of the character codes in each of the ligatures with respect to these ligature windows and lines. Then these sorted arrays are given to the rendering engine to display the recognized text in the order it was found in the text image.
The result of single word recognition is illustrated in figure 5.18.
166
Figure 5.19: Multiple words single line
Figure 5.19 shows result of recognition of multiple words in a single line of text image.
Figure 5.20: Multiple words multiple lines
Figure 5.20 shows result of multiple words in multiple lines in the text image. 167
5.13 The Recognition Process
The recognition process is a step by step procedure in which the system reads, recognizes and displays the recognized text on the screen. The process uses the first assigned character template from the font file to match against all the characters present in the segmented input text image. Once a match is made the template’s character code is recorded with the address on which it matched the corresponding character within the bounding box of the ligature image giving it a subscript of the array x where x represents all the character addresses in a single ligature image. The recognized characters are first stored in the order they are recognized and then sorted out according to the addresses they are found in the input image. Once all the characters have been put in the order they appear in the text image, they are displayed on the screen.
168
CHAPTER 6 Conclusion and Future Work
6.1 Introduction
All the research work done on this project is summarized in this chapter and the directions for future research work on this and related topics are discussed besides we have also included in this chapter our proposed Nastalique Text Processor Model, which, in our opinion, if realized, shall help in increasing the processing as well as recognition efficiency of our proposed Nastalique OCR system.
6.2 Introduction to Nastalique Text Processor Model
For our Nastalique OCR system, we develop character-based True Type Font files of
Nastalique words. These words are written using the same character-based TTF font and an image is made of the Nastalique text. The image is then given to our Nastalique OCR system.
After recognition the rendering is done by using the same TTF font file to display the recognized text. The work is therefore three folds; development of character-based
Nastalique True Type Font, Nastalique character recognition and rendering the recognized text using character-based Nastalique True Type Font.
The term writing system implies a formatting and layout system in which the user is required no more than key in a pure sequence of letters in their spoken order and the computer stores and transmits this information as plain text sequence for performing automatic contextual formatting and directional layout to render the text in its proper typeset appearance. 169
Writing systems for most of the languages are simple owing to the fact that they are free from context dependence and are neither cursive inherently. On the contrary Urdu Nastalique because of its context sensitive features and cursiveness is considered as one of the most complex style for electronic typography. Therefore creating a model for the digital implementation of Urdu Nastalique has its limitations and challenges which can be duly attributed to its natural cursive and context sensitive features. However, the goal though difficult to achieve could be made possible by creating versatile and extensive algorithms.
A basic requirement for creating a writing system is to make a good set of fonts available for representing each character in a variety of ways. A unique way to do this is to form a character based Nastalique font which stores all individual characters along with their possible contextual shapes. This would result in a greater variety of ways to combine characters but would increase the rules for character combinations.
This would imply that all possible shapes of characters in ligatures are determined and added to the font files. A set of rules would then be formed to allow for the correct combination of characters that form words which would include rules governing context sensitivity, cursiveness, positions of diacritic marks, bi-directional attitude etc.
Other than being cursive, characters in the Urdu Nastalique writing script are not consistent but context sensitive. They change and adopt shapes according the position in the word and their adjoining characters on both the sides. Also the words do not follow a completely flat
170 horizontal base line. Rather they begin from the top right and moving diagonally finish at the bottom left. Thus characters and words heights vary accordingly.
We discussed in detail the complexities of Urdu Nastalique script in particular related to character recognition and related to computing in general in chapter two of this thesis, so are not going to discuss the same topics again here for the sake of avoiding redundant information.
The native writing system for Urdu Nastalique will serve for localization of regional languages, desktop publishing in Urdu and making global searches.
Over the years Naskh became the most popular Arabic writing style for typing because of its relatively flat, horizontal baseline and legibility which made it comparatively easier to adopt for typography. Since then a more standardized version of Naskh evolved and found its way for scripting printed Arabic text. The fancier form of Nastalique which was adopted for Urdu and Persian did not gain the same popularity despite its space saving features and more artistic form. However interest is now being shown to automate Nastalique for professional typesetting [36].
Except for a few characters in Arabic most of them join the subsequent ones to form ligatures.
There are just a few that do not join a subsequent character and remain either isolated or joined at one end only.
171
Earlier attempts to make a Nastalique typewriter had failed due to the large number of ligatures required to create a font set. In the same way the task of computerization of
Nastalique has two approaches, the ligature based approach and the character based approach.
The ligature based approach offers the creation of extremely fine and well formed fonts though the number of glyphs increases phenomenally approximating around 17000 and perhaps even more because of new words from foreign languages need to be absorbed. Thus there are innumerous combinations that could be made with characters joining the other characters in a variety of ways. The visual effect of this approach is obviously of higher quality as joining of characters is embedded in the system and not based on an algorithm.
This approach also makes it impossible to build the font files from different calligraphers because of the large number of glyphs required in the font file.
The character based approach works in a different manner. Here the shapes of the characters depend on the preceding and subsequent glyphs. This makes Nastalique a highly complex script. For developing this system of writing various shapes of individual Urdu characters are included in the sets of glyphs and rules are determined that govern the formation of words.
The advantage of this system is the relatively small number of glyphs required for its creation and its accessibility to be recreated by different calligraphers.
Urdu is one of those world languages which have a complex writing system and are context sensitive. Nastalique text formatting therefore is an arduous undertaking and implementation of a text processing system would mean creating extensive rules for letterform combinations and analysis.
172
As Urdu takes its origins from the Arabic writing script it follows a bi-directional style. This implies that although all words are formed from right to left the numerals are written from left to right. Another inherent feature of the Nastalique script is cursiveness. Each letter connects with the adjoining character in a fluid form, beginning from the top right and moving diagonally to the bottom left. This feature makes writing in Nastalique space saving as characters are stacked up vertically also as they move ahead. This makes it also differ from the Arabic Naskh which is written on a primarily horizontal baseline. One drawback of this feature of Nastalique is that words with multiple characters in a line may clash with words on the preceding lines. The system also has a non-monotonic nature and other complexities arise due to the position of dots and other diacritical marks. All these complexities make Nastalique an extremely difficult writing system to implement for computerized text processing.
Nastalique is naturally cursive connecting each subsequent character with the previous one by means of delicately curving joints. The joining points or curves in characters and ligatures follow a predefined set of rules for formation that are governed by the size of the pen nib with which the words used to be originally written down.
Each character in Nastalique follows pre-described writing rules which are determined in relation to the width of the flat nib of the pen, called qat. Letters are written using a flat nib and both trajectory of the pen and angle of the nib define a glyph representing a letter. Figure
6.1 displays some of the rules of forming characters in Nastalique in terms of measurement in qat [20].
173
Figure 6.1: Measurements of Letters in qat
Because of the cursive qualities Urdu Nastalique is more space sparing while Naskh with its relatively more horizontal baseline occupies more space to write each word. There is also present, due to this feature an element of style and decorative value as words thus formed have a more artistic fluidity.
Nastalique is also non-monotonic in nature, where certain characters have strokes that move backwards sometimes even beyond the previous characters.
6.3 Nastalique Character Shapes
In Nastalique we have two to four basic character shapes for each character in the alphabet: Initial, Medial, Final and Isolated. Except Isolated last three are position dependent shapes and can have several forms depending on the precedent as well as the antecedent characters.
which is the ,ب Each of these character shapes has multiple instances, for example Bay second letter in the Urdu alphabet has 20 different shapes for its initial form. These character 174 shapes are context sensitive – character shapes change if the antecedent or the precedent character changes. At times even the 3rd or the 4th character may cause a similar change depicting an n-gram model in a Markov chain.
Besides the four basic character shapes in Nastalique typography i.e. initial, medial, final and isolated, we also define a white space character (ws). In Nastalique typography the white space character is an indicator of completion of a ligature. As we key in a sequence of characters from the keyboard the first input character takes the initial shape while the subsequent characters go on taking the medial shape until a white space character in made input, so the last input character which was keyed in just before the white space takes the final shape.
6.4 Nastalique Joining Characters Features Set
We call a set comprising features of Nastalique character shapes as a Nastalique
Joining Characters Features Set which can take the following features values:
Height, Thickness, Angle and Rotation
These features are also termed as attributes, thus the Nastalique feature set is given as:
FS = { Height, Thickness, Angle, Rotation }
Word forming in Nastalique is not as simple as in Roman script. In Nastalique a word is composed of ligatures and isolated character shapes. Sometimes a word consists of only a
ﺑﮩت .single ligature e.g
Two or more character shapes join cursively to form a ligature. While forming a ligature a character shape can join the other from left as well as the right-hand side. When a character
175 shape joins another from its right side with its own left side then at the joining point the two feature sets must match.
So we have features defined at one or both of the two possible joining ends of each character shape as:
Left Features (LF) and Right Features (RF).
6.5 Proposed Nastalique Text Processor Model
In this section we propose a Finite State Model of a character-based Nastalique text processor, illustrated in figure 6.2, which when implemented would produce a perfect
Nastalique text using a character-based font and a knowledge-base consisting of rules for joining various characters shapes cursively together to form a Nastalique ligature.
The proposed Nastalique Text Processor can be modeled as a finite state automaton:
Figure 6.2: Nastalique Text Processor Model
A finite state automaton A is defined by a five tuple [35] as follows:
176
A = ( Q, ∑, δ, qo, F ), where
Q: a finite set of states
∑: a finite set of input symbols also called alphabet
F: a finite set of accepting states
δ: the transition function, which is specified as δ: (Q x ∑ ) → Q
Where the transition function δ takes two input arguments in the form of an ordered pair having first element from Q, the set of states and the second element from ∑ , the set of input symbols or alphabet and returns as its output an element from Q, that is a state. So the transition function processes an input symbol at a particular state and gives the next state the automaton shall be. The transitions by the transition function on the various input symbols by our proposed Nastalique Text Processor Model are shown in figure 6.3. All possible transitions by the transition function on the input symbols from ∑, the alphabet, are shown in the table 6.1.
The set of states of the Proposed Nastalique Text Processor Model is Q,
Q : { q0, q1, q2, q3, q4 }, where
q0 = starting state and final state
q1 = input is an initial character shape
q2 = input is a medial or final character shape
q3 = error state / incomplete word
q4 = input is a white space
The input character set or the alphabet, ∑, can take the following values:
∑ : { initial, medial, final, isolated, white space }
177
= { in, m, f, is, ws }
F, the set of final or accepting states is given as
F= { q0}
Table 6.1: Transition Table for NTPM
δ In M f is ws
q0 q1 q3 q3 q0 q3
q1 q3 q2 q4 q3 q3
q2 q3 q2 q4 q3 q3
q3 q3 q3 q3 q3 q3
q4 q3 q3 q3 q3 q0
Figure 6.3: Transition Diagram for NTPM
178
6.6 Components of Nastalique Text Processor Model (NTPM)
Our proposed Nastalique text processor model has the following two functional components
6.6.1 Character Shape Recognizer
A character in Nastalique can have any of the four basic shapes out of its shapes set.
Following rules determine the shape of an input character which is just keyed in, whether it is an initial, medial, final or an isolated character.
This component gets the input from the keyboard as a character code and checks whether it has LFs only or RFs only or it has both LFs and RFs or it does not have any. On the basis of character shapes determination rules the decisions are made if the new input character shall take the initial, medial, final or an isolated shape.
The character shape determination rules are as follows:
1. If (X has RFs) ^ ~(X has LFs) then (X takes Final shape) 2. If (X has LFs) ^ ~(X has RFs) then (X takes Initial shape) 3. If (X has LFs) ^ (X has RFs) then (X takes Medial shape) 4. If ~(X has RFs) ^ ~(X has LFs) then (X takes Isolated shape)
These rules, put in simple words, describe that if an input character shape has only LFs, then the character shape is Initial, if it has only RFs then the character shape is final, if it has both 179
LFs and RFs then the character shape is Medial and if none of the features are available then the character shape is Isolated.
6.6.2 Next-State Function
Having decided the shapes of characters in a ligature as initial, medial, final or isolated, the system checks for multiple instances of these character shapes.
In Nastalique we have multiple instances of these character shapes encoded in the font file, the Next-state function compares the LFs and RFs on one-to-one basis and decides the correct instance of the character shape present at a particular position in a ligature and returns its character code and moves to the next state where it expects the next input from the keyboard.
As the input process proceeds and the character codes are received from the keyboard. This component takes the decision about a particular contextual shape of an input character amongst the many available for the same.
As soon as the next state function receives a white space character as input from the keyboard it displays the completed ligature on the screen and gets into the starting state looking for the first input from the keyboard for the start of the next ligature.
The Nastalique Text Processor Model will always be in a state to expect a new input character from the keyboard except that it receives a white space character, that results in
180 either an error state indicating an incomplete ligature or an accepting state where all the characters received from the keyboard are displayed joined together in the form of a ligature.
6.7 Conclusion
The Urdu writing takes many of its features from the written Arabic. However it differs from Arabic in terms of the number of characters and the placement of several diacritical marks. In the era of language automation and text processing the Arabic Naskh gained greater popularity due to its simplicity of style, a flatter horizontal baseline and convenience of processing. Most multinational software companies have tried to automate
Naskh to increase their foreign market yet a reliable OCR for Naskh also is yet not available with an acceptable rate of accuracy.
This meant that as far as research work is concerned literature is available showing techniques that have been tried and tested on Arabic to represent the script electronically and create recognition models although they have not been completely perfect.
Approaches that have so far been adopted to develop Arabic character recognition fall into two major categories, ligature based and character based.
On the other hand Nastalique which is a more decorative and artistic version of written
Arabic and which has been adopted to write Persian and Urdu was considered too complex because of a variety of its features to be fully automated and accessible for machine recognition. Advancement in technology has however brought growing interest to automate complex language systems as many Asian languages and therefore our interest in bringing
181
Nastalique at par with this research in terms of digital recognition of Nastalique script as any other.
With relatively very little previous research available in this context our exploration began with the various methods that were employed to recognize Arabic Naskh, which exhibits considerable resemblance to Nastalique though not completely. However the methods, tools and techniques which had been used with relative success with Naskh were not found to be so with Nastalique. We had therefore to find more innovative ways to create a new recognition strategy for this script. In this regard we were able to develop and present a novel technique of recognizing text in Nastalique which is character based and segmentation free.
This thesis presents the various steps in the design and implementation of a novel algorithm for a Nastalique OCR system. It also gives details of the various approaches currently in use in the designing of OCR systems. It includes the extensive research work undertaken to investigate past references to work in similar areas with particular emphasis on Arabic script languages.
The thesis presents in detail the complexities encountered in the implementation of a
Nastalique OCR system and explores the reasons that determine the limitations.
Some of the details as described in the thesis are summarized below.
Urdu has a unique system of representation. It derives its script from Arabic following most of its rules for word and ligature formation. Although Arabic today is written in the Naskh script Urdu follows a style that evolved later as a merger of Taleeq and Naskh. Urdu also has
182 a greater number of characters due to the more varied set of sounds characteristic of the Urdu language.
Nastalique also has inherent complexities that were explored in detail in this thesis. It is naturally cursive connecting each subsequent character with the previous one by means of delicately curving joints. The joining points or curves in characters and ligatures follow a predefined set of rules for formation that are governed by the size of the pen nib with which the words used to be originally written down.
The style also is bi-directional with characters moving from right to left while numerals are presented in the opposite direction. As words are written down there is also a vertical stacking of characters as they are kerned and cursively joined while some characters move backward and beyond the previous character. These factors added to the limitations of developing an OCR for Nastalique.
While Naskh was adopted to scribe Arabic, Nastalique became the more popular version for
Persian and Urdu. As compared to Nastalique, Naskh is simpler in features, following a single, flat and horizontal base line while Nastalique has multiple baselines. Nastalique also has other features that make it a complex system to adopt for computer automation.
Our research indicates that there is practically no reliable work done to automate Urdu
Nastalique for developing an OCR and for everyday computing needs whatever work is currently available has propriety rights rather than being based on more universally accepted standards. The universally accepted Unicode also holds Urdu as a sub language of Arabic
183 with a few added characters to cater to the characters that are not present in Arabic. Unicode also supports the writing of Urdu in the Naskh style and so far does not support the
Nastalique style of Urdu writing.
This forms the basis for the fact that our literature survey comprises of more work on Arabic than Urdu. The survey reveals that the experimentations and research conducted so far for the design and implementation of Arabic OCR have had two main approaches:
1. Character based
2. Ligature based
Both the approaches have had their own set of potentials and drawbacks but in general results have not been too encouraging for both.
Character based approach needs segmentation of text up to the character level, while in
Arabic script text points of segmentation are the main challenging task to define, so the poor segmentation results in low recognition accuracy.
While ligature based approach for an Arabic script OCR requires a very large number of ligature images to train a learning machine like ANN, HMM or a SVM. Many researchers have tried this approach, however were not able to receive high rates of recognition accuracy.
In this thesis we have tried to explore more innovative technique to create a novel algorithm to implement a Nastalique OCR.
184
An extensive degree of research was done on the Urdu Nastalique Script to explore the complete alphabet, position of dots, diacritical marks and rules of word and ligature formation presented in chapter 4 of this thesis.
Urdu has a more extended alphabet than Arabic due to the many sounds that it needs to cater to. This is done by increasing or decreasing the number or changing the position of dots with respect to some basic character shapes or by the changes in diacritical marks.
We have categorized the Nastalique alphabet according to the base shapes and the positions of diacritical marks and on the basis of this a computational model has been constructed.
We have also made profound comparisons between Nastalique and Naskh and analyzed the differences. Naskh is basically horizontally written while Nastalique promotes vertical stacking of characters as characters often overlap one another when they are joined cursively.
We have also studied in detail the Roman script OCRs which use the character based segmentation approach for recognition.
A word in Urdu is either a single ligature or combinations of ligatures with isolated characters. However when we look at the system from the point of view of building an OCR ligature based approach was also not feasible as it requires extensive computation and the end result is computationally expensive.
185
In Chapter 5 we have proposed a novel segmentation free algorithm for the design and implementation of an OCR (Optical Character Recognition) for printed Nastalique text, a calligraphic style of Urdu which uses the Arabic script for its writing.
Our proposed algorithm for Nastalique character recognition does not require segmentation of a ligature into constituent character shapes, rather it requires only two levels of segmentation i.e. first the text image is segmented into lines of text then each of the lines of text is further segmented into ligatures and isolated characters.
The algorithm takes a text image as an input. Segmentation is done at two levels i.e. the text image is segmented into lines of text and the lines are segmented into ligatures. There is no further segmentation at the ligature level to split the ligature into respective character shapes.
At the recognition step the system uses character templates from the font files which have been loaded in the main memory. These templates are matched with the character shapes present in the ligatures using Cross Correlation for recognition. A character shape is found successfully in a ligature at a particular point if the template matching process gives the highest correlation peak at this point in the ligature.
In Chapter 5 we also have presented the process of segmenting the lines of text into ligatures which is done using Matlab 7.
Our innovative technique has however aimed to avoid any of the previously adopted segmentation techniques so that the negative aspects of the segmentation processes that give
186 low recognition results could be kept to the minimum. Its novelty lies in the fact that it aims to segment lines of text into ligatures and then use ligatures for character based recognition without subjecting them to split into constituent character shapes.
Our research and experimentations point to an important pre-requisite required for making character based Nastalique recognition possible. It requires a character based writing system which is currently not available. As such we do not have a standard character based
Nastalique text processing system. For this purpose we create the Nastalique character based font, write the text using our character based font file, make the image of the text and then give it to our Nastalique OCR for recognition.
In this thesis we have also proposed a Finite State Model for such a text processor. It hopes to set a line of action that would lead to the development of a suitable text processor which would make text recognition of Urdu Nastalique easier to accomplish. Although we have created and presented the model for this text processor implementation has not yet been done and so results have not been reported. However it is hoped that future research will pave the way for the appropriate implementation of this model.
Our research sets the groundwork for further exploration of the presented techniques which could work in various directions e.g. video text recognition
We have presented OCR work that extended into the field of video text detection and extraction. There is popular acclaim in the area of video text extraction and experimentations in this regard have given us encouraging results.
187
Video text recognition is a different field of character recognition whereby the considered text is often presented in a more complex background compared to printed text that appears on a more consistent background in terms of texture and color.
Video text is primarily of two kinds, scene text and artificial text. Drawing out artificial text is comparatively easier as it has been superimposed on the background at a post processing and video making stage and confirms to present text writing patterns and standards to a great extent. However scene text, text that is embedded within the video scenes is the most difficult to identify owing to the fact that it can appear almost anywhere and in any form, colour or size.
Our experimentations on VOCR are limited to text area detection and text extraction from the lower one third segment of the video frames that generally display information about the speaker on the screen, breaking news slides or video subtitles. The results of text area detection and extraction are reported in the thesis.
Writing systems for most of the languages are simple owing to the fact that they are free from context dependence and are neither cursive inherently. On the contrary Urdu Nastalique because of its context sensitive features and cursiveness is considered as one of the most complex fonts for electronic typography. Therefore creating a model for the digital implementation of Urdu Nastalique has its limitations and challenges.
Since our character-based segmentation-free Nastalique OCR algorithm needs, as a ground work, a character-based Nastalique Text Processor, and as such there is no native standard
188 text processing system available, we have proposed a Finite State Nastalique Text Processor
Model. Implementation is not yet done so results are not reported. However this model could serve as an impetus for future research in this challenging field.
Our proposed Finite State Model of a character-based Nastalique text processor when implemented would produce a perfect Nastalique text using a character-based font and a knowledge-base consisting of rules for joining different character shapes cursively together to form a Nastalique ligature.
The native writing system for Urdu Nastalique will serve for localization of regional languages, desktop publishing in Urdu and for making global searches.
6.8 Contribution
The main contribution of this thesis is the character based segmentation free algorithm of an OCR for printed Nastalique text, which is presented in chapter 5 of this thesis with experimentations and results on text image segmentation and text recognition.
In chapter 3 we have included the details of experimentation with results for detection and extraction of text regions from video frames for video OCR, with the relevant literature survey.
189
We have included in this thesis, some research work as extension to our main research topic including a Finite State Model of a Nastalique Text Processor presented in the beginning of this chapter.
6.9 Future Work
To keep our problem of investigating a simple and straight technique for the design and implementation of an OCR for printed Nastalique text, we kept our boundaries at a low level, however for future work we have plans to extend our Nastalique OCR algorithm to work on a larger set of Urdu Nastalique words, shall make it more flexible and robust to accept text in different font sizes.
We also have plans to extend our Urdu Nastalique OCR to a video OCR for Arabic script languages. We have already performed successful experiments on the detection and extraction of text regions from the video frames. The results of these experiments shall be used for recognition by our Urdu Nastalique OCR.
190
References [1] S. I. Abuhaiba, “Recognition of Off-Line Handwritten Cursive Text,” Ph.D. thesis, Department of Electronic and Electrica Engineering, Loughborough University, Loughborough, U. K., 1996.
[2] I.S. Abuhaiba, “A Discrete Arabic Script for Better Automatic Document Understanding,” The Arabian J. Science and Eng., vol. 28, pp. 77-94, 2003.
[3] Albadr B, Haralick R, Segmentation-free approach to text recognition with application to Arabic text, International Journal on Document Analysis and Recognition, (1998) 1: 147-166.
[4] Alemami S, Usher M. On-line recognition of handwritten Arabic characters. IEEE Trans Pattern Analysis and Machine Intelligence 1990; 12(7): 704–710
[5] S. Alma’adeed, C. Higgens, and D. Elliman, “Off-Line Recognition of Handwritten Arabic Words Using Multiple Hidden Markov Models,” Knowledge-Based Systems, vol. 17, pp. 75-79, 2004.
[6] S. Alma’adeed, C. Higgens, and D. Elliman, “Recognition of Off-Line Handwritten Arabic Words Using Hidden Markov Model Approach,” Proc. 16th Int’l Conf. Pattern Recognition, vol. 3, pp. 481-484, 2002.
[7] S. Alma’adeed, D. Elliman, and C.A. Higgins, “A Data Base for Arabic Handwritten Text Recognition Research,” Proc. Eighth Int’l Workshop Frontiers in Handwriting Recognition, pp. 485-489, 2002.
[8] H. Almuallim and S. Yamaguchi, “A Method of Recognition of Arabic Cursive Handwriting,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 9, pp. 715-722, 1987.
[9] Al-Yousefi H, Udpa S. Recognition of Arabic characters. IEEE Transactions on Pattern Analysis and Machine Intelligence 1992; 14(8): 853-857.
[10] Amin A. Recognition of printed Arabic text using machine learning. Proceedings of the International Society of Optical Engineers, SPIE, 1998; 3305:63-70.
[11] Amin A, Mari J. Machine recognition and correction of printed Arabic text. IEEE Transactions on Man, Machine and Cybernetics 1989; 19(5): 1300-1306.
[12] Arica N, Yarman-Vural FT. Optical character recognition for cursive handwriting. IEEE Transactions on Pattern Analysis and Machine Intelligence 2002; 24(6): 801- 813.
[13] Bahri, Z.; and Kumar, B. V. K. 1988. Generalized Synthetic Discriminant Functions. J. Opt. Soc. Am. A 5 (4) 562–571
191
[14] I. Bazzi, R. Schwartz, and J. Makhoul, “An Omnifont Open-Vocabulary OCR System for English and Arabic,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 21, pp. 495-504, 1999.
[15] Bouslama F, Kishibe H. Fuzzy logic in the recognition of printed Arabic text. IEEE Transactions on 1999: 1150-1154.
[16] Chong-Wah Ngo, Chi-Kwong Chan, ‘Video text detection and segmentation for optical character recognition ‘Multimedia Systems 10: 261–272 (2005), Digital Object Identifier (DOI) 10.1007/s00530-004-0157-0
[17] Christian Wolf and Jean-Michel Jolion, ‘Extraction and Recognition of Artificial Text in Multimedia Documents’, Technical Report RFV RR-2002.01 Laboratoire Reconnaissance de Formes et Vision INSA de Lyon, Bˆat. J.Verne 20, Av. Albert Einstein 69621 Villeurbanne cedex FRANCE
[18] Chuan-Jie Lin, Che-Chia Liu, Hsin-Hsi Chen, ‘A Simple Method for Chinese Video OCR and Its Application to Question Answering ‘ Computational Linguistics and Chinese Language Processing Vol. 6, No. 2, August 2001, pp. 11-30 The Association for Computational Linguistics and Chinese Language Processing
[19] Datong Chen, Kim Shearer and Hervé Bourlard, ‘Video OCR for Sport Video Annotation and Retrieval ‘ Proceedings of then 8th International Conference on Mechatronics and Machine Vision in Practice, (M2VIP 2001), Hong Kong 27-29 August 2001 ., pp57-62
[20] David A. Forsyth and Jean Ponce, ‘Computer Vision: A Modern Approach’, Pearson Prentice Hall, NJ, USA, 2003
[21] Ejaz Raqum, Usool-o-Qavaid Khush Naveesi, Mir Muhammad Kutub Khana, Arram Bagh, Karachi, Pakistan
[22] R. El-Hajj, L. Likforman-Sulem, and C. Mokbel, “Arabic Handwriting Recognition Using Baseline Dependant Features and Hidden Markov Modeling,” Proc. Int’l Conf. Document Analysis and Recognition, pp. 893-897, 2005
[23] Elaine Rich and Kevin Knight, ‘Artificial Intelligence’, 2nd Edition, McGraw Hill, NY, USA, 1991
[24] Erlandson E, Trenkle J, Vogt R, Word level recognition of multifont Arabic text using a feature vector matching approach, Proceedings of International Society for Optical Engineers, SPIE, 1996; 2660: 63-70.
[25] Faisal Shafait, Adnan-ul-Hasan, Daniel Keysers, and Thomas M. Breuel, Layout Analysis of Urdu Document Images. Proceedings of 10th IEEE International Multitopic Conference, WMIC ’06, Islamabad, Pakistan, Dec. 2006.
[26] Fan X, Verma B. Segmentation vs. Non-Segmentation Based Neural Techniques for Cursive Word Recognition: An Experimental Analysis. Fourth International
192
Conference on Computational Intelligence and Multimedia Applications (ICCIMA'01) P. 251.
[27] N. Farah, L. Souici, L. Farah, and M. Sellami, “Arabic Words Recognition with Classifiers Combination: An Application to Literal Amounts,” Proc. Artificial Intelligence: Methodology, Systems, and Applications, pp. 420-429, 2004.
[28] H. Goraine, M. Usher, and S. Al-Emami, “Off-Line Arabic Character Recognition,” Computer, vol. 25, pp. 71-74, 1992.
[29] Hadjar, K., Ingold, R. (2003) Arabic News paper segmentation. In: Seventh International Conference on Document Analysis and Recognition (ICDAR 2003), pp 895-899. IEEE COMPUTER SOCIETY.
[30] Hamid, A. and Haraty, R.,“A Neuro-Heuristic Approach for Segmenting Handwritten Arabic Text”, ACS/IEEE International Conference on Computer Systems and Applications, Beirut, Lebanon, 25-06-2001 – 29-06-2001, pp: 110-113.
[31] Inam Shamsher, Zaheer Ahmad, Jehanzeb Khan Orakzai, and Awais Adnan, OCR For Printed Urdu Script Using Feed Forward Neural Network, Proceedings of World Academy of Science, Engineering and Technology volume 23 August 2007 ISSN 1307-6884.
[32] Jean-Marc Odobez and Datong Chen, ‘Robust Video Text Segmentation and Recognition with Multiple Hypotheses’ International Conference on Image Processing, Rochester New York, USA, Sept. 22-25, 2002, Vol. 2, IEEEE, pp 433- 436
[33] Jelodar MS, Fadaeieslam MJ, Mozayani N, Fazeli M. A Persian OCR System using Morphological Operators. Transactions On Engineering, Computing And Technology, Vol. 4, February 2005.
[34] Jinsik Kim, Taehun Kim, Jiexi Lin, ‘Implementation of a Video Text Detection System ,CS570 Artificial Intelligence Team Foxtrot KAIST, South Korea 305-701 Korean Advanced Institute of Seience & Technology, CS570-2004
[35] John E. Hopcroft, Rajiv Motwani and Jeffrey D. Ullman. “Introduction to Automata theory, Languages and Computation”, 2nd edition, Addison-Wesley, 2000.
[36] Joseph D. Becker, Arabic Word Processing, Communications of the ACM Volume 30, Issue 7 (July 1987) Pages: 600 – 610, ISSN:0001-0782
[37] Jürgen Schürmann, Norbert Bartneck, Thomas Bayer, Jürgen Franke, Eberhard Mandler, and Matthais Oberländer, “Document Analysis-From Pixels to Contents” Proceedings of the IEEE VOL.80 No.7, July1992, pp.1101-1119
[38] Kamal Mansour, Guidelines to Use of Arabic Characters, 24th Internationalization & Unicode Conference, Atlanta, GA September 2003
193
[39] T. Kanungo, G. A. Marton, and O. Bulbul, "OmniPage vs. Sakhr: Paired Model Evaluation of Two Arabic OCR products," Proceedings of SPIE Conference on Document Recognition, San Jose, CA, Vol. 3651, pp. 109-120, 1999.
[40] T. Kanungo, G. E. Marton, and O. Bulbul, “Performance Evaluation of Two Arabic OCR Products”, Proc. of AIPR Workshop on Advances in Computer Assisted Recognition, SPIE Vol. 3584, Washington DC, 1998.
[41] M. S. Khorsheed, “Offline recognition of omnifont Arabic text using the HMM ToolKit (HTK)”, Pattern Recognition Letters 28 (2007) pp. 1563-1571.
[42] Khorsheed MS, Clocksin WF. Spectral features for Arabic word recognition. The IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP’2000, Istanbul, Turkey, June 5-9, 2000, pp. 3574-3577
[43] Khorsheed MS, Clocksin WF. Structural features of cursive Arabic script.’ Proceedings of the Tenth British Machine Vision Conference, Nottingham, UK, 1999; 2: pp. 422–431
[44] Kraus E, Dougherty E. Segmentation-free morphological character recognition. Proceedings of the International Society for Optical Engineers, SPIE, 1994; 2181: 14–23
[45] Lee Y, Papineni K, Roukos S, Emam O, Hassan H. Language model based Arabic word segmentation.
[46] S. M. Lodhi and M. A. Matin, “Urdu character recognition using Fourier descriptors for optical networks”, Photonic Devices and Algorithms for Computing VII, Proc. of SPIE Vol. 5907-59070O, (2005).
[47] L. Lorigo and V. Govindaraju, “Segmentation and Pre-Recognition of Arabic Handwriting,” Proc. Int’l Conf. Document Analysis and Recognition, pp. 605-609, 2005.
[48] A. Mahalanobis, B. V. K. Vijaya Kumar, “On the optimality of the MACH filter for detection of targets in noise” Optical Engineering, Special Issue on Correlation Pattern Recognition, Vol. 36 (10), pp. 2642-2648,October 1997
[49] A. Mahalanobis, B. V. K. Vijaya Kumar, S. R. F. Sims, and J. F. Epperson. “Unconstrained Correlation Filters,” Applied Optics, Vol. 33, pp. 3751-3759, 1994
[50] Majid M. Altuwaijri and Magdy A. Bayoumi, “Arabic Text Recognition Using Neural Networks”, IEEE International Symposium on Circuits and Systems, ISCAS’94, London, UK, 30 May- 02 June,1994, Vol. 6, pp. 415-418.
[51] Malik, S., Khan, S.A. (2005) Urdu Online Handwriting Recognition. In: IEEE International Conference on Emerging Technologies, pp 27-31. Islamabad, Pakistan
194
[52] Motawa D, Amin A, Sabourin R, Segmentation of Arabic Cursive Script. Proceedings of the 4th International Conference on Document Analysis and Recognition, 1997: 625 – 628.
[53] Nevenka Dimitrova Lalitha Agnihotri Chitra Dorai Ruud Bolle, ‘MPEG-7 Videotext description scheme for superimposed text in images and video ‘ Eelsevier Signal Processing:2000
[54] Obaid AM, Dobrowiecki TP. Heuristic Approach to the Recognition of Printed Arabic Script.
[55] U. Pal and Anirban Sarkar, “ Recognition for Printed Urdu Script”, Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003), IEEE COMPUTER SOCIETY, Edinburgh, Scotland, Aug, 3-6, 2003, Vol. 2, pp. 1183-1187.
[56] B. Parhami and M. Taraghi, “Automatic Recognition of Printed Farsi Texts,” Pattern Recognition, vol. 14, pp. 395-403, 1981.
[57] Parker JR. Algorithms For Image Processing and Computer Vision. John Wiley & Sons, 1997
[58] M. Pechwitz and V. Ma¨rgner, “HMM Based Approach for Handwritten Arabic Word Recognition Using the IFN/ENIT-Database,” Proc. Int’l Conf. Document Analysis and Recognition, pp. 890-894, 2003.
[59] Rabiner L, Juang B. Fundamentals of Speech Recognition. Prentice Hall, 1993
[60] Rainer Lienhart, Frank Stuber.Automatic text recognition in digital videos. Technical Report TR-95-036, Department for Mathematics and Computer Science, University of Mannheim, 1995.
[61] Rainer Lienhart and Wolfgang Effelsberg, ‘Automatic Text Segmentation and Text Recognition for Video Indexing ‘Submitted to ACM/Springer Multimedia Systems Magazine, 9/98
[62] R. Ramsis, S.S. El-Dabi, and A. Kamel, Arabic Character Recognition System, IBM Kuwait Scientific Centre, report No. KSC027, 1988.
[63] P. Refregier. Optimal trade-off filters for noise robustness, sharpness of the correlation peak and Horner efficiency. Opt. Lett. 16 (11), 829–831, 1991
[64] Reza Safabakhsh and Peyman Abidi, “Nastaaligh Handwritten Word Recognition Using a Continuous-Density Variable-Duration HMM”, The Arabian Journal for Science and Engineering, Volume 30, Number 1B, April 2005, pp. 95-118
[65] Shunji Mori, Hirobumi Nishida, Hiromitsu Yamada, Optical Character Recognition, John Wiley and Sons, New York, USA, 1999.
195
[66] Sohail Abdul Sattar, Syed Salahuddin Hyder, Mahmood Khan Pathan, “Problems of Nastalique OCR: A comparison of Nastalique and Roman script OCRs” Proceedings of the ICCCE 06 , International Conference on Computer and Communication Engineering organized by IEEE and the faculty of Engineering, International Islamic University, Malaysia held from May 9 – 11, 2006, Kuala Lumpur, Vol. 2, pp 1066- 1071
[67] Tolba M, Shaddad E. On the automatic reading of printed arabic characters. Proceedings of IEEE International Conference on Systems, Man and Cybernetics, Los Angeles, CA, 1990; 496–498
[68] A. VanderLugt, “Signal detection by complex spatial filtering,” IEEE Transaction on Information Theory, Vol 10, pp. 139-145, 1964
[69] B. V. K. Vijaya Kumar, “Tutorial survey of composite filter designs for optical correlators,” Applied Optics, Vol. 33, pp. 4773-4801, 1992
[70] B. V. K. Vijaya Kumar, Abhijit Mahalanobis and Richard D. Juday, Correlation Pattern Recognition, Cambridge University Press, Cambriddge, UK, 2005
[71] Wing Hang Cheung, Ka Fai Pang, Michael R. Lyu, Kam Wing Ng, and Irwin King. Chinese optical character recognition for information extraction from video images. In Hamid R. Arabnia, editor, Proceedings of The 2000 International Conference on Imaging Science, Systems and Technology (CISST'2000) Volumn One, pages 269-- 275. CSREA Press, 2000
[72] Xian-Sheng Hua, Pei Yin, HongJiang Zhang: Efficient video text recognition using multiple frame integration. , International Conference on Image Processing, Rochester New York, USA, Sept. 22-25, 2002, Vol. 2, IEEEE, Vol. ICIP (2) 2002: 397-400
[73] Zaheer Ahmad, Jehanzeb Khan Orakzai, Inam Shamsher, and Awais Adnan, Urdu Nastaleeq Optical Character Recognition , Proceedings of World Academy of Science, Engineering and Technology volume 26 December 2007 ISSN 1307-6884.
[74] A. Zahour, B. Taconet, and A. Faure, "Machine Recognition of Arabic Cursive Writing", in From Pixels to Features III:Frontiers in Handwriting Recognition, ed. S. Impedovo and J.C. Simon. Amsterdam: Elsevier Science Publishers B.V., 1992, pp. 289-296.
[75] Zheng L, Hassin AH, Tang X. A new algorithm for machine printed Arabic character segmentation. Pattern Recognition Letters 25, 2004; 1723-1729.
[76] H. Zhou and T.-H. Chao. MACH filter synthesising for detecting targets in cluttered environment for gray-scale optical correlator. Proc. SPIE 715, 394–398, 1999
[77] Zidouri A. Sarfraz M. Shahab SA, Jafri SM. Adaptive dissection based subword segmentation of printed Arabic text. IEEE Transactions on 2005: 239-243.
[78] http://en.wikipedia.org/wiki/List_of_languages_by_writing_system 196
[79] http://en.wikipedia.org/wiki/Magnetic_Ink_Character_Recognition
[80] http://en.wikipedia.org/wiki/Nastaliq
[81] http://en.wikipedia.org/wiki/Optical_character_recognition
197
List of Publications Below is a list of publications produced as part of this research work
[1] Sohail Abdul Sattar, Syed Salahuddin Hyder, Mahmood Khan Pathan, “Problems of Nastaliq OCR: A comparison of Nastaliq and Roman script OCRs” Proceedings of the ICCCE 06 , International Conference on Computer and Communication Engineering in collaboration with IEEE, May 9 – 11, 2006, Kuala Lumpur, Malaysia, Vol. 2, pp 1066-1071.
[2] Sohail A. Sattar, Shamsul Haque and Mahmood K. Pathan, “Nastaliq Optical Character Recognition”. Proceedings of ACM SE 2008 Conference, Auburn, Alabama, USA, March 27-28, 2008, proceedings CD.
[3] Sohail A. Sattar, Shamsul Haque, Mahmood K. Pathan and Quintin Gee, “Implementation Challenges for Nastaliq Character Recognition”. Communications in Computer and Information Science Vol. 20, ISSN 1865-0929, 2008, Springer- Verlag, Berlin, Germany, pp 279-285.
[4] Sohail Abdul Sattar, Shamsul Haque and Mahmood Khan Pathan “Segmentation of Nastaliq Script for OCR” International Conference on Computing and Informatics (ICOCI-09), Kuala Lumpur, Malaysia, June 24-25, 2009.
198