Multistage Hybrid Arabic/Indian Numeral OCR System
Total Page:16
File Type:pdf, Size:1020Kb
(IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 1, 2010 Multistage Hybrid Arabic/Indian Numeral OCR System Yasser M. Alginaih, Ph.D., P.Eng. IEEE Member Abdul Ahad Siddiqi, Ph.D., Member IEEE & PEC Dept. of Computer Science Dept. of Computer Science Taibah University Taibah University Madinah, Kingdom of Saudi Arabia Madinah, Kingdom of Saudi Arabia [email protected] [email protected] Abstract— The use of OCR in postal services is not yet numeral OCR systems for Postal services have been used universal and there are still many countries that process in some countries, but still there are problems in such mail sorting manually. Automated Arabic/Indian numeral systems, stemming from the fact that machines are unable Optical Character Recognition (OCR) systems for Postal to read the crucial information needed to distribute the services are being used in some countries, but still there are mail efficiently. Historically, most civilizations have errors during the mail sorting process, thus causing a different symbols that convey numerical values, but the reduction in efficiency. The need to investigate fast and Arabic version is the simplest and most widely efficient recognition algorithms/systems is important so as to correctly read the postal codes from mail addresses and to acceptable. In most Middle Eastern countries both the eliminate any errors during the mail sorting stage. The Arabic (0,1,2,3,4,5,6,7,8,9) and Indian ۷,۸,۹) numerals are used. The objective of,٤,٥,٦,objective of this study is to recognize printed numerical (۰,۱,۲,۳ postal codes from mail addresses. The proposed system is a this work is to develop a numeral Arabic/Indian OCR multistage hybrid system which consists of three different system to recognize postal codes from mail letters feature extraction methods, i.e., binary, zoning, and fuzzy processed in the Middle Eastern countries. A brief history features, and three different classifiers, i.e., Hamming Nets, on the development of postal services is qouted from [1]. Euclidean Distance, and Fuzzy Neural Network Classifiers. “The broad development of mechanization in postal The proposed system, systematically compares the operations was not applied until the mid-1950s. The performance of each of these methods, and ensures that the numerals are recognized correctly. Comprehensive results translation from mechanization to automation of the U.S. provide a very high recognition rate, outperforming the Postal Services (USPS) started in 1982, when the first other known developed methods in literature. optical character reader was installed in Los Angeles. The introduction of computers revolutionized the postal industry, and since then, the pace of change has Keywords-component; Hamming Net; Euclidean Distance; accelerated dramatically [1].” Fuzzy Neural Network; Feature Extration; Arabic/Indian Numerals In the 1980s, the first OCRs were confined to reading the Zip Code. In the 1990s they expanded their I. INTRODUCTION capabilities to reading the entire address, and in 1996, the Optical Character Recognition (OCR) is the electronic Remote Computer Reader (RCR) for the USPS could translation of images of printed or handwritten text into recognize about 35% of machine printed and 2% of machine-editable text format; such images are captured handwritten letter mail pieces. Today, modern systems through a scanner or a digital camera. The research work can recognize 93% of machine-printed and about 88% of in OCR encompasses many different areas, such as handwritten letter mail. Due to this progress in pattern recognition, machine vision, artificial intelligence, recognition technology the most important factor in the and digital image processing. OCR has been used in efficiency of mail sorting equipment is the reduction of many areas, e.g., postal services, banks, libraries, cost in mail processing. Therefore, a decade intensive museums to convert historical scripts into digital formats, investment in automated sorting technology, resulted in automatic text entry, information retrieval, etc. high recognition rates of machine-printed and handwritten addresses delivered by state-of- the-art systems [1 – 2] The objective of this work is to develop a numerical OCR system for postal codes. Automatic Arabic/Indian 9 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 1, 2010 According to the postal addressing standards [3], a been implemented. Many OCR systems are available in standardized mail address is one that is fully spelled out the market, which are multi font and multilingual. and abbreviated by using the postal services standard Moreover, most of these systems provide high recognition abbreviations. The standard requires that the mail rate for printed characters. The recognition rate is addressed to countries outside of the USA must have the between 95% - 100%, depending on the quality of the address typed or printed in Roman capital letters and scanned images, fed into the systems, and the application Arabic numerals. The complete address must include the it is used for [9]. The Kingdom of Saudi Arabia has also name of addressee, house number with street address or initiated its efforts in deploying the latest technology of box number/zip code, city, province, and country. automatic mail sorting. It is reported in [10], that Saudi Examples of postal addresses used in the Middle East are Post has installed an advanced Postal Automation System, given in table 1. working with a new GEO-data based postal code system, an Automatic Letter Sorting Machine, and an OCR for TABLE 1: Examples of postal addresses used in the Middle East simultaneous reading of Arabic and English addresses. It comprises components for automatic forwarding, Address with Arabic numerals Address with Indian Numerals sequencing, and coding ﺍﻟﺴﻴﺪ ﻣﺤﻤﺪ ﻋﻠﻲ Mr. Ibrahim Mohammad ﺹ. ﺏ: ٥۲۱۰٦ P.O. Box 56577 [In his in-depth research study, Fujisawa, in [11 ﺍﻟﺮﻳﺎﺽ: RIYADH 11564 ۱۲۳٤٥ reports on the key technical developments for Kanji ﺍﻟﻤﻤﻠﻜﺔ ﺍﻟﻌﺮﺑﻴﺔ ﺍﻟﺴﻌﻮﺩﻳﺔ SAUDI ARABIA (Chinese character) recognition in Japan. Palumbo and Standards are being developed to make it easy to Srihari [12] described a Hand Written Address integrate newer technologies into available components Interpretation (HWAI) system, and reported a throughput instead of replacing such components, which is very rate of 12 letters per second. An Indian postal automation costly; such standards are the OCR/Video Coding based on recognition of pin-code and city name, proposed Systems (VCS) developed by the European Committee by Roy et al in [13] uses Artificial Neural Networks for for standardization. The OCR/VCS enables postal the classification of English and Bangla postal zip codes. operators to work with different suppliers on needed In their system they used three classifiers for the replacements or extensions of sub-systems without recognition. The first classifier deals with 16-class incurring significant engineering cost [1] [4]. problem (because of shape similarity the number is reduced from 20) for simultaneous recognition of Bangla Many research articles are available in the field of and English numerals. The other two classifiers are for automation of postal systems. Several systems have been recognition of Bangla and English numerals, individually. developed for address reading, such as in USA [5], UK Ming Su et al. [14], developed an OCR system, where the [6], Japan [7], Canada [8], etc. But very few countries in goal was to accomplish the automatic mail sorting of the Middle East use automated mail-processing systems. Chinese postal system by the integration of a mechanized This is due to the absence of organized mailing address sorting machine, computer vision, and the development of systems, thus current processing is done in post offices OCR. El-Emami and Usher [15] tried to recognize postal which are limited and use only P.O. boxes. Canada Post address words, after segmenting these into letters. A is processing 2.8 billion letter mail pieces annually structural analysis method was used for selecting features through 61 Multi-line Optical Character Readers of Arabic characters. On the other hand, U.Pal et.al., (MLOCRs) in 17 letter sorting Centers. The MLOCR – [16], argues that under three-language formula, the Year 2000 has an error rate of 1.5% for machine print destination address block of postal document of an Indian reading only, and the MLOCR/RCR – Year 2003 has an state is generally written in three languages: English, error rate of 1.7% which is for print/script reading. Most Hindi and the State official language. Because of inter- of these low read errors are on handwritten addresses and mixing of these scripts in postal address writings, it is on outgoing foreign mail [9]. very difficult to identify the script by which a pin-code is written. In their work, they proposed a tri-lingual The postal automation systems, developed so far, are (English, Hindi and Bangla) 6-digit full pin-code string capable of distinguishing the city/country names, post and recognition, and obtained 99.01% reliability from their zip codes on handwritten machine-printed standard style proposed system whereas error and rejection rates were envelopes. In these systems, the identification of the 0.83% and 15.27%, respectively. In regards to postal addresses is achieved by implementing an address recognizing the Arabic numerals, Sameh et.al. [17], recognition strategy that consists of a number of stages, described a technique for the recognition of optical off- including pre-processing, address block location, address line handwritten Arabic (Indian) numerals using Hidden segmentation, character recognition, and contextual post Markov Models (HMM). Features that measure the image processing. The academic research in this area has characteristics at local, intermediate, and large scales provided many algorithms and techniques, which have were applied. Gradient, structural, and concavity features 10 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol.