Comparison of Optical Character Recognition (OCR) Software

Total Page:16

File Type:pdf, Size:1020Kb

Comparison of Optical Character Recognition (OCR) Software Angelica Gabasio May 2013 Comparison of optical character recognition (OCR) software Master's thesis work carried out at Sweco Position. Supervisors Bj¨ornHarrtell, Sweco Tobias Lennartsson, Sweco Examiner Jacek Malec, Lund University The purpose of the thesis was to test some different optical char- acter recognition (OCR) software to find out which one gives the best output. Different kinds of images have been OCR scanned to test the software. The output from the OCR tools has been automatically compared to the correct text and the error percentage has been cal- culated. Introduction Optical character recognition (OCR) is used to extract plain text from images containing text. This is useful when information in paper form is going to be digitized. OCR is applied to the images to make the text editable or searchable. Another way of using OCR scanning is to process forms automatically.[1] If the output from the OCR tool contains many errors, it would require a lot of manual work to fix them. This is why it is important to get a correct or almost correct output when using OCR. The output from different OCR tools may differ a lot, and the purpose of this comparison is to find out which OCR software to use for OCR scanning. OCR steps As seen in Figure 1, the OCR software takes an image, applies OCR on it and outputs plain text. Figure 1 also shows the basic steps of the OCR algorithm, which are explained below[1, 2]: Preprocessing Preprocessing can be done either manually before the image is given to the OCR tool, or it can be done internally in the software. The most common preprocessing that is done is: 1) convert the image to only black and white 2) remove noise, and 3) rotate the image to make the text as horizontal as possible. The purpose of preprocessing is to modify the image in a way that makes the recognition as successful as possible. Layout analysis This step is to determine how to read the text in the image. It can include identification of columns or images in the text. 1 Recognition The recognition step is when the OCR tool decides which char- acter to output. The software extracts each character from the image and uses a database to match them to the (hopefully) correct character. Text image OCR Plain text Preprocessing Layout analysis Recognition Figure 1: OCR - the basic steps Method Software Nine different OCR tools has been used, four of them commercial and five open source. The tools included in the comparison are Tesseract, Ocrad, CuneiForm, GOCR, OCRopus, TOCR, Abbyy CLI OCR, Leadtools OCR SDK and OCR API Service. TOCR, Abbyy, Leadtools and OCR API Service are commercial tools, the rest are open source. Input Since it is important to get a good output from the OCR scan, different kinds of images have been tested to see how successful the tools are on images of different quality. The types of images used are skewed images, images containing handwriting, noisy and stained images, light images, images with pictures in the text, and images with underlined text. Most of the text in the images is in Swedish, but some images contain text in English as well. Comparison To decide how correct the OCR output is, the Wagner-Fischer string compar- ison algorithm is used. The algorithm calculates the edit distance between 2 two strings. The edit distance is the minimum number of character inser- tions/deletions/substitutions needed to change one string into the other.[3] The edit distance is used to get a percentage value of the error: edit distance(s1, s2) error(s1, s2) = maxfjs1j, js2jg ·100, where jsj is the length of the string s. Result The mean error, in increasing order, from the tested OCR tools on all of the images is: TOCR 8.79% Leadtools 20.06% Abbyy 24.53% OCR API Service 27.81% OCRopus 31.42% Tesseract 33.9% CuneiForm 37.68% Ocrad 50.25% GOCR 64.46% Conclusion The four commercial tools in the comparison are more accurate than any of the open source tools. TOCR gives the most accurate output on most of the scanned images. In some of the images, TOCR generates the correct text but some whitespace between the rows is missing. The comparison algorithm compares the strings character by character, and will result in an error if a new line is missing. The comparison shows that it probably is a good thing to invest in a com- mercial tool, preferably TOCR, since it is much more accurate than any of the other tools. Software used • Tesseract-ocr, http://code.google.com/p/tesseract-ocr/ • Ocrad, http://www.gnu.org/software/ocrad/ • CuneiForm, http://cognitiveforms.com/ru.html#1189-CuneiForm • GOCR, http://jocr.sourceforge.net/ • OCRopus, https://code.google.com/p/ocropus/ 3 • TOCR, http://www.transym.com/tocr-the-integrators-choice.htm/ • Abbyy CLI OCR, http://www.ocr4linux.com/ • Leadtools OCR SDK, http://www.leadtools.com/sdk/ocr/ • OCR API Service, http://ocrapiservice.com/ References [1] Inad Aljarrah, Osama Al-Khaleel, Khaldoon Mhaidat, Mu'ath Alrefai, Ab- dullah Alzu'bi, Mohammad Rabab'ah, Automated System for Arabic Op- tical Character Recognition with Lookup Dictionary, Journal of Emerging Technologies in Web Intelligence, Nov 2012, Vol. 4 Issue 4, pp. 362-370 [2] Tobias Blanke, Michael Bryant, Mark Hedges, Open source optical charac- ter recognition for historical research, Journal of Documentation, Vol. 68 Iss: 5, 2012, pp. 659-683 [3] Robert A. Wagner, Michael J. Fischer, The String-to-String Correction Problem, Journal of the Association for Computing Machinery, Vol. 21, No. 1, January 1974, pp. 168-173 4.
Recommended publications
  • OCR Pwds and Assistive Qatari Using OCR Issue No
    Arabic Optical State of the Smart Character Art in Arabic Apps for Recognition OCR PWDs and Assistive Qatari using OCR Issue no. 15 Technology Research Nafath Efforts Page 04 Page 07 Page 27 Machine Learning, Deep Learning and OCR Revitalizing Technology Arabic Optical Character Recognition (OCR) Technology at Qatar National Library Overview of Arabic OCR and Related Applications www.mada.org.qa Nafath About AboutIssue 15 Content Mada Nafath3 Page Nafath aims to be a key information 04 Arabic Optical Character resource for disseminating the facts about Recognition and Assistive Mada Center is a private institution for public benefit, which latest trends and innovation in the field of Technology was founded in 2010 as an initiative that aims at promoting ICT Accessibility. It is published in English digital inclusion and building a technology-based community and Arabic languages on a quarterly basis 07 State of the Art in Arabic OCR that meets the needs of persons with functional limitations and intends to be a window of information Qatari Research Efforts (PFLs) – persons with disabilities (PWDs) and the elderly in to the world, highlighting the pioneering Qatar. Mada today is the world’s Center of Excellence in digital work done in our field to meet the growing access in Arabic. Overview of Arabic demands of ICT Accessibility and Assistive 11 OCR and Related Through strategic partnerships, the center works to Technology products and services in Qatar Applications enable the education, culture and community sectors and the Arab region. through ICT to achieve an inclusive community and educational system. The Center achieves its goals 14 Examples of Optical by building partners’ capabilities and supporting the Character Recognition Tools development and accreditation of digital platforms in accordance with international standards of digital access.
    [Show full text]
  • Reconocimiento De Escritura Lecture 4/5 --- Layout Analysis
    Reconocimiento de Escritura Lecture 4/5 | Layout Analysis Daniel Keysers Jan/Feb-2008 Keysers: RES-08 1 Jan/Feb-2008 Outline Detection of Geometric Primitives The Hough-Transform RAST Document Layout Analysis Introduction Algorithms for Layout Analysis A `New' Algorithm: Whitespace Cuts Evaluation of Layout Analyis Statistical Layout Analysis OCR OCR - Introduction OCR fonts Tesseract Sources of OCR Errors Keysers: RES-08 2 Jan/Feb-2008 Outline Detection of Geometric Primitives The Hough-Transform RAST Document Layout Analysis Introduction Algorithms for Layout Analysis A `New' Algorithm: Whitespace Cuts Evaluation of Layout Analyis Statistical Layout Analysis OCR OCR - Introduction OCR fonts Tesseract Sources of OCR Errors Keysers: RES-08 3 Jan/Feb-2008 Detection of Geometric Primitives some geometric entities important for DIA: I text lines I whitespace rectangles (background in documents) Keysers: RES-08 4 Jan/Feb-2008 Outline Detection of Geometric Primitives The Hough-Transform RAST Document Layout Analysis Introduction Algorithms for Layout Analysis A `New' Algorithm: Whitespace Cuts Evaluation of Layout Analyis Statistical Layout Analysis OCR OCR - Introduction OCR fonts Tesseract Sources of OCR Errors Keysers: RES-08 5 Jan/Feb-2008 Hough-Transform for Line Detection Assume we are given a set of points (xn; yn) in the image plane. For all points on a line we must have yn = a0 + a1xn If we want to determine the line, each point implies a constraint yn 1 a1 = − a0 xn xn Keysers: RES-08 6 Jan/Feb-2008 Hough-Transform for Line Detection The space spanned by the model parameters a0 and a1 is called model space, parameter space, or Hough space.
    [Show full text]
  • An Accuracy Examination of OCR Tools
    International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8, Issue-9S4, July 2019 An Accuracy Examination of OCR Tools Jayesh Majumdar, Richa Gupta texts, pen computing, developing technologies for assisting Abstract—In this research paper, the authors have aimed to do a the visually impaired, making electronic images searchable comparative study of optical character recognition using of hard copies, defeating or evaluating the robustness of different open source OCR tools. Optical character recognition CAPTCHA. (OCR) method has been used in extracting the text from images. OCR has various applications which include extracting text from any document or image or involves just for reading and processing the text available in digital form. The accuracy of OCR can be dependent on text segmentation and pre-processing algorithms. Sometimes it is difficult to retrieve text from the image because of different size, style, orientation, a complex background of image etc. From vehicle number plate the authors tried to extract vehicle number by using various OCR tools like Tesseract, GOCR, Ocrad and Tensor flow. The authors in this research paper have tried to diagnose the best possible method for optical character recognition and have provided with a comparative analysis of their accuracy. Keywords— OCR tools; Orcad; GOCR; Tensorflow; Tesseract; I. INTRODUCTION Optical character recognition is a method with which text in images of handwritten documents, scripts, passport documents, invoices, vehicle number plate, bank statements, Fig.1: Functioning of OCR [2] computerized receipts, business cards, mail, printouts of static-data, any appropriate documentation or any II. OCR PROCDURE AND PROCESSING computerized receipts, business cards, mail, printouts of To improve the probability of successful processing of an static-data, any appropriate documentation or any picture image, the input image is often ‘pre-processed’; it may be with text in it gets processed and the text in the picture is de-skewed or despeckled.
    [Show full text]
  • Gradu04243.Pdf
    Paperilomakkeesta tietomalliin Kalle Malin Tampereen yliopisto Tietojenkäsittelytieteiden laitos Tietojenkäsittelyoppi Pro gradu -tutkielma Ohjaaja: Erkki Mäkinen Toukokuu 2010 i Tampereen yliopisto Tietojenkäsittelytieteiden laitos Tietojenkäsittelyoppi Kalle Malin: Paperilomakkeesta tietomalliin Pro gradu -tutkielma, 61 sivua, 3 liitesivua Toukokuu 2010 Tässä tutkimuksessa käsitellään paperilomakkeiden digitalisointiin liittyvää kokonaisprosessia yleisellä tasolla. Prosessiin tutustutaan tarkastelemalla eri osa-alueiden toimintoja ja laitteita kokonaisjärjestelmän vaatimusten näkökul- masta. Tarkastelu aloitetaan paperilomakkeiden skannaamisesta ja lopetetaan kerättyjen tietojen tallentamiseen tietomalliin. Lisäksi luodaan silmäys markki- noilla oleviin valmisratkaisuihin, jotka sisältävät prosessin kannalta oleelliset toiminnot. Avainsanat ja -sanonnat: lomake, skannaus, lomakerakenne, lomakemalli, OCR, OFR, tietomalli. ii Lyhenteet ADRT = Adaptive Document Recoginition Technology API = Application Programming Interface BAG = Block Adjacency Graph DIR = Document Image Recognition dpi= Dots Per Inch ICR = Intelligent Character Recognition IFPS = Intelligent Forms Processing System IR = Information Retrieval IRM = Image and Records Management IWR = Intelligent Word Recognition NAS = Network Attached Storage OCR = Optical Character Recognition OFR = Optical Form Recognition OHR = Optical Handwriting Recognition OMR = Optical Mark Recognition PDF = Portable Document Format SAN = Storage Area Networks SDK = Software Development Kit SLM
    [Show full text]
  • Integral Estimation in Quantum Physics
    INTEGRAL ESTIMATION IN QUANTUM PHYSICS by Jane Doe A dissertation submitted to the faculty of The University of Utah in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Mathematical Physics Department of Mathematics The University of Utah May 2016 Copyright c Jane Doe 2016 All Rights Reserved The University of Utah Graduate School STATEMENT OF DISSERTATION APPROVAL The dissertation of Jane Doe has been approved by the following supervisory committee members: Cornelius L´anczos , Chair(s) 17 Feb 2016 Date Approved Hans Bethe , Member 17 Feb 2016 Date Approved Niels Bohr , Member 17 Feb 2016 Date Approved Max Born , Member 17 Feb 2016 Date Approved Paul A. M. Dirac , Member 17 Feb 2016 Date Approved by Petrus Marcus Aurelius Featherstone-Hough , Chair/Dean of the Department/College/School of Mathematics and by Alice B. Toklas , Dean of The Graduate School. ABSTRACT Blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah. Blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah. Blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah. Blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah. Blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah. Blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah. Blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah. Blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah.
    [Show full text]
  • JETIR Research Journal
    © 2019 JETIR June 2019, Volume 6, Issue 6 www.jetir.org (ISSN-2349-5162) IMAGE TEXT CONVERSION FROM REGIONAL LANGUAGE TO SPEECH/TEXT IN LOCAL LANGUAGE 1Sayee Tale,2Manali Umbarkar,3Vishwajit Singh Javriya,4Mamta Wanjre 1Student,2Student,3Student,4Assistant Professor 1Electronics and Telecommunication, 1AISSMS, Institute of Information Technology, Pune, India. Abstract: The drivers who drive in other states are unable to recognize the languages on the sign board. So this project helps them to understand the signs in different language and also they will be able to listen it through speaker. This paper describes the working of two module image processing module and voice processing module. This is done majorly using Raspberry Pi using the technology OCR (optical character recognition) technique. This system is constituted by raspberry Pi, camera, speaker, audio playback module. So the system will help in decreasing the accidents causes due to wrong sign recognition. Keywords: Raspberry pi model B+, Tesseract OCR, camera module, recording and playback module,switch. 1. Introduction: In today’s world life is too important and one cannot loose it simply in accidents. The accident rates in today’s world are increasing day by day. The last data says that 78% accidents were because of driver’s fault. There are many faults of drivers and one of them is that they are unable to read the signs and instructions written on boards when they drove into other states. Though the instruction are for them only but they are not able to make it. So keeping this thing in mind we have proposed a system which will help them to understand the sign boards written in regional language.
    [Show full text]
  • Optical Character Recognition As a Cloud Service in Azure Architecture
    International Journal of Computer Applications (0975 – 8887) Volume 146 – No.13, July 2016 Optical Character Recognition as a Cloud Service in Azure Architecture Onyejegbu L. N. Ikechukwu O. A. Department of Computer Science, Department of Computer Science, University of Port Harcourt. University of Port Harcourt. Port Harcourt. Rivers State. Nigeria. Port Harcourt. Rivers State. Nigeria ABSTRACT The dawn of digital computers has made it unavoidable that Cloud computing and Optical Character Recognition (OCR) everything processed by the digital computer must be technology has become an extremely attractive area of processed in digital form. For example, most famous libraries research over the last few years. Sometimes it is difficult to in the world like the Boston public library has over 6 million retrieve text from the image because of different size, style, books, for public consumption, inescapably has to change all orientation and complex background of image. There is need its paper-based books into digital documents in order that they to convert paper books and documents into text. OCR is still could be stored on a storage drive. It can also be projected that imperfect as it occasionally mis-recognizes letters and falsely every year over 200 million books are being published[22], identifies scanned text, leading to misspellings and linguistics many of which are disseminated and published on papers [6]. errors in the OCR output text. A cloud based Optical Thus, for several document-input jobs, Opitcal Character Character Recognition Technology was used. This was Reconigtion is the utmost economical and swift process powered on Microsoft Windows Azure in form of a Web obtainable.
    [Show full text]
  • Representation of Web Based Graphics and Equations for the Visually Impaired
    TH NATIONAL ENGINEERING CONFERENCE 2012, 18 ERU SYMPOSIUM, FACULTY OF ENGINEERING, UNIVERSITY OF MORATUWA, SRI LANKA Representation of Web based Graphics and Equations for the Visually Impaired C.L.R. Gunawardhana, H.M.M. Hasanthika, T.D.G.Piyasena,S.P.D.P.Pathirana, S. Fernando, A.S. Perera, U. Kohomban Abstract With the extensive growth of technology, it is becoming prominent in making learning more interactive and effective. Due to the use of Internet based resources in the learning process, the visually impaired community faces difficulties. In this research we are focusing on developing an e-Learning solution that can be accessible by both normal and visually impaired users. Accessibility to tactile graphics is an important requirement for visually impaired people. Recurrent expenditure of the printers which support graphic printing such as thermal embossers is beyond the budget for most developing countries which cannot afford such a cost for printing images. Currently most of the books printed using normal text Braille printers ignore images in documents and convert only the textual part. Printing images and equations using normal text Braille printers is a main research area in the project. Mathematical content in a forum and simple images such as maps in a course page need to be made affordable using the normal text Braille printer, as these functionalities are not available in current Braille converters. The authors came up with an effective solution for the above problems and the solution is presented in this paper. 1 1. Introduction In order to images accessible to visually impaired people the images should be converted into a tactile In this research our main focus is to make e-Learning format.
    [Show full text]
  • Manual Archivista 2009/I
    Manual Archivista 2009/I c 18th January 2009 by Archivista GmbH, CH-8118 Pfaffhausen Web pages: www.archivista.ch Contents I Introduction 8 4.4 Accessing the manual . 26 4.5 Login WebClient . 26 1 Introduction 9 4.6 Scanning and entering keywords . 26 1.1 Welcome to Archivista . 9 4.7 Rotating pages . 26 1.2 Notes on the manual . 9 4.8 Title search . 27 1.3 Our address . 9 4.9 Full text search . 27 1.4 Previous versions . 9 4.10 Login WebAdmin . 27 1.5 Licensing . 12 4.11 Adding users . 27 4.12 Adding/deleting fields . 27 2 First Steps 18 4.13 Editing the input/search mask . 27 2.1 Introduction . 18 4.14 Activating SSH . 27 2.2 The digital archive . 18 4.15 Activating VNC . 27 2.3 Database, server and client . 18 4.16 Enabling print server (CUPS) . 28 2.4 Tables, records and fields . 19 4.17 Password, Unlock & Restart OCR . 28 2.5 Archivista and working method . 19 4.18 Activating HTTPS . 28 2.6 Tips for archiving . 20 2.7 Archive, pages and documents . 20 5 Tutorial RichClient 29 2.8 The Archivista document . 20 5.1 Archivista in 90 Seconds . 29 5.2 Adding Documents . 29 3 Installation 22 5.3 Search . 29 3.1 ArchivistaBox . 22 5.4 Extended Functions . 30 3.2 Virtual (Box) . 22 5.5 Users & Fields . 31 3.3 OpenSource (Box) . 22 5.6 Databases, fields and barcodes . 31 3.4 OpenSource (Windows) . 22 III ArchivistaBox 33 II Tutorials 25 6 Introduction 34 4 Introduction 26 6.1 Advantages .
    [Show full text]
  • Choosing Character Recognition Software To
    CHOOSING CHARACTER RECOGNITION SOFTWARE TO SPEED UP INPUT OF PERSONIFIED DATA ON CONTRIBUTIONS TO THE PENSION FUND OF UKRAINE Prepared by USAID/PADCO under Social Sector Restructuring Project Kyiv 1999 <CHARACTERRECOGNITION_E_ZH.DOC> printed on June 25, 2002 2 CONTENTS LIST OF ACRONYMS.......................................................................................................................................................................... 3 INTRODUCTION................................................................................................................................................................................ 4 1. TYPES OF INFORMATION SYSTEMS....................................................................................................................................... 4 2. ANALYSIS OF EXISTING SYSTEMS FOR AUTOMATED TEXT RECOGNITION................................................................... 5 2.1. Classification of automated text recognition systems .............................................................................................. 5 3. ATRS BASIC CHARACTERISTICS............................................................................................................................................ 6 3.1. CuneiForm....................................................................................................................................................................... 6 3.1.1. Some information on Cognitive Technologies ..................................................................................................
    [Show full text]
  • An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow For
    Article OCR4all - An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings Christian Reul 1, Dennis Christ 1, Alexander Hartelt 1, Nico Balbach 1, Maximilian Wehner 1, Uwe Springmann 2, Christoph Wick 1, Christine Grundig 3, Andreas Büttner 4, and Frank Puppe 1 1 Chair for Artificial Intelligence and Applied Computer Science, University of Würzburg, 97074 Würzburg, Germany 2 Center for Information and Language Processing, LMU Munich, 80538 Munich, Germany 3 Institute for Modern Art History, University of Zurich, 8006 Zurich, Switzerland 4 Institute for Philosophy, University of Würzburg, 97074 Würzburg, Germany * Correspondence: [email protected] Version September 11, 2019 submitted to Appl. Sci. Abstract: Optical Character Recognition (OCR) on historical printings is a challenging task mainly due to the complexity of the layout and the highly variant typography. Nevertheless, in the last few years great progress has been made in the area of historical OCR, resulting in several powerful open-source tools for preprocessing, layout recognition and segmentation, character recognition and post-processing. The drawback of these tools often is their limited applicability by non-technical users like humanist scholars and in particular the combined use of several tools in a workflow. In this paper we present an open-source OCR software called OCR4all, which combines state-of-the-art OCR components and continuous model training into a comprehensive workflow. A comfortable GUI allows error corrections not only in the final output, but already in early stages to minimize error propagations. Further on, extensive configuration capabilities are provided to set the degree of automation of the workflow and to make adaptations to the carefully selected default parameters for specific printings, if necessary.
    [Show full text]
  • Analiza I Optičko Prepoznavanje Rukopisa S Herbarijskih Etiketa U Zbirci Herbarium Croaticum
    SVEUČILIŠTE U ZAGREBU FILOZOFSKI FAKULTET ODSJEK ZA INFORMACIJSKE I KOMUNIKACIJSKE ZNANOSTI KATEDRA ZA ARHIVISTIKU I DOKUMENTALISTIKU Marta Dević Hameršmit Analiza i optičko prepoznavanje rukopisa s herbarijskih etiketa u zbirci Herbarium Croaticum Diplomski rad Mentor: dr. sc. Hrvoje Stančić, red. prof. Neposredni voditelj: Vedran Šegota, dipl. ing. bio. (Botanički zavod, Biološki odsjek, PMF) Zagreb, srpanj 2018. Sadržaj 1. Uvod ....................................................................................................................................... 1 1.1. Optičko prepoznavanje .................................................................................................... 2 1.1.1. Optičko prepoznavanje znakova .............................................................................. 2 1.1.2. Inteligentno prepoznavanje znakova ........................................................................ 3 1.1.3. Prepoznavanje rukopisa ............................................................................................ 4 1.2. Herbarij ............................................................................................................................ 5 1.2.1. Herbarijske zbirke .................................................................................................... 6 1.2.2. Herbarijske etikete .................................................................................................... 6 2. Ciljevi ....................................................................................................................................
    [Show full text]