Optical Character Recognition

Total Page:16

File Type:pdf, Size:1020Kb

Optical Character Recognition OPTICAL CHARACTER RECOGNITION Màster de Visió per Computador Curs 2006 - 2007 Outline • Introduction • Pre-processing (document level) – Binarization – Skew correction • Segmentation – Layout analysis – Character segmentation • Pre-processing (character level) • Feature extraction – Image-based features – Statistical features – Transform-based features – Structural features • Classification • Post-proccessing – Classifier combination – Exploitation of context information • Examples of OCR systems • Bibliography 2 Outline 1 Optical Character Recognition Pattern Recognition Statistical Structural Image Processing Pattern Recognition Pattern Recognition Methods Applications Optical Document Character Analysis Recognition 3 Introduction Some examples Drawings, maps Books, journals, reports License plates Identity PDAs Cards Old documents Cheques, bills Postal addresses Quality control 4 Introduction 2 Document Image Analysis G. Nagy: Twenty years of document image analysis in PAMI. IEEE Trans. on PAMI, vol. 22, nº 1, pp. 38-62 January 2000. Document image analysis is the subfield of digital image processing that aims at converting document images to symbolic form for modification, storage, retrieval, reuse and transmission. Document image analysis is the theory and practice of recovering the symbol structure of digital images scanned from paper or produced by computer. What is a document? Objects created expressly to convey information encoded as iconic symbols – Scanned images from paper documents – Electronic documents – Multimedia documents (video with text) –… 5 Introduction Applications of DIA Document Imaging: •Digitization •Storage •Compression •Re-printing Applications of DIA Document understanding •Recognition •Interpretation •Indexing •Retrieval 6 Introduction 3 DIA tasks Mostly text Mostly graphics •Acquisition •Acquisition •Binarization •Binarization Document Imaging •Filtering •Filtering •Skew correction •Vectorization •Segmentation •Text-graphics •Layout analysis separation Document Understanding •OCR •Symbol recognition •Interpretation 7 Introduction Outline of the course Focus: Document understanding of mostly text documents 1. Acquisition 2. Pre-processing − Binarization − Skew correction 3. Layout analysis 4. Character segmentation 5. OCR − Feature extraction − Classification − Post-processing 8 Introduction 4 Categorization of Character Recognition Optical Character Recognition According to the type of writing Machine-printed Hand-written character recognition character recognition According to the type of acquisition Off-line On-line character recognition character recognition 9 Introduction Machine-printed character recognition • Characters are totally defined by the font type: – Dimensions (segmentation) • Character width • Inter-character separation • Character height – Shape (recognition) • Typographic effects (boldface, italics, underline). •Challenges: – Similar shapes among characters – Multiple fonts – Joined characters – Digitization noise: broken lines, random noise, heavy characters, etc. – Document degradation: old documents, photocopies, etc. 10 Introduction 5 Machine-printed character recognition • Classification of machine-printed OCR systems – Monofont: • One single type of font – Multifont: • Recognition of a fixed and known set of fonts • It is necessary to identify and learn the differences between characters of all the types of fonts – Omnifont: • Recognition of any arbitrary type of font, even if it has not been previously learned 11 Introduction Off-line hand-written character recognition • Hand-written • Off-line: acquisition by a scanner or a camera • Challenges: – Shape variability among images of the same character – Character segmentation • Subproblems: – Hand-written numeral recognition: digit recognition – Hand-printed character recognition: well-separated characters – Cursive character recognition: non-separated characters 12 Introduction 6 On-line hand-written character recognition • On-line acquisition – Digitizer tablets – Digital Pen – Tablet PC • Advantages with respect to off-line acquisition: – Image is acquired while the text is written – We can take advantage of dynamic information: • Temporal information: writing order, stroke segmentation, etc • Writing speed • Pen pressure • Subproblems: – Cursive script recognition. – Signature verification/recognition. 13 Introduction Levels of difficulty in character recognition •S.Mori, H.Nishida, H.Yamada. Optical Character Recognition. John Wiley and sons. 1999. •S.V. Rice, G. Nagy, T.A. Nartker. Optical Character Recognition: An illustrated guide to the frontier. Kluwer Academic Publishers. 1999. • Little shape variability 0.0. Printed characters. Specific font. Constant size. Roman • Small number of characters alphabets. • Little noise 0.1. Constrained hand-printed characters. Arabic numerals. Level 0 • Medium variation in shape 1.0. Printed characters. Multiple fonts. Nº characters < 100 • Medium noise 1.1. Loosely constrained hand-printed char. Nº char < 100 1.2. Chinese characters of few fonts Level 1 1.3. Loosely constrained hand-printed char. Nº char≈1000 • Much variation in shape 2.0. Printed characters of multiple fonts • Heavy noise 2.1. Unconstrained hand-printed characters 2.2. Affine transformed characters Level 2 • Nonsegmented strings of 3.0. Touching/broken characters characters 3.1. Cursive handwriting characters Level 3 3.2. Characters on a textured background 14 Introduction 7 Levels of difficulty in character recognition Level 0 0.0. Printed character of a specific font with a constant size • Constant size • Connectivity of characters • Variation in the stroke thickness • Little noise 0.1. Constrained hand-printed characters • Characters are written according to some instructions or box guidelines Solved problem 15 Introduction Levels of difficulty in character recognition Level 1 1.0. Printed characters of multiple fonts 1.1. Loosely constrained hand-printed characters 1.2. Chinese characters of few fonts. 1.3. Loosely constrained hand-printed characters. Nº characters ≈ 1000 Solved problem 16 Introduction 8 Levels of difficulty in character recognition Level 2 2.0. Printed characters of multiple fonts 2.1. Unconstrained hand-printed characters 2.2. Affine transformed characters 17 Introduction Levels of difficulty in character recognition Level 3 3.0. Caràcters no separats o trencats 3.1. Cursive handwriting characters 3.2. Characters on a textured background 18 Introduction 9 Databases for OCR Off-line Hand-written characters On-line Machine- characters printed CEDAR CENPARMI NIST UNIPEN Univ. Washington • 50.000 • 17.000 • More than • Definition of a • More than segmented manually 1.000.000 format to 1500 pages of numerals from segmented characters from represent on- articles in zip codes numerals from forms line data english • 5.000 zip codes zip codes • Several learning • 4.500.000 • More than 500 • 5.000 city and test sets characters pages of articles names (more • Segmented in japanese • 9.000 state variability) characters, • Originals, names • 91.500 words and photocopies and sentences with a sentences pages with dictionary arificially generated noise • Page segmentation into labeled zones 19 Introduction Performance evaluation of OCR systems • Hand-printed Character Recognition – Institute for Posts and Telecommunications Policy (IPIP) – Japan - 1996 – 5000 hand-written numerals from japanese zip codes – Performance of the best system: 97.94% (human performance: 99.84%) • Machine-printed Character Recognition – “The fifth Annual Test of OCR Accuracy”. Information Science Research Institute. TR-96-01. April 1996. http://www.isri.unlv.edu – 5.000.000 characters from 2000 pages in journals, newspapers, letters and technical reports – Performance in good quality documents: 99.77% - 99.13% – Performance in medium quality documents: 99.27% - 98.21% – Performance in low quality documents: 97.01% - 89.34% • Performance 99% => 30 errors /page (3000 characters/page) 20 Introduction 10 Performance evaluation of OCR systems C.Y. Suen, J. Tan: Analysis of errors of handwritten digits made by a multitude of classifiers. PRL 26, pp. 369-379. 2005 Classifier N of test samples Recognition (%) Error (%) MNIST database GPR 10,000 99.06 0.94 VSVMb 10,000 99.62 0.38 VSV2 10,000 99.44 0.56 LeNet-5 10,000 99.18 0.82 POE 10,000 98.32 1.68 CENPARMI database VSVM 2000 98.7 1.3 USPS database VSVM 2007 97.66 2.34 NIST SD19 database MLP 30,000 99.16 0.84 21 Introduction Performance evaluation of OCR systems Classifier N. of errors Category 1 Category 2 Category 3 MNIST database GPR 94 24 11 59 VSVMb 38 15 6 17 VSV2 56 15 9 32 LeNet5 82 17 14 51 POE 168 41 9 118 CENPARMI database VSVM 26 6 4 16 USPS database VSVM 47 13 13 21 NIST SD 19 database MLP 119 30 8 81 Sum 630 161 74 395 Percentage (%) 100 25.56 11.75 62.70 22 Introduction 11 Components of an OCR system ACQUISITION • Filtering. DOCUMENT • Binarization. PRE-PROCESSING • Skew correction. • Layout analysis SEGMENTATION • Text/graphics separation • Character segmentation CHARACTER • Filtering PRE-PROCESSING • Normalization • Image-based features FEATURE • Statistical features EXTRACTION • Transform-based features • Structural features LEARNING CLASSIFICATION Models Context POSTPROCESSING infromation 23 Introduction Acquisition. Scanners A scanner is a linear camera with a lighting and a displacement system Scan line Opening Video circuit Digital image text Lens CCD Document Lights 24 Acquisition 12 Acquisition. Scanners Types of scanners: • Flatbed scanners. The CCD
Recommended publications
  • OCR Pwds and Assistive Qatari Using OCR Issue No
    Arabic Optical State of the Smart Character Art in Arabic Apps for Recognition OCR PWDs and Assistive Qatari using OCR Issue no. 15 Technology Research Nafath Efforts Page 04 Page 07 Page 27 Machine Learning, Deep Learning and OCR Revitalizing Technology Arabic Optical Character Recognition (OCR) Technology at Qatar National Library Overview of Arabic OCR and Related Applications www.mada.org.qa Nafath About AboutIssue 15 Content Mada Nafath3 Page Nafath aims to be a key information 04 Arabic Optical Character resource for disseminating the facts about Recognition and Assistive Mada Center is a private institution for public benefit, which latest trends and innovation in the field of Technology was founded in 2010 as an initiative that aims at promoting ICT Accessibility. It is published in English digital inclusion and building a technology-based community and Arabic languages on a quarterly basis 07 State of the Art in Arabic OCR that meets the needs of persons with functional limitations and intends to be a window of information Qatari Research Efforts (PFLs) – persons with disabilities (PWDs) and the elderly in to the world, highlighting the pioneering Qatar. Mada today is the world’s Center of Excellence in digital work done in our field to meet the growing access in Arabic. Overview of Arabic demands of ICT Accessibility and Assistive 11 OCR and Related Through strategic partnerships, the center works to Technology products and services in Qatar Applications enable the education, culture and community sectors and the Arab region. through ICT to achieve an inclusive community and educational system. The Center achieves its goals 14 Examples of Optical by building partners’ capabilities and supporting the Character Recognition Tools development and accreditation of digital platforms in accordance with international standards of digital access.
    [Show full text]
  • Choosing Character Recognition Software To
    CHOOSING CHARACTER RECOGNITION SOFTWARE TO SPEED UP INPUT OF PERSONIFIED DATA ON CONTRIBUTIONS TO THE PENSION FUND OF UKRAINE Prepared by USAID/PADCO under Social Sector Restructuring Project Kyiv 1999 <CHARACTERRECOGNITION_E_ZH.DOC> printed on June 25, 2002 2 CONTENTS LIST OF ACRONYMS.......................................................................................................................................................................... 3 INTRODUCTION................................................................................................................................................................................ 4 1. TYPES OF INFORMATION SYSTEMS....................................................................................................................................... 4 2. ANALYSIS OF EXISTING SYSTEMS FOR AUTOMATED TEXT RECOGNITION................................................................... 5 2.1. Classification of automated text recognition systems .............................................................................................. 5 3. ATRS BASIC CHARACTERISTICS............................................................................................................................................ 6 3.1. CuneiForm....................................................................................................................................................................... 6 3.1.1. Some information on Cognitive Technologies ..................................................................................................
    [Show full text]
  • Development of the Documents Comparison Module for an Electronic Document Management System
    Development of the documents comparison module for an electronic document management system M A Mikheev1, P Y Yakimov1,2 1Samara National Research University, Moskovskoe Shosse 34А, Samara, Russia, 443086 2Image Processing Systems Institute of RAS - Branch of the FSRC "Crystallography and Photonics" RAS, Molodogvardejskaya street 151, Samara, Russia, 443001 e-mail: [email protected], [email protected] Abstract. The article is devoted to solving the problem of document versions comparison in electronic document management systems. Systems-analogues were considered, the process of comparing text documents was studied. In order to recognize the text on the scanned image, the technology of optical character recognition and its implementation — Tesseract library were chosen. The Myers algorithm is applied to compare received texts. The software implementation of the text document comparison module was implemented using the solutions described above. 1. Introduction Document management is an important part of any business today. Negotiations, contracts, protocols, agreements must be documented. Documents go through several stages of preparation in most cases, and there is a possibility of replacing or correcting of files by one member without notifying other members. EDMSs with built-in checking mechanisms guarantees the identity of two documents versions in this case. А variety of data formats as well as a low degree of automation or its complete absence significantly complicates the process of finding differences between documents. As a result, some companies ignore such important process without considering how risky this approach to organization of workflow in a company can be [1]. All the above suggests that the topic chosen by the authors is relevant.
    [Show full text]
  • Design and Implementation of a System for Recording and Remote Monitoring of a Parking Using Computer Vision and Ip Surveillance Systems
    VOL. 11, NO. 24, DECEMBER 2016 ISSN 1819-6608 ARPN Journal of Engineering and Applied Sciences ©2006-2016 Asian Research Publishing Network (ARPN). All rights reserved. www.arpnjournals.com DESIGN AND IMPLEMENTATION OF A SYSTEM FOR RECORDING AND REMOTE MONITORING OF A PARKING USING COMPUTER VISION AND IP SURVEILLANCE SYSTEMS Albeiro Cortés Cabezas, Rafael Charry Andrade and Harrinson Cárdenas Almario Department of Electronic Engineering, Surcolombiana University Grupo de Tratamiento de Señales y Telecomunicaciones, Pastrana Av. Neiva Colombia, Huila E-Mail: [email protected] ABSTRACT This article presents the design and implementation of a system for detection of license plates for a public parking located at the municipality of Altamira at the state of Huila in Colombia. The system includes also, a module of surveillance cameras for remote monitoring. The detection system consists of three steps: the first step consists in locating and cutting the vehicle license plate from an image taken by an IP camera. The second step is an optical character recognition (OCR), responsible for identifying alphanumeric characters included in the license plate obtained in the first step. And the third step consists in the date and time register operation for the vehicles entry and exit. This data base stores also the license plate number, information about the contract between parking and vehicle owner and billing. Keywords: image processing, vision, monitoring, surveillance, OCR. 1. INTRODUCTION vehicles in a parking and store them in digital way to gain In recent years the automatic recognition plates control of the activity in the parking, minimizing errors numbers technology -ANPR- (Automatic Number Plate that entail manual registration and storage of information Recognition) has had a great development in congestion in books accounting and papers.
    [Show full text]
  • Comparison of Optical Character Recognition (OCR) Software
    Angelica Gabasio May 2013 Comparison of optical character recognition (OCR) software Master's thesis work carried out at Sweco Position. Supervisors Bj¨ornHarrtell, Sweco Tobias Lennartsson, Sweco Examiner Jacek Malec, Lund University The purpose of the thesis was to test some different optical char- acter recognition (OCR) software to find out which one gives the best output. Different kinds of images have been OCR scanned to test the software. The output from the OCR tools has been automatically compared to the correct text and the error percentage has been cal- culated. Introduction Optical character recognition (OCR) is used to extract plain text from images containing text. This is useful when information in paper form is going to be digitized. OCR is applied to the images to make the text editable or searchable. Another way of using OCR scanning is to process forms automatically.[1] If the output from the OCR tool contains many errors, it would require a lot of manual work to fix them. This is why it is important to get a correct or almost correct output when using OCR. The output from different OCR tools may differ a lot, and the purpose of this comparison is to find out which OCR software to use for OCR scanning. OCR steps As seen in Figure 1, the OCR software takes an image, applies OCR on it and outputs plain text. Figure 1 also shows the basic steps of the OCR algorithm, which are explained below[1, 2]: Preprocessing Preprocessing can be done either manually before the image is given to the OCR tool, or it can be done internally in the software.
    [Show full text]
  • Yelieseva S. V. Information Technologies in Translation.Pdf
    The Ministry of Education and Science of Ukraine Petro Mohyla Black Sea National University S. V. Yelisieieva INFORMATION TECHNOLOGIES IN TRANSLATION A STUDY GUIDE PMBSNU Publishing House Mykolaiv – 2018 S. V. Yelisieieva UDC 81' 322.2 Y 40 Рекомендовано до друку вченою радою Чорноморського національного університету імені Петра Могили (протокол № 12 від 03.07.2018). Рецензенти: Демецька В. В., д-р філол. наук, професор, декан факультету перекладознавства Херсонського державного університету; Філіпова Н. М., д-р філол. наук, професор кафедри прикладної лінгвістики Національного морського університету ім. адмірала Макарова; Кондратенко Ю. П., д-р техн. наук, професор, професор кафедри інтелектуальних інформаційних систем факультету коп’ютерних наук Чорноморського національного університету ім. Петра Могили; Бабій Ю. Б., канд. філол. наук, доцент кафедри прикла- дної лінгвістики Миколаївського національного університету ім. В. О. Сухомлинського. Y 40 Yelisieieva S. V. Information Technologies in Translation : A Study Guide / S. V. Yelisieieva. – Мykolaiv : PMBSNU Publishing House, 2018. – 176 р. ISBN 978-966-336-399-8 The basis of the training course Information Technologies in Translation lies the model of work cycle on translation, which describes the sequence of necessary actions for qualified performance and further maintenance of written translation order. At each stage, translators use the appropriate software (S/W), which simplifies the work and allows to improve the quality of the finished documentation. The course familiarizes students with all stages of the translation work cycle and with those components of the software used at each stage. Besides, this training course includes information on working with the latest multimedia technologies that are widely used nowadays for presentations of various information materials.
    [Show full text]
  • PDF Extractor
    DELIVERABLE Project Acronym: EuDML Grant Agreement number: 250503 Project Title: The European Digital Mathematics Library D7.1: State of the Art of Augmenting Metadata Techniques and Technology Revision: 1.2 as of 2nd November 2010 Authors: Petr Sojka MU, Masaryk University, Brno Josef Baker UB, University of Birmingham Alan Sexton UB, University of Birmingham Volker Sorge UB, University of Birmingham Contributors: Michael Jost FIZ, Karlsruhe Aleksander Nowinski´ ICM, Warsaw Peter Stanchev IMI-BAS, Sofia Thierry Bouche, Claude Goutorbe CMD/UJF, Grenoble Nuno Freie, Hugo Manguinhas IST, Lisbon Łukasz Bolikowski ICM, Warsaw Project co-funded by the European Comission within the ICT Policy Support Programme Dissemination Level P Public X C Confidential, only for members of the consortium and the Commission Services Revision History Revision Date Author Organisation Description 0.1 31th October 2010 Petr Sojka MU First prerelease. 0.2 2nd November 2010 Josef Baker UB OCR stuff added. 0.3 11th November 2010 PS,JB,AS,VS MU, UB Restructured. 0.4 23th November 2010 Petr Sojka MU TB’s part added. 0.5 27th November 2010 Petr Sojka MU Near to final polishing, TODO removal, release for language and stylistic checking. 1.0 27th November 2010 Petr Sojka MU Release for internal review, with language checked by AS and NER part by ICM. 1.1 28th November 2010 Alan Sexton UB Minor corrections. 1.2 2nd December 2010 Petr Sojka MU Release for EU with com- ments from JR, MBa, and with part added by HM and extended conversions part. Statement of originality: This deliverable contains original unpublished work except where clearly indicated otherwise.
    [Show full text]
  • Extracción Automáticas De Modelos Uml Contenidos En Imágenes
    Universidad Carlos III de Madrid Departamento de Informática Doctorado en Ciencia y Tecnología Informática EXTRACCIÓN AUTOMÁTICA DE MODELOS UML CONTENIDOS EN IMÁGENES Tesis Doctoral Autor Valentín Moreno Pelayo Director Prof. Dr. D. Juan Bautista Llorens Morillo Prof. Dr. Gonzalo Génova Fuster Leganés, noviembre de 2015 TESIS DOCTORAL EXTRACCIÓN AUTOMÁTICAS DE MODELOS UML CONTENIDOS EN IMÁGENES Autor: Valentín Moreno Pelayo Director/es: Juan Bautista Llorens Morillo Gonzalo Génova Fuster Firma del Tribunal Calificador: Firma Presidente: (Nombre y apellidos) Vocal: (Nombre y apellidos) Secretario: (Nombre y apellidos) Calificación: Leganés, de de A mi familia y amigos Resumen Aunque parezca extraño, pese a no poder encontrar sitios web especializados en ofertar diseños de software representados mediante diagramas UML, existe una ingente cantidad de documentación a disposición de cualquiera, y que contiene dichos modelos: como imágenes en documentos textuales. Este universo de información no se encuentra fácilmente accesible para los desarrolladores porque no es posible, con la tecnología actual, buscar de forma precisa información semántica dentro de imágenes. Lo único que pueden hacer los desarrolladores es intentar buscar documentos relevantes, leerlos, y decidir si los diseños le sirven a sus intereses. Para evitar este problema, y conseguir poner a disposición de toda la comunidad de desarrolladores centenas de miles de diseños, este trabajo pretende desarrollar la metodología necesaria para poder extraer la información textual y gráfica de las imágenes que representen diagramas UML, y convertirla en información pura UML (es decir, en modelos UML reales). El poner a disposición de los analistas, desarrolladores de software, o interesados tal cantidad de diagramas y modelos de software permitirá la aplicación de técnicas modernas de reutilización de software basadas en la búsqueda de diagramas UML.
    [Show full text]
  • Evaluation of Off-The-Shelf OCR Technologies
    Masaryk University Faculty of Informatics Evaluation of off-the-shelf OCR technologies Bachelor’s Thesis Martin Tomaschek Brno, Fall 2017 Masaryk University Faculty of Informatics Evaluation of off-the-shelf OCR technologies Bachelor’s Thesis Martin Tomaschek Brno, Fall 2017 This is where a copy of the official signed thesis assignment and a copy ofthe Statement of an Author is located in the printed version of the document. Acknowledgements I would like to thank my advisor for patience, my brother for help and my parents for their love. iii Abstract A OCR software comparison iv Keywords ocr, benchmark v Contents 1 Preface 1 2 Outlines of the OCR process 3 3 Challenges to OCR 5 4 OCR benchmarking and evaluation 9 4.1 Dataset creation ........................9 4.1.1 Synthetic and real data . 10 4.1.2 Formats . 11 4.2 Evaluation metrics ...................... 12 4.2.1 Text recognition . 12 4.2.2 Text segmentation . 13 4.2.3 Existing datasets . 15 4.2.4 Ground-truthing tools . 16 4.3 Evaluation tools ........................ 16 4.3.1 The ISRI Analytic Tools[13] . 16 4.3.2 hOCR tools[15] . 18 4.3.3 An open-source OCR evaluation tool . 18 5 The tested OCR systems 19 5.1 Proprietary .......................... 19 5.1.1 Abby FineReader . 19 5.1.2 Readiris 16 . 20 5.1.3 Adobe Acrobat 11 . 20 5.1.4 Omnipage . 21 5.2 Open source .......................... 21 5.2.1 Tesseract . 21 5.2.2 GNU Ocrad . 21 5.2.3 Gocr . 22 5.2.4 Ocropus . 22 5.2.5 Cuneiform .
    [Show full text]
  • Download Download
    -1- Which OCR toolset is good and why? A comparative study Pooja Jain1,*, Dr. Kavita Taneja1, Dr. Harmunish Taneja2 1Dept. Of Computer Science & Applications, Panjab University, Chandigarh, India. 2Dept. Of Computer Science & Information Tech., DAV College, Sec - 10, Chandigarh, India. *Corresponding author: [email protected] Abstract Optical Character Recognition (OCR) is a very active research area in many challenging fields like pattern recognition, natural language processing (NLP), computer vision, biomedical informatics, machine learning (ML), and artificial intelligence (AI). This computational technology extracts the text in an editable format (MS Word/Excel, text files, etc.) from PDF files, scanned or hand-written documents, images (photographs, advertisements, and alike), etc. for further processing and has been utilized in many real-world applications including banking, education, insurance, finance, healthcare and keyword-based search in documents, etc. Many OCR toolsets are available under various categories, including open-source, proprietary, and online services. This research paper provides a comparative study of various OCR toolsets considering a variety of parameters. Keywords: ABBYY FineReader; Calamari; Google Docs; OCR; Tesseract. 1. Introduction OCR (Bokser, 1992; Mori et al., 1992) is a 1.1. OCR process commonly used technology for recognizing text within digital images such as OCR process (Goswami et al., 2013; Cao, scanned documents, advertisements, 2014; Tomaschek, 2018) generally goes photographs, etc. It is widely used as an through multiple stages, as shown in Figure information entry tool that can extract useful 1, including Image-acquisition (downloading information from scanned documents, image from an online source or capturing it including printed forms (filled by users), using a camera or scanner), Pre-processing computerized receipts, bank statements, (modifying image in a way that may increase invoices, business cards, passport documents, the accuracy of OCR), Binarization mails, or any other suitable documentation.
    [Show full text]
  • TEXT RECOGNITION Kazylina Y.I
    XX International conference for students and young scientists «MODERN TECHNIQUE AND TECHNOLOGIES» Section 7: Informatics And Control In Engeneering Systems TEXT RECOGNITION Kazylina Y.I. Scientific supervisor: lepustin a.v., senior teacher Tomsk Polytechnic University, 634050, Russia, Tomsk, Lenin Avenue, 30 E-mail: [email protected] Optical character recognition, or OCR, is a mechanical or electronic process of a handwritten, typewritten or printed text conversion into text data, i.e. code sequence representing computer symbols, for example, in a word processor. OCR is widely used for converting books and documents into an electronic format, business accounting system computerization and text publication in the Web. Text recognition on the PC The process of digitization and OCR includes five steps. Fig 2. Method of recognition • Page data input. On this stage scanned or photographed document is transferred into a computer • image. Document reconstruction. After the recognition process is complete the software starts to • Layout analysis. OCR application detects the recreate pages, combining separate symbols into arrangement of text, pictures, tables etc. on the page words, words into sentences, sentences into and divide it into blocks. The software sequentially paragraphs etc., using the embedded vocabulary. To splits the page into smaller blocks: the text into quicken the process the results of layout analysis (step paragraphs, then into sentences, separate words and 2) are used. Moreover, using special methods symbols. At the end layout analysis the page is applications try to take into account grammatical represented by a set of separate symbols. The features of the text so the result would represent application remembers the location of every one of grammatically correct sentences.
    [Show full text]
  • TEXT LINE EXTRACTION USING SEAM CARVING a Thesis
    TEXT LINE EXTRACTION USING SEAM CARVING A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirement for the Degree Master of Science Christopher Stoll May, 2015 TEXT LINE EXTRACTION USING SEAM CARVING Christopher Stoll Thesis Approved: Accepted: __________________________________ __________________________________ Advisor Dean of the College Dr. Zhong-Hui Duan Dr. Chand Midha __________________________________ __________________________________ Faculty Reader Interim Dean of the Graduate School Dr. Chien-Chung Chan Dr. Rex Ramsier __________________________________ __________________________________ Faculty Reader Date Dr. Yingcai Xiao __________________________________ Department Chair Dr. Timothy Norfolk !ii ABSTRACT Optical character recognition (OCR) is a well researched area of computer science; presently there are numerous commercial and open source applications which can perform OCR, albeit with varying levels of accuracy. Since the process of performing OCR requires extracting text from images, it should follow that text line extraction is also a well researched area. And indeed there are many methods to extract text from images of scanned documents, the process known in that field as document analysis and recognition. However, the existing text extraction techniques were largely devised to feed existing character recognition techniques. Since this work was originally conceived from the perspective of computer vision and pattern recognition and pattern analysis and machine intelligence, a new approach seemed necessary to meet the new objectives. Out of that need an apparently novel approach to text extraction was devised which relies upon the central idea behind seam carving. Text images are examined for seams, but rather than removing the lowest energy seams, they are evaluated to determine where text is located within the image.
    [Show full text]