Evaluation of Off-The-Shelf OCR Technologies
Total Page:16
File Type:pdf, Size:1020Kb
Masaryk University Faculty of Informatics Evaluation of off-the-shelf OCR technologies Bachelor’s Thesis Martin Tomaschek Brno, Fall 2017 Masaryk University Faculty of Informatics Evaluation of off-the-shelf OCR technologies Bachelor’s Thesis Martin Tomaschek Brno, Fall 2017 This is where a copy of the official signed thesis assignment and a copy ofthe Statement of an Author is located in the printed version of the document. Acknowledgements I would like to thank my advisor for patience, my brother for help and my parents for their love. iii Abstract A OCR software comparison iv Keywords ocr, benchmark v Contents 1 Preface 1 2 Outlines of the OCR process 3 3 Challenges to OCR 5 4 OCR benchmarking and evaluation 9 4.1 Dataset creation ........................9 4.1.1 Synthetic and real data . 10 4.1.2 Formats . 11 4.2 Evaluation metrics ...................... 12 4.2.1 Text recognition . 12 4.2.2 Text segmentation . 13 4.2.3 Existing datasets . 15 4.2.4 Ground-truthing tools . 16 4.3 Evaluation tools ........................ 16 4.3.1 The ISRI Analytic Tools[13] . 16 4.3.2 hOCR tools[15] . 18 4.3.3 An open-source OCR evaluation tool . 18 5 The tested OCR systems 19 5.1 Proprietary .......................... 19 5.1.1 Abby FineReader . 19 5.1.2 Readiris 16 . 20 5.1.3 Adobe Acrobat 11 . 20 5.1.4 Omnipage . 21 5.2 Open source .......................... 21 5.2.1 Tesseract . 21 5.2.2 GNU Ocrad . 21 5.2.3 Gocr . 22 5.2.4 Ocropus . 22 5.2.5 Cuneiform . 22 5.3 Online services ........................ 22 5.3.1 Google docs . 22 5.4 Tests .............................. 22 vii 6 Conclusion 25 Bibliography 27 viii 1 Preface Optical character recognition (OCR) is the extraction of machine- encoded text from an image. It is a subfield of computer vision and has many applications: digitizing of scanned documents to enable editing, searching and indexing or storing them more effectively, processing bank cheques, sorting mail [1], recognition of license plate numbers in highway toll systems, etc. Since when the first commercial OCR systems were created in the 1950s [2], they have improved significantly, alongside the com- puter – once room-sized, expensive custom built systems used only by large organizations, nowadays OCR application can even run on a smartphone. and leverage its in-built camera to take the picture. Early OCR systems were limited to monospace1text often of single typeface, today’s OCR software supports many common proportional2fonts. OCR is a complex and computationally demanding task. There are uncountable combinations of document type, layout, paper type, font, language, script and countless other variables, such as material degradation, defects of imaging and print, etc. Because of this there is also large variety of OCR software each designed for particular appli- cation, for example recognizing Hebrew3or Japanese4, handprinted script recognition or the OCR packed might be fine-tuned for reading medieval scripts and so on. This thesis focuses on evaluation of the most common type of OCR software, designed to recognize western languages using Latin script and its derivatives. English has the most samples in the datasets for the tests in this work, Slovak and Czech documents examined are fewer, the ISRI[3] dataset, which will also be used, contains also Spanish documents, other languages were not tested. 2. Every character occupies the same, i.e. fixed, width. 2. The opposite of monospace. 4. Hebrew is "impure abjad", using an alphabet of 22(+5) consonants, vovels are indicated by diacritical marks beneath the consonants and is written right to left. https://en.wikipedia.org/wiki/Hebrew_alphabet 4. Japanese uses 4 scripts – logographic characters adopted from China, i.e. kanji, two syllabic scripts, hiragana and katakana, latin for some of the foreign words (mostly acronyms) along with arabic numerals. The core set of kanji used daily has 3000 symbols, a few thusand more used from time to time. Japanese can be written 1 1. Preface Figure 1.1: Examples of scripts used around the globe The aim of this thesis is to compare available OCR software, se- lected among the industry leaders and various open-source projects. While a few papers already exist on the subject, these tend to be rather outdated (e.g. [3] from 1996) or focused on a specific document type (e.g. [4]). Some websites contain more up-to-date reviews and com- parisons, however, they are often not very credible, as they seldom describe their methodology or test the OCR solutions on very small datasets, contain subjective performance measures or just list available features (e.g. [5]). The first chapter provides an overview of the OCR process itself. The second lists factors and problems connected to OCR its perfor- mance. Chapter three explains various metrics that can be used to evaluate OCR systems and presents the tools used to measure them. In the fourth chapter the tested OCR programs are introduced. The fifth chapter investigates the actual impact various variables, suchas image resolution, lossy compression, skew, font of the text, etc. on the accuracy of OCR systems. The last chapter presents and discusses result obtained in the tests. both left to right and top to bottom. https://en.wikipedia.org/wiki/Japanese_ writing_system 2 2 Outlines of the OCR process OCR process generally involves these stages: ∙ Image acquisition – an image is taken, using a scanner, a camera or a similar device. To achieve high accuracy results, a good quality image is needed. ∙ Preprocessing – text orientation detection, deskewing, noise filtering, perspective correction (if the source is a photograph) etc. ∙ Binarization – the content is separated from the background. ∙ Page segmentation – the document is divided into homogeneous regions, such as columns of text, tables, images, etc. ∙ Line, word and character segmentation [6] – the image is further divided up to the character level.1 ∙ Recognition [7] – Feature extraction – various characteristics (called features) are calculated for every character image. 2 – Classification – features are compared with trained data 3 to determine what the output character should be, via a classifier (a program). 4 ∙ Postprocessing – dictionaries and various language models can be used to enhance results. 3 2. Outlines of the OCR process OCR packages use very different algorithms and techniques to perform their task. Some examples can be found in articles referenced above. 1. Basically all OCR programs require reference data, which is used to identify the patterns in an image. This data is usually bundled in the OCR package. Some OCR software are user trainable, which allows to add new symbols or even languages and scripts, or improve accuracy. Training is principally done by presenting images of characters or even whole sentences to the OCR program together with the correct solution. See [8] for examples. 2. Line and word segmentation is relatively easy to do, especially on printed docu- ments, where lines are straight and evenly spread. Character segmentation is much tougher problem and is often closely coupled with recognition, because already recognized characters can be used to improve segmentation accuracy. Some recogni- tion approaches (notably hidden Markov model (HMM) ones) do not need character level pre-segmentation. 3. For instance gradient features can be obtained by splitting the character image in 4 by 4 tiles grid and applying Sobel operator to calculate gradient orientation at each pixel, which is then quantized into 12 orientations (as per 5 minutes or 1 hour on the clock). Finally for each tile the features are defined as the count of pixels with given gradient orientation normalized by tile size. 4. Many types of character classifiers exist, each one works using different setof features, and therefore is good at distinguishing among different character classes. Several of classifiers are often used together to leverage their individual strengths to achieve better accuracy. 4 3 Challenges to OCR This chapter presents challenges OCR software has to overcome in order to be ale to correctly convert an image to text, expanding on Nartker, Nagy and Rice [9] who have described some key factors contributing to OCR errors. ∙ Imaging defects – there are many ways imaging defect may be in- troduced during printing and scanning the document. Common imaging defects include: – Heavy or light print – heavy print may be produced for example when a tape is replaced for a new one in a dot matrix printer, light print when a printer is running low on ink or toner – Uneven contrast – cheap or old laser printers often do not produce quality output, scanning a book results in darker areas near the binding, etc. – Stray marks – Curved baselines – Lens geometry and perspective transformation – these af- fect images acquired by a camera. – Paper quality – paper slowly degrades over time and so does the information it carries. ∙ Similar symbols – there exist many characters that look simi- lar to vertical line and therefore one another – i, j, I, l, 1, !, |. Some capital letters differ from their regular counterparts just by size – e.g. v,V, o,O, s,S, z,Z. Other pairs of glyphs that bear close resemblance are 0,O, (,{,[, u,v, U,V, p,P, k,K and so on. Co- mas (,) and dots (.) look almost identical in some fonts and so do many other punctuation symbols. While English does not use diacritical marks much, some languages do so extensively. Punctuation and diacritical marks are often very small and thus hard to correctly recognize and easy to be mistaken for noise. 5 3. Challenges to OCR ∙ Special or new symbols – the Unicode contains a lot of characters and OCR software is simply not trained to recognize all of them. Many languages contain little peculiarities and use different alphabets (in Spanish question mark is written upside down, Scandinavian languages slash some letters, etc.) in addition to aforementioned diacritical marks and different punctuation, and OCR systems use different trained data and/or minor modifi- cations to support them.