Masaryk University Faculty of Informatics

Evaluation of off-the-shelf OCR technologies

Bachelor’s Thesis

Martin Tomaschek

Brno, Fall 2017

Masaryk University Faculty of Informatics

Evaluation of off-the-shelf OCR technologies

Bachelor’s Thesis

Martin Tomaschek

Brno, Fall 2017

This is where a copy of the official signed thesis assignment and a copy ofthe Statement of an Author is located in the printed version of the document.

Acknowledgements

I would like to thank my advisor for patience, my brother for help and my parents for their love.

iii Abstract

A OCR comparison

iv Keywords ocr, benchmark

v

Contents

1 Preface 1

2 Outlines of the OCR process 3

3 Challenges to OCR 5

4 OCR benchmarking and evaluation 9 4.1 Dataset creation ...... 9 4.1.1 Synthetic and real data ...... 10 4.1.2 Formats ...... 11 4.2 Evaluation metrics ...... 12 4.2.1 Text recognition ...... 12 4.2.2 Text segmentation ...... 13 4.2.3 Existing datasets ...... 15 4.2.4 Ground-truthing tools ...... 16 4.3 Evaluation tools ...... 16 4.3.1 The ISRI Analytic Tools[13] ...... 16 4.3.2 hOCR tools[15] ...... 18 4.3.3 An open-source OCR evaluation tool ...... 18

5 The tested OCR systems 19 5.1 Proprietary ...... 19 5.1.1 Abby FineReader ...... 19 5.1.2 Readiris 16 ...... 20 5.1.3 Adobe Acrobat 11 ...... 20 5.1.4 Omnipage ...... 21 5.2 Open source ...... 21 5.2.1 ...... 21 5.2.2 GNU ...... 21 5.2.3 Gocr ...... 22 5.2.4 Ocropus ...... 22 5.2.5 Cuneiform ...... 22 5.3 Online services ...... 22 5.3.1 Google docs ...... 22 5.4 Tests ...... 22

vii 6 Conclusion 25

Bibliography 27

viii 1 Preface

Optical character recognition (OCR) is the extraction of machine- encoded text from an image. It is a subfield of computer vision and has many applications: digitizing of scanned documents to enable editing, searching and indexing or storing them more effectively, processing bank cheques, sorting mail [1], recognition of license plate numbers in highway toll systems, etc. Since when the first commercial OCR systems were created in the 1950s [2], they have improved significantly, alongside the com- puter – once room-sized, expensive custom built systems used only by large organizations, nowadays OCR application can even run on a smartphone. and leverage its in-built camera to take the picture. Early OCR systems were limited to monospace1text often of single typeface, today’s OCR software supports many common proportional2fonts. OCR is a complex and computationally demanding task. There are uncountable combinations of document type, layout, paper type, font, language, script and countless other variables, such as material degradation, defects of imaging and print, etc. Because of this there is also large variety of OCR software each designed for particular appli- cation, for example recognizing Hebrew3or Japanese4, handprinted script recognition or the OCR packed might be fine-tuned for reading medieval scripts and so on. This thesis focuses on evaluation of the most common type of OCR software, designed to recognize western languages using Latin script and its derivatives. English has the most samples in the datasets for the tests in this work, Slovak and Czech documents examined are fewer, the ISRI[3] dataset, which will also be used, contains also Spanish documents, other languages were not tested.

2. Every character occupies the same, i.e. fixed, width. 2. The opposite of monospace. 4. Hebrew is "impure abjad", using an alphabet of 22(+5) consonants, vovels are indicated by diacritical marks beneath the consonants and is written right to left. https://en.wikipedia.org/wiki/Hebrew_alphabet 4. Japanese uses 4 scripts – logographic characters adopted from China, i.e. kanji, two syllabic scripts, hiragana and katakana, latin for some of the foreign words (mostly acronyms) along with arabic numerals. The core set of kanji used daily has 3000 symbols, a few thusand more used from time to time. Japanese can be written

1 1. Preface

Figure 1.1: Examples of scripts used around the globe

The aim of this thesis is to compare available OCR software, se- lected among the industry leaders and various open-source projects. While a few papers already exist on the subject, these tend to be rather outdated (e.g. [3] from 1996) or focused on a specific document type (e.g. [4]). Some websites contain more up-to-date reviews and com- parisons, however, they are often not very credible, as they seldom describe their methodology or test the OCR solutions on very small datasets, contain subjective performance measures or just list available features (e.g. [5]). The first chapter provides an overview of the OCR process itself. The second lists factors and problems connected to OCR its perfor- mance. Chapter three explains various metrics that can be used to evaluate OCR systems and presents the tools used to measure them. In the fourth chapter the tested OCR programs are introduced. The fifth chapter investigates the actual impact various variables, suchas image resolution, lossy compression, skew, font of the text, etc. on the accuracy of OCR systems. The last chapter presents and discusses result obtained in the tests.

both left to right and top to bottom. https://en.wikipedia.org/wiki/Japanese_ writing_system

2 2 Outlines of the OCR process

OCR process generally involves these stages: ∙ Image acquisition – an image is taken, using a scanner, a camera or a similar device. To achieve high accuracy results, a good quality image is needed. ∙ Preprocessing – text orientation detection, deskewing, noise filtering, perspective correction (if the source is a photograph) etc. ∙ Binarization – the content is separated from the background. ∙ Page segmentation – the document is divided into homogeneous regions, such as columns of text, tables, images, etc. ∙ Line, word and character segmentation [6] – the image is further divided up to the character level.1 ∙ Recognition [7] – Feature extraction – various characteristics (called features) are calculated for every character image. 2 – Classification – features are compared with trained data 3 to determine what the output character should be, via a classifier (a program). 4 ∙ Postprocessing – dictionaries and various language models can be used to enhance results.

3 2. Outlines of the OCR process

OCR packages use very different algorithms and techniques to perform their task. Some examples can be found in articles referenced above.

1. Basically all OCR programs require reference data, which is used to identify the patterns in an image. This data is usually bundled in the OCR package. Some OCR software are user trainable, which allows to add new symbols or even languages and scripts, or improve accuracy. Training is principally done by presenting images of characters or even whole sentences to the OCR program together with the correct solution. See [8] for examples. 2. Line and word segmentation is relatively easy to do, especially on printed docu- ments, where lines are straight and evenly spread. Character segmentation is much tougher problem and is often closely coupled with recognition, because already recognized characters can be used to improve segmentation accuracy. Some recogni- tion approaches (notably hidden Markov model (HMM) ones) do not need character level pre-segmentation. 3. For instance gradient features can be obtained by splitting the character image in 4 by 4 tiles grid and applying Sobel operator to calculate gradient orientation at each pixel, which is then quantized into 12 orientations (as per 5 minutes or 1 hour on the clock). Finally for each tile the features are defined as the count of pixels with given gradient orientation normalized by tile size. 4. Many types of character classifiers exist, each one works using different setof features, and therefore is good at distinguishing among different character classes. Several of classifiers are often used together to leverage their individual strengths to achieve better accuracy.

4 3 Challenges to OCR

This chapter presents challenges OCR software has to overcome in order to be ale to correctly convert an image to text, expanding on Nartker, Nagy and Rice [9] who have described some key factors contributing to OCR errors. ∙ Imaging defects – there are many ways imaging defect may be in- troduced during printing and scanning the document. Common imaging defects include:

– Heavy or light print – heavy print may be produced for example when a tape is replaced for a new one in a dot matrix printer, light print when a printer is running low on ink or toner – Uneven contrast – cheap or old laser printers often do not produce quality output, scanning a book results in darker areas near the binding, etc. – Stray marks – Curved baselines – Lens geometry and perspective transformation – these af- fect images acquired by a camera. – Paper quality – paper slowly degrades over time and so does the information it carries.

∙ Similar symbols – there exist many characters that look simi- lar to vertical line and therefore one another – i, j, I, l, 1, !, |. Some capital letters differ from their regular counterparts just by size – e.g. v,V, o,O, s,S, z,Z. Other pairs of glyphs that bear close resemblance are 0,O, (,{,[, u,v, U,V, p,P, k,K and so on. Co- mas (,) and dots (.) look almost identical in some fonts and so do many other punctuation symbols. While English does not use diacritical marks much, some languages do so extensively. Punctuation and diacritical marks are often very small and thus hard to correctly recognize and easy to be mistaken for noise.

5 3. Challenges to OCR

∙ Special or new symbols – the Unicode contains a lot of characters and OCR software is simply not trained to recognize all of them. Many languages contain little peculiarities and use different alphabets (in Spanish question mark is written upside down, Scandinavian languages slash some letters, etc.) in addition to aforementioned diacritical marks and different punctuation, and OCR systems use different trained data and/or minor modifi- cations to support them. Languages using other scripts (e.g. Chinese, Arabic language family etc.) often require entirely new approaches and are developed separately. ∙ Typography – documents are usually made for human readers, and what makes the text easy to read and aesthetically pleasing for us, can make it harder for OCR system.

– Proportional fonts [10] – unlike monospace ones, characters have varying width (an m is up to 3-times wider then an i), which makes character segmentation much harder as the line can not be simply split to same length segments – Italic and oblique type – in this case the characters are typically slanted to the right, and some parts of them ex- tend over or under another character (these parts are called kerns), overlapping their notional “patch”. Thus the line can not be so easily segmented, i.e. vertically and more elaborate approaches must be used. – Uneven stroke width – many of the most commonly used fonts evolved from the penmanship of medieval authors, aiming to imitate its aesthetics. The varying stroke width was a natural result of used techniques and instruments. In contrast fonts such as OCR-A and OCR-B, that were developed specifically for OCR purposes, have uniform stroke width. – Shaded background – the varying contrast between fore- ground and background demands more advanced bina- rization techniques, i.e. adaptive binarization – threshold is adjusted based on local conditions. – Too small or too big type

6 3. Challenges to OCR

Figure 3.1: Example of a rumpled page

Figure 3.2: Text with heavy print

– Unusual typeface – stylized fonts are not easy to OCR, es- pecially the ones imitating handwriting on which segmen- tation is harder, broken fonts and outline fonts. Some OCR programs have support for historic “fraktur” (Gothic) style fonts. – Spacing – if the distance between two characters is too big or too small, errors are more likely to occur in the word and especially character segmentation process. On the one hand too wide gaps between letters may result in a word split in two, on the other touching glyphs are very hard to segment correctly. These are the most prominent reasons of varying spacing:

* Justification – distances between words and letters are adjusted to align text on both margins. * Kerning – certain combinations of letters are moved closer together, as in fig N, because people seem to prefer roughly the same area between characters. Ob- viously this practice creates kerns which make correct segmentation harder, as discussed previously. * Ligatures – is similar to kerning, but the character fuse together.

7 3. Challenges to OCR

Figure 3.3: Text with light print

Figure 3.4: Text with curved baseline

Figure 3.5: A page from a book with dark area near the binding

8 4 OCR benchmarking and evaluation

Each step of the OCR process could be evaluated separately, however such approach is only useful (and feasible) for OCR system developers and researchers. A user’s interest would instead focus on the final result – the recognized text, and so a black-box approach is more suitable. To evaluate OCR system performance [11], its output is compared to ground-truth1associated with the document image being processed.

4.1 Dataset creation

Creating a dataset is not an easy task. There are two distinct ways of creating datasets – using real-life data and generating synthetic data. The former approach usually involves a lot more manual work, the latter requires a lot of insight to create an accurate model for the data generation. Often these approaches are mixed together to strike a good balance, for example real data for which ground-truth is available is used as basis for various transformations to generate new data or OCR is used is used to obtain first version of a text, that is than proofread and corrected by a human. The process consists of following tasks: ∙ Data selection and collection/generation – first part of the dataset are the document images. There are two main aspects to con- sider for data selection – being realistic and representative of problem domain. Being realistic means that the images should be as close to the real images as possible, while being represen- tative signifies that the dataset should be constructed from all the classes of documents that belong to that problem domain in a balanced way. For example a dataset for character recognition should contain documents using different combinations of type- faces, sizes of script and other formatting options, while a dataset for testing segmentation capabilities of an OCR package should be comprised of document with varying layouts of differently sized text regions, tables and graphics. To achieve realistic and representative dataset, it is necessary for the samples to contain

1. The expected result, sometimes also called truth text or golden text.

9 4. OCR benchmarking and evaluation

distortion, noise and other image degradations of varying level just as real-use scenarios would. After determining what kinds of document images should be selected, a large enough set is collected, where large enough is connected with the representative trait discussed earlier. ∙ Ground-truth definition and annotation/definition – the second part of the dataset consists of ground-truth associated with the image data. The nature of the ground-truth data is very specific for a given application domain. For example text strings are required for text recognition evaluation, bounding boxes and polygons are usually used in zone segmentation, skew angles for autodeskewing tests, foreground pixels for binarization and so on. It is always clear how this information is best represented and multiple format and standards have developed. Afterwards all pictures must be annotated with ground-truth data. This is very time-consuming and tedious if done manually, especially if the ground-truth is complex. This problem is elimi- nated if using synthetic data, which allows for the ground-truth to be generated as well. To help annotate real life documents many tools have been developed to allow interactive or collabo- rative editing. ∙ Organization and structuring of the dataset – in the end image and ground truth data are put together. The dataset may be sorted/annotated according to various criteria such as language, document type or degree of degradation.

4.1.1 Synthetic and real data

As has been already briefly explained in previous section there are two ways of gathering data – collecting real world data or generating synthetic data. Using real data trivially satisfies the "being realistic" requirement, but has several drawbacks. Firstly it may be hard collect sufficient number of real images, especially if one must consider licensing, copy- right and confidentiality limitations for a dataset to be released to the public. Secondly, as has been already mentioned annotation of

10 4. OCR benchmarking and evaluation

large sets is a very costly process in terms of human effort. Also it is prone to human errors, so sometimes several people proofread the ground-truth, further increasing the costs. To make annotation eas- ier, several interactive and collaborative tools have been developed, a comprehensive list can be found in the handbook, a brief overview of a few of them can be found in "available tools" section. Also crowdsor- cing (mass collaboration) has been successfully used, for example in the Google books project2. To reduce the cost of this process, OCR or other document analysis tool can be used to generate initial version of the ground-truth, which is then proofread by humans and necessary corrections are made. And lastly, the amount of noise and degrada- tion of real documents is hard to quantify, and thus it is hard to sort documents by this attribute. Generating synthetic data allows for arbitrary amounts of data to be to be generated along with all the necessary ground-truth, requiring little manual effort. Developing models for realistic and representative automatic generation is however very hard problem, if the domain is broad, and a lot of factors must be considered. A few theoretical models of degradation have been proposed, but simpler image trans- formation can also be used to generate degraded images. There has been little progress in developing methods for generation of docu- ments with varied and realistic layouts and formatting.

4.1.2 Formats Most formats that have been proposed to hold OCR results an/or ground-truth (OCR result and ground-truth should to have similar format, i.e expose the same information to be comparable) are based on XML. The most notable are ALTO, PAGE and hOCR, for each a different set of tools was developed. To make things worse each

2. reCAPTCHA is a system for detecting whether the user of a website is human or not. The most recent version requires the to check a "I’m not a robot" checkbox, earlier versions however used images of words from Google books project. Images of two words, one already known control word and one that the OCR programs used by reCAPTCHA recognized differently, were distorted in some way(to make the task more challenging to other OCR software) and presented to the user. If enough users, who solved the control word used the same transcription, it was selected as the correct one and on the contrary if too many users used different transcriptions, the word was discarded as unreadable.

11 4. OCR benchmarking and evaluation

OCR package usually supports export to only one of these formats, or even worse it’s own proprietary format. Conversions are possible to a degree, using xsl stylesheets, xquery or similar techniques. The easiest to use and produce is simple plaintext ground-truth. Also plaintext output is supported by every OCR package. However it is of limited use when evaluating more complex aspects of the document such as segmentation or non trivial reading order.

4.2 Evaluation metrics

This chapter explains various metrics and other terms related to evalu- ation of text segmentation, i.e. the ability to properly divide the page in homogeneous regions, most importantly text regions; and text recog- nition, i.e. software’s ability to correctly recognize the characters in said text regions.

4.2.1 Text recognition

Evaluation of text recognition is usually done by aligning the OCR ouput with ground-truth and calculating Levenshtein distance3, some- times also called edit distance, which is the minimal number of inser- tions, deletions and substitutions needed to make both texts equal, i.e. the number of errors. Then: ∙ Character recognition error rate – is the ratio of the number of errors to the number of all characters in the ground-truth text. ∙ Character recognition accuracy –is the ratio of the number of the correctly recognized characters to the number of all the char- acters in the ground-truth text. A word is correctly recognized if all its letter have been correctly recognized. Punctuation, digits, other special symbols, and sometimes even case is ignored, when evaluating word recognition. Then: ∙ Word recognition error rate – is the ratio of the number of in- correctly recognized words to the number of all words in the ground-truth text.

12 4. OCR benchmarking and evaluation

∙ Word recognition accuracy – is the ratio of the number of cor- rectly recognized words to the number of all the words in the ground-truth text. Word accuracy is an important metric for several use cases, such as building a database for information retrieval system, where a whole word needs to be correct in order to fetch relevant information4. These systems often do not index all the word in a text, instead they skip stopwords, which are words that are very not useful for information retrieval, e.g. articles ("a","the"), prepositions ("of","to") or conjunctions ("and", "or"). Then: ∙ Non-stopword recognition accuracy – is the ratio of the number of correctly recognized words that are not stopwords to the number of all the words in the ground-truth text that are not stopwords. If two systems that have similar character recognition accuracy, but one has higher word recognition accuracy, means that errors are more concentrated5, and so easier to correct. In a similar way to word recognition accuracy, phrase recognition accuracy for a given phrase lengths6can be computed and would indicate how concentrated or spread the errors are. Word recognition accuracy can also be approximated by bag-of- words method. This method disregards word order and records only the number of occurrences of each word.

4.2.2 Text segmentation Evaluation of segmentation results is more difficult than evaluation of text recognition results. First approach, uses plain text ground- truth, in which the text is organized according to reading order. The

6. https://en.wikipedia.org/wiki/Levenshtein_distance 6. For example, suppose there is a document which contains the word "mouse" but "m" gets substituted for "h" and thus the resulting word in the database is "house". If an user would query the database for the world "mouse" the document wouldn’t be selected; and if the user would query for "house" he would get a document, that has nothing to do with houses. 6. More words are without errors, but more words contain multiple errors. 6. Word recognition accuracy can be viewed as phrase recognition accuracy with phrase length of 1. 13 4. OCR benchmarking and evaluation

OCR software then performs segmentation, reading order detection and finally character recognition. The output is than aligned withthe ground-truth. If a text block hasn’t been detected, the OCR software’s output would have characters missing, if noise or graphics is mistaken for text region the output might contain some extra characters. If the reading order has not been detected correctly a block of text would end up at a wrong position and block move operation is needed to correct that. Next the amount of character insertions/deletions and block move operations7needed to make the two text equal is established. The result is then normalized by the length of the text. This approach has several shortcomings. The OCR result result may contain errors not only due to page segmentation but also due to text recognition. To mitigate this factor, the correction cost may be calculated once for with an OCR result obtained using manual zoning and then subtracting the two values should produce an estimate of the costs that need to be done due to page segmentation only. Another problem is testing documents, with arbitrary reading order, and thus there should be multiple correct solutions, but this process would penalize them with additional move operation, luckily the block move operation’s cost is quite low compared to not detecting the block at all, provided that it’s length is more than the threshold chosen for the operation’s replacement with insert/delete operations. This method, however, is not a direct comparison and provides little details about segmentation errors. Most of the other approaches rely on zoning information being directly available in the OCR result and the ground truth. The zones are usually defined as either axis aligned boxes or arbitrary polygons. These are than geometrically aligned within a reasonable margin. Several special case must be considered such as two zones merging together, or one splitting to two. Sometimes this is not a problem but and should be allowed, e.g. one containing two paragraph may be rep- resented by one or two zones; other times this is not possible, e.g. two columns merging to one zone or zones of two different types merging. Afterwards the quality of the match can be expressed by calculating

7. Short moves, e.g. below five characters may be replaced by equivalent number of insertions and deletions, because this method tries to estimate the cost of the corrections as if made by a human operator. He is more likely to delete and retype a very short word than to cut and paste it to a different location. 14 4. OCR benchmarking and evaluation

several values such as undersegmentation (segments missing) and oversegmentation (extra segments) or various metrics based oh how much the generated zones overlap with those in the ground-truth.

4.2.3 Existing datasets

Quite a few dataset have been produced so far. A list that list most of them can be found in CITEHANBOOK. Listed here are those suitable for printer character recognition evaluation, and (at least partially) publicly available, free of charge.

ISRI-UNLV This dataset was used and expanded each year between the years 1992 and 1996 in the annual tests of OCR accuracy, which was an open competition in character recognition held at Information Science Re- search Institute at University of Nevada, Las Vegas. It contains over 2200 images, organized in several categories: business letters, legal documents, magazines, news, reports and technical documents. Most of the images are in English but the set includes also Spanish newspa- per and German business letter samples. The images are binary in g4 compressed tiff file format and range from having no noise to having heavy noise. Ground-truth is present in plaintext, zoning information is provided via ".uzn" files, basically plaintext files containing a listof rectangles with an optional short label. License: Unknown

Google 1000 book sample This dataset is made out of 1000 books from the Google books project. There are books in several languages. The images of the pages of the book have been preprocessed – a camera was used to take the images – dewarped, deskewed and cropped to page size. Images are in JPEG format. Sadly no ground-truth is available apart for the OCR output of an undisclosed OCR solution Google used on the images. License: Free for personal and non-commercial use, the books themselves are public domain.

15 4. OCR benchmarking and evaluation

PRImA Layout Analysis Dataset A large dataset, with detailed ground-truth in the PAGE xml format, used in recent ICDAR8 Competition on Recognition of Documents with Complex Layouts 2017 segmentation competition. License: Free for personal and non-commercial use.

4.2.4 Ground-truthing tools The last chapter of CITEHAND lsit several ground-truthing tools. I experienced trouble building the older ones, due to dependencies on outdated packages end environments; and stability or usability with some others. In this paper, capabilities of some of them are compared.

Alethia The most recent ground-truthing environment and is still being de- veloped. It is very feature complete, allows defining ground-truth on multiple levels – page, zone, line and character, and provides intuitive tools to do so. Every document element can be assigned a rich set of metadata, supports advanced reading order editing. Some functional- ity is however available only in the paid version and the tool supports Windows only, though it can be run under Wine in with certain stability and usability issues. License: free lite version for personal and non-commercial use

4.3 Evaluation tools

4.3.1 The ISRI Analytic Tools[13] Developed at Information Science Research Institute of University of Nevada, this set of tools were used to test OCR system on annual basis from 1992 to 1996. While, today 21 years later, most of the OCR systems tested in the study have been discontinued, the tools are still relevant and were released under Apache 2.0 license. Eddie Antonio Santos added support for UTF8, which allows to use the tools to be

8. International Conference on Document Analysis and Recognition

16 4. OCR benchmarking and evaluation

used for other languages without additional conversion to extended ASCI that were required previously. The ISRI tools consist of 17 programs, citeisri-manual including: ∙ accuracy – computes character accuracy, by comparing the OCR result with the ground-truth. It generates a nice report, including totals, results by groups (lowercase, digits, spacing, etc.), accu- racy per character, and number of errors for each kind, occurring in the text (e.g. o has been substituted by 4 times). The program expects that the zoning information (columns and such) is made available to the OCR program, and thus segmentation errors are taken out of the equation. ∙ synctext – aligns texts, so it is possible to see the differences between the generated text and the ground-truth. ∙ accsum – aggregates multiple reports to one ∙ groupacc – reprocesses reports to generate new one where char- acter accuracies group by user defined rules ∙ acci – estimates a 95% confidence interval for character accuracy for a given set of reports ∙ accdist – calculates the distribution of character accuracy in a set of accuracy reports. ∙ computes n-gram9 statistics for a text file for a given n. ∙ vote – takes the output of multiple OCR programs and combines them together. ISRI tests have shown up to 80% reduction in error rate compared to the best OCR software used. A similar technique is often used inside of a OCR program, which may internally use several character classifiers. ∙ wordacc – computes word accuracy, output is similar to accuracy. Wordaccsum and wordacci work analogically to their character accuracy counterparts, ∙ editop – estimates the zoning capabilities of an OCR system. It does this by counting the number of insert, delete and block

9. N-gram is a contiguous sequence of n items from a given sequence of text.

17 4. OCR benchmarking and evaluation

move operations needed to make the OCR output equal to the ground-truth file. ∙ editopsum – aggregates editop reports Available from Github License: Apache, 2.0

4.3.2 hOCR tools[15] Is a set of tools that allow manipulation of the hOCR file format, which is based on xml and was designed as a free standart to hold OCR results. Its xml elements contain the recognized text and its coordinates in the image, often up to the character level. Only a handful of OCR support output directly in hOCR format (e.g. Tesseract and OcrOpus), other can output their own proprietary xml based file format, which might be transformed to hOCR using xsl, Xquery or similar technique. Most important parts of the package for testing OCR are: ∙ hocr-combine – joins multiple hOCR files to one. ∙ hocr-eval – compares two hOCR files by aligning them geo- metrically, then counts both the segmenation errors and actual recognition errors. ∙ hocr-eval-lines – compares hOCR to plain text ground-truth, this method requires that linebreaks match between the generated text and the ground-truth (some OCR packages, e.g. ABBY, have a function to remove line breaks, this most be turned off) ∙ hocr-eval-geometry – computes under-/oversegmentation on a selected hOCR element level against the ground-truth. Available from: License: Apache 2.0.

4.3.3 An open-source OCR evaluation tool by Rafael C. Carrasco, is writen in Java and provides a , which not only computes character error rates and such, but highlights the difference between OCR generated text and the ground-truth side by side.

18 5 The tested OCR systems

The OCR systems tested in following chapters have been selected by this process: Candidates were established, by searching the web for OCR software and previous comparisons. There is also a list of OCR software on Wikipedia. Companies were contacted via e-mail, in which a free license was requested for the purpose of the evaluation. Only ABBY agreed to provide a license for their product, an therefore is the only proprietary OCR software to undergo all the tests. Other will be tested only on a few pages (because of the limitations of their respective free trial versions) just to get a glimpse of their performance. Many open-source OCR projects were given a chance in the short test but only the best ones were further explored.

5.1 Proprietary

5.1.1 Abby FineReader

Abby is one of the market leaders in OCR software and is seldom omitted in OCR tests and evaluations as their product’s performance often acts as a reference point for state of the art results. Supports a wide range of languages and input/output formats. The tested version provides a good graphical user interface with many options, but sadly batch processing is reserved for the "Enterprise" version. Abby also

19 5. The tested OCR systems sells their OCR in form of SDK, online/cloud service or a server based variant. Some of their products support various platforms, including Linux. Tested version: 12 Pro (License provided free of charge by ABBY), 14 Trial

5.1.2 Readiris 16

Readiris also seems to perform very well. Images must be pro- cessed one by one, only the Readiris Pro version is capable of batch processinging whole folders of images. Free trial version is restricted to maximum of 100 image recognitions, therefore testing was con- siderably limited. User interface is quite atractive and intuitive thus suitable for wide range of users. Readiris also provides their OCR for various platforms (Windows, Mac, iOS and Android) and in form of a SDK.

5.1.3 Adobe Acrobat 11 Acrobat is a suite for manipulating , and not a software designed specifically designed as an OCR suite. It doesn’t provide any tools for manual preprocessing or zoning, but its accuracy is reportedly high. It also allow batch processing. A free trial version is available to be used for a few days.

20 5. The tested OCR systems

5.1.4 Omnipage

Successor of the Recognita OCR engine, Omnipage is another estab- lished brand in the OCR business. Supports many languages and run on Windows, Mac and Linux.

5.2 Open source

None of the open source OCR packages comes with a GUI, many options are offered by 3-rd parties, but those severely lack in function- ality and quality, compared to their proprietary counterparts. A good open source scanning and preprocessing tool is Scan Tailor, the OCR engines are best controlled via their command line interfaces.

5.2.1 Tesseract

The most accurate and used open source OCR package. Originally developed by HP, open sourced and now maintained and improved on by Google. Supports many languages and also quite a few image formats. It can output hOCR among other formats. It can also be used as a library and supports Android, Linux and Windows builds are also available. It includes the necessary tools for teaching it new lan- guages and symbols. Automatic preprocessing options are orientation detection and minor skew correction. It has a layout analyser. Available from: Tested version 3.04, 4.0a License: Apache, 2.0

5.2.2 GNU Ocrad

The GNU’s OCR is small project and receive updates only rarely. Only works with pnm1 pictures and outputs plain text or its own format (not xml based). Tested version: 0.27 License: GNU GPLv2+

1. Pnm means one of in pbm (bitmap), pgm (greyscale) or ppm (color) format.

21 5. The tested OCR systems

5.2.3 Gocr Another open source OCR engine. There are some occasional updates but not very often. It supports more input file formats, thanks to automatic conversion. Outputs plian text. Tested version: 0.51 License: GNU GPL

5.2.4 Ocropus This OCR system does not consist of just one binary that does it all, but rather a collection of document analysis tools which together function as OCR. The upside is that it very modular and multiple implementation of each step might exist2. The downside is that getting even one page OCRed takes several commands. License: Apache: 2.0 Tested version: 1.3.3

5.2.5 Cuneiform Developed by Russian company Cognitive this OCR package was opensourced in 2008. Tested version: 1.1.0 License: BSD

5.3 Online services

5.3.1 Google docs Google docs is primarily a cloud document storage and editing plat- form, however it automatically performs recognition on all uploaded images and PDFs, after that a plain text extract can be downloaded. A Google account is required to use the service.

2. There are two OCR engines available at the moment and a new layout analyser, which will use GPU-accelerated deep learning methods, has been announced

22 6 Tests

A test wast conducted using a synthetic page containing the first para- graph of the Wikipedia article on OCR set with 12pt TimesNewRoman font at 300dpi. Most of the tested page readers achieved accuracy grater than 98%.

Figure 6.1: Table contain results using one synthetic page - achieved accuracies a elapsed time

Than several documents were collected from the Internet, and from them several pages were selected at random. Overall the sample contained a balanced mix of document classes: a recipe book, legal documents, magazines etc. These pages were then manually zoned using the Alethia tool and the text copied over from the original ’s. A simple python script was used to parse the PAGE xml file and cut the image to the defined zones which were than fed to the OCR software. Results were compared to the ground-truth using the ISRI analytic tools.

23 6. Tests

Figure 6.2: Table contain results using one synthetic page - achieved accuracies a elapsed time

24 7 Conclusion

This thesis presented the basic theoretical and practical concepts of the OCR process. An overview of available OCR packages, evaluation methodologies and testing tools was presented. Finally OCR packages were tested on on a prepared dataset.

25

Bibliography

[1] M. Gilloux, “Document analysis in postal applications and check processing”, in Handbook of Document Image Processing and Recog- nition, D. Doermann and K. Tombre, Eds., Springer-Verlag, 2014. doi: 10.1007/978-0-85729-859-1. [2] H. S. Baird and K. Tombre, “The evolution of document image analysis”, in Handbook of Document Image Processing and Recog- nition, D. Doermann and K. Tombre, Eds., Springer-Verlag, 2014. doi: 10.1007/978-0-85729-859-1. [3] S. V. Rice, G. L. Nagy, and T. A. Nartker, “The fifth annual test of ocr accuracy”, Information Science Research Institute, University of Nevada, Las Vegas, Tech. Rep., 1996. [4] J. C. Lecoq, L. Najman, O. Gibot, and E. Trupin, “Benchmarking commercial ocr engines for technical drawings indexing”, in Proceedings of Sixth International Conference on Document Analysis and Recognition, 2001, pp. 138–142. doi: 10.1109/ICDAR.2001. 953770. [5] J. Stone, The best ocr software of 2017s. [Online]. Available: http: //www.toptenreviews.com/business/software/best- ocr- software/. [6] N. Nobile and C. Y. Suen, “Text segmentation for document recognition”, in Handbook of Document Image Processing and Recognition, D. Doermann and K. Tombre, Eds., Springer-Verlag, 2014. doi: 10.1007/978-0-85729-859-1. [7] H. Cao, “Machine-printed character recognition”, in Handbook of Document Image Processing and Recognition, D. Doermann and K. Tombre, Eds., Springer-Verlag, 2014. doi: 10.1007/978-0- 85729-859-1. [8] Training tesseract. [Online]. Available: https://github.com/ tesseract-ocr/tesseract/wiki/Training-Tesseract. [9] S. V. Rice, G. L. Nagy, and T. A. Nartker, Optical Character Recog- nition: An Illustrated Guide to the Frontier. Norwell, MA, USA: Kluwer Academic Publishers, 1999, isbn: 079238492X. doi: 10. 1007/978-1-4615-5021-1.

27 BIBLIOGRAPHY

[10] I. Vynckier, Segmenting words and characters. [Online]. Available: http : / / www . how - ocr - works . com / OCR / word - character - segmentation.. [11] V. Märgner and H. E. Abed, “Tools and metrics for document analysis systems evaluation”, in Handbook of Document Image Processing and Recognition, D. Doermann and K. Tombre, Eds., Springer-Verlag, 2014. doi: 10.1007/978-0-85729-859-1. [12] Ocrfeeder. [Online]. Available: https : / / github . com / GNOME / . [13] Isri tools. [Online]. Available: https : / / github . com / eddieantonio/isri-ocr-evaluation-tools. [14] Abby on voting api. [Online]. Available: https : / / abbyy . technology/en:features:ocr:classifier. [15] Hocr tools. [Online]. Available: https://github.com/tmbdev/ hocr-tools.

28