Quick viewing(Text Mode)

A Comparison of OCR Methods on Natural Images in Different Image Domains

A Comparison of OCR Methods on Natural Images in Different Image Domains

DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2020

A comparison of OCR methods on natural images in different image domains

AGNES FORSBERG MELVIN LUNDQVIST

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND En jämförelse av OCR-metoder i olika domäner

AGNES FORSBERG MELVIN LUNDQVIST

Degree Project in Computer Science, DD142X Date: June 8, 2020 Supervisor: Kevin Smith Examiner: Pawel Herman School of Electrical Engineering and Computer Science Abstract

Optical character recognition (OCR) is a blanket term for methods that convert printed or handwritten text into machine-encoded text. As the digital world keeps growing the amount of digital images with text increases, and the need for OCR methods that can handle more than plain text documents as well. There are OCR engines that can convert images of clean documents with an over 99% recognition rate. OCR for natural images is getting more and more attention, but because natural images can be far more diverse than plain text documents it also leads to complications. To combat these issues it needs to be clear in what areas the OCR methods of today struggle. This thesis aims to answer this by testing three popular, readily available, OCR methods on a dataset comprised only of natural images containing text. The results show that one of the methods, GOCR, can not handle natural images as its test results were very far from correct. For the other two methods, ABBYY FineReader and , the results were better but also show that there still is a long way to go, especially when it comes to images with special font. However when the images are less complicated some of our methods performed above our expectations.

i Sammanfattning

Optical character recognition (OCR) är en samlingsterm för metoder som konverterar tryckt eller handskriven text till maskinkod. När den digitala världen växer så växer även antalet digitala bilder med text, och även behovet för OCR- metoder som kan hantera mer än vanliga textdokument. Det finns idag OCR- motorer som kan konvertera bilder av rena dokument till maskinkod med över 99% korrekthet. OCR för fotografier får mer och mer uppmärksamhet, men eftersom fotografier har mycket större mångfaldhet än rena textdokument leder detta också till problem. För att hantera detta krävs klarhet inom vilka områden som dagens OCR-metoder har problem. Denna uppsats ämnar svara på denna fråga genom att undersöka och testa populära, enkelt tillgängliga OCR- metoder på ett dataset som endast innehåller fotografier av naturliga miljöer med text. Resultaten visade att en av metoderna, GOCR, inte kan hantera fotografier. GOCRs testresulet var långt från det korrekta. För de andra metoderna, ABBYY FineReader och Tesseract, var resultaten bättre men visade att det fortfarande finns mycket arbete att göra inom området, särskilt när det kommer till bilder med speciella typsnitt. När det däremot kommer till bilder som är mindre komplicerade blev vi förvånade över hur bra resultatet var för några av metoderna.

ii Contents

1 Introduction 1 1.1 Problem statement ...... 1 1.2 Scope ...... 1 1.3 Hypothesis ...... 2 1.4 Outline ...... 3

2 Background 4 2.1 Optical character recognition ...... 4 2.2 Tesseract ...... 4 2.3 ABBYY FineReader Engine ...... 5 2.4 GOCR ...... 5

3 Method 7 3.1 The NEOCR dataset ...... 7 3.2 Image domains ...... 7 3.3 Image filtering ...... 10 3.4 Experiment ...... 11

4 Results 15 4.1 Font ...... 15 4.2 Texture ...... 15 4.3 Arrangement ...... 16 4.4 Contrast ...... 16 4.5 Blurriness ...... 16 4.6 Comparative study ...... 17

5 Discussion 19 5.1 Method discussion ...... 19 5.2 Discussion of results ...... 20 5.3 Future research ...... 22 5.4 Ethical and sustainability considerations ...... 22 5.5 Societal considerations ...... 23 5.6 Conclusion ...... 23

References 24 iii 1 Introduction

Optical character recognition (OCR) is the process of converting images of typewritten, handwritten or printed text to editable machine-encoded text [7]. The first true OCR machine was installed in 1954 on typewritten sales reports [4]. As the digital world is growing and the amount of digital images with text increases, the potential use of OCR technology is expanding. Nowadays there are a lot of great OCR engines for converting pictures of clean documents to editable text that can be used by computer , some with over 99% recognition rate [6]. Therefore, OCR is very helpful converting physical office documents to digital ones, but is far from faultless in scenarios such as recognising text in natural scene images [7]. Recognising text in real world images is getting more and more attention, but come with numerous complications, since text in natural images is far more diverse than in plain text documents [6].

1.1 Problem statement

OCR engines are very complex and consist of several steps. To develop OCR technology in natural images further, it must be clear in what cases OCR methods perform poorly in order to know what should be in focus. Therefore, the goal of this study is to evaluate the current state of three different OCR methods and identify possible scenarios in which each respective method demonstrates lower recognition rate by answering the following question:

How accurate are the OCR methods Tesseract, ABBYY FineReader and GOCR on natural images overall and in different domains?

1.2 Scope

The study is solely comparative and will not take into account the accessibility or computational cost of the OCR methods.

The study is restricted to investigate the performance of the three popular off-the- shelf OCR methods Tesseract (version 4.0.0-beta.1)1, ABBYY FineReader Engine

1https://tesseract-ocr.github.io/tessdoc/Home.html

1 (version 12 for )2 and GOCR(version 0.52-20181015)3. The experiment will use the accessible software of the OCR methods and will not make use of additional support, such as language input, training or neural networks. String function Levenshtein distance will be used to evaluate the performance.

Furthermore, the OCR methods will only be evaluated on their overall performance and when applied to data in the presence of the domains (confounding factors) font, texture, arrangement, contrast and blurredness (from here on referred to as blurriness). All images for the study are included in the dataset NEOCR version 1.0 and will be filtered for specific domains. For more specifics of the characteristics of the images in NEOCR, see section 3.1.

1.3 Hypothesis

GOCR seems to be a rather simple OCR method. In terms of results, we believe that this will be a disadvantage in complex images of natural environment. ABBYY FineReader commercially appears to be a tool for digital documents only; however the SDK engine used for these experiments supposedly has support for text recognition in images as well. It is also used in many large scale applications4. Because of this, we expect ABBYY FineReader to have a better accuracy than the other methods. Tesseract is open source but maintained by Google and is one of the most used systems in the world. However, we do not know how well it performs on natural images. We expect it to have good accuracy, but not as good as the ABBYY FineReader Engine. In addition, we expect all methods to perform better in images with standard font, horizontal arrangement and high contrast while performing worse in images with high texture and blurriness, since text in these types of images is more difficult to distinguish for the eye and then hypothetically also for OCR methods.

2https://abbyy.technology/en:products:fre:linux 3https://www-e.ovgu.de/jschulen/ocr/download.html 4https://www.abbyy.com/en-gb/case-studies/?product=3250

2 1.4 Outline

The next section presents a more explicated background required for the comprehension of the rest of the report, using previous studies. Section three describes how the experiment was carried out. It contains an explanation of how filtering the images in the dataset was made and a technical specification from accessing the OCR tools needed to retrieve and treat the outputs. The fourth section presents the results of the experiment and compares them. In the fifth section, there is a discussion on the course of action and the reliability and usefulness of the results, both to answer the research question and for future research. Finally, a conclusion of the study is presented.

3 2 Background

In this section a theoretical background for the area of study will be given. Specifically OCR technology will be introduced further, and the three OCR methods Tesseract, ABBYY FineReader and GOCR will be described.

2.1 Optical character recognition

Optical character recognition (OCR) uses technology to distinguish printed or handwritten text characters in digital images. It was originally invented as a tool that read text out loud for the blind or visually impaired. Nowadays it is most frequently used to transform historic documents and books into . OCR methods use algorithms to recognize the characters, of which there are two variants. Pattern recognition is where the algorithm is trained with examples of characters in different fonts and can then use this training to try and recognize characters from the input. Feature recognition is where the algorithm has a specific set of rules regarding the features of characters, for example number of angles and crossed lines. The algorithm then uses this to recognize the text [5].

In recent years neural networks (NN) have become increasingly popular for solving these tasks. Especially with natural images where other OCR methods struggle [10]. The OCR methods tested in this report are commercial or in other ways readily available, and advertise being able to perform well on plain text documents.

2.2 Tesseract

Tesseract is a free open source OCR engine, with a first version fully developed in 1994. It began as a PhD project in HP Labs and a possible add-on for HP scanners. In 2005, it was released as open source [8]. Nowadays it is one of the most used OCR systems world-wide and is maintained by Google [9]. Architecturally it is built on a step-by-step pipeline of several types of processing [8]. The first step converts the input image to a binary one. The second step of the pipeline, connected component analysis, identifies character outlines [7],

4 which makes Tesseract recognise white-on-black text as easily as the opposite [8]. These outlines are then grouped into “blobs” by regions of the image that in some way differ from the surrounding [9]. These blobs are then formed as text lines, analysed on character size, and separated into words using fuzzy spaces. The pipeline includes a linguistic analysis, but it is relatively basic. When processing a word, the OCR compares it to the closest words in several categories, such as dictionary words, numeric words and upper/lower case words [8].

2.3 ABBYY FineReader Engine

The ABBYY FineReader Engine is an AI-powered OCR SDK developed by ABBYY. It is available to developers to integrate into other systems through licensing agreements. It can perform text recognition and PDF-conversion among other things. ABBYY as a company offers commercial products based on the FineReader Engine.

The FineReader Engine method in comprised of three steps. It accepts a wide range of file formats as input and output. Pre-processing is performed on the input, and the engine to enhance the image quality. For example images might be rotated or de-skewed. In the next step the engine analyses the input to find the layout, structure and where in the file there is text. After those steps the recognition starts. ABBYY has not disclosed specifics of the architecture and algorithms of their recognition [1].

2.4 GOCR

GOCR is an open source OCR tool developed under the GNU public license, originally by Jörg Schulenburg. The recognition is divided into two phases. In the first phase, whole documents are processed and in the next phase the unrecognized characters are processed again [3]. GOCR claims it can handle single-column sans-serif fonts of 20–60 pixels in height. It reports trouble with serif fonts, overlapping characters, handwritten text, heterogeneous fonts, noisy images, large angles of skew, and text in anything other than a Latin alphabet. The tool is downloadable and available online5. Other fundamental facts for GOCR, as

5https://www.offidocs.com/index.php/desktop-online-utilities-apps/-online

5 well as Tesseract and ABBYY FineReader can be found in Table 2.1.

Latest release year Supported languages Tesseract 2019 100+ ABBY FineReader 2019 192 GOCR 2018 20+

Table 2.1: Comparison of the chosen OCR methods [11]

6 3 Method

In this section we will describe how we conducted our study and why. The NEOCR dataset and its parameters that are relevant for the study will be presented.

3.1 The NEOCR dataset

The Natural Environment OCR (NEOCR) dataset is a dataset of 659 images containing 5238 bounding boxes (from here on reffered to as text fields). The dataset includes only images of natural scenes with one or more text fields per image. The dataset also includes images with properties that might make it more difficult to read the text, for example text rotation and poor contrast. There are also images with text in different languages. The NEOCR dataset includes annotations for each image in XML format as well as metadata about the images [6]. This is the main reason this dataset was chosen for the experiments, as this allows us to categorize the images into different domains and perform comparisons based on this. As can be seen in Table 3.1 the NEOCR dataset has far more text fields per image than other datasets like it.

Dataset #Images #Bounding boxes Average #char/box ICDAR 2003 509 2263 6.15 Chars74K 312 2112 6.47 MS Text DB 307 1729 10.76 Street View Text 350 904 6.83 NEOCR 659 5238 17.62

Table 3.1: Comparison of available image datasets [6].

3.2 Image domains

The different domains of images focused on are font, texture, arrangement, contrast and blurriness [6]. These domains are part of the metadata about each image provided by the NEOCR dataset. For the factors that have a float as datatype (contrast and blurriness) boundaries were set for ‘low, mid, high’ by calculating values that gave roughly the same number of images in each category. The texture levels low, mid and high as well as the categories in font (standard, special, handwritten) and arrangement (horizontal, vertical, circular) were taken from the

7 dataset.

3.2.1 Font

Font is divided into standard, handwritten and special by the authors of the NEOCR dataset. Information about specific type faces and the categorization of fonts in special and standard is not given in the NEOCR specification. Font thickness is not a part of the metadata. This domain was chosen since different font families occur frequently in natural images, for example on restaurant signs.

(a) Standard (b) Special (c) Handwriting

Figure 3.1: Examples of font types

3.2.2 Texture

Texture is divided into low (single-colored text and background), mid (multi- colored text or background) and high (multi-colored text and background or text without a continuous surface). These levels are given in the NEOCR specification. Texture is interesting since medium or high texture rarely occurs in text documents.

(a) Low (c) High (b) Mid

Figure 3.2: Examples of texture levels

8 3.2.3 Arrangement

Arrangement can be horizontal, vertical or circular. Arrangement was chosen over rotation because it separates text arranged vertically into a category which was considered interesting. Circular text is interesting to test since that is something rarely found in a text document.

(a) Horizontal (b) Vertical (c) Circular

Figure 3.3: Examples of arrangement

3.2.4 Contrast

Contrast is defined as the standard deviation of the luma channel in each image in the NEOCR dataset, and ranges from 1 to 123 [6]. The images’ metadata contain information about contrast both in each text field and for the image as a whole. The latter was chosen for the experiments.

(a) Low (b) Mid (c) High

Figure 3.4: Examples of contrast level

3.2.5 Blurriness

Blurriness is calculated by finding the kurtosis on the Laplacian variation in each image [6]. For each text field in the dataset blurriness is represented by a float between 0 and 100000. It is common for natural images to have either motion blur or lens blur. When examining the images and this domain it was found that

9 very few text fields in the dataset have a blurriness value of above 100. Therefore the boundaries for the low, mid and high categories had to be set depending on the amount of images in them. That leads to the high category having a very large range, as can be seen in Table 3.3.

(a) Low

(c) High (b) Mid

Figure 3.5: Examples of blurriness

3.3 Image filtering

In each image there is on average around 17 text fields. However the OCR methods do not run once for each text field but instead once per image. This makes calculating what domain an image should belong to somewhat difficult since the different text fields in the image might belong to different domains. For the most accurate result all the text fields in an image should be in the correct domain, for example high texture. As can be seen in Table 3.2 this in some occasions leads to very small testing samples. As a solution images with some text field in the correct domain were chosen when the all-value (number of images with all text fields in the correct domain) was found to be too low. In the texture domain, for the low category images with all text fields having low texture could be used, but in the mid and high category images with some text fields with mid or high texture respectively had to be used. Additionally, running GOCR on all images, some took too much time to process. Because of this, 20 images had to be disincluded from the selection for domains and values for GOCR results. In Table 3.3 all domain information can be found. The boundaries for contrast and blurriness were calculated to have a similar number of images in each category, and will from

10 here on be referred to as low, mid and high.

Texture Low Mid High All textfields 487 10 3 Some textfield 641 149 42

Table 3.2: Example of number of images in the texture domain

The dataset used for the experiments includes images with text in different languages. The OCR methods in this study all have support for multiple languages, see Table 2.1. In the experiments however the OCR methods were not provided with information about the language of the image. This was based on the OCR methods having a large range in their number of supported languages. The report’s intent is to show how well the OCR methods handle image quality and text differences, not how many languages they have support for. Nevertheless, it is important to be aware of this fact while comparing the results of the experiments, since it might affect some OCR methods negatively - especially GOCR that does not support as many languages as the other two. This will be taken into consideration while doing the comparative study.

3.4 Experiment

3.4.1 Main experiment

For all three OCR methods the Linux distribution and command line interface were used on an 18.04.4 OS. Scripts were written that ran the OCR methods with the 659 input files. The images in the NEOCR dataset are jpg format which GOCR does not allow and they therefore had to be converted to png for the GOCR tests. This does not affect the results as jpg is a lossy format and png is a lossless format. The output of the OCR method was written to result files corresponding to the image file name and method used, to simplify for later stages.

3.4.2 Comparison

The accuracy was computed with a Levenshtein distance algorithm. Levenshtein distance is a method of measuring the difference between two text strings. Using

11 Font

Standard Special Handwritten All textfields (395/385) All textfields (19/18) All textfields (15/14)

Texture

Low Mid High All textfields (487/472) Some textfield (149/147) Some textfield (42/39)

Arrangement

Horizontal Vertical Circular All textfields (507/490) Some textfield (36/36) Some textfield (121/118)

Contrast

Low (0.0-55.7473) Mid (55.7473-68.1854) High (68.1854-123.0) Only global image data Only global image data Only global image data (220/217) (219/210) (220/212)

Blurriness

Low (0.0-6.5816) Mid (6.5816-14.1457) High (14.1457-100000) Textfield average Textfield average Textfield average (219/209) (219/212) (220/217)

Table 3.3: Image selection criteria for each domain and value and the number of processed images selected in the format *for Tesseract and ABBYY FineReader*/*for GOCR*. two strings, the source s and the target t, the Levenshtein distance between them is defined as being the number of deletions, insertions and substitutions required to transform s into t. If the two strings are equal the Levenshtein distance will be 0. If for example s = hello, t = help the Levenshtein distance is 2 (e.g. remove o and substitute l for p). A mathematical definition of Levenshtein distance between the two strings s and t, given by levs,t(length(s), length(t)), can be found in Figure 3.6, where si is the character with index i (1-based) in string s, levs,t(i, j) is the distance between the first i characters of s and the first j characters of t and 1(si≠ tj ) is equal to 0 if si = tj and 1 otherwise.

12   max(i, j) ifmin(i, j) = 0,    levs,t(i − 1, j) + 1 levs,t(i, j) =  −  min  levs,t(i, j 1) + 1 otherwise.   − − levs,t(i 1, j 1) + 1(si≠ tj )

Figure 3.6: Mathematical definition of Levenshtein distance.

A java program referred to as Levenshtein program in Figure 3.8 was written (see appendices A, B and C). This program fetched the text found by the OCRs in the result files and divided each file into a list of non-empty lines. This list was then compared to another list fetched from the annotations for each image respectively (the correct text). The program started by matching the two lines (one from each list) with the closest Levenshtein distance, and then the second closest pair and so on. If the OCRs failed to process a complete correct line and divided it into two separate lines, only one of these lines would be able to be matched to the correct line, and the other would most likely be perceived by the program as text that should not have been found.

Since images can contain quite long lines of characters, the runtime with a regularly recursively implemented algorithm had too high computational costs. The algorithm was rewritten using a to store the correct lines and distance matrices (see the example in Figure 3.7) calculating the distance for different lines found by the OCR methods.

Figure 3.7: An example of a Levenshtein distance matrix [2].

The total distance for all lines was then divided by the total number of characters in the correct text from the annotations. The average distance per correct character was decided to be the best way to compute accuracy, since an image with more and longer lines naturally increases the risk of errors. In addition, the OCR methods get a worse result (longer average distance) if they find text that is not in the image,

13 even though they also find the correct text in between. A perfect result would be 0, a result of 1 most likely means an empty result from the OCR method and a result greater than 1 means that the OCR method added extra characters (at least as many as the number of correct characters) that should not be there. The average distance for each image was calculated and then written to new result files.

The final result for each domain and value was produced by another Java program (referred to as Average program in Figure 3.8) getting the filenames of the images that matched the criteria (e.g. font = special) and calculating the average result (distance/character) for all those images. This gave an overview of how well the OCR methods performed, but also a more specific look at what types of images the method struggled with. For the domains with values represented as floats, another Java program was written to calculate the suitable limits to divide groups. The limits were calculated so that they would divide the images into three equal sized groups.

Figure 3.8: The pipeline of the experiment.

14 4 Results

The results of the study will be presented in this section. A comparison of the results will be illustrated and described.

The results of the OCR methods for different domains and values show in tables 4.1, 4.2, 4.3, 4.4 and 4.5 rounded to four decimal places followed by the number of images in that group in parenthesis.

4.1 Font

ABBYY FineReader is more accurate than Tesseract on images with only standard font text fields or only special font. Especially on the latter, on which Tesseract results have approximately 2.7 average Levenshtein distance per correct character. On the other hand Tesseract handles images with only handwritten font textfields better. GOCR performs much worse than the others on all values, finding a lot more text in images than they actually contain, but performs better on special and standard fonts than handwritten. The reason GOCR gets such high results is because it finds characters in details in the background. If the background for example contains a tree, GOCR often finds characters in the shapes of the leaves

Dataset Standard Special Handwritten Tesseract 1.8652 (395) 2.7480 (19) 1.6297 (15) ABBYY FineReader 1.7756 (395) 1.6413 (19) 1.7005 (15) GOCR 306.6821 (365) 236.0985 (18) 468.2075 (14)

Table 4.1: Average Levenshtein per correct character and number of images for domain font.

4.2 Texture

On images with only low texture text fields and images with some mid or high texture text fields, ABBYY FineReader has the highest accuracy, Tesseract shortly thereafter and GOCR more than 100 times worse. Tesseract and ABBYY FineReader has their worse results in low texture, but GOCR in high texture.

15 Dataset Low Mid High Tesseract 1.8207 (487) 1.7396 (149) 1.7075 (42) ABBYY FineReader 1.7341 (487) 1.6825 (149) 1.6891 (42) GOCR 249.8101 (39) 199.0539 (147) 273.4938 (472)

Table 4.2: Average Levenshtein distances per correct character and number of images for domain texture.

4.3 Arrangement

Tesseract handles vertical and circular text fields best, but ABBYY FineReader had higher accuracy in images with only horizontal text fields. GOCR has its best results in the study in images with vertical or circular text fields, but almost three times worse in horizontal text images.

Dataset Horizontal Vertical Circular Tesseract 1.8568 (507) 1.5206 (36) 1.6259 (121) ABBYY FineReader 1.7311 (507) 1.6055 (36) 1.7217 (121) GOCR 275.1143 (490) 105.1187 (36) 105.3215 (118)

Table 4.3: Average Levenshtein distances per correct character and number of images for domain arrangement.

4.4 Contrast

In images with mid and high contrast, the results of Tesseract and ABBYY FineReader is very similar. They both perform worse in low contrast images, especially Tesseract. GOCR performs much worse than both of them, and also has its worse result in low contrast images.

Dataset Low Mid High Tesseract 1.9329 (220) 1.7187 (219) 1.7476 (220) ABBYY FineReader 1.7550 (220) 1.7099 (219) 1.7034 (220) GOCR 352.7407 (217) 197.1780 (210) 153.7890 (212)

Table 4.4: Average Levenshtein distances per correct character and number of images for domain contrast.

4.5 Blurriness

In images with a standard or higher average text field blurriness, ABBYY performs the best, but has a bit lower accuracy in images with low blurriness, in which

16 Tesseract performs better. The results for GOCR are worse than the others in all values, but is relatively better in images with average blurriness.

Dataset Low Mid High Tesseract 1.8387 (219) 1.7252 (219) 1.8382 (220) ABBYY FineReader 1.9683 (219) 1.6207 (219) 1.5824 (220) GOCR 256.8601 (209) 175.1308 (212) 267.4770 (217)

Table 4.5: Average Levenshtein distances per correct character and number of images for domain blurriness.

4.6 Comparative study

As shown in Table 4.6, Tesseract and ABBYY FineReader result in very similar accuracy overall, with an average Levenshtein distance per correct character of 1.7- 1.8. On both their results on average, the Levenshtein algorithm had to perform more than one operation per character, which means it found more text than there actually is in many of the images. In comparison to both these OCR methods, GOCR performs much worse.

A more detailed comparison of Tesseract and ABBYY FineReader can be found in Figure 4.1. The chart shows that the results in different domains and values for both of the OCR methods are quite unvaried. Although, there is an exception for images that only have text fields with special fonts, where Tesseract performs worse than the average. It can also be distinguished that there is a slightly higher accuracy on images with horizontal and circular text fields for both methods, but in particular Tesseract. ABBYY has the most varied results in the domain blurriness and along with low blurriness, has standard font and low contrast as worst scenarios. Tesseract also performs comparably worse in low contrast images, but also images with standard and particularly special font. The worst scenarios for GOCR are similar, but GOCR has relatively low accuracy in images with handwritten fonts rather than special fonts.

Tesseract ABBYY FineReader GOCR 1.7999 1.7228 235.6109

Table 4.6: Average Levenshtein distance per correct character overall rounded to four decimal places.

17 Tesseract 3 ABBY FineReader

2

1

0 Average distance/character Low Mid Low Mid Low Mid High High High Overall Special Vertical Circular Standard Horizontal Handwritten

Font Texture Arrangement Contrast Blurriness

Figure 4.1: Results for Tesseract and ABBYY FineReader.

18 5 Discussion

This section contains a discussion of how the study was performed and possible shortcomings. The reliability of the results and its potential use in future research will be discussed. Finally, the results will be used to answer the research question in a conclusion.

5.1 Method discussion

When sorting out images for each domain and value, our initial idea was to eradicate parameters that could interfere with the results. GOCR for example does not support nearly as many languages as the other two methods. Therefore we would only want images with text in languages supported by all three methods, or else this would impede on the results. We however realised that the dataset simply had too few images to filter away all images that could affect the reliability of the results. For some values it would be possible, but to keep the tests on different values in each domain under the same conditions, we chose to only filter on the value of each domain respectively. On the other hand, this could have been done on tests in the domains were the value was represented as a float (contrast and blurriness), since each group had approximately 220 images.

Relating to this, would it make more sense to set the boundaries for these domains based on the range of values, e.g. 0 - 33333 - 66666 - 100000 for blurriness? We tried this at first, discovering that there is no image with a blurriness over 66666 and that this would not give us sufficient number of images for some samples. We are still unsure how the range is created, but it is definitely not a balanced interval.

As for the Java program evaluating the OCR results, there are some limitations. Each OCR method divided its result into lines in output files. Every line containing at least one non-whitespace character was retrieved as it was and added to a list. As for the correct text, text fields with multiple lines included the character combination “<br/>” (one or more times) in the annotations to separate the different lines. Each line from all text fields in an image was retrieved and added to the other list, not taking into account what lines belonged to what text field. It may

19 be relevant to test if the text is recognized in the same order as it appears on the image, but our program did not implement this. If a correct line was recognized correctly, but divided into two seperate lines in the output file, it pulled down the result. In addition, if two different text fields are placed next to each other in an image, the OCR methods might interpret lines over two text fields. In this case the methods can recognize the text correctly, but there will only be different parts of those lines to match them against.

After conducting all tests, we came across two incorrect produced distances for results from Tesseract by checking manually in images simple enough to easily see if it was correct. It was problems with the trie and the Levenshtein algorithm. The program did not initiate the Levenshtein algorithm for each of the child characters of the root, only the first letter of the recognized line, preventing e.g. ”flower” to match a correct line ”lower” instead of another correct line ”friends”. In addition, when e.g. deleting a line ”the” from the trie after it had been matched, the program did not check if there were any other lines with the same start, for example ”there”, in the trie before deleting it, which resulted in that no line could match the line ”there” after that. We these problems, and redid the tests, but the occurrence made us realise that there can still be problems or edge cases our program does not handle correctly. Even though we made some manual inspections, profound testing should have been made. Unfortunately, our time constraint prevented this.

5.2 Discussion of results

To begin with, not completing the processing of all images for GOCR lead to inaccurate results. Each image was given at least 30 minutes to process, and if GOCR had not moved on to the next image by then, it was disincluded for GOCR results. Some images were even given several hours to compute when leaving the computer with the OCR running. Would it have been better to rule out these images for results in all OCR methods? That would have made the comparison of GOCR counter the other methods more valid, but we also wanted to results for Tesseract and ABBYY FineReader to be accurate respectively. Looking back after conducting all tests, it would not be interesting to process the last images of

20 GOCR. Firstly, because the results of GOCR were so distant from the other results that a big margin of error did not matter. Secondly, because we discovered that the images GOCR had problems with were images with a lot of details, e.g. images with leaves on trees. In images like this where GOCR did produce a result, it would recognize thousands of characters that should not be there. Therefore, the results of the images we disincluded would probably just be even worse than the results we included.

In several tests, domains that would be considered more difficult, e.g. vertical and circular arrangement and high texture, got better results than horizontal arrangement and low texture. This contradicts the fact that the OCR methods perform better on plain text documents than natural images. However the sample size and distance calculation is likely to blame. The small sample size meant that in the texture domain we had 487 images with low texture, but only 149 mid and 42 high. With horizontal, low texture text there is a higher chance for the OCR method to find text, and therefore a higher risk of extra incorrect characters being found. Our distance method values not finding anything over finding too much. This means that images with vertical text might result in no text being found, and that will give it a better result than an image with horizontal text where the image found many extra characters.

The NEOCR dataset was great in many ways, mainly due to its detailed annotations. However, we discovered some faults during testing. For example, one image was annotated to have negative blurriness, a value that should be in the range of 0 and 100000. Another example is an image we used for initial debugging, which had a whole line of text in its annotations missing, which clearly was in the picture. We did not investigate it any further, since it would take too much time going through images manually, but the fact that we stumbled upon this image, made us believe there could be many others like it. These discoveries question the reliability of the annotations, and furthermore the results of the study.

The results of this study lose some connection to the real usage of OCR technology, since it is often used with training data and other tools like language specification which were not used while conducting the tests in this study.

21 5.3 Future research

This study can be executed in many other ways for the better. The main improvement would be to use a dataset with more images in each domain so that only the images that match the criteria on all its text fields would be selected. With more powerful computational resources all of the images could have been processed for GOCR, and with less time limits more profound testing could have been made on the program evaluating the OCR methods.

The results in this study can be interesting in development of OCR methods with a specific target area. For example, in a development of an OCR method to recognize text in scenarios where the text is often in unusual fonts, with high texture and various arrangement, the results of this study indicate that management of special fonts should be in focus for improvement with Tesseract’s code as starting point. There are many domains that were ignored for this study, that can be favorable to test to find out for what types of images OCR methods in particular need to be developed further. Additionally, to have the study represent today’s OCR technology in general, more OCR methods should be tested.

5.4 Ethical and sustainability considerations

Since OCR in natural images without a specific purpose still is unpredictable and generates errors, it should only be used in scenarios where the result can help the user, rather than being relied on for the result to be correct. For example, when taking pictures of street signs in a foreign language to let another software translate, users should keep in mind that the result may not be perfect in the same way that e.g. Google Translate not always has the perfect translation. This may help the user, but if the result is completely out of context, it can be disregarded. On the other hand, some OCR methods with a more specific usage can have a lot higher accuracy, and therefore be depended on further.

This study can contribute to the development of OCR methods with higher accuracy. In the future, OCR may be used in a lot more areas. This can have consequences, such as humans losing their jobs, but also making a lot of processes more efficient and less energy consuming.

22 5.5 Societal considerations

OCR is used in many applications in society. However most are for plain text documents, for example when scanning paper documents to make them editable. Evolving OCR in natural images could increase efficiency in many areas. Drivers can benefit from the car interpreting street signs, and as self driving cars are developing more and more OCR is an important aspect of making them as secure as possible.

Another area that can benefit from the usage of OCR is categorizations of images. Computers are becoming increasingly good at recognizing shapes and objects to categorize images, and by implementing OCR the text in images can also be used to categorize the image. This can help people working with file systems were categorizing images by hand will take too long.

5.6 Conclusion

In this study, the different OCR methods were tested on images both overall and in different domains and values. Overall, the results aligned quite well with the hypothesis. ABBYY FineReader has an average Levenshtein distance per correct character of approximately 1.72, while the same measurement for Tesseract was approximately 1.80. In comparison to GOCR with an average distance of roughly 235.61, they both have high accuracy, but in this study, a distance over 1.0 is still worse than an OCR not recognizing any characters at all in any images.

The results in different domains were very similar, which goes against the hypothesis. In some cases e.g. in the domain arrangement, the OCR methods performed better on vertical and circular images than horizontal, which we believed would be the easiest to recognize. The only significant deviation was that Tesseract had a relatively lower accuracy in images with special font text fields. The results of GOCR varied a lot, but given how badly it performed, the result was deemed to be irrelevant.

We believe that the results in this study is fairly reliable, but question its relevance to real usage of OCR technology, in which training data and other tools are utilized.

23 References

[1] ABBYY. ABBYY FineReader Engine Specification. [Online; accessed 15- May-2020]. 2020. URL: https://www.abbyy.com/en-gb/ocr-sdk/ocr- stages/.

[2] Cuelogic Insights. The Levenshtein Algorithm. [Online; accessed 15-May- 2020]. 2017. URL: https://www.cuelogic.com/blog/the-levenshtein- algorithm.

[3] Dhiman, Shivani and Singh, A. “Tesseract vs gocr a comparative study”. In: International Journal of Recent Technology and Engineering 2.4 (2013), p. 80.

[4] Eikvil, Line. “Optical character recognition”. In: citeseer. ist. psu. edu/142042. html (1993).

[5] Margaret Rouse. Definition OCR. [Online; accessed 15-May-2020]. 2019. URL: https://searchcontentmanagement.techtarget.com/definition/ OCR-optical-character-recognition.

[6] Nagy, Robert, Dicker, Anders, and Meyer-Wegener, Klaus. “Definition and Evaluation of the NEOCR Dataset for Natural-Image Text Recognition”. In: (2011).

[7] Patel, Chirag, Patel, Atul, and Patel, Dharmendra. “Optical character recognition by open source OCR tool tesseract: A case study”. In: International Journal of Computer Applications 55.10 (2012), pp. 50–56.

[8] Smith, Ray. “An overview of the Tesseract OCR engine”. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007). Vol. 2. IEEE. 2007, pp. 629–633.

[9] Tafti, Ahmad P et al. “OCR as a service: an experimental evaluation of OCR, Tesseract, ABBYY FineReader, and Transym”. In: International Symposium on Visual Computing. Springer. 2016, pp. 735– 746.

[10] Wang, Jianfeng and Hu, Xiaolin. “Gated recurrent convolution neural network for ocr”. In: Advances in Neural Information Processing Systems (2017).

24 [11] Wikipedia contributors. Comparison of optical character recognition software. [Online; accessed 29-April-2020]. 2020. URL: https : / / en . wikipedia.org/wiki/Comparison_of_optical_character_recognition_ software.

25 Appendix - Contents

A Neocr3.java 27

B Trie.java 31

C TrieNode.java 32

D Link to GitHub Repository 33

26 A Neocr3.java

27 28 29 30 B Trie.java

31 C TrieNode.java

32 D Link to GitHub Repository https://gits-15.sys.kth.se/agnesfo/kexOCR

33 TRITA-EECS-EX-2020:368

www.kth.se